VIEW
Communicated by Theodore H. Bullock
On Neural Circuits and Cognition Michael S. Gazzaniga Center for Neuroscienc...
7 downloads
898 Views
32MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
VIEW
Communicated by Theodore H. Bullock
On Neural Circuits and Cognition Michael S. Gazzaniga Center for Neuroscience, University of California at Davis, Davis, C A 95616 U S A 1 Introduction
Those of us trying to deal with the interfaces between human brain function and mind often wonder what it is we can communicate to our colleagues dealing with related issues but from different perspectives. We are all seeking principles of function, core ideas that help us understand how the nervous system accomplishes its goals. Ideally, those of us working at the level of human cognition might help define the problem for neuroscientists, mathematicians, and engineers. By studying issues from a cognitive view, tempered perhaps with an evolutionary perspective, we may be in a position to define the type of neural unit and organizational logic that is essential in the brain’s capacity to enable perceptual and cognitive activities. In the following, a current major assumption in neuroscience is challenged, the idea that a larger brain with more cells is responsible for the greater computational capacity of the human being. Consider Passingham’s main conclusion to his fascinating book, T h e Human Primate (1981). Relatively simple changes in the genetic control of growth can have far-reaching effects on form. The human brain differs from the chimpanzee brain in its extreme development of cortical structures, in particular the cerebellar cortex and the association areas of the neocortex. But the proportions of the different areas are predictable from the rules governing the construction of primate brains of differing size.. . . Furthermore there appears to be a basic uniformity in the number and types of cells used in the building of the different neocortical areas; and the human brain follows the general pattern for mammals. Even in the case of the two speech areas we believe we can detect regions in the monkey brain which are alike in their basic cellular organization. The evolution of the human brain appears to have been characterized more by an expansion of existing areas than any more radical reconstructions. This belief that bigger brains mean better substrate for complex operations is echoed by Willerman and colleagues (Willerman et al. 1991). Neural Computation 7, 1-12 (1995)
@ 1994 Massachusetts Institute of Technology
2
Michael S. Gazzaniga
Brain size is correlated with cortical surface area so that .larger size might reflect more cortical columns available for analyzing high-noise or low-redundancy signals, thus enabling more efficient information processing pertinent to IQ test performance. Thus, it is commonly believed that the uniqueness of the human brain can be traced to its larger size. It has more neurons, more colitical columns, and in that truth lies, somewhere, the secret to the human experience. Indeed, this view seems entirely consistent with many other observations in both humans and animals. The disproportionately large cortical representation of some sensory and motor regions of the cortex in animals and humans is well established. The recognized correlation between the large inferior colliculus for the ecolocating bat and dolphin and of the enlarged optic lobes for some very visual fish is well known. In short, the idea that a larger brain structure reflects an increase in function is ubiquitous. Even Charles Darwin promoted the idea that big brains were the reason for the uniqueness of the human condition. In The Descent of Man (1981) he said there is ”no fundamental difference between man and the higher mammals in their mental faculties” (p. 35). Further, he went on to add that ”the difference in mind between man and the higher animals, great as it is, is certainly one of degree and not of kind” (p. 105). He did not want to be part of any thinking that there may be critical qualitative differences between the subhuman primate and man (Preuss 1993). As Preuss has pointed out, Darwin left the actual anatomy to his colleague Thomas Henry Huxley. At that time Richard Owen, another anatomist, argued there was a special structure in the human brain, the ”hippocampus minor.” However, Huxley showed this structure was also found in other primates, thereby undercutting the idea that the human brain was qualitatively different in any way from the primate brain. So, here we had Darwin, the genius who had articulated the idea of natural selection and the notion of diversity arguing for a straight line evolution between primates and humans. Organisms were the product of selection pressures and as a result a rich diversity occurred in the evolution of species. Yet, when it came to the brain and to mind itself, it would seem safe to say that Darwin thought the human to be a blown u p monkey brain, a nervous system that had some sort of monotonic relationship with its closest ancestor. Nonetheless, for a number of years we have been collecting evidence that the human brain does not rely for its unique capacities on cell number so much as it does on the appearance of unique and specialized circuits. Observing the human brain is larger with more cells is not sufficient by itself to explain the increased capacities. I would like to suggest the human brain has unique organizational features that distinguish it from other brains and, in particular, from the nonhuman primate brain. If for no other reasons than we are a different species adapted to
On Neural Circuits and Cognition
3
a different niche, one would assume there would be differences in brain organization. There are a variety of findings arising from different aspects of the study of the human brain that support this view. Not only does work on the cognitive capacities of split-brain patients support this assertion, but also studies on both the functions of the cerebral commissures as well as studies assessing the effects of cortical lesions. When these facts are considered together in light of evolutionary theory, a view emerges that suggests that the essential neuronal characteristics crucial to specific mental activities will be the products of specific neural circuit organization. In short, the complexity of human mental capacity is derived by genetically determined neural circuits (Gazzaniga 1992). After millions of years of natural selection, we have accumulated a variety of circuits that enable us to carry out specific aspects of human cognition. In short, just as comparative neurobiologists have demonstrated the presence of specialized circuitry in lower animals that reflect adaptations to specific niches (Bullock 1993; also see Arbas et al. 19911, it is argued that similar demonstrations will be made in human neuroscience. In arguing for the importance of specialized local circuits, it is helpful to keep in mind how minute neural systems such as those seen in the ant can, nonetheless, support complex social behaviors. Just as dedicated electronic circuits can support complex functions, so too can dedicated neuronal circuits. When we see complex behavior in a big brained animal, we assume it has something to do with their big brain. But as William James (1890) suggested, the human really possesses far more instincts than do animals, not fewer, and it is that fact that finds them more flexibly intelligent. In short, those big brains may be bigger because they are housing many more special circuits. 2 Evidence for Specialized Circuits
Consider the human brain. It has two halves, the left and the right. We know the left cortex is specialized for language and speech and the right has some specializations as well. Each half cortex is the same size and has roughly the same number of nerve cells. The cortices are connected by the corpus callosum. The total, linked cortical mass is assumed somehow to contribute to our unique human intelligence. What would happen to intelligence if the two half brains were disconnected, leaving the left operating independently of the right and vice versa? The brain is divided when split-brain surgery is performed in patients who suffer from epilepsy. Would split-brain patients lose half of their cognitive capacity since the left, talking hemisphere would now operate with only half of the total brain cortex? A cardinal feature of split-brain research is that following disconnection of the human cerebral hemispheres, the verbal IQ of the patient
Michael S. Gazzaniga
4
remains intact (Gazzaniga 1965; Nass and Gazzaniga 1987; Zaidel 1990) and the problem solving capacity, such as seen in hypothesis formation tasks, remains unchanged for the left hemisphere (Ledoux et al. 1977). While there can be deficits in recall capacity (Phelps et al. 1991) and in some other performance measures, the overall capacity for problem solving seems unaffected. In other words, isolating essentially half of the cortex from the dominant left hemisphere causes no major change in cognitive functions. Following surgery, the integrated 1200-1300 g brain becomes two isolated 600-650 g brains, each about the size of a chimpanzee brain. The left remains unchanged from its preoperative capacity, while the largely disconnected, same size right hemisphere is seriously impoverished on a variety of tasks (Gazzaniga and Smylie 1984). While the largely isolated right hemisphere remains superior to the isolated left hemisphere for some activities such as the recognition of upright faces, some attentional skills and perhaps also emotional processes, it is poor at problem solving and many other mental activities (Gazzaniga 1989). A brain system (the right hemisphere) with roughly the same number of neurons as one that easily cognates (the left hemisphere) is not capable of higher order cognition. This represents strong evidence that simple cortical cell number cannot fully explain human intelligence. 3 Brain Asymmetry and Language Processes
Perhaps the most influential and dominant idea that more cortical area means higher level function comes from the work of Geschwind and Levitsky (1968). Over the past 25 years, their report that the left hemisphere has a larger planum temporale solidified the belief that somehow more brain area meant higher level function. Specifically, they concluded their classic paper by stating, Our data show that this area is significantly larger on the left side, and the differences observed are easily of sufficient magnitude to be compatible with the known functional asymmetries. Since this classic finding makes a strong case for the relationship between cortical area and function we have recently re-examined the issue of whether the left planum temporale is larger than the right planum. With "standard" 3D-magnetic resonance reconstructions of normal brains, careful measurement of the posterior temporal region using the same methods as Geschwind and Levitsky found approximately the same percentage of brains showing apparent left-larger asymmetry. However, this measure is not a true 3D reconstruction since it does not take into account the natural curvature of the cortical surface from one coronal slice to another. When we used a true 3D reconstruction algorithm on this cortical region, we found that cortical surface area of the region is not reliably
On Neural Circuits and Cognition
5
asymmetrical (Loftus et al. 1993a). As many brains show a larger cortical surface area in the right as in the left hemisphere in our sample of 10 brains. 4 Brain Asymmetries and Individual Differences
Using the true 3D reconstruction algorithm we have also now examined some 26 other regions for possible reliable asymmetries (Loftus etal. 1993b). In brief, magnetic resonance (MR) images were acquired of 13 young, normal, right-handed males. Computer representations of the cortical surface in the 26 hemispheres were reconstructed from the images using previously established methods that have proven to be highly reliable (Jouandet et al. 1989,1990). Twenty-seven gyri in the left and right hemispheres were identified on each subject’s MR images, and the surface areas in the corresponding portions of the model hemispheres were measured. For each subject and for each region, a left-right asymmetry score was computed based on the difference in surface area of the left and right homologues. A region was classified as asymmetric if the side difference was larger than 20%. The asymmetry scores of the 28 regions in a single brain constitute a subject’s hemispheric asymmetry profile. The number of asymmetric regions in each profile ranged from 5 to 14. A collective asymmetry profile was based on the mean asymmetry scores of all the subjects for each region. None of these means reached criterion. At the same time, however, all the subjects showed asymmetries scattered throughout the cortex; however, the unique pattern of those asymmetries in each individual profile resulted in a mean profile with no asymmetry. Clearly, since all of these subjects were healthy adults, with normal cognitive skills, any one pattern of morphometries showing an asymmetrical pattern in one individual cannot explain a physical basis for those skills. These data suggest that the simplistic idea that greater cortical areas on one side reflect particular functions is wrong. A second subject with the same cognitive skills might well have a wholly different pattern of asymmetries. Again, the answers must lie in the nature of specialized circuits. With the view being put forward here, it becomes important to try and specify what is meant by the idea of specialized circuits. Are these proposed differences at the neuroanatomic systems level or at the more physiologic synaptic level of organization, or both? At this point in our understanding of the anatomy and physiology of the nervous system, it is premature to lay out how particular local circuit features might yield differences in network functions. However, there is increasing anatomical evidence of ample candidates that could explain differences in function. For example, local circuits in various cortical areas can differ in their morphological constituents, in their chemical organization, and in the details of their connectivity. These factors can vary between cortical areas
6
Michael S. Gazzaniga
in the same organism, between homologous cortical areas of different organisms, and over the course of the life span. There is also direct evidence for species differences existing at the level of basic anatomy. A brief review of some of this literature would suggest that there are qualitative differences between the nonhuman primate and the human brain, differences that might explain the differences in capacities of the two species. 5 Possible Physiologic and Anatomical Differences in Synaptic Function
Many lines of research suggest that cortical areas within a given species contain differing proportions of morphologically and neurochemically defined cell types. For example, primary and secondary visual, somatosensory, and auditory cortices have been shown to express differing distributions of calbindin and tachykinin-immunoreactive fibers (DeFelipe et al. 1990)and the density of parvalbumin-containing chandelier cells differs between prefrontal and visual cortical regions (Lewis and Lund 1990). It has also been recently reported that there is a unique population of large pyramidal neurons in the left Brodmann’s area 45 that may be related to this area’s involvement in speech (Hayes and Lewis 1993). Differences in cortical connectivity exist between species and it has been suggested that these differences may reflect the niche in which an organism exists. The squirrel monkey and bush baby show species differences in the connection of the interblob region of their visual cortices. In the bush baby layer IIIB nonblobs receive input from lamina IV alpha, while in the squirrel monkey this layer receives input from lamina IV beta. This difference alters the inputs to lamina IIIB from magnocellular, in the bush baby to parvocellular in the squirrel monkey (Lachica et al. 1993). Finally, there are some fascinating possible clues emerging from recent work on human brain tissue. For example, there have been some suggestions that the physiological properties of dendritic spines in the human might have different properties from those seen in other animals. Shepherd and his colleagues have studied presumed normal cortical tissue removed from epileptic patients (Williamson ef al. 1993). Comparing the membrane and synaptic properties of human and rodent dentate granule cells, several differences were noted. First, there was less spike frequency adaptation in the human relative to rodents and second, the human tissue showed feedback inhibition while the rodent tissue showed both feedforward and feedback inhibition. The differences noted are consistent with neuronal modeling work Shepherd and his colleagues (Shepherd ef al. 1989) have carried out. This work suggests that by simply adding the presence of a few calcium channels on the ”dendritic” spine, vastly
7
On Neural Circuits and Cognition
different and more complex computational capacities can result in the spines allowing for a greater informational processing capability. These early studies are suggestive, but exciting and may point the way to new ways of thinking about possible differences in the basic physiology of neurons between species. 6 System Level Differences in Cortical Anatomical Organization
~
In our own work we have shown how the nonhuman primate and human visual systems have different organizational properties. Specifically, when comparing other primates and humans with the anterior commissure intact but with the corpus callosum sectioned, visual information is seen to easily transfer in the monkey, but not in humans (Gazzaniga 1988). This suggests there is a marked difference between the two species with respect to how visual information transfers between the two cerebral hemispheres. We have also shown that lesions to primary visual cortex in humans render patients blind (Holtzman 1984; Fendrich et al. 1992) whereas similar lesions in monkeys find them capable of residual vision (Pasik and Pasik 1982). When residual vision is seen in the human, as is the case with so called "blindsight," we have argued that it reflects incomplete damage to the primary visual cortex. When residual vision is seen in the monkey it must reflect capacities of other secondary visual system processes (Fendrich et al. 1992; Gazzaniga et al. 1994). While there are many examples of system level differences between primates and other lower animals (for review see Preuss 1993), there is less attention paid to differences between nonhuman primates and humans such as those just described. Yet, the observations described above mandate there are differences in anatomical organization even though the monkey visual system and the human visual system have virtually identical sensory capacities (see Harwerth et al. 1993). Careful psychophysical measurement of acuity, color, and other parameters reveals identical sensitivities. Additionally, at the level of basic anatomical processes, both also have approximately 1.2 million retinal ganglion cells (Curcio and Allen 1990). And even though the gray matter volume of human primary visual cortex, area striata, is three times larger than it is in Macaca mulatta and five times larger than it is for the owl monkey, Aotus (Frahm et al. 1984) V1 has the same number of cells in both the rhesus monkey and human brain (Williams 1993). To explain the differences between the monkey and human behavior one has to consider possible differences that might exist at the level of basic neuronal organization of the visual system, given the result of the studies on the anterior commissure and the V1 lesion work. It remains to be determined if these differences are to be understood in terms of connectivity of major processing areas or at the level of synaptic function.
Michael S. Gazzaniga
8
It is known, for example, that V1 in humans has greater striation than in the monkey thereby suggesting greater dendritic density leading to basic differences in neural organization. 7 General Discussion
The risk of arguing about similarities between species has been criticized by many. Perhaps Stott (1983) says it best when summarizing his complaints about studies on the relationship between brain size and intelligence: The first objection that may be made to this reasoning is that extrapolation from interspecific to intraspecific differences is an offense against the realities of evolution. Each species has developed behavioural capabilities which were advantageous for its survival, and as such these would be common to all normal individuals. The capabilities of each species differ qualitatively according to the ecological niche in which each evolved. The application of the human-centered concept of intelligence to these essentially incomparable capabilities is naive anthropomorphism. All attempts to produce by selection a generally more intelligent strain within an animal species have met with failure. The strain selected for maze running would as likely as not come ’at the bottom of the class’ for discrimination learning, and so on. The human species developed a larger brain along with the necessity of operating in more complex ways in a larger range of situations. It is therefore reasonable to assume that every organically intact human brain has the brain capacity for the development of the distinctively human capabilities, irrespective of the small variations in head size which are mainly an aspect of body size. And yet while Stott seems to have it right on how mere brain size cannot explain the unique capacities of the human, Pinker has recently lamented that Chomsky favors the view that while language is deeply biologic in nature, he does not believe it is a product of natural selection. Chomsky leaves open the possibility it is the result, the concomitant of massive interactions of millions of neurons. Consider Pinker’s assessment (Pinker 1994): If Chomsky maintains that grammar shows signs of complex design, but is skeptical that natural selection manufactured it, what alternative does he have in mind? What he repeatedly mentions is physical law. Just as the flying fish is compelled to return to the water and calcium-filled bones are compelled to
On Neural Circuits and Cognition be white, human brains might, for all we know, be compelled to contain circuits for Universal Grammar. He writes: "These skills [e.g., learning a grammar1 may well have arisen as a concomitant of structural properties of the brain that developed for other reasons. Suppose that there was selection for bigger brains, more cortical surface, hemispheric specialization for analytic processing, or many other structural properties that can be imagined. The brain that evolved might well have all sorts of special properties that are not individually selected; there would be no miracle in this, but only the normal workings of evolution. We have no idea, at present, how physical laws apply when 10'O neurons are placed in an object the size of a basketball, under the special conditions that arose during human evolution." We may not, just as we don't know how physical laws apply under the special conditions of hurricanes sweeping through junkyards, but the possibility that there is an undiscovered corollary of the laws of physics that causes human-sized and shaped brains to develop the circuitry for Universal Grammar seems unlikely for many reasons. At the microscopic level, what set of physical laws could cause a surface molecule guiding an axon along a thicket of glial cells to cooperate with millions of other such molecules to solder together just the kinds of circuits that would compute something as useful to an intelligent social species as grammatical language? The vast majority of the astronomical ways of wiring together a large neural network would surely do something else: bat sonar, or nest-building, or go-go dancing, or, most likely of all, random neural noise. At the level of the whole brain, the remark that there has been selection for bigger brains is, to be sure, common in writings about human evolution (especially from paleoanthropologists). Given that premise, one might naturally think that all kinds of computational abilities might come as a by-product. But if you think about it for a minute, you should quickly see that the premise has to have it backwards. Why would evolution ever have selected for sheer bigness of brain, that bulbous, metabolically greedy organ? A large-brained creature is sentenced to a life that combines all the disadvantages of balancing a watermelon on a broomstick, running in place in a down jacket, and, for women, passing a large kidney stone every few years. Any seIection on brain size itself would surely
9
10
Michael S. Gazzaniga have favored the pinhead. Selection for more powerful computational abilities (language, perception, reasoning, and so on) must have given us a big brain as a by-product, not the other way around!
Neuroscientists have had a hard time accepting the view big brains may come as a by-product to other processes active in establishing the uniqueness of each species nervous system. Yet basic biologists have known for years how specialized circuits define the differences between fish and reptile, reptile and mammal, snail and octopus, worm and jellyfish, and so on (see Bullock 1993). It seems only logical such processes would contribute to defining the neural processes supporting unique human capacities, especially language. Big brains (corrected for body size) may get bigger because they collect more specialized circuits. There are certainly a multitude of commonalities between all species and these provide the strength of much of biological research. At the same time, there are crucial differences between species such as those reviewed here, and in the present context, we find work in human brain research suggesting unique aspects of human behavior may be supported by specialized neural circuitry. Alas, I am arguing that major clues to understanding how the brain enables human cognitive function will come from understanding the microcircuitry of the human brain.
Acknowledgments Aided by NIH Grants NINDS 5 R01, NS22626-09 and NINDS 5 PO1 NS17778-012, and the James S. McDonnell Foundation.
References Arbas, E. A., Meingerthagen, I. A., and Shaw, S. R. 1991. Evolution in nervous systems. Annu. Rev. Neurosci. 14, 9-38. Bullock, T. H. 1993. How are more complex brains different? One view and an agenda for comparative neurobiology. Brain, Behav. E d . 41(2), 88-96. Curcio, C. A,, and Allen, K. A. 1990. Topography of ganglion cells in human retina. I. Cornp. Neurol. 300, 5-25. Darwin, C. 1981. The Descent of Man. Princeton University Press (facsimile edition), Princeton, NJ. DeFelipe, J., Hendry, S. H. C., Hashikawa, T., Molinari, M., and Jones, E. G. 1990. A microcolumnar structure of monkey cerebral cortex revealed by immunocytochemical studies of double bouquet cell axons. Neuroscience 37, 655473. Fendrich, R., Wessinger, C. M., and Gazzaniga, M. S. 1992. Residual vision in a scotoma: Implications for blindsight. Science 258, 1489-1491.
On Neural Circuits and Cognition
11
Frahm, N. D., Stephan, H., and Baron, G. 1984. Comparison of brain structure volumes in insectivora and primates of area striata (AS).]. Hirnforschung 25, 537-557. Gazzaniga, M. S. 1965. Some effects of cerebral commissurotomy in monkey and man. Diss. Abstr. 26. Gazzaniga, M. S. 1988. Interhemispheric integration. In Dahlem Conference, I? Rakic, ed. John Wiley, New York. Gazzaniga, M. S. 1989. Organization of the human brain. Science 245, 947-952. Gazzaniga, M. S. 1992. Nature's Mind. Basic Books, New York. Gazzaniga, M. S., and Smylie, C. S.1984. Dissociation of language and cognition: A psychological profile of two disconnected right hemispheres. Brain 107, 145-153. Gazzaniga, M. S., Wessinger, C. M., and Fendrich, R. 1994. Blindsight reconsidered. Contemp. Issues Psychol. 3(3), 93-96. Geschwind, N., and Levitsky, W. 1968. Human brain: Left-right asymmetries in temporal speech region. Science 161, 186-187. Harwerth, R. S., Smith 111, E. L., and De Santis, L. 1993. Behavioral perimetry in monkeys. Invest. Ophthalomol. Vis.Sci. 34(1), 31-40. Hayes, T. L., and Lewis, D. A. 1993. Hemispheric differences in layer 111 pyramidal neurons of the anterior language area. Arch. Neurol. 50, 501-505. James, W. 1890. Principles of Psychology. Henry Holt, New York. Jouandet, M. L., Tramo, M. J., Herron, D. M., Hermann, A,, Loftus, W. C., Bazell, J., and Gazzaniga, M. S. 1989. Brainprints: Computer-generated twodimensional maps of the human cerebral cortex in vivo. ]. Cog. Neurosci. 1, 88-1 17. Jouandet, M. L., Tramo, M. J., Thomas, C. E., Newton, C. H., Loftus, W. C., Weaver, J. B., and Gazzaniga, M. S. 1990. Brainprints: Inter- and intraobserver reliability. Soc. Neurusci. Abstr. 16, 1151. Holtzman, J. D. 1984. Interactions between cortical and subcortical visual areas: Evidence from human commissurotomy patients. Vis. Res. 24, 801-813. Lachica, E. A., Beck, I? D., and Casagrande, V. A. 1993. Intrinsic connections of layer I11 of striate cortex in squirrel monkey and bush baby: Correlations with patterns of cytochrome oxidase. J. Comp. Neurol. 328, 163-187. LeDoux, J. E., Risse, G., Springer, S., Wilson, D. H., and Gazzaniga, M. S. 1977. Cognition and commissurotomy. Brain 100, 87-104. Lewis, D. A., and Lund, J. S. 1990. Heterogeneity of chandelier neurons in monkey neocortex: Corticotropin-releasing factor and parvalbumin-immunoreactive populations. J. Comp. Neurol. 293, 599-615. Loftus, W. C., Tramo, M. J., Thomas, C. E., Green, R. L., Nordgren, R. A., and Gazzaniga, M. S. 1993. Three-dimensional quantitative analysis of hemispheric asymmetry in the human superior temporal region. Cerebral Cortex 3(4), 348-385. Loftus, W. C., Hutsler, J. J., and Gazzaniga, M. S. 199313. Averaged brains are not real brains: Demonstration of human brain variability with respect to anatomical asymmetry. Society of Neuroscience 19, 559. Nass, R., and Gazzaniga, M. S.1987. Lateralization and specialization of the hu-
12
Michael S. Gazzaniga
man central nervous system. In HandbookofPhysiology, F. Plum, ed., pp. 701761. The American Physiological Society, Bethesda, MD. Pasik, P., and Pasik, T. 1982. Visual functions in monkeys after total removal visual cerebral cortex. Contrib. Sensory Physiol. 7, 147-200. Passingham, R. E. 1981. The Human Primate. W. H. Freeman, Oxford and San Francisco. Phelps, E. A., Hirst, W., and Gazzaniga, M. S. 1991. Deficits in recall following partial and complete commissurotomy. Cerebral Cortex 1,492-498. Preuss, T. M. 1993. The role of neurosciences in primate evolutionary biology: Historical commentary and prospectus. In Primates and their Relatives in Phylogenetic Perspective, R. D. E. Mac Phee, ed. Plenum Press, New York. Shepherd, G. M., Woolf, T. B., and Carnevale, N. T. L. 1989. Comparisons between active properties of distal dendritic branches and spines: Implications , for neuronal computations. 1. Cog. Neurosci. 1,273-286. Stott, D. 1983. Brain size and ‘intelligence.’ Br. I. Dev. Psychol. 1,279-287. Schultz, W. L., Rutledge, R., Neal, J., and Bigler, E. D. 1991. In vivo brain size and intelligence. Intelligence 15, 223-228. Williams, R. 1993. Personal communication. Williamson, A., Spencer, D. D., and Shepherd, G. M. 1993. Comparison between the membrane and synaptic properties of human and rodent dentate granule cells. Brain Res. 622, 194-202. Zaidel, E. 1990. Language functions in the two hemispheres following complete cerebral commissurotomy and hemispherectomy. In Handbook Neuropsychol., 4, F. Boller and J. Grafman, eds. Elsevier Science Publishers, B.V. (Biomedical Division), Amsterdam.
Received September 1, 1993; accepted May 16, 1994.
This article has been cited by: 1. Mansour A. Al-Garni. 2010. Interpretation of spontaneous potential anomalies from some simple geometrically shaped bodies using neural network inversion. Acta Geophysica 58:1, 143-162. [CrossRef] 2. Geoffrey Schoenbaum, Matthew R. Roesch, Thomas A. Stalnaker, Yuji K. Takahashi. 2010. A new perspective on the role of the orbitofrontal cortex in adaptive behaviour. Nature Reviews Neuroscience . [CrossRef]
Communicated by Michael Jordan
NOTE
The EM Algorithm and Information Geometry in Neural Network Learning Shun-ichi Amari Department of Mathematical Engineering, University of Tokyo, Bunkyo-ku, Tokyo 113, Japan Hidden units play an important role in neural networks, although their activation values are unknown in many learning situations. The EM algorithm (statistical algorithm) and the em algorithm (informationgeometric one) have been proposed so far in this connection, and the effectiveness of such algorithms is recognized in many areas of research. The present note points out that these two algorithms are equivalent under a certain condition, although they are different in general.
1 Hidden Variables in Stochastic Neural Net
The behavior of a neural network is specified by the relation between input and output signals, although hidden neurons play a fundamental role in it. However, when we design a neural network or specify an adequate learning rule, the roles and the values of the hidden variables are often unknown, so that we need to estimate them from the observable input-output data. This is an important and interesting problem in the theory of neural computation. Let us consider a probability model for a neural network whose structural parameters (synaptic weights and thresholds) are summarized in a vector form u = (u,, . . . . u I I ) Given . an input vector signal x, both the output vector y and hidden variables vector z are stochastically determined from x and u. In other words, the whole behavior of the network is specified by the conditional probability p(y, z 1 x; u), and the input-output relation is given by the marginal distribution p(y I x; u) = C , p ( ~z >I x; u). This is a very simple example, and we often use much more complex types of hidden variables. When the probability model is of the exponential family type, it has a sufficient statistic s which is a vector function of hidden random variables rh and observable random variables rv as
s Neural Computation 7, 13-18 (1995)
= s(rv.rh)
@ 1994 Massachusetts Institute of Technology
14
Shun-ichi Amari
Two different methods have so far been proposed to solve such hidden variable problems. One is the EM algorithm (Expectation and Maximization) originated from statistics (Jordan and Jacobs 1994). It is applied to the hierarchical mixture model of expert networks (Jordan and Jacobs 1994). The other is the em algorithm (e- and m-geodesic projections) originated from information geometry (Amari 1991; Amari et al. 1992; Byrne 1992) and applied to Boltzmann machines. [It was found that a not well known paper by Csiszar and Tusnady (1984) proposed the ern algorithm.] The present notes show that these two methods are equivalent under a certain condition. See Amari (1994) for details.
2 EM Algorithm
Let us consider the set S consisting of all the conditional probabilities p ( y , z I x). Here, we assume that S is an exponential family type of distributions. In this case, a sufficient statistic s exists, and, when s is observed, the maximum likelihood estimator (m.1.e.) determines a distribution in S. However, not all the distributions in S are realizable by neural networks. Neural networks can realize conditional distributions of the form p(y.z I x. u) specified by network parameter u. The set of such realizable distributions is a subset of S. It forms a neural network submanifold N embedded in S, where u is a coordinate system of N. When T examples (yl.zt;x t ) , t = 1.. . . , T , are observed, we have the m.1.e. distribution p determined from the sufficient statistics s in S (or in the product space S1 x . . . x ST when the distributions are not identical, depending on t). However, this p does not in general belong to N. The maximum likelihood estimate u is calculated from s or 6. When zt are missing, we cannot obtain p but know candidates of distributions where yf are known but z t may be arbitrarily assigned. That is, in the sufficient statistics s(r,,rh), rv is observed but rh may be assigned arbitrarily. In this case, the given partial data (yt,xl)defines the set D of candidates p ( r , , q ) of distributions where rh are arbitrary. Such candidates form a submanifold D in S. The true p should lie in D (see Fig. 1). The EM algorithm is as follows : Let p , and ul be the candidate p, E D, u,E N at the ith step ( i = 1 , 2 . . . .). The initial is chosen arbitrarily from D. M-step: u,+,is the m.1.e. from the current distribution p I in D. E-step: p l + l is given by substituting the conditional expectation E[s I r,; u,+l] conditioned on the observed parial data r, for unknown s. It is proved that this procedure converges locally to the m.1.e. from the observed data.
EM Algorithm and Information Geometry
15
maximum likelihood
xpectatio
Figure 1: EM and ern algorithms.
3 EM Algorithm and Information Geometry
Information geometry is new differential geometry to be introduced naturally in the manifold of probability distributions (Amari 1985; see also Amari and Han 1989; Murray and Rice 1993). It defines two dually coupled geodesics: the e-geodesic and m-geodesic. When two probability distributions po(x) and pl(x)on random variable x are connected by their mixture,
pd.) = (1 - t)po(x)+ tPl(X) the curve pr(x) where t is the parameter of the curve is called the mgeodesic in the manifold S = {p(x)} of all the probability distributions. When they are connected by
where c(t) is the normalization constant, the curve is called the e-geodesic. They can be generalized to the case of conditional distributions in a similar way.
16
Shun-ichi Amari
Now the em algorithm proposed by Csiszhr and Tusnhdy (1984) and Amari et al. (1992) is as follows, where the Fisher information metric is used to define the orthogonality. m-step: Project pI orthogonally to the manifold N by the m-geodesic. This gives &+I. e-step: Project ul+l orthogonally to the manifold D by the e-geodesic. This gives P I + , . A nice property of the em algorithm is that the m-projection is the one minimizing the Kullback-Leibler divergence D(pl I pu) over pu E N and that the e-projection is the one minimizing D(p I p ~ , , ] over ) p E D. So this can be written in a dual gradient-descent form (see also Neal and Hinton 1994). It is believed that the EM and ern algorithms are equivalent (Amari et al. 1992; Byrne 1992; Neal and Hinton 1994; Csiszar and Tusnady 1984). However, they are not equivalent in general. We have the following new theorem. Theorem. The EM and em algorithms are equivalent, when D is m-flat (that is, the m-geodesic connecting fzuo points of D is included in D ) and the coizditional expectation E[s I rv;q]at the distribution q E D is linear in rv. The condition of the theorem holds asymptotically when the number of observations is large. It also holds when all the random variables are discrete. We given an example where the two algorithms are different (Appendix 1). The proof of the theorem is sketched in Appendix 2. 4 Conclusions
We have shown the condition guaranteeing the equivalence of the EM and ern algorithms. It is a pleasant surprise that they happen to have similar nomination. It is interesting to see that they are different in general. The information geometry is expected to elucidate the global structure and leads to new learning algorithms. The algorithm is applied to the multilayer perceptron to give a new learning rule by Amari and independently by D. Rumelhart (personal communication). We analyzed the case of the ordinary one hidden-layer and a single output-unit analog perceptron, in which small normal noises with variance c2 are added to the neurons. The loss function to be minimized is the squared sum of the difference of the output 0,= f l and the target t , in the conventional case. But the stochastic model automatically gives the loss
EM Algorithm and Information Geometry
17
when 0 is small, where f / is the derivative of the sigmoid output function evaluated at the ith input. This shows that smaller losses are automatically assigned for those signals whose outputs are saturated. This suggests promising features of stochastic modeling to be studied further. Appendix 1. An Example in which the EM and em Algorithms Are Different Let x1 and xz be two independent random variables subject to the normal distribution N ( u . u2),that is, with mean u and variance u2. The statistics
x =
s
(XI
+x2)/2
= (x: + x : ) / 2
are sufficient. We assume that s, = x is observed but sh = S is hidden. The EM algorithm gives the m.1.e. u = - 1 ) X = 0.732. In this case, the manifold S is the set of normal distributions with coordinates ( p ,a 2 ) . The model N is given by the parabola /L = u, a2 = u2 in S. The observed X gives the candidate vertical line D : fi = X , 0': arbitrary, because S is unknown. The N and D intersect at ( X , X 2 ) , giving the minimum of the divergence. Hence, the em algorithm gives u* = X, different from the m.1.e.
(a
Appendix 2. Sketches of the Proof of Theorem When D is m-flat, the sufficient statistic s is decomposed into the visible and hidden parts s = (rv>rh).Let p , be a point in N and let q* be the e-projection of p , to D. It is shown that the e-projection is the point in D minimizing D(9 I p , ) (see Amari 1985). We then show that the eprojection keeps the conditional probability p(rh I rv) invariant, that is, pu and q* have the same conditional distribution. This is proved from the decomposition of the Kullback divergence,
where the first term of the right-hand side is the divergence with respect to the marginal distributions of rv and the second is with respect to the conditional distributions of rh conditioned on rv. The second term is minimized when the conditional distribution at q is equal to that at pu, but the first term cannot be free because of q E D. This proves that the e-projection 9' is the one having the same conditional probability and hence the same conditional expectation of rh. This is the e-step. The E-step of the EM algorithm replaces the missing rh by its conditional expectation i h = E[rh I r,;u]. However, the e-projection uses the unconditional expectation of r1, at q' to define the guessed data. So they
18
Shun-ichi Amari
are equivalent when a n d only when the conditional a n d unconditional expectations coincide at any point on D defined by the observed data. This leads to the conditions of the theorem.
References Amari, S. 1985. Differential Geometrical Methods in Statistics. Springer Lecture Note in Statistics, 28. Amari, S. 1991. Dualistic geometry of the manifold of higher-order neurons. Neural Networks 4, 443-451. Amari, S. 1994. Information Geometry of the EM and ern Algorithms for Neural Networks. METR94-4, University of Tokyo. Amari, S., and Han, T. S. 1989. Statistical inference under multiterminal rate restrictions-a differential geometrical approach. I E E E Trans. IT35, 217-227. Amari, S., Kurata, K., and Nagaoka, H. 1992. Information geometry of Boltzmann machines. I E E E Trans. Neurul Networks 3(2), 260-271. Byrne, W. 1992. Alternating minimization and Boltzmann machine learning. IEEE Trans. Neural Networks 3, 612-620. Csiszar, I., and Tusnady, G. 1984. Information geometry and alternating minimization procedures. In Statistics and Decisions, E. F. Dedewicz et al., eds., Supplementary issue, pp. 205-237. Oldenburg Verlag, Munich. Jordan, M. I., and Jacobs, R. A. 1994. Hierarchical mixtures of experts and the EM-algorithm. Neural Cornp. 6, 181-214. Murray, M. K., and Rice, J. W. 1993. Differential Geometryand Statistics. Chapman and Hall, London. Neal, R. N., and Hinton, G. E. 1994. A new version of the EM algorithm that justifies incremental and other variants. To appear.
Received November 12, 1993; accepted April 11, 1994
This article has been cited by: 1. T. Adali, X. Liu, M.K. Sonmez. 1997. Conditional distribution learning with neural networks and its application to channel equalization. IEEE Transactions on Signal Processing 45:4, 1051-1064. [CrossRef]
NOTE
Communicated by Steve Nowlan
Convergence Theorems for Hybrid Learning Rules Michel Benaim Department of Mathematics, University of California at Berkeley, Berkeley, CA 94720 USA
1 Introduction
Several heuristic hybrid algorithms for feedforward neural networks, which combine unsupervised learning of hidden units with supervised learning of output units [see, e.g., Moody and Darken (1989); Nowlan (1990); Poggio and Girosi (1990); Benaim and Tomasini (1992); Benaim (1994)] have recently been proposed. The purpose of this note is to present some convergence theorems for such learning rules. 2 Hybrid Learning Rules
Consider a one hidden-layer neural network with k input units, 1 hidden units and one output. We let u E (Rk)' denote the weight matrix from the input layer to the hidden layer and w E R' denote the weight vector from the hidden layer to the output unit. With a hybrid algorithm, v is trained according to an unsupervised rule. (2.1) where {X,}n,oE Rk is a sequence of input patterns and {y,},?o E R+ is a sequence of learning rates. The function C: Xk' x Rk H R+ is a local cost associated to the unsupervised algorithm. Such an algorithm can be, for example, a k-means algorithm as in Moody and Darken (1989) or a "soft competitive algorithm" based on a Maximum Likelihood principle as in Nowlan (1990),Benaim and Tomasini (1991,1992),Marroquin and Girosi (1993), and Benaim (19941, among others. The output weight vector w is trained according to a supervised rule:
where Y, E R is the desired output (target) of the network when X, is given as input and D: Rk' x R' x Rk x R H R f is a local cost function that measures the distortion between the network's output and the target. Neural Computation 7, 19-24 (1995)
@ 1994 Massachusetts Institute of Technology
Michel Benaim
20
3 Convergence Results
We suppose that the training set (i.e., the sequence of inputs and targets presented to the network) is described by a joint probability law v(dx,dy) over Rk x R. The probability density of the input data is the density over Rk:
To analyze the asymptotic behavior of the hybrid rule (2.1,2.2) we introduce the averaged ordinary differential equation (ODE):
du - -dt dw - = dt where
- -VC(v) -V,,,D(V,W)
(3.1) (3.2)
-
C ( V )=
C(u,X)p(dX)
and -
D ( v ,w)=
D(u.w,x . y ) v ( d x ,d y )
It is clear that such an (ODE) is not given by a gradient vector field as is the case for most nonhybrid algorithms. Therefore the classical results on stochastic gradients [see, e.g., White (198911 cannot be applied to prove the convergence of 2.1 and 2.2. For the sake of simplicity we make the following assumptions: i. The maps C and D are C' 0
ii. The sequence { ( X I , ,Yn)},,>0 is a sequence of independent identically distributed random variables having v as probability law.
0
iii. C yH = IXI.
0
iv. There exists 6 > 0 such that CT,','~< IXI.
0
v. There exists a compact set K c Rk'x R such that the sequence { (u,,, W , ~ ) } , ,solution ~O to 2.1 and 2.2 remains in K with probability one.
The theorems given below follow from Benaim (1993a,b). An outline of the proof is given in the appendix. Theorem 1. If equilibria of 3.1 and 3.2 are isolated, then any sequence ?~ to 2.1 and 2.2 converges with probability one toward an { ( v , ~w, , , ) } ~solution equilibrium of3.1 and 3.2.
Convergence Theorems for Hybrid Learning Rules
21
It is often assumed that the output unit is linear and trained according to a least mean square minimization. In this situation it may happen that the equilibria of 3.1 and 3.2 are never isolated. The next theorem is devoted to this case. Let Bi(x,v) denote the value of the ith hidden unit when x is given as input. Let B(x; v) = [BI (x,v). . . . , &(x, v)]' denote the vector of hidden units. We assume that 0
vi. The network's output is given as the weighted sum: I
0=
(w, H(x,v)) = Cw,H,(x,v) i=l
0
vii. The error function D is the quadratic error: 1
D(71.w.x.y) = -110 - y//' 2 Under this set of assumptions, equation 3.2 has the particular form:
dw - = -A(v)w + B(u) dt
(3.3)
where A(v) is the 1 x 1 matrix defined by
A(v)=
/ H(x,v)H(x.~ ) ~ p ( d x )
and B(v) is the 1-dimensional vector:
B(v) = / y Q ( x ,v)i/(dx.dy) Theorem 2. Assume equilibria of 3.2 are isolated. Then the limit set of any solution to2.1 and2.2 isalmostsurelyaconnected compact subset of theequilibria set of the ODE (3.1 and 3.3). This result extends previous results obtained with constant learning rate and a specific architecture in Benaim (1994).
Appendix Detailed proofs of Theorems 1 and 2 are given in Benaim (1993b). In this appendix we describe the main idea of the proofs. These results are based on the following general theorem concerning Robbins-Monro algorithms. We assume given a probability space (0,3, { F n } r Pl )~ with ~ , an increasing sequence of sigma algebra { . L } n 2 ~ .
Michel Benaim
22
Let
{Zn}n>o
G+I - z n
E RN be solution to the following stochastic algorithm Yn+l
1
[ F ( z n ) + un+1]
(A.1)
l ~a~sequence random variables where F: RN H RN is Lipschitz, { U n } f is F,,measurables such that E(U,f+l 151)= 0
and supE(II~flll4) < 02 w>o for some q 2 2. The sequence { ~ , , } ~ > ois a sequence of nonnegative real < 03 for some 6 > 1. numbers such that C T,~= 00, and C y:" It is clear that the algorithm (2.1 and 2.2) can be put in the general form given by A.l with N = kl + 1, zn = (un,w,,), and F is the vector field given by 3.1 and 3.2.
Theorem 3. (Benaim 1993a,b). Assume the sequence {zI1}solution to A.1 is } (with probability one) bounded (with probability one). Then the limit set of { q f is a nonempty compact connected set invariant under the flowof F and included in the set of chain-recurrent points for F . Let @ denote the flow induced by F, a point p is said to be chainrecurrent if for all d > 0, T > 0 there exists a finite sequence of partial trajectories {@,(y,) : 0 5 t 5 t , } ;
i = 0,...,k
- 1;
t, 2 T
such that
4 y 0 , p ) 0 and IvI > 0. The same results can be obtained as the limit of a gamma prior (Neal 1992; Williams 1993b).
is uniformly distributed for any
Peter M. Williams
122
to show that -logP(w)
=
WlogEw
to within an additive constant. If the noise level a = l/a2 is known, or assumed known, the objective function to be minimized in place of M is now
L
= /3ED
+ WlogEw
(4.2)
In practice 0 is generally not known in advance and similar treatment can be given to ,O as was given to a. This leads to -logP(D
I W) = i N l o g E ~
assuming the gaussian noise model.6 - log P(w 1 D)is now given by
The negative log posterior
L = iNlOgED+ WlOgEw
(4.3)
which replaces 2.1 as the loss function to be minimized. It is worth noting that if cy and are assumed known, differentiation of 2.1 yields mM = 0VED cy VEWwith 1//3as the variance of the noise process and 1 / a as the mean absolute value of the weights. Differentiation of 4.3 yields VL = 5 VED G VEWwhere
+
+
(4.4) is the sample variance of the noise and (4.5) is the sample mean of the size of the weights. This means that minimizing L is effectively equivalent to minimizing M assuming a and /3 are continuously adapted to the current sample values 6 and
a.
5 Priors, Regularization Classes, and Initialization For simplicity Section 2.2 assumed a single weight prior for all parameters. In fact different priors are suitable for the three types of parameter found in feedforward networks, distinguished by their different transformational properties. 6The f comes from the fact that E D is measured in squared units. Assuming LaplaIyI, - tIJ. cian noise this term becomes Nlog E D with ED =
c,
Bayesian Regularization
123
5.1 Internal Weights. These are weights on connections to or from hidden units. The argument of Section 2.2 suggests a Laplace prior. MacKay (1992) points out, however, that there are advantages in dividing such weights into separate classes with each class c having its own adaptively determined scale. This leads by the arguments of Section 4 to the more general cost function
(5.1) where summation is over regularization classes, Wc is the number of weights in class c, and EE, = C,,, lwjl is the sum of absolute values of weights in that class. A simple classification uses two classes consisting of (1) weights on connections with output units as destinations and ( 2 ) weights on connections with hidden units as destinations. More refined classifications might be suitable for specific applications. 5.2 Biases. Regularization classes must be exclusive but need not be exhaustive. Parameters belonging to no regularization class are unregularized. This corresponds to a uniform prior. This is appropriate for biases that transform as location parameters (Williams 1993b). The prior suitable for a location parameter is one with constant density. Biases are therefore excluded from regularization.
5.3 Direct Connections. If direct connections are allowed between input and output units, the argument of Section 2.2 does not apply. There is no intrinsic symmetry in the signs of these weights. It is then reasonable to use a gaussian prior contributing an extra term $ Wd log E L to the right-hand side of 5.1 where d is the class of direct connections, Wd is the w: is half the sum of their number of direct connections, and EL = $ squares.
xjEd
5.4 Initialization. It is natural to initialize the weights in the network in accordance with the assumed prior. For internal weights with the Laplace prior, this is done by setting each weight to f a l o g r where r is uniformly random in (0, l),the sign is chosen independently at random, and a > 0 determines the scale. a is then the average initial size of the weights. Satisfactory results are obtained with a = l / f i for input weights and a = 1 . 6 / 6 for remaining weights where m is the fan-in of the destination unit. The network function corresponding to the initial guess then has roughly unit variance outputs for unit variance inputs, assuming the natural hyperbolic tangent as transfer function. All biases are initially set to zero.
Peter M. Williams
124
6 Multiple Outputs and Noise Levels
Suppose the regression network has n output units. In general the noise levels will be different for each output. The data misfit term then becomes El Dl Eb where summation is over output units and, assuming independent gaussian noise, Eb = i Cp(ypI - tp,)2is the error on the ith output, summed over training pattern^.^ If each [?,= l / u f is known, the objective function becomes
(6.1) in place of 4.2, assuming a single regularization class. Otherwise integrating over each ijl, with the l/P; prior gives
L = i N x logEb I
+ x W c logEL C
in place of 5.1, assuming multiple regularization classes. 6.1 Multiple Noise Levels. Even in the case of a single output regression network there may be reason to suspect that the noise level differs between different parts of the training set.8 In that case the training set can be partitioned into two or more subsets and the term N log E D in 5.1 is replaced by i C, N,log EL where N, is the number of patterns in - t,)’ is the data error over subset s, with EN,= N, and ES, = CpEs(yP that subset.
i
7 Nonsmooth Optimization and Pruning
The practical problem from here on is assumed to be unconstrained minimization of 5.1. The objective function L is nondifferentiable, however, on account of the discontinuous derivative of (w,( at each w,= 0. This is a case of nonsrnooth optimization (Fletcher 1987, Ch. 14). On the other hand, since L has discontinuities only in its first derivative and these are easily located, techniques applicable to smooth problems can still be effective (Gill ef d.1981, 54.2). Most optimization procedures applied to L as objective function are therefore likely to converge despite the discontinuities, though with a significant proportion of weights assuming negligibly small terminal values, at least for real noisy data. These are weights that an exact line search 71n many applications it will be unwise to assume that the noise is independent across outputs. This is often a reason for not using multiple output regression models in practice, unless one is willing to include cross terms (ypl- tpl)(yp,- f p , ) in the data error and re-estimate the inverse of the noise covariance matrix during training. 8Typically this arises when training items relate to domains with an intrinsic topology. For example, predictability of some quantity of interest may vary over different regions of space (mineral exploration)or periods of time (forecasting).
Bayesian Regularization
125
would have set to exact zeros. They are in fact no longer free parameters of the model and should not be included in the counts W, of weights in the various regularization classes. For consistency, these numbers should be reduced during the course of training, otherwise the trained network will be over-regularized. The rest of the paper is devoted to this issue.' The approach is as follows. It is assumed that the training process w1,. . . to consists of iterating through a sequence of weight vectors WO, a minimum of L. If these are considered to be joined by straight lines, the current weight vector traces out a path in weight space. Occasionally this path crosses one of the hyperplanes w, = 0 where w, is one of the components of the weight vector. This means that w, is changing sign. The question is whether w, is on its way from being sizeably positive to being sizeably negative, or vice versa, or whether I w , ~ is executing a Brownian motion about w, = 0. The proposal is to pause when the path crosses, or is about to cross, a hyperplane and decide which case applies. This is done by examining dL/dw,. If dL/dw, has the same sign on both sides of w, = 0, w, is on its way elsewhere. If it has different signsmore specifically the same sign as w, on either side-this is where W , wishes to remain since L increases in either direction. In the second case the proposal is to freeze w, permanently at zero and exclude it from the count of free parameters. From then on the search continues in a lower dimensional subspace. With this in mind there are three problems to solve. The first concerns the behavior of L at w,= 0 and a convenient definition of dL/dw, in such a case. The second concerns the method of setting weights to exact zeros and the third concerns the implementation of pruning and the recount of free parameters.'O 7.1 Defining the Derivative. For convenience we write the objective function 5.1 as L = LD LW where
+
The problem in defining dL/dw, lies with the second term since (w,I is not differentiable at w,= 0. Suppose that w, belongs to regularization class c and consider variation of w, about w, = 0 keeping all other weights fixed. This gives the cusp-shaped graph for LW shown in Figure 1, which has a discontinuous 9Typical features of Laplace regularization can be sampled by applying some preferred optimization algorithm directly to the objective functions given by 4.3 or 5.1. This corresponds to the "quick and dirty" method of MacKay (1992, 56.1). 'OThe following discussion assumes batch training. Regularization using stochastic techniques is outside the present scope.
Peter M. Williams
126
/
\
/
/
\ \
Figure 1: Space-like data gradient at w,= 0. derivative at w,= 0. Its one-sided values are of wi,where
* 6, depending on the sign
is the mean absolute value of weights in class c. The two corresponding tangents to the curve are shown as dashed lines. Consider small perturbations in w, around w, = 0 keeping other weights fixed. So far as the regularizing term LW alone is concerned, w,will be restored to zero, since a change in either direction increases Lw. The full objective function, however, is L = LD LW so that behavior under small perturbations is governed by the sum of the two terms dLD/i3wIand dLw/dw,. Figure 1 shows one possibility for the relationship between them. Here 3Lo/3w1 is "space-like" with respect to f&.'I This is stable since dL/i)w,, which is the sum of the two, has the same sign as w,in either direction. Small perturbations in w,will be restored to zero. Contrast this with Figure 2 where dLD/dw, is now "time-like'' with respect to fti.,. Increasing w,will escape the origin since dL~/dw, is more negative than dLw/dw, = ti., is positive. In short dL/dw, is negative for small positive w,.It follows that the criterion for stability at w,= 0 is that
+
(7.1) "This is a reference to Minkowski's formulation of special relativity with the tangents at the origin playing the role of a section of the light cone.
Bayesian Regularization
127
-
-a,
\
Figure 2: Time-like data gradient at w,= 0. If L is given by 5.1 so that LD = f Nlog E D , then dLD/dwj = PdE~/dwj with fi given by 4.4. The criterion for stability can then be written in terms of E D as
and a similar argument establishes 3.3 in the case of a single regularization class when (Y and P are assumed known. It is convenient to define the objective function partial derivative aL/awj at wj= 0 as follows. If w,is bound to zero, i.e., the partial derivative DLD/dwjis space-like, dL/awj is defined to be zero. If it is time-like, it is defined to be the value of the downhill derivative. Explicitly using the abbreviations
then
b+a b-a b+a b-a 0
ifwj>O ifw,y2)as target value^.'^ Simple three layer networks were used with 2 input, 2 output and from 5 to 20 hidden units. Results are shown in Figure 5. For comparability with MacKay’s results, a single regularization class was used and it was assumed that the noise level (T = 0.05 was known in advance. The objective function to be minimized is therefore 6.1 with 81 = = l/a2. The ordinate in Figure 5 is twice the final value of the first term on the right-hand side of 6.1. This is a dimensionless x2 quantity whose expectation is 400 f 20 relative to the actual noise process used in constructing the training set. Results on a test set also of size 200 and drawn from the same distribution as the training set are shown in Figure 6 using the same error units. Comparison with results on a further test set, of the same size and drawn from the same distribution, is shown in Figure 7. This confirms MacKay’s observation that generalization error on a test set is a noisy quantity, so that many data would have to be devoted to a test set for test error to be a reliable way of setting regularization parameters. ~
I7Trainingand test sets used here are the same as those in MacKay (1992) by courtesy of David MacKay.
135
Bayesian Regularization
750 700
I
I
I
I
i
i
i
data error
650 .
i
i
400-W
o
...
600 .
550
500 . 450 . 400 350 . 300 . I
I
I
1
1
1
1
1
6
8
10
12
14
16
18
20
Figure 5: Plot showing the data error of 148 trained networks. Ten networks were trained for each of 16 network architectures with hidden units ranging from 5 to 20. Twelve outliers relating to small numbers of hidden units have been excluded. The dotted line is 400 - W where W is the empirically determined number of free parameters remaining after Laplace regularized training, averaged over each group of 10 trials.
6
8
10
12
14
Figure 6: Test error versus number of hidden units.
16
18
20
Peter M. Williams
136
650
I
I
I
++ +
600 t
+
++
550
+
++ +
,%--
I
I
t
++
t
+ :. .'
:
..'
t
5-12 hidden units + 13-20 hidden units 0
t.'
y= z
400
450
500
550
600
...
650
Figure 7 Errors on two test sets. Performance on both training and test sets settles down after around 13 hidden units. Little change is observed when further hidden units are added since the extra connections are pruned by the regularizer as shown by the dotted line in Figure 5. This contrasts with MacKay's results using the sum of squares regularizer for which the training error continues to decrease as more hidden units are added and where the training error for approaching 20 hidden units differs very little from the best possible unregularized fit. MacKay's approach is to evaluate the "evidence" for each solution and to choose a number of hidden units that maximizes this quantity, which in this case is approximately 11 or 12. The present heuristic is to supply the network with ample hidden units and to allow the regularizer to prune these to a suitable number. Provided the initial number of hidden units is sufficient, the results are largely independent of the number of units initially supplied. 9.1 Varying the Noise. For a further demonstration of Laplace pruning, the problem is changed to one in which the network has a single output. Multiple output regression networks are unusual in practice, especially ones satisfying a relation such as y1(x1.x2)= y2(x1 n/2. x2). There is also the possibility that the hidden units divide themselves into two groups, each serving one of the two outputs exclusively, which can make it difficult to interpret results. We therefore consider interpolation of just one of the outputs considered above, specifically the cosine expression y1. The same 200 input pairs (XI. x2) were used as for MacKay's
+
Bayesian Regularization
137
Figure 8: Data error versus noise level for an initial 50 hidden units. training set, but varying amounts of gaussian noise were added to the target outputs. Results using a network with 50 hidden units and with noise varying from 0.01 to 0.19 in increments of 0.01 are shown in Figure 8. In this case the noise was resampled on each trial so that each of the 190 different networks was trained on a different training set. Two regularization classes were used and it was no longer assumed that the noise level was known in advance. The objective function is therefore given by equation 5.1 with input and output weights forming the two classes. The data error in Figure 8 is again shown in x2 units whose expected value is now 200 relative to the actual noise process since there is only one output unit. Specifically the ordinate in Figure 8 measures &[(y, - f,)/o]* where c is the abscissa and p ranges over the 200 training items. The actual data error increases proportionately with the noise so that the normalized quantity is effectively constant. Figure 9 shows mean numbers of live hidden units, with one standard deviation error bars, in networks corresponding to each of the 19 noise levels. This is the number of hidden units remaining in the trained network after the pruning implicit in Laplace regularization. Note that the number of initially free parameters in a 50 hidden unit network with 2 inputs and 1 output is 201 so that with 200 data points the initial ratio of data points to free parameters is approximately 1. This should be contrasted with the statement in MacKay (1992) that the numerical approximation needed by the evidence framework, when used with gaussian regularization, seems to break down significantly when this ratio is less than 3 1. Figure 9 indicates that there ought to be little purpose in using networks with more than 20 hidden units for noise levels higher than 0.05, if it is to be correct to claim that results are effectively independent of the number of hidden units used, provided there are enough of them. To verify this a further 190 networks were trained using an initial architecture of 20 hidden units. Results for the final numbers of hidden
*
Peter M. Williams
138
50 45 40
35 30 25 20 15 10 5 0 0
0.05
0.1
0.15
0.2
Figure 9: Live hidden units versus noise level for an initial 50 hidden units.
Figure 10: Live hidden units versus noise level for an initial 20 hidden units. units are shown in Figure 10. Comparison with Figure 9 shows that if more than 20 hidden units are available for noise levels below 0.05 the network will use them. But for higher noise levels, there is no significant difference in the number of hidden units finally used, whether 20 or 50 are initially supplied. The algorithm also works for higher noise levels. Figure 11 shows corresponding results for noise levels from 0.05 to 0.95 in increments of 0.05. Note that in all these demonstrations with varying noise, the level is automatically detected by the regularizer and the number of hidden units, or more generally the number of parameters, is accommodated to suit the level of noise detected.
Bayesian Regularization
300
100
1
'
0
139
Data error I
I
I
I
I
I
I
I
I
0.2
0.4
0.6
0.8
1
0.6
0.8
1
Live units
20 15 10
5 0 0
0.2
0.4
Figure 11: Data error and live hidden units versus larger noise levels for 20 hidden units. 9.2 Posterior Weight Distribution. It was noted in Section 3 that the weights arrange themselves at a minimum so that the sensitivity of the data error to each of the nonzero weights in a given regularization class is the same, assuming Laplace regularization is used. For the weights themselves, the posterior conditional distributions in a given class are roughly uniform over an interval. Figure 12 shows the empirical distributions for a sample of 500 trained networks. These plots answer the question "what is the probability that the size of a randomly chosen input (output) weight of a trained network lies between x and x 6x conditional on its being nonzero?" The unconditional distributions have discrete components at the origin. The probability of an output weight being zero was 0.38 and the probability of an input weight being zero was 0.47. These networks were trained on the cosine output of the robot arm problem using MacKay's sampling of the noise at the 0.05 level.
+
10 Summary and Conclusions
This paper has argued that the C IwI regularizer is more appropriate for the hidden connections of feedforward networks than the C w 2 regularizer. It has shown how to deal with discontinuities in the gradient of (wIand how to recount the free parameters of the network as they are
Peter M. Williams
140
I n p u t weights
O u t p u t weights
1.8
0.7
1.6
0.6
1.4
0.5
1.2 1
0.4
0.8
0.3
0.6
0.2
0.4
0.1
0.2 0
0
0.5
1
1.5
1
2
3
4
Figure 12: Empirical posterior distributions of the size of nonzero input and output weights for 500 trained networks each using 20 hidden units. Mean values are 0.55 for input weights and 1.31 for output weights. The natural hyperbolic tangent was used as transfer function for hidden units. pruned by the regularizer. No numerical approximations need be made and the method can be applied exactly even to small noisy data sets where the ratio of free parameters to data points may approach unity.
Appendix The evidence framework (MacKay 1992; Thodberg 1993) proposes to set the regularizing parameters (Y and i;l by maximizing
P ( D ) = / P ( D 1 w)P(w)dw considered as a function of a and p. This quantity is interpreted as the evidence for the overall model, including both the underlying architecture and regularizing parameters. From equations 2.1 and 2.2 it follows that P(D)
=
(ZwZo)-'/e-"dw
To evaluate the integral analytically, M is usually approximated by a quadratic in the neighborhood of a maximum of the posterior density at w = wMP where VM vanishes. The approximation is then M(w)
=
+
M ( w ~ p ) ;(w - W M P ) ~(W A - WMP)
(A.1)
Bayesian Regularization where A -
= VVM
log P ( D )
141
is the Hessian of M evaluated at WMP. It follows that
=
rvEw
+ []ED +
log det A + log ZW
+ log Z D + constant
where the constant, which also takes account of the order of the network symmetry group, does not depend explicitly on (Y or P. Now the Laplace regularizer Ew is locally a hyperplane. This means that VVEWvanishes identically so that A = PH, where H = VVED is the Hessian of the data error alone. Assuming the Laplace regularizer and gaussian noise, ZW= ( 2 1 0 ) ~and Z D = ( 2 ~ / [ j ) " ~ so that - log P ( D ) = u E w
+ @ED +
log [I - W log N
-
$ log ,!j' + constant
where k is the full dimension of the weight vector. Setting to zero partial derivatives with respect to u and /3 yields a = W/Ew and /3 = ( N - k ) / 2 € ~ , so that
and
These should be compared with 4.4 and 4.5. If A.2 and A.3 are used as re-estimation formulas during training, the difference between the evidence framework and the method of integrating over hyperparameters reduces, in the case of Laplace regularization, to the difference between the factors N - k and N when re-estimating ij." In many applications the differences in results, when using these two factors with Laplace regularization, are not sufficiently clear to decide the matter empirically and it needs to be settled on other grounds (Wolpert 1993; MacKay 1994). In the present context, this paper prefers the method of integrating over hyperparameters for reasons of simplicity. Its main purpose, however, is to advocate the Laplace over the gaussian regularizer, in which case the difference between these two methods of setting regularizing parameters appears less significant. ~~
~~~~
ls1f /3 is assumed known, the methods are apparently equivalent. For multiple regularization classes the same argument leads, on either approach, to the re-estimation formula a, = Wc/EL for each regularization class c. For the multiple noise levels envisaged in Section 6, however, results will generally differ unless the levels are known in advance. Note that in saying that the Laplace regularizer is locally a hyperplane, it is assumed that none of the regularized weights vanishes, otherwise the Hessian A is not defined and the quadratic assumption A.l is no longer meaningful. It is therefore assumed that zero weights are also pruned for the Laplace regularizer when using the evidence framework (compare Thodberg 1993, for pruning with the gaussian regularizer).
142
Peter M. Williams
Acknawledgments I a m grateful to Dr. Perry Eaton, Dr. Colin Barnett, a n d other members of the Geophysical Department of Newmont Exploration Limited for stimulating discussions o n the subject of this paper a n d related topics over the last few years. References Bishop, C. M. 1993. Curvature-driven smoothing: A learning algorithm for feedforward networks. I E E E Trans. Neural Networks 4(5), 882-884. Buntine, W. L., and Weigend, A. S. 1991. Bayesian back-propagation. Complex Syst. 5, 603-643. Denker, J., Schwartz, D., Wittner, B., Solla, S., Howard, R., Jackel, L., and Hopfield, J. 1987. Large automatic learning, rule extraction, and generalization. Coinplex Syst. 1, 877-922. Fletcher, R. 1987. Practical Methods of Optimization (2nd ed.). John Wiley, New York. Gill, P. E., Murray, W., and Wright, M. H. 1981. Practical Optimization. Academic Press, New York. Hassibi, B., and Stork, D. G. 1993. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 164-171. Morgan Kaufmann, San Mateo, CA. Jaynes, E. T. 1968. Prior probabilities. ZEEE Trans. Syst. Sci. Cybernet. 4(3), 227241. Le Cun, Y., Denker, J. S., and Solla, S. A. 1990. Optimal brain damage. In Advances in Neiiral Information Processing System 2, D. s. Touretzky, ed., pp. 598605. Morgan Kaufmann, San Mateo, CA. MacKay, D. J. C. 1992. A practical Bayesian framework for backprop networks. Neural Coinp. 4(3), 448-472. MacKay, D. J. C. 1994. Hyperparameters: Optimise, or integrate out? In Maxiiriuin Eiitropy arid Bayesian Methods, Sarita Barbnra, 1993, G. Heidbreder, ed. Kluwer, Dordrecht. (In press.) Merller, M. F. 1993a. Exact calculation of the product of the Hessian matrix of feedforward network error functions and a vector in O ( n ) time. Report DAIMI PB-432, Computer Science Department, Aarhus University, Denmark. Mdler, M. F. 1993b. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks 6(4), 525-533. Neal, R. M. 1992. Bnyesian trniriirig of backpropagation rietzuorks by the hybrid Monte Carlo method. Tech. Rep. CRG-TR-92-1, Department of Computer Science, University of Toronto. Neal, R. M. 1993. Bayesian learning via stochastic dynamics. In Advances in Neural Inforination Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 475432. Morgan Kaufmann, San Mateo, CA.
Bayesian Regularization
143
Nowlan, S. J., and Hinton, G. E. 1992. Adaptive soft weight tying using gaussian mixtures. In Advances in Neural Information Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippmann, pp. 993-1000. Morgan Kaufmann, San Mateo, CA. Pearlmutter, B. A. 1994. Fast exact multiplication by the Hessian. Neural Comp. 6(1), 147-160. Plaut, D. C., Nowlan, S. J., and Hinton, G. E. 1986. Experiments on learning by backpropagation. Tech. Rep. CMU-CS-86-126, Carnegie Mellon University, Pittsburgh, PA 15213. Thodberg, H. H. 1993. Ace of Bayes: Application of neural networks with pruning. Manuscript 1132E, The Danish Meat Research Institute. Tikhonov, A. N., and Arsenin, V. Y. 1977. Solutions of Ill-Posed Problems. John Wiley, New York. Tribus, M. 1969. Rational Descriptions, Decisions and Designs. Pergamon Press, Oxford. Weigend, A. S., Rumelhart, D. E., and Huberman, B. A. 1991. Generalization by weight-elimination with application to forecasting. In Advances i n Neural Information Processing Systems 3, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds., pp. 875-882. Morgan Kaufmann, San Mateo, CA. Williams, P. M. 1991. A Marquardt algorithm for choosing the step-size in backpropagation learning with conjugate gradients. Cognitive science research paper CSRP 229, University of Sussex. Williams, I? M. 1993a. Aeromagnetic compensation using neural networks. Neurnl Conrp. Aypl. 1, 207-214. Williams, P. M. 1993b. Improved generalization and network pruning using adaptive Laplace regularization. In Proceedings of 3rd I E E International Conference on Artificial Neurnl Networks, pp. 76-80. Institution of Electrical Engineers, London. Wolpert, D. H. 1993. On the use of evidence in neural networks. In Advilnces in Neural lnformation Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 539-546. Morgan Kaufmann, San Mateo, CA. ~
~~
~~
Received February 16, 1994; accepted May 20, 1994.
This article has been cited by: 2. Lei Xu. 2010. Bayesian Ying-Yang system, best harmony learning, and five action circling. Frontiers of Electrical and Electronic Engineering in China 5:3, 281-328. [CrossRef] 3. Yan-Heng LIU, Da-Xin TIAN, Xue-Gang YU, Jian WANG. 2010. Large-Scale Network Intrusion Detection Algorithm Based on Distributed Learning. Journal of Software 19:4, 993-1003. [CrossRef] 4. Jaber Juntu, Jan Sijbers, Steve De Backer, Jeny Rajan, Dirk Van Dyck. 2010. Machine learning study of several classifiers trained with texture analysis features to differentiate benign from malignant soft-tissue tumors in T1-MRI images. Journal of Magnetic Resonance Imaging 31:3, 680-689. [CrossRef] 5. Israel Gonzalez-Carrasco, Angel Garcia-Crespo, Belen Ruiz-Mezcua, Jose Luis Lopez-Cuadrado. 2009. Dealing with limited data in ballistic impact scenarios: an empirical comparison of different neural network approaches. Applied Intelligence . [CrossRef] 6. Vladimir V. Berdnik, Valery A. Loiko. 2009. Retrieval of size and refractive index of spherical particles by multiangle light scattering: neural network method application. Applied Optics 48:32, 6178. [CrossRef] 7. Mansour A. Al-Garni. 2009. Interpretation of some magnetic bodies using neural networks inversion. Arabian Journal of Geosciences 2:2, 175-184. [CrossRef] 8. Richard M. Zur, Yulei Jiang, Lorenzo L. Pesce, Karen Drukker. 2009. Noise injection for training artificial neural networks: A comparison with weight decay and early stopping. Medical Physics 36:10, 4810. [CrossRef] 9. W.-P. Lee, W.-S. Tzou. 2009. Computational methods for discovering gene networks from expression data. Briefings in Bioinformatics . [CrossRef] 10. Abdoul-Fatah Kanta, Ghislain Montavon, Michel Vardelle, Marie-Pierre Planche, Christopher C. Berndt, Christian Coddet. 2008. Artificial Neural Networks vs. Fuzzy Logic: Simple Tools to Predict and Control Complex Processes—Application to Plasma Spray Processes. Journal of Thermal Spray Technology 17:3, 365-376. [CrossRef] 11. Abdoul-Fatah Kanta, Ghislain Montavon, Marie-Pierre Planche, Christian Coddet. 2008. Artificial Intelligence Computation to Establish Relationships Between APS Process Parameters and Alumina–Titania Coating Properties. Plasma Chemistry and Plasma Processing 28:2, 249-262. [CrossRef] 12. Chi-Sing Leung, J.P.-F. Sum. 2008. A Fault-Tolerant Regularizer for RBF Networks. IEEE Transactions on Neural Networks 19:3, 493-507. [CrossRef] 13. W. L. Tung, C. Quek. 2007. A brain-inspired fuzzy semantic memory model for learning and reasoning with uncertainty. Neural Computing and Applications 16:6, 559-569. [CrossRef]
14. A. Navia-Vazquez, D. Gutierrez-Gonzalez, E. Parrado-Hernandez, J.J. Navarro-Abellan. 2006. Distributed Support Vector Machines. IEEE Transactions on Neural Networks 17:4, 1091-1097. [CrossRef] 15. Gleb Basalyga , Emilio Salinas . 2006. When Response Variability Increases Neural Network Robustness to Synaptic NoiseWhen Response Variability Increases Neural Network Robustness to Synaptic Noise. Neural Computation 18:6, 1349-1379. [Abstract] [PDF] [PDF Plus] 16. L. Weruaga, B. Kieslinger. 2006. Tikhonov Training of the CMAC Neural Network. IEEE Transactions on Neural Networks 17:3, 613-622. [CrossRef] 17. Y. Xu, K.-W. Kwok-WoWong, C.-S. Leung. 2006. Generalized RLS Approach to the Training of Neural Networks. IEEE Transactions on Neural Networks 17:1, 19-34. [CrossRef] 18. Elko B. Tchernev , Rory G. Mulvaney , Dhananjay S. Phatak . 2005. Investigating the Fault Tolerance of Neural NetworksInvestigating the Fault Tolerance of Neural Networks. Neural Computation 17:7, 1646-1664. [Abstract] [PDF] [PDF Plus] 19. T. S. Hu, K. C. Lam, S. Thomas Ng. 2005. A Modified Neural Network for Improving River Flow Prediction/Un Reseau de Neurones Modifie pour Ameliorer la Prevision de L'Ecoulement Fluvial. Hydrological Sciences Journal 50:2, 1-318. [CrossRef] 20. Miroslaw Galicki, Lutz Leistritz, Ernst Bernhard Zwick, Herbert Witte. 2004. Improving Generalization Capabilities of Dynamic Neural NetworksImproving Generalization Capabilities of Dynamic Neural Networks. Neural Computation 16:6, 1253-1282. [Abstract] [PDF] [PDF Plus] 21. Michael W. Harm, Mark S. Seidenberg. 2004. Computing the Meanings of Words in Reading: Cooperative Division of Labor Between Visual and Phonological Processes. Psychological Review 111:3, 662-720. [CrossRef] 22. A.B.A. Graf, A.J. Smola, S. Borer. 2003. Classification in a normalized feature space using support vector machines. IEEE Transactions on Neural Networks 14:3, 597-605. [CrossRef] 23. Jau-Jia Guo, P.B. Luh. 2003. Selecting input factors for clusters of gaussian radial basis function networks to improve market clearing price prediction. IEEE Transactions on Power Systems 18:2, 665-672. [CrossRef] 24. Ping Guo, M.R. Lyu, C.L.P. Chen. 2003. Regularization parameter estimation for feedforward neural networks. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 33:1, 35-44. [CrossRef] 25. Melissa Dominguez , Robert A. Jacobs . 2003. Developmental Constraints Aid the Acquisition of Binocular Disparity SensitivitiesDevelopmental Constraints Aid the Acquisition of Binocular Disparity Sensitivities. Neural Computation 15:1, 161-182. [Abstract] [PDF] [PDF Plus] 26. Zhe Chen , Simon Haykin . 2002. On Different Facets of Regularization TheoryOn Different Facets of Regularization Theory. Neural Computation 14:12, 2791-2846. [Abstract] [PDF] [PDF Plus]
27. G. Satalino, F. Mattia, M.W.J. Davidson, Thuy Le Toan, G. Pasquariello, M. Borgeaud. 2002. On current limits of soil moisture retrieval from ERS-SAR data. IEEE Transactions on Geoscience and Remote Sensing 40:11, 2438-2447. [CrossRef] 28. V.P. Plagianakos, G.D. Magoulas, M.N. Vrahatis. 2002. Deterministic nonmonotone strategies for effective training of multilayer perceptrons. IEEE Transactions on Neural Networks 13:6, 1268-1284. [CrossRef] 29. T. Blu, M. Unser. 2002. Wavelets, fractals, and radial basis functions. IEEE Transactions on Signal Processing 50:3, 543-553. [CrossRef] 30. Chi-Sing Leung, Ah-Chung Tsoi, Lai Wan Chan. 2001. Two regularizers for recursive least squared algorithms in feedforward multilayered neural networks. IEEE Transactions on Neural Networks 12:6, 1314-1332. [CrossRef] 31. J. L. Bernier , J. Ortega , E. Ros , I. Rojas , A. Prieto . 2000. A Quantitative Study of Fault Tolerance, Noise Immunity, and Generalization Ability of MLPsA Quantitative Study of Fault Tolerance, Noise Immunity, and Generalization Ability of MLPs. Neural Computation 12:12, 2941-2964. [Abstract] [PDF] [PDF Plus] 32. M. Skurichina, S. Raudys, R.P.W. Duin. 2000. k-nearest neighbors directed noise injection in multilayer perceptron training. IEEE Transactions on Neural Networks 11:2, 504-511. [CrossRef] 33. Y. Grandvalet. 2000. Anisotropic noise injection for input variables relevance determination. IEEE Transactions on Neural Networks 11:6, 1201-1212. [CrossRef] 34. Mirko van der Baan, Christian Jutten. 2000. Neural networks in geophysical applications. Geophysics 65:4, 1032. [CrossRef] 35. P.J. Edwards, A.F. Murray. 2000. Optimally distributed computation in augmented networks. IEE Proceedings - Computers and Digital Techniques 147:1, 27. [CrossRef] 36. G.N. Karystinos, D.A. Pados. 2000. On overfitting, generalization, and randomly expanded training sets. IEEE Transactions on Neural Networks 11:5, 1050-1057. [CrossRef] 37. J Svensson, M von Hellermann, R W T K$ouml$nig. 1999. Plasma Physics and Controlled Fusion 41:2, 315-338. [CrossRef] 38. Chi Sing Leung, G.H. Young, J. Sum, Wing-Kay Kan. 1999. On the regularization of forgetting recursive least square. IEEE Transactions on Neural Networks 10:6, 1482-1486. [CrossRef] 39. F. Aires, M. Schmitt, A. Chedin, N. Scott. 1999. The "weight smoothing" regularization of MLP for Jacobian stabilization. IEEE Transactions on Neural Networks 10:6, 1502-1510. [CrossRef] 40. W.A. Wright. 1999. Bayesian approach to neural-network modeling with input uncertainty. IEEE Transactions on Neural Networks 10:6, 1261-1270. [CrossRef] 41. Peter J. Edwards, Alan F. Murray. 1998. Toward Optimally Distributed ComputationToward Optimally Distributed Computation. Neural Computation 10:4, 987-1005. [Abstract] [PDF] [PDF Plus]
42. P. Niyogi, F. Girosi, T. Poggio. 1998. Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE 86:11, 2196-2209. [CrossRef] 43. Yves Grandvalet, Stéphane Canu, Stéphane Boucheron. 1997. Noise Injection: Theoretical ProspectsNoise Injection: Theoretical Prospects. Neural Computation 9:5, 1093-1108. [Abstract] [PDF] [PDF Plus] 44. Guozhong An . 1996. The Effects of Adding Noise During Backpropagation Training on a Generalization PerformanceThe Effects of Adding Noise During Backpropagation Training on a Generalization Performance. Neural Computation 8:3, 643-674. [Abstract] [PDF] [PDF Plus] 45. Lizhong Wu , John Moody . 1996. A Smoothing Regularizer for Feedforward and Recurrent Neural NetworksA Smoothing Regularizer for Feedforward and Recurrent Neural Networks. Neural Computation 8:3, 461-489. [Abstract] [PDF] [PDF Plus] 46. Takio Kurita, Hideki Asoh, Shinji Umeyama, Shotaro Akaho, Akitaka Hosomi. 1996. Influence of noises added to hidden units on learning of multilayer perceptrons and structurization of networks. Systems and Computers in Japan 27:11, 64-73. [CrossRef] 47. Kam-Chuen Jim, C.L. Giles, B.G. Horne. 1996. An analysis of noise in recurrent neural networks: convergence and generalization. IEEE Transactions on Neural Networks 7:6, 1424-1438. [CrossRef] 48. Todd K. Leen . 1995. From Data Distributions to Regularization in Invariant LearningFrom Data Distributions to Regularization in Invariant Learning. Neural Computation 7:5, 974-981. [Abstract] [PDF] [PDF Plus]
Communicated by Stephen Nowlan and John Bridle
Bayesian Regularization and Pruning Using a Laplace Prior Peter M. Williams School of Cognitive and Computing Sciences, University of Sussex, Falmer, Brighton, BN1 9QH, U.K.
Standard techniques for improved generalization from neural networks include weight decay and pruning. Weight decay has a Bayesian interpretation with the decay function corresponding to a prior over weights. The method of transformation groups and maximum entropy suggests a Laplace rather than a gaussian prior. After training, the weights then arrange themselves into two classes: (1) those with a common sensitivity to the data error and (2) those failing to achieve this sensitivity and that therefore vanish. Since the critical value is determined adaptively during training, pruning-in the sense of setting weights to exact zeros-becomes an automatic consequence of regularization alone. The count of free parameters is also reduced automatically as weights are pruned. A comparison is made with results of MacKay using the evidence framework and a gaussian regularizer. 1 Introduction
Neural networks designed for regression or classification need to be trained using some form of stabilization or regularization if they are to generalize well beyond the original training set. This means finding a balance between complexity of the network and information content of the data. Denker et al. distinguish formal and structural stabilization. Formal stabilization involves adding an extra term to the cost function that penalizes more complex models. In the neural network literature this often takes the form of weight decay (Plaut et al. 1986) using the penalty function w: where summation is over components of the weight vector. Structural stabilization is exemplified in polynomial curve fitting by explicitly limiting the degree of the polynomial. Examples relating to neural networks are found in the pruning algorithms of Le Cun et al. (1990)and Hassibi and Stork (1993). These use second-order information to determine which weight can be eliminated next at the cost of minimum increase in data misfit. They do not by themselves, however, give a criterion for when to stop pruning. This paper advocates a type of formal regularization in which the penalty term is proportional to the logarithm of the L1 norm of the weight
c,
Neural Computation 7,117-143 (1995)
@ 1994 Massachusetts Institute of Technology
Peter M. Williams
118
vector C j(w,(. This simultaneously provides both forms of stabilization without the need for additional assumptions. 2 Probabilistic Interpretation
Choice of regularizer corresponds to a preference for a particular type of model. From a Bayesian point of view the regularizer corresponds to a prior probability distribution over free parameters w of the model. Using the notation of MacKay (1992) the regularized cost function can be written as (2.1)
M(W) = P E D ( W )f aEw(w)
where E D measures the data misfit, E w is the penalty term, and a,P > 0 are regularizing parameters determining a balance between the two. Equation 2.1 corresponds, by taking negative logarithms and ignoring constant terms, to the probabilistic relation
P(W I D )
P ( D I W)P(W)
where P(w I D ) is the posterior density in weight space, P(D 1 w) is the likelihood of the data D, and P(w) is the prior density over weights.' According to this correspondence
P(D 1 w) = Z,' exp -,#ED
and
P(w) = Z6;' exp -aEw
(2.2)
where ZD = Z D ( P ) and Z w = Z w ( a ) are normalizing constants. It follows that the process Of minimizing
M(w) = - log P(w 1 D) + constant is equivalent to finding a maximum of the posterior density. 2.1 The Likelihood Function for Regression Networks. Suppose a training set of pairs ( x p ,t,), p = 1 , .. . , N, is to be fitted by a neural network model with adjustable weights w. The x, are input vectors and the t, are target outputs. The network is assumed for simplicity to have a single output unit. Let y, = f(x,, w), p = 1,. . . .N, be the corresponding network outputs when f is the network mapping, and assume that the measured values t, differ from the predicted values y, by an additive noise process
'The notation is somewhat schematic. See Buntine and Weigend (1991), MacKay (1992), and Neal (1993) for more explicit notations.
Bayesian Regularization
119
If the up have independent normal distributions, each with zero mean and the same known standard deviation (T, the likelihood of the data is
which implies, according to 2.2, that N
p=l
with /I= l/rr2 and Z D = ( 2 ~ / p ) ” ~ . As (2 -, 0 we have the improper uniform prior over w so that P(w 1 D ) 0: P(D 1 w) and M is proportional to E D . This means that least squares fitting, which minimizes E D alone, is equivalent to simple maximum likelihood estimation of parameters assuming gaussian noise. Other models of the noise process are possible but the gaussian model is assumed here throughout.2 2.2 Weight Prior. A common choice of weight prior assumes that weights have identical independent normal distributions with zero mean. If {w,I j = 1, . . . , W} are components of the weight vector, then according to 2.2
Ew
=
W
4
w:
(Gauss)
(2.4)
1=1
where 11. is the variance. Alternatively if the absolute values of the weights have exponential distributions then W
lw;l
Ew =
(Laplace)
(2.5)
;=1
where 1/a is the mean absolute value. Another possibility is the Cauchy distribution
j=l
where l/ai is the median absolute value.3 It turns out that the Laplace prior has a special connection with network pruning that derives principally from the behavior of the derivative of 1x1 in the neighborhood of the origin. This is described in the next section and explored in the rest of the paper. It is nonetheless interesting to 2This paper concerns regression networks in which the target values are real numbers, but the same ideas can be applied to classification networks where the targets are exclusive class labels. 3A penalty function w2/(1 + d)similar to log (1 + w2)is the basis of Weigend e t a / . (1991).
Peter M. Williams
120
ask whether there might be an a priori reason for this prior to be especially suitable for neural network models. Jaynes (1968) offers two principles-transformation groups and maximum entropy-for setting up probability distributions in the absence of frequency data. These can be applied to neural networks as follows. For any feedforward network in which there are no direct connections between input and output units, there is a functionally equivalent network in which the weight on a given connection has the same size but opposite sign. This is also true if there are direct connections, except for the direct connections. This is evident if the transfer function a is odd, such as the hyperbolic tangent. It is true, more generally, provided there are constants b.c such that g ( x - b ) a ( b - x ) = c. For example, b = 0, c = 1 for the logistic function. Consistency then demands that the prior for a given weight wl should be a function of IwII alone. If it is assumed that all that is known about (wI(is its scale, and that the scale of a positive quantity is determined by its mean rather than some higher order moment, the most noncommittal distribution for lwll according to the principle of maximum entropy is the exponential distribution, since this is the maximum entropy distribution for a positive quantity constrained to have a given mean (Tribus 1969). It would follow that the signed weight w,has the two-sided exponential or Laplace density $2e-"17L'~~where I / @ is the mean absolute value. Under the assumption of independence for the joint distribution, this leads to the ) ~ Laplace expression (2.5) for the regularizing term with ZW = ( 2 / ~ as normalizing constant. The gaussian prior would be obtained if constraints were placed on the first two moments of the distribution of the signed weights. The crux of the present argument is that constraining the mean of the signed weights to be zero is not an adequate expression of the intrinsic symmetry in the signs of the weights. A zero mean distribution need not be symmetric and a symmetric distribution need not have a mean. Note that the present argument uses a specific property of neural network models that does not apply to regression models generally4
+
3 Comparison of Sensitivities
It is revealing to compare the conditions for a minimum of the overall cost function in the cases of Gauss and Laplace weight priors. Recalling that M = PED + aEw it follows that, at a minimum of M where aM/aw, = 0, (3.1) 4A possible alternative would be to assume each Iwj! has a log-normal distribution or a mixture of a log-normal and an exponential distnbution; compare Nowlan and Hinton (1992). For an approach to formal stabilization, more in the style of Tikhonov and Arsenin (1977), see Bishop (1993).
Bayesian Regularization
121
assuming EW is given by the gaussian regularizer (2.4). Sensitivity of the data misfit to a given weight is proportional to its size and therefore unequal for different weights. Furthermore if w,is to vanish at a minimum of M then a E ~ / d w= , 0. This is the same condition as for an unregularized network so that gaussian weight decay contributes nothing toward network pruning in the strict sense. Condition 3.1 should be contrasted with Laplacian weight decay (2.5) where sufficient conditions for a stationary point are, as we shall see, that (3.2) (3.3) Equation 3.2 means that, at a minimum, the nonzero weights must arrange themselves so that the sensitivity of the data misfit to each is the same. Equation 3.3 means that there is a definite cut-off point for the contribution which each weight must make. Unless the data misfit is sufficiently sensitive to the weight on a given connection, that weight is set to zero and the connection can be pruned. At a minimum the weights therefore divide themselves into two classes: (1)those with common sensitivity a/P and (2) those that fail to achieve this sensitivity and that therefore vanish. It turns out that the critical ratio a/P can be determined adaptively during training. Pruning is therefore automatic and performed entirely by the regularizer. 4 Elimination of a and /3
The regularizing parameters a and P are not generally known in advance. MacKay (1992) proposes the evidence framework for determining these parameters. This paper uses the method of integrating over hyperparameters (Buntine and Weigend 1991). A comparison is made in the Appendix. The weight prior in 2.2 depends on a and can be written as
P(w I a ) = Zw(a)-’ exp - a E w
(4.1)
where a is now considered as a nuisance parameter. If a prior P(a) is assumed, a can be integrated out by means of
P(w) = /P(w
I a)P(a)da
Since a is a scale parameter, it is reasonable to use the improper l / a ignorance prior.5 Using 2.5 and 4.1 with P(a) = 1 / a it is straightforward 5This means assuming that log a is uniformly distributed or equivalently that log KN” K > 0 and IvI > 0. The same results can be obtained as the limit of a gamma prior (Neal 1992; Williams 1993b).
is uniformly distributed for any
Peter M. Williams
122
to show that -logP(w)
=
WlogEw
to within an additive constant. If the noise level a = l/a2 is known, or assumed known, the objective function to be minimized in place of M is now
L
= /3ED
+ WlogEw
(4.2)
In practice 0 is generally not known in advance and similar treatment can be given to ,O as was given to a. This leads to -logP(D
I W) = i N l o g E ~
assuming the gaussian noise model.6 - log P(w 1 D)is now given by
The negative log posterior
L = iNlOgED+ WlOgEw
(4.3)
which replaces 2.1 as the loss function to be minimized. It is worth noting that if cy and are assumed known, differentiation of 2.1 yields mM = 0VED cy VEWwith 1//3as the variance of the noise process and 1 / a as the mean absolute value of the weights. Differentiation of 4.3 yields VL = 5 VED G VEWwhere
+
+
(4.4) is the sample variance of the noise and (4.5) is the sample mean of the size of the weights. This means that minimizing L is effectively equivalent to minimizing M assuming a and /3 are continuously adapted to the current sample values 6 and
a.
5 Priors, Regularization Classes, and Initialization For simplicity Section 2.2 assumed a single weight prior for all parameters. In fact different priors are suitable for the three types of parameter found in feedforward networks, distinguished by their different transformational properties. 6The f comes from the fact that E D is measured in squared units. Assuming LaplaIyI, - tIJ. cian noise this term becomes Nlog E D with ED =
c,
Bayesian Regularization
123
5.1 Internal Weights. These are weights on connections to or from hidden units. The argument of Section 2.2 suggests a Laplace prior. MacKay (1992) points out, however, that there are advantages in dividing such weights into separate classes with each class c having its own adaptively determined scale. This leads by the arguments of Section 4 to the more general cost function
(5.1) where summation is over regularization classes, Wc is the number of weights in class c, and EE, = C,,, lwjl is the sum of absolute values of weights in that class. A simple classification uses two classes consisting of (1) weights on connections with output units as destinations and ( 2 ) weights on connections with hidden units as destinations. More refined classifications might be suitable for specific applications. 5.2 Biases. Regularization classes must be exclusive but need not be exhaustive. Parameters belonging to no regularization class are unregularized. This corresponds to a uniform prior. This is appropriate for biases that transform as location parameters (Williams 1993b). The prior suitable for a location parameter is one with constant density. Biases are therefore excluded from regularization.
5.3 Direct Connections. If direct connections are allowed between input and output units, the argument of Section 2.2 does not apply. There is no intrinsic symmetry in the signs of these weights. It is then reasonable to use a gaussian prior contributing an extra term $ Wd log E L to the right-hand side of 5.1 where d is the class of direct connections, Wd is the w: is half the sum of their number of direct connections, and EL = $ squares.
xjEd
5.4 Initialization. It is natural to initialize the weights in the network in accordance with the assumed prior. For internal weights with the Laplace prior, this is done by setting each weight to f a l o g r where r is uniformly random in (0, l),the sign is chosen independently at random, and a > 0 determines the scale. a is then the average initial size of the weights. Satisfactory results are obtained with a = l / f i for input weights and a = 1 . 6 / 6 for remaining weights where m is the fan-in of the destination unit. The network function corresponding to the initial guess then has roughly unit variance outputs for unit variance inputs, assuming the natural hyperbolic tangent as transfer function. All biases are initially set to zero.
Peter M. Williams
124
6 Multiple Outputs and Noise Levels
Suppose the regression network has n output units. In general the noise levels will be different for each output. The data misfit term then becomes El Dl Eb where summation is over output units and, assuming independent gaussian noise, Eb = i Cp(ypI - tp,)2is the error on the ith output, summed over training pattern^.^ If each [?,= l / u f is known, the objective function becomes
(6.1) in place of 4.2, assuming a single regularization class. Otherwise integrating over each ijl, with the l/P; prior gives
L = i N x logEb I
+ x W c logEL C
in place of 5.1, assuming multiple regularization classes. 6.1 Multiple Noise Levels. Even in the case of a single output regression network there may be reason to suspect that the noise level differs between different parts of the training set.8 In that case the training set can be partitioned into two or more subsets and the term N log E D in 5.1 is replaced by i C, N,log EL where N, is the number of patterns in - t,)’ is the data error over subset s, with EN,= N, and ES, = CpEs(yP that subset.
i
7 Nonsmooth Optimization and Pruning
The practical problem from here on is assumed to be unconstrained minimization of 5.1. The objective function L is nondifferentiable, however, on account of the discontinuous derivative of (w,( at each w,= 0. This is a case of nonsrnooth optimization (Fletcher 1987, Ch. 14). On the other hand, since L has discontinuities only in its first derivative and these are easily located, techniques applicable to smooth problems can still be effective (Gill ef d.1981, 54.2). Most optimization procedures applied to L as objective function are therefore likely to converge despite the discontinuities, though with a significant proportion of weights assuming negligibly small terminal values, at least for real noisy data. These are weights that an exact line search 71n many applications it will be unwise to assume that the noise is independent across outputs. This is often a reason for not using multiple output regression models in practice, unless one is willing to include cross terms (ypl- tpl)(yp,- f p , ) in the data error and re-estimate the inverse of the noise covariance matrix during training. 8Typically this arises when training items relate to domains with an intrinsic topology. For example, predictability of some quantity of interest may vary over different regions of space (mineral exploration)or periods of time (forecasting).
Bayesian Regularization
125
would have set to exact zeros. They are in fact no longer free parameters of the model and should not be included in the counts W, of weights in the various regularization classes. For consistency, these numbers should be reduced during the course of training, otherwise the trained network will be over-regularized. The rest of the paper is devoted to this issue.' The approach is as follows. It is assumed that the training process w1,. . . to consists of iterating through a sequence of weight vectors WO, a minimum of L. If these are considered to be joined by straight lines, the current weight vector traces out a path in weight space. Occasionally this path crosses one of the hyperplanes w, = 0 where w, is one of the components of the weight vector. This means that w, is changing sign. The question is whether w, is on its way from being sizeably positive to being sizeably negative, or vice versa, or whether I w , ~ is executing a Brownian motion about w, = 0. The proposal is to pause when the path crosses, or is about to cross, a hyperplane and decide which case applies. This is done by examining dL/dw,. If dL/dw, has the same sign on both sides of w, = 0, w, is on its way elsewhere. If it has different signsmore specifically the same sign as w, on either side-this is where W , wishes to remain since L increases in either direction. In the second case the proposal is to freeze w, permanently at zero and exclude it from the count of free parameters. From then on the search continues in a lower dimensional subspace. With this in mind there are three problems to solve. The first concerns the behavior of L at w,= 0 and a convenient definition of dL/dw, in such a case. The second concerns the method of setting weights to exact zeros and the third concerns the implementation of pruning and the recount of free parameters.'O 7.1 Defining the Derivative. For convenience we write the objective function 5.1 as L = LD LW where
+
The problem in defining dL/dw, lies with the second term since (w,I is not differentiable at w,= 0. Suppose that w, belongs to regularization class c and consider variation of w, about w, = 0 keeping all other weights fixed. This gives the cusp-shaped graph for LW shown in Figure 1, which has a discontinuous 9Typical features of Laplace regularization can be sampled by applying some preferred optimization algorithm directly to the objective functions given by 4.3 or 5.1. This corresponds to the "quick and dirty" method of MacKay (1992, 56.1). 'OThe following discussion assumes batch training. Regularization using stochastic techniques is outside the present scope.
Peter M. Williams
126
/
\
/
/
\ \
Figure 1: Space-like data gradient at w,= 0. derivative at w,= 0. Its one-sided values are of wi,where
* 6, depending on the sign
is the mean absolute value of weights in class c. The two corresponding tangents to the curve are shown as dashed lines. Consider small perturbations in w, around w, = 0 keeping other weights fixed. So far as the regularizing term LW alone is concerned, w,will be restored to zero, since a change in either direction increases Lw. The full objective function, however, is L = LD LW so that behavior under small perturbations is governed by the sum of the two terms dLD/i3wIand dLw/dw,. Figure 1 shows one possibility for the relationship between them. Here 3Lo/3w1 is "space-like" with respect to f&.'I This is stable since dL/i)w,, which is the sum of the two, has the same sign as w,in either direction. Small perturbations in w,will be restored to zero. Contrast this with Figure 2 where dLD/dw, is now "time-like'' with respect to fti.,. Increasing w,will escape the origin since dL~/dw, is more negative than dLw/dw, = ti., is positive. In short dL/dw, is negative for small positive w,.It follows that the criterion for stability at w,= 0 is that
+
(7.1) "This is a reference to Minkowski's formulation of special relativity with the tangents at the origin playing the role of a section of the light cone.
Bayesian Regularization
127
-
-a,
\
Figure 2: Time-like data gradient at w,= 0. If L is given by 5.1 so that LD = f Nlog E D , then dLD/dwj = PdE~/dwj with fi given by 4.4. The criterion for stability can then be written in terms of E D as
and a similar argument establishes 3.3 in the case of a single regularization class when (Y and P are assumed known. It is convenient to define the objective function partial derivative aL/awj at wj= 0 as follows. If w,is bound to zero, i.e., the partial derivative DLD/dwjis space-like, dL/awj is defined to be zero. If it is time-like, it is defined to be the value of the downhill derivative. Explicitly using the abbreviations
then
b+a b-a b+a b-a 0
ifwj>O ifw,y2)as target value^.'^ Simple three layer networks were used with 2 input, 2 output and from 5 to 20 hidden units. Results are shown in Figure 5. For comparability with MacKay’s results, a single regularization class was used and it was assumed that the noise level (T = 0.05 was known in advance. The objective function to be minimized is therefore 6.1 with 81 = = l/a2. The ordinate in Figure 5 is twice the final value of the first term on the right-hand side of 6.1. This is a dimensionless x2 quantity whose expectation is 400 f 20 relative to the actual noise process used in constructing the training set. Results on a test set also of size 200 and drawn from the same distribution as the training set are shown in Figure 6 using the same error units. Comparison with results on a further test set, of the same size and drawn from the same distribution, is shown in Figure 7. This confirms MacKay’s observation that generalization error on a test set is a noisy quantity, so that many data would have to be devoted to a test set for test error to be a reliable way of setting regularization parameters. ~
I7Trainingand test sets used here are the same as those in MacKay (1992) by courtesy of David MacKay.
135
Bayesian Regularization
750 700
I
I
I
I
i
i
i
data error
650 .
i
i
400-W
o
...
600 .
550
500 . 450 . 400 350 . 300 . I
I
I
1
1
1
1
1
6
8
10
12
14
16
18
20
Figure 5: Plot showing the data error of 148 trained networks. Ten networks were trained for each of 16 network architectures with hidden units ranging from 5 to 20. Twelve outliers relating to small numbers of hidden units have been excluded. The dotted line is 400 - W where W is the empirically determined number of free parameters remaining after Laplace regularized training, averaged over each group of 10 trials.
6
8
10
12
14
Figure 6: Test error versus number of hidden units.
16
18
20
Peter M. Williams
136
650
I
I
I
++ +
600 t
+
++
550
+
++ +
,%--
I
I
t
++
t
+ :. .'
:
..'
t
5-12 hidden units + 13-20 hidden units 0
t.'
y= z
400
450
500
550
600
...
650
Figure 7 Errors on two test sets. Performance on both training and test sets settles down after around 13 hidden units. Little change is observed when further hidden units are added since the extra connections are pruned by the regularizer as shown by the dotted line in Figure 5. This contrasts with MacKay's results using the sum of squares regularizer for which the training error continues to decrease as more hidden units are added and where the training error for approaching 20 hidden units differs very little from the best possible unregularized fit. MacKay's approach is to evaluate the "evidence" for each solution and to choose a number of hidden units that maximizes this quantity, which in this case is approximately 11 or 12. The present heuristic is to supply the network with ample hidden units and to allow the regularizer to prune these to a suitable number. Provided the initial number of hidden units is sufficient, the results are largely independent of the number of units initially supplied. 9.1 Varying the Noise. For a further demonstration of Laplace pruning, the problem is changed to one in which the network has a single output. Multiple output regression networks are unusual in practice, especially ones satisfying a relation such as y1(x1.x2)= y2(x1 n/2. x2). There is also the possibility that the hidden units divide themselves into two groups, each serving one of the two outputs exclusively, which can make it difficult to interpret results. We therefore consider interpolation of just one of the outputs considered above, specifically the cosine expression y1. The same 200 input pairs (XI. x2) were used as for MacKay's
+
Bayesian Regularization
137
Figure 8: Data error versus noise level for an initial 50 hidden units. training set, but varying amounts of gaussian noise were added to the target outputs. Results using a network with 50 hidden units and with noise varying from 0.01 to 0.19 in increments of 0.01 are shown in Figure 8. In this case the noise was resampled on each trial so that each of the 190 different networks was trained on a different training set. Two regularization classes were used and it was no longer assumed that the noise level was known in advance. The objective function is therefore given by equation 5.1 with input and output weights forming the two classes. The data error in Figure 8 is again shown in x2 units whose expected value is now 200 relative to the actual noise process since there is only one output unit. Specifically the ordinate in Figure 8 measures &[(y, - f,)/o]* where c is the abscissa and p ranges over the 200 training items. The actual data error increases proportionately with the noise so that the normalized quantity is effectively constant. Figure 9 shows mean numbers of live hidden units, with one standard deviation error bars, in networks corresponding to each of the 19 noise levels. This is the number of hidden units remaining in the trained network after the pruning implicit in Laplace regularization. Note that the number of initially free parameters in a 50 hidden unit network with 2 inputs and 1 output is 201 so that with 200 data points the initial ratio of data points to free parameters is approximately 1. This should be contrasted with the statement in MacKay (1992) that the numerical approximation needed by the evidence framework, when used with gaussian regularization, seems to break down significantly when this ratio is less than 3 1. Figure 9 indicates that there ought to be little purpose in using networks with more than 20 hidden units for noise levels higher than 0.05, if it is to be correct to claim that results are effectively independent of the number of hidden units used, provided there are enough of them. To verify this a further 190 networks were trained using an initial architecture of 20 hidden units. Results for the final numbers of hidden
*
Peter M. Williams
138
50 45 40
35 30 25 20 15 10 5 0 0
0.05
0.1
0.15
0.2
Figure 9: Live hidden units versus noise level for an initial 50 hidden units.
Figure 10: Live hidden units versus noise level for an initial 20 hidden units. units are shown in Figure 10. Comparison with Figure 9 shows that if more than 20 hidden units are available for noise levels below 0.05 the network will use them. But for higher noise levels, there is no significant difference in the number of hidden units finally used, whether 20 or 50 are initially supplied. The algorithm also works for higher noise levels. Figure 11 shows corresponding results for noise levels from 0.05 to 0.95 in increments of 0.05. Note that in all these demonstrations with varying noise, the level is automatically detected by the regularizer and the number of hidden units, or more generally the number of parameters, is accommodated to suit the level of noise detected.
Bayesian Regularization
300
100
1
'
0
139
Data error I
I
I
I
I
I
I
I
I
0.2
0.4
0.6
0.8
1
0.6
0.8
1
Live units
20 15 10
5 0 0
0.2
0.4
Figure 11: Data error and live hidden units versus larger noise levels for 20 hidden units. 9.2 Posterior Weight Distribution. It was noted in Section 3 that the weights arrange themselves at a minimum so that the sensitivity of the data error to each of the nonzero weights in a given regularization class is the same, assuming Laplace regularization is used. For the weights themselves, the posterior conditional distributions in a given class are roughly uniform over an interval. Figure 12 shows the empirical distributions for a sample of 500 trained networks. These plots answer the question "what is the probability that the size of a randomly chosen input (output) weight of a trained network lies between x and x 6x conditional on its being nonzero?" The unconditional distributions have discrete components at the origin. The probability of an output weight being zero was 0.38 and the probability of an input weight being zero was 0.47. These networks were trained on the cosine output of the robot arm problem using MacKay's sampling of the noise at the 0.05 level.
+
10 Summary and Conclusions
This paper has argued that the C IwI regularizer is more appropriate for the hidden connections of feedforward networks than the C w 2 regularizer. It has shown how to deal with discontinuities in the gradient of (wIand how to recount the free parameters of the network as they are
Peter M. Williams
140
I n p u t weights
O u t p u t weights
1.8
0.7
1.6
0.6
1.4
0.5
1.2 1
0.4
0.8
0.3
0.6
0.2
0.4
0.1
0.2 0
0
0.5
1
1.5
1
2
3
4
Figure 12: Empirical posterior distributions of the size of nonzero input and output weights for 500 trained networks each using 20 hidden units. Mean values are 0.55 for input weights and 1.31 for output weights. The natural hyperbolic tangent was used as transfer function for hidden units. pruned by the regularizer. No numerical approximations need be made and the method can be applied exactly even to small noisy data sets where the ratio of free parameters to data points may approach unity.
Appendix The evidence framework (MacKay 1992; Thodberg 1993) proposes to set the regularizing parameters (Y and i;l by maximizing
P ( D ) = / P ( D 1 w)P(w)dw considered as a function of a and p. This quantity is interpreted as the evidence for the overall model, including both the underlying architecture and regularizing parameters. From equations 2.1 and 2.2 it follows that P(D)
=
(ZwZo)-'/e-"dw
To evaluate the integral analytically, M is usually approximated by a quadratic in the neighborhood of a maximum of the posterior density at w = wMP where VM vanishes. The approximation is then M(w)
=
+
M ( w ~ p ) ;(w - W M P ) ~(W A - WMP)
(A.1)
Bayesian Regularization where A -
= VVM
log P ( D )
141
is the Hessian of M evaluated at WMP. It follows that
=
rvEw
+ []ED +
log det A + log ZW
+ log Z D + constant
where the constant, which also takes account of the order of the network symmetry group, does not depend explicitly on (Y or P. Now the Laplace regularizer Ew is locally a hyperplane. This means that VVEWvanishes identically so that A = PH, where H = VVED is the Hessian of the data error alone. Assuming the Laplace regularizer and gaussian noise, ZW= ( 2 1 0 ) ~and Z D = ( 2 ~ / [ j ) " ~ so that - log P ( D ) = u E w
+ @ED +
log [I - W log N
-
$ log ,!j' + constant
where k is the full dimension of the weight vector. Setting to zero partial derivatives with respect to u and /3 yields a = W/Ew and /3 = ( N - k ) / 2 € ~ , so that
and
These should be compared with 4.4 and 4.5. If A.2 and A.3 are used as re-estimation formulas during training, the difference between the evidence framework and the method of integrating over hyperparameters reduces, in the case of Laplace regularization, to the difference between the factors N - k and N when re-estimating ij." In many applications the differences in results, when using these two factors with Laplace regularization, are not sufficiently clear to decide the matter empirically and it needs to be settled on other grounds (Wolpert 1993; MacKay 1994). In the present context, this paper prefers the method of integrating over hyperparameters for reasons of simplicity. Its main purpose, however, is to advocate the Laplace over the gaussian regularizer, in which case the difference between these two methods of setting regularizing parameters appears less significant. ~~
~~~~
ls1f /3 is assumed known, the methods are apparently equivalent. For multiple regularization classes the same argument leads, on either approach, to the re-estimation formula a, = Wc/EL for each regularization class c. For the multiple noise levels envisaged in Section 6, however, results will generally differ unless the levels are known in advance. Note that in saying that the Laplace regularizer is locally a hyperplane, it is assumed that none of the regularized weights vanishes, otherwise the Hessian A is not defined and the quadratic assumption A.l is no longer meaningful. It is therefore assumed that zero weights are also pruned for the Laplace regularizer when using the evidence framework (compare Thodberg 1993, for pruning with the gaussian regularizer).
142
Peter M. Williams
Acknawledgments I a m grateful to Dr. Perry Eaton, Dr. Colin Barnett, a n d other members of the Geophysical Department of Newmont Exploration Limited for stimulating discussions o n the subject of this paper a n d related topics over the last few years. References Bishop, C. M. 1993. Curvature-driven smoothing: A learning algorithm for feedforward networks. I E E E Trans. Neural Networks 4(5), 882-884. Buntine, W. L., and Weigend, A. S. 1991. Bayesian back-propagation. Complex Syst. 5, 603-643. Denker, J., Schwartz, D., Wittner, B., Solla, S., Howard, R., Jackel, L., and Hopfield, J. 1987. Large automatic learning, rule extraction, and generalization. Coinplex Syst. 1, 877-922. Fletcher, R. 1987. Practical Methods of Optimization (2nd ed.). John Wiley, New York. Gill, P. E., Murray, W., and Wright, M. H. 1981. Practical Optimization. Academic Press, New York. Hassibi, B., and Stork, D. G. 1993. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 164-171. Morgan Kaufmann, San Mateo, CA. Jaynes, E. T. 1968. Prior probabilities. ZEEE Trans. Syst. Sci. Cybernet. 4(3), 227241. Le Cun, Y., Denker, J. S., and Solla, S. A. 1990. Optimal brain damage. In Advances in Neiiral Information Processing System 2, D. s. Touretzky, ed., pp. 598605. Morgan Kaufmann, San Mateo, CA. MacKay, D. J. C. 1992. A practical Bayesian framework for backprop networks. Neural Coinp. 4(3), 448-472. MacKay, D. J. C. 1994. Hyperparameters: Optimise, or integrate out? In Maxiiriuin Eiitropy arid Bayesian Methods, Sarita Barbnra, 1993, G. Heidbreder, ed. Kluwer, Dordrecht. (In press.) Merller, M. F. 1993a. Exact calculation of the product of the Hessian matrix of feedforward network error functions and a vector in O ( n ) time. Report DAIMI PB-432, Computer Science Department, Aarhus University, Denmark. Mdler, M. F. 1993b. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks 6(4), 525-533. Neal, R. M. 1992. Bnyesian trniriirig of backpropagation rietzuorks by the hybrid Monte Carlo method. Tech. Rep. CRG-TR-92-1, Department of Computer Science, University of Toronto. Neal, R. M. 1993. Bayesian learning via stochastic dynamics. In Advances in Neural Inforination Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 475432. Morgan Kaufmann, San Mateo, CA.
Bayesian Regularization
143
Nowlan, S. J., and Hinton, G. E. 1992. Adaptive soft weight tying using gaussian mixtures. In Advances in Neural Information Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippmann, pp. 993-1000. Morgan Kaufmann, San Mateo, CA. Pearlmutter, B. A. 1994. Fast exact multiplication by the Hessian. Neural Comp. 6(1), 147-160. Plaut, D. C., Nowlan, S. J., and Hinton, G. E. 1986. Experiments on learning by backpropagation. Tech. Rep. CMU-CS-86-126, Carnegie Mellon University, Pittsburgh, PA 15213. Thodberg, H. H. 1993. Ace of Bayes: Application of neural networks with pruning. Manuscript 1132E, The Danish Meat Research Institute. Tikhonov, A. N., and Arsenin, V. Y. 1977. Solutions of Ill-Posed Problems. John Wiley, New York. Tribus, M. 1969. Rational Descriptions, Decisions and Designs. Pergamon Press, Oxford. Weigend, A. S., Rumelhart, D. E., and Huberman, B. A. 1991. Generalization by weight-elimination with application to forecasting. In Advances i n Neural Information Processing Systems 3, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds., pp. 875-882. Morgan Kaufmann, San Mateo, CA. Williams, P. M. 1991. A Marquardt algorithm for choosing the step-size in backpropagation learning with conjugate gradients. Cognitive science research paper CSRP 229, University of Sussex. Williams, I? M. 1993a. Aeromagnetic compensation using neural networks. Neurnl Conrp. Aypl. 1, 207-214. Williams, P. M. 1993b. Improved generalization and network pruning using adaptive Laplace regularization. In Proceedings of 3rd I E E International Conference on Artificial Neurnl Networks, pp. 76-80. Institution of Electrical Engineers, London. Wolpert, D. H. 1993. On the use of evidence in neural networks. In Advilnces in Neural lnformation Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 539-546. Morgan Kaufmann, San Mateo, CA. ~
~~
~~
Received February 16, 1994; accepted May 20, 1994.
This article has been cited by: 2. Krzysztof M. Graczyk, Piotr Płonski, Robert Sulej. 2010. Neural network parameterizations of electromagnetic nucleon form-factors. Journal of High Energy Physics 2010:9. . [CrossRef] 3. Lei Xu. 2010. Bayesian Ying-Yang system, best harmony learning, and five action circling. Frontiers of Electrical and Electronic Engineering in China 5:3, 281-328. [CrossRef] 4. D. P. Vetrov, D. A. Kropotov, N. O. Ptashko. 2009. An efficient method for feature selection in linear regression based on an extended Akaike’s information criterion. Computational Mathematics and Mathematical Physics 49:11, 1972-1985. [CrossRef] 5. Woon Jeung Park, Rhee Man Kil. 2009. Pattern Classification With Class Probability Output Network. IEEE Transactions on Neural Networks 20:10, 1659-1673. [CrossRef] 6. Frank R. Burden, David A. Winkler. 2009. An Optimal Self-Pruning Neural Network and Nonlinear Descriptor Selection in QSAR. QSAR & Combinatorial Science 28:10, 1092-1097. [CrossRef] 7. D. Kropotov, N. Ptashko, D. Vetrov. 2009. Relevant regressors selection by continuous AIC. Pattern Recognition and Image Analysis 19:3, 456-464. [CrossRef] 8. Ping Zhong, Runsheng Wang. 2008. Learning Sparse CRFs for Feature Selection and Classification of Hyperspectral Imagery. IEEE Transactions on Geoscience and Remote Sensing 46:12, 4186-4197. [CrossRef] 9. B. L. Monroe, M. P. Colaresi, K. M. Quinn. 2008. Fightin' Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict. Political Analysis 16:4, 372-403. [CrossRef] 10. Steven J. Phillips, Miroslav Dudík. 2008. Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation. Ecography 31:2, 161-175. [CrossRef] 11. Steven J. Phillips, Miroslav Dudík. 2008. Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation. Ecography, ahead of print080328142746259-???. [CrossRef] 12. O. M. Vasil’ev, D. P. Vetrov, D. A. Kropotov. 2007. Knowledge representation and acquisition in expert systems for pattern recognition. Computational Mathematics and Mathematical Physics 47:8, 1373-1397. [CrossRef] 13. Alina Zare, Paul Gader. 2007. Sparsity Promoting Iterated Constrained Endmember Detection in Hyperspectral Imagery. IEEE Geoscience and Remote Sensing Letters 4:3, 446-450. [CrossRef] 14. He-Sheng Tang, Songtao Xue, Tadanobu Sato. 2007. H[sub ∞] Filtering in Neural Network Training and Pruning with Application to System Identification. Journal of Computing in Civil Engineering 21:1, 47. [CrossRef] 15. Malcolm R. Haylock, Gavin C. Cawley, Colin Harpham, Rob L. Wilby, Clare M. Goodess. 2006. Downscaling heavy precipitation over the United Kingdom:
a comparison of dynamical and statistical methods and their future scenarios. International Journal of Climatology 26:10, 1397-1415. [CrossRef] 16. Sanjiv Kumar, Martial Hebert. 2006. Discriminative Random Fields. International Journal of Computer Vision 68:2, 179-201. [CrossRef] 17. Jian-hua Xu, Xue-gong Zhang, Yan-da Li. 2006. Regularized Kernel Forms of Minimum Squared Error Method. Frontiers of Electrical and Electronic Engineering in China 1:1, 1-7. [CrossRef] 18. Mohammad Sajjad Khan, Paulin Coulibaly. 2006. Bayesian neural network for rainfall-runoff modeling. Water Resources Research 42:7. . [CrossRef] 19. Juan Antonio Sánchez Mesa, Carmen Galán, César Hervás. 2005. The use of discriminant analysis and neural networks to forecast the severity of the Poaceae pollen season in a region with a typical Mediterranean climate. International Journal of Biometeorology 49:6, 355-362. [CrossRef] 20. N. Garcia-Pedrajas, C. Hervas-Martinez, D. Ortiz-Boyer. 2005. Cooperative Coevolution of Artificial Neural Network Ensembles for Pattern Classification. IEEE Transactions on Evolutionary Computation 9:3, 271-302. [CrossRef] 21. G. Deng. 2004. Iterative Learning Algorithms for Linear Gaussian Observation Models. IEEE Transactions on Signal Processing 52:8, 2286-2297. [CrossRef] 22. M.A.T. Figueiredo. 2003. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 25:9, 1150-1159. [CrossRef] 23. S. Hosseini, C. Jutten. 2002. Maximum likelihood neural approximation in presence of additive colored noise. IEEE Transactions on Neural Networks 13:1, 117-131. [CrossRef] 24. K. Tsuda, M. Sugiyama, K.-R. Miller. 2002. Subspace information criterion for nonquadratic regularizers-Model selection for sparse regressors. IEEE Transactions on Neural Networks 13:1, 70-80. [CrossRef] 25. Henrik �jelund, Henrik Madsen, Poul Thyregod. 2001. Calibration with absolute shrinkage. Journal of Chemometrics 15:6, 497-509. [CrossRef] 26. A.P. Engelbrecht. 2001. A new pruning heuristic based on variance analysis of sensitivity information. IEEE Transactions on Neural Networks 12:6, 1386-1399. [CrossRef] 27. M.A. Kupinski, D.C. Edwards, M.L. Giger, C.E. Metz. 2001. Ideal observer approximation using Bayesian classification neural networks. IEEE Transactions on Medical Imaging 20:9, 886-899. [CrossRef] 28. A. Dumitras, F. Kossentini. 2000. Feedforward neural network design with tridiagonal symmetry constraints. IEEE Transactions on Signal Processing 48:5, 1446-1454. [CrossRef] 29. J. F. G. de Freitas , M. Niranjan , A. H. Gee . 2000. Hierarchical Bayesian Models for Regularization in Sequential LearningHierarchical Bayesian Models for Regularization in Sequential Learning. Neural Computation 12:4, 933-953. [Abstract] [PDF] [PDF Plus]
30. Kazumi Saito , Ryohei Nakano . 2000. Second-Order Learning Algorithm with Squared Penalty TermSecond-Order Learning Algorithm with Squared Penalty Term. Neural Computation 12:3, 709-729. [Abstract] [PDF] [PDF Plus] 31. J.-P. Vila, V. Wagner, P. Neveu. 2000. Bayesian nonlinear model selection and neural networks: a conjugate prior approach. IEEE Transactions on Neural Networks 11:2, 265-278. [CrossRef] 32. Mirko van der Baan, Christian Jutten. 2000. Neural networks in geophysical applications. Geophysics 65:4, 1032. [CrossRef] 33. D.A. Miller, J.M. Zurada. 1998. A dynamical system perspective of structural learning with forgetting. IEEE Transactions on Neural Networks 9:3, 508-515. [CrossRef] 34. Tin-Yau Kwok, Dit-Yan Yeung. 1997. Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Transactions on Neural Networks 8:3, 630-645. [CrossRef] 35. Tin-Yan Kwok, Dit-Yan Yeung. 1997. Objective functions for training new hidden units in constructive neural networks. IEEE Transactions on Neural Networks 8:5, 1131-1148. [CrossRef] 36. Peter M. Williams. 1996. Using Neural Networks to Model Conditional Multivariate DensitiesUsing Neural Networks to Model Conditional Multivariate Densities. Neural Computation 8:4, 843-854. [Abstract] [PDF] [PDF Plus]
Communicated by Vladimir Vapnik
Empirical Risk Minimization versus Maximum-Likelihood Estimation: A Case Study Ronny Meir Depurtment of Electrical Engineering, Technion, Huifu 32000, Israel
We study the interaction between input distributions, learning algorithms, and finite sample sizes in the case of learning classification tasks. Focusing on the case of normal input distributions, we use statistical mechanics techniques to calculate the empirical and expected (or generalization) errors for several well-known algorithms learning the weights of a single-layer perceptron. In the case of spherically symmetric distributions within each class we find that the simple Hebb rule, corresponding to maximum-likelihood parameter estimation, outperforms the other more complex algorithms, based on error minimization. Moreover, we show that in the regime where the overlap between the classes is large, algorithms with low empirical error do worse in terms of generalization, a phenomenon known as overtraining. 1 Introduction
~
The problem of pattern recognition can formally be stated as follows (Vapnik 1982): in a certain environment that is characterized by a probability density P(x), instances x appear randomly and independently. The instructor classifies these instances into one of k classes, using the conditional probability distribution function P(y 1 x), where y = 0.1.. . . .k - 1 is a class label. Neither the properties of the environment P(x) nor the decision rule P(y I x) is known. In the remainder of the paper we restrict ourselves to two-class classification, denoting the class labels by y = i. In the case of parametric classification one considers a class of parameterized functions fil,(x)= sgn[h,,(x)],depending on a parameter vector w. The objective of pattern recognition is then to estimate the value w* of the parameters w, which minimize the probability of misclassification for an input instance x drawn randomly according to the environmental probability distribution P ( x ) . This quantity, which (following Vapnik 1982) we term the expected error, is given by .(W)
=jdxdy
P(x.y)l;[fzO(x)l
(1.1)
where the complementary indicator function 1b(z) is 1 if y # z and zero otherwise. It should be noted that the expected error is sometimes termed Neural Computation 7, 144-157 (1995)
@ 1994 Massachusetts Institute of Technology
Empirical Risk Minimization
145
generalization or prediction error by other authors. In typical situations one is exposed to a set m training pairs D"' = {(xl,y'), . . . , (x",y")}, each drawn independently at random according to the unknown probability distribution P(x,y). It is then common to define the empirical error
which is a finite sample approximation to the expected error. It can be shown that under a wide range of conditions (Vapnik 1982) the empirical error converges uniformly to the expected error almost surely, when the sample size m increases without bound. The rate at which this occurs, however, is a complicated and problem-specific issue, which is at the heart of the learning problem. In fact, it is the objective of efficient learning strategies to ensure that this convergence occurs with the highest possible rate. In particular, one is often interested in learning algorithms that produce the lowest possible expected error for finite sample sizes. It has been pointed out by Vapnik (1982) that there is no reason to expect that the parameter value, w"', minimizing the empirical risk for a finite sample size m, is indeed the best approximation to the true minimizer w*. Moreover, from a practical point of view, minimizing the empirical error in the pattern recognition problem may be problematic as well. As can be seen from equation 1.2, the empirical error is a piece-wise constant function, rendering any gradient-based method useless. One possible solution to these problems is choosing the parameter value w to minimize an auxiliary function termed the training error, &,(W,D,,') =
I
-
x
V[y'.h,(x')]
I=1
Here V[y'.fzi,(x')]is a differentiable distance measure between the desired and actual outcome of the classifier. One then usually considers gradientbased learning algorithms, which utilize the information available from the gradient vector in order to minimize the function &,(w,Dm).The widely used backpropagation algorithm is a special case of this strategy (at least when applied to classification problems). As we show in Section 3, it is actually possible for finite sample sizes to obtain a lower expected error by choosing w to minimize the training, rather than the empirical, error. A general framework for judging the performance of any classifier is Bayesian theory (Duda and Hart 1973). Denoting by P(x 1 y) the conditional probability for input x given that it belongs to class y = f, one can show that the optimal classifier is obtained by the decision rule l n P ( x ) + ) - l n P ( x 1- ) + l n -
P-
(1.4)
where p* are the prior probabilities for class &. In the remainder of this work we will be interested in the important case of normal probability
Ronny Meir
146
distributions P(x 1 y), for which the optimal Bayes classifier is given by the quadratic classifier
+
'-1
1 IC+I -In--In2 K-I
P-
In the above equation uk and C+ stand for the mean vectors and covariance matrices, respectively, of the normal distributions P(x 1 h),and IC+( stand for the determinants. As can be readily seen from equation 1.5, in the case where the covariance matrices E+ are equal, the optimal Bayes classifier reduces to a linear discriminant function (single-layer perceptron),
i
F(x) = sgn (u+ - U-)TC-'X
+ 21 (ur+c-'u+ - U Y U - ) -
(1.6)
-
P-
We restrict ourselves in the remainder of the paper to the analysis of the single layer case, both in the case where it is optimal (equal covariance matrices) and otherwise. We are able then to calculate analytically the empirical, training, and expected errors for a variety of choices for the training error. In performing the calculations for the different learning models we follow the line of research pioneered by E. Gardner (19881, which enables the analytic calculation of many relevant quantities. Much of the earlier work along these lines has focused on rather simple situations such as uniform distributions and computationally unfeasible learning algorithms (see Watkin ef al. 1993 for a review). More recently Griniasti and Gutfreund (1991) and Meir and Fontanari (1992) have studied more practical gradient-based learning algorithms, while Biehl and Mietzner (1993) and Barkai et al. (1993) have focused on more realistic input distributions of the type discussed in this paper. In fact, it is the goal of this paper to combine the above extensions, thus prompting a theoretical investigation into the performances of several well-known learning algorithms, and their dependence on the task being solved. The remainder of the paper is organized as follows. In Section 2 we specialize the above general framework to the particular case studied in this work. Section 3 then introduces several training error functions, corresponding to distinct learning algorithms, while in Section 4 we present the analysis of the models under various circumstances. Finally, in Section 5 we summarize our findings and list several open problems. The mathematical details of the derivation, based on the work of E. Gardner (19881, are briefly reviewed in the appendix but are not essential for an understanding of the paper.
Empirical Risk Minimization
147
2 The Linear Threshold Classifier
We consider a single-layer perceptron with inputs x E Rd and output
o = sgn[h,(x)] = sgn(w. x)
(2.1)
We have set the bias term to zero for simplicity and, thus, without loss of generality, impose the normalization condition llwll = 1. We further assume that the conditional probability distribution of input vector x given class label y = f is gaussian, UY
II
/2 0;)
(2.2)
Here we have assumed the covariance matrices C, to be multiples of the unit matrix in d dimensions, i.e., E* = oil. The geometric meaning of this assumption is that the pattern distribution around each class center is spherically symmetric. To simplify the calculations we follow Barkai et a/. (1993) and take the mean vectors u,, to be orthogonal, u+ . u- = 0, and of equal magnitude, IIu+II = llu-ll = u. Furthermore, we will assume throughout that the prior class probabilities, p + , are equal to 1/2. The probability of error for a linear classifier of weight vector w and input distribution as in equation 2.2 is readily evaluated, yielding & ( w 2) = q H ( y ) + H ( - y ) ] ,
(2.3)
dx e-x2/2/\/z;f. As mentioned in Section 1, where we have used H(y) = in the case where the variances are equal, i.e., o+ = 0- = o, the linear discriminant function is optimal. The optimal weight W* can be read directly from equation 1.6, yielding
w* = (u+ - u - ) / ( h u )
(2.4)
giving rise to the minimal expected error & ( W * )= H ( u / & )
(2.5)
We note that the factor fi in equation 2.4 guarantees the normalization llwll = 1 (keeping in mind the orthogonality condition u+ . u- = 0). Obviously if one knew u i in advance one could plug them directly into equation 2.4 and obtain w*. The problem of learning is approximating W* as well as possible from a limited data set. 3 Minimizing the Training Error
As discussed in Section 1 the goal of learning is to calcualte parameter values, w*, which yield minimal expected error. However, with access to only a finite set of input/output pairs, Dm, the exact calculation of the
Ronny Meir
148
expected error is impossible. This fact, together with the nondifferentiability of the empirical error, led us to consider minimizing the training error, equation 1.3. For the case of single-layer perceptrons it is useful to define the stability, given by A‘ = y‘w . XI
(3.1)
The complementary indicator function Ib,[frL,(x’)] of equation 1.2 is then replaced by @(-A’) where the Heaviside function O ( x ) is zero for negative x and 1 otherwise. The empirical error is then given by 1 v(w.JY) =-
rn
c
O( -A!)
(3.2)
I=1
With a slight abuse of the notation for V ( . ) ,equation 7.3,the training error for our problem then takes the form (Griniasti and Gutfreund 1991), 1
E,(w,DI”) =
rn
111
C V(A‘)
(3.3)
I=1
In particular we focus here on three different functions V(A), which give rise to distinct learning algorithms. The first function studied is the one giving rise to the well-known perceptron learning algorithm, Vp(A) = -A@(-A)
(3.4)
Gradient-descent learning using this function gives rise to the perceptron learning rule (in batch mode). We note in passing that the well-known result on the nonconvergence of the perceptron learning algorithm in the nonseparable case (Minsky and Papert 1988) does not apply in the batch mode since the algorithm always converges to a local minimum, as long as the gradient descent is done properly. Another related function, recently proposed by Frean (1992), VF(A) = 11 - exp(XA)]O(-A)
(3.5)
was motivated by the desire to construct an efficient learning algorithm in the nonlinearly separable case. The motivation for this proposal was the observation that in the perceptron error function, equation 3.4, the price paid for strongly “wrong” patterns (large negative A) is very high (proportional to IAl). Thus, outliers are expected to wreak havoc with this learning algorithm. In the newly proposed algorithm (i.e., gradientdescent dynamics on the function 1.3 with V = VF) outliers are suppressed due to the exponential decay. The Frean function interpolates nicely between the perceptron function for small X and the empirical error, i.e., the fraction of misclassifications (equation 1.2), when X becomes large. Note that the functions Vp and VF are continuous, in spite of the step-functions appearing in their definition.
Empirical Risk Minimization
149
Finally, in the Hebb learning algorithm (recently studied in a very similar context by Barkai et al. 1993) one has (3.6)
where y assures the normalization llwll = 1. One should note that this algorithm can be derived as the gradient-descent dynamics performed on the training-error Et with VH(A)
=
-A
(3.7)
provided initially w = 0. We note in passing that the Hebb rule is different from the other rules studied in that it is not an error-correcting rule, i.e., all patterns contribute to the weight modification, whether they are correctly classified or not. In this work we do not solve the dynamical process resulting from the gradient-descent procedure, but rather compute the properties of the global minima of the above energy functions. While it is possible that the error functions studied contain local minima, a mathematical study of these local minima is beyond the scope of this paper. In particular, in Section 4 we present the empirical and expected errors predicted by minimizing the different training errors discussed in this section. 4 Analysis
The analytic results we derive are obtained in the so-called thermodynamic limit: d 00, m m, and a = m/d < co, and utilize the framework developed by Gardner (1988) and Gardner and Derrida (1988). We will be particularly interested in the case where the class centers, u%, as well their widths, cr*, are of order unity, so that the inputs from the two classes overlap considerably, and thus the minimum expected error is nonzero (see, for example, equation 2.3). Before discussing the results it is important to make the following observation. For small values of a, the normalized sample size, both the perceptron and the Frean error functions yield zero empirical error. This results from the fact that for a small number of training examples (small a ) a hyperplane can always separate the data without error, yielding A‘ > 0 for all values of 1. In this paper we focus on values of o larger than this critical size, since all error-correcting algorithms are identical below this point. This observation does not hold for the Hebb rule, which yields nonzero empirical error even for small a. Using the equations described in the appendix allows us to analyze the empirical and expected errors produced by the various learning algorithms studied. Following the notation of Barkai et al. (1993) we define the variables R+ through -+
w . U*
-+
= MR&.
(4.1)
Ronny Meir
150
In the thermodynamic limit, the empirical and expected errors can be simply expressed in terms of these variables, as explained in the appendix. Denoting by wm(Dm) the value of w minimizing the training error, we obtain the following results for the average empirical and expected errors, where the average is taken with respect to the probability distribution of the data Dm. The average empirical error is given by the following remarkably simple expression,
where the parameters T& and Xi are algorithm dependent. Their specific values can be obtained by solving the equations given in the appendix. The average expected error is similarly given by the expression 1 [H(uR+/u+) H(-uR-/g-)] (4.3) 2 In the case of the Hebb rule, the overlap variables R* can be written out explicitly (see also Barkai et al. 1993), E
=
+
-
1
(Hebb rule)
(4.4)
with R- = -R+ and u2 = c: + c:. We note that for consistency we must 0 when o 00, so that the empirical error converges to the have T* expected error as it should. ---f
---f
4.1 Equal Covariance Matrices. In this case we take u+ = 0- = 1 in which case the linear discriminant function is optimal. Moreover, as can clearly be seen in equation 1.6, the optimal bias vanishes (keeping in mind Ilu+ll = Ilu-ll). For large cv we find, for all training functions considered, that the expected error decays toward the optimal value, crnln,according to a power law C
&(IY)
- Em,,
x(r
(4.5)
where the constant c is again algorithm specific. However, since for the perceptron and Frean error functions the replica symmetric solution is unstable (see appendix) it is not inconceivable that the decay rate is somewhat different. For the Hebb rule, equation 3.6, the solution is exact and the value c is given by Barkai et al. (1993). It is interesting that choosing w to minimize the empirical error ~ ( w(the ) so-called zero temperature Gibbs algorithm) yields an expected error that decays asymptotically like l / & as observed by Barkai et al. (1993). This implies that minimizing the empirical error is a suboptimal strategy in this case. The dependence of the empirical and expected errors on the sample size N is plotted in Figure 1 for the learning algorithms studied, in the case u = 1 and g+ = 1.
Empirical Risk Minimization
151
a
0' 0.20
a
0
Figure 1: (a) Empirical error for a mixture of gaussians with ti = 1 and o+ = 1. The lowest possible empirical error is given by the solid line, the dotted line is the result of the Frean algorithm (with X = l), the dashed-dotted line is that of the perceptron algorithm, while the Hebb algorithm is given by the dashed line. (b) Same as (a), but for the expected error. Results of simulations for 50 inputs are presented by crosses for the perceptron algorithm, by + for Frean's algorithm and by * for the Hebb rule. The size of the error bars is approximately f(0.01-0.02).
As can be seen in Figure la, the Hebb rule yields the highest empirical error, being followed by the perceptron and Frean error functions. It should be noted, however, that the Hebb rule gives rise to the lowest expected error, as can be seen in Figure lb. We have added to Figure l b results
152
Ronny Meir
of numerical simulations performed with d = 50 and averaged over 100 cases. As can be seen, there is a small discrepancy between the analytic and numerical results for the perceptron and Frean functions. After having increased the system size to 200 and observing no noticeable change we concluded that the difference is probably due to replica symmetry breaking effects, as discussed in the appendix. Another source for the discrepancy between the analytic results and the simulation is the possible existence of local minima in the training functions. All numerical results were obtained using the conjugate gradient method to minimize equation 1.3 (keeping in mind the constraint llwll = 1). This superiority of the Hebb rule over the others is in fact not surprising, as can be seen from the following argument. As was claimed above, in the present case the optimal Bayes classifier is given in equation 2.4 by the difference between the gaussian centers u + ~and up. Now, the Hebb rule can be written in the form (4.6)
where I E k refers to those inputs arising from class Ifr, respectively. Now, it is well known that the sample average is the maximum likelihood estimate for the mean of a normally distributed random variable (Duda and Hart 1973). Thus, we expect that in this situation the Hebb rule, which simply calculates the sample mean for each class, will indeed be a good learning strategy, assuming that maximum likelihood is an efficient strategy. It is interesting to note that there exists a strategy, the so-called James-Stein method (James and Stein 1961), which in the case of spherically symmetric normal distributions, is guaranteed to yield a better estimator of the true mean than the sample mean, computed through the maximum-likelihood approach. Moreover, one can show (Strawderman 1971) that under certain conditions an optimal strategy exists (different from the James-Stein one) that yields the "best" estimator for the true mean. Thus, at least in this case maximum-likelihood estimation is provably suboptimal, although performing better than empirical error minimization. These results are an illustration of the observation of Vapnik (1982) that the weight value minimizing the empirical error is not necessarily optimal for finite sample sizes. In fact, in the present model it turns out that minimizing the empirical error is the worst strategy among those studied (for finite sample sizes). We note in passing that the result for the Hebb rule can be derived without recourse to statistical mechanics, by using simple probabilistic arguments. For details of a similar calculation the reader is referred to Watkin et al. (1992). 4.2 Unequal Covariance Matrices. As remarked in Section 4.1, it would seem that the simplest learning algorithm, namely the Hebb rule,
Empirical Risk Minimization
153
performs best in the situation where the covariance matrices are both equal to the unit matrix. This prompted us to consider the slightly more complex situation described by equation 2.2 with o+ # o-,as a more realistic scenario. Since in this case the linear classifier is no longer Bayes optimal, it behooves us to find the best linear classifier. A surprisingly simple result (Fukunaga 1990)demonstrates that under the condition that k,,,(x) = w . x is normal, the best linear classifier is given by w* =
[ U L
+ (1
-
u)c+]-'(u+ - u-)
(4.7)
where a is determined by the specific optimality criterion used. We note that in the limit of high dimension (large d ) h,,,(x)is expected to be gaussian under a wide range of conditions (central limit theorem), and thus the above result should be of wide applicability. In the case studied here, namely & = ri1, we find that again the optimal linear classifier is proportional to u+ - u-, as in the case of equal variances. In this case, however, the optimal bias term is no longer zero, as could be surmised on geometric grounds. This would lead us to conjecture, based on our arguments in Section 4.1, that the Hebb rule is best (for linear classifiers) in this case as well. This is in fact borne out by the calculations, and can be seen clearly in Figure 2 where we plot the expected error for several learning algorithms in the case u = 1, o+ = 1, and o- = 0.3. It is interesting to note that in this case the perceptron learning algorithm in fact produces slightly worse results (i.e., higher expected error) than those obtained by minimizing the empirical error (at least for the range of N appearing in the figure). In fact, we expect that as relative width of the two gaussian centers decreases (keeping their centers fixed) the empirical error offers a continually improving estimate, since in this case the overlap between the two classes becomes negligible. We note also from Figure 2b that the Frean error function yields a lower expected error than that obtained by minimizing the empirical error, if the sample size is large enough. However, this effect is small and may be an artifact of the instability of the replica symmetric solution. We conclude from the above results that the Hebb rule is the best learning rule of those studied in the case where the input distribution within each class is a spherically symmetric gaussian, i.e., E* = oi1. As can be seen from equation 4.7, in the case where C+ are not spherically symmetric we expect a different choice of weights to be optimal. Unfortunately, the analytic calculation of the learning curve in the case of a general covariance matrix E* becomes much more complicated, and will not be pursued in this paper. 5 Conclusion
We have studied the interaction between data distributions, learning algorithms, and finite sample size. In particular, we have looked at the
Ronny Meir
154
a
0'24: 0.22
0
Figure 2: (a) Empirical error for a mixture of gaussians with u = 1, u+ = 1, and 0- = 0.3. The lowest possible empirical error is given by the solid line, the dotted line is the result of the Frean algorithm (with X = l), the dashed-dotted line is that of the perceptron algorithm, while the Hebb algorithm is given by the dashed line. (b) Same as (a), but for the expected error. classification of mixtures of gaussians with spherically symmetric covariance matrices, of possibly different variances. Comparing several learning algorithms for single-layer perceptrons, we have found the Hebb rule to be the best choice under the above conditions. An interesting result of our work is that algorithms yielding low empirical error do worse in terms of the expected error. This result is perhaps not surprising, since we have focused on situations where the data
Empirical Risk Minimization
155
arising from the two classes overlap considerably, giving rise to problems of overfitting. The important case of efficient algorithms for general covariance matrices requires further study. In any case, it is clear that the Hebb rule is not efficient under these conditions, since it does not even converge asymptotically to the optimal weight matrix (see equation 1.6). It would thus be worthwhile to further investigate how the other algorithms, which are more sensitive to the input distribution, behave under these conditions. A related line of research would be the investigation of quadratic classifiers, which are known to be optimal for arbitrary mixtures of gaussians.
Appendix In this appendix we briefly describe the mathematical technique used to obtain the analytical results presented in this paper. To calculate the minimal training error we first define a partition function, Z(D'"), given by
from which the average minimal training error is easily seen to be given by
In this expression, (lnZlli)DI,, stands for an average over the distribution of the data D"'. Thus, all results derived will be average case results. Without going into the mathematical details, which are lengthy but unilluminating, we present the general equations needed in order to calculate the overlap functions R* defined in equation 4.1, which are shown to be relevant for the calculation of the behavior of the system. For simplicity we present results for the case where the prior probabilities of the classes are equal, i.e., p , = p - = 1/2. Our calculations were done using the elegant formulation of Griniasti and Gutfreund (1991),where details of a similar calculation can be found (see also Meir and Fontanari 1992). We focus on the regime above oc,which is the value of N above which zero empirical error is impossible. All algorithms, except for the Hebb rule, yield identical results for (Y 5 mc, since they all lead to zero empirical (and training) error in this regime. In order to calculate R+ we need to solve the following three equations for R+ and the auxiliary variable x.
R-
au . + __ / D1 (A202.
-
a-t
+ uR-) = 0
(A.3)
156
Ronny Meir
Here Dt = e-”dt/d% is a gaussian measure and A+ are the minima, respectively, of the functions
where the form of V(A) is given for the three error functions studied in this paper in Section 2. The variables T* appearing in the empirical error results of equation 4.2 are given by 6 in the case where the empirical error is minimized, by ofx in the case of the perceptron function, and by 2X+a*/au for the Hebb algorithm (notice that in this case R- = -R+ irrespective of whether CT+ = g-). In all cases x is obtained through a solution of equations A.3. The expression for Ti in the case of the Frean error function is much more complex, requiring the solution of several other equations, and will not be presented here. The calculations reported above were done using the so-called replica symmetric assumption (Mezard ef al. 1987). We have found, however, that this ansatz is incorrect in the regime considered [except for the Hebb rule, which can be solved without recourse to replicas (Watkin et al. 1992)]. In spite of this fact, it has been found in many similar problems (see Watkin et al. 1993 for a review) that the replica symmetric solution provides a good approximation even in the regime where replica symmetry breaking takes place. The numerical results presented in Section 4 lend further credence to this claim. Acknowledgments The author is grateful to J. F. Fontanari for many helpful discussions and to M. Biehl and H. S. S u n g for sending him copies of their work prior to publication. The author also thanks the anonymous referees for pointing out the nonoptimality of the maximum-likelihood estimator, as well as other useful comments. Research supported in part by the Ollendorff Center of the Electrical Engineering Department at the Technion and by the Lady Davis Foundation. References Barkai, N., Seung, H. S., and Sompolinsky, H. 1993. Scaling laws in learning classification tasks. Phys. Rev. Lett. 70(20), 3167-3170. Biehl, M., and Mietzner, A. 1993. Statistical mechanics of unsupervised learning, preprint Universitat Wurzburg. Duda, R. O., and Hart, P. E. 1973. Pattern Classification and Scene Analysis. Wiley, New York. Frean, M. 1992. A “thermal” perceptron learning rule. Neural Cornp. 4(6), 946957.
Empirical Risk Minimization
157
Fukunaga, K. 1990. Introduction to Statistical Pattern Recognition. Academic Press, San Diego, CA. Gardner, E. 1988. The space of interactions in neural network models. J. Phys. A 21, 257-270. Gardner, E., and Derrida, B. 1988. Optimal storage properties of neural networks. J. Phys. A 21, 271-284. Griniasti, M., and Gutfreund, H. 1991. Learning and retrieval in attractor neural networks above saturation. J. Phys. A 24, 715. James, W., and Stein, C. 1961. Estimation with quadratic loss. Proc. Fourth Berkeley Symp. Math. Statist. Prob. 1, 311-319. Meir, R., and Fontanari, J. H. 1992. Calculation of learning curves for inconsistent algorithms. Phys. Rev. A 45(12), 8874-8884. Mezard, M., Parisi, G., and Virasoro, M. A. 1987. Spin Glass Theory and Beyond. World Scientific, Singapore. Minsky, M., and Papert, S. 1988. Perceptrons. MIT Press, Cambridge, MA. Strawderman, W. E. 1971. Proper Bayes minimax estimators for the multivariate normal mean. Ann. Statist. 42, 385-388. Vapnik, V. N. 1982. Estimation of Dependences Based on Empirical Data. SpringerVerlag, Berlin. Watkin, T. L. H., Rau, A,, Bolle, D., and van Mourik, J. 1992. Learning multiclass classification problems. J. Phys. (Paris) 12, 167-180. Watkin, T. L. H., Rau, A., and Biehl, M. 1993. The statistical mechanics of learning a rule. Rev. Mod. Phys. 65(2), 499-556. -~
Received July 1, 1993; accepted April 11, 1994.
This article has been cited by: 2. M Biehl, A Freking, G Reents. 1997. Dynamics of on-line competitive learning. Europhysics Letters (EPL) 38:1, 73-78. [CrossRef] 3. C Marangi, M Biehl, S. A Solla. 1995. Supervised Learning from Clustered Input Examples. Europhysics Letters (EPL) 30:2, 117-122. [CrossRef]
Communicated by Naftali Tishby
Learning a Decision Boundary from Stochastic Examples: Incremental Algorithms with and without Queries Yoshiyuki Kabashima" Shigeru Shinomoto Department of Physics, Kyoto University, Kyoto 606, Japan
Even if it is not possible to reproduce a target input-output relation, a learning machine should be able to minimize the probability of making errors. A practical learning algorithm should also be simple enough to go without memorizing example data, if possible. Incremental algorithms such as error backpropagation satisfy this requirement. We propose incremental algorithms that provide fast convergence of the machine parameter 0 to its optimal choice 0, with respect to the number of examples t. We will consider the binary choice model whose target relation has a blurred boundary and the machine whose parameter 6' specifies a decision boundary to make the output prediction. The question we wish to address here is how fast 0 can approach ,?f depending upon whether in the learning stage the machine can specify inputs as queries to the target relation, or the inputs are drawn from a certain distribution. If queries are permitted, the machine can achieve the fastest convergence, (0 - H,)2 N O(t-'). If not, O(t-') convergence is generally not attainable. For learning without queries, we showed in a previous paper that the error minimum algorithm exhibits a slow convergence (f? - 1 9 , ) ~ 0(t-*I3)). We propose here a practical algorithm that provides a rather fast convergence, O(t-"'). It is possible to further accelerate the convergence by using more elaborate algorithms. The fastest convergence turned out to be O[(lnt)2t-']. This scaling is considered optimal among possible algorithms, and is not due to the incremental nature of our algorithm.
-
1 Introduction
An ideal objective of machine learning is to identify a target input-output relation. Even if all the examples can be reproduced by adjusting machine parameters, the relation acquired via examples is generally not identical to the target relation, and the central issue is then the probability of error in the prediction of a novel example (Valiant 1984; Baum and Haussler 1989; Levin et al. 1990; Amari et u1. 1992; Sompolinsky and Barkai 1993). *Present address: Department of Physics, Nara Women's University, Nara 630, Japan.
Neural Computation 7,158-172 (1995)
@ 1994 Massachusetts Institute of Technology
Learning a Decision Boundary
159
However, in most of the practical applications of learning algorithms, it is still hard to reproduce all the examples drawn from a target relation, so that it is definite already in the learning stage that the target relation is not reproducible by the machine (Rumelhart et al. 1986; Sejnowski and Rosenberg 1987). The objective of learning in this case is not necessarily to look for the target relation, but just to obtain the best output prediction for an individual input. When considering the binary choice model whose target relation is stochastic, it is obvious that the best prediction is to choose an output that appears more often than the alternative. Thus the learning machine has to partition the input space so as to minimize the prediction error. In a previous paper, we discussed the error minimum and the maximum-likelihood algorithms as strategies to find a decision boundary that partitions the input space (Kabashima and Shinomoto 1992). In the error minimum algorithm, a parameter or a set of parameters 6' is readjusted so that the controlled decision boundary makes the minimum number of empirical errors. We found in this case that the parameter 6' converges O(t-2/3).Though the to the optimal choice 0, rather slowly, (0 - 0,)' anomalous fractional exponent 2/3 is theoretically interesting, the error minimum algorithm cannot be called efficient due to this exponent. We noticed that problems with a similar origin have been discussed independently in various fields of science: the time scaling of the intervals of shocks observed in the Burgers equation (Burgers 1974; Karder etal. 1986), mathematical economics (Manski 1975; Kim and Pollard 1990; Kawanabe and Amari 1993), pattern recognition (Kohonen 1989), and statistical decision theory (Haussler 1991). Apart from the asymptotic scaling, Barkai et al. (1993) studied the problem specific to the high dimensional error minimum algorithm. They estimated the number of examples needed to attain a significant inference using a machine with a large number of parameters. In the maximum-likelihood algorithm, a probability distribution function is selected out of a family of hypothetical functions so as to maximize the (log) likelihood for given data. This can also be utilized for finding a decision boundary. The decision boundary is determined as a hypersurface on which the hypothetical probabilities for alternative classes balance each other. If the true distribution happens to be included in a family of hypothetical distribution functions, the decision boundary 0 eventually converges to the optimal choice 0,. In this case, we obtain rapid convergence, (0 - 0,)* O(t-'). If the true distribution is not available, however, the decision boundary does not converge to the optimal choice. The naive maximum-likelihood algorithm is thus not efficient either. To reproduce an arbitrary probability distribution function, we have to prepare a family of functions with an infinite number of parameters. When the number of parameters is finite, the determination of the boundary must have some error. If on the one hand we prepare a number of parameters to approximate any distribution function and to obtain the
-
-
Yoshiyuki Kabashima and Shigeru Shinomoto
160
asymptotic dependence of O(t-') to a certain precision, then the prefactor of t-' will become large and the asymptotic regime of O(t-') will not be reached within a practical number of examples. For a given number of examples, there must be an optimal number of parameters. Using the strategy to change the number of parameters depending on the number of examples, we can obtain a fairly rapid convergence of the decision boundary H to the optimal choice 0,. The question we wish to address here is how fast (0-Qo)2 converges to zero. The O(t-') convergence is not attainable. We will propose a practical algorithm that provides a rather fast convergence, ( H - H,)' O(t-4/5). A more elaborate algorithm can provide more rapid convergence, O(t-2p'(21'+'1).p= 2.3,. . .. Although larger p appears preferable, it has to pay the price of a larger prefactor. The best choice of p depending upon the number of examples gives the fastest convergence, which turned out to be O[(lnt j 2 t - ' ] . In comparison with passive learning, which leaves the inputs to be drawn from a certain distribution, there must be an advantage of introducing the freedom of queries into learning; the machine can specify inputs to inquire about respective outputs. Seung et al. (1992) and Freund et al. (1992) showed that the information gain per example remains finite even in the limit t -+ m if queries are allowed, while it decays as t-' if not. We propose here a practical algorithm that gives the fastest convergence (0 - fl,)2 O(t-'). The use of queries in neural networks was discussed by Baum (1991), in which for a realizable target relation an exponentially fast convergence is proved under some conditions. This exponential convergence is due to the deterministic nature of the target relation, and is in principle not attainable for our stochastic relation (Cram& 1946). For the algorithm to be practical, there is another requirement in addition to the problems of convergence. The learning algorithm should be simple enough to work without storing the example data, if possible. We will introduce incremental algorithms that do not require memory of the previous examples. This incremental nature is extremely important for a practical algorithm, as it greatly reduces the computational burden. The popularity of the error backpropagation algorithm is due to its being of this nature. The backpropagation algorithm, however, can be considered as a kind of maximum-likelihood algorithm. Instead, what we introduce here is not a naive maximum-likelihood algorithm that can give a wrong decision boundary, but an efficient and practical algorithm for finding the optimal decision boundary for output prediction. First, we are going to discuss the problem of dividing a one-dimensional space that has a blurred boundary (see Fig. 1). Every example consists of the real input x E [a, b] and the binary output s = fl.In learning without queries, t inputs x are drawn independently from some nonsingular distribution p ( x ) . On the other hand, the machine can specify inputs x in learning with queries. The target relation is stochastic, namely for every input x, the outputs is drawn from a certain conditional probability distribution p ( s I x ) . We will assume that p ( s = $1 I x ) = 1 - p ( s = -1 I x )
-
N
Learning a Decision Boundary
161
0.5
1
0
Figure 1: A conditional probability distribution function p(s I x) describing the one-dimensional blurred boundary for the binary choice. as a function of x is infinitely differentiable and monotonically increasing. The learning machine knows nothing other than that. Even if the perfect knowledge of p(s I x) is acquired, it is impossible to give a perfect prediction of the individual output for every input. The best prediction for the individual examples is attained if we separate the input space into positive and negative regions depending on p(s = $1 I x ) > p(s = -1 I x ) and p(s = +l I x) < p(s = -1 I x ) . Learning a decision boundary does not necessarily require complete knowledge of the conditional probability distribution p(s I x), but it suffices to find a (directed) boundary at which alternative probabilities balance. As p(s = +1 1 x ) is a monotonically increasing function of x in this one-dimensional case, the optimal decision boundary x = Oo is a single point that satisfies p(s = f l I x ) = p(s = -1 I x) Ix=o,. We will first show the incremental algorithm that enables the fastest convergence (0 - Oo)2 O(t-') with queries allowed, and then discuss the efficient algorithms for learning without queries. Second, we are going to discuss the higher dimensional case to see whether these scaling forms obtained in the one-dimensional case are critically dependent on the dimension of the parameter space. The higher dimensionality affects the convergence by a factor, but presumably does not deteriorate the scaling form, though we have not succeeded in finding a concrete algorithm to accelerate the convergence up to O[(lnt)'t-'].
-
2 Learning with Queries
We consider first an active learning process in which the learning machine can specify inputs as queries to a target relation. For every input x E [a, b],
Yoshiyuki Kabashima and Shigeru Shinomoto
162
an output s = 5 1 is drawn from the conditional probability p ( s I x). The nature of getting s = f l consists of two factors: the mean and the fluctuation. If for instance p ( s = $1 I x) > p ( s = -1 I x) at some point x, then there is a tendency to find s = +1 more often than s = -1 at this point. However, if the alternative probabilities do not have a significant difference, a number of examples are needed before we are sure of which is larger. An appropriate series of queries brings about a fast convergence of the decision boundary x = 6' to its optimal choice 8,. Provided that p ( s = +l I x) is a monotonically increasing function of x, it will be preferable on average to push back the hypothetical boundary x = B if one gets s = +1 at the boundary, and push it forward otherwise. We propose the learning algorithm as follows. Let 0, denote the hypothetical boundary determined via t examples, and assume that the query given at this point is x = 0,. Given the output s for this input x = B,, the machine moves the parameter Of to (2.1)
8[+1= 0, - SfYL, where cuf(> 0) is a step size that can depend on t. We assume that the conditional probability p ( s panded around x = 0, as
=
$1
p ( s = + l Ix) = 1 / 2 + k l ( x - 8 , ) + k 2 ( X - H [ , ) 2 + . . ' In the vicinity of O,, the mean and the variance of s mated as
(s),
2kl (x - 00).
(s2), - (s):
N
1
I x) can be ex(2.2)
=
f l are approxi-
(2.3)
where (...), = Cs=*l. . . p ( s 1 x). As the hypothetical boundary x = Of comes close to the optimal position B,, the mean drift force toward the optimal boundary becomes weak, while its fluctuation remains large. If we keep a f constant, Of is subject to both the drift force and fluctuation and will not converge to 0,. It was proven (Robbins and Monro 1951; Kushner and Clark 1978) that 19,strongly converges to its optimal value 0, provided that
The a, dependence of the way in which BI converges has not been studied in detail. We are going to investigate the convergence of 0, by means of a physical interpretation of its dynamics. We found that 2.4 is just a sufficient condition for the convergence of Of, and it is even possible to give a successful schedule {at}frwhich is outside of the condition 2.4. Owing to the coexistence of the drift force (the mean of s) and the fluctuation (the variance of s), equation 2.1 with equations 2.3 can be
Learning a Decision Boundary
163
interpreted as Brownian motion in a quadratic drift potential. These dynamics can be approximated by the Langevin equation, dz/dt = a[-2k12
+~ ( t ) ]
(2.5)
where z = 0, - B0, a = cyf is generally dependent on t, and v ( t ) is white noise characterized by the statistical properties, ( ~ ( t )=) 0, ( 7 j ( t ) ~ ( f ' ) ) = b(t - t'). The Fokker-Planck equation is an alternative description of stochastic dynamics,
d[zP(z.t ) ] a2 dZP(2, t ) dP(Z, t ) = 2kla -~ (2.6) dt d2 2 dz2 where P(z, t ) is the ensemble distribution of learning machines with parameters z = 19, - B0 at the moment t. From equation 2.5 or 2.6 we are able to obtain the evolution equation of the mean square deviation u = ( 2 2 ) = ((0, - 8 0 ) 2 ) ,
+
d u / d t = -4klffU
+ a2
(2.7)
From this equation, we can find the optimal series of a, for obtaining the fastest convergence of u . This is performed by minimizing the right-hand side of equation 2.7, which is obtained by adjusting a = 2klu. This gives the solution u ( t ) = 1/[4k:(f + const.)] 1/(4k:t), and hence a 1/(2klt). This strategy is similar to what we employed to discuss the finite time scaling of energy in simulated annealing (Shinomoto and Kabashima 1991). The present model is in some sense similar to a thermodynamic system. The mean square deviation u is proportional to a in equilibrium, so cy corresponds to the "temperature" of the thermodynamic system. The feature specific to this model is that the drift potential is also proportionally dependent on a. To obtain the optimal learning schedule one has to have a knowledge of kl and the mean square deviation u ( t ) . In a practicaI learning situation, a rough estimate of kl might be available, but the deviation u ( t ) at the moment t is unknown. Instead, we can fix a reasonable schedule {a,},first that would give a fast convergence of the (unknown) deviation. By substituting the learning schedule a, = A / t , we can solve equation 2.7. The asymptotic form of the solution turns out to be
-
N
u(t)
N
A2[(4klA- l)t]-', lntlt, ct-4k1A>
for A > 1/4kl for A = 1/4kl for 0 < A < 1/4kl
(2.8)
This result shows that the learning schedule af = 1/(2klt) is optimal (see Fig. 2a), which is in agreement with the solution of the optimal strategy. All the learning schedules here satisfy the conventional condition for convergence 2.4. Though convergent, it is intriguing to see that the asymptotic form exhibits a qualitative deterioration in its exponent for A < 1/4kl (see Fig. 2b).
Yoshiyuki Kabashima and Shigeru Shinomoto
164
0 b,
I
Figure 2: Asymptotic behavior of the mean square deviation u ( t ) = ((0, - 0, )2) obtained with various values of A for learning schedule a, = A/t. (a) Prefactor of the u ( t ) O(t-’) for the case A > 1/4kl; (b)qualitative deterioration seen in the exponent of the asymptotic decay u ( t ) O ( t P ) , seen for A < 1/4kl. The closed circles are the results of numerical experiments. N
-
Next, we wish to see what happens if we change the learning schedule from A / t to A/t”. It is still easy to integrate equation 2.7, and the asymptotic form is
The mean square deviation converges to zero, if 0 < /3 5 1. Note that the learning schedules with 0 < /3 < 112 are out of the conventional condition for convergence 2.4. The convergence for 0 < /3 < 1/2 is not so surprising,
Learning a Decision Boundary
165
Figure 3: The mean square deviation u ( t ) = ((0, -0,)’) vs. t for various learning schedules ctt = A/tP, [I = 0.0.125,0.25. and 0.5. The average is taken over 1000 sets of t examples for the target relation p(s = +1 I x) = x. The mean square fit for u ( t ) O(t-Y), respectively, gives y = -0.0081 i 0.0048.0.1226 0.0045.0.2534i0.0051. and 0.5271 k 0.0043. These are in good agreement with the theoretical asymptotes u ( t ) = A/4klt-O shown as lines in the figure.
-
*
and can be reasonably understood from a physical point of view. As described before, the parameter (Y works as a kind of ”temperature” in this system, and in equilibrium u ( t ) is proportional to Q. If we reduce the temperature too rapidly, the ensemble of the systems does not attain the equilibrium distribution, and will be partially ”frozen.” This happens for /3 > 1, in which the second term of the right-hand side of equation 2.9 does not vanish even in the limit t + 00. On the other hand, if one reduces the temperature slowly, the ensemble equilibrates almost every time, which makes the mean square deviation u proportional to at. This happens for 0 < p < 1, in which the first term of the right-hand side of equation 2.9 is dominant, implying u ( t ) cx at. The result of numerical simulation is shown in Figure 3.
Yoshiyuki Kabashima and Shigeru Shinomoto
166
Smaller gives slower convergence, which is not preferable. On the other hand, the second term of equation 2.9, which represents a memory effect with respect to the initial condition, exhibits more rapid decay for smaller 13. We are often faced with a learning situation in which the target relation itself depends on time. In such a case, the machine has to be sufficiently adaptive to the temporal change. Amari (1967) illustrated that learning with a fixed step size, which is similar to the case P = 0 in the present framework, is adaptive to a time varying target. As it is easy for small /j to erase the memory, we are then able to choose sufficiently small A in order to obtain smaller mean square deviation. The number of examples required to obtain a certain precision for parameter estimation is not critically dependent on the choice of /j.This argument with respect to the sample complexity was discussed by Kabashima and Shinomoto (1993). 3 Learning without Queries If queries are not allowed, the machine has to make most of the information available from examples drawn from the distribution p ( s , x) = p ( s l x ) p ( x ) . If there is a symmetry with respect to the inversion, p ( s >0, + x) = p( -s. 8, - x) and we can assume this symmetry in advance, then most of the data from this joint probability can be utilized in determining the hypothetical boundary, and the machine can attain the fastest convergence of O(t-'). There is, however, no symmetry in general. Due to the absence of symmetry, we have to prepare more in the inference, and this makes the convergence slower. An easy way of utilizing an incremental algorithm similar to the preceding one is to prepare a window for accepting inputs. Let us assume a window of an interval 2rf centered at the hypothetical boundary Of. The parameter 8, is updated when the input x falls in the window, 0,+,
= 8,
-
strt/2rt.
if x E [H, - T!. 0, + r,]
(3.1)
and does not change otherwise. This algorithm is similar to the vector quantization procedure LVQ2, proposed by Kohonen (1989). In the original LVQ2, the window size is fixed, but we are going to control the window size so as to obtain the exact convergence. We will hereafter assume that the probability distribution p ( x ) is infinitely differentiable and is expanded around x = 0, as
p(x) = p,
+ h , (x
-
0,)
+h2(X
-
Oo)2
+ ...
The probability that the machine receives s in the window is obtained via ~I+Tl
p(s I-TI
=
+1 1 x)p(x) d x /
p(x) dx @,-'I
=
(3.2) +1 for an input that falls
(3.3)
Learning a Decision Boundary
167
The Langevin equation for the corresponding dynamics 3.1 is given by (3.4) where z = 0, - O,, and q ( t ) is the white noise characterized by ( 7 7 ( t ) ) = 0 and ( i i ( f ) r j ( f ’ ) )= S(t - t’). The second term on the right-hand side of equation 3.4 does not vanish even in the limit z = 0. This term is the origin of the systematic error due to the asymmetry in the joint distribution p ( s . x). To eliminate this systematic error, one has to shorten the interval T~ itself. On the other hand, updates become infrequent as one narrows the window, and then the relative intensity of fluctuations increases by (27-)-’/*. The trade-off between these tendencies determines the optimal choice of the window size. We assume the learning schedule to be at = A/t, and seek the optimal window schedule from among r, O(t-0. From the Langevin equation 3.4 we obtain the evolution of the mean square deviation u = (z2),
-
(3.5) Minimizing this with respect to 2 due to these orthogonality conditions. Thus the resultant move for the positive example is not necessarily to the left, and vice versa. Using this kernel function, we found that the optimal window size which is slower than that of the original model scales as rt O(t-’’(21’+’)),
-
Learning a Decision Boundary
169
for p > 2. The resultant asymptotic scaling of the mean square deviation turns out to be (3.8)
u ( t ) N O(p2t-2”(2P+’))
The same exponent 2 p / ( 2 p + l ) was obtained by Barron and Cover (1991) although they did not give an estimate of the prefactor p2 here. The scaling 3.8 implies that the machine can attain an asymptotic scaling arbitrarily close to O(t-’). For a finite number of examples, however, it is not necessarily advantageous to use a kernel with larger p , as the prefactor p‘ increases rapidly with p . The optimal p depends on the number of examples t in such a way that
P
(In t)/4
(3.9)
By substituting this into equation 3.8 we obtain the fastest convergence, u(t)
-
~[(lnt)~t-’]
(3.10)
This scaling form would presumably be optimal, although we have not succeeded in proving this. 4 Higher Dimensional Case
In this section, we wish to discuss whether the asymptotic scaling forms we obtained for the one-dimensional model are critically dependent on the dimension of the parameter space as well as the dimension of the input vector space. In higher dimensional problems, there are two main causes for the impossibility of reproducing the input-output relation. First, a target input-output relation is originally stochastic as is seen in our one-dimensional paradigm. In this case, it is impossible to reproduce individual examples even if the machine can produce an arbitrary decision boundary surface. Second, a target relation is deterministic and has a clear separation boundary in the input space, but the machine cannot reproduce the separation boundary due to limitations in its adaptability. For example, consider the combination of the target dichotomy with a round boundary and the machine that can produce a decision boundary by using only a finite number of hyperplanes. In practical applications, these two causes are presumably mixed; the target relation is more or less stochastic, and, moreover, the learning machine cannot produce an optimal decision boundary for this stochastic relation. To shed light on the effect of the dimensionality, we will consider here the first case. Namely, the target relation is stochastic and the machine can produce the best decision boundary for this stochastic relation. Our model is as follows. For the ( D 1)-dimensional unit vector x, output s = f l is drawn from the stochastic relation,
+
P(S =
+i I
X) =
1/2
+ kl(e,. X) + k2(e,. x ) +~ . . .
(4.1)
Yoshiyuki Kabashima and Shigeru Shinomoto
170
where 8, is a ( D + 1)-dimensional unit vector and (a . b) is the inner product. The optimal decision boundary is a hyperplane (containing the origin) normal to 6,. The machine is able to choose any hyperplane containing the origin, or equivalently, a (unit) normal vector 8. The machine is actually capable of producing an optimal decision boundary by choosing 8 = B0. The parameter space is a D-dimensional manifold. In learning with queries, the machine chooses inputs x randomly from the hypothetical boundary. The rule for updates can be of the form,
where the factor 1/41 + (2: is added to keep the norm unity. The dynamics is again similar to Brownian motion in a (D-dimensional) quadratic potential, if the input vector x is drawn uniformly from the hypothetical boundary. On a D-dimensional locally flat coordinate z 0: 6, - 8, normal to 8,, we can estimate the average restoring force. The dynamics is then expressed by the Langevin equation, dz/dt = > /LO, the first term dominates this equation.
3. In the initial undamaged state, eo < 8, ensuring that the network does not follow nonstored patterns if the latter are presented as inputs. Therefore, at t = 0 and m'(0) zz 0, the argument of the first term is negative, and increased noise increases the magnitude of the overlap m' (1) (the overlap of the input pattern). However, as the dynamics evolve, the argument of the first term becomes positive, and the increased noise reduces the final overlap.
Figure 2 displays the map [m' ( t + 1) 1 m1( t i ] defined by 4.5, describing stimulus-dependent retrieval. In the baseline, undamaged, state there is only one (stable) fixed point solution, with m' = 1 (full curve). After the weakening of external projections, two additional fixed points may appear (dotted curve). The lowest fixed point is not visible on the scale of Figure 2, but it always coexists with the middle fixed point since, by 4.5, m'(1) > 0 when m'(0) = 0. It is easy to see that the middle fixed point is unstable, while the two extreme ones are stable. The middle, unstable, fixed point denotes the critical value C, that divides the map; only input patterns with overlap m' > C,. will converge to the higher fixed point, signifying successful retrieval. As the initial overlap of an input pattern is essentially zero, equation 4.5 will converge to its lower fixed point whenever it exists, resulting in stimulus-retrieval failure. The compensatory potentials of internal synaptic strengthening and increased noise level are illustrated in Figure 3. In both cases we see that with sufficient compensatory increase the original situation (where only a single, high overlap, fixed point exists) is restored. Note that
190
D. Horn and E. Ruppin
I .O
0.8
-+ r
0.6
c
v
E
0.4
0.2
0.0 m(t)
+
Figure 2: The map [m(t 1) I m ( t ) ]generated by iterating the overlap equation. a = 0.05, p = 0.1, eo = 0.035, c = 1, and T = TO = 0.005. In the initial undamaged state (full curve) e = 0.035 and in the decreased input case e = 0.015. Recall that B remains fixed and its value is determined by 4.7, with eo = 0.035, co = 1, and p = 0.1. All figures are based on this choice of initial synaptic strengths and threshold, as well as on Q = 0.05.
as the noise level is increased the magnitude of the highest fixed point decreases monotonically, in accordance with observation 3. To study spontaneous retrieval, we calculate the overlap m” (0) between the initial network state S and each memory pattern As shown in the Appendix, the maximal ovbrlap mmaxhas an almost deterministic value. This enables us to model spontaneous retrieval by entering mmax as the initial m(0) in 4.5, letting e = 0 (no external stimulus is present),
[”.
Compensatory Mechanisms in an Attractor Neural Network
I
191
I
Figure 3: The map [m(t+ 1) I m ( t ) ]after a decrease in the magnitude of external projections (e = 0.015) and a compensatory increase in the internal synaptic strength (long-dashed curve, c = 2.5 and T = 0.005), or in the noise level (dashed curve c = 1 and T = 0.015). These curves should be compared with the c = 1, T = 0.005 curve in Figure 2.
and iterating the overlap equation. The map [m(t+ 1) I m(t)]generated in this fashion has a similar form to that shown previously in the stimulus-driven mode, as illustrated in Figure 4. Analogous to the case of stimulus-dependent retrieval, the middle fixed point denotes the critical value C,, such that only when mmax> C, expression 4.5 converges to its higher fixed point, denoting spontaneous retrieval of a stored memory pattern.
D. Horn and E. Ruppin
192
r
1.o
0.8
0.6 h
v-
c
E
0.4
0.2
0.0 m(t)
Figure 4 The maps [m(t+l) 1 m(t)]of spontaneousand stimulus retrieval after a decreasein the magnitude of external projections, and followinga compensatory increase in the internal synaptic strength. T = 0.005.
Figure 4 illustrates the effects of synaptic changes on spontaneous retrieval. As e decreases (to e = 0.010, compare with Fig. 2), the curve corresponding to the stimulus-driven map is shifted to the right, approaching the spontaneous-retrieval curve (e = 0). Following observation 1, increasing c results in a leftward (and upward) shift of both curves, possibly maintaining successful stimulus-retrieval (by eliminating the lower fixed points, as illustrated in this example), but causing a continuing decrease in the value of C,, such that spontaneous retrieval may arise. Note also that increasing c tends to further decrease the difference between the spontaneous and stimulus-driven retrieval maps. Depending on the val-
Compensatory Mechanisms in an Attractor Neural Network
193
ues of e, c, and T , each of the following three retrieval scenarios may occur: 1. The basic, stimulus-retrieval mode, is preserved.
2. Spontaneous-retrieval emerges (C, < mmax),while stimulus-retrieval is preserved. 3. Stimulus-driven retrieval is lost (a lower fixed-point appears in the stimulus-driven map).
Similar to its effect on stimulus-dependent retrieval, an increase of the noise level would enhance the level of spontaneous retrieval, by decreasing the negative argument of the dominant first term in 4.8, but would gradually decrease the final overlap. In the next section we use equation 4.5 to study quantitatively the relation between the two retrieval modes, characterized as Stimulus-dependent with parameters m(0) = 0. e > 0 retrieval mode (4.9) Spontaneous retrieval mode with parameters m(0) = mmax3e = 0 It should be noted that the derivation of 4.5 is based on the assumption that the overlap m singled out is significantly higher than the overlaps with all other memory patterns, which are considered as background noise. This is different from the situation in the spontaneous mode, where a few memory patterns may have initial overlaps that do not fall far from mmax. Hence, the results obtained by iterating 4.5 in this mode are only an approximation to the actual emergence of spontaneous retrieval in the network. As we shall show, in sparsely coded, low memory-load networks simulation results are in close agreement with these estimates.
I
5 Numerical Results
We turn now to simulations examining the behavior of a network under variations of synaptic strength and noise level, and compare these results with the analytic approximations obtained by iterating equation 4.5. All the simulations presented in this section were performed in a network of N = 400 neurons, storing M = 20 memory patterns, with coding level p = 0.1. Optimal thresholds were set for eo = 0.035 and co = 1. Performance was measured by the final overlap averaged over 100 trials, denoted the average final overlap. In the initial, undamaged state, the values of the synaptic strengths and threshold were set such that perfect memory retrieval at low noise levels was attained, as shown by the full curve in Figure 5a. Figure 5a displays simulation results demonstrating that an increase in the noise level can compensate for the deterioration of memory retrieval due to a decrease in the external input. For fixed T , performance
D. Horn and E. Ruppin
194
lor-
1.o
'
e = ow5
08
0.8
~
8 -= OM5 e 0015 e = 0014 B 0 0 - 0013 ae- 0012 e = OW5
+
0.6
06 -
0.4
04
E
1 I\
O2
02
I I
I 00
0 000
--
7
_ _-
,
-.,'
.~
0010
0 020
T
Figure 5: Stimulus-dependentretrieval performance, measured by the average final overlap m, as a function of the noise level T . Each curve displays this relation at a different magnitude of external input projections e. (a) Simulation results. (b) Analytic approximation.
decreases rapidly as e is decreased. If the decrease in e is not too large, an increase in T restores stimulus-dependent retrieval performance. The first three curves are qualitatively similar, characterized by a peak of the retrieval performance at some e-dependent optimal level of noise. Eventually, at low e levels retrieval is lost. Figure 5b presents analytical results describing the effect of noise on the dynamic evolution of the network, obtained by iterating the macroscopic overlap equation 4.5. These results bear strong resemblance to those obtained in simulations.* 2A discrepancy between analytic approximations and simulations regarding the behavior of the undamaged network in low noise should be noted. However, in general, there is close correspondence between theory and simulations also in low noise values. The case shown in Figure 5 is an exception that arises since precisely for these parameter values there is a sharp change in the performance near zero temperature. If e is slightly lowered to 0.032 the retrieval performance (in both analysis and simulations) is near zero, and when e is slightly increased to 0.038 the retrieval performance (in both analysis and simulations) is almost perfect.
Compensatory Mechanisms in an Attractor Neural Network
195
The initial sharp rise in the performance obtained as T is increased above some point (at high enough e values) is made clear by considering the map [m(t+ 1 ) I rn(t)]displayed in Figure 3; at this point the noise level is sufficient to eliminate the two lower fixed points and there is a crossover to the highest fixed point. As e is decreased, higher T values are required to eliminate the lower fixed points, and the value of the higher fixed point decreases. As illustrated in Figure 5b, there is a crossover point (e M 0.013) where retrieval performance drops sharply. The map [m(t 1) 1 m ( t ) ]presented in Figure 6 shows that in this parameter region the crossover to the higher fixed point does not occur any more (see the dashed curve) and the solution of 4.5 is always obtained at the lower fixed point. The results of a simulation examining the compensatory potential of strengthening internal connections are shown in Figure 7. As e is decreased the best possible performance is achieved with increasing c values. The macroscopic overlap equation fails to give an accurate account of stimulus-dependent retrieval at high c levels; as the internal synapses are strengthened and spontaneous retrieval arises, there is no longer a single significant overlap. The combined compensatory potential of internal synaptic strengthening and increased noise is illustrated in Figure 8. The effect is synergistic, as high stimulus-dependent retrieval performance is achieved already at a fairly low increase of synaptic and noise levels. Figure 9a and b illustrates that synaptic strengthening and increased noise eventually generate spontaneous retrieval. The analytic approximation is in fair correspondence with the simulation. Due to interference from other memories with high initial overlap, spontaneous retrieval in the network is lower than the theoretical prediction. In the previous section we have seen that three retrieval scenarios may occur, depending on the values of e, c, and T. As spontaneous retrieval depends only on c and T , the remaining parameter e determines wherever stimulus-dependent retrieval is maintained as spontaneous retrieval emerges. In our network this combined retrieval mode is obtained with fairly high levels of e, c, and T , but it may exist also at lower levels, depending on the levels of the memory load a, spontaneous activity q, and the initial external strength eo. Finally, we wish to point out another adverse feature of compensation, with relevance to decreased specificity of stimulus-dependent retrieval: When the undamaged network is presented with a nonmemorized input pattern, it converges to a state that has no significant overlap with any of the memorized patterns. However, after compensatory synaptic changes take place, the network may respond to the presentation of a nonstored pattern by converging to a state that has high overlap with one of the memory states, and thus erroneously retrieve nonqueried patterns. As illustrated in Figure 10, retrieval specificity begins to deteriorate at moderate compensatory levels, before spontaneous retrieval arises.
+
D. Horn and E. Ruppin
196
1 .o
0.8
-+
0.6
F
v L
E
0.4
0.2
0.0 m(t)
+
Figure 6 The map [m(t 1) I m ( t ) ] after a decrease in the magnitude of the external projections, and an optimal compensatory increase of the noise level. While at e = 0.014 the fixed point has a large value of 0.9, at e = 0.012 it drops sharply (even at the optimal noise T = 0.019) to 0.25. This shows that e x 0.013 is a crossover between large and small m values. 6 Discussion
Motivated by Stevens’ hypothesis, we have constructed a neural model supporting the idea that synaptic regenerative processes observed in the frontal cortices of schizophrenics concomitantly with the denervation of MTL projections are not just a mere “epiphenomenon,” but have a compensatory role. Schizophrenic symptomatology involves complicated cognitive and perceptual phenomena, whose description certainly
Compensatory Mechanisms in an Attractor Neural Network
r
197
1
C
Figure 7 Stimulus-dependentretrieval performance, measured as the average final overlap rn, as a function of the internal synaptic strength c. Each curve displays this relation at a different strength of external input projections e. T = 0.005.
requires much more elaborate representations than a simple associative memory model of the kind we have used. However, whatever their neural realization may be, schizophrenic symptoms such as delusions or hallucinations frequently appear in the absence of any apparent external trigger. It therefore seems plausible that the emergence of spontaneous activation of stored patterns is an essential element in their pathogenesis. The decrease of retrieval specificity may underlie schizophrenic thought disorders such as loosening of associations, where a unifying theme is absent from the patient’s discourse; one may contend that due to decreased specificity, numerous patterns in different modules may be activated con-
198
D. Horn and E. Ruppin
1 .o
0.8
,
0.6 E
Analytic approximation Simulation
0.4
0.2 I
0.0
0.5
,
1 .o
,
.
1.5
2.0
1
1
5
Figure 8: The final overlap m as a function of internal synaptic strength c. Both simulations and analytical results are displayed. e = 0.015 and T = 0.009. This should be compared with the e = 0.015 and T = 0.005 curve in Figure 7 and the e = 0.015 and c = 1 curve of Figure 5.
comitantly and compete with each other, making the maintenance of a serial ordered cognitive process an increasingly difficult task. Delusions and hallucinations tend to concentrate upon a limited set of recurring cognitive and perceptual themes. This cannot be accounted for by a model where spontaneous retrieval is homogeneously distributed among all stored memory patterns. To obtain a nonhomogeneous distribution, the compensatory regeneration of internal synapses should have
Compensatory Mechanisms in an Attractor Neural Network
199
,,"!
Figure 9: (a) Spontaneous retrieval, measured as the highest final overlap rn achieved with any of the stored memory patterns, displayed as a function of the noise level 7'. c = 1. (b) Spontaneous retrieval as a function of internal synaptic compensation factor c. 7' = 0.009. In both cases e = 0, q = 0.05, yielding mmax= 0.111 as the starting point for iterating the overlap equation. an additional Hebbian-like activity-dependent term, as, for example,
This learning process is assumed to proceed on a much slower time scale than the retrieval one. Nevertheless, their coexistence can lead to interesting phenomena: As some memory pattern is spontaneously retrieved, its corresponding basin of attraction is further enlarged. This increases therefore the probability of spontaneously retrieving memories that have already been retrieved. If spontaneous retrieval emerges, then via this positive feedback loop, any bias in the network's initial state can break the symmetry underlying the generation of a homogeneous distribution of retrieved states, and an inhomogeneous distribution can be obtained. We have assumed in our analysis that only states of high overlap with one of the stored memories are cognitively significant. We have thus neglected all spurious states to which the network can converge. The simulation results presented in Table 1 indicate that when the network size is small (N= 400) the role of the spurious states in spontaneous retrieval is
D. Horn and E. Ruppin
200
1 .o
0.8
/ / /
/ / / /
0.6
/ /
/ /
E
/ /
!
/
0.4
e = 0.015 e = 0.025
/
-~~
/
/ / / / /
0.2
/
_--
/
0.0 C
Figure 1 0 Decreased specificity: The final overlap m as a function of internal synaptic strength c, for two values of external synaptic strength e. The input stimulus is a random pattern that does not correspond to any memory pattern. In each trial, m is taken as the highest final overlap achieved with any of the stored memory patterns. T = 0.009.
rather small; if the network does not converge to one of the memories it often ends up in a state with very low activity that has negligible overlap with the memories. However, the percentage of mixed states considerably increases as the network size is increased. Hence, in large networks spontaneous retrieval seems likely to mostly consist of spurious states, together with a few memory states. Yet two additional factors may in turn enhance the relative percentage of memory states spontaneously retrieved in large networks. First, as the coding rate p is decreased [and
Compensatory Mechanisms in a n Attractor Neural Network
201
Table I: Distribution of Final Attractor States Generated in a Network with Spontaneous Retrieval."
N = 400 c = 1.5 c = 2.0 c = 2.5 N = 800 c = 2.0 c = 2.5 c = 3.0 N = 1600 c = 3.0 c = 3.25 c = 3.5
Memory
Spurious
Near-zero activity
0 18 61
0 3 9
100 79 30
0 11 31
0 4 34
100 85 35
8 14 21
20 46 68
72 40 11
"Stored memory states, spurious states, and near-zero activity states are counted. The results are shown for three networks of different size N (keeping the memory load N = 0.05 fixed), while varying the internal synaptic strength c. In all simulations presented e = 0.015, T = 0.009, and p = 0 1. Convergence to a stored memory pattern was considered as such when the final overlap with that pattern was above 0.9.
cortical networks are considered to have very low "coding" rates (Abeles et al. 199011, the percentage of memory retrieval is significantly increased; for example, in a simulation performed in a network of size N = 1600 (with o = 0.05, c = 2.5, e = 0.015, T = 0.009) and coding rate p = 0.05, the network has converged to a memory state in 44% of the trials, to a nearzero activity state in 38%,and to a spurious state in only 18%.3Second, as preliminary results seem to indicate, the incorporation of synaptic Hebbian changes like those suggested in 6.1 is likely to markedly increase the percentage of memory states the network spontaneously converges to, due to their enlarged basins of attraction. The question of how spurious states are distinguished from memory states has been addressed in the ANN literature elsewhere ( e g , Parisi 1986; Shinimoto 1987; Ruppin and Yeshurun 1991). In Alzheimer's disease, synaptic degenerative processes damage intramodular (i.e., internal) synaptic connections (DeKosky and Scheff 1990) that store the memory patterns. Hence, although synaptic compensation (performed by strengthening the remaining synapses) may slow down memory deterioration, the demise of memory facilities is inevitable (Horn et al. 1993). Simulations we have performed show that spontaneous retrieval does not emerge when the primary damage compensated for in"s the coding rate is lowered, spontaneous memory retrieval is achieved at lower compensation values, so a direct comparison of the retrieval obtained with different coding levels at the same c levels is not possible.
202
D. Horn and E. Ruppin
volves intramodular connections. In schizophrenia, the internal synaptic matrix of memories remains presumably intact, and synaptic compensatory changes may successfully maintain memory functions. However, as we have shown, when internal synaptic strengthening compensates for external synaptic denervation, spontaneous retrieval emerges. Despite a number of suggestive findings, there is currently no proof that a global abnormality of neurotransmission is a primary feature of schizophrenia (Mesulam 1990). Motivated by Stevens’ theory, we have focused on the neuroanatomical synaptic changes, without referring to any specific neurotransmitter. However, it should be noted that symptoms like delusions and hallucinations are known to be responsive to dopaminergic agents. Building upon recent data that may support the possibility that the initial abnormality in schizophrenia involves a hypodopaminergic state, Cohen and Servan-Schreiber (1992) have shown that schizophrenic deficits may result from a reduction of adrenergic neuromodulatory tone in the prefrontal areas. In parallel, we have shown that increased noise, which is computationally equivalent to decreased neural adrenergic gain [see Cohen and Servan-Schreiber (1992) for a review of this data], may result in adverse positive symptoms. However, in accordance with Stevens’ theory, this additional noise arises from synaptic reinnervation, and is independent of the level of dopaminergic activity. On the physiological level, it is predicted that at some stages of the disease, due to the increased noise level, increased spontaneous activity should be observed. This prediction is obviously difficult to examine directly via electrophysiological measurements. Yet, numerous EEG studies in schizophrenics show increased sensitivity to activation procedures (Lee, more frequent spike activity) (Kaplan and Sadock 1991), together with a significant increase in slow-wave delta activity that may reflect increased spontaneous activity (Jin et al. 1990). Our model can be tested by quantitatively examining the correlation between a recent premortal history of florid psychotic symptoms and postmortem neuropathological findings of synaptic compensation in schizophrenic subjects. Quoting Mesulam (Mesulam 19901, ”One would have expected neuropathology to provide a gold standard for research on schizophrenia, but this is not yet so.” It is our hope that neural network models may encourage detailed neuropathological studies of synaptic changes in neurological and psychiatric disorders, that, in turn, would enable more quantitative modeling.
Appendix: The Calculation of mmax Let us consider a network of N neurons, storing M ( 0 , l ) patterns p, 11 = 1... . ,M. Each memory pattern E” is generated with Prob( 0 (using Markov’s inequality)
To find the tightest of these bounds we differentiate r / ( h ) with respect to f and solve
to find the corresponding f that maximizes 7 / ( b ) . As we have M stored memory patterns, we obtain
= eNp
As is evident from equation A.4, the probability that the maximal initial overlap mmaxis larger than b / [ p ( l- p ) ] decreases exponentially with q ( h ) 0. At low values of h many memories will have an overlap larger than b / [ p ( l - p ) ] . At high values of b, there will be no such memory, with probability almost 1 . Hence, mmaxis found by searching for h’ such that / I ( & * ) = b, i.e., when the expected number of memories whose overlap is larger than mmax = h * / [ p ( l - p ) ] is 1 . To this end, for every 6 [from 0 to p ( 1 - p ) ] we search for the best f-value by solving A.3, calculate the corresponding r / ( b ) by A.2, and stop whenever q ( 6 ) = p. Some values of mmaxas a function of q, for three different networks, are displayed in Table 2. Although mmax decreases monotonically as the network size increases (keeping cy fixed), the value of mmax remains nonvanishing even when considering large, “cortical-like” networks.
D. Horn and E. Ruppin
204
Table 2: Some Typical m,,,
Values."
9
N = 400
N = 2000
N
0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15
0.06 0.091 0.111 0.128 0.143 0.156 0.168 0.179
0.029 0.045 0.057 0.066 0.074 0.082 0.088 0.094
0.014 0.022 0.028 0.033 0.037 0.041 0.044 0.047
"In all three networks the memory load
cv = M / N = 0.05 is kept
= 10000
constant and p
= 0.1.
Acknowledgment We are grateful to Professor Isaac Meilijson for helpful discussion a n d comments.
References Abeles, M., Vaadia, E., and Bergman, H. 1990. Firing patterns of single units in the prefrontal cortex and neural network models. Network 1, 13. Amit, D. J., Parisi, G., and Nicolis, S. 1990. Neural potentials as stimuli for attractor neural networks. Network 1, 75-88. Cohen, J. D., and Servan-Schreiber, D. 1992. Context, cortex, and dopamine: A connectionist approach to behavior and biology in schizophrenia. Psycho/. Review 99(1), 45-77. Connors, 8. W., and Gutnick, M. J. 1990. Intrinsic firing patterns of diverse neocortical neurons. Trends Neurosci. 13(3), 99-1 04. DeKosky, S. T., and Scheff, S. W. 1990. Synapse loss in frontal cortex biopsies in Alzheimer's disease: Correlation with cognitive severity. Ann. Neurol. 27(5), 457-464. Haley, D. C. 1952. Estimation of the dosage mortality relationship when the dose is subject to error. Tech. Rep. TR-15, August 29, Stanford University. Heit, G., Smith, M. E., and Halgren, E. 1988. Neural encoding of individual words and faces by the human hippocampus and amygdala. Nature (London) 333, 773-775. Hoffman, R., and Dobscha, S. 1989. Cortical pruning and the development of schizophrenia: A computer model. Schizophrenia BuIl. 15(3), 477. Hoffman, R. E. 1987. Computer simulations of neural information processing and the schizophrenia-mania dichotomy. Arch. Gen. Psychiat. 44, 178. Horn, D., Ruppin, E., Usher, M., and Herrmann, M. 1993. Neural network
Compensatory Mechanisms in an Attractor Neural Network
205
modeling of memory deterioration in Alzheimer’s disease. Neural Cornp. 5, 736-749. Jin, Y., Potkin, S. G., Rice, D., and Sramek, J. et al. 1990. Abnormal EEG responses to photic stimulation in schizophrenic patients. Schizophrenia Bull. 16(4), 627-634. Kaplan, H. I., and Sadock, B. J. 1991. Synopsis ofpsychiatry. Williams & Wilkins, Baltimore. Mesulam, M. M. 1990. Schizophrenia and the brain. N . Engl. J. Med. 322(12), 842-845. Parisi, G. 1986. Asymmetric neural networks and the process of learning. J. Phys. A: Math. Gen. 19, L675-L680. Ruppin, E., and Yeshurun, Y. 1991. Recall and recognition in an attractor neural network of memory retrieval. Connect. Sci. 3(4), 381-399. Shinimoto, S. 1987. A cognitive and associative memory. Bid. Cybern. 57, 197-2 11. Squire, L. R. 1992. Memory and the hippocampus: A synthesis from findings with rats, monkeys, and humans. Psyckol. Rev. 99, 195-231. Stevens, J. R. 1992. Abnormal reinnervation as a basis for schizophrenia: A hypothesis. Arch. Gen. Psychiat. 49, 238-243. Tsodyks, M. V. 1988. Associative memory in asymmetric diluted network with low activity level. Europhys. Lett. 7, 203-208. Tsodyks, M. V., and Feigel’man, M. V. 1988. The enhanced storage capacity in neural networks with low activity level. Europhys. Lett. 6, 101-105.
Received July 13, 1993; accepted April 14, 1994.
This article has been cited by: 2. Leonardo Franco , Sergio A. Cannas . 2000. Generalization and Selection of Examples in Feedforward Neural NetworksGeneralization and Selection of Examples in Feedforward Neural Networks. Neural Computation 12:10, 2405-2426. [Abstract] [PDF] [PDF Plus]
Communicated by William W. Lytton
Compensatory Mechanisms in an Attractor Neural Network Model of Schizophrenia D. Horn School of Physics and Astronomy, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel Aviu University, Tef Auiu 69978, Israel
E. Ruppin Department of Computer Science, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel Aviv University, Tel Auiu 69978, Israel
We investigate the effect of synaptic compensation on the dynamic behavior of an attractor neural network receiving its input stimuli as external fields projecting on the network. It is shown how, in the face of weakened inputs, memory performance may be preserved by strengthening internal synaptic connections and increasing the noise level. Yet, these compensatory changes necessarily have adverse side effects, leading to spontaneous, stimulus-independent retrieval of stored patterns. These results can support Stevens’ recent hypothesis that the onset of schizophrenia is associated with frontal synaptic regeneration, occurring subsequent to the degeneration of temporal neurons projecting on these areas. 1 Introduction
A prominent feature of attractor neural networks (ANN) as models for associative memory is their robustness, i.e., their ability to maintain performance in the face of damage to their neurons and synapses. Robustness of biological systems is due, however, not just to their distributed structure, but also depends on compensatory mechanisms that they employ. In a recent paper (Horn et al. 1993), we have shown that while some of the synapses are deleted, compensatory strengthening of all the remaining ones can rehabilitate the system, and that different compensation strategies can account for the observed variation in the progression of Alzheimer’s disease. The ANN we examined represented an isolated cortical module, receiving its input as an initial state into which the network is “clamped,” after which it evolves in an autonomous manner. In this work, we study an ANN representing a cortical module receiving input patterns as persistent external fields projecting on the network Neurd Computation 7, 182-205 (1995) @ 1994 Massachusetts Institute of Technology
Compensatory Mechanisms in an Attractor Neural Network
183
[as, for example, in Amit et al. (1990)1, presumably arising from other cortical modules. We examine the network’s potential to compensate for the weakening of the external input field. It is shown that, to a certain limit, memory performance may be preserved by strengthening the internal synaptic connections and by increasing the noise that stands for other, nonspecific external connections. However, these compensatory changes necessarily have adverse side effects, leading to spontaneous, stimulus-independent, retrieval of stored patterns. Our interest in studying synaptic deletion and compensation in an external-input driven model is motivated by Stevens’ recent hypothesis concerning the possible role of such synaptic changes in the pathogenesis of schizophrenia (Stevens 1992). Schizophrenia is a devastating psychiatric disease, whose broad clinical picture ranges from ”negative,” deficit symptoms including pervasive blunting of affect, thought, and socialization, to “positive” symptoms such as florid hallucinations and delusions. Its worldwide prevalence is approximately 1%, and even with the most up-to-date treatment the majority of patients suffer from chronic deterioration. While the introduction of relatively objective criteria has improved diagnostic uniformity, and dopamine-blocking neuroleptic drugs have enhanced symptomatic relief, the diagnosis still remains phenomenologic, and the treatment palliative. Our goal in this paper is to provide a computational account of Stevens’ theory of the pathogenesis of schizophrenia, in a framework of an ANN model. Only a few neural network models of schizophrenia have been proposed. Hoffman (1987; Hoffman and Dobscha 1989) has previously presented Hopfield-like ANN models of schizophrenic disturbances. He has demonstrated on small networks that when, due to synaptic deletion the network’s memory capacity becomes overloaded, the memories’ basins of attraction are distorted and ”parasitic foci” emerge, which he suggested could underlie some schizophrenic symptoms such as hallucinations and delusions. This scenario implies, however, that a considerable deterioration of memory function should accompany the appearance of psychotic symptomatology already in the early stages of the disease process, in contrast with the clinical data (Mesulam 1990; Kaplan and Sadock 1991). We shall show that when the broad spectrum of synaptic changes that occur in accordance with Stevens‘ theory is considered, memory functions may remain preserved while spontaneous retrieval rises. The latter may be an important mechanism participating in the generation of some psychotic symptoms. Cohen and Servan-Schreiber have presented connectionist feedforward backpropagation networks that were able to simulate normal and schizophrenic performance in several attention and language-related tasks (Cohen and Servan-Schreiber 1992). In the framework of a model corresponding to the assumed function of the prefrontal cortex, they demonstrate that some schizophrenic functional deficits can arise from neuromodulatory effects of dopamine, which may take place in schizo-
D. Horn and E. Ruppin
184
phrenia. Their model obtains an impressive quantitative fit with human performance in a broad spectrum of cognitive phenomena. They also provide a thorough review of previous neural models of schizophrenia to which the interested reader is referred. In this work the discussion is restricted to memory retrieval, assuming that the bulk of long-term memory has already been stored. The next section defines the network and its dynamics. Section 3 describes its role as a functional model of Stevens’ hypothesis. In Section 4 we derive an analytic approximation of the network performance, and study the relation between stimuli-driven retrieval and spontaneous retrieval following synaptic deletion and compensation. The results of this approximation and of corresponding simulations are presented in Section 5. Finally, the relevance of our findings to Stevens’ hypothesis concerning the pathogenesis of schizophrenia is discussed. 2 The Model
We build upon a biologically motivated variant of Hopfield’s ANN model, proposed by Tsodyks and Feigel’man (TF) (Tsodyks and Feigel’man 1988). Each neuron i is described by a binary variable S, = { 1,0} denoting an active (firing) or passive (quiescent) state, respectively. M = ON distributed memory patterns are stored in the network. The elements of each memory pattern are chosen to be 1 (0) with probability p (1 - p), respectively, with p > /LO, the first term dominates this equation.
3. In the initial undamaged state, eo < 8, ensuring that the network does not follow nonstored patterns if the latter are presented as inputs. Therefore, at t = 0 and m'(0) zz 0, the argument of the first term is negative, and increased noise increases the magnitude of the overlap m' (1) (the overlap of the input pattern). However, as the dynamics evolve, the argument of the first term becomes positive, and the increased noise reduces the final overlap.
Figure 2 displays the map [m' ( t + 1) 1 m1( t i ] defined by 4.5, describing stimulus-dependent retrieval. In the baseline, undamaged, state there is only one (stable) fixed point solution, with m' = 1 (full curve). After the weakening of external projections, two additional fixed points may appear (dotted curve). The lowest fixed point is not visible on the scale of Figure 2, but it always coexists with the middle fixed point since, by 4.5, m'(1) > 0 when m'(0) = 0. It is easy to see that the middle fixed point is unstable, while the two extreme ones are stable. The middle, unstable, fixed point denotes the critical value C, that divides the map; only input patterns with overlap m' > C,. will converge to the higher fixed point, signifying successful retrieval. As the initial overlap of an input pattern is essentially zero, equation 4.5 will converge to its lower fixed point whenever it exists, resulting in stimulus-retrieval failure. The compensatory potentials of internal synaptic strengthening and increased noise level are illustrated in Figure 3. In both cases we see that with sufficient compensatory increase the original situation (where only a single, high overlap, fixed point exists) is restored. Note that
190
D. Horn and E. Ruppin
I .O
0.8
-+ r
0.6
c
v
E
0.4
0.2
0.0 m(t)
+
Figure 2: The map [m(t 1) I m ( t ) ]generated by iterating the overlap equation. a = 0.05, p = 0.1, eo = 0.035, c = 1, and T = TO = 0.005. In the initial undamaged state (full curve) e = 0.035 and in the decreased input case e = 0.015. Recall that B remains fixed and its value is determined by 4.7, with eo = 0.035, co = 1, and p = 0.1. All figures are based on this choice of initial synaptic strengths and threshold, as well as on Q = 0.05.
as the noise level is increased the magnitude of the highest fixed point decreases monotonically, in accordance with observation 3. To study spontaneous retrieval, we calculate the overlap m” (0) between the initial network state S and each memory pattern As shown in the Appendix, the maximal ovbrlap mmaxhas an almost deterministic value. This enables us to model spontaneous retrieval by entering mmax as the initial m(0) in 4.5, letting e = 0 (no external stimulus is present),
[”.
Compensatory Mechanisms in an Attractor Neural Network
I
191
I
Figure 3: The map [m(t+ 1) I m ( t ) ]after a decrease in the magnitude of external projections (e = 0.015) and a compensatory increase in the internal synaptic strength (long-dashed curve, c = 2.5 and T = 0.005), or in the noise level (dashed curve c = 1 and T = 0.015). These curves should be compared with the c = 1, T = 0.005 curve in Figure 2.
and iterating the overlap equation. The map [m(t+ 1) I m(t)]generated in this fashion has a similar form to that shown previously in the stimulus-driven mode, as illustrated in Figure 4. Analogous to the case of stimulus-dependent retrieval, the middle fixed point denotes the critical value C,, such that only when mmax> C, expression 4.5 converges to its higher fixed point, denoting spontaneous retrieval of a stored memory pattern.
D. Horn and E. Ruppin
192
r
1.o
0.8
0.6 h
v-
c
E
0.4
0.2
0.0 m(t)
Figure 4 The maps [m(t+l) 1 m(t)]of spontaneousand stimulus retrieval after a decreasein the magnitude of external projections, and followinga compensatory increase in the internal synaptic strength. T = 0.005.
Figure 4 illustrates the effects of synaptic changes on spontaneous retrieval. As e decreases (to e = 0.010, compare with Fig. 2), the curve corresponding to the stimulus-driven map is shifted to the right, approaching the spontaneous-retrieval curve (e = 0). Following observation 1, increasing c results in a leftward (and upward) shift of both curves, possibly maintaining successful stimulus-retrieval (by eliminating the lower fixed points, as illustrated in this example), but causing a continuing decrease in the value of C,, such that spontaneous retrieval may arise. Note also that increasing c tends to further decrease the difference between the spontaneous and stimulus-driven retrieval maps. Depending on the val-
Compensatory Mechanisms in an Attractor Neural Network
193
ues of e, c, and T , each of the following three retrieval scenarios may occur: 1. The basic, stimulus-retrieval mode, is preserved.
2. Spontaneous-retrieval emerges (C, < mmax),while stimulus-retrieval is preserved. 3. Stimulus-driven retrieval is lost (a lower fixed-point appears in the stimulus-driven map).
Similar to its effect on stimulus-dependent retrieval, an increase of the noise level would enhance the level of spontaneous retrieval, by decreasing the negative argument of the dominant first term in 4.8, but would gradually decrease the final overlap. In the next section we use equation 4.5 to study quantitatively the relation between the two retrieval modes, characterized as Stimulus-dependent with parameters m(0) = 0. e > 0 retrieval mode (4.9) Spontaneous retrieval mode with parameters m(0) = mmax3e = 0 It should be noted that the derivation of 4.5 is based on the assumption that the overlap m singled out is significantly higher than the overlaps with all other memory patterns, which are considered as background noise. This is different from the situation in the spontaneous mode, where a few memory patterns may have initial overlaps that do not fall far from mmax. Hence, the results obtained by iterating 4.5 in this mode are only an approximation to the actual emergence of spontaneous retrieval in the network. As we shall show, in sparsely coded, low memory-load networks simulation results are in close agreement with these estimates.
I
5 Numerical Results
We turn now to simulations examining the behavior of a network under variations of synaptic strength and noise level, and compare these results with the analytic approximations obtained by iterating equation 4.5. All the simulations presented in this section were performed in a network of N = 400 neurons, storing M = 20 memory patterns, with coding level p = 0.1. Optimal thresholds were set for eo = 0.035 and co = 1. Performance was measured by the final overlap averaged over 100 trials, denoted the average final overlap. In the initial, undamaged state, the values of the synaptic strengths and threshold were set such that perfect memory retrieval at low noise levels was attained, as shown by the full curve in Figure 5a. Figure 5a displays simulation results demonstrating that an increase in the noise level can compensate for the deterioration of memory retrieval due to a decrease in the external input. For fixed T , performance
D. Horn and E. Ruppin
194
lor-
1.o
'
e = ow5
08
0.8
~
8 -= OM5 e 0015 e = 0014 B 0 0 - 0013 ae- 0012 e = OW5
+
0.6
06 -
0.4
04
E
1 I\
O2
02
I I
I 00
0 000
--
7
_ _-
,
-.,'
.~
0010
0 020
T
Figure 5: Stimulus-dependentretrieval performance, measured by the average final overlap m, as a function of the noise level T . Each curve displays this relation at a different magnitude of external input projections e. (a) Simulation results. (b) Analytic approximation.
decreases rapidly as e is decreased. If the decrease in e is not too large, an increase in T restores stimulus-dependent retrieval performance. The first three curves are qualitatively similar, characterized by a peak of the retrieval performance at some e-dependent optimal level of noise. Eventually, at low e levels retrieval is lost. Figure 5b presents analytical results describing the effect of noise on the dynamic evolution of the network, obtained by iterating the macroscopic overlap equation 4.5. These results bear strong resemblance to those obtained in simulations.* 2A discrepancy between analytic approximations and simulations regarding the behavior of the undamaged network in low noise should be noted. However, in general, there is close correspondence between theory and simulations also in low noise values. The case shown in Figure 5 is an exception that arises since precisely for these parameter values there is a sharp change in the performance near zero temperature. If e is slightly lowered to 0.032 the retrieval performance (in both analysis and simulations) is near zero, and when e is slightly increased to 0.038 the retrieval performance (in both analysis and simulations) is almost perfect.
Compensatory Mechanisms in an Attractor Neural Network
195
The initial sharp rise in the performance obtained as T is increased above some point (at high enough e values) is made clear by considering the map [m(t+ 1 ) I rn(t)]displayed in Figure 3; at this point the noise level is sufficient to eliminate the two lower fixed points and there is a crossover to the highest fixed point. As e is decreased, higher T values are required to eliminate the lower fixed points, and the value of the higher fixed point decreases. As illustrated in Figure 5b, there is a crossover point (e M 0.013) where retrieval performance drops sharply. The map [m(t 1) 1 m ( t ) ]presented in Figure 6 shows that in this parameter region the crossover to the higher fixed point does not occur any more (see the dashed curve) and the solution of 4.5 is always obtained at the lower fixed point. The results of a simulation examining the compensatory potential of strengthening internal connections are shown in Figure 7. As e is decreased the best possible performance is achieved with increasing c values. The macroscopic overlap equation fails to give an accurate account of stimulus-dependent retrieval at high c levels; as the internal synapses are strengthened and spontaneous retrieval arises, there is no longer a single significant overlap. The combined compensatory potential of internal synaptic strengthening and increased noise is illustrated in Figure 8. The effect is synergistic, as high stimulus-dependent retrieval performance is achieved already at a fairly low increase of synaptic and noise levels. Figure 9a and b illustrates that synaptic strengthening and increased noise eventually generate spontaneous retrieval. The analytic approximation is in fair correspondence with the simulation. Due to interference from other memories with high initial overlap, spontaneous retrieval in the network is lower than the theoretical prediction. In the previous section we have seen that three retrieval scenarios may occur, depending on the values of e, c, and T. As spontaneous retrieval depends only on c and T , the remaining parameter e determines wherever stimulus-dependent retrieval is maintained as spontaneous retrieval emerges. In our network this combined retrieval mode is obtained with fairly high levels of e, c, and T , but it may exist also at lower levels, depending on the levels of the memory load a, spontaneous activity q, and the initial external strength eo. Finally, we wish to point out another adverse feature of compensation, with relevance to decreased specificity of stimulus-dependent retrieval: When the undamaged network is presented with a nonmemorized input pattern, it converges to a state that has no significant overlap with any of the memorized patterns. However, after compensatory synaptic changes take place, the network may respond to the presentation of a nonstored pattern by converging to a state that has high overlap with one of the memory states, and thus erroneously retrieve nonqueried patterns. As illustrated in Figure 10, retrieval specificity begins to deteriorate at moderate compensatory levels, before spontaneous retrieval arises.
+
D. Horn and E. Ruppin
196
1 .o
0.8
-+
0.6
F
v L
E
0.4
0.2
0.0 m(t)
+
Figure 6 The map [m(t 1) I m ( t ) ] after a decrease in the magnitude of the external projections, and an optimal compensatory increase of the noise level. While at e = 0.014 the fixed point has a large value of 0.9, at e = 0.012 it drops sharply (even at the optimal noise T = 0.019) to 0.25. This shows that e x 0.013 is a crossover between large and small m values. 6 Discussion
Motivated by Stevens’ hypothesis, we have constructed a neural model supporting the idea that synaptic regenerative processes observed in the frontal cortices of schizophrenics concomitantly with the denervation of MTL projections are not just a mere “epiphenomenon,” but have a compensatory role. Schizophrenic symptomatology involves complicated cognitive and perceptual phenomena, whose description certainly
Compensatory Mechanisms in an Attractor Neural Network
r
197
1
C
Figure 7 Stimulus-dependentretrieval performance, measured as the average final overlap rn, as a function of the internal synaptic strength c. Each curve displays this relation at a different strength of external input projections e. T = 0.005.
requires much more elaborate representations than a simple associative memory model of the kind we have used. However, whatever their neural realization may be, schizophrenic symptoms such as delusions or hallucinations frequently appear in the absence of any apparent external trigger. It therefore seems plausible that the emergence of spontaneous activation of stored patterns is an essential element in their pathogenesis. The decrease of retrieval specificity may underlie schizophrenic thought disorders such as loosening of associations, where a unifying theme is absent from the patient’s discourse; one may contend that due to decreased specificity, numerous patterns in different modules may be activated con-
198
D. Horn and E. Ruppin
1 .o
0.8
,
0.6 E
Analytic approximation Simulation
0.4
0.2 I
0.0
0.5
,
1 .o
,
.
1.5
2.0
1
1
5
Figure 8: The final overlap m as a function of internal synaptic strength c. Both simulations and analytical results are displayed. e = 0.015 and T = 0.009. This should be compared with the e = 0.015 and T = 0.005 curve in Figure 7 and the e = 0.015 and c = 1 curve of Figure 5.
comitantly and compete with each other, making the maintenance of a serial ordered cognitive process an increasingly difficult task. Delusions and hallucinations tend to concentrate upon a limited set of recurring cognitive and perceptual themes. This cannot be accounted for by a model where spontaneous retrieval is homogeneously distributed among all stored memory patterns. To obtain a nonhomogeneous distribution, the compensatory regeneration of internal synapses should have
Compensatory Mechanisms in an Attractor Neural Network
199
,,"!
Figure 9: (a) Spontaneous retrieval, measured as the highest final overlap rn achieved with any of the stored memory patterns, displayed as a function of the noise level 7'. c = 1. (b) Spontaneous retrieval as a function of internal synaptic compensation factor c. 7' = 0.009. In both cases e = 0, q = 0.05, yielding mmax= 0.111 as the starting point for iterating the overlap equation. an additional Hebbian-like activity-dependent term, as, for example,
This learning process is assumed to proceed on a much slower time scale than the retrieval one. Nevertheless, their coexistence can lead to interesting phenomena: As some memory pattern is spontaneously retrieved, its corresponding basin of attraction is further enlarged. This increases therefore the probability of spontaneously retrieving memories that have already been retrieved. If spontaneous retrieval emerges, then via this positive feedback loop, any bias in the network's initial state can break the symmetry underlying the generation of a homogeneous distribution of retrieved states, and an inhomogeneous distribution can be obtained. We have assumed in our analysis that only states of high overlap with one of the stored memories are cognitively significant. We have thus neglected all spurious states to which the network can converge. The simulation results presented in Table 1 indicate that when the network size is small (N= 400) the role of the spurious states in spontaneous retrieval is
D. Horn and E. Ruppin
200
1 .o
0.8
/ / /
/ / / /
0.6
/ /
/ /
E
/ /
!
/
0.4
e = 0.015 e = 0.025
/
-~~
/
/ / / / /
0.2
/
_--
/
0.0 C
Figure 1 0 Decreased specificity: The final overlap m as a function of internal synaptic strength c, for two values of external synaptic strength e. The input stimulus is a random pattern that does not correspond to any memory pattern. In each trial, m is taken as the highest final overlap achieved with any of the stored memory patterns. T = 0.009.
rather small; if the network does not converge to one of the memories it often ends up in a state with very low activity that has negligible overlap with the memories. However, the percentage of mixed states considerably increases as the network size is increased. Hence, in large networks spontaneous retrieval seems likely to mostly consist of spurious states, together with a few memory states. Yet two additional factors may in turn enhance the relative percentage of memory states spontaneously retrieved in large networks. First, as the coding rate p is decreased [and
Compensatory Mechanisms in a n Attractor Neural Network
201
Table I: Distribution of Final Attractor States Generated in a Network with Spontaneous Retrieval."
N = 400 c = 1.5 c = 2.0 c = 2.5 N = 800 c = 2.0 c = 2.5 c = 3.0 N = 1600 c = 3.0 c = 3.25 c = 3.5
Memory
Spurious
Near-zero activity
0 18 61
0 3 9
100 79 30
0 11 31
0 4 34
100 85 35
8 14 21
20 46 68
72 40 11
"Stored memory states, spurious states, and near-zero activity states are counted. The results are shown for three networks of different size N (keeping the memory load N = 0.05 fixed), while varying the internal synaptic strength c. In all simulations presented e = 0.015, T = 0.009, and p = 0 1. Convergence to a stored memory pattern was considered as such when the final overlap with that pattern was above 0.9.
cortical networks are considered to have very low "coding" rates (Abeles et al. 199011, the percentage of memory retrieval is significantly increased; for example, in a simulation performed in a network of size N = 1600 (with o = 0.05, c = 2.5, e = 0.015, T = 0.009) and coding rate p = 0.05, the network has converged to a memory state in 44% of the trials, to a nearzero activity state in 38%,and to a spurious state in only 18%.3Second, as preliminary results seem to indicate, the incorporation of synaptic Hebbian changes like those suggested in 6.1 is likely to markedly increase the percentage of memory states the network spontaneously converges to, due to their enlarged basins of attraction. The question of how spurious states are distinguished from memory states has been addressed in the ANN literature elsewhere ( e g , Parisi 1986; Shinimoto 1987; Ruppin and Yeshurun 1991). In Alzheimer's disease, synaptic degenerative processes damage intramodular (i.e., internal) synaptic connections (DeKosky and Scheff 1990) that store the memory patterns. Hence, although synaptic compensation (performed by strengthening the remaining synapses) may slow down memory deterioration, the demise of memory facilities is inevitable (Horn et al. 1993). Simulations we have performed show that spontaneous retrieval does not emerge when the primary damage compensated for in"s the coding rate is lowered, spontaneous memory retrieval is achieved at lower compensation values, so a direct comparison of the retrieval obtained with different coding levels at the same c levels is not possible.
202
D. Horn and E. Ruppin
volves intramodular connections. In schizophrenia, the internal synaptic matrix of memories remains presumably intact, and synaptic compensatory changes may successfully maintain memory functions. However, as we have shown, when internal synaptic strengthening compensates for external synaptic denervation, spontaneous retrieval emerges. Despite a number of suggestive findings, there is currently no proof that a global abnormality of neurotransmission is a primary feature of schizophrenia (Mesulam 1990). Motivated by Stevens’ theory, we have focused on the neuroanatomical synaptic changes, without referring to any specific neurotransmitter. However, it should be noted that symptoms like delusions and hallucinations are known to be responsive to dopaminergic agents. Building upon recent data that may support the possibility that the initial abnormality in schizophrenia involves a hypodopaminergic state, Cohen and Servan-Schreiber (1992) have shown that schizophrenic deficits may result from a reduction of adrenergic neuromodulatory tone in the prefrontal areas. In parallel, we have shown that increased noise, which is computationally equivalent to decreased neural adrenergic gain [see Cohen and Servan-Schreiber (1992) for a review of this data], may result in adverse positive symptoms. However, in accordance with Stevens’ theory, this additional noise arises from synaptic reinnervation, and is independent of the level of dopaminergic activity. On the physiological level, it is predicted that at some stages of the disease, due to the increased noise level, increased spontaneous activity should be observed. This prediction is obviously difficult to examine directly via electrophysiological measurements. Yet, numerous EEG studies in schizophrenics show increased sensitivity to activation procedures (Lee, more frequent spike activity) (Kaplan and Sadock 1991), together with a significant increase in slow-wave delta activity that may reflect increased spontaneous activity (Jin et al. 1990). Our model can be tested by quantitatively examining the correlation between a recent premortal history of florid psychotic symptoms and postmortem neuropathological findings of synaptic compensation in schizophrenic subjects. Quoting Mesulam (Mesulam 19901, ”One would have expected neuropathology to provide a gold standard for research on schizophrenia, but this is not yet so.” It is our hope that neural network models may encourage detailed neuropathological studies of synaptic changes in neurological and psychiatric disorders, that, in turn, would enable more quantitative modeling.
Appendix: The Calculation of mmax Let us consider a network of N neurons, storing M ( 0 , l ) patterns p, 11 = 1... . ,M. Each memory pattern E” is generated with Prob( 0 (using Markov’s inequality)
To find the tightest of these bounds we differentiate r / ( h ) with respect to f and solve
to find the corresponding f that maximizes 7 / ( b ) . As we have M stored memory patterns, we obtain
= eNp
As is evident from equation A.4, the probability that the maximal initial overlap mmaxis larger than b / [ p ( l- p ) ] decreases exponentially with q ( h ) 0. At low values of h many memories will have an overlap larger than b / [ p ( l - p ) ] . At high values of b, there will be no such memory, with probability almost 1 . Hence, mmaxis found by searching for h’ such that / I ( & * ) = b, i.e., when the expected number of memories whose overlap is larger than mmax = h * / [ p ( l - p ) ] is 1 . To this end, for every 6 [from 0 to p ( 1 - p ) ] we search for the best f-value by solving A.3, calculate the corresponding r / ( b ) by A.2, and stop whenever q ( 6 ) = p. Some values of mmaxas a function of q, for three different networks, are displayed in Table 2. Although mmax decreases monotonically as the network size increases (keeping cy fixed), the value of mmax remains nonvanishing even when considering large, “cortical-like” networks.
D. Horn and E. Ruppin
204
Table 2: Some Typical m,,,
Values."
9
N = 400
N = 2000
N
0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15
0.06 0.091 0.111 0.128 0.143 0.156 0.168 0.179
0.029 0.045 0.057 0.066 0.074 0.082 0.088 0.094
0.014 0.022 0.028 0.033 0.037 0.041 0.044 0.047
"In all three networks the memory load
cv = M / N = 0.05 is kept
= 10000
constant and p
= 0.1.
Acknowledgment We are grateful to Professor Isaac Meilijson for helpful discussion a n d comments.
References Abeles, M., Vaadia, E., and Bergman, H. 1990. Firing patterns of single units in the prefrontal cortex and neural network models. Network 1, 13. Amit, D. J., Parisi, G., and Nicolis, S. 1990. Neural potentials as stimuli for attractor neural networks. Network 1, 75-88. Cohen, J. D., and Servan-Schreiber, D. 1992. Context, cortex, and dopamine: A connectionist approach to behavior and biology in schizophrenia. Psycho/. Review 99(1), 45-77. Connors, 8. W., and Gutnick, M. J. 1990. Intrinsic firing patterns of diverse neocortical neurons. Trends Neurosci. 13(3), 99-1 04. DeKosky, S. T., and Scheff, S. W. 1990. Synapse loss in frontal cortex biopsies in Alzheimer's disease: Correlation with cognitive severity. Ann. Neurol. 27(5), 457-464. Haley, D. C. 1952. Estimation of the dosage mortality relationship when the dose is subject to error. Tech. Rep. TR-15, August 29, Stanford University. Heit, G., Smith, M. E., and Halgren, E. 1988. Neural encoding of individual words and faces by the human hippocampus and amygdala. Nature (London) 333, 773-775. Hoffman, R., and Dobscha, S. 1989. Cortical pruning and the development of schizophrenia: A computer model. Schizophrenia BuIl. 15(3), 477. Hoffman, R. E. 1987. Computer simulations of neural information processing and the schizophrenia-mania dichotomy. Arch. Gen. Psychiat. 44, 178. Horn, D., Ruppin, E., Usher, M., and Herrmann, M. 1993. Neural network
Compensatory Mechanisms in an Attractor Neural Network
205
modeling of memory deterioration in Alzheimer’s disease. Neural Cornp. 5, 736-749. Jin, Y., Potkin, S. G., Rice, D., and Sramek, J. et al. 1990. Abnormal EEG responses to photic stimulation in schizophrenic patients. Schizophrenia Bull. 16(4), 627-634. Kaplan, H. I., and Sadock, B. J. 1991. Synopsis ofpsychiatry. Williams & Wilkins, Baltimore. Mesulam, M. M. 1990. Schizophrenia and the brain. N . Engl. J. Med. 322(12), 842-845. Parisi, G. 1986. Asymmetric neural networks and the process of learning. J. Phys. A: Math. Gen. 19, L675-L680. Ruppin, E., and Yeshurun, Y. 1991. Recall and recognition in an attractor neural network of memory retrieval. Connect. Sci. 3(4), 381-399. Shinimoto, S. 1987. A cognitive and associative memory. Bid. Cybern. 57, 197-2 11. Squire, L. R. 1992. Memory and the hippocampus: A synthesis from findings with rats, monkeys, and humans. Psyckol. Rev. 99, 195-231. Stevens, J. R. 1992. Abnormal reinnervation as a basis for schizophrenia: A hypothesis. Arch. Gen. Psychiat. 49, 238-243. Tsodyks, M. V. 1988. Associative memory in asymmetric diluted network with low activity level. Europhys. Lett. 7, 203-208. Tsodyks, M. V., and Feigel’man, M. V. 1988. The enhanced storage capacity in neural networks with low activity level. Europhys. Lett. 6, 101-105.
Received July 13, 1993; accepted April 14, 1994.
This article has been cited by: 2. Richard S. Zemel , Michael C. Mozer . 2001. Localist Attractor NetworksLocalist Attractor Networks. Neural Computation 13:5, 1045-1064. [Abstract] [PDF] [PDF Plus] 3. Leif H. Finkel. 2000. NEUROENGINEERING MODELS OF BRAIN DISEASE. Annual Review of Biomedical Engineering 2:1, 577-606. [CrossRef] 4. Eytan Ruppin , James A. Reggia . 1995. Patterns of Functional Damage in Neural Network Models of Associative MemoryPatterns of Functional Damage in Neural Network Models of Associative Memory. Neural Computation 7:5, 1105-1127. [Abstract] [PDF] [PDF Plus]
Communicated by Michael Jordan
Real-Time Control of a Tokamak Plasma Using Neural Networks Chris M. Bishop Neural Computing Research Group, Department of Computer Science, Aston University, Birmingham, B4 7ET, U.K.
Paul S. Haynes Mike E. U. Smith Tom N. Todd David L. Trotman AEA Technology, Culhatn Laboratory, Oxfordshire OX14 3DB, U.K.
In this paper we present results from the first use of neural networks for real-time control of the high-temperature plasma in a tokamak fusion experiment. The tokamak is currently the principal experimental device for research into the magnetic confinement approach to controlled fusion. In an effort to improve the energy confinement properties of the high-temperature plasma inside tokamaks, recent experiments have focused on the use of noncircular cross-sectional plasma shapes. However, the accurate generation of such plasmas represents a demanding problem involving simultaneous control of several parameters on a time scale as short as a few tens of microseconds. Application of neural networks to this problem requires fast hardware, for which we have developed a fully parallel custom implementation of a multilayer perceptron, based on a hybrid of digital and analogue techniques. 1 Introduction
Fusion of the nuclei of hydrogen provides the energy source that powers the sun. It also offers the possibility of a practically limitless terrestrial source of energy. However, the harnessing of this power has proved to be a highly challenging problem. One of the most promising approaches is based on magnetic confinement of a high temperature (107-10s K) plasma in a device called a tokamak (from the Russian for "toroidal magnetic chamber") as illustrated schematically in Figure 1. At these temperatures the highly ionized plasma is an excellent electrical conductor, and can be confined and shaped by strong magnetic fields. Early tokamaks had plasmas with circular cross sections, for which feedback control of the plasma position and shape is relatively straightforward. However, Neiirnl Coniputntion 7, 206-217 (1995)
@ 1994 Massachusetts Institute of Technology
Real-Time Control of a Tokamak Plasma
z
207
t
Figure 1: Schematic cross section of a tokamak experiment showing the toroidal vacuum vessel (outer D-shaped curve) and plasma (shown shaded). Also shown are the radial (R) and vertical (Z) coordinates. To a good approximation, the tokamak can be regarded as axisymmetricabout the Z-axis, and so the plasma boundary can be described by its cross-sectionalshape at one particular toroidal location.
recent tokamaks, such as the COMPASS experiment at Culham Laboratory, as well as most next-generation tokamaks, are designed to produce plasmas whose cross sections are strongly noncircular. Figure 2 illustrates some of the plasma shapes that COMPASS is designed to explore. These novel cross sections provide substantially improved energy confinement properties and thereby significantly enhance the performance of the tokamak. Unlike circular cross section plasmas, highly noncircular shapes are more difficult to produce and to control accurately, since currents through several control coils must be adjusted simultaneously. Furthermore, during a typical plasma pulse, the shape must evolve, usually from some initial near-circular shape. Due to uncertainties in the current and pressure distributions within the plasma, the desired accuracy for plasma control can be achieved only by making real-time measurements of the position and shape of the boundary, and using error feedback to adjust the currents in the control coils. The physics of the plasma equilibrium is determined by force balance between the thermal pressure of the plasma and the pressure of the magnetic field, and is relatively well understood. Particular plasma configurations are described in terms of solutions of the Grad-Shafranov
Chris M. Bishop et al.
208
0.4
0.4
0.2
0.2
am)
z(m)
0.0
0.0 -0.2
-0.2
-0.4
-0.4 0.4
0.6
0.6
R(m)
0.4
0.4
0.2
0.2
Z(m)
z(m)
0.0
0.0
-0.2
-0.2
-0.4
-0.4
0.4
0.6
0.8
R(m)
0.4
0.6
0.8
R(m)
Figure 2: Cross sections of the COMPASS vacuum vessel showing some examples of potential plasma shapes. The solid curve is the boundary of the vacuum vessel, and the plasma is shown by the shaded regions. Again, R and Z are the radial and vertical coordinates, respectively, in units of meters.
equation (Shafranov 1958), given by
where the coordinates R and Z are defined in Figure 1, the function \k is called the poloidal flux function, and the plasma boundary corresponds to a surface of constant \k. The function I(@,R ) specifies the plasma cur-
Real-Time Control of a Tokamak Plasma
209
rent density, and for the work reported here we have chosen the following representation
which is motivated by plasma physics considerations. Here b is a constant, /jcontrols the ratio of plasma pressure to magnetic field energy density, and the parameters (11 and 02 are numbers 2 1 that can be varied to generate a variety of current profiles. Fortunately, the plasma configurations obtained by solution of the Grad-Shafranov equation are relatively insensitive to the precise choice of representation for the function I ( q %R ) . Due to the nonlinear nature of the Grad-Shafranov equation, a general analytic solution is not possible. However, for a given current density function I ( @ R ) , the Grad-Shafranov equation can be solved by iterative numerical methods, with boundary conditions determined by currents flowing in the external control coils that surround the vacuum vessel. On the tokamak itself it is changes in these currents that are used to alter the position and cross-sectional shape of the plasma. Numerical solution of the Grad-Shafranov equation represents the standard technique for post-shot analysis of the plasma, and is also the method used to generate the training dataset for the neural network, as described in the next section. However, this approach is computationally very intensive and is therefore unsuitable for feedback control purposes. For real-time control it is necessary to have a fast (typically 5 50 psec) determination of the plasma boundary shape. This information can be extracted directly from a variety of diagnostic systems, the most important being local magnetic measurements taken at a number of points around the perimeter of the vacuum vessel. Most tokamaks have several tens or hundreds of small pick up coils located at carefully optimized points around the torus for this purpose. We shall represent these magnetic signals collectively as a vector m. The position and shape of the plasma boundary can be described in terms of a set of geometric parameters such as vertical position and elongation, which we collectively denote by Yk. These parameters are illustrated in Figure 3, and will be discussed in more detail in the next section. The basic problem that has to be addressed, therefore, is to find a representation for the (nonlinear) mapping from the magnetic signals m to the values of the geometric parameters yk, which can be implemented in suitable hardware for real-time control. The conventional approach presently in use on many tokamaks involves approximating the mapping between the measured magnetic signals and the geometric parameters by a single linear transformation. However, the intrinsic nonlinearity of the mappings suggests that a representation in terms of feedforward neural networks should give significantly improved results (Lister and Schnurrenberger 1991; Bishop et al.
210
Chris M. Bishop et al.
R plasma boundary
Figure 3: Schematic illustration of a cross section of the toroidal vacuum vessel showing the definitions of various coordinates and parameters. The elliptical curve denotes the plasma boundary, whose center is at R = Xo, Z = ZO and whose minor radius is a. The parameter K describes the elongation of the plasma, and 6' is called the poloidal angle. The triangularity 6 (not shown) describes the departure of the plasma boundary from a simple ellipse. (Values of K = 1 and h = 0 correspond to a circular plasma boundary.)
1992; Lagin et a[. 1993). Figure 4 shows a block diagram of the control loop for the neural network approach to tokamak equilibrium control.
2 Software Simulation Results
The dataset for training and testing the network was generated by numerical solution of equation 1.1 using a free-boundary equilibrium code. This code contains a detailed description of the COMPASS hardware configuration, and allows the boundary conditions to be expressed directly in terms of currents in the control coils. The database currently consists of over 2,000 equilibria spanning the wide range of plasma positions and shapes available in COMPASS. Each configuration takes several minutes to generate on a fast workstation. For a large class of equilibria, the plasma boundary can be reasonably well represented in terms of a simple parameterization, governed by an
211
Real-Time Control of a Tokamak Plasma
Network
plasma position
and shape yp
error signals
signals m
0 Tokarnak
Amplifiers
Figure 4: Block diagram of the control loop used for real-time feedback control of plasma position and shape. The neural network provides a fast nonlinear mapping from the measured magnetic signals onto the values of a set of geometric parameters yk (illustrated in Fig. 3) that describe the position and shape of the plasma boundary. These parameters are compared with their desired values yf, and the resulting error signals used to correct the currents in a set of feedback control coils using standard linear PD (proportional-differential) controllers. angle-like variable 0, given by
R(6) Z(0)
=
X o + u cos(6' + 6 sin 6')
= Zofu~sinO
(2.1)
where w e have defined the following parameters:
Ro radial distance of the plasma center from the major axis of the torus, Zo vertical distance of the plasma center from the torus midplane, u minor radius measured in the plane Z = ZO, r;.
elongation,
6 triangularity. Thus, for instance, if the triangularity parameter 6 is zero, the boundary is described by a n ellipse with elongation n. These parameters (except
212
Chris M. Bishop et al.
for the triangularity) are illustrated in Figure 3. Each of the entries in the database has been fitted using the form in equation 2.1, so that the equilibria are labeled with the appropriate values of the shape parameters. On the COMPASS experiment, there are some 120 magnetic signals that could be used to provide inputs to the network. Since each input could either be included or excluded, there are potentially 2120 rv possible sets of inputs that might be considered. To find a computationally tractable procedure for selection of a suitable subset of inputs, we have used forward sequential selection (Fukunaga 1990), based on a simple linear mapping (discussed shortly) to provide a selection criterion. Simulations aimed at finding a network suitable for use in real-time control have so far concentrated on 16 inputs, since this is the number available from the initial hardware configuration. It is important to note that the transformation from magnetic signals to flux surface parameters involves an exact linear invariance. This follows from the fact that if all of the currents are scaled by a constant factor, then the magnetic fields will be scaled by this factor, and the geometry of the plasma boundary will be unchanged. It is important to take advantage of this prior knowledge and to build it into the network structure, rather than force the network to learn it by example. We therefore normalize the vector m of input signals to the network by dividing by a quantity proportional to the total plasma current. A scaling of the magnetic signals by a common factor then leaves the network inputs (and hence the network outputs) unchanged. Compared with learning by example, this explicit use of prior knowledge brings three distinct advantages: (1) the network exhibits exact, rather than approximate, invariance to rescaling of the currents; (2) the relative output accuracy can be maintained over a wide range of plasma current (which typically varies from a few kA to a few 100 kA during the plasma pulse); and ( 3 ) the network training can be performed with a smaller dataset than would otherwise be possible, which can be generated for just one value of total plasma current. Note that the normalization has to be incorporated into the hardware implementation of the network, as will be discussed in Section 3. The results presented in this paper are based on a multilayer perceptron architecture having a single layer of hidden units with " t a n h activation functions, and linear output units. Networks are trained by minimization of a sum-of-squares error using a standard conjugate gradient optimization algorithm, and the number of hidden units is optimized by measuring performance with respect to an independent test set. Results from the neural network mapping are compared with those from the optimal linear mapping, that is, the single linear transformation that minimizes the same sum-of-squares error as is used in the neural network training algorithm, as this represents the method currently used on a number of present day tokamaks. This minimization can be expressed in terms of a set of linear equations whose solution can be found efficiently and robustly using the technique of singular value decomposition
Real-Time Control of a Tokamak Plasma
Database
' T . 2 Database
213
Database
Database
Database
Database
2 '
2
Figure 5: Plots of the values from the test set versus the values predicted by the linear mapping for the three equilibrium parameters, together with the corresponding plots for neural network with four hidden units.
(Press et al. 1992). Note that the same normalization of the inputs was used here as in the neural network case. Initial results were obtained on networks having 3 output units, corresponding to the values of vertical position Zo, major radius Ro, and elongation K , these being parameters that are of interest for real-time feedback control. The smallest normalized test set error of 11.7 is obtained from the network having 16 hidden units. By comparison, the optimal linear mapping gave a normalized test set error of 18.3. This represents a reduction in error of about 30% in going from the linear mapping to the neural network. Such an improvement, in the context of this application, is very significant. For the experiments on real-time feedback control described in Section 4 the currently available hardware permitted only networks having four hidden units, and so we consider the results from this network in more detail. Figure 5 shows plots of the network predictions for various parameters versus the corresponding values from the test set portion of the database. Analogous plots for the optimal linear map predictions versus the database values are also shown. Comparison of the correspond-
Chris M. Bishop et al.
214
ing figures shows the poorer predictive capability of the linear approach, even for this suboptimal network topology. 3 Hardware implementation
The hardware implementation of the neural network must have a bandwidth of 1 20 kHz in order to cope with the fast time scales of the plasma evolution. It must also have an output precision of at least 8 bits in order to ensure that the final accuracy that is attainable will not be limited by the hardware system. We have chosen to develop a fully parallel custom implementation of the multilayer perceptron, based on analogue signal paths with digitally stored synaptic weights (Bishop et al. 1993a). A VMEbased modular construction has been chosen as this allows flexibility in changing the network architecture, ease of loading network weights, and simplicity of data acquisition. Three separate types of card have been developed as follows: 0
0
0
Combined 16-input buffer and signal normalizer: This provides an analogue hardware implementation of the input normalization described earlier. For future flexibility this makes use of an EPROM (erasable programmable read-only memory) to provide independent scaling of groups of 8 inputs by an arbitrary function of an external reference signal. In the present application the reference signal is taken to be the plasma current (determined by a magnetic pick-up loop called a Rogowski coil) and the function is chosen to be a simple inverse proportionality. 16x4 matrix multiplier: The synaptic weights are produced using 12 bit frequency-compensated multiplying DACs (digital to analogue converters) that can be configured to allow 4-quadrant multiplication of analogue signals by a digitally stored number. The weights are obtained as a 12-bit 2's-complement representation from the VME backplane. Note that the DACs are being used here as digitally controlled attenuators, and not in their usual role of converting digital signals into analogue signals. Synaptic weights are downloaded (prior to the plasma pulse) via the VME backplane from a central control computer, using an addressing technique to label the individual weights. 4-channel sigmoid module: There are many ways to produce a sigmoidal nonlinearity, and we have opted for a solution using two transistors configured as a long-tailed-pair, to generate a "tanh sigmoidal transfer characteristic. The principal drawback of such an approach is the strong temperature sensitivity due to the appearance of temperature in the denominator of the exponential transistor transfer characteristic. An elegant solution to this problem has been found by exploiting a chip containing five transistors in close
Real-Time Control of a Tokamak Plasma
215
thermal contact. Two of the transistors form the long-tailed pair; one of the transistors is used as a heat source, and the remaining two transistors are used to measure temperature. External circuitry provides active thermal feedback control, and stability to changes in ambient temperature over the range 0 to 50°C is found to be well within the acceptable range. A separate 12-bit DAC system, identical to the ones used on the matrix multiplier cards but with a fixed DC input, is used to provide a bias for each sigmoid. The complete network is constructed by mounting the appropriate combination of cards in a VME rack and configuring the network topology using front panel interconnections. The system includes extensive diagnostics, allowing voltages at all key points within the network to be monitored as a function of time via a series of multiplexed output channels. 4 Results from Real-Time Feedback Control
Figure 6 shows the first results obtained from real-time control of the plasma in the COMPASS tokamak using neural networks. The evolution of the plasma elongation, under the control of the neural network, is plotted as a function of time during a plasma pulse. Here the desired elongation has been preprogrammed to follow a series of steps as a function of time. The remaining two network outputs (radial position Ro and vertical position Z,)were digitized for post-shot diagnosis, but were not used for real-time control. The graph clearly shows the network responding and generating the required elongation signal in close agreement with the reconstructed values. The typical residual error is of order 0.07 on elongation values up to around 1.5. Part of this error is attributable to residual offset in the integrators used to extract magnetic field information from the pick-up coils, and this is currently being corrected through modifications to the integrator design. An additional contribution to the error arises from the restricted number of hidden units available with the initial hardware configuration. While these results represent the first obtained using closed loop control, it is clear from earlier software modeling of larger network architectures (such as 32-164) that residual errors of order a few percent should be attainable. The implementation of such larger networks is being pursued, following the successes with the smaller system. Neural networks have already been used with great success for fast interpretation of the data from tokamak plasma diagnostics to determine the spatial and temporal profiles of quantities such as temperature and density (Bishop et al. 1993b, 1993c; Bartlett and Bishop 1993). There is currently considerable interest in extending these techniques to allow real-time feedback control of the profiles to give more complete determination of the plasma configuration than is possible by boundary shape
Chris M. Bishop et al.
216
1.a shot 9576 I
c
I
.-,
.-0
c ,
ccf
u)
1.4
c
0 a
1.o
0.0
0.1
0.2
time (sec.) Figure 6: Plot of the plasma elongation K as a function of time during shot no. 9576 on the COMPASS tokamak, during which the elongation was being controlled in real-time by the neural network. The solid curve shows the value of elongation given by one of the network outputs. The dashed curve shows the post-shot reconstruction of the elongation obtained from a simple "filament" code, which gives relatively rapid post-shot plasma shape reconstruction but with limited accuracy. The circles denote reconstructions obtained from the full equilibrium code, which gives closer agreement with the network predictions.
control alone. For such applications, neural networks appear to offer one of the most promising approaches. Acknowledgments We would like to thank Peter Cox, Jo Lister, and Colin Roach for many useful discussions and technical contributions. This work was partially supported by the UK Department of Trade and Industry. References Bishop, C. M., Cox, P., Haynes, P. S., Roach, C. M., Smith, M. E. U., Todd, T. N., and Trotman, D. L. 1992. A neural network approach to tokamak equilibrium control. In Neural Network Applications, J. G. Taylor, ed., pp. 114128. Springer-Verlag, Berlin.
Real-Time Control of a Tokamak Plasma
217
Bishop, C. M., Haynes, P. S., Roach, C. M., Smith, M. E. U., Todd, T. N., and Trotman, D. L. 1993a. Hardware implementation of a neural network for plasma position control in COMPASS-D. In Proceedings ofthe 17fh Symposium on Fusion Technology, Rome, 2, 997-1001. Bishop, C. M., Roach, C. M., and von Hellerman, M. 1993b. Automatic analysis of JET charge exchange spectra using neural networks. Plasma Phys. Controlled Fusion 35, 76.5-773. Bishop, C. M., Strachan, I. G. D., ORourke, J., Maddison, G., and Thomas, P. R. 1993c. Reconstruction of tokamak density profiles using feedforward networks. Neural Comput. Appl. 1, 4-16. Bartlett, D. V., and Bishop, C. M. 1993. Development of neural network techniques for the analysis of JET ECE data. In Proceedings of the 8th International Workshop on ECE and ECRH (EC8, 1992). Fukunaga, K. 1990. Statistical Pattern Recognition, 2nd ed. Academic Press, San Diego. Lagin, L., Bell, R., Davis, S., Eck, T., Jardin, S., Kessel, C., Mcenerney, J., Okabayashi, M., Popyack, J., and Sauthoff, N. 1993. Application of neural networks for real-time calculations of plasma equilibrium parameters for PBX-M. In Proceedings of the 17th Symposium on Fusion Technology, Rome, 2, 1057-1061. Lister, J. B., and Schnurrenberger, H. 1991. Fast non-linear extraction of plasma parameters using a neural network mapping. Nuclear Fusion 31, 1291-1300. Press, W. H., Flannery, B. I?, Teukolsky, S. A., and Vetterling, W. T. 1992. Numerical Recipes in C: The Art of Scientific Computing, 2nd ed. Cambridge University Press, Cambridge. Shafranov, V. D. 19.58. On magnetohydrodynamical equilibrium configurations. Sou. Phys. IETP 8, 710.
Received November 5, 1993; accepted May 20, 1994.
This article has been cited by: 2. O. Barana, A. Murari, P. Franz, L. C. Ingesson, G. Manduchi. 2002. Neural networks for real time determination of radiated power in JET. Review of Scientific Instruments 73:5, 2038. [CrossRef] 3. Young-Mu Jeon, Yong-Su Na, Myung-Rak Kim, Y. S. Hwang. 2001. Newly developed double neural network concept for reliable fast plasma position control. Review of Scientific Instruments 72:1, 513. [CrossRef] 4. D Wroblewski, G.L Jahns, J.A Leuer. 1997. Tokamak disruption alarm based on a neural network model of the high- beta limit. Nuclear Fusion 37:6, 725-741. [CrossRef]
REVIEW
Communicated by Vladimir Vapnik
Regularization Theory and Neural Networks Architectures Federico Girosi Michael Jones Tomaso Poggio Center for Biological and Computational Learning, Department of Brain and Cognitive Sciences and Artificial Intelligence Laboratory, Massachusetts Institute of Technology,Cambridge, M A 02139 USA
We had previously shown that regularization principles lead to approximation schemes that are equivalent to networks with one layer of hidden units, called regularization networks. In particular, standard smoothness functionals lead to a subclass of regularization networks, the well known radial basis functions approximation schemes. This paper shows that regularization networks encompass a much broader range of approximation schemes, including many of the popular general additive models and some of the neural networks. In particular, we introduce new classes of smoothness functionals that lead to different classes of basis functions. Additive splines as we11 as some tensor product splines can be obtained from appropriate classes of smoothness functionals. Furthermore, the same generalization that extends radial basis functions (RBF) to hyper basis functions (HBF) also leads from additive models to ridge approximation models, containing as special cases Breiman’s hinge functions, some forms of projection pursuit regression, and several types of neural networks. We propose to use the term generalized regularization networks for this broad class of approximation schemes that follow from an extension of regularization. In the probabilistic interpretation of regularization, the different classes of basis functions correspond to different classes of prior probabilities on the approximating function spaces, and therefore to different types of smoothness assumptions. In summary, different multilayer networks with one hidden layer, which we collectively call generalized regularization networks, correspond to different classes of priors and associated smoothness functionals in a classical regularization principle. Three broad classes are (1) radial basis functions that can be generalized to hyper basis functions, (2) some tensor product splines, and (3) additive splines that can be generalized to schemes of the type of ridge approximation, hinge functions, and several perceptron-like neural networks with one hidden layer. Neural Computation 7, 219-269 (1995)
@ 1995 Massachusetts Institute of Technology
220
F. Girosi, M. Jones,and T.Poggio
1 Introduction
In recent years we and others have argued that the task of learning from examples can be considered in many cases to be equivalent to multivariate function approximation, that is, to the problem of approximating a smooth function from sparse data, the examples. The interpretation of an approximation scheme in terms of networks and vice versa has also been extensively discussed (Barron and Barron 1988; Poggio and Girosi 1989, 1990a,b; Girosi 1992; Broomhead and Lowe 1988; Moody and Darken 1988, 1989; White 1989, 1990; Ripley 1994; Omohundro 1987; Kohonen 1990; Lapedes and Farber 1988; Rumelhart et al. 1986; Hertz et al. 1991; Kung 1993; Sejnowski and Rosenberg 1987; Hurlbert and Poggio 1988; Poggio 1975). In a series of papers we have explored a quite general approach to the problem of function approximation. The approach regularizes the illposed problem of function approximation from sparse data by assuming an appropriate prior on the class of approximating functions. Regularization techniques (Tikhonov 1963; Tikhonov and Arsenin 1977; Morozov 1984; Bertero 1986; Wahba 1975,1979,1990) typically impose smoothness constraints on the approximating set of functions. It can be argued that some form of smoothness is necessary to allow meaningful generalization in approximation type problems (Poggio and Girosi 1989, 1990). A similar argument can also be used (see Section 9.1) in the case of classification where smoothness is a condition on the classification boundaries rather than on the input-output mapping itself. Our use of regularization, which follows the classical technique introduced by Tikhonov, identifies the approximating function as the minimizer of a cost functional that includes an error term and a smoothness functional, usually called a stabilizer. In the Bayesian interpretation of regularization (see Kimeldorf and Wahba 1971; Wahba 1990; Bertero ef al. 1988; Marroquin et al. 1987; Poggio et al. 1985) the stabilizer corresponds to a smoothness prior, and the error term to a model of the noise in the data (usually gaussian and additive). In Poggio and Girosi (1989, 1990) and Girosi (1992) we showed that regularization principles lead to approximation schemes that are equivalent to networks with one "hidden" layer, which we call regularization networks (RN). In particular, we described how a certain class of radial stabilizers-and the associated priors in the equivalent Bayesian formulation-lead to a subclass of regularization networks, the alreadyknown radial basis functions (Powell 1987, 1992; Franke 1982, 1987; Micchelli 1986; Kansa 1990a,b; Madych and Nelson 1990a,b; Dyn 1987, 1991; Hardy 1971,1990; Buhmann 1990; Lancaster and Salkauskas 1986; Broomhead and Lowe 1988; Moody and Darken 1988, 1989; Poggio and Girosi 1990; Girosi 1992). The regularization networks with radial stabilizers we studied include many classical one-dimensional (Schumaker 1981; de Boor 1978) as well as multidimensional splines and approximation tech-
Regularization Theory and Neural Networks
221
niques, such as radial and nonradial gaussian, thin-plate splines (Duchon 1977; Meinguet 1979; Grimson 1982; Cox 1984; Eubank 1988) and multiquadric functions (Hardy 1971,1990). In Poggio and Girosi (1990a,b) we extended this class of networks to Hyper Basis Functions (HBF). In this paper we show that an extension of regularization networks, which we propose to call Generalized Regularization Networks (GRN), encompasses an even broader range of approximation schemes including, in addition to HBF, tensor product splines, many of the general additive models, and some of the neural networks. As expected, GRN have approximation properties of the same type as already shown for some of the neural networks (Girosi and Poggio 1990a; Cybenko 1989; Hornik et al. 1989; White 1990; hie and Miyake 1988; Funahashi 1989; Barron 1991, 1994; Jones 1992; Mhaskar and Micchelli 1992, 1993; Mhaskar 1993a,b). The plan of the paper is as follows. We first discuss the solution of the variational problem of regularization. We then introduce three different classes of stabilizers-and the corresponding priors in the equivalent Bayesian interpretation-that lead to different classes of basis functions: the well-known radial stabilizers, tensor-product stabilizers, and the new additive stabilizers that underlie additive splines of different types. It is then possible to show that the same argument that extends radial basis functions to hyper basis functions also leads from additive models to some ridge approximation schemes, defined as
p=l
where k , are appropriate one-dimensional functions. Special cases of ridge approximation are Breiman’s hinge functions (19931, projection pursuit regression (PPR) (Friedman and Stuezle 1981; Huber 1985; Diaconis and Freedman 1984; Donoho and Johnstone 1989; Moody and Yarvin 19911, and multilayer perceptrons (Lapedes and Farber 1988; Rumelhart et al. 1986; Hertz et al. 1991; Kung 1993; Sejnowski and Rosenberg 1987). Simple numerical experiments are then described to illustrate the theoretical arguments. In summary, the chain of our arguments shows that some ridge approximation schemes are approximations of regularization networks with appropriate additive stabilizers. The form of h, depends on the stabilizer, and includes in particular cubic splines (used in typical implementations of PPR) and one-dimensional gaussians. Perceptron-like neural networks with one hidden layer and with a gaussian activation function are included. It seems impossible, however, to directly derive from regularization principles the sigmoidal activation functions typically used in feedforward neural networks. We discuss, however, in a simple example, the close relationship between basis functions of the hinge, the sigmoid and the gaussian type. The appendices deal with observations related to the main results of the paper and more technical details.
222
F. Girosi, M. Jones, and T. Poggio
2 The Regularization Approach to the Approximation Problem
~
Suppose that the set g = {(xi, yi) E Rd x R}Zl of data has been obtained by random sampling a function f , belonging to some space of functions X defined on Rd, in the presence of noise, and suppose we are interested in recovering the function f , or an estimate of it, from the set of data g. This problem is clearly ill-posed, since it has an infinite number of solutions. To choose one particular solution we need to have some a priori knowledge of the function that has to be reconstructed. The most common form of a priori knowledge consists in assuming that the function is smooth, in the sense that two similar inputs correspond to two similar outputs. The main idea underlying regularization theory is that the solution of an ill-posed problem can be obtained from a variational principle, which contains both the data and prior smoothness information. Smoothness is taken into account by defining a smoothnessfunctional #[f] in such a way that lower values of the functional correspond to smoother functions. Since we look for a function that is simultaneously close to the data and also smooth, it is natural to choose as a solution of the approximation problem the function that minimizes the following functional:
where X is a positive number that is usually called the regularization parameter. The first term is enforcing closeness to the data, and the second smoothness, while the regularization parameter controls the trade-off between these two terms, and can be chosen according to cross-validation techniques (Allen 1974; Wahba and Wold 1975; Golub et al. 1979; Craven and Wahba 1979; Utreras 1979; Wahba 1985) or to some other principle, such as structural risk minimization (Vapnik 1988). It can be shown that for a wide class of functionals 4, the solutions of the minimization of the functional (2.1) all have the same form. Although a detailed and rigorous derivation of the solution of this problem is out of the scope of this paper, a simple derivation of this general result is presented in Appendix A. In this section we just present a family of smoothness functionals and the corresponding solutions of the variational problem. We refer the reader to the current literature for the mathematical details (Wahba 1990; Madych and Nelson 1990a; Dyn 1987). We first need to give a more precise definition of what we mean by smoothness and define a class of suitable smoothness functionals. We refer to smoothness as a measure of the "oscillatory" behavior of a function. Therefore, within a class of differentiable functions, one function will be said to be smoother than another one if it oscillates less. If we look at the functions in the frequency domain, we may say that a function is smoother than another one if it has less energy at high frequency (smaller bandwidth). The high frequency content of a function can be
Regularization Theory and Neural Networks
223
measured by first high-pass filtering the function, and then measuring the power, that is the L2 norm, of the result. In formulas, this suggests defining smoothness functionals of the form (2.2) where the tilde indicates the Fourier transform, G is some positive function that tends to zero as llsll + 00 (so that 1/G is an high-pass filter) and for which the class of functions such that this expression is well defined is not empty. For a well defined class of functions G (Madych and Nelson 1990a; Dyn 1991; Dyn et al. 1989) this functional is a seminorm, with a finite dimensional null space N . The next section will be devoted to giving examples of the possible choices for the stabilizer 4. For the moment we just assume that it can be written as in equation 2.2, and make the additional assumption that G is symmetric, so that its Fourier transform G is real and symmetric. In this case it is possible to show (see Appendix A for a sketch of the proof) that the function that minimizes the functional (2.1) has the form k
N
(2.3) where {?,ha}:=, is a basis in the k-dimensional null space N of the functional 4,that in most cases is a set of polynomials, and therefore will be referred to as the "polynomial term" in equation 2.3. The coefficients da and ci depend on the data, and satisfy the following linear system:
( G + XI)c+ Q'd 9c
=y
=0
(2.4)
(2.5)
where I is the identity matrix, and we have defined (y)t = yz, (c), = ci, (d)i = di ( G ) , = G(x1 - XI), (*)a1 == $a(xl)
Notice that if the data term in equation 2.1 is replaced by CE, V[f(x,)-y,] where V is any differentiable function, the solution of the variational principle has still the form 2.3, but the coefficients cannot be found any more by solving a linear system of equations (Girosi 1991; Girosi et al. 1991). The existence of a solution to the linear system shown above is guaranteed by the existence of the solution of the variational problem. The case of X = 0 corresponds to pure interpolation. In this case the existence of an exact solution of the linear system of equations depends on the properties of the basis function G (Micchelli 1986).
F. Girosi, M. Jones, and T. Poggio
224
The approximation scheme of equation 2.3 has a simple interpretation in terms of a network with one layer of hidden units, which we call a Regularization Network (RN). Appendix B describes the extension to the vector output scheme. In summary, the argument of this section shows that using a regularization network of the form 2.3, for a certain class of basis functions G, is equivalent to minimizing the functional 2.1. In particular, the choice of G is equivalent to the corresponding choice of the smoothness functional 2.2. 2.1 Dual Representation of Regularization Networks. Consider an ap-
proximating function of the form 2.3, neglecting the "polynomial term" for simplicity. A compact notation for this expression is
f(x) = c . g(x)
(2.6)
where g(x) is the vector of functions such that [g(x)],= G(x - xl). Since the coefficients c satisfy the linear system 2.4, solution 2.6 becomes
+
f ( x ) = (G Al)-'y. g(x) We can rewrite this expression as N
f ( x ) = C y l b l ( x ) = y . b(x)
(2.7)
r=l
in which the vector b(x) of basis functions is defined b(x) = (G + AI)-'g(x)
(2.8)
and now depends on all the data points and on the regularization parameter A. The representation 2.7 of the solution of the approximation problem is known as the dual of equation 2.6, and the basis functions b,(x) are called the equivalent kernels, because of the similarity between equation 2.7 and the kernel smoothing technique that we will define in Section 2.2 (Silverman 1984; Hardle 1990; Hastie and Tibshirani 1990). While in equation 2.6 the "difficult" part is the computation of the vector of coefficients c,, the set of basis functions g(x) being easily built, in equation 2.7 the "difficult" part is the computation of the basis functions b(x), the coefficients of the expansion being explicitly given by the y,. Notice that b(x) depends on the distribution of the data in the input space and that the kernels b,(x), unlike the kernels G(x - x,), are not translated replicas of the same kernel. Notice also that, as shown in Appendix B, a dual representation of the form 2.7 exists for all the approximation schemes that consists of linear superpositions of arbitrary numbers of basis functions, as long as the error criterion that is used to determine the parameters of the approximation is quadratic. The dual representation provides an intuitive way of looking at the approximation scheme 2.3: the value of the approximating function at an
Regularization Theory and Neural Networks
225
evaluation point x is explicitly expressed as a weighted sum of the values y; of the function at the examples xi. This concept is not new in approximation theory, and has been used, for example, in the theory of quasiinterpolation. The case in which the data points {xi} coincide with the multi-integers Z d , where 2 is the set of integers number, has been extensively studied in the literature, and it is also known as Schoenberg’s approximation (Schoenberg 1946a, 1969; Rabut 1991, 1992; Madych and Nelson 1990a; Jackson 1988; de Boor 1990; Buhmann 1990,1991; Dyn et al. 1989). In this case, an approximation f * to a function f is sought of the form (2.9) where $ is some fast-decaying function that is a linear combination of radial basis functions. The approximation scheme 2.9 is therefore a linear superposition of radial basis functions in which the functions $(x - j) play the role of equivalent kernels. Quasi-interpolation is interesting because it could provide good approximation without the need of solving complex minimization problems or solving large linear systems. For a discussion of such noniterative training algorithms see Mhaskar (1993b) and references therein. Although difficult to prove rigorously, we can expect the kernels b,(x) to decrease with the distance of the data points x, from the evaluation point, so that only the neighboring points affect the estimate of the function at x, providing therefore a ”local” approximation scheme. Even if the original basis function G is not ”local,” like the multiquadric G(x) = the basis functions b,(x) are bell shaped, local functions, whose locality will depend on the choice of the basis function G, on the density of data points, and on the regularization parameter A. This shows that apparently ”global” approximation schemes can be regarded as local, memory-based techniques (see equation 2.7) (Mhaskar 199313). It should be noted however, that these techniques do not have the highest possible degree of locality, since the parameter that controls the locality is the regularization parameter A, that is the same for all the kernels. It is possible to devise even more local techniques, in which each kernel has a parameter that controls its locality (Bottou and Vapnik 1992; Vapnik, personal communication). When the data are equally spaced on an infinite grid, we expect the basis functions b, (x) to become translation invariant, and therefore the dual representation 2.7 becomes a convolution filter. For a study of the properties of these filters in the case of one-dimensional cubic splines see the work of Silverman (1984), who gives explicit results for the shape of the equivalent kernel. Let us consider some simple experiments that show the shape of the equivalent kernels in specific situations. We first considered a data set composed of 36 equally spaced points on the domain [0,1] x [O, 11, at the nodes of a regular grid with spacing equal to 0.2. We use the multiquadric
,,/m,
F. Girosi, M. Jones, and T. Poggio
226
1
Figure 1: (a) The multiquadric function. (b) An equivalent kernel for the multiquadric basis function in the cases of two-dimensional equally spaced data. (c,d,e) The equivalent kernels b3, bg, and b6, for nonuniform one-dimensional multiquadric interpolation (see text for explanation).
-/,
basis functions G(x) = where (T has been set to 0.2. Figure l a shows the original multiquadric function, and Figure 1b the equivalent kernel b16, in the case of A = 0, where, according to definition 2.8 36
b,(x) = c(G-~),,G(x - x,) ,=I
All the other kernels, except those close to the border, are very similar, since the data are equally spaced, and translation invariance holds approximately. Consider now a one-dimensional example with a multiquadric basis function: G(x)=
d
G
Regularization Theory and Neural Networks
227
The data set was chosen to be a nonuniform sampling of the interval [0,1],that is the set {0.0,0.1,0.2,0.3,0.4,0.7,1.0)
In Figure lc, d, and e we have drawn, respectively, the equivalent kernels b3, bs, and bb, under the same definitions. Notice that all of them are bell-shaped, although the original basis function is an increasing, cup-shaped function. Notice, moreover, that the shape of the equivalent kernels changes from b3 to bb, becoming broader in moving from a high to low sample density region. This phenomenon has been shown by Silverman (1984) for cubic splines, but we expect it to appear in much more general cases. The connection between regularization theory and the dual representation 2.7 becomes clear in the special case of "continuous" data, for which the regularization functional has the form
where y(x) is the function to be approximated. This functional can be intuitively seen as the limit of the functional 2.1 when the number of data points goes to infinity and their spacing is uniform. It is easily seen that, when the stabilizer 4 [ f ] is of the form 2.2, the solution of the regularization functional 2.10 is
f(x) = Y(X) * B(x)
(2.11)
where B ( x ) is the Fourier transform of
[see Poggio et al. (1988) for some examples of B(x)]. The solution 2.11 is therefore a filtered version of the original function y(x) and, consistently with the results of Silverman (19841, has the form 2.7, where the equivalent kernels are translates of the function B(x) defined above. Notice the effect of the regularization parameter: for X = 0 the equivalent kernel B ( x ) is a Dirac delta function, and f(x) = y(x) (no noise), while for X -+ 00 we have B(x) = G(x)/X and f = G/X * y (a low-pass filter). The dual representation is illuminating and especially interesting for the case of a multi-output network-approximating a vector field-that is discussed in Appendix B. 2.2 Normalized Kernels. An approximation technique very similar to radial basis functions is the so-called normalized Radial Basis Functions
228
F. Girosi, M. Jones, and T. Poggio
(Moody and Darken 1988, 1989). A normalized radial basis functions expansion is a function of the form (2.12) The only difference between equation 2.12 and radial basis functions is the normalization factor in the denominator, which is an estimate of the probability distribution of the data. A discussion about the relation between normalized gaussian basis function networks, gaussian mixtures, and gaussian mixture classifiers can be found in the work of Tresp et al. (1993). In the rest of this section we show that a particular version of this approximation scheme has again a tight connection to regularization theory. Let P(x,y) be the joint probability of inputs and outputs of the network, and let us assume that we have a sample of N pairs {(xi,yi)}& randomly drawn according to P. Our goal is to build an estimator (a network) f that minimizes the expected risk (2.13) This cannot be done, since the probability P is unknown, and usually the empirical risk (2.14) is minimized instead. An alternative consists in obtaining an approximation of the probability P(x, y) first, and then in minimizing the expected risk. If this option is chosen, one could use the regularization approach to probability estimation (Vapnik and Stefanyuk 1978; Aidu and Vapnik 1989; Vapnik 1982) that leads to the well-known technique of Parzen windows. A Parzen window estimator P* for the probability distribution of has the form a set of data {zi}:, (2.15) where is an appropriate kernel, for example a gaussian, whose L1 norm is 1, and where h is a positive parameter, that, for simplicity, we set to 1 from now on. If the joint probability P(x,y) in the expected risk 2.13 is approximated with a Parzen window estimator P,we obtain an approximated expression for the expected risk, I* [f],that can be explicitly minimized. In order to show how this can be done, we notice that we need to approximate the probability distribution P(x, y), and therefore
Regularization Theory and Neural Networks the random variable z of equation 2.15 is z kernel of the following form:'
229 =
(x,y). Hence, we choose a
W) = K(llXll)K(Y) where K is a standard one-dimensional, symmetric kernel, like the gaussian. The Parzen window estimator to P(x, y) is therefore I
N
(2.16) An approximation to the expected risk is therefore obtained as
In order to find an analytical expression for the minimum of I*[f]we impose the stationarity constraint:
that leads to the following equation: N
"
Performing the integral over x, and using the fact that llKll~,= 1 we obtain
Performing a change of variable in the integral of the previous expression and using the fact that the kernel K is symmetric, we finally conclude that the function that minimizes the approximated expected risk is (2.17) The right-hand side of the equation converges to f when the number of examples goes to infinity, provided that the scale factor h tends to zero at an appropriate rate. This form of approximation is known as kernel regression, or Nadaraya-Watson estimator, and it has been the subject of extensive study in the statistics community (Nadaraya 1964; Watson 1964; Rosenblatt 1971; Priestley and Chao 1972; Gasser and Miiller 1985; Devroye and Wagner 1980). A similar derivation of equation 2.17 has been given by Specht (1991), but we should remark that this equation 'Any kernel of the form @(z) = K(x,y) in which the function K is even in each of the variables x and y would lead to the same conclusions that we obtain for this choice.
F. Girosi, M. Jones, and T. Poggio
230
is usually derived in a different way, within the framework of locally weighted regression, assuming a locally constant model (Hardle 1990) with a local weight function K. Notice that this equation has the form of equation 2.12, in which the centers coincide with the examples, and the coefficients ci are simply the values yi of the function at the data points xi. On the other hand, the equation is an estimate o f f , which is linear in the observations yi and has therefore also the general form of equation 2.7. The Parzen window estimator, and therefore expression 2.17, can be derived in the framework of regularization theory (Vapnik and Stefanyuk 1978; Aidu and Vapnik 1989; Vapnik 1982) under a smoothness assumption on the probability distribution that has to be estimated. This means that in order to derive equation 2.17, a smoothness assumption has to be made on the joint probability distribution P ( x , y), rather than on the regression function as in 2.2. 3 Classes of Stabilizers
In the previous section we considered the class of stabilizers of the form
and we have seen that the solution of the minimization problem always has the same form. In this section we discuss three different types of stabilizers belonging to the class 3.1, corresponding to different properties of the basis functions G. Each of them corresponds to different a priori assumptions on the smoothness of the function that must be approximated. 3.1 Radial Stabilizers. Most of the commonly used stabilizers have radial symmetry, that is, they satisfy the following equation:
df(x)l = df(W1 for any rotation matrix R. This choice reflects the a priori assumption that all the variables have the same relevance, and that there are no privileged directions. Rotation invariant stabilizers correspond to radial basis function G( Ilxll). Much attention has been dedicated to this case, and the corresponding approximation technique is known as radial basis functions (Powell 1987, 1990; Franke 1982, 1987; Micchelli 1986; Kansa, 1990a,b; Madych and Nelson 1990a; Dyn 1987,1991; Hardy 1971,1990; Buhmann 1990; Lancaster and Salkauskas 1986; Broomhead and Lowe 1988; Moody and Darken 1988, 1989; Poggio and Girosi 1990; Girosi 1992). The class of admissible radial basis functions is the class of conditionally positive definite functions (Micchelli 1986) of any order, since it has been shown
Regularization Theory and Neural Networks
231
(Madych and Nelson 1990a; Dyn 1991) that in this case the functional of equation 3.1 is a seminorm, and the associated variational problem is well defined. All the radial basis functions can therefore be derived in this framework. We explicitly give two important examples.
3.2 .2 Duchon Multidimensional Splines. Duchon (1977)considered measures of smoothness of the form
In this case G(s) = l/lls112mand the corresponding basis function is therefore
G ( x )=
llx112m-d1nllxll
if 2m > d and d is even otherwise
(3.2)
In this case the null space of 4[f] is the vector space of polynomials of degree at most m in d variables, whose dimension is
These basis functions are radial and conditionally positive definite, so that they represent just particular instances of the well known radial basis functions technique (Micchelli 1986; Wahba 1990). In two dimensions, for m = 2, equation 3.2 yields the so-called "thin plate" basis function G(x) = llx112In IIxJJ(Harder and Desmarais 1972; Grimson 1982). 3.2.2 The Gaussian. A stabilizer of the form
where /3 is a fixed positive parameter, has G(s) = e-llsl12/@and as basis function the gaussian function (Poggio and Girosi 1989; Yuille and Grzywacz 1988). The gaussian function is positive definite, and it is well known from the theory of reproducing kernels (Aronszajn 1950)that positive definite functions (Stewart 1976) can be used to define norms of the type 3.1. Since 4[f] is a norm, its null space contains only the zero element, and the additional null space terms of equation 2.3 are not needed, unlike in Duchon splines. A disadvantage of the gaussian is the appearance of the scaling parameter ,B, while Duchon splines, being homogeneous functions, do not depend on any scaling parameter. However, it is possible to devise good heuristics that furnish suboptimal, but still good, values of /3, or good starting points for cross-validation procedures.
F. Girosi, M. Jones, and T. Poggio
232
3.2.3 Other Basis Functions. Here we give a list of other functions that can be used as basis functions in the radial basis functions technique, and that are therefore associated with the minimization of some functional. In the following, we indicate as "p.d." the positive definite functions, which do not need any polynomial term in the solution, and as "c.p.d. k" the conditionally positive definite functions of order k, which need a polynomial of degree k in the solution. It is a well known fact that positive definite functions tend to zero at infinity whereas conditionally positive functions tend to infinity.
G ( r ) = ecflr2
Gaussian, p.d.
G ( r )=
multiquadric, c.p.d. 1
G ( r )= -
inverse multiquadric, p.d.
G ( r ) = y2I1+l
thin plate splines, c.p.d. n
G(r) = r2" in r
thin plate splines, c.p.d. n
$7
3.2 Tensor Product Stabilizers. An alternative to choosing a radial function G(s)in the stabilizer 3.1 is a tensorproduct type of basis function, that is a function of the form
G(s) = n!=,g(sj)
(3.3)
where s, is the jth coordinate of the vector s, and g is an appropriate oneis dimensional function. When g is positive definite the functional 4[f] clearly a norm and its null space is empty. In the case of a conditionally positive definite function the structure of the null space can be more complicated and we do not consider it here. Stabilizers with G(s) as in equation 3.3 have the form
which leads to a tensor product basis function
where x, is the jth coordinate of the vector x and g ( x ) is the Fourier transform of g(s). An interesting example is the one corresponding to the choice 1 g(s) = 1 +s2
Regularization Theory and Neural Networks
233
This basis function is interesting from the point of view of VLSl implementations, because it requires the computation of the L1 norm of the input vector x, which is usually easier to compute than the Euclidean norm Lz. However, this basis function is not very smooth, and its performance in practical cases should first be tested experimentally. Notice that if the approximation is needed for computing derivatives smoothness of an appropriate degree is clearly a necessary requirement (see Poggio et al. 1988). We notice that the choice
g(s) = CS2 leads again to the gaussian basis function G(x)
= e-IIxI12.
3.3 Additive Stabilizers. We have seen in the previous section how some tensor product approximation schemes can be derived in the framework of regularization theory. We now will see that it is also possible to derive the class of additive approximation schemes in the same framework, where by additive approximation we mean an approximation of the form d
f(x) =
Cf,(XP)
(3.4)
,=I
where xu is the pth component of the input vector x and the f, are onedimensional functions that will be defined as the additive components off (from now on Greek letter indices will be used in association with components of the input vectors). Additive models are well known in statistics (Hastie and Tibshirani 1986, 1987, 1990; Stone 1985; Wahba 1990; Buja et al. 1989) and can be considered as a generalization of linear models. They are appealing because, being essentially a superposition of onedimensional functions, they have a low complexity, and they share with linear models the feature that the effects of the different variables can be examined separately. The simplest way to obtain such an approximation scheme is to choose, if possible, a stabilizer that corresponds to an additive basis function: (3.5) where 0, are certain fixed parameters and g is a one-dimensional basis function. Such a choice would lead to an approximation scheme of the form 3.4 in which the additive components fF have the form
cc i g p N
f,(x”)
= 0,
i=l
-
x’)
(3.6)
F. Girosi, M. Jones, and T. Poggio
234
Notice that the additive components are not independent at this stage, since there is only one set of coefficients c,. We postpone the discussion of this point to Section 4.2. We would like then to write stabilizers corresponding to the basis function 3.5 in the form 3.1, where G(s) is the Fourier transform of G(x). We notice that the Fourier transform of an additive function like the one in equation 3.5 exists only in the generalized sense (Gelfand and Shilov 1964), involving the 6 distribution. For example, in two dimensions we obtain
and the interpretation of the reciprocal of this expression is delicate. However, almost additive basis functions can be obtained if we approximate the delta functions in equation 3.7 with gaussians of very small variance. Consider, for example in two dimensions, the stabilizer (3.8)
This corresponds to a basis function of the form
+
G(x.y) = 8.1g(~)e-'2Y2 Byg(~)e-'2x*
(3.9)
In the limit of t going to zero the denominator in expression 3.8 approaches equation 3.7, and the basis function 3.9 approaches a basis function that is the sum of one-dimensional basis functions. In this paper we do not discuss this limit process in a rigorous way. Instead we outline another way to obtain additive approximations in the framework of regularization theory. Let us assume that we know a priori that the function f that we want to approximate is additive, that is d
f ( x )= C f P ( X k L ) p=l
We then apply the regularization approach and impose a smoothness constraint, not on the function f as a whole, but on each single additive component, through a regularization functional of the form (Wahba 1990; Hastie and Tibshirani 1990):
where 8, are given positive parameters that allow us to impose different degrees of smoothness on the different additive components. The min-
Regularization Theory and Neural Networks
235
imizer of this functional is found with the same technique described in Appendix A, and, skipping null space terms, it has the usual form N
f(x) =
CC;G(X - xi)
(3.10)
i=l
where d
G(x - x;)
=
C Owg(X’
- x:)
,=l
as in equation 3.5. We notice that the additive component of equation 3.10 can be written as
c N
f,(X,)
=
cf”g(x” - X f )
i=l
where we have defined
cf”= c,o, The additive components are therefore not independent because the parameters 0, are fixed. If the 0, were free parameters, the coefficients cf” would be independent, as well as the additive components. Notice that the two ways we have outlined for deriving additive approximation from regularization theory are equivalent. They both start from a priori assumptions of additivity and smoothness of the class of functions to be approximated. In the first technique the two assumptions are woven together in the choice of the stabilizer (equation 3.8); in the second they are made explicit and exploited sequentially. 4 Extensions: From Regularization Networks to Generalized Regularization Networks
In this section we will first review some extensions of regularization networks, and then will apply them to radial basis functions and to additive splines. A fundamental problem in almost all practical applications in learning and pattern recognition is the choice of the relevant input variables. It may happen that some of the variables are more relevant than others, that some variables are just totally irrelevant, or that the relevant variables are linear combinations of the original ones. It can therefore be useful to work not with the original set of variables x, but with a linear transformation of them, Wx, where W is a possibly rectangular matrix. In the framework of regularization theory, this can be taken into account by making the assumption that the approximating function f has the form f(x) = F(Wx) for some smooth function F. The smoothness assumption is now made
F. Girosi, M. Jones, and T. Poggio
236
directly on F, through a smoothness functional d[F]of the form 3.1. The regularization functional is expressed in terms of F as N
H[F]=
1
[yi
+
- F(z,)]* X$[F]
I=]
where zi = Wxi. The function that minimizes this functional is clearly, accordingly to the results of Section 2, of the form N
F(z)= C c ~ G ( z- ~
i )
i=l
(plus eventually a polynomial in z). Therefore the solution for f is N
f ( x ) = F(Wx) = C C ~ G ( W-XWX,)
(4.1)
,=I
This argument is rigorous for given and known W, as in the case of classical radial basis functions. Usually the matrix W is unknown, and it must be estimated from the examples. Estimating both the coefficients ci and the matrix W by least squares is usually not a good idea, since we would end up trying to estimate a number of parameters that is larger than the number of data points (though one may use regularized least squares). Therefore, it has been proposed (Moody and Darken 1988, 1989; Broomhead and Lowe 1988; Poggio and Girosi 1989,1990a)that the approximation scheme of equation 4.1 be replaced with a similar one, in which the basic shape of the approximation scheme is retained, but the number of basis functions is decreased. The resulting approximating function that we call the Generalized Regularization Network (GRN) is n
f(x)=
1c,G(WX- Wt,)
(4.2)
n=l
where n < N and the centers t, are chosen according to some heuristic, or are considered as free parameters (Moody and Darken 1988,1989; Poggio and Girosi 1989,1990a). The coefficientsc,, the elements of the matrix W, and eventually the centers t,, are estimated according to a least squares criterion. The elements of the matrix W could also be estimated through cross-validation (Allen 1974; Wahba and Wold 1975; Golub et al. 1979; Craven and Wahba 1979; Utreras 1979; Wahba 19851, which may be a formally more appropriate technique. In the special case in which the matrix W and the centers are kept fixed, the resulting technique is one originally proposed by Broomhead and Lowe (19881, and the coefficientssatisfy the following linear equation
GTGc = GTy
Regularization Theory and Neural Networks
237
where we have defined the following vectors and matrices:
(y)!= yt.
(c),
= c,,
(G)lU
= G(Wx, - Wt,)
This technique, which has become quite common in the neural network community, has the advantage of retaining the form of the regularization solution, while being less complex to compute. A complete theoretical analysis has not yet been given, but some results, in the case in which the matrix W is set to identity, are already available (Sivakumar and Ward 1991; Poggio and Girosi 1989). The next sections discuss approximation schemes of the form 4.2 in the cases of radial and additive basis functions. 4.1 Extensions of Radial Basis Functions. In the case in which the basis function is radial, the approximation scheme of equation 4.2 becomes
c caG(llx 17
f(x) =
-
tallw)
,=I
where we have defined the weighted norm llxllw = xWTWx
(4.3)
The basis functions of equation 4.2 are not radial any more, or, more precisely, they are radial in the metric defined by equation 4.3. This means that the level curves of the basis functions are not circles, but ellipses, whose axis does not need to be aligned with the coordinate axis. Notice that in this case what is important is not the matrix W itself, but rather the symmetric matrix WTW. Therefore, by the Cholesky decomposition, it is sufficient to consider W to be upper triangular. The optimal center locations t, satisfy the following set of nonlinear equations (Poggio and Girosi 1990a,b): t,
=
c,PPXI ~
CI p:
(Y
= 1,.. . , n
(4.4)
where P: are coefficientsthat depend on all the parameters of the network and are not necessarily positive. The optimal centers are then a weighted sum of the example points. Thus in some cases it may be more efficient to "move" the coefficients Pp rather than the components of t, (for instance when the dimensionality of the inputs is high relative to the number of data points). The approximation scheme defined by equation 4.2 has been discussed in detail in Poggio and Girosi (1990a) and Girosi (1992), so we will not discuss it further. In the next section we will consider its analogue in the case of additive basis functions.
F. Girosi, M. Jones,and T. Poggio
238
4.2 Extensions of Additive Splines. In the previous sections we have seen an extension of the classical regularization technique. In this section we derive the form that this extension takes when applied to additive splines. The resulting scheme is very similar to projection pursuit regression (Friedman and Stuezle 1981; Huber 1985; Diaconis and Freedman 1984; Donoho and Johnstone 1989; Moody and Yarvin 1991). We start from the ”classical” additive spline, derived from regularization in Section 3.3: N
d
(4.5) In this scheme the smoothing parameters 8, should be known, or can be estimated by cross-validation. An alternative to cross-validation is to consider the parameters 0, as free parameters, and estimate them with a least squares technique together with the coefficients c;. If the parameters 8,‘ are free, the approximation scheme of equation 4.5 becomes the following: N
d
i=l w=1
where the coefficients c’ are now independent. Of course, now we must estimate N x d coefficients instead of just N, and we are likely to encounter an overfitting problem. We then adopt the same idea presented in Section 4, and consider an approximation scheme of the form n
f(x) =
n
1C c : g ( x P
-
(4.6)
a=l ,=l
in which the number of centers is smaller than the number of examples, reducing the number of coefficients that must be estimated. We notice that equation 4.6 can be written as d
,=l
where each additive component has the form
a=l
Therefore another advantage of this technique is that the additive components are now independent, each of them being one-dimensional radial basis functions. We can now use the same argument from Section 4 to introduce a linear transformation of the inputs x + Wx,where W is a d’ x d matrix.
Regularization Theory and Neural Networks Calling w, the pth row of W, and performing the substitution x in equation 4.6, we obtain
c cc;g(w,.
+
Wx
d‘
n
f(x) =
239
x - f;)
(4.7)
We now define the following one-dimensional function: n u=l
and rewrite the approximation scheme of equation 4.7 as d’
(4.8) ,=I
Notice the similarity between equation 4.8 and the projection pursuit regression technique: in both schemes the unknown function is approximated by a linear superposition of one-dimensional variables, which are projections of the original variables on certain vectors that have been estimated. In projection pursuit regression the choice of the functions h,(y) is left to the user. In our case the h, are one-dimensional radial basis functions, for example, cubic splines, or gaussians. The choice depends, strictly speaking, on the specific prior, that is, on the specific smoothness assumptions made by the user. Interestingly, in many applications of projection pursuit regression the functions h, have been indeed chosen to be cubic splines but other choices are flexible Fourier series, rational approximations, and orthogonal polynomials (see Moody and Yarvin 1991). Let us briefly review the steps that bring us from the classical additive approximation scheme of equation 3.6 to a projection pursuit regressionlike type of approximation: 1. The regularization parameters O,, of the classical approximation scheme 3.6 are considered as free parameters.
2. The number of centers is chosen to be smaller than the number of data points. 3. The true relevant variables are assumed to be some unknown linear
combination of the original variables. We notice that in the extreme case in which each additive component has just one center ( n = l), the approximation scheme of equation 4.7 becomes
cc’Lg(w, . d‘
f(x) =
x
-
f”)
(4.9)
F. Girosi, M. Jones, and T. Poggio
240
When the basis function g is a gaussian we call-somewhat improperlya network of this type a gaussian multilayer perceptron (MLP) network, because if g were a threshold function sigmoidal function this would be a multilayer perceptron with one layer of hidden units. The sigmoidal function, typically used instead of the threshold, cannot be derived directly from regularization theory because it is not symmetric, but we will see in Section 6 the relationship between a sigmoidal function and the absolute value function, which is a basis function that can be derived from regularization. There are a number of computational issues related to how to find the parameters of an approximation scheme like the one of equation 4.7, but we do not discuss them here. We present instead, in Section 7, some experimental results, and will describe the algorithm used to obtain them. 5 The Bayesian Interpretation of Generalized Regularization Networks
It is well known that a variational principle such as equation 2.1 can be derived not only in the context of functional analysis (Tikhonov and Arsenin 1977),but also in a probabilistic framework (Kimeldorf and Wahba 1971; Wahba 1980, 1990; Poggio et al. 1985; Marroquin et al. 1987; Bertero et al. 1988). In this section we illustrate this connection informally, without addressing the related mathematical issues. Suppose that the set g = { ( x l ,y I ) E Rd x R}:, of data has been obtained by random sampling a function f, defined on Rd, in the presence of noise, that is f ( x I )= yI
+
f I ,
i = 1 , .. . ,N
(5.1)
where el are random independent variables with a given distribution. We are interested in recovering the function f,or an estimate of it, from the set of data g. We take a probabilistic approach, and regard the function f as the realization of a random field with a known prior probability distribution. Let us define 0
0
0
P[f I g] as the conditional probability of the function f given the examples g.
P[g 1 f] as the conditional probability of g given f . If the function underlying the data is f , this is the probability that by random sampling the function f at the sites { X , } ~ J _ ~ the set of measurement {y,}fJ=,is obtained. This is therefore a model of the noise. P[f]:is the a priori probability of the random field f. This embodies our a priori knowledge of the function, and can be used to impose constraints on the model, assigning significant probability only to those functions that satisfy those constraints.
Regularization Theory and Neural Networks
241
Assuming that the probability distributions P[g 1 f ] and P [ f ] are known, the posterior distribution P[f 1 g] can now be computed by applying the Bayes rule:
P [ f 181 c( Pk
If l P[fl
(5.2)
We now make the assumption that the noise variables in equation 5.1 are normally distributed, with variance 0. Therefore the probability P[g 1 f ] can be written as
where 0 is the variance of the noise. The model for the prior probability distribution P [ f ]is chosen in analogy with the discrete case (when the function f is defined on a finite subset of a n-dimensional lattice) for which the problem can be formalized (see for instance Marroquin et al. 1987). The prior probability P [ f ] is written as
~ [ fc(]e-n4[fl
(5.3)
where $If] is a smoothness functional of the type described in Section 3 and a a positive real number. This form of probability distribution gives high probability only to those functions for which the term d[f] is small, and embodies the a priori knowledge that one has about the system. Following the Bayes rule (5.2) the a posteriori probability off is written as (5.4) One simple estimate of the function f from the probability distribution 5.4 is the so-called maximum a posteriori (MAP) estimate, that considers the function that maximizes the a posteriori probability P [ f I g], and therefore minimizes the exponent in equation 5.4. The MAP estimate of f is therefore the minimizer of the following functional:
where X = 202ct. This functional is the same as that of equation 2.1, and from here it is clear that the parameter A, that is usually called the "regularization parameter" determines the trade-off between the level of the noise and the strength of the a priori assumptions about the solution, therefore controlling the compromise between the degree of smoothness of the solution and its closeness to the data. Notice that functionals of the type 5.3 are common in statistical physics (Parisi 1988), where $[f] plays the role of an energy functional. It is interesting to notice that, in that case, the correlation function of the physical system described by 4[f] is the basis function G(x).
F. Girosi, M. Jones, and T. Poggio
242
As we have pointed out (Poggio and Girosi 1989; Rivest, personal communication), prior probabilities can also be seen as a measure of complexity, assigning high complexity to the functions with small probability. It has been proposed by Rissanen (1978) to measure the complexity of a hypothesis in terms of the bit length needed to encode it. It turns out that the MAP estimate mentioned above is closely related to the minimum description length principle: the hypothesisf, which for given g can be described in the most compact way, is chosen as the "best" hypothesis. Similar ideas have been explored by others (see for instance Solomonoff 1978). They connect data compression and coding with Bayesian inference, regularization, function approximation, and learning. 6 Additive Splines, Hinge Functions, Sigmoidal Neural Nets
~
In the previous sections we have shown how to extend RN to schemes that we have called GRN, which include ridge approximation schemes of the PPR type, that is
c d'
f(x) =
h,(W,'.
4
jG1
where
c n
U Y )=
C2(Y-
f3
U=l
The form of the basis function 8 depends on the stabilizer, and a list of "admissible" G has been given in Section 3. These include the absolute value g(x) = Ixl-corresponding to piecewise linear splines, and the function g(x) = Ix13-corresponding to cubic splines (used in typical implementations of PPR), as well as gaussian functions. Though it may seem natural to think that sigmoidal multilayer perceptrons may be included in this framework, it is actually impossible to derive directly from regularization principles the sigmoidal activation functions typically used in multilayer perceptrons. In the following section we show, however, that there is a close relationship between basis functions of the hinge, the sigmoid and the gaussian type. 6.1 From Additive Splines to Ramp and Hinge Functions. We will consider here the one-dimensional case, since multidimensional additive approximations consist of one-dimensional terms. We consider the approximation with the lowest possible degree of smoothness: piecewise linear. The associated basis function g(x) = 1x1 is shown in Figure 2a, and the associated stabilizer is given by
Regularization Theory and Neural Networks
+-
a
b
-0.4 -0.2
-0.4-0.2
243
C
0.2
0.4
0.2 0.4
Figure 2: (a) Absolute value basis function, 1x1. (b) Sigmoidal-likebasis function q , ( x ) . (c) Gaussian-likebasis function g~(x). This assumption thus leads to approximating a one-dimensional function as the linear combination with appropriate coefficients of translates of 1x1. It is easy to see that a linear combination of two translates of 1x1 with appropriate coefficients (positive and negative and equal in absolute value) yields the piecewise linear threshold function ~ ( xalso ) shown in Figure 2b. Linear combinations of translates of such functions can be used to approximate one-dimensional functions. A similar derivative-like, linear ) with appropriate coefcombination of two translates of ~ ( xfunctions ficients yields the gaussian-like function gL(x) also shown in Figure 2c. Linear combinations of translates of this function can also be used for approximation of a function. Thus any given approximation in terms of gL(x) can be rewritten in terms of oL(x)and the latter can be in turn expressed in terms of the basis function 1x1. Notice that the basis functions 1x1 underlie the "hinge" technique proposed by Breiman (1993), whereas the basis functions g~(x) are sigmoidallike and the gL(x) are gaussian-like. The arguments above show the close relations between all of them, despite the fact that only 1x1 is strictly a "legal" basis function from the point of view of regularization [ ~ L ( x )is not, though the very similar but smoother gaussian is]. Notice also that 1x1 can be expressed in terms of "ramp" functions, that is 1x1 = x+ + x-. Thus a one-hidden-layer perceptron using the activation function oL( x ) can be rewritten in terms of a generalized regularization network with basis function 1x1. The equivalent kernel is effectively local only if there This is the exist a sufficient number of centers for each dimension (w,.x). case for projection pursuit regression but not for usual one-hidden-layer perceptrons. These relationships imply that it may be interesting to compare how well each of these basis functions is able to approximate some simple function. To do this we used the model f ( x ) = CLc,g(w,x - t a ) to approximate the function h(x) = sin(27rx) on [0,1], where g ( x ) is one of the basis functions of Figure 2. Fifty training points and 10,000 test points
F. Girosi, M. Jones, and T. Poggio
244
a
b
C
Figure 3: Approximation of sin(2ax)using 8 basis functions of the (a) absolute value type, (b) sigmoidal-like type, and (c) gaussian-like type.
were chosen uniformly on [0,1].The parameters were learned using the iterative backfitting algorithm (Friedman and Stuezle 1981; Hastie and Tibshirani 1990; Breiman 1993) that will be described in Section 7. We looked at the function learned after fitting 1,2,4,8, and 16 basis functions. Some of the resulting approximations are plotted in Figure 3. The results show that the performance of all three basis functions is fairly close as the number of basis functions increases. All models did a good job of approximating sin(2~x).The absolute value function did slightly worse and the "gaussian" function did slightly better. It is interesting that the approximation using two absolute value functions is almost identical to the approximation using one "sigmoidal" function, which again shows that two absolute value basis functions can sum to equal one "sigmoidal" piecewise linear function.
7 Numerical Illustrations 7.1 Comparing Additive and Nonadditive Models. To illustrate some of the ideas presented in this paper and to provide some practical intuition about the various models, we present numerical experiments comparing the performance of additive and nonadditive networks on two-dimensional problems. In a model consisting of a sum of twodimensional gaussians, the model can be changed from a nonadditive radial basis function network to an additive network by "elongating" the gaussians along the two coordinate axes x and y. This allows us to measure the performance of a network as it changes from a nonadditive scheme to an additive one. Five different models were tested. The first three differ only in the variances of the gaussian along the two coordinate axes. The ratio of the
Regularization Theory and Neural Networks
245
x variance to the y variance determines the elongation of the gaussian. These models all have the same form and can be written as N
f(x)
=C
c i [ G l ( x - xi)
+ G ~ ( -x x i ) ]
i=l
where
G1 = e - [ ( X 2 / n ~ ) + ( Y 2 / ~ 2 ) 1 ;
G2 --e
-[(X2/n2)+(Yz/g1)1
The models differ only in the values of o1 and 0 2 . For the first model, ~ 7 1= 0.5 and 0 2 = 0.5 (RBF), for the second model = 10 and u2 = 0.5 (elliptical gaussian), and for the third model, = 00 and 0 2 = 0.5 (additive). These models correspond to placing two gaussians at each data point xi, with one gaussian elongated in the x direction and one elongated in the y direction. In the first case (RBF) there is no elongation, in the second case (elliptical gaussian) there is moderate elongation, and in the last case (additive) there is infinite elongation. The fourth model is a generalized regularization network model, of the form 4.9, that uses a gaussian basis function: f(x) =
2
C,e-(w".x-fe)2
a=l
In this model, to which we referred earlier as a gaussian MLP network (equation 4.9), the weight vectors, centers, and coefficientsare all learned. In order to see how sensitive were the performances to the choice of basis function, we also repeated the experiments for model 4 with a sigmoid (that is not a basis function that can be derived from regularization theory) replacing the gaussian basis function. In our experiments we used the standard sigmoid function: .(X)
1 1 + e-X
=-
Models 1 to 5 are summarized in Table 1: notice that only model 5 is a multilayer perceptron in the standard sense. In the first three models, the centers were fixed in the learning algorithm and equal to the training examples. The only parameters that were learned were the coefficients ci, that were computed by solving the linear system of equations 2.4. The fourth and the fifth models were trained by fitting one basis function at a time according to the following recursive algorithm with backfitting (Friedman and Stuezle 1981; Hastie and Tibshirani 1990; Breiman 1993) 0
0
Add a new basis function; Optimize the parameters war t,, and c, using the "random step" algorithm (Caprile and Girosi 1990) described below;
E Girosi, M. Jones, and T. Poggio
246
Table 1: The Five Models Tested in our Numerical Experiments. Model I
Model 2
u = 0.5
0
Backfitting: for each basis function
Q
added so far:
- hold the parameters of all other functions fixed; - reoptimize the parameters of function tr; 0
Repeat the backfitting stage until there is no significant decrease in L2 error.
The "random step" (Caprile and Girosi 1990) is a stochastic optimization algorithm that is very simple to implement and that usually finds good local minima. The algorithm works as follows: pick random changes to each parameter such that each random change lies within some interval [a. b]. Add the random changes to each parameter and then calculate the new error between the output of the network and the target values. If the error decreases, then keep the changes and double the length of the interval for picking random changes. If the error increases, then throw out the changes and halve the size of the interval. If the length of the interval becomes less than some threshold, then reset the length of the interval to some larger value. The five models were each tested on two different functions: a twodimensional additive function
+ 4(y - 0.5)2
kadd(x,y) = s i n ( 2 ~ x )
and the two-dimensional Gabor function g G a b r ( X , y) = e-''xi'2 cos
[ 0 . 7 5 ~+( ~ y)]
247
Regularization Theory and Neural Networks Table 2: A Summary of the Results of Our Numerical Experiments.O
hadd ( x ,
!/I
Training Test scabor( x ,
Model 1
Model 2
Model 3
Model 4
Model 5
0.000036 0.011717
0.000067 0.001598
0.000001 0.000007
0.000170 0.001422
0.000743 0.026699
0.000000
0.000000 0.344881
0.000000 67.95237
0.000001 0.033964
0.000044
!/I
Training Test
0.003818
0.191055
“ach table entry contains the Lz errors for both the training set and the test set.
The training data for the functions k a d d and 8Gabor consisted of 20 points picked from a uniform distribution on [0,1]x [0,1]and [-1,1] x [-1,1], respectively. Another 10,000 points were randomly chosen to serve as test data. The results are summarized in Table 2 (see Girosi et al. 1993 for a more extensive description of the results). As expected, the results show that the additive model 3 was able to approximate the additive function, k a d d (x,y) better than both the RBF model 1 and the elliptical gaussian model 2, and that there seems to be a smooth degradation of performance as the model changes from the additive to the radial basis function. Just the opposite results are seen in approximating the nonadditive Gabor function, gGabor(X, y), shown in Figure 4a. The RBF model 1 did very well, while the additive model 3 did a very poor job, as shown in Figure 4b. However, Figure 4c shows that the GRN scheme (model 4) gives a fairly good approximation, because the learning algorithm finds better directions for projecting the data than the x and y axis as in the pure additive model. Notice that the first three models we considered had a number of parameters equal to the number of data points, and were supposed to exactly interpolate the data, so that one may wonder why the training errors are not exactly zero. The reason is the ill-conditioning of the associated linear system, which is a typical problem of radial basis functions (Dyn et al. 1986). 8 Hardware and Biological Implementation of Network Architectures
We have seen that different network architectures can be derived from regularization by making somewhat different assumptions on the classes of functions used for approximation. Given the basic common roots, one is tempted to argue-and numerical experiments support the claimthat there will be small differences in average performance of the various architectures (see also Lippmann 1989; Lippmann and Lee 1991).
F. Girosi, M. Jones, and T. Poggio
248
b
1
0. -0
Figure 4: (a) The function to be approximated g(x.y). (b) Additive gaussian model approximation of g(x,y) (model 3). (c) GRN approximation of g(x,y) (model 4). It therefore becomes interesting to ask which architectures are easier to implement in hardware. All the schemes that use the same number of centers as examplessuch as RBF and additive splines-are expensive in terms of memory requirements (if there are many examples) but have a simple learning stage. More interesting are the schemes that use fewer centers than examples (and use the linear transformation W). There are at least two perspectives for our discussion: we can consider implementation of radial vs. additive schemes and we can consider different activation functions. Let us first discuss radial vs. nonradial functions such as a gaussian RBF vs. a gaussian MLP network. For VLSI implementations, the main difference is in computing a scalar product rather than an L2 distance, which is usually more expensive both for digital and analog VLSI. The L2 distance, however, might be replaced with the L1 distance, that is a sum of absolute values, which can be computed efficiently. Notice that a radial basis functions scheme that uses the L1 norm has been derived in Section 3.2 from a tensor-product stabilizer.
Regularization Theory and Neural Networks
249
Let us consider now different activation functions. Activation functions such as gaussian, sigmoid, or absolute values are equally easy to compute, especially if look-up table approaches are used. In analog hardware it is somewhat simpler to generate a sigmoid than a gaussian, although gaussian-like shapes can be synthesized with fewer than 10 transistors (J. Harris, personal communication). In practical implementations other issues, such as trade-offs between memory and computation and on-chip learning, are likely to be much more relevant than the specific chosen architecture. In other words, a general conclusion about ease of implementation is not possible: none of the architectures we have considered holds a clear edge. From the point of view of biological implementations the situation is somewhat different. The hidden unit in MLP networks with sigmoidallike activation functions is a plausible, albeit much oversimplified, model of real neurons. The sigmoidal transformation of a scalar product seems much easier to implement in terms of known biophysical mechanisms than the gaussian of a multidimensional Euclidean distance. On the other hand, it is intriguing to observe that HBF centers and tuned cortical neurons behave alike (Poggio and Hurlbert 1994). In particular, a gaussian HBF unit is maximally excited when each component of the input exactly matches each component of the center. Thus the unit is optimally tuned to the stimulus value specified by its center. Units with multidimensional centers are tuned to complex features, made of the conjunction of simpler features. This description is very like the customary description of cortical cells optimally tuned to some more or less complex stimulus. So-called place coding is the simplest and most universal example of tuning: cells with roughly bell-shaped receptive fields have peak sensitivities for given locations in the input space, and by overlapping, cover all of that space. Thus tuned cortical neurons seem to behave more like gaussian HBF units than like the sigmoidal units of MLP networks: the tuned response function of cortical neurons mostly resembles exp(-IIx- t1I2) more than it does a(x.w). When the stimulus to a cortical neuron is changed from its optimal value in any direction, the neuron’s response typically decreases. The activity of a gaussian HBF unit would also decline with any change in the stimulus away from its optimal value t. For the sigmoid unit, though, certain changes away from the optimal stimulus will not decrease its activity, for example, when the input x is multiplied by a constant ct > 1. How might, then, multidimensional gaussian receptive fields be synthesized from known receptive fields and biophysical mechanisms? The simplest answer is that cells tuned to complex features may be constructed from a hierarchy of simpler cells tuned to incrementally larger conjunctions of elementary features. This idea-popular among physiologists-can immediately be formalized in terms of gaussian radial basis functions, since a multidimensional gaussian function can be decomposed into the product of lower dimensional gaussians (Ballard
F. Girosi, M. Jones, and T. Poggio
250
x,
....
x.
Figure 5: An implementation of the normalized radial basis function scheme. A "pool" cell (dotted circle) summates the activities of the hidden units and then divides the output of the network. The division may be approximated in a physiological implementation by shunting inhibition.
1986; Me1 1988, 1990, 1992; Poggio and Girosi 1990a). There are several biophysically plausible ways to implement gaussian RBF-like units (see Poggio and Girosi 1989; Poggio 1990), but none is particularly simple. Ironically one of the plausible implementations of a RBF unit may exploit circuits based on sigmoidal nonlinearities (see Poggio and Hurlbert 1994). In general, the circuits required for the various schemes described in this paper are reasonable from a biological point of view (Poggio and Girosi 1989; Poggio 1990). For example, the normalized basis function scheme of Section 2.2 could be implemented as outlined in Figure 5 where a "pool" cell summates the activities of all hidden units and shunts the output unit with a shunting inhibition approximating the required division operation. 9 Summary and Remarks
A large number of approximation techniques can be written as multilayer networks with one hidden layer. In past papers (Poggio and
Regularization Theory and Neural Networks
25 1
Regularization Radial Stabilizer
RBF
a
Additive Stabilizer
Product Stabilizer
Tensor Product Splines
1 Figure 6: Several classes of approximation schemes and correspondingnetwork architectures can be derived from regularization with the appropriate choice of smoothness priors and associated stabilizers and basis functions, showing the common Bayesian roots. Girosi 1989, 1990; Girosi 1992) we showed how to derive radial basis functions, hyper basis functions, and several types of multidimensional splines from regularization principles. We had not used regularization to yield approximation schemes of the additive type (Wahba 1990; Hastie and Tibshirani 19901, such as additive splines, ridge approximation of the projection pursuit regression type, and hinge functions. In this paper, we show that appropriate stabilizers can be defined to justify such additive schemes, and that the same extensions that lead from RBF to HBF lead from additive splines to ridge function approximation schemes of the projection pursuit regression type. Our generalized regularization networks include, depending on the stabilizer (that is on the prior knowledge on the functions we want to approximate), HBF networks, ridge approximation, tensor products splines, and perceptron-like networks with one hidden layer and appropriate activation functions (such as the gaussian). Figure 6 shows a diagram of the relationships. Notice that HBF networks and ridge approximation networks are directly related in the special case of normalized inputs (Maruyama et al. 1992).
F. Girosi, M. Jones, and T. Poggio
252
We now feel that a common theoretical framework justifies a large spectrum of approximation schemes in terms of different smoothness constraints imposed within the same regularization functional to solve the ill-posed problem of function approximation from sparse data. The claim is that many different networks and corresponding approximation schemes can be derived from the variational principle N
They differ because of different choices of stabilizers &, which correspond to different assumptions of smoothness. In this context, we believe that the Bayesian interpretation is one of the main advantages of regularization: it makes clear that different network architectures correspond to different prior assumptions of smoothness of the functions to be approximated. The common framework we have derived suggests that differences between the various network architectures are relatively minor, corresponding to different smoothness assumptions. One would expect that each architecture will work best for the class of function defined by the associated prior (that is stabilizer), an expectation that is consistent with numerical results in this paper (see also Donoho and Johnstone 1989). 9.1 Classification and Smoothness. From the point of view of regularization, the task of classification-instead of regression-may seem to represent a problem since the role of smoothness is less obvious. Consider for simplicity binary classification, in which the output y is either 0 or 1 and let P(x.y) = P(x)P(y I x) be the joint probability of the inputoutput pairs (x,y). The average cost associated to an estimator f(x) is the expected risk (see Section 2.2)
The problem of learning is now equivalent to minimizing the expected risk based on N samples of the joint probability distribution P(x,y), and it is usually solved by minimizing the empirical risk (2.14). Here we discuss two possible approaches to the problem of finding the best estimator: 0
If we look for an estimator in the class of real valued functions, it is well known that the minimizer fo of Q[f] is the so-called regression function, that is
Therefore, a real valued network f trained on the empirical risk (2.14) will approximate, under certain conditions of consistency
Regularization Theory and Neural Networks
253
(Vapnik 1982; Vapnik and Chervonenkis 1991), the conditional probability distribution of class 1, P(l I x). In this case our final estimator f is real valued, and in order to obtain a binary estimator we have to apply a threshold function to it, so that our final solution turns out to be
where t9 is the Heaviside function. 0
We could look for an estimator with range (0, l}, for example of the form f ( x ) = 19[8(x)]. In this case the expected risk becomes the average number of misclassified vectors. The function that minimizes the expected risk is not the regression function any more, but a binary approximation to it.
We argue that in both cases it makes sense to assume that f (and g) is a smooth real-valued function and therefore to use regularization networks to approximate it. The argument is that a natural prior constraint for classification is smoothness of classification boundaries, since otherwise it would be impossible to effectively generalize the correct classification from a set of examples. Furthermore, a condition that usually provides smooth classification boundaries is smoothness of the underlying regressor: a smooth function usually has ”smooth level crossings. Thus both approaches described above suggest to impose smoothness of f or g, that is to approximate f or g with a regularization network. 9.2 Complexity of the Approximation Problem. So far we have discussed several approximation techniques only from the point of view of the representation and architecture, and we did not discuss how well they perform in approximating functions of different functions spaces. Since these techniques are derived under different a priori smoothness assumptions, we clearly expect them to perform optimally when those a priori assumptions are satisfied. This makes it difficult to compare their performances, since we expect each technique to work best on a different class of functions. However, if we measure performances by how quickly the approximation error goes to zero when the number of parameters of the approximation scheme goes to infinity, very general results from the theory of linear and nonlinear widths (Timan 1963; Pinkus 1986; Lorentz 1962, 1986; DeVore et al. 1989; DeVore 1991; DeVore and Yu 1991) suggest that all techniques share the same limitations. For example, when approximating an s times continuously differentiable function in d variables with some function parameterized by n parameters, one can prove that even the “best” nonlinear parameterization cannot achieve an accuracy that is better than the Jackson type bound, that is 0 ( n + l d ) . Here the adjective “best” is used in the sense defined by DeVore ef at. (1989)
254
F. Girosi, M. Jones, and T. Poggio
in their work on nonlinear n-widths, which restricts the sets of nonlinear parameterization to those for which the optimal parameters depend continuously on the function that has to be approximated. Notice that, although this is a desirable property, not all the approximation techniques may have it, and therefore these results may not always be applicable. However, the basic intuition is that a class of functions has an intrinsic complexity that increases exponentially in the the ratio d/s, where s is a smoothness index, that is a measure of the amount of constraints imposed on the functions of the class. Therefore, if the smoothness index is kept constant, we expect that the number of parameters needed in order to achieve a certain accuracy increases exponentially with the number of dimensions, irrespectively of the approximation technique, showing the phenomenon known as ”the curse of dimensionality” (Bellman 1961). Clearly, if we consider classes of functions with a smoothness index that increases when the number of variables increases, then a rate of convergence independent of the dimensionality can be obtained, because the increase in complexity due to the larger number of variables is compensated by the decrease due to the stronger smoothness constraint. To make this concept clear, we summarized in Table 3 a number of different approximation techniques, and the constraints that can be imposed on them in order to make the approximation error to be O(l/Jii), that is ”indepedent of the dimension,” and therefore immune to the curse of dimensionality. Notice that since these techniques are derived under different a priori assumptions, the explicit form of the constraints are different. For example in entries 5 and 6 of Table 3 (Girosi and Anzellotti 1992, 1993; Girosi 1993) the result holds in H2”J(Rd),that is the Sobolev space of functions whose derivatives up to order 2m are integrable (Ziemer 1989). Notice that the number of derivatives that are integrable has to increase with the dimension d in order to keep the rate of convergence constant. A similar phenomenon appears in entries 2 and 3 (Barron 1991, 1993; Breiman 1993), but in a less obvious way. In fact, it can be shown (Girosi and Anzellotti 1992, 1993) that, for example, the spaces of functions considered by Barron (entry 2) and Breiman (entry 3) are the set of functions that can be written respectively asf(x) = IIxII’-~*X and f(x) = l l ~ 1 1 ~* -A,~ where X is any function whose Fourier transform is integrable, and * stands for the convolution operator. Notice that, in this way, it becomes more apparent that these space of functions become more and more constrained as the dimensions increase, due to the more and more rapid fall-off of the terms (IxII’-~ and l l ~ 1 1 ~ -The ~ . same phenomenon is also very clear in the results of Mhaskar (1993a1, who proved that the rate of convergence of approximation of functions with s continuous derivatives by multilayered feedforward neural networks is O ( n d d ) :if the number of continuous derivatives s increases linearly with the dimension d, the curse of dimensionality disappears, leading to a rate of convergence independent of the dimension. It is important to emphasize that in practice the parameters of the
Regularization Theory and Neural Networks
255
Table 3: ApproximationSchemes and Corresponding Functions Spaces with the Same Rate of Convergence O ( ~ Z ' / ~ ) !
Function space
Norm
Approximation scheme
"The function c is the standard sigmoidal function, the function IxI+ in the third entry is the ramp function, and the function G, in the fifth entry is a Bessel potential, that is the Fourier transform of (1 11s112)-m/2 (Stein 1970). Hzm,'(Rd) is the Sobolev space of functions whose derivatives up to order 2m are integrable (Ziemer 1989).
+
approximation scheme have to be estimated using a finite amount of data (Vapnik and Chervonenkis 1971, 1981, 1991; Vapnik 1982; Pollard 1984; Geman et al. 1992; Haussler 1989; Baum and Haussler 1989; Baum 1988; Moody 1991a,b). In fact, what one does in practice is to minimize the empirical risk (see equation 2.14), while what one would really like to do is to minimize the expected risk (see equation 2.13). This introduces an additional source of error, sometimes called "estimation error," that usually depends on the dimension a' in a much milder way than the approximation error, and can be estimated using the theory of uniform convergence of relative frequences to probabilities (Vapnik and Chervonenkis 1971,1981,1991; Vapnik 1982; Pollard 1984). Specific results on the generalization error, that combine both approximation and estimation error, have been obtained by Barron (1991, 1994) for sigmoidal neural networks, and by Niyogi and Girosi (1994) for gaussian radial basis functions. Although these bounds are different, they all have the same qualitative behavior: for a fixed number of data points the generalization error first decreases when the number of parameters increases, then reaches a minimum and starts increasing again, revealing the well known phenomenon of overfitting. For a general description of how the approximation and estimation error combine together to bound the generalization error see Niyogi and Girosi (1994).
256
F. Girosi, M. Jones, and T. Poggio
9.3 Additive Structure and the Sensory World. In this last section we address the surprising relative success of additive schemes of the ridge approximation type in real world applications. As we have seen, ridge approximation schemes depend on priors that combine additivity of onedimensional functions with the usual assumption of smoothness. Do such priors capture some fundamental property of the physical world? Consider, for example, the problem of object recognition, or the problem of motor control. We can recognize almost any object from any of many small subsets of its features, visual and nonvisual. We can perform many motor actions in several different ways. In most situations, our sensory and motor worlds are redundant. In terms of GRN this means that instead of high-dimensional centers, any of several lower-dimensional centers, that is components, are often sufficient to perform a given task. This means that the “and“ of a high-dimensional conjunction can be replaced by the ”or” of its components (low-dimensional conjunctions)-a face may be recognized by its eyebrows alone, or a mug by its color. To recognize an object, we may use not only templates comprising all its features, but also subtemplates, comprising subsets of features and in some situations the latter, by themselves, may be fully sufficient. Additive, small centers-in the limit with dimensionality one-with the appropriate W are of course associated with stabilizers of the additive type. Splitting the recognizable world into its additive parts may well be preferable to reconstructing it in its full multidimensionality, because a system composed of several independent, additive parts is inherently more robust than a whole simultaneously dependent on each of its parts. The small loss in uniqueness of recognition is easily offset by the gain against noise and occlusion. There is also a possible meta-argument that we mention here only for the sake of curiosity. It may be argued that humans would not be able to understand the world if it were not additive because of the too-large number of necessary examples (because of high dimensionality of any sensory input such as an image). Thus one may be tempted to conjecture that our sensory world is biased towards an ”additive structure.”
Appendix A Derivation of the General Form of Solution of the Regularization Problem We have seen in Section 2 that the regularized solution of the approximation problem is the function that minimizes a cost functional of the following form:
Regularization Theory and Neural Networks
257
where the smoothness functional 4[f] is given by
The first term measures the distance between the data and the desired solution f, and the second term measures the cost associated with the deviation from smoothness. For a wide class of functionals 4 the solutions of the minimization problem A.l all have the same form. A detailed and rigorous derivation of the solution of the variational principle associated with equation A.l is outside the scope of this paper. We present here a simple derivation and refer the reader to the current literature for the mathematical details (Wahba 1990; Madych and Nelson 1990; Dyn 1987). We first notice that, depending on the choice of G, the functional 4[f] can have a nonempty null space, and therefore there is a certain class of functions that are "invisible" to it. To cope with this problem we first define an equivalence relation among all the functions that differ for an element of the null space of &[f]. Then we express the first term of H [ f ] in terms of the Fourier transform o f f ?
obtaining the functional
Then we notice that since f is real, its Fourier transform satisfies the constraint: f * ( s ) =f ( - s )
so that the functional can be rewritten as
In order to find the minimum of this functional we take its functional derivatives with respect to f and set it to zero:
'For simplicity of notation we take all the constants that appear in the definition of the Fourier transform to be equal to 1.
F. Girosi, M. Jones, and T. Poggio
258
We now proceed to compute the functional derivatives of the first and second term of Hf]. For the first term we have
N =
2 c [y, -f(x,)] ,=1 N
Ld
ds 6(s - t)e'xi's
For the smoothness functional we have
Using these results we can now write equation A.2 as
Changing t i n -t and multiplying by G(-t) on both sides of this equation we get
We now define the coefficients c, =
-f(xi)l x
i = l , ..., N
assume that G is symmetric (so that its Fourier transform is real), and take the Fourier transform of the last equation, obtaining N
f(x)
C c,~(x,
1
- X)
N
* G(x) = C C , G ( X- xl) i=l
1=1
We now recall that we had defined as equivalent all the functions differing by a term that lies in the null space of 4[f], and therefore the most general solution of the minimization problem is N
f(x) = C c I G ( x - XI)
+~ ( x )
1=1
where p ( x ) is a term that lies in the null space of $[f], that is a set of polynomials for most common choices of stabilizer $[f].
Regularization Theory and Neural Networks
Yl
Y 2
...
'
259
...
Y,-1
ys
Figure 7 The most general network with one hidden layer and vector output. Notice that this approximation of a q-dimensional vector field has, in general, fewer parameters than the alternative representation consisting of q networks with one-dimensional outputs. If the only free parameters are the weights from the hidden layer to the output (as for simple RBF with n = N, where N is the number of examples) the two representations are equivalent. Appendix B: Approximation of Vector Fields with Regularization Networks Consider the problem of approximating a q-dimensional vector field y (x) from a set of sparse data, the examples, which are pairs (xi,yi) for i = 1, . . . ,N. Choose a generalized regularization network as the approximation scheme, that is, a network with one "hidden" layer and linear output units. Consider the case of N examples, n 5 N centers, input dimensionality d and output dimensionality q (see Fig. 7). Then the approximation is
where G is the chosen basis function and the coefficients c, are now q-dimensional vector^:^ c, = ( c t , . . . ,cg,. . . ,c4,). 3The components of an output vector will always be denoted by superscript, Greek indices.
F. Girosi, M. Jones, and T. Poggio
260
Here we assume, for simplicity, that G is positive definite in order to avoid the need of additional polynomial terms in the previous equation. Equation B.l can be rewritten in matrix notation as
Y(X) = C g ( x )
(B.2)
where the matrix C is defined by (C)p,, = c; and g is the vector with elements [g(x)la= G ( x - ta). Assuming, for simplicity, that there is no noise in the data [that is equivalent to choosing X = 0 in the regularization functional (2.111, the equations for the coefficients c, can be found imposing the interpolation conditions:
YI
= Cg(xJ
Introducing the following notation
( Y ) r ,= p ~"(xl),
(C)p,c2 = c:,
(GIa,!= G(xl - t,)
the matrix of coefficients C is given by
C
=
YG'
where G+ is the pseudoinverse of G (Penrose 1955; Albert 1972). Substituting this expression in equation B.2, the following expression is obtained:
After some algebraic manipulations, this expression can be rewritten as
where the functions b , ( x ) , that are the elements of the vector b ( x ) ,depend on the chosen G, according to
b ( x )= G+g(x) Therefore, it follows (though it is not so well known) that the vector field y ( x ) is approximated by the network as the linear combination of the example fields yi. Thus forany choiceof the regularization networkand any choiceof the (positive definite) basis function the estimated output vector is always a linear combination of the output example vectors with coefficients b that depend on the input value. The result is valid for all networks with one hidden layer and linear outputs, provided that the mean square error criterion is used for training.
Regularization Theory and Neural Networks
261
Acknowledgments We are grateful to P. Niyogi, H. Mhaskar, J. Friedman, J. Moody, V. Tresp, and one of the (anonymous) referees for useful discussions and suggestions. This paper describes research done within the Center for Biological and Computational Learning in the Department of Brain and Cognitive Sciences and at the Artificial Intelligence Laboratory at MIT. This research is sponsored by grants from the Office of NavaI Research under contracts N00014-91-J-0385 and NOOOl4-92-J-1879 and by a grant from the National Science Foundation under contract ASC-9217041 (which includes funds from ARPA provided under the HPCC program). Support for the A.I. Laboratory’s artificial intelligence research is provided by ARPA-ONR contract N00014-91-J-4038. Tomaso Poggio is supported by the Uncas and Helen Whitaker Chair at the Whitaker College, Massachusetts Institute of Technology.
References Aidu, F. A., and Vapnik, V. N. 1989. Estimation of probability density on the basis of the method of stochastic regularization. Avtom. Telemek. (41, 84-97. Albert, A. 1972. Regression and the Moore-Penrose Pseudoinverse. Academic Press, New York. Allen, D. 1974. The relationship between variable selection and data augmentation and a method for prediction. Technometrics 16, 125-127. Aronszajn, N. 1950. Theory of reproducing kernels. Trans. Am. Math. SOC.686, 337-404. Ballard, D. H. 1986. Cortical connections and parallel processing: structure and function. Behav. Brain Sci. 9, 67-120. Barron, A. R., and Barron, R. L. 1988. Statistical learning networks: A unifying view. In Symposium on the Interface: Statistics and Computing Science, Reston, Virginia. Barron, A. R. 1991. Approximation and estimation boundsfor artificial neural networks. Tech. Rep. 59, Department of Statistics, University of Illinois at UrbanaChampaign, Champaign, IL. Barron, A. R. 1993. Universal approximation bounds for superpositions of a sigmoidal function. I E E E Transact. Inform. Theory 39(3), 930-945. Barron, A. R. 1994. Approximation and estimation bounds for artificial neural networks. Machine Learn. 14, 115-133. Baum, E. B. 1988. On the capabilities of multilayer perceptrons. 1. Complex. 4, 193-21 5. Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Comp. 1, 151-160. Bellman, R. E. 1961. Adaptive Control Processes. Princeton University Press, Princeton, NJ. Bertero, M. 1986. Regularization methods for linear inverse problems. In Inverse Problems, C. G. Talenti, ed. Springer-Verlag, Berlin.
262
F. Girosi, M. Jones, and T. Poggio
Bertero, M., Poggio, T., and Torre, V. 1988. Ill-posed problems in early vision. Proc. IEEE 76,869-889. Bottou, L., and Vapnik, V. 1992. Local learning algorithms. Neural Comp. 4(6), 888-900. Breiman, L. 1993. Hinging hyperplanes for regression, classification, and function approximation. I€€€ Trans. Inform. Theory 39(3), 999-1013. Broomhead, D. S., and Lowe, D. 1988. Multivariable functional interpolation and adaptive networks. Complex Syst. 2, 321-355. Buhmann, M. D. 1990. Multivariate cardinal interpolation with radial basis functions. Construct. Approx. 6, 225-255. Buhmann, M. D. 1991. On quasi-interpolation with radial basis functions. Numerical Analysis Reports DAMPT 1991/NA3, Department of Applied Mathematics and Theoretical Physics, Cambridge, England. Buja, A., Hastie, T., and Tibshirani, R. 1989. Linear smoothers and additive models. Ann. Statist. 17,453-555. Caprile, B., and Girosi, F. 1990. A nondeterministic minimization algorithm. A.I. Memo 1254, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA. Cox, D. D. 1984. Multivariate smoothing spline functions. S I A M I. Numer. Anal. 21, 789-813. Craven, P., and Wahba, G. 1979. Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross validation. Numer. Math. 31, 377403. Cybenko, G. 1989. Approximation by superposition of a sigmoidal function. Math. Control Systems Signals 2(4), 303-314. de Boor, C. 1978. A Practical Guide to Splines. Springer-Verlag, New York. de Boor, C. 1990. Quasi-interpolants and approximation power of multivariate splines. In Computation of Curves and Surfaces, M. Gasca and C. A. Micchelli, eds., pp. 313-345. Kluwer Academic Publishers, Dordrecht, Netherlands. DeVore, R. A. 1991. Degree of nonlinear approximation. In Approximation Theory, VI, C. K. Chui, L. L. Schumaker, and D. J. Ward, eds., pp. 175-201. Academic Press, New York. DeVore, R. A., and Yu, X. M. 1991. Nonlinear n-widths in Besov spaces. In Approximation Theory, VI, C. K. Chui, L. L. Schumaker, and D. J. Ward, eds., pp. 203-206. Academic Press, New York. DeVore, R., Howard, R., and Micchelli, C. 1989. Optimal nonlinear approximation. Manuskrip. Math. Devroye, L. P., and Wagner, T. J. 1980. Distribution-free consistency results in nonparametric discrimination and regression function estimation. Ann. Stutist. 8, 231-239. Diaconis, P., and Freedman, D. 1984. Asymptotics of graphical projection pursuit. Ann. Statist. 12(3), 793-815. Donoho, D. L., and Johnstone, I. M. 1989. Projection-based approximation and a duality with kernel methods. Ann. Statist. 17(1),58-106. Duchon, J. 1977. Spline minimizing rotation-invariant semi-norms in Sobolev spaces. In Constructive Theory of Functions of Several Variables, Lecture Notes in Mathematics, 571, W. Schempp and K. Zeller, eds. Springer-Verlag, Berlin.
Regularization Theory and Neural Networks
263
Dyn, N. 1987. Interpolation of scattered data by radial functions. In Topics in Multivariate Approximation, C. K. Chui, L. L. Schumaker, and F. I. Utreras, eds. Academic Press, New York. Dyn, N. 1991. Interpolation and approximation by radial and related functions. In Approximation Theory, VI, C. K. Chui, L. L. Schumaker, and D. J. Ward, eds., pp. 211-234. Academic Press, New York. Dyn, N., Levin, D., and Rippa, S. 1986. Numerical procedures for surface fitting of scattered data by radial functions. S I A M ] . Sci. Stat. Comput. 7(2), 639-659. Dyn, N., Jackson, I. R. H., Levin, D., and Ron, A. 1989. On multivariate approximation by integer translates of a basis function. Computer Sciences Tech. Rep. 886, University of Wisconsin-Madison. Eubank, R. L. 1988. Spline Smoothing and Nonparametric Regression, Vol. 90 of Statistics, Textbooks and Monographs. Marcel Dekker, Basel. Franke, R. 1982. Scattered data interpolation: Tests of some method. Math. Comp. 38(5), 181-200. Franke, R. 1987. Recent advances in the approximation of surfaces from scattered data. In Topics in Multivariate Approximation, C. K. Chui, L. L. Schumaker, and F. I. Utreras, eds. Academic Press, New York. Friedman, J. H., and Stuetzle, W. 1981. Projection pursuit regression. 1. Am. Statist. Assoc. 76(376), 817-823. Funahashi, I. 1989. On the approximate realization of continuous mappings by neural networks. Neural Networks 2, 183-192. Gasser, Th., and Miiller, H. G. 1985. Estimating regression functions and their derivatives by the kernel method. Scand. J. Statist. 11, 171-185. Gelfand, I. M., and Shilov, G. E. 1964. Generalized Functions. Vol. 1: Propertiesand Operations. Academic Press, New York. Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Comp. 4, 1-58. Girosi, F. 1991. Models of noise and robust estimates. A.I. Memo 1287, Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Girosi, F. 1992. On some extensions of radial basis functions and their applications in artificial intelligence. Comput. Math. Applic. 24(12), 61-80. Girosi, F. 1993. Regularization theory, radial basis functions and networks. In From Statistics to Neural Networks. Theory and Pattern Recognition Applications, V. Cherkassky, J. H. Friedman, and H. Wechsler, eds. Subseries F, Computer and Systems Sciences. Springer-Verlag, Berlin. Girosi, F., and Anzellotti, G. 1992. Rates of convergence of approximation by translates. A.I. Memo 1288, Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Girosi, F., and Anzellotti, G. 1993. Rates of convergence for radial basis functions and neural networks. In Artificial Neural Networks for Speech and Vision, R. J. Mammone, ed., pp. 97-113. Chapman & Hall, London. Girosi, F., and Poggio, T. 1990. Networks and the best approximation property. Biol. Cybernet. 63, 169-176. Girosi, F., Poggio, T., and Caprile, B. 1991. Extensions of a theory of networks for approximation and learning: Outliers and negative examples. In Ad-
264
F. Girosi, M. Jones, and T. Poggio
uances in Neural Information Processings Systems 3, R. Lippmann, J. Moody, and D. Touretzky, eds. Morgan Kaufmann, San Mateo, CA. Girosi, F., Jones, M., and Poggio, T. 1993. Priors, stabilizers and basis functions: From regularization to radial, tensor and additive splines. A.I. Memo No. 1430, Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Golub, G., Heath, M., and Wahba, G. 1979. Generalized cross validation as a method for choosing a good ridge parameter. Technometrics 21, 215-224. Grimson, W. E. L. 1982. A computational theory of visual surface interpolation. Proc. R. S . London B 298, 395-427. Harder, R. L., and Desmarais, R. M. 1972. Interpolation using surface splines. J. Aircraft 9, 189-191. Hardle, W. 1990. Applied Nonparametric Regression, Vol. 19 of Econometric Society Monographs. Cambridge University Press, Cambridge. Hardy, R. L. 1971. Multiquadric equations of topography and other irregular surfaces. J. Geophys. Res. 76, 1905-1915. Hardy, R. L. 1990. Theory and applications of the multiquadric-biharmonic method. Computers Math. Applic. 19(8/9), 163-208. Hastie, T., and Tibshirani, R. 1986. Generalized additive models. Statist. Sci. 1, 297-31 8. Hastie, T., and Tibshirani, R. 1987. Generalized additive models: Some applications. J. Am. Statist. Assoc. 82, 371-386. Hastie, T., and Tibshirani, R. 1990. Generalized Additive Models, Vol. 43 of Monographs on Statistics and Applied Probability. Chapman & Hall, London. Haussler, D. 1989. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Tech. Rep. UCSC-CRL-91-02, University of California, Santa Cruz. Hertz, J. A., Krogh, A., and Palmer, R. 1991. Introduction to the TheoryofNeural Computation. Addison-Wesley, Redwood City, CA. Hornik, K., Stinchcombe, M., and White, W. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366. Huber, P. J. 1985. Projection pursuit. Ann. Statist. 13(2), 435-475. Hurlbert, A., and Poggio, T. 1988. Synthesizing a color algorithm from examples. Science 239, 482-485. Irie, B., and Miyake, S. 1988. Capabilities of three-layered perceptrons. I E E E Int. Conf. Neural Networks 1, 641-648. Jackson, I. R. H. 1988. Radial basis functions methodsfor multivariate approximation. Ph.D. thesis, University of Cambridge, U.K. Jones, L. K. 1992. A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. Ann. Statist. 20(1), 608-613. Kansa, E. J. 1990a. Multiquadrics-A scattered data approximation scheme with applications to computational fluid dynamics-I. Comput. Math. Applic. 19(8/9), 127-145. Kansa, E. J. 1990b. Multiquadrics-A scattered data approximation scheme with applications to computational fluid dynamics-11. Comput. Math. Applic. 19(8/9), 147-161.
Regularization Theory and Neural Networks
265
Kimeldorf, G. S., and Wahba, G. 1971. A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Ann. Math. Statist. 2, 495-502. Kohonen, T. 1990. The self-organizing map. Proc. I E E E 78(9), 1464-1480. Kung, S. Y. 1993. Digital Neural Networks. Prentice Hall, Englewood Cliffs, NJ. Lancaster, P., and Salkauskas, K. 1986. Curveand SurfaceFitting. Academic Press, London. Lapedes, A., and Farber, R. 1988. How neural nets work. In Neural Information Processing Systems, D. Z. Anderson, ed., pp. 442-456. American Institute of Physics, New York. Lippmann, R. P. 1989. Review of neural networks for speech recognition. Neural Comp. 1, 1-38. Lippmann, R. P., and Lee, Y. 1991. A critical overview of neural network pattern classifiers. Presented at Neural Networks for Computing Conference, Snowbird, UT. Lorentz, G. G. 1962. Metric entropy, widths, and superposition of functions. Am. Math. Monthly 69, 469-485. Lorentz, G. G. 1986. Approximation of Functions. Chelsea, New York. Madych, W. R., and Nelson, S. A. 1990a. Multivariate interpolation and conditionally positive definite functions. 11. Math. Comput. 54(189), 211-230. Madych, W. R., and Nelson, S. A. 1990b. Polyharmonic cardinal splines: A minimization property. J. Approx. The0y 63, 303-320. Marroquin, J. L., Mitter, S., and Poggio, T. 1987. Probabilistic solution of illposed problems in computational vision. J. Am. Stat. Assoc. 82, 76-89. Maruyama, M., Girosi, F., and Poggio, T. 1992. A connection between HBF and MLP. A.I. Memo No. 1291, Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Meinguet, J. 1979. Multivariate interpolation at arbitrary points made simple. J. Appl. Math. Phys. 30, 292-304. Mel, B. W. 1988. MURPHY: A robot that learns by doing. In Neural Information Processing Systems, D. Z. Anderson, ed. American Institute of Physics, New York. Mel, B. W. 1990. The sigma-pi column: A model of associative learning in cerebral neocortex. Tech. Rep. 6, California Institute of Technology. Mel, B. W. 1992. NMDA-based pattern-discrimination in a modeled cortical neuron. Neural Comp. 4, 502-517. Mhaskar, H. N. 1993a. Approximation properties of a multilayered feedforward artificial neural network. Adv. Comp. Math. 1, 61-80. Mhaskar, H. N. 1993b. Neural networks for localized approximation of real functions. In Neural Networks for Signal Processing III, Proceedings of the 1993 IEEE-SP Workshop, C. A. Kamm et al., eds., pp. 190-196. IEEE Signal Processing Society, New York. Mhaskar, H. N., and Micchelli, C. A. 1992. Approximation by superposition of a sigmoidal function. Adu. Appl. Math. 13, 350-373. Mhaskar, H. N., and Micchelli, C. A. 1993. How to choose an activation function. In Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds. Morgan Kaufmann, San Mateo, CA.
266
F. Girosi, M. Jones, and T. Poggio
Micchelli, C. A. 1986. Interpolation of scattered data: Distance matrices and conditionally positive definite functions. Construct. Approx. 2, 11-22. Moody, J. 1991a. Note on generalization, regularization, and architecture selection in nonlinear learning systems. In Proceedings of the First IEEE-SP Workshop on Neural Networks for Signal Processing, pp. 1-10. IEEE Computer Society Press, Los Alamitos, CA. Moody, J. 1991b. The effective number of parameters: an analysis of generalization and regularization in nonlinear learning systems. In Advances in Neural lnformation Processings Systems 4, J. Moody, S. Hanson, and R. Lippmann, eds., pp. 847-854. Morgan Kaufmann, Palo Alto, CA. Moody, J., and Darken, C. 1988. Learning with localized receptive fields. In Proceedings of the 1988 Connectionist Models Summer School, G. Hinton, T. Sejnowski, and D. Touretzsky, eds., pp. 133-143. Palo Alto, CA. Moody, J., and Darken, C. 1989. Fast learning in networks of locally-tuned processing units. Neural Comp. 1(2), 281-294. Moody, J., and Yarvin, N. 1991. Networks with learned unit response functions. In Advances in Neural Information Processings Systems 4, J. Moody, S. Hanson, and R. Lippmann, eds., pp. 1048-1055. Morgan Kaufmann, Palo Alto, CA. Morozov, V. A. 1984. Methods for Solving lncorrectly Posed Problems. SpringerVerlag, Berlin. Nadaraya, E. A. 1964. On estimating regression. Theor. Prob. Appl. 9, 141-142. Niyogi, I?, and Girosi, F. 1994. On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions. A.I. Memo 1467, Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Omohundro, S. 1987. Efficient algorithms with neural network behaviour. Complex Syst. 1, 273. Parisi, G. 1988. Statistical Field Theory. Addison-Wesley, Reading, MA. Parzen, E. On estimation of a probability density function and mode. Ann. Math. Statist. 33, 1065-1076. Penrose, R. 1955. A generalized inverse for matrices. Proc. Cambridge Philos. SOC. 51,406-413. Pinkus, A. 1986. N-widths in Approximation Theory. Springer-Verlag, New York. Poggio, T. 1975. On optimal nonlinear associative recall. Bid. Cybernet. 19, 201-209. Poggio, T. 1990. A theory of how the brain might work. Cold Spring Harbor Symp. Quantit. Biol. 899-910. Poggio, T., and Girosi, F. 1989. A theory of networks for approximation and learning. A.I. Memo No. 1140, Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Poggio, T., and Girosi, F. 1990a. Networks for approximation and learning. Proc. I E E E 78(9). Poggio, T., and Girosi, F. 1990b. Extension of a theory of networks for approximation and learning: dimensionality reduction and clustering. In Proceedings lmage Understanding Workshop, pp. 597-603, Pittsburgh, Pennsylvania, September 11-13. Morgan Kaufmann, Palo Alto, CA.
Regularization Theory and Neural Networks
267
Poggio, T., and Girosi, E 1990c. Regularization algorithms for learning that are equivalent to multilayer networks. Science 247, 978-982. Poggio, T., and Hurlbert, A. 1994. Observation on cortical mechanisms for object recognition and learning. In Large-Scale Neuronal Theories of the Brain, C. Koch and J. Davis, eds. In press. Poggio, T., Torre, V., and Koch, C. 1985. Computational vision and regularization theory. Nature 317, 314-319. Poggio, T., Voorhees, H., and Yuille, A. 1988. A regularized solution to edge detection. J. Complex. 4, 106-123. Pollard, D. 1984. Convergence of Stochastic Processes. Springer-Verlag, Berlin. Powell, M. J. D. 1987. Radial basis functions for multivariable interpolation: A review. In Algorithms for Approximation, J. C. Mason and M. G. Cox, eds. Clarendon Press, Oxford. Powell, M. J. D. 1992. The theory of radial basis functions approximation in 1990. In Advances in Numerical Analysis Volume 11: Wavelets, Subdivision Algorithms and Radial Basis Functions, W. A. Light, ed., pp. 105-210. Oxford University Press, Oxford. Priestley, M. B., and Chao, M. T. 1972. Non-parametric function fitting. J. R. Statist. SOC.B 34, 385-392. Rabut, C. 1991. How to build quasi-interpolants. applications to polyharmonic B-splines. In Curvesand Surfaces, P.-J. Laurent, A. Le Mehaute, and L. L. Schumaker, eds., pp. 391-402. Academic Press, New York. Rabut, C. 1992. An introduction to Schoenberg’s approximation. Comput. Math. Applic. 24(12), 149-175. Ripley, B. D. 1994. Neural networks and related methods for classification. Proc. R. SOC.London, in press. Rissanen, J. 1978. ModeIing by shortest data description. Automatica 14, 465471. Rosenblatt, M. 1971. Curve estimates. Ann. Math. Statist. 64, 1815-1842. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning representations by back-propagating errors. Nature (London) 323(9), 533-536. Schoenberg, I. J. 1946a. Contributions to the problem of approximation of equidistant data by analytic functions, part a: On the problem of smoothing of graduation, a first class of analytic approximation formulae. Quart. Appl. Math. 4, 45-99. Schoenberg, I. J. 1969. Cardinal interpolation and spline functions. J. Approx. Theory 2, 167-206. Schumaker, L. L. 1981. Spline Functions: Basic Theory. John Wiley, New York. Sejnowski, T. J., and Rosenberg, C. R. 1987. Parallel networks that learn to pronounce English text. Complex Syst. 1, 145-168. Silverman, 8. W. 1984. Spline smoothing: The equivalent variable kernel method. Ann. Statist. 12, 898-916. Sivakumar, N., and Ward, J. D. 1991. On the best least square fit by radial functions to multidimensional scattered data. Tech. Rep. 251, Center for Approximation Theory, Texas A&M University. Solomonoff, R. J. 1978. Complexity-based induction systems: Comparison and convergence theorems. IEEE Trans. Inform. Theory 24.
268
F. Girosi, M. Jones, and T. Poggio
Specht, D. F. 1991. A general regression neural network. l E E E Trans. Neural Networks 2(6), 568-576. Stein, E. M. 1970. Singular Integrals and Differentiability Properties of Functions. Princeton University Press, Princeton, NJ. Stewart, J. 1976. Positive definite functions and generalizations, an historical survey. Rocky Mountain J. Math. 6, 409434. Stone, C. J. 1985. Additive regression and other nonparametric models. Ann. Statist. 13, 689-705. Tikhonov, A. N. 1963. Solution of incorrectly formulated problems and the regularization method. Soviet Math. Dokl. 4, 1035-1038. Tikhonov, A. N., and Arsenin, V. Y. 1977. Solutions of 111-Posed Problems. W. H. Winston, Washington, DC. Timan, A. F. 1963. Theory of Approximation of Functions of a Real Variable. Macmillan, New York. Tresp, V., Hollatz, J., and Ahmad, S. 1993. Network structuring and training using rule-based knowledge. In Advances in Neural Information Processing Systems5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds. Morgan Kaufmann, San Mateo, CA. Utreras, F. 1979. Cross-validation techniques for smoothing spline functions in one or two dimensions. In Smoothing Techniquesfor Curve Estimation, T. Gasser and M. Rosenblatt, eds., pp. 196-231. Springer-Verlag, Heidelberg. Vapnik, V. N. 1982. Estimation of Dependences Based on Empirical Data. SpringerVerlag, Berlin. Vapnik, V. N., and Chervonenkis, A. Y. 1971. On the uniform convergence of relative frequences of events to their probabilities. Th. Prob. Applic. 17(2), 264-280. Vapnik, V. N., and Chervonenkis, A. Y. 1981. The necessary and sufficient conditions for the uniform convergence of averages to their expected values. Teor. Veroyat. Primen. 26(3), 543-564. Vapnik, V. N., and Chervonenkis, A. Y. 1991. The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recog. Image Anal. 1(3), 283-305. Vapnik, V. N., and Stefanyuk, A. R. 1978. Nonparametric methods for restoring probability densities. Automat. Telemek. 8, 38-52. Wahba, G. 1975. Smoothing noisy data by spline functions. Numer. Math 24, 383-393. Wahba, G. 1979. Smoothing and ill-posed problems. In Solutions Methods for Integral Equations and Applications, M. Golberg, ed., pp. 183-194. Plenum Press, New York. Wahba, G. 1980. Spline bases, regularization, and generalized cross-validation for solving approximation problems with large quantities of noisy data. In Proceedings of the International Conference on Approximation Theory in Honour of George Lorenz, J. Ward and E. Cheney, eds., Austin, TX, January 8-10, 1980. Academic Press, New York. Wahba, G. 1985. A comparison of GCV and GML for choosing the smoothing parameter in the generalized splines smoothing problem. Ann. Statist. 13, 1378-1402.
Regularization Theory and Neural Networks
269
Wahba, G. 1990. Splines Models for Observational Data. Series in Applied Mathematics, Vol. 59, SIAM, Philadelphia. Wahba, G., and Wold, S. 1975. A completely automatic French curve. Commun. Statist. 4, 1-17. Watson, G. S. 1964. Smooth regression analysis. Sankhya A 26, 359-372. White, H. 1989. Learning in artificial neural networks: A statistical perspective. Neural Cornp. 1,425464. White, H. 1990. Connectionist nonparametric regression: Multilayer perceptrons can learn arbitrary mappings. Neural Networks 3, 535-549. Yuille, A,, and Grzywacz, N. 1988. The motion coherence theory. In Proceedings of the International Conference on Computer Vision, pp. 344-354, IEEE Computer Society Press, Washington, DC. Ziemer, W. P. 1989. Weakly DifferentiableFunctions: Sobolev Spaces and Functions of Bounded Variation. Springer-Verlag, New York.
Received February 2, 1994; accepted June 22, 1994.
This article has been cited by: 1. M. Baglietto, C. Cervellera, M. Sanguineti, R. Zoppoli. 2010. Management of water resource systems in the presence of uncertainties by nonlinear approximation techniques and deterministic sampling. Computational Optimization and Applications 47:2, 349-376. [CrossRef] 2. Sang-Hoon Oh. 2010. Design of Multilayer Perceptrons for Pattern Classifications. The Journal of the Korea Contents Association 10:5, 99-106. [CrossRef] 3. Arta A. Jamshidi, Michael J. Kirby. 2010. Skew-Radial Basis Function Expansions for Empirical Modeling. SIAM Journal on Scientific Computing 31:6, 4715. [CrossRef] 4. R. A. Mat Noor, Z. Ahmad, M. Mat Don, M. H. Uzir. 2010. Modelling and control of different types of polymerization processes using neural networks technique: A review. The Canadian Journal of Chemical Engineering n/a-n/a. [CrossRef] 5. Israel Gonzalez-Carrasco, Angel Garcia-Crespo, Belen Ruiz-Mezcua, Jose Luis Lopez-Cuadrado. 2009. Dealing with limited data in ballistic impact scenarios: an empirical comparison of different neural network approaches. Applied Intelligence . [CrossRef] 6. Vladimir V. Berdnik, Valery A. Loiko. 2009. Retrieval of size and refractive index of spherical particles by multiangle light scattering: neural network method application. Applied Optics 48:32, 6178. [CrossRef] 7. Kelli A. C. Baumgartner, Silvia Ferrari, Thomas A. Wettergren. 2009. Robust Deployment of Dynamic Sensor Networks for Cooperative Track Detection. IEEE Sensors Journal 9:9, 1029-1048. [CrossRef] 8. Zainal Ahmad, Rabiatul ′Adawiah Mat Noor, Jie Zhang. 2009. Multiple neural networks modeling techniques in process control: a review. Asia-Pacific Journal of Chemical Engineering 4:4, 403-419. [CrossRef] 9. M. Islam, A. Sattar, F. Amin, Xin Yao, K. Murase. 2009. A New Adaptive Merging and Growing Algorithm for Designing Artificial Neural Networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39:3, 705-722. [CrossRef] 10. Tomohiro Ando, Sadanori Konishi. 2009. Nonlinear logistic discrimination via regularized radial basis functions for classifying high-dimensional data. Annals of the Institute of Statistical Mathematics 61:2, 331-353. [CrossRef] 11. Ralf Östermark. 2009. Geno-mathematical identification of the multi-layer perceptron. Neural Computing and Applications 18:4, 331-344. [CrossRef] 12. Ruoming Jin, Yuri Breitbart, Chibuike Muoh. 2009. Data discretization unification. Knowledge and Information Systems 19:1, 1-29. [CrossRef]
13. Giorgio Gnecco, Marcello Sanguineti. 2009. Accuracy of suboptimal solutions to kernel principal component analysis. Computational Optimization and Applications 42:2, 265-287. [CrossRef] 14. Giorgio Gnecco, Marcello Sanguineti. 2009. The weight-decay technique in learning from data: an optimization point of view. Computational Management Science 6:1, 53-79. [CrossRef] 15. S. Giulini, M. Sanguineti. 2009. Approximation Schemes for Functional Optimization Problems. Journal of Optimization Theory and Applications 140:1, 33-54. [CrossRef] 16. Grzegorz Zadora. 2009. Classification of Glass Fragments Based on Elemental Composition and Refractive Index*. Journal of Forensic Sciences 54:1, 49-59. [CrossRef] 17. Masashi Sugiyama, Hirotaka Hachiya, Christopher Towell, Sethu Vijayakumar. 2008. Geodesic Gaussian kernels for value function approximation. Autonomous Robots 25:3, 287-304. [CrossRef] 18. D.A.G. Vieira, R.H.C. Takahashi, V. Palade, J.A. Vasconcelos, W.M. Caminhas. 2008. The $Q$ -Norm Complexity Measure and the Minimum Gradient Method: A Novel Approach to the Machine Learning Structural Risk Minimization Problem. IEEE Transactions on Neural Networks 19:8, 1415-1430. [CrossRef] 19. L. Lo Gerfo, L. Rosasco, F. Odone, E. De Vito, A. Verri. 2008. Spectral Algorithms for Supervised LearningSpectral Algorithms for Supervised Learning. Neural Computation 20:7, 1873-1897. [Abstract] [PDF] [PDF Plus] 20. Minjoon Kouh, Tomaso Poggio. 2008. A Canonical Neural Circuit for Cortical Nonlinear OperationsA Canonical Neural Circuit for Cortical Nonlinear Operations. Neural Computation 20:6, 1427-1451. [Abstract] [PDF] [PDF Plus] 21. Yaochu Jin, B. Sendhoff. 2008. Pareto-Based Multiobjective Machine Learning: An Overview and Case Studies. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38:3, 397-415. [CrossRef] 22. Weifeng Liu, Puskal P. Pokharel, Jose C. Principe. 2008. The Kernel Least-Mean-Square Algorithm. IEEE Transactions on Signal Processing 56:2, 543-554. [CrossRef] 23. R. Amiri Chayjan, Y. Moazez. 2008. Estimation of Paddy Equilibrium Moisture Sorption Using ANNs. Journal of Applied Sciences 8:2, 346-351. [CrossRef] 24. Shantanu Chakrabartty, Yunbin Deng, Gert Cauwenberghs. 2007. Robust Speech Feature Extraction by Growth Transformation in Reproducing Kernel Hilbert Space. IEEE Transactions on Audio, Speech and Language Processing 15:6, 1842-1849. [CrossRef] 25. A. D'Addabbo, A. Latiano, O. Palmieri, R. Maglietta, V. Annese, N. Ancona. 2007. Regularized Least Squares Classifiers may Predict Crohn's Disease from Profiles of Single Nucleotide Polymorphisms. Annals of Human Genetics 71:4, 537-549. [CrossRef]
26. K. Pelckmans, J. A. K. Suykens, B. De Moor. 2007. A Convex Approach to Validation-Based Learning of the Regularization Constant. IEEE Transactions on Neural Networks 18:3, 917-920. [CrossRef] 27. Grzegorz Zadora. 2007. Glass analysis for forensic purposes—a comparison of classification methods. Journal of Chemometrics 21:5-6, 174-186. [CrossRef] 28. Antonio Muñoz San Roque, Carlos Maté, Javier Arroyo, Ángel Sarabia. 2007. iMLP: Applying Multi-Layer Perceptrons to Interval-Valued Data. Neural Processing Letters 25:2, 157-169. [CrossRef] 29. X. Hong. 2007. Modified radial basis function neural network using output transformation. IET Control Theory & Applications 1:1, 1. [CrossRef] 30. Arta A. Jamshidi, Michael J. Kirby. 2007. Towards a Black Box Algorithm for Nonlinear Function Approximation over High-Dimensional Domains. SIAM Journal on Scientific Computing 29:3, 941. [CrossRef] 31. L. Leistritz, M. Galicki, E. Kochs, E.B. Zwick, C. Fitzek, J.R. Reichenbach, H. Witte. 2006. Application of Generalized Dynamic Neural Networks to Biomedical Data. IEEE Transactions on Biomedical Engineering 53:11, 2289-2299. [CrossRef] 32. Jayanta Basak. 2006. Online Adaptive Decision Trees: Pattern Classification and Function ApproximationOnline Adaptive Decision Trees: Pattern Classification and Function Approximation. Neural Computation 18:9, 2062-2101. [Abstract] [PDF] [PDF Plus] 33. S. Morigi, L. Reichel, F. Sgallari. 2006. An iterative Lavrentiev regularization method. BIT Numerical Mathematics 46:3, 589-606. [CrossRef] 34. D. Wedge, D. Ingram, D. Mclean, C. Mingham, Z. Bandar. 2006. On Global–Local Artificial Neural Networks for Function Approximation. IEEE Transactions on Neural Networks 17:4, 942-952. [CrossRef] 35. Josh Wills, Sameer Agarwal, Serge Belongie. 2006. A Feature-based Approach for Dense Segmentation and Estimation of Large Disparity Motion. International Journal of Computer Vision 68:2, 125-143. [CrossRef] 36. J. M. Matías, W. González-Manteiga. 2006. Regularized kriging as a generalization of simple, universal, and bayesian kriging. Stochastic Environmental Research and Risk Assessment 20:4, 243-258. [CrossRef] 37. C. Alippi, F. Scotti. 2006. Exploiting Application Locality to Design Low-Complexity, Highly Performing, and Power-Aware Embedded Classifiers. IEEE Transactions on Neural Networks 17:3, 745-754. [CrossRef] 38. L. Weruaga, B. Kieslinger. 2006. Tikhonov Training of the CMAC Neural Network. IEEE Transactions on Neural Networks 17:3, 613-622. [CrossRef] 39. Y. Abe, Y. Iiguni. 2006. Interpolation capability of the periodic radial basis function network. IEE Proceedings - Vision, Image, and Signal Processing 153:6, 785. [CrossRef]
40. Jin Cong. 2006. A novel watermarking algorithm for resistant geometric attacks using feature points matching. Information Management & Computer Security 14:1, 75-98. [CrossRef] 41. G. Valentini. 2005. An Experimental Bias-Variance Analysis of SVM Ensembles Based on Resampling Techniques. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 35:6, 1252-1271. [CrossRef] 42. G. Corani, G. Guariso. 2005. Coupling Fuzzy Modeling and Neural Networks for River Flood Prediction. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 35:3, 382-390. [CrossRef] 43. S. Ferrari, I. Frosio, V. Piuri, N.A. Borghese. 2005. Automatic Multiscale Meshing Through HRBF Networks. IEEE Transactions on Instrumentation and Measurement 54:4, 1463-1470. [CrossRef] 44. Theodoros Evgeniou, Constantinos Boussios, Giorgos Zacharia. 2005. Generalized Robust Conjoint Estimation. Marketing Science 24:3, 415-429. [CrossRef] 45. J. C. Lemm, J. Uhlig, A. Weiguny. 2005. Bayesian approach to inverse quantum statistics: Reconstruction of potentials in the Feynman path integral representation of quantum theory. The European Physical Journal B 46:1, 41-54. [CrossRef] 46. Tomasz Czekaj, Wen Wu, Beata Walczak. 2005. About kernel latent variable approaches and SVM. Journal of Chemometrics 19:5-7, 341-354. [CrossRef] 47. A. Krzyzak, D. Schafer. 2005. Nonparametric Regression Estimation by Normalized Radial Basis Function Networks. IEEE Transactions on Information Theory 51:3, 1003-1010. [CrossRef] 48. Alain Rakotomamonjy, Xavier Mary, St�phane Canu. 2005. Non-parametric regression with wavelet kernels. Applied Stochastic Models in Business and Industry 21:2, 153-163. [CrossRef] 49. Vera Kurková, Marcello Sanguineti. 2005. Error Estimates for Approximate Optimization by the Extended Ritz Method. SIAM Journal on Optimization 15:2, 461. [CrossRef] 50. I. Steinwart. 2005. Consistency of Support Vector Machines and Other Regularized Kernel Classifiers. IEEE Transactions on Information Theory 51:1, 128-142. [CrossRef] 51. C.K. Loo, M. Rajeswari, M.V.C. Rao. 2004. Novel Direct and Self-Regulating Approaches to Determine Optimum Growing Multi-Experts Network Structure. IEEE Transactions on Neural Networks 15:6, 1378-1395. [CrossRef] 52. G.L. Wang, Y.F. Li, D.X. Bi. 2004. Support Vector Machine Networks for Friction Modeling. IEEE/ASME Transactions on Mechatronics 9:3, 601-606. [CrossRef] 53. Miroslaw Galicki, Lutz Leistritz, Ernst Bernhard Zwick, Herbert Witte. 2004. Improving Generalization Capabilities of Dynamic Neural NetworksImproving Generalization Capabilities of Dynamic Neural Networks. Neural Computation 16:6, 1253-1282. [Abstract] [PDF] [PDF Plus]
54. Lorenzo Rosasco , Ernesto De Vito , Andrea Caponnetto , Michele Piana , Alessandro Verri . 2004. Are Loss Functions All the Same?Are Loss Functions All the Same?. Neural Computation 16:5, 1063-1076. [Abstract] [PDF] [PDF Plus] 55. C. Cervellera, M. Muselli. 2004. Deterministic Design for Neural Network Learning: An Approach Based on Discrepancy. IEEE Transactions on Neural Networks 15:3, 533-544. [CrossRef] 56. D. Bi, Y.F. Li, S.K. Tso, G.L. Wang. 2004. Friction Modeling and Compensation for Haptic Display Based on Support Vector Machine. IEEE Transactions on Industrial Electronics 51:2, 491-500. [CrossRef] 57. M. Arif, T. Ishihara, H. Inooka. 2004. Intelligent Learning Controllers for Nonlinear Systems using Radial Basis Neural Networks. Control and Intelligent Systems 32:2. . [CrossRef] 58. S. Ferrari, M. Maggioni, N.A. Borghese. 2004. Multiscale Approximation With Hierarchical Radial Basis Functions Networks. IEEE Transactions on Neural Networks 15:1, 178-188. [CrossRef] 59. I. Goethals, T. Van Gestel, J. Suykens, P. Van Dooren, B. De Moor. 2003. Identification of positive real models in subspace identification by using regularization. IEEE Transactions on Automatic Control 48:10, 1843-1847. [CrossRef] 60. Emmanuel Guigon . 2003. Computing with Populations of Monotonically Tuned NeuronsComputing with Populations of Monotonically Tuned Neurons. Neural Computation 15:9, 2115-2127. [Abstract] [PDF] [PDF Plus] 61. R. Genov, G. Cauwenberghs. 2003. Kerneltron: support vector "machine" in silicon. IEEE Transactions on Neural Networks 14:5, 1426-1434. [CrossRef] 62. P. Cerveri, C. Forlani, A. Pedotti, G. Ferrigno. 2003. Hierarchical radial basis function networks and local polynomial un-warping for X-ray image intensifier distortion correction: A comparison with global techniques. Medical & Biological Engineering & Computing 41:2, 151-163. [CrossRef] 63. Ping Guo, M.R. Lyu, C.L.P. Chen. 2003. Regularization parameter estimation for feedforward neural networks. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 33:1, 35-44. [CrossRef] 64. Zhe Chen , Simon Haykin . 2002. On Different Facets of Regularization TheoryOn Different Facets of Regularization Theory. Neural Computation 14:12, 2791-2846. [Abstract] [PDF] [PDF Plus] 65. C. Alippi. 2002. Selecting accurate, robust, and minimal feedforward neural networks. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 49:12, 1799-1810. [CrossRef] 66. M. Sgrenzaroli, A. Baraldi, H. Eva, G. De Grandi, F. Achard. 2002. Contextual clustering for image labeling: an application to degraded forest assessment in Landsat TM images of the Brazilian Amazon. IEEE Transactions on Geoscience and Remote Sensing 40:8, 1833-1848. [CrossRef]
67. B. Heisele, A. Verri, T. Poggio. 2002. Learning and vision machines. Proceedings of the IEEE 90:7, 1164-1177. [CrossRef] 68. Emmanuel Guigon , Pierre Baraduc . 2002. A Neural Model of Perceptual-Motor AlignmentA Neural Model of Perceptual-Motor Alignment. Journal of Cognitive Neuroscience 14:4, 538-549. [Abstract] [PDF] [PDF Plus] 69. S. Belongie, J. Malik, J. Puzicha. 2002. Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 24:4, 509-522. [CrossRef] 70. Jiann-Ming Wu . 2002. Natural Discriminant Analysis Using Interactive Potts ModelsNatural Discriminant Analysis Using Interactive Potts Models. Neural Computation 14:3, 689-713. [Abstract] [PDF] [PDF Plus] 71. T. Blu, M. Unser. 2002. Wavelets, fractals, and radial basis functions. IEEE Transactions on Signal Processing 50:3, 543-553. [CrossRef] 72. A. Alessandri, M. Sanguineti, M. Maggiore. 2002. Optimization-based learning with bounded error for feedforward neural networks. IEEE Transactions on Neural Networks 13:2, 261-273. [CrossRef] 73. P. Cerveri, C. Forlani, N. A. Borghese, G. Ferrigno. 2002. Distortion correction for x-ray image intensifiers: Local unwarping polynomials and RBF neural networks. Medical Physics 29:8, 1759. [CrossRef] 74. A.P. Engelbrecht. 2001. A new pruning heuristic based on variance analysis of sensitivity information. IEEE Transactions on Neural Networks 12:6, 1386-1399. [CrossRef] 75. Christophe Andrieu , Nando de Freitas , Arnaud Doucet . 2001. Robust Full Bayesian Learning for Radial Basis NetworksRobust Full Bayesian Learning for Radial Basis Networks. Neural Computation 13:10, 2359-2407. [Abstract] [PDF] [PDF Plus] 76. V. Kurkova, M. Sanguineti. 2001. Bounds on rates of variable-basis and neural-network approximation. IEEE Transactions on Information Theory 47:6, 2659-2665. [CrossRef] 77. Koji Tsuda. 2001. The subspace method in Hilbert space. Systems and Computers in Japan 32:6, 55-61. [CrossRef] 78. G. De Nicolao, G. Ferrari-Trecate. 2001. Regularization networks: fast weight calculation via Kalman filtering. IEEE Transactions on Neural Networks 12:2, 228-235. [CrossRef] 79. C. Alippi, V. Piuri, F. Scotti. 2001. Accuracy versus complexity in RBF neural networks. IEEE Instrumentation & Measurement Magazine 4:1, 32-36. [CrossRef] 80. A. Ruiz, P.E. Lopez-de-Teruel. 2001. Nonlinear kernel-based statistical pattern analysis. IEEE Transactions on Neural Networks 12:1, 16-32. [CrossRef] 81. J. Pruvost, J. Legrand, P. Legentilhomme. 2001. Three-Dimensional Swirl Flow Velocity-Field Reconstruction Using a Neural Network With Radial Basis Functions. Journal of Fluids Engineering 123:4, 920. [CrossRef]
82. J. F. G. de Freitas , M. Niranjan , A. H. Gee . 2000. Hierarchical Bayesian Models for Regularization in Sequential LearningHierarchical Bayesian Models for Regularization in Sequential Learning. Neural Computation 12:4, 933-953. [Abstract] [PDF] [PDF Plus] 83. J. Lemm, J. Uhlig, A. Weiguny. 2000. Bayesian Approach to Inverse Quantum Statistics. Physical Review Letters 84:10, 2068-2071. [CrossRef] 84. Emilio Salinas , L. F. Abbott . 2000. Do Simple Cells in Primary Visual Cortex Form a Tight Frame?Do Simple Cells in Primary Visual Cortex Form a Tight Frame?. Neural Computation 12:2, 313-335. [Abstract] [PDF] [PDF Plus] 85. Andrea Baraldi, Palma Blonda, Flavio Parmiggiani, Giuseppe Satalino. 2000. Contextual clustering for image segmentation. Optical Engineering 39:4, 907. [CrossRef] 86. D.J.H. Wilson, G.W. Irwin, G. Lightbody. 1999. RBF principal manifolds for process monitoring. IEEE Transactions on Neural Networks 10:6, 1424-1434. [CrossRef] 87. F. Aires, M. Schmitt, A. Chedin, N. Scott. 1999. The "weight smoothing" regularization of MLP for Jacobian stabilization. IEEE Transactions on Neural Networks 10:6, 1502-1510. [CrossRef] 88. G. De Nicolao, G.F. Trecate. 1999. Consistent identification of NARX models via regularization networks. IEEE Transactions on Automatic Control 44:11, 2045-2049. [CrossRef] 89. A. Alessandri, M. Baglietto, T. Parisini, R. Zoppoli. 1999. A neural state estimator with bounded errors for nonlinear systems. IEEE Transactions on Automatic Control 44:11, 2028-2042. [CrossRef] 90. V.N. Vapnik. 1999. An overview of statistical learning theory. IEEE Transactions on Neural Networks 10:5, 988-999. [CrossRef] 91. S. Chen, Y. Wu, B.L. Luk. 1999. Combined genetic algorithm optimization and regularized orthogonal least squares learning for radial basis function networks. IEEE Transactions on Neural Networks 10:5, 1239-1243. [CrossRef] 92. P. Niyogi, F. Girosi, T. Poggio. 1998. Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE 86:11, 2196-2209. [CrossRef] 93. Gokaraju K. Raju, Charles L. Cooney. 1998. Active learning from process data. AIChE Journal 44:10, 2199-2211. [CrossRef] 94. J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, M. Anthony. 1998. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory 44:5, 1926-1940. [CrossRef] 95. Tomaso Poggio , Federico Girosi . 1998. A Sparse Representation for Function ApproximationA Sparse Representation for Function Approximation. Neural Computation 10:6, 1445-1454. [Abstract] [PDF] [PDF Plus]
96. Federico Girosi . 1998. An Equivalence Between Sparse Approximation and Support Vector MachinesAn Equivalence Between Sparse Approximation and Support Vector Machines. Neural Computation 10:6, 1455-1480. [Abstract] [PDF] [PDF Plus] 97. C. C. Holmes , B. K. Mallick . 1998. Bayesian Radial Basis Functions of Variable DimensionBayesian Radial Basis Functions of Variable Dimension. Neural Computation 10:5, 1217-1233. [Abstract] [PDF] [PDF Plus] 98. Christopher K. I. Williams . 1998. Computation with Infinite Neural NetworksComputation with Infinite Neural Networks. Neural Computation 10:5, 1203-1216. [Abstract] [PDF] [PDF Plus] 99. L.I. Perlovsky. 1998. Conundrum of combinatorial complexity. IEEE Transactions on Pattern Analysis and Machine Intelligence 20:6, 666-670. [CrossRef] 100. A. Krzyzak, T. Linder. 1998. Radial basis function networks and complexity regularization in function learning. IEEE Transactions on Neural Networks 9:2, 247-256. [CrossRef] 101. A. Lipman, W. Yang. 1997. VLSI hardware for example-based learning. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 5:3, 320-328. [CrossRef] 102. I. Scott, B. Mulgrew. 1997. Nonlinear system identification and prediction using orthogonal functions. IEEE Transactions on Signal Processing 45:7, 1842-1853. [CrossRef] 103. Rajesh P. N. Rao, Dana H. Ballard. 1997. Dynamic Model of Visual Recognition Predicts Neural Response Properties in the Visual CortexDynamic Model of Visual Recognition Predicts Neural Response Properties in the Visual Cortex. Neural Computation 9:4, 721-763. [Abstract] [PDF] [PDF Plus] 104. Wael El-Deredy. 1997. Pattern recognition approaches in biomedical and clinical magnetic resonance spectroscopy: a review. NMR in Biomedicine 10:3, 99-124. [CrossRef] 105. Alexandre Pouget, Terrence J. Sejnowski. 1997. Spatial Transformations in the Parietal Cortex Using Basis FunctionsSpatial Transformations in the Parietal Cortex Using Basis Functions. Journal of Cognitive Neuroscience 9:2, 222-237. [Abstract] [PDF] [PDF Plus] 106. H. N. Mhaskar, Nahmwoo Hahm. 1997. Neural Networks for Functional Approximation and System IdentificationNeural Networks for Functional Approximation and System Identification. Neural Computation 9:1, 143-159. [Abstract] [PDF] [PDF Plus] 107. C. K. Chui, Xin Li, H. N. Mhaskar. 1996. Limitations of the approximation capabilities of neural networks with one hidden layer. Advances in Computational Mathematics 5:1, 233-243. [CrossRef] 108. Partha Niyogi, Federico Girosi. 1996. On the Relationship between Generalization Error, Hypothesis Complexity, and Sample Complexity for Radial Basis FunctionsOn the Relationship between Generalization Error, Hypothesis
Complexity, and Sample Complexity for Radial Basis Functions. Neural Computation 8:4, 819-842. [Abstract] [PDF] [PDF Plus] 109. Lizhong Wu , John Moody . 1996. A Smoothing Regularizer for Feedforward and Recurrent Neural NetworksA Smoothing Regularizer for Feedforward and Recurrent Neural Networks. Neural Computation 8:3, 461-489. [Abstract] [PDF] [PDF Plus] 110. H. N. Mhaskar . 1996. Neural Networks for Optimal Approximation of Smooth and Analytic FunctionsNeural Networks for Optimal Approximation of Smooth and Analytic Functions. Neural Computation 8:1, 164-177. [Abstract] [PDF] [PDF Plus] 111. David Lowe, Robert Matthews. 1995. Shakespeare vs. fletcher: A stylometric analysis by radial basis functions. Computers and the Humanities 29:6, 449-461. [CrossRef] 112. Elias S. ManolakosNeural Networks and Applications to Communications . [CrossRef] 113. Yoshihiro YamanishiSupervised Inference of Metabolic Networks from the Integration of Genomic Data and Chemical Information 189-211. [CrossRef]
NOTE
Communicated by Peter Dayan
A Counterexample to Temporal Differences Learning Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, M A 02139 U S A Sutton’s TD(N method aims to provide a representation of the cost function in an absorbing Markov chain with transition costs. A simple example is given where the representation obtained depends on A. For X = 1 the representation is optimal with respect to a least-squares error criterion, but as X decreases toward 0 the representation becomes progressively worse and, in some cases, very poor. The example suggests a need to understand better the circumstances under which TD(0) and Q-learning obtain satisfactory neural network-based compact representations of the cost function. A variation of TD(0) is also given, which performs better on the example. 1 Introduction
We consider a Markov chain with states 0,1,2,. . . , n. The transition from state i to state j has probability pi,, and cost g(i,j). We assume that state 0 is cost-free and absorbing, and that it is eventually reached from every other state with probability one. In other words, pm = 1 and g(0,O) = 0, and from every state i, there is a path of positive probability transitions that leads to 0. For each initial state i we want to estimate the expected total cost J ( i )up to reaching state 0. We consider approximations within a class of differentiable functions y(i,w) parameterized by a vector w. For example, I(i,w) may be the output of a neural network when the input is i and the vector of weights is w.Sutton’s TD(N method (Sutton 1988) is a gradient-like algorithm for obtaining a suitable vector w after observing a large number of simulated trajectories of the Markov chain. The method has attracted considerable attention, and has been used successfully in a more general setting by Tesauro (1992) for the training of a neural network to play backgammon. See Barto et al. (1994) for a nice and comprehensive survey of related issues. For X E [0,1],TD(X) performs an infinite number of simulation runs, each ending at the absorbing state 0. Within the total number of runs, each state is encountered an infinite number of times. If (il, i2,. . . , i N , 0) is the typical trajectory, a positive stepsize y is selected, and the vector Neural Computation 7, 270-279 (1995) @ 1995 Massachusetts Institute of Technology
Counterexample to Temporal Differences Learning
271
w is modified at the end of the kth transition by an increment that is proportional to y and to the temporal difference dk given by dk = 8 ( i k r ikfl)
+ ?(ik+l,W )- I(ikrW ) ,
k = 1,.. . ,N
(1.1)
where iN+1 = 0. The increment also involves the preceding gradients with respect to w,Vj(imrw),rn = 1,.. . ,k, which are evaluated at the vector w prevailing at the beginning of the simulation run. (An alternative possibility for which the analysis of this paper also holds is to evaluate these gradients at the current value of w.)The method is as follows: Following the state transition (il, i2), set
w
:= w
+ 7dlVj(i1,w)
(1.2)
Following the state transition (i2,i3), set
w := w + yd2 [ ~ ~ f ( w) i l+ , vJ(i2,w)]
(1.3)
Following the state transition (iN, o), set
w := w + y d N [XN-lvJ(il,w) + X N - ~ V ~ ( ~ ~ + , W. .). + v f ( i N , w ) ]
...
By adding equations 1.2-1.4 for X = 1, and by using the temporal differences formula 1.1, we see that the TD(1) iteration corresponding to a complete trajectory can be written as
so it is a gradient iteration for minimizing the sum of squares
It follows, as originally discussed by Sutton (1988), that TD(1) can be viewed as a form of incremental gradient or backpropagation method for minimizing over w the sum of the squared differences of the sample costs of the states i visited by the simulation and the estimates I(i,w). This method has satisfactory convergence behavior, and is supported by classical results on stochastic approximation and stochastic gradient methods (see, e.g., Poljak and Tsypkin 1973; Kushner and Clark 1978; Poljak 1987; Bertsekas and Tsitsiklis 19891, and by more recent analyses on deterministic incremental gradient methods by Luo (19911, Luo and Tseng (1993), and Mangasarian and Solodov (1993). Thus TD(1) will typically tend to yield a value of w that minimizes a weighted sum of the squared errors H i ) - J(i,w)l2
Dimitri P. Bertsekas
272
where c(i) is the average sample cost corresponding to state i, and the weights of different states are determined by the relative frequencies these states are visited during the simulation. An alternative view that leads to similar conclusions is to consider TD(1) as a stochastic gradient method for minimizing an expected value of the square of the error I(i) - i(i,w). On the other hand for X < 1, the convergence behavior of TD(X) is unclear, unless w contains enough parameters to make possible an exact representation of J(i) by j(i, w)for all states i (a lookup table representation), as shown in various forms by Sutton (19881, Dayan (19921, Tsitsiklis (1993), and Jaakkola et al. (1993). Actually, Sutton’s and Dayan’s convergence results apply to the slighly more general case of linear representations, under a restrictive linear independence condition on the set of observation vectors. Basically, TD(A) can be viewed as a form of incremental gradient method where there are some error terms in the gradient direction. These error terms depend on w as well as A, and they typically do not diminish when w is equal to the value where TD(1) converges, unless A = 1 or a lookup table representation is used. Thus, in general, the limit obtained by TD(X) depends on A, as has also been shown by Dayan (1992). Nonetheless, there are accounts of good practical performance of TD(X), even with X substantially less than 1. For example, Tesauro (1992) reports that his backgammon program performs better when trained with small than with high values of A. 2 An Example
In the following example we use a linear approximation of the form j(i, w) = iw and we find that as X is reduced from the value 1, TD(X) converges to an increasingly poor value W ( X ) . For a deliberate choice of the problem data, we obtain W(0)M -7i(l) that is, a reversal of sign of 7(i, w)(see Fig. 2). In our example the state transitions and associated costs are deterministic. In particular, from state i we move to state i - 1 with a given cost g,. Let all simulation runs start at state n and end at 0 after visiting all the states n - 1,n - 2, . . . , 1 in succession. The temporal difference associated with the transition from i to i - 1 is
g, + j(i - 1,w )- 7(i,w)
= g, - w
and the corresponding gradient is Vj(i, w)
=i
Counterexampleto Temporal Differences Learning
273
The iteration of TD(X) corresponding to a complete trajectory is given by (2.1)
and is linear in w. Suppose that the stepsize y is either constant and satisfies
(in which case the iteration 2.1 is contracting), or else y is diminishing at a rate that is inversely proportional to the number of simulation runs performed thus far. Then the TD(A) iteration 2.1 converges to the scalar $(A) for which the increment in the right-hand side of equation 2.1 is zero, that is,
c n
[ g k - .;l(X)]
[A"-%
+ Xn-k-l
(n-l)+...+k]
=o
k=l
In particular, we have
It can be seen that zb(1) minimizes over w the sum of squared errors (2.4)
where
J(i) = g l
+ ... +g,,
J(i,w ) = iw,
V i = 1,. . . , n
Indeed the optimality condition for minimization of the function 2.4 over w is n
Ci(g,+...+g;-iw)=O r=l
which when solved for w gives a solution equal to W(1) as given by equation 2.2. Figures 1 and 2 show the form of the cost function J ( i ) ,and the representations [i,W ( l ) ] and 7 [i,W(O)] provided by TD(1) and TD(O), respectively, for n = 50 and for the following two cases:
I
l.gl=l,
gi=O,
2. g n = - ( n - l ) ,
Vi#1 gi=l,
Vi#n
Dimitri P. Bertsekas
274
0
10
20
30
40
50
State i
Figure 1: Form of the cost function J(i), and the linear representationsf [ i ,zb(l)], and 7 [i.W(O)] provided by TD(1) and TD(O), respectively, for the case gl = 1, g,=O,Vi#l.
It can be seen that TD(0) can yield a very poor approximation to the cost function. The above example can be generalized with similar results. For instance, the cost of transition from i to i - 1 may be random, in which case the costs g, must be replaced by their expected values in equations 2.2 and 2.3. The trajectories need not all start at state n. The results are qualitatively similar if the successor state of state i is randomly chosen. Also, similar behavior can be observed in a variety of stochastic examples that can be constructed with our deterministic example as a ”building block.” The example indicates that for X < 1, TD(X) is in need of further justification for the case of a compact cost function representation. The example also relates to one of Watkins’ Q-learning methods (Watkins 1989). These methods have the advantage that they apply to discounted Markovian decision problems and stochastic shortest path problems (as defined in Bertsekas and Tsitsiklis 19891, where there are multiple actions available at each state and the objective is not just to obtain the optimal cost, but also to find an optimal action at each state. Strong convergence results have been recently shown by Tsitsiklis (1993) for the
Counterexample to Temporal Differences Learning
275
50.0
25.0
0.0
- 25.0
-.-.-.
-.-.-. -.*.-
- 50.0 0
10
20
30
40
50
State i
Figure 2: Form of the cost function J(i), and the linear representations [i,ib(l)], and 7 [i,$(O)] provided by TD(1) and TD(O), respectively, for the case g,, = - ( n - l),gi= 1, V i # n. most commonly used Q-learning method in the case of a lookup table representation. TD(0) can be viewed as a special case of this Q-learning method for the situation where there is only one action available at each state, so our conclusions also apply to the corresponding neural network versions. 3 A Partial Remedy
In view of the preceding example, it is interesting to ask whether there is a modified version of TD(0) that yields the exact cost values in the case of a lookup table representation and approximates the cost values better when compact representations are used. For the case of a lookup table representation, we know that TD(0) can be viewed as a Robbins-Monro method for solving the system of equations n
Cpq [g(i,j)+ j i=l
~w)] , - J (i w) , = 0,
i = I , . . . ,n
(3.1)
Dimitri P. Bertsekas
276
that is, for finding a w for which the expected value of the temporal difference vanishes at each state i. For the case of a compact representation, it is thus reasonable to consider a weighted least-squares problem that aims at making the size of the expected temporal differences small in an aggregate sense, that is, a problem of the form
where
d(i.j ) = g ( i , j ) + y(j, w)- y(i, w) denotes the temporal difference associated with the transition from i to j , E l { . I i} denotes conditional expected value over j given i, and q, is a nonnegative weight for each state i. A simulation-based gradient method for solving such a problem is to update w following a transition from i k by the iteration
=
w + ?E,{d(iklj) I ik) (vy(ik, w)- E, { Vj(j9 w)I ik})
(3.3)
The relative frequencies of visits to different states determine the relative weights in the corresponding least-squares problem (3.2). Note that the
expected temporal difference n
E,{d(ik,j) I id
=
xp,,,d(ik,j) /=1
at ik and the expected gradient n
E,{v?u, w,I ik} cr?fk/V?u> w, =
r=l
over the successor states j appear in the right-hand side of this iteration. Thus the computational requirements per iteration are increased over TD(X), unless the system is deterministic. The method (3.3) is apparently new, although an iteration similar to 3.3 and its sampled version given below (cf. equation 3.6) have been independently developed by Baird and are briefly described in Baird (1993) and Harmon et al. (1994) (this was pointed out by one of the reviewers). For the deterministic example of Section 2, the iteration 3.3 takes the form
Counterexample to Temporal Differences Learning
277
so the iteration corresponding to a full trajectory ( n ,n - 1,. . . , 1 , O ) is n
w := w + y c ( g k - w)
(3.4)
k=l
When y is smaller than l / n , this iteration converges to (3.5) This corresponds to a linear approximation, which is exact for state n, that is, J ( n )= y(n, regardless of the costs of the other states. In particular, for the example of Figure 1, we obtain in the limit w = l / n , while for the example of Figure 2, we obtain ik = 0. The corresponding approximations y(i, = iw are not as good as those obtained by TD(l), but they are much better than those obtained by TD(0). While it is unclear whether such a conclusion can be reached in a more general setting, in the author’s limited experimentation with some stochastic problems, iteration 3.3 has produced substantially better compact cost representations than TD(0). There is a simpler version of iteration 3.3 that does not require averaging over the successor states j. In this version, the two expected values in iteration 3.3 are replaced by two independent single sample values. In particular, w is updated by
w),
w)
w := w f yd(ik, k + l ) [vj(ik, w)- vj(ii+1,w)]
(3.6)
where ik+l and ii+l correspond to two independent transitions starting from i k . It can be seen that this iteration yields in the limit values of w that solve the least-squares problem (3.2). It is necessary to use two independently generated states ik+l and ii+l in order that the expected value of the product d(ik, [Vf(ik,w)- Vy(ii+l, w) given ik is equal to the term E,{d(ik, j) I ik} (V?(ik, w)- E, {Vi(j,w)I ik} appearing in the right-hand side of equation 3.3. is used, The variant of iteration 3.6 where a single sample ( i k + l = that is,
1
w := w + yd(ik, i k + l ) [Vl(ik,w)- v7(ik+1,w)]
(3.7)
has been discussed by Dayan (1992). It aims at solving the problem
where 9, are the nonnegative weights also appearing in equation 3.2, which are determined by the relative frequencies of the visits to different states during the simulation. This problem involves a weighted sum of second moments of the temporal differences, which is not as desirable an objective as the weighted sum of the squares of the means of the temporal differences, which is minimized by iteration 3.3. In particular, in the case
278
Dimitri P. Bertsekas
of a lookup table representation, iteration 3.3 yields the exact cost values, while solving problem 3.8 can give other values that may also depend on the weights 9,. Thus it appears that iteration 3.7 is unsuitable for Markov chains that are not deterministic.
Acknowledgments Research supported by the SBIR through a contract with Alphatech, Inc.
References Barto, A. G., Bradtke, S. J., and Singh, S. P. 1994. Learning to act using real-time dynamic programming. J. Artificial Intelligence, in press. Baird, L. C. 1993. Advantage updating. Tech. Rep. WL-TR-93-1146, Wright Lab., Wright-Patterson Air Force Base, OH. Bertsekas, D. P., and Tsitsiklis, J. N. 1989. Parallel and Distributed Computation: Numerical Methods. Prentice-Hall, Englewood Cliffs, NJ. Dayan, P. 1992. The convergence of TD(A) for general A. Machine Learn. 8, 341-362. Harmon, M. E., Baird, L. C., and Klopf, A. H. 1994. Advantage updating applied to a differential game. NIPS Conf., Denver, Colorado, submitted. Jaakkola, T., Jordan, M. I., and Singh, S. P. 1993. On the convergence ofstochastic iterative dynamic programming algorithms. MIT Computational Cognitive Science Tech. Rep. 9307. Kushner, H. J., and Clark, D. S. 1978. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer-Verlag, New York. Luo, Z. Q. 1991. On the convergence of the LMS algorithm with adaptive learning rate for linear feedforward networks. Neural Comp. 3, 226-245. Luo, Z. Q., and Tseng, P. 1993. Analysis of an approximate gradient projection method with applications to the back propagation algorithm. Department of Electrical and Computer Engineering, McMaster University, Hamilton, Ontario and Department of Mathematics, University of Washington, Seattle. Mangasarian, 0. L., and Solodov, M. V. 1993. Serial and parallel backpropagation convergence via nonmonotone perturbed minimization. Computer Science Department, Computer Sciences Tech. Rep. No. 1149, University of WisconsinMadison, April 1993. Poljak, B. T. 1987. Introduction to optimization. Optimization Software Inc., New York. Poljak, B. T., and Tsypkin, Y. Z. 1973. Pseudogradient adaptation and training algorithms. Automation Remote Control 4.5-68. Sutton, R. S. 1988. Learning to predict by the methods of temporal differences. Machine Learn. 3, 944. Tesauro, G. 1992. Practical issues in temporal difference learning. Machine Learn. 8,257-277.
Counterexample to Temporal Differences Learning
279
Tsitsiklis, J. N. 1993. Asynchronous stochastic approximation and Q-learning. LIDS Report P-2172, MIT. Watkins, C. J. C. H. 1989. Learning from delayed rewards. Ph.D. Thesis, Cambridge University, England.
Received April 21, 1994; accepted July 22, 1994.
This article has been cited by: 1. Benjamin Van Roy. 2006. Performance Loss Bounds for Approximate Value Iteration with State Aggregation. Mathematics of Operations Research 31:2, 234-244. [CrossRef] 2. Tasos Falas, Andreas Stafylopatis. 2005. Implementing Temporal-Difference Learning with the Scaled Conjugate Gradient Algorithm. Neural Processing Letters 22:3, 361-375. [CrossRef] 3. D.P. Bertsekas, M.L. Homer, D.A. Logan, S.D. Patek, N.R. Sandell. 2000. Missile defense and interceptor allocation by neuro-dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 30:1, 42-51. [CrossRef] 4. J.N. Tsitsiklis, B. Van Roy. 1997. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42:5, 674-690. [CrossRef]
Communicated by Joshua Alspector
NOTE
New Perceptron Model Using Random Bitstreams Eel-wan Lee Soo-Ik Chae Department of Electronic Engineering, Seoul National University, Sun 56-1, Shilimdong, Gwanakgu, Seoul, 151-742, Korea
A very high precision is needed to implement the adder using stochastic computation (or pulse arithmetic) in modern VLSI technology. In
this paper we propose a new model of perceptron using random bitstreams that alleviates this problem. 1 Introduction
The perceptron is a formal neuron that accepts N inputs and outputs a single bit. If the sum of its weighted inputs is greater than or equal to its threshold, then its output is f l ; otherwise, -1 assuming that its threshold is zero (McCulloch and Pitts 1943). We assume that each input is bipolar ($1 or -1) and that the weights are in the range of [-1, $11. In this note, we focus on the perceptrons using stochastic computation (Mars and Poppelbaum 1981). There have been several works on the neural networks using stochastic computation (Tomlinson et al. 1990; Alspector et al. 1989; Kondo and Sawada 1991). For stochastic computation we represent a signal with a random bitstream. Because a multiplier can be implemented with a simple logic gate, it is attractive for parallel implementation of artificial neural networks. In the bipolar representation where each ONE bit has +1/L weight and each ZERO bit has -1/L weight, the expectation value of a random bitstream X and its variance are
E(X) ox
=
x
(1.1)
1 -x*
(1.2) L If the probability of being in the ONE state of the bitstream is P, x = 2P-1. We represent each signal with a bipolar bitstream of given length L . The accuracy of a signal depends on its value and bitstream length. Figure l a shows a circuit diagram for a conventional perceptron with stochastic computation. First, we convert its synaptic weights into the bipolar random bitstreams. Then, we multiply each bipolar input with its corresponding weight bitstream through an XNOR gate. We convert each bit in the bitstreams of weighted inputs into a current and add it =
Neural Computation 7, 280-283 (1995)
@ 1995 Massachusetts Institute of Technology
Perceptron Model Using Random Bitstreams
281
Figure 1: Circuit diagrams for the conventionalperceptron (a) and the proposed perceptron (b). on a capacitor CI, which is used as a KCL-based adder. We connect a comparator as a thresholding unit to the capacitor to generate a 1-bit output. we reset the capacitor voltage to V d d / 2 with CLK2. This conventional perceptron integrates its weighted-input bitstream on the capacitor in both spatial and temporal domains to add the weighted inputs. Because N inputs are distributed in space and each input is represented with L bits in time, NL bits must be integrated on the capacitor. This perceptron determines its state based on its sign bit once every L clocks with adding NL bits in N random bitstreams of weighted inputs. If L and N are large, the limitation of the conventional stochastic perceptron is obvious because the voltage range of the capacitor is limited and a lower limit of the current driven by a current source exists. This lower bound of the current sources is due to the difficulty in matching the current-driving capability of the transistors. We propose a new perceptron to alleviate this limitation. The multiplication part of the proposed perceptron is the same as that of the conventional one. The proposed model determines its local state by adding N bits from the weighted-input bitstreams at every cycle. It integrates the local states on an up/down counter. It takes the sign bit of the counter as a global state y2 once every L clocks. With this scheme, we reset the capacitor voltage to V d d / 2 with CLKl and the counter to zero with CLK2. We can separate the decison into two domains, spatial and temporal. The capacitor in Figure l b is required to be large enough for the N-bit addition. Therefore, the limit of the input number can be alleviated considerably. 2 Properties of the Proposed Perceptron Model
Although the proposed model can alleviate the limit of the input number by dividing the addition of the pulse into two domains as mentioned
Eel-wan Lee and Soo-Ik Chae
282
above, it poses two types of errors. The first type is the error due to the finite pulse length L. We can control this error by lengthening the pulse stream length. The second type occurs when the weighted sum is near zero; it can make an incorrect decision although the variance of the bitstream due to the finite pulse length approaches to 0. Assume that the weighted inputs are composed of 3 bitstreams f0.4, +0.5, -1.0 in a 3-input perceptron. Its net sum in the conventional perceptron is 0.4 + 0.5 - 1.0 = -0.1 to produce the output -1. However, its net sum in the proposed perceptron is 2 . (P[threeinputs are ONE] + P[two inputs are ONE]) - 1 1 +0.4 1 +0.5 1 - 1.0 2 2 2 1 + 0.4 1 + 0 . 5 1 + 1.0 +-.-.1 + 0 . 4 1 - 0.5 1 - 1.0 2 2 2 2 2 + 1-0.4 2 = 2.0.525 -
1f 0 . 5
2 1> 0
1 - 1.0 2 (2.1)
which produces the output +l. We proved that the incorrect decision never occurs when the absolute value of the net sum exceeds 2 In 2 - 1 if the pulse length is infinity. Detailed proof of this fact is omitted in this note for brevity. It can be shown that the incorrect decision probability decreases as the number of inputs increases. If the number of inputs is N and the distribution of the weights is assumed to be uniform in [-1,1] the variance of the net sum is N/3. The distribution of the net sum approximates the gaussian distribution as N gets larger. Therefore,
We have also checked this result for several values of N with simulations, which are summarized in Table 1. 3 Conclusion
We explained the difficulty in adding the weighted inputs in the conventional perceptron using random pulse streams if its input number is
Perceptron Model Using Random Bitstreams
283
Table 1: Incorrect Decision Probability of the Proposed Neuron Model
5
7
Number of inputs
3
pt!,,,, (%)
1.91 1.17 0.74
9
0.56
11
0.68
13
0.51
large and the observation length of the pulse stream is long. We proposed a new perceptron to overcome this difficulty. Furthermore, we verified that the probability of incorrect decision in the proposed method becomes negligible as the number of inputs is more than 10.
References Alspector, J., Gupta, B., and Allen, R. B. 1989. Performance of a stochastic learning microchip. In Advances in Neural Information Processing Systems, Vol. 1, pp. 748-760. Morgan Kaufmann. Kondo, Y., and Sawada, Y. 1991. Functional abilities of a stochastic logic neural network. I E E E Trans. Neural Network 3, 4344l3. Mars, P., and Poppelbaum, W. J. 1981. Stochastic and Deterministic Averaging Processors. Peter Peregrinus. McCulloch, W. S., and Pitts, W. 1943. A logical calculus of ideas immanent in nervous activity. Bull. Math. Biophys. 5, 127-147. Tomlinson, M. S., Walker, D. J., and Sivilotti, M. A. 1990. A digital neural network architecture for VLSI. Proc. IJCNN 2, 545-550.
Received April 4, 1994; accepted July 6, 1994.
NOTE
Communicated by Peter Dayan
On the Ordering Conditions for Self-organizing Maps Marco Budinich' John G . Taylor Centre for Neural Networks, KingS College, London, England
We present a geometric interpretation of ordering in self-organizing feature maps. This view provides simpler proofs of Kohonen ordering theorem and of convergence to an ordered state in the one-dimensional case. At the same time it explains intuitively the origin of the problems in higher dimensional cases. Furthermore it provides a geometric view of the known characteristics of learning in self-organizing nets.
Self-organizing neural networks have been proposed to model the feature maps of the brain (Von der Malsburg 1973; Kohonen 1984) but the underlying theory is still not completely understood (see, e.g., Erwin et al. 1992; Ritter et al. 1992, and references therein). In what follows we focus on Kohonen nets (Kohonen 1984). These nets map a continuous vectorial input space, the space of the patterns { f i } , onto a lattice of neurons. Here we consider a one-dimensional lattice, i.e., a string of neurons. The kth neuron has weights w k and its response to a pattern j i is j i . w k . In this view both patterns and neurons can be thought of as points in space. Figure 1 contains two different representations of a net with two-dimensional input and five neurons. The standard learning algorithm is 1. set the weights to initial random values;
2. select a pattern at random, say I, and feed it to the neurons; 3. find the output neuron with maximal output, say rn;2 'Permanent address: Dip. di Fisica & INFN, Via Valerio 2, 34127 Trieste, Italy. 2This definition can be tricky unless pattern and weight vectors are somehow normalized. Since both patterns and weights define points in space, the problem can be circumvented by defining the most active neuron for pattern { as the neuron whose weights define the nearest point to {. Simple algebra shows that the two definitions are equivalent.
Neural Computation 7, 284-289 (1995) @ 1995 Massachusetts Institute of Technology
Ordering Conditions for Self-organizingMaps
285
pattem/waight space
I
2
Figure 1: A simple example: a two-dimensional input space mapped to a onedimensional lattice of five neurons. The drawings represent the net and its representation in pattern/weight space. 4. train m and its neighbors up to a distance d by the Hebb rule; the training affects 2d + 1 neurons:
5. update the parameters d and a according to a predefined schedule and, if the learning loops are not yet finished, go to 2. The order parameter D is given by:
equality holding if and only if the neurons are aligned and sorted; obviously D evolves during learning. The Kohonen ordering theorem (Kohonen 1984) applies to the case of one-dimensional input (i.e., scalar w,and C) and gives necessary and sufficient conditions to lower D in a learning step [“D increases (if and only if) 5 lies outside of a fold of length 2 5 at the selected node” (Kohonen 1984, p. 451)]. The theorem states also that if the patterns are ordered (D = 0) learning leaves D unchanged. Subsequently Cottrell and Fort (1987) proved the stronger result of convergence to an ordered state.
Marco Budinich and John G. Taylor
286
The main result of this paper is a geometric interpretation that 0
0
gives an intuitive necessary and sufficient condition for the decrease of D that applies to the more general case of d dimensional input spaces and that is trivial to prove;3 simplifies greatly also the proof of Cottrell and Fort (1987) while providing intuitive evidence for difficulties to be faced in the proof of convergence in higher dimensions.
Following Kohonen (19841, and for inputs and weights in arbitrary dimensions, we calculate the change AD = D’ - D produced in a learning step. For brevity we take just five neurons ( n = 5) and suppose that the training affects the third neuron and its immediate neighbors (rn = 3, d = 1). From 1.2 and applying 1.1 to neurons 2, 3, and 4 we obtain, after simple algebra,
AD
=
11232
+ tr(< - W 2 )
+ (11
-
+ 11%~
-
Wl 11
- llW2 - W l I(
- I)(llWs - .17211
- W4
- a( 0 the inphase state is unstable on its whole domain of existence [it disappears, as well as the anti-phase state, at g N 1.93 where the period T ( g ) vanishes]. The intermediate solution is always stable and the anti-phase state is unstable at low g and becomes stable when it merges with the intermediate solution. Qualitatively similar results were found for all the values of 71 and 7 2 we have considered. The in-phase locked state was always unstable and a stable anti-phase state was achieved at large but finite g. In a recent work van Vreeswijk et al. (1994)studied the Lapicque model when the time course of the synapse is described by an 0: function. This is a special case of the interaction we have used. Their conclusion is that excitation is desynchronizing in agreement with our results. As a consequence, a large network of excitatory integrate-and-fire neurons cannot synchronize in-phase even at finite coupling strength [except if the interaction is instantaneous (Mirollo and Strogatz 199011. This fact was also found by Tsodyks et al. (1993) and the stability of the asynchronous state that may then arise was recently examined (Abbott and van Vreeswijk 1993; Treves 1993). Note also that at high coupling (T small) and given 71 = 0.3 the in-phase solution remains unstable even for very small values of 7 2 (a real eigenvalue is larger than 1);this was checked for 72 as small as lop3. Therefore even a very fast rise of the interaction cannot stabilize in-phase synchronization. 4.2 The Model of Connor e t al. For the model of Connor et al., at large coupling strength, the phase shift between two neurons depends drastically on the synaptic time course. This is illustrated in Figure 10, for a fixed decay time constant 71 = 3 msec. The results are qualitatively different depending on the value of the rise time 72. For 72 = 1 msec, the system locks in phase above 2.6 mS/cm2 but for 72 = 2 msec the phase shift between the two neurons increases and anti-phase is reached at g = 1.3 mS/cm2. This behavior is similar to what we have found above for the integrate-and-fire model. In the first case, large networks are expected to synchronize in phase at high coupling. This is confirmed by simulations: for g above 2.6 mS/cm2 full synchrony is achieved at a time
Synchrony in Excitatory Neural Networks
331
n
\
2
1
3
g (mS/cm2)
Figure 10: Dephasing 6, for the model of Connor et al., as a function of the coupling for a pair of excitatory neurons. Here 71 = 3 msec, while 72 = 2 msec (dashed line) or r 2 = 1 msec (solid line). of the order of 100 msec. In contrast, for 72 = 2 msec and g = 1.3 mS/cm2 the system stabilizes in a symmetric three-cluster state. In this state the network is broken into three similar groups of neurons. In each group all the neurons are locked in-phase while the phase shift between the clusters is T / 3 . Note that a stability analysis grounded on phase reduction reveals that this state is unstable at weak coupling. Clustering has also been found recently in models of thalamic (Golomb and Rinzel 1994; Golomb et al. 1994). In the present model it is found only in the strong coupling regime. 5 Conclusion
It has been proposed in Kopell (1988) that synaptic interactions should be classified as synchronizing or desynchronizing rather than excitatory or inhibitory when dealing with synchronization in systems of neurons. The results of the present work support this point of view since we have shown that the time course of the synaptic interaction plays a role as significant as its excitatory or inhibitory nature. To understand collec-
332
D. Hansel, G. Mato, and C. Meunier
tive states of neural systems, one cannot separate the synaptic properties from cellular properties. This stands out clearly in our study, since the synchronizing effect of excitation was shown to depend on the response of the neurons to perturbations. The main result of this work is the fact that for neurons with type I response, excitation is desynchronizing in a large range of synaptic parameters that includes physiologically realistic values. Even if the synaptic times are very short the interaction is desynchronizing for this type of neuron. In contrast, for neurons of type I1 sufficiently fast excitation can be synchronizing. These results have been based on general arguments valid at weak coupling. The study of specific examples has allowed us to extend it to intermediate values and even in some cases to strong values of the coupling strength. We have also given examples of some of the consequences of these results for the dynamics of a large network of identical excitatory neurons. The trend of the neurons to lock outof-phase induces frustration in the network that settles then in partially coherent states, such as rotating wave states. An important characteristic of these rotating waves is that the activities of the neurons are then correlated with phase shifts. When the frustration effects in the network become too strong, a transition to the completely asynchronous state can take place in spite of the homogeneity of the network and the absence of external noise. In this paper we have focused on excitatory interactions. However, the reduction to phase models can also be used for predicting the effect of inhibitory synapses. For neurons of type I, one can show that under very general conditions inhibition can be synchronizing, leading to a bistability where both the in-phase locked state and the anti-phase locked state of a pair of identical inhibitory neurons are stable. If this synchronizing effect is sufficiently strong (for instance for fast synapses), the anti-phase can even lose stability and the in-phase state is then the only stable state of locking. Similar results have been found for an integrate-and-fire model in van Vreeswijk et al. (1994). A more systematic study of this effect and its consequences for large networks will be published elsewhere (Hansel etal. 1994). We have considered only large homogeneous and fully connected networks. An important issue is to assess the effect of the heterogeneities found in biological situations: dispersion of neural characteristics (membrane time constant, ionic conductances etc.), various sources of noise, connectivity pattern. It would be very interesting to determine whether these can counterbalance to some extent the desynchronizing effect of excitation by effectively reducing the frustration in the system (Hansel and Mato 1993; Tsodyks et al. 1993). This would give some insight on the ubiquity of partially coherent states and phase shifts in cross-correlations for biological systems. Finally let us remark that response functions are amenable to experiments (Reyes and Fetz 1993a,b). It would be very interesting to determine
Synchrony in Excitatory Neural Networks
333
the responses of neurons in biological systems where collective effects have been observed. Are type I responses representative? If not, this would, for instance, question the relevance of integrate-and-fire models for modeling such systems. In particular, such observations would be very interesting for central pattern generators (CPGs). Can synchronization properties in such systems be related to the response function of the neurons? One may also wonder whether the type of response can be modified by neuromodulatory effects leading to different patterns of synchrony. If so, this could have consequences from the functional point of view.
Appendix The Hodgkin-Huxley Model. The HH model provides the simplest framework to describe spike generation in a real biological situation, namely the squid's giant axon. An HH neuron is described by a set of four variables X = (V, rn, k. n ) where V is the membrane potential, rn and k the activation and inactivation variables of the sodium current, and n the activation variable of the potassium current. The corresponding equations read (Hodgkin and Huxley 1952):
c-dV dt drn dt
=
I - g ~ a m ~ h ( VV -N a ) -gKn4(v-VK)-gl(V-V/)
(A.1)
-
m,(V)-rn .,,(V)
(A.2)
- -
I is the external current injected into the neuron. It determines the neuron's firing rate. The parameters gNa, gK, and gl are the maximum conductances per surface unit for the sodium, potassium, and leak currents, VNa, VK, and Vl are the corresponding reversal potentials, and C is the capacitance per surface unit. For the squid's axon typical values of the parameters (at 6.3"C) are V N = ~ 50 mV, VK = -77 mV, V, = -54.4 mV, gNa = 120 mS /cm2, gK = 36 mS/cm2, gl = 0.3 mS/cm2, and C = 1 ,uF/cm2. The functions rn, (V), k, (V), and n, (V) and the characteristic times (in milliseconds) r,, r,,, r h , are given by x,(V) = a,/(a, bx), rx = I/@, b,) with x = rn, n, k anda, = O.l(V+40)/(1-exp[(-V-40)/10]}, b, = 4exp[(-V65)/18], a h = 0.07exp[(-V - 65)/20], bh = 1/{1 eXp[(-V - 35)/10]}, a, = O.Ol(V 55)/{1 - exp[(-V - 55)/10]}, b, = 0.125exp[(-V - 65)/80]. For small values of I the system reaches a stable fixed point (Veq = -65 mV for I = 0 pA/cm2). At lI = 9.78 pA/cm2 the system undergoes an inverted Hopf bifurcation to the spiking regime. This behavior agrees
+
+
+
+
D. Hansel, G. Mato, and C. Meunier
334
with the electrophysiological observation on the squids axon that the oscillations start with finite amplitude and frequency. The periodic emission of spikes stops at 12 = 154.5 pA/cm2, where the fixed point becomes again stable. The Model of Connor et al. The model of Connor et al. (1977) incorporates, in addition to the sodium and delayed rectifier potassium currents of the HH model, an A current and displays a much wider range of firing frequency than the Hodgkin-Huxley model. It is well known that the firing rates (from 50 to 120 Hz) that are achieved by the Hodgkin-Huxley model with usual parameters are much higher than is commonly observed in other preparations. On the basis of numerous observations, the so-called A current is often considered to widen the frequency range. Indeed this potassium current is characterized by an inactivation that is much slower than its activation and it will play a major role when one tries to depolarize a neuron starting from a situation of hyperpolarization. The slow deinactivation of the inward A current will then tend to impede fast membrane depolarization and firing rates ranging from 0 to 300 Hz are thus obtained in the model of Connor et al. with a linear current-frequency relation at low frequencies. The role of the A current in low frequency spiking was recently investigated in detail by Rush and Rinzel (1994). The parameterizations of the sodium and delayed rectifier potassium currents for the HH model and the model of Connor etal. are very similar. Parameters for these currents are V N ~= 55 mV, VK = -72 mV, V I = -17 mV, 8 N a = 120 mS /cm2, gK = 20 mS/cm2, 81 = 0.3 mS/cm2, and C = 1 pF/cm2. x,(V) = a,/& bx),r, = l/(a,+ b,) with x = m, n, h and a,,, = O.l(V+29.7)/{1-exp[(-V-29.7)/10]}, b,, = 4exp[(-V-54.7)/18], a h = 0.07exp[(-V-48)/20], b h = 1/{1+exp[(-V-18)/10]}, a, = O.Ol(V+ 46.7)/{1 - exp[(-V - 46.7)/10]}, b, = 0.125exp[(-V - 56.7)/80]. The A current is described in a similar way:
+
= -8A
(v
-
(5.1)
V,)A3B
(5.3) where A,(V)
=
exp[(V + 94.22)/31.84]
0.0761
+
1 exp[(V
TA(V)
=
0.3632+
+ 1.17)/28.93] 1.158
1
1
"3
+ exp[(V+ 55.96)/20.12]
(5.4) (5.5) (5.6)
Synchrony in Excitatory Neural Networks
TB(V)
=
1.24+
2.678 1 + exp[(V 50)/16.027]
+
335
(5.7)
The reversal potential VA = -75 mV is slightly different from VK and the conductance gA is set here to 47.7 mS/cm2. The only difference with the parameterization of Connor et al. (1977) is that we did not introduce the temperature scaling factor in the kinetics of activation and inactivation variables.
Acknowledgments We are thankful to H. Sompolinsky and M. Tsodyks for most helpful discussions. We are indebted to D. Golomb, M.-L. Monnet, S. Seung, and H. Sompolinsky for careful and critical reading of the manuscript. D.H. acknowledges hospitality of the Center for Neural Computation and the Racah Institute of Physics of the Hebrew University. This work was partially supported by a Projet Concerte de Cooperation Scientifique of MinistPre des Affaires Etrangeres. Part of the simulations were performed on the CRAY-C98 of IDRIS. While we were completing this paper we learned about the related work of C. van Vreeswijk et al. (1994). We thank B. Ermentrout for having brought it to our attention.
References Abbott, L. F., and van Vreeswijk, C. 1993. Asynchronous states in networks of pulse-coupled oscillators. Phys. Rev. E48, 1483-1490. Bush, I?, and Douglas, R. 1991. Synchronization of bursting action potential discharge in a model network of neocortical neurons. Neural Comp. 3, 1930. Connor, J. A., Walter, D., and McKown, R. 1977. Neural repetitive firing: modifications of the Hodgkin-Huxley axon suggested by experimental results from crustacean axons. Biophys. I. 18,81-102. Ermentrout, G. B., and Kopell, N. 1984. Frequency plateaus in a chain of weakly coupled oscillators, I. SIAM 1.Math. Anal. 15, 215-237. Ermentrout, G. B., and Kopell, N. 1991. Multiple pulse interactions and averaging in systems of coupled neural oscillators. J. Math. B i d . 29, 195-217. Golomb, D., and Rinzel, J. 1994. Clustering in globally coupled inhibitory neurons. Physica D71, 259-282. Golomb, D., Hansel, D., Shraiman, B., and Sompolinsky, H. 1992. Clustering in globally coupled phase oscillators. Phys. Rev. A45, 3516-3530. Golomb, D., Wang, X. J., and Rinzel, J. 1994. Synchronization properties of spindle oscillations in a thalamic reticular nucleus model. J. Neurophysiol. 72, 1109-1126. Grannan, E. R., Kleinfeld, D., and Sompolinsky, H. 1992. Stimulus-dependent synchronization of neuronal assemblies. Neural Comp. 4,550-569.
336
D. Hansel, G. Mato, and C. Meunier
Hansel, D., and Mato, G. 1993. Patterns of synchrony in a heterogeneous Hodgkin-Huxley neural network with weak coupling. Physica A200, 662669. Hansel, D., Mato, G., and Meunier, C. 1993a. Phase dynamics for weakly coupled Hodgkin-Huxley neurons. Europhys. Lett. 23, 367-372. Hansel, D., Mato, G., and Meunier, C. 1993b. Clustering and slow switching in globally coupled phase oscillators. Phys. Rev. E48, 3470-3477. Hansel, D., Mato, G., and Meunier, C. 1993c. Phase reduction and neural modeling. In Functional Analysis of the Brain Based on Multiple-Site Recordings, October 1992. Concepts Neurosci. 4, 192-210. Hansel, D., Mato, G., and Meunier, C. 1994. In preparation. Hodgkin, A. L., and Huxley, A. F. 1952. A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (London) 117,500-544. Kaneko, K. 1990. Clustering, coding, switching, hierarchical ordering and control in a network of chaotic elements. Physica D41, 136-172. Kopell, N. 1988. Toward a theory of modeling central pattern generators. In Neural Control of Rhythmic Movements in Vertebrates, A. Cohen, ed., pp. 369413. John Wiley, New York. Kuramoto, Y. 1984. Chemical Oscillations, Waves and Turbulence. Springer, New York. Kuramoto, Y. 1991. Collective synchronization of pulse-coupled oscillators and excitable units. Physica D50, 15-30. Mirollo, R. E., and Strogatz, S. H. 1990. Synchronization of pulse-coupled biological oscillators. S I A M J. Appl. Math. 6, 1645-1657. Monnet, M.-L., Hansel, D., Mato, G., and Meunier, C. 1994. In preparation. Neu, J. 1979. Coupled chemical oscillators. S I A M J . Appl. Math. 37, 307-315. Okuda, K. 1993. Variety and generality of clustering in globally coupled oscillators. Physica D63, 424-436. Pinsky, P. 1994. Mathematical models of hippocampal neurons and neural networks: Exploiting multiple time scales. Ph.D. thesis, University of Maryland. Reyes, A. D., and Fetz, E. E. 1993a. Two modes of interspike interval shortening by brief transient depolarizations in cat neocortical neurons. J. Neurophysiol. 69, 1661-1672. Reyes, A. D., and Fetz, E. E. 1993b. Effects of transient depolarizing potentials on the firing rate of cat neocortical neurons. 1. Neurophysiol. 69, 1673-1683. Rogawski, M. A. 1985. The A-current: How ubiquitous a feature of excitable cells is it? TINS 63, 214-219. Rush, M. E., and Rinzel, J. 1994. The potassium A-current, low firing rates and rebound excitation in Hodgkin-Huxley models. Preprint. Stone, E., and Holmes, P. 1991. Unstable fixed points, heteroclinic cycles and exponential tails in turbulence production. Phys. Lett. A155, 29-42. Strogatz, S. H., and Mirollo, R. E. 1991. Stability of incoherence in a population of coupled oscillators. I. Stat. Phys. 63, 613-635. Traub, R., and Miles, R. 1991. Neuronal Networks of Hippocampus. Cambridge University Press, New York.
Synchrony in Excitatory Neural Networks
337
Treves, A. 1993. Mean field analysis of neuronal spike dynamics. Network 4, 259-284. Tsodyks, M., Mitkov, I., and Sompolinsky, H. 1993. Patterns of synchrony in integrate and fire network. Phys. Rev. Lett. 71, 1280-1283. Tuckwell, H. C. 1988. Introduction to Theoretical Neurobiology. Cambridge University Press, New York. Watanabe, S., and Strogatz, S. H. 1993. Integrability of a globally coupled oscillator array. Phys. Rev.Lett. 70, 2391-2394. van Vreeswijk, C., Abbott, L. F., and Ermentrout, G. B. 1994. When inhibition not excitation synchronizes neural firing. J. Cornp. Neurosci., submitted.
Received May 6, 1994; accepted August 30, 1994.
This article has been cited by: 2. Stefano Luccioli, Antonio Politi. 2010. Irregular Collective Behavior of Heterogeneous Neural Networks. Physical Review Letters 105:15. . [CrossRef] 3. Patrick J. Bradley, Kurt Wiesenfeld, Robert J. Butera. 2010. Effects of heterogeneity in synaptic conductance between weakly coupled identical neurons. Journal of Computational Neuroscience . [CrossRef] 4. Lakshmi Chandrasekaran, Srisairam Achuthan, Carmen C. Canavier. 2010. Stability of two cluster solutions in pulse coupled networks of neural oscillators. Journal of Computational Neuroscience . [CrossRef] 5. R. M. Smeal, G. B. Ermentrout, J. A. White. 2010. Phase-response curves and synchronized neural networks. Philosophical Transactions of the Royal Society B: Biological Sciences 365:1551, 2407-2422. [CrossRef] 6. Srisairam Achuthan, Robert J. Butera, Carmen C. Canavier. 2010. Synaptic and intrinsic determinants of the phase resetting curve for weak coupling. Journal of Computational Neuroscience . [CrossRef] 7. Stan Gielen, Martin Krupa, Magteld Zeitler. 2010. Gamma oscillations as a mechanism for selective information transmission. Biological Cybernetics 103:2, 151-165. [CrossRef] 8. Christoph Börgers, Martin Krupa, Stan Gielen. 2010. The response of a classical Hodgkin–Huxley neuron to an inhibitory input pulse. Journal of Computational Neuroscience 28:3, 509-526. [CrossRef] 9. Fred Sieling, Carmen Canavier, Astrid Prinz. 2010. Inclusion of noise in iterated firing time maps based on the phase response curve. Physical Review E 81:6. . [CrossRef] 10. Daisuke Takeshita, Renato Feres. 2010. Higher order approximation of isochrons. Nonlinearity 23:6, 1303-1323. [CrossRef] 11. Takashi Kanamaru, Kazuyuki Aihara. 2010. Roles of Inhibitory Neurons in Rewiring-Induced Synchronization in Pulse-Coupled Neural NetworksRoles of Inhibitory Neurons in Rewiring-Induced Synchronization in Pulse-Coupled Neural Networks. Neural Computation 22:5, 1383-1398. [Abstract] [Full Text] [PDF] [PDF Plus] 12. N. Malik, B. Ashok, J. Balakrishnan. 2010. Noise-induced synchronization in bidirectionally coupled type-I neurons. The European Physical Journal B . [CrossRef] 13. Nishant Malik, B. Ashok, J. Balakrishnan. 2010. Complete synchronization in coupled type-I neurons. Pramana 74:2, 189-205. [CrossRef] 14. William Erik Sherwood, John Guckenheimer. 2010. Dissecting the Phase Response of a Model Bursting Neuron. SIAM Journal on Applied Dynamical Systems 9:3, 659. [CrossRef]
15. Katherine Newhall, Gregor Kovačič, Peter Kramer, David Cai. 2010. Cascade-induced synchrony in stochastically driven neuronal networks. Physical Review E 82:4, 041903. [CrossRef] 16. Shigefumi Hata, Takeaki Shimokawa, Kensuke Arai, Hiroya Nakao. 2010. Synchronization of uncoupled oscillators by common gamma impulses: From phase locking to noise-induced synchronization. Physical Review E 82:3, 036206. [CrossRef] 17. Christoph Kirst, Marc Timme. 2010. Partial Reset in Pulse-Coupled Oscillators. SIAM Journal on Applied Mathematics 70:7, 2119. [CrossRef] 18. Jing Shao, Dihui Lai, Ulrike Meyer, Harald Luksch, Ralf Wessel. 2009. Generating oscillatory bursts from a network of regular spiking neurons without inhibition. Journal of Computational Neuroscience 27:3, 591-606. [CrossRef] 19. Kevin K. Lin, Eric Shea-Brown, Lai-Sang Young. 2009. Reliability of Coupled Oscillators. Journal of Nonlinear Science 19:5, 497-545. [CrossRef] 20. Kaiichiro Ota, Masaki Nomura, Toshio Aoyagi. 2009. Weighted Spike-Triggered Average of a Fluctuating Stimulus Yielding the Phase Response Curve. Physical Review Letters 103:2. . [CrossRef] 21. Aushra Abouzeid, Bard Ermentrout. 2009. Type-II phase resetting curve is optimal for stochastic synchrony. Physical Review E 80:1. . [CrossRef] 22. Yuko Takahashi, Hiroshi Kori, Naoki Masuda. 2009. Self-organization of feed-forward structure and entrainment in excitatory neural networks with spike-timing-dependent plasticity. Physical Review E 79:5. . [CrossRef] 23. Myongkeun Oh, Victor Matveev. 2009. Loss of phase-locking in non-weakly coupled inhibitory networks of type-I model neurons. Journal of Computational Neuroscience 26:2, 303-320. [CrossRef] 24. Klaus M. Stiefel, Boris S. Gutkin, Terrence J. Sejnowski. 2009. The effects of cholinergic neuromodulation on neuronal phase-response curves of modeled cortical neurons. Journal of Computational Neuroscience 26:2, 289-301. [CrossRef] 25. Kantaro Fujiwara, Kazuyuki Aihara. 2009. Trial-to-trial variability and its influence on higher-order statistics. Artificial Life and Robotics 13:2, 470-473. [CrossRef] 26. Ramana Dodla, Charles Wilson. 2009. Asynchronous Response of Coupled Pacemaker Neurons. Physical Review Letters 102:6. . [CrossRef] 27. Bard Ermentrout, Martin Wechselberger. 2009. Canards, Clusters, and Synchronization in a Weakly Coupled Interneuron Model. SIAM Journal on Applied Dynamical Systems 8:1, 253. [CrossRef] 28. Sheng-Jun Wang, Xin-Jian Xu, Zhi-Xi Wu, Zi-Gang Huang, Ying-Hai Wang. 2008. Influence of synaptic interaction on firing synchronization and spike death in excitatory neuronal networks. Physical Review E 78:6. . [CrossRef] 29. Germán Mato, Inés Samengo. 2008. Type I and Type II Neuron Models Are Selectively Driven by Differential Stimulus FeaturesType I and Type II
Neuron Models Are Selectively Driven by Differential Stimulus Features. Neural Computation 20:10, 2418-2440. [Abstract] [PDF] [PDF Plus] 30. Takashi Kanamaru, Kazuyuki Aihara. 2008. Stochastic Synchrony of Chaos in a Pulse-Coupled Neural Network with Both Chemical and Electrical Synapses Among Inhibitory NeuronsStochastic Synchrony of Chaos in a Pulse-Coupled Neural Network with Both Chemical and Electrical Synapses Among Inhibitory Neurons. Neural Computation 20:8, 1951-1972. [Abstract] [PDF] [PDF Plus] 31. Jun-nosuke Teramae, Tomoki Fukai. 2008. Complex evolution of spike patterns during burst propagation through feed-forward networks. Biological Cybernetics 99:2, 105-114. [CrossRef] 32. Sashi Marella. 2008. Class-II neurons display a higher degree of stochastic synchronization than class-I neurons. Physical Review E 77:4. . [CrossRef] 33. Yihui Liu, Jing Yang, Sanjue Hu. 2008. Transition between two excitabilities in mesencephalic V neurons. Journal of Computational Neuroscience 24:1, 95-104. [CrossRef] 34. Na Yu, Rachel Kuske, Yue Xian Li. 2008. Stochastic phase dynamics and noise-induced mixed-mode oscillations in coupled oscillators. Chaos: An Interdisciplinary Journal of Nonlinear Science 18:1, 015112. [CrossRef] 35. Yu-Chuan Chang, Jonq Juang. 2008. Stable Synchrony in Globally Coupled Integrate-and-Fire Oscillators. SIAM Journal on Applied Dynamical Systems 7:4, 1445. [CrossRef] 36. Nicolas Brunel, Vincent Hakim. 2008. Sparsely synchronized neuronal oscillations. Chaos: An Interdisciplinary Journal of Nonlinear Science 18:1, 015113. [CrossRef] 37. Yasuhiro Tsubo, Jun-nosuke Teramae, Tomoki Fukai. 2007. Synchronization of Excitatory Neurons with Strongly Heterogeneous Phase Responses. Physical Review Letters 99:22. . [CrossRef] 38. Daniele Marinazzo, Hilbert J. Kappen, Stan C. A. M. Gielen. 2007. Input-Driven Oscillations in Networks with Excitatory and Inhibitory Neurons with Dynamic SynapsesInput-Driven Oscillations in Networks with Excitatory and Inhibitory Neurons with Dynamic Synapses. Neural Computation 19:7, 1739-1765. [Abstract] [PDF] [PDF Plus] 39. Yasuhiro Tsubo, Masahiko Takada, Alex D. Reyes, Tomoki Fukai. 2007. Layer and frequency dependencies of phase response properties of pyramidal neurons in rat motor cortex. European Journal of Neuroscience 25:11, 3429-3441. [CrossRef] 40. Jean-Pierre Nadal. 2007. Modelling collective phenomena in neuroscience. Interdisciplinary Science Reviews 32:2, 177-184. [CrossRef] 41. Naoki Masuda, Hiroshi Kori. 2007. Formation of feedforward networks and frequency synchrony by spike-timing-dependent plasticity. Journal of Computational Neuroscience 22:3, 327-345. [CrossRef] 42. Svetlana Postnova, Karlheinz Voigt, Hans A. Braun. 2007. Neural Synchronization at Tonic-to-Bursting Transitions. Journal of Biological Physics 33:2, 129-143. [CrossRef]
43. Ho Young Jeong, Boris Gutkin. 2007. Synchrony of Neuronal Oscillations Controlled by GABAergic Reversal PotentialsSynchrony of Neuronal Oscillations Controlled by GABAergic Reversal Potentials. Neural Computation 19:3, 706-729. [Abstract] [PDF] [PDF Plus] 44. Yasuomi Sato, Masatoshi Shiino. 2007. Generalization of coupled spiking models and effects of the width of an action potential on synchronization phenomena. Physical Review E 75:1. . [CrossRef] 45. Hiroshi Kori, Alexander Mikhailov. 2006. Strong effects of network architecture in the entrainment of coupled oscillator systems. Physical Review E 74:6. . [CrossRef] 46. A. N. Burkitt. 2006. A review of the integrate-and-fire neuron model: II. Inhomogeneous synaptic input and network properties. Biological Cybernetics 95:2, 97-112. [CrossRef] 47. Takeaki Shimokawa, Shigeru Shinomoto. 2006. Inhibitory neurons can facilitate rhythmic activity in a neural network. Physical Review E 73:6. . [CrossRef] 48. Nicolas Brunel , David Hansel . 2006. How Noise Affects the Synchronization Properties of Recurrent Networks of Inhibitory NeuronsHow Noise Affects the Synchronization Properties of Recurrent Networks of Inhibitory Neurons. Neural Computation 18:5, 1066-1110. [Abstract] [PDF] [PDF Plus] 49. Takashi Kanamaru . 2006. Analysis of Synchronization Between Two Modules of Pulse Neural Networks with Excitatory and Inhibitory ConnectionsAnalysis of Synchronization Between Two Modules of Pulse Neural Networks with Excitatory and Inhibitory Connections. Neural Computation 18:5, 1111-1131. [Abstract] [PDF] [PDF Plus] 50. W. Govaerts , B. Sautois . 2006. Computation of the Phase Response Curve: A Direct Numerical ApproachComputation of the Phase Response Curve: A Direct Numerical Approach. Neural Computation 18:4, 817-847. [Abstract] [PDF] [PDF Plus] 51. Hidetsugu Sakaguchi. 2006. Instability of synchronized motion in nonlocally coupled neural oscillators. Physical Review E 73:3. . [CrossRef] 52. Carson C. Chow, S. Coombes. 2006. Existence and Wandering of Bumps in a Spiking Neural Network Model. SIAM Journal on Applied Dynamical Systems 5:4, 552. [CrossRef] 53. Philip Holmes, Robert J. Full, Dan Koditschek, John Guckenheimer. 2006. The Dynamics of Legged Locomotion: Models, Analyses, and Challenges. SIAM Review 48:2, 207. [CrossRef] 54. Amanda Preyer, Robert Butera. 2005. Neuronal Oscillators in Aplysia californica that Demonstrate Weak Coupling In Vitro. Physical Review Letters 95:13. . [CrossRef] 55. Klaus M. Stiefel, Valérie Wespatat, Boris Gutkin, Frank Tennigkeit, Wolf Singer. 2005. Phase Dependent Sign Changes of GABAergic Synaptic Input Explored In-Silicio and In-Vitro. Journal of Computational Neuroscience 19:1, 71-85. [CrossRef]
56. Eun-Hyoung Park, Ernest Barreto, Bruce J. Gluckman, Steven J. Schiff, Paul So. 2005. A Model of the Effects of Applied Electric Fields on Neuronal Synchronization. Journal of Computational Neuroscience 19:1, 53-70. [CrossRef] 57. A. Tonnelier . 2005. Categorization of Neural Excitability Using Threshold ModelsCategorization of Neural Excitability Using Threshold Models. Neural Computation 17:7, 1447-1455. [Abstract] [PDF] [PDF Plus] 58. Takashi Kanamaru , Masatoshi Sekine . 2005. Synchronized Firings in the Networks of Class 1 Excitable Neurons with Excitatory and Inhibitory Connections and Their Dependences on the Forms of InteractionsSynchronized Firings in the Networks of Class 1 Excitable Neurons with Excitatory and Inhibitory Connections and Their Dependences on the Forms of Interactions. Neural Computation 17:6, 1315-1338. [Abstract] [PDF] [PDF Plus] 59. Jonathan E. Rubin. 2005. Surprising Effects of Synaptic Excitation. Journal of Computational Neuroscience 18:3, 333-342. [CrossRef] 60. Masahiko Yoshioka. 2005. Cluster synchronization in an ensemble of neurons interacting through chemical synapses. Physical Review E 71:6. . [CrossRef] 61. Masahiko Yoshioka. 2005. Chaos synchronization in gap-junction-coupled neurons. Physical Review E 71:6. . [CrossRef] 62. Christoph Börgers , Nancy Kopell . 2005. Effects of Noisy Drive on Rhythms in Networks of Excitatory and Inhibitory NeuronsEffects of Noisy Drive on Rhythms in Networks of Excitatory and Inhibitory Neurons. Neural Computation 17:3, 557-608. [Abstract] [PDF] [PDF Plus] 63. Benjamin Pfeuty , Germán Mato , David Golomb , David Hansel . 2005. The Combined Effects of Inhibitory and Electrical Synapses in SynchronyThe Combined Effects of Inhibitory and Electrical Synapses in Synchrony. Neural Computation 17:3, 633-670. [Abstract] [PDF] [PDF Plus] 64. Yuval Aviel , David Horn , Moshe Abeles . 2005. Memory Capacity of Balanced NetworksMemory Capacity of Balanced Networks. Neural Computation 17:3, 691-713. [Abstract] [PDF] [PDF Plus] 65. Andrey Shilnikov, Gennady Cymbalyuk. 2005. Transition between Tonic Spiking and Bursting in a Neuron Model via the Blue-Sky Catastrophe. Physical Review Letters 94:4. . [CrossRef] 66. Martin Wechselberger. 2005. Existence and Bifurcation of Canards in $\mathbbR^3$ in the Case of a Folded Node. SIAM Journal on Applied Dynamical Systems 4:1, 101. [CrossRef] 67. Robert Clewley, Horacio G. Rotstein, Nancy Kopell. 2005. A Computational Tool for the Reduction of Nonlinear ODE Systems Possessing Multiple Scales. Multiscale Modeling & Simulation 4:3, 732. [CrossRef] 68. Eric Brown , Jeff Moehlis , Philip Holmes . 2004. On the Phase Reduction and Response Dynamics of Neural Oscillator PopulationsOn the Phase Reduction and Response Dynamics of Neural Oscillator Populations. Neural Computation 16:4, 673-715. [Abstract] [PDF] [PDF Plus]
69. T Takekawa, T Aoyagi, T Fukai. 2004. Influences of synaptic location on the synchronization of rhythmic bursting neurons. Network: Computation in Neural Systems 15:1, 1-12. [CrossRef] 70. Michael Denker, Marc Timme, Markus Diesmann, Fred Wolf, Theo Geisel. 2004. Breaking Synchrony by Heterogeneity in Complex Networks. Physical Review Letters 92:7. . [CrossRef] 71. Jonathan Drover, Jonathan Rubin, Jianzhong Su, Bard Ermentrout. 2004. Analysis of a Canard Mechanism by Which Excitatory Synaptic Coupling Can Synchronize Neurons at Low Firing Frequencies. SIAM Journal on Applied Mathematics 65:1, 69. [CrossRef] 72. Sang-Gui Lee, Shigeru Tanaka, Seunghwan Kim. 2004. Orientation tuning and synchronization in the hypercolumn model. Physical Review E 69:1. . [CrossRef] 73. Jan Benda , Andreas V. M. Herz . 2003. A Universal Model for Spike-Frequency AdaptationA Universal Model for Spike-Frequency Adaptation. Neural Computation 15:11, 2523-2564. [Abstract] [PDF] [PDF Plus] 74. Bard Ermentrout . 2003. Dynamical Consequences of Fast-Rising, Slow-Decaying Synapses in Neuronal NetworksDynamical Consequences of Fast-Rising, Slow-Decaying Synapses in Neuronal Networks. Neural Computation 15:11, 2483-2522. [Abstract] [PDF] [PDF Plus] 75. Masaki Nomura , Tomoki Fukai , Toshio Aoyagi . 2003. Synchrony of Fast-Spiking Interneurons Interconnected by GABAergic and Electrical SynapsesSynchrony of Fast-Spiking Interneurons Interconnected by GABAergic and Electrical Synapses. Neural Computation 15:9, 2179-2198. [Abstract] [PDF] [PDF Plus] 76. Naoki Masuda , Kazuyuki Aihara . 2003. Ergodicity of Spike Trains: When Does Trial Averaging Make Sense?Ergodicity of Spike Trains: When Does Trial Averaging Make Sense?. Neural Computation 15:6, 1341-1372. [Abstract] [PDF] [PDF Plus] 77. Toshio Aoyagi , Takashi Takekawa , Tomoki Fukai . 2003. Gamma Rhythmic Bursts: Coherence Control in Networks of Cortical Pyramidal NeuronsGamma Rhythmic Bursts: Coherence Control in Networks of Cortical Pyramidal Neurons. Neural Computation 15:5, 1035-1061. [Abstract] [PDF] [PDF Plus] 78. Yuqing Wang, Z. D. Wang, Y.-X. Li, X. Pei. 2003. Synchronous Phase Clustering in a Network of Neurons with Spatially Decaying Excitatory Coupling. Journal of the Physics Society Japan 72:2, 443-447. [CrossRef] 79. D. Hansel , G. Mato . 2003. Asynchronous States and the Emergence of Synchrony in Large Networks of Interacting Excitatory and Inhibitory NeuronsAsynchronous States and the Emergence of Synchrony in Large Networks of Interacting Excitatory and Inhibitory Neurons. Neural Computation 15:1, 1-56. [Abstract] [PDF] [PDF Plus] 80. Eugene M. Izhikevich, Frank C. Hoppensteadt. 2003. Slowly Coupled Oscillators: Phase Dynamics and Synchronization. SIAM Journal on Applied Mathematics 63:6, 1935. [CrossRef]
81. Daisuke Suzuki, Toshio Aoyagi. 2002. Phase Locking States in Network of Inhibitory Neurons: A Putative Role of Gap Junction. Journal of the Physics Society Japan 71:11, 2644-2648. [CrossRef] 82. Yasuomi Sato, Masatoshi Shiino. 2002. Spiking neuron models with excitatory or inhibitory synaptic couplings and synchronization phenomena. Physical Review E 66:4. . [CrossRef] 83. Dezhe Jin. 2002. Fast Convergence of Spike Sequences to Periodic Patterns in Recurrent Networks. Physical Review Letters 89:20. . [CrossRef] 84. Naoki Masuda , Kazuyuki Aihara . 2002. Spatiotemporal Spike Encoding of a Continuous External SignalSpatiotemporal Spike Encoding of a Continuous External Signal. Neural Computation 14:7, 1599-1628. [Abstract] [PDF] [PDF Plus] 85. Sorinel A. Oprisan , Carmen C. Canavier . 2002. The Influence of Limit Cycle Topology on the Phase Resetting CurveThe Influence of Limit Cycle Topology on the Phase Resetting Curve. Neural Computation 14:5, 1027-1057. [Abstract] [PDF] [PDF Plus] 86. Dezhe Jin, H. Seung. 2002. Fast computation with spikes in a recurrent neural network. Physical Review E 65:5. . [CrossRef] 87. Masahiko Yoshioka. 2001. Spike-timing-dependent learning rule to encode spatiotemporal patterns in a network of spiking neurons. Physical Review E 65:1. . [CrossRef] 88. Naoki Masuda, Kazuyuki Aihara. 2001. Synchronization of pulse-coupled excitable neurons. Physical Review E 64:5. . [CrossRef] 89. Carlo R. Laing , Carson C. Chow . 2001. Stationary Bumps in Networks of Spiking NeuronsStationary Bumps in Networks of Spiking Neurons. Neural Computation 13:7, 1473-1494. [Abstract] [PDF] [PDF Plus] 90. J. Schwarz, A. Klotz, K. Bräuer, A. Stevens. 2001. Master-slave synchronization in chaotic discrete-time oscillators. Physical Review E 64:1. . [CrossRef] 91. Bard Ermentrout , Matthew Pascal , Boris Gutkin . 2001. The Effects of Spike Frequency Adaptation and Negative Feedback on the Synchronization of Neural OscillatorsThe Effects of Spike Frequency Adaptation and Negative Feedback on the Synchronization of Neural Oscillators. Neural Computation 13:6, 1285-1310. [Abstract] [PDF] [PDF Plus] 92. C. van Vreeswijk , D. Hansel . 2001. Patterns of Synchrony in Neural Networks with Spike AdaptationPatterns of Synchrony in Neural Networks with Spike Adaptation. Neural Computation 13:5, 959-992. [Abstract] [PDF] [PDF Plus] 93. L. Neltner , D. Hansel . 2001. On Synchrony of Weakly Coupled Neurons at Low Firing RateOn Synchrony of Weakly Coupled Neurons at Low Firing Rate. Neural Computation 13:4, 765-774. [Abstract] [PDF] [PDF Plus] 94. D. Hansel, G. Mato. 2001. Existence and Stability of Persistent States in Large Neuronal Networks. Physical Review Letters 86:18, 4175-4178. [CrossRef]
95. Robert Urbanczik, Walter Senn. 2001. Similar NonLeaky Integrate-and-Fire Neurons with Instantaneous Couplings Always Synchronize. SIAM Journal on Applied Mathematics 61:4, 1143. [CrossRef] 96. Gennady S. Cymbalyuk , Girish N. Patel , Ronald L. Calabrese , Stephen P. DeWeerth , Avis H. Cohen . 2000. Modeling Alternation to Synchrony with Inhibitory Coupling: A Neuromorphic VLSI ApproachModeling Alternation to Synchrony with Inhibitory Coupling: A Neuromorphic VLSI Approach. Neural Computation 12:10, 2259-2278. [Abstract] [PDF] [PDF Plus] 97. Carson C. Chow , Nancy Kopell . 2000. Dynamics of Spiking Neurons with Electrical CouplingDynamics of Spiking Neurons with Electrical Coupling. Neural Computation 12:7, 1643-1678. [Abstract] [PDF] [PDF Plus] 98. L. Neltner , D. Hansel , G. Mato , C. Meunier . 2000. Synchrony in Heterogeneous Networks of Spiking NeuronsSynchrony in Heterogeneous Networks of Spiking Neurons. Neural Computation 12:7, 1607-1641. [Abstract] [PDF] [PDF Plus] 99. D. Golomb , D. Hansel . 2000. The Number of Synaptic Inputs and the Synchrony of Large, Sparse Neuronal NetworksThe Number of Synaptic Inputs and the Synchrony of Large, Sparse Neuronal Networks. Neural Computation 12:5, 1095-1139. [Abstract] [PDF] [PDF Plus] 100. Masahiko Yoshioka, Masatoshi Shiino. 2000. Associative memory storing an extensive number of patterns based on a network of oscillators with distributed natural frequencies in the presence of external white noise. Physical Review E 61:5, 4732-4744. [CrossRef] 101. Jianfeng Feng, David Brown, Guibin Li. 2000. Synchronization due to common pulsed input in Stein’s model. Physical Review E 61:3, 2987-2995. [CrossRef] 102. Wulfram Gerstner . 2000. Population Dynamics of Spiking Neurons: Fast Transients, Asynchronous States, and LockingPopulation Dynamics of Spiking Neurons: Fast Transients, Asynchronous States, and Locking. Neural Computation 12:1, 43-89. [Abstract] [PDF] [PDF Plus] 103. Paul C. Bressloff , S. Coombes . 2000. Dynamics of Strongly Coupled Spiking NeuronsDynamics of Strongly Coupled Spiking Neurons. Neural Computation 12:1, 91-129. [Abstract] [PDF] [PDF Plus] 104. P. C. Bressloff, S. Coombes. 2000. A Dynamical Theory of Spike Train Transitions in Networks of Integrate-and-Fire Oscillators. SIAM Journal on Applied Mathematics 60:3, 820. [CrossRef] 105. Nicolas Brunel , Vincent Hakim . 1999. Fast Global Oscillations in Networks of Integrate-and-Fire Neurons with Low Firing RatesFast Global Oscillations in Networks of Integrate-and-Fire Neurons with Low Firing Rates. Neural Computation 11:7, 1621-1671. [Abstract] [PDF] [PDF Plus] 106. E.M. Izhikevich. 1999. Weakly pulse-coupled oscillators, FM interactions, synchronization, and oscillatory associative memory. IEEE Transactions on Neural Networks 10:3, 508-526. [CrossRef]
107. G. Frank, G. Hartmann, A. Jahnke, M. Schafer. 1999. An accelerator for neural networks with pulse-coded model neurons. IEEE Transactions on Neural Networks 10:3, 527-538. [CrossRef] 108. E.M. Izhikevich. 1999. Class 1 neural excitability, conventional synapses, weakly connected networks, and mathematical foundations of pulse-coupled models. IEEE Transactions on Neural Networks 10:3, 499-507. [CrossRef] 109. R. Mueller, A. Herz. 1999. Content-addressable memory with spiking neurons. Physical Review E 59:3, 3330-3338. [CrossRef] 110. P. Bressloff, S. Coombes. 1998. Spike Train Dynamics Underlying Pattern Formation in Integrate-and-Fire Oscillator Networks. Physical Review Letters 81:11, 2384-2387. [CrossRef] 111. Masahiko Yoshioka, Masatoshi Shiino. 1998. Associative memory based on synchronized firing of spiking neurons with time-delayed interactions. Physical Review E 58:3, 3628-3639. [CrossRef] 112. P. Bressloff, S. Coombes. 1998. Desynchronization, Mode Locking, and Bursting in Strongly Coupled Integrate-and-Fire Oscillators. Physical Review Letters 81:10, 2168-2171. [CrossRef] 113. C. van Vreeswijk , H. Sompolinsky . 1998. Chaotic Balanced State in a Model of Cortical CircuitsChaotic Balanced State in a Model of Cortical Circuits. Neural Computation 10:6, 1321-1371. [Abstract] [PDF] [PDF Plus] 114. W. Senn , Th. Wannier , J. Kleinle , H.-R. Lüscher , L. Müller , J. Streit , K. Wyler . 1998. Pattern Generation by Two Coupled Time-Discrete Neural Networks with Synaptic DepressionPattern Generation by Two Coupled Time-Discrete Neural Networks with Synaptic Depression. Neural Computation 10:5, 1251-1275. [Abstract] [PDF] [PDF Plus] 115. Sharon M. Crook, G. Bard Ermentrout, James M. Bower. 1998. Spike Frequency Adaptation Affects the Synchronization Properties of Networks of Cortical OscillatorsSpike Frequency Adaptation Affects the Synchronization Properties of Networks of Cortical Oscillators. Neural Computation 10:4, 837-854. [Abstract] [PDF] [PDF Plus] 116. Bard Ermentrout. 1998. Reports on Progress in Physics 61:4, 353-430. [CrossRef] 117. Eve Marder. 1998. FROM BIOPHYSICS TO MODELS OF NETWORK FUNCTION. Annual Review of Neuroscience 21:1, 25-45. [CrossRef] 118. Francesco P. Battaglia , Alessandro Treves . 1998. Stable and Rapid Recurrent Processing in Realistic Autoassociative MemoriesStable and Rapid Recurrent Processing in Realistic Autoassociative Memories. Neural Computation 10:2, 431-450. [Abstract] [PDF] [PDF Plus] 119. D. Hansel , G. Mato , C. Meunier , L. Neltner . 1998. On Numerical Simulations of Integrate-and-Fire Neural NetworksOn Numerical Simulations of Integrate-and-Fire Neural Networks. Neural Computation 10:2, 467-483. [Abstract] [PDF] [PDF Plus]
120. Samuele Bottani. 1996. Synchronization of integrate and fire oscillators with global coupling. Physical Review E 54:3, 2334-2350. [CrossRef] 121. Bard Ermentrout. 1996. Type I Membranes, Phase Resetting Curves, and SynchronyType I Membranes, Phase Resetting Curves, and Synchrony. Neural Computation 8:5, 979-1001. [Abstract] [PDF] [PDF Plus] 122. D. Hansel, H. Sompolinsky. 1996. Chaos and synchrony in a model of a hypercolumn in visual cortex. Journal of Computational Neuroscience 3:1, 7-34. [CrossRef] 123. Xiao-Jing WangNeural Oscillations . [CrossRef]
Decorrelated Hebbian Learning
339
the presented input pattern. Consequently, different neurons tend to match different features in the input space and, thus, a set of feature detectors is formed. Several authors (Sanger 1989; Foldiak 1990; Oja 1989) relate competitive and unsupervised learning with the vector quantization techniques and the principal component analysis. Kohonen (1984) developed a competitive learning paradigm, which preserves the topological properties of the input space. In the Kohonen model, the classical winner-take-all competitive learning is altered so that the several neighboring neurons in a predefined ordered lattice are simultaneously updated. In dealing with very different input signals, which is a case where the traditional Kohonen network fails, substantial improvements were made by introducing the definition of a ”dynamically adapted neighbourhood” of the neurons in the Kohonen lattice (Kangas et al. 1990). An alternative approach is the so-called ”neural-gas-model” of Martinez et al. (1993). In this model the updated units are neighbors not in the predescribed lattice but in the input space itself. A different learning algorithm called the competitive Hebbian learning was formulated for neurons with sigmoidal activation functions (White 1992). In this algorithm the neurons learn simultaneously (not in the ”winner-take-all” fashion) to respond to different features of the training set under the restriction on the correlation between different neuron outputs. The aim of this work is to introduce a learning paradigm for the radial basis functions, which has the following characteristics: I. It is a competitive learning, but it is not based on the ”winner-andneighbors-take-all” strategy.
2. It results in a strong local response to each input pattern while guaranteeing uncorrelated neural outputs (without using lateral connections). 3. The learning is unsupervised and derived via gradient update of a particular cost function. The cost function, as shown later, has two competing terms. The first term corresponds to Hebbian (Hebb 1949) learning while the second term penalizes the correlation between the outputs of the gaussians. Hence, the second term has an anti-Hebbian effect. Due to the penalty on the correlation, we refer to the presented learning paradigm as the decorrelated Hebbian learning (DHL). An application of the DHL in clustering the (input) data space is straightforward. After being presented with the input patterns, the DHL partitions the input space into the minimally overlapping clusters that effectively cover only the region where the input data are present. Furthermore, if the sum of the outputs of the gaussians is normalized to one, the clustering problem becomes related to the problem of the input
340
G. Deco and D. Obradovic
probability density estimation. In the latter, the input probability density is approximated with a sum of normal distributions. Although clustering of input space (or input probability density estimation) is a well-defined task in itself, its result can be further utilized in approximation of an unknown function that maps input space into a measurable output space. This paper also presents a DHL-based learning algorithm for approximation of a real-valued function with a single hidden layer gaussian neural network. The approximation algorithm has two stages with the DHL being used in the first one. In the second stage, the function approximation is constructed as a weighted piece-wise linear approximation of the function on the localized regions defined by the unsupervised learning in the DHL. The application of the DHL is illustrated on the chaotic time-series prediction. 2 Theoretical Foundation
Let us define a single layer network of gaussian neurons with centers w, and widths 0 , whose activations are normalized by partitioning to one. The activation of the ith neuron is defined as
(2.2)
and where is the input vector. The function given by 2.1 is the wellknown softmax function or Pott function and it has a probability interpretation (Bridle 1989; Jacobs et al. 1991; Nowlan 1991). Furthermore, let us define a cost function H’as follows: (2.3)
where P ( i ) is the probability density of the input patterns. Unsupervised learning is defined herein as minimization of the cost function presented in equation 2.3 with respect to the centers wi.The first term on the righthand side of 2.3 induces attraction of all neurons toward the region of the input space where input data exist. The second term of the cost function penalizes overlapping of the outputs of the neurons with respect to each pattern. In other words, by minimizing H’we reward the coverage of relevant regions of the input space with gaussian functions but, at the same time, we penalize the overlapping of the gaussian units. The first term in the cost function tries to position the centers of the gaussian radial basis functions in a way that they approximate the distribution of
Decorrelated Hebbian Learning
341
the input data. The second term is a repulsion term between the centers of localized gaussian functions that causes the decorrelation between the gaussian neurons. This decorrelation effect is very important since it implies an optimal use of the representation resources (the gaussian neurons). The optimality in the latter case stands for avoiding the redundant information about the input distribution that would be present if the gaussians would overlap. The reason for the softmax application in our case is threefold. The first reason is that the result of the clustering with the normalized gaussians has a probabilistic interpretation as an approximation of the input probability density function. The second reason is the fact that the resulting cost function will not depend any more on the Lagrange multiplier a, as will be seen in equation 2.7. Finally, the third reason is to improve the influence of the gaussian centers, initially placed outside of the regions where the input data are present, on the cost function in 2.3. In the case where the normalization is not present, the influence of the ”badly” placed gaussian centers is negligible, i.e., their activation is almost zero, and they might remain inactive. On the other hand, the normalization might improve their influence and make it easier for the learning to move them to the regions covered by the input data. Due to the softmax normalization the cost function H’becomes:
H’
=
-c[/- I
(2.5) =
-1+a
[
1-
where we have used the fact that C k y k ( ( ) terms in 2.6 we define a new cost function,
H=-C[/dSP(€)YXE)] - -
(2.6) =
1. Ignoring the constant
(2.7)
1
We see that minimization of the cost function H is equivalent to minimization of H’. The advantage is that H does not depend on a. To derive a learning rule that minimize the cost function of 2.7 we apply the gradient method. The weights are updated as follows: (2.8) where q is the learning coefficient. After some algebra it is easy to find the updating rules for the centers of the gaussians by using 2.8. The updates rule is as follows: (2.9)
342
G. Deco and D. Obradovic
Equation 2.9 defines the DHL learning rule for the centers. The widths of all the gaussians are kept equal and they are selected to cover the whole region where input data exist with small overlapping. Consequently, the common width is the resolution for the clustering. To avoid local minima during the training of the centers with DHL we usually start with a gaussian width that initially covers the whole input region and that then decays exponentially until the a priori set minimum resolution width is reached.' Once where the centers of the gaussians are determined, the output of the neurons are used to obtain a weighted piece-wise linear approximation of the single-output function by minimizing the following cost function:
(2.10) where zl is the jth pattern of the function output, N is the number of patterns and m is the number of the gaussian neurons. The idea of blending together local maps was first discussed by Shepard (1968) and recently several authors applied this idea in the neural network context (see, for example; Stokbro et al. 1990; Ritter 1991; Omohundro 1991; Martinez et al. 1993). In 2.10 the parameters to be adjusted are a, and the vector b, whose dimension is equal to the number of function inputs. Hence, the total number of parameters to be determined in the second learning stage is m*[1+(number offunction inputs)]. It is easy to see that if the parameters b, are set to zero, the cost function in 2.10 will correspond to the LMS (least mean square) problem presented in Moody and Darken (1988). The cost function in 2.10 has a unique (or many equally good) solution due to the linearity of the approximation error error, with respect to its parameters. The application of yl(t1) as a scaling factor guarantees that the linear approximation is localized to the region where the input information is maximized. 3 Simulations
To demonstrate the efficiency of the decorrelated Hebbian learning (DHL) we analyze at first a simple artificial example. A single input is distributed uniformly in two closed regions specified by [0,0.3]and [0.5,0.8]. We study the evolution of a layer of 20 gaussian neurons initially allocated randomly in the region [0,1]. Figure 1 shows the evolution of each center position as a function of the iteration epoch, by using equation 2.9. During the training, all the 'This strategy results from mean field annealing and is based on the Peierl's inequality (Bilbro et al. 1992).
Decorrelated Hebbian Learning
343
Weights Position
DHL
1.00
0.95
0.90 0.85 0.80 0.75 0.70 0.65 0.60
0.55 0.50 0.45
1
1
1
1
1
1
0.40 0.35 0.30 0.25 0.20 0.15 0.10
0.05
0.00
1 0.00
100.00
200.00
300.00
400.00
Epochs 500.00
Figure 1: Evolution of the center positions of 20 gaussian neurons during the DHL as a function of epoch number. The input is a uniformly distributed variable in the range [0,0.3]and [0.5,0.8].
gaussian neurons have the same width. The initial value of the width equal to 10 was exponentially decreased in every epoch according to the predefined update law. The final value of the width after 400 epochs was 0.02. As seen in Figure 1, the algorithm converged to a uniform and a decorrelated distribution of the centers in the region of the two input data sets. Therefore, the DHL resulted in a set of minimally overlapping clusters that cover the region of the input space where the actual data are located. In addition to clustering, we present herein an application of the DHL in the neural network approximation of an unknown function. The application of gaussian neural networks in function approximation is studied by Moody and Darken (1988). As they point out, the one-stage supervised learning that minimizes the function approximation error often places the centers of the gaussians outside of the input
G. Deco and D. Obradovic
344
region and makes the corresponding standard deviations large. Consequently, the localization properties of gaussian neurons are not utilized and the supervised learning in gaussian networks has no advantage over the supervised learning of, for example, sigmoidal networks. In order to retain local properties of the gaussian networks, Moody and Darken (1988) introduce a two-stage algorithm where the first stage places the centers of gaussians only in the region of the input space where input data actually exists. The center locations are determined by the k-mean clustering algorithm (Lloyd 1982) while the widths (standard deviations) are determined using ”P nearest neighbor” heuristics. Once the centers and standard deviations are determined, the second stage of the learning process defined in Moody and Darken (1988) finds the optimal values of the heights of the gaussians so that the approximation error of the function is minimized. Hence, in the second training stage the prediction error is linear in the unknown parameters (heights) and the corresponding optimization has a unique solution. The possible shortcoming of this method is that the centers and the widths are determined by an optimization criteria that does not explicitly depend on the gaussian functions. Furthermore, the second stage is often too restrictive because it allows only the simplest linear coupling between the network output (function approximation) and the output of gaussian neurons. In this paper we introduce a novel two-stage function approximation algorithm that has the following characteristics: 1. The DHL is used in the first stage to cluster the input space into the minimally overlapping regions that contain input patterns.
2. The function approximation is constructed in the second stage as a weighted linear or piece-wise linear approximation of the function on the localized regions defined by the unsupervised learning in the first stage. Therefore, the second learning stage allows more appropriate constructions of the gaussian-based function approximant than in Moody and Darken (1988) since the initial clustering explicitly depends on the gaussian functions. The example used herein to illustrate the use of DHL in function approximation is the standard benchmark Mackey-Glass time series. The delay difference equation of Mackey-Glass (Mackey and Glass 1977) can be expressed as
x ( t + 1) = (1 - b ) x ( t )+
ax( t - T ) 1 x’O(t - T )
+
(3.1)
We used a = 0.2, b = 0.1, and T = 17, which are the same parameters as in Moody and Darken (1989). The training set contains four inputs: x ( t ) , x ( t - 6 ) , x ( t - 12), x ( t - 18), and the network has to predict the output x ( t + 85).
Decorrelated Hebbian Learning
345
"MG" "RND"
0
0
1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 1.3
.2
0
Figure 2: The projection of the initial distribution of the centers of gaussian neurons and training data onto the space spanned by the first three inputs in the Mackey-Glasstime-series prediction. The DHL is used as an unsupervised learning method for positioning the centers of gaussian neurons to fit the Mackey-Glass chaotic attractor. The test set consists of 500 points not contained in the training set of 1000 data points. The initial value of the width for the gaussians was 5 and it was forced to decay exponentially to 0.1 in 10,000 epochs. The initial and the end distributions of gaussians due to the DHL learning are depicted in Figures 2 and 3. The resulting centers of gaussians after the DHL learning are then used for linear and piece-wise linear approximation. In the linear optimization, the widths of all gaussians (100 parameters) were fixed to different values and for each of them the optimal parameters (gaussian heights) were obtained. Since the cost function is linear in parameters, the optimal heights were obtained by performing pseudoinverse of the corresponding data matrix. The best result was 0.2127 and it was obtained with the gaussian width of 0.6. The same process was repeated in the case of the piece-wise linearization with the difference that the number of the parameters was now greater and equal to 500. The smallest value of the cost function was 0.056 and it was achieved with the constant gaussian width of 0.01. The sizes of the optimal gaussian widths in the linear and piece-wise linear maps seem to be proportional to the number of free parameters in these two approximation methods.
346
G. Deco and D. Obradovic
"MG" "DHL"
0
1.3
.2
Figure 3: The projection of the end distribution of the centers of gaussian neurons (after the DHL) and training data onto the space spanned by the first three inputs in the Mackey-Glasstime-series prediction. In order to compare the DHL method with some already existing methods, we have repeated the same linear and piece-wise linear approximation but with the centers determined by the kmeuns algorithm of Lloyd (1982). The number of the gaussians was 100, the same as in the DHL case. For the same values of gaussian widths as in the DHL case, the resulting cost function value in linear approximation was 0.55 and for the piece-wise approximation it was 0.19. The results from the DHL and kmeans based linear and piece-wise linear approximations are presented in Table 1. In addition to the linear and piece-wise linear approximation where the centers obtained by the DHL learning were kept constant, we have also performed a full scale (all 901 network parameters including centers, widths, output weights, and the output bias) approximation based on the batch quasi-Newton method with a line search. All the parameters in the network were optimized with initial values of the widths, the output bias, and the heights being chosen randomly while the initial center positions were those resulting from the DHL phase. The cost function of the quasiNewton method was 0.0376, which is better than the result of the linear and the weighted piece-wise linear (WPWL) approximations. This is understandable since the number of parameters updated by the quasi-
Decorrelated Hebbian Learning
347
Table 1: Results from Linear and Piece-wise Linear Approximations
Number of parameters DHL + optimal linear output layer k-means + optimal linear output layer DHL + weighted piecewise linear approximation kmeans + weighted piece-wise linear approximation
Normalized error on the test set
+ 100 = 501 401 + 100 = 501
0.5504
+ 500 = 901
0.056
401
401
401 + 500 = 901
0.2127
0.1966
Newton method is 901 while the number of parameters in the WPWL is 500 and only 100 in the linear approximation. On the other hand, the WPWL and the linear method provided the optimal result in a single step (for a fixed gaussian width) while the quasi-Newton method required several hundred epochs.
References Bilbro, G., Snyder, W., Garnier, S., and Gault, J. 1992. Mean field annealing: A formalism for constructing GNC-like algorithms. Bridle, J. 1989. Probabilistic interpretation of feedforward classification network outputs, with relationship to statistical pattern recognition. In NeuroComputing: Algorithms, Architectures and Applications. F. Fogelman-Soulie and J. Herault, eds. Springer-Verlag, Berlin. Foldiak, P. 1990. Forming sparse representations by local anti-Hebbian learning. Biol. Cybern. 64,165-170. Hebb, D. 1949. The Organization of Behavior. Wiley, New York. Hornik, M., Stinchcombe, M., and White, H. 1989. Multilayer feedforward neural networks are universal approximators. Neural Networks 2, 359-366. Hubel, D. H., and Wiesel, T. N. 1962. Receptive fields, binocular interaction and functional architecture in the cat’s striate cortex. I. Physiol. 160, 106-154. Jacobs, R., Jordan, M., Nowlan, S., and Hinton, G. 1991. Adaptive mixtures of local experts. Neural Comp. 3, 79-87. Kangas, J., Kohonen, T., and Laaksonen, J. 1990. Variants of self-organizing maps. I E E E Transact. Neural Networks I(1), 93-99. Kohonen, T. 1984. Self-Organization and Associative Memo y. Springer Series in Information Sciences 8, Heidelberg. Lloyd, S. P. 1982. Least squares quantization in PCM. Bell Laboratories Internal Tech. Rep., IEEE Trans. Info. Th., IT-28 2.
348
G. Deco and D. Obradovic
Mackey, M., and Glass, L. 1977. Oscillation and chaos in physiological control systems. Science 197, 287. Martinetz, T., Berkovich, S., and Schulten, K. 1993. Neural gas network for vector quantization and its application to time-series prediction. l E E E Transact. Neural Networks, in press. Moody, J., and Darken, C. 1988. Learning with localized receptive fields. Proc. 1988 Connect. Mod. Summer School, D. Touretzky ed. Morgan Kaufmann, San Mateo, CA. Moody, J., and Darken, C. 1989. Fast learning in networks of locally-tuned processing units. Neural Comp. 1, 281-294. Nowlan, S. 1991. Soft competitive adaptation: Neural network learning algorithms based on fitting statistical mixtures. Ph.D. Thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA. Oja, E. 1989. Neural networks, principal components and subspaces. Int. J. Neural Syst. 1(1), 61-68. Omohundro, P. 1991. Bumptrees for efficient function, constraint, and classification learning. Adu. Neural Inform. Process. 3, 693-699. Ritter, H. 1991. Learning with the self-organizing map. In Artificial Neural Networks 1, T. Kohonen, K. Makisara, 0.Simula, and J. Kangas, eds., pp. 357364. Elsevier Science Publishers, Amsterdam. Rumelhart, D., and Zipser, D. 1985. Feature discovery by competitive learning. Cog. Sci. 9, 75-112. Sanger, T. 1989. Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Network 2, 459473. Shepard, D. 1968. A two-dimensional interpolation function for irregularly spaced data. Proc. 23rd Natl. Conf. A C M 517-523. Stokbro, K., Umbreger, D. K., and Hertz, J. A. 1990. Exploiting neurons with localized receptive fields to learn chaos. Preprint 90/28 S, Nordita, Copenhagen, Denmark. White, R. 1992. Competitive Hebbian learning: Algorithm and demonstrations. Neural Networks 5, 261-275.
Received March 5, 1993; accepted August 19, 1994.
This article has been cited by:
Communicated by Michael Jordan
Identification Using Feedforward Networks Asriel U. Levin’ Kumpati S. Narendra Center for Systems Science, Department of Electrical Engineering, Yale University, New Haven, CT 06520 U S A This paper is concerned with the identification of an unknown nonlinear dynamic system when only the inputs and outputs are accessible for measurement. Specifically we investigate the use of feedforward neural networks as models for the input-output behavior of such systems. Relying on the approximation capabilities of feedforward neural networks and under mild assumptions regarding the properties of the underlying nonlinear system, it is shown that there exists a feedforward network that for almost all inputs (an open and dense set) will display the input-output behavior of the system. 1 Introduction In recent years several authors (e.g., Jordan 1986; Narendra and Parthasarathy 1990) have suggested the use of recursive neural network models of the form2
y(k+ l)=N[y(k),y(k- 1 ) .. .y(k- I + l ) , U ( k ) , ~ ( k -1).. . ,u(k- I + l)] (1.1) for the identification of nonlinear dynamic systems. While empirical tests and simulations show such structures3 to be effective identification models, it is not clear how universal these models are and whether they can describe the input-output behavior of general nonlinear dynamic systems. An often quoted result (Takens 1981) establishes the existence of input-output models for homogeneous systems. For nonhomogeneous systems, however, only local results stating sufficient conditions for the existence of such models in a restricted region of operation around an equilibrium point have been reported (Leontaritis and Billings 1985; Levin and Narendra 1992). In this paper, using some fundamental concepts from system theory and differential topology, we establish the existence of global inputoutput models for a general class of nonlinear systems. To keep the ‘Present address: Wells Fargo Nikko Investment Advisors, Advanced Strategies and Research Group, 45 Fremont St., San Francisco, CA 94105 USA. 21n the following N(.) denotes a feedforward neural network. 3 T be ~ referred to as input-output models.
Neurd Computation 7,349-357 (1995)
@ 1995 Massachusetts Institute of Technology
Asriel U. Levin and Kumpati S. Narendra
350
presentation concise, only sketches of the proofs are provided in the appendix. For a comprehensive analysis of the issue of identification of nonlinear dynamic systems using neural networks the reader is referred to Levin (1992) and Levin and Narendra (1992). 2 Systems and Networks
A general finite dimensional, deterministic, discrete-time, time-invariant process is described by the equation
where x E 8“ are the internal states of the system, u E Rr are the inputs, y E Xm are the outputs, f : Wr --* R”,h : Rn + Xm.4 Assuming f and h to be unknown, the problem of identification is to construct a model that displays the same input-output behavior as the original system. In general, the states are not directly accessible, thus identification has to be done using only input and output measurements. 2.1 Linear Systems. Extensive literature exists on linear system identification. It is well known that the input-output behavior of a single input single output (SISO) linear system of order n
x ( k + 1) y(k)
= =
A x ( k ) + bTu(k) CTX(k)
(2.2)
where A is an n x n matrix and b and c are n-dimensional vectors, can be realized by a recursive relation of the form n
n
Hence for linear systems, for a given order of the underlying system, identification reduces to the task of parameter estimation. Once the nonlinear domain is entered the problem becomes substantially more complex. One cannot hope to identify an unknown nonlinear system by using an arbitrarily chosen model, no matter how sophisticated the parameter estimation techniques being used. Hence, in order to be able to use the adaptivity of neural networks for the identification of nonlinear dynamic systems, an appropriate model that can theoretically realize the input-output behavior of the observed system needs to be employed. 4For clarity of exposition the results will be stated for r = m = 1
Identification Using Feedforward Networks
351
2.2 Possible Neural Network Models. Relying on the approximation capabilities of feedforward neural networks (Cybenko 1989; Hornik eta!. 19891, each of the functions f and h can be approximated by a multilayer feedforward neural network with appropriate input and output dimensions. Thus, the input-output behavior of (2.1) can be realized by a system of the form
Since the system’s states are assumed not to be accessible, training such a network to identify the system requires the use of dynamic backpropagation, which is a computationally very intensive procedure, and hence hard and slow to implement (Levin and Narendra 1992). If instead, as in the linear case, it is possible to determine the future outputs of the system as a function of past observations of the inputs and outputs, i.e., there exists a number I and a continuous function @: YJx l4,-+ Y such that the recursive (or input-output) model
~ ( +k1)= @ [ y ( k ) , y ( k- 2 ) . . . y ( k - 1 + l ) ,u(k),~ ( -k1). . . , ~ ( -kl + l)] (2.5) has the same input-output behavior as the original system 2.1, then @(.) can be realized by a feedforward neural network resulting in the model given in 1.1. Since both inputs and outputs to the network are directly observable at each instant of time, static backpropagation (or any other supervised training method) can be used to train the network. In the following we establish sufficient conditions for the existence of such models. 3 Some Useful Notions
The results presented will make use of the following concepts: Genericity: In the real world, no continuous quantity or functional relationship is ever perfectly determined. The only physically meaningful properties of a mapping, consequently, are those that remain valid when the map is slightly deformed, i.e., their stable properties. A property is generic if it is stable and dense, that is if any function may be deformed by an arbitrary small amount into a function that possesses that property. Physically, only stable maps can be observed, but if a property is generic all observed maps will posses it. Transversality: One such property is transversality; which concerns the “typical” manner in which manifolds and maps intersect: Definition 1. Let X and y be smooth manifolds and f : X -+ y be a smooth mapping. Let W be a submanifold of Y and x a point in X . 5For an excellent introduction see Guillemin and Pollack (1974).
Asriel U. Levin and Kumpati S. Narendra
352
Then f intersects W transversally at x (denoted by f Tfi W at x ) if either one of the following hold: 1. f ( x ) @ w 2. f(x) E W and Tf(,)Y = Tf(,)W+ (df),(T,X) (T,B denoting the tangent space to B at a).
f intersects W transversally (denoted by f T W )if f iF W at x for all x E X.
Let X and Y be smooth manifolds and W be a closed submanifold of Y . The genericity of transversality means that the set of smooth mappings f : X + Y , which intersects W transversally is open and dense in C". The key to transversality is families of mappings. Suppose f s : X -+ Y is a family of smooth maps, indexed by a parameter s that ranges over a set S. Consider the map F : X x S --+ Y defined by F(x,s) = fs(x). We require that the mapping vary smoothly by assuming S to be a manifold and F to be smooth. The central theorem is Theorem 1 (Transversality Theorem). Suppose F : X x S + Y is a smooth map of manifolds and let W be a submanifold of Y . If F Tfi W then for almost every s E S (i.e., generic s) f5 is transversal to W . Finally, an important consequence of the property that a mapping is transversal is given by the following proposition (Golubitsky and Guillemin 1973): Proposition 1. Let X and Y be smooth manifolds and W be a submanifold of Y . Suppose dim W dim X < dim Y . Let f : X --t Y be a smooth mapping and suppose that f Tfi W . Then f ( X ) W = 0.
+
For example, if two lines are picked at random in a three-dimensional space, they will not intersect (which suits our intuition well). Specifically when applying the notion of transversality to identification of nonlinear systems we will make use of this last result to determine the minimum number of past observations required to build an input-output model. Generic Observability: One of the fundamental concepts of systems theory, which concerns the ability to determine the states of a dynamic system from the observations of its inputs and outputs, is observability: Definition 2. A dynamic system is said to be observable if for any two states x1 and x2 there exist an input sequence of finite length 1 Ul = [u(O),u ( l ) , . . . , u(1 - l)] such that Yl(xl, U,) # Yl(x2,Ur) where YI is the output sequence. A desirable situation would be if any input sequence of length 1 will suffice to determine the state uniquely for some integer 1. This form of observability will be referred to as strong observnbility. It readily follows that any observable linear system is strongly observable with 1 = n, n being the order of the linear system.
Identification Using Feedforward Networks
353
Unfortunately, unlike the linear case, global strong observability is too stringent a requirement and may not hold (or hold only locally) for most nonlinear systems of the form 2.1. However, practical determination of the state can still be achieved if there exists an integer 1 such that almost any input sequence (generic) of length greater or equal to I will uniquely determine the state. This wiIl be termed (I sfep)generic observability. It is this notion of generic observability that will help us establish the existence of global input-output models for 2.1. The Observer: If a system is observable, then the states can be computed from input-output observations. A formal way of looking at this is that there exists another system C’ (static or dynamic) that when fed with the input and output observations of C will generate the state of the system at its outputs. This latter system is referred to as the observer. 4 Main Results
The existence of an input-output model for a general nonlinear system will be derived in several stages. First we show how observability of a system can be described as a transversal intersection between maps. Through that, the genericity of transversal intersections will be used to prove the genericity of generically observable systems. On the other hand, we prove that a generically observable system can be realized by an input-output model. Finally, bringing the two together we conclude that generic systems of the form 2.1 can be identified using the recursive model 1.1. Since the true system is not known, certain assumptions concerning its structure need to be made to make the problem meaningful and tractable. We will make the following assumptions concerning the system: 1. f and h are smooth. 2. The system is state invertible.6 3. An upper bound on the order of the system is given.
To express the observability of C in terms of transversality conditions we need the notion of the diagonal: Definition 3. Let X be a smooth manifold and let x E X . The diagonal A ( X x X ) is the set of points of the form ( x , x ) . ~
~
will call the system 2.1 stateinverfible if for a given u,f defines a diffeomorphism on x. For a given input sequence, the invertibility of a system guarantees that the future as well as the past of a state are unique. State invertible systems arise naturally when continuous time systems are sampled or when an Euler approximation is used to discretize a differential equation. Since we are generally interested in modeling processes whose underlying behavior is governed by deterministic differential equations, we can limit ourselves to invertible systems.
354
Asriel U. Levin and Kumpati S. Narendra
Recalling the definition of observability, a system is observable if for a given input the mapping from the state space to the output is injective, i.e., Y(x1,U ) = Y(x2,U ) iff x1 = x2. This is equivalent to saying that for any X I # x2, YI(x1,U I ) Yl(x2, , U I )4 A(Y1 x Y I ) .Now, from Proposition 1, transversality implies empty intersection if dim A(Y1 x N) 2 dim X < 2dimYr and since dimA(Y1 x Yl) = dimY1 >_ 1 and dimX = n observability of the system is equivalent to a transversal intersection with the diagonal if
+
With these in mind the following can be shown.7 1. Homogeneous system are generically observable: The system
is observable for generic f and k if at least 2n + 1 output measurements are taken. The equivalent result for homogeneous continuous time systems of the form i = f ( x ) , y = k ( x ) was proven by Aeyels (1981). Furthermore, this is equivalent to Takens' result on the attractor dimension of a chaotic system (Takens 1981). 2. Generic observability is generic: The system 2.1 is observable after 2n + 1 measurements, for generic f and k and a generic subset U&+, c U 2 n + ~ (2&+, denoting an input sequence of length 2n + 1). Once we establish the genericity of generically observable systems,8 we need to show that such systems can be realized by input-output models. This will be done through the use of an observer: 3. Existence of a continuous observer for generically observable systems: For a generically observable system there exists a continuous function 9: U2,,+1 xY2,,+1 -+ Xsucktkatx(k) = q[U2n+l(k)r Y2n+l(k)]f~~allinputsequences except on an open set U&+,containing the singular input sequences U&+,.9
7Sketchesof the proofs are given in the appendix. For the full proofs the reader is referred to Levin (1992) and Levin and Narendra (1992, 1995). *Since generic observability requires 2n + 1 measurements, from now on by generic observability we will mean 2n + 1-stepgeneric observability. 9The open set U&+,can be made as small as necessary once we assume that the set of singular input sequences is of measure zero. Although, in general, one could come up with pathological examples for which the notion of measure one and genericity do not agree, such cases seem to be more of a mathematical peculiarity and we conjecture that for physical systems the two notions coincide.
Identification Using Feedforward Networks
355
With this preamble, and with some simple algebra, the existence of an input-output model for a generically observable system can be established: 4. Generically observable systems can be realized by input-output models: For a generically observablesystem of the form 2.2 there exists a continuous mapping @: UZ,,+~ x Y2n+l-+ Y such that the input-output behavior of the recursive model
y(k + 1) =
@a&),. . . ,y(k - 2n),u(k),. . . ,u(k
-
2n)l
(4.2)
will be identical to that of 2.2 for all input sequences except an open set U;,,, (as small as necessay)containing the singular input sequences. Finally, combining the above results and relying on the approximation properties on multilayer neural networks we have: 5. Generically, nonlinear systems can be realized by multilayer feedforward networks: For generic f and h, and for any E > 0 there exists a feedforward neural network Na(.)such that
for all input sequences except an open set U&+,(of measure < singular input sequences.
E)
containing the
5 Conclusion
The fact that generic observability is a generic property of systems implies that practically all systems (satisfying the above assumptions) can be identified using input-output models and hence realized by feedforward networks. The number of past observations that need to be employed to successfully identify the system depends on the system's order, which may not be available. The result guarantees, however, that even without this information, a finite number of past observations suffices to predict the future, hence by reaching further back into the past we are assured that identification can be achieved.
Appendix: Sketches of Proofs Result 1: The basic idea here is to show that for almost all f and h, the mapping of x'(k) x x2(k) + Yin+l (k)x Y&+l (k)given by equation 4.1 intersects the diagonal transversally. We assume h is a Morse function (a generic property of maps). Let fi(x) 4 f ( x , ui) where ui denotes the input at time i. For a given f , C will
Asriel U. Levin and Kumpati S. Narendra
356
be observable if the mapping 0 : A(F x F) x X x X defined by
-+
3?2n+1x 3?2n+1
is transversal to W = A(%z'J+lx %2n+1 1. To prove that this is true for a generic f, we will consider the family of maps F(x, s) = f(x) +sg(x) where s is a parameter and g is a smooth function. Specifically we will construct the function g(x) such that F ( x , s) intersects the diagonal transversally. Now, from the transversality theorem, if 0 i!W i then for a generic f , 0 5 W . Since the diagonal is of dimension 2n + 1 and the states map to a 2n-dimensional manifold, transversality means that for a given f and k the mapping h of2n+' : x(k) --+ Yzn+l(k) does not intersect the diagonal and hence the system is observable. The construction of the family of mappings needs to be carried out for four possible cases: 1. Neither x1 nor xz is periodic with period 5 2n
+ 1.
+ 1. 3. Both x1 and x2 are periodic with period 5 2n + 1.
2. Either x1 or x2 is periodic with period 5 2n
4.
x1
and xz are on the same trajectory.
The proof follows to show how transversality is achieved for each of the above conditions. Result 2: With input present the system can be viewed as a map (3 : C" x Uzn+1+ Y2,,+1from the Cartesian map of the space of smooth functions and the set of input sequences to the set of the corresponding output sequences. To achieve the desired result we need to show that injectiveness of this map is open and dense in the input space. Openness is an immediate consequence of the stability of injectiveness. To show denseness, we note that for any given input sequence the system can be described as a homogeneous system, thus from the first theorem, a small perturbation of f(.) will achieve injectiveness. Result 3: Since the singular set over which the system is not observable is of measure zero, it can be enclosed by a compact set A, whose measure can be made as small as necessary. Outside this set the mapping [Y2n+l
(k).UZn+l (k)l = f i [ x ( k ) ,&+1
(k)l
is bijective, hence by the Tietse extension theorem there exists a continuous map that is equal to the inverse of fi on the complement of A,. This map is the desired observer.
Identification Using Feedforward Networks
357
Result 4: From the above two results, a generic set of systems is generically observable and for generically observable systems one can construct a continouos observer (except on a small set). Combining the two we get the desired result. Result 5 Follows immediately from the approximation properties of feedforward neural networks (Cybenko 1989; Hornik et al. 1989). Acknowledgments
The first author wishes to thank Felipe Pait and Eduardo Sontag for helpful discussions. This work was supported by NSF Grant ECS-8912397. References Aeyels, D. 1981. Generic observability of differentiable systems. S I A M 1. Control Optim. 19, 595-603. Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function. Math. Control, Signals, Syst. 2, 303-314. Golubitsky, M., and Guillemin, V. 1973. Stable Mappings and Their Singularities. Springer-Verlag, Berlin. Guillemin, V., and Pollack, A. 1974. Differential Topology. Prentice-Hall, Englewood Cliffs, NJ. Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366. Jordan, M. 1986. Attractor dynamics and parallelism in a connectionist sequential machine. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pp. 531-546. Lawrence Erlbaum, Hillsdale, NJ. Leontaritis, I., and Billings, S. 1985. Input-output parametric models for nonlinear systems. Part i: Deterministic nonlinear systems. Int. 1. Control 41, 303-328. Levin, A. 1992. Neural networks in dynamical systems: A system theoretic approach. Ph.D. Thesis, Yale University, New Haven, CT. Levin, A., and Narendra, K. 1992. Control of non-linear dynamical systems using neural networks. Part ii: Observubilityand identification. Tech. Rep. 9116, Center For Systems Science, Yale University, New Haven, CT. Levin, A., and Narendra, K. 1995. Recursive identification using feedforward neural networks. International Journal of Control. To appear. Narendra, K., and Parthasarathy, K. 1990. Identification and control of dynamical systems using neural networks. l E E E Transact. Neural Networks 1, 4-27. Takens, F. 1981. Detecting strange attractors in turbulence. In Lecture Notes in Mathematics, D. Rand and L. Young, eds., Vol. 898, pp. 366-381. SpringerVerlag, Berlin.
Received December 16, 1993; accepted July 22, 1994.
This article has been cited by: 2. A. Karim El-Jabali. 2005. Neural network modeling and control of type 1 diabetes mellitus. Bioprocess and Biosystems Engineering 27:2, 75-79. [CrossRef] 3. N. Sadegh. 2001. Minimal realization of nonlinear systems described by input-output difference equations. IEEE Transactions on Automatic Control 46:5, 698-710. [CrossRef] 4. Alex Aussem, Fionn Murtagh. 2001. Web traffic demand forecasting using wavelet-based multiscale decomposition. International Journal of Intelligent Systems 16:2, 215-236. [CrossRef] 5. H.-Z. Tan, T.W.S. Chow. 2000. Blind identification of quadratic nonlinear models using neural networks with higher order cumulants. IEEE Transactions on Industrial Electronics 47:3, 687-696. [CrossRef] 6. J. Kalkkuhl, K.J. Hunt, H. Fritz. 1999. FEM-based neural-network approach to nonlinear modeling with application to longitudinal vehicle dynamics control. IEEE Transactions on Neural Networks 10:4, 885-897. [CrossRef] 7. K.H. Chon, R.J. Cohen. 1997. Linear and nonlinear ARMA model parameter estimation using an artificial neural network. IEEE Transactions on Biomedical Engineering 44:3, 168-174. [CrossRef] 8. H. N. Mhaskar, Nahmwoo Hahm. 1997. Neural Networks for Functional Approximation and System IdentificationNeural Networks for Functional Approximation and System Identification. Neural Computation 9:1, 143-159. [Abstract] [PDF] [PDF Plus]
Communicated by Hemi Bourlard
An HMM/MLP Architecture for Sequence Recognition Sung-Bae Cho* A T R Human lnformation Processing Research Laboratories, 2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02, Japan
Jin H. Kim M I S T Center for Artificial Intelligence Research, 373-1 Koosung-dong, Yoosung-ku, Taejeon 305-701, Republic of Korea
This paper presents a hybrid architecture of hidden Markov models (HMMs) and a multilayer perceptron (MLP). This exploits the discriminative capability of a neural network classifier while using HMM formalism to capture the dynamics of input patterns. The main purpose is to improve the discriminative power of the HMM-based recognizer by additionally classifying the likelihood values inside them with an MLP classifier. To appreciate the performance of the presented method, we apply it to the recognition problem of on-line handwritten characters. Simulations show that the proposed architecture leads to a significant improvement in generalization performance over conventional approaches to sequential pattern recognition. 1 Introduction
The multilayer perceptron (MLP) has been recognized as a powerful tool for pattern classification problems (Lippmann 1989b). Their properties are the discriminative power and the capability to learn and represent implicit knowledge, but they are generally for classification of static patterns without sequential processing. Several researchers have proposed original architectures having feedback loops for providing dynamic and implicit memory (Jordan 1986; Elman 1988; Waibel et al. 1989). However, current neural network topologies are inefficient in modeling temporal structures. An alternative approach to sequence recognition is to use hidden Markov models (HMMs). HMM provides a good probabilistic representation of temporal sequences having large variations, and has been widely used for automatic speech recognition (Rabiner 1989). The main drawback of HMM-based recognizers trained independently, however, is *Permanent Address: Computer Science Department, Yonsei University, 134 Shinchon-dong, Sudaemoon-ku, Seoul 120-749, Republic of Korea.
Neural Computation 7, 358-369 (1995)
@ 1995 Massachusetts Institute of Technology
HMM/MLP Architecture for Sequence Recognition
359
their weak discriminative power. The maximum likelihood (ML) estimation procedures typically used for training HMM can be suitable to model the time sequential order and variability of input observation sequences, but the recognition task requires more powerful discrimination. A solution is to train HMMs with the maximum mutual information (MMI) criterion. The standard ML criterion is to use a separate training sequence of observations to derive model parameters for each model. On the contrary, in the MMI criterion the average mutual information between the observation sequence and the complete set of models is maximized. However, MMI training involves a number of practical difficulties. The Baum-Welch algorithm (Rabiner 1989) is a robust and efficient algorithm for ML estimation, but it cannot be applied directly to MMI. As a result, the work on MMI training was forced to use slow and somewhat unreliable gradient descent methods (Bahl et al. 1986). To alleviate this problem, several attempts have been made to combine the classification power of MLP with the temporal sequence modeling capability of HMM. This paper is inspired by previous attempts for combining HMM with neural networks. We present a method in which HMMs provide an MLP with input vectors through which the temporal variations are filtered. This method takes the likelihoods inside the HMMs of all class models and presents them to an MLP to estimate posterior probabilities better. To evaluate the performance of the hybrid architecture, we conducted classification experiments using on-line handwritten characters. Although a serious theoretical investigation is beyond the scope of this paper, experimental results show that the hybrid architecture can achieve a 20% error rate reduction over that obtained by the conventional HMM-based recognizer. 2 Method
The key idea in the proposed HMM/MLP hybrid architecture is (1) to convert a dynamic input sample to a static pattern sequence by using an HMM-based recognizer and (2) to recognize the sequence by using an MLP-trained classifier. A block diagram of the hybrid architecture is compared with that of a conventional HMM-based recognizer in Figure 1. A usual HMM-based recognizer assigns one Markov model for each class. Recognition with HMMs involves accumulating scores for an unknown input across the nodes in each class model, and selecting the class model that provides the maximum accumulated score. On the contrary, the proposed architecture replaces the maximum-selection part with an MLP classifier. 2.1 Hidden Markov Models. An HMM can be thought of as a directed graph consisting of N nodes (states) and arcs (transitions) repre-
Sung-Bae Cho and Jin H. Kim
360
.
t
. 0
Figure 1: (a) Conventional architecture of an HMM-based recognizer; (b) the hybrid HMM/MLP architecture. senting the relationships between them. We denote the state at time t as 9t, and an observation sequence as X = (XI,XZ.. . . ,X T ) , where each observation Xt is one of the observation symbols and T is the number of observations in the sequence. Each node stores the initial state probability, r = {rl 1 rl = P(ql = i ) ,i = 1,2,. . . , N ) , and the observation ) a posteriori probsymbol probability distribution, B = {b,(X,) I l ~ j ( X , = ability of observation Xt given 9r = j } , and each arc contains the state transition probability distribution, A = {u,j I ul, = P(qf+1= j I 9f = i), i,j = 1,2,. . . , N}. Using these parameters, the observation sequence can be modeled by an underlying Markov chain whose state transitions are not directly observable.
HMM/MLP Architecture for Sequence Recognition
361
Given a model Xi = ( A , B , T )and an unknown input sequence X = (XI,X2,. . . , XT), the matching score is obtained by summing the probability of a sequence of observation X generated by the model over all possible state sequences giving (2.1) Then, we select the maximum as
i*
= argmaxP(X I
I Xi),
1Ii I c
(2.2)
and classify the input sample as class .'i For a given Xi, an efficient method for computing equation 2.1, known as the forward-backward algorithm (Rabiner 1989), is as follows: 0
Initialization:
Notice that in this equation the score for a model is computed as a sum over all states of the model, but it is usual to specify distinguished final states for each model. In that case, the score is the amount of the sum of ) the final states. To train an HMM involves the forward variables C X T ( ~at adjusting the model parameters ( A ,B , T ) to maximize the probability of all training sequences given the model (Rabiner 1989). 2.2 Hybrid Architecture. As shown in the previous section, an HMM calculates the likelihood with parameters X i = ( A ,B, T ) by equations 2.3 through 2.5. With infinite training data and a model space that includes the true source, the global ML estimate is optimal in the sense that it yields an unbiased estimate with minimum variance. However, when constructing an HMM-based recognizer, training data are not unlimited and the model space does not include the source. Thus, in some cases, the matching score of observation sequences generated by the correct model may be less than that generated by an alternative model. To overcome this shortcoming, many researchers have attempted to combine the advantages of the time-alignment function of HMMs and the
362
Sung-Bae Cho and Jin H. Kim
powerful discriminative capability of neural networks. Some researchers have shown that HMMs can be considered as a subset of recurrent neural networks, resulting in the use of several alternatives to the traditional HMM training algorithms (Niles 1990; Young 1990). Bourlard and Morgan (1991) have also proposed a hybrid method, called discriminative HMM, for utilizing the advantages of the neural network classifier in the HMM framework. Other researchers have used neural networks for preprocessing, one-unit-at-a-time, and/or for postprocessing, to refine or integrate information at a static pattern level, leaving temporal processing to the HMM (Lippmann 1989a; Bengio et a!. 1991). In this paper, we propose a novel hybrid architecture that combines HMM and MLP in a manner quite different from those used in the above hybrid methods. A key idea of the proposed method is to generate fixeddimensional feature vectors from temporal input sequence and to classify these static time-normalized feature vectors with a discriminative MLP classifier. This architecture is seemingly similar to some previous methods, but nevertheless essentially different. Actually, in Lippmann (1989a), HMM state transition information was treated as the input to an MLP. However, it was severely restricted to the use of only the temporal information of input. This HMM/MLP hybrid was inadequate, because the discriminative capability of the postend classifier was limited by rough HMM segmentations due to word insertion and deletion errors. In our architecture, the time-normalized vectors are created by partial likelihoods from statistically trained HMMs. The basic idea is to classify the nodal matching scores, each Q T ( ~ )in equation 2.5, of the complete set of models by an MLP instead of simply selecting the model that generates the maximum accumulated score. The overall organization is shown in Figure lb. The hybrid architecture takes the likelihood patterns inside the HMMs and presents them to an MLP to estimate posterior probabilities of class w; as follows: (2.6) where the w$ is a weight from the jth input node at the Ith state to the kth hidden node, q n is a weight from the kth hidden node to the ith class output, and f is a sigmoid function such as f ( x ) = 1/(1+ePx). Here, a T ( j , I ) is the value of the forward variable or([) at the jth class model. Rather than simply selecting the model producing the maximum value of P ( X I A/), the proposed method have an MLP perform additional classification with all the likelihood values inside HMMs. In this method, the HMM yields a kind of static pattern of which the inherent temporal variations have been processed, and the MLP classifier discriminates them as belonging to one particular class. The hybrid method automatically focuses on those parts of the model that are important for discriminating between sequentially similar patterns. In the conventional HMM-based approach, only the patterns in
HMM/MLP Architecture for Sequence Recognition
363
the specified class are involved in the estimation of parameters; there is no role for any patterns in the other classes. The hybrid method uses more information than the conventional approach; it uses knowledge of the potential confusions in the particular training data to be recognized. Since it uses more information, there are certainly reasons to suppose that the hybrid method will prove superior to the conventional approach. In this method, the MLP will learn prior probabilities as well as to correct the assumptions made on the probability density functions used in the HMMs. 3 Results
3.1 On-Line Handwriting Recognition. The success of the HMM approach to the speech recognition area has stimulated many researchers to considerable research efforts of applying it to the problem of handwritten script recognition (Nag e f al. 1986; Kundu and Bahl 1988; Kundu etal. 1989; Tappert 1991). The reason for this trend is that the rules about the interpretation of temporal patterns can be clearly specified by the HMM trained with samples in data whether they be speech features or image features. We have used a data set of handwriting characters as a source of both training and test samples to give an idea of the practical application of the presented method to sequential pattern recognition. An input character consists of a set of strokes, each of which begins with a pen-down movement and ends with a pen-up movement. Several preprocessing algorithms were applied to successive data points within each stroke to reduce quantization noise and fluctuations due to the writer's pen motion. The processes used were wild point reduction, dot reduction, hook analysis, three point smoothing, peak preserving filtering, and N point normalization. A sequence of preprocessed data points was approximated by a sequence of 8-directional straight-line segments-the chain code, as used by Freeman (1974). A left-right HMM model was used, in which no transitions are allowed to states whose indices are lower than that of the current state. The HMM consisted of 10 nodes, each of which incorporated 8 observation symbols. This is based on discrete output probability distributions over 8 chain codes. Lastly, the nodal matching scores of all models (10 x no. of classes) were provided as inputs to an MLP classifier in the hybrid architecture. 3.2 Simulation Results. Handwriting characters were inputed to the computer (SUN workstation) by a Photron FIOS-6440 LCD tablet, which samples at the rate of 80 dots per second. The tasks were to classify Arabic numerals, uppercase letters, and lowercase letters collected from 13 different writers. For training the HMM and MLP classifiers, 40 samples
Sung-Bae Cho and Jin H. Kim
364
80
-
80
-
40
-
20
-
0
I
1
I
Trainingdata Tat&@
I
No. of Tralnlng Data / Model
Figure 2: Recognition rates of HMM-based recognizer depending on the number of training samples.
for each class were used, while for recognition an additional 500 samples were used as test inputs. In addition to those, we collected another 500 samples for a validation set. The EBP algorithm was used for training the MLP and the iterative estimation process was stopped when the recognition rate over the validation set was optimized. This process and early stopping mechanism were adopted mainly for preventing networks from overtraining. The parameter values used for training were as follows: learning rate is 0.4 and momentum parameter is 0.6. An input vector is classified as belonging to the output class associated with the highest output activation. First, we tried to investigate the recognition performance of an HMMbased recognizer by differing numbers of training samples, from one to 40 for each class. Figure 2 shows the recognition rates of classifying the 500 test samples of numerals in all cases. It is seen that the correct recognition rate tends to depend on the number of training samples. However, once the number of training samples reaches 30, the recognition rate shows little variation. This is a strong indication that the accuracy of the HMM is increased by using more training data, but the recognition rate arrives at a limit when about 40 samples are used per model.
HMM/MLP Architecture for Sequence Recognition
365
Table 1: Performance Comparison for Numerals (% correct).
NN
HMM HMM with final state HMM + linear combination HMM + perceptron with logistic HMM + multilayer perceptron
Training data
Test data
88.7 95.2 95.5 95.5 97.2 99.5
74.4 83.6 84.2 83.4 84.2 85.4
To apply the presented hybrid method for numeral recognition, we implemented a two-layered perceptron having 100 input, 20 hidden, and 10 output nodes. The input was provided by 10 HMM models, each of which consists of 10 nodes. This network, however, did not converge because the (floating point) nodal matching scores of the HMMs were too small. Therefore, we encoded the output values, a T ( k ) , of each HMM state as one of 10 values between zero and one: we assigned 1.0 if a ~ ( k>) 0.1, 0.9 if 0.1 2 a ~ ( k > ) 0.01, and so on. The selection of the encoding scheme is largely ad hoc and no serious attempts were made to find an optimal coding scheme, although this may be an important issue. Our objective here is to make the MLP train the likelihood patterns produced by the HMMs, because the neural network would at least do no harm unless it was very badly trained. Before running the proposed hybrid method, we observed the following three intermediate conditions: 1. HMM with final state-the simplest standard condition, which goes some way toward being able to do what the MLP postprocessor can. For this, we included in the training of the HMMs a set of final transition probabilities, which become the weights applied to the final alphas, a T ( k ) .
2. HMM + linear combination-a MLP in the hybrid method.
linear combination trained as an
3. HMM + perceptron with logistic-a ”single layer perceptron” that is a single layer of weights followed by logistic.
The recognition rates (% correct) pertaining to these various methods are summarized in Table 1. To appreciate the performance of a conventional neural network for sequence pattern recognition, we also implemented another two-layered perceptron using the same input as the HMM-based recognizer, which is denoted as NN in the table. This network has 10 input, 20 hidden, and 10 or 26 output nodes. Obviously, the simple neural network does not come close to minimizing the error rate on the training data.
Sung-Bae Cho and Jin H. Kim
366
Table 2: Performance Comparison for Uppercase Letters (% correct).
NN
HMM HMM with final state HMM + linear combination HMM + perceptron with logistic HMM + multilayer perceptron
Training data
Test data
85.6 91.9 92.5 88.9 96.1 95.3
73.2 76.4 78.4 69.2 82.0 83.4
Table 3: Performance Comparison for Lowercase Letters (% correct).
NN
HMM HMM with final state HMM + linear combination HMM + perceptron with logistic HMM + multilayer perceptron
Training data
Test data
75.8 87.8 87.9 77.4 94.4 96.1
59.0 72.0 72.2 56.8 77.2 77.6
It is seen that the performance has been somewhat improved by fixing the HMM to insist that the model is in the final state at the end of the input, but it is not as good as the proposed method. It clearly shows that there is a possibility to improve the performance by additional classification that takes into account both the true data and their potential nemeses. The other two intermediate conditions prove that the final frame probability distributions cannot be effectively classified by single layer perceptron or a linear discriminant. In the case of numerals, the overall recognition rate for 10 classes with the hybrid method is 85.4% for a total of 500 characters. This is a useful improvement over the performance obtained with an HMMbased recognizer trained with ML optimization (83.6% recognition rate), as well as that with a neural network using character direction sequences as inputs (74.4% recognition rate). This improvement may be practically significant, but it is not impressive for a method that should give some net benefit by construction. However, the fact that similar (or bigger) improvements were obtained for upper and lower case letters provides evidence that this is a real effect. Tables 2 and 3 show the recognition rates of recognizing the uppercase letters and the lowercase letters, respectively. Figure 3 shows the rate
HMM/MLP Architecture for Sequence Recognition
Numerals
Uppercases
367
Lowercases
Figure 3: A comparison of error rates of MLP, HMM, and the hybrid method. of recognition errors made on the test samples in each of the three experiments. As the figure shows, the proposed method led to a reduction in error rate in every case. Overall, the number of errors fell by 20%. In summary, the hybrid architecture gave a better discriminative capability over the conventional HMM classifiers. We may thus assert that these improvements are mainly due to the excellent discriminative capability of MLP. However, even in the hybrid system, there was a big performance gap between training data and test data. This problem, which is related to the generalization issue in pattern classification and learning, still requires more investigation. 4 Concluding Remarks
In this paper, we have proposed a hybrid architecture of HMMs and an MLP to improve recognition accuracy for sequential pattern recognition. From the results of preliminary experiments recognizing on-line handwritten characters, we have seen that the hybrid architecture has performed well despite some limitations on the coding techniques. We believe that with additional work on the encoding method not only of the neural network but also of the HMMs, there is potential for this hybrid method to be used for recognizing handwriting script in much the same way it has been for handwriting character recognition. Several works remain for further research. The relatively easy one is to increase the recognition rate of each classifier for practical usage through
368
Sung-Bae Cho and Jin H. Kim
engineering works. An investigation into different model structures of each of them would be interesting since the left-right HMM and MLP may not be the most appropriate models for script recognition. Furthermore, although it is intuitively clear why the proposed method succeeds, we have been unable to prove that it always will.
Acknowledgments This work was supported in part by a grant from the Korea Science and Engineering Foundation (KOSEF) and Center for Artificial Intelligence Research (CAIR), the Engineering Research Center (ERC) of Excellence Program.
References Bahl, L. R., Brown, I? F., de Souza, I? V., and Mercer, R. L. 1986. Maximum mutual information estimation of hidden Markov model parameters for speech recognition. Prvc. ICASSP‘86, I, 49-52. Bengio, H., Mori, R. D., Flammia, G., and Kompe, R. 1991. Global optimization of a neural network-hidden Markov model hybrid. Prvc. IJCNN-92 11, 789794. Elman, J. L. 1988. Finding structure in time. CRL Tech. Rep. 8801, Univ. California, San Diego. Freeman, H. 1974. Computer processing in line drawing images. Comput. Surv. 6(1), 57-98. Jordan, M. L. 1986. Serial order: A parallel distributed processing approach. Tech. Rep. 8604, Univ. California, Davis. Kundu, A., and Bahl, P. 1988. Recognition of handwritten script: A hidden Markov model based approach. Prvc. ICASSP‘88 928-931. Kundu, A., He, Y., and Bahl, I? 1989. Recognition of handwritten word: First and second order hidden Markov model based approach. Pattern Recog. 22(3), 283-297. Lippmann, R. I? 1989a. Review of neural networks for speech recognition. Neural Comp. 1, 1-38. Lippmann, R. P. 1989b. Pattern classification using neural networks. I E E E Commun. Mag. 47-64. Morgan, N., and Bourlard, H. 1990. Continuous speech recognition using multilayer perceptrons and with hidden Markov models. Proc. ICASSP’SO, I, 113116. Nag, R., Wong, K. H., and Fallside, F. 1986. Script recognition using hidden Markov models. Proc. ICASSP’86 2071-2074. Niles, L. T., and Silverman, H. F. 1990. Combining hidden Markov model and neural network classifiers. Proc. ICASSP’SO, I, 417-420. Rabiner, L. R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. I E E E 77(2), 257-286.
HMM/MLP Architecture for Sequence Recognition
369
Tappert, C. C. 1991. Online handwriting recognition with hidden Markov models. Proc. Fifth Handwrit. Conf. Int. Graphonomics SOC. 204-206. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. J. 1989. Phoneme recognition using time-delay neural networks. IEEE Trans. on ASSP, 37, 328-339. Young, S. J. 1990. Competitive training in hidden Markov models. Proc. ICASSP’SO, 11, 681-684.
Received April 13, 1993; accepted July 1, 1994.
This article has been cited by: 2. Y. Barniv, M. Aguilar, E. Hasanbelliu. 2005. Using EMG to Anticipate Head Motion for Virtual-Environment Applications. IEEE Transactions on Biomedical Engineering 52:6, 1078-1093. [CrossRef] 3. Siri Bavan, Martyn Ford, Melina Kalatzi. 2000. Genomic and proteomic sequence recognition using a connectionist inference model. Journal of Chemical Technology & Biotechnology 75:10, 901-912. [CrossRef] 4. Xiaolin Li, M. Parizeau, R. Plamondon. 2000. Training hidden Markov models with multiple observations-a combinatorial method. IEEE Transactions on Pattern Analysis and Machine Intelligence 22:4, 371-377. [CrossRef] 5. Pierre Baldi, Yves Chauvin. 1996. Hybrid Modeling, HMM/NN Architectures, and Protein ApplicationsHybrid Modeling, HMM/NN Architectures, and Protein Applications. Neural Computation 8:7, 1541-1565. [Abstract] [PDF] [PDF Plus]
Communicated by Eric Baum
Learning Linear Threshold Approximations Using Perceptrons Tom Bylander Division of Mathematics, Computer Science, and Statistics, The University of Texas at San Antonio, San Antonio, Texas 78249 U S A
We demonstrate sufficient conditions for polynomial learnability of suboptimal linear threshold functions using perceptrons. The central result is as follows. Suppose there exists a vector w, of n weights (including the threshold) with "accuracy" 1 - a, "average error" 7, and "balancing separation" CT, i.e., with probability 1-a, w* correctly classifies an example x; over examples incorrectly classified by w*, the expected value of Iw, . XI is 17 (source of inaccuracy does not matter); and over a certain portion of correctly classified examples, the expected value of Iw, . XI is 0. Then, with probability 1 - 6, the perceptron achieves accuracy at least 1 - [ F o(1 + 7j/o)] after O[n~-*o-~(lnl/h)] examples.
+
1 Introduction
Recently, the perceptron and other linear-threshold learning algorithms have been receiving increasing attention. The perceptron, despite its attractive convergence properties (Rosenblatt 1962), fell into disfavor after its limitations appeared too severe (Minsky and Papert 1969). Even the surge of interest in neural networks, especially the justification for hidden layers, was in part based on the limitations of the perceptron (Rumelhart et al. 1986). Nowadays, the perceptron and linear threshold functions do not seem so bad after all. When the examples are linearly separable, Baum (1990) shows that exponentially sized weights are not a problem given modest distributional assumptions, i.e., it is likely that the perceptron will find a highly accurate function with a polynomial number of examples if the distribution of examples is chosen independently from the linear threshold function to be learned. Littlestone (1988, 1989) has developed algorithms for learning linear-threshold functions that outperform the perceptron when there are many irrelevant attributes. For several datasets that are not linearly separable, Gallant (1990) and Shavlik et al. (1991) show that the performance of the perceptron is comparable to more sophisticated learning algorithms, despite the cycling behavior of the perceptron. Neural Computation 7, 370-379 (1995)
@ 1995 Massachusetts Institute of Technology
Learning Linear Threshold Approximations
371
Of course, none of this changes the inherent representational limitations of linear threshold functions. If no linear threshold function is adequate for a given situation, there is no choice but to consider different representations. However, learning with multilayer networks has been shown to be hard in a number of contexts (Blum and Rivest 1988; Judd 1990), so it makes sense to first try perceptrons (or Littlestone’s algorithms) given their simplicity and convergence properties before trying something more complicated with unclear convergence. Also, linear threshold functions and their sigmoidal cousins are common components of neural networks, so understanding the behavior of linear threshold functions is an important part of understanding neural networks. This paper presents more evidence in favor of the perceptron, by providing a theoretical explanation of when the perceptron can be expected to perform well on nonlinearly separable examples. This indirectly supports the aforementioned empirical results of Gallant (1990) and Shavlik et al. (1991). In particular, we demonstrate sufficient conditions for when the perceptron can be used to efficiently learn accurate, albeit suboptimal, linear threshold functions. However, neither the perceptron nor any other learning algorithm is likely to perform well in all cases. Hoffgen and Simon (1992) show that it is NP-hard to find the optimal linear threshold function for a set of examples. Furthermore, they show that if the representation of a weight is bounded by a constant, then, for any constant k, it is NP-hard to find a linear-threshold function that is no more than k times worse than optimal. Hoffgen and Simon’s results are distribution-free, i.e., no restrictions on the distribution of examples are made. To circumvent this difficulty, we characterize a given distribution by three parameters on the optimal linear threshold function: 1 - a is its accuracy, o is the average separation of a certain portion of the correctly classified examples from the threshold, and 7 is the average error of incorrectly classified examples from the threshold (to be defined precisely in the next section). For given a, 0, and 7, then with probability 1 - 6,the perceptron’s accuracy will be at least 1 - [ E + a(1 -t77/o)] after O[nE-%-*(lnl/S)] iterations, where n is the number of Boolean attributes. A modified version of the convergence proof from Minsky and Papert (1969) and an inequality from Hoeffding (1963) are used to demonstrate this result. Note that this result does not guarantee convergence to the optimal linear threshold function. However, if the average error 71 is small relative to the separation parameter o, and if the optimal linear threshold function has a high accuracy, then the perceptron is likely to achieve a good result. However, it is easy to generate a set of examples in which 7 / 0 is arbitrary high, and so the result of this paper is consistent with Hoffgen and Simon’s hardness results. The outline of the paper is as follows. First, concepts and notation are defined. Then, the perceptron is analyzed in the context of the above framework. Next, a generalization of the parameters is discussed. Addi-
Tom Bylander
372
tional coverage on the utility and limitations of linear threshold functions can be found elsewhere (e.g., Minsky and Papert 1969; Duda and Hart 1973; Gallant 1990; Littlestone 1988; Shavlik et al. 1991). 2 Concepts and Notation
The concepts underlying our analysis are closely related to PAC-learnability (Valiant 1984). Our development is tailored to Boolean vectors and linear threshold functions and can be viewed for the most part as specializations of agnostic learning (Kearns et d. 1992) and robust learning (Hoffgen and Simon 1992). Let x denote an example, specifically x E [-l*l]”,i.e., a vector of n Boolean attributes where true and false are encoded as 1 and -1, respectively. Let X = [-1,l]”be the set of examples with n attributes. Let D be a probability distribution on labeled examples from X, i.e., each positive example ( x , +) and negative example ( x , -) has a given probability according to D.’ Let H be the class of linear threshold functions on X. Each linear threshold function k E H, then, has a specific accuracy based on the probability distribution D: acc(h, D ) = D [ x ,h ( x ) ]
c
XEX
and there is a maximum, or optimal, accuracy that can be achieved. opt(H,D ) = maxacc(h, D ) It E H
Given a source of labeled examples randomly drawn according to distribution D, we wish to find as accurate a linear threshold function as possible in a reasonable amount of time. In agnostic learning and robust learning, the goal is to come within E of opt(H,D ) with high probability. However, HGffgen and Simon (1992) show that this problem is NP-hard for the class of linear threshold functions. An alternative is to consider whether one can efficiently come within t of some suboptimal accuracy with high probability. Let w denote n weights for representing a linear threshold function, where weights are real numbers (though note that the perceptron only requires integers) and the nth element corresponds to w’s threshold. For convenience, we transform labeled examples with n -- 1 Boolean attributes to positive examples with n attributes. For (x’,I ) E [-1, x {+, -}, an example x = (XI, x2,. . . ,x , ~ is ) constructed as follows: xi ifI=+andlIiIn-l -xi ifl=-andl 0 for each transformed example x. Our analysis assumes the existence of a weight vector w*,normalized so that JJw,II= 1, with the following properties: With probability 1 - a, w, . x > 0; Over incorrectly classified examples, i.e., examples where w, . x L 0, the expected value of w, . x is -7. There exists a c > 0 such that the expected value of w, . x in the range 0 cl wc . x 5 c is u, and the probability that 0 < w, . x 5 c is q / u . Using the previous property, it follows that the expected value of w, . x when w, . x 5 c is 0. 1 - a, 7, and cr are, respectively, called the accuracy, average error, and balancing separafion of w,. The analysis will show that the perceptron
+
+
will meet or exceed the suboptimal accuracy 1 - [E n(1 7/cr)] with probability 1 - 6 after O[ncc2u-*(lnl/S)] examples. For convenience, I shall use err(6) = E + a(1 + q / u ) , although err is really also a function of w, and the distribution D. It is possible that some weight vector has a lower err(€)than the optimal weight vector, i.e., a slightly higher value for a (lower accuracy) is offset by a much lower value for r//u (some combination of lower average error and higher balancing separation). The analysis does not require that w, have optimal accuracy. In general then, it is better to assume that w, is the weight vector with minimum err(€). The PERCEPTRON algorithm (see Fig. 1) uses the current weight vector w, to classify examples. x, is the first example misclassified by w,. Once a mistake is made, w, f x , is assigned to W,+I, and i is incremented, which makes the new weight vector the current weight vector. Let u, = wc . x,. That is, if u, is positive, then u, is the amount that wc separates x, from 0. Otherwise, -v, is the error of w, on x,. Let N stand for the number of examples classified by the perceptron. Let M be the number of mistakes over the N examples. For the perceptron, M is also the number of updates. Call 1 - M / N the online accuracy of the perceptron. 3 Linear Threshold Approximation Using Perceptrons
The key property of PERCEPTRON is that the average value of u; = w t. xi progresses toward zero. When this happens, then the perceptron’s mistakes that w* would have correctly classified must be “balanced by the perceptron’s mistakes that w, would have incorrectly classified, i.e.,
Tom Bylander
374
PERCEITRON i t 1 WI + 0
for each example x classify x using wi if w,misclassifies x then xi +- x Wjfl + w j+xi
i t i s 1 Figure 1: Perceptron algorithm for linear threshold approximation. a certain proportion of the mistakes made by the perceptron must correspond to misclassifications by w+.The following theorem describes this behavior more precisely.
Theorem 1. If there exists a weight vector w, with accuracy 1 - a, balancing separation n, and average error 7, then, for any 6 > 0 such that 6 5 0.25 and for will have any E > 0 such that e r r ( f )= E + a(1 + v / o ) < 0.5, then PERCEITRON online accuracy at least 1 - err(€)with probability 1 - 6 after Bnc-*a-’ ln(2/6) examples. The following lemma is used to prove Theorem 1. It gives an upper bound for the sum of the vjs.
i=l
Proof. This proof is a simple variation of Minsky and Papert’s proof of the perceptron convergence theorem (Minsky and Papert 1969). First, define
w*.w G(w) = llwll Because IIw,II = 1 by definition, it follows that G(w)5 1. For the Consider in turn the numerator and denominator of G(w,+,). numerator,
w*. w,+1= w, . (Wj + XI) = w, w,+- w, . xi = w, . w i + u, ’
Learning Linear Threshold Approximations
Because w, . WI nator, llWl+1Il2
= 0,
375
it follows that w, . WM+I = Cf"=, vJ.For the denomi-
= w,+1. Wl+1 =
(Wl
=
llw1112
+ x,) . (Wl + XI) f
2wJ
'
f
llx1112
I ) W , ) ~ ~ + 2w, . x, + n 5 llw,1l2+ n
=
The inequality follows because x.x = n and, by definition, w, misclassifies x,, hence w, . x, 5 0. Because 1 1 ~ 1 1 1 =~ 0, IIw~+111~ I Mn; thus I I W ~ + ~ I I 5 Therefore,
a.
and the inequality of the lemma follows.
0
Define v to be the sum of the v;s divided by N :
Lemma 2 implies that 8 5 a / N . Before proving Theorem 1, I introduce the following inequalities from Hoeffding (1963)for proving probability bounds on sums of independent variables. Lemma 3 (Hoeffding 1963). If ijis thesample mean ofN independent variables yI,a 5 yl I b, and t is a positive value, then the following two inequalities hold:
~ ( -g~ k2 lt ) 5 e-2Nf21(b-a)2 ~(~i -j ij 2 ] t ) 5 e-2N12/(b-a)2
Proof of Theorem 1. Let N 2 8nc-2u-21n(2/6). Assume P [ M / N > err(€)]> 6. That is, the probability that more than err(€)examples are misclassified is greater than 6. First, Lemma 2 is applied, and the result used later. If 6 5 0.25, then /6) N > 16nr-2a-2, and by ln(2/6) > 2, so N 2 8 n t ~ ~ u - ~ l n ( 2implies Lemma 2:
Now by the definition of balancing separation and average error, there exists a c 2 u such that the probability that w,.x 5 c is a ( l + ~ / a = ) err(O), and the expected value of w* . x when w, . x 5 c is 0.
Tom Bylander
376
Let L be the number of examples such that w, . x 5 c. Using Hoeffding's inequality:
6 ) , E[L] = Nerr(0) and choosing f When N 2 8 n ~ - ~ a - ~ l n ( 2 /then c a / ( 4 f i ) results in
=
Note that -fi5 w, . x 5 fi because IIw+II = 1 and llxll = fi.This implies that (r 5 fi,and so ~ / ( 4 f i ) 5 €14. Assuming that P[M/N > err(€)]> 6, it follows that MJN > L/N + 3614 with probability greater than 612. Let s be the sum of the L w, . x when w, . x 5 c. Because -fi5 w, . x 5 fi,by Hoeffding's inequality:
When N 2 8nf-2a.21n(2/6), then E[s] = 0 and choosing t in
=
ca/2 results
Assuming that P[M/N > err(^)] > 6,M/N > l / N + 3 ~ / 4and s / N > - 6 0 1 2 with probability greater than zero. At worst, v will include all 1 w, . x when w, . x 5 c. Any additional w, . x must be at least (r. Thus, with probability greater than zero, €0 €U v > -_ + 3€a -= -
2
4
4
However, the result from Lemma 2 ensures that V < m/4, which contradicts a nonzero probability for V > m/4. Therefore, the assumption must be wrong, i.e., it must be the case that P[M/N > err(€)]5 6 when 8n
2
€202
6
N L -1n-
which proves the theorem.
0
It is noteworthy that the accuracy and average error parameters, a and 7, do not affect this upper bound on the number of examples. I believe that with the careful use of Chernoff bounds, N could be shown to vary linearly with err(€).However, the special conditions for applying Chernoff bounds would complicate the analysis considerably.
Learning Linear Threshold Approximations
377
PERCEPTRON uses many more examples to converge to a suboptimal accuracy than that necessary to identify a weight vector that is within E of optimal with probability 1 - S. It is well known that the separation 0 must be exponentially small in the number of attributes n in some cases (Minksy and Papert 19691, and so the above bound will be exponentially large in n in these cases. However, it follows from the results in Blumer et al. (1989) that finding the most accurate linear threshold function on O ( n 6 ln[l/(&)]} ~~ examples would suffice for this task. 4 Generalizing the Separation Parameter
The analysis assumes that there exists a c > 0 such that the expected value of w, . x in the range 0 < w, . x 5 c is 0,and the probability that 0 < w, . x 5 c is aq/cr. This condition can be relaxed in the following way. There exists a c > 0 and Q‘ > 0 such that the expected value of w* . x in the range 0 < w* . x 5 c is c,and P(0 < w, . x 5 c) = a’ 2 arl/o. That is, the proportion a’ of examples represented by the balancing separation cr is sufficient to ensure that the expected value of w, ‘ x when w, . x 5 c is greater than or equal to 0. The above analysis can be easily modified to demonstrate that the perceptron will achieve at least 1 - (c + a + a’) accuracy with probability 1 - 6. Note that in the separable case, cr is 0, leaving the relationship between a’ and 0 as the limiting constraint on the learnability of linear threshold functions with 1 - c accuracy. This is the focus of Baum’s analysis of the perceptron in the separable case assuming a uniform distribution over the unit sphere (Baum 1990), i.e., for this distribution, only a small fraction of examples has a small separation. Baum then shows that the perceptron learns using O(nep3)examples. 5 Remarks
A similar, but somewhat more involved, analysis can be applied to the weighted majority algorithm (Littlestone 1989). This algorithm and other algorithms developed by Littlestone (1988) are interesting because their bounds are logarithmic in n rather than linear. Using the framework of this paper, it can be shown that O[(lnn)e-20-2(lnI/S)] examples are sufficient to achieve at least 1 - err(€) = 1 - [ E + a(1 + q/a)]accuracy with probability 1 - S using the weighted majority algorithm (Bylander 1993). A drawback is that an updating factor in the algorithm depends on the balancing separation 0,so if nothing is known about 0 in advance, one must instantiate the algorithm multiple times using different guesses for 0. The perceptron algorithm can be modified to return a single weight vector that is likely to have at least 1 - err(€) accuracy, e.g., first obtain
378
Tom Bylander
a sufficiently large "test" sample of examples; then, run the perceptron on a sufficient number of additional examples, testing each weight vector against the test sample; and finally, return the weight vector with the highest accuracy on the test sample. This is similar to the conversion of mistake-bound algorithms to PAC-learning algorithms given in (Littlestone 1989, Chapter 5). An unanswered question is the complexity of learning linear threshold functions with better than 1 - err(0) accuracy. In some sense, the proof of Theorem 1 assumes that the perceptron will make mistakes on the "worst" examples, i.e., all the examples misclassified by w,, plus those that make the least progress toward w,. For various kinds of noise (Sloan 19881, one might be able to demonstrate that the perceptron is better behaved. More generally, whenever it is hard to learn near-perfect or nearoptimal concepts, learning good suboptimal approximations is still a reasonable possibility. Understanding when this is possible is an important direction for future research. Acknowledgments This paper is a revised version of Bylander (1993). Thanks to Bruce Rosen and anonymous reviewers for comments. References Baum, E. 8. 1990. The perceptron algorithm is fast for nonmalicious distributions. Neural Comp. 2(2), 248-260. Blum, A., and Rivest, R. L. 1988. Training a 3-node neural network is NPcomplete. Proc. First Annual Workshop Computational Learning Theory, 9-18. Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. 1989. Learnability and the Vapnik-Chervonenkis dimension. JACM 36(4), 929-965. Bylander, T. 1993. Polynomial learnability of linear threshold approximations. Proc. Sixth Annual ACM Conf. Computational Learning Theory, 297-302, Santa Cruz, California. Duda, R. O., and Hart, P. E. 1973. Pattern Classification and Scene Analysis. John Wiley, New York. Gallant, S. I. 1990. Perceptron-based learning algorithms. I E E E Trans. Neural Networks 1(2), 179-191. Hoeffding, W. 1963. Probability inequalities for sums of bounded variables. J. Am. Statist. Assoc. 58(1), 13-30. Hoffgen, K.-U., and Simon, H.-U. 3992. Robust trainability of single neurons. Proc. Fifth Annual ACM Workshop Computational Learning Theory, 428-439, Pittsburgh, Pennsylvania. Judd, J. S. 1990. Neural Network Design and the Complexity of Learning. MIT Press, Cambridge, MA.
Learning Linear Threshold Approximations
379
Kearns, M. J., Schapire, R. E., and Sellie, L. M. 1992. Towards efficient agnostic learning. Proc. Fifth Annual ACM Workshop Computational Learning Theory, 341-352, Pittsburgh, Pennsylvania. Littlestone, N. 1988. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learn. 2(4),285-318. Littlestone, N. 1989. Mistake bounds and logarithmic linear-threshold learning algorithms. Ph.D. thesis, University of California, Santa Cruz, California. Minsky, M. L., and Papert, S. A. 1969. Perceptrons. MIT Press, Cambridge, MA. Rosenblatt, F. 1962. Principles of Neurodynamics. Spartan Books, New York. Rumelhart, D. E., McClelland, J. L., and the PDP Research Group. 1986. Parallel Distributed Processing, Vol. 1. MIT Press, Cambridge, MA. Shavlik, J., Mooney, R. J., and Towell, G. 1991. Symbolic and neural learning programs: An experimental comparison. Machine Learn. 6(2), 111-143. Sloan, R. 1988. Types of noise in data for concept learning. Proc. First Annual Workshop Computational Learning Theory, 91-96. Valiant, L. G. 1984. A theory of the learnable. Commun. ACM 27(11), 1134-1142.
Received November 2, 1993; accepted August 10, 1994.
This article has been cited by:
Communicated by Scott Fahlman
An Algorithm for Building Regularized Piecewise Linear Discrimination Surfaces: The Perceptron Membrane Guillaume Deffuant’ CREST, 15, rue G. P h i , 92 245 Malakofi, France The perceptron membrane is a new connectionist model that aims at solving discrimination (classification) problems with piecewise linear surfaces. The discrimination surfaces of perceptron membranes are defined by the union of convex polyhedrons. Starting from only one convex polyhedron, new facets and new polyhedrons are added during learning. Moreover, the positions and orientations of the facets are continuously adapted according to the training examples. Considering each facet as a perceptron cell, a geometric credit assignment provides a local training domain to each perceptron of the network. This enables one to apply statistical theorems on the probability of good generalization for each unit on its learning domain, and gives a reliable criterion for perceptron elimination (using Vapnik-Chervonenkis dimension). Furthermore, a regularization procedure is implemented. The model efficiency is demonstrated on well-known problems such as the 2-spirals or waveforms. 1 Introduction
Feedforward neural networks can be considered as examples of nonparametric regression estimators (Geman et al. 1992). A typical nonparametric inference problem is the estimation of arbitrary decision boundaries for a discrimination task, based on a collection of labeled (preclassified) training examples. The term ”nonparametric” means that no structure, or class of boundaries (like linear or quadratic surfaces) is assumed a priori, as it would be in the case of a parametric model. The main problem of this approach is now well identified: this is the bias/variance dilemma (Geman et al. 1992). This dilemma can be related to the richness of the a priori hypotheses set in which the estimator is searched (Deffuant 1992). The variance comes from a high sensitivity to nonrelevant particularities of the training set. The complete model-free approaches, using enormous a priori hypotheses sets enabling approximation of any function, suffer from high variance. These approaches, therefore, require prohibitively large training sets to converge. On the *Permanent Address: ADAPT, 3 rue de I’Arrivke, 75749 Paris, France.
Neural Computation 7, 380-398 (1995) @ 1995 Massachusetts Institute of Technology
The Perceptron Membrane
381
contrary, poor hypotheses sets, which are less sensitive to the training set, are likely to have a high bias. This happens when the function to approximate is far from all possible hypotheses. The main difficulty of nonparametric estimation is therefore to achieve a tradeoff between variance and bias. This can be done by the adaptation of the hypotheses set (the set of possible models) to the training data. The algorithms of architecture modulation in connectionist networks seldom take explicitly into account the achievement of the tradeoff. Two main strategies of architecture modulation can be found in the literature: 0
0
Growing networks: new units are added to the network and improve incrementally its classification results for the training examples. Some authors keep the traditional multilayer architecture (Nadall989; Mezard and Nadall989; Marchand ef al. 1990; Fahlman and Lebeire 19901, while others propose a hierarchy of multilayer networks (Bochereau et al. 1990). The tree architecture shows some advantages considering the growth procedure (Utgoff 1989; Frean 1990; Deffuant 1990a,b). With such procedures, the bias is low, but the variance is difficult to control. Shrinking networks: the network is initialized with a large structure and useless elements are deleted. The main methods are "weight decay" (Hinton 1986; Scalettar and Zee 1988) or direct pruning (Siestma and Dow 1988). These procedures provide a "smoothing" or a regularization of the solution that decreases the variance. However, the choice of the smoothing criterion is quite difficult.
This paper describes a new connectionist model, the perceptron membrane, in which the bias/variance tradeoff is central. It involves concurrent growing and shrinking procedures. The originality of the model (compared to other connectionist models) lies in the use of its geometric interpretation: it is equivalent to an adaptive polyhedral surface defining the boundaries between examples of different classes. This geometric interpretation is widely used in the learning procedures. The units of the network adapt their weights according to a "geometric credit assignment" providing fast and efficient learning. Moreover, geometric considerations are used to achieve a compromise between bias and variance. On one hand, new convex polyhedrons can be added in order to improve the classification of the training examples. This increases the richness of the model and enables one to decrease the bias. On the other hand, regularization procedures are performed. These procedures maximize the regularity of the discrimination surface. They are founded on local observations of the surface and of the corresponding training examples. These procedures control the variance according to the training set size. The paper is organized as follows: Section 2 defines mathematically the perceptron membrane and gives details about the bias/variance
Guillaume Deffuant
382
dilemma, Section 3 describes the geometric credit assignment, Sections 4 and 5 describe the search for a bias/variance compromise, Section 6 describes the global algorithm, and Section 7 is devoted to descriptions of simulation examples. 2 Nonparametric Discrimination by Piecewise Linear Surfaces 2.1 The Problem. We consider a training set TN of N d-dimensional vectors {xl,x2, . . . , xi,. . . ,xN}, labeled by {y',y2,. . . ,y', . . . ,yN} in (0, l}. The training set is supposed to be drawn from a probability distribution D(x,y) on Rd x {0,1}. The goal is to approximate the optimal decision surface S. On one side of the surface, the mean value Ely I x] of y conditioned by x is higher than 1/2, on the other side this is the contrary. In this paper, we consider a piecewise linear surface, a membrane, that adapts itself to the training examples to approximate S. A membrane M is defined as the boundary of the union of convex polyhedrons (CPs). The facets of the surface are defined by perceptron cells (Rosenblatt 1962).
Definition 1. Perceptron Membrane. Let {w', w 2 , .. . ,wp} be a set of ddimensional vectors, and {bl, b2, . . . ,bp} a set of real numbers, defining the weights and bias of perceptron cells, and {C, ,C2, . . . ,C,} be a set of subsets of { 1,2, . . . ,p } , defining a set of convex polyhedrons. The perceptron membrane M defined by these perceptrons and convex polyhedrons is the boundary of the set
where . is the scalar product in the Rd.
Example. The membrane M of Figure 1 is represented in bold lines, I ( M ) is the part of space in gray. Let wi and b' be the weights and bias of perceptron i. For the membrane of Figure 1 we have {x I w' . x
+ b'
8
2 0))
u (n{x I w' . x + b' 2 O} i=6
Definition 2. Active part of hyperplane (or perceptron). The active part of a perceptron is the part of the hyperplane that is included in M, i.e., is included in the boundary of I ( M ) (the bold lines on Fig. 1). A point x is
The Perceptron Membrane
383
Figure 1: A membrane. I ( M ) is defined by three convex poIyhedrons. in the active part of perceptron k iff the following conditions are verified: i. x belongs to hyperplane k wk . X
+ bk = 0
ii. x belongs to facet k of CP j: 3 j I k E C, A (Vi E Cj,w' . x + b' 2 0)
iii. x is not in the interior of I ( M ) : V l E ( 1 , . . . ,c}, k $ Ci
+ 3i E C , / W ' . X + ~ '< 0
The perceptron membrane can be used for a two-class discrimination problem: it answers 1 for the points located inside I ( M ) , and 0 for the others. It is well known that such functions are equivalent to 2-hidden layered feedforward networks (Bochereau et al. 1990; Sethi 19901, in which the weights of the hidden layers are fixed according to the logical relations. Perceptron membranes are therefore a subset of feedforward networks. However, their geometric interpretation makes them easier to manipulate, as shown below. 2.2 The BiasNariance Dilemma. The bias/variance dilemma always arises in nonparametric estimation problems. Let TN be a training set of size N drawn from the distribution D, and f(x, T N )be the classification result at x of the estimator derived from TN. The performance of the estimator is measured by the quadratic error between f(x, T N ) ,and the best classification choice S (x) corresponding to the optimal decision surface S. To evaluate the estimation method, one must average this performance on all possible training sets TN drawn from D. The average on the possible sets is denoted by ( ), corresponding to the quenched average in
Guillaume Deffuant
384
Seung et al. (1992). It can be shown that the quenched average of the estimation performance splits into two terms (Geman et al. 1992):
The first term is called bias and the second is called variance of the estimation method. High bias appears when the hypotheses set of the estimator is too poor compared to the function to approximate: every function f (x,T N )is far from S(x). On the contrary, high variance occurs when the set of possible functions is too rich. The estimator has then more chances to be sensitive to nonrelevant particularities of the training set. In nonparametric estimation, the hypotheses set is very rich. With sufficiently large training sets, many nonparametric estimators can approximate arbitrarily well optimal decision rules (they are said consistent). Among many others “CART (Classification and Regression Trees, Breiman et al. 1984) “MARS (Multivariate Adaptive Regression Splines, Friedman 1991)as well as feedforward neural networks (White 1992)can be used as consistent nonparametric estimators. It can be easily shown that the perceptron membranes are also consistent (Deffuant 1992). In practice, it is important to control the bias/variance compromise for a given learning set. The main mathematical tool to achieve this control is the Vapnik-Chervonenkis dimension (Vapnik and Chervonenkis 1981). As shown afterward, the perceptron membranes use this tool for the achievement of the bias/variance tradeoff. 3 Regression for a Fixed Structure of the Membrane
As soon as a perceptron membrane includes several perceptrons, the well-known ”credit assignment problem” arises: how to assign the responsibility of an answer among the different perceptrons? considering the perceptron membrane as a feedforward network and using the backpropagation algorithm (Le Cun 1985; Rumelhart et al. 1986) is a solution. The proposed geometric credit assignment shows advantages to establish the bias/variance tradeoff.
3.1 Definition of the Perceptron Learning Domain: The Geometric Credit Assignment. The learning procedure uses the definition of a learning domain for each perceptron. Definition 3. Learning domain of a perceptron (see Figs. 2 and 3). Let 0 0
M be a membrane, H be a hyperplane (perceptron) of M,
The Perceptron Membrane
385
x be a point of Rd, 0
p be the orthogonal projection of x on H ;
x belongs to the learning domain L ( H ) of the perceptron H for the membrane M if and only if
C1. p is located in the active part of the hyperplane H ,
and Cz. the segment [xp[ does not intersect the membrane M .
This learning domain is computed as if each training example was the center of a spheric perturbation that propagates toward the membrane, and hits the membrane orthogonally. It can happen that the perturbation hits no active part of perceptron membrane. In this case, its influence is neglected. The learning domain provided by the geometric credit assignment has an interesting property: Theorem 1. The discriminations made by perceptron H in L ( H ) are equal to the discriminations made by the whole membrane in L ( H ) . Sketch of Proof. By definition of L ( H ) , the only part of the membrane that is in L ( H ) is the active part of H . Therefore, in this part of space, 0 the discriminations are totally determined by H . This property is important because it allows one to "factorize" the error minimization on individual perceptrons: the local error minimization provides a global error minimization. The credit assignment is therefore very efficient. Moreover, this property will be used by the perceptron elimination procedure. 3.2 The Training Algorithm. The learning domain is used in the training algorithm as follows:
Training algorithm (one training cycle): N times repeat (Nbeing the size of the training set): i. choose randomly a training example x, ii. for each perceptron H of the membrane: if x belongs to the learning domain of H , then: the hyperplane H is attracted or repulsed by x so that the membrane tends to absorb class 1 examples and to reject class 0 ones. The repulsion or attraction on the hyperplanes is derived from the delta rule (Rumelhart et al. 1986) applied to one perceptron cell alone. Con-
Guillaume Deffuant
386
Figure 2: The geometric credit assignment. Example B repulses perceptrons 5 and 1. Example A repulses perceptrons 1, 2, 3, and not 4.
Figure 3: Learning domain. The learning domain of perceptron 1 is hatched. sidering a perceptron of weights w and bias b, the modification of the parameters is performed by example x, as follows:
w,:= W,- 2 p . [f(w. X' + b) - y'] f'(w . x' + b ) . XI b := b - 2 p . [ f ( w .x'
for 1 5 j 5 d
+ b ) - y'] f'(w.x' + b )
where w,are the components of w, p is the step of the training algorithm, and f the sigmoid function.
4 Bias Reduction This section describes the procedures that add new degrees of freedom to the model and consequently make it possible to reduce the bias.
The Perceptron Membrane
387
Figure 4: CP construction. The perceptrons are created to isolate a chosen example (pointed by the arrow) from examples of the other class.
4.1 Initialization of the Membrane: Creation of the First Convex Polyhedron. Let the inside class of the membrane be the class 1. Then, a class 1 example e is searched in the training set. A CP C that contains only class 1 examples is built around e, according to the following procedure:
CP construction (see Fig. 4): i. Initialize C with the median hyperplane between e and its nearest neighbor of class 0, such that e is on the positive side of the hyperplane. ii. Iterate the construction by adding the median hyperplane between e and its nearest neighbor of class 0 inside C, such as e is on the positive side of the hyperplane. iii. Stop the construction when C contains only examples of class 1. The CP built according to this method allows initialization of the membrane, which is then trained according to the geometric credit assignment. 4.2 Recruitment. The purpose is to add a new CP to the membrane definition, or to dig a convex hole into I ( M ) .
Recruitment procedure: i. Search for a misclassified example. ii. If such an example e is found, a convex polyhedron C that contains only e-class examples, is built around it by the CP construction procedure. iii. Case 1: e is of class 1, then, C is added to the membrane definition (see Fig. 5);
Case2: e is of class 0, then every CP that intersects with C is replaced by its intersection with the complementary of C (see Fig. 6).
388
Guillaume Deffuant
Figure 5: CP recruitment (e is of class 1). The new CP is built around a class 1 example. Before recruitment I ( M ) = {(1>2)},and after recruitment I ( M ) = { ( 1 ? 2 )(3,4.5)}. >
Figure 6: CP recruitment ( e is of class 0). The new CP is built around a class 0 example. Before recruitment I ( M ) = {(1%2)},and after recruitment I ( M ) = { (1.2). (3.4,5)}. 4.3 Perceptron Duplication. The procedure is the following: Perceptron duplication:
Choose at random a perceptron of the membrane. Copy the perceptron and slightly modify its weights and bias at random. Create a new concave or convex (with an equal probability) deformation of the membrane (see Fig. 7). The convex deformation is obtained by simply adding the new perceptron to the corresponding CP. The concave deformation is obtained by creating a new CP in which the new perceptron replaces the other one (see Fig. 7).
The Perceptron Membrane
389
a convex pemptron duplication
b. concave meptron duplication
Figure 7 Perceptron duplication. Before duplication I ( M ) = { (1,2)},and after duplication in case a I ( M ) = {(1,2,3)},and in case b I ( M ) = {(1,2),(1,3)}.
Figure 8: Regularization. Useless irregularities are removed from the membrane. Before regularization,I ( M ) = { (1,Z), (3,4,5)},and after I ( M ) = { (1,2,3), (2,3,4,5)}.Perceptrons 2 and 3 are shared by both CPs. 5 Variance Reduction 5.1 Regularization. The regularization theory (Tikhonov and Arsenin 1977) leads to reduction of the variance by selecting the most regular solutions of a problem. The same idea is applied in the perceptron membranes. Periodically, useless irregularities of the surface are eliminated (as shown in Fig. 8). This is made by “linking” pairs of CPs. Two CPs are said to be linked when the same perceptron is present in their definitions. The procedure takes into account all pairs of CPs that intersect, and tests whether sharing a perceptron improves the membrane efficiency on the training set (i.e., the error on the training set of the inside part of the CP after linking is lower or equal to the one before linking). If this test is positive, then the perceptron is shared by both CPs. 5.2 Removal of Perceptrons. Let us assume that the set of training examples belonging to a perceptron learning domain is approximately constant during one training pass (in practice, we check that the number of training examples in the learning domain of the perceptron is stabi-
Guillaume Deffuant
390
Figure 9: Elimination process. The learning domain of perceptron 3 is empty on one side; it is therefore useless. Its removal from the membrane is followed by a membrane reorganization. Before elimination I ( M ) = { (1,2),(3,4,5,6)} and after I ( M ) = {(1,2),(1,4,5,6)}.Perceptron 1 is shared by both CPs and perceptron 3 has been removed. lized). Then, the relevance for the generalization of the perceptron on this training set can be evaluated according to the result due to Ehrenfeucht et al. (1988). This result gives a bound of the training set size under which there exists a distribution (in the worst case) leading with good chance to a bad generalization rate. The theorem uses the Vapnik-Chervonenkis dimension (Vapnik and Chervonenkis 1981) of a class of functions: Theorem 2 (Ehrenfeucht et al. 1988). Let F be a class of (0, 1)-valued functions on Rd with Vapnik-Cheruonenkis dimension dvc 2 2. For any E , 0 < E < 118,for any N such as
N 1/20 to make an error in generalization > 1 / 8 in this learning domain. According to the theorem, this error rate is the same for the global membrane on this learning domain. This gives a reliable criterion to select useless perceptrons. In practice, as the theorem holds in the worst case, the threshold can be higher than d/4. A value close to N = 2d seems to provide satisfactory results in various cases (see Section 7). Moreover, a perceptron receiving no perturbation from one side is useless because it discriminates no example. Such perceptrons must also
The Perceptron Membrane
391
be removed in any case. Note that the regularization procedure must be applied to the CP that lost a perceptron, to optimize the modification of the membrane (see Fig. 9). 6 Development Algorithm
The development algorithm is similar to simulated annealing. An artificial temperature T in [0, 11 rules the probability of CP recruitment and perceptron duplication. Elimination and regularization procedures are applied after every training pass. Starting from an inititial value TO, T decreases slowly according to the number of errors and the complexity of the membrane. Learning stops when T reaches 0. The development algorithm is the following: intitialization of T: T := TO. while (T > 0) do: perform one training pass remove useless perceptrons regularize the membrane recruit new CPs with probability T duplicate perceptrons with probability T decrease T :
T:=T-
a
error + nper . d
Where error is the classification error on the last training pass, nper is the number of perceptrons of the membrane, and a is a parameter. N enables one to control the rapidity of the decrease in T. The temperature decreases slowly when d, the number of errors, and nper, the number of perceptrons, are high. TO,a, and /L (the learning rate) are the only parameters of the algorithm. Several examples of membrane developments are given in the next section. 7 Simulation Results 7.1 Holed Square Problem with a Uniform Distribution of Examples. The experimental conditions are the following:
Five hundred learning examples are drawn from a uniform distribution in a hypercube of dimension d, each coordinate being drawn uniformly between -1 and f l . We made experiments for the values: d = 2, d = 5, d = 10.
Guillaume Deffuant
392
Table 1: Results for the Double Square Problem (for 500 Training Examples). Dimension
Training passes Perceptron CP number number
2 (no noise) 2 (10% noise) 5 (no noise) 5 (10% noise) 10 (no noise) 10 (10%noise) 0
0
0
80 120 210 300 450 550
9 10.2 11 12.2 15 17.2
4.2 5.8 5.2 6.4 7.2 8.4
Generalization rate 96.5 85.2 90.4 73.6 80.9 63.7
Learning examples x such that ( X I , x2) is inside the square of center (0,O) and of width 1.4, and outside the square of center (0.0) and of width 0.6 belong to the class 1, the others to the class 0. The behavior of the system has also been tested on noisy learning bases: the noise is introduced by giving a wrong label to 10% of the learning examples (see Fig. 9). The chosen parameter values are for d = 2: for d = 5: for d = 10:
(Y
= 0.25
N = 0.125 (t = 0.1
To = 0.5, To = 0.2, To = 0.1,
p = 0.1 p = 0.1 /l
= 0.1
It can be verified that this problem is difficult for a two-layer feedforward network trained with the backpropagation algorithm because of numerous possibilities for local minima. Table 1 gives the average results for five membrane developments, for noisy and nonnoisy data. An example of membrane development for noisy data in dimension 2 is shown in Fig. 10, and an example with nonnoisy data in dimension 3 is shown in Fig. 11. 7.2 Two Spirals. It is well known that this problem is very hard for backpropagation networks because of the numerous configurations of local minima (see Baum and Lang 1991, for instance). The perceptron membrane easily succeeds in solving the problem in very few training passes. An example of membrane development is shown in Figure 12. The number of training examples is 192 (96 examples of each class). The chosen parameter values are (Y = 0.3, TO= 1, p = 0.02. The final membrane involves 29 perceptrons and the result is reached after 150 training passes.
7.3 Example of the Waveforms. The example of the waves has been first introduced in Breiman et al. (1984) to study the behavior of classi-
The Perceptron Membrane
393
-
1-30
I 40
1-60
t
-
70
1-50
1 - 110
Figure 10: Membrane development for the holed square problem in dimension 2. The learning set includes 500 examples. A noise of 10% has been introduced: a training example has a probability of 0.1 to have a wrong label. t is the number of training passes. The final membrane has 8 perceptrons and 4 CPs.
fication and regression trees. This is a three-class problem that is based on the 21 dimensional waveforms f ' , f 2 , f 3 shown in Figure 13. Class 1 examples are generated as noisy linear combinations of f' and f2, class 2 of fz and f3, and class 3 of f3 and f'. More precisely, examples of class 1 are generated as follows: Let x, be the components of the generated example,
u be a number drawn from a uniform distribution on [0,1],
Guillaume Deffuant
394
-
20
1.2
1-10
t
1-40
1-50
t-€4
t
-
-
I 90
80
1-16
Figure 11: Membrane development for the holed square problem in dimension 3. The learning set includes 500 examples. t is the number of training passes. The final membrane has 8 perceptrons and 4 CPs. For 1 to 21: let g be a number drawn from a gaussian distribution of mean 0 and standard deviation 1, xi is defined by xi = u
. f ; + (1 - u ) f; + g '
Class 2 and 3 examples are generated similarly, with a circular permutation in the set {f'? f2,f3}. To allow easy comparisons, the learning sets used for experiments are exactly the same as in Breiman et al. (1984). These sets are made of 100 examples of class 1, 85 of class 2, and 115 of class 3. The test set is made of 5000 examples, independent of the learning examples, and with equal
The Perceptron Membrane
1-30
395
1-40
1-54
Figure 12: Membrane development for the 2-spirals problem. The learning set includes 192 examples (96 of each class). t is the number of training passes. The final membrane has 29 perceptrons and 13 CPs.
proportions for the three classes. All the parameters of the distributions being known, an analytic expression can be derived for the Bayes error rate. Using this rule on a test sample of size 5000 gives a recognition rate of 86%. Table 2 gives the results for the average of five membrane developments on this training set, and Table 3 summarizes the performances of other models on the same data. The chosen parameter values are cy = 0.25, To = 0.1, p = 0.1. Note that the average structure of the membrane is very simple: two CPs for three perceptrons.
Guillaume Deffuant
396
1
3
5
7
9
11
13 15
17
19 21
Figure 13: Waveform. Table 2: Average Results of Five Membrane Developments for the Waveform Problem. Training passes Perceptron number Convex number Generalization rate 120
2
3.2
83.3
8 Conclusion and Future Work
The procedures of recruitment, elimination, and regularization enable the perceptron membrane to reach a good compromise between bias and variance. The qualities of the algorithm are founded on a good use of the model geometric properties. In particular, a novel geometric credit assignment is very efficient for the adaptation of polyhedron facets. Moreover, perceptron elimination is performed according to statistical criteria related to the generalization probability of the network. The efficiency of this approach has been illustrated on well-known examples such as the 2-spirals and the waveform problems. Future work will focus on possible improvements for the geometric credit assignment and the development algorithm. Membranes performing tasks other than Table 3: Performances on Test Sets for the Waveform Problem. Bayes Discrimination CART Nearest K-means LVQ rule analysis neighbors 86 %
74 %
80 %
78 %
82 %
82.7 %
Multilayer network 81.6 %
The Perceptron Membrane
397
discrimination, such as function approximation and process control, are also under study.
Acknowledgments I would like to thank T. Fuhs, L. Bochereau, P. Bourgine, F. Varela, and the reviewers for their help in improving earlier versions of the paper. References Baum, E., and Lang, K. 1991. Constructing hidden units using examples and queries. In Proc. NIPS 91 904-910. Bochereau, L., Bourgine, P., and Epesse-Priso, H. 1990. Generalist vs. specialist neural networks. In Proc. Cog. 90 1, 41-49. Breiman, L., Freidman, J., Olshen, R., and Stone, C. 1984. Classification and Regression Trees. Wadsworth International Group. De Bollivier, M., Gallinari, P., and Thiria, S. 1990. Cooperation of neural nets for robust classification. Proc. IJCNN 1990, vol. I, pp. 113-120, San Diego. Deffuant, G. 1990a. Neural units recruitment algorithms. Proc. IJCNN 1990, San Diego. Deffuant, G. 1990b. Dualite local/global et algorithmes de recrutement. In Proc. Neuro-Nimes, 1990. Deffuant, G. 1992. Reseaux connexionnistes auto-construits. Ph.D. dissertation, EHESS, 54 Bd Raspail, Paris. Ehrenfeucht, A., Haussler, D., Kearns, M., and Valiant, L. G. 1988. A general lower bound on the number of examples needed for learning. In Proceedings of the Annual Workshop on Computational Learning Theoy 1988. Morgan Kaufmann, San Mateo, CA. Fahlman, S. E., and Lebiere, C. 1990. The cascade-correlation learning architecture. In Advances in Neural Information Processing Systems II, D. S. Touretzky, ed., pp. 524-532. Morgan Kaufmann, San Mateo, CA. Frean, M. 1990. The upstart algorithm: A method for constructing and training feedforward neural networks. Neural Comp. 2, 198-209. Friedman, J. H. 1991. Multivariate adaptive regression splines. Ann. Statist. 19, 1-141. Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Comp. 4, 1-58. Hinton, G. E. 1986. Learning distributed representations of concepts. In Proceedings ofthe Eighth Annual Conference ofthe Cognitive Science Society (Amherst 1986), pp. 1-12. Erlbaum, Hillsdale, NJ. Le Cun, Y. 1985. A learning scheme for asymmetric threshold networks. In Cognitiva (CESTA-AFCET Ed.), pp. 599-604. Marchand, M., Golea, M., and Rujan, P. 1990. A convergence theorem for sequential learning in two-layer perceptrons. Europhys. Lett. ll, 487-492. Mezard, M., and Nadal, J. P. 1989. Learning in feedforward layered networks: The Tiling algorithm. J. Phys. 21, 1087-1092.
398
Guillaume Deffuant
Nadal, J. P. 1989. Study of a growth algorithm for a feedforward network. Int. J. Neural Syst. l(1). Rosenblatt, F. 1962. Principles of Neurodynamics. Spartan Books, New York. Rumelhart, D. E., Hinton, G., and Williams, R. 1986. Learning internal representations by error propagation. In Parallel Dstributed Processing: Exploration in the microstructures of cognition, J. L. McClelland, D. E. Rumelhart, and the PDP Research Group, eds. Scalettar, R., and Zee, A. 1988. Connectionist Modelsatid their Implications: Readings from Cognitive Science, chapter Emergence of grandmother memory in feed forward networks: Learning with noise and forgetfulness. In Connectionist Models and Their Implications: Readings from Cognitive Science, D. Waltz and J. A. Feldman, eds., pp. 309-332. Norwood, Albex. Sethi, K. 1990. Entropy nets. Proc. IJCNN 1990, San Diego. Seung, H. S., Sompolinsky, H., and Tishby, N. 1992. Statistical mechanics of learning from examples. Phys. Rev. A 45(8), 605645091. Sietsma, J., and Dow, R. J. F. 1988. Neural net pruning-Why and how. In IEEE International Conference on Neural Networks, San Diego, Vol. I, pp. 325-333. IEEE, New York. Tikhonov, A. N., and Arsenin, V. N. 1977. Solutions of Ill-Posed Problems. Winston, Washington. Utgoff, P. E. 1989. Perceptron trees, a case study in hybrid concept representations. Connection Sci. 1(4), 161-186. Vapnik, V. N., and Chervonenkis, Y. 1981. On the uniform convergence of relative frequencies to their probabilities. In Theory of Probability and Its Applications, Vol. XXVI, pp. 532-553. White, H. 1992. Artificial Neural Networks: Approximation and Learning Theory. Blackwell, Oxford.
Received January 24, 1994; accepted June 16, 1994.
This article has been cited by: 2. Guillaume Deffuant, Thierry Fuhs, Etienne Monneret, Paul Bourgine, Francisco Varela. 1995. Semi-Algebraic Networks: An Attempt to Design Geometric Autopoietic ModelsSemi-Algebraic Networks: An Attempt to Design Geometric Autopoietic Models. Artificial Life 2:2, 157-177. [Abstract] [PDF] [PDF Plus]
Communicated by john Hertz
The Upward Bias in Measures of Information Derived from Limited Data Samples Alessandro Treves SISSA, Biophysics, via Beirut 2-4,34013 Trieste, Italy
Stefan0 Panzeri SISSA, Biophysics and Mathematical Physics, via Beirut 2-4,34013 Trieste, Italy
Extracting information measures from limited experimental samples, such as those normally available when using data recorded in vivo from mammalian cortical neurons, is known to be plagued by a systematic error, which tends to bias the estimate upward. We calculate here the average of the bias, under certain conditions, as an asymptotic expansion in the inverse of the size of the data sample. The result agrees with numerical simulations, and is applicable, as an additive correction term, to measurements obtained under such conditions. Moreover, we discuss the implications for measurements obtained through other usual procedures. 1 Introduction
A thorough quantitative understanding of information processing in the mammalian nervous systems will eventually require reliable measurements of the amounts of information carried, in well-defined situations, by the activity of nerve cells. Although most system-level neuroscience research is definitely still of a qualitative nature, there have already been several attempts (Eckhorn and Pope1 1975; Optican and Richmond 1987; Tovee et al. 1993) to quantify the information present in the response of cortical neurons, in viuo, with the animal performing, e.g., simple perceptual tasks.' These attempts have met with certain technical difficulties, the most serious of which has been called the limited sampling problem. It stems from the fact that while information is defined in terms of probability 'Information measures are, of course, always relative, in particular, relative to the way chosen to quantify cell responses. Simpler response measures, such as a cell's firing rate in a given window, will yield information values lower than (or at most similar to) those produced by more complete characterizations, e.g., measures that capture the detailed time-course of the response (Optican and Richmond 1987; Tovee et al. 1993). While one may wish to consider this as a problem of underestimating some ill-defined "true" information, it has nothing to do with the overestimating discussed in the following.
Neural Computation 7, 399-407 (1995)
@ 1995 Massachusetts Institute of Technology
Alessandro Treves and Stefan0 Panzeri
400
distributions, in measuring it from real data one has, in practice, to estimate approximate forms of the probability distributions from a data sample of limited size, N. For N + 00 the estimated distributions are expected to tend to the ”true” underlying distributions, but for a series of reasons (such as trying to keep the animal alert and interested) N often has to be limited to relatively small values. In such a situation, it turns out that naive measurements of information are always, on average, overestimated. Methods have been developed that try to correct for (Optican et al. 1991) or avoid (Chee-Orts and Optican 1993; Hertz et al. 1992; Kjaer e f al. 1994) this upward bias, but only on a rather empirical basis. Here we explain how to calculate the upward bias directly. Although the result is only an asymptotic expansion, whose convergence is not guaranteed, and moreover it is valid only under certain conditions, it proves useful and clarificatory even beyond those conditions. 2 Evaluation of the Bias
To be concrete, let us assume that we want to measure the average information carried by the response r of a neuron (or of several neurons) about a stimulus s presented to the animal. We assume that s is drawn at random from a discrete set S of S elements. Likewise, we initially require that the response space R be discretized, to include a total of R distinct responses. If the actual, raw responses are real numbers (e.g., the firing rates of several cells in a given time period, or the weights of the firing train of one neuron on the principal components of the covariance matrix), we assume that they have been binned into R different boxes. We stress that R is the total number of response bins, independently of what is the underlying dimensionality, if any, of the raw response space.’ The binning procedure must satisfy an independence condition, i.e., that the number of times a given bin r is occupied should depend only on an underlying probability P ( r ) , and not on the occupancy of the other bins. This condition is violated by most usual binning procedures that involve some prior smoothing of the data, e.g., by convolutions with a gaussian distribution. This introduces correlations among bins, which complicate the analysis that follows. On the other hand, the simplest straightforward binning procedure, that simply allots raw responses to the response interval r that they happen to lie in, does satisfy the independence condition. Averaging over S and R, the amount of information we aim at is
~
zIf, e.g , the raw responses are the firing rates of two cells, which are then discretized into R1 and, RZ bins, respectively, we set R = R1 x R2.
Bias in Information from Limited Data Samples
401
where P(s,r ) is the underlying joint probability distribution of stimuli and responses, P ( s ) and P ( r ) the separate ones for stimuli and for responses, and obviously P ( r ) = C, P(s, Y) while P ( s ) = C, P ( s ,r). In practice, we have a total of N experimental trials (i.e., stimulus-response pairs) available, so we get a raw estimate of the information
where the PNS are the experimental frequency-of-occupancytables, e.g., P N ( r ) = N r / N , with N,the number of times response r occurred. The difference, or bias, between I N and I of course fluctuates depending on the particular outcomes of the N trials performed. We can estimate the average of the difference, however, by averaging ((. . .)) over all possible outcomes of the N trials, keeping the underlying probability distributions fixed. We assume that P N ( s ) is given by a bynomial distribution of mean P ( s ) . ~ The procedure is first to rewrite IN in a more convenient form, and then to use the replica trick (Edwards and Anderson 1975) to convert the logs into limits of a power
Note that the replica trick, which in other contexts is the initial step for sophisticated calculations with subtle implications, here reduces to the trivial expedient of calculating the average of a log as a limit in the average of a power. As the frequency tables approach, for large N, the underlying probability distributions, e.g., limN,m P N ( YI s) = P ( r I s), we write
(2.4) and
The binomial expansion is not assured to converge because in some cases 6,(.) will be larger than P(.), and thus outside the converge radius. From 3We have carried out the analysis also for the case in which P N ( s ) = P(s), and even for non-independent and/or non-discrete responses (Panzari and Treves, in preparation).
Alessandro Treves and Stefan0 Panzeri
402
a purely formal point of view, one can be slightly more rigorous by using, instead off“(.), the fictitious frequency table = (1- F ) P N ( . ) + FP(.). E has the role of a mass parameter, and if it is sufficiently close to 1 the binomial expansion will converge. However, in the end one wants to take the limit E --+ 0, and the problem will show u p again. Ultimately, our binomial expansion is only asymptotic, and does not yield a convergent series (cf. the expansion in the coupling constant in quantum electrodynamics, Dyson 1952). Separating out the terms with k = 0, which can easily be shown to give just I, one gets (2.6)
We now take the average over all possible outcomes of the N trials, by using the independence condition, and therefore write, independently for each term of the triple sum (2.7) where N’= N when considering a PN(r)term, and N’= N,with a PN(r I s) term. Then
where (2.9) represents successive contributions in the asymptotic expansion of ( I N ) . These contributions can be computed explicitly using elementary combinatorics. We have carried out the calculation up to Ag(x); for the first few we find A~(x) = 0 A ~ ( x )= -xI*-’(l - x )
N’
A ~ ( x )= A~(x) =
1
__ x)’-*(2x2- 3x
“’I2 1
~
(”I3
+ 1)
x ” - ~ ( ~N 6Nx2 x ~ + 3Nx - 6x3 i- 12x2- 7x + 1)
(2.10)
Bias in Information from Limited Data Samples
403
In fact, one can see that in general Ak gives a contribution of at most order 1/(N')k/2for k even, and 1/(N')(k+1)/2 for k odd. Performing the n -+ 0 limit and grouping homogeneous terms, one finds (IN) =I
00
+
cm
(2.11)
m=l
where the first few correction terms are c 1 =
~
2N In 2
-
(S-l)(R-
1)
1[ P q r ) - P p ' ( r ) ] r
c4 =
(C [19P-3(r I
120N41n2{CPi3(s) - 30PP2(rI s ) -
s)
+ 10P-'(r 1 s)] + 1))
{ 1[19F3(r)
120~41n2
-
3 0 F 2 ( r )+ 10P-'(r)]
+1
The remarkable fact about the first term, C1, is that, due to a series of cancellations, it depends only on the sizes S and R of the stimulus and response sets, but it is invariant with respect to the probability distributions of stimuli and responses. The following terms, instead, depend explicitly on averages of inverse powers of J"(s) (the actual presentation frequency of each stimulus; but instances with J"(s) = 0 must be excluded from the average) and P(r I s) and P(r) (the underlying probability distributions of the responses). The dependence is very strong, that is, with growing rn the factors P(m-l)(.)can be quite small, and produce strong fluctuations of the corresponding correction terms as the underlying probability distribution varies by tiny amounts. At best, when the underlying distributions are uniformly flat (I = O!), term C, is roughly of size (S x R/N)" = (Cl)m, and thus the expansion can be expected to be reasonably well behaved only when at least C1 CT = 16.97. In (b) the effect of differing cortical and retinal shapes is simulated using a cortical sheet of size 512 x 512 and a retinal sheet of size 128 x 512. The initial values of x(r) are amended to x(r) = 0 . 2 5 ~and training patterns are drawn from 0 5 uX < 128, 0 5 uy < 512, qmax = 12.8, zmax= 14.08. Other parameters in (a) and (b) as in Figure 10.
Several structural models build in biases in preferred orientations (Bauer and Dow 1991; Braitenberg 1985; Dow and Bauer 1984). Most other models could also be modified to favor certain features. In models where training patterns are used, sensory deprivation has been simulated by biases in the training set. Training biases lead to biases in feature preferences (Obermayer et al. 1992a), which may be consistent with experimental findings (Blakemore and van Sluyters 1975; Stryker et al. 1978). Increased ability to control the distribution of specificities and feature preferences distinguishes iterative spectral models (Swindale 1982, 1992) from similar one-step models (Niebur and Worgotter 1993; Rojer and Schwartz 1990). (See Appendix 5.2.1 and 5.2.2.) One-step models generate a single, fixed distribution of orientation specificities (taken as orientation vector length) (Fig. 9a). Although optical imaging tends to underestimate orientation specificities through spatial averaging, it still reveals a distribution favoring higher orientation specificity than the onestep spectral models predict (Fig. 9b).
E. Erwin, K. Obermayer, and K. Schulten
442
0
36
72
108
144
18
Preferred Orientation Figure 8: Histogram showing that preferred orientations in optical imaging data (animal NMI) are approximately evenly distributed. Each of 20 bins represents orientations in a 9” range. Iterative spectral models allow the inclusion of functions linking development of distinct feature vector components and allow the possibility to reproduce any observed distribution of orientation specificities, preferred orientations, or ocularities, although so far no attempt has been made to precisely match experimental data. Linking functions can also be used to give correlations between otherwise independent feature components. Ultimately, however, the physiological basis of any linking function must be found if the model is to be used to predict map development. 3.7 Maps of Different Features Are Correlated. As explained in Section 2, the patterns of ocular dominance and orientation preference in macaque striate cortex are not independent. The two patterns are ”globally orthogonal” such that the principal axes of the map patterns, measured on a length of about several ocular dominance bands, are not coincident, and may even be perpendicular. The two patterns also exhibit “local orthogonality” such that singularities and saddle points tend to align with the centers of ocular dominance bands, and isoorientation lines intersect ocular dominance band borders at approximately right angles. Spectral models (Niebur and Worgotter 1993; Rojer and Schwartz 1990; Swindale 1982, 1992) can be easily extended to include both ocular dominance and orientation preferences in three-dimensional feature
Orientation and Ocular Dominance
25%.
20%.
15%.
443
(a)
(
lr
10%. 5%. u.
LI-l
u . 2
0.75
1
Orientation Specificity(Norma1ized) Figure 9: Histograms comparing the distribution of normalized orientation specificities9 in maps from a one-step spectral model to experimental data. (a) One-step spectral models always generate a fixed distribution favoring low orientation specificities [data from the model of Niebur and Worgotter (1993)l. (b) Optical imaging tends to underestimate orientation specificity compared to other experimental methods, yet still reveals a distribution favoring higher specificities than the one-step spectral models.
vectors. An array of these three-dimensional vectors can be componentwise convolved with a Mexican-hat kernel to generate ocular dominance and orientation preference patterns simultaneously. The two map patterns would not, however, be correlated unless the feature components were linked during pattern generation. The Appendix (5.2.2) demonstrates two examples of linking functions that can be added to iterative spectral models. In a simple case, model cells are encouraged to develop (three-dimensional) feature vectors with approximately the same length. Thus cells with high monocularity will tend to have low orientation specificity and vice versa, which leads to the emergence of singularities in the centers of ocular dominance bands and to slabs of similar orientation preference intersecting ocular dominance borders preferentially at steep angles, i.e., local orthogonality. A more physiologically interpretable linking function used by Swindale (1992) couples the separate feature components by reducing the speed at which orientation preference grows in regions where ocularity is high. Singularities with low orientation specificity will more likely
444
E. Erwin, K. Obermayer, and K. Schulten
develop in the centers of single-eye dominance bands where growth of orientation preference was slowed (Fig. 6a). Figure 6a and b compares maps with and without the linking function. With the linking function, the otherwise distinct feature maps are locally coupled such that a tendency toward local orthogonality between isoorientation and ocular dominance borders develops. Close inspection reveals several instances where the orientation preference map is distorted such that orientation domain borders are "kinked" at the ocular dominance band borders (Fig. 6a, see arrow). Such kinks are not seen in present macaque maps. Kinks in the model result from the specific linking function used. This linking function also predicts a course of development in which strong orientation preference occurs first along the ocular dominance borders, and develops more slowly in the monocular regions. No other known model produces these kinks. Thus, observation of such a pattern in future experimental data from any species would support this model's developmental hypothesis. A simple extension (see Appendix 5.2.1) to the model of Rojer and Schwartz (1990), whereby both ocular dominance and orientation preference are derived from a single filtered noise array, generates maps with complete local orthogonality. Yet global orthogonality cannot be achieved in this simple model. Using an anisotropic filter would result in anisotropic map patterns, but both patterns would necessarily be elongated along the same axis. Since Swindale's (1992) model allows different filters for the orientation and ocular dominance components, the wavelengths and anisotropies of the two patterns may be separately specified to give global orthogonality while still maintaining the same degree of local orthogonality. Although local and global orthogonality appear to be distinct properties of macaque maps, no other model currently treats them independently. In simulations of the simultaneous development of orientation and ocular dominance, competitive Hebbian models (Figs. 10 and 11) generate patterns that include all of the types of local correlations between these two patterns that have been observed in the macaque, but do not reproduce global orthogonality.s These correlations have been demonstrated for the self-organizing map (Obermayer et al. 1992b,c) and are also present when the elastic-net approach (Durbin and Mitchison 1990) is appropriately extended (see Appendix 5.4.2). The correlations trivially emerge when patterns with the undesired combinations, e.g., low orientation specificity combined with binocularity, are excluded from the training set. However, they also occur when the training set includes all possible combinations of feature preferences. For the latter case, the emergence of correlations between features can best be explained in the dimension-reduction framework (Fig. 12). In this framework cortical maps are described as mappings between a high5Globalorthogonality, however, can be heuristically introduced by allowing different neighborhood functions to act on different components of the feature vector.
Orientation and Ocular Dominance
445
Figure 10: Model output from the self-organizing map (Obermayer et al. 1990, 1992~)in the format of Figure 4. Model size is 512 x 512 with periodic boundary conditions for the rl and r2-axes. Training patterns v = { u x ,u,,. uLIsin(2v4),uL7cos(2u4).v,} were chosen with uniform probability from 0 < v,, ~y < 328, 0 < 214 < T , 0 < uq < 9max, J u ~ I < Zmax, 9max = 51.2, zmaX= 56.32. Initial values: x(r) = r1, y(r) = r2, q = 0.01 * 9max, z = 0, with 4 uniformly distributed over all angles. In the function h m ~ ( . )u, = 16.97. Output is shown after 1,000,000 iterations with t = 0.02.
dimensional feature space and a two-dimensional cortical space that obey certain continuity and diversity constraints (Durbin and Mitchison 1990; Kohonen 1987; Obermayer et al. 1990). When training patterns are presented with equal probability out of an appropriate manifold in feature space, the magnification factor of the map between feature space and cortical coordinates will be approximately constant. Consequently, regions where one feature-vector component changes rapidly coincide with regions where other components change slowly. In regions where two feature components change fairly rapidly, they tend to do so along orthogonal axes in the cortex. If orientation selectivity and ocular dominance are represented by Cartesian coordinates as described in the Appendix,
446
E. Erwin, K. Obermayer, and K. Schulten
Figure 11: Model output from the elastic-net model (Durbin and Mitchison 1990) in the same format as Figure 4. Model size is 256 x 256 with periodic boundary conditions for rl and r2. Initial values and training patterns as in Figure 10, with qmax = 61.44, t m a = x 46.08. In the function !ZEN(.), CJ = 2.771. Output is shown after 2,000,000 iterations with N = 0.4, /,’ = 0.0001.
the model maps will then develop with local orthogonality between orientation and ocular dominance columns, similar to what has been found in the macaque maps. The generalized deformable model of Yuille (Yuille et ~ 1 1991) . can be made to produce similar maps to the elastic-net model. Yet he points out that the model may be generalized by modifying the definition of the norm used to enforce similarity between neighboring neurons. Different norms could lead to other types of correlations that might occur in other species, such as coincident regions of rapid change in orientation and ocular dominance. The magnitude of the correlations between orientation preference and ocularity cannot be adequately determined from the current experimental data, because noise and slight movements of cortex during recording tend to destroy such correlations. Thus while we note that the SOM,
Orientation and Ocular Dominance
447
Figure 12: Dimension-reduction: This figure shows how points in a twodimensional array might be mapped into a three-dimensional feature space with components $1, 42, and $3, representing such features as visual field location and ocular dominance. Dimension-reduction models often constrain the map to fill the input space with near-uniform density while maintaining continuity. This leads to maps where rapid changes in one feature vector component are correlated with slow changes in other vector components.
EN, and Rojer and Schwartz models predict stronger correlations than are observed experimentally, quantitative comparison is currently not recommended. 3.8 Correlations between Orientation Preference Coordinates and Cortical Coordinates. Several structural models imply particular relationships between the coordinate systems representing cortical location and orientation preference. For example, they may arrange cells preferring horizontal (or radial) stimuli in columns running in one direction across cortex while columns of cells preferring vertical (or concentric) stimuli run in the perpendicular direction (Bauer and Dow 1991; Dow and Bauer 1984).
448
E. Erwin, K. Obermayer, and K. Schulten
Figure 13: The pinwheel model (Braitenberg and Braitenberg 1979) tiles the plane with hexagonal hypercolumns each containing a +1 singularity. Six -112 singularities will be formed at the vertices where adjacent hypercolumns meet. Two versions of the model were suggested: (a) and (b). In each case, orientation preferences (short bars) are nearly perfectly correlated with the cortical orientation of the isoorientation lines (longer lines). The implied link between coordinate systems is often visible if the maps are drawn using oriented line segments to directly represent preferred orientations of cells. Displaying maps from the pinwheel models (Braitenberg 1985; Braitenberg and Braitenberg 1979) in this way, line segments representing preferred orientations appear aligned along curves that either radiate out from, or circle around the +1 vortices (Fig. 13a and b). In this model, such an arrangement of the orientation selective cells arises from a simple, plausible scheme of synaptic connections. Although cortical maps are not as well ordered as this simplified model, this predicted link between cortical and retinal coordinates could be present to some degree. A numerical test for such a link can be performed by comparing preferred orientations with the orientation of the isoorientation region contours. Alternatively the preferred orientations can be compared to the local orientation of the gradient vector of orientation preference with respect to cortical location, since this gradient vector is generally perpendicular to the isoorientation borders. In separate versions of the pinwheel model, the orientation preference vectors are either almost all perpendicular to (Fig. 13a) or almost all parallel to (Fig. 13b) the orientation gradient vectors. These trends are demonstrated in Figure 15g and h. When analyzed in this way, the macaque optical imaging data show no preferred angle of intersection between orientation preference and its gradient vector (Fig. 15a) and thus no link between retinal and cortical coordinates. Links between orientation preference and cortical coordinates are completely absent from models that treat orientation preference as an abstract
Orientation and Ocular Dominance
451
count for many of the prominent features of lateral map organization, like singularities, and fractures. Analyzing the maps generated by this model as above reveals that there can occur a strong correlation between a cell’s orientation preference in retinal coordinates and the orientation of the isoorientation bands in cortical coordinates. This results in orientation preference vectors aligned with the local direction of the orientation gradient (Fig. 15c) similar to but weaker than the correlations seen for the pinwheel model (Fig. 13b). Although the relationship has not been well studied, the strength of the correlations does depend on model parameters, and there appear to be some parameter regimes where such correlations are not apparent. A related model by Linsker (1986~) produces maps that show a similar type of correlations (Fig. 15d) although in this case resembling the alternate version of the pinwheel model (Fig. 13a). As Linsker (1986~)noted, when cortical cells have receptive fields containing parallel subfields of opposing types, such as excitatory and inhibitory (likewise for ON and OFF), the degree of similarity between receptive fields will depend not only on their orientation but also on their relative location and internal structure. Two cortical cells with identical receptive field structure that are in partially overlapping locations in the retina would have greater similarity if they were displaced along the axis of the subfield alignment than if they were displaced along the perpendicular axis. Thus if the growth of receptive fields is influenced by the degree of receptive field similarity, correlations can develop between orientation preference (receptive field alignment) and the direction of orientation column alignment in cortex. Tanaka’s model of correlation-based learning (Tanaka 1991a; Miyashita and Tanaka 19921, as well as the high-dimensional version of the selforganizing map (Obermayer etal. 1990), are both similar to Miller’s model in that orientation preferences develop through alignment of subregions in the receptive fields and growth of columnar structure is related to the overlap of receptive fields. We have examined data from one sample map from Tanaka’s model and found that it did not show any correlations between retinal and cortical coordinates (Fig. 15e). We have likewise not observed the high-dimensional self-organizing map to predict a link between coordinate systems (Fig. 15b). It is unknown whether such a correlation could develop for some other choices of parameters. Correlations between retinal and cortical coordinates are not seen in macaque maps (Fig. 154 although they could be present in maps from other species. Since the measure of correlations introduced here has not previously been used to test model and experimental data, additional study will be required to determine the effect of model parameters on such correlations, and whether they occur in differently organized maps from other species. Differences between the models above suggest a few tentative hypotheses. First, comparing the self-organizing map model and the mod-
452
E. Erwin, K. Obermayer, and K. Schulten
els of Linsker and Miller suggests that the presence of contrasting types of subfields (ON/OFF or +/-) increases the likelihood that correlations will develop. The phase of two receptive fields will have less impact on their degree of overlap if there is a single type of subfield, as in the selforganizing map model. Second, the self-organizing map and Tanaka's models indicate that the inclusion of some scatter in the topographic projection from retinal to cortical locations could cause any correlation that may develop between the direction of subfield alignment and receptive field location in retinal coordinates to not be visible in the cortical map. Third, correlations appear to be more likely in models that consider only linear development rules, omitting refinements that could be due to more complex nonlinear processes.
3.9 Orientation Maps Are Not a Linear Transformation of a Conservative Vector Field. A spectral model proposed by Rojer and Schwartz (1990) used the gradient of a bandpass-filtered noise pattern to characterize cortical orientation maps (see Appendix 5.2.1). The model does generate maps that superficially resemble experimentally observed maps (Fig. 16). However, since the model maps are derived through a linear mapping from a conservative vector field (in which vectors are always perpendicular to the field gradient) the model predicts a unique type of link between cortical and orientation preference coordinates (Erwin et al. 1993). This relationship restricts the range of patterns the model can produce, as is easily demonstrated visually near singularities (Fig. 17). One way to numerically demonstrate these correlations, and show that they are not present in macaque data, is to multiply the preferred orientations (180" periodic) in the maps by two to give a vector field (360" periodic). Analyzing the resulting vector field in a manner similar to the method of Figure 15 reveals that the direction of these vectors is strongly correlated with the direction of their gradient vector field for the model map. Similar correlations do not appear in spectral models that do not involve conservative vector fields (e.g., Niebur and Worgotter 1993; Swindale 1982). However, such correlations do also occur in Gotz's (1987) version of the pinwheel model. Analyzing the macaque data in a similar manner reveals that it cannot be derived from a linear mapping to a conservative field. This discussion helps illuminate the utility of models that attempt to characterize map patterns in simple equations. Without Rojer and Schwartz's model it is unlikely that we would have noted that macaque orientation maps are not a linear function of a conservative vector field. Knowing this property of experimental maps, new models should be tested to ensure that such a relationship has not been unintentionally included.
Orientation and Ocular Dominance
453
Figure 16: Orientation and ocular dominance map from a combined version of the models of Rojer and Schwartz (1990). Model size 512 x 512, H ( s ) = (1+ ~ ~ ~ u * ( P ' ~ ~ / ~ - I I S I xI ) (1 ) ~ +e-ar(IIsII-Pc-a/2))-1, ~ pc = 4.96, 6 == 0.96, a = 1.5625. Noise array n(r) values are normally distributed around 0 with variance 1.0. Note that the medium gray orientation contour, which indicates O", exits all of the +1/2 singularities (exactly one-half of the singularities) from the left or right side only. See Figure 17 for an explanation. 4 Discussion
In this contribution we have investigated several models for the structure and the formation of orientation and ocular dominance maps. The results of our comparison between model predictions and experimental data obtained from the upper layers of macaque striate cortex are summarized in Table 2. References to articles on each model are given in Table 1. Many of the models are also briefly described in the Appendix. Data for our comparisons come primarily from implementations of selected models on computers at our site. Generally our implementation followed closely the published description of the models and parameters. However, we extended a few models to include simultaneous
E. Erwin, K. Obermayer, and K. Schulten
454
Table 2: Summary Comparison of Model Predictionso
General Properties Global disorder Power suectrum Anisotropies
Included in all models except several structural models (icecube, Gotz, pinwheel) Miller, Yuille et a/., and Tanaka maps often have lowvass. rather than bandvass vower svectra All models here can produce anisotropic map patterns ~
Orientation Maps Singularities
Saddle points Fractures
Linear zones Linked coordinates
Conservative maps Distribution of specificities
Absent from icecube model Arise spontaneously in many models of map formation Several structural models (Pinwheel, one form of Baxter and Dow) suggested 360" periodic singularities Overall orientations of singularities are restricted in Rojer and Schwartz Absent only in icecube model Structural models tend to omit fractures All others include fractures as loci of rapid, continuous orientation change Miller, Linsker may include actual discontinuities, but the map resolution is too low to allow a meaningful distinction between rapid change and discontinuitv Present to varying degrees in all models Less prominent in SOM-h, and correlation-based models Pinwheel, Gotz, and Baxter and Dow predict a link between a cell's preferred orientation and the direction of isoorientation columns For some parameters, Miller and Linsker suggest a similar link A link has not been observed in macaque data, nor in the remaining models Rojer and Schwartz, and Gotz maps are a linear transformation of a conservative vector field Macaque maps, as well as other model maps, are not Most models that include a notion of feature specificity can be tuned to approximate experimentally observed distributions of specificity Among spectral models, the iterative approach (Swindale) allows finer control over the distribution of feature specificities than the one-step approach (Rojer and Schwartz, Niebur and Worgotter)
"Model abbreviations are explained in Table 1.
Orientation and Ocular Dominance
455
Table 2: Continued. Orientation deprivation and bias
Due to the method of learning by examples, competitive Hebbian models can easily simulate learning under exposure to a restricted or biased set of oriented visual features The other models here have not been applied to the same problem Ocular Dominance
Monocular deprivation Strabismus
All models that include ocular dominance can simulate development or appearance of maps in monocularly deprived animals Miller, SOM, EN, and Tanaka models successfully reproduce development of maps in strabismic animals
Relationships between Ocular Dominance and Orientation Maps Joint pattern development
Orientation specificity and binocularity Local orthogonality
Global orthogonality
Very few joint models of ocularity and orientation were proposed (SOM-h, SOM-1, Swindale) We have extended the EN and Rojer and Schwartz models to test their generalizability The model of Miller is currently being similarly extended, with no conclusive results at present All joint models correlate higher orientation specificity with binocularity and place singularities preferentially away from OD borders SOM-1, EN, and Rojer and Schwartz include a greater degree of correlation than observed in macaque All joint models include some preference for OR1 borders to be perpendicular to OD borders SOM-I, EN, and Rojer and Schwartz include a greater degree of correlation than observed in macaque Swindale's model makes a unique fine-scale prediction that has not been seen experimentally Local and global orthogonality appear to be separate properties of experimental maps Only Swindale currently treats them separately in a model
development of orientation and ocular dominance so that w e could compare them with the favorable results of the SOM models. We extended only several representative models where the extensions seemed to be a direct continuation of the model's principles and equations. Our extensions to the spectral model of Rojer and Schwartz, the correlation-based
456
E. Erwin, K. Obermayer, and K. Schulten
Figure 17 (a)-(d) Examples of vector fields (outside)and the associated orientation map (inside,local tangents to the curves) for typical singularitiesthat can occur in the experimental data. The singularity (d) is an example of a feature not allowed by the model of Rojer and Schwartz (1990), because the curl of the associated vector field does not vanish at this location.
learning model of Miller, and the elastic-net model are described in Appendices 5.2.1, 5.3, and 5.4.2. Among the pattern models, the spectral models perform better than the earlier structural models, mainly because they account for global disorder and for the coexistence of linear zones and singularities. The filtered noise approach for orientation selectivity (Niebur and Worgotter 1993) and for ocular dominance (Rojer and Schwartz 1990) captures most of the important features of the individual maps, except for the high degree of feature selectivity that is observed in the macaque. Models by Swindale (1980, 1982, 1992) provide the currently best description of the individual orientation and ocular dominance patterns found in the macaque. Additionally, they can account for many correlations between the maps. Such a close match to experimental patterns has not yet been achieved in the more physiological high-dimensional models. The particular form of the function used in Swindale’s model to link development of orientation and ocular dominance leads to a prediction of occasional sudden changes in direction, or “kinks” in the isoorientation region borders at ocular dominance borders. This prediction is unique to Swindale’s model. If such kinks are found in future highresolution experimental images, it would support the model’s prediction that orientation preference develops (or refines) first in binocular regions. Swindale‘s model is also unique in including separate mechanisms for generating local and global orthogonality. This extra freedom may be required to explain the structure of experimental maps. Correlation-based learning models have led to valuable insight into the role of Hebbian learning in receptive field development (Linsker 1986a,b; Miller 1992; Yuille etal. 1989). They were not expected to predict the structure of cortical maps with as much precision. It is, however, in-
Orientation and Ocular Dominance
457
structive to note how the inclusion of realistic receptive field properties impacts on the cortical map patterns. Correlation-based learning models perform well for ocular dominance (Miller et al. 1989). When applied to the formation of orientation maps (Linsker 1986c; Miller 1992), the ON/OFF-competition model underrepresents linear zones, and produces maps without a bandpass power spectrum. These points might be related to the low resolution of the maps necessitated by high computational demand. Linsker’s model always predicts a link between preferred orientation and direction of its vector gradient. Miller’s model also predicts a link for some model parameters. Such a link is not present in the macaque data, thus constraining the range of parameters for which the model could apply to macaque data. If maps from different species are shown in the future to possess such a link, this would provide strong support for the correlation-based learning approach. Competitive Hebbian models (Durbin and Mitchison 1990; Goodhill and Willshaw 1990; Obermayer et al. 1990, 1992~)lead to the currently best description of the observed patterns from a developmental perspective. These models attempt to describe the developmental process on a mesoscopic level, spatially as well as temporally, which has the advantage that the level of description matches the resolution of the experimental data. These models do not involve the microscopic concepts neuron, synapse, and spike, which makes it somewhat more difficult to relate model predictions to experimental data. Competitive Hebbian models make qualitatively correct predictions with respect to all the principles we have outlined above, except that they have not yet addressed the issue of global orthogonality as separate from local orthogonality. These models could be extended by, for example, including separate neighborhood functions for ocular dominance and orientation preference. For correlations between orientation and ocular dominance maps, the competitive Hebbian models give the most realistic predictions. As expected, the predictions of the extended elastic-net model closely match the low-dimensional SOM algorithm. Since Yuille’s generalized deformable model (Yuille etal. 1991) can be reduced to the elastic net, it should be equally capable of matching the experimental data if extended. Our extended version of the Rojer and Schwartz model failed to reproduce some of the experimentally observed correlations between orientation and ocularity. This observation is not intended to show a deficiency in their model as originally published. Rather, we wish to show how easily the property of local orthogonality and qualitatively correct correlations between singularities and ocularity emerge when the model is extended in a simple way. In our simulations with an extended version of the correlation-based learning model of Miller, maps with both well-organized orientation and ocular dominance failed to develop. We cannot, however, conclude that a more appropriate parameter regime does not exist. Further work on this joint model is in progress.
458
E. Erwin, K. Obermayer, and K. Schulten
More stringent tests of the postulated mechanisms of activity-dependent neural development must rely on experiments that (1) monitor the actual timecourse of pattern formation and that (2) study pattern development under experimentally modified conditions (deprivation experiments). While progress has been made (Bonhoeffer ef al. 1993; Lowel and Singer 1993; Hubel et al. 1977; Kim and Bonhoeffer 1993; Obermayer el al. 1994; Rauschecker 1991; Tanaka 1991b,c) there is currently not enough data on the spatial patterns available to constrain the present models. Unfortunately, no anatomical correlate has yet been found for orientation selectivity and binocularity in upper layers of monkey striate cortex. This quantity must be assessed physiologically and, therefore, after birth, which currently limits investigations to the final, refinement phase of orientation and ocular dominance development. Further evidence to decide between proposed mechanisms might be derived from interspecies comparisons. The underlying assumption is that mechanisms of visual cortex development should be fairly universal and that any model of value should be able to account for interspecies variations. A few studies modeling cat and monkey patterns have been reported (Jones et al. 1991; Miller 1992; Obermayer el al. 1990; Rojer and Schwartz 1990; Swindale 1981). Yet, most studies focused on properties of the experimental patterns that arise from very basic assumptions like broken rotational symmetry, which leads to global map anisotropies. Consequently, most of the models were able to account for the observed interspecies variations. As more and better data become available (e.g., Blasdel etal. 1993),fewer of the existing models may continue to be useful. Finally, one would like to have relatively simple models that make predictions about several aspects of cortical organization. Some current models do make predictions about features other than orientation preference and ocular dominance, such as receptive field location (Durbin and Mitchison 1990; Goodhill 1993; Jones et al. 1991; Obermayer et al. 1990, 1992c; Miyashita and Tanaka 1992; Yuille et al. 19911, color selectivity (Barrow and Bray 19924, receptive field subfields, and spatial phase (Barrow and Bray 1992b; Berns et a!. 1993; Linsker 1986~;Miller 1992, 1994; Miyashita and Tanaka 1992; Yuille et al. 1989), and correlations with locations of cytochrome-oxidase blobs (e.g., Gotz 1988). Correlations between maps of different features are predicted by all of these models, and could be tested in suitably designed experiments. 5 Appendix: Model Descriptions 5.1 General Nomenclature. This Appendix gives brief formulations of several of the models included in this study. The model descriptions are intended to (1)ease comparison between different approaches by presenting models with common symbols, and (2) provide sufficient detail to allow interpretation of model parameters given in figure captions. By
Orientation and Ocular Dominance
459
necessity, the descriptions here reduce the complexity of some models. Refer to the original references for fuller descriptions and more general formulations. Response properties of cortical cells or small cortical regions at each cortical location r are represented by a feature vector @(r).In the "lowdimensional" representation each component stands for a selected response property. Ocular dominance is represented by a scalar z(r) where positive and negative numbers code for eye preferences and zero indicates binocularity. Preferred orientation d(r) and degree of preference for that orientation q(r) are denoted by more convenient Cartesian components 4(r) = {9(r)sin[2d(r)],9(r) cos[2#(r)]}(Swindale 1982), where the factor of two enforces the assumption that the orientation maps code for the 180"-periodic orientation rather than the 360"-periodic direction of a stimulus.6 Additional features, such as the retinal location {x(r),y(r)} of the receptive field or the preferred direction in color space, can be incorporated. In the "high-dimensional" representation, the feature vector codes for the effective strength of the connections between a cortical cell and each of a set of N receptor cells in one or more input layers @(r)= {~l(r),w2(r),..',wN(r)).
The subscript r for cortical location will be omitted in the equations below, except where necessary for clarity. 5.2 Spectral Models. Spectral models generate orientation and ocular dominance patterns by either convolving an array of random feature vectors with an appropriate kernel k ( r ) in the space domain, or by filtering a noise array with an appropriate filter H ( s ) in the Fourier domain. Convolution or filtering may be carried out either iteratively or in one step.
5.2.2 One-Step Spectral Models. The models of Rojer and Sckwartz (2990). Let n(r) be a white-noise pattern of independently chosen random numbers gaussian-distributed around 0 and let k(r) be the space-domain representation of a bandpass filter H ( s ) . Then an ocular dominance-like pattern may be derived from
z=n*k
(5.1)
where * denotes convolution. An orientation map is derived through a similar process by taking the vector gradient with respect to the cortical coordinates r1 and r2 of the filtered noise array. The preferred orientation # is then taken as the angular direction of this vector divided by two, and in a simple extension 6This assumption is based in part on the appearance of the singularities.
E. Erwin, K. Obermayer, and K. Schulten
460
of the model, an orientation specificity 9 may be taken from the length of the vector
Due to the gradient operation, the orientation vector field is linearly related to a conservative field, and the model wrongly predicts correlations between orientation preferences and cortical locations such that
f9 sin(24)drl + 9 cos(2$)drz
=
0
(5.3)
is fulfilled for every closed path (Erwin et al. 1993). Rojer and Schwartz proposed separate models for orientation preference and ocular dominance, and omitted orientation specificity. For comparing their predictions with other models, we extend their model by considering z ( r ) to be simultaneously an ocular dominance and the precursor of an orientation array, and consider 9 to represent orientation specificity. The model ofNieburand Worgotter (2993). An orientation map is derived by applying a bandpass filter H ( s ) and an inverse Fourier transform IFT to a white-noise array N ( s ) of independent, uniformly distributed elements in the Fourier domain. The Cartesian coordinates of the orientation vector are given by the real and imaginary parts of the resulting array I?: (5.4) (5.5) 5.2.2 Iterative Spectral Models. Iterative models begin with a random distribution of small feature preferences Il@oll > 72, we apply adiabatic elimination to obtain the analytical esn. We can show, following Schieve et al. (1991), that it takes the form -v1
Q(V1)
7/4Vl)
N
=
+W9(VI) +[
li(V1)
(P.)2 + -$(VdP 472
d4Wd(V1)1
- $(Vl)lIl
-
27N7l)l (3.6)
We also simulate the coupled equations 3.5 to generate a histogram of the stationary distribution of X1, from which we can obtain the numerical
Toru Ohira and Jack D. Cowan
526
e m by tuning procedure as before. The simulation was performed with
parameter values 71
= 50, c2 = 0.5, zu = 4.6. / j = 1.0. 0 = 2.3
(3.7)
The value of r2 was varied between 1 and 50. The distribution of X1 was calculated after sufficiently long iterations beyond transients, for 1000 trials. The numerical esn was constructed in the form of equation 2.1 with values for w,,, [j,,, and Q,, obtained by tuning the histograms,
(3.8) The results are shown in Figure 4, where we have plotted the functions zu!U(V,) and zur,q$,(Vl) obtained with the analytical esn and the numerical esn, respectively, with varying r = r1/r2.
:a)
71
Figure 4: Comparison of the numerical (solid line) and analytical (dashed line) esn approximations. The values of 7 3 are (a) 1, (b) 5, (c) 25, (d) 50.
Stochastic Single Neurons
527
We see that the numerical esn and the analytical esn agree well in the region Y >> 1 where adiabatic elimination is valid. Further, the results show the gradual deviation of the analytical from the numerical esn as we relax the condition 71 > 7 2 . Comparisons using larger networks are left for the future; however, we expect to find qualitatively similar results. In addition, this result suggests a possible use of the numerical esn as a test for the validity of various analytical esn models. 4 Discussion
We have studied the properties of a single self-exciting neuron under the influence of additive noise. Explicit analytical solutions of the FokkerPlanck approximation enable us to construct numerical esns. Our preliminary investigation has shown that such a numerical representation of neural networks can guide analytical studies of stochastic neural networks. A natural extension of this work is to study the properties of a few interacting neurons with additive noise. However, the multivariable nonlinear Fokker-Planck equation cannot be solved explicitly even for a few neuron system and some approximation method like adiabatic elimination is necessary. One way to analyze stochastic neural networks is to use a master equation with discrete neural states (Ohira and Cowan 1993). Such a master equation can be constructed given transition rates between states, which can be obtained from the crossing rates either from experiments or from analytical modeling as discussed in this work. Acknowledgments Most of this work was done as the Ph.D. thesis project of one of us (T.O.) at the University of Chicago. The work was supported in part by a Robert R. McCormick fellowship at the University of Chicago, and in part by Grant N0014-89-J-1099 from the U.S. Department of the Navy, Office of Naval Research. References Babcock, K. L., and Westervelt, R. M. 1986. Stability and dynamics of simple electronic neural networks with added inertia. Plzysica. 23D, 464. Babcock, K. L., and Westervelt, R. M. 1987. Dynamics of simple electronicneural networks. Physica. ZSD, 464. Buhmann, J., and Schulten, K. 1987. Influence of noise on the function of a "physiological" neural network. Bid. Cybern. 56, 313. Bulsara, A. R., and Schieve, W. C. 1991. Single effective neuron: Macroscopic potential and noise-induced bifurcations. Phys. Rev.A 44, 7913.
528
Toru Ohira and Jack D. Cowan
Bulsara, A. R., Boss, R., and Jacobs, E. W. 1989. Noise effects in an electronic model of a single neuron. Biol. Cybern. 61,211. Bulsara, A. R., Jacobs, E. W., Zhou, T., Moss, F., and Kiss, L. 1991. Stochastic resonance in a single neuron model: Theory and analog simulation. I. Theor. Bid. 152, 531. Cowan, J. D. 1968. Statistical mechanics of nervous nets. In Neural Networks, E. R. Caianiello, ed., p. 181. Springer-Verlag, Berlin. Cowan, J. D. 1972. Stochastic models of neuroelectric activity. In Statistical Mechanics New Concepts, New Problems, New Applications, Proc. of the Sixth IUPAP Conference oti Statistical Mechanics, S. A. Rice, K. F. Freed, and J. C. Light, eds. University of Chicago Press, Chicago. Haken, H. 1985. Synergetics. Springer-Verlag, Berlin. Hansel, D., and Sompolinsky, H. 1993. Solvable model of spatiotemporal chaos. Phys. Rev. Lett. 71 , 2710-2713. Hopfield, J. J. 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. U S A . 81, 3088-3092. Li, Z., and Hopfield, J. J. 1989 Modeling the olfactory bulb and its neural oscillatory processings. Biol. Cybern. 61, 379-392. Longtin, A., Bulsara, A., and Moss, F. 1991. Time-interval sequences in bistable systems and the noise-induced transmission of information by sensory neurons. Phys. Rev. Lett. 67, 656. McAuley, J. D., and Stampfli, J. 1994. Analysis of the effects of noise on a model for the neural mechanism of short-term active memory. Neural Comp. 6,668. Ohira, T.,and Cowan, J. D. 1993. Master equation approach to stochastic neurodynamics. Phys. Rev. E 48, 2259-2266. Ricciardi, L. M., and Sacerdote, L. 1979. The Ornstein-Uhlenbeck process as a model for neuronal activity. Biol. Cybern. 35, 1-9. Schieve, W., Bulsara, A., and Davis, G. 1991. The single effective neuron. Phys. Rev. A 43, 2613. Smith, C. E. 1992. A heuristic approach to stochastic models of single neurons. In Single Neuron Computation, T. McKenna, J. Davis, and S. F. Zornetzer, eds., Academic Press, Boston. Sompolinsky, H., Crisanti, H., and Sommers, H. 1988. Chaos in random neural networks. Phys. Rev. Lett. 61, 259. Stratonovich, R. L. 1963. Topics in the Theory of Random Noise, Vol. 1. Gordon and Breach, New York. Zhou, T., Moss, F., and Jung, P. 1990. Escape-time distributions of a periodically modulated bistable system with noise. Phys. Rev. .4 42, 3161. Zipser, D. 1991. Recurrent network model of the neural mechanism of shortterm active memory. Neural Comp. 3, 179.
Received November 3, 1993; accepted September 9, 1994.
This article has been cited by: 1. Javier R. Movellan . 1998. A Learning Theorem for Networks at Detailed Stochastic EquilibriumA Learning Theorem for Networks at Detailed Stochastic Equilibrium. Neural Computation 10:5, 1157-1178. [Abstract] [PDF] [PDF Plus]
Communicated by Laurence Abbott
Memory Recall by Quasi-Fixed-Point Attractors in Oscillator Neural Networks Tomoki Fukai Department of Electronics, Tokai University, Kitakaname 111 7, Hiratsuka, Kanagawa, Japan
Masatoshi Shiino Department of Applied Physics, Tokyo Institute of Technology, Ohokayama, Meguro, Tokyo, Japan
It is shown that approximate fixed-point attractors rather than synchronized oscillations can be employed by a wide class of neural networks of oscillators to achieve an associative memory recall. This computational ability of oscillator neural networks is ensured by the fact that reduced dynamic equations for phase variables in general involve two terms that can be respectively responsible for the emergence of synchronization and cessation of oscillations. Thus the cessation occurs in memory retrieval if the corresponding term dominates in the dynamic equations. A bottomless feature of the energy function for such a system makes the retrieval states quasi-fixed points, which admit continual rotating motion to a small portion of oscillators, when an extensive number of memory patterns are embedded. An approximate theory based on the self-consistent signal-to-noise analysis enables one to study the equilibrium properties of the neural network of phase variables with the quasi-fixed-point attractors. As far as the memory retrieval by the quasi-fixed points is concerned, the equilibrium properties including the storage capacity of oscillator neural networks are proved to be similar to those of the Hopfield type neural networks. 1 Introduction Synchronization of oscillatory neural activity, which was reported by electrophysiological experiments in various cortical regions (Freeman 1975; Eckhorn et al. 1988; Gray and Singer 1989; Ahissar and Vaadia 1990), have attracted growing attention in neural information processing since it is expected to give a solution to the segmentation problem. Visual perception by networks of oscillators was discussed in terms of synchronization and desynchronization (Shuster and Wagner 1990; Grossberg and Somers 1991; von der Malsburg and Buhmann 1992; Sompolinsky Neural Computation 7, 529-548 (1995) @ 1995 Massachusetts Institute of Technology
530
Tomoki Fukai and Masatoshi Shiino
and Tsodyks 1994), and the possibility of using oscillatory activity in associative memory neural networks was pointed out (Abbott 1990). An advantage of using synchronization in neural information processing can be its potential ability to access multiple memory traces simultaneously. Indeed it was shown that associative memory oscillator networks exhibit continual transitions between embedded memory patterns which may be interpreted as such a simultaneous access to multiple memory traces (Fukai 1994a). On the other hand, an associative memory network of the WilsonCowan oscillators (Wilson and Cowan 1972) was shown to behave as a fixed-point attractor neural network such as the Hopfield model with symmetric connections in a certain case (Fukai 199413). For a particular oscillator neural network whose phase-variable description is mathematically equivalent to an analog neural network with a monotonic response function (Fukai and Shiino 19941, the content addressable memory achieved by the so-called oscillator death (Shiino and Frankowicz 1989; Ermentrout and Kopell 1990) was analytically studied by means of self-consistent signal-to-noise analysis (SCSNA) (Shiino and Fukai 1992, 1993). It was found that the oscillator network functions like the Hopfield network with fixed-point attractors except that a small portion of neural oscillators shows oscillating behavior in the retrieval states. This implies that attractor neural networks using fixed points, which usually employ such formal neurons as described by firing rates or response functions, make sense in a more general case including the information coding by the phase degrees of freedom in oscillator neural networks. It has, however, not yet been clarified whether the memory encoding by the oscillator death is a generic phenomenon to be seen in a wide class of oscillator neural networks of associative memory. The purpose of the present paper is to show that such memory information processing is a rather generic feature of oscillator neural networks. We will argue that the dynamic equations for the phases in oscillator networks are in general decomposed into two terms that can be respectively responsible for synchronization and cessation of oscillations under appropriate conditions. Therefore the resultant dynamic behavior in the retrieval depends on the relative weights of the two terms in the time evolution equations. The phase diagram showing the retrieval region is obtained by solving the equations for the equilibrium order parameters derived from an analytic treatment based on the SCSNA. Provided that the oscillator death plays a key role in memory retrieval, the equilibrium properties including storage capacity of oscillator networks are similar to those of the Hopfield neural networks. One might argue that using fixedpoint type attractors implies the loss of an advantage of oscillator neural networks in encoding binding information. What is to be emphasized, however, is that oscillator neural networks have alternative possibilities in adopting dynamic attractors for a particular aim of memory information processing.
Memory Recall in Oscillator Neural Networks
531
2 Phase Description of Neural Networks of Oscillators
The dynamic behavior of an oscillator exhibiting a periodic orbit is described by two degrees of freedom, amplitude and phase, which in general are coupled to each other. In a system of coupled oscillators, however, its dynamics can be described solely by the equations for the phase degrees of freedom if the strength of mutual interactions between the oscillators is small enough to be treated as perturbation (Kuramoto 1984). This reduction of dynamic degrees of freedom yields a minimal model to analyze the dynamic behavior such as a mutual entrainment of coupled oscillators. In the present paper, we will follow the phase description of oscillator neural networks, since we are interested in the dynamic features that are free of rather than specific to a particular model. Let X be an n-dimensional ( n L 2) dynamic variable to describe the motion of an oscillator. In the present paper, the coupled oscillator system of the associative memory is assumed to obey the following time evolution equations for N identical oscillators: (2.1)
where the first term describes a periodic orbit of a single oscillator, and the second term describes mutual interactions to be treated as perturbations whose smallness is indicated by parameter E . As will be shown later, connections J,, are assumed to be specified by the local Hebb learning rule. When F is sufficiently small, the coupled oscillator system given in 2.1 can be reduced to a coupled system of the phase variables or rotators 0, (i = 1,. . . ,N) that are defined along, and in the neighborhoods of, the unperturbed periodic orbits of the isolated oscillators. The derivation is briefly shown in Appendix A and the resultant dynamic equations for the phases are
with Z(0,) and V(0,) being periodic functions of 0,. Z ( 0 , ) measures the sensitivity of the phase to perturbations. In general, the above expression yields a good approximation of the coupled oscillator system 2.1 when the mutual interactions are small. To make 2.2 mathematically tractable, we assume that the oscillation is approximateIy viewed as a sinusoidal one. This assumption allows us to express Z and V in terms of linear combinations of {sine,} and {cos Q,}, say Z(0,) = a, cos 8, b, sin 0, and V(0,) = c, cos 8, + d, sin O,, and
+
Tomoki Fukai and Masatoshi Shiino
532
equation 2.2 can be rewritten in the following form:
K’
C J l j sin(8,
-
0; + 6’) -
If1
In the above expression, the constants K , K’, 6,and 6’ are determined by both the intrinsic properties of the oscillator and the detailed structure of mutual connections specified by G(X) in 2.1. The second term on the right-hand side of 2.3 with small 6’ is expected to describe the slow mutual entrainment of the phases of oscillators and such a mutual entrainment has attracted much attention (Abbott 1990; Kuramoto et al. 1992). The third term in 2.3 is usually neglected and its dynamic effects have not been closely investigated since K should be small when mutual entrainment takes place. An important observation, however, is that this term indeed plays a central role in the memory encoding by approximate fixed-point attractors, or the oscillator death, in associative memory networks of oscillators. Then the dynamic behavior exhibited by a particular oscillator neural network should essentially depend on the relative magnitudes of K and K’ in the corresponding phase equations. In the next section, we discuss the retrieval properties including the phase diagram of the phase rotator neural network derived above. In so doing, we omit the second term to confine ourselves to the study of the memory encoding by the oscillator death and assume that the mutual connections are given by a local learning rule of the Hebb type with p random memory patterns ,...,N , ,.,., of f l . These assumptions allow an approximate treatment of the equilibrium properties of the network with an extensive number of memory patterns by means of the SCSNA (Shiino and Fukai 1992, 1993). Thus the network to be studied is
II,
c(:(r
1 p
= -
(2.5)
p=1
Note that 6 was eliminated by a uniform shift of phase variables Bi+6/2 Or. In 2.4, the interaction term between a pair of phase rotators depends on sin 0; cos Ol + cos 8i sin 0,. If either of two terms is absent, the equilibrium properties of the network 2.4 could be easily studied using a formal equivalence between the equilibrium states of the phase rotator network and those of a monotonic analog-neuron network (hereafter we refer to this type of the phase rotator networks as the monotonic phase rotator networks; see Fukai and Shiino 1994). For the general case described by 2.4, however, such a formal equivalence is not trivially ensured. Although this situation makes the analytical studies of equilibrium states with the SCSNA rather complicated, we can still manage to derive the phase diagram for the rotator network 2.4 used as an associative memory network with approximate fixed-point attractors. ---f
Memory Recall in Oscillator Neural Networks
533
3 The Equilibrium States of the Phase Rotator Network
We first note that our system 2.4 has a formal energy function,
E
=
-WEB;
(3.1)
-
I
although it is not bounded from below due to the existence of the first term. Accordingly the stability of the fixed-point type attractors of 2.4 is not necessarily ensured; runaway solutions implied by the occurrence of translational motion of the phase variables may appear. However, as shown below, the retrieval states that are exactly given by the fixed-point attractors of 2.4 remain in existence when the network is loaded with a finite number of patterns. Defining the order parameters mr and mf in the large N limit as (3.2)
one obtains the following equilibrium condition of the oscillator network for finite values of p by setting dBi/dt = 0 in 2.4: (3.3)
Assuming the stored patterns {E:} to be random, we will be concerned with the case where mf = mf = 0 for p > 1 under the condition that the first pattern {E:} is retrieved. Then one has 71 =
E,’ (rn, sin H, + m, cos 0,)
(3.4)
and summing this over i yields 7
= 2m,m,
(3.5)
where superscripts 1 of m,, m,, and easily solve 3.4 to obtain
tl are omitted for brevity.
One can
which gives msq
rn,
=
+ m,,/m; + mf m: + m:
-
112
Since it follows from 3.5 and 3.7 that m:
mc = cos6,
rn, = sin6, rl
=
sin26
(3.7)
+ mf = 1, one has (3.8) (3.9)
534
Tomoki Fukai and Masatoshi Shiino
Then we see that the equilibrium solutions of 2.4 representing the retrieval states exist only for I’r11 L 1. The above analysis reveals how the present oscillator network functions properly as an associative memory under the local learning rule 2.5; in the retrieval state each phase rotator gets frozen at the angle value given by 3.6 and perfect memory retrieval ensues owing to 3.8. We now proceed to deal with the case of an infinite number of stored patterns. The SCSNA provides a simple and powerful method to study the equilibrium properties of analog-neuron networks when p, N --+ co with (I = p / N fixed (Shiino and Fukai 1992, 1993). It gives a set of equations for the order parameters describing the equilibrium states of the network systems. We can apply the method to the present phase rotator network and study the equilibrium phase diagram of the model. Due to the complication arising from the loss of formal equivalence between the phase rotator network and an analog-neuron network, the resultant order-parameter equations form a rather complicated nine-dimensional equation instead of the simple three-dimensional one obtained for conventional analog-neuron networks. The equations for the order parameters are given as follows (see Appendix B for the details of derivation): (3.10) (3.11) (3.12) (3.13) (3.14) (3.15) (3.16) (3.17) (3.18)
(3.19)
(3.20)
Memory Recall in Oscillator Neural Networks
Q3
adetQ (1 - A)’ [&(I - S d q c + (1 - C1)C2q,I CI + S 2 +C2S, -Cis2
=
-~
535
+ (1 sz ci + S2Ci + SlCZ)4SC -
-
(3.21) A = (3.22) The angle @(xl,x~;~) at equilibrium is obtained as a function of the two gaussian-noise variables by solving
ar
+ xl) sin@+ (Em, + x2)cos 0 + -sin 1-A + -(sl cos’ o + C? sin’ O ) 1-A
rl = ((m,
(V
r
c1+ s2+ 2c2s, - 2c1s2
Ocos O (3.23)
(3.24) The above equations for the order parameters can be solved numerically. A point that requires delicate treatment in the present analysis arises because the fixed-point condition 3.23 in general gives more than one solution. Since the SCSNA itself does not involve any criteria for choosing an available solution, we assume a priori that the solution is obtained by analogy of the Maxwell’s rule in the statistical mechanics of monotonic analog-neuron networks. The Maxwell rule, which is stated in the form of an equi-area law, is the rule to pick u p a suitable solution for the saddle-point equations of a thermodynamic system by ensuring the minimization of the free energy. Using the formal equivalence discussed previously, an analogous rule was obtained for the monotonic phase rotator network and was successful in deriving its equilibrium phase diagram (Fukai and Shiino 1994). A schematic description of the rule for the phase rotator network is given in Appendix C. The retrieval states are characterized by the solutions of order-parameter equations with nonvanishing mc and m,. Besides them, the equations give the spin-glass type solutions with vanishing m, and m,, and nonvanishing q,, q,, and qsc. The phase diagram of the rotator neural network 2.4 is shown in Figure l a for various values of 7. The solid curve gives upper bounds obtained by the SCSNA for the retrieval phase (RI) in which retrieval is achieved by approximate fixed points (see below). Spin-glass solutions exist in the whole region of the phase diagram. Note that the phase boundary curve yields a broad peak at around TI NN 0.07 and vanishes at both ends. Compared to the phase diagram of the monotonic rotator network (Fukai and Shiino 1994), we observe that the retrieval phase is extended into the region with relatively large values of 71. It was found that the fixed-point condition 3.23 possesses no solution for certain narrow ranges of x1 and xz in most regions of the retrieval phase RI except for 71 5 0.02. This implies that some phase rotators cannot be fixed at any value in equilibrium and thus the network contains a small number of continually rotating phase variables. Since, however, the fraction of these rotating components was found to be very small (less =
Tomoki Fukai and Masatoshi Shiino
536
0.16
0.14
SG
0.12
.
0.1
1
1
0.08
0.06
0.04
0.02 0
0
0.1
02
0.3
0.4
0.5
ll
1
0.9 0.8
0.7 0.6
E
0.5 0.4
0.3
0.2 0.1
0.6
0.7
0.8
0.9
1
Memory Recall in Oscillator Neural Networks
537
than 1% of all rotators), we simply neglected the contributions from the rotating components in the quasi-fixed-point retrieval states in evaluating the gaussian integrals for the order-parameter equations. Numerical simulations of the phase rotator network 2.4 with N = 600 to 1000 were conducted, and they confirmed that the phase boundary in fact coincides with the one obtained by the SCSNA as far as the quasifixed-point retrieval states are concerned. The values of rn, and rn, predicted by the SCSNA at the critical storage capacity and those obtained by the simulations are plotted in Figure lb, which confirms the validity of our theoretical analysis. The presence of the rotating components in the retrieval states was not clearly observed for the sizes of the networks used in the numerical simulations. The simulations further revealed that the neural network acquires larger upper bounds for another retrieval phase (RII) if it is allowed to exhibit oscillations with small but seizurable sizes of amplitude in the retrieval states. Figure 2a, b, and c shows the behavior of several O,(t)s observed in the memory retrieval for Q = 0.1 when 7 = 0.1, 0.3, and 0.5, which are respectively related to regions RI, RII, and SG. As is seen from the figures, the phase variables, which are attracted by fixed points near 0 (for ( = 1) or T (for = -1) in RI, begin to exhibit small oscillations around those points in RII (one of the phases even exhibits a translational motion). Since the amplitudes remain small, there should practically be no problem in regarding those oscillatory states, which presumably appear through the Hopf bifurcations, as the retrieval states of the phase rotator network. As shown in Figure 2c, most of the phases exhibit the translational motion in SG until the network state finally settles in a small oscillation around a state having no statistical correlation with any memory pattern. The phase boundary between RII and SG was obtained by numerical simulations and is drawn by the dashed curve in Figure l a . It shows that the maximal storage capacity (x0.15) of the present phase rotator network is quantitatively similar to that of the Hopfield neural network. In Figure 3, we show the values of rn, in the retrieval phases RI and RII as a function of 77 when Q is fixed at 0.1. In region RI, the SCSNA is applicable to evaluating the values, and they are shown by the solid line.
nn
....
000000
n ‘7
Hidden Layer
000000
Input Layer
Figure 2: Note that the vectors IL and IR represent the intensities falling on the left and right retinas respectively, and are indexed by spatial location. S represents the vector of the disparities to be extracted. That is, the output S; of output unit i represents the disparity at spatial location i. By setting some of the synapses to zero we obtain the disjoint receptive fields of the Becker and Hinton paradigm (Fig. 1). patterns. The variables have distributions P,(n) and P p ( a ) ,respectively. Note that D and P P ( a ) are assumed to be known but P,(n) and the functional form of F(n,a ) are unknown. The input distribution is given by
MD)
=
/ / w -~(n,
and can be observed by the system. Let the output of the system be S = G(D,y) where G is a function of a set of parameters y to be determined. For example, the function G ( D , y ) could be represented by a multilayer perceptron with y being the synaptic weights. By approximation theory, it can be shown that a large variety of neural networks can approximate any input-output function arbitrarily well given enough hidden nodes (Hornik ef al. 1991). We can combine these formulas to give S
=
G[F(n,a), 71
(2.1)
A. L. Yuille, S. M. Smirnakis, and L. Xu
584
Figure 3: The parameters y are adjusted to minimize the Kullback-Leibler distance between the prior (Pp) distribution of the true signal (C) and the derived distribution (I'DD) of the network output (S). The aim of self-organizing the network is to ensure that the parameters y are chosen so that the outputs S are as close to the a (or some simple transformation of the as) as possible. We claim that this can be achieved by adjusting the parameters y so as to make the derived distribution of the outputs PDD(S: y) = J b [ S - G(D.y)]p~(D)[dD] as close as possible to P,( S ). This can be seen to be a consistency condition for a Bayesian theory. From Bayes's formula we obtain the condition:
/ P ( s I D)PD(D)[dD]= / P ( D
I s)Pp(s)[dI)] = pp(s)
(2.2)
This is equivalent to our condition provided we identify P ( S I D) with b[s - G(D,y)I. To make this more precise we must define a measure of similarity between the two distributions Pp(S)and PDD(S: y). An attractive measure is the Kullback-Leibler distance (the entropy of P D D relative to Pp):
Thus our theory (see Fig. 3) corresponds to adjusting the parameters y to minimize the Kullback-Leibler distance between Pp(S) and PDD(S: y). This measure can be divided into two parts: (1) - JPDD(S: y)logP,(S)[dS] and (2) JPDD(S: y)logP~~(s : -f)[dS].As we now show both terms have very intuitive interpretations. Suppose that Pp(S)can be expressed as a Markov random field [i.e., the spatial distribution of Pp(S) has a local neighborhood structure, as is commonly assumed in Bayesian models of vision]. Then, by the
Bayesian Self-organization
585
Hammersely-Clifford theorem, we can write Pp(S) = e--PEp(S)/Z where E,(S) is an energy function with local connections [for example, E,(S) = - Si+1)21, io is an inverse temperature, and Z is a normalization constant. Then the first term can be written as
-/pDo(s : Y ) ~ O ~ P , ( S ) [ ~ S ] =
/I@
-
G(D,r)]Po(D)P€,(S)[dD][dS] + 1ogZ
= ~PEplG(D,y)lPD(D)IdDl + l0gZ
(2.4) P(E,[G(D, r ) ] ) D + logz We can ignore the log2 term since it is a constant (independent of y). Minimizing the first term with respect to y will therefore try to minimize the energy of the outputs averaged over the inputs--(E,[G(D, which is highly desirable [since it has a close connection to the minimal energy principles in Poggio et al. (1985), and Clark and Yuille (1990)l. It is important, however, to avoid the trivial solution G(D,y) = constant or solutions where G(D,y) is very small for most inputs. Fortunately these solutions will be discouraged by the second term. The second term J P D D ( D ,l~o)g P ~ ~ ( D , y ) [ dcan D ] be interpreted as the negative of the entropy of the derived distribution of the output. Minimizing it with respect to y is a maximum entropy principle that will encourage variability in the outputs G(D,y ) and hence prevent the trivial solutions. The two terms combine to determine the y so that the energy of the output variables is minimized while maximizing their variability. This is closely related to Becker and Hinton's method of maximizing the mutual information between pairs of output variables-essentially assuming a spatially constant prior distribution for S. At the same time it is reminiscent of other organizational principles for early vision based on information theory (Atick and Redlich 1990). How can one guarantee that the optimal solution to our criteria will indeed extract the signal? This will depend on a number of factors: (1) the forms of the functions F and G , ( 2 ) the forms of the probability distributions P,(n) and P,(a), and (3) whether the prior P, is indeed correct or not. It is straightforward to write down the conditions for the derived distribution to be equal to the prior distribution (assuming that the prior is correct). This is a stronger condition than requiring the KullbackLeibler distance to be minimal (though, if equality is possible, minimizing Kullback-Leibler would lead to it). It is =
(2.5) If one could find y' so that G[F(n,a ) ,y*] = a,Vn,a then the equation y*]= a, however, is could be solved exactly. The condition G[F(n,a),
586
A. L. Yuille, S. M. Smirnakis, and L. Xu
too strong. It requires that the function G , which can be thought of as a nonlinear filter, is able to completely eliminate the dependence on n. We have assumed that the correct prior is known by the system, perhaps by being specified genetically. An alternative possibility is that the prior itself is learned by a method reminiscent of Occam's razor: the goodness of the prior is evaluated based on the Kullback-Leibler distance after self-organization, and a more complex prior is chosen if this distance is large (see also Mumford 1992). 3 Connection to Becker and Hinton
In this section, we show that the case of disparity extraction implemented by Becker and Hinton based on their principle of mutual information maximization arises as a special case of our formalism, by choosing a particular prior. The Becker and Hinton method (Becker and Hinton 1992) for extracting the disparity involves maximizing the mutual information between two network output units S1, Sz with spatially disjoint receptive fields, under the assumption that disparity is spatially coherent. S1 and S2 denote the scalar values of two units in the output layer of a neural network, indexed by spatial location. The mutual information between S,,SZis given by
I(s1,sZ;y) = -(lOgPDD(sl;'Y)) - (logPDO(s2;y)) + (logPDD(s1, sZ;y)) = H(S1;y) - H(S1 I S2;y)
(3.1)
From this equation we see that we want to maximize the entropy, H(S1; y), of S, while minimizing the conditional entropy, H(S1 I Sz;y),of S1 given S2, which forces S1 to be a deterministic function of 52 (alternatively, by symmetry, we can interchange the roles of S, and S2). For the discussion below we will use our criterion to reproduce the case in which this last term forces S1 = SZ. By contrast, in our version (see Fig. 4) we propose to minimize the expression (logPDD(S,, S2;y)) -JlogPp(SI, S2)PI)D(S,,Sl;?)[dS]. If we enthen, for large T , our second sure that the prior Pp(S1,Sz) IXe-'(S1-S2)2, term will force S1 M S2 and our first term will maximize the entropy of the joint distribution of S,,Sz. We argue that this is effectively the same as Becker and Hinton (1992), since maximizing the joint entropy of S1, S2 with S1 constrained to equal S2 is equivalent to maximizing the individual entropies of S1 and S2 with the same constraint. To be more concrete, we consider Becker and Hinton's implementation of the mutual information maximization principle in the case of units with continuous outputs. They assume that the outputs of units 1 , 2 are gaussian2 and perform steepest descent to maximize the symmetrized *We assume for simplicity that these gaussians have zero mean.
Bayesian Self-organization
587
B-HProposal
Our Proposal
Maximize Mutual Information
Minimize Kullback-Leibler distance
s2
t
0
A
A
A
Output Layer
Hidden Layer
000000 000000 Right Intensity Input
Left Intensity Input
Figure 4: Comparing our theory with Becker and Hinton’s. Observe that setting P,(SI, Sz) IX e-T(s1ps2)2forces S1 = S2 for large 7,implementingtheir assumption that the disparity is spatially coherent. form of the mutual information between S1 and SZ:
=
+
log V ( S , ) log V(S2) - 210g V(S1 - S Z )
(3.2)
where V(.) stands for variance over the set of inputs. They assume that the difference between the two outputs can be expressed as uncorrelated additive noise, S1 = S2 N. Therefore, their criterion amounts to maximizing
+
+
EBH[V(SZ), V(N)] =log{ V(S2) V ( N ) }+logV(Sz) -2 logV(N)
(3.3)
For our scheme we make similar assumptions about the distributions of S1 and Sz.We then see that, up to additive constants independent of 7,
A. L. Yuille, S. M. Smirnakis, and L. Xu
588
(logpDD(sl,s2)) = -1/2log{(s:)(s;) - (Sls2)’) = -1/2l0g{V(S2)V(N)) [since (SIS2) = ((5’2 N)S2) = V(S2) and (S:) = V(S2) V(N)l. We now observe that if we choose the prior distribution IJp(SI,S2) cc e-r(S1-S2)2our criterion corresponds to minimizing EYSX[V(S~), V(N)] where (3.4) EYSX[V(SZ),VW)I = - log V(S2) - 1% V(N) + 7V(N) It is easy to see that maximizing EBH[V(S~),V(N)] will try to make V(S2)as large as possible and force V(N) to zero [recall that, by definition, V(N) 2 01. On the other hand, minimizing our energy will try to make . 7 appears as V(S2)as large as possible and will force V ( N )to 1 / ~Since the inverse of the variance of the gaussian prior for S = (S1, Sz), making 7 large will force the prior distribution to approach h(S1 - S2). Thus, in the case of large 7, our method has the same effect as the Becker and Hinton algorithm. For this to be true, it is important to choose a network architecture satisfying the requirement that the output units representing disparity have spatially disjoint receptive fields (see Fig. 4). If this were not the case, the output units would run the risk of getting entrained on the receptive field overlap, provided it has the right probability structure. Even though we did not pursue this issue in the above analysis, it is, in principle, possible to implement such architectural constraints by defining a prior distribution on the weights of the network. Note that, in principle, maximizing the mutual information between S1, S2 can only determine the network output up to transformations that leave the mutual information invariant. Which solution the network will settle at depends on the specifics of the implementation and on initial conditions. For instance, in the Becker and Hinton example the network sometimes settles so that S1 x S2, and sometimes so that Sl = -S2. This may not be always desirable. In this context, the ability to choose a prior affords a natural way to restrict the possible space of solutions.
+
+
4 Reformulating for Implementation in a General Setting
Our proposal requires us to minimize the Kullback-Leibler distance (equation 2.3) with respect to y.In the previous section, we showed that Becker and Hinton’s implementation of the mutual information maximization principle for disparity extraction arose as a special case of our formalism, for a particular prior. Therefore, their simulation already represents a concrete example of how our scheme can be implemented. In the present section, we endeavor to expand further by outlining two general implementation strategies based on variants of stochastic learning: First observe that by substituting the form of the derived distribution, PDD(S: y ) = J h [ S - G(D. y)]PD(D)[dD], into equation 2.3 and integrating out the S variable we obtain
Bayesian Self-organization
589
This is the form of the Kullback-Liebler distance that we assume in the implementation strategies we describe below: 1. Assuming a representative sample {DP : p. E A} of inputs we can ~ ~ ~ { P D D [ G (:D yl/Pp[G(Dp, ~ , - Y ) 7)l). We can approximate KL(y) by CPLtA now, in principle, perform stochastic learning using backpropagation: pick inputs Dfi at random and update the weights y using log{PDDIG(Dp,7) : r]/Pp[G(DP,r)]}as the error function. To do this, however, we need expressions for PDD[G(DC", y) : 71 and its derivative with respect to y.If the function G(D,y) can be restricted to being 1-1 (artificially increasing the dimensionality of the output space if necessary) then we can obtain analytic expressions P D D [ G ( D , :~y] ) = PD(D)/ldet(dG/dD)I and {dlogPDD[G(D,y) : 71/87) = -(aG/8D)-' (8G/aDay), where -1 denotes the matrix inverse. To see this we observe that
(4.2)
where D* = G-'(S, y) and we assume that the function G is 1-1. It follows directly that (4.3)
Substituting back into the K-L measure (equation 4.1) means that we must minimize with respect to y the cost function E[y,D] averaged over a sample of D (where we have dropped terms that are independent of
Y):
I
I
(4.4) det Z ( D . 7 ) + PEp[G(D,711 aG We implement this by stochastic learning. Pick an input D at random, set ynew= yold- ((dE/dy) (where is the learning rate), and repeat. This involves calculating dE/ay. After some algebra we find that E[y, Dl
= - log
0.5.
Competition and Multiple Cause Models
575
The equivalent for the noisy-or has
which lacks the reduction in the gradient as pNo -+ 1. In the case that the s, are themselves stochastic choices from underlying independent binomials, we need an estimate of the expected cost under the cross-entropy error measure, namely
4{5JF,I
+
= f{s,)[f,logP,C (1 - f,)
log(1 -$)I
One way to do this would be to collect samples of the { s J } . Another way, which is a rather crude approximation, but which has worked, is to use f,log$+(l -f1)l0g(l -$) where (4.1)
The term on the left is just a mean field inspired approximation to the activation function from equation 3.7 (using pi in place of s,). The extra term on the right takes partial account of the possibility that none of the s, is on-this is underestimated in the term CJp,bIJ, which is insensitive to the generative priority of the p J in that the s, are first generated from the p , before the f,] are picked. For this, we employ just the noisy-or, written in terms of the odds bJl.We used this mean field approximation to generate the results in Figure 3. Figure 5 shows how both the approximation in equation 4.1 and the simpler approximation p; = 1 - 1/(1+C,pJb,,)compare to the true value of p; in a case like the one before of two causes, where p1 = 1, c1/ = 0.5, and across different values of p2 and c2/. An anonymous referee pointed out the substantial difference between the true p: = 0.67 and p: = 0.5 for p2 = 1 and czl = 0.5. From our experiments, the important case seems to be as c2/ 1, and we can see that p is better than p in this limit. --f
5 Discussion
We have addressed the problem of how multiple causes can jointly specify an image, in the somewhat special case in which they interact at most weakly-different causes describe different parts of the same image. We used this last constraint in the form of a generative model in which the probability distribution of the value of each pixel is specified on any occasion by just one cause (or a null or bias cause). This is the generative form of Keeler, Rumelhart, and Leow's summing forward model in their ISR architecture. The model is more competitive than previous
576
Peter Dayan and Richard S. Zemel
Figure 5: Mean-field approximations to p;. The graphs show the ratios of p? and p: to p; for the case of two causes, where p1 = 1 and cll = 0.5. The behavior of p: at c2, = 1 and small p2 exhibits the insensitivity mentioned in the text.
Competition and Multiple Cause Models
577
schemes, such as the noisy-or, linear combination, or combination using a sigmoid activation function, and provides a principled way of learning sparse distributed representations. It has applications outside the selfsupervised autoencoding examples that have motivated our work. For instance, one could use a function based on this for the supervised learning in Nowlan and Sejnowski's (1993) model of motion segmentation, in which each local region in an image is assumed to support at most one image velocity. There is a natural theoretical extension of this model to the case of generating gray values for pixels rather than black or white ones. This uses the same notion of competition as above-at most one cause is responsible for generating the value of a pixel-but allows different causes to maintain different probabilities t;,k of setting yj = k, where k corresponds to a real-valued activation of the pixel. The bi, odds again determine the amount of responsibility generator i takes for setting the value j, and the t y k would determine what i would do with the pixel if it is given the opportunity. This scheme also requires a bias t+ which is the probability that y, = k if none of the causes wins in the f , competition. This makes
for the case of binary s,. Note that equation 3.7 is a simple case of equation 5.1 where till = 1 for each cause and the bias is zero. Once again, we can sample from the distribution generating the s, to calculate the expected cost of coding y, using this as the prior. We have considered the case where k can be black (0) or white (1) as a way of formalizing a write white-and-black imaging model (Saund 1995). Unfortunately a mean field version of equation 5.1 which combines p; and p; in a manner analogous to equation 4.1 yields a poor approximation. Causes with b , very large, p , moderate, and t,,o = 1 can outweigh causes with b , moderate, pI = 1, and t , l = 1. Saund (1995) used a technique that separates out the contributions from causes that try to turn the pixel black from those that try to turn it white before recombining them. This can be seen as a different mean field approximation to equation 5.1. However it did not perform well in the examples we tried, suggesting that it might rely for its success on Saund's more powerful activation scheme, which has an inner optimization loop. The weak interaction that the competitive schemes use is rather particular-in general there may be causes that are separable on different dimensions but that interact strongly in producing an output (e.g., base pitch and timbre for a musical note, or illumination and object location for an image). The same competitive scheme as here could be used within a dimension (e.g., notes at different gross pitches might have roughly separable spectrograms like the horizontal bars in the figure) but learning how they combine is more complicated, introducing such issues as
578
Peter Dayan and Richard S. Zemel
the binding problem. Yet it has applications to many interesting and difficult problems, such as image segmentation, where complex occlusion instances can be described based o n the fact that each local image region can be accounted for by a single opaque object (Zemel and Sejnowski,
1995).
Acknowledgments We are very grateful to Virginia d e Sa, Geoff Hinton, Terry Sejnowski, Paul Viola, and Chris Williams for helpful discussions, to Eric Saund for generously sharing unpublished results, and to two anonymous reviewers for their helpful comments. Support was froin grants to Geoff Hinton, the Canadian NSERC, Terry Sejnowski, and the ONR.
References Barlow, H. 1961. The coding of sensory messages. In Current Problems in Animal Behaviour, pp. 331-360. Cambridge University Press, Cambridge. Barlow, H., Kaushal, T., and Mitchison, G. 1989. Finding minimum entropy codes. Neural Comp. 1, 412423. Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. 1995. The Helmholtz machine. Neural Comp., in press. Foldiak, P. 1990. Forming sparse representations by local anti-Hebbian learning. Biol. Cybern. 64, 165-170. Hinton, G. E., and Zemel, R. S. 1994. Autoencoders, minimum description length, and Helmholtz free energy. In Advances in Neural Information Processing Systems, 6, pp. 3-10. Morgan Kaufmann, San Mateo, CA. Jacobs, R. A., Jordan, M. I., and Barto, A. G. 1991a. Task decomposition through competition in a modular connectionist architecture: The what and where vision tasks. Cog. Sci. 15, 219-250. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. 1991b. Adaptive mixtures of local experts. Neural Comp. 3, 79-87. Keeler, J. D., Rumelhart, D. E., and Leow, W. K. 1991. Integrated segmentation and recognition of hand-printed numerals. In Advances in Neural Information Processing Systems, 3, R. P. Lippmann, J. Moody, and D. S. Touretzky, eds., pp. 557-563. Morgan Kaufmann, San Mateo, CA. Nowlan, S. J. 1990. Competing Experts: A n Experimental Investigation of Associative Mixture Models. Tech. Rep. CRG-TR-90-5, Department of Computer Science, University of Toronto, Canada. Nowlan, S. J., and Sejnowski, T. J. 1993. Filter selection model for generating visual motion signals. In Advances in Neural Information Processing Systems, 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 369-376. Morgan Kaufmann, San Mateo, CA. Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, CA.
Competition and Multiple Cause Models
579
Saund, E. 1994. Unsupervised learning of mixtures of multiple causes in binary data. In Advances in Neural Information Processing Systems, 6 , J. D. Cowan, G. Tesauro, and J. Alspector, eds. Morgan Kaufmann, San Mateo, CA. Saund, E. 1995. A multiple cause mixture model for unsupervised learning. Neural Comp., in press. Schmidhuber, J. H. 1992. Learning factorial codes by predictability minimization. Neural Comp. 4, 863-879. Zemel, R. S. 1993. A minimum description length framework for unsupervised learning. Ph.D. Dissertation, Computer Science, University of Toronto, Canada. Zemel, R. S., and Sejnowski, T. J. 1995. Grouping components of three-dimensional moving objects in area MST of visual cortex. In Advances in Neural Information Processing Systems 7, G. Tesauro, D. Touretzky, and T. Leen, eds. Morgan Kaufmann, San Mateo, CA. To appear.
Received April 21, 1994; accepted September 20, 1994.
This article has been cited by: 1. Jörg Lücke. 2009. Receptive Field Self-Organization in a Model of the Fine Structure in V1 Cortical ColumnsReceptive Field Self-Organization in a Model of the Fine Structure in V1 Cortical Columns. Neural Computation 21:10, 2805-2845. [Abstract] [Full Text] [PDF] [PDF Plus] 2. Ella Bingham, Ata Kabán, Mikael Fortelius. 2009. The aspect Bernoulli model: multiple causes of presences and absences. Pattern Analysis and Applications 12:1, 55-78. [CrossRef] 3. Yujia An, Xuelei Hu, Lei Xu. 2006. A Comparative Investigation on Model Selection in Independent Factor Analysis. Journal of Mathematical Modelling and Algorithms 5:4, 447-473. [CrossRef] 4. József Fiser, Richard N. Aslin. 2005. Encoding Multielement Scenes: Statistical Learning of Visual Feature Hierarchies. Journal of Experimental Psychology: General 134:4, 521-537. [CrossRef] 5. Jörg Lücke, Christoph von der Malsburg. 2004. Rapid Processing and Unsupervised Learning in a Model of the Cortical MacrocolumnRapid Processing and Unsupervised Learning in a Model of the Cortical Macrocolumn. Neural Computation 16:3, 501-533. [Abstract] [PDF] [PDF Plus] 6. Guilherme de A. Barreto , Aluizio F. R. Araújo , Stefan C. Kremer . 2003. A Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised CaseA Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised Case. Neural Computation 15:6, 1255-1320. [Abstract] [PDF] [PDF Plus] 7. B.J. Frey, N. Jojic. 2003. Transformation-invariant clustering using the EM algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 25:1, 1-17. [CrossRef] 8. M. W. Spratling , M. H. Johnson . 2002. Preintegration Lateral Inhibition Enhances Unsupervised LearningPreintegration Lateral Inhibition Enhances Unsupervised Learning. Neural Computation 14:9, 2157-2179. [Abstract] [PDF] [PDF Plus] 9. Randall C. O'Reilly . 2001. Generalization in Interactive Networks: The Benefits of Inhibitory Competition and Hebbian LearningGeneralization in Interactive Networks: The Benefits of Inhibitory Competition and Hebbian Learning. Neural Computation 13:6, 1199-1241. [Abstract] [PDF] [PDF Plus] 10. Sepp Hochreiter , Jürgen Schmidhuber . 1999. Feature Extraction Through LOCOCODEFeature Extraction Through LOCOCODE. Neural Computation 11:3, 679-714. [Abstract] [PDF] [PDF Plus] 11. Peter Dayan . 1999. Recurrent Sampling Models for the Helmholtz MachineRecurrent Sampling Models for the Helmholtz Machine. Neural Computation 11:3, 653-677. [Abstract] [PDF] [PDF Plus]
12. Brendan J. Frey , Geoffrey E. Hinton . 1999. Variational Learning in Nonlinear Gaussian Belief NetworksVariational Learning in Nonlinear Gaussian Belief Networks. Neural Computation 11:1, 193-213. [Abstract] [PDF] [PDF Plus] 13. Rajesh P. N. Rao, Dana H. Ballard. 1997. Dynamic Model of Visual Recognition Predicts Neural Response Properties in the Visual CortexDynamic Model of Visual Recognition Predicts Neural Response Properties in the Visual Cortex. Neural Computation 9:4, 721-763. [Abstract] [PDF] [PDF Plus] 14. Peter Dayan , Geoffrey E. Hinton , Radford M. Neal , Richard S. Zemel . 1995. The Helmholtz MachineThe Helmholtz Machine. Neural Computation 7:5, 889-904. [Abstract] [PDF] [PDF Plus]
R. Rovatti et al.
596
the classes of the sorted neighbors is defined and used to decide which class best classifies x. Many methods (Cover and Hart 1967; Tomek 1976; Dudani 1976; Parthasarathy and Chatterji 1990)have been proposed for this latter decision, which possibly leads from A(x) to the correct association C ( x ) . One of the most investigated strategies makes the decision dependent on a single piece of knowledge, setting k = 1 and then associating the element under examination to the class of its nearest neighbor. This approach can be easily generalized to k > 1 with a majority decision rule associating x with the class that most frequently appears in A(x). Formal proofs exist (Cover and Hart 1967) which ensure that these decision policies behave optimally when 7 has strong regularities and is infinite. A decision rule model is developed in this paper to generalize the majority mechanism as well as some earlier generalizations. The information about the knowledge base structure can be encoded in this general model by adapting two distinct parts of the decision procedure. One of the two adaptive options is investigated in this paper while a preliminary description of a second methodology can be found in Kovhcs et al. (199313). 3 A Model for the Generalized Majority Decision Rule
If we consider the product class space Ck = C x C x . . . x C we may think of A(x) as a point in Ck while the decision criteria become a partition of that set into n subsets. Let us assume that given a certain class c,, some points PI,P‘,,Prrl. . . E Ck can be taken as the prototypes for the k-tuples A(x) of those x that should be classified as c,. Let us define a measure of similarity (T : Ck x Ck H R by assessing the similarity between a prototype and a given class k-tuple. When a point x has to be examined, the similarities o[A(x),P,], a[A(x),I”,],. . . with the established prototypes are evaluated for each class c,. Then a decision is taken assuming that the greater the similarity between a prototype and A(x), the greater our confidence in classifying x in the class associated to that prototype. In the following we will assume a simple definition of similarity between X = ( X I , X Z , . . . ,xk) and Y = (yl,y2,.. . ,yk), i.e.,
where q and /3 are real numbers and 6(., .) is a Kronecker-like operator defined as follows
s(x’y) =
{
if x = y otherwise
Voting Rules for k-NN Classifiers
597
The association {match -+ a,mismatch -+ b} is an arbitrary coding that requires the only obvious condition: a # b. Note that the bias term P can be discarded in the classification procedure as long as the final decision policy is concerned only with the relative magnitudes of the similarities. This approach includes the classical association rules if the single prototype
Pi = ( C j , C j , . . . , C i ) E Ck
(3.2)
is given for each class ci and if a1 = a2 = . . . = (Yk = a is assumed. With these positions, a majority voting rule on A(x) is obtained. In fact, if v[A(x),ci] is the number of times that ci appears in A(x),the following monotonic relationship holds between q[A(x), ci] and o[A(x),Pi]
In an adaptive decision rule following this model, both the prototypes set and the similarity coefficients should depend on the characteristics of 7. In practice an improvement in performance can be achieved by exploiting just one of the two adaptive options of the model. In Kovacs et al. (1993b)an adaptive set of prototypes is extracted from the analysis of 7 and applied to improve the performances of a generic k-NN classifier in the handwritten character recognition task; in that case an isotropic similarity measure is considered in which all the similarity coefficients uI are equal. In the following the formal definition of optimum similarity coefficients is developed and used to show how the performance can be increased. The prototype set is given a priori, independently of 7, and consists of the n elements P, of equation 3.2. 4 Optimum Similarity Coefficients
In this section we develop a procedure to encode the statistical features of the training set in the similarity coefficients a,. In an ideal training set, each neighbor would be classified according to the same statistical distribution (Cover and Hart 1967). In this case, discriminating among neighbors by means of the similarity coefficient is of no use. Yet, when the training set is finite and sparse, its specific features are nontrivial and can be exploited to improve the classification performance as they subsume the detailed structure of 7 in a few significant parameters. Moreover, relying only on statistical features and not on distance information as used in Dudani (1976) and Parthasarathy and Chatterji (1990), we make the overall approach applicable even when such a distance is not a true metric (Kovics and Guerrieri 1992).
R. Rovatti et al.
598
For each class c, E C, its unique prototype PI == (c,, c,, . . . , c,) E Ck can be compared with the k-tuple A(x) = (al,a2,. . . ,ak) to construct the random match-mismatch vectors A[c,,A(x)] = (6,1,6,2,. . . , 6 , k ) E { a , b } k such that 6,= h(c,,a,). Due to equation 3.1, the similarity measure is a function of the match-mismatch vector of its arguments. The performance of such a measure is, therefore, dependent on the probability of the random event C(x) = c, when the realization of A[c,,A(x)] is known, i.e.,
In fact, an optimal behaving similarity measure depending only on the match-mismatch vector is high when II is high and vice versa, encoding the statistically correct decision policy that classifies x as ci when
We may think of tabulating every possible realization of rI (i.e., corresponding to every possible A[c,,A(x)]E { a , b ) k in a suitable array np with p = 1. . . . , 2k. Given the linear model of similarity presented in equation 3.1 we approximate the conditional probabilities in the least-squares sense, minimizing E =
5
p=l
(&o,b,,
+ ,!j' - n,
/=1
which can be minimized by solving the k + 1 simultaneous equations
Let us assume that the conditional probabilities FIp can be estimated for each of the 2k possible match-mismatch vectors. This assumption is legitimate and will be further discussed in Section 5. In this case a compact closed-form solution can be derived. In fact, as the index p scans the collection of every possible match-mismatch vector, the following equalities hold 2k
Ch,
=
2k-1(n+b)
j = 1 , 2 , . . . ,k
(4.3)
p=l 2' p=l
2k-1(a2+ b2) 2k-2(a+ b)2
if jl = j2 otherwise
(4.4)
Voting Rules for k-NN Classifiers
599
These equalities allow us to find a closed-form solution to the problem stated in 4.2 (4.5)
Note that the derivation of 4.5 and 4.6 is completely independent of the actual values (a?b ) chosen to code match and mismatch conditions. In fact we have that 2bP1- a a-b
-
b =
{
+I -1
if b , = a if 6, = b
(4.7)
It is then possible to argue that the relative magnitudes of the optimal interpolating coefficients determined by means of 4.5 are independent of the coding of match and mismatch conditions. Moreover, the structure of 4.5 shows how the relative magnitudes of optimal similarity coefficients quantify the correlation between IIp and coding-independent equivalents of the components of the match-mismatch vector. This property links the least-squares approach to the correlation learning methodology (Hebb 1949) and gives an a posteriori interpretation of the results. We may finally show that this approach maintains its consistency in the case of an ideal training set (Cover and Hart 1967). In fact, we may exploit the Bayes rule to write 4.1 in the usual form
Let us assume that the probability distributions of the instances of each class are continuous and independent and that the cardinality of the training set grows to infinity. It can be shown (Cover and Hart 1967) that the class distributions of the k-nearest neighbors tend with probability one to the class distribution of the instance we are considering. In this case, the classes of each neighbor and the class of the instance are independently and equally distributed. Thus, from 4.8 we see that if two match-mismatch vectors A, and A2 can be obtained from each other by means of a permutation of their components, then n ~=, The value of n A [ , , A ( , ) ] depends only on the number of a and b in the match-mismatch vector. Let us indicate with M , the subset of those match-mismatch vectors with rn matches (and k - rn mismatches) and with fi, the value of 4.1 common to those vectors. We may take 4.5 and obtain
R. Rovatti et al.
600
Yet, as the inner sum scans M,, each component 6, assumes the value of a for
times and the value of b for
times. Thus, from 4.7 we may derive 2m - k
a j =2-k+’ - x f i m T (
a
-
b
m k )
m=O
from which it follows that a1 = a2 = . . . ak. We may recall our discussion at the end of Section 3 to conclude that our methodology behaves consistently in the ideal case indicating that the majority voting technique is the best linear voting rule.
5 Results The above methodology has been applied to a problem of handwritten character recognition. We used the 44,951 upper case letter examples in the NIST Special Database 3 in order to train our system (i.e., to extract optimal similarity coefficients given by equation 4.5) and all the 11,941 upper case letter images of NIST Test Data 1 (Garris and Wilkinson 1992) to test its validity. We considered three different existing k-NN classifiers based on different preprocessing and feature extraction algorithms, and proposed also in Kovacs et al. (1993a). The first classifier uses noise filtering, deskewing, and size normalization, while feature extraction is based on the distance transform (Kovacs and Guerrieri 1992). The second classifier differs from the first one due to the lack of the deskew operation and the use of a chain code histogram feature (Takahashi 1991). The third classifier has the same preprocessing of the first and the feature extraction is the same as the second classifier. The dissimilarity measure used in all classifiers is the semimetric described in Kovacs and Guerrieri (1992), which was specifically designed to cope with the character recognition task. It is important to stress that all classifiers, in spite of the fact that they are based on the same training set, are quite different in terms of neighborhoods and recognition performances, due to their algorithmic differences. The set of prototypes has been selected assuming n = 26 and
Pi
= (c;, c,, . . . ,ci)
i = 1,.. . ,n
Voting Rules for k-NN Classifiers
601
consensual voting neural classifier weighted voting .--..
Figure 1: Classification performances for different values of k (neighborhood size) using optimal similarity coefficients. i.e., expressing the plain idea that if all neighbors of an element belong to the same class, the element itself belongs to that class. According to our methodology the training set produced 44,951 neighborhoods that have been matched against the 26 prototypes giving rise to 1,168,726 samples of match-mismatch vectors. These were enough to successfully estimate all the necessary conditional probabilities by using the corresponding relative frequencies. In Figure I, a comparison between the classification performances of optimal similarity coefficients applied to the third classifier is shown for some values of neighborhood size k. During the classification the similarity is used to define the confidence level to associate with the decision; this allows us to introduce a reject option. Hence, an extensive comparison can be made by plotting the error rate as a function of rejection rate. It can be noted that all error curves are monotonic decreasing functions of the rejection rate, demonstrating that the definition of a similarity based confidence is well-posed. At high reject rate levels every curve tends to saturate at an error level depending on the neighborhood size:
R. Rovatti et al.
602 Table 1: ( k = 7).
Optimal Similarity Coefficients Classifier
Position 1
2 3 4 5 6 7
1 1.o 0.591945 0.484895 0.354451 0.281967 0.293107 0.174177
2 1.o 0.621187 0.422457 0.273057 0.219016 0.224698 0.179763
3 1 .o 0.607418 0.462484 0.394019 0.365996 0.303665 0.256124
the greater k the lower the asymptotic error rate value. Yet, subsequent improvements due to the increase of k tend to vanish even in the low rejection region. This trend supports our assumption on the possibility of estimating the necessary conditional probabilities for each of the 2k match-mismatch vectors. In fact, as the quality of the classifier rapidly increases with k up to its bound, there is no need to consider big neighborhoods. Thus, k can be kept reasonably low (Fig. 1 shows that a good choice for this database is k = 7) and it can be assumed that the estimation of the 2k possible II, remains reliable. In Table 1, optimal similarity coefficients, found by means of equation 4.5 for the three classifiers under examination, are reported when k is 7. The values have been normalized to the first coefficient. It is worth noting that we may recall the correlation encoding point of view from Section 4 to confirm the intuitive idea that, generally speaking, the nearer the neighbor the greater its information content. This heuristic notion was also suggested in Dudani (1976) and Parthasarathy and Chatterji (1990), while Table 1 tends to validate this idea a posteriori. Table 2 shows the performance of the similarity coefficients in Table 1 and that of majority voting when no rejection is allowed. We found that a simple majority rule is not able to classify each element of the test set because sometimes there is more than one class with the same maximum number of occurrences in the neighborhood. In this case, a further rule has to be applied to obtain the classification. However, assuming an always correct or always incorrect tie breaking, a best case and worst case classification result can be obtained. Obviously, once a policy has been defined, the real answer of the system will lie somewhere between them. In Table 2 these theoretically best and worst case performances are listed with the 1-NN case and our similarity coefficients. It is worth noting that the careful adaptation of the similarity coefficients makes our methodology always comparable, if not better than a majority voting approach with ideal tie breaking policy that could not, in any case, be known a priori.
Voting Rules for k-NN Classifiers
603
Table 2: Error Rate of Voting Rules at 0% Reject.
Classifier I-NN
Majority Worst case Best case
Similaritv coefficients
1 7.31%
2 7.08%
3 6.15%
7.29% 6.68% 6.46%
6.57% 6.09% 6.11%
5.72% 5.21% 5.28%
Moreover, once a tie breaking strategy is chosen, a measure of the confidence on the decision still has to be defined if more information on the classification is needed. A further rule is therefore needed for this purpose. On the contrary, the previous discussion on Figure 1 highlights that the similarity measure is a natural and well-posed confidence definition. To test its full performance, our methodology is finally compared with the consensual voting technique and with a neural classifier based on the same 64-dimensional feature vector used in the neighbor computation. The consensual voting is a generalization of the I-NN rule commonly used when a reject option is required. As with the 1-NN rule, the instance under examination is assigned to the class of its first neighbor. Then, the confidence associated to this decision is determined by the number of nearest neighbors that are classified as first. The neural classifier is based on a three-layer feedforward neural network whose layers contain, respectively, 64, 325, and 26 sigmoidal units. Training and testing sets are the same for the three classifiers. In Figure 2, the error rate is shown as a function of the rejection rate when consensual and weighted voting are applied to the third classifier with k = 7. A notable improvement is observable on consensual voting for low levels of the rejection rate while, at the other extreme, the two curves approach the same asymptotic value since any uncertain case is discarded when large rejection rates are allowed. The neural classifier and the weighted voting approach show approximately the same recognition quality in a broad range of rejection rates. This is an indirect confirmation that the connectionist lesson about adaptivity can be received in the context of statistical classifiers based on the k-NN approach. 6 Conclusion
A generalized voting rule for k-NN classifiers has been presented that allows incorporation of the adaptivity in the decision procedure. A statistically based procedure has been developed to exploit this adaptivity and tighten the links between the training set features and the decision rules. These stronger links are expected to improve the performance of
R. Rovatti et al.
604
6
5.5
5 4.5 4 3.5 3 2.5 2
1.5 1
0.5 0
5
10
15
20
25
Rejection Rate %
Figure 2: Comparison between consensual voting, neural classification, and weighted voting. the classifier when the training set is far from ideal conditions. Moreover, as the adapted voting rule is based on an approximation of the probability of correct classification, it provides a well-behaving confidence measure that can be extremely useful for semantic postprocessing and for rejection. The proposed technique has been applied to three existing k-NN classifiers for the recognition of handwritten characters. Improvements have been measured over the classical I-NN rule as well as over the consensual voting rule in treating uncertain cases achieving the same recognition quality as a neural network trained on the same examples.
References Bottou, L., and Vapnik, V. 1992. Local learning algorithm. Neural Comp. 4, 888-901. Cao, J., Shridhar, M., Kimura, F., and Ahmadi, M. 1992. Statistical and neural classification of handwritten numerals: A comparative study. Proc. Int. Conf. Pattern Recognition, The Netherlands, 643-646.
Voting Rules for k-NN Classifiers
605
Cover, T. M., and Hart, P. E. 1967. Nearest neighbor pattern classification. I € € € Transact. Inform. Theory 13, 21-27. Dudani, S. A. 1976. The distance-weighted k-nearest-neighbor rule. I E E E Transact. Syst. Man Cybern. 4, 325-327. Garris, M. D., and Wilkinson, R. A. 1992. NET special database 3 and test data 1. NIST Advanced Systems Division, Image Recognition Group. Hebb, D. 0. 1949. The Organization of Behavior. Wiley, New York. Kawabata, T. 1991. Generalization effects of k-neighbor interpolation training. Neural Comp. 3, 409-417. Kovacs, Zs. M., and Guerrieri, R. 1992. Computer recognition of handwritten characters using the distance transform. Electron. Lett. 28, 1825-1827. Kovacs, Zs. M., Guerrieri, R., and Baccarani, G. 1993a. Cooperative classifiers for high quality handprinted character recognition. Proc. World Congr. Neural Networks, Oregon, 186-189. Kovacs, Zs. M., Ragazzoni, R., Rovatti, R., and Guerrieri, R. 1993b. Improved handwritten character recognition using 2nd order information from training set. Electron. Lett. 14, 1308-1309. Lee, Y. 1991. Handwritten digit recognition using K nearest-neighbor, radialbasis, and backpropagation neural networks. Neural Comp. 3, 440-449. MacKay, D. J. C. 1992a. Bayesian interpolation. Neural Comp. 3, 415447. MacKay, D. J. C. 1992b. The evidence framework applied to classification networks. Neural Comp. 5, 720-736. Martin, G. L., and Pittman, J. A. 1991. Recognizing hand-printed letters and digits using backpropagation learning. Neural Comp. 3, 258-267. Parthasarathy, G., and Chatterji, B. 1990. A class of new K" methods for low sample problems. IEEE Transact. Syst. Man Cybern. 3, 715-718. Richard, M. D., and Lippmann, R. P. 1991. Neural network classifiers estimate Bayesian a posteriori probabilities. Neural Comp. 4, 461-483. Takahashi, H. 1991. A neural net OCR using geometrical and zonal pattern features. Proc. Int. Conf. Document Anal. Recognition, France, 821-828. Tomek, I. 1976. A generalization of the "-(I rule. I€€€ Transact. Inform. Theory 2, 121-126. Vapnik, V., and Bottou, L. 1993. Local algorithms for pattern recognition and dependencies estimation. Neural Comp. 5, 893-909.
Received June 10, 1994; accepted October 21, 1994.
This article has been cited by: 1. A. Roncaglia, I. Elmi, L. Dori, M. Rudan, A. Roncaglia, I. Elmi, L. Dori, M. Rudan. 2004. Adaptive K-NN for the Detection of Air Pollutants With a Sensor Array. IEEE Sensors Journal 4:2, 248-256. [CrossRef]
Communicated by John Platt
Regularization in the Selection of Radial Basis Function Centers Mark J. L. Orr Centre for Cognitive Science, University of Edinburgh, 2, Buccleuch Place, Edinburgh EH8 9LW, UK Subset selection and regularization are two well-known techniques that can improve the generalization performance of nonparametric linear regression estimators, such as radial basis function networks. This paper examines regularized forward selection (RF3-a combination of forward subset selection and zero-order regularization. An efficient implementation of RFS into which either delete-1 or generalized crossvalidation can be incorporated and a reestimation formula for the regularization parameter are also discussed. Simulation studies are presented that demonstrate improved generalization performance due to regularization in the forward selection of radial basis function centers. 1 Introduction
In linear regression, subset selection is used to identify subsets of fixed functions of the independent variables (regressors), which can model most of the variation in the dependent variable. Finding the smallest subset that explains a given fraction of this variation is usually intractable, and suboptimal algorithms that do not search through all possible combinations of regressors are often used in practice. One of these, forward selection, starts with an empty model and then recursively adds the current most explanatory regressor to a growing subset until some criterion is met (Rawlings 1988). Chen et al. (1991) used forward selection to choose the hidden units (centers) of radial basis function (RBF) networks to produce parsimonious networks. They also described an efficient implementation of forward selection, which they called orthogonal least squares (OLS).The criterion used to halt center selection was a simple threshold on the fraction of variance explained by the subset model. If the threshold is chosen so that too much variance is explained by the chosen regressors, poor generalization performance (overfit) results. To avoid this, Orr (1993) introduced regularized forward selection (RFS) in which high hidden-to-output weights are penalized by using zero-order regularization (also known as ridge regression in statistics and weightdecay in neural networks). Subsequently, Chen et al. (1995),by penalizing Neural Computation 7, 606-623 (1995)
@ 1995 Massachusetts Institute of Technology
RFS of Radial Basis Function Centers
607
the orthogonalized weights, found an efficient implementation of RFS, which they called regularized orthogonal least squares (ROLS). The purpose of the present paper is two-fold. Firstly, delete-1 or generalized cross-validation (GCV), both more effective ways of halting center selection than a fixed threshold on the explained variance, can easily be incorporated into OLS or ROLS. Second, if there are good a priori grounds for believing that the target function has some global smoothness property then the combination of regularization and cross-validated selection will give better generalization performance than cross-validated selection alone. In addition, a reestimation formula for the regularization parameter is derived that lets the data choose its value. The combination of regularization and subset selection is rare but not unknown. Barron and Xiao (1991) use a derivative-based roughness penalty to avoid overfitting in the selection of subsets of polynomial regressors. Breiman (1992) used stacking to combine a number of different regression estimators of two types: one type used backward elimination (with various subset sizes) and the other type used ridge regression (with various values for the regularization parameter). The MARS algorithm (Friedman 1991) forwardly splits (then backwardly merges) spline basis functions using a GCV criterion and employs a kind of ridge regression. However, the regularization used in MARS merely fulfils a computational requirement-by fixing the regularization parameter just large enough to maintain numerical stability, the necessity for a numerically sensitive but computationally slow matrix inversion algorithm is avoided. Here, the combination of forward selection and zero-order regularization is explored with a view to improving the generalization performance of a single linear regression estimator-a radial basis function network. Some alternative methods of building up radial basis function networks-resource-allocating networks (Platt 1991; Kadirkamanathan and Niranjan 1993) or growing cell structures (Fritzke 1994)--can be compared to RFS. In common with these methods, but unlike Moody and Darken (1989), RFS uses the output values as well as the input vectors of the training set to determine the center placement. However, in contrast, these other methods all involve adaptive centers (in position and size) and consequently some kind of gradient descent learning procedure and multiple passes through the data. In RFS the available centers are all fixed, but there is a process of selection to determine which ones are included in the network. The other methods search a continuous space (of weights, positions, and sizes) that grows in dimension as centers are added while RFS heuristically searches a discrete space of different combinations of fixed centers. Another difference is that the other approaches all involve several preset parameters and thresholds (used for adding new centers and performing gradient descent) that must be tuned to each new problem. RFS, applied to RBF networks, has only one preset parameter, the basis function width. One last difference is that the other methods are all naturally suited for on-line applications where the train-
Mark J. L. Orr
608
ing data arrive sequentially in time. Although the RFS method could be adapted for that case, its fast implementation (ROLS, described below) depends, at least in its current form, on the data being available all at once. The next section briefly reviews RBF networks, regularization, and forward selection and Section 3 reviews OLS and ROLS, algorithms that efficiently implement unregularized and regularized forward selection. Section 4 shows how cross-validation can be integrated into OLS and ROLS and Section 5 derives a reestimation formula for the regularization parameter. Section 6 reports the results of some simulation studies and the final section presents conclusions. 2 Regularization and Forward Selection
__
The general linear model has the form rn
where the regressors, { k , ( . ) } y , are fixed functions of the input, x E %", and only the coefficients, {zu,}~,are unknown. The output, y E 8, is assumed to be scalar, for simplicity. To perform linear regression with this model on the training set { ( x i ,yi)}! the system of equations y=Hw+e is solved. The p x m elements of the design matrix H are the responses of the m regressors to the p inputs of the training set, y = b1. . .ypIT is the p-dimensional vector of training set outputs, and the vector e contains p unknown errors between these (measured) outputs and their true values. The goal is to find the best linear combination of the columns of H (i.e., the best value for w) to explain y according to some criterion. The normal criterion is minimization of the sum of squared errors,
E
= eTe
in which case the solution is
w
=
(H~H)-'H~Y
(2.1)
In radial basis function networks (Broomhead and Lowe 1988) the regressors are distinguished by a set of points (centers) { c l } y in the input space and a set of scale factors (radii) { r I } y such that (2.2)
RFS of Radial Basis Function Centers
609
where @(.)is a nonlinear function that monotonically decreases as its argument increases from zero, for example the gaussian function d ( z ) = exp(-2’). Each regressor is associated with a hidden unit in a feedforward architecture with a single hidden layer and the coefficients {ZU,}~ are the weights from the hidden units to the output unit. The radii can be kept constant (r, = r, 1 5 j 5 m), since such networks, even thus restricted, are still universal approximators (Park and Sandberg 1991). The fixed radius r can be set by some heuristic (Moody and Darken 1989). About half the maximum distance separating pairs of input training points often gives good results, I find. The components of multidimensional ( n > 1) inputs, which may have widely different variances in the training set, should be rescaled to all have the same (e.g., unit) variance or, equivalently, an appropriate non-Euclidean metric should be employed in 2.2. If too many centers are used the large number of free parameters available in the regression will cause the network to be oversensitive to the details of the particular training set and result in poor generalization performance (overfit). An extreme case is if the set of centers is chosen to be the set of training inputs (ci = xi, 1 5 i I p ) in which case H is square of dimension p (I will call this the full design matrix and denote it by F). Then the normal equation 2.1 results in strict interpolation in which the training set is exactly reproduced by the network (Broomhead and Lowe 1988). There are two main ways to avoid overfit. The first, regularization (Tikhonov and Arsenin 1977; Bishop 19911, reduces the ”number of good parameter measurements” (MacKay 1992) in a large model (e.g., the full model) by adding a weight penalty term to the minimization criterion. For example, minimization of the energy E=eTe+XwTw is zero-order regularization (Press et al. 1992), or ridge regression as it is known in statistics (Hocking 1983), and results in the solution w
=
(F~+ F xI ~ ) - ~ F ~ Y
(where I, is the p x p identity matrix). The regularization parameter, A, has to be chosen a priori or estimated from the data (see Section 5). The second way to avoid overfit is to explicitly limit the complexity of the network by allowing only a subset of the possible centers to participate. This method has the added advantage of producing parsimonious networks. Broomhead and Lowe (1988) suggested choosing such a subset randomly from the training inputs. However, a better approach is to choose the subset that best explains the variation in the dependent variable, and this is what the subset selection methods of regression analysis are for (Rawlings 1988). If forward selection is used, centers are picked one at a time from some large set (e.g., a regular array covering the sample space, or all the training set inputs) and added to an initially empty
Mark J. L. Orr
610
subset model until some criterion is met. Regularized forward selection is formulated as follows (the unregularized formulation can be obtained by setting X = 0 throughout). At the mth step the old design matrix, Hm-l, is augmented by a new column,
Hm
=
[Hw-l
fi]
where f, is chosen from the columns of the full design matrix F. After including the new column in the subset and finding the regularized weight W, =
+
( H ~ H , XI,)-'H;Y
the minimized energy is
EL)
=
=
T erne, + X wiw, yTPmy
where P,
= I, - H,
(H,'H,
+ X Im)-lH,T
When X = 0, P, is a projection matrix projecting p-dimensional vectors perpendicular to the space spanned by the columns of H,. The criterion used to select the best column (center) from F is the constraint
E;)~E$),
1<j 0) orthogonalization is possible only if the roughness penalty term depends on the orthogonalized weights, w,, and not the ordinary weights, w, (Chen et al. 1995). Then the minimized energy is
where
P,
=
I, - H, (HLH,
=
P,-1-
fj
+ x IJ'H,T
f'
x + f'f,
~
The selected f i is the one that maximizes (3.3)
and it becomes h,, the last column of H,. A more efficient alternative to orthogonalizing each f,, 1 5 i 5 p , at each step using 3.2 is to recursively compute the matrix
(with FO= F initially) whose columns are precisely the {fl}{ from which the selection at step m + 1 is made. To recover the unregularized weight vector w, at the end note that the components of the regularized weight vector W, are given by
Mark J. L. Orr
612
Then 3.1 can be used to obtain w,
=
u,'w,
an easy inversion since U, is triangular. U, can be recursively computed as (3.5) (with U1 = 1 initially). Note that the matrix H,TplH,-~ is diagonal. The efficiency of the orthogonalization scheme derives from the relative ease of computing 3.3 instead of 2.3, even with the overheads of 3.4 and 3.5. The computational cost (number of floating point operations) required to select one center from a pool of size M with p patterns in the training set is, to first order, proportional to Mt7 with orthogonalization. Without orthogonalization, the cost is roughly proportional to M p 2 . If the input training points are used as the pool of selectable centers then M = p and the corresponding costs are p2 and y3, respectively. 4 Cross-Validation
The previous two sections described an algorithm for making selections but without mentioning criteria for halting the selection process. In Chen et al. (1991) a simple fixed threshold on the fraction of unexplained variance was used so that the last center was selected as soon as the condition Em
< tyTy!
0 0, E,, is more than just the residual square error since it contains a weight penalty component. If MSRE is used to halt center selection in RFS the correct formula to use is
where
P,, = I,
c
h,fiT
in
-
j=1
___
X
+ h:h,
(4.3)
Other criteria for choosing subset size are described in Hocking (1983). One general method is cross-validation, which has a number of variations, two of which are delete-1 cross-validation (Allen 1974; Stone 1974) and generalized cross-validation (Golub et al. l979). In delete-1 crossvalidation (or predicted sum of squares, PRESS) generalization performance is measured by the average (over all training examples) of the squared prediction error when the network is tested on one example and trained on the remainder. If f$(.) is the network output (at selection step rn) when trained on all but the ith training example, the average predicted sum of squares is (4.4) Good generalization performance is associated with low values of PRESS so in forward selection the subset size is determined by the point at which this measure reaches a minimum. For nonlinear regression problems with long training times, e.g., multilayer perceptrons trained with backpropagation, delete-1 cross-validation is too expensive to compute, but for linear systems, such as RBF networks, it can be derived analytically (Golub et al. 1979) as 1 PRESS,, = - 11 [diag(P,,)]-'P,, y 1 '
P
where diag(.) denotes the matrix obtained by zeroing all off-diagonal terms. In ROLS the expansion of P,, in terms o f orthogonal vectors 4.3 allows PRESS to be computed very efficiently. P,y and diag(P,) can be recursively updated at each step and the product of [diag(P,)]-' and P,y is equivalent to a mere element-by-element division of two p-dimensional vectors. Generalized cross-validation, given by (4.5)
615
RFS of Radial Basis Function Centers
is similar to the delete-1 form but the average over the diagonal elements of P,, makes it even easier to compute than PRESS since the scalar quantities 11 P,,, y 112 and trace(P,,) can both be computed recursively. PRESS and GCV tend to choose similar subset sizes and are both much better at avoiding overfit than a fixed threshold or MSRE (as is shown in Section 6). Figure l d shows the fit to the example training set using PRESS to halt the selection of radial basis function centers. 5 Automatic Estimation of A
As is shown in the next section, while cross-validation scores such as PRESS and GCV are certainly good criteria for avoiding overfit, using regularization as well further decreases the likelihood of overfit. First, however, a simple reestimation formula is derived that can be integrated into ROLS for letting the data choose a value for the regularization parameter, A. The formula is based on GCV minimization, like Gu and Wahba (1991),except they used the Newton method. An alternative reestimation formula results from maximizing Bayesian evidence (MacKay 1992). Differentiating 4.5 with respect to X and setting the result to zero gives a minimum when T
-
y Pm
3Pffl
-2 atrace(F,) y trace(P,,) = y P,y
__
ax
(5.1)
However, from 4.3 it can be shown that
y T Pm -
___ -
ax
XW,T(H,:H,,
+ A1,J1Wm
This result allows 5.1 to be rearranged into the reestimation formula
(5.2) where
(5.3)
Mark J. L. Orr
616
A new value of X can be reestimated after each forward selection step by using the previous value in the right-hand side of 5.2, initializing X = 0 prior to the first step. Equations 5.3-5.6 are not fully recursive since the value of X changes after each step. In other words, each term in each summation must be recomputed at each step (instead of just the last term if X had been fixed). However, the extra coinputation this involves is proportional only to m (assuming the results of {h/h,};l and {(~'h,)~};l are cached) and is thus negligible compared to the complexity of the whole algorithm (which is proportional to Mp operations per selected center where M ,p > m-see Section 3). 6 Simulation Studies
In this section RFS is applied to two simulated learning problems. The first involves a one-dimensional Hermite polynomial and is used to compare ordinary forward selection (using the OLS algorithm and various halting criteria) to RFS (using ROLS, a GCV criterion and X reestimation) and then to compare RFS to an alternative method of building RBF networks, the RAN-EKF algorithm (Kadirkamanathan and Niranjan 1993). The second is a multivariate problem with data from a simulated alternating current series circuit and is used to compare RFS with the MARS algorithm (Friedman 1991),which is based on recursive splitting of spline basis functions. In the first problem the target function is the Hermite polynomial,
f ( x ) = 1.1(I - x
+ 2 x 2 ) exp
(3 --
from which are taken noisy samples. Figure 2 shows a typical data set and some fits. To properly assess the performance of the different selection methods 1000 training sets were generated each with different inputs {x,}: (sampled uniformly from the range [-4,4]) and errors {e;}; (sampled from the same normal distribution). Each algorithm used Cauchy basis functions [ 4 ( z )= l/(l+z*)] with a radius of Y = 1.5and drew centers from a set of 100 equally spaced points in the interval [-5, 51. Ordinary forward selection (implemented by OLS) with three different halting criteria, (1) a fixed threshold on the unexplained variance, (2) minimization of MSRE, (3) minimization of PRESS, were compared with (4) regularized forward selection, implemented by ROLS, using a GCV criterion and X reestimation. The results are shown in the four plots, one for each algorithm, of Figure 3. The plots display data error (horizontal axis) against fit error (vertical axis) and there is a point in each plot for every training set. The data error is the root mean square error between the data and the true function (i.e., and is concentrated around the value c = 0.5-the size of normal distribution from which the errors were
Jm)
R E of Radial Basis Function Centers
617
(a) Data
-5
0
(b) Threshold
5
-5
(c) PRESS
0
5
(d) ROLS
31
-5
0
5
-5
0
I
Figure 2: One of the Hermite polynomial training sets. (a) The true function (solid curve) is sampled at p = 100 random positions in the range [-4,4] and gaussian noise of standard deviation u = 0.5 is added. OLS-PRESS performed worse on this training set, as measured by fit error, than on 999 other similar sets (see Fig. 3). (b) The poor fit produced by OLS with a fixed threshold on variance. (c) The OLS-PRESS fit with overfitting near the edges of the sample space. (d) The RFS fit achieved by the ROLS algorithm with a GCV criterion and X reestimation. sampled. The fit error is the root mean square error between the fit and the true function over a set of 100 equally spaced points in the same range from which the training inputs were sampled. It objectively measures how well a particular algorithm generalizes from a particular training set, but of course, like the data error, is realizable only in synthetic examples such as this where the true target function is known. As can be seen from Figure 3a, unregularized forward selection with a fixed threshold can lead to extremely bad generalization (note the logarithmic scale of the vertical axis) if the training set contains an above
Mark J. L. Orr
618
(a) Threshold
1oc
100
*
w
1
I
0.01
. .. ':. .
.. .
.. ...
1
(b) MSRE
&
0.01
(c) PRESS
i * , l L
(d) ROLS
. .
@ .. ;.
0
Figure 3: Plots of data error (horizontal axis) versus fit error (vertical axis) for 1000 training sets (similar to the one shown in Fig. 2a) and four fitting algorithms: (a) OLS with a fixed threshold on variance, (b) OLS-MSRE, (c) OLSPRESS, and (d) ROLS-GCV with X reestimation. Logarithmic scales have been used on the vertical axes in (a) and (b) to embrace the larger dynamic range of the OLS-threshold and OLS-MSRE fit errors. The ringed point in (c), the worst OLS-PRESS fit, corresponds to the training data used in Figure 2. average data error. The algorithm accommodates the extra noise in the training set by selecting extra centers, which cause overfit. Halting the selection of centers after MSRE has stopped decreasing (Fig. 3b) is slightly better and appears to be relatively indifferent to the size of the data error but, like the fixed threshold, produces many very bad fits. In contrast, using PRESS gives much improved performance (Fig. 3c). Similar results are obtained with GCV. Finally, using regularized forward selection and GCV for both halting selection and X reestimation (Fig. 3d) shows still further improvement over the unregularized algorithm. However, examination of the handful of training sets with fit errors above about 0.4 in Figure 3c revealed that these, the poorest fits, had
RFS of Radial Basis Function Centers
619
resulted from overfitting close to the edges of the area of input space from which the training inputs were drawn (the sample space). The training set used in Figure 2, which is the one upon which OLS-PRESS performed least well and corresponds to the ringed point in Figure 3c, illustrates this. Absence of training points or chance alignments between training points near the extremes of the sample space can cause local overfitting (Fig. 2a and c) since cross-validation is dependent on the presence of data to constrain the fit. As is well known from the Bayesian interpretation of regularization (MacKay 19921, regularization provides an extra a priori constraint-the fitted function should be smooth-which allows the fit to extrapolate gracefully across the edges of the sample space (Fig. 2d). Missing data in interior regions of the sample space would also produce opportunities for local overfitting that regularization could ameliorate (Orr 1993). RFS results on the Hermite polynomial function were compared to those of the RAN-EKF algorithm (Kadirkamanathan and Niranjan 19931, which adapts the positions and sizes of existing basis functions as well as adding new ones. The same number of training examples ( p = 401, the same test set (200 uniformly spaced noiseless samples in the range [-4, 4]), and the same type of radial basis functions (gaussians) were used as in their study. The results were averaged over 100 runs for each of several different noise levels in the training data. The randomly chosen 40 training set inputs in each run were used as the centers of the basis functions (of radius r = 1.5) from which each network was built. Figure 4 shows how the average over 100 runs of the number of selected centers and the root mean square error (of the fit over the test set) varies with noise variance. The RAN-EKF data have been read off Figure 4 of Kadirkamanathan and Niranjan (1993). The number of centers chosen by RFS tends to drop as the noise level increases (the opposite trend to RAN-EKF, see Fig. 4a), and the RFS fit error is consistently smaller than that of RAN-EKF (Fig. 4b). These results suggest that RFS is more accurate than RAN-EKF and better at producing parsimonious networks. RFS was also applied to a problem from Friedman (1991) involving data from a simulated alternating current series circuit where the input vectors come from a four-dimensional space, x = [R w L
C]'
with resistance ( R ohms), angular frequency (w radians per second), inductance ( L henries), and capacitance (Cfarads) in the ranges 0 4 0 0 1 x lo-6
5 R I 100 ~ 5 w 5 560~ 5 L 51 5 c I 11 x
Mark J. L. Orr
620
RAN-EKF 0
I :0-
104
10-2
VARIANCE
(b)
10'
-
RAN-EKF RFS
10''
B 104
~~~~
10-1
Figure 4: (a) The number of selected centers and (b) the fit error as a function of noise level for RFS (averaged over 100 runs) and the RAN-EKF algorithm. The two dependent variables are impedance (Z ohms) and phase dians), given by
Z(X) = \/R2 @(x) = tan-'
+ (W L - l/wC)'
(
W L
- l/WC
)
( 4 ra(6.1) (6.2)
Following the procedure in Friedman (1991) as closely as possible, training sets of various sizes ( p = 100,200,400) were replicated 100 times each.
RFS of Radial Basis Function Centers
621
Table 1: The Average (over 100 Runs) Scaled Mean Square Error of the RFS and MARS Fits to the Impedance (Z) and Phase ( 4 ) Data for Different Training Set Sizes ( p ) . ~
~
4
Z
P 100 200 400
RFS 0.45 0.26 0.14
MARS 0.28 0.12 0.07
RFS 0.26 0.20 0.16
MARS 0.24 0.16 0.12
The p random input vectors of each set were drawn randomly from the above ranges and gaussian errors of size oz = 175 and 04 = 0.44 (to give 3/1 signal-to-noiseratios) were added to the corresponding p impedance and phase values. All four components of the input vectors were standardized (to have zero mean and unit variance) before RFS was applied. The pool of selectable centers was set to be the p standardized inputs of the training set and gaussian basis functions were used. The fixed radius was set at r = 3.5, which is about half the maximum distance between any two standardized input points in the four-dimensional space (z2 &). The two sets of data (impedance and phase) were processed separately and the quality of each fit determined by mean square error (MSE) scaled by the variance of the function,
where f(.) is the true function [either Z ( . ) or 4(.)1with mean f over the randomly chosen test inputs, {xk}? (with N = 5000), f(.) is the RFS fit trained on data with standardized inputs and {Xk}? are standardized versions of the test inputs. Table 1 shows average (over 100 replications) MSE values for the different training set sizes with corresponding figures for the MARS algorithm. The latter were read from Tables 9 and 11 of Friedman (1991) from the rows pertaining to mi = 2 (the best value, for this problem, of the MARS interaction parameter) and the columns labelled ISE.' As can be seen from the table, MARS is much more accurate than RFS for the impedance data (by about a factor of 2 in MSE) and slightly better for the phase data. Further investigations are required to fully explain the difference between the two methods. 'Friedman (1991) claimed to have calculated a Monte Carlo approximation to (scaled) integrated square error (ISE), which is given by 6.3 times a factor V (the volume of the unscaled sample space-in this case 1.63). However, as communicated to me privately by Friedman, the factor V was omitted, which means he was really calculating (scaled) mean square error (MSE) as given by 6.3.
622
Mark J. L. Orr
7 Conclusions Zero-order regularization (with automatic estimation of the regularization parameter) along with either delete-1 or generalized cross-validation can be incorporated into an efficient algorithm for performing regularized forward selection (RFS) of linear regressors, such as the centers of RBF networks. While cross-validation alone is an effective method for limiting the number of selected centers to avoid overfit, the experimental evidence supports the additional use of regularization to further reduce overfit. The extra information about the target function implicit in regularization (namely, that it has some degree of smoothness) improves generalization performance, particularly in areas of the sample space, such as the edges, where training data are sparse. In tests, RFS performed better (in terms of accuracy and network size) than RAN-EKF (an alternative technique for constructing RBF networks) on a simple one-dimensional problem but proved less accurate than MARS (a recursive splitting algorithm using splines) on a more complex multivariate problem. Acknowledgments
I thank Sheng Chen, Roy Hougen, Warren Sarle, and two anonymous referees for useful comments and references. This work was supported by Grant RR21748 from the U.K. Joint Councils Initiative in Human Computer Interaction and Cognitive Science. References Allen, D. M. 1974. The relationship between variable selection and data augmentation and a method for prediction. Technometrics 16(1), 125-127. Barron, A. R., and Xiao, X. 1991. Discussion of “Multivariate adaptive regression splines” by J. H. Friedman. Ann. Stat. 19, 67-82. Bishop, C. 1991. Improving the generalization properties of radial basis function neural networks. Neural Comp. 3(4), 579-588. Breiman, L. 1992. Stacked Regression. Tech. Rep. TR-367, Department of Statistics, University of California, Berkeley. Broomhead, D. S., and Lowe, D. 1988. Multivariate functional interpolation and adaptive networks. Complex Syst. 2, 321-355. Chen, S., Chng, E. S., and Alkadhimi, K. 1995. Regularised orthogonal least squares algorithm for constructing radial basis function networks. International Journal of Control, submitted. Chen, S., Cowan, C. F. N., and Grant, P. M. 1991. Orthogonal least squares learning for radial basis function networks. l E E E Transact. Neural Networks 2(2), 302-309.
RFS of Radial Basis Function Centers
623
Friedman, J. H. 1991. Multivariate adaptive regression splines (with discussion). Ann. Stat. 19, 1-141. Fritzke, B. 1994. Supervised learning with growing cell structures. In Advances in Neural Information Processing Systems 6, J. D. Cowan, G. Tesauro, and J. Alspector, eds., pp. 255-262. Morgan Kaufmann, San Mateo, CA. Golub, G. H., Heath, M., and Wahba, G. 1979. Generalised cross-validation as a method for choosing a good ridge parameter. Technometrics 21(2), 215-223. Gu, C., and Wahba, G. '1991. Minimising GCV/GML scores with multiple smoothing parameters via the Newton method. S I A M J. Sci. Stat. Comp. 12(2), 383-398. Hocking, R. R. 1983. Developments in linear regression methodology: 19591982 (with discussion). Technometrics 25, 219-249. Hoerl, A. E., and Kennard, R. W. 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12(3), 55-67. Kadirkamanathan, V., and Niranjan, M. 1993. A function estimation approach to sequential learning with neural networks. Neural Comp. 5(6), 954-975. MacKay, D. J. C. 1992. Bayesian interpolation. Neural Comp. 4(3), 415447. Moody, J., and Darken, C. J. 1989. Fast learning in units of locally-tuned processing units. Neural Comp. 1(2), 281-294. Orr, M. J. L. 1993. Regularised centre recruitment in radial basis function networks. Research Paper 59, Centre for Cognitive Science, Edinburgh University. Park, J., and Sandberg, I. W. 1991. Universal approximation using radial-basisfunction networks. Neural Comp. 3(2), 246-257. Platt, J. 1991. A resource-allocating network for function interpolation. Neural Comp. 3(2), 213-225. Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. 1992. Numerical Recipes in C, 2nd ed. Cambridge University Press, Cambridge, UK. Rawlings, J. 0. 1988. Applied Regression Analysis. Wadsworth & Brooks/Cole, Pacific Grove, CA. Stone, M. 1974. Cross-validation choice and the assessment of statistical predictions. J. R. Stat. SOC.(B) 36, 111-147. Tikhonov, A. N., and Arsenin, V. Y. 1977. Solutions of Ill-Posed Problems. Winston, Washington.
Received March 14, 1994; accepted September 13, 1994
This article has been cited by: 1. George Kapetanios, Andrew P. Blake. 2010. TESTS OF THE MARTINGALE DIFFERENCE HYPOTHESIS USING BOOSTING AND RBF NEURAL NETWORK APPROXIMATIONS. Econometric Theory 26:05, 1363-1397. [CrossRef] 2. M. K. Leong, S.-W. Lin, H.-B. Chen, F.-Y. Tsai. 2010. Predicting Mutagenicity of Aromatic Amines by Various Machine Learning Approaches. Toxicological Sciences 116:2, 498-513. [CrossRef] 3. María D. Perez-Godoy, Antonio J. Rivera, Francisco J. Berlanga, María José Del Jesus. 2010. CO2RBFN: an evolutionary cooperative–competitive RBFN design algorithm for classification problems. Soft Computing 14:9, 953-971. [CrossRef] 4. M. Bortman, M. Aladjem. 2009. A Growing and Pruning Method for Radial Basis Function Networks. IEEE Transactions on Neural Networks 20:6, 1039-1045. [CrossRef] 5. Kang Li, Jian-Xun Peng, Er-Wei Bai. 2009. Two-Stage Mixed Discrete–Continuous Identification of Radial Basis Function (RBF) Neural Models for Nonlinear Systems. IEEE Transactions on Circuits and Systems I: Regular Papers 56:3, 630-643. [CrossRef] 6. Xia Hong, Sheng Chen. 2009. A New RBF Neural Network With Boundary Value Constraints. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39:1, 298-303. [CrossRef] 7. J.P.-F. Sum, Chi-Sing Leung, K.I.-J. Ho. 2009. On Objective Function, Regularizer, and Prediction Error of a Learning Algorithm for Dealing With Multiplicative Weight Noise. IEEE Transactions on Neural Networks 20:1, 124-138. [CrossRef] 8. Andreas Brandstetter, Alessandro Artusi. 2008. Radial Basis Function Networks GPU-Based Implementation. IEEE Transactions on Neural Networks 19:12, 2150-2154. [CrossRef] 9. Xiaoli Li, Bin Hu, Ruxu Du. 2008. Predicting the Parts Weight in Plastic Injection Molding Using Least Squares Support Vector Regression. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38:6, 827-833. [CrossRef] 10. Chi-Sing Leung, J.P.-F. Sum. 2008. A Fault-Tolerant Regularizer for RBF Networks. IEEE Transactions on Neural Networks 19:3, 493-507. [CrossRef] 11. Cangtao Zhou, Tianxing Cai, Choy Heng Lai, Xingang Wang, Ying-Cheng Lai. 2008. Model-based detector and extraction of weak signal frequencies from chaotic data. Chaos: An Interdisciplinary Journal of Nonlinear Science 18:1, 013104. [CrossRef] 12. H.Y. Lau, X. Li, R. Du. 2008. A NEW METHOD FOR MONITORING AND TUNING PLASTIC INJECTION MOLDING MACHINES. Control and Intelligent Systems 36:2. . [CrossRef]
13. Andrew P. Blake, George Kapetanios. 2007. Testing for Neglected Nonlinearity in Cointegrating Relationships. Journal of Time Series Analysis 28:6, 807-826. [CrossRef] 14. S. Sundararajan, Shirish Shevade, S. Sathiya Keerthi. 2007. Fast Generalized Cross-Validation Algorithm for Sparse Model LearningFast Generalized Cross-Validation Algorithm for Sparse Model Learning. Neural Computation 19:1, 283-301. [Abstract] [PDF] [PDF Plus] 15. Stephen A. Billings, Hua-Liang Wei. 2007. Sparse Model Identification Using a Forward Orthogonal Regression Algorithm Aided by Mutual Information. IEEE Transactions on Neural Networks 18:1, 306-310. [CrossRef] 16. Xun-Xian Wang, Sheng Chen, Chris J. Harris. 2006. Using the correlation criterion to position and shape RBF units for incremental modelling. International Journal of Automation and Computing 3:4, 392-403. [CrossRef] 17. Xiao-cheng Shi, Chun-ling Xie, Yuan-hui Wang. 2006. Nuclear power plant fault diagnosis based on genetic-RBF neural network. Journal of Marine Science and Application 5:3, 57-62. [CrossRef] 18. D. Wedge, D. Ingram, D. Mclean, C. Mingham, Z. Bandar. 2006. On Global–Local Artificial Neural Networks for Function Approximation. IEEE Transactions on Neural Networks 17:4, 942-952. [CrossRef] 19. A S Paterno, J C C Silva, M S Milczewski, L V R Arruda, H J Kalinowski. 2006. Radial-basis function network for the approximation of FBG sensor spectra with distorted peaks. Measurement Science and Technology 17:5, 1039-1045. [CrossRef] 20. Nuo Gao, S A Zhu, Bin He. 2005. Estimation of electrical conductivity distribution within the human head from magnetic flux density measurement. Physics in Medicine and Biology 50:11, 2675-2687. [CrossRef] 21. Estefane Lacerda, Andr� C. P. L. F. Carvalho, Ant�nio P�dua Braga, Teresa Bernarda Ludermir. 2005. Evolutionary Radial Basis Functions for Credit Assessment. Applied Intelligence 22:3, 167-181. [CrossRef] 22. Peter J Boltryk, Martyn Hill, Paul R White. 2005. Improving the resolution of peak estimation on a sparsely sampled surface with high variance using Gaussian processes and radial basis functions. Measurement Science and Technology 16:4, 955-965. [CrossRef] 23. X. Li. 2005. Development of Current Sensor for Cutting Force Measurement in Turning. IEEE Transactions on Instrumentation and Measurement 54:1, 289-296. [CrossRef] 24. X. Hong, S. Chen. 2005. M-Estimator and D-Optimality Model Construction Using Orthogonal Forward Regression. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 35:1, 155-162. [CrossRef] 25. G.-B. Huang, P. Saratchandran, N. Sundararajan. 2005. A Generalized Growing and Pruning RBF (GGAP-RBF) Neural Network for Function Approximation. IEEE Transactions on Neural Networks 16:1, 57-67. [CrossRef]
26. Y.-J. Oyang, S.-C. Hwang, Y.-Y. Ou, C.-Y. Chen, Z.-W. Chen. 2005. Data Classification With Radial Basis Function Networks Based on a Novel Kernel Density Estimation Algorithm. IEEE Transactions on Neural Networks 16:1, 225-236. [CrossRef] 27. X. Hong, C.J. Harris, S. Chen. 2004. Robust Neurofuzzy Rule Base Knowledge Extraction and Estimation Using Subspace Decomposition Combined With Regularization and D-Optimality. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 34:1, 598-608. [CrossRef] 28. S. Ferrari, M. Maggioni, N.A. Borghese. 2004. Multiscale Approximation With Hierarchical Radial Basis Functions Networks. IEEE Transactions on Neural Networks 15:1, 178-188. [CrossRef] 29. X. Hong, M. Brown, S. Chen, C.J. Harris. 2004. Sparse model identification using orthogonal forward regression with basis pursuit and D-optimality. IEE Proceedings - Control Theory and Applications 151:4, 491. [CrossRef] 30. Andrew P. Blake, George Kapetanios. 2003. A radial basis function artificial neural network test for neglected nonlinearity. Econometrics Journal 6:2, 357-373. [CrossRef] 31. ANDREW P. BLAKE, GEORGE KAPETANIOS. 2003. Pure Significance Tests of the Unit Root Hypothesis Against Nonlinear Alternatives. Journal of Time Series Analysis 24:3, 253-267. [CrossRef] 32. Alessandro Artusi, Alexander Wilkie. 2003. Novel color printer characterization model. Journal of Electronic Imaging 12:3, 448. [CrossRef] 33. C. Panchapakesan, M. Palaniswami, D. Ralph, C. Manzie. 2002. Effects of moving the center's in an RBF network. IEEE Transactions on Neural Networks 13:6, 1299-1307. [CrossRef] 34. Shie-Jue Lee, Chun-Liang Hou. 2002. An ART-based construction of RBF networks. IEEE Transactions on Neural Networks 13:6, 1308-1321. [CrossRef] 35. C T Zhou, K B Teo, L Y Chew. 2002. Detection of Signals from Noisy Chaotic Interference. Physica Scripta 65:6, 469-475. [CrossRef] 36. K. Lewenstein. 2001. Radial basis function neural network approach for the diagnosis of coronary artery disease based on the standard electrocardiogram exercise test. Medical & Biological Engineering & Computing 39:3, 362-367. [CrossRef] 37. C. Alippi, V. Piuri, F. Scotti. 2001. Accuracy versus complexity in RBF neural networks. IEEE Instrumentation & Measurement Magazine 4:1, 32-36. [CrossRef] 38. Chi-Sing Leung, Ah-Chung Tsoi, Lai Wan Chan. 2001. Two regularizers for recursive least squared algorithms in feedforward multilayered neural networks. IEEE Transactions on Neural Networks 12:6, 1314-1332. [CrossRef] 39. J. Pruvost, J. Legrand, P. Legentilhomme. 2001. Three-Dimensional Swirl Flow Velocity-Field Reconstruction Using a Neural Network With Radial Basis Functions. Journal of Fluids Engineering 123:4, 920. [CrossRef]
40. A. Hu, P. Otto, J. Ladik. 1999. Relativistic Gaussian functions for atoms by fitting numerical results with adaptive nonlinear least-square algorithm. Journal of Computational Chemistry 20:7, 655-664. [CrossRef] 41. B. Pokrić, N. M. Allinson, E. T. Bergström, D. M. Goodall. 1999. Combining linear filtering and radial basis function networks for accurate profile recovery. IEE Proceedings - Vision, Image, and Signal Processing 146:6, 297. [CrossRef] 42. X. Hong, S. A. Billings. 1999. Parameter estimation based on stacked regression and evolutionary algorithms. IEE Proceedings - Control Theory and Applications 146:5, 406. [CrossRef] 43. C.L.P. Chen, J.Z. Wan. 1999. A rapid learning and dynamic stepwise updating algorithm for flat neural networks and the application to time-series prediction. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 29:1, 62-72. [CrossRef] 44. Stefan Schaal , Christopher G. Atkeson . 1998. Constructive Incremental Learning from Only Local InformationConstructive Incremental Learning from Only Local Information. Neural Computation 10:8, 2047-2084. [Abstract] [PDF] [PDF Plus] 45. Lu Yingwei, N. Sundararajan, P. Saratchandran. 1998. Performance evaluation of a sequential minimal radial basis function (RBF) neural network learning algorithm. IEEE Transactions on Neural Networks 9:2, 308-318. [CrossRef] 46. Z. Trajanoski, P. Wach. 1998. Neural predictive controller for insulin delivery using the subcutaneous route. IEEE Transactions on Biomedical Engineering 45:9, 1122-1134. [CrossRef] 47. Andrew J. Meade, Michael Kokkolaras, Boris A. Zeldin. 1997. Sequential function approximation for the solution of differential equations. Communications in Numerical Methods in Engineering 13:12, 977-986. [CrossRef] 48. S. Haykin, P. Yee, E. Derbez. 1997. Optimum nonlinear filtering. IEEE Transactions on Signal Processing 45:11, 2774-2786. [CrossRef]
Communicated by Richard Lippmann
Bootstrapping Confidence Intervals for Clinical Input Variable Effects in a Network Trained to Identify the Presence of Acute Myocardial Infarction William G . BaxY Department of Emergency Medicine and Medicine, University of California, San Diego Medical Center, San Diego, CA U S A 92093 Halbert White Department of Economics and Institute for Neural Compul ation, Universify of California, San Diego, San Diego, C A USA 92093
1 Introduction
The artificial neural network has been successfully applied to a broad range of clinical settings (Widrow and Hoff 1960; Rumelhart et al. 1986; McClelland et al. 1988; Weigend et al. 1990; Hudson et al. 1988; Smith et al. 1988; Saito and Nakano 1988; Kaufman et al. 1990; Hiraiwa et al. 1990; Cios et al. 1990; Marconi et al. 1989; Eberhard et al. 1991; Mulsant and Servan-Schreiber 1988; Bounds et al. 1990; Yoon et al. 1989). Such a network has been adapted for use as an aid to the clinical diagnosis of acute myocardial infarction (Baxt 1990, 1991, 1992a; Harrison et al. 1991) (heart attack). Both initial retrospective and subsequent prospective studies have revealed that this network performed more accurately than either physicians or other electronic data processing technologies (Baxt 1990, 1991; Goldman et al. 1988). Since nonlinear artificial networks are known to be capable of identifying relationships between input data that are not apparent to human analysis (Weigend et al. 1990), one hope has been that the network could be utilized to identify relationships in clinical data that have not been revealed by previous study. The inherent problem in this hope has been the inability easily to identify how artificial neural networks derive their output. One indirect way that this can be approached is by the stepwise perturbation of isolated individual input variables across a large number of patterns coupled with an analysis of the effect this has on network output. Prior application of this analysis to the artificial neural network trained to identify the presence of acute myocardial infarction revealed that one could gain a *Present address: Department of Emergency Medicine, Hospital of the University of Pennsylvania, 3400 Spruce Street, Philadelphia, PA 19104 USA. Neural Computation 7, 624-638 (1995)
@ 1995 Massachusetts Institute of Technology
Network to Identify Myocardial Infarction
625
general impression about which clinical variables have the greatest effect on network output (diagnosis) (Baxt 1992b). This process revealed that the network relied on the electrocardiographic variables that had in the past been shown to be predictive of myocardial infarction. Surprisingly, however, the network also used variables that had not been shown to be highly specific for the presence of myocardial infarction. Although these findings were interesting, these results gave only an impression of effects because there was no way to tell if the observed effects were a true reflection of the actual relationships or a result of random sampling variation. The work presented here was undertaken to develop a statistical approach to accomplish this.
2 Methods
To assess the effects of sampling variation in the patient population on the trained weights of an artificial neural network, and in consequence on the effects attributed by the network to the predictor variables, we use a resampling method known as the bootstrap (Efron 1982). The basic idea underlying the bootstrap is that one may investigate the sampling variability in a statistic of interest ( e g , the effect of a given predictor variable), by computing that statistic repeatedly from randomly drawn samples having the same probability distribution as the sample at hand ("resampling"). One may then observe the distribution of the statistic of interest over the resampling experiments. Under appropriate conditions, the distribution of the resampled statistics corresponds to the sampling distribution of the statistic of interest (Gin6 and Zinn 1990). In the present context, interest attaches to the average effect on the conditional probability of myocardial infarction associated with a perturbation in each of the underlying clinical predictor variables. Let fo(x) denote the true but unknown probability of myocardial infarction, given that a (randomly chosen) patient presents with attributes x (a vector consisting of numerical indicators of the 19 input variables listed in Table 1). Mathematically, we represent this as fo(x) = P[T = 1 1 X = XI,where for a randomly chosen individual, T is the target variable equal to 1 if heart attack, 0 otherwise, and X is the input vector. Let A; f o ( x ) represent the change ("delta") in conditional probability associated with perturbing the ith component of x (attribute) in a prescribed manner, e.g., changing the T wave inversion of the patient from present to absent. (By convention, a change of sign is made when changes are made in the opposite direction, as from absent to present.) There are two related quantities that may serve as the focus of our attention. The first is the average effect of perturbing the attribute over a given population, represented by
William G. Baxt and Halbert White
626
where P is the population distribution over attributes. The second is the average effect of perturbing the attribute over the sample available to us, denoted
c n
A, j o
= n-’
A, fo(xp)
p=l
where x p is the observed input for the pth patient, p = I,.. . ,n. When L\,fo is zero, the attribute makes no contribution on average in the population to the prediction of myocardial infarction; when & fo is zero, the attribute makes no contribution on average in the sample available to us. When these quantities are positive or negative, the attribute makes a corresponding positive or negative impact on the probability of infarction on average, over the population or sample, respectively. Because the population that generates our input patterns is not necessarily representative of the population of potential emergency room patients at large, and because we wish to be conservative in the inferences we draw, we focus our attention on A, fo, which we call the “true sample mean delta.” By clearly drawing inferences on the input patterns available to us, there is less possibility for misinterpreting our conclusions as somehow applying to the population of all emergency room patients. The immediate difficulty in observing A, fo is that the probability function fo is unknown. Nevertheless, we may approximate fo, quite well in principle, by the output function of a neural network trained to diagnose infarction, as in previous work (Baxt 1991). We denote this trained network output function as f . The “network sample mean delta,” denoted
A, f = n-’
2 A, f ( x p )
p=l
can be computed from our data, and provides ,in informative estimate . because f is an estimate, it is subject to sampling of A, f ~ Nevertheless, variation. Effects that are truly zero could appear to be nonzero and vice versa as a result. The bootstrap method mentioned above can be used to assess this sampling variation: one draws a large number of N of pseudosamples of size n independently of the original sample, and computes statistics from each such pseudosample. The distribution of the statistics from the pseudosamples can then be used to draw conclusions about the distribution of the original statistic. There are two different ways that the pseudosamples can be drawn from the bootstrap analysis. The first method, pairs sampling, entails drawing pseudosamples of n patterns with replacements from the original sample. The second method, residual sampling, involves using the input patterns from the original sample, but perturbing the associated target in such a way that the probabilistic relation between target and inputs is maintained, but the perturbation is
Network to Identify Myocardial Infarction
627
independent of the inputs and at the same time typical of the random variation in network errors found in the origiiial data. The first method is computationally straightforward and provides an unconditional bootstrap distribution for the statistics of interest, i.e., a distribution that does not take into account the input patterns actually observed, but only the underlying distribution that gave rise to them, so that A, fo is estimated. The second method involves a little more computation (to generate proper pseudotargets) and provides a conditional bootstrap distribution, i.e., a distribution that takes into account the input patterns actually observed, so that A, fo is estimated. The bootstrap residual sampling approach is appropriate here, because it forces us to draw inferences based on the sample available to us, making it completely clear that our results do not pertain to a general population, such as the population of emergency room patients at large. We wish to make it as difficult as possible for ourselves or others to indulge in overstating the generality and applicability of our results. Thus, let A, f i ’ ( x p ) , p = 1 , . . . , n, represent the individual ( i ) delta values for the jth pseudosample, j = 1,.. . , N. A resampled estimate of A, f (hence A, fo) is the pseudosample mean delta
It turns out that the distribution of A, f around A, fo is the same as that of A, f around A, f. We can observe the latter by resampling, and from this assess the probability that 6, f is nonzero by chance. We can also construct a confidence interval for fo. The first step in implementing our procedure is to train a two hidden layer network with logistic output squasher on 706 patients 18 years or older who presented to the emergency department with anterior chest pain, as reported previously (Baxt 1991). The inputs are x p , and the targets are Tp = 1 if infarction, 0 otherwise, p = 1,.. . , n, n = 706. The weights of this trained network can then be used to compute network output f and network sample mean delta A, f by perturbing the ith input (e.g., changing T wave inversion from present to absent and vice versa) and averaging the change in output over the sample population. We use 10 hidden units in the first layer and 10 hidden units in the second layer. These choices for the number of hidden units and the choice of two hidden layers are made to ensure that it is at least plausible that our network is capable of approximating whatever might be the true relation between targets and inputs (i.e.,f ~ to) a relatively high degree of precision. We train using least-squares-based backpropagation, and stop training when performance on an independent test set is optimized. An appealing alternative would have been to use a cost function based on cross-entropy as in Rumelhart et aI. (1994); however, we stick to standard backpropagation to maximize comparability to our previous study (Baxt
a,
William 6. Baxt and Halbert White
628
1991), and because software for least-squares-based backpropagation is readily available to us and to others. The validity of our approach is unaffected by our choice of cost function for training. In order to generate the quantities A, needed for the next step of our procedures, we proceed just as we did to obtain A, f, but instead of using the original training sample, we use a "resample" of pseudo-observations of input-target pairs ( X p ,Tk), p = 1,.. . , n. Note that the inputs are precisely those of the original dataset, but that the targets T; are different for each resample. The inputs are kept the same so that our conclusions may be interpreted as being conditional on the population of symptoms present in our sample; we again stress that we wish to be conservative in this regard in drawing our conclusions. To create the targets Tk represents a challenge: we must randomly change the outcome (infarction, noninfarction) associated with input X, without changing the systematic part of the probabilistic relationship between T, and X, embodied by the conditional probability f (X,), so that when training is performed using the pseudosample, the same relationship can be learned. To see how this challenge can be met, consider that for a uniform random variable U independent of X we have
P [ U < z I XI
=
P[U b, if g(x) E [ax,b,I
Yaser S. Abu-Mostafa
656
When a new type of hint is identified in a given application, it should also be expressed in terms of virtual examples. The resulting error measure Em will represent the hint to the learning process. 3.3 Duplicate Examples. Duplicate examples are perhaps the easiest way to use certain types of hints, most notably invariance hints. If we start with a set of training examples from the target function f
..
[Xl,f(XI)],[X2>f(X2)]$.
?
[XN,f(XN)]
and then assert that f is invariant under some transformation of x into x', it follows that we also know the value of f on x i , x;, . . . , xh. In effect, we have a duplicate set of training examples
where f ( x l ) = f ( x n ) ,that can be used along with the original set. For instance, duplicate examples in the form of new 2D views of a 3D object are generated in (Poggio and Vetter 1992) based on existing prototypes. A theoretical analysis of duplicate examples versus virtual examples is given in Leen (1995). When duplicate examples are used to represent a hint, the rest of the learning machinery is already in place. The training error EO can still be used as the objective function, with the augrnented training set now consisting of the original examples and the duplicate examples. In many cases, the duplication process "inherits" the probability distribution that was used to generate the original examples, which is usually the target distribution P. A balance, of sorts, is automatically maintained between the hint and training examples since both are learned through the same set of examples. The same software for learning from examples can be used unaltered. On the other hand, there are two main advantages to virtual examples over duplicate examples. To pinpoint these advantages, let us consider the original training set
[where the error on example [xn,f(xn)]is given by [ g ( X n ) -f(xn)12 as usual] together with the following restricted set of virtual examples
where the error on example (x,,xl) is given by [g(x,) - g(x;)l2. Clearly, if all errors are zero, this will be equivalent to the case of duplicate examples. However, when the errors are nonzero, there is a difference. In the case of duplicate examples, there is a built-in linkage between the training error and the hint error; they cannot be controlled independently. On
Hints
657
the other hand, if we separate the two errors by using virtual examples, we have independent control over how much to emphasize the training error versus the hint error. This is the first advantage of virtual examples. Maintaining independent control over the errors on the training set and on different hints is essential if the errors are to go down in some prescribed balance, as we will discuss in Section 4. Notice also that when we use duplicate examples, we are in effect using a fixed set of N virtual examples to represent the hint. The fixed set will result in a generalization error on the hint the same way that representing f by a fixed set of examples results in the usual generalization error. [In terms of the VC dimensions of Section 2, V C ( G ; H )plays the role of V C ( G )for the hint.] This leads to the second advantage of using virtual examples: They are unlimited in number. We can generate a fresh virtual example every time we need one, since we do not need to know the value of the target function. Thus, there is no generalization error on the hint when we use virtual examples.
4 Objective Functions
When hints are available in a learning situation, the objective function to be optimized by the learning algorithm is no longer confined to €0 (the error on the training examples off). This section addresses how to combine €0 with the different hints to create a new objective function.
4.1 Adaptive Minimization. If the learning algorithm had complete information about f, it would search for a hypothesis g for which E ( g , f ) = 0. However, f being unknown means that the point E = 0 cannot be directly identified. The most any learning algorithm can do given the hints Ho,H I , . . . ,H M is to reach a hypothesis g for which all the error measures Eo, E l , . . . , EM are zeros (assuming that overfitting is not an issue). If that point is reached, regardless of how it is reached, the job is done. However, it is seldom the case that we can reach the zero-error point because either (1)it does not exist (i.e., no hypothesis can satisfy all the hints simultaneously, which implies that no hypothesis can replicate f exactly), or (2) it is difficult to reach (i.e., the computing resources do not allow us to exhaustively search the space of hypotheses looking for this point). In either case, we will have to settle for a point where the E m s are “as small as possible.” How small should each Em be? A balance has to be struck, otherwise some Ems may become very small at the expense of the others. This situation would mean that some hints are overlearned while the others are underlearned. Knowing that we are really trying to minimize E, and that the E m s are merely a vehicle to this end, the criterion for balancing
Yaser S. Abu-Mostafa
658
the E m s should be based on how small E is likely to be. This is the idea behind Adaptive Minimization. Given Eo, E l , . . . , E M , we form an estimate E of the actual error E
E(Eo,E l , E2,. .., E M ) and u5e it as the objective function to be minimized. This estimate of E becomes the common thread that balances between the errors on the different hints. The formula for E expresses the impact of each Em on the ultimate performance. Such a formula is of theoretical interest in its own right. E is minimized by the learning algorithm. For instance, if backpropagation is used, the components of the gradient will be
which means that regular backpropagation can be used on each of the hints, with aE/aEm used as the “weight“ for hint H,. Equivalently, a batch of examples from the different hints would be used with a number of examples from H,,, in proportion to d E / a E , . This idea is discussed further when we talk about schedules in Section 4.3. 4.2 Simple Estimate. In Catalepe and Abu-Mostafa (1994), a simple formula for E ( E 0 , . . . , E M ) is derived and tested for the case of a binary target functionf : R”-+ {0,1} that has two invariance hints. The learning model is a sigmoidal neural network, g : R“ -+ [0, 1] . The difference between f and g is viewed as a “noise” function n:
1-g(x),
if f ( x ) = 1 if f ( x ) = 0
Let p and u’ be the mean and variance of n ( x ) . In terms of p and u2, the error measure E ( g , f ) is given by
E
= E{
F ( x ) - g(x)I2}= &[n2(x)]= p2 + u2
Similarly, the error on each of the two invariance hints is given by
Em = &{ k ( x ) - g(x’)]’} = &{ [ n ( x )- n(x’)]’} = 202 assuming that n ( x ) and n(x’) are independent random variables. Given the training examples, one can obtain a direct estimate of p
Hints
659
Training set size
=
20
0 U
W U
500
0
1000
1500
2000
2500
3000
3500
4000
4500
5000
pass
Figure 7: The error estimate in the case of overfitting. and, combining this estimate with Eo, El, and Ez, one can get an estimate of o’ [u’]=
~ ( E-o[PI’)
+ EI + E z
6
Finally, we get an estimate of E , based solely on the training examples of f and the virtual examples of the hints, by combining [p] and [u’] € = [p]’
+ [o’]
Figures 7 and 8 illustrate the performance of this estimate in two cases, one where overfitting occurs and the other where it does not. The figures show the pass number of regular backpropagation versus the training error (Eo), test error ( E ) , and the estimate of the test error (El. Notice that E is closer to the actual E than EO is (Eo is the de facto estimate of E in the absence of hints). E is roughly monotonic in E and, as seen in Figure 7, exhibits the same increase due to overfitting that E exhibits. The significant difference between E and E is in the form of (almost) a constant. However, constants do not affect descent operations. Thus, E provides a better objective function than Eo, even with the simplifying assumptions made.
Yaser S. Abu-Mostafa
660
0.1 0.09
0.08 0.07
0.06
;
0.05
w
0.04 0.03
0.02 0.01
0 0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
pass
Figure 8: The error estimate without overfitting. 4.3 Schedules. The question of objective functions can be posed as a scheduling question: If we are simultaneously minimizing the interrelated quantities EQ,. . . , E M , how do we schedule which quantity to minimize at which step? To start with, let us explore how simultaneous minimization of a number of quantities is done. Perhaps the most common method is that of penaltyfunctions (Wismer and Chattergy 1978). To minimize EQ,E l , . . . , E M , we minimize the penalty function
where each om is a nonnegative number that may be constant (exact penalty function) or variable (sequential penalty function). Any descent method can be employed to minimize the penalty function once the ams are selected. The Q,S are weights that reflect the relative emphasis or "importance" of the corresponding Ems. The choice of the weights is usually crucial to the quality of the solution. In the case of hints, even if the a,s are determined, we still do not have the explicit values of the Ems (recall that E , is the expected value of the error em on a virtual example of the hint). Instead, we will estimate
Hints
661
E, by drawing several examples and averaging their error. Suppose that we draw N , examples of H,. The estimate for E , would then be
where e t ) is the error on the nth example. Consider a batch of examples Nl examples of H I ,. . and NMexamples consisting of NOexamples of Ho, of HM. The total error of this batch is ./
9
!&)
m=O n = l
If we take N , 0: a,, this total error will be a proportional estimate of the penalty function M
In effect, we translated the weights into a schedule, where different hints are emphasized, not by magnifying their error, but by representing them with more examples. We make a distinction between a fixed schedule, where the number of examples of each hint in the batch is predetermined (albeit time-invariant or time-varying, deterministic or stochastic), and an adaptive schedule where run-time determination of the number of examples is allowed (how many examples of which hint go into the next batch depends on how things have gone so far). For instance, constant QS, correspond to a fixed schedule. Even if the a,s are variable but predetermined, we still get a fixed (time-varying) schedule. When the QS, are variable and adaptive, the resulting schedule is adaptive. We can use uniform batches that consist of N examples of one hint at a time, or, more generally, mixed batches where examples of different hints are allowed within the same batch. For instance, as we discussed before, Adaptive Minimization can be implemented using backpropagation on a mixed batch where hint H, is represented by a number of examples proportional to aE/aEm. If we are using a linear descent method with a small learning rate, a schedule that uses mixed batches is equivalent to a schedule that alternates between uniform batches (with frequency equal to the frequency of examples in the mixed batch). Figure 9 shows a fixed schedule that alternates between uniform batches giving the examples of the function (Eo) twice the emphasis of the other hints ( E l and E 2 ) . The schedule defines a turn for each hint to be learned. If we are using a nonlinear descent method, it is generally more difficult to ascertain a direct translation from mixed batches to uniform batches. The implementation of a given schedule (expressed in terms of uniform batches for simplicity) goes as follows: (1) the algorithm decides
662
Yaser S. Abu-Mostafa
Figure 9: A fixed schedule for learning from hints.
which hint (which m for rn = 0, 1, . . . , M ) to work on next, according to some criterion; (2) the algorithm then requests a batch of examples of this hint; (3) it performs its descent on this batch; and (4) when it is done, it goes back to step (1). For fixed schedules, the criterion for selecting the hint can be "evaluated ahead of time, while for adaptive schedules, the criterion depends on what happens as the algorithm runs. Here are some simple schedules.
Simple Rotation: This is the simplest possible schedule that tries to balance between the hints. It is a fixed schedule that rotates between Ho, HI, . . . ,HM.Thus, at step k, a batch of N examples of Hm is processed, where rn = kmod(M + 1). Weighted Rotation: This is the next step in fixed schedules that tries to give different emphasis to different Ems. The schedule rotates between the hints, visiting Hm with frequency am. The choice of the amscan achieve balance by emphasizing the hints that are more important or harder to learn. The schedule of Figure 9 is a weighted rotation with (YO = 0.5 and = ~2 = 0.25.
663
Hints
Maximum Error: This is the simplest adaptive schedule that tries to achieve the same type of balance as simple rotation. At each step k, the algorithm processes the hint with the largest error Em. The algorithm uses estimates of the Ems to make its selection. Maximum Weighted Error: This is the adaptive counterpart to weighted rotation. It selects the hint with the largest value of amEm.The choice of the amscan achieve balance by making up for disparities between the numerical ranges of the Ems. Again, the algorithm uses estimates of the Ems. Adaptive schedules attempt to answer the question: Given a set of values for the Ems, which hint is the most underlearned? The above schedules answer the question by comparing the individual Ems. Adaptive Minimization answers the question by relating the E m s to the actual error E. Here is the uniform-batch version of Adaptive Minimization:
Adaptive Minimization Schedule: Given Eo, E l , . . . ,E M , make M mates of E, each based on all but one of the hints:
+ 1 esti-
E l , EZ, . . . EM) F(E0, ., EZ,.. . ,EM) E(Eo, E l , 0,. . . , E M )
(!,.
3
...
G E o , E l , E2,. . . I . ) and choose the hint for which the corresponding estimate is the smallest. The idea is that if the absence of Em resulted in the most optimistic view of E, then Em carries the worst news and, hence, the mthhint requires immediate attention. 5 Application
In this section, we describe the details of the application of hints to forecasting in the FX markets (Abu-Mostafa 1995). We start by discussing the very noisy nature of financial data that makes this type of application particularly suited for the use of hints. A financial market can be viewed as a system that takes in a lot of information (fundamentals, news events, rumors, who bought what when, etc.) and produces an output f (say up/down price movement for simplicity). A model, e.g., a neural network, attempts to simulate the market (Fig. lo), but it takes an input x, which is only a small subset of the information. The ”other information” cannot be modeled and plays the role of noise as far as x is concerned. The network cannot determine the target output f based on x alone, so it approximates it with its output g. It is typical that this approximation will be correct only slightly more than half the time.
Yaser S. Abu-Mostafa
664
Other Information
Figure 10: Illustration of the nature of noise in financial markets. What makes us consider x "very noisy" is that g and f agree only E of the time (50% performance range). This is in contrast to the typical pattern recognition application, such as optical character recognition, where g and f agree 1 - E of the time (100% performance range). It is not the poor performance per se that poses a problem in the 50% range, but rather the additional difficulty of learning in this range. Here is why. In the 50% range, a performance of 1/2 + E is good, while a performance of 1/2 - E is disastrous. During learning, we need to distinguish between good and bad hypotheses based on a limited set of N examples. The problem with the 50% range is that the number of bad hypotheses that look good on N points is huge. This is in contrast to the 100%range where a good performance is as high as 1 - E . The number of bad hypotheses that look good here is limited. Therefore, one can have much more confidence in a hypothesis that was learned in the 100% range than one learned in the 50% range. It is not uncommon to see a random trading policy making good money for a few weeks, but it is very unlikely that a random character recognition system will read a paragraph correctly. Of course this problem would diminish if we used a very large set of examples, because the law of large numbers would make it less and less 1/2
+
Hints
665
U.S.
GERMAN
DOLLAR
MARK
Figure 11: Illustration of the symmetry hint in FX markets.
+
likely that g and f can agree 1/2 E of the time just by "coincidence." However, financial data have the other problem of nonstationarity. Because of the continuous evolution in the markets, old data may represent patterns of behavior that no longer hold. Thus, the relevant data for training purposes are limited to fairly recent times. Put together, noise and nonstationarity mean that the training data will not contain enough information for the network to learn the function. More information is needed, and hints can be the means of providing it. Even simple hints can result in significant improvement in the learning performance. Figure 1 showed the learning performance for FX trading with and without the symmetry hint. Figure 11 illustrates this hint as it applies to the U.S. Dollar versus the German Mark. The hint asserts that if a pattern in the price history implies a certain move in the market, then this implication holds whether you are looking at the market from the U.S. Dollar viewpoint or the German Mark viewpoint. Formally, in terms of normalized prices, the hint translates to invariance under inversion of these prices. Notice that the hint says nothing about whether the market should go up or down. It requires only that the prediction be consistent from both sides of this symmetric market. Is the symmetry hint valid? The ultimate test for this is how the learning performance is affected by the introduction of the hint. The formulation of hints is an art. We use our experience, common sense, and analysis of the market to come up with a list of what we believe to be valid properties of this market. We then represent these hints by virtual examples, and proceed to incorporate them in the objective function. The improvement in performance will only be as good as the hints we put in. It is also possible to use soft hints (hints that are less reliable), taking into consideration how much confidence we have in them. The two curves in Figure 1 show the annualized percentage returns (cumulative daily, unleveraged, transaction cost included) for a sliding
Yaser S. Abu-Mostafa
666
usbp: w i t h o u t h i n t usbp: w i t h h i n t
,*,‘
_,-’
_:
A‘
,,-’
0
50
100 150 T e s t Day Number
200
250
Figure 12: British Pound performance with and without hint.
1-year test window in the period from April 1988 to November 1990, averaged over the four major FX markets with more than 150 runs per currency. The error bar in the upper left corner is 3 standard deviations long (based on 253 trading days, assuming independence between different runs). The plots establish a statistically significant differential in performance due to the use of hints. This differential holds to varying degrees for the four currencies: the British Pound, the German Mark, the Japanese Yen, and the Swiss Franc (versus the US. Dollar), as seen in Figures 12-15. In each market, only the closing prices for the preceding 21 days were used for inputs. The objective function we chose was based on the maximization of the total return on the training set, not the minimization of the mean square error, and we used simple filtering methods on the inputs and outputs of the networks. In each run, the training set consisted of 500 days, and the test was done on the following 253 days. Figures 1215 show the results of these tests averaged over all the runs. All four currencies show an improved performance when the symmetry hint is used. The statistics of resulting trades are as follows. We are in the market about half the time, each trade takes 4 days on the average, the hit rate (percentage of winning days) is close to 50%, and the annualized
Hints
667
C b4 Y 1
a"
a N
0
50
100 150 Test Day Number
200
250
Figure 13: German Mark performance with and without hint. percentage return without the hint is about 5%and with the hint is about 10%. Notice that having the return as the objective function resulted in a fairly good return even with a modest hit rate. Since the goal of hints is to add information to the training data, the differential in performance is likely to be less dramatic if we start out with more informative training data. Similarly, an additional hint may not have a pronounced effect if we have already used a few hints in the same application. There is a saturation in performance in any market that reflects how well the future can be forecast from the past. (Believers in the efficient market hypothesis (Malkiel 1973) consider this saturation to be at zero performance.) Hints will not make us forecast a market better than whatever that saturation level may be. They will, however, enable learning from examples to approach that level. 6 Summary
The main practical hurdle that faced learning from hints was the fact that hints came in different shapes and forms and could not be easily integrated into the standard learning paradigms. Since the introduction
Yaser S. Abu-Mostafa
668
12
I
1
I
I
-
usjy: w i t h o u t h i n t usjy: w i t h h i n t ---y" ,_I-
10
-
,' P'
C
2
8
,-'
aW $
I "
,_I-
,-'
6
Y
W
E
4
(L
-2
-4
Figure 14:Japanese Yen performance with and without hint. of systematic methods for learning from hints 5 years ago, hints have become a regular value-added tool. This paper reviewed the method for using different hints as part of learning from examples. The method does not restrict the learning model, the descent technique, or the use of regularization. In this method, all hints are treated on equal footing, including the examples of the target function. Hints are represented in a canonical way using virtual examples. The performance on the hints is captured by the error measures Eo, E l , . . . , EM, and the learning algorithm attempts to simultaneously minimize these quantities. This gives rise to the idea of balancing between the different hints in the objective function. The Adaptive Minimization algorithm achieves this balance by relating the Ems to the test error E. Hints are particularly useful in applications where the information content of the training data is limited. Financial applications are a case in point because of the nonstationarity and the high level of noise in the data. We reviewed the application of hints to forecasting in the four major foreign-exchange markets. The application illustrates how even a simple hint can have a decisive impact on the performance of a real-life system.
Hints
669
-
ussf: without hint ussf: with hint
,,-.,- ,-
. I
,,-' ,I
,-'
,-'
..
0
50
.
100 150 Test Day Number
.
.
200
. .
.
250
Figure 15: Swiss Franc performance with and without hint.
Acknowledgments I wish to acknowledge the members of the Learning Systems Group at Caltech, Mr. Eric Bax, Ms. Zehra Cataltepe, Mr. Joseph Sill, and Ms. Xubo Song, for many valuable discussions. In particular, Ms. Cataltepe was very helpful throughout this work. References Abu-Mostafa, Y. 1989. The Vapnik-Chervonenkis dimension: Information versus complexity in learning. Neural Comp. 1, 312-317. Abu-Mostafa, Y. 1990. Learning from hints in neural networks. J. Complex. 6, 192-198. Abu-Mostafa, Y. 1993a. Hints and the VC dimension. Neural Comp. 5, 278-288. Abu-Mostafa, Y. 1993b. A method for learning from hints. In Advances in Neural Information Processing Systems, S. Hanson etal., eds., Vol. 5, pp. 73-80. Morgan Kaufmann, San Mateo, CA. Abu-Mostafa, Y. 1995. Financial market applications of learning from hints. In
670
Yaser S. Abu-Mostafa
Neural Networks in the Capital Markets, A. Refenes, ed., pp. 221-232. Wiley, London, UK. Akaike, H. 1969. Fitting autoregressive models for prediction. Ann. Inst. Stat. Math. 21, 243-247. Al-Mashouq, K., and Reed, I. 1991. Including hints in training neural networks. Neural Comp. 3, 418-427. Amaldi, E. 1991. On the complexity of training perceptrons. In Proceedings of the 1991 International Conference on Artificial Neural Networks (ICANN ‘91), T. Kohonen, K. Makisara, 0. Simula, and J. Kangas, eds., pp. 55-60. North Holland, Amsterdam. Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. 1989. Learnability and the Vapnik-Chervonenkis dimension. I. ACM 36, 929-965. Cataltepe, Z., and Abu-Mostafa, Y. 1994. Estimating learning performance using hints. In Proceedings of the 1993 Connectionist Models Summer School, M. Mozer et al., eds., pp. 380-386. Erlbaum, Hillsdale, NJ. Cover, T., and Thomas, J. 1991. Elements of Information Theory. Wiley-Interscience, New York. Duda, R., and Hart, P. 1973. Pattern Classification and Scene Analysis. John Wiley, New York. Fyfe, W. 1992. Invariance hints and the VC dimension. Ph.D. thesis, Computer Science Department, Caltech (Caltech-CS-TR-92-20). Hecht-Nielsen, R. 1990. Neurocomputing. Addison-Wesley, Reading, MA. Hertz, K., Krough, A., and Palmer, R. 1991. Introduction to the Theory of Neural Computation, Lecture Notes, Vol. 1. Santa Fe Institute Studies in The Sciences of Complexity. Hinton, G. 1987. Learning translation invariant recognition in a massively parallel network. Proc. Conf. Parallel Architectures and Languages Europe, 1-13. Hinton, G., Williams, C., and Revow, M. 1992. Adaptive elastic models for handprinted character recognition. In Advances in Neural Information Processing Systems, J. Moody, S. Hanson, and R. Lippmann, eds., Vol. 4, pp. 512-519. Morgan Kaufmann, San Mateo, CA. Hu, M. 1962. Visual pattern recognition by moment invariants. IRE Trans. Inform. Theory IT-8, 179-187. Judd, J. S. 1990. Neural Network Design and the Complexity of Learning, MIT Press, Cambridge, MA. Leen, T. 1995. From data distributions to regularization in invariant learning. Neural Comp. (to appear). Malkiel, B. 1973. A Random Walk Down Wall Street. W. W. Norton, New York. McClelland, J., and Rumelhart, D. 1988. Explorations in Parallel Distributed Processing. MIT Press, Cambridge, MA. Minsky, M., and Papert, S. 1988. Perceptrons, expanded edition. MIT Press, Cambridge, MA. Moody, J. 1992. The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In Advances in Neural Information Processing Systems, J. Moody, S. Hanson, and R. Lippmann, eds., Vol. 4, pp. 847454. Morgan Kaufmann, San Mateo, CA. Moody, J., and Wu, L. 1994. Statistical analysis and forecasting of high frequency
Hints
671
foreign exchange rates. In Proceedings of Neural Networks in the Capital Markets, Y. Abu-Mostafa et al., eds. Omlin, C., and Giles, C. L. 1992. Training second-order recurrent neural networks using hints. Machine Learning: Proceedings of the Ninth International Conference, ML-92, D. Sleeman and P. Edwards, eds., Morgan Kaufmann, San Mateo, CA. Poggio, T., and Vetter, T. 1992. Recognition and structure from one 2D model view: Observations on prototypes, object classes and symmetries. A1 Memo No. 1347, Massachusetts Institute of Technology. Rumelhart, D., Hinton, G., and Williams, R. 1986. Learning Internal Representations by Error Propagation. In Parallel Distributed Processing, D. Rumelhart et al., eds., Vol. 1, pp. 318-362. MIT Press, Cambridge, MA. Suddarth, S., and Holden, A. 1991. Symbolic neural systems and the use of hints for developing complex systems. Int. I. Machine Studies, 35, 291. Vapnik, V., and Chervonenkis, A. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory Prob. Appl. 16,264-280. Weigend, A., and Rumelhart, D. 1991. Generalization through minimal networks with application to forecasting. In ProceedingslNTERFACE91-Computing Science and Statistics (23rd Symposium), E. Keramidas, ed., pp. 362-370. Interface Foundation of North America. Weigend, A., Huberman, B., and Rumelhart, D. 1990. Predicting the future: A connectionist approach. lnt. J. Neural Syst. 1, 193-209. Weigend, A., Rumelhart, D., and Huberman, B. 1991. Generalization by weight elimination with application to forecasting. In Advances in Neural Information Processing Systems, R. Lippmann, J. Moody, and D. Touretzky, eds., Vol. 3, pp. 875-882. Morgan Kaufmann, San Mateo, CA. Wismer, D., and Chattergy, R. 1978. Introduction to Nonlinear Optimization. North Holland, Amsterdam.
Received May 10, 1994; accepted December 20, 1994.
This article has been cited by: 1. Daniel L. Silver, Ryan Poirier, Duane Currie. 2008. Inductive transfer with context-sensitive neural networks. Machine Learning 73:3, 313-336. [CrossRef] 2. Wimalin Sukthomya, James Tannock. 2005. The optimisation of neural network parameters using Taguchi’s design of experiments approach: an application in manufacturing process modelling. Neural Computing and Applications 14:4, 337-344. [CrossRef] 3. Wimalin Sukthomya, James D.T. Tannock. 2005. Taguchi experimental design for manufacturing process optimisation using historical data and a neural network process model. International Journal of Quality & Reliability Management 22:5, 485-502. [CrossRef] 4. M.E. Munich, P. Perona. 2003. Visual identification by signature tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25:2, 200-217. [CrossRef] 5. Zhe Chen , Simon Haykin . 2002. On Different Facets of Regularization TheoryOn Different Facets of Regularization Theory. Neural Computation 14:12, 2791-2846. [Abstract] [PDF] [PDF Plus] 6. M. Magdon-Ismail, A. Atiya. 2002. Density estimation and random variate generation using multilayer networks. IEEE Transactions on Neural Networks 13:3, 497-520. [CrossRef] 7. N. Chapados, Y. Bengio. 2001. Cost functions and model combination for VaR-based asset allocation using neural networks. IEEE Transactions on Neural Networks 12:4, 890-906. [CrossRef] 8. Norbert Krüger . 2001. Learning Object Representations Using A Priori Constraints Within ORASSYLLLearning Object Representations Using A Priori Constraints Within ORASSYLL. Neural Computation 13:2, 389-410. [Abstract] [PDF] [PDF Plus] 9. Malik Magdon-Ismail . 2000. No Free Lunch for Noise PredictionNo Free Lunch for Noise Prediction. Neural Computation 12:3, 547-564. [Abstract] [PDF] [PDF Plus] 10. B. Apolloni, I. Zoppis. 1999. Sub-symbolically managing pieces of symbolical functions for sorting. IEEE Transactions on Neural Networks 10:5, 1099-1122. [CrossRef] 11. P. Niyogi, F. Girosi, T. Poggio. 1998. Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE 86:11, 2196-2209. [CrossRef] 12. Lizhong Wu , John Moody . 1996. A Smoothing Regularizer for Feedforward and Recurrent Neural NetworksA Smoothing Regularizer for Feedforward and Recurrent Neural Networks. Neural Computation 8:3, 461-489. [Abstract] [PDF] [PDF Plus]
ARTICLE
Communicated by Maxwell Stinchcombe
Topology and Geometry of Single Hidden Layer Network, Least Squares Weight Solutions Frans M. Coetzee Virginia L. Stonick Electrical and Computer Engineering Department, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213-3890 USA In this paper the topological and geometric properties of the weight solutions for multilayer perceptron (MLP) networks under the MSE error criterion are characterized. The characterization is obtained by analyzing a homotopy from linear to nonlinear networks in which the hidden node function is slowly transformed from a linear to the final sigmoidal nonlinearity, Two different geometric perspectives for this optimization process are developed. The generic topology of the nonlinear MLP weight solutions is described and related to the geometric interpretations, error surfaces, and homotopy paths, both analytically and using carefully constructed examples. These results illustrate that although the natural homotopy provides a practically valuable heuristic for training, it suffers from a number of theoretical and practical difficulties. The linear system is a bifurcation point of the homotopy equations, and solution paths are therefore generically discontinuous. Bifurcations and infinite solutions further occur for data sets that are not of measure zero. These results weaken the guarantees on global convergence and exhaustive behavior normally associated with homotopy methods. However, the analyses presented provide a clear understanding of the relationship between linear and nonlinear perceptron networks, and thus a firm foundation for development of more powerful training methods. The geometric perspectives and generic topological results describing the nature of the solutions are further generally applicable to network analysis and algorithm evaluation. 1 Introduction
Linear networks are well understood both qualitatively and quantitatively in terms of projection operators, error surfaces, matrix forms, and manipulations. Theorems that afford a similar level of description and manipulation for all corresponding nonlinear networks do not yet exist. In previous papers we addressed these issues for a single layer perceptron (SLP). We used a natural homotopy to define a globally convergent Neural Computation 7, 672-705 (1995)
@ 1995 Massachusetts Institute of Technology
Weight Solutions for MLP Networks
673
constructive weight optimization method, and an intuitive geometric perspective on the weight optimization process for the nonlinear network (Coetzee and Stonick 1993, 1994a,b). The natural homotopy approach corresponds to changing the node nonlinearity from a linear to a nonlinear sigmoidal function as a free parameter (the homotopy parameter) is varied. Here we extend the analysis of this natural homotopy to multilayer perceptron (MLP) networks with one hidden layer of sigmoidal neurons. Yang and Yu (1993) used the natural homotopy approach as a practical heuristic to obtain improved convergence during training in a multilayer neural network but did not address the theoretical considerations that support application of the approach. Without these theoretical underpinnings, the homotopy approach offers no guarantees on existence of solutions or global convergence, and provides no insight into the optimization process or mapping abilities of neural networks. Here we extend the geometric perspectives developed for the SLP to the MLP homotopy equations and describe the topological nature of the weight solutions. The perspective resulting from this approach indicates that although the natural homotopy is a practically valuable heuristic for training networks, it suffers from a number of theoretical and practical difficulties. Specifically, the linear system forms a bifurcation point of the homotopy equations, and solutions to the initial system are generically points of discontinuity along the solution path. Bifurcations and infinite solutions at intermediate points on the homotopy path can further occur for data sets that are not of measure zero. These results weaken the guarantees on global convergence and exhaustive behavior that are normally associated with homotopy methods. However, the geometric perspectives arising from the homotopy approach provide a clear understanding of the relationship between linear and nonlinear neural networks. Insight regarding uniqueness of weights for networks trained on finite data sets results. Thus this work complements the work of Sussmann (1992), Albertini and Sontag (1992), and Chen et a1. (19931, all of which describe weight uniqueness for infinite training sets, and that by Poston et al. (1991) and Sartori and Antsaklis (1991)on training with more hidden nodes than data samples. The complete results, describing geometric formulations and the generic topological nature of the weight solutions, also are valuable for general algorithm evaluation and analysis. For example, these results provide bounds on data sizes that will generically ensure nonsingularity of the Jacobian of the gradient, a necessary condition for many optimization procedures (e.g., conjugate gradient procedures). This paper is organized as follows. Relevant background on the basic homotopy method is reviewed in Section 2. The natural homotopy used in this paper is defined and equations defining the optimal weight solutions are derived in Section 3. These equations form the basis for the critical geometric interpretations presented in Section 4. Results in
Frans M. Coetzee and Virginia L. Stonick
674
Section 5.1 extend the linear network analysis of Baldi and Hornik (1989) to the general MLP architecture, while the nonlinear MLP is addressed in Section 5.2. The impact of these results on homotopy path following is discussed in Section 5.3. Carefully constructed examples are used in Section 6 to illustrate homotopy path behavior. Final implications of these results on robustness of the natural homotopy for neural networks are discussed in Section 7. The majority of the proofs are relegated to the Appendix to facilitate ease of reading. 2 Homotopy Methods
In this section, basic homotopy methods are briefly described with emphasis on properties and constraints critical to our development. For a more complete introduction, we recommend Garcia and Zangwill(1981), Morgan (19871, or Richter and DeCarlo (1983). Specific results relevant to the application of homotopy to neural networks are discussed in more depth in Coetzee and Stonick (1994a). Homotopy methods provide a constructive way to find the solutions to a set of equations by mapping the known solutions, from a simple initial system, to the desired solution of the unsolved system of equations. Homotopy methods are appropriate for optimization if the optimization problem can be reduced to solving systems of equations. Mathematically, the basic homotopy method is as follows: Given a final set of equations f ( x ) = 0 , f : D c X" --+ R" with an unknown solution, an initial system of equations g ( x ) = 0, g : D c 3'' $2" with a known solution is constructed. A homotopy function h : D x T + 3" is defined in terms of an embedded parameter T E T c R,such that ---$
{ fg((xx ))
when 7 = 0 (2.1) when T = 1 The objective is to solve the final equations by solving h ( x , T ) = 0 numerically for x for increasing values of T , starting at T = 0 where the solution is known by construction, and continuing to T = 1. Intuitively, incrementing T in small increments yields an efficient numerical solution procedure; the solution for the previous value of T can be used as the initial guess for the current value of T . For differentiable systems, the problem is reduced to that of solving of the Davidenko implicit differential equation, ax a h H*-+-=o (2.2) h(x' ')
a7
=
a7
where H, E Rnxn is the Jacobian w.r.t. x of h. Homotopy methods are advantageous as they are possibly globally convergent and can be constructed to be exhaustive. However, computing a final solution using the approach is successful only if solutions for h ( x ,T ) exist for all T and connect the initial solutions to the final solutions.
Weight Solutions for MLP Networks
675
1
I
I ! I
A'
I
D
I I I I I
I 0
7
1
Figure 1: Problematic homotopy paths.
Although the general homotopy theory allows higher-dimensional manifold solutions (Alexander 1978; Alexander and Yorke 1978), practical numerical algorithms require that the solutions form bounded paths as 7 is varied (Watson et al. 1987; Garcia and Zangwill 1981). Hence, at each 7,the solutions should consist only of isolated points. Bifurcations and path crossings further result in difficulties; depending on the choice of exit branch, the solution being tracked can change from a local minimum to a local maximum at such a point. Full rank and continuity of the Jacobian [H, (ah/d7)]are necessary and sufficient for well-behaved paths (Garcia and Zangwill 1981). The paths illustrated in Figure 1 are problematic for numerical solution procedures. A homotopy with solution paths (A) and (A') does not connect the initial and final solutions. Along (8) there exists a functional relationship between parameters of the solutions of the homotopy equations at 7 = 0, while a unique solution exists for 7 > 0 (7 = 0 is a bifurcation point). If the exit point at 7 = 0 cannot be found, or in the case of multiple exit paths, all of the paths found reliably, the homotopy algorithm fails. A bifurcation for an intermediate value of 7 is illustrated at (C). We now proceed to derive the equations specifying a natural homotopy between linear and nonlinear networks, and to analyze the path behavior thereof.
Frans M. Coetzee and Virginia L. Stonick
676
3 Multilayer Perceptron Network Homotopy Formulation
We consider MLP networks with one hidden layer of sigmoidal node transfer functions and linear output nodes. The network has n inputs, m hidden nodes, and k outputs. The weight wi,connects input node j to hidden node i, and cij maps from hidden node i to output node j . The natural homotopy mapping between linear and nonlinear networks is defined by parameterizing the neural network hidden node nonlinearity in terms of r: "(x)
CT(X,T)
=
when r when r
{ ;f(x)
=0 =1
(3.1)
where cf is the node nonlinearity of the final network, assumed to be monotonically increasing and saturating at f l for large positive/negative values. In addition, the deformation satisfies the following properties: .(x,r) + *cc as x + +oo d -o(x, r ) + x ( r )# 0 as x
1.
11.
ax
... 111.
v r E [0,1) +
fcc
(3.2)
V r E [0,1)
~ ( xr ,) is C"
(3.3) (3.4)
(3.5) where x ( r ) is a smooth positive valued function. These conditions can easily be met by deformation of most widely used sigmoidal functions, including the standard deformation .(x, T ) = (I - T)X + r tanh(x), which is used in our examples. The input data x[i] E W" and the desired data output y[i] E Wk,i = 1,2,. . . ,L are collected in the data matrices
x
=
y
=
[ x1 [ Y1
x2
...
Y2
. . . yL
XL
3
E
WRnXL
] E !JFxL
(3.6)
The inputs are mapped to hidden node activations ai E W" by the input layer weights W E W m x n . The hidden node trace E X L x m is then mapped via the output layer weights C E @'" to produce the output Z, as described by the following feedforward equations:
*
a
=
WXEWrnXL
@ ( a ) = .(a') E
z
=
xLxm
C@T€!JPxL
(3.7) (3.8) (3.9)
Weight Solutions for MLP Networks
677
The error matrix E E P X and L error criterion t2are defined, respectively, by E
= Y-Cip'
t2
=
(3.10)
(vecE)' vec E
= tr E'E
(3.11)
where tr denotes the trace of a matrix. As described in Section 2 and Coetzee and Stonick (1994a), it is essential that functional dependencies among parameters be eliminated for solutions to form paths. Linear dependency can be removed by observing that if each row of a matrix W' lies in 7 (X') = Im (X)', then WX = (W + W')X [where 7 () and Im () denote the null and the range spaces of the matrices, respectively]. Thus as in the case of the SLP (Coetzee and Stonick 1994a1, a reduced QR-decomposition of X' such and rank Q = s is used that X' = QR, where Q E X L x s , R E Pxn, to generate a new coordinate set p = RWT E P x m .The activations CY = WX = W(QR)' = WRTQT= pTQTremain the same. The weights E X s x m are linearly independent weight combinations. Each column pi corresponds to a nonredundant set of inputs for each single hidden layer node. Each hidden node is excited by the row basis of the input data matrix X . In the following sections, it is assumed that the linearly nonredundant weights p are used. The homotopy equations for optimization are found by setting the differential equal to zero. In matrix calculus notation (cf. Magnus and Neudecker 1988, Ch. 4) the necessary equations are C9'ip
-
(3.12)
Yip = 0
R+ ( I , 8 Q ) = o vec (E'c)~
(3.13)
where (3.14)
R+ = diag (vec d(aT)}
The Hessian of the homotopy function follows from the second differential:
Hc,c E !Pmxkm
Hc,p E
Pmxms
H =
H&
E
Xmsxkm
Hp,p E Xmsxms
Frans M. Coetzee and Virginia L. Stonick
678
where
Hc,c
=
*'*
H c , ~= [-I,
€3 I k @E
+ K m k (C @ *')I
Re (1, C3 Q )
Hp,p
=
( I m C3 Q') [R: (C'C @ I L )Ra - diag (vecM;}] (I, @ Q )
Ma
=
C'EO~'((Y)
(3.15)
The above formulation 3.12-3.15 allows both input and output weights to vary over all permissible values. Using 3.12, it is possible to solve directly for the optimal weights C in terms of the hidden node weights' /3 as C = Y(et)' + C' where the rows of C1-E v(*). The necessary equations 3.12-3.15 are then reduced to the following: [V~CE'Y(*~)']'R~ ( I , €3 Q ) = 0
(3.16)
Ra (I,,,@ Q ) - (I, @ Q ' ) diag (vecM;} (I, @ Q ) E !Rmsxms where =I,
8 E - Kmk [y(+')' @
*']
(3.17)
This formulation 3.16-3.17 is complete and convenient and is used in Section 6; by using only the hidden node weights the number of variables is reduced. Both sets of homotopy equations, 3.12-3.15 and 3.16-3.19, have specific symmetries that can be used to reduce the number of paths that need to be tracked to ensure that all solutions are computed. Reversing the sign of all the weights leading up to and away from a hidden node, as well as permuting the hidden nodes, leaves the network performance invariant (Sussmann 1992; Albertini and Sontag 1992; Chen et al. 1993). These results imply that the existing solutions can be separated into equivalence classes. Chen et al. (1993) showed that, for the network architectures we consider, a total of 2'"m! equivalent classes are created, and inequalities specify a region in weight space where a single representative of each solution class can be found. Using these results, it is in principle possible to find all solutions by tracking only one solution in each equivalence class, reducing the number of solutions to the homotopy equations. Note that the the objective pursued by the above researchers was to find uniqueness of weights for an infinite set of arbitrary inputs that completely specify the network mapping. Given this condition, the symmetries described above are the only invariant transformations, and it is 'This process of reducing variables using pseudoinverses has a long history (cf. Golub and Pereyra 1973).
Weight Solutions for MLP Networks
679
simple to show that weight solutions are isolated. However, we consider a finite set of data, as is typically the case in practice. Thus other weight sets may exist that produce the same output for the given data, and the weight solutions are not necessarily point sets. The topological nature of all these solutions is of primary interest for the homotopy approach, to ensure well-behaved paths, and in descent algorithms, to ensure invertibility of the Hessian matrices. 4 Geometric Interpretation
For the two equivalent homotopy equation formulations derived above, it is not obvious whether the parameters { p ,C } or { p } are functionally independent, and hence, that the solutions to the homotopy paths will be isolated (if they exist). In addition, it has to be verified that paths originating at the linear system solutions can be reliably followed to solve the final system. Addressing these questions is the main topic of the rest of this paper. As in the case of the single layer perceptron, geometric formulations provide the mathematical insight necessary to resolve these questions. Ui + V be a mapping of variables xi E U;, For this work let f : i = 1,2, . . . , n from various allowable domains Ui. Let a be an index set for these variables, i.e., x,(;) E { x I , x 2 , . . . , x n } , i = 1 , 2 , .. . ,Y, and let xE = ( X , , X ~ , .. . , x n } \ x a be the rest of the variables. Then with the function f we associate various manifolds
ny=,
Yxn(,),xn(2) ,...,X , ( , ) ( X Z ( l ) , X Z ( Z ) , ' = C f ( x 1 , x 2 , . . . , x , ) I V x.(~) E U,(i) i = 1 , 2 , .. . , Y} "
1
w
)
(4.1)
generated by varying the subset of variables x , ( l ) , x , ~ ) ., . . , x , ( ~ ) over all allowable values, while keeping the rest of the variables at preset values. Where confusion will not arise, the 5 is neglected. Manifolds associated with each mapping defined by the feedforward network when both C and are varied are denoted by y , and by U when C is solved explicitly in terms of using the pseudoinverse. Using this notation, the following manifolds are used extensively in our analyses: 0
0
C eXL is the Yc,,(.) = { C [ O ( Q / ~I ) ~V] C E !Rkxm, V p E Pxm} manifold generated by varying both the input and output layer weights over the allowable weight space. Yc(p,7)= { C [ O ( Q P ) I~ ] V C E = !RkxmeT C is the manifold generated by varying the output layer weights for a fixed input layer weight set p. C eXL is the manifold genu, = { y ( , t ) T [ ~ ( Q p )I T ]V p E Pxm} erated by varying the input layer weights, but solving explicitly for an ideal set of output weights using the pseudoinverse. Note that U, and yp,c are fundamentally different in topology.
exm}
exL
Frans M. Coetzee and Virginia L. Stonick
680
0
} the manifold generated by varying W i { a ( Q P )I b' P E !I?is the input layer weights for a single hidden node or single layer perceptron, and previously analyzed in detail (Coetzee and Stonick 1993, 1994a,b).
We now proceed to discuss two different geometric interpretations of the necessary equations. 4.1 Projection Interpretation. Geometrically, minimization of the least squares error norm defines a projection of the desired output data onto a data set generated by varying the parameters over their allowed range. This geometric perspective was used for the SLP in Coetzee and Stonick (1994a) and can be extended to homotopy formulations for the MLP. Specifically, the solution to 3.12-3.13 defines a projection of Y onto Y c , p ( ~ )(For . ease of visualization it is convenient in this case to identify @ x L with !RkLusing the vec operator.) A set of weights generating the same output for given input data is associated with each projection and forms a solution set. Note that symmetries inherent in the network, as discussed in Section 3, prevent specification of a solution by a single weight set. In neither formulation do the parameters necessarily form an allowable coordinate system for the associated data manifolds. The homotopy approach will, however, still track paths if each point on &,p is generated only by isolated weight sets, i.e., the manifold is a local immersion of the weight space. The topology of these weight sets is analyzed in Section 5.2, and provides a characterization of the weights at a particular value of the homotopy parameter 7 , and hence, if paths are being tracked or not. Similarly, solving the vector equation 3.16 can be interpreted as finding the orthogonal projection of Y onto the manifold Up Note that in this case the data surface is defined in terms of both the input and output data, rather than just the input data; this dependence leads to an involved geometric formulation. However, since the weight solutions defined by the two formulations are the same, it is sufficient for our objectives to analyze only one. Unlike for the SLP, these projection interpretations do not provide much intuitive insight into the mapping capabilities of the MLP. The high dimensionality of the spaces even for simple examples hinders development of intuition. The next subsection describes an alternative perspective that makes use of results arising from the geometric analysis for the SLP, to clearly delineate the influence of the input and output layer weights. This view allows for insight into the actual neural mapping and construction of illustrative examples for visualization of the homotopy process. 4.2 Intersection Interpretation. For simplicity, first consider the case where k = 1. The input data surface generated by a single hidden layer
Weight Solutions for MLP Networks
681
Figure 2: Intersection of hyperspace W = o(Qp)Rz = Y c ( p )generated by two hidden layer nodes (indicated by vectors) and single layer perceptron mapping W of the input data surface. yp is the projection of y onto Yc(p). neuron, W = a ( Q P ) is the same for all of the hidden node neurons, since each receives the same input data. For a given set of weights p, each hidden neuron weight vector pi defines a vector in 9IL from the origin to the point a(QPi)on W . Using m hidden layer neurons, a total of m such vectors are generated. The output of the network (defined by 3.9) is formed by linearly combining these vectors. Allowing all possible output weights C generates an m-dimensional subspace, and optimal output weights result from projection of the desired data vector y onto this subspace. Simultaneous optimization of both input and output weights corresponds to selecting a subspace of dimension p 5 m that intersects the surface W = a(QP)such that at least p linearly independent vectors exist in the intersection, and such that the hyperplane is closest to the desired vector y. This optimization process is shown in Figure 2, where the two vectors corresponding to two hidden units are found on the surface W = a ( Q P ) , and J+(p) is the hyperplane spanned by these units. When 7 = 0, W is a subspace, and linear combinations of vectors in this plane simply select specific subspaces of this plane. This result
682
Frans M. Coetzee and Virginia L. Stonick
explains why hidden nodes do not modify the mapping of a linear network. When 7 > 0, W corresponds to a smooth distortion of the linear subspace spanned by the rows of the data matrix (Coetzee and Stonick 1994a, Theorem 1). In general, the number of hidden layer nodes directly determines whether the problem can be solved exactly. If there are more or at least an equal number of nodes as there are samples ( m 2 L), there are usually a sufficient number of vectors L that can be used to span the data space sRL [cf. Lemma 1, Appendix A, Poston et al. (1991) (Theorem 3.1) and Sartori and Antsaklis (1991) (Lemma 1)l. Thus, any desired signal y can be generated exactly by the network. However, if the number of hidden nodes is less than the number of samples, a characterization of all p 5 m-dimensional subspaces that intersect W is needed to perform global optimization. These subspaces are not necessarily of the same dimension as the surface W . If k > 1, i.e., there is more than one output node, the desired output at each output node is projected onto a common hidden node subspace & ( p ) such that a measure of total projection error (including individual projection errors) is minimized by the choice of linear subspace. Therefore, the distinct desired outputs jointly will determine the optimum hidden layer weights. In this case, Figure 2 remains the same, except that multiple points yk for each output node are projected onto Yc(p). The additional outputs represent further constraints on the solution set, with an expected reduction in the measure of viable solution sets in weight space. The geometric interpretation provides insight into the functional dependency among the weights. Consider the input weights p to be optimal and fixed. If these produce p 5 m linearly independent hidden node vectors, then a p-dimensional coordinate system can be defined for the subspace &(p), consisting of a linear combination of the rn output weights. Since Y c ( p ) is a plane, it is a simple matter to characterize all projections of y by particular solutions. Given p, it is therefore simple to ensure an isolated and unique path for the weights C in the formulation 3.12-3.17. Thus, for both formulations of the necessary equations 3.12-3.15 and 3.16-3.17, the hidden layer weights p have to be isolated, or be reconstructed from a parameterization with isolated solutions, in order to define viable solution paths as the homotopy parameter is varied. The coordinates p define a valid coordinate system for W (Coetzee and Stonick 1994a, Theorem 1). However, in the MLP, this condition is not sufficient to guarantee isolated solutions p E !RsXm. For example, equivalent performance results from any two weight sets p and p* such that span {a(QP)} = span {.(QP*)}. Requiring isolated solutions corresponds to having any optimal plane intersecting W (see Fig. 2) be generSince the weights p ated only by isolated hidden node weights in Pxm. define a coordinate system on W , this condition requires that the intersec-
Weight Solutions for MLP Networks
683
tion of the &(P) and W consist only of isolated points. For the example in Figure 2 it is clear that intersection of the plane and W corresponds to an infinite number of allowable hidden unit vectors. These vectors correspond to a manifold of weight solutions P of dimension at least as large as that of the intersection. In addition to independent variation of the columns of P, constrained variation among columns can further increase the dimension of this solution manifold. This constrained variation corresponds to the fact that it is possible for planes of different orientation to result in the same performance. From this geometric perspective, the topological nature of the solutions therefore depends on the intersection of the tangent planes with W at the hidden node vectors, and the plane 9 spanned by the hidden node vectors. This geometric intuition is formalized in Section 5.2. The networks analyzed here have linear output nodes and one hidden layer. However, the geometric interpretation can be modified to deal with more hidden layers and nonlinear outputs. For example, for a single nonlinear output neuron, the plane generated by rn vectors on the input single layer perceptron data surface of dimension s [in this paper Yc(P)], is deformed by a secondary single layer perceptron with rn inputs. This secondary deformation results from the same nonlinearity as used in the hidden layer node and is thus described by the same analysis performed for the SLP (Coetzee and Stonick 1993,1994a). The optimal weights result from finding minimal length projections onto this secondary surface. 5 MLP Weight Solutions Topology This section presents a formal topological analysis of the natural homotopy solutions, and quantifies the geometric interpretations presented in Sections 4.1-4.2. Linear and nonlinear cases are analyzed in Sections 5.1 and 5.2, respectively. Implications of these topological weight characterizations on the natural homotopy method for the MLP are discussed in Section 5.3. In all cases, proofs have been relegated to the Appendix to aid the flow of the discussion. 5.1 Linear System Analysis (7 = 0). In the linear case u’(x) = 1, V x E %, RC, = l m ~ , M = * 0 and 9 = QP. Applying these identities to 3.12-3.13 results in the following initial set of equations: u”(x)= 0
(5.1) (5.2) where EX = QTQ and zYX = YQ. Note that in our formulation C x is always invertible by virtue of the explicit QR-decomposition of the input data X. An analysis of these linear neural network equations was performed by Baldi and Hornik (1989) assuming n inputs, n outputs,
Frans M. Coetzee and Virginia L. Stonick
684
and p 5 n hidden nodes (assuming linearity a priori). Their results are not sufficiently general to deal with the architectures we consider. However, all the components for extension to the architectures we consider are present, and corresponding (or parallel) results follow from rearranging parts of their proofs and taking proper care in dealing with pseudoinverses, index sets, and matrix dimensions. These straightforward extensions are stated below without proof. In Theorem 1 below the following notational convention will be used. An index set 3; is a set of r integers J ( l ) , 1 = 1 , 2 , . .. , r with J C {1,2,. . . , n}. For a given matrix A E !)Imxn the matrix AJ; E !)Imxr is the matrix whose columns are selected from A according to the index set 3;. The set { 1,2, . . . , n}\J is denoted by 7. The following theorems require some care to prove, but follow directly from those in Baldi and Hornik (1989): Theorem 1 (Baldi and Hornik-Restated). If C isa rank 1 5 r 5 min(k,m ) solution of the necessary linear equations, then C is of the form
C = [ (U),
Okx(m-r)
1D
(5.3)
where D E !)Imxm is nonsingular and arbitrary, Zkun index set, and U a set of left singular vectors (not necessarily unique) of B = C y x C , ' C x y . Therefore, C is an element of one ofkCr linear equivalence classes. To each C of the form 5.3 there corresponds a set PT described by PT
= CtXYXC,'
+ (I,
- C+C)Z
(5.4)
where Z E !RmXssatisfies [p;xyx 8 ( I ~ c ~ c v) ]e c = ~o
(5.5)
Here Pi denotes the projection matrix onto the space orthogonal to the span of C. Theorem 1 provides a complete description of the form of the weight solutions to the linear problem. Given the data matrix X, an eigendecomposition (SVD) is performed. If the singular values are distinct, this decomposition is unique. In that case, for a required rank r of C, one of k C r equivalence classes can be selected for C using 5.3. Equivalence holds for any two solutions C and C', which are related by an invertible transformation. For each such C, it is possible to find an affine subspace (of at least trivial dimension) of weights PT using 5.4 and 5.5. If the eigenvalues of c are not distinct, the singular vectors of c are not unique; in that case C is further equivalent up to rotation in the invariant subspaces of X. However, this rotational equivalence can be subsumed by the invertible transformation D in 5.3. Finally, from examination of 5.4 it follows that the weight solutions are always unbounded. The structure of the performance surface at the critical points (saddle point or extrema) can be determined from the eigenvalues of C at that point:
Weight Solutions for MLP Networks
685
Theorem 2. Let the eigenvalues of I: be given by A1 > A2 > . . . > A, 2 0, with multiplicity m(Ai), i = 1,2,. . . ,q. Then for every index set J containing an element j such that 3 k $2 3,A k < Xj, the equivalence class generated by L7 corresponds to a saddle point. If no suck k exists, then the critical point is either a local minimum or a saddle point. Finally, if n 2 m (more input nodes than hidden layer nodes) then all critical points except the global minimum are saddle points in the following sense:
Theorem 3. If I: is f u l l rank, and C is not full rank [r 5 min(k,m ) ] ,then a critical point is a saddle point, when considered in the space of all matrices C of rank C 5 min(k,m). The full implication of these results for the natural homotopy method can be discussed using only knowledge of the topological nature of the solutions for the nonlinear case. The nonlinear case is described in the next subsection, and discussion of the natural homotopy for the MLP follows in Section 5.3. 5.2 Nonlinear System Analysis (7 > 0). If the node transfer function is nonlinear, neither the hidden node data surface W nor the set is a linear subspace. As described in Section 4.1, the topological nature of solutions relates to the definition of a coordinate system on the manifold YC,p E A solution to the optimization equations is found by projecting the desired data vector onto Yp,c. Every point (such as the projection of the desired data vector) on Yc,p is the image of the set of weights that produce the same output given the input data. Lemma 2 below provides an explicit characterization of this weight set in terms of the number of linearly independent input nodes s, hidden nodes m, output nodes k, and the number of input data samples L:
exL.
exm
Lemma 2. Let {C, p } H CoT define a mapping of x X s x m to the manifold Y c ,C~9?kxLwith Jacobian Jz E !I?k'xm(s+k). Then the inverse image of a point on Y C is, ~a manifold of dimension p = max(0, m(k s) - rank J z } , where generic with respect to { Q ,p }
+
(i) if kL 5 km, then rank J z = kL and p = m(k + s) - kL. (ii) if km 5 kL, then km 5 rank J z 5 min{kL,m(k + s)} and m(k min{kL, m(k + s)} 5 p 5 ms.
+ s) -
Note that the dimension of the weight manifold corresponding to a desired vector projection depends on the rank of the Jacobian J z , which defines the tangent space to &,pa As expected from the geometric perspective (Section 4.21, the dimension p, and hence rank Jz, is dependent on the intersection of the hidden node span with the hidden node data manifold W , and how this intersection changes as the weights p vary.
Frans M. Coetzee and Virginia L. Stonick
686
Lemma 3 below formalizes this interpretation. Precisely, rank J Z is dependent on the joint transversality of the subspace spanned by the hidden node vectors span{+}, and the tangent planes Ji to the hidden node manifold W at each of the hidden node vectors pi: Lemma 3. Thelacobian], isfull rankexceptonaset ofmeasurezeroinC E X k X m if
(i) Fan out architecture: k 2 m, L 2 m + s and [ ip Ji 1, i = 1 , 2 , .. . m is full rank. (ii) Fan in architecture: k 5 m, L 2 m + sp', where p / = [m/kl, and [ J a , . . . lap, ] wherecri,i = 1,2,. . .p'isanindexset in { 1 , 2 , .. . ,m}, has full rank.
*
By characterization of the generic intersection of these tangent spaces Ti (cf. Lemma 4) and span {+}, it is possible to generate generic statements quantifying the dimension p of the weight solution sets. In particular, conditions on data sizes and network architectures which ensure that the inverse weight manifold dimension p is zero are of interest. In this case the weight solutions are isolated points in weight space since solutions to the necessary equations define projections onto the manifold Y c , ~The . solutions to the homotopy equations will then form paths as T is varied and can be tracked using well-established numerical procedures (cf. Sections 2 and 4.1). The theorem and corollary below provide a sufficient bound on the data size L to ensure that p = 0, and that the solutions form paths. Theorem 5 (Immersion). IfL 2 max{m + [ms/kl,m + s [ m / k ] } then , except a p H CgT foraset { C , ~ , Q } o f m e a ~ u r e z e r o i n X ~ ~~~Rx ~! J" ?~ ~, t~h e m{C,/3} defines a local coordinate system (immersion)on the manifold Y c ,G~ eXL. Corollary 6 (Path Theorem). If L 2 max{m + [nzs/kl,m + s [ m / k l } , then except for a set { Q } of measure zeru in X L x s , the set of {Y} E 3?kxLhaving an isolated solution has non-zero measure. In summary, the above theorems show that depending on the number of hidden, output and linearly independent input nodes, the neural network weight solutions consist of finite dimensional manifolds when T > 0. If the number of samples L is small relative to the number of parameters in the network, then the solutions form higher dimensional manifolds, while if L is sufficiently large, the intersection set has dimension 0, and the solutions are generically isolated. We now discuss how the results of this section describing the weight solution topology for nonlinear case (T > 0) combined with results of Section 5.1 for the linear system ( T = 0) impact the feasibility of the natural homotopy approach.
Weight Solutions for MLP Networks
687
5.3 Discussion. The analysis in Section 5.1 proved that a number of different equivalence classes of weight solutions result for the linear case (T = 0). These solutions form higher-dimensional manifolds, and are not isolated. All points are either minima or saddle points of the quadratic error surface. If it is known that a particular solution path retains the initial solution classification of the extremal point (i.e., minima map to minima, maxima to maxima, etc.), only the minima would have to be tracked to perform optimization using homotopy. However, as will be illustrated in Section 6.3, this condition does not hold for the MLP. Thus implementing the homotopy approach requires that all possible critical points of the necessary equations have to be tracked. Since equivalence classes for different choices of rank for C do not subsume, there are min(k,m)
N
C
kCr
(5.6)
r=l
equivalence classes that describe the initial system solutions. A graphic interpretation of the linear solution topology is shown in Figure 3. All solutions emanating from each of these classes have to be tracked. Analysis of the nonlinear case (T > 0; Section 5.2) showed that these solutions generically form higher dimensional manifolds if there are not enough data samples. If there are enough data, the solutions at each T are isolated and will therefore form paths as T is varied (due to the differentiability of the node nonlinearity). Generally, there will be a change in the dimension of the solution manifold as 7 is varied, as illustrated in Figure 3. The change in the dimension of the solution manifold as T is varied reflects a change in the rank of the homotopy Jacobian. Thus the homotopy Jacobian generically changes rank as T changes from T = 0 to T = 6 > 0, and the linear system is a bifurcation point of the homotopy equations. The MLP natural homotopy thus generally requires tracking manifolds. Even given enough data (when the problem generically reduces to tracking paths) additional issues still remain for the homotopy approach to be successful. First, solutions should exist for all T > 0. Second, it has to be established that each original solution has a path emanating from it for T > 0. Third, a solution path should connect to a solution of the final system of equations. From a practical, if not theoretical perspective, bifurcations should not occur for values of 0 < T < 1. In the following section, we present examples constructed using the geometric interpretation in Section 4.2 that illustrate that usually most of these conditions cannot be guaranteed. 6 Homotopy Path Behavior
In this section a simple example of the multilayer perceptron is used to illustrate path behavior of the natural homotopy. The network and the
688
Frans M. Coetzee and Virginia L. Stonick
Figure 3: For T = 0, the solution set in weight space consists of equivalence classes of solutions forming manifolds of various dimensions. When T > 0 each equivalence class can give rise to extensions of the original manifolds, lower dimensional manifolds, or isolated solutions. data are shown in Figure 4a, as is the equivalent network using weights p. For this example QT = [l a],L = 2, s = 1, m = 1, and k = 1. The different associated data manifolds of Section 4 are illustrated for a fixed arbitrary value of T in Figure 4b. From the linear analysis results there is only one equivalence class of linear system solutions. Also, due to symmetry considerations (Section 3) only values of p > 0 need to be considered. The following undesirable path characteristics are illustrated in the following sections: c and
The solution retains the higher-dimensional manifold structure from T = 0 to T = 1 (Section 6.1). A manifold of solutions at T = 0 exists, there are no minima solutions for /3 in R when 0 < T < 1 (except for limit sets at /3 + 0, \I -+ co),and no finite solutions exist at T =. 1 (Section 6.2). Bifurcations in the solutions for a nonzero measure set of desired values y (Section 6.3). In each case the projection operator, error surface, and homotopy path descriptions are presented and discussed. While interrelated, using all of these perspectives facilitate complete understanding of the weight optimization process.
Weight Solutions for MLP Networks
689
Figure 4: Example network (a) and original and equivalent networks (b). Notation: hidden node (SLP) surface W (heavy line) for a given 7 is generated by deforming span { Q } ; the shaded region Yc,p is the set of all possible outputs from varying c and P. For fixed /3, a specific hidden node vector in W is selected (indicated by the arrow); varying c generates Yc(/3)(dashed line). The minimum error results from projection of y onto a&,,.
6.1 Higher Dimensional Manifold Solutions. Let cy = 1 y E %', and y # 0. The data manifolds and the desired data vector are illustrated in Figure 5. The linear system has one equivalence class of solutions of the form cp = const for some appropriate constant, which scales the vector Q = [ 1 1 1' into the projection yp of y onto Q . When T > 0, the set W = o(Q%)= Q%, and therefore, yp E Yc,p(7) is specified by the infinite set of possible solutions C C T ( ~ = ) const. It follows that the hyperbolic solution manifold of the linear case is retained when 7 increases, as illustrated in Figure 6. In this case the network can implement the mapping exactly, and the vector Q is such that span {o(QP)} = span { Q } . However, this solution is not stable with variation in Q; an arbitrarily small perturbation of the data vector Q will result in the solutions forming a path. 6.2 Manifold Collapse. In the following sections, it is convenient to assume that c is explicitly solved in terms of P, and optimization of /3 alone is considered. For this example 1 < (Y < 00 and y = [ 1 cy 1.' In this example it is illustrated how the dimension of the manifold of solutions can change abruptly as T is varied, and that no finite weight solutions exist corresponding to minima. The data manifolds and pro-
690
Frans M. Coetzee and Virginia L. Stonick
Figure 5: Data resulting in manifold of solutions for both the linear and nonlinear system. The desired vector y has a unique projection yp onto the invariant plane generated by varying the input and output weights. v1 and v2 are representative of the infinite possible set of hidden node vectors having equivalent performance. jection operator perspective are illustrated in Figure 7, the error surface in Figure 8 and the homotopy paths in Figure 9. When T = 0, the set Yc,p = span { Q } (indicated by the heavy dashed line in Fig. 7) and there is zero error for all p # 0, since y E &,p. When T = 6 > 0, the set Yc,p changes dramatically, since the single layer perceptron surface W (heavy curved line) no longer forms a plane. It can be shown that there is no hidden node weight that allows for y E Z4([{), and there is always a nonzero error. However, it can also be shown that there are sequences of p producing hidden node vectors approaching span { Q } . In Figure 7, the sequence of vectors vo,vl, . . . ,vj,. . . corresponding to increasing values of P will approach the optimal set span { Q } when p -+ 00. Therefore, the error + 0 as 0 -+ co. Similarly, as p 10, the hidden node vector also approaches span {Q}, and c -+ 0, as illustrated by the sequence of vectors V O ,v-1, . . . . The error surface for different values of T and p is shown in
Weight Solutions for MLP Networks
691
I
W
i-
1
0
-+
0
Figure 6: Manifold solution connecting initial and final system when span {Q} is invariant under CT.
Figure 8. At P = 0 the network produces a constant zero output and the error is = 1+ a*. Therefore, the error surface is discontinuous (in P) at p = 0. The error surface has an internal maximum, and monotonically approaches zero as P - + 00, and /3 10 for 0 < T < 1. The homotopy paths corresponding to the different values of T are shown in Figure 9. When T = 0, the solution set is R\{O}, while for 0 < T < 1 the only solution that exists corresponds to the maximum in the error surface in Figure 8. The solutions corresponding to minima approach limiting sets at 0 and 03. When T T 1, there are no solutions in R. In this example infinite weight sets are the only optimal set of solutions. By varying T the manifold changes dimension-this can result in an arbitrarily large change in the weight solution for infinitesimal variation in T . There are no finite minima for the problem, and unreachable limit sets with large basins of attraction result, with corresponding numerical difficulties. Note further that this behavior happens for nonzero measure sets of Q and y-for example, one set of possible values is given by y and Q such that y2/y1 > a > 1. Therefore generic exception arguments cannot be made.
Frans M. Coetzee and Virginia L. Stonick
692
x2
Figure 7: Sequence V O , V ~ ., .. of hidden node vectors generated by increasing p, and sequence V O ,v-1, v-2,. . . generated by decreasing @.In both cases the span of the hidden node vector approaches the space Q as f l # 0 becomes arbitrarily large, or arbitrarily small.
6.3 Path Bifurcation. In this section it is illustrated that the solutions can undergo bifurcation for nonzero measure sets of values of cy and y. Here CY > 1 and y is chosen such that at 7 = 0 there is no zero-error solution for 0. However, y E y , , ~ when 0 < T~ < 7, as depicted in Figure 10. When 7 < T,, an optimal weight 0 results from the projection y, onto ayc,pof the desired vector y. It follows that for 7 < 7c, there is a unique, isolated solution /3. When 7 > T ~ the , problem can be solved with zero error by two choices of ,8, whose corresponding hidden node vectors span the same space containing y. Corresponding error surfaces for different values of P and T are illustrated in Figure 11. The error is once again discontinuous in fl at 0 = 0. When 7 = 0, the hidden node weight solution set is /3 E R\{O}. Below the critical value of T = T~ one local minimum occurs, generated by the projection of y the boundary a&,,. When 7 = 7;: zero error results, and
Weight Solutions for MLP Networks
693
I
0
0.5
1
1.5
2
s
2.5
3
3.5
4
Figure 8: Error surface for hidden layer node weight as r is varied. The error is symmetric in around T = 0, and discontinuous in /3 at 0 = 0. For r = 0, the hidden node surface is a plane, and for all 0E %2\{0} a solution exists with zero error; for 0 < r < 1 one local maximum occurs. For 0 < T < 1 the error becomes arbitrarily small when p 40,m. for T~ < r < 1 two local minima, separated by a local maximum, appear. As p -+ 03, the hidden node vector approaches span {Q) and a constant error equal to that of the linear case appears. The homotopy paths for this example are shown in Figure 11. When 7 = 0, /3 = %\{O), while for T positive but less than rc there is only one solution, corresponding to a minimum. A bifurcation occurs at T = rc. Three paths result, two of which are minima of zero error (paths a and c). Path b corresponds to the local maximum for rc < r < 1. The paths a and b diverge to infinity at r = 1, while path c leads to a finite solution. Note that the set of y E Rz exhibiting this behavior does not have measure zero; this set is given by y E Uo rC, the error is zero, but two hidden node vectors (labeled us) can generate the same span containing y, and correspond to two branches of the solution. function becomes norm-coercive, and solutions at infinity do not appear. Furthermore, since the Hessian of the initial solutions will generically be nonsingular, the solution to the initial system will generically form points rather than the higher dimensional manifolds described earlier. The addition of sum of squared regularization terms for standard linear regression is well known and simple to analyze. The problem is that of finding a matrix A such that Z = AX, and minimizing E’ = tr{ (Y Z)(Y - Z ) T + XAAT} where X is a proportionality constant. For almost all X E 92 there is a unique, explicit solution. The standard geometric projection perspective holds since it is simple to show that the regression term need not be added explicitly, but is equivalent to the addition of
Frans M. Coetzee and Virginia L. Stonick
696
0.2
I
0
0.5
1
1.5
2
2.5
3
3.5
P Figure 11: Error surface for hidden layer node weight as T is varied. The error is symmetric in p around T = 0, and discontinuous in p at = 0. When T = 0, the hidden node surface is a plane, and for all p E R\{O} a solution exists; for 0 < T < T, one local minimum occurs. When T = T~ zero error results, and for T, < T < 1 two local minima, separated by a local maximum, appear. For 0 < 7 < 1 the error approaches the linear system error when /3 -+ 00. additional data to X and Y and performing an unregularized regression (Allan 1974). However, in the neural network case, due to the hidden layer, the above results need to be modified. First, it should be noted that the regularization term 6.1 is invariant under the symmetries described in Section 3, and there are, therefore, always multiple initial solutions. It is not known whether the only solutions to the regularized equations form an equivalence class as described in Section 3. If this is the case, then the norm coercivity and the symmetry imply that each initial solution is a minimum, connects to a solution of the final equations, and that these final solutions form an equivalence class. Hence, given the solution to the initial system, a globally convergent method results whereby one path from the initial system can be tracked to the final system to obtain a com-
Weight Solutions for MLP Networks
697
P
Figure 12: Homotopy paths for hidden layer node, showing different solution states: at T = 0, p = R\{O}, while for 0 < T < T, = 0.67 there is only one solution. A bifurcation occurs when 7 = T, = 0.67 resulting in three bifurcation paths, two, a and c, corresponding to minima of zero error, and one, b, to a local maximum for T, < T < 1. Paths a and b diverge to infinity at T = 1, while path c leads to a finite solution. plete equivalence class of solutions to the final equations. Unfortunately, it cannot be guaranteed that this equivalence class contains the global minima of the neural network. The major problem with this approach lies in characterizing the initial system of equations. The product CP' at the solutions is different from the optimal linear regressor A previously described. It can be shown that the regularized problem reduces to finding /3 so that
In the higher dimensional case finding the multiple initial solutions analytically does not appear to be possible. Hence, the basic tenet of homotopy, that of transforming simple initial solutions into final solutions, is violated .
698
Frans M. Coetzee and Virginia L. Stonick
Naturally, other regularization terms can be considered; however, in all cases, careful analysis is required before any success could be claimed for such a procedure. In defense of the work presented here, we note that most of the results describing the topological nature of the data manifolds (Lemmas 1-4, Theorem 5) are independent of how the error is measured, and can be used for general analysis. The equivalence of orthogonal projection and optimization discussed in Section 4 is no longer valid, although the rest of the geometric picture (such as how a network produces its outputs) remains intact. 7 Conclusions
In this paper the topology and geometry of the weight solutions under the MSE error criterion of the natural homotopy for the multilayer perceptron have been developed. Different geometric interpretations of the weight solution process have been presented, and related to error surface and homotopy path descriptions. These geometric perspectives provide insight into the neural mapping as is needed for both characterizing the topological nature of the solutions and illustrating possible path behavior by carefully constructed examples. In the linear case, the solutions generally consist of equivalence classes of unbounded, higher dimensional manifolds. In the nonlinear case, the solutions generically form paths if enough data is available. However, using examples, we have shown that these paths can have multiple bifurcations, minima solutions might not exist, and infinite weights occur. Furthermore, this path behavior occurs for data sets that are not necessarily of measure zero. To prove that the initial and final system solutions connect, the homomorphism induced by the necessary equations should be nonzero for a nontrivial homology theory (Alexander 1978; Alexander and Yorke 1978). However, degree theory cannot be used to verify this for the neural networks described in this paper (at least in Euclidean space), since the solutions are never bounded for the linear case, and also, in general, not for the nonlinear case. Therefore, open-set degree theory (Lloyd 1978) is not applicable. Even if such a connection could be established, the numerical difficulties are severe. The most profound difficulty is that of finding a path (or lower dimensional manifold) emanating from one of the linear equivalence classes. In the case of the neural network, such an exit point is not known a priori, nor whether multiple exit points exist. Therefore, it cannot be guaranteed that all exiting solution paths are found, solutions can undergo arbitrarily large changes of magnitude as 7 is varied, and the underlying motivation for the homotopy approach is lost. This dilemma is in sharp contrast to predictable bifurcation problems, where it is known where the exit homotopy path occurs from a higher dimensional manifold [cf. the eigenvector problem discussed in Keller (1977) where the zero vector is always the exit path] or where an
Weight Solutions for MLP Networks
699
entrance point is known (7 < 0) and it is adequate to step through the bifurcation point (Durbin et al. 1989). The fact that a linear (convex) system is transformed to a nonlinear (nonconvex)system is not entirely responsible for the difficulties faced by the homotopy method described in the paper. Linear systems are often used as the initial system of equations in homotopy methods [e.g., the commonly used fixed point and convex-linear homotopy methods (Garcia and Zangwill 1981) use a linear system as an initial point]. In fact, for networks with no hidden layer, the homotopy approach described here does successfully lead to a solution (Coetzee and Stonick 1994a). Also, convexity of the initial system is often crucial for formulating degree arguments to prove that at least one of the initial system solutions connects to a solution of the final equations. Rather, the difficulties for the neural networks described in this paper result from the fact that the functional relationships among the system weight solutions do not vary smoothly or predictably as the homotopy parameter is changed. In contrast, for a single perceptron (even if there is not enough data to specify all the weights exactly) the functional relationships among the weights for the linear and nonlinear weights are the same, and hence it is possible to reliably find a combination of weights that can be used as parameters in the homotopy method (Coetzee and Stonick 1994a). Yang and Lu (1993)found the natural homotopy to be useful in obtaining faster convergence during training. Based on the analysis presented in this paper, it is clear that this method suffers from some fundamental difficulties that weaken a number of claims made based on the numerical results. In particular, there is no reason to believe that "good" solutions can be obtained via homotopy, or that infinite weight solutions can be avoided. The bifurcation when moving from the linear to the nonlinear system prevents claims from being made that any preferred solution path is tracked. However, this does not mean that the method does not provide a viable practical heuristic to aid in obtaining convergence of the method, simply that no strong claims as to the global convergence and exhaustive nature often guaranteed by homotopy methods can be made. Perhaps other homotopy methods may be formulated that do not suffer from the same difficulties. A serious limitation on the use of any homotopy method for global optimization results from the implicit assumption that optimization can be reduced to the solution of systems of equations. Even if a homotopy can be constructed that is theoretically guaranteed to find all solutions, such a method generally does not have a descent property. Therefore new solutions do not necessarily have lower error than solutions that have already been found, and all solutions have to be compared to identify the global minimum. If a large number of stationary points exists on the error surface, this process might not be feasible. Recently reported numerical estimates (Goffe et al. 1994) indicate that this might be the case in the neural network problem. However, homotopy methods might
Frans M. Coetzee and Virginia L. Stonick
700
still offer some advantage over standard descent procedures since they can be constructed to be globally convergent and to produce multiple solutions without repetition (standard approaches are prey to repeatedly finding only solutions with large basins of attraction). Hence a reasonable optimization approach results from continuing the homotopy solution process until an acceptable minimum is found. The geometric formulations and generic results describing the nature of the solutions presented in this paper are independently valuable for constructing and evaluating other algorithms. For example, direct application of Theorem 6 provides bounds on data size generically ensuring non-singular Hessians of the error, a necessity in some optimization procedures (e.g., conjugate gradient procedures). Careful investigation of the differential geometric properties of the various data manifolds described in Section 4.1 could also provide valuable insight into the type of mappings that can be implemented by perceptron networks.
Appendix A Nonlinear Multilayer Analysis Note: Lemma 1 is a general result that subsumes results by Poston et al. (1991) (Theorem 3.1) and Sartori and Antsaklis 1991 (Lemma 1)for m 1 L. The proof in this Appendix follows the process of analytic continuation and contradiction similar to that of Poston et al. (1991), but allows for the more general node nonlinearity required by homotopy. Lemma 1. Given { Q ,p } E gLXs x P x mThen . except for a set of measure zero in RLxsx PXm, the matrix u(QP) is f u l l rank.* Proof. Let Q and ,B be as stated. If L > rn, select the first m rows of Q, while if L 5 m, select the first m columns of 0, to generate a square matrix QP of size m’ = min(m,L) and consider the matrix determinant det o(QP) : Rm‘s+s‘m--+ R. Since det u(QP) is analytic everywhere in both Q and 0, it follows that if the determinant vanishes identically on a manifold of dimension 2m’s then it vanishes identically over all of %m’s+s’m . Therefore, if there exist one Q,P such that the matrix is full rank, then the theorem follows. A general example is constructed as follows: Let Q be constructed by taking the first rn’ rows of the matrix p 8 I,, and p by using the first rn’ columns of dT 8 I,, where d and p are vectors where p , = 1 and > 0. Let Q, be generated by taking the first rn’ x m’ submatrix
0. Therefore, as y + 00, P + g( to TG-
(3.3) where Go = G(to).
Francois Chapeau-Blondeauand Nicolas Chambet
722
To preserve a coherent link with the synapse description of Section 2 involves relating equation 3.2 to equation 2.8, and equation 3.3 to equation 2.7. Both of these relations point to the fact that an accurate identification can be made between these two pairs of equations only if the conductance G ( t ) operates far below saturation. In such a case GSat- G ( t ) x G,,, and equation 2.8 can be reduced to equation 3.2. At the same time, a G(t) that remains far below Gsat is associated with aQ 0, q(t) = q ( t ) - qm(t),and qn(t) is the trajectory the coordinates q are required to follow, assumed to be bounded, and at least twice continuously differentiable, with bounded first and second derivatives. It is also convenient to rewrite equation 2.2 as s ( t ) = q(t) @ ( t ) where
Note that this algebraic definition of the error metric s also has a dynamic interpretation: the actual tracking errors q are the output of an exponentially stable linear filter driven by s. Thus, a controller capable of maintaining the condition s = 0 will produce exponential convergence of q(t) to zero, and hence exponential convergence of the actual joint trajectories to the desired trajectory qm(t ) . The following sections discuss the design of control laws for equation 2.1 that asymptotically drive s to 0, thus also asymptotically assuring perfect tracking of the specified desired trajectory. Section 2.1 reviews the structure of such control laws when perfect information about the dynamics (2.1) is available, and how standard adaptive techniques can ”tune up” these controllers in the face of uncertainty on the mass properties of the system. Section 2.2 then discusses how ”neural” networks can be used to greatly extend the adaptive capability of the controllers in Section 2.1, permitting uncertainty on the actual structure of the nonlinear functions appearing in 2.1. Section 2.3 next discusses the selection of an appropriate network architecture for use in the controller, and finally Section 2.4 presents the complete specification of the new controller and learning algorithm. 2.1 Stable Adaptive Robot Control and Linear Parameterizations. The state vector for the process is specified in terms of the coordinates q and their derivatives so that xT = [qT,qT]E R2“.With perfect knowledge of H, C, and E and exact measurements of the state vector, the above derived signals can be used to design an effective nonlinear tracking control algorithm for equation 2.1. Indeed, the control law T
=
-KDS + 7”’
where K D is a symmetric positive definite matrix and the nonlinear components are given by = H(q)ii’
+ C(q, a)$ + E(q, q)
will produce asymptotically convergent closed-loop tracking of any smooth desired trajectory q” (Slotine and Li 1991), with asymptotically stable closed-loop tracking error dynamics given by
HS + CS+ K D S= 0
Adaptive Control of Robot Manipulators
759
Since a practical controller implementation has at best partial information about the exact structure of the dynamics, the required nonlinear terms are usually not known exactly. To compensate adaptively for this uncertainty requires first obtaining a factorization of the nonlinear components of the control law: 7”’=
H ( q ) q + C(q, q)$
= Y(q, q, $,
+ E(q, q)
0
(2.4)
Substantial prior knowledge about the system dynamics must be exploited to separate the (assumed known) nonlinear functions comprising the elements of H , C, and E, from the (unknown but constant) physical parameters a. Such a factorization is always possible for the rigid body dynamics of a fixed-based manipulator, when the physical uncertainty is on the mass properties of the individual manipulator links (Khosla and Kanade 1985>,and arises naturally from the structure of the Lagrangian equations of motion. Using this factorization, but perhaps lacking exact knowledge of the mass properties of the manipulator, the nonlinear components can be implemented using estimates, a, of the true physical parameters, a T
=
-KDS + YP
(2.5)
Such a controller results in the closed-loop dynamics
HS + cs + KDS = Ya where a = a - a, and the model error Ya thus acts as a perturbation on the otherwise asymptotically stable closed-loop dynamics. The fundamental result of Slotine and Li (1987) demonstrates that the effects of these perturbations can be asymptotically eliminated by continuously tuning the estimates of the physical parameters according to the adaptation law
i = -ryTs
(2.6)
where r is a constant, symmetric, positive definite matrix controlling the rate of adaptation. Indeed, the formal analysis in Slotine and Li (1991) shows that the coupled learning and control strategy, equations 2.5 and 2.6, ensures globally stable operation and asymptotically perfect tracking of any sufficiently smooth desired trajectory. Implementation of the above algorithm, however, requires exact prior knowledge of the component functions of the matrix Y. Of course, for an ideal robotic model, elementary physics directly provides these functions, and for relatively “clean” manipulator designs that are well modeled by this analysis, the above algorithm can be shown to perform extremely well in practice (Larkin 1993; Niemeyer and Slotine 1991; Slotine and Li 1988). However, for many other nonlinear systems whose dynamics can also be represented as in equation 2.1, the physics may be too
760
Robert M. Sanner and Jean-JacquesE. Slotine
complex or too poorly understood to provide an explicit, closed-form description of each of the nonlinear functions in H,C, and E. For example, the hydrodynamic and hydrostatic forces in an underwater vehicle, or the attitude-dependent solar pressure, aerodynamic, and gravitational torques on a satellite, or even the exact form of friction effects in a slightly more complete robot model, all may be quite difficult to model analytically, leaving the specific nature of some of the functions in 7"'unknown. Moreover, by "hardcoding" into Y a description of the expected environment E, through the choice of specific functions assumed to model these forces, the system may become excessively "rigid," incapable of responding appropriately to unexpectedly different environments. The available "library" of possible responses in this case may not be sufficiently complete to appropriately respond to changes in its nominally assumed environment. These relatively unstructured sources of uncertainty in the dynamics 2.1 can be just as significant as the parameterized uncertainty examined above. They cannot, however, be addressed by the above adaptive techniques since the prerequisite linear parameterization cannot be determined. The next section thus demonstrates how the established function approximation abilities of "neural" networks (Cybenko 1989; Funahashi 1989; Girosi and Poggio 1990; Hornik et al. 1989) can be used to compensate for this kind of uncertainty, giving the controller the ability to learn the actual component functions of H, C, and E, and thus greatly extending its flexibility and applicability. 2.2 Functional Parameterization and "Neural" Networks. Consider instead the following alternative representation of the nonlinear component of the required control input:
or, in component form,
where UI = ;i;, U I + ~= tj;, for I = 1 . . . n, and vzn+1 = 1. Unlike expanl of a matrix of known functions, Y, sion 2.4, which decomposes ~ " into multiplying a vector of unknown constants a, this expansion decomposes 7"'into a matrix of n(2n + 1) (potentially) unknown functions M , multiplying a vector of known signals v. Note that equation 2.7 is merely a more compact method for expressing 7"':the components of M are just the components of H I C, and E. Thus the nonlinear components of the required control always admit (trivially) the representation 7"' = Mv, while only under specific circumstances can they be represented as 7"' = Ya.
Adaptive Control of Robot Manipulators
761
Without the ability to determine a Ya factorization, an adaptive controller capable of producing the required control input must instead learn each of the unknown component functions, M i , , ( x ) , as opposed to the conventional model, which must learn only the unknown constants, a. If such a controller used estimates, @j, in place of the true required functions, the closed-loop dynamics would be
-
HS + CS+ K D S= M V where here M i , , = G,,,- M i , , . Unlike the Ya parameterization considered above, however, it is by no means obvious how the functional estimates Gi,,should be implemented, nor how they could be continuously tuned so as to eliminate the effects of the perturbations M v . To address first the implementation issue, note that for the rigid body dynamics of robots and spacecraft, the components of the matrices H and C are continuous functions of their arguments. Provided that the same is also true of the environmental forces, €, to which the system is subjected, each component of M can be uniformly approximated on any closed, bounded subset of the state space by an appropriately designed “neural” network (Cybenko 1989; Funahashi 1989; Girosi and Poggio 1990; Hornik et al. 1989). That is, given a closed, bounded subset A c R2“,and a prespecified accuracy, Q , , there exist values for the design parameters N, C j , j , k and [ k SO that
for any x E A. This expansion approximates a component of the matrix M using a single hidden layer “neural” network design with the state vector x as the network input; here g k is the model of the signal processing performed by a single ”neural” element or node, [ k is a vector of n ”input weights” associated with node k, and Ci,j,k is the output weight associated with that node. The inherent flexibility of the representations afforded by these networks naturally suggests their use in the “functional” adaptive controller discussed above. Indeed, defining (2.9) which uses the network expansion N N;,j(x,
p) = c c i , j , k
c g k ( X , [k)
k=l
this structure can accurately approximate the required nonlinear control input for appropriate values of the free network parameters N,C j , j , k , and
Robert M. Sanner and Jean-JacquesE. Slotine
762
&, which here have been collected into the parameter vector p. To explicitly determine the accuracy of this expansion, define d = +“‘ - T ~so, that
for any inputs x E A. Since the assumed smoothness of q”(t) assures that each Iv,(t)l is bounded whenever x ( t ) E A , over this subset of the state space the discrepancy between the “neural” approximation and the required nonlinear terms can be made arbitrarily small by appropriate design of the network employed. Throughout the discussion that follows, the free parameters in the implementation of the network will be collected together into a single parameter vector p as above. In the general case, p thus consists of the number of nodes, all the input and output weights in the network, and any additional parameters (such as biases or scale factors) that may influence the signal processing performed by each node. Since M is assumed to be unknown a priori, in principle a learning algorithm would need to search for values of each of these different parameters so that the above inequality holds. In many of the specific learning algorithms that follow, however, certain of the network parameters may be fixed to preselected values, determined for example by the size and location of the set A and some measure of the smoothness of the functions the network must approximate. In these cases, p will contain only those parameters of the network that may vary during its operation. In fact, for many classes of “neural” networks, small amounts of additional prior information about the nature of the functions in M (beyond continuity) can be exploited to effectively preassign many of the network design parameters, thus dramatically reducing the number of values that must be learned in order to approximate the specific functions in M . The following section briefly reviews some recently developed methods for determining appropriate values for certain network design parameters, especially for classes of radial basis function networks, i.e., networks in which gk(x,&) = g(akl(x for a given continuous function g, and some positive scaling parameter C q . 2.3 Network Architecture Selection. Equation 2.9 is simply equation 2.8 where each component of the matrix M is approximated by one of the outputs of a single hidden layer network. The network used in equation 2.9 has the 2n components of the state vector, x, as its input, and 2n2 n outputs, N;,,(x, p), representing the approximations to each M , , , ( x ) . This network is thus being used to “patch together” approximations to the functions M,,, using a collection of simple computing elements gk. In this approximation theoretic sense, “neural” computation is related to Fourier series, spline, and wavelet expansions. In a network
+
Adaptive Control of Robot Manipulators
763
with one hidden layer, for example, selection of a set of input weights for the approximation is comparable to choosing a set of frequencies for a Fourier series expansion, or knots for a spline expansion, or translation parameters for a wavelet expansion. Choice of the output weights in a single hidden layer network is then equivalent to, in each of these three cases, determining the degree to which each resulting basis function contributes to the approximation of Mi,,. A similar identification can be made between the components of a “fuzzy” logic approximation and the parameters of a single hidden layer network (Jang and Sun 1993; Wang 1992). Thus, in additional to the standard sigmoidal network models (Rumelhart and McClelland 1986), fuzzy basis function networks (Jang and Sun 1993; Wang 1993), generalized spline and radial basis function networks (Broomhead and Lowe 1988; Girosi et al. 1995; Poggio and Girosi 1990), or even wavelet networks (Pati and Krishnaprasad 1993; Cannon and Slotine 1995; Sanner and Slotine 1992; Zhang and Benveniste 1992), can be used to implement the required control input. Use of these latter two models in particular allows use of powerful approximation theoretic tools that have recently been developed (Daubechies 1992; Powell 1992; Walter 19941, to explicitly bound the size of the required networks and to effectively select fixed values for other network design parameters. For example, Sanner and Slotine (1992) develop a constructive design procedure for gaussian radial basis function networks, using a sampling theoretic analysis that exploits the conjoint space-frequency localization of the gaussian. In this latter construction, if the smooth restriction to A of each of the functions Mi,,Jx) produces functions with integrable Fourier transforms, the network input weights can be chosen to encode a regular mesh of ”sampling” points, Ak, covering the set A. Here each k is an integer multi-index that both labels and defines the input weights used in the network, and the mesh size A is chosen inversely proportional to the effective bandwidth of the restrictions of Mi,,. The same scaling parameter, 0,is used for each node, describing the effective bandwidth of the gaussian low-pass “filter,” and is chosen directly proportional to the assumed bandwidth of the functions being approximated. The required output weights in this construction are then identified with the samples of a continuous function, ci,,(x), related to Mi,,(x)through a simple convolution, so that Cj,j,k = ci,,(kA) (note that it is more convenient to use the multi-index label k in place of the scalar index k in these constructions). The resulting network expansion then has the form p) = ci,j(kA)exp(-m2JJx- kA1I2)
x,j(~, C
dist(A,kA) 0, where the distance measure used to terminate the summation is given by dist(A,kA) i!inf ))z- kAJJ, zEA
764
Robert M. Sanner and Jean-JacquesE. Slotine
The total number of nodes N employed by the construction is then simply the number of centers, kA, contained within the set {x E RZnI dist(A,x) V) =
1
Ri,j(x, p) uj
j=1
Similar to the algorithms considered in Polycarpou and Ioannou (1991), if estimates of these parameters are updated using the projection algorithm PI = P{-"i~GTs)flPll~,~l}
then the time derivative of the function v = -21 + pr-lp) satisfies the bound
Noting that p' is contained in a compact set if p is, and recalling the assumed continuity of the second derivatives of N;.,, and the smoothness of the model trajectory q", ((r(x(t), p(t), v(t))JJis bounded for all t if x(t) and p(t) are confined to compact sets. Thus since each is confined to the compact set [;o ,p,] by choice of adaptation mechanism, and since (1 - rn) vanishes if xdies outside the compact set A, (1 - rn)llrll is a uniformly bounded function of time. The stability and convergence properties of this more general adaptive algorithm are thus identical to that considered in Sections 2.4 and 3, augmenting d , in inequality 3.7 by the uniform bound on (1 - rn)llrll. Note especially that knowledge of the uniform bound on (1 - rn)llrll is not required to implement the adaptation law; this new perturbation to the closed-loop dynamics serves only to increase the asymptotic tracking error bound. In component form, this new adaptation law can be written as
where the indicated partial derivatives are evaluated at (x(t)lp(t)). In particular, considering the case PI = I&, the adaptation mechanism in Section 2.4 is seen to be a special case of this more general adaptation scheme. Like backpropagation techniques (Rumelhart and McClelland 1986), this approach to tuning each network parameter considers only the firstorder components of the nonlinear impact of parameter mistuning; the higher order effects are simply treated as additional disturbances to the closed-loop dynamics. The disadvantage is that generally llrll may be quite large, despite the fact that I -+ 0 as p -+ 0. Since the above argument again says nothing about the convergence of p, the neglected higher
Robert M. Sanner and Jean-JacquesE. Slotine
780
order terms may contribute substantially to the asymptotic tracking error bound. More sophisticated methods of parameter adaptation are required to overcome this limitation, taking explicitly into account the exact nonlinear impact of parameter variations. New methods should also permit the actual number of nodes to vary during the learning, perhaps "sampling" densely with "high bandwidth" nodes in regions where the required functions locally exhibit a low degree of smoothness, then sampling sparsely with low bandwidth nodes in regions of greater smoothness. Stable, on-line versions of such techniques are the subject of current research (Cannon and Slotine 1995). 4.3 Disturbances and Unmodeled Dynamics. Actual physical systems are at best approximately modeled by deterministic, finite dimensional differential equations such as 2.1. In general, there may be additional dynamic effects that couple to the rigid body motions captured by equation 2.1, as well as a variety of additional external influences on the system, some of which might best be modeled a s stochastic. A complete analysis must thus assess the sensitivity of the convergence proof given above to these neglected physical effects. Significantly, the robust control and adaptation mechanisms utilized in the algorithms developed above were originally developed to accommodate the impact of just such disturbances on the idealized model (2.1). Since these mechanisms are central to the "neural" controller developed above, accommodating the impact of the unmeasurable network approximation error d, the algorithm naturally inherits the ability to accommodate additional disturbance sources. For example, suppose that instead of equation 2.1, a more complete description of the dynamics is
H(q)ii + C(q, q)q t E(q, 4) + rdt r) = where rd E 72" is an unmeasurable disturbance torque, and r ) E R" represents the additive effects of any unmodeled dynamics. Of course, if the time variations of either rdor r) have a functional dependence on the instantaneous values q ( t ) and q ( t ) , their effects can simply be included in the definition of E, allowing the adaptive networks to eliminate their effects. In the more general case, the disturbance T~ is assumed to be independent of the states q and q, but uniformly bounded in time. The unmodeled dynamics are assumed to couple with the evolution of the system states through equations of the form C ( t ) = FC(t) t @i+(t)
+ @zq(f)
r)(t) = cat) where p is a small positive constant, and the eigenvalues of F all have negative real parts. The dimension of the state space of these unmodeled
Adaptive Control of Robot Manipulators
781
dynamics is possibly unknown, but assumed to be finite. Such a model captures, among other effects, the dynamics of the motors used to drive each robotic joint (Reed and Ioannou 1989). Two cases can now be identified: p = 0 and p > 0. If p = 0, the system is perturbed only by the bounded disturbance torque T ~ In . this case, the analysis of Section 3 contains an additional term, so that
Hence, by simply increasing each sliding gain, ki, by = supr # ( t ) , the disturbance merely acts to augment the asymptotic bound on the energy in the tracking errors:
where
If ,u > 0, the unmodeled dynamics may couple to the dynamics (2.1) used to design the controller structure. Intuitively in this case, stability can be preserved by preventing the unmodeled dynamics from becoming excited, for instance, by assuring that the input torques are neither excessively large nor too rapidly changing. In fact, a formal analysis of this situation (Reed and Ioannou 1989; Slotine and Li 1991) shows that the robust adaptation algorithms above will still provide stable, convergent operation, provided essentially that the feedback gains and learning rates are small compared to the "bandwidth 1 / p of the unmodeled dynamics, and that the total input torque is sufficiently smooth. This latter constraint requires avoiding the discontinuous inputs possible when the sliding controller is active. By instead replacing the discontinuous sgn(si) terms in the sliding controller with the smoother sat(si/*), where @ describes the width of a boundu y layer whose size is inversely proportional to the bandwidth 1/ p , these discontinuities are avoided and the unmodeled dynamics are not excited (Slotine and Li 1991). Note that, with this choice of 0, in the limiting case p -+ 0 corresponding to the ideal dynamics (2.1), the saturation function approaches the sign function used in the controller of Section 2.4. 5 Robotic Example
As a relatively simple example with which to illustrate the essential features of the algorithm, consider a planar, two joint robotic manipulator,
Robert M. Sanner and Jean-JacquesE. Slotine
782
whose actual dynamics can be written in the form 2.1 with HIJ
+ 2 a 3 cos(q2)+ % sin(q2) H2,1 = a2 + a3 cos(q2) + a4 sin(q2)
= UI
HI,^ = H2,2
= a2
C1,I
= -472142
c1,2 = c2,1
=
c2,2 = El
=
-h(q2)(91
+42)
h(q2)91
0 E2 = 0
where h(q2) = 4 sin(q2) - a4 cos(q2) (Slotine and Li 1991). For this simulation the parameters u1 = 3.3, a2 = 0 . 9 7 , ~= 1.04, and a4 = 0.6 are used, and the robot is initialized so that x(0) = [0, .75,0,OIT.Note that, given this structure, the matrix CO,used in Section 4.2.1, can be written as CO(qlrq2) = 0
2 )
[
0 -1 -1 -1 1 0 0 0
I
The desired trajectories are chosen to be q y ( t ) = 1.33[1- cos(0.75~t)l qp(t) = 0 . 7 5 ~ 0 ~ ( 2 ~ t )
so that the set Ad C R4 can be taken as Ad == [0,2.66]x [-.75,.75] x [ - T , T ] x [-1.57r, 1.5~1, or, for convenience, as the .unitball centered at xo = [1.5,0,0,OIT with respect to the scaled infinity norm l l x l l m , ~= maxi IXi/.il using scale factors a1 = 1.5, a 2 = 1, a 3 = 3.5, and a4 = 5. The set A is taken as a slightly larger superset of Ad,
A
= {xlllx
- xollW,cy I 1+ a }
and the modulation function is then computed as
where u ( t ) = Ilx(t) - X O ~ / ~ , (with Y, a transition region width of @ = 0.1. Using the dimensionality reduction described in Section 4.2.1, the network used in the control law has the two inputs 41 and 9 2 and the 22 23 = 12 outputs needed to implement approximations to the functions in H(q1,q 2 ) and Co(q1,q 2 ) (actually, note here that the true matrices H and CO are functions of q 2 only). A gaussian network is employed, , p = 5A = 1.25 using the construction parameters A = 0.25, d = 2 ~and so that, given the above definition of the set A, this network has a total of 437 nodes and 5244 output weights which must be learned.
+
Adaptive Control of Robot Manipulators
783
Since each entry of H and COis either a sine, cosine, constant, or a sum of these, assuming that each component function has the general form KI )i2 cos(2m$q) )is sin(27rr]Fq)with JJr]iJJ 5 1 and JniJ5 5, the choice of construction parameters and the analysis of Sanner (1993) suggests that each of the required output weights is conservatively bounded in magnitude by (Ci,j,k(5 1 5 a A n ~where
+
+
nG = exp
(3) exp(0.5) =
so that Ici,j,kl 5 16. This bound is used with the weight decay adaptation laws 2.14 and 4.2, using a different w for each output weight. The weight decay parameter, WO, is taken as wo = 5 and the adaptation gains are 'y;,j,k = 2, for each i, j, and k. The initial condition on each output weight is taken to be zero, simulating a total lack of prior knowledge of the parameters required by the control law. The gains of the sliding controller are taken so that kl(ql392rf)
= k2(91,92,t) = 2511ii'(t)II +30ll[q(t)il'(t)]ll
using worst case bounds on the magnitude of the elements of H and CO. The error metric s is computed using equation 2.2 with h = 201; and finally, the gains KD = 1001are used for the linear feedback components of the control law 2.14. Figures 1 and 2 compare the results of attempting to force the robot to follow the model trajectory, first without use of the adaptive gaussian networks, i.e., using just the linear feedback and (if required) sliding components of the control law 2.10, then including the adaptive network contributions. Use of the networks improves the initial worst case tracking errors by a factor of three over the I'D controller, and this ratio rapidly improves as the network learns more about the dynamics of the arm. Figure 3 plots the average energy of each component of the error metric s during the simulation. As predicted above, these quantities are asymptotically converging to small values. 6 Concluding Remarks
In this paper, "neural" networks have been used to extend an existing nonlinear control methodology for a special class of multivariable systems. By merging the techniques developed in our previous analysis of "neural" adaptive control algorithms with powerful, physically motivated methods of exploiting the mechanical passivity of these new systems, we have developed a class of adaptive control algorithms with a much broader range of applicability than either of its progenitors (Sanner and Slotine 1992 or Slotine and Li 1987)provide. In the process, networks have been incorporated into the existing framework of robotics and control theory, allowing a precise determination of effective adaptive control
Robert M. Sanner and Jean-JacquesE. Slotine
784
0.10
,
~
PD Controller Tracking Errors -_ ._ _ ~ -
v j1 ._._. loinr 2
0.05
: t
W
.1 d
0
2
w
-0.05
7-
1
10
15
25
20
Time (see.) 0050 1
Adaptive Controller Tracking Errors -
~-
-1
I
0.025
4
---~
5
1
-~
I
15
10
-~
T
20
~~
25
Time (see.)
Figure 1: Comparison of robot tracking performance under PD and adaptive control laws. structures that exploit these devices. This technique seems to us more straightforward than the alternative of attempting to reinterpret control theory in terms of the properties of "neural" networks. The above control and adaptation laws may appear formidable in their complexity, but this is mostly due to the large number of parameters that may appear in the control law, and the concomitant notational baggage needed to describe how each is used and modified. This increase in pa-
Adaptive Control of Robot Manipulators
150
PD Controller Applied Torques
-
.1m+--0
785
-
--
5
-
7---
10
-
-r-------
15
1
20
25
T h e (see.)
Adaptive Controller -~ Applied Torques
,
-
lWr---
i I
100 4
- 1 m c 0
______ T-
5
--
-
10
1
15
----
--1
20
I
25
Time (sec.)
Figure 2: Comparison of control signals used in the PD and adaptive tracking simulations for the two-degree-of-freedomrobot. rameters is to be expected given the increased flexibility allowed by the algorithm; the underlying control and adaptation mechanisms, however, are quite simple. Each network output is multiplied by one of the components of v, (or wj)and the results are summed to implement the adaptive components of the control law. Each output weight of each network changes according to the product of the output of the node to which it is attached, the signal it multiplies, and the appropriate component of the
Robert M. Sanner and Jean-JacquesE. Slotine
786
0.08
0.04
5z 0.02
0
----
-~
5
7-
7r
15
10
20
Time (see.)
Figure 3: Convergence of the average energy in each component of the error metric s. tracking error. A weight decay term is added if the weight magnitudes become excessively large, and adaptation is halted (but decay may continue) whenever the state vector leaves the set on which good network approximation can be ensured. Since the "neural" controller does not require an explicit Ya linear parameterization of the dynamics, it is capable of solving adaptive robotic problems for which such parameterizations are impossible, even when the functional form of the equations of motion is quite well known. For example, the dynamics of "free-floating" robotic manipulators, that is, manipulators that are mounted on orbital or submersible bases whose orientation is not independently controlled, can also be written in the form 2.1, but cannot be linearly parameterized in terms of the (possibly unknown) mass properties of the manipulator and its load (Papadopou10s 1990). Nonetheless, the above algorithm can be shown to produce results for such systems comparable to those shown for the fixed-base manipulator examined in Section 5 (Sanner and Vance 1995). In the face of real-world uncertainty on the physical properties of the system or of the environment with which it interacts, properly utilized "neural" networks can thus represent a significant new enabling technology in robotics, providing unique solutions for important practical problems that otherwise cannot be solved with established adaptive control techniques. The stability and convergence properties of the algorithm described provide the assurances of reliability and effectiveness
Adaptive Control of Robot Manipulators
787
needed to make such controllers viable alternatives to existing control algorithms. Of course, to implement the full algorithm described above in real time on a serial, digital microprocessor would currently pose a difficult computational task for a typical six-degree-of-freedom manipulator. However, one of the promises of the above approach to function approximation and estimation is the future availability of hardware that implements the required computations in parallel. Such parallel computations are indeed felt to underlie the sensorimotor coordination of living organisms, which likely are not equipped with structured, Lagrangian models of the dynamic systems that govern their movement, and must rather construct approximate representations of these dynamics from the aggregates of relatively simple processing and actuating units at their disposal. The algorithms detailed in this paper, while not intended to provide a plausible explanation of this capability in living creatures, help solidify and formalize recent progress toward reliably endowing cybernetic constructions with these capabilities.
References Apostol, T. M. 1974. Mathematical Analysis. Addison-Wesley, Reading, MA. Arimoto, S., Kawamura, S., and Miyazaki, F. 1984. Bettering operation of robots by learning. 1.Robot. Syst. 1(2), 123-140. Atkeson, C. 1989. Learning arm kinematics and dynamics. Annu. Rev. Neurosci. 12, 157-183. Atkeson, C. G., and Reinkensmeyer, D. J. 1990. Using associative contentaddressable memories to control robots. In Neural Networks for Control, T. W. Miller, R. S. Sutton, and P. J. Werbos, eds. MIT Press, Cambridge, MA.
Barron, A. R. 1993. Universal approximation bounds for superpositions of a sigmoidal function. I E E E Trans. IT 39, 930-945. Barto, A. G., Sutton, R. S., and Anderson, C. W. 1983. Neuronlike adaptive elements that can solve difficult learning control problems. I E E E Trans. Syst. Man Cyber. 13, 834-846. Braitenberg, V. 1984. Vehicles: Experiments in Synthetic Psychology. MIT Press, Cambridge, MA. Broomhead, D. S., and Lowe, D. 1988. Multivariable functional interpolation and adaptive networks. Complex Syst. 2, 321-355. Cannon, M., and Slotine, J. J. E. 1995. Space-frequencylocalized basis function networks for nonlinear system estimation and control. Neuro computing 7(5). Craig, J. J. 1986. Introduction to Robotics: Mechanics and Control. Addison-Wesley, Reading, MA. Cybenko, G. 1989. Approximations by superposition of a sigmoidal function. Math. Cont. Sig. Syst. 2, 303-314. Daubechies, I. 1992. Ten Lectures on Wavelets. SIAM, Philadelphia, PA.
788
Robert M. Sanner and Jean-Jacques E. Slotine
DeVore, R., Howard, R., and Micchelli, C. 1989. Optimal nonlinear approximation. Manuskripta Matehmat, Vol. 63, pp. 469-478. Funahashi, K. 1989. On the approximate realization of continuous mappings by neural networks. Neural Networks 2, 183-192. Girosi, F., and Anzellotti, G. 1992. Rates of convergence of approximation by translates. Artificial Intelligence Lab. Memo, No. 1288. MIT, Cambridge, MA. Girosi, F., and Poggio, T. 1990. Networks and the best approximation property. Biol. Cybern. 63, 169-176. Girosi, F., Jones, M., and Poggio, T. 1995. Regularization theory and neural network architectures. Neural Cornp. Gomi, H., and Kawato, M. 1993. Neural-network control for a closed-loop system using feedback-error-learning. Neural Networks 7(1). Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366. Ioannou, P., and Datta, A. 1991. Robust adaptive control: A unified approach. Proc. I E E E 79(12), 1736-1768. Ioannou, P., and Kokotovic, P. V. 1984. Instability analysis and the improvement of robustness of adaptive control. Autornatica 2O(!J), 583-594. Jang, J.-S., and Sun, C.-T. 1993. Functional equivalence between radial basis function networks and fuzzy inference systems. lEEE Trans. Neural Networks 4(1), 156-158. Jordan, M. I. 1990. Learning inverse mappings using forward models. Proc. 6th Yale Workshop Adaptive Learning Syst. 146-151. Jordan, M. I., and Rumelhart, D. E. 1992. Forward models: Supervised learning with a distal teacher. Cog. Sci. 16, 307-354. Kawato, M., Furukawa, K., and Suzuki, R. 1987. A hierarchical neural-network model for control and learning of voluntary movement. Biol. Cybern. 57, 169-185. Kelly, S. E., Kon, M. A., and Raphael, L. A. 1994. Pointwise convergence of wavelet expansions. Bull. Am. Math. SOC.30(1), 87-94. Khosla, P., and Kanade, T. 1985. Parameter Identification of Robot Dynamics. I E E E Conf. Decision Control, Fort Lauderdale, FL. Larkin, D. 1993. Implementation of an adaptive controller. Rob. Ind. Assoc. Conf., Detroit. Messner, W., Horowitz, R., Kao, W.-W., and Boals, M. 1991. A new adaptive learning rule. I E E E Trans. Autom. Cont. 36(2), 188-197. Mhaskar, H. N. 1993. Approximation properties of a multilayered feedforward artificial neural network. Adv. Cornp. Math. 1,61-80. Miller, W. T., Glanz, F. H., and Kraft, L. G. 1987. Application of a general learning algorithm to the control of robotic manipulators. Int. J. Robot. Res. 6, 84-98. Miller, T. W., Sutton, R. S., and Werbos, P. J., eds. 1990. Neural Networks for Control. MIT Press, Cambridge, MA. Narendra, K. S., and Annaswamy, A. 1989. Stable Adaptive Systems. PrenticeHall, Englewood Cliffs, NJ. Narendra, K. S., and Parthasarathy, K. 1990. Identification and control of dynamical systems using neural networks. IEEE Trans. Neural Networks 1 , 4 2 7 .
Adaptive Control of Robot Manipulators
789
Narendra, K. S., and Parthasarathy, K. 1991. Gradient methods for the optimization of dynamical systems containing neural networks. IEEE Trans. Neural Networks 2, 252-262. Niemeyer, G., and Slotine, J.-J. E. 1991. Performance in adaptive manipulator control. Int. J. Robot. Res. lO(2). Papadopoulos, E. G. 1990. On the dynamics and control of space manipulators. Ph.D. Thesis, Department of Mechanical Engineering, MIT, Cambridge, MA. Pati, Y. C., and Krishnaprasad, P. S. 1993. Analysis and synthesis of feedforward networks using discrete affine wavelet transformations. IEEE Trans. Neural Networks 4, 73-85. Poggio, T., and Girosi, F. 1990. Networks for approximation and learning. Proc. I E E E 78(9), 1481-1497. Polycarpou, M., and Ioannou, P. 1991. Identification and Control of Nonlinear Systems Using Neural Network Models: Design and Stability Analysis. TR No.9109-01, USC Dept. EE-Systems. Powell, M. J. D. 1992. The theory of radial basis function approximation in 1990. In Advances in Numerical Analysis, Vol.IZ: Wavelets, Subdivision Algorithms, and Radial BasisFunctions, W. A. Light, ed., pp. 105-210. Oxford University Press, Oxford. Reed, J. S., and Ioannou, P. A. 1989. Instability analysis and robust adaptive control of robotic manipulators. ZEEE Trans. Robot. Aut. 5, 381-386. Rudin, W. 1991. Functional Analysis, 2nd ed. McGraw-Hill, New York. Rumelhart, D. E., and McClelland, J. L. 1986. Parallel Distributed Processing, Explorations in the Microstructure of Cognition, Vol. 1. MIT Press, Cambridge, MA. Sanner, R. M. 1993. Stable adaptive control and recursive identification of nonlinear systems using radial gaussian networks. Ph.D. Thesis, Department of Aeronautics and Astronautics, MIT, Cambridge, MA. Sanner, R. M.,and Slotine, J.-J. E. 1992. Gaussian networks for direct adaptive control. l E E E Trans. Neural Networks 3(6), 837-863. Sanner, R. M., and Vance, E. E. 1995. Adaptive control of free-floating space robots using 'neural' networks. SSL Report 94-05, Department of Aeronautical Engineering, University of Maryland, College Park, MD. Proc. I995 American Control Conference, in press. Shadmehr, R., and Mussa-Ivaldi, F. 1994. Adaptive representation of dynamics during learning of a motor task. J. Neurosci. 14(5),3208-3224. Slotine, J.-J. E., and Di Benedetto, M. D. 1990. Hamiltonian adaptive control of spacecraft. l E E E Trans. Aut. Control AC-35, 848-852. Slotine, J.-J.E., and Li, W. 1987. On the adaptive control of robotic manipulators. Int. J. Robot. Res. 6(3). Slotine, J.-J. E., and Li, W. 1988. Adaptive manipulator control: A case study. ZEEE Trans. Autom. Control 33(11). Slotine, J.-J. E., and Li, W. 1991. Applied Nonlinear Control. Prentice-Hall, Englewood Cliffs, NJ. Slotine, J.-J. E., and Sanner, R. M. 1993. Neural networks for adaptive control and recursive identification: A theoretical framework. In Essays on
790
Robert M. Sanner and Jean-Jacques E. Slotine
Control: Perspectives in the Theory and its Applications, H. L. Trentelman and J. C. Willems, eds., pp. 381-436. Birkhauser, Boston. Strang, G., and Fix, G. 1973. A Fourier analysis of the finite element variational method. In Constructive Aspects of Functional Analysis, G. Geymonet, ed., pp. 793-840. Cremonese, Rome. Sweldens, W., and Piessens, R. 1995. Quadrature formulae and asymptotic error expansions for wavelet approximations of smooth functions. S I A M J. Num. Anal. (in press). Walter, W. G. 1950. An imitation of life. Sci. Am. 42-45. Walter, W. G. 1951. A machine that learns. Sci. Am. 60-63. Walter, G. G. 1994. Wavelets and Other Orthogonal Systems with Applications. CRC Press, Boca Raton, FL. Wang, L.-X. 1992. Fuzzy basis functions, universal approximation, and orthogonal least-squares learning. IEEE Trans. Neural Networks 3(5), 807-814. Wang, L.-X. 1993. Stable adaptive fuzzy control of nonlinear systems. IEEE Trans. Fuzzy Logic 1(2), 146-155. Wiener, N. 1961. Cybernetics: or Control and Communication in the Animal and the Machine, 2nd ed. MIT Press, Cambridge, MA. Zemanian, A. H. 1965. Distribution Theory and Transform Analysis. McGraw-Hill, New York. Zhang, Q., and Benveniste, A. 1992. Wavelet networks. IEEE Trans. N N 3, 889-898. Received May 12,1994; accepted November 23, 1994.
This article has been cited by: 2. C. J. B. Macnab. 2010. Neural-adaptive control using alternate weights. Neural Computing and Applications . [CrossRef] 3. Fuchun Sun, Zengqi Sun, Peng-Yung Woo. 2001. Neural network-based adaptive controller design of robotic manipulators with an observer. IEEE Transactions on Neural Networks 12:1, 54-67. [CrossRef] 4. F.C. Sun, Z.Q. Sun, R.J. Zhang, Y.B. Chen. 2000. Neural adaptive tracking controller for robot manipulators with unknown dynamics. IEE Proceedings - Control Theory and Applications 147:3, 366. [CrossRef] 5. I. Rivals, L. Personnaz. 2000. Nonlinear internal model control using neural networks: application to processes with delay and design issues. IEEE Transactions on Neural Networks 11:1, 80-90. [CrossRef] 6. T. Efrati, H. Flashner. 1999. Neural Network Based Tracking Control of Mechanical Systems. Journal of Dynamic Systems, Measurement, and Control 121:1, 148. [CrossRef] 7. Fuchun Sun, Zengqi Sun, Peng-Yung Woo. 1998. Stable neural-network-based adaptive control for sampled-data nonlinear systems. IEEE Transactions on Neural Networks 9:5, 956-968. [CrossRef]
Communicated by Richard Lippmann
A Modular and Hybrid Connectionist System for Speaker Identification Y0unL.s Bennani C.N.R.S., L.I.P.N. URA-1507, University of Paris-Nord, Au. J-B. Cl6ment, 93430 Villefaneuse,France
This paper presents and evaluates a modularhybrid connectionist system for speaker identification. Modularity has emerged as a powerful technique for reducing the complexity of connectionist systems, and allowing a priori knowledge to be incorporated into their design. Text-independent speaker identification is an inherently complex task where the amount of training data is often limited. It thus provides an ideal domain to test the validity of the modularhybrid connectionist approach. To achieve such identification, we develop, in this paper, an architecture based upon the cooperation of several connectionist modules, and a Hidden Markov Model module. When tested on a population of 102 speakers extracted from the DARPA-TIMIT database, perfect identification was obtained. 1 Introduction
Connectionist systems have gained widespread acceptance for tackling problems where the relationship between input and desired output is highly complex and nonlinear. Unfortunately, the required connectionist system often has a large number of parameters, while the amount of training data is frequently very limited. This places a serious constraint on the ability of the system to correctly generalize. Two ways to handle this problem are to attempt to reduce the system’s complexity and to incorporate a priori knowledge into its architecture. Since a complex problem can often be decomposed into a series of much simpler subproblems, decomposing the single connectionist system into a set of modules that tackles each of these subproblems, while cooperating together to solve the global problem, is a powerful method to both reduce complexity and incorporate a priori knowledge about the problem. In addition, nonconnectionist modules can be used, especially if, for certain subproblems, they have clear advantages over alternative connectionist modules. In this paper, we will describe a modular/hybrid connectionist system for speaker identification. A review on connectionist approaches for speaker recognition can be found in (Bennani and Gallinari 1994). We present in Section 2 the modular architecture, the learning, and the Neural Computation 7, 791-798 (1995) @ 1995 Massachusetts Institute of Technology
792
Younes Bennani
identification strategies. In Section 3, we show how to add new speakers to the system. Finally, the results are discussed in Section 4.
2 A Modular Connectionist Architecture for Text-Independent
Speaker Identification 2.1 Architecture of the Modular Connectionist System. We have used a significant part of the TIMIT database containing around 100 speakers from the first two dialects. A full description of this database can be found in Fisher et al. (1987). LPCC analysis of order 16 was performed on the speech signal. As shown in Hampshire and Waibel (1989), Rudasi and Zahorian (19911, and Bennani and Gallinari (1991), it is relatively easy to train a system to perform identification on a small population size. However, when the population size increases, the performances of the system progressively degrade. We have thus decided to use a method that breaks down the population into subgroups. Within the classes of females and males it may be possible to distinguish many subclasses, which group together speakers with similar vocal characteristics. We have found that precisely such a subdivision is possible by using a k-means clustering technique labeled by a majority vote on the set of speech vectors (LPCC) formed by the training data. In essence, this subdivision of the population of speakers reflects an underlying structure in the problem, which is a form of a priori knowledge. We will refer to each of these subgroups as a typology, and our proposed system is based on using a separate connectionist module for each typology. The system illustrated in Figure 1 consists of two types of networks: a typology detector and expert modules. Each expert module of the system is dedicated to the discrimination between speakers of the same typology. The specialized module for the typology detection plays the role of an information gating network. At this level the system architecture can be designed in two ways. The first case (Fig. 1)is where the typology detection module contributes to the final score in the form of a weight factor for the scores of the expert modules. The second case is where the typology detection module serves to orient the input toward the appropriate expert module. It is, however, necessary to note that identification time is significantly shorter in the second case, since preselecting the expert module saves the need to compute the other experts’ outputs. However, with the first architecture, an error occurring during typology detection can be further compensated by the expert modules, which is not the case with the second architecture. Case 1 will be used in this section and case 2 in section 3. Training of similar modular multiexperts approaches has been studied by (Jacobset al. 19911, who used them for control tasks (Jacobsand Jordan 1993).
Modular/ Hybrid Connectionist System
793
Identity
xi P(LiIX.Tj) P(TjlX)
t
t
r
I
I
t
t
Figure 1: Architecture of the modular connectionist system. The speech coefficients X enter the system and are fed to each typology expert module and to the typology detector module. Each expert module outputs a probability, P(Li I X, T,), that speaker i belongs to that typology j . The typology detector module outputs a probability, P(T, I X), that the speech X belongs to a particular typology j. The decision module combines these probabilities to produce a final probability per speaker, P(Li I X), given by C;l=,P(Li I X,Tj)P(Tj I X). 2.2 Learning Phase. The components of our speaker identification architecture are TDNN-type modules (Waibel et al. 1987; Lang and Hinton 1988). The input size of these networks is fixed, as in the majority of connectionist models. This poses a problem with speech data, where the sentences are not all of the same size. To handle such a problem, we proceed by sliding a fixed-size window over each sentence. More specifically, we proceed in the following fashion. We divide each sentence into a set of successive windows. Each window is composed of 25 spectral vectors or frames, with an overlap of 5 frames between two successive windows. Each window is an input to the system. We will call this type of TDNN, STDNN (for shift TDNN). All the modules virtually have the same architecture. They are three layer nets with the following topology. The input layer has (16 x 25) cells which correspond to the 25 successive time frames (= 0.25 sec total) over the LPCCs. The first hidden layer has 12 feature extractors or independent cells, replicated 21 times. Each cell is connected to 5 consecutive input time frames and this local window is shifted one frame to the right for the next hidden cell in this layer. The second hidden layer has 10 feature extractors, connected to 7 consecutive
Younes Bennani
794
columns of 12 cells in the previous layer with an overlap of 6 columns. The output layer is fully connected to the last hidden layer. LPCC vectors for all speakers are clustered by a k-means algorithm. Then each speaker is given the typology where the majority of its LPCC vectors are found. This k-means clustering thus provides an initialization of the typologies. For the population used in this work (102 speakers) we have found that 16 typologies produced a balanced set of clusters. The typology detection STDNN module is trained to classify the input speech according to the typology label found by the k-means technique. Expert modules are trained to recognize the speakers within each typology. 2.3 Identification Phase. For the identification, all frames of a sentence are presented to the system, as win successive windows of 25 acoustic vectors. At time t, when presented with a window Wr, the system produces m activations at(Li), i = 1 . . . m, computed as follows: at time t, output d , of the typology detector is used as a weighting factor for the outputs of of the expert modules. So activation a, at time t is given by n
(2.1) Notice that, since our typologies form a partition of the speakers population, only one term in (2.1) is nonzero. However, this formulation would allow for a more general case where typologies could overlap. Successive activations of the system are accumulated over the duration of the sentence to give the final activation A(Li) for each speaker: (2.2) The final speaker identification decision is give:n by Li* = arg max A(Li) I> 1, one can readily verify that the variance is 0 (l/n), and thus zero in the thermodynamic limit ( n -+ m), which is the underlying assumption of self-averaging in statistical mechanics calculations. For p = n, the variance is zero since the VS collapses to a single point. In Figure 2, we plot C2 as a function of a for a perceptron of dimension n = 10 and n = 100, with the number of test examples set to the number of training examples. For small values of a, there is a correspondingly large test error variance decreasing monotonically with increasing a. The variance of the test error for a close to 1 is small, indicating that students in the VSs generated by random training sets have almost equal test errors. For large n, C2 decays as 2(1 - an), which, for fixed a, decreases like l / n .
Test Error Fluctuations
815
3 Optimal Test Set Size
In this section, we turn our attention to the partitioning of a data set of examples into a training set and test set. That is, given a data set of I elements, how many elements should be assigned to the training set, and the rest to the test set, given that we wish to produce a student with a low generalization function. Looking for a student that has a low test error does not necessarily mean that the student will have a low generalization function, unless we can show that the test error will (at least on average) be close to the generalization function. As mentioned in the introduction, by applying the central limit theorem, the difference between the generalization function and the test error will be distributed in a gaussian manner (Feller 1970) with mean zero. The standard deviation of this distribution is over the realizations of the test set. This means, for example, that the generalization function, with probability 0.84, will not lie more than one standard deviation above the test error. This bound, however, is dependent on the actual test error value, whereas we will here be interested in the typical upper bound when one takes into account the version space and different possible training sets. We therefore replace the test error by its average, the generalization error, and the standard deviation over test sets alone by that over test sets, students, and training sets. That is, we define the average probabilistic upper bound on the generalization function as
Eub(m 1 I )
= Eg
+ TC
Setting T = 1, we will be 84% confident that the generalization function will, on average, not be more than one standard deviation above the test error. Similarly, for 7 = 2, we will be 98% confident that E ~ ( W1 wo) will, on average, be less than two standard deviations above the test error.5 If we fix the size of the data set, 1, we can consider the variance and generalization error as a function of the test set size, m, the training set size being given by p = I - m. In Figure 3 the generalization error and standard deviation are plotted for a perceptron of dimension n =400 and data set size I = 200. For small m, the standard deviation is large and the generalization error is small, the perceptron having been trained on a relatively large number of examples. This situation reverses as m is increased, which gives rise to a minimum in the upper bound, E,b(rn 1 I ) for m = m*. We note from Figure 3 that this is at m* =24 for T = 1. The dependence of rn* on m for finite n and I is rather complicated; however, in the limit of large n and setting I = yn, we obtain the following scaling law for the optimal test set size,
m*
1
N
- 127 (1 - y) n ] 2 / 3 .
2
5Here we have employed standard results about the percentage of the normal curve less than a certain number of standard deviations from the mean.
816
'
0
50
100
D. Barber, D. Saad, and P. Sollich
150
200
m
Figure 3: The standard deviation,generalization error, and upper bound (T = 1) plotted against the test set size m. The dimension of the perceptron is n = 400, with data set size 1 = 200, and test set size p = 1 - m. As m increases, the deviation of errors decreases, whereas the generalization error increases (as the number of training examples decreases). The value, m*, for which the upper bound is minimized represents the optimal test set size; in this case, m* = 24.
Or, writing this as the optimal fraction of the data set to be used for testing,
For fixed T , Y , the optimal test fraction tends to zero as II increases to infinity. Even though the fraction of test examples tends to zero, there is still a very large number of test examples, enough that the test error will be close to the generalization function. For fixed n , ~ the , optimal test fraction tends to zero as y approaches 1 as the perceptron then has increasingly more data at its disposal to learn the teacher. For T tending to zero, we recover the normal case in which we utilize all the data set as training examples, regardless of the test error fluctuations.
Test Error Fluctuations
817
4 Confidence in the Trainincesting Procedure
One way to quantify trust in the training/testing procedure for a learning machine is to compare the results of training and testing the machine on different sets, and seeing whether the test errors are close. We have in mind the following scenario. We divide a 2p-dimensional data set into two disjoint sets of equal cardinality-a "left" and a "right" half. Perceptron w1 is trained on the right set and then tested on the left, and w2 is trained on the left set for and tested on the right. This generates two test errors, & and perceptrons w1 and w2, respectively. If the difference between ci:,)t and is large, our confidence in the training/testing procedure would be small. A quantity that measures the mean square difference between the test errors of the perceptrons is
~i:2~
where we have defined the variance
and the covariance,
(1) (2) Numerical simulations were performed to calculate cov(ctest; cteSt) for a range of values of n and a and were found to be an order of magnitude smaller than the variances calculated from the results of the previous section. The effect of the covariance cov(&; €::it)is thus relatively weak (see Fig. 4). The results (Fig. 4) demonstrate how the root mean square difference decreases as the data set size increases. For small between & and a, there is minimal information supplied to both perceptrons about the teacher and the two students vary greatly in their errors. As a increases, the VSs become more constrained around the teacher and the degree of belief in the training/testing procedure increases. As the dimension, n, of the perceptron is increased, A2 scales with l / n . We briefly mention that the error defined by = c!:jt)/2 resembles the leave out half cross-validation error (Shao 1993L6 E(',2) has variance var(&)/2 + cov( &)/2. A negative covariance signi(2) fies that if, for example, E!:,)~ is greater than the average value, ctest will tend to be smaller. Our simulations are negative and thus give rise to a
€!:it
E('t2)
(&it+
⁢
6Half the data set is used to train the student, and the other half to test it. This equipartitioning of the data set is random, with each realization giving rise to a test error value. The average of these values is taken to give the leave out half test error.
D. Barber, D. Saad, and P. Sollich
818
1.5-+
A 1.0-
+
0.0 0.0
0.2
0.6
0.4
0.8
.o
1
QI
Figure 4: The crosses are the simulated values of A, the root mean square deviation between the two test error values generated by training perceptron (1) on the right half of the data set and testing it on the left, and vice versa for perceptron (2). The perceptron is of dimension n = 64. The dots are the approximation to A, which neglects the covariance term COV(E{:~)~;&). slightly sharpened error distribution about the mean compared to using two independent test and training sets, for which the C O D ( E ~c!:s~)J ~ ~ term ; would be absent. 5 Summary and Outlook
We have explicitly calculated the variance in the test error of a linear n-dimensional spherical perceptron and found that it decays with the system size n as l / n . Furthermore, the variance decreases monotonically to zero as the number of training examples approaches the system size. Employing the variance, we found the optimal test set size m*,defined by minimizing the average upper bound on the generalization function given the test error. That is, for a data set of dimension 1, an upper bound on the expected error that a student perceptron will make on a random test example by training on I -m and testing on rn examples is minimized for m = m'. For large n, m* scales with n2/3. A simple measure of the confidence in the training/testing procedure was given, being the differ-
Test Error Fluctuations
819
ence between the test error values for two identical perceptrons trained and tested on the same data set. This difference necessarily decays to zero as the number of training examples increases. Extensions to this work for the case of noise and weight decay (Barber et al. 1995) and nonlinear systems are in progress. Although the model examined in this work is rather modest, it readily admits a theoretical treatment of variances that we hope goes some way to increasing interest in finite size effects. Appendix A
For a single test example, the average of the test error etest(w1 M , wo) over the VS can be written: 1
--xixi 2mn
((r;rj)w+ ,Pry)
where r = w - c, c is the center of the VS, and r" = Pwo. To perform the VS average, we transform the coordinate system, under a rotation matrix R, to express the hyperspherical VS in canonical coordinates, % : + . . . + jL0. If it is expected that the expert’s probability for A is greater than 1 / 2 when A occurs, and less than 1 / 2 when A occurs, then ji1 > 0 > jL0. A special case is when the DM has no prior view of its own and, thus, wants to adopt
Experts’ Probability Assessments
873
the expert’s opinion as its posterior. This occurs when the DMs prior log-odds equals zero, meaning that it considers the events A and A to be equally probable. By setting p 1 = -po and (PI - po)/a2 = 1, the D M s posterior log-odds equals the expert’s log-odds. The opposite alternative is that the DM has a poor opinion of the expert, in which case it may be that p1 < /LO. Here the DM thinks that the expert usually predicts the wrong event. This does not make the expert’s opinion less valuable; it means that the DM should expect the opposite of what the expert states. Several examples of the use of the supra Bayesian procedure are presented in Lindley (1982). In one example, the data provided by a weather forecaster are combined with the DM’s prior distribution to generate a predictor of rain whose performance is modestly better than that of either the forecaster or the DM alone. The weather forecaster supplied probabilities of rain in a subsequent 12-hr period in Chicago from July 1972 to June 1976. The DM summarized the forecaster’s stated probabilities in the event of rain, and in the event of no rain, using normal distributions. The normal distributions were fit to the forecaster’s data using maximum likelihood estimation. The forecaster’s predictions were then combined with the DM’s prior distribution using equation 2.8. The prior probability of rain in any time period was 0.25. The analyses revealed that the normal distributions provided a good fit to the forecaster’s data, though there is some evidence of mild skewness in this data. In addition, it was shown that the forecaster had a slight tendency to underestimate the probability of rain. The supra Bayesian approach as presented so far is easily extended to the case of multiple experts. Let q = [91, . . . q,IT denote the vector of the experts’ log-odds for the event A. Bayes’ rule is as above (equation 2.6) with the vector q replacing the scalar 41:
.
’
lo(A I H , q) = log p(q A ’ H )+ lo(A I H) (2.9) P(9 I A . W The first term on the right-hand side involves the DM’s joint distribution for the expert log-odds 91,. . . qm and, thus, includes the D M s views concerning dependencies among the experts. The normal assumption studied above extends by using the multivariate normal distribution:
.
(2.10)
where the vectors p1 and p o are the rn-dimensional means for q given A and A, respectively, and C is the rn x rn covariance matrix. The D M s posterior log-odds is given by 1 (2.11) lo(A I H , 4)= [q - +o + Pl)lTc-’ ( P I - Po) + W A I HI This equation includes a bias adjustment term (po+p,)/2and a coefficient C-’ (pl - p o )for the experts’ log-odds q.
Robert A. Jacobs
874
Supra Bayesian techniques are also applicable when there are multiple discrete events. Suppose that there are n exclusive events A, . . . ,A,,, and suppose that each expert provides a probability distribution for each event. Let p,, = pI(A, I HI)denote the probability assigned by expert i to event A,, and let qi, = Iogp,, denote the logarithm of this probability. Then the logarithm of the DMs posterior probability for event A , is given by ~
logp(As 1 Q,W = c + logp(Q 1 As,W
+ logp(As 1 H )
(2.12)
where Q is an m x n matrix with element qil in the ith row and jth column, and c is a constant (c is used as a generic symbol for a constant, though it is not the same constant each time that it appears). The DM assumes that for each event As, the logarithms of the expert probabilities q,, have a multivariate normal distribution with means
E ( q , I As, H)= /+
(2.13)
and covariances cOv(qij3qkl I As, H ) = aijkl
(2.14)
A consequence of this assumption is that logp(Q I A,,H), the second term on the right-hand side of equation 2.12, is a quadratic form in the qs plus a normalizing term that depends only on the covariances and, thus, not on the event As. This normalizing term can be incorporated into the constant c, so that logp(Q 1 A,,H) can be written as 1 logp(Q 1 A s , H ) = c - 5
C (9;) - Pijs)
(qkl - MIS)
(2.15)
JW
where dJki are the elements of a matrix that is inverse to that with elements Uijkl. The equation for computing the DMs posterior log-probability for event A, (equation 2.12) may now be rewritten as log p(As
I
Qj
H)= c + C Pijsqij + as + log p(As I H )
(2.16)
i >I
where (2.17) and as =
51
/'ip
cr
l]kl
/'kls
(2.18)
,,iM
That is, for the DM to use the opinions provided by the m experts about the n events, it should form its posterior log-probabilities by linearly combining its prior log-probabilities with the logarithms of the experts'
Experts’ Probability Assessments
875
stated probabilities. The linear coefficients depend on the DM’s assessments of the experts, expressed through the means p+ and covariances “ijkl.
Lindley (1985, 1988) provides many more details regarding the supra Bayesian procedure. For example, he notes that the assumption that the experts’ log-probabilities have a normal distribution is often untenable, and discusses how, in many situations, this problem can be overcome by considering contrasts instead of log-probabilities. He also shows that the computational requirements of the supra Bayesian procedure can be considerably reduced by assuming that the means and covariances of the normal distribution that characterizes the experts’ log-probabilities have a restricted structure. The reader is referred to the Lindley papers for these and other matters. In summary, we have shown how a DM can use Bayes’ rule to combine its prior beliefs with experts’ probability assessments. To make the resulting computations tractable, it is often necessary to assume that either the experts’ log-odds or log-probabilities are normally distributed. The framework was reviewed in the cases of a single expert, multiple experts, a single discrete event, and multiple discrete events. In all instances, it was shown that with the normality assumption the DMs posterior log-odds (or log-probabilities) is the linear combination of its prior log-odds (or log-probabilities)and those of the experts. We have already raised concerns about the supra Bayesian approach based on issues of computational expense. Some researchers have also expressed reservations based on more theoretical matters. Note, for example, that the DM assesses the joint probability of the experts’ opinions given the quantity of interest 8 and its knowledge H , whereas each expert’s opinion is based on its knowledge Hi. Because the DM does not know each expert’s knowledge, it would appear as if the DM needs to assess the probability of each expert’s knowledge to evaluate the joint probability of the experts’ opinions. This assessment, however, is omitted from the supra Bayesian framework.’ Other theoretical reservations concern the order in which particular operations are performed (cf. French 1985; Genest and Zidek 1986). For example, suppose that some objective evidence becomes available such that all experts agree on the likelihood function derived from this evidence. Consider the following two procedures for determining a final distribution: (1) each expert updates its prior belief via Bayes’ rule and then the experts’ opinions are combined into an aggregate distribution; (2) the aggregate distribution of the experts’ prior beliefs is first formed and then this distribution is updated through Bayes’ rule. An aggregation method that produces the same final distribution regardless of which procedure is followed is said to possess the property of external Bayesianify. This property is often considered a reasonable requirement of an ‘We thank an anonymous reviewer for raising this issue.
Robert A. Jacobs
876
aggregation technique. Note, however, that supra Bayesian methods are not externally Bayesian. As a second example, suppose that subsets of the original events A l . .. . ,A,, are grouped into a new set of events B1,. . . ,B,, where r < n. For instance, let the events be defined in terms of two discrete quantities, X and Y, and, using new subscript notation, let Alk denote the event k. Let B, denote the event that X = j (Y may where X = j and Y take on any value). The DM is interested in the marginal probabilities of the events B,, but the experts provide probabilities for the original events Alk. There are at least two procedures that the DM can use to obtain the marginal probabilities: (1) compute the marginals for each expert by p,(B,) = Ekpl(Alk) and then combine the results; (2) combine the experts’ probabilities about A,k to obtain p(A,k 1 (2, H ) and then compute the marginal probabilities by p ( B , ) = Ckp(A,k I Q . H ) . If an aggregation method produces the same final marginal distribution regardless of which procedure is followed, it is said to possess the marginalization property. This property is often considered a desirable property of an aggregation technique, but it is not characteristic of supra Bayesian methods. Lindley (1985,1988) argued that neither external Bayesianity nor the marginalization property is a reasonable requirement and, therefore, it is of no consequence that supra Bayesian techniques do not possess these features. Interestingly, McConway (1981) has shown that the only aggregation technique to possess the marginalization property is the linear opinion pool. :
3 Linear Opinion Pools
This section reviews the use of linear opinion pools. When using this aggregation procedure, the DM defines its own opinion to be the linear combination of the experts’ distributions with the constraint that the resulting combination is also a distribution:
where wi are linear coefficientsor weights. A necessary condition to meet the constraint is that the weights sum to one. It is also often assumed that the weights are nonnegative (see Lawson and Hanson (1974) for the solution to least squares problems with nonnegativity constraints). The focus of this section is on methods for assigning values to the weights. Two cases are considered: weights as veridical probabilities and minimum error weights. Wrights as veridical probabilities-The DM adopts the veridical assumption when it assumes that the quantity of interest, 0, is generated by one of the experts‘ distributions PI, . . . ,P,, though it i s tincertain as to which one. The weight wi is the probability that Pi is the ”true” distribution and, thus, the linear opinion pool gives the marginal distribution for 19.
Experts’ Probability Assessments
877
Statistical models known as mixture models typically adopt the veridical assumption. For example, within the artificial neural network literature, the veridical assumption is used in the mixtures-of-experts (ME) architecture proposed by Jacobs et al. (1991). It is assumed that each data item is generated as follows: given an input, one of several processes is selected from a conditional multinomial probability distribution (the distribution is conditioned on the input); the quantity of interest is sampled from the conditional distribution associated with the chosen process (again, the distribution is conditioned on the input). The ME architecture is a multinetwork architecture consisting of a “gating” network and several ”expert” networks. It uses the gating network to learn the multinomial distribution, and the different expert networks to learn the distributions associated with the different processes. The output of the architecture is the linear combination of the outputs of the expert networks. The gating network‘s outputs, which are nonnegative and sum to one, serve as the weights. The architecture’s training procedure combines aspects of associative and competitive learning. This procedure adjusts the parameters of the gating network so that, for a given input, the ith weight tends toward the probability that the distribution produced by the ith expert network is the true distribution. The training of the experts is as follows. Given an input, each expert produces an estimate of the true distribution of the quantity of interest. The expert whose estimate most closely matches the true distribution is called the winner of the competition; all other experts are called losers. It is assumed that one, and only one, expert’s estimate can match the true distribution. Each expert updates its parameters in proportion to its relative performance in the competition. The overall effect is that the experts adaptively partition the data set so that each expert tends to closely approximate the true distribution for a restricted set of inputs, and different experts tend to learn the true distribution for different input sets. Because different experts receive training information on different sets of inputs, the experts’ outputs are relatively independent. The veridical assumption is, therefore, often appropriate in this context. One example of the use of the veridical assumption is provided by Nowlan (1990) who trained a mixtures-of-expertsarchitecture on a vowel classification task. The data consisted of the first two formant values for 10 vowels spoken by 75 different speakers (Peterson and Barney 1952). In one set of simulations, 20 expert networks and a single gating network comprised the architecture; all networks received the formant values as inputs. During the course of training, different expert networks became specialized for classifying different sets of vowels. Nowlan (1990) suggested that the nature of the task decomposition discovered by the architecture is related to the positions of the vocal articulators for each of the vowel utterances; in one instance, for example, an expert network became specialized for distinguishing among the set of vowels that is spoken with the tongue toward the front of the mouth.
878
Robert A. Jacobs
Despite the successes of statistical mixture models, there exist many circumstances requiring the combination of multiple experts’ opinions in which it is unrealistic to assume that the experts’ opinions are independent, and that the veridical assumption is valid. The experts may be nonadaptive, in which case they may come to the situation in which their opinions need to be aggregated with correlated outputs. Alternatively, the experts may be adaptive, but have opinions that are dependent because they have similar biases or receive correlated training information. For these reasons, other methods for choosing a linear opinion pool’s weights have been studied. Minimum error weights-Several researchers have proposed that the weights of the linear opinion pool be selected by performing regression on the probability of H against the expert opinions P I , . . . ,P,. Two such methods, referred to as constrained and unconstrained regression, are commonly used. The linear coefficients are constrained to sum to one when using constrained regression; unconstrained regression places no constraint on the sum of the coefficients, and it employs an intercept term. Although these methods are useful for combining probability distributions, they have a broader scope of applicability in the sense that they can be used to combine any type of function approximations. Here we deviate from our practice of only considering the aggregation of probability distributions. Instead we consider the case in which experts provide point estimates of an uncertain quantity H conditional on some independent variables. The DM pools these expert opinions into an aggregate point estimate. First we present the two regression methods. Next we show how each method can be justified from a Bayesian perspective in some circumstances. Let f l denote the function that gives expert i’s point estimate of 8. Suppose that the DM believes that the experts’ opinions are unbiased, meaning that E(f; - 0) = 0. The DM may form an unbiased, minimum variance estimate of H, denoted f , by taking a weighted average of the expert opinions (Bates and Granger 1969):
The weights wi must be selected so that they minimize the variance of f , and so that they sum to one. A consequence of this weight selection is that the variance of the DMs estimate is guaranteed to be less than or equal to the variance of each expert’s estimate. Because an unbiased estimator’s variance equals its expected squared error, this means that the D M s estimate is as good as or better than any of the experts’ estimates (Dickenson 1973,1975; Perrone 1993). Dickenson used Lagrange optimization to find the optimal weights. Suppose that the expert errors 0 - fi are normally distributed with zero means, and let q denote the covariance between expert i’s and expert j’s
Experts’ Probability Assessments
879
errors. The weights are found by minimizing the objective function (3.3)
The first term on the right-hand side is the variance or expected squared error off. The second term gives the constraint that the weights sum to one. The solution to this optimization problem is w =c-’I(ITE-y
(3.4)
where w = (w1,. . . ,win) is the vector of weights, C is the covariance matrix for the experts’ errors, and I is a vector whose elements are equal to one. If, for example, the DM believes that the experts’ errors are independent, then the ith weight is proportional to the ith expert’s precision I / var(fl). Disadvantages of this constrained regression procedure can be illustrated by considering the case of two experts (Granger and Ramanathan 1984). The DM’s estimate is
f
= Wfl
+ (1 - w)f2
(3.5)
which can be rewritten as 8 - fz
= w ( fl - f 2 )
+
(3.6)
f
where t = 8 -f is the error. The weight w is chosen so as to minimize the expected squared error. A drawback of this procedure is that although the error t is uncorrelated with the difference f1 - fZ, it is not necessarily uncorrelated with the individual expert estimates f1 and f 2 . It is, therefore, possible to estimate the error from the expert estimates. In this sense, the constrained regression procedure is not optimal. One possibility is to remove the constraint that the weights sum to one, but then it would no longer be the case that the DM’s estimate is unbiased so long as the experts’ estimates are unbiased. An alternative is to include an additional unbiased estimate of 8, namely its unconditional mean € ( B ) . In this case, the quantity 8 is given by 8 = wlfi
+
+
+
~ 2 f 2 ~ 3 E ( 8 ) t
(3.7)
where w1 + w2 + w3 = 1. The weights are chosen via least-squares regression in which w3€(8)is a constant and w1 and w2 are unconstrained. Because w3E(B)is a constant, the DM’s estimate is a linear combination of the expert estimates plus an intercept term. The error E is uncorrelated with the expert estimates. Granger and Ramanathan (1984) advocated the use of an intercept term and the removal of the constraint that the weights sum to one. That is, the DM’s estimate should be a linear combination of the expert
Robert A. Jacobs
880
estimates plus an intercept term, and the weights should be selected via unconstrained least-squares regression. This method has the advantage that it yields an unbiased pooled estimate even if the expert estimates are biased. Researchers have debated the relative merits of the constrained regression (no intercept term, weights sum to one) and unconstrained regression (intercept term, no constraint on the weights) procedures. It is clear that the unconstrained method has more “degrees-of-freedom”and, thus, will achieve a smaller sum of squared error on a set of training items (Granger and Ramanathan 1984). Nonetheless, due to possible overfitting of the training data, it is uncertain which procedure will perform better on novel data (Clemen 1986). Meir (1994) quantified the bias and variance of linear opinion pools in the case of linear least-squares regression. Of greatest interest for our purposes is that he studied the situation where the data set is partitioned into disjoint subsets, and a different subset is used to train each expert. As compared to the case in which all experts are trained on the full data set, this training scheme can, in many situations, lead to a linear opinion pool with good performance because it results in a large decrease in the pool’s variance due to the independence of the experts’ opinions. This decrease tends to more than offset the concomitant increase in the pool’s bias. Bordley (1982,1986)has shown that the constrained and unconstrained regression methods can be deduced from a Bayesian approach. In the case of constrained regression, Bordley (1982) assumed that (1) the DM’s prior distribution on 8 is diffuse (that is, any value of 0 is equally likely); (2) the DM considers the expert errors 0 - fi to be normally distributed with mean zero and covariance matrix C; and (3) the expert errors are uncorrelated with the DM’s prior estimates of 0. Using these assumptions, p ( 0 I f , . . . . ,fm), the DM’s posterior distribution on 0, is a multivariate normal whose mean is given by the linear opinion pool without an intercept term and whose weights sum to one. The weights are the optimal weights selected via Lagrange optimization as described above (equation 3.4). In the case of unconstrained regression, Bordley (1986) replaced assumptions (1) and (3) with the assumptions that the DM’s prior distribution on 0 is normal, and that the expert errors are correlated with the DM’s prior estimates. Using these assumptions, the mean of the DMs posterior distribution is given by a linear opinion pool with an intercept term, and with no constraints on the weights. The weights are selected via unconstrained least-squares regression. When the experts’ estimates are biased, the DM’s estimate may be written in the form m
where E( f;)is the DMs expected value for expert i’s estimate. That is, the DM computes its posterior expectations by adjusting its prior expecta-
Experts’ Probability Assessments
881
tions, either positively or negatively, in proportion to the degree to which the experts’ estimates deviate from what it had expected their values to be. This is reasonable in the sense that if the experts’ estimates are what the DM expects them to be, then the DM has not gained any information, and its posterior expectation equals its prior expectation. Perrone (1993) presented several examples of the use of linear opinion pools when the experts are neural networks. One set of simulations compared different classifiers on a face recognition task. The database consisted of images of 16 human male faces. Different images of the faces were generated under various lighting conditions and with a variety of locations and orientations of the faces. During the first stage of training, 10 neural networks were individually trained to classify the faces. The networks had identical architectures, though they differed in the initial values of their weights. The outputs of the networks were combined into a linear opinion pool during a second stage of training using constrained least-squares regression. It was found that the linear opinion pool significantly outperformed all of the individual networks that comprised the pool on a novel set of images. Additional empirical results can be found in Hashem (1993) and Perrone (1993). Recently, researchers have proposed combining linear opinion pools based on constrained or unconstrained least-squares regression with model selection techniques to achieve systems with good generalization properties (e.g., Breiman 1992; LeBlanc and Tibshirani 1993; Wolpert 1992). This combination is a special case of what Wolpert (1992) referred to as stacked generalization. To illustrate the combination, we contrast it with leave-one-out cross-validation, a common model selection procedure. Let { X k : & } be a set of input-output data items, and let f , ’ - k ’ ( ~ k ) denote the output of the ith expert in response to the input x k when this expert has been trained using all data except data item k. The prediction error for expert i is defined as (3.9) Leave-one-out cross-validation is a “winner-take-all” model selection scheme that selects the expert with the smallest prediction error. In contrast, the combination of linear opinion pools with leave-one-out crossvalidation defines the prediction error in terms of the linear aggregation of the experts’ outputs: r
72
(3.10) The weights wi are selected to minimize the prediction error via constrained or unconstrained regression. Breiman (1992) compared stacked generalization with a variety of conventional statistical techniques on a wide range of linear regression tasks.
Robert A. Jacobs
882
The target functions had 40 input variables and one output variable. In the first stage of the simulations, a set of experts was formed. Each expert was a linear model, and the experts differed because they each received a different subset of the input variables. The outputs of the experts were aggregated in a second stage of the simulations via a least-squares procedure that did not contain an intercept term, and that was constrained so that all the coefficients were nonnegative. The performance of this system was superior to the performances of three regression methods based upon cross-validation methodology. Additional empirical and theoretical results regarding stacked generalization can be found in Breiman (19921, LeBlanc and Tibshirani (1993), and Wolpert (1992). In summary, we have reviewed methods that a DM can use to linearly combine experts’ probability assessments. Two cases were considered: weights as veridical probabilities and minimum error weights. In practice, linear opinion pools have proven popular because they often yield useful results with a moderate amount of computation. Objections to their use have been raised, however, based on theoretical grounds. To give just one example, it has been argued that the DM should combine the experts’ opinions in such a way as to preserve any form of expert agreement regarding the independence of the events in question (Genest and Wagner 1987). That is, it should be the case that (3.11) p( A n B I P I , .. . ,P m ) = p ( A I PI . pr,,) p ( B I Pi, . . . ,P,,,) whenever it is each expert’s belief that p,(A n B) = p l ( A ) p l ( B ) V i for events A and B. This property is referred to as the independence preservation property. Note that it is not possessed by linear opinion pools except when a single expert has a weight of one and all other experts have a weight of zero, a situation referred to as a dictatorship. Genest and Wagner (1987) argued, however, that the independence preservation property is not a reasonable requirement of an aggregation procedure. ’
. $
4 Value of Information from Dependent Experts
The major feature that makes the aggregation of expert opinions difficult is the high correlation or dependence that typically occurs among these opinions. This problem was alluded to in the preceding discussion; it is explicitly studied in this section where we review the results of Clemen and Winkler (1985). These authors showed that, given certain assumptions, m dependent experts are worth the same as k independent experts where k 5 m. In some cases, an exact value for k can be given; in other cases, lower and upper bounds can be placed on k. Clemen and Winkler (1985) assumed that the experts provide point estimates f,of the uncertain quantity 8, and that these estimates are unbiased, meaning that E(fi - Q) = 0. The vector E = ( F 1 , . . . , em) denotes the experts’ errors where F , = f,- 0. It is assumed that the joint probability of the experts’ errors, p ( I~Q), is normally distributed with mean
Experts' Probability Assessments
883
zero and covariance matrix C. The D M s prior distribution for 8, p ( 8 ) , is a normal distribution with mean [LO and variance go'. It is assumed that the DM's prior estimation error po - 0 is uncorrelated with any of the experts' errors. Using Bayes' rule, the DMs posterior distribution is given by
P(8 I f l , .
' '
3 f m ) 0: P ( E
I 0)
(4.1)
This distribution is normal with mean p* = (o&o
+ ITC-'f)r?
(4.2)
and variance where I is a vector whose elements are equal to one, and f = (f1, . . . ,fm) is the vector of expert opinions. Much of the analysis given below uses the fact that the posterior mean p* is a weighted average of the prior mean and the experts' estimates, and that the weights depend on the covariance matrix C. The posterior variance 0: also depends on C. Suppose that the expert errors are independent, and that each expert has an error variance of cr*. As a matter of notation, we use m to denote the number of experts when these experts are dependent, and k to denote the number when they are independent. The DM's prior variance can be ~ g2/ko. The posterior variance is then written in the form C T = (4.4)
By comparing equations 4.3 and 4.4, we can determine the number of independent experts with error variance 0' that yields the same posterior variance as m dependent experts with covariance matrix C. This number, denoted k ( g 2 ,C) can be written
k ( d , C) = a21TC-'I
(4.5)
That is, under the given assumptions, k(a2,C) independent experts are equivalent to m dependent experts. Clemen and Winkler (1985) considered three cases. The first case assumes that C is an intraclass correlation matrix, meaning that all expert variances are equal and all correlations are equal. The common variance and correlation are denoted LT* and p, with p > 0. The inverse of the covariance matrix, C-', takes a relatively simple form under these conditions, and the equivalent number of independent experts is
k ( 0 2 ,C)
= m[l
+ (m - 1)pI-l
(4.6)
If m > 1, then k ( u 2 , C )< m, meaning that positive dependence among the experts' errors reduces the information value of their opinions. The stronger the dependence, the greater is the reduction because dk(o*>C ) / d p
Robert A. Jacobs
884
< 0. The equivalent number of independent experts is a concave function of rn, whose limit may be written as lim k ( 0 2 ,C) = p-I
(4.7)
m-cc
In other words, there is an upper limit on the number of equivalent independent experts, and on the precision of the information that can be attained by consulting dependent experts. This limit is surprisingly low. For example, if p = 0.8, then k(rr2,C) = 1.25. After the first expert, who is worth one independent expert, all other experts combined are worth only one-fourth of an independent expert. As a second example, if p = 0.25, then k(a2,C) = 4.0. Consulting an infinite number of dependent experts, in this case, is no better than consulting four independent experts. The second case considered by Clemen and Winkler (1985) assumes that the correlations among the expert errors are positive and equal, but that the error variances may differ. It is also assumed that the weights used to compute the DMs posterior mean are positive (equation 4.2). Define ah and as follows: oM 2 = max{a?}; I
(4.8)
:a = min{o~) I
where o,?is expert 2s variance. Then it can be shown that k(a2,E M ) < k ( a 2 ,C) < k(a2,E m )
(4.9)
where EM and C, have intraclass correlation structure with correlation p and variances af and u i , respectively. In other words, an increase in the expert variances leads to a decrease in the equivalent number of independent experts. In the general case, both the correlations among the expert errors and the expert variances may vary. As before, assume that the weights used to compute the DM’s posterior mean are positive. Define PR = max{p,,}; I>I
Pr = min{p,,);
*J
PO = 0
(4.10)
where pI, is the correlation between expert i’s and expert j’s errors. Let CR have common correlation p ~ C, , have common correlation p,, and Co have common correlation po, with variances equal to those in C. It can be shown that k(a2.CR) < k ( a 2 .C) < k(a2,C , ) < k(a2,Co)
(4.11)
An increase in the correlation among expert errors leads to a decrease in the equivalent number of independent experts. As Clemen and Winkler (1985) pointed out, dependence among the experts can occasionally be helpful. For example, suppose that the expert errors have a common variance and correlation, but that the correlation
Experts' Probability Assessments
885
is negative (-m-l 5 p < 0; the lower bound is necessary to make C positive definite). Then
m < k ( d , C ) < m2
(4.12)
where m > 1. Negative dependence can, therefore, lead to increases in the number of equivalent independent experts. As a second example, when one expert has a very large variance, it may be useful to include additional experts whose errors have high positive correlations with those of the first expert, but with smaller variances. Despite these examples, Clemen and Winkler concluded that it will generally be advantageous to include experts that are believed not to be highly correlated with each other or with the prior information, even if this means using experts with relatively high variance. In conclusion, it appears to be the case that different aggregation procedures are appropriate for different situations. The simplest circumstance occurs when the experts' errors are uncorrelated. This may occur if it is possible to train different experts on independent data sets (Meir 1994). Alternatively, this situation may be approximated if the experts are competitive; the experts learn different mappings because they adaptively partition the data set (Jacobset al. 1991). Aggregation is more complicated when the experts' opinions are dependent. In this case, it may be necessary for the DM to model the dependencies among the experts to achieve good performance. Two classes of aggregation procedures were reviewed in this article. With problems of easy or moderate difficulty, a "quick and dirty" procedure such as the linear opinion pool may be sufficient. A more computationally intensive technique, such as a supra Bayesian method, may be necessary for problems of greater complexity. Recent years have seen a large increase in the number of studies on how people combine multiple sources of information, particularly in the context of visual perception. Linear opinion pools are often used to model people's perceptual cue aggregations, though some experimental results suggest that such pools are not always perfectly suited for this role. Modifications of these pools are, therefore, often explored. For example, Young et al. (1993), in their study of how people combine object motion and texture gradient visual cues to extract depth information, proposed two modifications to the basic linear opinion pool. One modification, called cue promotion, involves the use of one visual cue to provide missing information required by another cue to yield accurate perceptual judgments. The second modification is that the coefficients of the linear opinion pool may change with the visual environment, a technique referred to as dynamic reweighting. Unfortunately, there is currently no well-articulated theory to guide researchers in the selection of a model that is well-suited for the circumstance that they study. That is, it is not known which perceptual or cognitive phenomenon are best modeled using a linear opinion pool, which phenomenon should be characterized using a modified linear opinion pool, or which phenomenon
886
Robert A. Jacobs
requires a more complex model such as a supra Bayesian model. This will surely be a topic of many future studies.
Acknowledgments
~
This work was supported in part by NIH grant MR-54770. References Abidi, M. A., and Gonzalez, R. C. 1992. Data Fusion in Robotics and Machine Intelligence. Academic Press, San Diego, CA. Agnew, C. E. 1985. Multiple probability assessments by dependent experts. 1.A m . Stat. Assoc. 80, 343-347. Bates, J. M., and Granger, C. W. J. 1969. The combination of forecasts. Operational Res. Q. 20, 451467. Bordley, R. F. 1982. The combination of forecasts: A Bayesian approach. J. Operational Res. SOC.33, 171-174. Bordley, R. F. 1986. Linear combination of forecasts with an intercept: A Bayesian approach. 1.Forecasting 5, 243-249. Breiman, L. 1992. Stacked Regression. Tech. Rep. TR-367, Department of Statistics, University of California, Berkeley. Chatterjee, S., and Chatterjee, S. 1987. On combining expert opinions. A m . J. Math. Management Sci. 7, 271-295. Clark, J. J., and Yuille, A. L. 1990. Data Fusion for Sensory Information Processing Systems. Kluwer Academic Publishers, Norwell, MA. Clemen, R. T. 1986. Linear constraints and the efficiency of combined forecasts. J. Forecasting 5, 31-38. Clemen, R. T., and Winkler, R. L. 1985. Limits for the precision and value of information from dependent sources. Oper. Res. 33, 427-442. Cooke, R. M. 1990. Statistics in expert resolution: A theory of weights for combining expert opinion. In Statistics in Science: The Foundations of Statistical Methods in Biology, Physics, and Economics, R. Cooke and D. Costantini, eds. Kluwer Academic Publishers, The Netherlands. DeGroot, M. H., and Fienberg, S. E. 1986. Comparing probability forecasters: Basic binary concepts and multivariate extensions. In Bayesian Inference and Decision Techniques, I? Goel and A. Zellner, eds. Elsevier Science Publishers, Amsterdam. Dickenson, J. P. 1973. Some statistical results in the combination of forecasts. Oper. Res. Q. 24, 253-260. Dickenson, J. P. 1975. Some comments on the Combination of forecasts. Oper. Res. Q. 26,205-210. Dosher, B. A., Sperling, G., and Wurst, S. A. 1986. ‘Tradeoffs between stereopsis and proximity luminance covariance as determinants of perceived 3D structure. Vision Res. 26, 973-990. Drucker, H., Schapire, R., and Simard, P. 1993. Improving performance in neural networks using a boosting algorithm. In Advances in Neural Information
Experts’ Probability Assessments
887
Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds. Morgan Kaufmann, San Mateo, CA. French, S. 1980. Updating of belief in the light of someone else’s opinion. J. Royal Statist. SOC.A 143, 4348. French, S. 1985. Group consensus probability distributions: A critical survey. In Bayesian Statistics 2, J. M. Bernardo, M. H. DeGroot, D. V. Lindley, and A. F. M. Smith, eds. Elsevier Science Publishers, North-Holland. Gelfand, A. E., Mallick, B. K., and Dey, D. K. 1995. Modeling expert opinion arising as a partial probabilistic specification. J. Am. Statist. Assoc. 90, 598604. Genest, C., and McConway, K. J. 1990. Allocating the weights in the linear opinion pool. 1.Forecasting 9, 53-73. Genest, C., and Wagner, C. G. 1987. Further evidence against independence preservation in expert judgement synthesis. Aequat. Math. 32, 74-86. Genest, C., and Zidek, J. V. 1986. Combining probability distributions: A critique and an annotated bibliography. Statist. Sci. 1, 114-148. Graham, N. V. S. 1989. Visual Pattern Analyzers. Oxford University Press, New York. Granger, C. W. J., and Ramanathan, R. 1984. Improved methods of combining forecasts. 1.Forecasting 3, 197-204. Hashem, S. 1993. Optimal Linear Combinations of Neural Networks. Tech. Rep. SMS 94-4, School of Industrial Engineering, Purdue University. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. 1991. Adaptive mixtures of local experts. Neural Comp. 3, 79-87. Jordan, M. I., and Jacobs, R. A. 1994. Hierarchical mixtures of experts and the EM algorithm. Neural Comp. 6, 181-214. Lawson, C. L., and Hanson, R. J. 1974. Solving Least Squares Problems. PrenticeHall, Englewood Cliffs, NJ. LeBlanc, M., and Tibshirani, R. 1993. Combining Estimates in Regression and Classification. Tech. Rep., Department of Preventive Medicine and Biostatistics, University of Toronto. Lindley, D. V. 1982. The improvement of probability judgements. 1.Royal Statist. SOC.A 145, 117-126. Lindley, D. V. 1985. Reconciliation of discrete probability distributions. In Bayesian Statistics 2, J. M. Bernardo, M. H. DeGroot, D. V. Lindley, and A. F. M. Smith, eds. North-Holland, Amsterdam. Lindley, D. V. 1988. The use of probability statements. In Accelerated Life Testing and Experts’ Opinions in Reliability, C. A. Clarotti and D. V. Lindley, eds. North-Holland, Amsterdam. Lindley, D. V., Tversky, A., and Brown, R. V. 1979. On the reconciliation of probability assessments. J. Royal Statist. SOC.A 142, 146-180. McConway, K. J. 1981. Marginalization and linear opinion pools. J. A m . Statis. Assoc. 76, 410-414. Meir, R. 1994. Bias, Variance, and the Combination of Estimators: The Case of Linear Least Squares. Tech. Rep. 922, Department of Electrical Engineering, Technion, Haifa, Israel. Morris, P. A. 1974. Decision analysis expert use. Manage. Sci. 20, 1233-1241.
888
Robert A. Jacobs
Morris, P. A. 1977. Combining expert judgements: A Bayesian approach. Manage. Sci. 23, 679-693. Nakayama, K., and Shimojo, S. 1990. Toward a neural understanding of visual surface representation. Cold Spring Harbor Symp. Quant. Biol. 55, 911-924. Nowlan, 5. J. 1990. Competing Experts: An Experimental Investigation of Associative Mixture Models. Tech. Rep. CRG-TR-90-5, Department of Computer Science, University of Toronto. Perrone, M. P. 1993. Improving regression estimation: Averaging methods for variance reduction with extensions to general convex measure optimization. P1i.D. thesis, Department of Physics, Brown University. Peterson, G. E., and Barney, H. L. 1952. Control methods used in a study of vowels. 1. Aconst. SOC.Am. 24, 175-184. Roberts, H. V. 1965. Probabilistic prediction. J. A m . Stat. Assoc. 60, 50-62. Shuford, E. H., Albert, A., and Massengil, H. E. 1966. Admissible probability measurement procedures. Psychometrika 31, 125-145. Stein, B. E., and Meredith, M. A. 1993. The Merging of the Senses. MIT Press, Cambridge, MA. Trueswell, J. C., and Hayhoe, M. M. 1993. Surface segmentation mechanisms and motion perception. Vision Res. 33, 313-328. Winkler, R. L. 1969. Scoring rules and the evaluation of probability assessors. J. A m . Statist. Assoc. 64,1073-1078. Winkler, R. L. 1981. Combining probability distributions from dependent information sources. Manage. Sci. 27, 479-488. Wolpert, D. H. 1992. Stacked generalization. Neural Networks 5, 241-259. Xu, L., Krzyzak, A., and Suen, C. Y. 1992. Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Trans. Systems, Man, Cybernet. 22, 418-435. Young, M. J., Landy, M. S., and Maloney, L. T. 1993. A perturbation analysis of depth perception from combinations of texture and motion cues. Vision Res. 33, 2685-2696. Zeki, S. 1993. A Vision of the Brain. Blackwell Scientific Publications, Oxford,
UK. ____
Received March 29, 1994; accepted March 3, 1995.
This article has been cited by: 1. Jose M. Álvarez, Theo Gevers, Antonio M. López. 2010. Learning Photometric Invariance for Object Detection. International Journal of Computer Vision 90:1, 45-61. [CrossRef] 2. I. Kokkinos, P. Maragos. 2009. Synergy between Object Recognition and Image Segmentation Using the Expectation-Maximization Algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 31:8, 1486-1501. [CrossRef] 3. Michael J. Procopio, Jane Mulligan, Greg Grudic. 2009. Learning terrain segmentation with classifier ensembles for autonomous robot navigation in unstructured environments. Journal of Field Robotics 26:2, 145-175. [CrossRef] 4. THORSTEN HANSEN, KARL R. GEGENFURTNER. 2009. Independence of color and luminance edges in natural scenes. Visual Neuroscience 26:01, 35. [CrossRef] 5. I. Kokkinos, G. Evangelopoulos, P. Maragos. 2009. Texture Analysis and Segmentation Using Modulation Features, Generative Models, and Weighted Curve Evolution. IEEE Transactions on Pattern Analysis and Machine Intelligence 31:1, 142-157. [CrossRef] 6. Abdelhamid Bouchachia. 2007. Learning with partly labeled data. Neural Computing and Applications 16:3, 267-293. [CrossRef] 7. Susan L. Epstein, Eugene C. Freuder, Richard J. Wallace. 2005. LEARNING TO SUPPORT CONSTRAINT PROGRAMMERS. Computational Intelligence 21:4, 336-371. [CrossRef] 8. Petra M. Kuhnert, Tara G. Martin, Kerrie Mengersen, Hugh P. Possingham. 2005. Assessing the impacts of grazing levels on bird density in woodland habitat: a Bayesian approach using expert opinion. Environmetrics 16:7, 717-747. [CrossRef] 9. David M. Pennock, Michael P. Wellman. 2005. Graphical Models for Groups: Belief Aggregation and Risk Sharing. Decision Analysis 2:3, 148-164. [CrossRef] 10. Kian Hsiang Low , Wee Kheng Leow , Marcelo H. Ang Jr. . 2005. An Ensemble of Cooperative Extended Kohonen Maps for Complex Robot Motion TasksAn Ensemble of Cooperative Extended Kohonen Maps for Complex Robot Motion Tasks. Neural Computation 17:6, 1411-1445. [Abstract] [PDF] [PDF Plus] 11. Tara G. Martin, Petra M. Kuhnert, Kerrie Mengersen, Hugh P. Possingham. 2005. THE POWER OF EXPERT OPINION IN ECOLOGICAL MODELS USING BAYESIAN METHODS: IMPACT OF GRAZING ON BIRDS. Ecological Applications 15:1, 266-280. [CrossRef] 12. Shuang Yang, Antony Browne. 2004. Neural network ensembles: combining multiple models for enhanced performance using a multistage approach. Expert Systems 21:5, 279-288. [CrossRef]
13. M. Pardo, G. Sberveglieri. 2002. Learning from data: a tutorial with emphasis on modern pattern recognition methods. IEEE Sensors Journal 2:3, 203-217. [CrossRef] 14. Sham Kakade, Peter Dayan. 2002. Acquisition and extinction in autoshaping. Psychological Review 109:3, 533-544. [CrossRef] 15. L.C. Jain, L.I. Kuncheva. 2000. Designing classifier fusion systems by genetic algorithms. IEEE Transactions on Evolutionary Computation 4:4, 327-336. [CrossRef] 16. G.P. Zhang. 2000. Neural networks for classification: a survey. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 30:4, 451-462. [CrossRef] 17. James K. Hammitt, Alexander I. Shlyakhter. 1999. The Expected Value of Information and the Probability of Surprise. Risk Analysis 19:1, 135-152. [CrossRef] 18. D.J. Miller, Lian Yan. 1999. Critic-driven ensemble classification. IEEE Transactions on Signal Processing 47:10, 2833-2844. [CrossRef] 19. Tom Heskes . 1998. Bias/Variance Decompositions for Likelihood-Based EstimatorsBias/Variance Decompositions for Likelihood-Based Estimators. Neural Computation 10:6, 1425-1433. [Abstract] [PDF] [PDF Plus] 20. C. C. Holmes , B. K. Mallick . 1998. Bayesian Radial Basis Functions of Variable DimensionBayesian Radial Basis Functions of Variable Dimension. Neural Computation 10:5, 1217-1233. [Abstract] [PDF] [PDF Plus] 21. Y. Shimshoni, N. Intrator. 1998. Classification of seismic signals by integrating ensembles of neural networks. IEEE Transactions on Signal Processing 46:5, 1194-1201. [CrossRef] 22. Xi Miao, M.R. Azimi-Sadjadi, Bin Tan, A.C. Dubey, N.H. Witherspoon. 1998. Detection of mines and minelike targets using principal component and neural-network methods. IEEE Transactions on Neural Networks 9:3, 454-463. [CrossRef] 23. B. LeBaron, A.S. Weigend. 1998. A bootstrap evaluation of the effect of data splitting on financial time series. IEEE Transactions on Neural Networks 9:1, 213-220. [CrossRef] 24. Michiaki Taniguchi, Volker Tresp. 1997. Averaging Regularized EstimatorsAveraging Regularized Estimators. Neural Computation 9:5, 1163-1178. [Abstract] [PDF] [PDF Plus] 25. J.A. Benediktsson, J.R. Sveinsson, P.H. Swain. 1997. Hybrid consensus theoretic classification. IEEE Transactions on Geoscience and Remote Sensing 35:4, 833-843. [CrossRef] 26. S. Lawrence, C.L. Giles, Ah Chung Tsoi, A.D. Back. 1997. Face recognition: a convolutional neural-network approach. IEEE Transactions on Neural Networks 8:1, 98-113. [CrossRef]
27. Yu Hen Hu, S. Palreddy, W.J. Tompkins. 1997. A patient-adaptable ECG beat classifier using a mixture of experts approach. IEEE Transactions on Biomedical Engineering 44:9, 891-900. [CrossRef] 28. J.A. Benediktsson, J.R. Sveinsson, O.K. Ersoy, P.H. Swain. 1997. Parallel consensual neural networks. IEEE Transactions on Neural Networks 8:1, 54-64. [CrossRef] 29. M. Schuster, K.K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45:11, 2673-2681. [CrossRef] 30. Peter DayanReinforcement Learning . [CrossRef]
Communicated by Michael Jordan
The Helmholtz Machine Peter Dayan Geoffrey E. Hinton Radford M. Neal Department of Computer Science, University of Toronto, 6 King’s College Road, Toronto, Ontario M5S 1A4, Canada
Richard S. Zemel CNL, The Salk Institute, PO Box 85800, San Diego, C A 92186-5800 USA
Discovering the structure inherent in a set of patterns is a fundamental aim of statistical inference or learning. One fruitful approach is to build a parameterized stochastic generative model, independent draws from which are likely to produce the patterns. For all but the simplest generative models, each pattern can be generated in exponentially many ways. It is thus intractable to adjust the parameters to maximize the probability of the observed patterns. We describe a way of finessing this combinatorial explosion by maximizing an easily computed lower bound on the probability of the observations. Our method can be viewed as a form of hierarchical self-supervised learning that may relate to the function of bottom-up and top-down cortical processing pathways. 1 Introduction
Following Helmholtz, we view the human perceptual system as a statistical inference engine whose function is to infer the probable causes of sensory input. We show that a device of this kind can learn how to perform these inferences without requiring a teacher to label each sensory input vector with its underlying causes. A recognition model is used to infer a probability distribution over the underlying causes from the sensory input, and a separate generative model, which is also learned, is used to train the recognition model (Zemel 1994; Hinton and Zemel 1994; Zemel and Hinton 1995). As an example of the generative models in which we are interested, consider the shift patterns in Figure 1, which are on four 1 x 8 rows of binary pixels. These were produced by the two-level stochastic hierarchical generative process described in the figure caption. The task of learning is to take a set of examples generated by such a process and induce the model. Note that underlying any pattern there are multiple Neural Computation 7, 889-904 (1995) @ 1995 Massachusetts Institute of Technology
Peter Dayan et al.
890
Figure 1: Shift patterns. In each of these six patterns the bottom row of square pixels is a random binary vector, the top row is a copy shifted left or right by one pixel with wraparound, and the middle two rows are copies of the outer rows. The patterns were generated by a two-stage process. First the direction of the shift was chosen, with left and right being equiprobable. Then each pixel in the bottom row was turned on (white) with a probability of 0.2, and the corresponding shifted pixel in the top row and the copies of these in the middle rows were made to follow suit. If we treat the top two rows as a left retina and the bottom two rows as a right retina, detecting the direction of the shift resembles the task of extracting depth from simple stereo images of short vertical line segments. Copying the top and bottom rows introduces extra redundancy into the images that facilitates the search for the correct generative model. simultaneous causes. We call each possible set of causes an explumtion of the pattern. For this particular example, it is possible to infer a unique set of causes for most patterns, but this need not always be the case. For general generative models, the causes need not be immediately evident from the surface form of patterns. Worse still, there can be an exponential number of possible explanations underlying each pattern. The computational cost of considering all of these explanations makes standard maximum likelihood approaches such as the Expectation-Maximization algorithm (Dempster et al. 1977) intractable. In this paper we describe a tractable approximation to maximum likelihood learning implemented in a layered hierarchical connectionist network. 2 The Recognition Distribution
The log probability of generating a particular example, d, from a model with parameters 0 is r
1
(2.1)
The Helmholtz Machine
891
where the (Y are explanations. If we view the alternative explanations of an example as alternative configurations of a physical system there is a precise analogy with statistical physics. We define the energy of explanation cy to be
I Q)p(d I a , Q )
L ( 0 , d ) = -logp(tt
(2.2)
The posterior probability of an explanation given d and 0 is related to its energy by the equilibrium or Boltzmann distribution, which at a temperature of 1 gives
where indices 0 and d in the last expression have been omitted for clarity. Using E , and P, equation 2.1 can be rewritten in terms of the Helmholtz free energy, which is the difference between the expected energy of an explanation and the entropy of the probability distribution across explanations. (2.4)
So far, we have not gained anything in terms of computational tractability because we still need to compute expectations under the posterior distribution P, which, in general, has exponentially many terms and cannot be factored into a product of simpler distributions. However, we know (Thompson 1988) that any probability distribution over the explanations will have at least as high a free energy as the Boltzmann distribution (equation 2.3). Therefore we can restrict ourselves to some class of tractable distributions and still have a lower bound on the log probability of the data. Instead of using the true posterior probability distribution, P, for averaging over explanations, we use a more convenient probability distribution, Q. The log probability of the data can then be written as logp(d I 0)
a
=
cQa Q, + cQa log[QalP,] (2.5) 0. Q ) + cQa log[Q,/Pa] (2.6)
QaEa -
= -
-F(d;
log
0
a
a
where F is the free energy based on the incorrect or nonequilibrium posterior Q. Making the dependencies explicit, the last term in equation 2.5 is the Kullback-Leibler divergence between Q(d)and the posterior distribution, P(H, d) (Kullback 1959). This term cannot be negative, so by ignoring it we get a lower bound on the log probability of the data given the model. In our work, distribution Q is produced by a separate recognition model that has its own parameters, 4. These parameters are optimized at the same time as the parameters of the generative model, 8, to maximize the overall fit function -F(d; 0,d) = -F[d; 0, Q ( @ ) ]Figure . 2 shows
892
Peter Dayan et al.
global maximum likelihood
Figure 2: Graphic view of our approximation. The surface shows a simplified example of - F ( B , Q) as a function of the generative parameters 6 and the recognition distribution Q. As discussed by Neal and Hinton (1994), the ExpectationMaximization algorithm ascends this surface by optimizing alternately with respect to 8 (the M-step) and Q (the E-step). After each E-step, the point on the surface lies on the line defined by Qn = Pa, and on this line, -F = logp(d I 0). Using a factorial recognition distribution parameterized by 4 restricts the surface over which the system optimizes (labeled "constrained posterior"). We ascend the restricted surface using a conjugate gradient optimization method. For a given 8, the difference between logp(d I 0) = maxQ(-F(O,Q)} and - F ( 6 , Q ) is the Kullback-Leibler penalty in equation 2.5. That EM gets stuck in a local maximum here is largely for graphic convenience, although neither it, nor our conjugate gradient procedure, is guaranteed to find its respective global optima. Showing the factorial recognition as a connected region is an arbitrary convention; the actual structure of the recognition distributions cannot be preserved in one dimension. graphically the nature of the approximation we are making and the relationship between our procedure and the EM algorithm. From equation 2.5, maximizing -F is equivalent to maximizing the log probability
The Helmholtz Machine
893
generative biases
layer 3
00
00 recognition weights 1
0
generative
input
oWelghts
Figure 3: A simple three layer Helmholtz machine modeling the activity of 5 binary inputs (layer 1) using a two-stage hierarchicalmodel. Generative weights (0) are shown as dashed lines, including the generative biases, the only such input to the units in the top layer. Recognition weights ( 4 )are shown with solid lines. Recognition and generative activation functions are described in the text.
of the data minus the Kullback-Leibler divergence, showing that this divergence acts like a penalty on the traditional log probability. The recognition model is thus encouraged to be a good approximation to the true posterior distribution P. However, the same penalty also encourages the generative model to change so that the true posterior distributions will be close to distributions that can be represented by the recognition model. 3 The Deterministic Helmholtz Machine
A Helmholtz machine (Fig. 3) is a simple implementation of these principles. It is a connectionist system with multiple layers of neuron-like binary stochastic processing units connected hierarchically by two sets of weights. Top-down connections 0 implement the generative model. Bottom-up connections 4 implement the recognition model.
Peter Dayan et al.
894
The key simplifying assumption is that the recognition distribution for a particular example d, Q ( c $ , ~is) ,factorial (separable) in each layer. If there are h stochastic binary units in a layer 8, the portion of the distribution P ( B , d ) due to that layer is determined by Zh - 1 probabilities. However, Q(q5,d) makes the assumption that the actual activity of any one unit in layer P is independent of the activities of all the other units in that layer, given the activities of all the units in the lower layer, l - 1, so the recognition model needs only specify h probabilities rather than 2" - 1. The independence assumption allows F(d;8.4) to be evaluated efficiently, but this computational tractability is bought at a price, since the true posterior is unlikely to be factorial: the log probability of the data will be underestimated by a n amount equal to the Kullback-Leibler divergence between the true posterior and the recognition distribution. The generative model is taken to be factorial in the same way, although one should note that factorial generative models rarely have recognition distributions that are themselves exactly factorial. Recognition for input example d entails using the bottom-up connections q5 to determine the probability q l ( $ , d ) that the jth unit in layer t has activity sf = 1. The recognition model is inherently stochastic-these probabilities are functions of the 0.1 activities sfp1 of the units in layer I - 1. We use
$ ( A st-')
=u
(7
s:-14;-y)
(3.1)
where " ( x ) = 1/[1 + exp(-x)] is the conventional sigmoid function, and sp-' is the vector of activities of the units in layer t - 1. All units have recognition biases as one element of the sums, all the activities at layer 4 are calculated after all the activities at layer P - 1, and s: are the activities of the input units. It is essential that there are no feedback connections in the recognition model. In the terms of the previous section, LY is a complete assignment of s,! for all the units in all the layers other than the input layer (for which B = 1). The multiplicative contributions to the probability of choosing that assignment using the recognition weights are 9; for units that are on and 1 - q,! for units that are off (3.2)
The Helmholtz free energy .F depends on the generative model through E,(B,d) in equation 2.2. The top-down connections 6' use the activities sP+'of the units in layer t + 1 to determine the factorial generative probabilities $(el se+')over the activities of the units in layer e. The obvious rule to use is the sigmoid: (3.3)
The Helmholtz Machine
895
including a generative bias (which is the only contribution to units in the topmost layer). Unfortunately this rule did not work well in practice for the sorts of inputs we tried. Appendix A discusses the more complicated method that we actually used to determine pf(0,set1). Given this, the overall generative probability of a is
We extend the factorial assumption to the input layer l = 1. The activities s2 in layer 2 determine the probabilities p; (0, s2) of the activities in the input layer. Thus
(3.5) Combining equations 2.2, 3.4, and 3.5, and omitting dependencies for clarity,
En(0,d)
=
-logp(ff I @ ) p ( d I 0,s)
(3.6)
Putting together the two components of F,an unbiased estimate of the value of F(d;0 , $ ) based on an explanation a drawn from Qn is
FcT,,(d; 0, 4)
=
En
+ log Qa
(3.8)
(3.9) One could perform stochastic gradient ascent in the negative free energy across all the data - F ( d l d) = - Ed F(d;8,d) using equation 3.9 and a form of REINFORCE algorithm (Barto and Anandan 1985; Williams 1992). However, for the simulations in this paper, we made a number of mean-field inspired approximations, in that we replaced the stochastic binary activities sf by their mean values under the recognition model qf. We took
we made a similar approximation for pf, which we discuss in Appendix A, and we then averaged the expression in equation 3.9 over cy to give the overall free energy:
896
Peter Dayan et al.
where the innermost term in the sum is the Kullback-Leibler divergence between generative and recognition distributions for unit j in layer P for example d:
Weights H and 4 are trained by following the derivatives of . F ( O , # ) in equation 3.11. Since the generative weights H do not affect the actual activities of the units, there are no cycles, and so the derivatives can be calculated in closed form using the chain rule, Appendix B gives the appropriate recursive formulas. Note that this deterministic version introduces a further approximation by ignoring correlations arising from the fact that under the real recognition model, the actual activities at layer (i + 1 are a function of the actual activities at layer P rather than their mean values. Figure 4 demonstrates the performance of the Helmholtz machine in a hierarchical learning task (Becker and Hinton 19921, showing that it is capable of extracting the structure underlying a complicated generative model. The example shows clearly the difference between the generative (8) and the recognition (4) weights, since the latter often include negative side-lobes around their favored shifts, which are needed to prevent incorrect recognition. 4 The Wake-Sleep Algorithm
The derivatives required for learning in the deterministic Helmholtz machine are quite complicated because they have to take into account the effects that changes in an activity at one layer will have on activities in higher layers. However, by borrowing an idea from the Boltzmann machine (Hinton and Sejnowski 1986; Ackley et al. 19851, we get the wake-sleep algorithm, which is a very simple learning scheme for layered networks of stochastic binary units that approximates the correct derivatives (Hinton et a / . 1995). Learning in the wake-sleep algorithm is separated into two phases. During the wake phase, data d from the world are presented at the lowest layer and binary activations of units at successively higher layers are determined picked according to the recognition probabilities, qf(4, by the bottom-up weights. The top-down generative weights from layer P + 1 to layer P are then altered to reduce the Kullback-Leibler divergence between the actual activations and the generative probabilities p f ( H ,se+'). In the sleep phase, the recognition weights are turned off and the topdown weights are used to activate the units. Starting at the top layer, activities are generated at successively lower layers based on the current top-down weights 0. The network thus generates a random instance from
The Helmholtz Machine
897
its generative model. Since it has generated the instance, it knows the true underlying causes, and therefore has available the target values for the hidden units that are required to train the bottom-up weights. If the bottom-up and the top-down activation functions are both sigmoid (equations 3.1 and 3.3), then both phases use exactly the same learning rule, the purely local delta rule (Widrow and Stearns 1985).
Recognition
Generative
2-3:
3.7
3-2:
13.3
1-2:
11.7
2-1:
13.3
Biases to 2 ;
38.4
Biases to 2 :
3.0
898
Peter Dayan et al.
Unfortunately, there is no single cost function that is reduced by these two procedures. This is partly because the sleep phase trains the recognition model to invert the generative model for input vectors that are distributed according to the generative model rather than according to the real data and partly because the sleep phase learning does not follow the correct gradient. Nevertheless, Qa = P , at the optimal end point, if it can be reached. Preliminary results by Brendan Frey (personal communication) show that this algorithm works well on some nontrivial tasks.
5 Discussion The Helmholtz machine can be viewed as a hierarchical generalization of the type of learning procedure described by Zemel (1994) and Hinton and Zemel(1994). Instead of using a fixed independent prior distribution for each of the hidden units in a layer, the Helmholtz machine makes this prior more flexible by deriving it from the bottom-up activities of units in the layer above. In related work, Zemel and Hinton (1995) show that a system can learn a redundant population code in a layer of hidden units, provided the activities of the hidden units are represented by a point in a multidimensional constraint space with pre-specified dimensionality. The role of their constraint space is to capture statistical dependencies among the hidden unit activities and this can again be achieved in a more uniform way by using a second hidden layer in a hierarchical generative model of the type described here. Figure 4: Facing page. The shifter. Recognition and generative weights for a three layer Helmholtz machine’s model for the shifter problem (see Fig. I for how the input patterns are generated). Each weight diagram shows recognition or generative weights between the given layers (1-2, 2-3, etc.) and the number quoted is the magnitude of the largest weight in the array. White is positive, black negative, but the generative weights shown are the natural logarithms of the ones actually used. The lowest weights in the 2-3 block are the biases to layer 3; the biases to layer 2 are shown separately because of their different magnitude. All the units in layer 2 are either silent, or respond to one or two pairs of appropriately shifted pairs of bits. The recognition weights have inhibitory side lobes to stop their units from responding incorrectly. The units in layer 3 are shift tuned, and respond to the units in layer 2 of their own shift direction. Note that under the imaging model (equation A.2 or A.3), a unit in layer 3 cannot specify that one in layer 2 should be off, forcing a solution that requires two units in layer 3. One aspect of the generative model is therefore not correctly captured. Finding weights equivalent to those shown is hard, requiring many iterations of a conjugate gradient algorithm. To prevent the units in layers 2 and 3 from being permanently turned off early in the learning they were given fixed, but tiny generative biases (0 = 0.05). Additional generative biases to layer 3 are shown in the figure; they learn the overall probability of left and right shifts.
The Helmholtz Machine
899
The old idea of analysis-by-synthesisassumes that the cortex contains a generative model of the world and that recognition involves inverting the generative model in real time. This has been attempted for nonprobabilistic generative models (MacKay 1956; Pece 1992). However, for stochastic ones it typically involves Markov chain Monte Carlo methods (Neal 1992). These can be computationally unattractive, and their requirement for repeated sampling renders them unlikely to be employed by the cortex. In addition to making learning tractable, its separate recognition model allows a Helmholtz machine to recognize without iterative sampling, and makes it much easier to see how generative models could be implemented in the cortex without running into serious time constraints. During recognition, the generative model is superfluous, since the recognition model contains all the information that is required. Nevertheless, the generative model plays an essential role in defining the objective function F that allows the parameters 4 of the recognition model to be learned. The Helmholtz machine is closely related to other schemes for selfsupervised learning that use feedback as well as feedforward weights (Carpenter and Grossberg 1987; Luttrelll992,1994; Ullman 1994; Kawato et al. 1993; Mumford 1994). By contrast with adaptive resonance theory (Carpenter and Grossberg 1987) and the counter-streams model (Ullman 1994), the Helmholtz machine treats self-supervised learning as a statistical problem--one of ascertaining a generative model that accurately captures the structure in the input examples. Luttrell (1992, 1994) discusses multilayer self-supervised learning aimed at faithful vector quantization in the face of noise, rather than our aim of maximizing the likelihood. The outputs of his separate low level coding networks are combined at higher levels, and thus their optimal coding choices become mutually dependent. These networks can be given a coding interpretation that is very similar to that of the Helmholtz machine. However, we are interested in distributed rather than local representations at each level (multiple cause rather than single cause models), forcing the approximations that we use. Kawato et al. (1993) consider forward (generative) and inverse (recognition) models (Jordan and Rumelhart 1992) in a similar fashion to the Helmholtz machine, but without this probabilistic perspective. The recognition weights between two layers do not just invert the generation weights between those layers, but also take into account the prior activities in the upper layer. The Helmholtz machine fits comfortably within the framework of Grenander’s pattern theory (Grenander 1976) in the form of Mumford’s (1994) proposals for the mapping onto the brain. As described, the recognition process in the Helmholtz machine is purely bottom-up-the top-down generative model plays no direct role and there is no interaction between units in a single layer. However, such effects are important in real perception and can be implemented using iterative recognition, in which the generative and recognition activations interact to produce the final activity of a unit. This can introduce
Peter Dayan et al.
900
substantial theoretical complications in ensuring that the activation process is stable and converges adequately quickly, and in determining how the weights should change so as to capture input examples more accurately. An interesting first step toward interaction within layers would be to organize their units into small clusters with local excitation and longer-range inhibition, as is seen in the columnar structure of the brain. Iteration would be confined within layers, easing the complications. Appendix A: The Imaging Model The sigmoid activation function given in equation 3.3 turned out not to work well for the generative model for the input examples we tried, such as the shifter problem (Fig. 1). Learning almost invariably got caught in one of a variety of local minima. In the context of a one layer generative model and without a recognition model, Saund (1994; 1995) discussed why this might happen in terms of the underlying imaging model-which is responsible for turning binary activities in what we call layer 2 into probabilities of activation of the units in the input layer. He suggested using a noisy-or imaging model (Pearl 1988), for which the weights 0 5 H:+';; 5 1 are interpreted as probabilities that sf = 1 if unit si+' = 1, and are combined as
$0. sf+')= 1 -
fl (I -
S;+lof+y)
(A.1)
k
The noisy-or imaging model worked somewhat better than the sigmoid model of equation 3.3, but it was still prone to fall into local minima. Dayan and Zemel (1995) suggested a yet more competitive rule based on the integrated segmentation and recognition architecture of Keeler et al. (1991). In this, the weights 0 5 Of",;' are interpreted as the odds that s: = 1 if unit s"; = 1, and are combined as
For the deterministic Helmholtz machine, we need a version of this activation rule that uses the probabilities qe+' rather than the binary samples st+'. This is somewhat complicated, since the obvious expression 1 - 1/(1 + CkqFIOpl;p) turns out not to work. In the end (Dayan and Zemel 1995)we used a product of this term and the deterministic version of the noisy-or:
The Helmholtz Machine
901
Appendix B gives the derivatives of this. We used the exact expected value of equation A.2 if there were only three units in layer C + 1 because it is computationally inexpensive to work it out. For convenience, we used the same imaging model (equations A.2 and A.3) for all the generative connections. In general one could use different types of connections between different levels.
Appendix B: The Derivatives Write F(d;6, $) for the contribution to the overall error in equation 3.11 for input example d, including the input layer:
Then the total derivative for input example d with respect to the activation of a unit in layer I is
since changing 4: affects the generative priors at layer L - 1, and the recognition activities at all layers higher than C. These derivatives can be calculated in a single backward propagation pass through the network, accumulating dF(d;6, $)/a@as it goes. The use of standard sigmoid units in the recognition direction makes aqf/dqf completely conventional. Using equation A.3 makes
902
Peter Dayan et al.
One also needs the derivative
This is exactly what we used for the imaging model in equation A.3. However, it is important to bear in mind that pf(O,se+l)should really be a function of the stochastic choices of the units in layer ! 1. The contribution to the expected cost .F is a function of (logpf(O,sEf'))and (log [l - p,'(d. s f + ' ) ] ) , where ( ) indicates averaging over the recognition distribution. These are not the same as log(pf(B,s'+')) and log (1 - ($(d, s'+l ,which is what the deterministic machine uses. For it is possible to take this into account. other imaging mo
+
Acknowledgments We are very grateful to Drew van Camp, Brendan Frey, Geoff Goodhill, Mike Jordan, David MacKay, Mike Revow, Virginia de Sa, Nici Schraudolph, Terry Sejnowski, and Chris Williams for helpful discussions and comments, and particularly to Mike Jordan for extensive criticism of an earlier version of this paper. This work was supported by NSERC and IRIS. G. E. H. is the Noranda Fellow of the Canadian Institute for Advanced Research. The current address for R. s. z. is Baker Hall 330, Department of Psychology, Carnegie Mellon University, Pittsburgh, PA 15213.
References Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. 1985. A learning algorithm for Boltzmann machines. Cog. Sci. 9, 147-169. Barto, A. G., and Anandan, P. 1985. Pattern recognizing stochastic learning automata. I E E E Trans. Syst. Man Cybernet. 15, 360-374. Becker, S., and Hinton, G. E. 1992. A self-organizing neural network that discovers surfaces in random-dot stereograms. Nature K.ondon) 355, 161-163. Carpenter, G., and Grossberg, S. 1987. A massively parallel architecture for a self-organizing neural pattern recognition machine. Cornp. Vision, Graphics lnzage Process. 37, 54-115.
The Helmholtz Machine
903
Dayan, I?, and Zemel, R. S. 1995. Competition and multiple cause models. Neural Comp. 7, 565-579. Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Proc. Royal Stat. Soc. 8-39 1-38. Grenander, U. 1976-1981. Lectures in Pattern Theory I, 11 and 111: Pattern Analysis, Pattern Synthesis and Regular Structures. Springer-Verlag, Berlin. Hinton, G. E., Dayan, P., Frey, B. J., Neal, R. M. 1995. The wake-sleep algorithm for unsupervised neural networks. Science 268, 1158-1160. Hinton, G. E., and Sejnowski, T. J. 1986. Learning and relearning in Boltzmann machines. In Parallel Distributed Processing: Explorations in fhe Microstructure of Cognition. Volume I : Foundations, D. E. Rumelhart, J. L. McClelland, and the PDP research group, eds., pp. 282-317. MIT Press, Cambridge, MA. Hinton, G. E., and Zemel, R. S. 1994. Autoencoders, minimum description length and Helmholtz free energy. In Advances in Neural Information Processing Systems 6 , J. D. Cowan, G. Tesauro, and J. Alspector, eds., pp. 3-10. Morgan Kaufmann, San Mateo, CA. Jordan, M. I., and Rumelhart, D. E. 1992. Forward models: Supervised learning with a distal teacher. Cog. Sci. 16, 307-354. Kawato, M., Hayakama, H., and Inui, T. 1993. A forward-inverse optics model of reciprocal connections between visual cortical areas. Network 4,415-422. Keeler, J. D., Rumelhart, D. E., and Leow, W. K. 1991. Integrated segmentation and recognition of hand-printed numerals. In Advances in Neural Informution Processing Systems, R. P. Lippmann, J. Moody, and D. S. Touretzky, eds., Vol. 3,557-563. Morgan Kaufmann, San Mateo, CA. Kullback, S. 1959. Information Theory and Sfafistics. Wiley, New York. Luttrell, S. P. 1992. Self-supervised adaptive networks. 1EE Proc. Part F 139, 371-377. LuttreIl, S. I? 1994. A Bayesian analysis of self-organizing maps. Neural Comp. 6, 767-794. MacKay, D. M. 1956. The epistemological problem for automata. In Automata Studies, C. E. Shannon and J. McCarthy, eds., pp. 235-251. Princeton University Press, Princeton, NJ. Mumford, D. 1994. Neuronal architectures for pattern-theoretic problems. In Large-Scale Thmriesof thecortex, C. Koch and J. Davis, eds., pp. 125-152, MIT Press, Cambridge, MA. Neal, R. M. 1992. Connectionist learning of belief networks. Artificial Intelligence 56, 71-113. Neal, R. M., and Hinton, G. E. 1994. A new view of the EM algorithm that justifies incremental and other variants. Biornetrika (submitted). Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, CA. Pece, A. E. C. 1992. Redundancy reduction of a Gabor representation: A possible computational role for feedback from primary visual cortex to lateral geniculate nucleus. In Artificial Neural Networks, I. Aleksander and J. Taylor, eds., Vol. 2, pp. 865-868. Elsevier, Amsterdam. Saund, E. 1994. Unsupervised learning of mixtures of multiple causes in binary data. In Advances in Neural Information Processing Systems, J. D. Cowan,
904
Peter Dayan et al.
G. Tesauro and J. Alspector, eds., Vol. 6, pp. 27-34. Morgan Kaufmann, San Mateo, CA. Saund, E. 1995. A multiple cause mixture model for unsupervised learning. Neural Comp. 7, 51-71. Thompson, C. J. 1988. Classical Equilibrium Statistical Mechanics. Clarendon Press, Oxford. Ullman, S. 1994. Sequence seeking and counterstreams: A model for bidirectional information flow in the cortex. In Large-Scale Theories of the Cortex, C . Koch and J. Davis, eds., pp. 257-270. MIT Press, Cambridge, MA. Widrow, B., and Stearns, S. D. 1985. Adaptive Signal Processing. Prentice-Hall, Englewood Cliffs, NJ. Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learn. 8, 229-256. Zemel, R. S. 1994. A Minimum Description Length Fromework for Unsupervised Learning. Ph.D. Dissertation, Computer Science, University of Toronto, Canada. Zemel, R. S., and Hinton, G. E. 1995. Learning population codes by minimizing description length. Neural Comp. 7, 549-564.
Received August 29, 1994; accepted December 22, 1994.
This article has been cited by: 2. Philip R Corlett, Garry D Honey, John H Krystal, Paul C Fletcher. 2010. Glutamatergic Model Psychoses: Prediction Error, Learning, and Inference. Neuropsychopharmacology . [CrossRef] 3. Lei Xu. 2010. Bayesian Ying-Yang system, best harmony learning, and five action circling. Frontiers of Electrical and Electronic Engineering in China 5:3, 281-328. [CrossRef] 4. Lei Xu. 2010. Machine learning problems from optimization perspective. Journal of Global Optimization 47:3, 369-401. [CrossRef] 5. R. L. Carhart-Harris, K. J. Friston. 2010. The default-mode, ego-functions and free-energy: a neurobiological account of Freudian ideas. Brain 133:4, 1265-1283. [CrossRef] 6. Karl J. Friston, Jean Daunizeau, James Kilner, Stefan J. Kiebel. 2010. Action and behavior: a free-energy formulation. Biological Cybernetics 102:3, 227-260. [CrossRef] 7. Karl Friston. 2010. The free-energy principle: a unified brain theory?. Nature Reviews Neuroscience 11:2, 127-138. [CrossRef] 8. I. Kokkinos, P. Maragos. 2009. Synergy between Object Recognition and Image Segmentation Using the Expectation-Maximization Algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 31:8, 1486-1501. [CrossRef] 9. Barak A. Pearlmutter, Conor J. Houghton. 2009. A New Hypothesis for Sleep: Tuning for CriticalityA New Hypothesis for Sleep: Tuning for Criticality. Neural Computation 21:6, 1622-1641. [Abstract] [Full Text] [PDF] [PDF Plus] 10. K. Friston, S. Kiebel. 2009. Predictive coding under the free-energy principle. Philosophical Transactions of the Royal Society B: Biological Sciences 364:1521, 1211-1221. [CrossRef] 11. Felipe Campelo, Frederico G. Guimaraes, Jaime A. Ramirez, Hajime Igarashi. 2009. Hybrid Estimation of Distribution Algorithm Using Local Function Approximations. IEEE Transactions on Magnetics 45:3, 1558-1561. [CrossRef] 12. Thomas Trappenberg. 2008. Tracking population densities using dynamic neural fields with moderately strong inhibition. Cognitive Neurodynamics 2:3, 171-177. [CrossRef] 13. Nicholas V. Swindale. 2008. Feedback Decoding of Spatially Structured Population Activity in Cortical MapsFeedback Decoding of Spatially Structured Population Activity in Cortical Maps. Neural Computation 20:1, 176-204. [Abstract] [PDF] [PDF Plus] 14. Karl J. Friston, Klaas E. Stephan. 2007. Free-energy and the brain. Synthese 159:3, 417-458. [CrossRef]
15. Shun-ichi Amari. 2007. Integration of Stochastic Models by Minimizing α-DivergenceIntegration of Stochastic Models by Minimizing α-Divergence. Neural Computation 19:10, 2780-2796. [Abstract] [PDF] [PDF Plus] 16. Kevin M. Squire, Stephen E. Levinson. 2007. HMM-Based Concept Learning for a Mobile Robot. IEEE Transactions on Evolutionary Computation 11:2, 199-212. [CrossRef] 17. Peter Dayan. 2006. Images, Frames, and Connectionist HierarchiesImages, Frames, and Connectionist Hierarchies. Neural Computation 18:10, 2293-2319. [Abstract] [PDF] [PDF Plus] 18. Makiko Sadakata, Peter Desain, Henkjan Honing. 2006. The Bayesian Way to Relate Rhythm Perception and Production. Music Perception 23:3, 269-288. [CrossRef] 19. Si Wu , Shun-ichi Amari . 2005. Computing with Continuous Attractors: Stability and Online AspectsComputing with Continuous Attractors: Stability and Online Aspects. Neural Computation 17:10, 2215-2239. [Abstract] [PDF] [PDF Plus] 20. Zhuowen Tu, Xiangrong Chen, Alan L. Yuille, Song-Chun Zhu. 2005. Image Parsing: Unifying Segmentation, Detection, and Recognition. International Journal of Computer Vision 63:2, 113-140. [CrossRef] 21. K. Friston. 2005. A theory of cortical responses. Philosophical Transactions of the Royal Society B: Biological Sciences 360:1456, 815-836. [CrossRef] 22. József Fiser, Richard N. Aslin. 2005. Encoding Multielement Scenes: Statistical Learning of Visual Feature Hierarchies. Journal of Experimental Psychology: General 134:4, 521-537. [CrossRef] 23. S. Baker, I. Matthews, J. Schneider. 2004. Automatic construction of active appearance models as an image coding problem. IEEE Transactions on Pattern Analysis and Machine Intelligence 26:10, 1380-1384. [CrossRef] 24. L. Xu. 2004. Temporal BYY Encoding, Markovian State Spaces, and Space Dimension Determination. IEEE Transactions on Neural Networks 15:5, 1276-1295. [CrossRef] 25. L. Xu. 2004. Advances on BYY Harmony Learning: Information Theoretic Perspective, Generalized Projection Geometry, and Independent Factor Autodetermination. IEEE Transactions on Neural Networks 15:4, 885-902. [CrossRef] 26. Daniel Kersten, Pascal Mamassian, Alan Yuille. 2004. Object Perception as Bayesian Inference. Annual Review of Psychology 55:1, 271-304. [CrossRef] 27. Rajesh P. N. Rao. 2004. Bayesian Computation in Recurrent Neural CircuitsBayesian Computation in Recurrent Neural Circuits. Neural Computation 16:1, 1-38. [Abstract] [PDF] [PDF Plus] 28. K. Humphreys, D. M. Titterington. 2003. Variational approximations for categorical causal modeling with latent variables. Psychometrika 68:3, 391-412. [CrossRef]
29. A.J. Storkey, C.K.I. Williams. 2003. Image modeling with position-encoding dynamic trees. IEEE Transactions on Pattern Analysis and Machine Intelligence 25:7, 859-871. [CrossRef] 30. M.V. Jankovic. 2003. A new simple ∞OH neuron model as a biologically plausible principal component analyzer. IEEE Transactions on Neural Networks 14:4, 853-859. [CrossRef] 31. Guilherme de A. Barreto , Aluizio F. R. Araújo , Stefan C. Kremer . 2003. A Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised CaseA Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised Case. Neural Computation 15:6, 1255-1320. [Abstract] [PDF] [PDF Plus] 32. Song-Chun Zhu. 2003. Statistical modeling and conceptualization of visual patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 25:6, 691-712. [CrossRef] 33. Si Wu , Danmei Chen , Mahesan Niranjan , Shun-ichi Amari . 2003. Sequential Bayesian Decoding with a Population of NeuronsSequential Bayesian Decoding with a Population of Neurons. Neural Computation 15:5, 993-1012. [Abstract] [PDF] [PDF Plus] 34. Tai Sing Lee, David Mumford. 2003. Hierarchical Bayesian inference in the visual cortex. Journal of the Optical Society of America A 20:7, 1434. [CrossRef] 35. Kwok-Wai Cheung, Dit-Yan Yeung, R.T. Chin. 2002. Bidirectional deformable matching with application to handwritten character extraction. IEEE Transactions on Pattern Analysis and Machine Intelligence 24:8, 1133-1139. [CrossRef] 36. S. Fiori. 2002. A theory for learning based on rigid bodies dynamics. IEEE Transactions on Neural Networks 13:3, 521-531. [CrossRef] 37. Xiaojuan Feng, C.K.I. Williams, S.N. Felderhof. 2002. Combining belief networks and neural networks for scene segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 24:4, 467-483. [CrossRef] 38. Karl Friston. 2002. BEYOND PHRENOLOGY: What Can Neuroimaging Tell Us About Distributed Circuitry?. Annual Review of Neuroscience 25:1, 221-250. [CrossRef] 39. Eero P Simoncelli, Bruno A Olshausen. 2001. NATURAL IMAGE STATISTICS AND NEURAL REPRESENTATION. Annual Review of Neuroscience 24:1, 1193-1216. [CrossRef] 40. Lawrence K. Saul , Michael I. Jordan . 2000. Attractor Dynamics in Feedforward Neural NetworksAttractor Dynamics in Feedforward Neural Networks. Neural Computation 12:6, 1313-1335. [Abstract] [PDF] [PDF Plus] 41. Joshua B. Tenenbaum , William T. Freeman . 2000. Separating Style and Content with Bilinear ModelsSeparating Style and Content with Bilinear Models. Neural Computation 12:6, 1247-1283. [Abstract] [PDF] [PDF Plus] 42. H. Attias . 1999. Independent Factor AnalysisIndependent Factor Analysis. Neural Computation 11:4, 803-851. [Abstract] [PDF] [PDF Plus]
43. Peter Dayan . 1999. Recurrent Sampling Models for the Helmholtz MachineRecurrent Sampling Models for the Helmholtz Machine. Neural Computation 11:3, 653-677. [Abstract] [PDF] [PDF Plus] 44. Brendan J. Frey , Geoffrey E. Hinton . 1999. Variational Learning in Nonlinear Gaussian Belief NetworksVariational Learning in Nonlinear Gaussian Belief Networks. Neural Computation 11:1, 193-213. [Abstract] [PDF] [PDF Plus] 45. Andrew M. Finch , Richard C. Wilson , Edwin R. Hancock . 1998. An Energy Function and Continuous Edit Process for Graph MatchingAn Energy Function and Continuous Edit Process for Graph Matching. Neural Computation 10:7, 1873-1894. [Abstract] [PDF] [PDF Plus] 46. Gad Miller , David Horn . 1998. Probability Density Estimation Using Entropy MaximizationProbability Density Estimation Using Entropy Maximization. Neural Computation 10:7, 1925-1938. [Abstract] [PDF] [PDF Plus] 47. Jingzhou Zhou. 1998. A neural network model based on logical operations. Journal of Computer Science and Technology 13:5, 464-470. [CrossRef] 48. Peter Dayan . 1998. A Hierarchical Model of Binocular RivalryA Hierarchical Model of Binocular Rivalry. Neural Computation 10:5, 1119-1135. [Abstract] [PDF] [PDF Plus] 49. H. J. Kappen , F. B. Rodríguez . 1998. Efficient Learning in Boltzmann Machines Using Linear Response TheoryEfficient Learning in Boltzmann Machines Using Linear Response Theory. Neural Computation 10:5, 1137-1156. [Abstract] [PDF] [PDF Plus] 50. Song Chun Zhu , Ying Nian Wu , David Mumford . 1997. Minimax Entropy Principle and Its Application to Texture ModelingMinimax Entropy Principle and Its Application to Texture Modeling. Neural Computation 9:8, 1627-1660. [Abstract] [PDF] [PDF Plus] 51. Radford M. Neal , Peter Dayan . 1997. Factor Analysis Using Delta-Rule Wake-Sleep LearningFactor Analysis Using Delta-Rule Wake-Sleep Learning. Neural Computation 9:8, 1781-1803. [Abstract] [PDF] [PDF Plus] 52. Rajesh P. N. Rao, Dana H. Ballard. 1997. Dynamic Model of Visual Recognition Predicts Neural Response Properties in the Visual CortexDynamic Model of Visual Recognition Predicts Neural Response Properties in the Visual Cortex. Neural Computation 9:4, 721-763. [Abstract] [PDF] [PDF Plus] 53. G.E. Hinton, P. Dayan, M. Revow. 1997. Modeling the manifolds of images of handwritten digits. IEEE Transactions on Neural Networks 8:1, 65-74. [CrossRef] 54. Pierre Baldi, Yves Chauvin. 1996. Hybrid Modeling, HMM/NN Architectures, and Protein ApplicationsHybrid Modeling, HMM/NN Architectures, and Protein Applications. Neural Computation 8:7, 1541-1565. [Abstract] [PDF] [PDF Plus] 55. M. Revow, C.K.I. Williams, G.E. Hinton. 1996. Using generative models for handwritten digit recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 18:6, 592-606. [CrossRef]
Communicated by Jack Cowan
Spontaneous Excitations in the Visual Cortex: Stripes, Spirals, Rings, and Collective Bursts Corinna Fohlmeister Wulfram Gerstner Raphael Ritz J. Leo van Hemmen Physik-Department der TU Munchen, D-85747 Garching bei Miinchen, Germany
As a simple model of the cortical sheet, we study a locally connected net of spiking neurons, Refractoriness, noise, axonal delays, and the time course of excitatory and inhibitory postsynaptic potentials are taken into account explicitly. In addition to a low-activity state and depending on the synaptic efficacy, four different scenarios evolve spontaneously, viz., stripes, spirals, rings, and collective bursts. Our results can be related to experimental observations of drug-induced epilepsy and hallucinations. 1 Introduction
What do spontaneous coherent excitations in the primary visual cortex look like in time and-that is what we are interested in h e r e i n space? This is a fascinating question whose solution is, to some extent, now within reach of computational neuroscience. It is generally believed that this kind of excitation occurs in drug-induced epilepsy and, presumably, also in hallucinations. Hallucinations (Kliiver 1967; Siegel and West 1975; Siegel 1977; Cowan 1985) are perceptions in the absence of a visual stimulus. They can occur even in subjects that have been completely blinded by a retinal disease (Zeki 1993). It was Kluver (1967) who in the twenties started experiments to classify what he called ”form constants,” which meanwhile have turned out to be universal characteristics of the first stage of druginduced imagery, most notably LSD. There are at least four categories of form constant, such as grating and filigree, spiral, tunnel and funnel, and cobweb. The imagery of the second stage is much more complex and, without any doubt, involves several areas of the brain. We mention two key questions: Are the form constants generated in the primary visual cortex (areas V1 and V2) or are they due to functional feedback, i.e., feedback from other areas with different functions? Second, can we understand the form constants theoretically? Neural Compufation 7, 905-914 (1995)
@ 1995 Massachusetts Institute of Technology
Corinnn Fohlmeister et al.
906
There exists a mathematically very elegant analysis of Ermentrout and Cowan (1979). Their main hypothesis, which we will adopt as well, is that the form constants can be modeled as elementary excitations in the primary visual cortex. The model uses a rate coding and takes the complex logarithm (Schwartz 1977) as the retinocortical map. The patterns follow from a bifurcation analysis in a neighborhood of the homogeneous low-activity state, a linear theory. A final result is that parallel stripes of active and quiescent neurons constitute elementary excitations of the model. Due to the retinocortical map, some of the cortical stripe patterns should appear as spirals on the retina. One may wonder, though, what are the spontaneous excitations in a "realistic" nonlinear cortical network of spiking, noisy neurons? This is the question we will focus on. In so doing we can, and will, verify the above hypothesis. In the context of our model we conclude that several, but not all, form constants occur as spontaneous excitations. Furthermore, we do encounter spatiotemporal activity patterns as found experimentally in drug-induced epilepsy. At the same time as us but in a network of integrate-and-fire neurons without delays, local inhibition, and noise, Milton et al. (1993) found spirals as elementary excitations that evolve out of a fixed excitation center. Spirals are inconsistent with the parallel stripes referred to above. Below we will clear the situation and show that there is in fact a sequence of at least four scenarios. In so doing we will avoid any external input and exploit several neural characteristics which have been incorporated into our own spike response model. 2 Spike Response Model
The essentials of neuronal behavior are the absolute and relative refractory period, the response at the soma resulting from synaptic input (usually described by an alpha function), the omnipresent delays, and noise. All these ingredients have been incorporated in the spike response model (Gerstner and van Hemmen 1992, 1993; Gerstner et al. 1993). It presents a faithful but simplified description of the neurons themselves without taking recourse to differential equations. This is essential since we have to study the spatial activity of a large system of neurons (say N 2 20,000) over a long period of time. We discretize time by units At = 1 msec, the width of a spike, and label the neurons on a two-dimensional square lattice by the index i. The state of a neuron is described by S, E {O,l}. If the potential h; at the hillock of neuron i reaches the threshold 6, then the neuron is expected to fire. We describe this stochastic behavior through a noise parameter ii;l in the transition probability Prob{S,(t + At)
=
1 1 I hi(t)} = -{1 2
+ tanh[@;(t)
-
6)]}
(2.1)
Spontaneous Excitations in the Visual Cortex
907
This is the conditional probability that neuron i fires at time t + At given hi(t). In the noise-free limit ,8 -+ co we get Si(t + At) = O[h,(t)- d ] where 0 is the Heaviside step function: O ( x ) = 1 for x 2 0 and O ( x ) = 0 for x < 0. In the numerics to be described below we have taken ,8 = 25 and 6 = 0.12. The spike response model describes the response of a neuron-both the sender and the receiver-to a spike. If a neuron has fired a spike, it exhibits refractory behavior for a while, i.e., it cannot or can hardly spike. This is taken care of by the refractory function 7,which is -co during the absolute refractory period and negative but increasing to zero thereafter,
h;'f'(t)
=
c
V(T)Sj(t - T )
(2.2)
r>o
Here we take a (.) = --oo for T = 1 and zero elsewhere. The spike travels along an axon and reaches a synapse on the dendritic tree of neuron i after A imsec. Let the synaptic strength be J i j and denote the alpha function by E . Then we obtain for the total input at the hillock of neuron i (2.3) where E ( T ) = T/T,' exp(-T/T,) so that C , E ( T ) = 1; here T~ = 2 msec. For the sake of computational simplicity we have assumed that the delays A, depend on i (instead of, say, j ) . In this work the A, are taken from ( 0 ,1,2) with equal probability. Furthermore, J r l always vanishes. The neurons considered so far are pyramidal cells. The stellate cells are modeled by an inhibitory loop, which is assigned to each neuron,
hYh(t) = C ~ " ' ~ S,(t ( 7 )- T - AYh)
(2.4)
T>O
where P h ( 7 ) first assumes a strongly negative value during 5 msec (shunting inhibition) and then decays exponentially with a time constant TInh = 6 msec. Moreover, E {3,4,5} is a uniformly distributed random variable. It is known that stellate cells operate locally. This we have simplified to a strictly local interaction; for details, see Gerstner et al. 11993). Putting things together we find
h,(t) = h:'f'(t)
+ h y y t ) + h:"h(t)
(2.5)
which is to be substituted into 2.1. What is left is specifying the Jl, in 2.3. Since we are concerned with visual percepts such as hallucinations it seems natural, even imperative (Zeki 1993), to model the primary visual cortex. We will work with a simplified model of cortical connectivity. Inside a column the pyramidal cells experience an excitatory interaction.
Corinna Fohlmeister et al.
908
Different columns with strongly different direction preferences are expected to inhibit each other. The upshot is a "mexican hat,"' Jij
(2.6)
=Aexp(-6/Ai) - Bexp(-r;?,/A2)
with XI > B. Here rg is the Euclidean distance between i and j. A second possibility, which has also been studied, is J;j = A
for r,, 5 ro
and
-
s
for ro < r;, I rmax
(2.7)
with rmax5 30 and, again, A >> B. We use free boundary conditions. In our numerical simulations we have seen no difference between 2.6 and 2.7. Alternatively and giving rise to the very same scenarios, one can replace Ill in 2.3 by OKl where El = 1 with probability exp[-(r, - l)/Aex,,)] or exp -[(rr, - ~ ) / A G ~ " ~ ~ otherwise )]~; i,vanishes. Typical values for the As are in the range between 2 and 5. The probabilities have been chosen in such a way that nearest neighbors (r,, = 1) are always connected. D is a drug parameter. Summarizing, we have explicitly modeled the various interactions including the stellate cells, the delays that are abundantly present in the cortex, and the noise. We now turn to the network behavior itself. 3 Drug-Induced Collective Excitations ___ As in the experiments (Siegel and West 1975; Siegel 1977), we study a network without external input. In its normal state we then encounter spontaneous activity in the form of incoherent low-frequency firing. Fixing B (or b) and increasing the excitatory A, A, or D,so as to model the influence of hallucinogens, we find four successive scenarios (see Figs. 1 4 ) . We always start with random initial conditions, unless stated otherwise, and find depending on A, A, or D: 1. Stripes. Once A (or A, or D ) has become large enough, say A > A!'), an excitation can propagate through the lattice. Just above A!') the stripes are relatively short but they become longer with increasing A (see Fig. 1). As time proceeds, the stripes propagate. Their length does not grow but they get slightly curved (the more so with increasing A) as the neurons in the center of a line segment are stimulated more strongly than those at the ends and, hence, their propagation is faster. Behind a stripe the neurons experience inhibition due to the stellate cells which get activated a bit later. 2. Spirals. As A (or D) increases further, the stripes get longer and more curved so that for A > A:') they regroup and build a spiral (see 'Interestingly, Hebbian learning of random contours gives rise to the very same form. It is plain that Dale's law is inconsistent with a Mexican hat but this form has been very popular. It is a simple matter, though, to redefine the sign of the bonds and at the same time shift the threshold 29.
Spontaneous Excitations in the Visual Cortex
a
I
909
1
Figure 1: Scenario 1-stripes. (a) 90 x 90 network with locally homogeneous couplings A = 0.16, B = 0.02, while ro = 15 and r,,, = 20; cf. 2.7. (b)90 x 90 network with locally sparse, excitatory couplings whose probability decreases with the distance; cf. Section 2. Here X G =~2 and ~ D ~ =~0.056. Note the similarity of the two figures despite their different microscopic structure. For all figures we have taken random initial conditions.
b
Figure 2: Scenario 2-spirals. (a) 90 x 90 network with A = 0.12, B = 0.02, XI = 15 and = 100; cf. 2.6. Two or more spirals may coexist as shown in (b), where we have a 90 x 90 network with excitatory couplings whose probability decreases with the distance; cf. Section 2. Here = 2.83 and D = 0.1.
Corinna Fohlmeister et al.
910
a
b
Figure 3: Scenario &-rings. (a) 90 x90 network with A = 0.14, B = 0.02, XI = 15 and A2 = 100; cf. 2.6. The two rings annihilate each other where they meet. New rings originate from the two centers. In (b)we show a 150 x 150 network with excitatory couplings whose probability decreases with the distance; cf. Figures l b and 2b. Here Xcduss= 2.83 and D = 0.12.
a
Figure 4: Scenario &collective burst. (a) 90x90 network with A = 2.4, B = 0.02, A1 = 8.4 and A2 = 100; cf. 2.6. (b)150 x 150 network with excitatory couplings whose probability decreases with the distance; cf. Figures lb-3b. Here we have an exponential distribution with Xexp = 3. Furthermore, D = 0.14.
Spontaneous Excitations in the Visual Cortex
911
Fig. 2). Plainly, spirals rotate. The number of their arms (1, 2, or 3) depends on the random initialization. Spirals are also extremely stable. Once they exist one can even increase A suddenly to a strength corresponding to scenario 4 but nevertheless the spirals survive. 3. Rings (see Fig. 3). There may be several centers generating new rings all the time. These propagate outward. If two nonconcentric rings hit each other, they annihilate their common part while moving outward. The reason is simply the inhibition that follows a front. The thickness of a ring increases with A or D, respectively. 4. Collective bursts. These are complex pulsating patterns. Here A (or D ) is so large that a few active neurons ignite the whole system in 20-25 msec (cf. Fig. 4) after which inhibition takes over and a quiescent state sets in. The frequency is in a range between 10 and 20 Hz. The resulting activity pattern vaguely resembles an epileptic state. Interestingly, and in agreement with experiments (Siegel 1977), the ”objects” in scenarios 2 and 3 have different length scales that vary from one scene to the next (even for the academic case of fixed parameter values). The width of the stripes in scenario 1 depends on A (or D). The patterns in scenario 4 have all length scales. Indirect experimental evidence confirming scenarios 2-3 has been found by Petsche et al. (1974) in the occipital cortex of a rabbit with penicillin-induced epilepsy. Quite surprisingly, even for the complex pattern of scenario 4 experimental data are available (cf. Siegel and West 1975, p. 123). 4 Discussion
The “wetware” of the primary visual cortex apparently allows a variety of spontaneous excitations that are similar to patterns found in excitable media (Tyson and Keener 1988; Meron 1992; Cross and Hohenberg 1993). They arise due to intrinsic nonlinearities of the neuronal dynamics and resemble experimental hallucinogen-induced activity patterns rather closely but not completely. These excitations, however, are in the cortex and it is a natural question what they would look like on the retina. To answer this question we have taken Figure 2a, positioned it extrafoveally so that the complex logarithm offers a fair description of the retinocortical map, and applied the inverse map. The result is shown in Figure 5 where the Archimedean spiral in the cortex reappears as a quasi-logarithmic spiral on the retina. In passing we note that we find several but not all four types of ”form constant” as described by Kliiver (1967). This may be due to the initial conditions that we had chosen, viz., random ones. It is an open problem, though, what are the generic initial conditions that generate, e.g., a hexagonal pattern. The performance of a large network does not depend on the details of the model once the neural essentials have been incorporated. An example is provided by the three different kinds of coupling that we assumed
Corinna Fohlmeister et al.
912
Retina
Cortex
Figure 5: Retinal pattern (left) corresponding to the cortical activity pattern of Figure 2a (right). The retinal picture is the result of the inverse retinocortical map applied extrafoveally;cf. Schwartz (1977). If (x, y) is a point in the cortex and ( r ,y ) is on the retina, then parameters have been chosen in such a way that r = expx and y = y. in the primary visual cortex, viz., the locally homogeneous ones (2.6) and (2.7) and the locally sparse, excitatory ones whose probability decreases with the distance (cf. Figs. 1-4). It has been stressed by Zeki (1993, pp. 324-326 and 342-343) that hallucinations do depend on reentry into area V1 or V2. On the basis of the present work and in agreement with Ermentrout and Cowan (1979) we tentatively suggest that the form constants are mainly generated in the primary visual cortex. Through functional feedback they may be, and we expect are, modified and combined with other objects, e.g., from memory. Under this proviso we are then led to the following interpretation. The scenarios 2 and 3 are in a one-to-one correspondence with the experimental hallucinatory spirals, tunnels and funnels-the more so since spirals are very stable and, thus, dominant. They also have been observed indirectly in drug-induced epilepsy. On the other hand, scenario 1 gives room to many interpretations. Scenario 4 is a high-dose one and hard to reach since the system usually has to pass through the previous three scenarios, where it can get stuck. Nevertheless, it has been "seen." One has to realize, though, that pictures drawn by patients may give rise to contradictory results as is illustrated nicely by Siege1 and West (1975, p. 135). Here both a quasi-logarithmic spiral, which is "seen" by most people, and a purely Archimedean one are shown; the two spirals were observed by two different persons under
Spontaneous Excitations in the Visual Cortex
913
the influence of ketamine and LSD, respectively. In fact, the two different pictures with Archimedean and logarithmic spirals would constitute a fascinating problem to theory, if they were reproducible. In summary, we have exhibited several scenarios that appear as the synaptic efficacy is increased in a locally connected neuronal network. All of them have been observed, some in the cortex, others through hallucinations. Our model reproduces some but not all of the form constants a s they are found in hallucinations. Hence it may well be that the basic hypothesis that they are generated in the primary visual cortex is too simple-minded in relation to cortical processing. There is little doubt, however, that all these spontaneous excitations with their typical spatiotemporal behavior do occur in the cortex. An analytic treatment of the model under consideration will be presented elsewhere (Fohlmeister et al. 1995). Acknowledgments
WG has been supported by the Deutsche Forschungsgemeinschaft under Grant He 1729/2-2. References Cowan, J. D. 1985. What do drug-induced visual hallucinations tell us about the brain? In Synaptic Modification, Neuron Selectivity, and Nervous System Organization, W. B. Levy, J. A. Anderson, and S. Lehmkuhle, eds., pp. 223241. Lawrence Erlbaum, Hillsdale, NJ. Cross, M. C., and Hohenberg, P. C. 1993. Pattern formation outside of equilibrium. Rev. Mod. Pkys. 65, 851-1112. Ermentrout, G. B., and Cowan, J. D. 1979. A mathematical theory of visual hallucination patterns. Biol. Cybern. 34, 137-150. Fohlmeister, C., Gerstner, W., Ritz, R., and van Hemmen, J. L. 1995. Manuscript in preparation. Gerstner, W., and van Hemmen, J. L. 1992. Associative memory in a network of 'spiking' neurons. Network 3, 139-164. Gerstner, W., and van Hemmen, J. L. 1993. Coherence and incoherence in a globally coupled ensemble of pulse-emitting units. Pkys. Rev. Lett. 71, 312315. Gerstner, W., Ritz, R., and van Hemmen, J. L. 1993. A biologically motivated and analytically soluble model of collective oscillations in the cortex: I. Theory of weak locking. Biol. Cybern. 68, 363-374. Kliiver, H. 1967. Mescal and the Mechanisms of Hallucination. The University of Chicago Press, Chicago. Meron, E. 1992. Pattern formation in excitable media. Phys. Rep. 218, 1-66. Milton, J. G., Chu, P. H., and Cowan, J. D. 1993. Spiral waves in integrate-andfire neural networks. In Neural Information Processing Systems, S. J. Hanson,
914
Corinna Fohlmeister et al.
J. D. Cowan, and C. L. Giles, eds., Vol. 5, pp. 1001-1006. Morgan Kaufmann, San Mateo, CA. Petsche, H., Prohaska, O., Rappelsberger, P., Vollmer, R., and Kaiser, A. 1974. Cortical seizure patterns in multidimensional view: The information content of equipotential maps. Epilepsia 15, 439-463. SchwartL, E. 1977. Spatial mapping in the primate sensory projection: Analytic structure and relevance to perception. B i d . Cybern. 25, 181-194. Siegel, R. K. 1977. Hallucinations. Sci. Am. 237(4), 132-140. Siegel, R. K., and West, L. J. 1975. Hallucinations: Behavior, Experience, and Theory. Wiley, New York. Tyson, J. J., and Keener, J. P. 1988. Singular perturbation theory of travelling waves in excitable media (a review). Physica D 32, 327-361. Zeki, S. 1993. A Vision offhe Brain. Blackwell Scientific, Oxford.
Received April 11, 1994; accepted December 22, 1994.
This article has been cited by: 2. H. Henke, P. A. Robinson, P. M. Drysdale, P. N. Loxley. 2009. Spatiotemporal dynamics of pattern formation in the primary visual cortex and hallucinations. Biological Cybernetics 101:1, 3-18. [CrossRef] 3. Helmar Leonhardt, Michael A. Zaks, Martin Falcke, Lutz Schimansky-Geier. 2008. Stochastic Hierarchical Systems: Excitable Dynamics. Journal of Biological Physics 34:5, 521-538. [CrossRef] 4. T. Prager, M. Falcke, L. Schimansky-Geier, M. Zaks. 2007. Non-Markovian approach to globally coupled excitable systems. Physical Review E 76:1. . [CrossRef] 5. Gouhei Tanaka, Borja Ibarz, Miguel A. F. Sanjuan, Kazuyuki Aihara. 2006. Synchronization and propagation of bursts in networks of coupled map neurons. Chaos: An Interdisciplinary Journal of Nonlinear Science 16:1, 013113. [CrossRef] 6. Carlo R. Laing. 2005. Spiral Waves in Nonlocal Equations. SIAM Journal on Applied Dynamical Systems 4:3, 588. [CrossRef] 7. Daniel Cremers , Andreas V. M. Herz . 2002. Traveling Waves of Excitation in Neural Field Models: Equivalence of Rate Descriptions and Integrate-and-Fire DynamicsTraveling Waves of Excitation in Neural Field Models: Equivalence of Rate Descriptions and Integrate-and-Fire Dynamics. Neural Computation 14:7, 1651-1667. [Abstract] [PDF] [PDF Plus] 8. David Golomb, G. Ermentrout. 2002. Slow excitation supports propagation of slow pulses in networks of excitatory and inhibitory populations. Physical Review E 65:6. . [CrossRef] 9. G.N. Borisyuk, R.M. Borisyuk, Yakov B. Kazanovich, Genrikh R. Ivanitskii. 2002. Models of neural dynamics in brain information processing the developments of 'the decade'. Uspekhi Fizicheskih Nauk 172:10, 1189. [CrossRef] 10. J. García-Ojalvo, F. Sagués, J. Sancho, L. Schimansky-Geier. 2001. Noise-enhanced excitability in bistable activator-inhibitor media. Physical Review E 65:1. . [CrossRef] 11. Carlo R. Laing , Carson C. Chow . 2001. Stationary Bumps in Networks of Spiking NeuronsStationary Bumps in Networks of Spiking Neurons. Neural Computation 13:7, 1473-1494. [Abstract] [PDF] [PDF Plus] 12. David Golomb. 2001. Bistability in Pulse Propagation in Networks of Excitatory and Inhibitory Populations. Physical Review Letters 86:18, 4179-4182. [CrossRef] 13. Werner Kistler. 2000. Stability properties of solitary waves and periodic wave trains in a two-dimensional network of spiking neurons. Physical Review E 62:6, 8834-8837. [CrossRef] 14. Luis Lago-Fernández, Ramón Huerta, Fernando Corbacho, Juan Sigüenza. 2000. Fast Response and Temporal Coherent Oscillations in Small-World Networks. Physical Review Letters 84:12, 2758-2761. [CrossRef]
15. R Huerta, M Bazhenov, M. I Rabinovich. 1998. Clusters of synchronization and bistability in lattices of chaotic neurons. Europhysics Letters (EPL) 43:6, 719-724. [CrossRef] 16. P. Bressloff, S. Coombes. 1998. Spike Train Dynamics Underlying Pattern Formation in Integrate-and-Fire Oscillator Networks. Physical Review Letters 81:11, 2384-2387. [CrossRef] 17. Bard Ermentrout. 1998. Reports on Progress in Physics 61:4, 353-430. [CrossRef] 18. David Horn , Irit Opher . 1997. Solitary Waves of Integrate-and-Fire Neural FieldsSolitary Waves of Integrate-and-Fire Neural Fields. Neural Computation 9:8, 1677-1690. [Abstract] [PDF] [PDF Plus] 19. Paul Bressloff. 1996. New Mechanism for Neural Pattern Formation. Physical Review Letters 76:24, 4644-4647. [CrossRef]
Communicated by Erkki Oja
Time-Domain Solutions of Oja’s Equations J. L. Wyatt, Jr. Research Laboratory of Electronics, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, M A 02139 U S A
I. M. Elfadel Masimo Corporation, 26052 Merit Circle, Suite 103, Laguna Hills, C A 92652 U S A
Oja’s equations describe a well-studied system for unsupervised Hebbian learning of principal components. This paper derives the explicit time-domain solution of Oja’s equations for the single-neuron case. It also shows that, under a linear change of coordinates, these equations are a gradient system in the general multi-neuron case. This latter result leads to a new Lyapunov-like function for Oja’s equations. 1 Introduction-Oja’s
Equations in the Single-Neuron Case
The principal Component (PC) of a random vector x is the dominant eigenvector of the covariance matrix C of x. Principal components have been widely used in data compression and neural network applications. Oja has devised an algorithm for estimating the principal component given a sequence of samples Xk from the probability distribution for x (Oja and Karhunen 1985; Oja 1989, 1992). It is appealingly simple in that it does not estimate the entries for C, and it automatically stabilizes the growth of the principal component estimate, normalizing it eventually to unit length. Oja’s algorithm can be understood in terms of a linear neuron with random input vector xk 6 R” at time k, a vector of weights wk E X“and a scalar output
yk
= wx :k
(1.1)
The weights evolve according to the rule
wk+l
=
wk+awk
(1.2)
nwk
= T]yk(Xk - ykwk)
(1.3)
where q > 0 governs the step size. The first term, (qykxk), represents , Hebb’s rule, while the second term, introduced by Oja, ( - ? & v k )limits the growth of IIwk(/,the Euclidean norm of the weight vector. Neural Computation 7, 915-922 (1995) @ 1995 Massachusetts Institute of Technology
J. L. Wyatt, Jr. and I. M. Elfadel
916
If x has zero mean, then one can immediately verify that (Awk), the mean of Awk, satisfies ( AWk) v(CWk - WkwTCwk) If we choose 71 equal to the intersample interval 6t and let 6t (1.4) converges to the drift ordinary differential equation
---f
(1.4) 0, then
W(f) = Cw(f) - w(t)wT(f)Cw(f)
(1.5) where the discrete-time and continuous-time solutions are related by w(f
= k6f)
(1.6)
(wk)
2 Closed-Form Solution to Oja‘s Equation in the One-Neuron Case
For any initial condition w(0) = wg E W”, the solution to 1.5 for all positive time is given by the formula (2.1) The derivation of 2.1 is given in Appendix 1. The numerator of 2.1 is the unstable pure Hebb’s rule solution that would result if the second term on the right side of 1.5 were neglected. The scalar denominator term in 2.1 results from the second term on the right-hand side of 1.5. This explicit solution is useful in studying the global convergence properties of 1.5 for initial states far from the equilibrium solutions. 3 Oja’s Equation in the Multineuron Case __
We now consider a multineuron generalization of 1.5 (Oja and Karhunen 1985; Williams 1985; Oja 1989, 1992). For the case of a system of p interconnected neurons processing a sequence of random input vectors xk of length n , p 5 n, the weights are represented by a weight matrix W EWnxp,where the jth column of W represents the weight vector for the jth neuron. The vector of neuron outputs at time k is given by yk 6 Y?, where yk = WLXk In this algorithm the weight matrix is updated according to Wk+l = Wk+nWk
nwk
r/(XkyL - WkYkY;) and if x has zero mean, the mean (Wk) evolves in the continuous-time limit (as in 1.4-1.6) according to Oja’s multineuron equation =
w = cw - W T C W Equation 3.1 reduces to 1.5 in the special case p
(3.1) = 1.
Time-Domain Solutions of Oja's Equations
917
4 Oja's Equation Is a Gradient System When Viewed in the Appropriate Set of Coordinates
The result in this section is unexpected. [See, eg., the remarks in Baldi and Hornick (1994) below their eq. (18).] In particular, the reader can easily verify that 1.5 and 3.1 as written are not gradient systems, since the Jacobian matrix of the right hand side is not symmetric for most covariance matrices C. However, the property of being a gradient system is dependent on the choice of coordinates used. (This fact is not immediately obvious, but the reader can easily verify it by considering any linear, gradient, vector ordinary differential equation under an arbitrary linear change of coordinates.) Oja's equation 3.1 becomes a gradient system in the new coordinates
z n C'I2W
(4.1)
where C'/2 is the positive (semi)definite square root of the covariance matrix of x. Substituting 4.1 into 3.1 yields
z = cz - ZZTZ
(4.2)
and we show in Appendix 2 that
z = cz - zzTz= -V@(Z)
(4.3)
with A 1 @(Z)= -"C - zzq; (4.4) 4 where )I . \IF represents the Frobenius norm of a matrix [i.e., with tr [ I A representing the matrix trace, ( ~ A /= ( F(fr[AAT])'I2,the square root of the sum of the squares of the entries]. Oja's equation 3.1 viewed under the coordinate change 4.1 is a gradient descent system with scalar potential (a(Z). In minimizing (a the system 4.2 is seeking to approximate a "generalized square root" of C as closely as possible in the Frobenius norm. Since 3.1 is a gradient system in one set of coordinates, it follows that the solutions in any set of coordinates cannot exhibit sustained oscillations. Furthermore the decay to equilibrium cannot exhibit damped oscillations, since the eigenvalues of the linearized equations about every equilibrium point must be real. 5 A New Lyapunov-Like Function
If C is nonsingular, then we can return to the original set of coordinates W, where the reader can easily verify that 4.3 and 4.4 take the form
w
=
CW-WWTCW
J. L. Wyatt, Jr. and I. M. Elfadel
918
Thus 9(W)
\ICl'2(1 - WWT)C1'2((2,
(5.1)
is a strict Lyapunov-like function for 3.1 in the sense that 4 5 0 along trajectories of 3.1 and = 0 only at equilibria. Note that 9 ( W ) is similar, but not identical, in form to the mean-square reconstruction error (Xu 1993; Plumbley 1994),
e(W)
5
E { \ \ x- WW?'xl12}= IlC(1 - WWT)ll:
=
"(1 - wwT)cll;.
In minimizing 9, the system 3.1 seeks to evolve a weight matrix W such that WWT approximates the identity matrix on W' as closely as possible. 6 Values of the Lyapunov-Like Function at Equilibria
At any matrix W, that is an equilibrium for 3.1, 9 has the simple form
since for any W, U(W)
tr[C2- w'c(W
=
+ CW)]
where the final term vanishes at equilibrium. Now consider the various continua of equilibria in which the weight vectors { W I , . . . .w,} (i.e., the columns of W) are orthonormal linear combinations of some set of p distinct eigenvectors of C. These can be written in the matrix form W,
= E,O
where the p columns of E, t gnxpare any distinct, unit length eigenvectors of C, and 0 c W x P is an arbitrary orthogonal matrix. Then
c
P
11
9(W,) =
1-1
A;
(6.2)
-
k=l
where A, is the jth eigenvalue of C and A,, is the eigenvalue of C corresponding to the kth column of E,, as shown in Appendix 3. Note that Q is independent of 0 and thus constant over each of these continua. The value of 9 is smallest over the continuum of matrices whose orthonormal columns are linear combinations of the p dominant eigenvectors of C, i.e., the stable equilibria (Oja 1989).
Time-Domain Solutions of Oja's Equations
919
7 Remarks A special case of 2.1, valid only for initial conditions satisfying llw,-ll = 1, has appeared in Chu (1986). The differential equation 4.3 and scalar function @(Z) in 4.4 have appeared for the single-neuron case p = 1 in Yuille et al. (1989), and 4.3 has appeared for the multineuron case in Plumbley (1994). It has apparently not been noted in the literature that 4.3 is simply Oja's equation 3.1 expressed in a new set of variables. The Lyapunov function 8 applies directly to the evolution of W(t), in contrast to those in Plumbley (19941, which apply to the evolution of the orthogonal projection operator P(t) associated with W(t). It can be of use in showing global convergence. We are grateful for access to a preprint of Baldi and Hornick (1994), a useful survey of the field. The results in Sections 1-4 first appeared in Wyatt and Elfadel (19941, along with a more intricate closed-form solution to 3.1 for the two-neuron case. Appendix 1: Derivation of Solution for Single-Neuron Case Consider the Oja equation 1.5 for the one-neuron case
w
Cw-wwTCw
=
= (C - wTCw1,)w
A(t)w
=
where A(t) 5 [C - wT(t)Cw(t)I,] and I, is the n x n identity. Note that the quantity a ( t ) wT(t)Cw(t) is a scalar, and therefore for any pair ( f l , t 2 ) of time instants, we have A(fl)A(t2)= A(tz)A(t,). For any initial condition WO, the solution to the Oja equation can then be written as w(t)
=
exp [ ~ A ( T ) ~wo T]
- &te-
J;a(.r)drW,-
From the above expression, we deduce that a(t)
n
W(t)TCW(t)
-
e-2
$ a(T)dTw,TCeZC'wo
(A.1)
J. L. Wyatt, Jr. and I. M. Elfadel
920
which can be written as
where K is an integration constant that can be computed using the initial condition WO, which gives K = 1 - w,'wo. Therefore,
as claimed in 2.1. Appendix 2: Proof That Oja's Equation Is a Gradient System in the Variables Z The objective of this appendix is to prove 4.3 using the expression of @ given in 4.4, 1 @ ( Z ) = 4- t r { ( C =
-
1 Z Z ? ' ) ( C- ZZT)'} = - t u { ( C - ZZ')(C - Z Z T ) } 4
1 -tr{ZZTZZT - C Z Z T - Z Z T C 4
+ CCT}
It is not difficult to prove that
which can be written more compactly as
Vtr(CZZT)= Vtr(ZZTC)= 2 c z Moreover, notice that
(A.3)
Time-Domain Solutions of Oja’s Equations
921
It follows that (A.4)
The right-hand side is nothing but the ijth term of
ZZTZ Therefore A.4 can be written compactly as Vtr(ZZTZZT)= ~ Z Z ~ Z
(-4.5)
Combining A.3 and AS, we get -V@(Z) = cz - ZZTZ In other words, the Oja equation under the linear transformation 4.1 is the gradient system
z = -V@(Z) as claimed. Appendix 3: Values of Q at Equilibria
To verify 6.2, let A be the p x p diagonal matrix with A, at the kth diagonal position. Note that CE,
E;E,
=
EpA
=
I
v x p
and thus, using 6.1, Q(W,)
=
@(E,O) = IlC(l$- tr[OTE$C2E,0]
=
IlCll: - tr[OTEFE,A20]
= IlCll: - t r [ O T A 2 0 ]= tr[C’] - tr[A’] =
I1
P
1x1
k=l
C A: - C Afk
as claimed. Acknowledgments We gratefully acknowledge helpful conversations with Mitch Trott, George Verghese, Terry Sanger, and Lei Xu. This work was supported by NSF and ARPA under Contract MIP-91-17724.
J. L. Wyatt, Jr. and I. M. Elfadel
922
References Baldi, P., and Hornick, K. 1994. Learning in linear neural networks: A survey. l E E E Trans. Neural Networks (in press). Chu, M. T. 1986. Curves on Sn-' that lead to eigenvalues or their means of a matrix. S l A M J. Alg. Disc. Math. 7(3), 425432. Oja, E. 1989. Neural networks, principal components and subspaces. Int. 1. Neural Syst. 1(1), 61-68. Oja, E. 1992. Principal components, minor components, and linear neural networks. Neural Networks 5, 927-935. Oja, E., and Karhuncn, J. 1985. On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. 1.Math. Anal. Appl. 106, 69-84. Plumbley, M. 1994. Lyapunov functions for the convergence of principal component algorithms. Neurul Networks 8(1), 11-23. Williams, R. 1985. Feature Discovery Through Error-Correcting Learning. Tech. Rep. 8501, U.C. San Diego, San Diego, CA. Wyatt, J., and Elfadel, I. 1994. On the solutions to Oja's equations. Neural Networksfor Computing. Proceedings of the Snowbird Conference, April 1994. Xu, L. 1993. Least mean square error reconstruction principle for self-organising neural nets. Neural Networks 6, 627-648. Yuille, A. L., Kammen, D. M., and Cohen, D. S. 1989. Quadrature and the development of orientation selective cortical cells by Hebb rules. Biol. Cybenret. 61, 183-194. ~
~
Received April 13, 1994; accepted December 22, 1994
This article has been cited by: 2. Brian S. Blais, Harel Z. Shouval. 2009. Effect of correlated lateral geniculate nucleus firing rates on predictions for monocular eye closure versus monocular retinal inactivation. Physical Review E 80:6. . [CrossRef] 3. B Bharath, V S Borkar. 1999. Stochastic approximation algorithms: Overview and recent trends. Sadhana 24:4-5, 425-452. [CrossRef] 4. Q. Zhang, Y.-W. Leung. 1997. Dynamic system for solving complex eigenvalue problems. IEE Proceedings - Control Theory and Applications 144:5, 455. [CrossRef]
Communicated by C. Lee Giles
Learning the Initial State of a Second-Order Recurrent Neural Network during Regular-Language Inference Mike1 L. Forcada Rafael C. Carrasco Departameizt de Lienguatges i Sistemes Informatics, Uniuersitat d'Alacant, E-03071 Alacant, Spain
Recent work has shown that second-order recurrent neural networks (20RNNs) may be used to infer regular languages. This paper presents a modified version of the real-time recurrent learning (RTRL) algorithm used to train 20RNNs, that learns the initial state in addition to the weights. The results of this modification, which adds extra flexibility at a negligible cost in time complexity, suggest that it may be used to improve the learning of regular languages when the size of the network is small. 1 Introduction A number of recent papers (Giles et al. 1992a,b; Siegelmann et al. 1992; Watrous and Kuhn 1992a,b) have explored the ability of second-order recurrent neural networks ( 2 0 R " s ) to learn simple regular grammars (that is, with languages accepted by small automata) from positive and negative training word samples. This letter presents a modified version of the real-time recurrent learning (RTRL) algorithm used by these authors [except Watrous and Kuhn (1992a,b), where the backpropagation through time (BPTT) method was used], which in turn is based on a training method by Williams and Zipser (1989). Instead of using a randomly selected or fixed initial state, the new model learns the initial state in addition to the weights. This adds flexibility to the learning process at a negligible cost in time, as shown by the preliminary results presented. We are currently working on the extension of this modification to other recurrent network architectures.
2 The Original Model A summary of the model used by Giles et al. (1992a,b) follows. The architecture used is that of a second-order recurrent neural network, with N hidden, recurrent neurons (with states labeled S j ) and L input, nonrecurrent neurons (with states labeled Ik) for character input (see Fig. 1). Neural Computntion 7, 923-930 (1995)
@ 1995 Massachusetts Institute of Technology
Mike1 L. Forcada and Rafael C. Carrasco
924
state
symbol tlk)R=I, ...,L
Figure 1: The architecture of the second-order recurrent neural network of Giles et al. (1992a,b). The network reads one character per cycle. The product of each hidden neuron state with each input neuron state, SIIk is fed, modified by weight Wllk, to hidden neuron i in each cycle. This represents a transition of the underlying deterministic automaton: the network goes from state 9 ({S,},=, N ) to state 9' ({s;}l=~, N) after reading symbol d ( { h } k = l , ,L). The dynamics is given by the equation:
srf' = g(E(") 1
where
,'= ( t )
-
)
c wllks;f)I;)
(2.1) (2.2)
ik
Function g is the sigmoid 1/[1 + exp(-x)]. Characters are usually (but not necessarily) encoded in a one-hot or unary fashion: each input neuron corresponds to a character. Giles ef al. (1992a,b) train their network using a second-order version of Williams and Zipser's (1989) real time recurrent learning (RTRL) algorithm. The state of a preselected hidden neuron, call it S,,,, is chosen to represent acceptance (S,,, E [l - ~ , 1 ] or ) rejection (S, E [O,T]).
Learning the Initial State of a 2 0 R "
925
The target values S,,, are chosen to be 1 - r/2 and 712 respectively; the usual choice is T = 0.2. The error in the state of the acceptance neuron, E,,, = (SLL - S,,,)*/2 after reading a word of length f is minimized by varying the weights W,,k using gradient descent:
(2.3) with a learning rate (1: and a momenfum term 7; AW;,, stands for the previous value of AW,,,,. This requires the evaluation of all the ~ S ( ' ) / ~ W I , , , derivatives, which is computationally intensive, and is carried out in a recurrent way in each symbol step t, based on the values obtained in symbol step t - 1, with f varying from 1 to f , where f is the length of the word. The recurrent formula is (2.4) where g' is the derivative of the sigmoid function, 6,l is Kronecker's delta (Sir = 1 if i = l and zero otherwise), and
3SjO)/dWI,,, = 0
(2.5)
In the original model of Giles et al. (1992a,b)the initial states of the hidden neurons Sp are fixed (either randomly chosen from the interval [O.O, 1.01 or taking Sp = 6io) for each learning run. The initial weight set is also a set of small random values around 0.0 (both positive and negative), unique for each job. Each update of the whole set of derivatives in real time has a time complexity of O(N4L2);however, due to the one-hot codification of inputs, it reduces to O(N4L). The space needed to store the derivative information is O(N3L). In most cases, Giles ef al. (1992a,b) find that once the network has learned to classify all the words in the training set, the states of the hidden neurons visit only small regions (clusters) of the available configuration space during recognition. These clusters can be identified as the states of the inferred deterministic finite automaton (DFA),and the transitions ocurring among these clusters as symbols are read to determine the transition function of the DFA. The automaton extraction algorithm uses a partition of the configuration hypercube [O.O, l . O j N to locate these clusters. 3 The Modified Model
The new model learns the initial state of all hidden neurons in addition to the weights, at almost no extra cost. The gradient-descent formulas
Mike1 L. Forcada and Rafael C. Carrasco
926
for learning the initial state are then
being the previous value of AS?), and with AS‘;’’
(3.2) The parameters cy and ti have the same meaning as in the weight updating formulas (2.3-2.5); their values may be different in the initial-state updating formulas, but, to avoid parameter proliferation, we have chosen to use the same values for the learning of weights and initial states.’ The initial values of the derivatives are (3.3) The initial values of the initial states S:”, before any learning takes place, are randomly chosen and unique for each run. If the empty word is in the training set, instead of applying the update rule in equation 3.1, which would only modify the value of S::;, we choose to use more efficiently this information to accelerate convergence by updating directly the initial state of the acceptance neuron: each time the empty word is presented for learning, we simply set the state of Si:! to 1 - T if the word is in the positive example set or to T if it is in the negative example set, and leave unchanged. For nonempty words, whenever the value all the other of an updated S;’) falls outside the [O.O, 1.01 interval, it is automatically dipped so that it falls within the valid range.2 The increases in space complexity [O(N2)is added to O(N3L)l and time complexity 10(N3L) is added to O(N4L2)in the general case, and for one-hot encoding, only O ( N 3 )is added to O(N4L)]due to this modification are both negligible, and therefore impose no burden on the learning process while allowing for increased flexibility. The automaton extraction used by Giles ct al. (1992a,b) may be used without modification in our new model, since the clustering of states is not dependent on the fact that the initial state is also learned.
Si”)
’The optimum (fastest learning) values may actually be different. 2As one of the referees has pointed out, this clipping is not needed if the initial state is treated as the output of a sigmoid and the input of this sigrnoid (in the range 1-33, m[) is learned instead. This approach is conceptually more elegant, may be implemented at no extra computational cost, and has been found to give equivalent results in test runs. We plan to use it from now on.
Learning the Initial State of a 2 0 R "
927
Table 1: Learning the Odd-Number-of-OnesGrammar with a 2 0 R " Having 2 Hidden Neurons, with and without Optimization of the Initial State. Initial state
Learned Fixed
Number of epochs for 11 consecutive runs 925, (failed), 109, 122, (2 failed), 314, 863, 136, 109 (all jobs failed)
Table 2: Learning the Odd-Number-of-OnesGrammar with a 2 0 R " Having 3 Hidden Neurons, with and without Optimization of the Initial State.
Initial state Learned Fixed
Number of epochs for 11 consecutive runs (failed),128, (3 failed), 107, (3 failed), 737, (failed) (4 failed), 214, 363, (5 failed)
4 Learning the Odd-Number-of-Ones Language
Tables 1 and 2 show preliminary results of the new model compared with the original model; the grammar chosen for these experiments is the odd-number-of-ones language on C = ( 0 ,l}, which is recognized by a simple two-state DFA. The training set (positive and negative examples) consisted of the 1023 words of C* that have length 9 or less. The learning process starts with 63 randomly chosen words, and adds 16 words to the set each time the current learning set is completely classified. The maximum number of epochs (an epoch is the basic learning unit, corresponding to one pass with the current learning set) was set to 1000, the learning rate a to 2.0, and the learning momentum 71 to 0.2. The network was trained with T = 0.1, that is, to accept with a target value Sac, of 0.95 and to reject with a target value of 0.05; errors below T were taken as successes. According to the prescriptions explained above, the initial values for the initial states of the hidden neurons and for the weights were random and unique for each run and model. Note that the value used for T , 0.1, is substantially smaller than the one used by Giles et al. (1992a,b): T = 0.4, target values, 0.8 for acceptance, 0.2 for rejection. We have found that for this particular grammar, we either had to use a very small value of T or had to update the weights after all words in the training set, misclassified or not, contrary to the "update only when misclassified" rule used by Giles et al. (1992a,b) to achieve learning in a reasonable number of epochs. This behavior has not been observed for other grammars. Table 1 shows convergence results for a network with 2 hidden neurons and 2 or 3 input neurons. The third neuron is for a n end-of-word symbol, used in the original model to compensate for the fact that a n inadequate choice of the fixed initial state may preclude a correct classi-
Mike1 L. Forcada and Rafael C. Carrasco
928
Table 3: Learning the Tomita-4 Grammar with a 20RNN Having 3 Hidden Neurons, with and without Optimization of the Initial State. ~~
Initial state
Number of epochs for 22 consecutive runs ~~
Learned Fixed
~
221, 501, (failed), 237, 247, 224, 520, 399, (failed), 349, 896 576, 180, (failed), 293, 540, 205, 181, 179, 573, 195, 227 873, 455, 302, 360, (failed), 396, 231, 764, 587, (failed), 321 (4 failed), 370, (failed),297, (failed),238, 621, (failed)
fication of the empty word and make the network too rigid. This way we also avoid giving the new model an unfair advantage in the comparison. As may be seen, while the original model fails to learn the grammar in 1000 epochs in 11 out of 11 runs, our modified model learns it succesfully in 7 out of 11 runs. This is mainly due to the fact that in the new model, the initial state may move around in state space and reach a favorable position, whereas in the old model, the initial state is fixed and additional transitions are needed to take care of the end-of-word symbol. When the number of neurons (and therefore, the dimensionality of the state space) is limited, the additional adaptiveness of the new model improves the chances that the grammar is learned. Table 2 shows results for 3 hidden neurons. In this case, the old model reaches convergence in 1000 epochs only in 2 out of 11 runs, and the new model in 3 out of 11. Apparently, the effect of the modification in the model is much smaller in this case. Here, three neurons is enough to accommodate both kinds of automata, and the effect of learning the initial state is almost negligible [an indication of this is given by the fact that some of the automata extracted using the method described by Giles etal. (1992a,b) were not the minimal, two state automaton, as was the case for all the runs made with 2 hidden neurons]. 5 Learning Tomita’s 4th Grammar
Tomita’s (Tomita 1982) 4th grammar (for strings on (0.1) not containing ”000” as a substring) is a typical test grammar used in studies of regular grammar inference. The minimal DFA has four states. Giles et al. (1992a) used this grammar to test the capability of 2 0 R ” s to learn regular grammars. Their experiments included networks with 3,4, and 5 hidden neurons. Table 3 shows the results of learning this grammar with a network with three hidden neurons, which seems to be the minimum number needed. The test set consists of the 1023 strings having length 9 or less, with the same word addition scheme; the maximum number of epochs is set to 1000, the value of 7 is 0.4, corresponding to the target values 0.2 and 0.8 for rejection and acceptance, respectively, and the learning
Learning the Initial State of a 2 O R "
929
Table 4: Learning the Tomita-4 Grammar with a 2 0 R " Having 4 Hidden Neurons, with and without Optimization of the Initial State. Initial state
Learned Fixed
Number of epochs for 22 consecutive runs 196, (failed),137, 291, 169, (failed),340, 286, 163, 276, (failed) 341, 151, 208, (failed),296, (failed),266, 290, 258, 346, 377 937, 240, 211, (failed),270, 406, 314, 254, 281, 576, 183, 178, 322, 320, 320, 908, 525, (2 failed), 278, 476, 169
parameters are cy = 2.0 and 77 = 0.2, the same as in the previous runs. The results show that the new model, where the initial state is also learned, reaches convergence more easily than the original model, which failed to converge in 1000 epochs in 9 out of 22 runs as compared to 3 out of 22 for the new model. Even after eliminating the runs that failed in both cases, the average number of epochs is somewhat smaller for the new model (355 versus 447). It may be argued that, again, the network is too small to accommodate easily the possible additional DFA transitions and states that may be due to the presence of an end-of-string symbol, but easily accommodates the smaller DFAs inferred by the new model. Table 4 shows the results of learning Tomita's 4th grammar with a network with four hidden neurons, one more than in the previous test set. All the other parameters are the same. As could be expected for a larger network where both kinds of automata may be accommodated easily (nonminimal automata are sometimes inferred), the new model does not make a big difference. The original model fails to learn the grammar in 1000 epochs in 3 out of 20 runs, whereas the new model fails in 5 runs. The average number of epochs, not taking the failed jobs into account, is old, 375; new, 235, a fact that somehow compensates for the higher number of failures by the new model.
6 Concluding Remarks
The results presented show that for the new model introduced in this paper, the learning process is faster when the recurrent network is small for a given grammar (when the number of neurons is larger, the speed is not appreciably affected by the improvement). Indeed, the modification may sometimes be critical to achieve learning in a reasonable number of epochs. This occurs at a negligible increase in time complexity in most cases. An extension of this modification to other recurrent architectures and a more thorough experimental suite to assess its effect on the learning process are under way.
Mike1 L. Forcada and Rafael C. Carrasco
930
Acknowledgments The authors wish to thank the Direcci6n General de Investigacibn Cientifica y T k n i c a of the Government of Spain for support through project
CICYT/TIC93-0633-C02. References Giles, C . L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., and Lee, Y. C. 1992a. Learning and extracting finite state automata with second-order recurrent neural networks. Neural Camp. 4, 393405. Giles, C. L., Miller, C. B., Chen, D., Sun, G. Z., Chen, H. H., and Lee, Y. C. 1992b. Extracting and learning an unknown grammar with recurrent neural networks. In Advances in Neural Information Processing Systems, J. Moody et al., eds., Vol. 4, pp. 317-324. Morgan Kaufmann, San Mateo, CA. Siegelmann, H. T., Sontag, E. D., and Giles, C. L. 1992. The complexity of language recognition by neural networks. In Information Processing 92, Vol. 1, pp. 329-335. Elsevier/North-Holland, Amsterdam. Tomita, M. 1982. Dynamic construction of finite-state automata from examples, using hill-climbing. Proc. Fourth Annu. Cogn. Sci. Conf. 105. Watrous, R. L., and Kuhn, G. M. 1992a. Induction of finite-state automata using second-order recurrent networks. In Advances in Neural lnformatiori Processing Systems, J. Moody et al., eds., Vol. 4, pp. 306-316. Morgan Kaufmann, San Mateo, CA. Watrous, R. L., and Kuhn, G. M. 1992b. Induction of finite-state languages using second-order recurrent networks. Neural Comp. 4, 406414. Williams, R. J., and Zipser, D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Con?p. 1, 270. ~~
Received August 2, 1994; accepted January 11, 1Y95.
This article has been cited by: 2. Jinmiao Chen, N.S. Chaudhari. 2009. Segmented-Memory Recurrent Neural Networks. IEEE Transactions on Neural Networks 20:8, 1267-1280. [CrossRef] 3. Min Wu, Fang Liu, Peng Shi, Yong He, Ryuichi Yokoyama. 2008. Improved Free-Weighting Matrix Approach for Stability Analysis of Discrete-Time Recurrent Neural Networks With Time-Varying Delay. IEEE Transactions on Circuits and Systems II: Express Briefs 55:7, 690-694. [CrossRef] 4. Peter Tiňo , Ashely J. S. Mills . 2006. Learning Beyond Finite Memory in Recurrent Networks of Spiking NeuronsLearning Beyond Finite Memory in Recurrent Networks of Spiking Neurons. Neural Computation 18:3, 591-613. [Abstract] [PDF] [PDF Plus] 5. Barbara Hammer , Alessio Micheli , Alessandro Sperduti . 2005. Universal Approximation Capability of Cascade Correlation for StructuresUniversal Approximation Capability of Cascade Correlation for Structures. Neural Computation 17:5, 1109-1159. [Abstract] [PDF] [PDF Plus] 6. Peter Tiňo , Barbara Hammer . 2003. Architectural Bias in Recurrent Neural Networks: Fractal AnalysisArchitectural Bias in Recurrent Neural Networks: Fractal Analysis. Neural Computation 15:8, 1931-1957. [Abstract] [PDF] [PDF Plus] 7. Stefan C. Kremer . 2001. Spatiotemporal Connectionist Networks: A Taxonomy and ReviewSpatiotemporal Connectionist Networks: A Taxonomy and Review. Neural Computation 13:2, 249-306. [Abstract] [PDF] [PDF Plus] 8. R.C. Carrasco, M.L. Forcada. 2001. Simple strategies to encode tree automata in sigmoid recursive neural networks. IEEE Transactions on Knowledge and Data Engineering 13:2, 148-156. [CrossRef] 9. Rafael C. Carrasco , Mikel L. Forcada , M. Ángeles Valdés-Muñoz , Ramón P. Ñeco . 2000. Stable Encoding of Finite-State Machines in Discrete-Time Recurrent Neural Nets with Sigmoid UnitsStable Encoding of Finite-State Machines in Discrete-Time Recurrent Neural Nets with Sigmoid Units. Neural Computation 12:9, 2129-2174. [Abstract] [PDF] [PDF Plus] 10. P. Frasconi, M. Gori, A. Sperduti. 1998. A general framework for adaptive processing of data structures. IEEE Transactions on Neural Networks 9:5, 768-786. [CrossRef] 11. Alan D. Blair, Jordan B. Pollack. 1997. Analysis of Dynamical RecognizersAnalysis of Dynamical Recognizers. Neural Computation 9:5, 1127-1142. [Abstract] [PDF] [PDF Plus] 12. Mikel L. Forcada, Marco GoriNeural Nets, Recurrent . [CrossRef]
Communicated by C. Lee Giles
An Algebraic Framework to Represent Finite State Machines in Single-Layer Recurrent Neural Networks R. AlquCzar Software Dept., Universitat Politechica de Catalanya (UPC)
A. Sanfeliu lnstitut de CibernPtica (UPC - CSIC), Diagonal 647,2a, 08028 Barcelona, Spain
In this paper we present an algebraic framework to represent finite state machines (FSMs) in single-layer recurrent neural networks (SLRNNs), which unifies and generalizes some of the previous proposals. This framework is based on the formulation of both the state transition function and the output function of an FSM as a linear system of equations, and it permits an analytical explanation of the representational capabilities of first-order and higher-order SLRNNs. The framework can be used to insert symbolic knowledge in RNNs prior to learning from examples and to keep this knowledge while training the network. This approach is valid for a wide range of activation functions, whenever some stability conditions are met. The framework has already been used in practice in a hybrid method for grammatical inference reported elsewhere (Sanfeliu and AlquCzar 1994). 1 Introduction The representation of finite-state machines (FSMs) in recurrent neural networks ( R " s ) has attracted the attention of researchers for several reasons, ranging from the pursuit of hardware implementations to the integration (and improvement) of symbolic and connectionist approaches to grammatical inference and recognition. Some previous works (Minsky 1967; Alon et al. 1991; Goudreau et al. 1994) have shown how to build different R" models, using hard-limiting activation functions, that perfectly simulate a given finite state machine. None of these approaches yields the minimum size R" that is required. Minsky's method (1967) uses a recurrent layer of McCulloch-Pitts' units to implement the state transition function, and a second layer of OR gates to cope with the output function; the recurrent layer has n x m units, where n is the number of states and rn is the number of input symbols. The method by Alon et al. (1991) uses a three-layer recurrent network that needs a number of threshold cells of order n3/4x m. Recently, Goudreau et al. (1994) have proven that, while second-order single-layer Neural Computation 7, 931-949 (1995) @ 1995 Massachusetts Institute of Technology
R. Alquezar and A. Sanfeliu
932
R " s (SLRNNs) can easily implement any n-state automaton using n recurrent units (and a total number of n2 x m weights), first-order S L R " s have limited representational capabilities. Other studies have been devoted to the design of methods for incorporating symbolic knowledge into R " s made u p of sigmoid activation units, both for first-order (Frasconi et al. 1991) and second-order (Omlin and Giles 1992) R " s . This may yield faster learning and better generalization performance, as it permits a partial substitution of training data by symbolic rules, when compared with full inductive approaches that infer finite-state automata from examples (Pollack 1991; Giles et al. 1992). This paper introduces a linear model for FSM representation in SLRNNs (Section 21, which improves the models reported elsewhere (Sanfeliu and Alquezar 1992; Alqugzar and Sanfeliu 1993). A study of the analytical conditions that are needed to ensure the stability of the state representation is included. This new model unifies and generalizes some of the previous proposals (Minsky 1967; Goudreau et al. 1994) and explains the limitations of first-order S L R " s (Section 3). A related method for inserting symbolic knowledge prior to learning in R " s , which is valid for a wide class of activation functions, is described (Section 4). This method has been used in a hybrid approach for grammatical inference reported recently (Sanfeliu and Alqu6zar 1994). 2 The FS-SLRNN Linear Model of Finite State Machine Representation
in Single-Layer Recurrent Neural Networks 2.1 Basic Definitions and Background. In the following we consider single-layer recurrent neural networks (SLRNNs)' such as the one shown in Figure 1. An S L R " has M inputs, which dre labeled XI,x2,. . . X M , and a single layer of N units (or neurons) U1, U,. . . . , U N ,whose outputs (or activation values) are labeled yI,y2,. . . ,YN. The values a t time f of inputs x, (1 5 i 5 M ) and unit outputs y, (1 5 j 5 N) are denoted by and y:, respectively. The activation values o f the neurons represent collectively the state of the S L R " , which is stored in a bank of latches. Each unit computes its output value based on the current state vector S' = b{-'.yi-', . . . , yL'lT and the input vector I' = [x{,xi.. . . ,x!,,,IT, so the network is fully connected. Some number P of the neurons (1 5 P 5 N) can be used to supply an output vector 0' = [yi,yk,.. . ,y;lf in order to accomplish a given task. Those neurons that are not involved in the output function of the S L R " are usually called hidden units. The equations that describe the dynamic behavior of the S L R " are
.
XI
n:
=
f(Wk,It,S')
y:
=
g (a:)
for 1 5 k 5 N
for 1 5 k 5 N
(2.1)
(2.2)
'We approximately follow the notation that is described by Goudreau et al. (1994).
Finite State Machines in SLR"s
933
Figure 1: Single-layer recurrent neural network ( S L R " ) architecture. where f is a weighted sum of terms that combines inputs and received activations to give the net input values ak, g is usually a nonlinear function, and wk is a vector of weights associated with the incoming connections of unit Uk. The S L R " s can be classified according to the types of their f and g functions, which will be referred to as the aggregation and activation functions of the S L R " , respectively. The usual choices for function f characterize the S L R " either as first-order type: M
N
(2.3)
or second-order type (Giles et al. 1992)': M
f (Wk,If S') = C 7
r=l
N
C wklJ
f
xi
t-1
yj
(2.4)
j=1
The most common choices for the activation function g are the hardlimiting threshold function
(2.5) the first-order case, the SLRNN has N x (M + N) weights, while in the secondx M. order case the number of weights rises to
934
R. Alquezar and A. Sanfeliu
and the sigmoid function (2.6) The former has been used mainly for implementation purposes (Minsky 1967) and to perform analytical studies of computability (Alon et al. 1991; Goudreau et al. 1994). The latter allows the use of gradient descent methods, such as the RTRL algorithm (Williams and Zipser 19891, to train the SLRNN to learn a sequential task [eg., symbol prediction (Cleeremans et al. 1989) or string classification (Giles ef al. 1992)l. A Mealy machine (Booth 1967) is an FSM defined as a quintuple (I>O.S.h,7/),where I is a set of m input symbols, 0 is a set of y output symbols, S is a set of n states, 0 : I x S + S is the state transition function, and 11 : I x S + 0 is the output function. In a Moore machine (Booth 19671, the output function depends only on the states (i.e., 71 : S -+ 0). 2.2 Construction of the Linear Model of FSM Representation in SLRNNs. We now show that the representation of a finite-state machine (FSM) in an S L R " can be modeled as two linear systems of equations. We refer to this algebraic representation as a finite state single-layer recurrent neural network (FS-SLR") model. First, we concentrate our attention on the state transition function b of an FSM and assume that the SLRNN is just concerned with its implementation. Later, we will discuss the representation of the output function rl. When an S L R " is running at a discrete time step t, one can think of the M input signals xl as encoding a symbol a E I of the machine input alphabet, the feedback of recurrent units as representing the current state q E S reached after the sequence of previous symbols, and the output of the N neurons yi as standing for the destination state q' that results from the current transition. Thus, the set of N unit equations can be seen as implementing the state transition b(a.9) = 9' that occurs at time f. Since an FSM has a finite number D of possible transitions (at most D = mn), at any given time step t, the network dynamic equations should implement one of them. Without loss of generality, let us assume that 6 is defined for all the pairs ( a E I, q E S). Hence, if we number all the FSM state transitions b(a.q) with an index d (1 5 d 5 mn), the finite set of equations (2.7) should be satisfied by the weights of an equivalent SLRNN at any arbitrary time t.
$'
where Iff and Si refer to the input and state vectors that encode the input symbol a and current state q of the d-th transi-tion S(a,q), respectively,
Finite State Machines
in SLRNNs
935
and the activation values y,& (1 < k I N) are related to the code of its destination state.3 The system of nonlinear equations (2.7) describes a complete deterministic state transition function 6 where only the weights wk (1 < k < N) of the S L R " are unknown. Due to the difficulty of studying nonlinear systems analytically, it is often desirable to convert them into linear systems if possible. In this case, we can transform equations (2.7) into a manageable linear system by performing the following steps: 1. Drop the time variable, i.e., convert the dynamic equations into static ones.
2. Use the inverse of the nonlinear activation function g. The first step is justified by the fact that the set of equations (2.7) must be fulfilled for arbitrary time steps t. However, this simplification can be made only as long as the stability of the state codes is guaranteed (see next subsection). In addition, the S L R " must be initialized to reproduce the code of the start state on the unit activation values before running any sequence of transitions. Concerning the use of the inverse function g-l, it must be satisfied that for all points y that can appear in the state codes, either a unique g-'(y) exists, or if the inverse is not unique (as for the hard-limiter g h ) then there exists an established criterion for selecting the proper value in any case.4 Therefore, the preceding transformations convert the nonlinear system of equations (2.7) into the linear system5:
which can be described in matrix representation as
AW=B
(2.9)
where A ( D x E ) is the array of the neuron inputs, W ( E x N) is the (transposed) weight array, B (D x N ) is the array of the neuron linear outputs, D = rnn is the number of transitions in the FSM, and E is the number of inputs to each neuron. 31t should be noted that the ordering defined by the index d is static and arbitrary. In particular, we can use the following ordered list: 6 ( a l , q l ) , 6 ( a l , q 2 ) , . . ., 6(ul.q,), 6(a2,q1), 6(an,,qn). 4Such a criterion will be applied in the representation of Minsky's general solution under our FS-SLRNN model. 'Caution: this does not mean that a linear sequential machine (Booth 1967) is involved. A linear machine would require a linear function g . . . . I
R. Alquezar and A. Sanfeliu
936
‘0’
’0’
‘1’
Figure 2: Odd parity recognizer. For a first-order S L R ” , E following form+ 111
1 1 ~Sll
‘dl
‘JM
‘Dl
‘OM $01
)(
sdl
’DN
=
M
+ N,
and equation 2.9 takes the
wll
“IM “ I ( M + I )
“A1
“kM “ h ( M t 1 1
“I(MtN1
“NI
“NM “N(M+I)
“N(M+N)
“I(M+N)
wOI
mDN
where Idl and Sdl refer to the ith element of the vectors that respectively encode the input symbol and state of the dth transition of the FSM. For a second-order SLR”, E = M N , and equation 2.9 takes the following form:
The above construction will be illustrated with an example. Consider the odd-parity recognizer shown in Figure 2. The coefficient arrays A and B that are obtained for a first-order and for a second-order SLRNN (with sigmoid activation function), by using local encoding and applying the procedure explained so far, are shown in Figures 3 and 4, respectively, where each row is labeled by the associated transition. Note that the first-order system is not solvable. Furthermore, note that the use of local encoding for both symbols and states in a second-order S L R ” implies an identity diagonal matrix A , and therefore a solvable system. With respect to the output function 77 : (I x S) + 0 of the Mealy machine, two approaches can be taken to represent it using the FS-SLR” model. If, as introduced in the beginning of this section, the output vector 0 is considered as part of the state vector S, then the representation of the output function just corresponds to some part of the linear system 61f a bias term is included (corresponding to a weighted input signal 1 for each unit) M is increased by one.
Finite State Machines in SLR"s
937
Figure 3: Representationin a first-order FS-SLRNN model of the state transition function for the odd parity recognizer. The value C is chosen such that gs(C)N 1 and gs(-C) 2: 0.
Figure 4: Representation in a second-order FS-SLR" model of the state transition function for the odd parity recognizer. The value C is chosen such that gs(C)E 1 and gs(-C) 2: 0.
AW = B of equation 2.9 (e.g., the first P columns in arrays W and B), that is udk
= g-'(ydk) =
f (wk,Id,Sd)
for 1 5 d 5 rnn
for 1 5 k 5 P
(2.10)
Goudreau et al. (1994) demonstrated that second-order S L R " s can implement any output function q for all the FSMs following this scheme, but first-order S L R " s need to be augmented to implement some 7 functions, either by adding a feedforward layer of output units (Minsky 1967; Elman 19901, or by allowing a one-time-step delay before reading the output (Goudreau and Giles 1993). Therefore, for (first-order) augmented SLR"s, the appropriate approach is to separate the representation of the output function q from that of the state transition function 6. To this end, the N recurrent units are preserved for state encoding, and P new nonrecurrent units 01are added to provide the output vector 0'= [o;+', . . . , o;+'], which is given by o y = g (W/,S'+')]
(2.11) where 0;" refers to the activation value of output unit 01at time t + 1, and WI is the vector of weights of unit 01.
R. Alquezar and A. Sanfeliu
938
In this case, after a "linearization" process for the output units, an additional linear system is yielded to represent TI: N
g-'(oll) = wI1 + C ~ , ( l + ~ ) y for , ~ 15 i I n
for 1 5 I 5 P (2.12)
I=1
where the values u,I (1 n). 'Augmented SLRNN for first-order type.
Finite State Machines in SL R "s
945
2. Initialize the weights of the hidden and output units to any of the solutions WS and W, that result from solving the above linear systems9 Usually, a neural learning scheme updates a11 the network weights to minimize an error function. In such a case, the inserted rules may be degraded and eventually forgotten as the weights are modified to cope with the training data. Although this behavior allows for rule refinement and it may be valid to learn a given task, an alternative approach that preserved the inserted FSM would be preferable if this were known to be part of a task solution. To that end, a constrained neural learning procedure must be followed. For example, if an on-line gradient-descent algorithm such as RTRL (Williams and Zipser 1989) is used, then the free weights of a recurrent unit k should be changed according to (4.1)
where a is the learning rate, E ( t ) is the overall network error at time t, D(Wk) denotes the subset of determined weights in unit k, and the partial derivatives 6wh /6wkl are known constants given by the linear relations among weights. The RTRL algorithm itself can be employed to compute both the partial derivatives GE(f)/Swkl and SE(t)/Gwb. On the other hand, the weights w b in D(Wk) should be changed at each step, after updating all the free weights wkl, to keep the linear constraints specified in the underdetermined solution of the system. 5 Conclusions
An algebraic linear framework to represent finite state machines (FSMs) in discrete-time single-layer recurrent neural networks (SLRNNs), which has been termed an FS-SLR" model, has been presented. The scheme is based on the transformation of the nonlinear constraints imposed on the network dynamics by the given FSM and data encoding into a set of static linear equations to be satisfied by the network weights. This transformation leads to an exact emulation of the FSM when some stability conditions, which have been stated, are met; otherwise the linear model is just a static approximation of the network dynamics. It has been proved, using the FS-SLR" model, that first-order S L R " s have some limitations in their representation capability, which are caused by the existence of some linear relations (that always hold) among the equations associated with the state transitions. To overcome 'The preferred initialization is that in which the sum of squared weights is minimal and the weights are nonzero.
946
R. Alquezar and A. Sanfeliu
this problem, a first-order S L R " may need to represent a larger equivalent machine with split states. Furthermore, first-order SLRNNs need to be augmented (e.g., with an output layer) to be able to represent every output mapping. According to these requirements, the method for FSM implementation in augmented first-order S L R " s by Minsky (1967) has been generalized. On the other hand, second-order (or higher-order) S L R " s can easily implement all the FSMs, since their corresponding FS-SLRNN models, given an orthogonal data encoding, are characterized by the full rank of the system matrix. The method for FSM implementation in second-order SLRNNs by Goudreau et al. (1994) can be seen as a particular case of this class of solutions. The actual requirements on the network activation function have been determined, and these have been shown to be quite weak (i.e., a large spectrum of activation functions can be used for FSM implementation in SLR"s). The framework proposed can be used to insert symbolic knowledge into discrete-time SLRNNs prior to neural learning from examples. This can be done by initializing the network weights to any of the possible solutions of an underdetermined linear system representing the inserted (partial) FSM with an excess of recurrent units. In comparison with other published methods (Frasconi etal. 1991; Omlin and Giles 1992) that insert FSMs into RNNs, the method proposed is more general, since it can be applied to a wide variety of both activation and aggregation functions. Moreover, a new distinguishing feature of our insertion method is that it allows the inserted rules to be kept during subsequent learning, by training a subset of free weights and updating the others to force the satisfaction of the linear constraints in the system solution. The ultimate aim of this paper is to establish a well-defined link between symbolic and connectionist sequential machines. Linear algebra has been used as a suitable tool to aid to the study of representational issues. Further research is needed to fully exploit the proposed model for other applications, such as the determination of the smallest SLRNN that simulates a given FSM, and the development of improved learning techniques for grammatical inference (Sanfeliu and Alqubzar 1994). Appendix 1 Let AsWs = Bs be the linear model that represents the state transition function h' of the maximally split automaton (with m inputs and mn states) equivalent to the original automaton ( m inputs, n states, and transition function 6) of the given FSM, using a first-order S L R " and an orthogonal data encoding (with values ( 0 , l ) ) for both states and inputs. As is a matrix of m2n rows (for a complete h ) and ( m f m n f l ) columns, whereas BS has also m2n rows but just rnn columns. The rows of both As and Bs can be organized into rn blocks of 777n rows, where each block ( R f
Finite State Machines in S L R " s
947
and RB) corresponds to the transitions with the input symbol a;. Let $ I and r; denote the jth row of blocks RP and Rf, respectively [which are associated with transition 6'(ai, qj) of the maximally split automaton]. Each block of rows is divided into n subblocks (as many as states in the original FSM), and each subblock (R; and R i ) has as many rows as states resulting from the split of the state q k . The rows in any subblock Rg of matrix Bs are identical since they code the same destination state [identified by the pair (ai,qk)l. Let K U ) be a function that indicates the number of subblocks k to which the jth row of any block belongs. The columns of Bs (denoted as c ~ Jare labeled by two subindexes, u = 1 , .. . ,m and u = 1 , .. . , n, which associate cfv with the unit that flags the state of the split automaton characterized by "being the destination of a transition with the uth input symbol from a state equivalent to the vth state of the original automaton." Finally, let be the element in the row r! and in the column cflu, and let Y(,,)(~.) = g(a(ij)(uu)) be the corresponding activation value. Since we deal with a first-order S L R " and an orthogonal encoding is followed for both inputs and states, Theorem 1 establishes that the rank of As is mn + m - 1 and there exist (mn - l ) ( m - 1) linear relations among the rows of As, these being the following:
0 is a real number and n a positive integer. h’ is the derivative of h. Different values of K , L, and n can be used for each (i,j). Changing
+
Algorithms for Generalized Learning Automata
965
to Lij would change the constraints, but the analysis would remain the same. The results hold equally well for differing values of K , L, and n. {Cil(k) : 1 _< i 5 M , 1 5 j 5 n ( i ) ,k 2 0} is a sequence of iid random variables with zero mean and variance u2, D being a positive constant. For example, they can be iid, taking values &cr with equal probability. The analysis of this algorithm is essentially the same as that given in Thathachar and Phansalkar (1994). The algorithm is approximated by the Langevin equation using weak convergence techniques and it is known that the Langevin equation globally optimizes the appropriate function, when D is small enough. In practice, a constant value of D need not be used from the beginning. A high value of cr can be used initially to increase the speed of the algorithm. After a finite number of steps cr is fixed at a sufficiently low value. The above results hold in these cases, since the initial steps when cr is reduced from an initial high value to a low fixed value can be ignored in the analysis. This is due to the fact that D is kept constant after a finite number of steps. 7 Simulation Results
In this section, simulation results for the algorithms discussed in Sections 4, 5, and 6 are presented. Example 2. The example used for the local algorithms is the example that is presented in Section 4. The REINFORCE algorithm is simulated with the learning parameter b = 0.8. In 25 simulation runs, u l ( k ) reached = 6.8 and u2(k)reached = -6.8 in 12,500 steps. In one simulation carried up to 36 x lo6 steps, the magnitude reached was 15.9. The modified algorithm is simulated with b = 0.8. The bounds were set to f 5 and the value of K,, was 1. u1 and u2 converged to values around $5 and -5, respectively, in an average of 730 steps. Next, an example is presented to show that the global algorithm of Section 6 works in cases where the local algorithm of Section 5 does not work. Example 3. A single unit interacts with the environment in this example. The context vectors arrive uniformly from [0,1]x [0,1]. The unit has two actions. A = {a1,a2} is the set of actions The internal state of the unit ) ~g is the probability generating function. If c is the is u = (ul, u ~ and context vector from the environment, Prob{action
=a
I c, u } = g(c,a,u )
V. V. Phansalkar and M. A. L. Thathachar
966
Figure 2: Optimal regions for example 3.
If c1
+ cz 2 1.0 and c I + c2 5 1.5 then E [ r I c,a,]
I
= 1 - E [ r c , a ~= ] 0.9
otherwise
E [ r I c,al] = 1 - E [ r I c.a2]
= 0.1
Denote the region where a1 is optimal by A l . a2 is optimal in A2, the complement of A I . The regions are shown in Figure 2. The unit is capable of learning only a single hyperplane. The best it can do in this case is to minimize the probability of misclassification. This occurs when the hyperplane learned is c1 cz - 1 = 0. The optimal values of both u1 and u2 are 10. L can be any value greater than 10 so that the optimal value can be reached. The value of L was fixed at 20. Any value greater than 10 will do. CT was fixed at 0.1 and the learning parameter, b, was set to 0.1. Twenty simulation runs were conducted with the initial conditions u1 = u2 = 0. The system converged to the global maximum in every simulation. The average number of steps taken to converge to the global optimum was 15 x 105. The algorithm of Section 5 was also tried out on this example. Simulations were conducted for two values of the learning parameter, 0.1
+
Algorithms for Generalized Learning Automata
967
and 0.25. The algorithm converged to values that do not give the global maximum, converging to negative values in some cases. The same initial conditions as for the global algorithm were used. Thus, it is seen that the global algorithm converges to the global maximum while the local counterpart fails. The time taken per run on the VAX 8810 was 115 sec for the local algorithm and 135 sec for the global algorithm . The timings are for a run of 15 x lo5 steps. 8 Conclusions In this paper long-term analysis of the REINFORCE algorithm for GLA is attempted using weak convergence techniques. It is shown that this algorithm can be approximated by an appropriate gradient ascent ODE. However, an example illustrated the fact that this algorithm can lead to unbounded behavior. To overcome this, the optimization problem is reposed as a constrained optimization problem and a different algorithm is suggested. Under some assumptions, a one to one correspondence is established between the stable equilibrium points of the ODE approximating the new algorithm and the local maxima of the constrained optimization problem. In this paper the basic version of the REINFORCE algorithm (Williams 1988) is analyzed, but similar results would hold even for its generalizations (Williams 1992). The algorithms discussed in Sections 4 and 5 are local optimization algorithms. In Section 6 a global algorithm for reinforcement learning based on the constant temperature heat bath technique is presented and analyzed. It is approximated by a stochastic differential equation (SDE) and the invariant probability measure of the SDE concentrates on the global maxima. The algorithm is based on the algorithm presented in Section 4. Simulations confirm the analytical results and show that the algorithm converges to the global optimum even in cases where its local equivalent does not. Appendix Proof of Theorem 1. Let ( ( k ) be the vector consisting of the SRS at instant k, the actions of all the units at instant k and the context vector from the environment at instant k. Since the units have a finite number of actions and the context vectors are assumed to arrive from a compact set, [(k) E S, where S is a compact metric space. The following properties hold. 1. { u ( k ) , < ( k- 1)) is a Markov process. 2. Algorithm 4.1 can be written in vector form as
d(k+
1)= d ( k ) + b G [ U b ( k ) , t b ( k ) 1
(A.1)
V. V. Phansalkar and M. A. L. Thathachar
968
It is seen that G(., .) is bounded over all compact sets since g; is bounded away from zero over all compact sets. 3. For a given b > 0, define a one step transition probability on S by
Pb([,B I u) = Prob{Eb(k) E B I Eb(k - 1 ) = E , d(k) = u} Pb(., . I u ) is independent of k as the algorithm 4.1 is not explicitly
0 ) =
(A.6)
The equalities in A.6 hold because of the assumption that all active con-
Algorithms for Generalized Learning Automata
971
straints at u’ are strictly active. Shift ii to the origin by the transformation E=U-U
The following two cases are considered to calculate dc/dt. Case 1: ( i , j ) E I
dc dt
=
af -“h(Z? 3~jj
+ + Kjj{hjj(uq + t i j ) - Ui/ - Ejj} €)]
Since lU,l < L,, the neighborhood considered can be such that U 7 J + ~ l l< L, in the neighborhood. Thus,
hI/(k,+ f,) = 61,
+
€1)
It is known that k(i) = u‘. Let
h(U + F ) Then
= u’
+
f’
tiI = 0 if ( k ,I ) E A and tiI = E ~ ifI ( k ,I ) E I . dt, 3f -= -((US) dt duij
As ( i ,j )
E 1 in
this case,
+ (€’)fV
and since the FONKT conditions imply 3f /duij
and since fiI= 0 for (k. I ) E A and otherwise
=0
for ( i , j ) t I,
EL] = €k[,
Case2: ( i , j ) E A d3 f.. dt
=
-[h(Z? af 3 Ujj
+ E ) ] + Kjj{hjj(iijj + t j j ) - Ujj - til}
since h(tl+t) = u’+t’ and h I f ( i i l f + ~= , / )u: if ell is small enough, as ii, > L,. Retaining only the first-order terms and after simplification,
Let HII denote the Hessian o f f restricted to I and HAI the Hessian restricted to those { (i,j ) , (k,I ) } such that (i,j ) E A and ( k ,I ) E I. Also, let A , denote the diagonal matrix with diagonal elements being -Ki,, (i,j) E A.
V. V. Phansalkar and M. A. L. Thathachar
972 E j is the part of E restricted to I and can then be written as
C A to
A . The first-order approximation
The eigenvalues of the matrix in A.7 are the eigenvalues of Hrr and A , . Since the second-order sufficiency conditions hold, HI1 is negative definite and since Ki, > 0 for all (i,j), A1 is also negative definite. Thus, all the eigenvalues are in the open left half plane and therefore ii is a locally asymptotically stable equilibrium point. 0
Proof of Lemma 5. As u’ is a n equilibrium point of the ODE, by Lemma 2 k(u’) satisfies the FONKT conditions. Define the sets A and I as in Lemma 4, except that they are defined here with respect to k(u’). Transfer the origin to u’ by the transformation t=u-ll’
CI and eA are defined as in Lemma 4. We get the same linearized model as A.7. As E = 0 is stable, by assumption, the linearized model is also locally asymptotically stable. Therefore no eigenvalue of HI,or A1 can be in the closed right half plane. Therefore HII, which is real and symmetric, has all its eigenvalues in the open left half plane. Therefore HIT is negative definite, which is precisely the second-order sufficient Kuhn Tucker condition under the assumptions of the Lemma. This implies that u’ is a local maximum. 0
Acknowledgment This work was partially supported by a grant under the Indo US Project N00014-92-J-1324.
References Aluffi-Pentini, F., Parisi, V., and Zirilli, F. 1985. Global optimization and stochastic differential equations. I. Opt. Theory Appl. 471, 1-26. Barto, A. G., and Anandan, P. 1985. Pattern recognizing stochastic learning automata. I E E E Trans. Syst. Man Cybern. 15, 360-375. Barto, A. G., Sutton, R. S., and Brouwer, P. S. 1981. Associative search network A reinforcement learning associative memory. B i d . Cybern. 40, 201-211. Chiang, T., Hwang, C., and Sheu, S. 1987. Diffusion for global optimization in R“. SIAM 1. Control Opt. 25, 737-753. Gelfand, S. B., and Mitter, S. K. 1989. Simulated annealing with noisy or imprecise energy measurements. 1.Opt. Theory Appl. 62, 4942. Gelfand, S. B., and Mitter, S. K. 1990. Recursiue Sfochustic Algorithms for Global Optimization in Wd. Center for Intelligent Control Systems, Report CICSP187, MIT, Cambridge, MA.
Algorithms for Generalized Learning Automata
973
Geman, S., and Hwang, C. R. 1986. Diffusions for global optimization. S I A M I. Control Opt. 24, 1031-1043. Hirsh, M., and Smale, S. 1974. Differential Equations, Dynamical Systems and Linear Algebra. Academic Press, New York. Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. 1983. Optimization by simulated annealing. Science 220, 621-680. Kushner, H. J. 1984. Approximation and Weak Convergence Methods for Random Processes,with Applications to Stochastic Systems Theory. MIT Press, Cambridge, MA. McCormick, G. P. 1967. Second order conditions for constrained minima. SlAh4 J. A p p l . Math. 15, 641-652. Narendra, K. S., and Thathachar, M. A. L. 1974. Learning automata: A survey. I E E E Trans. Syst. Man Cybern. 4, 323-334. Narendra, K. S., and Thathachar, M. A. L. 1989. Learning Automata: A n lntroduction. Prentice Hall, Englewood Cliffs, NJ. Phansalkar, V. V. 1991. Learning Automata Algorithms for Connectionist SystemsLocal and Global Convergence. Ph.D. thesis, Indian Institute of Science, India. Sutton, R. S. 1992. (Guest Editor). Special issue on reinforcement learning. Machine Learn. 8. Thathachar, M. A. L., and Phansalkar, V. V. 1995. Learning the global maximum with parametrised learning automata. IEEE Trans. Neural Networks 6, 398406. Watkins, C. J. C. H., and Dayan, P. 1992. Technical note: Q learning. Machine Learn. 8, 279-292. Williams, R. J. 1986. Reinforcement Learning in Connectionist Networks: A Mathematical Analysis. ICS Report 8605, Institute for Cognitive Science, University of California, San Diego. Williams, R. J. 1988. Toward a Theory of Reinforcement Learning Connectionist Systems. Tech. Rep. NU-CCS-88-3, Northeastern University. Williams, R. J. 1992. Simple statistical gradient following algorithms for connectionist reinforcement learning. Machine Learn. 8, 229-256. Zangwill, W. 1969. Nonlinear Programming: A Unified Approach. Prentice Hall, Englewood Cliffs, NJ.
Received February 15, 1994; accepted December 5, 1994.
This article has been cited by: 2. M.A.L. Thathachar, P.S. Sastry. 2002. Varieties of learning automata: an overview. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 32:6, 711-722. [CrossRef] 3. Nicolas Meuleau , Marco Dorigo . 2002. Ant Colony Optimization and Stochastic Gradient DescentAnt Colony Optimization and Stochastic Gradient Descent. Artificial Life 8:2, 103-121. [Abstract] [PDF] [PDF Plus]
Communicated by Scott Fahlman
Initializing Weights of a Multilayer Perceptron Network by Using the Orthogonal Least Squares Algorithm Mikko Lehtokangas Jukka Saarinen Kimmo Kaski Tampere University of Technology,Microelectronics Laboratory, P.O. Box 692, FIN-33101 Tampere, Finland
Pentti Huuhtanen University of Tampere, Department of M f f ~ h e ~ n a fSciences, ica~ P.O. Box 607, FIN-33101 Tampere, Finland
Usually the training of a multilayer perceptron network starts by initializing the network weights with small random values, and then the weight adjustment is carried out by using an iterative gradient descent-based optimization routine called backpropagation training. If the random initial weights happen to be far from a good solution or they are near a poor Iocal optimum, the training will take a lot of time since many iteration steps are required. Furthermore, it is very possible that the network will not converge to an adequate solution at all. On the other hand, if the initial weights are close to a good solution the training will be much faster and the possibility of obtaining adequate convergence increases. In this paper a new method for initializing the weights is presented. The method is based on the orthogonal least squares algorithm. The simulation results obtained with the proposed initialization method show a considerable improvement in training compared to the randomly initialized networks. In light of practical experiments, the proposed method has proven to be fast and useful for initializing the network weights. 1 Introduction
The multilayer perceptron (MLP) network is one of the best known and commonly used neural network models. Its weights are usually trained by using an iterative gradient descent-based optimization routine called the backpropagation (BPI algorithm (Rumelhart et al. 1986). The main drawback of backpropagation training is the slow and unreliable convergence in the training phase. Two major reasons for the poor training performance of this basic approach are the problem of determining optimal steps, i.e., size and direction in the weight space in consecutive Nrural C ~ ~ p u 7 f, 982-999 ~ ~ j ~(1995) ~ @ 1995 Massachusetts Institute of Technology
Multilayer Perceptron Network
983
iterations and the problem of weight initialization. It is apparent that the training speed and convergence can be improved by solving either one of these problems. Most studies have concentrated on optimizing the step size. This has resulted in many improved variations of the standard BP. The proposed methods include for instance the addition of a momentum term (Rumelhart etal. 1986),an adaptive learning rate (Jacobs 1988),and second-order algorithms (Fahlman 1988; Schiffmann et al. 1992; Pfister and Rojas 1993). Some of these BP variations have been shown to give quite impressive results in terms of convergence rate (Schiffmann et al. 1992). However, the improved training algorithms do not guarantee adequate convergence because of the initialization problem. If the initial weight values are poor the training speed is bound to get slower even if improved BP algorithms are used. In the worst case the network may converge into a poor local optima. Therefore, it is important to also improve the initialization strategy as well as the training algorithms. A common way to handle the weight initialization is to restart the training with new random initial values if the previous ones did not lead to adequate convergence (Schmidt et al. 1993). In many problems this approach can be too extensive to be an adequate strategy for practical usage since the time required for training can increase to an unacceptable length. A simple and obvious nonrandom initialization strategy is to linearize the network and then calculate the initial weights by using linear regression. The network can be linearized by replacing the sigmoidal activation functions with their first-order Taylor approximations. This approach has been used for instance by Burrows and Niranjan (1993). The advantage of this approach is that if the problem is more linear than nonlinear then most of the training is done before the iterative weight adjusting is even started. However, if the problem is highly nonlinear this method does not perform any better than the random initialization. Some other kinds of initialization procedures have been studied by Drago and Ridella (1992), Wessels and Barnard (1992), Kim (1993), and Li et al. (1993). In this study a new initialization method is proposed. The method is based on the orthogonal least squares (OLS) algorithm, which has been successfully used in training the radial basis function network (Chen ef al. 1991). The proposed method concentrates only on calculating the initial weight values and the weight adjusting is done afterward by using the standard BP algorithm. This means that all improved BP type training algorithms can be readily used to improve the convergence rate even further. The paper is organized as follows. In Section 2 the structure of the MLP network is described briefly. In Section 3 the OLS algorithm is presented. In Section 4 the MLP network is slightly modified to be able to apply the OLS method for the weight initialization. Also, other important details of the initialization method are explained in this section.
M. Lehtokangas et al.
984
In Section 5 the results of the practical simulations are presented. In the simulations the MLP network has been used to model some widely used benchmark problems and nonlinear time series. General discussion about the OLS method is presented in Section 6. Finally, the conclusions are presented in Section 7. 2 Multilayer Perceptron Network
We shall concentrate in initializing the weights of a three layer perceptron network with single output as shown in Figure 1. The number of input neurons is p and the number of hidden neurons is q. The weights to be (between input and hidden neurons) and u, (between initialized are wl, hidden and output neurons). There are also the bias terms of the hidden and output neurons, which are denoted by WO, and V O . The activation function was chosen to be the hyperbolic tangent (tanh) function. Since the output neuron has also the tank activation function, the modeled data must be scaled between -1 and 1 before this network configuration can be used. The mathematical formula for the network can be written as P
(2.1)
The training of the network was done by using the standard backpropagation algorithm (Rumelhart et al. 1986), which minimizes the squared output error by using the weight update rule 8E H(n) = H(n - 1)- 17ad
(2.2)
where 6 is a weight (w;j or v!), n is step number, E is the output error, and 71 is the learning rate. 3 Orthogonal Least Squares Algorithm
In this study we have considered the MLP network as a regression model where the hidden neurons are the regressors. In the weight initialization phase the problem is to choose the best available regressors. In other words it means the selection of the hidden units with the best available initial weights. An efficient algorithm for the optimal regressor selection is the orthogonal least squares (OLS) algorithm, which has been successfully used in training the radial basis function network (Chen et al. 1991). The OLS algorithm concentrates on fin.ding the most significant regressors for a regression model of the form
Multilayer Perceptron Network
985
J
wij
Figure 1: Three layer perceptron network with single output. where t' is the desired output, u, are model parameters and R,(I) are known as the regressors, which are fixed functions of the input XI, i.e., R,(l) = R,[x']. The error E' is assumed to be uncorrelated with the regressors Rj(l). Parameter M is the number of regressor candidates. Having many different regressor candidates, the problem is now to select the q most significant of them. An efficient solution for the problem is given by the OLS method, which will be explained in the following. It is apparent, that equation 3.1 can be written in matrix form as (3.2)
t=Rv+E
where t = [ t ' t 2 . . . tnIT,v = [uOu1v2.. . z)MITI
R=
I
1 ... 1
...
... ...
Rl[X"]
...
Rl[X']
E =
.&"IT, and
RA,f[X']
...
= [rlr2. . . r ( M + l ) ] .
(3.3)
RM[X"]
Now, the square of the projection Rv is part of the desired output variance that can be counted by the regressors. Usually different regressors are correlated. This means that it is not clear how an individual regressor contributes to the output variance. Therefore the OLS method involves the transformation of the regressor set of rj into a set of orthogonal basis vectors. This makes it possible to calculate the individual contributions to the desired output variance. The regression matrix R can be decomposed into R=HU
(3.4)
M. Lehtokangas et al.
986
where U is a (M + 1) x ( M + 1) upper triangular matrix with l’s’on the diagonal,
U=
(3.5)
and H is an n x (M + 1) matrix with orthogonal columns h, such that HTH = B
(3.6)
The matrix B is diagonal and the diagonal elements b,j are (3.7) The space spanned by the set of orthogonal basis vectors h, is the same space spanned by the original regressors rI. Therefore equation 3.2 can be rewritten as
t=Hg+E
(3.8)
when Uw = g is satisfied. The least square estimate for the specially selected new parameter vector g can be calculated from
-’
g = (HTH) HTt = B-’ H T t
(3.9)
or element by element from (3.10) The orthogonal decomposition for equation 3.4 can be obtained by using, for instance, the Gram-Schmidt algorithm. The computational procedure of the Gram-Schmidt algorithm is as follows:
hl = r1 hrr,
“I/
=
fi
h, - rl - E ::: O,,hl
i= j
. . , (j - 1)
= 2 . 3 , .. . , ( M
+ 1)
Since regressors hi and h, are orthogonal for i t‘ is
(3.11)
# j , the sum of squares of (3.12)
Multilayer Perceptron Network
987
If t is the desired output vector after its mean has been removed, then its variance estimate is given by 1
1
cg:hfh,
M+l
var(t) = l t T t= n
/=I
+ -n1E ~ E
(3.13)
The sum term in equation 3.13 is the part of the desired output variance that can be explained by the regressors h,. Thus each regressor has its unique contribution to the total sum and the problem is to find those 9 regressors that have the largest contribution. At this point it is useful to define an error reduction ratio err due to hi as
g:h;h, (3.14) j = 1 , 2 , .. . , ( M + 1) tTt * This ratio gives the relative contribution of a regressor j to the whole variance. Now, it is convenient to select the best regressors one by one. The practical procedure can be described loosely as follows: 1. Calculate the error reduction ratio for each of the original regressors (i.e., let h, = r,) and select the one with the largest ratio. Let the selected regressor be hl and drop it out among the r,. 2. Use the remaining r, regressors as candidates for obtaining h2. Using one regressor by one calculate an h2 candidate and the corresponding error reduction ratio. Select the one with the largest ratio, calculate hz by using it, and drop it out among the r,. 3. Continue the one by one selection as shown in phase 2 as long as 9 best regressors have been selected. The q best regressors are those that were dropped out among the r, during the selection procedure. In the above algorithm the orthogonalization is done partially such that only hl , . . . ,h, are calculated. Furthermore, hl corresponds to the best regressor and h, to the 9th best regressor. More details of this procedure can be found in Chen et al. (1991).
err,
=
~
4 Weight Initialization by Using OLS Algorithm
To be able to apply the OLS algorithm we must first modify the used network model expressed by equation 2.1 in the weight initialization phase. By replacing the tanh function of the output unit by its first order Taylor approximation we obtain y
= DO
+
M
/
Y
\
v, tanh
(4.1)
/=1
Clearly, equation 4.1 is the same as equation 3.1 when we denote P
(4.2) i=l
M. Lehtokangas et al.
988
where j = 1 , .. . , M and Ro = 1. The relationship between the network output and the desired output is t = y E. One should note that in the initialization phase the number of hidden units is M, which should be significantly larger than the desired number of hidden units q. In this study we used M = 1Oq. As equation 4.2 shows, each of the M hidden units corresponds to one regressor, thus now we can use the OLS algorithm to select the q best of them. Before the OLS algorithm can be used we must somehow generate the M candidate hidden units. One simple way is to initialize the weights in the candidate regressors by using uniformly distributed random numbers. In the simulations, which will be presented later, we used random numbers from the interval {-4,4}. If a regressor is selected by the OLS algorithm then the initial weights of the selected regressor are actually the initial values of the network. As can be seen, each regressor has p + 1 weights. Thus, the number of inputs determines the dimension of the weight space formed by the regressors. It is quite obvious that the smaller the dimension of the weight space, the fewer degrees of freedom exist to initialize the regressors. This implies that the OLS approach is bound to work better when the given network has only a few inputs. After we have selected q hidden units (or regressors) we have determined the initial values for the weights w,)and zoo). Now the initialization of the weights u, and uo remains. Since the weights between the input and hidden layer have initialized values, we can calculate the outputs of the hidden neurons for each of the training patterns. Also, since we have linearized the output neuron (see equation 4.1), the network is completely linear after we have passed the hidden layer. Thus it is very simple to form a linear regression for the linear part of the network and initialize the weights u, and uo by using the regression coefficients. This ends the initialization phase and the final weight adjustment is carried out by using the standard BP method. The initialization phase can be summarized as follows:
+
Phase 1. Linearize the activation function of the output neuron so that the resulted network takes the same form as equation 3.1. Let the network have M hidden units or regressors so that M >> q . Initialize the regressors, i.e., initialize the weights w,j and WO,with uniformly distributed random values. Select the q best regressors by using the OLS algorithm and let the initial values of the selected regressors be the initial values for the network. Phase 2. Calculate the outputs of the q previously selected hidden units for each training pattern. Form a linear regression for the linear part of the network and let the obtained regression coefficients be the initial values for the weights v, and VO.
Multilayer Perceptron Network
989
5 Experiments
The proposed method has been tested by using the MLP network to model some widely used benchmark problems and nonlinear time series. In time series modeling the main aim is to construct a model that predicts the future value from the past and present values of the series. In the MLP scheme this means that the inputs are past and present observations of a series, i.e., x = [ x p - l . . . X ~ - ~ + I ]and the output y is the prediction for the future value x,+~. Before the network can be used, we must define the number of input and hidden units. For the widely used benchmark problems we used the "standard" architectures. The network sizes for the time series problems were determined by using the predictive minimum description length (PMDL) principle (Rissanen 1989). The PMDL procedure searches the best model structure among the available structures (Rissanen 1994). This means that the resulted model is optimal only compared to the other tested models. Thus it is always possible to introduce a new untested structure that is more optimal according to the PMDL principle. The consistency of the PMDL method has been proven (Wax 1988; Wei 1992). In the following experiments the used model structures are expressed by the notation MLP(p,4 ) . The effect of the OLS initialization is studied by using visually representative training curves. In other words we plotted the normalized mean square error (NMSE)as a function of the training epochs. After an epoch each of the training patterns has been applied once to the network. The NMSE is defined as 1 " NMSE = - (x:+, - xi,,)' (5.1) riff' I=] where n ' is the variance of the desired outputs and n is the number of training patterns. For the binary problems one could argue whether to use the NMSE or correctness of classification metric. In this work we chose to use the NMSE metric also in binary problems because of the following. First, for the correctness of classification metric to be used it is necessary to select some threshold value to classify the outputs as correct or incorrect. This would also cause arguments as a proper threshold value may be problem dependent. As well the strict NMSE metric can be used. After all, it is the sum squared error that we try to minimize in the training phase. Moreover, the correctness of classification can be roughly estimated from the NMSE. For instance, let us consider the XOR problem. Let us assume that for three of the four patterns the network gives exactly the correct answer, and for the fourth pattern the output is 0.5 (exactly between 0 and I). Then the NMSE is 0.25. Now, if we train another network and obtain an NMSE that is below 0.25 we can be absolutely sure that all patterns are classified correctly if our threshold value is 0.5. This is obvious if we compare the two situations. If the error is spread to all patterns the situation is even clearer. The results
990
M. Lehtokangas et al.
for the XOR problem (will be presented later) will show clearly that with OLS initialization all the trials gave "all-correct" solution but for random initialization there were solutions in which some of the outputs would have been classified as incorrect. Since the OLS approach has some randomness, the training procedure was repeated 100 times by using different regressor candidates each time. Similarly, the comparative training with random initialization was also repeated 100 times with different initializations each time. In random initialization uniformly distributed random numbers from interval -0.5,. . . ,0.5 were used. The plotted curves are the averages of the 100 repetitions. Also, we plotted upper and lower deviation curves on the same picture to see the variations between the worst and best training runs (or trials). The upper deviation curve was obtained as a n average of those values that were greater than the average curve and the lower one is the average of those error values that were smaller than the average curve. Experiment 1. The first experiment was chosen to be the XOR problem. The used network structure was MLP(2,2). The problem was to train the network so that the output unit will turn on if one or the other of the inputs is on, but not both. The results for this widely used benchmark problem are depicted in Figure 2. For this simple problem the results given by the OLS initialization method are quite impressive. It seems that the backpropagation training is needed only to finetune the network. The convergence without OLS initialization is quite poor considering the simplicity of this problem. Especially notable is the large deviation between the best and worst runs with the basic training scheme. Experiment 2. The second problem was a generalized case of the XOR problem. Namely, the XOR problem can be regarded as a 2-bit parity problem. In an n-bit parity problem the output is to be one if an odd number of the n inputs are on. In this experiment we tried to solve a 4-bit parity problem with the MLP(4,4) network. The results for this problem are depicted in Figure 3. In the basic training scheme there were many trials when the network did not converge to any reasonable solution. Also the large deviation between the best and worst runs is unacceptable. Only the best trials seemed to give reasonable solution. In the OLS initialization scheme the network converged to a reasonable solution in all the cases and the deviation between the best and worst trials is comparatively small. Experiment 3. The third problem was another generalized case of the XOR problem. The XOR problem can also be regarded as a 2 x 2 sized chessboard. The two inputs are the coordinates of the squares in the chessboard, and for white squares the output is off and for black squares the output is on. In this experiment we trained the MLP(2,6) network with a 4 x 4 sized chessboard problem. The results are depicted in Figure 4. As can be seen the basic network scheme is not able to
Multilayer Perceptron Network
991
1
0.E
W
yz 0.6 0.4
0.2
.
'-,-
0 0
-- - - - -
500 epoch
1000
"
0
500 epoch
1000
Figure 2: Training curves for the XOR problem. Solid lines are the average curves and dashed lines are upper and lower deviations, respectively. (a) Training curves with random initialization method. (b) Training curves with OLS initialization method.
learn the problem at all. With OLS initialization a reasonable result was obtained in all the runs. Experiment 4. The well known chaotic time series called the Henon map was used in the fourth experiment. It is defined as
in which cro = 1.0, cyl = -1.4, and a2 = 0.3. The initial values were x-1 = 1.0 and xo = 0.4. The model selection procedure suggested a good network structure for this mapping problem to be MLP(3,3). With this network structure we obtained training results that can be seen in Figure 5. Obviously the OLS weight initialization improves the training properties also in this problem. First, the training starts at a lower error level, second, the convergence rate is better, and third, the deviation between the worst and best training runs is smaIIer.
M. Lehtokangas et al.
992
0.41
0
0
'.
{,,
0.4
500
1000
I
0
epoch
500
1000
epoch
Figure 3: Training curves for the 4-bit parity problem. Solid lines are the average curves and dashed lines are upper and lower deviations, respectively. (a) Training curves with random initialization method. (b) Training curves with OLS initialization method. Experiment 5. The fifth benchmark problem was a time series generated by the formula ~ t + i= C O S ( ~ , )
+ ~ x t - 1IXr-21 a
(5.3)
in which (k = -2.0 and p' = 1.8. All the initial values were zeros for this series. The training simulations were performed using the predictive MDL optimum network MLP(3,4). The obtained training curves for the random and OLS initialization are depicted in Figure 6 . The results are very similar to the fourth experiment. The biggest difference is that the worst runs of the random initialization are significantly worse than the worst runs of the OLS initialization. This is especially true at the end of training. Experiment 6. The sixth example was a time series with even more complex nonlinear structure and some additive gaussian noise. The series is defined as
xtt 1 = cos(xt-1)
+ [w+ cyz exp (Q-?.:)] + /JixtIxr-i + ~t
2 1'
Eft1
(5.4)
Multilayer Perceptron Network
993
1-
I
0.8 -
W
pz 0.6
-
DO
epoch
epoch
Figure 4: Training curves for the 4 x 4 sized chessboard problem. Solid lines are the average curves and dashed lines are upper and lower deviations, respectively. (a) Training curves with random initialization method. (b) Training curves with OLS initialization method. in which a1 = -0.2, a2 = 6.0, 123 = -0.4, fl, = -0.4, and ,& = 0.6. The initial values were xP1= 1.0 and xo = 0.5. The signal-to-noise ratio was g X / c E= 20. In this case the PMDL optimum architecture was found to be MLP(6,7). With this architecture we obtained training curves, which are shown in Figure 7. In this case the results are not as good as in previous examples although the OLS method significantly lowers the upper deviation curve. However, there is a way to improve the results with the OLS method. Namely, instead of using 1O*q regressor candidates we can increase that number. The increase in the candidates means that we scan the weight space more densely before backpropagation training. This in turn increases the chance that some of the tested initial weights are near a good solution. Thus the risk of getting stuck to a poor local minima will be reduced. An additional experiment with 100 * q regressor candidates was made for this series. The result is depicted in Figure 8. Now clear improvement can be seen both in average and upper deviation curves. Notable also is that the training starts at a significantly lower
M. Lehtokangas et al.
994 ~
1
0.8 u
* 0.6
z
0.4
0
50 epoch
100
"0
50 epoch
100
Figure 5: Training curves for the Henon map. Solid lines are the average curves and dashed lines are upper and lower deviations, respectively. (a) Training curves with random initialization method. (b)Training curves with OLS initialization method. error level than in experiments shown in Figure 7. As the upper deviation curve is now lower than the average curve with the random initialization method we can conclude that the risk of getting stuck to a poor local minima has substantially reduced. The increase in candidate regressors will increase, however, the computational efforts needed for the OLS method. These computational efforts are discussed in the next section. 6 Discussion
In the previous section we illustrated the training speed in terms of training epochs. However, in the OLS scheme computational efforts are also needed in the initialization. Here we compare the total training time of the two methods in terms of total floating point operations needed to train the network. First, the number of flops needed to train the network with the basic scheme and then the number of flops needed to train the network to the same error level with the OLS method were measured.
Multilayer Perceptron Network
995
1-
0.8 -
w
0.6-
0.4 i
epoch
Figure 6: Training curves for the second time series. Solid lines are the average curves and dashed lines are upper and lower deviations, respectively. (a) Training curves with random initialization method. (b) Training curves with OLS initialization method.
The speedup values presented in Table 1 are calculated from these two measures. In experiments 1-3 the error level after the OLS initialization was already below the error level obtained with the basic scheme, so in those cases the flops needed for the initialization were the measure for the OLS scheme. Also, speedup values in these three cases are lower bounds since it could be possible to obtain the same error level with the basic scheme if more and more training epochs were used. As the results show the OLS initialization method was certainly useful in solving the benchmark problems presented in this work. The network learned the problems faster and the possibility that the net would converge in a poor local minimum was significantly reduced. However, no guarantees are given that the OLS method will avoid poor local minimas or even reduce that risk in all existing problems. Based on the given experiments the OLS method is certainly a useful method for the initialization problem.
M. Lehtokangas et al.
996
I00
200 300 epoch
400
'
-
'0
100
200 300 epoch
400
Figure 7: Training curves for the third time series. Solid lines are the average curves and dashed lines are upper and lower deviations, respectively. (a) Training curves with random initialization method. (b)Training curves with OLS initialization method. Table 1: Comparison of the Training Speed of the Basic Approach and OLS Approach in 'Terms of Total Floating Point Operations Needed to Train the Network."
Exp. 1
2 3 4 5 6
Name
Speedup ( x times)
XOR 4-bit parity 4 x 4 chessboard Henon map series time series noisy time series
> 40 > 40 > 20 2.3 5.0 2.4
"he values given indicate how many times faster the training is accomplished with the OLS scheme.
Multilayer Perceptron Network
I
I
997
I
I
1-
0.8-
0
Figure 8: Training curves for the third time series with OLS initialization method. The number of candidate regressors was 100 * 9.
7 Conclusions In this study we proposed a new method for the initialization of the weights in an MLP network. The method is based on the OLS algorithm. The proposed method scans the weight space and selects the best of the tested points to be the initial values. The weight space scanning is performed prior to the backpropagation training. This can save a considerable amount of computational effort since fewer training epochs are needed. Also, the experiments presented show that the risk of getting stuck to a poor local minimum can be significantly reduced with this method. Based on the given experiments we conclude that the proposed method has potential usefulness in training the MLP network. One should note that the proposed method works on networks with a single hidden layer and single output. In future work one main aim is to generalize this method for networks with more than one hidden layer and one output.
998
M. Lehtokangas et al.
Acknowledgments This work has been supported by the Academy of Finland. The authors wish to thank the reviewers for their valuable comments on the manuscript. References Burrows, T. L., and Niranjan, M. 1993. The UseofFeed-Forwardand Recurrent Neural Networks for System Identification. Tech. Rep. CUED/F-INFENG/TR158, Cambridge University Engineering Department, Trumpington Street, Cambridge CB2 lPZ, England. Chen, S., Cowan, C. F. N., and Grant, P. M. 1991. Orthogonal least squares learning algorithm for radial basis function networks. I€€€ Trans. Neural Networks 2(2), 302-309. Drago, G. P., and Ridella, S. 1992. Statistically controlled activation weight initialization (SCAWI). IEEE Trans. Neural Networks 3(4), 627-631. Fahlman, S. E. 1988. A n Empirical Study of Learning Speed in Backpropagation Networks. Tech. Rep. CMU-CS-88-162, Carnegie Mellon University. Jacobs, R. A. 1988. Increased rates of convergence through learning rate adaption. Neural Networks 1, 295-307. Kim, L. S. 1993. Initializing weights to a hidden layer of a multilayer neural network by linear programming. Proc. Int. Joint Conf Neural Networks, IJCNN-93 2, 1701-1704. Li, G., Alnuweiri, H., Wu, Y., and Li, H. 1993. Acceleration of back propagations through initial weight pre-training with delta rule. Proc. I E E E Int. Conf. Neural Networks, ICNN-93, 1, 580-585. Pfister, M., and Rojas, R. 1993. Speeding-up backpropagation-a comparison of orthogonal techniques. Proc. Int. joint Conf. Nmral Networks, IJCNN-93, 1, 517-523. Rissanen, J. 1989. Stochastic Complexity in Statisticai Inquiry, Series in Computer Science, Vol. 15. World Scientific Publishing Co., Singapore. Rissanen, J. 1994. Information theory and neural nets. In Mathematical Perspectives on Neural Networks, P. Smolensky, M. Mozer, and D. Rumelhart, eds. Laurence Erlbaum, Hillsdale, NJ. Rumelhart, D., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, D. Rumelhart, J. L. McClelland, and the PDP Research Group, eds., Ch. 8, pp. 318-362. MIT Press, Cambridge, MA. Schiffmann, W., Joost, M., and Werner, R. 1992. Optimization of the Backpropagation Algorithm for Training Multilayer Perceptron.3. Tech. Rep., University of Koblenz, Institute of Physics, Rheinau 3-4, W-5400 Koblenz. Schmidt, W. F., Raudys, S., Kraaijveld, M. A,, Skurikhina, M., and Duin, R. I? W. 1993. lnitializations, back-propagation and generalization of feed-forward classifiers. Proc. I E E E Int. Conf. Neural Networks, PCNN-93, 1, 598-604.
Multilayer Perceptron Network
999
Wax, M. 1988. Order selection for AR models by predictive least squares. I E E E Trans. Acoust. Speech Signal Process. 36(4), 581-588. Wei, C. 1992. On predictive least squares principles. Ann. Stat. 20(1), 1 4 2 . Wessels, L. F. A., and Barnard, E. 1992. Avoiding false local minima by proper initialization of connections. IEEE Trans. Neural Networks 3(6), 899-905.
Received March 15, 1994; accepted November 23, 1994.
This article has been cited by: 2. Kwang-Tzu Yang. 2008. Artificial Neural Networks (ANNs): A New Paradigm for Thermal Science and Engineering. Journal of Heat Transfer 130:9, 093001. [CrossRef] 3. M. Lehtokangas. 1999. Fast initialization for cascade-correlation learning. IEEE Transactions on Neural Networks 10:2, 410-414. [CrossRef] 4. B.V. Rigas, N. Zein, B. Honary. 1999. Synchronisation techniques for a CCMA system using neural nets. IEE Proceedings - Communications 146:3, 150. [CrossRef] 5. Ansgar West, David Saad. 1998. Role of biases in on-line learning of two-layer networks. Physical Review E 57:3, 3265-3291. [CrossRef] 6. A.B. Geva. 1998. ScaleNet-multiscale neural-network architecture for time series prediction. IEEE Transactions on Neural Networks 9:6, 1471-1482. [CrossRef] 7. Lipo Wang, Kiuju FuArtificial Neural Networks . [CrossRef]
Communicated by Federico Girosi
Learning and Generalization in Radial Basis Function Networks J. A. S. Freeman D. Saad Department of Physics, University of Edinburgh, Edinburgh EH9 3 J Z , United Kingdom
The two-layer radial basis function network, with fixed centers of the basis functions, is analyzed within a stochastic training paradigm. Various definitions of generalization error are cansidered, and two such definitions are employed in deriving generic learning curves and generalization properties, both with and without a weight decay term. The generalization error is shown analytically to be related to the evidence and, via the evidence, to the prediction error and free energy. The generalization behavior is expIored; the generic learning curve is found to be inversely proportional to the number of training pairs presented. Optimization of training is considered by minimizing the generalization error with respect to the free parameters of the training algorithms. Finally, the effect of the joint activations between hidden-layer units is examined and shown to speed training. 1 Introduction
Within the context of supervised learning in neural networks, one is primarily interested in minimizing the average deviation of the actual network output from the desired output over the entire space of possible inputs. This quantity is not directly available within the paradigm of learning from a training set, and so is usually estimated with some approximation scheme, such as the mean sum-squared error on a set of test points that were not employed during training. Generalization error can be investigated analytically by making an assumption concerning the process that generated the training set. One can then analyze properties of learning in the typical case, such as the decay rate of the generalization error with number of training patterns and the optimal settings of parameters controlling the training algorithm. Several methods exist that facilitate such investigations, such as the VC and PAC frameworks (Vapnik and Chervonenkis 1971; Haussler 1994) and the statistical mechanics framework (see Watkin et al. 1993, for a review). This paper utilizes a Bayesian approach in which a probability Neural Cornputation 7,1000-1020 (1995) @ 1995 Massachusetts Institute of Technology
Radial Basis Function Networks
1001
distribution is constructed over the weight space of the network. Similar approaches can be found in MacKay (1992) and Bruce and Saad (1994). To date, such analytic investigations of generalization error have primarily focused on the one-layer perceptron, either in boolean or linear form, and on simple extensions of this, such as the committee machine (see, for instance, Schwarze 19931, as these architectures are analytically tractable unlike the general multilayer perceptron. This paper calculates generalization error for a more complicated network: the two-layer radial basis function network (RBF). The RBF is representationally powerful, being a universal approximator for continuous functions in the limit of an infinite number of hidden units (Hartman et al. 1990). It has been successfully employed in a number of applications, including chaotic time-series prediction (Casdagli 1989), speech recognition (Niranjan and Fallside 19901, and data classification (Musavi et al. 1992). Generalization error for the RBF has been considered both analytically and empirically to some extent: Niyogi and Girosi (1994) derive a bound under the assumption that the training algorithm always finds a globally optimal solution, but require only weak constraints on the function that generated the training set; they do not consider regularization. This paper also contains an extensive bibliography pertaining to the topic of generalization. Botros and Atkeson (1991) compare the performance of various choices for the basis functions. Further afield, bounds have been derived for the case in which the hidden layer consists of units with sigmoidal transfer functions (Barron 1993, 1994). The typical training methodology employed for the RBF is to fix the parameters of the first layer utilizing some algorithm to ensure that the positions of the training data in input space are adequately represented by the basis functions, and then either to solve a system of linear equations or use some training algorithm such as gradient descent to set the parameters of the second layer. Training is computationally inexpensive as compared to multilayer perceptrons. This paper first presents a detailed specification of the RBF model to be analyzed. Various definitions of generalization error are then considered, and two such definitions selected for the analysis. The expressions for generalization error are derived and linked to the evidence; finally the behavior of the network is examined from several perspectives. 2 The Radial Basis Function Network
The RBF architecture consists of a two-layer fully-connected network (see Fig. l), with an input layer that performs no computation. With no loss of generality, only a single output node is utilized in the analysis. Each hidden node is parameterized by two quantities: a center m in input space, corresponding to the vector defined by the weights between the node and the input nodes, and a width 0;. These parameters are assumed to be
1002
J. A. S. Freeman and D. Saad
OUTPUT NODE
HIDDEN NODES 1-H
INPUT NODES 1-N
Figure 1: RBF network architecture. fixed by a suitable process, such as a clustering algorithm or maximizing the likelihood of the parameters with respect to the training data. The activation function of the hidden nodes is radially symmetric in input space; the magnitude of the activation given a particular datapoint is a decreasing function of the distance between the input vector of the datapoint and the center of the basis function. The distance metric employed is Euclidian. The role of the hidden units is to perform a nonlinear transformation of the input space into the space of activations of the hidden units; it is this transformation that gives the RRF a much greater representational power than the linear perceptron. The output layer computes a linear combination of the activations of the basis functions, parameterized by the weights w between hidden and output layers. Within this model, the basis functions will be taken as gaussian; each hidden node will have a width gk corresponding to the variance of the gaussian. The overall function computed by the network is therefore (2.1)
The training data D will be taken to consist of P input-output pairs indexed 1 . . . p: (xp,yp); the data will be assumed to be generated by a teacher RBF and corrupted under some noise process, with the input points being drawn from a symmetric gaussian distribution of variance u:. The centers of the teacher will be taken to be identical to those of the student and to possess an identical width parameter cr;. The fact
Radial Basis Function Networks
1003
that student and teacher centers are identical implies that the function to be learned is exactly realizable. In the terminology of learning theory, this means that the approximation error is zero; the generalization error is equivalent to the estimation error (see Niyogi and Girosi 1994, for an overview). The training algorithm for the weights that impinge on the student output node will be considered stochastic in nature; this requires that an expression for the probability of a student weight vector given the training data and training algorithm parameters be defined. Modeling the noise process as zero-mean additive gaussian noise leads to the following form for the probability of the dataset given the weights and training algorithm parameters:’
(2.2) where ED = 1/2 Cpkp- f ( x p , w)]’ is sum-squared training error and ZD = (27r/P)”’2. This form resembles a Gibbs distribution over student space: it also corresponds to imposing the constraint that minimization of the training error is equivalent to maximizing the likelihood of the data (Levin et al. 1989). This distribution can be realized practically by employing the Langevin training algorithm, which is simply the gradient descent algorithm with an appropriate noise term added to the weights at each update (Rognvaldsson 1994). Furthermore, it has been shown that gradient descent, considered as a stochastic process due to random order of presentation of the training data, solves a Fokker-Planck equation for which the stationary distribution can be approximated by a Gibbs distribution (Radons et al. 1990). To prevent overdependence of the distribution of student weight vectors on the details of the noise, it is necessary to introduce a regularizing factor, which can be viewed as a prior distribution over student space:
where Ew is a penalty term based, for instance, on the magnitude of the student weight vector2 and ZW = Jw dwexp(-YEW). ‘Note that, strictly, P ( D I w,-y,P) should be written P[(yl,. . . , y p ] I (XI , . . . , xp), w, 7, as it is desired to predict the output terms from the input terms, rather than both jointly. penalty term, ZW = ( 2 7 ~ / 7 ) ~ / ~ . ’Note that for the ubiquitous Ew = 1/211~11~
a]
J. A. S. Freeman and D. Saad
1004
Employing Bayes’ theorem, one can derive an expression for the probability of a student weight vector given the training data and training algorithm parameters: (2.4)
?(w -
exp (-PED - YEW) Z
Here, Z = rdw exp(-PED - yEw) is the partition function over student space. The quantity P ( D I 7 , P ) has been termed the evidence for dataset D given the training algorithm parameters (MacKay 1992). It is proportional to the partition function, and thus closely related to the free energy, F = -( 1//I) log Z , an important quantity in the statistical mechanics framework (see, for instance, Hertz et al. 1989). It is of interest to relate analytically the evidence to generalization error, as certain conjectures concerning this relation have been made on intuitive grounds (MacKay 1992). 3 Generalization Error
There are several approaches that can be taken in defining generalization error. The most prominent class of definitions focuses on the expectation of the difference between the desired network output and the actual output, as measured by some appropriate error measure, taken over the entire input space. The square of the difference between desired and actual output is the typical error measure employed, which for a particular student network gives
E=
s,
2
d x P ( x ) [f(x, wO)- f ( x , w)]
(3.1)
where wo is the weight vector of the t e a ~ h e r . ~ From a practical viewpoint, one only has access to the empirical risk, or training error, 1/P &,[yp -f(x, w)]’. This quantity is an approximation to the expected risk, defined as the expectation of [y - f(x, w)]’ with respect to the joint distribution P ( x ,y). With an additive noise model, the expected risk simply decomposes to E + C T ~where , 0; is the variance of the noise. Some authors equate the expected risk with generalization error by considering the squared difference between the noisy teacher and the student (see, for instance, Hansen 1993). A more detailed discussion of these quantities can be found in Niyogi and Girosi (1994). 3This definition is equivalent to the distance in the t 2 ( P )norm between f(x , w 0 ) and f ( x , w), where L2(P)is the set of square-integrable functions with respect to the measure defined by P.
Radial Basis Function Networks
1005
If a stochastic training algorithm is employed, such as the Langevin variant of gradient descent described previously, giving some probability distribution over weight space conditioned on the training data, there are two possibilities for the generalization error. If, as is usually the case practically, the algorithm selects a single weight vector from the ensemble, a procedure that here will be termed Gibbs learning, then equation 3.1 becomes4
EG
=
S, dxP(x)/
W
dwP(W 1 D , Y,P ) k ( X , W o )- f ( X , W!]
2
(3.2)
A second possibility arises from considering a Bayes-optimal approach. This requires one to take the expectation of the estimate of the network, which is impractical due to the computation involved, but can be approximated by performing a succession of training runs: (3.3)
These two quantities are related by
To investigate the generic performance of the architecture, it is desirable to eliminate the dependence of generalization error on the particular dataset used. An average over possible datasets, denoted by ((. . .)), will be utilized for this purpose. Thus, with additive gaussian noise 17 on the data, one obtains
An alternative measure of generalization performance is a quantity known as prediction error (Levin et al. 19891, E p = -logP(y I x , D ) , which is derived from the probability of the network correctly predicting a data point drawn from a known probability distribution. Prediction error is closely linked to both the free energy F and the evidence. 41t is worth noting that by taking P/-y + m, the distribution of student weight vectors becomes a delta function centered on the weight vector that minimizes the empirical risk. This situation is commonly considered in the computational learning theory literature, but is unrealistic for neural networks, where often in practice only locally optimal solutions are found.
J. A. S. Freeman and D. Saad
1006
4 Calculation of Generalization Error
~
The calculation of generalization error will focus on both EG and E g ; a link to prediction error is developed via an analytic relation between EG and the evidence. Recalling that the teacher centers are equal in number and position to those of the student and signifying the difference between student and teacher weight vectors, w - wo, by w*, the definition of EG becomes
(4.1)
Since the input vectors are drawn from a symmetric gaussian distribution with mean 0, variance CT;, on performing the integral over input space one obtains
( ( E G ) )= a where =
((
JwP(W
I o,r,P)xwtw:Gbcdw
(4.2)
bc
(2
+l)-N'2
with mb, m, referencing the centers. Employing the definition of P(w 1 D , y,p) as in equation 2.4, taking ED as sum-squared training error with 71p as the noise on training example p , and defining Ew = 1/211~11~ as the prior over weight space allows equation 4 to be rewritten as
where
Radial Basis Function Networks
1007
Now by taking the derivative of the numerator of equation 3 with respect to the elements of the matrix A-', ( ( E G ) ) becomes
Recalling that the evidence is proportional to the partition function Z, one can immediately relate the evidence to the generalization error: (4.5) At this point it is also possible to relate generalization error to prediction error. It is relatively simple to derive the relationship between prediction error and evidence:
Employing this relationship in equation 4.5, one arrives at
(4.7) Returning to the derivation of EG,calculating the evidence and performing the partial derivatives of equation 4.5 leads to
( ( E G ) )= -trGh((+$AGAp)) a P
(4.8)
It remains to consider the average ((. . .)) over datasets and the gaussian noise on the datasets. Performing the noise average, recalling that only p contains noise terms:
(4.9)
To progress further and perform the dataset average, it is necessary to
J. A. S. Freeman and D. Saad
1008
know the form of A. To this end, it will be assumed that A-' is of the form [ H Q ..' 8 1 (j 8 . . . (j
(4.10)
... ... .. . ... (j
Q ...
8
That is, all diagonal entries are equal to 8, and all off-diagonal entries are equal to 6. This induces A to take on the form:
(4.11)
where ,@
=
i:
r
1+4 ~
8-4
4 ~
(8 -
* = -
e) 0
8 + B(H - 1)
The implications of this assumption for the RBF model are twofold: first, the equality of diagonal entries corresponds to all the centers receiving an equal amount of activation via the training data.5 For the particular case of a symmetric input distribution centered at the origin of input space, this assumption breaks down only for the case in which the centers are dissimilar in distance from the origin and the variance of the input distribution is not of sufficient magnitude for the distribution to be approximately uniform in the regions covered by the basis functions. Second, the equality of off-diagonal entries requires each pair of basis functions to receive a similar joint activation via the training data. This assumption is satisfied except for the case in which the centers are not approximately equidistant from each other and the spread of the basis functions is not sufficient to allow considerable overlap between each pair of receptive fields to occur. STheauthors thank an anonymous referee for pointing out that a common procedure for selecting basis function parameters is to maximize the likelihood of the inputs of the training data under a mixture model given by a linear combination of the basis functions; constraining the priors of the mixture model to be equal encourages this property of equal activation to be satisfied.
Radial Basis Function Networks
1009
Unfortunately, this selection of form for A-' is not sufficient to allow the dataset average to be carried out, as the x p s do not separate into independent factors. One can approximate A-' as (4.12)
where (. . .)x~denotes an average over datasets. Utilizing the central limit theorem, the neglected variance in the distribution of (1/P)& @(17)9c(P) decreases as 1/P. Note that this implies that the calculation of generalization error holds strictly only in the asymptotic regime of large P, but it will be shown via simulations that the results are a good approximation for nonasymptotic P. The integral over datasets can now be performed as a straightforward gaussian, yielding the final expression for generalization error: (4.13)
P
where, for notational convenience, the matrix defined by /32cr:PaGbC has been introduced. From this, via equation 3.4, one can calculate ((EB)): a ((EB))= -(trAGAT) P2
rbc
= y2wiw:
+
(4.14)
To examine the validity of the assumptions for A-', simulations were conducted in which the empirical value of EG was calculated via equation 4.9 by generating random training data and numerically evaluating A. The simulations were carried out for three scenarios: first, the case in which the conditions for the assumption of form of A-' were exactly satisfied; second, for certain basis functions receiving an impoverished supply of training data, thus violating the equality of diagonal entries; finally, for the interactions between different pairs of basis functions being unequal, which violates the equality of off-diagonal entries.6 Comparisons of the mean values of EG found by simulation, EgM, with those found analytically via equation 4.13 are shown in Figure 2. Note that the variances of the simulation distributions quickly become rapidly converges negligible. When the assumptions are satisfied,
€zM
6Each simulation was run 50 times with the following parameter settings (denoting the angle between mb and m, as @b,& Common to all simulations: N = 3, H = 4, u: = 1, p = 0.5, y = I,ug = 2, ui = I. Assumptions satisfied: v b : mb((= I, V b , c b p : Ob,, = 2 ~ / 3 . Diagonal violation: llrnlII = Ilrn211 = 1, llrnsll = Ilrn411 = 4, Vb,c.h#c : @b.c = 2 ~ / 3 .Offdiagonal violation: v b : llrnbll = 1, @1,2 = @3,4 = ~ / 6 @,4 , = @2,3 = T .
1. A. S. Freeman and D. Saad
1010
to Ec. Violation of the assumption of diagonal equality gives rise to a systematic error, while violation of the off-diagonal assumption causes the convergence to slow, but introduces negligible systematic error. This lack of significant effect is explicable by an examination of the definition of G: the result of introducing differing interactions between the basis functions is simply to vary llmh + m'll; the effect of this will always be overwhelmed by that of other terms, particularly if the ratio of r~ito 0; is large. It can be concluded, therefore, that the calculation of generalization error is invalid only for the cases in which P is near to 0 or in which the basis functions receive significantly different levels of activation via the training data. 5 Analysis of Generalization Error
The equations derived for EG and E B do not admit to a straightforward intuitive understanding of the effect of varying parameters such as the number of training patterns, noise level, and training parameters y and P. To promote such an understanding, the behavior of the expressions for generalization error will initially be examined under simplifying limiting conditions. 5.1 Noiseless Training Data. Taking the mf
-+
0 limit while treating
P as a free parameter leads to the conclusion that, for both EG and EB, optimal training occurs when /3 oc) (see Fig. 3). This is intuitively plausible; if the training data are not noisy then no training error should be tolerated, so forcing the distribution over student space to become a delta function centered on the value of w that sets the error to zero is reasonable. Note that in the [j + 00 limit, the prior on student space becomes irrelevant. --f
5.2 No Weight Decay: the y + 0 Limit. Considering the y 0 limit allows one to analyze the dependence of EG m d E B on the number of training examples, P. The assumption of the diagonal versus off-diagonal form for A-l induces a similar form on the matrix G; referencing the diagonal and off-diagonal elements of G by GD and Go, respectively, and defining the matrix S2 by t ) h , = ljbc lim,,o d, which is both P and P independent, one obtains ---f
+
(5.1) and
1011
Radial Basis Function Networks
09s
=-a=:
(a) Assumptions satisfied 0.3s
0.3 0.25
0.2
8 0.15
0.1
0.05
a
(b) Violation of diagonal assumption
Figure 2: Analytic EG (unbroken line) versus mean of EFM (dashed line) examining the validity of the assumption of form for A-' under various distributions of the centers of the basis functions. The error bars are plotted at 1 standard deviation of the simulation mean.
J. A. S. Freeman and D. Saad
1012
(c) Violation of off-diagonal assumption
Figure 2: Continued. It is apparent that both EG and EB are inversely proportional to the number of training examples. This result is somewhat similar to that found for the linear perceptron in this limit, whereby EG and EB are inversely proportional to P - N - 1 (Hansen 1993; Bruce and Saad 1994). In addition, the y + 0 limit brings to light an interesting difference between Ec and E B . Examining EB, it is apparent that /3 plays no role; the expression is independent of the error sensitivity. This result is in contrast to that for EG, in which the first term is minimized by taking /j -+ 03. This hints that, in the Bayes generalizer, it is only the ratio of y to /3 that is important, as is the case for the linear perceptron (Bruce and Saad 19941, while the Gibbs generalizer is dependent on both p and y separately. This discrepancy is explicated by recalling equation 3.4; E G consists of a variance term, minimized by taking /3 -+ M, and a term identical to Eg. Both EG and E B are independent of N, the dimensionality of input space, in this limit. 5.3 The General Case: Noise and Weight Decay. To gain some understanding of the variation of EG and E B with P, y, and /? in the general case, consider Figures 4, 5, 6, and 7. Examining first Figure 4, in which E B is plotted against P and p for a constant value of y, it is apparent that there is a minimum in the
Radial Basis Function Networks
1013
Figure 3: EG as a function of number of examples P and error sensitivity ,B for u;
+
0.
generalization error surface at a constant value of P. When y is set to its optimal value, the value of P at the minimium can be shown empirically to be inversely proportional to the variance of the noise, F;. Similarly, plotting EB against P and y (Fig. 5) demonstrates a minimum in the generalization error surface at a constant value of y. This minimum, for /3 set to an optimal value, is a function of both 1lw011*and Cbcwtw:. An entirely different pattern of results emerges for EG. Considering Figure 6, the optimal value of ,h! rapidly becomes infinite as P increases. This discrepancy is due to the fact that the Gibbs generalizer requires the selection of a single weight vector from the ensemble of students, so it is advantageous to penalize any training error maximally once a reasonable amount of training data is available. The Bayes generalizer, on the other hand, employs a weighted average of students to make a prediction; noise on the training data output values can to some extent be compensated for by this average, and so it is not desirable to force the ensemble to become a delta function. Focusing on EG as a function of P and y (Fig. 7), an analogous result is apparent: the optimal value of
J. A. S. Freeman and D. Saad
1014
Figure 4: Generalization error E B as a function of number of examples P and error sensitivity p. The minimum in EB with respect to / j is independent of P.
y is initially infinite, but as P 03, the optimal value of y tends to an expression similar in dependence to that for EB. --j
5.4 Analytic Determination of Optimal Parameters. It is not possible to find closed-form analytic expressions for the optimal settings of ,8and y for either EG or EB generally, but for the case in which there is no interaction between the basis functions, as may occur when the variance of the input distribution is large compared to the width of the basis functions, such expressions can be obtained; these can then be elaborated upon to some extent to suggest the form of the actual dependencies of A p t and Yopt. For the Bayes-optimal generalizer, by minimizing ( (EB)) with respect to the training parameters, the optimal settings were determined to be
Radial Basis Function Networks
1015
Figure 5: Generalization error E B as a function of number of examples P and weight decay parameter 7 . The minimum in EB with respect to is independent of P.
(5.4) The form of equations 5.3 and 5.4 proves that only the ratio of y to 0, 2Ha$’lJw0)I2,determines whether the parameter settings are optimal. For the Gibbs generalizer the expressions for optimal parameters are a little more complicated: (5.5) (5.6) Under this assumption of no interactions between the basis functions, the results for optimal parameters closely resemble those found for the perceptron (Bruce and Saad 1994), an architecture that can also be viewed as
J. A. S. Freeman and D. Saad
1016
Figure 6: Generalization error EG as a function of number of examples P and error sensitivity @.At the minimum in EG with respect to p, /3 + co as P co. --f
having no interactions between units of the layer immediately preceding the output layer. Allowing terms linear in the interaction parameter, Go, leads to optimal parameters that have an additional dependence on the cross-correlation of the teacher RBF weight vector, Cbc w:w,". For instance, the optimal ratio of yoPt to Oopt for EB becomes (with GD small) TOPt
-_
@opt
-
,L?,;HGL ( G D - G0)GOCbc W:w: f ( G D - GO)211w0112
(5.7)
The effect of admitting all terms in Go for E g can only be examined empirically. As in the Go = 0 case, POptwas found to be linearly dependent on y,and vice versa, with the gradient of the Topt versus P dependence being the reciprocal of that for @opt versus 7. This form of relationship implies that EB can still be minimized by finding the correct ratio of y to p; it is unnecessary to find absolute values for these quantities. Thus, the optimal values define a straight line in training parameter space.
Radial Basis Function Networks
1017
Figure 7 Generalization error EG as a function of number of examples P and weight decay parameter 7 . As P --+ 00, the value of y at the minimum in EG with respect to y becomes constant. In the case of EB, the dependence of rapt and Wpt on the noise variance u2 can also be found; again, as in the Go = 0 case, "/opt is proportional to uV 1 while P 0, there is a finite dimensional subspace L of X and an x E E such that 23 c ( x + L)'. Every flat set is approximately flat, but the reverse is not true. Note that if E is approximately flat, then so is the closure of E. To see this, pick arbitrary E > 0 and note that there is a finite dimensional subspace L such that (x + L)'I2 contains E for some x E E. But this implies that the closure of E is contained in ( X L)'. Definition 2.2. For E c X and E > 0, the €-dimension of E is defined by
+
dim, E = min{dimL : ( 3 x E E ) [ E c ( x + L ) ' ] } (2.2) where min 0 := +00. In these terms, E is approximately flat if its €-dimension is finite for all F > 0. An approximately flat set is arbitrarily close to being contained in a finite dimensional subset. Intuitively, the consequence of this is that in forming the span of E, the precision problem will arise when interest attaches to functions outside of the finite dimensional subspaces L that solve the problem min{dimL : ( 3 x E E ) [ E c (x + L)']}. The following result will be used repeatedly.
Precision and Approximate Flatness in Neural Networks
1027
Lemma 2.1. If the closure of E is compact in ( X ,7),then E is approximately
pat. Proof. Pick e > 0 and x t E. Let { B ( x , ,E ) : i = 1 , .. . , m } be a finite cover of the necessarily compact closure E - x by E balls. Let L be the span of the x,. Because E c (x + {xl: i = 1 , .. . m})' and the metric is translation 0 invariant, E must be a subset of (x + L)'. ~
Lemma 2.1 has a partial converse in Banach spaces. Lemma 2.2. If E is a norm bounded, approximately flat subset of a Banach space ( X . I), then E has compact closure.
The intuition has two parts: (1) a norm bounded flat set has compact closure, being a bounded subset of a finite dimensional real vector space; ( 2 ) up to any E > 0, a norm bounded approximately flat set is flat.
Proof. Let E be a norm bounded, approximately flat set. The compactness of cl(E), the closure of E, must be shown. Because Banach spaces are complete, to prove compactness, it is sufficient to show that any sequence X" in E contains a Cauchy subsequence. [Any sequence in cl(E) is arbitrarily close to a sequence in E. A Cauchy subsequence will necessarily converge to some point in X. Because the sequence is in E, the subsequence is in E, hence the limit of the subsequence must belong to the closure of E.] Let x" be a sequence in E. The proof will be an application of the Diagonal Method and the following. Fact: For every F > 0, there is an infinite set N, c N such that for all nl. n2 E PQ,d(x"1,x " ~ < ) 2.5. To see that this is true, let L be a finite dimensional linear subspace of X such that E c (x + L)'12 for some x E E. For each xN, pick y" E Y := (x + L ) n cl(E'12), where for any set S, cl(S) denotes the closure of S. [Such a y" exists because E c ( x + L)'I2.] Because E is norm bounded and x + L is closed, Y is a norm bounded subset of a finite dimensional affine subspace of X. Hence Y is compact. Therefore y" has a convergent subsequence yn' converging to a limit point y E Y. For all sufficiently large n', d(y"',y) < ~ / 2 . Therefore, the triangle inequality implies that for all sufficiently large n', d(x"',y) < F. Applying the triangle inequality once more, for all sufficiently large n;, nh, d(xn;,xn;) < 2 ~ . To apply the Diagonal Method, let Fk 1 0. Inductively define a decreasing sequence N k of infinite subsets of N by NI = NE,, and Nk = Nk&lnNck.The Fact just given implies that the inductive step always delivers a further infinite subset. Enumerate each N k as {nk,l, n k , 2 , q 3 , . . .}. The requisite 0 Cauchy subsequence is x n k , k .
Although flatness is not a topological concept, approximate flatness depends on the metric chosen. The smaller the 6-balls, the more difficult it is for a set to be approximately flat. In a similar fashion, the smaller
Maxwell B. Stinchcornbe
1028
the €-balls, the more open covers of any given set there are, making it more difficult for a set to be compact. 2.2 Implications. As noted above, approximate flatness has implications for understanding the practical limitations of artificial neural networks. As an illustrative example, consider the implications for the power to detect differences between multivariate populations (Stinchcombe and White 1993a,b).
Example 2.1. When G is an analytic, nonpolynomial function, and Q1 # Q2are two distinct distributions supported on a compact set K , the set of r such that
1
G ( i ’ r )dQ1 ( x ) =
1
G(X’7)dQ2(x)
(2.3)
is a closed analytic variety with empty interior (implying it has Lebesgue measure 0 (Stinchcombe, 2994, Ch. 1 , Theorem 6). Thus, maximizing I J G(X’r)d(Q1 Q2)( x )I over r in a compact set T having nonempty interior provides a test for the (in)equalityof Ql and Qz. This tests for arbitrary differences between Q1and Q2 by testing for the difference between its integrals against a compact set offunctions. This can be implemented on data by solving (2.4)
where Il indexes a random sample drawn from the distribution Q1,12 indexes an independent random sampledrawnfrom thedistribution Q2, and T is a compact set having nonempty interior. Too large a maximum indicates that the distributions are different,a small maximum indicates that they are the same. Recall that the Reisz representation theorem identifies Q1 and Q2 as elements of the dual space of C ( K ) in Example 2.1. Lemma 2.1 implies that to a fair degree of approximation, the dimension of the compact set of functions, { x H G(?’T) : T E T}, is finite, hence its codimension is, to the same degree of approximation, infinite. This means that Q1 and Q2 can be anywhere in an infinite dimensional subset of distributions, and the test will be approximately blind to their differences. In plainer language, there are some differences between Q1 and Q2 that are hard to see with this test. The extremely high precision needed to see such differences translates to a need for an extremely large amount of data to find them statistically. The discussion of the advantages of including RBF functions in searching for a functional relation between inputs and outputs applies here too. There are differences between Q1 and QZ that an RBF-based version of this test will be more (and less) sensitive to, but a test based on both SLFFs and RBFs will be more sensitive than either.5 ‘In implementation one would need to take care of the higher potential for false positives with a more sensitive test.
Precision and Approximate Flatness in Neural Networks
1029
The same set of conclusions about the power of tests to detect alternatives arises in the literature on testing for arbitrary misspecifications of statistical models. Bierens (1990),Stinchcornbe and White (1993b1, Zheng (1994a,b), and Bierens and Ploeberger (1994) all base specification tests on the cross moments of estimated residuals with compact (and therefore approximately flat) classes of functions of the independent variables. 3 Approximate Flatness: Specific Instances
This section begins with an examination of sufficient conditions for approximate flatness for single hidden layer feedforward networks. Following this will be a parallel analysis for radial basis function networks, and then for certain classes of compound networks. This section concludes with a flatness-based comparison of artificial neural networks with other methods of finding functional relations between inputs and outputs. 3.1 Single Hidden Layer Feedforward Networks. The results and examples will involve the set E(G,T ) from 1.2. All activation functions are assumed to be measurable. Often continuity, differentiability, or other conditions will be required. For example, the following (slight) generalization of Cybenko's (1989) sigmoids will play a role. Definition 3.1. The activation function G is extendable to [-m, +m], or exG(r) tendable, or asymptotically constant if both limr4-m G(r)and lim,,,, exist as finite numbers.6 Theorem 3.1. The set E(G,T ) is approximatelyflat in the following spaces under the following conditions: a. In C ( K ) with the topology of uniform convergence if G is continuous and T is bounded. b. In C,(K) with the topology of uniform convergence of the first m derivatives if G is m times continuously differentiable and T is bounded. c. In Lp(ritr. p), p E [l,cm), with the norm topology if p(F) = 0 for all affine subspaces of R' of dimension r - 1 or less, G is bounded and extendable and its discontinuities are isolated in R,and T is arbitra y, d. In S&(R',p ) , p E [I,cm),with the norm topology if G is m times continuously differentiable with bounded derivatives and T is bounded. e. In Lp(R', p ) , p E [l,co),with the weak topology if G is bounded, and T is arbitra y. f. In S&(R,p), p E [l,co),rn > 0, with the weak topology if G is m times continuously differentiable with bounded derivatives, and T is bounded. 6A referee helpfully pointed out that extendable functions are also known as asymptotically constant functions.
1030
Maxwell B. Stinchcornbe
g. In L0(rwr.p ) with the topology of convergence in probability if p ( F ) = 0 for all affine subspaces of R' of dimension r - 1 or less, G is extendable and its
discontinuities are isolated in R, and T is arbitrary. Some discussion and examples are in order before the proof. First, the isolated discontinuities condition in (c) and (g)allows for hard limiter and many other discontinuous activation functions. It should also be noted that in each of the seven cases covered, the conditions on G and T are known to allow for the denseness of the span of E(G, T ) . In other words, Theorem 3.1 is not proving approximate flatness for networks that fail to have universal approximation properties. What is perhaps surprising is the relative lack of restrictions needed for approximate flatness in the smallest (coarsest)topologies, Theorem 3.1, c, e, and g. A direct way to understand the intuition for these results is to note that it is relatively easy to be compact in these small topologies because there are fewer open covers of any given set, implying that relatively large sets can be compact. The first example shows that the results for C,(K) and C ( K ) cannot generally include unbounded T even for the well-known logistic activation function.
Example 3.1. Suppose that G is the logistic squasher, G ( t ) = 1/[1+ exp(-t)]. With r = 1, consider the set of functions E = {G(ax + b) : a , b E R} in the space C ( K ) where K is the compact interval [0,1]. It can be shown that there exists a 6 > 0 and an infinite set of g E E achieving sup norm distance greater than b from each other. Because the set E is norm bounded, Lemma 2.2 implies that E is not approximately flat. This failure of compactness can lead to the nonexistence of best fits, in which case iterative algorithms such as backpropagation will not converge. This nonexistence problem disappears with the addition of an appropriate complexity regularization term (Stinchcombe 1994, Ch. 3, Lemma 5). The second example shows that the result for L P ( R r , p ) = SE(Rr,p) cannot generally include nonextendable activation functions when the norm topology is used.
Example 3.2. Suppose that G is the sinefunction, G ( t ) = sin(t). With r = 1, Fourier analysis tells us that the set E = {G(ax + b) : a , b E R} contains a countably infinite set of orthogonal elements of L2( [0,1],A) where X is the Lebesgue measure on the unit interval (set b = 0 and consider a = 2n7r, n E N). Further, the functions have norm uniformly bounded below by a strictly positive number. Thus, there is a countably infinite set offunctions in E at distancesfvom each other that are uniformly bounded below, so that E cannot have compact closure. Because the set E is norm bounded, Lemma 2.2 implies that E is not approximatelyflat. The third example demonstrates the difficulties in extending results for SL(RI.p ) to unbounded sets of 7s (input-to-hidden weights).
Precision and Approximate Flatness in Neural Networks
1031
Example 3.3. Suppose that p,({O}) > 0. Then for m 2 1,
for some b provided G is not identically equal to 0. Even if fi is nonatomic, this limit may he infinite. To see this, suppose that G(’)(.), the first derivative of G, is a smooth ( P ) function fhat: is equal to 0 outside of the interval [-1.3, +1.3], is strictly increasing (respectively decreasing) in the interval (-1.3, -1) [respectively ( f l ,+1.3)],is equal to 1 inside the interval [-1, fl], and has slope with absolute value greater than or equal to 1 on the intervals [-1.2, -1.11 and [+1.1,$1.21. If /L is the uniform distribution on [-1.2. +1.2], then again we have lim lal+ry:
1ID2G(ax)I2dp(x)5
lim IlG[a(x-
14-rn
Proof of Theorem 3.1. By Lemma 2.1, it is sufficient to show compactness in the relevant topology. Parts a, b, c, d, e, and f are the easiest, and part c follows directly from g. The given proof of part g uses techniques from Robinson’s nonstandard analysis [for excellent introductions see Hurd and Loeb (19851, Lindstram (1988), Anderson (19901, or Stigum (1990, Ch. V)].7 a, b, and d: Because T is bounded in RY, its closure, cl(T) is compact. In each of these three cases, the mapping T ++ G,, where G , ( x ) := G(X’T), is continuous on w’. This implies that the image of the compact set cl(T) under the given continuous mapping is compact, and the image clearly contains E(G. T ) . e and f: By Alaoglu’s Theorem (e.g., Royden 1968, Theorem 10.7.17, p. 202), any norm bounded subset of a Banach space has compact closure in the weak topology. The uniform bound on G gives a uniform bound on the LP-norm of elements of E(G.T) in e, the boundedness of T delivers a bound on the Sobolev norm in f . (Note also that f follows directly from d.) g: By Robinson’s theorem (e.g., Lindstrsm 1988, Proposition 111.2.1, p. 52), to show that E c X has compact closure in the metric space ( X ,d ) , it is sufficient to show that each point in *E is nearstandard in *X.By the transfer principle (e.g., Lindstrsm 1988, Theorem V.2.4, p. 771, it is sufficient to show that each point in { g ( x )= G,(x) := C(X’7) : T E *W+I1 is nearstandard. Because the discontinuities of G are isolated, there are at most countably many of them. If T is nearstandard, then except on countably many affine subspaces F, G, is infinitesimally close to the function on R‘ defined by x H G(’X’OT), where for nearstandard x E *R”,O x denotes the standard part of x . Because each F has mass 0, this implies that G, is nearstandard. 7A metatheorem guarantees that any proof that uses nonstandard analysis has a proof that avoids all reference to nonstandard constructions.
Maxwell B. Stinchcombe
1032
For T with one or more components infinite, that is, for standard, define
L, = { x E *R' : X'T
= 0)
T
not near(3.1)
For some sufficiently large E II 0 and for all x @ L:, X'T N $00 or X'T N -w. Let H,' denote the set of x 6 L: such that X'T N fco, and let H; denote the set of x 6L: such that X'T N -w. Let g+ denote limr-,+mG(r), g- denote lim,+-m G(r), and go be an arbitrary real number. There are two cases to consider, depending on whether or not L: contains any nearstandard points. If not, then either the nearstandard points are all contained in H:, in which case G, is infinitely close to the constant function equal to g+, or they are all contained in H;, in which case G, is infinitely close to the constant function equal to g-. Suppose now that L: contains nearstandard points. Let L denote the affine subspace of dimension r - 1 in R', L = "L:. Pick infinitesimal b 2 c and infinite integer N such that Lf contains *Ln [-N, fN]'. Because p(L) = 0, the overspill principle (e.g., Lindstr~m1988, Corollary 1.2.4, p. 12) implies that * b ( L f ) N 0. This and the definition of the metric d in turn implies that G, is infinitely close to the standard function
i
g+
g(x) =
go
if X E H+ if x E L
(3.2)
gif x E H where H+ and H- are the interiors of the sets "H,f and
O H ; ,
respectively.
c: The conclusion follows from g and the observation that when f "
converges to f in probability and the f" are uniformly bounded, then f " converges to f in Lp-norm. The assumption on p in c and g [that p ( F ) = 0 for any affine subspace F of dimension r - 1 or less] can be dispensed with. However, this would complicate the construction of the function g ( x ) in 3.2, and require the additional assumption that the activation function be bounded in g. 3.2 Radial Basis Function Networks. Essentially the same set of approximate flatness results holds for RBF or elliptical basis networks (introduced in Park and Sandberg 1993b). For X > 0, let SXdenote the set of symmetric, positive semidefinite r x r matrixes with all eigenvalues greater than or equal to A. Distance between elements of SXis measured as the Euclidean distance between the vectorized versions of the matrixes (i.e., as points in Rrxr). For S c SX, C c R', and G a function from R to iig, define
E(G, S . C) = {G[(x - c)'C(X - c)] : C E S , c E C} The RBFs in 1.4 are the special cases where C the r x r identity matrix.
=
(3.3)
I / g , u > 0, where I is
Precision and Approximate Flatness in Neural Networks
1033
Theorem 3.2. The set E(G, S . C ) is approximately flat in the following spaces under the following conditions: a. In C ( K ) with
the topology of uniform convergence if G is continuous and S and C are bounded. b. In C,(K) with the topology of uniform convergence of thefirst m derivatives if G is m times continuously differentiable and S and C are bounded. c. In LP(iwr, p ) , p E [l.coo), with the norm topology if p ( F ) = 0 for any affine subspace F of iw' of dimension r - 1 or less (e.g., if p has a density with respect to Lebesgue measure), G is continuous, limt+wG ( t )exists as afinite number, S is an arbitrary subset of SAfor some X > 0, and C is arbitrary. d. lnSk(iw',p),p E [l.co),withthenormtopoZogyifGismtimescontinuously differentiable with bounded derivatives and S and Care bounded. e. In Lp(w',p ) , p E [l,m), with the weak topology if G is bounded, and S and C are arbitray. f. In Sk(iw, p ) , p E [l,m), m > 0, with the weak topology if G is m times continuously differentiable with bounded derivatives and S and C are bounded. g. In Lo(iwr,p) with the topology of convergence in probability if p ( F ) = 0 for any affne subspace F of iw' of dimension r - 1or less, G is continuous, limf,m G ( t )exists as afinite number, S is an arbitrary subset of SAfor some X > 0, and C is arbitrary. Proof of Theorem 3.2. The proofs of all but g can be adapted directly from the proof of Theorem 3.1. g: Fix a standard X > 0 and let g+ denote limt,, G(t). It is sufficient to show that for all c E *R' and all C E *SA,the function G,,z defined by G,,c(x) = G [ ( x- c)'C(x - c)] is nearstandard. If both c and C are nearstandard, then continuity implies that G,,c is nearstandard. If c has one or more infinite components, then the fact that each eigenvalue of C is greater than or equal to X makes G,,z(x) N g+ for all nearstandard x. Thus, G,,E is infinitely close to the function identically equal tog+. Finally, suppose that C has infinite components and that c = ( ~ 1 , .. . ,c,,. . . ,cr) is nearstandard. Let M denote the set of x having x, = O c , for some i E { 1, . . . , r } . For all nearstandard x outside of M' for some infinitesimal E, G,,c(x) c" g+. By overspill and the assumption that p(F) = 0 for any affine subspace F of dimension less than or equal to r - 1, * p ( M t ) N 0. Thus, G,,z is again infinitesimally close to the function identically equal to g+. 0
It should be noted that the assumption that p ( F ) = 0 for affine F can be dispensed with in c and g-at the cost of a proof that is a welter of special cases. Also, with the assumption that p ( M ) = 0 for all lower dimensional manifolds M in W' (a consequence of p of having a density with respect to Lebesgue measure), the continuity assumption on G can be weakened as in Theorem 3.1.
Maxwell B. Stinchcornbe
1034
3.3 Compound Networks. In general, any class of networks continuously parameterized by a finite dimensional set of parameters gives rise to a compact set of functions when bounds are imposed on the parameters (Theorems 3.1 and 3.2, a, b, d, and f). However, approximate flatness may also arise with unbounded sets of parameters (Theorems 3.1 and 3.2, c, e, and g). Thus, it is important to more directly examine the approximate flatness of more complex networks. This examination will be limited to two popular classes of compound networks.
3.3.1 Networks by Linear Combination. The first class arises when one constructs a network by taking linear combinations of different types of activation functions. This could be implemented by (say) building networks by a process that allows the choice of either an RBF or an SLFF to be the next unit added. Mathematically, this corresponds to choosing linear combinations of points in the union of two approximately flat sets. When two or more types of network are combined in this fashion, the relevant fact is
Lemma 3.1. A n y finite union of compact sets is compact. Proof. Immediate.
0
The resultant network output functions are o f the form (3.4) k=l lk=l
where each G k may be a separate activation function, and each &Ik is drawn from the class of inner functions appropriate to networks of type k [e.g.,Zk.lk(x)= X ’ T ~ ,for ~ ~ SLFF networks, or &,lk(xJ= (x- q l k ) ’ ( x- ~ k , ~ ~ ) / for RBF networks]. Combining two or more different types of networks does reach further into more dimensions for any given number of nonlinear units at any given degree of precision. It thereby ameliorates some of the precision problem, but the approximate flatness of the resultant network shows that it cannot altogether avoid it. 3.3.2 Networks by Composition. The second class of compound networks to be considered may have multiple nonlinear layers. These give rise to output functions of the form fn ofn-l 0 .. . of1 where fl is an element of the first layer, fk, 2 5 k 5 n, is an element of a kth, higher, layer, and ”0” denotes the composition of functions.
Example 3.4. A popular type of classifier network has output functions of the form (3.5)
q , ~ ~
Precision and Approximate Flatness in Neural Networks
1035
where G is a strictly increasingfunctionfrom R onto (0,l).Here f 2 is thefunction G, andfi is drawn from the class of SLFF output functions. The ascending towers of mulfila~ernetworks begin wifk linear cumbinafionsofthe classifiers just given,
Here both fz and fl are drawn frum classes of SLFF output functions. More layers can be had by iterating this process, = Gkn-l(X)l
(3.7)
when n is odd, and K
gn(4 =
cPkgn-l,k(X)
(3.8)
k=l
when n is even. Sigma-pi networks also arise as compositions. Their output functions are of the form (3.9)
where /?,E R, E RT+'. Here, output funcfiuns are h e a r cumbinafiuns of functions of the form f2 o f , where the f 2 yield products of the components of their input vectors, and the fl are of the form G(5'r)for some r. Both of the results regarding this class of compound networks involve f i and f 2 being drawn from compact sets of functions. As the lower layer f, are often linear combinations of elements of a compact set. The following is helpfuL8 Lemma 3.2. If E is a compact set of a complete metrizable convex topological vector space ( X ,I ) ,then the set
{
aco(E,b) := g
=
c€
33
/=I
/=1
C Plf, : f, E E, C lP,l
i
I b
(3.10)
has compact closure for any b 2 0. Considering those vectors of ,8s satisfying [j, = 0 for j 2 J + 1 shows that this Lemma covers the case of finite linear combinations. Proof. Robertson and Robertson (1973, Lemma 111.4.7, p. 53) shows that if E is compact, then so is the set b . E compact. Further, Robertson and Robertson (1973, Corollary to Theorem 111.6.5, p. 60) show that the closed absolutely convex envelope of a compact set is compact. The closure of 0 the set aco(E,b ) is the closed absolutely convex envelope of b . E .
The first result concerns compositions in C ( K ) . *Note that all of the metrizable vector spaces under analysis here are complete and convex, that is, every Cauchy sequence converges and they have a neighborhood basis of convex sets.
Maxwell B. Stinchcornbe
1036
Theorem 3.3. If F1 is a compact subset of C(KI), K1 a compact subset of W*,and F2 is a compact subset of C(K2),K2 a compact subset of R containing all points of the form fl ( x ) where fl E F1 and x E K1, then the set offunctions F2
:= {g(x) =fz[fi(x)] :fz E F2, andfi E FI}
(3.11)
is a compact subset of C(K1). Applying Lemma 3.2, we can take F1 to be the set of linear combinations of elements of a compact set E(G,T ) with the sum of the absolute values of the weights bounded, and F2 to be the compact singleton set F2 = {G}. This delivers compactness of the closure of the set of classifier networks in 3.5 when the /3,s are absolutely summable. This in turn implies that the precision problem may arise at each nonlinear level in 3.6 and 3.8. Proof of Theorem 3.3. Let g" be an arbitrary sequence in F2 o F1. Taking subsequences at most twice, pick a subsequence g"' where the f;' and f,'" in the representation of gn' converge to some f; and f;. For each x E K1, g"'(x) converges to g * ( x ) :=f;[f;(x)]. This pointwise convergence of the continuous functions g"' to a continuous function g* implies that the convergence is uniform over the compact set Kl . It would be a mistake to attempt to compose subsets of the spaces distribution on the input space is mapped to many distributions on the space of outputs by the many elements f1 E F1. Therefore, the measure of distance for the upper layer is not well-defined. However, it is possible to analyze composition with continuous bounded functions. Let Cb(&t) denote the space of continuous bounded functions from w to &t with the sup norm topology.
( X 7) , based on probability measures p-a
Theorem 3.4. lfF2 is a compact subset of Cb(&t)and F , is a compact subset of either LP(w', p) with the norm topology or Lo(wr, p ) with the topology of convergence in probability, then F2 o F1 is compact in LP(&tr,p ) or L 0 ( W ,p ) , respectively. Proof. Let g" = ;f of: be a sequence in F2 o F1. By the compactness of F 2 and F,, there exists a subsequence (still denoted by n ) g" = f; of: with the property that ;f converges to some f; E F2 and f/ converges to some f; € F1. To treat the case of convergence in probability, L0(w',p), note that every subsequence off; has a further subsequence n' converging p-a.e. to f;. Further, restricted to any compact subset of R, the convergent sequence f;' is equicontinuous. Pick a compact K c R such that pLcf; E K} > 1 - E. For arbitrary 6 > 0 and all sufficiently large n', pLcf," E K 6 } > 1 - F because f;' is converging to f; p-a.e. Because K', the closure of K6, is compact, the equicontinuity off;' restricted to K' implies that for all sufficiently large n', ';fI If:' (x)] - f$ [f; (x)] I is less than E for all x in a set having p-measure at least 1 - F. Because F was arbitrary, the proof is complete.
Precision and Approximate Flatness in Neural Networks
1037
To treat the LP case, take convergent subsequences ff converging to f; E F2 and f? converging to f; E F1. Because LP convergence implies convergence in probability, note that the previous step implies that every subsequence has a further subsequence n’ converging p-a.e. tog* := f;of;. Because the sequence ff ‘ is uniformly bounded, this implies that
(3.12) converges to 0 by Lebesgue’s Dominated Convergence Theorem.
0
3.4 A Comparison with Other Methods. This work has shown that many of the leading classes of artificial neural networks are linear combinations of elements of approximately flat sets. As well as providing a justification for combining different types of networks, this gives insight into the limitations of artificial neural network techniques. There will always be dimensions in which these techniques are relatively blind. However, the ideas of flatness and approximate flatness also provide some insight into why artificial neural network techniques are so powerful compared to other methods of finding functional relations between inputs and outputs. In a series regression context (e.g., Fourier analysis or polynomial regression), the amount and type of data or training examples determine how many nonlinear terms should be included. One then considers linear combinations of this fixed number of non-linear functions. In other words, at each stage, one is using linear combinations of a flat set of functions. Approximately flat sets are much larger than flat sets, and it is the switch from flat sets to approximately flat sets that is behind the exceptional power of artificial neural networks.
References Adams, R. A. 1975. Sobolev Spaces. Academic Press, New York. Anderson, R. M. 1990. Nonstandard methods in mathematicaleconomics. Working Paper No. 90-143, Institute of Business and Economic Research, Berkeley. Bierens, H. 1990. A consistent conditional moment test of functional form. Economefrica 58, 1443-1458. Bierens, H., and Ploeberger, W. 1994. Asymptotic theory of integrated conditional moment tests. Working Paper, Department of Economics, Southern Methodist University, Dallas, TX. Choi, J. Y., and Choi, C.-H. 1992. Sensitivity analysis of multilayer perceptrons with differentiable activation functions. IEEE Trans. Neural Networks 3(1), 101-1 07. Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems 2, 303-14. Funahashi, K. 1989. On the approximate realization of continuous mappings by neural networks. Neural Networks 2, 183-92.
1038
Maxwell B. Stinchcombe
Gallant, A. R., and White, H. 1992. On learning the derivatives of an unknown mapping with neural networks. Neural Networks 5(1), 129-138. Ghosh, J., and Tumer, K. 1994. Robust classification by combining multiple neural networks: An analysis of decision boundaries. Photocopy, Department of Electrical and Computer Engineering, University of Texas at Austin. Goffe, W. L., Ferrier, G. D., and Rogers, J. 1994. Global optimization of statistical functions with simulated annealing. J. Economet. 60(3), 65-99. Hornik, K. 1991. Approximation capabilities of multilayer feedforward networks. Neural Networks 4(2), 251-257. Hornik, K. 1993. Some new results on neural network approximation. Neural Networks 6(8), 1069-1072. Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-66. Hornik, K., Stinchcombe, M., and White, H. 1990. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Networks 3, 551-560. Hurd, A. E., and Loeb, P. A. 1985. An Introduction to Nonstandard Real Analysis. Academic Press, New York. Jordan, M. 1989. Generic constraints on underspecified target trajectories. In Proceedings of the 1989 lnternational Joinf Conference on Neural Networks, Vol. 1, pp. 217-225. IEEE Press, New York. Kufner, A. 1980. Weighted Sobolev Spaces. B. G. Tuebner, Leipzig. Kufner, A., and Sandig, A. M. 1987. Some Applications of Weighted Sobolev Spaces. B. G. Tuebner, Leipzig. Lindstr~m,T. 1988. An invitation to nonstandard analysis. In Nonstandard Analysis and Its Applications. N. Cutland, ed., pp. 1-105. Cambridge University Press, Cambridge. Maz'ja, V. G. 1985. Sobolev Spaces. Springer-Verlag, New York. Park, J., and Sandberg, I. W. 1991. Universal approximation using radial basisfunction networks. Neural Cornp. 3(2), 246-257. Park, J., and Sandberg, I. W. 1993a. Approximation and radial-basis function networks. Neural Cornp. 5(2), 305-316. Park, J., and Sandberg, I. W. 1993b. Nonlinear approximations using elliptic basis function networks. Circuits Syst. Signal Process. 13(1), 99-113. Piche, S. 1995. The selection of weight accuracies for madalines. I E E E Trans. Neural Networks 6(2), 432-445. Ricketts, I. W. 1992. Cervical cell image inspection-a task for artificial neural networks. Neural Cornp. 3(1), 15-18. Robertson, A. P., and Robertson, W. 1973. Topological Vector Spaces. Cambridge University Press, Cambridge. Royden, H. 1968. Real Analysis. Macmillan, New York. Showalter, R. E. 1977. Hilbert Space Methods for Partial Differential Equations. Pitman, London. Stigum, B. 1990. Toward a Formal Science of Economics. MIT Press, Cambridge, MA. Stinchcombe, M. 1994. Notes on Econometrics and Artificial Neural Networks, Working Paper, Department of Economics, University of Texas at Austin.
Precision and Approximate Flatness in Neural Networks
1039
Stinchcombe, M., and White, H. 1989. Universal approximation using feedforward networks with nonsigmoid hidden layer activation functions. In Proceedings of the lnternational Ioint Conference on Neural Networks, Washington, DC, Vol. I, pp. 613-617. SOS Printing, San Diego. (Reprinted in Artificial Neural Networks: Approximation & Learning Theoty, H. White, ed. Blackwell, Oxford, 1992.) Stinchcombe, M., and White, H. 1990. Approximating and learning unknown mappings using multilayer feedforward networks with bounded weights. In Proceedings of the lnternational Joint Conferenceon Neural Networks, Washington, D.C., Vol. 111, pp. 7-16. SOS Printing, San Diego. (Reprinted in Artificial Neural Networks: Approximation & Learning Theory, H. White, ed. Blackwell, Oxford, 1992.) Stinchcombe, M., and White, H. 1993a. Using feedforward networks to distinguish multivariate populations. In Proceedings of the International Joint Conference on Neural Networks, Vol. I, pp. 788-793. IEEE Press, New York. Stinchcombe, M., and White, H. 1993b. Consistent specification testing with unidentified nuisance parameters using duality and Banach space limit theory. U.C.S.D. Discussion Paper 93-14R3,April. Zheng, J. 1994a. A specification test of conditional parametric distributions using kernel estimation methods. Working Paper, Department of Economics, University of Texas at Austin. Zheng, J. 1994b. A residual-based consistent test of parametric regression models. Working Paper, Department of Economics, University of Texas at Austin.
Received July 11, 1994; accepted December 6, 1994.
This article has been cited by: 2. Allan Pinkus. 1999. Approximation theory of the MLP model in neural networks. Acta Numerica 8, 143. [CrossRef]
Communicated by Wolfgang Maass
Lower Bounds on the VC Dimension of Smoothly Parameterized Function Classes Wee Sun Lee Peter L. Bartlett Department of Systems Engineering, RSISE, Australian National University, Canberra, ACT 0200, Australia
Robert C. Williamson Department of Engineering, Australian National University, Canberra, ACT 0200, Australia
We examine the relationship between the VC dimension and the number of parameters of a threshold smoothly parameterized function class. We show that the VC dimension of such a function class is at least k if there exists a k-dimensional differentiable manifold in the parameter space such that each member of the manifold corresponds to a different decision boundary. Using this result, we are able to obtain lower bounds on the VC dimension proportional to the number of parameters for several thresholded function classes including two-layer neural networks with certain smooth activation functions and radial basis functions with a gaussian basis. These lower bounds hold even if the magnitudes of the parameters are restricted to be arbitrarily small. In Valiant's probably approximately correct learning framework, this implies that the number of examples necessary for learning these function classes is at least linear in the number of parameters. 1 Introduction
Smoothly parameterized functions are often used as classification functions by thresholding the outputs to create binary valued functions. This is done because differentiability allows the use of gradient-based algorithms in learning the functions. Examples of frequently used smoothly parameterized functions include feedforward neural networks with sigmoidal activation functions such as tanh and radial basis functions with a gaussian basis. In considering the number of examples necessary to learn these functions, we utilize Valiant's probably approximately correct (PAC) framework (Valiant 1984). In this framework, for any desired target function and any probability distribution of examples, the learning algorithm is required to produce with high probability a hypothesis that classifies Neurul Computation 7, 1040-1053 (1995) @ 1995 Massachusetts Institute of Technology
VC Dimension of Parameterized Function Classes
1041
most of the randomly chosen examples correctly. It has been shown in Blumer et al. (1989) that the number of examples necessary and sufficient for PAC learning a function class is proportional to a combinatorial dimension known as the Vapnik-Chervonenkis (VC) dimension of the function class. Definition 1. Let F be a class of (0, 1)-valued functions defined on a set X . A finite set S c X is said to be shattered if for any subset S+ of S, there is an f E F such that f(x) = 1 for all x E S+ and f(y) = 0 for all y E S\S+. The VC dimension of F is the cardinality of the largest subset of X that is shattered by F.
Bounds on the VC dimension of several specific parameterized function classes are known. For example, the class of threshold functions formed from a vector space of real valued functions of dimension d is known to have VC dimension d (Dudley 1978). This includes functions formed from linear combinations of linearly independent fixed basis functions such as polynomials and radial basis functions with fixed bases. Other classes with known bounds include certain neural networks with threshold activation functions with VC dimension O( W log W) (Baum and Haussler 1989; Maass 1992; Sakurai 1993) and networks with piecewise polynomial activation functions that have VC dimension O(W2 log q ) where W is the number of parameters and q is the number of pieces (Maass 1993; Goldberg and Jerrum 1993). In this paper, we consider thresholded smoothly parameterized function classes. We give genera1 lower bounds on the VC dimension of a thresholded smoothly parameterized function class in terms of the number of “useful” parameters. Obviously, function classes can be parameterized in such a way that a lot of the parameters are redundant. For example, a neural network where the activation functions are linear has VC dimension no more than the number of inputs plus one regardless of the number of parameters used. Similarly, a network with a tanh(x) activation function but only one unit in each hidden layer has the same decision boundary as a linear function regardless of the number of layers used. We show that if there exists a k-dimensional differentiable manifold in the parameter space such that each member of the manifold corresponds to a different decision boundary, then the VC dimension is at least k. Using this result, we find lower bounds for the VC dimension of some two-layer neural networks. Certain two- and three-layer feedforward neural networks with tanh(x) activation function are known to have VC dimension at least R(W1og W). This bound is obtained simply by letting the tanh(x) network approximate a network with the same architecture, but threshold activation function [with VC dimension O(W log W) (Baum and Haussler 1989; Maass 1992; Sakurai 199311 when the weights are large enough. However, if the inputs and weights are bounded such that the input-output map is “nearly linear,” these techniques will no longer
W. S . Lee, P. L. Bartlett, and R. C. Williamson
1042
provide the bounds. Notable increases in performance have also been observed in experiments when the norm of the weights is minimized along with the empirical error (Hertz et al. 1991). Heuristic explanations for this include suggestions that the VC dimension of such networks decreases significantly and approaches that of a linear classifier (see, for example, Boser et al. 1992) as the weights are constrained to be small. We give a lower bound on the VC dimension proportional to the number of weights even when the inputs and weights are restricted to arbitrarily small open sets around the origin. This shows that the VC dimension of the network does not approach the VC dimension of a linear classifier as the allowable size of the weights is reduced. In Valiant's PAC framework, the number of examples required for learning this class of neural networks remains at least proportional to the number of weights regardless of bounds on the size of the weights and inputs. Previous results bounding the VC dimension of neural networks from below that we are aware of hold only for certain networks with activation functions that can approximate the threshold function (Baum and Haussler 1989; Bartlett 1993; Maass 1992; Sakurai 1993). We are able to give lower bounds proportional to the number of weights for neural networks with a large class of analytic activation functions that are not necessarily able to approximate the threshold function. These techniques also give a lower bound for radial basis functions with a gaussian basis when the centers are adaptable. The bound is approximately n (where n is the input dimension) times better than the best previous lower bound, which follows from the bound in Anthony and Holden (1993) for gaussian radial basis functions with fixed centers. 2 Parameterized Function Classes A function f : A x X -+ k? can be used to form a parameterized function class F by letting each parameter a E A define a function in the class F. An example of this is an artificial neural network where A is the set of weights and X is the set of inputs. Such functions are used as classification functions by thresholding the output.
Definition 2. Let A be an open subset of Rm and X be an open subset of R ~ Let . f : A x X 4 R be some (fixed) continuous function. We use f to define decision regions 0,' and 0; by 0,' 0,
:=
{x E X : f ( a , x ) > 0)
:= { X E X : f ( a , x ) < 0 }
The region of input space where the function is positive is separated from the region where it is negative by the decision boundary.
VC Dimension of Parameterized Function Classes
Definition 3. The boundary of a E A in the open set X bdy(a) is defined by bdy(a) := X\(D,f U D,)
= {X E
1043
c
R", denoted
X : ~ ( u , x=) 0}
For threshold real valued function classes, the following definition of the VC dimension is useful.
Definition 4. Let 8 : R + (0, l} be defined by 8 ( x ) = 1 if x > 0, and O(x) = 0 otherwise. Let Z be some set and let F be a class of functions from Z to R. The thresholded function class formed from the function class F is the class FB = { O o f :f E F}. Let x = ( x l , .. . , x m )E R", and let O ( x ) = (O(xl),. . . , O(x,)). Let T c R", and write O ( T ) = { O ( x ) : x E T } . For any sequence, z = ( 2 1 , . . . , z m ) E Zm, let FI. = {(f(zl), . . . ,f(qn)) : f E F } . The VC dimension of FO is VCdim(F0) := sup{m : 3 E Z", ~O(FI,)I= 2") We will need the inverse function theorem and implicit function theorem from calculus (see Spivak 1965).
Definition 5. For f : R~ + R", f ( a 1 , . . . , a n ) = (fl(a), . . . , f " ( ~ ) ) ~ let, D f ( a ) denote the Jacobian matrix o f f at a, where the Jacobian matrix is the m x n matrix: -
%(a)
~
an,
%(a)
~
an,
...
8fl(fl)
~
-
an,,
1
Df(a)=
8f,.(a)
-
an,
%",(a) an2
,
. , ah,,( 0 ) ~
an,,
-
Theorem 6 (Inverse Function Theorem). Let a E Rn and suppose that f : R" -+ R" is continuouslydifferentiablein an open set containing a, and det[Df(a)] # 0. Then there is an open set V containing a and an open set W containing f ( a ) suck that f : V -+ W hasa continuous inverse f -1 : W -+ V which is differentiable andfor a l l y E W satisfies Df-'(y) = {Df[f-'(y)]}-'. Theorem 7 (Implicit Function Theorem). Suppose f : R" x R" + R" is continuously differentiable on an open set containing ( a , b) and f ( a , b) = 0. Let M be them x m matrix Dnf (a, b). If det Dnf (a, b ) # 0, there is an open set B c R" containing b and an open set A c R" containing a, with the following property: for each x E B, there is a unique y E A such that f ( y , x ) = 0. The function h : x H y is differentiable.
1044
W. S. Lee, P. L. Bartlett, and R. C. Williamson
We want to relate the VC dimension of the thresholded function class to the number of parameters that are not redundant in some sense. Hence, we will consider subsets of the parameter space that form differentiable manifolds in the space such that each member of the manifold defines a different boundary. For our purposesl c' manifolds will be sufficient.
Definition 8. (a) If U and V are open sets in W",a continuously differentiable function h : U -+ V with a continuously differentiable inverse k-' : V U is called a diffeomorpkism. (b) A subset M of W" is called a k-dimensional manifold (in W n ) if for every point x E M, there is an open set U containing x, an open set V c w", and a diffeomorphism k : U + V such that k(U n M) = v n ( R ~x (0)) = {y E v : yk+l = . . . = y" = o}. (c) If for all a l , E~ M, a1 # a2 =+ bdy(a1) # bdy(a2) then we say that M has unique decision boundaries. -+
Let the function class F be Cf(a, .) : a E A } , where A is an open subset of W~ and f is continuously differentiable. Let g : A x X m Rm be defined by g(a, X I , . . . ,x,) = ( f ( a ,x l ) , . . . , f ( a , ~ , ) ) ~For . a fixed x, define g x ( a ) = g(a*x). The next lemma is a simple consequence of the Inverse Function Theorem. ---f
Lemma 9. Let cp be any diffeomorpkismfrom an open subset of A to a subset of R" and $, = g, o q'. If VCdim(F0) < k, where Fe is the tkreskolded function class formed from F , then for every a E A, for every x E XIn, g,(a) = 0 + rank[D$,(b)] < k, where b = cp(a). Proof. Suppose VCdim(F0) < k but there is an a E A, an x E X, and a diffeomorphism cp(a) = b such that g(cp-'(b),x) = 0 but rank(D$,(b)) 2 k. Because the rank is at least k, we can choose a submatrix of k linearly independent rows from the matrix D,$,(b) corresponding to k examples from X . We can now choose k linearly independent columns from the previous submatrix corresponding to k parameters. Let x' E X k be the k components of x corresponding to the k linearly independent rows of D$I,(b) we picked. Let b' E W~ be the k components of b corresponding W~ so that h,~(b')comprises the to the k columns. Define k,! : k components of $,(b) corresponding to those k rows (the appropriate components of b are fixed). Then the k x k matrix is Dh,!(b'). By the Inverse Function Theorem, kxf has a continuous inverse at k,t(b') = 0. Note that a function y : X + Y is continuous if and only if the inverse image of any open subset of Y is an open subset of X. Since the inverse of k,, at k,,(b') = 0 is continuous, the inverse image of an open subset in R~ containing b' is an open subset containing h,,(b') = 0. So we can pick 2k points, one from each orthant in a small enough open subset containing -+
VC Dimension of Parameterized Function Classes
1045
h,(b') = 0. The inverse of hxr at these 2k points gives us the 2k points in the neighborhood of b'. Applying cp-l on the appropriate points gives us the 2k functions in A required to get IH(Flx)l = 2k. So the VC dimension 0 must be at least k, contradicting VCdim(F0) < k. From the previous lemma, it is obvious that to show that the VC dimension is at least k, all we have to do is to pick a parameter and k points from its decision boundary such that the rank of the coresponding Jacobian matrix is k. [This technique was used in Bartlett (1993) to give lower bounds on the VC dimension of neural networks with threshold activation functions.] However, this may not be easy to do. The following theorem gives conditions that may be easier to check in some cases. Theorem 10. Let A be an open subset of R ~ X, be an open subset of R",and f : A x X + R be a continuously differentiable function (in all of its arguments). Let F := Cf(a, .) : a E A } . If there exists a k-dimensional manifold M c A that has unique decision boundaries, then VCdim(Fe) >_ k. Proof. VC dimension is always greater than or equal to zero so the case k = 0 is trivial. Assume the manifold with unique boundaries M is of dimension k 2 1 but VCdim(F0) < k. Recall that g : A x X" 4 Rm is defined by g(a ~ 1 . .3 . > x m ) = ( f (a, XI , . . f (a,x m ) )T, gx (a) = 8(a x), and ?I, = gw 0 P-' where p is an appropriate diffeomorphism that defines the manifold M at a. Then Lemma 9 implies that for every a E M and b = cp(a), for every x E Xm,g(a,x) = 0 + rank[D&(b)] < k. Let b = (w, 0 ) and yx : -+ Rm be defined by y,([) = $ x ( [ , O ) . Then rank(Dy,(w)] < k as well. Pick a E M and x E X"' such that g ( a , x ) = 0 and r = rank[Dy,(w)] < k is the largest possible. The rank r is greater than zero, because if it were not so, the boundary could not change as we change w. By permuting x, w,and yx(w), x becomes x', w becomes w' = (c,d), yx(w) becomes y(w') = cr(w'),,O(w'), and Dy,(w) becomes ' 7
7
[ DcQ(c,d)
Dyx'(w') =
3
Dda(c,d)
Dcp(c,d ) DdP(c,d )
1
where cr and c both have r components and the r x Y matrix D,cr(c,d) is nonsingular. By the Implicit Function Theorem, with a ( c , d ) = 0, there exists an open set E around d , and a continuously differentiable function h : E + R' such that a ( h ( d ) , d )= 0 for each d E E. We will now show that P(h(d),d)is also zero for each d E E. Because Dyxj(w') is of rank r, there exists some matrix K such that Dc/3(c.d)= KD,a(c, d ) and DdP(C,d ) = KDdn(c,d ) . Differentiating P(h(d),d ) with respect to d by the chain rule: DdP(h(d),d) = Dh(d)PDdh f DdP =
KDh(d)QDdh+ KDda
+
= K(Dh(d)QDdh Dda)
W. S. Lee, P. L. Bartlett, and R. C. Williamson
1046
K(Dd(l(k(d),d))
= o [because cl.(h(d),d) = 0 for all d E El. Since ,O(w’) = 0, P ( k ( d ) , d ) = 0 for all d E E. Let ,O(w’) = (f’(w’,xi), . . . ,f’(w’,x L + ) ) ~ . We can substitute an arbitrary x’ from bdy(w‘) for one of these xi without increasing the rank of Dyx,(w’)because by hypothesis we have picked the xis that give the largest rank. Hence the rank of Jacobian matrix of the function after substitution remains Y. Differentiating p(w’) (with the new xi) using the chain rule as above, we find that f‘(w‘’,xi) = 0 for any w” E graph@), where graph(h) = { ( d , h ( d ) ): d E E } . Since we have picked xi arbitrarily from bdy(w’), bdy(w’) C bdy(w”) for any w” E graph@). From the continuity of the components of DCca(w’),and hence of its determinant, D,n(w‘) is nonsingular in a neighborhood of w’. Choose w” E graph(h), which is also in this neighborhood. Again, we can substitute any x” from the boundary of w’’ into ~ ( z o ” )without changing the rank of the Jacobian matrix. Since w’ and w” are both in E , we can again differentiate using the chain rule to show that bdy(w”) & bdy(w’). So w‘ and w” must have the same boundary, which is a contradiction. So if VCdim(F0) < k, any k-dimensional manifold M must contain distinct a’ = cp-’(b’) and a” = cp-’(b’’) such that bdy(a‘) = bdy(a”), where 0 cp is the diffeomorphism that defines M at u’. As a simple example we consider the VC dimension of the linear classifier (perceptron) when the parameters are restricted to an open set.
Example 11. Consider the linear functionf : A x !P R such that f(a, x) = + + . .. + a,x, where A is any open subset of R~+’.Choose an open subset A’ c A such that none of the parameters, (ao, . . . un), is zero. This can always be done because A is an open set. Let M be the projection of A’ onto the subspace where uo # 0 is constant. Then M is an n-dimensional manifold with unique decision boundaries. One way to see that the boundaries are unique is to check the intersection of the boundaries with the axes of R ~ Using . Theorem 10, the VC dimension of the thresholded linear function class is at least n. ---f
uO alxl
3 Two Layer Neural Networks
3.1 Tanh Activation Function. We first consider finding a lower bound for the VC dimension of a two-layer neural network with tanh(x) activation functions when both the inputs and weights (parameters) are restricted to an arbitrary open subset that includes the origin.
Definition 12. A two-layer feedforward network with an n-dimensional input x = ( X I , . . . , x , ~ )E R“, k hidden units with tanh activation, and weights w = (q”, . . . ,ukrl,W O , . . . ,wk) E RW (where W = kn + 2k + 1) is a
VC Dimension of Parameterized Function Classes
1047
function f : wW x R” + w given by f(w,x) = wo
k
+
w, tanh(u, . x
+ ul0)
r=l
where u,= (u,,.. . . u,,~)and uf . x 1,. . . , k are called the offsets. ~
=
c;l!,uSx,. The weights W O ?ufo, i =
By permuting the hidden units, the input-output map of the network remains unchanged. Furthermore, because the tanh activation function is odd, changing the sign of all the input weights, the bias and the output weight of a hidden unit will not change the input-output map. It has been shown that any two networks with the same input-output behavior must be related by a transformation from the finite group generated from these transformations provided that the networks are irreducible (Sussmann 1992). A net is reducible if any of the following conditions hold: 1. w, = 0 for some i = 1 , .. . , k ; 2. there exist two different indices jl ,j 2 E { 1,. . . ,k } such that Iy,z(x)l for all x E W“ where y,(x) = u/. x vo; or 3. D, = 0 for some i E 1 , . . ,k.
+
1 1 ~ (x)l ~1 =
,
This means that the input-output maps of the networks are essentially unique up to a finite group of transformations. Unfortunately, uniqueness of the input-output map is not sufficient (although it is necessary) for uniqueness of the decision boundaries. For example, multiplying the function by a constant results in a network with the same decision boundary but different parameters. Although we do not know the largest dimension for a manifold with unique boundaries in the parameter space, we can use Sussmann’s result to find such a manifold of dimension not too much smaller than the number of parameters. Theorem 13. Let F be the class of two-layer feedforward networks with k hidden units with tanhactiuation, input space X = {(XI,.. . ,x,) E R” : (x,(< C} (where C is a constant greater than zero), and k ( n + 2) + 1 weights restricted to an open set that includes the origin. Then VCdim(F0) 2 p where = ( k - l ) ( n + 1)+ 1 is the number of weights of a network with n - 1 inputs and k - 1 hidden unzfs. Proof. Delete all the weights from the nth input and all the weights connected to the kth hidden unit except the weight connecting the two of them as shown in Figure 1. Let the smaller network (with weights deleted) be
f (w, x)
=
wo
+
k-1
+ wk tanh(vkl,xn)
w, tanh(u: . x’ + u , ~ )
I=]
=
g(w’,x’)
+ wk tanh(vknxn)
1048
W. 5.Lee, I? L. Bartlett, and R. C. Williamson
Figure 1: Network with weights deleted.
.. vi = (vll. . . .,vt,,-1), and x’ = where w’ = ( ~ 1 0 , .. . , ~ l k - l , ~ - l , w O , ..wk-l), ( X I , . . . ,xn-l). We will also fix wk and v h to be constants. This is equivalent to working on a manifold where the codiniension is the number of fixed and deleted weights. Now, g(w’,x’) is a network with n - 1 inputs and k - 1 hidden units (the network within the box in Fig. 1). Since nets that are reducible are not dense anywhere in the parameter space and the number of different parameter values with the same input-output map for irreducible nets is finite, we can always find an open subset of weights such that the input-output map is unique for all the parameter values in the subset. The input-output map of g(w’,.) is unique not only over the whole of R”-’ (as shown in Sussmann 1992) but also for any open set of x’ values (in particular, the open set satisfying Ixil < C, i = 1,.. . ,n - 1) because g(w’, .) is an analytic function. We can also choose the open subset of weights such that the boundary exists for all weights in the set, i.e., choose an open set of W‘x X’ such that the outputs are in the range of k : X, H Wk tanh(Uk,X,). When the output of f is zero, we have g(w’, x’) = -Wk tanh(vk,xn). Fix wk and vk, so that they are not adjustable. Then the boundary off
VC Dimension of Parameterized Function Classes
1049
is graph(tanh-’(-g(w‘, .)/wk)/vktl).We will show that this is unique for each w’ in the open set of weights. Let R be the range of h, where h : x, c-) -wktanh(vk,x,,). Because g(w’, .) is a continuous function, B := g-’(w’, R ) is an open set of X’. The boundary o f f can exist only for x’ E B. Since the input-output map is unique for g(w’, .) in domain B for each w’,graph(g) is unique (for domain B ) for each w’. This implies that the boundary of f is unique since tanh-’ is a one-to-one function. So we have found a manifold of unique boundaries the dimension of which is the number of weights in a net with n - 1 inputs and k - 1 hidden units. Theorem 10 then gives the desired result. 0 Since a network with the standard sigmoid activation I / ( ] + e-.) is equivalent to a tanh(x) net up to translation and change of coordinates of the weights, the same result holds for networks with the standard sigmoid activation. It would be interesting to know if the VC dimension of an arbitrary open set of parameters (which does not necessarily include the origin) is also proportional to the number of parameters (when boundaries exist). 3.2 Other Activation Functions. Similar bounds can be found using the same techniques for networks with other analytic activation functions if the networks have unique input-output mappings up to a finite group of transformations when they are irreducible. It has been shown that odd activation functions that satisfy the independence property (IP) have this property (Albertini et al. 1993). For networks with no offset, the weak independence property (WIP) is sufficient.
Definition 14. The function r~ : R + IR satisfies the independence property (IP) if, for every positive integer I , for any nonzero real numbers bl, . . . ,b f , and for any real numbers P I , . . . , Pl for which @I,
PI) # w,,P,)Vi # i
implies that the functions 1 , x I-+ a(b,x
f a ),. . . , x H [T(b/X + D l )
are linearly independent (where x E @. The function c satisfies the weak independence property (WIP) if the above linear independence property holds for all pairs (b,,[j,) with pl = 0, i = 1 , .. . ,Z. Obviously IP implies WIP. The following conditions for IP and WIP are from Albertini et al. (1993). Lemma 15. If [T is a polynomial, WlP does not hold. If cr is odd, infinitely differentiable and d k ) ( 0 # ) 0 for an infinite number of values of k then [T satisfies the property WIP.
W. S. Lee, P. L. Bartlett, and R. C. Williamson
1050
Lemma 16. Assume that u is a real-analyticfunction,and it extends toafunction g :c c analytic on a subset D c C of the form ---f
D
=
{ z E C : I Imzl 5 X } \ { z o , ~ }
for some 0 < X < 00. Here is the complex conjugate of ZO,Imzo = A, and zo and 20 are singularities, that is, there is a sequence z,! zo so that ~ O ( Z , ~-+ ) ~00, and similarly for G. Then, u satisfies property IP. ---f
Functions that satisfy the property IP include the tanh function considered earlier. Most rational functions also satisfy this property.
Theorem 17. Let F be a two-layer neural network with k hidden units and an activation function that is odd, real analytic, and satisfies the property IP. If the input space is w'],therz VCdim(F0) 2 h, where @ is the number of weights in a network with n - 1 inputs and k - 1 hidden units. Fur a network with no offsets it is suficient if the activation function is odd, real annlytic, and satisfies WIP. The proof is essentially the same as for the tanh activation function. 3.3 Radial Basis Functions. Another smoothly parameterized function class commonly used for classification is the radial basis function with a gaussian basis.
Definition 18. A k-term radial basisfunction with n inputs, x = ( X I , . . . ,x,,) E w",gaussian basis functions, and parameters w = ( ~ 1 1 .. . . ,ck,,,w1,. . . ,wk)E Ww (where W = kn + k ) is a function f : RW x R" R given by ---f
where cl = (ell.. . . ,cln) are the centers, w,# 0, i = 1,.. . ,k, and denotes the Euclidean norm.
11 . 1)
The VC dimension of a k-term radial basis function has been shown to be at least k (Anthony and Holden 1993). This bound is tight when the centers are not adjustable. When the centers are adjustable, we give a lower bound of kn - n. First we will need a result on uniqueness of input-output mappings similar to that for the tanh network. It is well known (see, e.g., Powell 1987) that for any 1 > 0 and any input dimension, the functions e-IIx-c'112,~
.
I
,x
e-llx-cd2
are linearly independent provided the centers are distinct. This implies that the input-output mappings are unique u p to permutation of the centers if none of the w,s is zero.
Theorem 19. Let F be a k-term radial basis function with gaussian basis functions. If the input space is P, then VCdim(F0) 1 !I, where p = kn - n is the number of parameters in a k - 1-term radial basis function with n - 1 inputs.
VC Dimension of Parameterized Function Classes
1051
Proof. As in the proof of Theorem 13 we will work on a manifold formed by projecting onto the subspace where some parameters are either zero or constant. Set cI1 = 0 for i # 1, cll = 0 for j # 1, and fix w1 and cll to nonzero constants. So at a boundary we have
wlexp
where x‘
-
x:
- 2XlCll
+ c:l + Ex;
= (XZ,. . . , x l l )and c: = (cQ.. . . , c I I 1 which ),
*XI
=
1 -log 2Cll
[
-
implies
C:=2WIexp (-llxF - c:112)
w1exp (-
~y~~
x:
-
c:,)
I
The argument of the log function will be positive by the assumption that x lies on a boundary. Since the log function is one-to-one and (k- 1)-term radial basis functions have unique input-output maps, the boundaries are unique where they exist. The log and exp functions are continuous, so for X I in an open interval, the regions of input and (adjustable) parameter space where the boundaries exist are open sets. 0 4 Conclusions
We have derived a relationship between the number of ”useful” parameters in a class of smooth functions and its VC dimension. Using this relationship, we have obtained lower bounds on the VC dimension proportional to the number of parameters for neural networks with tanh activation functions when the weights and inputs are restricted to be in an arbitrarily small open set that includes the origin. It would be interesting to know if this is also true for any open set of parameters that does not include the origin (if the decision boundaries exist for the set of parameters, otherwise the VC dimension is trivially zero). To do that using the same techniques would require solving the boundary uniqueness problem under such conditions. We have also obtained lower bounds on
1052
W. S. Lee, I? L. Bartlett, and R. C. Williamson
the VC dimension proportional to the number of parameters for networks with certain real analytic activation functions that are not necessarily sigmoids. For radial basis functions with a gaussian basis, we obtained bounds proportional to the number of parameters. We expect the same results to hold for other smooth radial basis functions with nonpolynomial basis (Anthony and Holden 1993; Powell 1987), but proving this using the same techniques would require proving boundary uniqueness for the functions.
Acknowledgments This research was supported by the Australian Research Council and the Australian Telecommunications and Electronics Research Board. We would like to thank Adam Kowalczyk for helpful comments.
References Albertini, F., Sontag, E. D., and Maillot, V. 1993. Uniqueness of weights for neural networks. In Artificial Neural Networks for Speech and Vision, R. Mammone, ed., pp. 115-125. Chapman and Hall, London. Anthony, M., and Holden, S. B. 1993. On the power of polynomial discriminators and radial basis function networks. Proc. Sixth Workshop Comp. Learning Theory 158-164. Bartlett, I? L. 1993. Lower bounds on the Vapnik-Chervonenkis dimension of multi-layer threshold networks. Proc. Sixth Workshop Comp. Learning Theory 144-150. Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Cornp. 1, 151-160. Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. 1989. Learnability and the Vapnik-Chervonenkis dimension. I. Assoc. Comyut. Mach. 36(4), 929-965. Boser, B., Guyon, I., and Vapnik, V. 1992. A training algorithm for optimal margin classifiers. Proc. Fifth Workshop Comp. Learning Theory 144-152. Dudley, R. M. 1978. Central limit theorems for empirical measures. Ann. Prob. 6(6), 899-929. Goldberg, P., and Jerrum, M. 1993. Bounding the Vapnik-Chervonenkis dimension of concept classes parameterized by real numbers. Proc. Sixth Workshop Comp. Learning Theory 361-369. Hertz, J., Krogh, A., and Palmer, R. G. 1991. Introduction to the Theory of Neural Computation. Addison-Wesley, Redwood City. Maass, W. 1993. Agnostic PAC-learning of functions on analog neural nets. Preprint, Graz, Austria. Maass, W. 1994. Neural nets with superlinear VC-dimension. Neural Comp. 6, 877-884.
VC Dimension of Parameterized Function Classes
1053
Powell, M. J. D. 1987. Radial basis functions for multivariable interpolation: A review. In Algorithms for Approximation, J. C. Mason and M. E. Cox, eds., pp. 143-167. Clarendon Press, Oxford. Sakurai, A. 1993. Tighter bounds of the VC-dimension of three-layer networks. Proc. World Congr. Neural Networks. Spivak, M. 1965. Calculus on Manifolds. Benjamin Cummings, Menlo Park, CA. Sussmann, H. J. 1992. Uniqueness of the weights for minimal feedforward nets with a given input-output map. Neural Networks 5, 589-593. Valiant, L. G. 1984. A theory of the learnable. Commun. ACM 27(11), 1134-1143.
Received May 26, 1994; accepted December 6, 1994.
This article has been cited by: 2. Michael Schmitt . 2005. On the Capabilities of Higher-Order Neurons: A Radial Basis Function ApproachOn the Capabilities of Higher-Order Neurons: A Radial Basis Function Approach. Neural Computation 17:3, 715-729. [Abstract] [PDF] [PDF Plus] 3. Michael Schmitt . 2002. Neural Networks with Local Receptive Fields and Superlinear VC DimensionNeural Networks with Local Receptive Fields and Superlinear VC Dimension. Neural Computation 14:4, 919-956. [Abstract] [PDF] [PDF Plus] 4. Yossi Erlich, Dan Chazan, Scott Petrack, Avraham Levy. 1997. Lower Bound on VC-Dimension by Local ShatteringLower Bound on VC-Dimension by Local Shattering. Neural Computation 9:4, 771-776. [Abstract] [PDF] [PDF Plus]
Communicated by Nicolo Cesa-Bianchi
Agnostic PAC Learning of Functions on Analog Neural Nets Wolfgang Maass Institute for Theoretical Computer Science, Technische Uni-Jersitaet Graz, Klosterwiesgusse 3212, A-8020 Graz, Austria
We consider learning on multilayer neural nets with piecewise polynomial activation functions and a fixed number k of numerical inputs. We exhibit arbitrarily large network architectures for which efficient and provably successful learning algorithms exist in the rather realistic refinement of Valiant’s model for probably approximately correct learning (”PAC learning”) where no a priori assumptions are required about the ”target function” (agnostic learning), arbitrary noise is permitted in the training sample, and the target outputs as well as the network outputs may be arbitrary reals. The number of computation steps of the learning algorithm LEARN that we construct is bounded by a polynomial in the bit-length n of the fixed number of input variables, in the bound s for the allowed bit-length of weights, in 1 / ~ , where E is some arbitrary given bound for the true error of the neural net after training, and in l / h where b is some arbitrary given bound for the probability that the learning algorithm fails for a randomly drawn training sample. However, the computation time of LEARN is exponential in the number of weights of the considered network architecture, and therefore only of interest for neural nets of small size. This article provides details to the previously published extended abstract (Maass 1994). 1 Introduction
The investigation of learning on multilayer feedforward neural nets has become a large and fruitful research area. It would be desirable to develop also an adequate theory of learning on neural nets that helps us to understand and predict the outcomes of experiments. The most commonly considered theoretical framework for learning on neural nets is Valiant’s model (Valiant 1984) for probably approximately correct learning (“PAC learning”). In this model one can analyze both the required number of training examples (the ”sample complexity”) and the required number of computation steps for learning on neural nets. With regard to sample complexity the theoretical investigation of PAC learning on neural nets has been rather successful. It has led to the discovery of an essential mathematical parameter of each neural net N: the Neural Computation 7, 1054-1078 (1995) @ 1995 Massachusetts Institute of Technology
Agnostic PAC Learning
1055
Vapnik-Chervonenkis dimension of N , commonly referred to as the VC dimension of N . The VC dimension of N determines the number of randomly drawn training examples that are needed in the PAC model to train N (Blumer et al. 1989). It has been shown that the VC dimension of any feedforward neural net N with linear threshold gates and w weights can be bounded by O(w1ogw) (Cover 1968; Baum and Haussler 1989). Recently it has also been shown that this upper bound is optimal in the sense that there are arbitrarily large neural nets N with w weights whose VC dimension is bounded from below by bZ(w1ogw) (Maass 1993). Since the PAC model is a worst case model with regard to the choice of the distribution on the examples, it predicts bounds for the sample complexity that tend to be somewhat too large in comparison with experimental results. The quoted upper bound for the VC dimension of a neural net implies that the sample complexity provides no obstacle for efficient (i.e., polynomial time) learning on neural nets in Valiant’s PAC model. However, a number of negative results due to Judd (1990), Blum and Rivest (1988), ~ and Kearns and Valiant (1989) show that even for arrays ( N n ) , lofE very simple multilayer feedforward neural nets (where the number of nodes in Nn is polynomially related to the parameter n ) in the PAC model there are no learning algorithms for g,whose number of computation steps can be bounded by a polynomial in n. Although these negative results are based on unproven conjectures from computational complexity theory such as NP # RP, they have effectively halted the further theoretical investigation of learning algorithms for multilayer neural nets within the framework of the PAC model. A closer look shows that the type of asymptotic analysis that has been carried out for these negative results is not the only one possible. In fact, a different kind of asymptotic analysis appears to be more adequate for a theoretical analysis of learning on relatively small neural nets with analog (i.e., numerical) inputs. We propose to investigate PAC learning on a fixed neural net N , with a fixed number k of numerical inputs (for example k sensory data). The asymptotic question that we consider is whether N can learn any target function with arbitrary precision if sufficiently many randomly drawn training examples are provided. More precisely we consider the question whether there exists an efficient learning algorithm for N whose number of computation steps can be bounded by a polynomial in the bit-length n of the k numerical inputs, a bound s for , E is an arbitrary the allowed bit-length of weights, as well as 1 / ~where given bound for the true error of N after the training, and 1/S, where S is an arbitrary given bound for the probability that the training fails for a randomly drawn sample. In this paper, we simultaneously turn to a more realistic refinement of the PAC model that is essentially due to Haussler (1992) and that was further developed by Kearns et al. (1990). This refinement of the PAC model is more adequate for the analysis of learning on neural nets,
1056
Wolfgang Maass
since it requires no unrealistic a priori assumptions about the nature of the ”target concept” or ”target function” that the neural net is supposed to learn (“agnostic learning”), and it allows for arbitrary noise in the sample. Furthermore it allows us to consider situations where both the target outputs in the sample and the actual outputs of the neural net are arbitrary real numbers (instead of boolean values). Hence in contrast to the regular PAC model we can also investigate in this more flexible framework the learning (and approximation) of complicated real valued functions by a neural net. In Definitions 1.1 and 1.2 we will give a precise definition of the type of neural network models that we consider in this paper: high order multilayer feedforward neural nets with piecewise polynomial activation functions. In Definition 2.2 we will give a precise definition of the refinement of the PAC learning model that we consider in this paper. We will show in Theorem 2.5 that, even in the stronger version of PAC learning considered here, the required number of training examples provides no obstacle to efficient learning. This is demonstrated by giving an upper bound for the pseudo-dimension dimp(7) of the associated function class F.It was previously shown by Haussler (1992) that for the learning of classes of functions with nonbinary outputs the pseudo-dimension plays a role that is similar to the role of the VC dimension for the learning of concepts. We will prove in Theorem 2.1 that for arbitrarily complex first-order neural nets N with piecewise linear activation functions there exists an efficient and provably successful learning algorithm for N . This positive result is extended to high order neural nets with piecewise polynomial activation functions in Theorem 3.1. One should note that these results do not show that there exists an efficient learning algorithm for every neural net. Rather they exhibit a special class of neural nets $ for which there exist efficient learning algorithms. This special class of neural nets N is ”universal” in the sense that there exists for every high order neural net N with piecewise polynomial activation functions a somewhat larger neural net N in this class such that every function computable on N is also computable on N . Hence our positive results about efficient and provably successful learning on neural nets can in principle be applied to real-life learning problems in the following way. One first chooses a neural net N that is powerful enough to compute, and respectively approximate, those functions or distributions that are potentially to be learned. One then goes to a somewhat larger neural net N that can simulate N and that has the previously mentioned special structure that allows us to design an efficient learning algorithm for N. One then trains N with a randomly drawn sample. The previously described transition from N to N provides a curious theoretical counterpart to a recipe that is frequently recommended by practitioners as a way to reduce the chance that backpropagation gets stuck in local minima: to carry out such training on a neural net that has
Agnostic PAC Learning
1057
somewhat more units than necessary for computing the desired target functions (Rumelhart and McClelland 1986; Lippmann 1987). The positive learning results of Theorem 2.1 and Theorem 3.1 are also of interest from the more general point of view of computational learning theory. Learnability in the here considered refinement of the PAC model for ”agnostic learning” (i.e., learning without a priori assumptions about the target concept) is a rather strong property. In fact this property is so strong that there exist hardly any positive results for learning with interesting concept classes and function classes as hypotheses in this model. Even some of the relatively few interesting concept classes that are learnable in the usual PAC model (such as monomials of boolean variables) lead to negative results in the here considered refinement of the PAC learning model (Kearns et al. 1992). Hence it is a rather noteworthy fact that function classes that are defined by arbitrarily complex analog neural nets yield positive results in this refined version of the PAC model. One should note, however, that the asymptotic analysis that we use here for the investigation of learning on neural nets is orthogonal to that which underlies the quoted negative result for agnostic PAC learning with monomials (one assumes there that the number of input variables goes to infinity). Hence one should not interpret our result as saying that learning with hypotheses defined by analog neural nets is easier than learning with monomials (or other boolean formulas) as hypotheses. Our result shows that learning with a fixed number of numerical inputs is provably feasible on a multilayer neural net, whereas boolean formulas such as monomials are not suitable for dealing with numerical inputs, and it makes no sense to carry out an asymptotic analysis of learning with a fixed number of boolean inputs (since there exist then only finitely many different hypotheses). Definition 1.1. A networkarchitecture (or ”neural net”) N of order v with k input nodes and 1 output nodes is a labeled acyclic directed graph (V,E). It has k nodes with fan-in 0 (”input nodes”) that are labeled by 1, . . . ,k, and l nodes with fan-out 0 (“output nodes”) that are labeled by 1 , .. . , 1 . Each node g of fan-in r > 0 is called a computation node (or gate), and is labeled by some activation function 7 8 : R + R and some polynomial Ig(y1,. . . ,y r ) of degree 5 v. We assume that the ranges of activation functions of output nodes in N are bounded. The coefficients of all polynomials Ig(y1,.. . ,y,) for gates g in N are called the programmable parameters of N. Assume that N has w programmable parameters, and that some numbering of these has been fixed. Then each assignment a E R” of reals to the programmable parameters in N defines an analog circuit NE, which computes a function x H Ng(x) from Rk into R’ in the following way: Assume that some input x E R has been assigned to the input nodes of N. If a gate g in N has r immediate predecessors in ( V ,E ) which output y1, . . . ,yr E R, then g outputs rg[Ig(yl,
’ . ’ ,yr)l-
1058
Wolfgang Maass
Any parameters that occur in the definitions of the activation functions y* of N are referred to as architectural parameters of N. Definition 1.2. A function y : R + R is called piecewise polynomial if there are thresholds t l , . . . , t, E R and polynomials P O , .. . ,P, such that tl < . . . < t, and for each i E ( 0 , . . . ,s} : ti 5 x < ti+l + y(x) = P l ( x ) (we set to := -m and f s + l := m). We refer to t l , . . . , t, together with all coefficients in the polynomials Po, . . . , P, as the parameters of?. If the polynomials PO,. . . , P, are of degree 5 1 then we call y piecewise linear. Note that we do not require that y is continuous tor monotone). 2 Learning on Neural Nets with Piecewise Linear Activation Functions
We show in this section that for any network architecture N with piecewise linear activation functions there exists another network architecture fi that not only can compute, but also learn any function f : Rk -+ R' that can be computed by N.The only difference between N and & is that each computation node in N has fan-out 5 1 (i-e., the c o m p u t a ~ i onodes ~ of # form a tree, but there is no restriction on the fan-out of input nodes), whereas the nodes in N may have arbitrary fan-out. If N has only one output node and depth 5 2 (i.e., N has a t most one layer of "hidden units") then one can set $ := n/. For a general network architecture one applies the standard construction for-transforming a directed acyclic graph into a tree. The construction of N from N proceeds recursively from the output level towards the input level: every computation node Y with fan-out m > 1 is replaced by m nodes with fan-out 1, which all use the same activation function as v and which all get the same input as Y. It is obvious that for this classical construction from circuit theory (Savage 1976) the depth of N is the same as the depth of N . To bound the size (i.e., number of gates) of &, we first note that the fan-out of the input nodes does not have to be changed. Hence the transformation of the directed acyclic graph of N into a tree is only applied to the subgraph of depth depth(N)- 1, which one gets from N by removing its input nodes. Furthermore one can easily see that the transformation does not increase the fan-in of any node. Obviously the fan-in of any gate in N is bounded by size(N) - 1. Therefore the tree that provides the graph-theoretic structure for N has in addition to its k input-nodes up to L depth O size(N)' 5 size(N)dep'h(N)/(size(N) - 1) computation nodes. Hence for bounded depth the increase in size is polynomially bounded. Let Qnbe the set of rational numbers that can be written as quotients of integers with bit-length 5 n. Let F : Rk -+ R' be some arbitrary function, which we will view as .- E Rk x R' we measure a "prediction rule." For any given instance (x,!y)
Agnostic PAC Learning
1059
the error of F by IIF(x) - yII1, where lI(21,.. . , z I ) I I I := I Z , ~ . For any distribution A over somesubset of Rk x R’we measure the true error of F with regard to A by E ( x , y ) E ~ [ I I F-(g111], ~ ) i.e., the expected value of the error of F with respect r6 distribution A.
Theorem 2.1. Let N be an arbitrary network architecture of first order (i.e., u := 1) with k input nodes and 1 output nodes, and let N be the associated network architecture as defined above. W e assume that all activation functions in N are piecewise linear with architectural parametersfrum Q. Let B C R be an arbitrary bounded set. Then there exists a polynomial m(l/&, 116) and a learning algorithm LEARN suck that for any given s, n E N and any distribution A ouer Qi x (Qtln B)‘ the following holds: For a sample C = ((&,y,)),=,,,m of m 2 m(l/&,1/6) examples that are independently drawn accordiFg to A the algorithm LEARN computesfrom C,s, n in polynomially in m, s, and n many computation steps an assignment &of rational numbers to the programmable parameters of the associated network architecture N suck that
with probability 2 1 - 6 (with regard to the random drawing of
0.
Consider the special case where the distribution A over is of the form
x (QnnB)’
for some arbitrary distribution D over the domain Qi and some arbitrary B~ E Q’. Then the term
is equal to 0. Hence the preceding theorem implies that with learning algorithm LEARN the ”learning network N can ”learn” with arbitrarily small true error any target function NSr that is computable on N with rational “weights” fir. Thus by choosing N to be sufficiently large, one can guarantee that n/ can learn any target function that might arise in the context of a specific learning problem. In addition the theorem also applies to the quite realistic situation noise), where the learner receives examples ( x , y )of the form ( g , N g ~ ( x ) + or even if there exists no ”target function” N E T that would “explain” the actual distribution A of examples (x,y ) (”agnostic learning”). Before we give the proof of Theorem 2.1 we first show that its claim may be viewed as a learning result within a refinement of Valiant’s PAC model (Valiant 1984). This refined version of the PAC model (essentially
Wolfgang Maass
1060
due to Haussler 1992)is better applicable to real world learning situations than the usual PAC model: It makes no a priori assumptions about the existence of a ”target concept” or “target function” of a specific type that explains the empirical data (i.e., the ”sample”). 0 It allows for arbitrary ”noise” in the sample (however, it does not attempt to remove the ”noise”; instead it models the distribution including the “noise”). 0 It is not restricted to the learning of ”concepts” (i.e., 0 - 1 valued functions) since it allows arbitrary real numbers as predictions of the learner and as target outputs in the sample. Hence it is, for example, also applicable for investigating learning (and approximation) of complicated real valued functions. Of course one cannot expect miracles from a learner in such a real-world learning situation. It is in general impossible for him to produce a hypothesis with arbitrarily small true error with regard to the distribution A. This is clearly the case if the distribution A produces inconsistent data, or if A is generated by a target function (with added noise) that is substantially more complicated than any hypothesis function that the learner could possibly produce within his limited resources (e.g., with a fixed neural network architecture). Hence the best that one can expect from the learner is that he produces a hypothesis h whose true error with regard to A is almost optimal in comparison with all possible hypotheses h from a certain pool 7 (the ”touchstone class” in the terminology of Kearns et al. 1992). This provides the motivation for the following definition, which slightly generalizes those in Haussler (1992) and Kearns et al. (1992). Definition 2.2. Let A = UrrENAnbe an arbitrary set of distributions over finite subsets of Qkx Q‘ such that for any n E N the bit-length of any point (x,y) that is drawn according to a distribution A E A, is bounded by a polynomial in n. Let 7 = (‘&N be an arbitrary family of functions from Rk into R‘ (with some fixed representation system) such that any f E Z has a representation whose bit-length is bounded by some polynomial in s. Let ‘H be some arbitrary class of functions from Rk into R‘. One says that I is efficiently learnable by ‘Hassuming A if there is an algorithm LEARN and a function m(E,6, s, n ) that is bounded by a polynomial in 1 / ~1/S, , s, and n such that for any E , 6 E ( 0 , l ) and any natural numbers s, n the following holds: If one draws independently m 2 m ( ~6 , ,s , n ) examples according to some arbitrary distribution A E A,, then LEARN computes from such a sample with a number of computation steps that is polynomial in the parameter s and the bit-length of the representation of some h E ‘H, which has with probability 2 1 - 6 the property
1 in N. Thus altogether at most rnO@ different systems L, of linear inequalities have to be considered. Hence the algorithm LEARN generates for each of the polynomially in rn many partitions of 9, . . ,x , that arise in the previously described fashion from thresholds betweenlinear pieces of activation functions of gates in and for each assignment of weights from { -1,O, 1) to edges between computation nodes in N a separate system L, of linear inequalities, for j = 1, , p ( r n ) . By construction one can bound p(rn) by a polynomial in rn (if the size of N can be viewed as a constant). We now expand each of the systems L, [which has only O(1) variables] into a linear programming problem LP, with O ( m )variables (it should be noted that it is essential that these 2 rn additional variables were not yet present in our preceding consideration, since otherwise we would have arrived at exponentially in rn many systems of linear inequalities L,). We add to L, for each of the I output nodes v of N 2rn new variables u;,v,V for I = 1,. . . . rn, and the 4rn inequalities
N,
t,”(S)I (YJU - + 4,- uy, t,”(S) 1 &)U uy
2 0,
+ u;
-
v;,
v,”2 0
0 for all remaining scaling parameters c in c. It follows that for these values of c, /3 each term that represents the input of some gate g in fi[c]gfor some network input from Qi has a value in QPp for (I := 2[s . ~ i z e ( N ) ] ~ ~ p. ~ [s2~. depth(N) (~)-' . (k + 2) . n]. Hence whenever the input s1 of some gate g in fi[c]esatisfies for some network input from QL the strict inequality "sl < 52'' (for some threshold s2 of this gate g), the inequality "sl 2-p 5 SZ" is also satisfied. Analogously each scaling parameter c > 0 in c satisfies c 2 2-p. These observations imply that the values for the parameters c,B that result by the transformation from &' give rise to a feasible solutionfor one of the linear programming problems LP,, for some j E (1,. . . , p ( r n ) } . The cost C output- (uy + vy)
a&@)].
+
zEl
node in N
of this feasible solution can be chosen to be CE, I INd(x;)- y;lIl (for each i: v set at least one of u;, vy equal to 0). This impliestharthe optimal -. - yilII. Hence we have solution of LP, has a cost of at most ELl IINd(x;) EL,Ilk(%) - yil(1 5 EE,IINd(x;)- y,II1 by the definition of algorithm LEARN. TherGfore the desired inequaxty 2.2 follows from 2.3. This com0 pletes the proof of Theorem 2.1. Remark 2.7. The algorithm LEARN can be speeded up substantially on a parallel machine. Furthermore if the individual processors of the parallel machine are allowed to use random bits, hardly any global control is required for this parallel computation. The numbers of processors that
Agnostic PAC Learning
1073
are needed can be bounded by mO(w') . poly(n, s). Each processor picks at random one of the systems L, of linear inequalities and solves the corresponding linear programming problem LP,. Then the parallel machine compares in a "competitive phase" the costs C:S,Ilh,(x,) yIII1 of the solutions h, that have been computed by the individual procGsors. It outputs the weights & f o r k that correspond to the best ones of these solutions k,. ~
In this parallelized version of LEARN the only interaction between individual processors occurs in the competitive phase. Even without any coordination between individual processors one can ensure that with high probability each of the relevant linear programming problems LP, for j = 1, . . . ,p(m) is solved by at least one of the individual processors, provided that there are slightly more than p(m) such processors with random bits. Each processor simply picks at random one of the problems LP, and solves it. It turns out that the computation time of each individual processor (and hence the parallel computation time of LEARN) is polynomial in m and in the total number w of weights in N . The construction of the systems L, [for j = 1 , . . . ,p(rn)l in the proof of Theorem 2.1 implies that only polynomially in m and w many random bits are needed to choose randomly one of the linear programming problems LP,, j = 1,.. . , p ( m ) . Furthermore with the help of some polynomial time algorithm for linear programming each problem LP, can be solved with polynomially in m and w many computation steps. The total number of processors for this parallel version of LEARN is simply exponential in w. However, even on a parallel machine with fewer processors the same randomized parallel algorithm gives rise to a rather interesting heuristic learning algorithm. Such a "scaled-down" version of LEARN is no longer guaranteed to find probably an approximately optimal weight setting in the strict sense of the PAC learning model. However, it may provide satisfactory performance for a real world learning problem in case that not only a single one, but a certain fraction of all linear programming problems LP,, yields for this learning problem a satisfactory solution. One may compare this heuristic consideration with the somewhat analogous situation for backpropagation, where one hopes that for a certain fraction of randomly chosen initial settings of the weights one is reasonably close to a global minimum of the error function.
3 Learning on Neural Nets with Piecewise Polynomial Activation Functions
In this section we extend the learning result from Section 2 to high order network architectures with piecewise polynomial activation functions.
Wolfgang Maass
1074
Theorem 3.1. Let N be some arbitrary high order network architecture with k inputs and 1 outputs. We assume that all activation functions of gates in JVare piecewise polynomial with architectural parameters from Q. Then one can construct an associated first-order network architecture .Nwith activation functions from the class {heauiside, x H x , x H x 2 } suck that the same learning property as in Theorem 2.1 holds. Remark 3.2. Analogously to Remark 2.3 (d) one can also formulate the result of Theorem 3.1 in terms of the strong version of the PAC learning model from Definition 2.2. Furthermore, on a parallel machine one can speed up the learning algorithm that is constructed in the proof of Theorem 3.1 in the same fashion as described in Remark 2.7 for the piecewise linear case. Proof of Theorem 3.1. The only difference to the proof of Theorem 2.1 lies in the different construction of the ”learning network” N.One can ’] easily see that because of the binomial formula y.2 = ; [ ( y + ~ ) ~ - y ~ - tall high order gates in JV can be replaced by first-order gates through the introduction of new first-order intermediate gates with activation function x H x2. Nevertheless the construction of k is substantially more difficult compared with the construction in the preceding section. Piecewise polynomial activation functions of degree > 1 give rise to a new source of nonlinearity when one tries to describe the role of the programmable parameters by a system of inequalities. Assume for example that g is a gate on level 1 with input u l x l ~ 2 x 2and activation function yX(y) = y2. Then this gate g outputs a:.: ~ C Y I Q ~ X I X ~ a:.;. Hence the variables nl, CYZ will not occur linearly in an inequality that describes the comparison of the output of g with some threshold of a gate at the next level. This example shows that it does not suffice to push all nontrivial weights to the first level. Instead one has to employ a more complex network construction that was introduced for a different purpose (it had been introduced to get an a priori bound for the size of weights in the proof of Theorem 3.1 in Maass 1993; see Maass 1995c for a complete version). That construction does not ensure that the output of the network architecture k is for all values of its programmable parameters contained in [bl,bz]’ if the ranges of the activation functions of all output gates of N are contained in (bl,b2]. Therefore we supplement the network architecture from the proof of Theorem 3.1 in Maass (1993) by adding after each output gate of that network a subcircuit that computes the function
+ +
ZH
{
bl, 2,
b2,
+
if t < bl if bl 5 z 5 b2 if z > b2
This subcircuit can be realized with gates that use the heaviside activation function, gates with the activation function x H x, and “virtual gates” that compute the product (y,z) H y . 2. These “virtual gates”
Agnostic PAC Learning
1075
can be realized with the help of 3 gates with activation function x H x2 via the binomial formula (see above). The parameters bl, b2 of this subcircuit are treated like architectural parameters in the subsequent linear programming approach, since we want to keep them fixed. Regarding the size of the resulting network architecture N we would like to mention that the number of gates in N is bounded by a polynomial in the number of gates in N and the number of polynomial pieces of activation functions in N , provided that the depth of N , the order of gates in N , and the degrees of polynomial pieces of activation functions in hi are bounded by a constant. The key point of the resulting network architecture N is that for fixed network inputs the conditions on the programmable parameters of N can be expressed by linear inequalities, and that any function that is computable on N is also computable on N. Apart from the different construction of N the definition and the analysis of the algorithm LEARN proceed analogously as in the proof of Theorem 2.1. Only the parameter p is defined here slightly differently by p := size(#) . ( n + s) . 3depth(#l. If one assumes that all architectural parameters of N as well as bl, b2 are from Qs, one can show that any function h : Rk+ R' that is computable on N with programmable parameters from Qs can be computed on N with programmable parameters from Qs,3drpth(fij). Furthermore, any linear inequality 'Is1 < s2" that arises in the description of this computation of h on N for an input from Qi (where s1, s2 are gate inputs and thresholds, respectively) can be replaced by the stronger statement "sl + 2 - P 5 s2." This observation justifies the use of the parameter p in the linear programming problems that occur in the design of the algorithm LEARN. Note that in contrast to the proof of Theorem 2.1 there are no scaling factors involved in these linear programming problems (because of the different design of Since N contains gates with the heaviside activation function, the algorithm LEARN has to solve not only one, but polynomially in rn many linear programming problems (analogously as in the proof of Theorem 2.1). 0
a.
4 Conclusion
It has been shown in this paper that positive theoretical results about efficient PAC learning on neural nets are still possible, in spite of the well-known negative results about learning of boolean functions with many input variables (Judd 1990; Blum and Rivest 1988; Kearns and Valiant 1989). In the preceding negative results one had carried over the traditional asymptotic analysis of algorithms for digital computation, where one assumes that the number n of boolean input variables goes to infinity. However, this analysis is not quite adequate for many applications of
1076
Wolfgang Maass
neural nets, where one considers a fixed neural net and the input is given in the form of relatively few analog inputs (e.g., sensory data). In addition for many practical applications of neural nets the number of input variables is first reduced by suitable preprocessing methods. For such applications of neural nets we have shown in this paper that efficient and provably successful learning is possible, even in the most demanding refinement of the PAC learning model. In this most realistic version of the PAC learning model no a priori assumptions are required about the nature of the “target function,” and arbitrary noise in the input data is permitted. Furthermore, this learning model is not restricted to neural nets with boolean output. Hence our positive learning results are also applicable to the learning and approximation of complicated real valued functions, such as they occur, for example, in process control. The proofs of the main theorems of this paper (Theorems 2.1 and 3.1) employ rather sophisticated results from statistics and algebraic geometry to provide a bound not just for the apparent error (i.e., the error on the training set) of the trained neural net, but also for its true error (i.e., its error on new examples from the same distribution). In addition, these positive learning results employ rather nontrivial variable transformation techniques to reduce the nonlinear optimization problem for the weights of the considered multilayer neural nets to a family of linear programming problems. The new learning algorithm LEARN that we introduce solves all of these linear programming problems, and then takes their best solution to compute the desired assignment of weights for the trained neural net. This paper has introduced another idea into the theoretical analysis of learning on neural nets that promises to bear further fruits: Rather than insisting on designing an efficient learning algorithm for every neural net, we design learning algorithms for a subclass of neural nets N whose architecture is particularly suitable for learning. This may not be quite what we want, but it suffices as long as there are arbitrarily “powerful” network architectures N that support our learning algorithm. It is likely that this idea can be pursued further with the goal of identifying more sophisticated types of special network architectures that admit very fast learning algorithms.
Acknowledgments
I would like to thank Peter Auer, Phil Long, Hal White, and two anonymous referees for their helpful comments. References Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Comp. 1, 151-160.
Agnostic PAC Learning
1077
Blum, A., and Rivest, R. L. 1988. Training a 3-node neural network is NPcomplete. In Proceedingsof the 1988 Workshop on Computational Learning Theory, pp. 9-18. Morgan Kaufmann, San Mateo, CA. Blumer, A,, Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. 1989. Learnability and the Vapnik-Chervonenkis dimension. J. ACM 36(4), 929-965. Cover, T. M. 1968. Capacity problems for linear machines. In Pattern Recognition, L. Kanal, ed., pp. 283-289. Thompson Book Co., Washington, DC. Goldberg, P., and Jerrum, M. 1993. Bounding the Vapnik-Chervonenkis dimension of concept classes parameterized by real numbers. Proc. of the6thAnnuaZ ACM Conferenceon Computational Learning Theory 361-369. ACM-Press, New York, NY. Haussler, D. 1992. Decision theoretic generalizations of the PAC model for neural nets and other learning applications. Inform. Comp. 100, 78-150. Haussler, D., Kearns, M., Littlestone, N., and Schapire, R. E. 1991. Equivalence of models for polynomial learnability. Inform. Comp. 95, 129-161. Judd, J. S. 1990. Neural Network Design and the Complexity of Learning. MIT Press, Cambridge, MA. Kearns, M., and Schapire, R. E. 1990. Efficient distribution free learning of probabilistic concepts. Proc. 31st l E E E Symp. Foundations Comp. Sci. 382-391. Kearns, M., and Valiant, L. 1989. Cryptographic limitations on learning boolean formulae and finite automata. Proc. 21st ACM Symp. Theory Comp. 433-444. Kearns, M. J., Schapire, R. E., and Sellie, L. M. 1992. Toward efficient agnostic learning. Proc. 5th ACM Workshop Comp. Learning The0ry 341-352. Lippmann, R. P. 1987. An introduction to computing with neural nets. l E E E ASSP Mag. 4-22. Maass, W. 1992. Bounds for the Computational Power and Learning Complexity of Analog Neural Nets. IIG-Report 349, Technische Universitat Graz. Maass, W. 1993. Bounds for the computational power and learning complexity of analog neural nets (extended abstract). Proc. 25th ACM Symp. Theory Comput. 335-344. Maass, W. 1994. Agnostic PAC-learning of functions on analog neural nets (extended abstract). In Advances in Neural lnforniation Processing Systems, Vol. 6, pp. 311-318. Morgan Kaufmann, San Mateo, CA. Maass, W. 1995a. Perspectives of current research about the complexity of learning on neural nets. In TheoreticalAdvances in Neural Computation and Learning, 295-336, V. P. Roychowdhury, K. Y. Siu, and A. Orlitsky, eds. Kluwer Academic Publishers, Boston. Maass, W. 1995b. Vapnik-Chervonenkis Dimension of Neural Nets. In Handbook of Brain Theoryand Neural Networks, M. A. Arbib, ed. MIT Press, Cambridge, MA (in press). Maass, W. 1995c. Computing on analog neural nets with arbitrary real weights. In Theoretical Advances in Neural Computation and Learning, V. P. Roychowdhury, K. Y. Siu, and A. Orlitsky, eds., pp. 153-172. Kluwer Academic Publishers, Boston, MA. Milnor, J. 1964. On the Betti numbers of real varieties. Proc. A m . Math. SOC. 15, 275-280.
1078
Wolfgang Maass
Papadimitriou, C. H., and Steiglitz, K. 1982. Combinatorial Optimization: Algorithms and Complexity. Prentice Hall, Englewood Cliffs, NJ. Pollard, D. 1990. Empirical Processes: Theory and Applications. NSF-CBMS Regional Conf. Ser. Prob. Statist. 2. Renegar, J. 1992. On the computational complexity and geometry of the first order theory of the reals, Part I. J. Symbolic Comp. 13, 255-299. Rumelhart, D. E., and McClelland, J. L. 1986. Parallel Distributed Processing: Exploration in the Microstructure of Cognition, Vol. 1: Foundations. MIT Press, Cambridge, MA. Savage, J. E. 1976. The Complexity of Computing. Wiley, New York. Valiant, L. G . 1984. A theory of the learnable. Commun. ACM 27, 1134-1142.
Received May 14, 1994; accepted November 9, 1994.
This article has been cited by: 2. Michael Schmitt . 2002. On the Complexity of Computing and Learning with Multiplicative Neural NetworksOn the Complexity of Computing and Learning with Multiplicative Neural Networks. Neural Computation 14:2, 241-301. [Abstract] [PDF] [PDF Plus] 3. N.S.V. Rao. 2001. On fusers that perform better than best sensor. IEEE Transactions on Pattern Analysis and Machine Intelligence 23:8, 904-909. [CrossRef] 4. Wee Sun Lee, P.L. Bartlett, R.C. Williamson. 1998. The importance of convexity in learning with squared loss. IEEE Transactions on Information Theory 44:5, 1974-1980. [CrossRef] 5. Wolfgang Maass. 1997. Bounds for the Computational Power and Learning Complexity of Analog Neural Nets. SIAM Journal on Computing 26:3, 708. [CrossRef] 6. Wee Sun Lee, P.L. Bartlett, R.C. Williamson. 1996. Efficient agnostic learning of neural networks with bounded fan-in. IEEE Transactions on Information Theory 42:6, 2118-2132. [CrossRef]
Communicated by Radford Neal
Convex Potentials and their Conjugates in Analog Mean-Field Optimization I. M. Elfadel’ Massachusetts Institute of Technology, Research Laboratory of Electronics, Room 36-881, Cambridge, M A 02239 U S A This paper deals with the problem of mapping hybrid (i.e., both discrete and continuous) constrained optimization problems onto analog networks. The saddle-point paradigm of mean-field methods in statistical physics provides a systematic procedure for finding such a mapping via the notion of effective energy. Specifically, it is shown that within this paradigm, to each closed bounded constraint set is associated a smooth convex potential function. Using the conjugate (or the Legendre-Fenchel transform) of the convex potential, the effective energy can be transformed to yield a cost function that is a natural generalization of the analog Hopfield energy. Descent dynamics and deterministic annealing can then be used to find the global minimum of the original minimization problem. When the conjugate is hard to compute explicitly, it is shown that a minimax dynamics, similar to that of Arrow and Hurwicz in Lagrangian optimization, can be used to find the saddle points of the effective energy. As an illustration of its wide applicability, the effective energy framework is used to derive Hopfield-like energy functions and descent dynamics for two classes of networks previously considered in the literature, winner-take-all networks and rotor networks, even when the cost function of the original optimization problem is not quadratic. 1 Introduction The analysis and design of analog, parallel, distributed architectures for solving optimization problems has been one of the most researched areas in the neural network literature. Since the seminal work of Hopfield and Tank (1985) on using Hopfield’s analog network (Hopfield 1984) to solve the traveling salesman problem, significant research effort has been devoted to applying Hopfield’s approach to solving other combinatorial optimization problems (Platt 1989). It is now well known that Hopfield’s analog model is closely related to the mean-field model of the Ising spin *Current address: Masimo Corporation, 26052 Merit Circle, Suite 103, Laguna Hills, CA 92653.
Neural Computation 7,1079-1104 (1995) @ 1995 Massachusetts Institute of Technology
1080
I. M. Elfadel
lattice (Amit 1989 and references therein; also Marroquin 1985; Yuille 19871. This connection, once realized, has led researchers, especially in the physics community, to adapt the mean-field concepts of statistical physics to the construction and analysis of new analog architectures for solving optimization problems (Peterson and Soderberg 1989; Simic 1990; Kosowsky and Yuille 1994). There are two main reasons for this strong interest in analog optimization. The first is algorithmic. Deterministic mean-field networks offer the possibility of using continuation-based optimization methods that are likely to be faster than simulated annealing in finding the global optimum (Peterson and Soderberg 1989; Geiger and Girosi 1991). The second reason is practical. It has been noticed by a number of researchers that analog optimization algorithms can be mapped onto nonlinear RC circuits possessing Lyapunov-like functions that get minimized as a result of the circuits’ natural properties (Harris et al. 1989). Although optimization is a very mature field in its concepts and methods, little has been accomplished to bring these concepts and methods to bear on the mean-field analog optimization paradigm. In a very recent paper however, Yuille and Kosowsky (1994) clarified, in the context of the linear assignment problem, a number of connections between mean-field optimization and the more classical approaches of linear programming with barrier function and interior point methods. This paper is written with an objective similar to that of Yuille and Kosowsky (1994): gaining a deeper theoretical understanding of analog mean-field optimization and relating it to the established methods of mathematical optimization. Specifically, we will concentrate on the mean-field cost function (or effective energy) introduced in Peterson and Soderberg (1989) and generalized in Gislkn et al. (1992) and show that it provides a unified framework for mapping constrained optimization problems, combinatorial and continuous, onto unconstrained analog networks. The effective energy has the distinguishing feature of associating, with the constraint sets, convex potential functions. This important fact allows us to analyze the effective energy using the classical notion of the conjugate (or LegendreFenchel transform) of a convex function (Rockafellar 1970). In particular, we will show that the derivation of Hopfield-like cost functions results in a straightforward manner from applying the Legendre-Fenchel transform to Peterson and Soderberg’s effective energy. Moreover, we will show that the “natural” dynamics for finding the extrema of the effective energy is that of minimax optimization rather than gradient descent. When interpreted in the context of a neural network architecture, the Legendre-Fenchel transform allows us to view the neuron input and output as conjugate (or dual) variables in the same sense that voltage and current are conjugate variables in electrical circuits. This paper is organized as follows. In Section 2, we sketch the derivation of the effective energy cost function, give a general expression of the
Analog Mean-Field Optimization
1081
potential functions associated with the constraint sets, and state their important convexity property. Next we introduce the conjugate of a convex potential and use it to derive a Hopfield-like cost function from the effective energy. In Section 3, we study the ascent-descent dynamic system that allows us to find the saddle points of the effective energy. In particular, we show that it possesses a Lyapunov function (different from the effective energy!) that is nonincreasing along its trajectories. We also give a Hopfield-like descent dynamics that generalizes the usual analog Hopfield dynamics to the case of neurons whose states are vectors constrained to live in a closed bounded set. In Sections 4 and 5, we specialize our results to the important cases of winner-take-all (WTA) networks (Peterson and Soderberg 1989; Simic 1991; Waugh and Westervelt 1993; Elfadel 1993) and rotor neural networks (Gislkn et al. 1992; Zemel e t a / . 1993). In particular, we show how descent dynamic systems can be derived for these two cases of networks from the general framework, even when the original cost function is not quadratic. A summary of the main contributions of this paper will be given in Section 6. 2 Cost Functions and Conjugate Functions
2.1 Potential Functions and Effective Energy. Consider the nonlinear programming problem minimize E ( x l , . . . ,XN) (2.1) subject to X k E s k C %", 12k 5 N The function E(x1, . . . ,x,,) will be interchangeably called a cost function or an energy function. The variable xk denotes the state of the kth neuron in a network of N neurons. For each k E [l,N], the subset Sk is a constraint set that will be assumed compact. Note that this compactness assumption includes the case when Sk is finite. Note also that in this formulation we allow the constraint set to be neuron-dependent, so that the state at one site can be discrete [e.g., S k = (0. +1} as in the discrete Hopfield model (Hopfield 1982)], while at another, it can be continuous [e.g, Sk = the ' as in the rotor model (Gislen ef a/. 199211. In other unit sphere of 3 words, the nonlinear program (2.1) might involve both continuous and combinatorial variables. The traditional way for dealing with optimization problems similar to 2.1 is through the use of penalty and barrier function methods (Luenberger 1984, Chapter 121, whereby the constrained optimization problem is approximated by an unconstrained optimization problem. Both methods involve a tradeoff between cost and constraint. In a penalty method, the cost function is modified by adding a term that prescribes a high cost for violating the constraint, whereas in a barrier method, the added term tends to favor points that are interior to the constraint set. In practice, a penalty method is used when the constraint sets s k are defined by functional equalities, whereas a barrier method is used when the constraint
I. M. Elfadel
1082
sets are robust, which is the typical situation when they are defined with functional inequalities. Analog mean-field optimization results in the replacement, rather than the approximation, of the original cost function with a new cost function that incorporates the constraints in its functional. form. We have emphasized the word replacement because the new cost function depends on a new set of variables different from the original ones. These new variables are the approximate mean-field values of Xk, 1 _< k 5 N, with respect to a hypothetical Gibbs system having the cost function E(x1, . . . , XN) as an energy function. Following Peterson and Soderberg (1989) and Gislh et al. (19921, the statistical physics paradigm of the saddle-point method allows us to associate, with the constrained optimization problem (2.11, the following cost function (also called effective energy):
k=l
k=l
where v and w are the vectors in X n N that result from the concatenation of Vk and Wk, 1 5 k 5 N, respectively, and where (wk.vk) denotes the scalar product of the two vectors wk and vk. The function (f& : X” % represents the “potential” of the kth neuron. The actual form of $k depends crucially on the constraint set Sk. Note that in the above expression of E,ff (2.21, we have used the notation E to denote the extension of the cost function E to the new set of variables vk, 1 5 k 5 N.Of course, we have ---f
To see how the variables Vk, wk and the potential functions 4k arise from the optimization problem (2.1), let us, for simplicity’s sake, treat the case of a single neuron with state x1 constrained to lie in S1 c 8”. First, we associate with the cost function E(x1) the Gibbs distribution
with the partition function Z given by
where p(xl) is a positive measure on the compact constraint surface S1. We use the sifting property of the Dirac delta function to transform the integrand of Z into an integral, i.e.,
Analog Mean-Field Optimization
1083
Next, we transform the delta Dirac function into an integral using the formula
where iP1denotes the set of n-dimensional vectors with purely imaginary components. Collecting the above expressions, we can now write the partition function as
The inner integral is always positive, therefore we can always define
The partition function then becomes
where Eeff(vl,wlrT) = E(vl)
+ T(wl,vl) - T ~ ( w I )
Repeating the above steps for the arguments x2,. . . , XN gives the effective energy (2.2) and for Z the following formula:
The above integral expression of Z is exact. It is, however, very difficult, if not impossible, to compute. The saddle-point method (Haykin 1994, pp. 338-340) allows us to approximate Z using the values of the integrand computed at the saddle points of Eeff. In statistical physics, finding an approximation of the partition function is important since it contains all the information necessary to characterize the thermodynamic system at thermal equilibrium. From an optimization viewpoint, we are more interested in the saddle points of €,a than in the partition function itself. The picture here is the following. If we know the saddle points of Eeff as a function of the temperature T , then as T approaches zero, the saddle points will approach the global minima of the cost function E(xl . . . ,XN). This is because as T approaches zero, the major contribution to the partition function comes from the points where the cost function E(xl, . . . ,XN) reaches its global minimum. Note also that in the limit of small temperature, the Gibbs probability distribution becomes concentrated around the global minimum. Analog algorithms for obtaining the
I. M. Elfadel
1084
saddle points of Eeff at a given temperature will be discussed at length in Section 3. Expression 3 endows the potential functi0n.s $hk with a number of special properties: 1. As a function of the variable wk E 9?, each & is infinitely differentiable. 2. The dependence of $hk on the constraint set is through the integration domain, the integrand being independent of Sk. 3. If the constraint set is finite, the measure p will be supported by the points of S k . The generic form of p is dp(X) =
C CU,~(S
-
x)dx
(2.4)
SESk
where the as's are nonnegative. These parameters can be understood as weights for the neuron states. 4. The most important property for our purposes is that each function (sk is convex. So as not to overload this section, we will postpone the proof of this property to Section 5.
It should be stressed that the saddle-point approximation applies to any energy function €(XI, . . . , XN), whether it is quadratic or not, and whether its arguments are discrete or continuous. This feature distinguishes the saddle-point method from other mean-field methods like the gaussian-trick method (Simic 1990), the probability decomposition method (Meir 1992), or the circular distribution method (Zemel et al. 1993). These methods assume that E ( x l , . . . XN) is quadratic. Another important feature of the saddle-point approximation is that it allows the consideration of hybrid cost functions in which the arguments belong to different constraint sets. Both discrete and continuous constraint sets can be handled in the same framework. It is interesting to note that the expressions of €eff in 2.2 and $hk in 2.3 can be given in an ad hoc manner without reference to an underlying Gibbs distribution or to the saddle-point approximation. This amounts to replacing the original cost function with the effective energy and to replacing the constraints with their potential functions. Note that the "mechanics" in Lagrangian optimization is similar: the cost function is replaced with the Lagrangian, and the constraints are imposed using the Lagrange multipliers. The arguments of the effective energy Eerf(v.w, T) call the following two remarks. First, the new variables wk are auxiliary variables that play in the context of constrained mean-field Optimization a role similar to that of the Lagrange multipliers in classical optimization in that they are associated with the constraint sets S k . We will return to this point in Section 3. The second remark is that the saddle-point method, which is rooted in statistical physics, introduces a new parameter into the optimization ~
Analog Mean-Field Optimization
1085
problem: the temperature. In the context of classical optimization algorithms, one can think of the temperature as the barrier or penalty parameter in constrained optimization (Luenberger 1984, p. 370; see also Yuille and Kosowsky 1994). Another interpretation of the temperature parameter is that it is the Lagrange multiplier associated with the constant internal energy constraint in the framework of the maximum entropy principle (Meir 1992). In the context of mean-field optimization, the temperature plays the role of a continuation parameter that will allow us, in the zero limit, to recover the solution of the original optimization problem from the saddle points of the effective energy. In the sequel, we will abuse notation and denote the cost function extended to the space of the v variable with the symbol E instead of fj. Moreover, we will lump the summation over the potential functions in one potential function CP with argument w. With this new notation, the effective energy function will be written as Eeff(v, W, T ) = E(v)
+ T(w,
V)
- T@(w)
(2.5)
The feasible points of the original minimization problem belong to the constraint set N
(2.6)
C=nSk k=l
the Cartesian product of the constraint sets sk. Note that since the $k's are convex, their sum CP is also convex. It follows that for a given v, the effective energy is always concaue with respect to w. One way for checking this is to notice that the second partial derivative of Eeff(v, w, T ) with respect to w is equal to the negative of the Hessian of CP(w). For the rest of this paper, we will assume that the function E(v) is at least twice continuously differentiable. 2.2 Conjugate Functions and Hopfield Energy. As a result of the concavity of Eefi(v,w,T)with respect to w the following definition is always meaningful: A
E*(v,T ) = m$xEeff(v,w, T)
(2.7)
Because the first term in Eeff(v, w, T ) is independent of w and because the (w,v)- @(w)is concave in the variable w, we can write function w ---$
E*(v,7')
=
E(v) + Tmzx [(w,v)- @(w)]
(2.8)
N
(2.9)
I. M. Elfadel
1086
where the second term is a direct result of the separability of the summation term in 2.2. The quantity @*(v)6mzx [(w,v) - @(w)l
(2.10)
is called the Legendre-Fenchel transform (or the conjugate) of the convex function @(w).A thorough mathematical account of the theory and applications of the Legendre-Fenchel transformation can be found in Sections 12 and 26 in Rockafellar (1970). The book by Strang (Strang 1986) contains, in Chapter 8, a lucid introductory treatment of these concepts. With the above definition of @*, we get
E*(v,T ) = E(v) + T@*(v)
(2.11)
From 2.8 and 2.9, it is easy to see that N
Hopfield (1984) proposed the following energy function' (2.12) as a Lyapunov function for his analog network dynamics [see equation (5) in Hopfield 19841. Each term in the above summation can be identified with one of the conjugates $;(zlk), while the parameter A, which controls the slope at the origin of the sigmoidal nonlinearity, can be identified with the inverse temperature in 2.11. It follows that 2.11 can be construed a generalization of the Hopfield energy to the general case where the neurons take their values in arbitrary constraint sets. Note that in this generalization, different neurons can take their values in different constraint sets. For instance, in the same network, we can have a binary neuron to model a yes-or-no decision, an n-state neuron to model a competitive mechanism with n outcomes, eg., a winner-take-all neuron (Peterson and Soderberg 1989; Simic 1991; Waugh and Westervelt 1993; Elfadel 1993), or a continuous-state neuron that conveys directional information, e.g., a rotor neuron (GislCn et al. 1992; Zemel et al. 1993). The Hopfield energy of this hybrid network is given by 2.11. The nature of a particular neuron, say, the kth in the network, is encoded in the convex potential @k or its Legendre-Fenchel transform $;. The interaction between the different neurons is encoded in the extended cost function E(vl,.. . , v N ) .
In Sections 4 and 5, we give explicit expressions of $; for the case of winner-take-all networks and rotor networks. The following important property of the Legendre-Fenchel transform provides a theoretical justification of the deterministic annealing algorithm as used in Hopfield-like networks. 'We are assuming that there are no external inputs.
Analog Mean-Field Optimization
1087
Proposition 1. Let @ be a smooth function on 8'". Then @ is conuex if and only if its Legendre-Fenchel transform @* is conuex on 8'". Proof. See Strang (1986, p. 731).
0
As a consequence of this general result, it follows that when the parameter T is high the cost function E' is dominated by the conuex function @*. The algorithm of deterministic annealing (Hopfield and Tank 1985; Peterson and Soderberg 1989; Geiger and Girosi 1991; Gislkn et al. 1992) uses this fact to find a minimizing point of E*(v, T ) at high temperature W v*(T)= argm;lnE*(v,T)= argm;lnmaxE,ff(v,w,T)
(2.13)
and then track this point as the temperature is decreased until it reaches a value close to zero. As a function of temperature, the tracked point v*(T) defines a continuation arc that connects the convex optimization problem at high temperature with the generally nonconvex optimization problem at low temperatures. Of course, there is no theoretical guarantee that the point found at the end of the continuation arc will be a feasible global minimum of the original cost function E(x),but this method was found to perform rather adequately in practice (Hopfield and Tank 1985; Peterson and Soderberg 1989; Geiger and Girosi 1991; Gislh ef al. 1992). Note that in order to get a solution x* of the original optimization problem, i.e., one that lies in the feasible set C (see 2.6) from v*(T) in the limit of small T , an additional step might be required. For instance, in the case of a binary network, one possibility is to threshold the components of v*( T ) and map them to either 0 or 1. The fact that the constrained minimization of E(x) has led to solving a maximin problem should not be surprising. The situation is similar to that of Lagrangian optimization where finding the minimizing point and the Lagrange multiplier corresponding to the active constraints leads to the solution of a minimax problem: minimization with respect to the constrained variable followed by a maximization with respect to the Lagrange multiplier (Luenberger 1984, Local Duality Theorem, p. 399). The natural question that arises is whether T) = m$xm;lnE,ff(v,w,T). m;lnm~xE,ff(v,w,
In other words, we want to make sure that the result of the search process for the saddle-points of Eeff(v, w, T) is independent of the order in which it is carried out with respect to the arguments v and w. The following general theorem (Rockafellar 1970, Corollary 37.6.2) shows that under rather mild conditions, this independence is indeed guaranteed. Theorem 2. Let C and D be two nonenipty compact conuex sets in %P and let K be a continuous conuex-concave function on C x D. Then K has a saddle-point
I. M. Elfadel
1088
with respect to C x D, i.e., there exists (V,W) E C x D suck that q v . w) 5 K(V. w)5 K(v. W)V(v,w) E c x D
The theorem states that obtaining the point ( V . W ) where K has a saddle-point could be either done by minimizing with respect to v [which gives K(V, w)] and then maximizing with respect to w, or by maximizing with respect to w [which gives K(v,W)] and then minimizing with respect to v. To apply this theorem to the effective energy, we note that E,ff(v,w, T ) is always a concave function with respect to the variable w E D, where D is an arbitrary compact convex set of X n N . Moreover, under the assumption that the cost function E(vj has a local minimum at V, we can choose a compact convex neighborhood C of V, such that Eeff is convex on C. Applying Theorem 2, we get that min max Eeff(v, w. T ) = max min Eeff(v.w, 17') vEC WED
W ED
vEC
(2.14)
For a given w, we have
"
minE,ff(v.w,T)= Tmin -E(v) VEC
VEC
T
I
+ (w,v)
--
If we denote by (2.15)
then 2.14 can be written as minE*(v:T)= maxT [@(w,17') - @(w)] VEC
WED
The function @(w.T ) can be interpreted as the dual of the function E(v)/T (Luenberger 1984, p. 399). Both sides of the above equality are meaningful. Indeed, on the left-hand side, a continuous convex function is being minimized on the compact convex set C, while on the right-hand side, a continuous concnve function is being maximized on the compact convex set D. 3 Analog Algorithms 3.1 Critical Points. Since the effective energy function is at least twice continuously differentiable, a necessary condition for a point (V,W) E %'IN x PNto be a saddle-point of Eeff is that
V,E,ff(V,W, T) = 0
VwEeff(V, W ?T ) = 0,
(3.1)
Analog Mean-Field Optimization
1089
where V, designates the gradient with respect to the vector z. The subscript z will be omitted when the context allows it. We can restate 3.1 by saying that (ii,w)is a solution of the fixed-point equations VE(V)+TW = 0
(3.2)
Tv-TV@(w) = 0
(3.3)
In this section, we will propose two continuous-time dynamic systems for solving the fixed-point equations 3.2 and 3.3. In Peterson and Soderberg (1989)and Gislh et al. (19921, iterated-map methods were proposed to find a solution for the fixed-point equations corresponding to the WTA (Peterson and Soderberg 1989) and the rotor (Gislen et al. 1992) neural networks. The main reason we are interested in continuous-time algorithms rather than iterated-map ones is the plausibility of analog hardware implementations in the former case. For instance, the "softmax" nonlinearity (or the generalized sigmoid mapping), which is the basic building block in analog WTA networks (Peterson and Soderberg 1989; Simic 1990; Waugh and Westervelt 19931, has been shown to possess a simple hardware implementation as an analog, reciprocal VLSI circuit (Elfadel and Wyatt 1993). A related but nonreciprocal circuit has been proposed by a number of authors (see Waugh and Westervelt 1993 and the references therein). The main feature of the algorithms that we propose is that they take into account the saddle-point structure of the effective energy function. As has been mentioned, this structure is due to the fact that the effective energy is concave in the auxiliary variable w and convex in the neighborhood of the local minima of the cost function E(v). Defining gradient descent dynamics with respect to both v and w is not compatible with this convex-concave structure. Since the domain of concavity of the function w + (v,w) - @(w)is the whole space, a gradient descent with respect to w will yield large values of IIwII. In the case of the WTA network, these large values could saturate the softmax mapping, thus leading the algorithm to converge to trivial solutions. 3.2 Minimax Dynamics. We will denote by ( V , w ) a saddle point of Eeff(v,w,T) and denote by 0 c W Nan open neighborhood of V where E(v) is convex. The first continuous-time algorithm is natural in that it implements a gradient-descent with respect to the v variable so as to minimize E(v) and a gradient ascent with respect to the concave part of Eeff(v,w,T) so as to compute the convex conjugate of the potential function @(w). Specifically, we write
(3.4) (3.5)
I. M. Elfadel
1090
where 7 is a positive gain factor. Since its actual value does not affect our proofs, we will assume that 7 = 1. The state variables v and w belong to 0 and X'", respectively. Here also it is important to draw an analogy with classical optimization. Consider the nonlinear program minimize f ( v ) subject to g(v) = 0
(3.6)
with f : 38' -+ Y? defining the cost function and g : Y?p + F,q < p , defining the constraints. The Lagrangian of this nonlinear program is given by A) = f (v) + (4d v ) )
(3.7)
where X E !J? are the q Lagrange multipliers associated with the q constraints. The Lagrange equations of 3.7 are V,A(V,X)
=
0
V,A(v,A)
=
0
In 1958 (Luenberger 1984, p. 453), Arrow and Hurwicz proposed the following continuous dynamic system to solve the above Lagrange equations: v = - V d ( v , A) = -[Vf(v)
A
=
+VxA(v, A)
=
+g(V)
+ (A, Vg(v))]
(3.8) (3.9)
The discretized version of the above dynamic system belongs to a class of optimization algorithms known under the name of first-order Lagrange methods (Luenberger 1984, p. 429). Taking the gradients of Eeff (2.2) with respect to v and w, we get for 3.4 and 3.5 1 V = - w - -VE(V) (3.10) T (3.11) w = V-V@(W) Comparing these equations with those of Arrow and Hurwicz, it becomes clear that the auxiliary variable w in the effective energy is playing the role of the Lagrange multiplier X in the Lagrangian. There are, however, two fundamental differences between 3.10 and 3.11, on the one hand, and 3.8 and 3.9 on the other: 1. The first difference is the coupling between the two equations. Both the variables v and w appear explicitly in 3.10 and 3.11. Because the Lagrangian is linear with respect to the Lagrange multipliers, only the variable v appears in the ascent equation of the Lagrange method 3.9.
Analog Mean-Field Optimization
1091
2. The second is that the constraint equation appears only implicitly in the descent dynamics (3.10) through the auxiliary variable w. This situation occurs in Lagrange's methods only if the constraints are linear in v. We now prove that the saddle points of the effective energy are all indeed locally stable equilibrium points of the dynamic system 3.10 and 3.11. Theorem 3. Let (V,W) bea saddlepoinf of Eeff(v, w, T ) . Then (V,w)is a localZy stable equilibrium point of 3.10 and 3.11. For the proof, we need the following lemma Lemma 4. Let M be a symmetric, stable matrix, i.e., all the (real)eigenvalues of Mare nonpositive, and let S bea skew-symmetric matrix, i.e., ST = -S. Then the matrix A = M + S is stable, i.e., all its eigenualues have nonpositive real parts. Proof. See Appendix A.
Now we give the proof of Theorem 3. Proof. Linearizing the right-hand sides of Equations 3.10 and 3.11 around the equilibrium point (V,w)we get the block matrix
-+HE@) -I [ I -H@(W)
1
where H denotes the Hessian operator. Note that because @ is convex on the whole space, the eigenvalues of the symmetric matrix -H@(W) are all nonpositive. Similarly, the eigenvalues of the symmetric matrix -HE(V) are all nonpositive in a neighborhood of V. Now, we apply Lemma 4 to the symmetric, stable matrix
M=
[
-$H€(V) 0
-H@(W) O I
and the skew-symmetric matrix
It follows that the eigenvalues of the linearized system around the saddle point are all nonpositive, and, therefore, the dynamic system is locally stable. 0
To obtain local asymptotic stability, strict convexity of both the potential function @ at w and the cost function E at V is required. To ensure strict convexity, a classical trick is to add the positive definite quadratic functions E(V,v) to the cost function E(v) and E(W,w) to the potential function @(w).We will not pursue this trick here, but we note that if
I. M. Elfadel
1092
E(v) has a strict local minimum at a point V, then there exists a neighborhood of that point where E(v) is strictly convex. We will assume that this neighborhood is 0 and show that this local strict convexity assumption in 0 is actually sufficient to guarantee the local asymptotic stability of the ascent-descent algorithm. Our argument will use the following Lyapunov function. Theorem 5. Assume that E(v) has a strict local minimum at V. Then there is a neighborhood 13 c %'IN of V such that the continuous function
)I ; (2 + ;
L(v,W, T) = - w + -VE(V) 2
-
/(v- V@(w)/l2
(3.12)
is strictly decreasing along the trajectories of 3.10 and .3.11 with initial conditions in c? x X n N .
Proof. Differentiating L(v.w, T) along the trajectories of 3.10 and 3.11, we get
d -L(v, W ,T ) dt
=
=
(VJ,
V)
+ (V,L,
W)
-
(H@(w)[v- V@(w)],W).
-
(fHE(v)v.v) - (H@(w)W,W)
< 0 The first inequality is a result of the fact that the Hessian matrix of the potential function is positive semidefinite 0x1 %'IN (convexity of @I, while the last strict inequality results from the fact that the Hessian matrix of the cost function E(v) is positive definite. It follows then that L(v,w, T) is a strictly decreasing function along the trajectories of the dynamic system given by 3.10 and 3.11. Standard results from stability theory (e.g., Vidyasagar 1978, Chapter 5) can then be used to prove that the above dynamic system is locally asymptotically stable in O x PN. It is interesting to note that the functions E(v) and @(w)play essentially symmetric roles modulo the temperature parameter T . This symmetry is apparent in the ascent-descent dynamic system used to find the saddle points. Note also that the result of the above theorem remains valid if we assume instead that V is a local minimum rather than a strict local minimum and that the potential function is strictly convex over XnN. However, in the statistical mechanics formulation of constrained optimization problems, the potential function @ that
Analog Mean-Field Optimization
1093
results from the imposition of constraints may fail to be strictly convex. A case in point is that of WTA networks where the softmax nonlinearity has a convex potential function that is not strictly convex. See Section 4. 3.3 Descent Dynamics. An alternative way for finding the saddle points of the effective energy function E,ff(v,w,T)is to first explicitly solve for the conjugate function according to 2.10 and then define a gradient descent on the cost function E*(v,T) defined in 2.8. The dynamic system is then simply V=
-VE*(v, T ) = -VE(V) - TV@*(v)
(3.13)
It is clear that if the above dynamic system is started in the neighborhood c3 of the local minimum V of E(v), the descent algorithm on E*(v, T ) will converge2 to ii. There is yet another descent dynamic system that admits E*(v,T ) as a Lyapunov function. It is, however, defined in terms of the auxiliary variable w as 1 W = -W - -VE(V) (3.14) T and the input-output constraint
v = V@(w)
(3.15)
Note that the right-hand sides of 3.14 and 3.10 are identical. To show that the cost function E*(v, T ) is indeed nonincreasing along the trajectories of 3.14, we need the following lemma whose proof is given in Appendix B. Lemma 6. The gradients of a smooth convex cf, and its conjugate @* satisfy the equation
V@*[V@(w)] = w,
vw E SRnN
(3.16)
We can now state the following: Theorem 7. The cost function E*(v, T ) is nonincreasingalong the trajectories of the dynamic system (3.14). Proof. Along the trajectories of 3.14, we have
d
-€*(v, T ) dt
=
(V€*(V,T ) .V )
=
(VE(v) TV@*(v),H@(w)W)
+
Now because of the input-output constraint (3.15)and by virtue of Lemma 6, we have
V@*(v)= V@*[V@(W)] =w 2We assume that 0 does not contain other critical points of E(v) than V.
I. M. Elfadel
1094
It follows that
VE*(v,T) = VE(V)+ TW = -TW Therefore,
d >YE*(", T) = -T(W, H@(w)W) Since @ is convex, its Hessian matrix is always positive semidefinite. Therefore,
d
-E'(v, dt
T) 5 0
0
It is important to note that the dynamics of 3.14 under the inputoutput constraint 3.15 is the essence of Hopfield's analog neural networks (Hopfield 1984). The framework provided by the effective energy function (2.2) shows that stable dynamic systems other than the Hopfield type are plausible. One such system is one in which the input-output constraint is given by 1
w = --VE(V)
T
(3.17)
and the dynamics is defined by
v
=
-v
+ TV@(w)
(3.18)
This system can be construed the conjugate (or dual) of 3.13 and 3.14. Moreover, it can be shown that it implements a local ascent dynamics on the energy surface given by *(w. T)- @(w) where @(w,T)is the dual of the cost function E(v)/T as given in 2.15. The locality is imposed by the requirement that the conjugate system be defined in the neighborhood of a local minimum of the energy function E(v) so that its local convexity can be guaranteed. The rest of the paper is devoted to specializing the previous results to some of the neural networks that have been proposed in the literature. We will deal with two important cases: the winner-take-all (WTA) network (Peterson and Soderberg 1989) (also known as the Potts network in the statistical physics literature) and the rotor network (Gislkn et al. 1992; Zemel et ai. 1993). For both cases, we will show that the potential functions of the constraint sets are indeed convex and that the conjugates of the convex potentials have the form of information-theoretic entropy. Thus we can rigorously identify the cost function E*(v,T) with the meanfield free energy (Meir 1992).
Analog Mean-Field Optimization
1095
4 Winner-Take-All Networks
4.1 Cost Functions. For the WTA case, the constraint set site k is the set of vertices of the unit simplex
Sk
for each
We will denote the unit simplex by conv(S), the convex hull of S. The discrete measure on this set (2.4) is given by
Using 2.3, it is easily seen that the potential function associated with S is
Proposition 8. The function d, defined in 4.2 is convex. Proof. It is sufficient to prove that the Hessian of $ at any point is positive The '. Hessian of J, at w semidefinite. Let w be an arbitrary point in !I? will be denoted by D, and simple algebra shows that D
= diagCf/)- ff'
where f 2 V$(w). To prove that D is positive semidefinite, it is sufficient to prove that CTD dPL(S)
Proof. Applying the definition of a conjugate function, we want to maximize wv, w) = (v,w) -
with respect to w. This maximum is reached at v=
s,,
SP(W*,s)
W * such
that
44s)
Multiplying through by w*, we get b(v,w*) =
/ (w*,s)p(w*, 4 4 s ) s)
S"
- $(w*)
But from the definition of the probability density function p(w,s), we have (w*,s)= d(w*) lnp(w*,s)
+
Equation 5.5 then follows from the fact that
Using the general results established in Section 3, one can show that the following continuous-time dynamic system 1
iVkE(v).
15k 5N
(5.6)
1 5k < N (5.7) has, for locally stable equilibrium points, the local minima of the rotor network Hopfield energy given by Vk
= v#(Wk).
where p[wk(vk),s] is given by 5.3. Here also, the Hopfield cost function E*(v,T ) , which is a Lyapunov function for 5.6, has an appealing thermodynamic structure, for the integral term is nothing but the negative of the information-theoretic entropy associated with the probability density function p[wk(vk),s ] . In other words, E*(v,T) has the form of Helmholtz's free energy: energy temperature x entropy. In the 2D case, equations 5.6 and 5.7 reduce to equations (6) and (7) in Zemel et al. (1993) if we assume that the cost function E(v) is quadratic. Note that the saddle-point method for deriving the effective energy E,w(v. w, T ) does not require such an assumption.
Analog Mean-Field Optimization
1101
5.2 General Case. It is clear that Proposition 10 remains true if the neuron state is constrained to lie on a surface other than a sphere. In fact, the only requirement is that the integral used in the definition of the potential function 5.1 exist and be finite for every w E 3".Moreover, formula 5.5 and its proof remain unchanged for any compact constraint surface. Thus the ascent-descent dynamics of 3.10 and 3.11 as well as the descent dynamics of 3.13 and 3.14 remain valid for the general constrained neuron case.
6 Conclusions
In this paper, we have investigated a generalization of the effective energy function, first introduced by Peterson and Soderberg (19891, using the notion of a conjugate function. We have shown that to each closed bounded constraint set on the neuron state, we can associate a smooth convex potential function. We have also shown that using the conjugate of the convex potentials, we can derive, from the effective energy, a cost function that is a natural generalization of the analog Hopfield energy. Descent dynamics and deterministic annealing can then be used to find the global minimum of the original minimization problem. When the conjugate is hard to compute explicitly, we have shown that a minimax dynamic system can be used to find the saddle points of the effective energy. We have also proved that the saddle points of the effective energy are locally stable equilibrium points for the minimax dynamic system. Furthermore, we have demonstrated that the minimax dynamics possesses a Lyapunov function that is nonincreasing along its trajectories. The general effective energy framework allows us to treat hybrid networks, i.e., networks in which different neurons have different state spaces, in a unified manner. As an illustration of the generality of this framework, we have used the above results to derive Hopfield-like cost functions and descent dynamic systems for two classes of networks previously considered in the literature: winner-take-all networks and rotor networks. Appendix A Proof of Lemma 4 Proof. To prove that A = M + S is stable, let X = o + ip be a complex eigenvalue of A with a complex eigenvector e = a + ib, where a and b are real vectors, and i = J--?. Denote by e" A aT- zbTthe conjugated transpose of e. Then on the one hand
1102
I. M. Elfadel
and on the other hand e*Ae = aTMa+ bTMb+ 2iaTSb Therefore al(e11*= aTMa+ bTMb 5 0 the last inequality following from the fact that the matrix M is negative semidefinite. Therefore, the real part of any eigenvalue of A is nonpositive, i.e., A is stable. 0 Appendix B: Proof of Lemma 6 Proof. By definition of the conjugate function, we have @*(v) =",ax
[(z,v) - @(z)]
This maximum is reached at a point z = w such that v
=
V@(W)
(B.1)
But @*(u)2 (z,u)- @(z).
vu,vz
Therefore @(w)= m,ax [ ( w , u )- @*(u)] which is reached at u = v such that
w
(8.2)
= O@*(v)
From B.l and B.2, we get
O@[V@*(v)]= v
and
O@*[V@(w)] == w
0
Acknowledgments I would like to acknowledge the continuous support and encouragement I have received from John Wyatt. I would like to thank Alan Yuille for many helpful discussions. Finally, I would like to thank an anonymous reviewer for many detailed and thoughtful comments on the first version of this paper. Work supported in part by NSF and ARPA under Contract NO. MIP-91-17724.
Analog Mean-Field Optimization
1103
References
Amit, D. J. 1989. Modeling Brain Function. Cambridge University Press, Cambridge. Callen, H. B. 1960. Thermodynamics. John Wiley, New York. Elfadel, I. M. 1993. Global dynamics of winner-take-all networks. In SPIE Proceedings, Vol. 2032, pp. 127-137. San Diego, CA. Elfadel, I. M., and Wyatt, J. L., Jr. 1994. The "softmax" nonlinearity: Derivation using statistical mechanics and useful properties as a multiterminal analog circuit element. In Advances in Neural Information Processing, J. D. Cowan, G. Tesauro, and J. Alspector, eds., Vol. 6, pp. 882-887. Morgan Kaufmann, San Mateo, CA. Elfadel, I. M., and Yuille, A. L. 1993. Mean-field phase transitions and correlation functions for Gibbs random fields. J. Math. Imaging Vision 3(2), 167-186. Geiger, D., and Girosi, F. 1991. Parallel and deterministic algorithms from MRFs: Surface reconstruction. IEEE Truns. PAM1 13(5), 401412. Gislh, L., Peterson, C., and Soderberg, B. 1992. Rotor neurons: Basic formalism and dynamics. Neural Comp. 4,737-745. Harris, J. G., Koch, C., Luo, J., and Wyatt, J. 1989. Resistive fuses: Analog hardware for detecting discontinuities in early vision. In Analog VLSI Implementation of Neural Systems, C. Mead and M. Ismail, eds., pp. 27-55. Kluwer Academic Publishers, Boston, MA. Haykin, S. 1994. Neural Networks: A Comprehensive Foundation. Macmillan, New York, NY. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Nutl. Acad. Sci. U.S.A. 79, 2554-2558. Hopfield, J. J. 1984. Neurons with graded responses have collective computational properties like those of two-state neurons. Proc. Nafl.Acad. Sci. U.S.A. 81, 3088-3092. Hopfield, J. J., and Tank, D. W. 1985. "Neural" computation of decisions in optimization problems. Biol. Cybern. 52, 141-152. Kosowsky, J. J., and Yuille, A. L. 1994. The invisible hand algorithm: Solving the assignment problem with statistical physics. Neural Networks 7(3), 477490. Luenberger, D. G. 1984. Linear and Nonlinear Programming, 2nd ed. AddisonWesley, Reading, MA. Marroquin, J. L. 1985. Probabilistic solution of inverse problems. Ph.D. thesis, MIT. Meir, R. 1992. On deriving deterministic learning rules from stochastic systems. Int. I. Neural Syst. 2, 283-289. Ortega, J. M., and Rheinboldt, W. C. 1970. Iterative Solution of Nonlinear Equations in Several Variables. Academic Press, New York. Parisi, G. 1988. Statistical field theory. Addison-Wesley, Reading, MA. Peterson, C., and Soderberg, B. 1989. A new method for mapping optimization problems onto neural networks. Int. J. Neural Syst. 1(1),3-22. Platt, J. 1989. Constraint methods for neural networks and computer graphics. Ph.D. thesis, California Institute of Technology, Pasadena, CA.
1104
I. M. Elfadel
Rockafellar, R. T. 1970. Convex Analysis. Princeton T;niversity Press, Princeton, NJ. Simic, P. D. 1990. Statistical mechanics as the underlying theory of "elastic" and "neural" optimization. Network 1, 89-103. Simic, P. D. 1991. Constrained nets for graph matching and other quadratic assignment problems. Neural Comp. 3, 169-281. Strang, G. 1986. Introduction to Applied Mathematics. Wellesley-Cambridge Press, Cambridge, MA. Vidyasagar, M. 1978. Nonlinear Systems Analysis. Prentice-Hall, Englewood Cliffs, NJ. Waugh, F. R., and Westervelt, R. M. 1993. Analog neural networks with local competition. I. Dynamics and stability. Phys. Rev. E, in press. Yuille, A. L., and Kosowsky, J. J. 1994. Statistical physics algorithms that converge. Neural Comp. 6, 341-356. Yuille, A. L. 1987. Energy functions for early vision and analog networks. A1 memo # 987, MIT. Zemel, R. S., Williams, C. K. I., and Mozer, M. C. 1993. Directional-unit Boltzmann machines. In Aduances in Neural Information Processing, S. Hanson, J. D. Cowan, and C. Lee Giles, eds., Vol. 5, pp. 172-179. Morgan Kaufmann, San Mateo, CA.
Received July 9, 1993; accepted October 31, 1994.
This article has been cited by: 2. A. L. Yuille , Anand Rangarajan . 2003. The Concave-Convex ProcedureThe Concave-Convex Procedure. Neural Computation 15:4, 915-936. [Abstract] [PDF] [PDF Plus] 3. A. Rangarajan, E.D. Mjolsness. 1996. A Lagrangian relaxation network for graph matching. IEEE Transactions on Neural Networks 7:6, 1365-1381. [CrossRef]
Communicated by William Lytton
Patterns of Functional Damage in Neural Network Models of Associative Memory Eytan Ruppin Dept. of Computer Science, School of Mathematics, Tel-Auiu University, Ramat-Auiu, 69978, lsrael
James A. Reggia Departments of Computer Science and Neurology, A.V. Williams Bldg., University of Maryland, College Park, M D 20742 U S A
Current understanding of the effects of damage on neural networks is rudimentary, even though such understanding could lead to important insights concerning neurological and psychiatric disorders. Motivated by this consideration, we present a simple analytical framework for estimating the functional damage resulting from focal structural lesions to a neural network model. The effects of focal lesions of varying area, shape, and number on the retrieval capacities of a spatially organized associative memory are quantified, leading to specific scaling laws that may be further examined experimentally. It is predicted that multiple focal lesions will impair performance more than a single lesion of the same size, that slit like lesions are more damaging than rounder lesions, and that the same fraction of damage (relative to the total network size) will result in significantly less performance decrease in larger networks. Our study is clinically motivated by the observation that in multi-infarct dementia, the size of metabolically impaired tissue correlates with the level of cognitive impairment more than the size of structural damage. Our results account for the detrimental effect of the number of infarcts rather than their overall size of structural damage, and for the "multiplicative" interaction between Alzheimer's disease and multi-infarct dementia. 1 Introduction
Understanding the response of neural nets to structural/functional damage is important for assessing the performance of neural network hardware, and in gaining understanding of the mechanisms underlying neurological and psychiatric disorders. Recently, there has been a growing interest in constructing neural models to study how specific pathological neuroanatomical and neurophysiological changes can result in various clinical manifestations, and to investigate the functional organization of Neural Computation
7, 1105-1127 (1995) @ 1995 Massachusetts Institute of Technology
1106
Eytan Ruppin and James A. Reggia
the symptoms that result from specific brain pathologies (reviewed in Reggia et a / . 1994; Ruppin 1995). In the area of associative memory models specifically, early computational studies found an increase in memory impairment with increasing lesion severity (Wood 1978) (in accordance with Lashley’s classical “mass action” principle), and showed that slowly developing lesions can have less pronounced effects than equivalent acute lesions (Anderson 1983). More recently, it was shown that the gradual pattern of clinical deterioration manifested in the majority of Alzheimer’s patients can be explained, and that different synaptic compensation rates can account for the observed variation in the severity and progression rate of this disease (Horn et al. 1993; Ruppin and Reggia 1994). Previous work, however, is limited in that model elements have no spatial relationships to one another (all elements are conceptually equidistant). Thus, as there is no way to represent focal (localized) damage in such networks, it has not been possible to study the functional effects of focal lesions on memory and to compare them with those caused by diffuse lesions. This paper presents the first computational study of the effect of focal lesions on memory performance with spatially organized neural networks. It is motivated by the observation that in neural network models, a focal structural lesion (that is, the permanent and complete inactivation of some group of adjacent elements) is accompanied by a surrounding functional lesion composed of structurally intact but functionally impaired elements. This region of functional impairment occurs due to the loss of innervation from the structurally damaged region. It is the combined effect of both regions that determines the actual extent of performance decrease in the network. From a modeling perspective, this paper presents a simple but general approach to analyzing the functional effects of focal lesions. This approach is used to derive scaling laws that quantify the effects of spatial characteristics of focal lesions such as their number and shape on the performance of network models of associative memory. Beyond its computational interest, the study of the effects of focal damage on the performance of neural network models can lead to a better understanding of functional impairments accompanying focal brain lesions. In particular, we are interested in multiinfarct dementia, a frequent cause of dementia (chronic deterioration of cognitive and memory capacities) characterized by a series of multiple, aggregating focal lesions. The distinction made in the model network considered here between structural and functional lesions has a clinical parallel: “structural” lesions represent regions of infarcted (dead) tissue, a s measured by structural imaging methods such as computerized tomography, and “functional” lesions represent regions of metabolically impaired tissue surrounding the infarcted tissue, as measured by functional imaging techniques such as positron emission tomography. Interestingly, in multiinfarct dementia the correlation between the volume of the primary infarct region and the severity of the resulting cognitive deficit is unclear and controversial
Functional Damage in Neural Network Models
1107
(Meyer et al. 1988; del Ser et al. 1990; Liu et al. 1990; Tatemichi et al. 1990; Gorelick et al. 1992). In contrast, there is a strong relationship between the total volume of metabolically impaired tissue measured in the chronic phase and the severity of multiinfarct dementia (Mieke et d.1992; Heiss et al. 1993a, 199310). This highlights the importance of studying functional impairment after focal lesions. The reader familiar with the clinical stroke literature should note that the functional lesions modeled in this paper are not the "penumbra" perilesion areas of comprised blood supply and acute metabolic changes that surround focal infarcts during the acute postinfarct period. Rather, they are regions of reduced metabolic activity that are observed in chronic multiinfarct dementia patients months after the last infarct episode. The reduced metabolic activity in these areas is probably a result of both residual postinfarct neuropathological damage and the loss of innervation from the primary infarct region (Mies et al. 1983; Heiss et al. 1993b). Intuitively, it is clear that in large enough lesions the functional damage resulting from loss of innervation should scale proportionally to the lesion circumference. This entails, in turn, that the functional damage should depend on the spatial characteristics of the structural lesion, such as its shape and the number of spatially distinct sublesions composing it. This work is devoted to a formal and quantitative study of these dependencies, and to a discussion of their possible clinical implications. In Section 2, we derive a theoretical framework that characterizes the effects of focal lesions on an associative networks performance. This framework, which is formulated in very general terms, is then examined via simulations with a specific associative memory network in Section 3. These simulations show a fair quantitative fit with the theoretical predictions, and are compared with simulations examining performance with diffuse damage. The effects of various parameters characterizing the networks architecture on postlesion performance are further investigated in Section 4. Finally, our results are discussed in Section 5, and are evaluated in light of some relevant clinical data. 2 Analytical Scaling Rules
The model network we study consists of a two-dimensional array of units whose edges are connected, forming a torus to eliminate edge effects. Each unit is connected primarily to its nearby neighbors, as in the cortex (Thomson and Deuchars 1994), where the probability of a connection existing between two units is a gaussian density function of the distance between them in the array. The unit of distance here is the distance between two neighboring elements in the array. Our analysis pertains to the case where, in the predamaged network, all units have similar average activation and performance levels.' A 'The analysis presented in this section is general in the sense that it does not rely on
1108
Eytan Ruppin and James A. Reggia
Figure 1: A sketch of a structural (dark shading) iInd surrounding functional (light shading) rectangular lesion. The a and b values denote the lengths of the rectangle’ssides, and d is the functional impairment span. focal structural lesion (anatomical lesion), denoting an area of damage and neuronal death, is modeled by permanently clamping the activity of the lesioned units to zero at the onset of the lesion. As a result of this primary structural lesion, the activity of surrounding units may be decreased, resulting in a secondary functional lesion, as illustrated in Figure 1. We are primarily interested in large focal lesions, where the area s of the lesion is significantly greater than the local neighborhood region from which each unit receives its inputs. Throughout our analysis we shall hold the working assumption that traversing from the border of the lesion outward, the activity of units, and with it, the network performance level, gradually rises from zero until it reaches its normal, predamaged levels, at some distance d from the lesion’s border (see Fig. 1). We denote d as the functional impairment span. This assumption reflects the notion that units that are closer to the lesion border lose more viable inputs than units that are farther away from the lesion. Since s is large relative to each element’s connectivity neighborhood, d is determined primarily by the effect of the inactive regions at the periphery of the lesion. We may therefore assume that the value of d is independent of the lesion size, and depends specifically on the parameters defining the networks connectivity and dynamics. In Section 4 we will use computer simulations to verify that d is invariant over lesion size, and examine its dependence on the network parameters. any specific connectivity or activation values. Note, however, that the above statement is true in general for associative memory networks, when the activity of each unit is averaged over a time span sufficiently long for the cueing and retrieval of a few stored patterns.
Functional Damage in Neural Network Models
1109
Let the intact baseline performance level of the network be denoted as P(O), and let the network area be A. The networks performance is quantified by some measure P ranging from 0 to 1. For example, if the network is an associative memory (as we study numerically in the next section), P denotes how accurately the network retrieves the correct memorized patterns given a set of input cues (defined formally in equation A.5 of Appendix A). In the predamaged network all units have an approximately similar level of activity and performance. Then, a structural lesion of area s (dark shading in Fig. l), causing an additional functional lesion of area A, (light shading in Fig. l),results in a performance level of approximately
where PA denotes the average level of performance over A, and AP = P(0) - PA. P(s) hence reflects the performance level over the remaining viable parts of the network, discarding the structurally damaged region.* Bearing these definitions in mind, the effect of focal lesions on the network’s performance level can be characterized by the following rules. 2.1 A Single Lesion. Consider a symmetric, circular structural lesion of size s = m2.The area of functional damage following such a lesion is A, = r [ ( r + d ) 2-r2] = rd2+&dfi. In networks that operate well below the limit of their capacity and hence have significant functional reserves, the second term dominates since s is assumed to be large relative to d, and therefore
Rule 1: A,
&d&
(2.2)
and (substituting the expression for A, in 2.1)
kfi P(s) s P(0)- (2.3) A-s for some constant k = f i d A P . Thus, the area of functional damage surrounding a single focal structural lesion is proportional to the square root of the structural lesion’s area. Some analytic performance/lesioning curves (for various k values) are illustrated in Figure 2. Note the different qualitative shape of these curves as a function of k. As is evident, the shape of these curves reflects two conflicting tendencies; they are initially concave (in light of rule 1) and then turn convex (as s increases and the remaining viable area is decreased). Letting x = s / A be the fraction of structural damage, we have 2Alternatively, it is possible to measure the performance over the entire network. This would not affect our findings as long as the same measure is used in both the analysis and simulations, as the mapping between the two performance measures is order preserving.
Eytan Ruppin and James A. Reggia
1110 ~
I
1.0 I
r
‘ 1
I
0.0 0.0
500.0
I
1000.0 Lesion size
J
1500.0
Figure 2: Theoretically predicted network performance as a function of a single focal structural lesion’s size (area): analytic curves obtained for different k values; A = 1600.
Corollary 1.
P(x)
k f i 1 P ( 0 ) - ___-
(2.4)
1 -X&
that is, the same fvaction x of damage results in less performance decrease in larger networks! This surprising result testifies to the possible protective value of having functional “modular” cortical networks of large size. Corollary 1 results from the fact that the functional damage does not scale up linearly with the structural lesion size, but only as the square root of the latter. 2.2 Varying Shape and Number. Expressions 2.3 and 2.4 are valid also when the structural lesion has a square shape. The resulting functional lesion of an s-size square structural lesion is A, = 4d2 4dfi. To
+
Functional Damage in Neural Network Models
1111
study the effect of the structural lesion's shape, we consider the area A,[,,] of a functional lesion resulting from a rectangular focal lesion of size s = a '11 (see Fig. l), where, without loss of generality, n = a / b 2 1. Then, for large n (i.e., elongated lesions), the area of functional damage is
K
+ 4d2 = ( n + 1)2d - + 4d2 = 2d\/;;& + 4d2 5 G A S
A,,,,,
=
2d(a + b )
(2.5)
where in the last step we neglect the contribution of the size-invariant term 4d2. The functional damage of a rectangular structural lesion of fixed size increases as its shape is more elongated. More quantitatively we have
Rule 2:
and
kfi
P(s) 2 P(0)- 2 ( A - S)
(2.7)
Next, to study the effect of the number of lesions, consider the area Asm of a functional lesion composed of m focal rectangular structural lesions (with sides a = n . b), each of area s/m. Using expression 2.5, we have AsJn = rn [2d(2d+ =
&a)]
fi [ 2 d ( Z d f i + fi&)] 2 &A,,,,,
(2.8)
The functional damage hence increases as the number of the focal lesions m increases (total structural lesion area held constant), in accordance with
Rule 3: Asrn2
fiAs[n]
(2.9)
which is always valid, irrespective of the value of d , and P(s)
kJm.s
P ( 0 ) - ___ 2 ( A - S)
(2.10)
At first glance, the second and third rules seem to indicate that the functional damage caused by varying the shape or by varying the number of focal lesions behaves according to scaling laws of similar order. However, it should be noted that while rule 3 presents a lower bound on the functional damage that may actually be significantly larger, and involves no approximations, rule 2 presents an upper bound on the actual
Eytan Ruppin and James A. Reggia
1112
functional damage. As we shall show in the next section, the number of lesions actually affects the network performance significantly more than its precise shape (maintaining the total structural area fixed). Let A[x]denote the functional damage caused by a single focal square lesion of area x (so A[s] is As). Since d A , 2 A[s . 11 (by rule l), then following rules 2 and 3 we obtain the following corollaries:
Corollary 2. &[,,I
A[n/4 . s]
(2.11)
That is, the functional damage area following a rectangular structural lesion of area s and sides-ratio n is approximately equal to the functional damage area following a larger single square structural lesion of area n/4 . s (for large n).
Corollary 3. Aj"l 2 A[m . S ]
(2.12)
In other words, the functional damage following multiple lesions composed of m rectangular focal structural lesions having total area s is greater than the functional damage following a single square lesion of area rn . s. As is evident, the analysis presented in this section is based on several simplifying approximations. As such, it cannot be expected to yield an exact match with numerical results from computer simulations. However, as demonstrated in the next section, the scaling rules developed have the same shape as the numerical data, matching quite well at times, testifying to their validity. 3 Numerical Results
We now turn to examine the effect of lesions on the performance of an associative memory network via simulations. 'The goals of these simulations are twofold: to examine how accurately the general but approximate theoretical results presented above describe the actual performance degradation in a specific associative network, and to compare the effects of focal lesions to those of diffuse ones, as the effect of diffuse damage cannot be described as a limiting case within the framework of our analysis. Our simulations were performed using a standard Tsodyks-Feigel'man attractor neural network (Tsodyks and Feigel'man 1988). This is a Hopfield-like network that has several features that make it more biologically plausible (Horn et al. 1993), such as low activity and nonzero thresholds. Spatially organized attractor networks can function reasonably well as associative memory devices (Karlholm
Functional Damage in Neural Network Models
1113
1993), and a biologically inspired realization of attractor networks using cortical columns as its elements has also been proposed (Lansner and Fransen 1994). The recent findings of delayed, poststimulus, sustained activity in memory-related tasks, both in the temporal (Miyashita and Chang 1988) and frontal (Wilson et al. 1993) cortices, provide support to the plausibility of such attractor networks as a model of associative cortical areas. A detailed formulation of the network used and simulation parameters is given in Appendix A. Each unit's connectivity is parameterized by n, where smaller 0 values denote a shorter (and more spatially organized) connectivity range. The network's performance level is quantified by an overlap measure rn ranging in the interval [-1, fl]. The overlap m measures the similarity between the networks end state and the cued memory pattern (which is the desired response), averaged over many trials with different input cues. We now describe the results of simulations examining the scaling rules derived in the previous section. 3.1 Performance Decrease with a Single Lesion. Figure 3 plots the network's performance as a function of the area of a single square-shaped focal lesion. As is evident, the spatially organized connectivity enables the network to maintain its memory retrieval capacities in the face of focal lesions of considerable size. As the connectivity dispersion 0 increases, focal lesions become more damaging. Also plotted in Figure 3 is the analytical curve calculated via rule 1 and expression 2.3 with k = 5, which matches well with the actual performance of the spatially connected network parameterized by 0 = 1. Concentrating on the study of focal lesions in a spatially connected network, we shall adhere to the values (T = 1 and k = 5 hereafter, and compare the analytical and numerical results. The performance of the network as a function of the fraction of the network lesioned, for different network areas A, is displayed in Figure 4. The analytical curves, plotted using equation 2.4 (with k = 51, are qualitatively similar to the numerical results (with CT = 1). The sparing effect of large networks is marked.
3.2 The Effects of Shape and Number. To examine rule 2, a rectangular structural lesion of area s = 300 was induced in the network. As shown in Figure 5a, as the ratio n between the sides is increased while holding the area constant, the network's performance further decreases, but this effect is relatively mild (note values on vertical axis). There is a fair quantitative agreement with the theoretical predictions obtained using 2.7, and these are also plotted in Figure 5. The effect of varying the lesion number while keeping the overall lesion area fixed stated in rule 3 is demonstrated in Figure 5b. Figure 5 shows the effect of multiple lesions composed of 2, 4, 8, and 16 separate focal lesions. This effect
Eytan Ruppin and James A. Reggia
1114
0.0 1 0.0
-1 500.0
1000.0
1500.0
Figure 3: Network performance as a function of focal lesion size. Simulation results obtained in three different networks, each characterized by a distinct distribution of spatially organized connectivity, and analytic results calculated with k = 5 using equation 2.2.
is much stronger than seen with lesion shape (note values on vertical axis). As is also evident, the analytical results computed using 2.10 correspond quite closely with the numerical ones. However, in both figures, the analytically calculated performance is consistently higher than that actually achieved in simulations, as the d2 term is omitted in the analytic approximation. Note also (Fig. 5bl that as the lesion size is increased the analytic results correspond better to the simulation results, as d becomes smaller in relation to &. To compare the effects of focal and diffuse lesions, the performance achieved with a diffuse lesion of similar size is arbitrarily plotted on the 20th x-ordinate. It is interesting to note that a large multiple focal lesion (s = 512) can cause a larger performance decrease than a diffuse lesion of similar size. That is, at some point, when the size of each individual focal
Functional Damage in Neural Network Models
lo---
,
,
1115
-
. -
,
lor
'
j
00 00
'
, 02
,
,
,
04
08
08
10
Franton of damage
(b)
Figure 4: Network performance as a function of the fraction of focal damage, in networks of different sizes. Both analytical (a) and numerical (b)results are displayed. lesion becomes small in relation to the width of each unit's connectivity, our analysis loses its validity, and rule 3 does not hold any more. Hence, the effect of a diffuse lesion on the network's performance cannot be calculated by viewing it as a "limiting case" of multiple focal lesions. 3.3 Diffuse Lesions in Spatially Organized Networks. Figure 6 displays how the performance of the network degrades when diffuse structural lesions of increasing size are inflicted upon it by randomly selecting units on the lattice and clamping their activity to zero. While the performance of nonspatially connected networks manifests the classical sharp decline [denoted as "catastrophic breakdown" (Amit 198911 at some critical lesion size (Fig. 6, (T = 30), the performance of spatially connected networks (Fig. 6, = 1) degrades in a more gradual manner as the size of the diffuse lesion increases. It is of interest to note that this "graceful" degradation parallels the gradual clinical and cognitive decline observed in the majority of Alzheimer patients (Katzman 1986; Katzman et d.1988). A comparison of Figures 3 and 6 demonstrates that diffuse lesions are generally more detrimental than a single focal lesion of identical area.
Eytan Ruppin and James A. Reggia
1116
(4
(
0.80
0.930
0.80
0.910
f
!
0.70
0.880
0.60 0.870
0-0.0
'
1:o
2:o
3:o
4:o
5:o
0 . ~ " . . 5 . 0
1d.o
'
15.0
'
20.0
0
Figure 5: Network performance as a function of structural focal lesion shape (a) and number (b),while keeping total structural lesion area constant. Both numerical and analytical results are displayed. The simulations were performed in a network whose connectivity was generated with u = 1. The analytical results are for the corresponding k = 5 . In Figure 5b, the x-ordinate denotes the number of separate sublesions (1.2,4,8,16), and, for comparison, the performance achieved with a diffuse lesion of similar size is plotted arbitrarily on the 20th x-ordinate. 4 The Functional Impairment Span d
The correspondence obtained between the theoretical and simulation results presented in the previous section testifies to the validity of the lesion-invariant impairment span that has been central to our analysis. This assumption is further supported directly by extensive simulations demonstrating that the span d remains practically invariant when the lesion size is varied (for large lesions). We now turn to study the influence of several factors such as the spatial connectivity distribution and the noise level 7' (defined in Appendix A) on the functional impairment span.3 The simulation results described below are compared with ana3Note that the average activity levels and performance have a similar functional impairment span. This follows since, due to the decrease in synaptic inputs to neurons in the functional lesion region, the probability of retrieval errors is mainly due to neurons that should fire (i.e., they belong to the cued pattern) but are silent, and not due to neurons that fire erroneously (i.e., they do not belong to the cued pattern).
Functional Damage in Neural Network Models
1.0
1117
,
0.8
0.6
i?
8
0.4
0.2
0.0 Lesion size
Figure 6: Network performance as a function of diffuse lesion size. Simulation results obtained in four different networks, each characterized by a distinct distribution of spatially organized connectivity.
lytical results obtained by iterating the overlap system of equations B.3 derived in Appendix B. They describe the dependence of an overlap vector whose components are "local" overlaps measured at consecutive distances from the border of the lesion (and hence termed distance-overlaps), on various parameters of the network. Figure 7a displays the distance-overlap span obtained with an almost noiseless (T = 0.001) network and with a network with noisy dynamics (T = 0.020). As is evident from the analytical results, as the noise level is increased the span d is markedly increased (e.g., from roughly d = 3 to d = 6 in Fig. 7a). In other words, the functional damage is significantly larger for increased noise levels. As one would expect, increasing the noise levels also results in decreased performance levels. As shown in Figure 7b, after decreasing the synaptic connectivity by randomly eliminating some fraction of the synapses of each unit (leaving
Eytan Ruppin and James A. Reggia
1118
I
-
0.0 0.0
2.0
4.0 8.0 D i s t e n horn ~ ~ lesion
-------
J 80
10.0
(a)
Figure 7: Performance levels as a function of distance from the lesion border. With different (a) noise levels and (b)connectivity. Both analytical and simulation results are displayed; s = 200. each unit with 40 instead of 60 incoming synapses), the network's performance decreases in a similar manner to the noisy dynamics case, but there is only a slight increase in d . In our simplified network the effects of random synaptic deletion are essentially equivalent to that of random neural damage (inactivation of some units), so that Figure 7b also illustrates how diffuse neuronal degeneration in the region of the functional lesion (as observed in the perilesion area after stroke) would effect the severity of multiinfarct dementia. The distance-overlap span calculated via analytic approximation (B.3) qualitatively agrees with simulation results, but is shifted upward at short distances when compared to the latter. This shift results primarily from approximation in the theoretical derivation: neglecting the effects of the noncued memory patterns, and from neglecting the correlations that evolve between subsequent external inputs to the same unit due to the spatially organized connectivity. Studying the dependency of the span d on the connectivity dispersion parameter cr [or its equivalent r in the theoretical expressions (B.3)1 requires larger networks than those that we could practically simulate. Hence, only analytical results are presented in Figure 8. As is evident, increased r levels result in a marked increase in the distance-overlap span,
Functional Damage in Neural Network Models
r
1119
,
_
It V. I
0.0
5.0
10.0
Distance from lesion
Figure 8: Performance levels as a function of distance from the lesion border, with different Y values. Analytical results. and in a more gradual performance gradient. As one would expect, at sufficiently high levels of r the typical gradient of the distance-overlaps’ span vanishes (not shown in Fig. 8). The practically negligible changes in the length of the span d following random synaptic deletion (demonstrated in Fig. 7) may then be understood by noticing that, on the one hand, random synaptic deletion tends to increase the noise in the network and hence increase d, but on the other hand it tends to shorten the average connectivity span and hence decrease d . 5 Discussion
We have presented a simple analytical framework for studying the effects of focal lesions on the functioning of spatially organized neural networks. The analysis presented is quite general and a similar approach could be
1120
Eytan Ruppin and James A. Reggia
adopted to investigate the effect of focal lesions in other neural models, such as models of random neural networks (Minai and Levy 1993) or cortical map organization (Sutton et al. 1994). Using this analysis, specific scaling rules have been formulated describing the functional effects of structural focal lesions on memory retrieval performance in associative attractor networks. The functional lesion scales as the square root of the size of a single structural lesion, and the form of the resulting performance curve depends on the impairment span d . Surprisingly, the same fraction of damage results in significantly less performance decrease in larger networks, pointing to their relative robustness. As to the effects of shape and number, elongated structural lesions cause more damage than more symmetrical ones. However, the number of sublesions is the most critical factor determining the functional damage and performance decrease in the model. Numerical studies show that in some conditions multiple lesions can damage performance more than diffuse damage, even though the amount of lost innervation is always less in a multiple focal lesion than with diffuse damage. The main parameter determining the relation between structural and functional damage is the length of the impairment span d. This span has been found to increase with the noise level ‘T of the network and the connectivity dispersion parameter CT. It should be noted that when d gets large (in relation to the network dimensions) multiple lesions are likely to “interact” (i.e., their resulting functional lesions are likely to intersect) and may increase the overall performance deterioration. In the introduction we described the parallel between the structural/ functional distinction that underlies this study and a similar distinction made with regard to infarcted tissue versus metabolically impaired regions in multiinfarct dementia. What are the clinical implications of this study with respect to the latter disease? Our results indicate a significant role for the number of infarcts in determining the extent of functional damage and dementia in multiinfarct disease. In our model, multiple focal lesions cause a much larger deficit than their simple ”sum,” i.e., a single lesion of equivalent total size. This is consistent with clinical studies that have suggested the main factors related to the prevalence of dementia after stroke to be the infarct number and site, and not the overall infarct size, which is related to the prevalence of dementia in a significantly weaker manner (Tatemichi ef al. 1990; Tatemichi 1990; del Ser et al. 1990). As noted by Hachinski (198:3), “In general, the effect of additional lesions of the brain increases with the number of lesions already present, so that the deficits in the brain do not add up, they mu1tiply.” We have found that decreasing the connectivity of each unit, and decreasing the fidelity of network dynamics by increasing the noise level, may not only lead to a decrease in the overall level of performance, but also to an increase in the length of the distance-overlap span in the perile-
Functional Damage in Neural Network Models
1121
sion area. The degenerative synaptic changes occurring as Alzheimer’s disease progresses are known to lead to a reduction in the number of synapses in a unit volume of the cortex (e.g., DeKosky and Scheff 1990), and the accompanying synaptic compensatory changes increase the level of noise in the system (Horn et al. 1993). This offers a plausible explanation for the “multiplicative” interaction occurring between coexisting Alzheimer’s and multiinfarct dementia (Tatemichi 1990), where cortical atrophy contributes as an independent variable to the severity of stroke symptomatology (Levine et al. 1986) and increases the severity of stroke symptomatology in Alzheimer patients. The loss of innervation from a focal stroke region to its immediate surroundings such as that studied in this paper may be viewed as sort of ”local diaschisis.” In contrast, global, interhemispheric diaschisis denotes the ”disconnection” of neural structures that are far apart in the brain, and may lead to structurally normal regions with reduced metabolism as observed in several neurological disorders (Feeney and Baron 1986). Such a metabolic depression of apparently intact structures involving Papez’s circuit and basal anterior regions has been recently observed in human patients suffering from “pure” amnesia (Fazio et al. 1992). Given more information concerning the patterns of connectivity between these structures, it may be possible in the future to study the functional consequences of interhemispheric diaschisis. Future studies of diaschisis may also address the effects of subcortical infarcts, which frequently accompany cortical lesions in multiinfarct dementia. Interestingly, just very recently it has been shown that, as with cortical infarction, with subcortical infarction the number of infarcts but not the volume of infarction (as measured in computerized tomography scans) is significantly associated with cognitive impairment in stroke patients (Corbett et al. 1994). Appendix A: The Numerical Simulations The attractor network used in this study is composed of N units, where each unit i is described by a binary variable Si = { 1 , O ) denoting an active (firing) or passive (quiescent) state, respectively. M distributed memory patterns [ p , where superscript p indicates a pattern index, are stored in the network. The elements of each memory pattern are randomly chosen to be 1 or 0 with probability p or 1 - p , respectively, with p 0, are tapped delay lines linking xf-,to yf. The leading weight thus adjusts just as would a weight connected to a neuron with only that one input (see Section 2.1). The delay weights attempt to decorrelate the past input from the present output. Thus the filter is kept from "shrinking" by its leading weight. The utility of this rule for performing blind deconvolution is demonstrated in Section 5.2. 2.4 For Weights with Time Delays. Consider a weight, w,with a time delay, d, and a sigmoidai nonlinearity, g, so that
Y(f)
=g [ W t -
41
(2.25)
We can maximize the entropy of y with respect to the time delay, again by maximizing the log slope of y (as in 2.6): 3~ a Ad x - = - (In Iy'l) dd dd The crucial step in this derivation is to realize that
3 -x(t i)d
-
d) =
-
a
-x(t - d ) at
(2.26)
(2.27)
Calling this quantity simply -X, we may then write
(2.28) Our general rule is therefore given as follows: (2.29) "The corresponding rules for micausal filters are substantially more complex.
Information-Maximization
1137
When g is the tanh function, for example, this yields the following rule for adapting the time delay:
Ad
0: 2wxy
(2.30)
This rule holds regardless of the architecture in which the network is embedded, and it is local, unlike the Aw rule in 2.16. It bears a resemblance to the rule proposed by Platt and Faggin (1992)for adjustable time delays in the network architecture of Jutten and Herault (1991). The rule has an intuitive interpretation. First, if w = 0, there is no reason to adjust the delay. Second, the rule maximizes the delivered power of the inputs, stabilizing when (xy) = 0. As an example, if y received several sinusoidal inputs of the same frequency, w,and different phase, each with its own adjustable time delay, then the time delays would adjust until the phases of the time-delayed inputs were all the same. Then, for each input, (xy) would be proportional to (coswf . tanh(sinwf)), which would be zero. In adjusting delays, therefore, the rule will attempt to line up similar signals in time, and cancel time delays caused by the same signal taking alternate paths. We hope to explore, in future work, the usefulness of this rule for adjusting time delays and tap-spacing in blind separation and blind deconvolution tasks. 2.5 For a Generalized Sigmoid Function. In Section 4, we will show how it is sometimes necessary not only to train the weights of the network, but also to select the form of the nonlinearity, so that it can "match input pdf's. In other words, if the input to a neuron is u, with a pdf of f u ( u ) , then our sigmoid should approximate, as closely as possible, the cumulative distribution of this input:
One way to do this is to define a "flexible" sigmoid that can be altered to fit the data, in the sense of 2.31. An example of such a function is the asymmetric generalized logistic function (see also Baram and Roth 1994) described by the differential equation: (2.32)
where p and r are positive real numbers. Numerical integration of this equation produces sigmoids suitable for very peaked (as p , r > 1, see Fig. 2b) and flat, unit-like (as p , r < 1, see Fig. 2c) input distributions. So by varying these coefficients, we can mold the sigmoid so that its slope fits unimodal distributions of varying kurtosis. By having p # r,
1138
Anthony J. Bell and Terrence J. Sejnowski
U
U
U
. U 3
U ) .
Figure 2: The generalized logistic sigmoid (top row) of 2.32, and its slope, y’ (bottom row), for (a) p = r = 1, (b) p = r = 5 and (c) p = r = 0.2. Compare the slope of (b) with the pdf in Figure 5a: it provides a good match for natural speech signals.
we can also account for some skew in the distributions. When we have chosen values for p and r, perhaps by some optimization process, the rules for changing a single input-output weight, w,and a bias, wo,are subtly altered from 2.14 and 2.11, but clearly the same when p = r = 1:
Au, AWo
1 0;
-
W 0:
+ x [ p ( l - y ) - ry]
p ( 1 - y ) - ry
(2.33) (2.34)
The importance of being able to train a general function of this type will be explained in Section 4.
3 Background to Blind Separation and Blind Deconvolution
Blind separation and blind deconvolution are related problems in signal processing. In blind separation, as introduced by Herault and Jutten (19861, and illustrated in Figure 3a, a set of sources, s l ( t ) ,. . . ,s N ( t ) , (different people speaking, music, etc.) is mixed together linearly by a matrix A. We do not know anything about the sources, or the mixing process. All
Information-Maximization
1139
BLIND L ZrrION
unknown mixing process
BLIND SEPARATION (learnt weights)
I
Figure 3: Network architectures for (a) blind separation of five mixed signals, and (b) blind deconvolution of a single signal. we receive are the N superpositions of them, x l ( t ) , . . . . x ~ ( t ) The . task is to recover the original sources by finding a square matrix, W, which is a permutation and rescaling of the inverse of the unknown matrix, A. The problem has also been called the "cocktail-party" p r ~ b l e m . ~ In blinddeconvolution, described in Haykin (1991,1994a) and illustrated in Figure 3b, a single unknown signal s ( t ) is convolved with an unknown tapped delay-line filter a l , . . . .aK, giving a corrupted signal x ( t ) = a ( t ) * s ( t ) where a ( t ) is the impulse response of the filter. The task is to recover s ( t ) by convolving x ( t ) with a learnt filter wl. . . . wL,which reverses the effect of the filter a ( t ) . There are many similarities between the two problems. In one, sources are corrupted by the superposition of other sources. In the other, a source is corrupted by time-delayed versions of itself. In both cases, unsupervised learning must be used because no error signals are available. In both cases, second-order statistics are inadequate to solve the problem. For example, for blind separation, a second-order decorrelation technique such as that of Barlow and Foldiak (1989) would find uncorrelated, or linearly independent, projections, y, of the input data, x. But it could only find a symmetric decorrelation matrix, that would not suffice if the mixing matrix, A, were asymmetric (Jutten and Herault 1991). Similarly, for blind deconvolution, second-order techniques based on the autocorrelation function, such as prediction-error filters, are phase-blind. They d o not have sufficient information to estimate the phase of the corrupting filter, a ( t ) , only its amplitude (Haykin 19944.
.
4Though for now, we ignore the problem of signal propagation delays.
1140
Anthony J. Bell and Terrence J, Sejnowski
The reason why second-order techniques fail is that these two "blind" signal processing problems are information-theoretic problems. We are assuming, in the case of blind separation, that the sources, s, are statistically independent and non-gaussian, and in the case of blind deconvolution, that the original signal, s ( t ) , consists of independent symbols (a white process). Then blind separation becomes the problem of minimizing the mutual information between outputs, u,, introduced by the mixing matrix A; and blind deconvolution becomes the problem of removing from the convolved signal, x ( f ) , any statistical dependencies across time, introduced by the corrupting filter a ( t ) . The former process, the learning of W, is called the problem of independent component analysis, or ICA (Comon 1994). The latter process, the learning of w ( t ) ,is sometimes called the whitening of x ( t ) . Henceforth, we use the term redundancy reduction when we mean either ICA or the whitening of a time series. In either case, it is clear in an information-theoretic context that secondorder statistics are inadequate for reducing redundancy, because the mutual information between two variables involves statistics of all orders, except in the special case that the variables are jointly gaussian. In the various approaches in the literature, the higher-order statistics required for redundancy reduction have been accessed in two main ways. The first way is the explicit estimation of cumulants and polyspectra. See Comon (1994) and Hatzinakos and Nikias (1994) for the application of this approach to separation and deconvolution, respectively. The drawbacks of such direct techniques are that they can sometimes be computationally intensive, and may be inaccurate when cumulants higher than fourth order are ignored, as they usually are. It is currently not clear why direct approaches can be surprisingly successful despite errors in the estimation of the cumulants, and in the usage of these cumulants to estimate mutual information. The second main way of accessing higher-order statistics is through the use of static nonlinear functions. The Taylor series expansions of these nonlinearities yield higher-order terms. The hope, in general, is that learning rules containing such terms will be sensitive to the right higher-order statistics necessary to perform ICA or whitening. Such reasoning has been used to justify both the Herault-Jutten (or H-J) approach to blind separation (Comon et nl. 1991) and the so-called "Bussgang" approaches to blind deconvolution (Bellini 1994). The drawback here is that there is no guarantee that the higher-order statistics yielded by the nonlinearities are weighted in a way relating to the calculation of statistical dependency. For the H-J algorithm, the standard approach is to try different nonlinearities on different problems to see if they work. Clearly, it would be of benefit to have some method of rigorously linking our choice of a static nonlinearity to a learning rule performing gradient ascent in some quantity relating to statistical dependency. Because of the infinite number of higher-order statistics involved in sta-
Information-Maximization
1141
tistical dependency, this has generally been thought to be impossible. As we now show, this belief is incorrect. 4 When Does Information Maximization Reduce
Statistical Dependence? In this section, we consider under what conditions the information maximiznfion algorithm presented in Section 2 minimizes the mutual information between outputs (or time points) and therefore performs redundancy reduction. Consider a system with two outputs, y1 and y2 (two output channels in the case of separation, or two time points in the case of deconvolution). The joint entropy of these two variables may be written as (Papoulis 1984, equation 15-93):
Maximizing this joint entropy consists of maximizing the individual entropies while minimizing the mutual information, I(y1, y2), shared between the two. When this latter quantity is zero, the two variables are statistically independent, and the pdf can be factored: fylYz(y1.y2) = fq, (y1)fy2 (yz). Both ICA and the "whitening" approach to deconvolution are examples of minimizing I(yl.y2) for all pairs y1 and y2. This process is variously known as factorial code learning (Barlow 19891, predictability minimization (Schmidhuber 19921, as well as independent component analysis (Comon 1994) and redundancy reduction (Barlow 1961; Atick 1992). The algorithm presented in Section 2 is a stochastic gradient ascent algorithm that maximizes the joint entropy in 4.1. In doing so, it will, in general, reduce I(y1. yz), reducing the statistical dependence of the two outputs. However, it is not guaranteed to reach the absolute minimum of I(y1.y2), because of interference from the other terms, the H(y,). Figure 4 shows one pathological situation where a "diagonal" projection (Fig. 4c) of two independent, uniformly distributed variables x1 and x2 is preferred over an "independent" projection (Fig. 4b). This is because of a "mismatch between the input pdf's and the slope of the sigmoid nonlinearity. The learning procedure is able to achieve higher values in Figure 4c for the individual output entropies, H(y1) and H(yz), because the pdf's of x1 + x2 and x1 - x2 are triangular, more closely matching the slope of the sigmoid. This interferes with the minimization of I(y1, y2). In many practical situations, however, such interference will have minimal effect. We conjecture that only when the pdf's of the inputs are sub-gaussian (meaning their kurtosis, or fourth-order standardized cumulant, is less than 01, may unwanted higher entropy solutions for logistic
1142
Anthony J. Bell and Terrence J. Sejnowski
Figure 4: An example of when joint entropy maximization fails to yield statistically independent components. (a) Two independent input variables, x1 and x2, having uniform (flat) pdf’s, are input into an entropy maximization network with sigmoidal outputs. Because the input pdf’s are not well matched to the nonlinearity, the “diagonal” solution (c) has higher joint entropy than the “independent-component“solution (b), despite its having nonzero mutual information between the outputs. The values given are for illustration purposes only. sigmoid networks be found by combining inputs in the way shown in Figure 4c (Kenji Doya, personal communication). Many real-world analog signals, including the speech signals we used, are super-gaussian. They have longer tails and are more sharply peaked than gaussians (see Fig. 5). For such signals, in our experience, maximizing the joint entropy in simple logistic sigmoidal networks always minimizes the mutual information between the outputs (see the results in Section 5 ) . We can tailor conditions so that the mutual information between outputs is minimized, by constructing our nonlinear function, g ( u ) , so that it matches, in the sense of 2.31, the known pdf’s of the independent variables. When this is the case, H(y) will be maximized [meaning fv(y) will be the flat unit distribution] only when u carries one single independent variable. Any linear combination of the variables will produce a “more gaussian” f , , ( u ) (due to central limit tendencies) and a resulting suboptimal (nonflat) fy(y). We have presented, in Section 2.5, one possible “flexible“ nonlinearity. This suggests a two-stage algorithm for performing independent component analysis. First, a nonlinearity such as that defined by 2.32 is optimized to approximate the cumulative distributions, 2.31, of known independent components (sources). Then networks using this nonlinear-
Information-Maximization
1143
Figure 5: Typical probability density functions for (a) speech, (b) rock music, and (c) gaussian white noise. The kurtosis of pdf's (a) and (b) was greater than 0, and they would be classified as super-gaussian. ity are trained using the full weight matrix and bias vector generalization of 2.33 and 2.34:
aw AWO
K
[w'] -' + [p(l - y) PP-y) -ry
-
ry]xT
(4.2) (4.3)
This way, we can be sure that the problem of maximizing the mutual information between the inputs and outputs, and the problem of minimizing the mutual information between the outputs, have the same solution. This argument is well supported by the analysis of Nadal and Parga (19951, who independently reached the conclusion that in the low-noise limit, information maximization yields factorial codes when both the nonlinear function, g(u), and the weights, w,can be optimized. Here, we provide a practical optimization method for the weights and a framework for optimizing the nonlinear function. Having discussed these caveats, we now present results for blind separation and blind deconvolution using the standard logistic function. 5 Methods and Results
The experiments presented here were obtained using 7 second segments of speech recorded from various speakers (only one speaker per recording). All signals were sampled at 8 kHz from the output of the auxiliary microphone of a Sparc-10 workstation. No special postprocessing was performed on the waveforms, other than that of normalizing their amplitudes so they were appropriate for use with our networks (input values roughly between -3 and 3). The method of training was stochastic gradient ascent, but because of the costly matrix inversion in 2.14, weights were usually adjusted based on the summed awls of small "batches" of length B, where 5 5 B 5 300. Batch training was made efficient using
Anthony J. Bell and Terrence J. Sejnowski
1144
vectorized code written in MATLAB. To ensure that the input ensemble was stationary in time, the time index of the signals was permuted. This means that at each iteration of the training, the network would receive input from a random time point. Various learning rates’ were used (0.01 was typical). It was helpful to reduce the learning rate during learning for convergence to good solutions. 5.1 Blind Separation Results. The architecture in Figure 3a and the algorithm in 2.14 and 2.15 were sufficient to perform blind separation. A random mixing matrix, A, was generated with values usually uniformly distributed between -1 and 1. This was used to make the mixed time series, x, from the original sources, s. The matrices s and x, then, were both N x M matrices (Nsignals, M timepoints), and x was constructed from s by (1) permuting the time index of s to produce st, and (2) creating the mixtures, x, by multiplying by the mixing matrix: x = As+. The unmixing matrix W and the bias vector wo were then trained. An example run with five sources is shown in Figure 6. The mixtures, x, formed an incomprehensible babble. This unmixed solution was reached after around lo6 time points were presented, equivalent to about 20 passes through the complete time series? though much of the improvement occurred on the first few passes through the data. Any residual interference in u is inaudible. This is reflected in the permutation structure of the matrix WA:
WA
=
i
I
0.13
0.07 0.02 0.02 -0.07
v l -0.02 0.03 0.14
0.09 -0.07 0.00 0.02 -0.06 -0.08 0.00 -0.01
11971
-0.01 -0.06
I-
L
I
(5.1)
0.04
As can be seen, only one substantial entry (boxed) exists in each row and column. The interference was attenuated by between 20 and 70 dB in all cases, and the system was continuing to improve slowly with a learning rate of 0.0001. In our most ambitious attempt, 10 sources (six speakers, rock music, raucous laughter, a gong, and the Hallelujah chorus) were successfully separated, though the fine tuning of the solution took many hours and required some annealing of the learning rate (lowering it with time). For two sources, convergence is normally achieved in less than one pass through the data (50,000 data points), and on a Sparc-10 on-line learning ’The learning rate is defined as the proportionality constant in 2.14-2.15 and 2.232.24. ‘This took on the order of 5 min on a Sparc-10. Two hundred data points were presented at a time in a “batch,” then the weights were changed with a learning rate of 0.01 based on the sum of the 200 accumulated Aws.
Information-Maximization
S
1145
X
U
Figure 6: A 5 x 5 information maximization network performed blind separation, learning the unmixing matrix W. The outputs, u, are shown here unsquashed by the sigmoid. They can be visually matched to their corresponding sources, S, even though their order was different and some (for example u1) were recovered as negative (upside down). can occur at twice the speed at which the sounds themselves are played. Real-time separation for more than, say, three sources, may require further work to speed convergence, or special-purpose hardware. In all our attempts at blind separation, the algorithm has failed under only two conditions: 1. when more than one of the sources were gaussian white noise, and 2. when the mixing matrix A was almost singular.
Both are understandable. First, no procedure can separate out independent gaussian sources since the sum of two gaussian variables has itself a gaussian distribution. Second, if A is almost singular, then any unmixing W must also be almost singular, making the learning in 2.14 quite unstable in the vicinity of a solution.
Anthony J. Bell and Terrence J. Sejnowski
1146
In contrast with these results, our experience with tests on the H-J network of Jutten and Herault (1991) has been that it occasionally fails to converge for two sources and only rarely converges for three, on the same speech and music signals we used for separating 10 sources. Cohen and Andreou (1992) report separation of up to six sinusoidal signals of different frequencies using analog VLSI H-J networks. In addition, in Cohen and Andreou (19951, they report results with mixed sine waves and noise in 5 x 5 networks, but no separation results for more than two speakers. How does convergence time scale with the number of sources, N? The difficulty in answering this question is that different learning rates are required for different N and for different stages of convergence. We expect to address this issue in future work, and employ useful heuristic or explicit second-order techniques (Battiti 1992) to speed convergence. For now, we present rough estimates for the number of epochs (each containing 50,000 data vectors) required to reach an average signal to noise ratio on the ouput channels of 20 dB. At such a level, approximately 80% of each output channel amplitude is devoted to one signal. These results were collected for mixing matrices of unit determinant, so that convergence would not be hampered by having to find an unmixing matrix with especially large entries. Therefore these convergence times may be lower than for randomly generated matrices. The batch size, B, was in each case 20. The average numbers of epochs to convergence (over 10 trials) and the computer times consumed per epoch (on a Sparc-10) are given in the following table: No. of sources, N Learning rate Epochs to
convergence Time in secs./epoch
2 0.1
3 0.1
4 0.1
0, and Q the m x m matrix with elements 9], = ( z , ~ , )i , # j , and 912= 0 for all id. The angled brackets indicate an ensemble average. We will assume that a = 1, i.e., the learning rates in the system are all equal at all times. Now, Q
(A.12)
=
(zz')
=
( ( I - U)Wxx'W'(I
=
(I - U)WCWT(I- U)7
-
U)')
(A.13)
(A.14)
where C,is (xlx,) for all i. j . Hence,
U)WCWT(Idt The transform from x to z is G, where G(t) U ( t )is the value of U at time f, etc. Then, dW(t) - --W(t) dU(t) _ -- [I - U(t)l-&dG dt dt dW dU = (I-U)---w dt dt -= (I -
[I - U ( t ) ]W ( t ) ,where (A.15) (A.16)
1204
Colin Fyfe
Appendix B: Rate of change of U at U = 0 __ We will show the required result for the basic peer-inhibitory interneuron model. Equivalent results for the other models are similarly shown.
Theorem 1. At the solutions U=O, w, = a,c, for any a,, ofdG/dt
du,,
= 0, then
for all i. j
-= 0
dt
if A, # 0, i.e., the ith eigenvalue is not zero. Proof. As U
dU df
+
0,
( I - U)WCWT(Z - U)7
Now wlCw: = 0 for all i # j and w,CwT = A,lw,Iz. Therefore W C W T is a diagonal matrix of the form diag{kl k 2 . . . . .k,,,} where k, = X,IW,~~, A, being the ith eigenvalue. Then
dU dt
= AK = diag{a:kl
Therefore, du,/dt
=0
~
&2.
for all i
. . . ,a:,k,,}
# j.
Acknowledgments
I would like to register my thanks to the unknown reviewers who after ploughing through the first version of this paper provided such valuable assistance in improving it. References Foldiak, P. 1992. Models of sensory coding. Ph.D. thesis, University of Cambridge, Cambridge. Fyfe, C. 1993a. Asymmetric learning in interneurons. Conf. Proc. World Conf. Neural Nets, 11, 473-476. Fyfe, C. 1993b. Interneurons which identify principal components. In Recent Advances in Neural Networks, bnns93. Fyfe, C. 1993c. PCA properties of interneurons. I:n From Neurobiology to Real World Computing, ICANN 93. Miller, K., and MacKay, D. 1993. The role of constraints in Hebbian learning. Neural Comp., 6, 100-126. Oja, E. 1982. A simplified neuron model as a principal component analyser. 1. Math. Biol. 15, 267-273.
Asymmetry in Interneuron Learning
1205
Oja, E. 1989. Neural networks, principal components and subspaces. Int. 1. Neural Syst. 1, 61-68. Oja, E., Ogawa, H., and Wangviwattana, J. 1992a. Pca in fully parallel neural networks. In Artificial Neural Networks, 2. Taylor, J. and Aleksander, I., eds., pp. 199-209. North-Holland. Oja, E., Ogawa, H., and Wangviwattana, J. 1992b. Principal component analysis by homogeneous neural networks, part 1 : The weighted subspace criterion. Itice Trans. lnf. Syst. E75-D, 366-375. Oja, E., Ogawa, H., and Wangviwattana, J. 1992c. Principal component analysis by homogeneous neural networks, part 2: Analysis and extensions of the learning algorithms. Ieice Trans. Inf. Syst. E75-D(3),375-381. Plumbley, M. 1991. On information theory and unsupervised neural networks. Ph.D. thesis, University of Cambridge, Cambridge. Rubner, J., and Schulten, K. 1990. Development of feature detectors and selforganisation. Biol. Cybernet. 62, 193-199. Rubner, J., and Tavan, I? 1989. A self-organising network for principal-component analysis. Europhys. Lett. 10(7), 693-698. Sanger, T. D. 1990. Analysis of the two-dimensional receptive fields learned by the generalized hebbian algorithm in response to random input. Biol. Cybernet. 68, 221-228. Xu, L. 1993. Least mean square error reconstruction principle for self-organizing neural-nets. Neural Networks 6(5), 627-648.
Received September 15, 1993; accepted January 20, 1995
This article has been cited by: 2. L. Xu. 2004. Advances on BYY Harmony Learning: Information Theoretic Perspective, Generalized Projection Geometry, and Independent Factor Autodetermination. IEEE Transactions on Neural Networks 15:4, 885-902. [CrossRef]
Communicated by Jean Pierre Nadal
Learning and Generalization with Minimerror, A Temperature-Dependent Learning Algorithm Bruno Raffin* Mirta B. Gordon+ CEAIDipartement de Recherche Fondamentale siir la Mati2re Condensie, S P S M S I M D N , Centre d'Etudes Nucliaires de Grenoble, 17, rue des Martyrs, 38054 Grenoble Cedex 9, France
We study the numerical performances of Minimerror, a recently introduced learning algorithm for the perceptron that has analytically been shown to be optimal both on learning linearly and nonlinearly separable functions. We present its implementation on learning linearly separable boolean functions. Numerical results are in excellent agreement with the theoretical predictions.
1 Introduction
An important feature of neural networks is their ability to learn rules, like pattern classification, from examples. Given a truining set of input-output pairs, the network's synaptic weights are determined with a lcarning algorithm. If the networks architecture is adapted to the rule to be learned, error-free learning is possible, and in general many solutions without training errors exist. The best weights are those that minimize the generalization error cg, i.e., the probability that the output to a not learned input is wrong. If the rule is not realizable with the current networks architecture, the weights should minimize the fraction of training errors E ~ i.e., , the fraction of patterns of the learning set to which the network gives a wrong answer. The simplest network, the single-layer perceptron, has an input layer and a single output unit. In principle, it is able to learn any linearly separable training set. There are several learning algorithms that are guaranteed to converge to an error-free solution if such a solution exists (Anlauf and Biehl 1989; Krauth and Mezard 1987; Rujan 1993). How'Present address: L.I.F.O./Dt. d'lnformatique, BP 6759, 45067 Orleans Cedex 2, France. 'Member of C.N.R.S.
Neural Coiiiputntioii 7, 1206-1224 (1995) @ 1995 Massachusetts Institute of Technology
Minimerror, A Learning Algorithm
1207
ever, if the training set is not separable, these algorithms never stop. Algorithms that always converge (Frean 1992; Gallant 1990; Nabutovsky and Domany 1991; Seewer and Rujan 1992), on the other hand, may miss the error-free solution even if it exists. Moreover, none of these algorithms gives optimal weights. Thus, a learning algorithm for the perceptron converging to the best solution in all the cases, without need of prior knowledge about linear separability of the training set, is still lacking. Learning of more general problems needs more complex architectures. In the case of binary classification, a two-layer perceptron with one hidden layer of neurons is sufficient to learn any boolean function. However, there is no theoretical ground relating the hidden layer size to the task complexity. Among the attempts to learn from examples in multilayered perceptrons, constructivistic approaches, that build up the network by successive addition of hidden units until the output training error vanishes, are very promising (Fahlman and Lebiere 1990; Frean 1990; Golea and Marchand 1990; Knerr rt al. 1990; Martinez and Esteve 1992; Mkzard and Nadal 1989; Nadal 1989; Peretto and Gordon 1992; Rujan and Marchand 1989; Sirat and Nadal 1990). The hidden units, which are nothing but single perceptrons, have to learn internal representations of the input patterns that may be separable or not. Attempts to generate small networks, which usually show the best generalization rates, are doomed to failure unless a perceptron learning algorithm ensuring optimal performance, whether the learning set is separable or not, is used. In this paper we study the performance on learning linearly separable rules of a recently introduced learning algorithm for the perceptron (Gordon et al. 1993) that satisfies the above requirements. It is based on the minimization of a cost function that may be interpreted as a noisy evaluation of the number of training errors. The noise in the error counting process, introduced through a temperature T , makes the cost function differentiable, enabling a gradient search of its minimum. Analytic results (Gordon and Grempel 1995) show that, depending on N = P I N , the size P of the training set relative to the number N of weights to be determined, there is a temperature P ( a ) at which errors have to be "counted" to reach optimal performances. If the training set is not separable, the weights minimizing the cost function at T*(tr) are the best trade-off between perceptron robustness and number of training errors. If the rule to be learned is linearly separable, they endow the perceptron with a generalization error that is numerically indistinguishable from the lowest bound, given by Bayes optimal learning (Opper and Haussler 1991). The paper is organized as follows: in Section 2 we discuss the cost function and the theoretical predictions; the algorithm's implementation is presented in Section 3, our numerical results, characterized by the training error, the generalization error, and the distribution of the sta-
Bruno Raffin and Mirta B. Gordon
1208
bilities of the learning set after training, are presented in Section 4 and discussed in Section 5. The conclusion is presented in Section 6. 2 The Cost Function and Theoretical Results -
We consider a binary-output perceptron of N real- or discrete-valued input units i (i = 1,.. . ,N)and real synaptic weights w = ( ~ 1 . .~, wN). . The output to any input u = ( c r ~ .. . . , c N ) is (2.1)
Given a learning set of P = aN patterns, of inputs 6” = (Ef?. . . ,ti) ( p = 1.. . . , P ) and corresponding outputs r p , the stability (Krauth et al. 1988) of pattern p is defined by (2.2) where we introduced the norm of the weights vector: (2.3) The unitary vector w//lwll is normal to the hyperplane in input space that separates patterns with positive from those with negative outputs. The absolute value of the stability, IyI’I, is the distance from pattern p to the hyperplane; ~b > 0 if the pattern is well classified, yl‘ < 0 otherwise. Thus, high positive stabilities characterize robust learning against pattern or weight corruption. Minimerror is based on the minimization of the following cost function:
where the noise parameter ([I = l/T), and V ( ? ;/I)
=
12 [1
-
tanh
[j
is the inverse of the learning temperature
(a)]
(2.4b)
represents the contribution of a pattern with stability y to the cost function. For T = 0, i.e., = m (which is different from taking the limit T + 0, as is discussed below), we have V = 0 if y > 0, V = 1 if y < 0; so that E(w;a. m) is strictly the number of training errors. At finite learning temperature T , patterns with y >> 0 have V M 0 and those
Minimerror, A Learning Algorithm
1209
0 have V = 1, so that for patterns far from the separating with y hyperplane, E(w; N. [I) still counts the number of errors. But patterns with stabilities within a window of width z 2T on both sides of the hyperplane, i.e., -2/P < y < 2/[3, contribute to E(w; cy. pj proportionally to 1 - /jy/2. Thus, although counting less than a "full" error, even well learned patterns (having small positive stabilities) contribute positively to the cost function. In the limit of infinite temperature (T 4 00; /j + 0) the contribution of each pattern to 2.4 is proportional to (minus) its stability. In this limit, minimization of the cost function (to first order in [j) leads to Hebb's rule (Gordon et al. 1993). The theoretical properties of the minima of E(w; ( Y , P ) were obtained within the statistical mechanics replica approach (Seung etal. 1992; Watkin et al. 1993). We summarize here the main results; technical details will be reported elsewhere (Gordon and Grempel 1995). The cost function (2.4) is considered as an energy in w space. A Boltzmann probability dP(w)exp[-r/E(w; (P. [ ~ ) ] / Z ($)( Yis~assigned to each weight w, where dP(w) = dwb(N - w w) restricts the phase space to normalized w, and Z(a.0)= JdP(w) exp[-qE(w; ( Y , b)] is the canonical partition function. The generic properties of the cost function's minimum are given by the where (. . . ) i c p ) stands for the limit Em(o,0)= limv+m77-'(1nZ(cy./3))~~r), average over all the possible realizations of input patterns in the learning set. In the case considered here, of a linearly separable task, the outputs 7'' are given by a teacher perceptron of weights w*, thus guaranteeing linear separability. In the large-N limit, the average may be performed using the replica method, assuming symmetry among replicas. The result is
xr
m
+
2a
.I,
where the notation Dx = exp(-x2/2jdx/v% has been used, and
+
W(X,x;z) = v(x;a)
- X&)*
22
(2.6)
The order parameters q a p = N-'(w, . wp)= q and ra = N-'(w, . w*) = r are the overlaps between two replicated solutions and between a solution and the teacher perceptron w*, respectively. If the cost function is differentiable, as is the case for any finite P, the A-integral is dominated
Bruno Raffin and Mirta B. Gordon
1210
in the limit 77 of
x&
=
-+ 00
X(x;z)
by the saddle point of the integrand, X(x; z ) , solution
-
(2.7)
4 cosh2[/jX(x;2)/2]
Minimizing 2.5 with respect to the parameters q and R, and taking the limit q -+ 1, we obtain Dxerfc
(- Jm ) xr
[X(x;c) - x]2
(2.8.a)
where c = lim,l+mq(1 - 4). Equations 2.8.a and 2.8.b are implicit equations for c and r as functions of a. These parameters determine all the properties of the learning rule, in particular the generaZization error E~ = n-l arccos(r), which is the probability that the perceptron gives a wrong output to a pattern not belonging to the learning set, and the distribution of stabilities of the patterns in the training set, p ( y . $), that characterizes the robustness of the trained network. Generalizing the approach used in the case of nonseparable tasks (Griniasty and Gutfreund 19911, we find
The equations for the random-output case may be obtained from the above by disregarding 2.8.b and substituting r = 0 in 2.8.a and 2.9. The fraction of errors of the trained perceptron on the learning set, or training , is deduced from 2.9 by integration over the negative stabilerror E ~ ( C Yp), ities. All the three quantities, ~ ~ ( a i , E ~ ( N p), . and p ( y >p), are averaged values over all the possible training sets of size (1. The nature of the solution of 2.7 depends crucially upon the value of z, = c f . If z, < u, = 6&, X(x; c) is a single-valued function, but for z, > u,, it has two branches, as is shown in Figure 1. The absolute minimum of W jumps from one branch to the other at a point x*. As a result, a gap appears in the distribution of stabilities of the patterns in the learning set. At any finite [j, the gap lies in the range X-(x*;c) < y < X+(x*;c) where X+(x*;c) are the two solutions of 2.7 at x*. It may be shown (Gordon and Grempel 1995) that, in the gapless regime, the performances of this rule are qualitatively similar to those of Hebbs rule Wallet 1989). The regime with a gap may be always attained by choosing /3 sufficiently large.
a),
Minimerror, A Learning Algorithm
1211
Figure 1: Saddle point solution X(x;c) for u > u,, showing the left and right branches, XI (x; c) and X,(x; c), respectively, and the gap at x * .
Let us first discuss the theoretical results for T = 0. The cost function 2.4 is the number of training errors. It has been investigated theoretically by several authors, both on learning random outputs (Gardner 1987),and linearly separable rules (Gyorgyi 1990; Gyorgyi and Tishby 1990; Seung et al. 1992). Their results may be summarized as follows: if the training set is separable there is a finite volume in w space, i.e., infinitely many solutions, without training errors; the corresponding generalization error (displayed as "Boltzmann" in Fig. 6 below for comparison) is relatively large, reflecting the fact that simply minimizing the training error is far from being a good learning strategy. If the training set is not separable, which arises for random outputs if N > 2, the volume of solutions in w space that minimizes E~ shrinks to zero, an indication that the solution is either unique or a set of isolated points. We turn now to Minimerror, in the limit T + 0, i.e., /3 -+ M. Strikingly, the saddle point equations in this limit are different from those obtained by replacing /3 = 00 in 2.4. In the separable regime they reduce to those of the maximal stability perceptron (MSP) (Gardner and Derrida 1988; Opper et al. 1990), which is the particular solution that, besides minimizing ct, maximizes the stability of the
1212
Bruno Raffin and Mirta B. Gordon
less stable pattern of the learning set. Moreover, the volume of points in w space that minimize 2.4 is zero for all values of b, whether the training set is separable or not. Therefore, minimizing E(w;a , /3) at successively decreasing temperatures is differenf from minimizing the number of training errors at T = 0. Taking the limit [j -+ 03 lifts the degeneracy inherent to the cost function at T = 0 by selecting one particular solution, the MSP. The generalization error of the MSP, although not optimal, is lower than the one of the perceptron barely minimizing the number of training errors. Although the fraction of training errors with Minimerror reaches its minimum value only in the limit + M, the theoretical results at finite temperature show that it is not worth looking for this limit. Moreover, it has been established that the behavior of the training error with T shows two distinct regimes. A high temperature regime, in which the p ) from its minimum value ~ f " l "is large, and a low departure A&,of €,(a3 temperature regime in which AE, is vanishingly small. The crossover between these two regimes occurs at a finite, a dependent, temperature, P ( a ) = l / [ j * ( a ) . At this temperature is already less than ) Moreover, the weights that minimize the cost function at T * ( c Y have remarkable properties. If the training set is nonseparable, which arises for random outputs and a > 2, the weights that minimize E~ endow a large fraction of learned patterns with vanishingly small stabilities (Griniasty and Gutfreund 1991), which may be an undesirable feature (Amit et al. 1990). Weights obtained by learning at Y ( a )are the best theoretical trade-off between learning capacity and robustness, because they endow well-learned patterns with finite stabilities at the price of accepting only x 0.1% more training errors than E;"'". In the case of learning a linearly separable rule, +(a, 8) turns out to be a nonmonotonic function of p that goes through a minimum at Y ( n ) . This minimum is numerically indistinguishable from the generalization error of Bayes algorithm, which gives the lowest bound to the generalization error of a perceptron learning a separable rule. Finally, in the separable and in the nonseparable cases, the distributions of stabilities p ( y ) for T < T* present gaps at both sides of the origin, y+ and 17-1, respectively, with 17-1 > y+. Therefore, wrongly learned patterns are farther from the hyperplane than a large fraction of well-learned ones, meaning that there is a confidence interval of width 17-1 on both sides of the separating hyperplane within which only well-learned patterns are located. A final remark concerns the validity of the above theoretical results. First, they are strictly valid in the thermodynamic limit N -+ m, P + M, at fixed cy = P/N. Also, predictions within the statistical mechanics approach are generic, in the sense that they are obtained after averaging over all the possible learning sets. Thus, finite size deviations and statistical fluctuations may arise in numerical simulations and in practical applications.
Minimerror, A Learning Algorithm
1213
3 The Learning Algorithm
The theoretical results suggest to start with normalized hebbian synaptic weights (that minimize the high temperature limit of the cost function), at decreasing values of T . The and to track the minimum of E(w;a,P) search should stop when the number of errors stops decreasing. In theory, this is expected to occur at a value of T close to P ( a )= l / P * ( a ) . Numerical simulations on learning random input-output patterns (Gordon and Berchier 1993) led us to consider learned patterns at lower temperature than not learned ones, during the training phase, an asymmetry that prevented the algorithm from getting stuck at local minima. Therefore, in our implementation, the function minimized in the intermediate stages of the learning algorithm is
+ ( p- IlwlI)2
(3.1)
with p+/& = const > 1. The last term of 3.1, imposing weight normalization, restricts the search of synaptic weights to the hypersphere of Clearly, 2.4 being independent of the normalization of w, the radius condition ((wI(= flthat minimizes 3.1 does not add any supplementary constraint to the problem. It ensures faster convergence at any finite 0, because it restricts the search region, and prevents IIwJJfrom increasing without bounds if the training set is linearly separable. A last minimization with @+ = P- = p gives the normalized weights that minimize 2.4 at temperature T = l / / j . To summarize, our learning algorithm starts with weights given by Hebb's rule, and initial values of /j+ and p-. We determine the minimum w(P+$p-) of 3.1. Then, 8, and 0- are increased through p- t p- + SP, keeping /j+/B- constant, and then 3.1 is again minimized with the previous minimum w(/j+?B-) as initial condition. Notice that this annealing schedule, in which P- is stepwise increased, is not equivalent to decreasing T by constant intervals. Successive minimizations at decreasing temperatures are performed until the number of errors in the training set vanishes for the first time or stops decreasing. One further minimization at /3+ = ,!L = gives the weights w(P) that minimize 2.4 at temperature T = l//j. In contrast with the implementation of learning random outputs, where p- and [j+ must be increased very slowly to obtain satisfactory results (Gordon and Berchier 19931, preliminary tests showed that the separable case is much less sensitive to parameter tuning. Results reported in this paper were obtained with a faster procedure, in which /L and p+ are increased by larger steps, and a conjugate gradient minimization (Press et al. 1986) is performed at each temperature. This schedule selects one particular path to reach the minimum of 2.4 that differs from simple gradient descent, but does not affect the properties
a.
Bruno Raffin and Mirta B. Gordon
1214
Table 1: Implementation of Minimerrof
Init: Initialize parameters:
p-
{p?' = 0.1 for 015 2, /P = 1 for (1 > 2 )
+- p i ;
{ 6 p = p i va } {w'"i = 10 V a } {maximal allowed number of steps of b; . . it1"' = 10 for (I: 2, it'"' = 20 for cr = 2 ) {value for the last minimization} {number of actual iterations}
c5/3- t ngm'; . . w E [j+ t w'"1; itmax t it'"';
/a-
+
p + w&; it
t
0;
Initialize and normalize synaptic weights: for i = 1 to N do
wi t EL=,p'
{w,= W Y b b }
end do; for i = 1 to N do wi
t
{therefore lIw11* = N}
WifilIIWII
end do;
Learn: while E t ( i t ) > 0 or it < itmax do Find w(/j+./L) the minimum of 3.1 with conjugate gradient; Count Et(it),the number of errors on the training set; if €!(it) > 0 then {count number of iterations} it = it + 1; {decrease the temperatures} [jL + /l+ b[jL, t wd-; else {set last minimization temperature} P 8+ end if; end while;
a+
+
End:
Find w(O), the minimum of 2.4, with conjugate gradient; stop. "Comments in brackets contain the values of the parameters used in our simulations.
of the final weights determined by the last minimization. In our simulations, this minimization was done at several values of T, to test the theoretical predictions. In practice, the best generalization performance is obtained if the last minimization is done at temperature T = l/@, where /J: is the value of /j+ when the stopping condition is met. The algorithm's implementation and the numerical values used in our simulations are summarized on Table 1.
Minimerror, A Learning Algorithm
1215
4 Numerical Results
We present here simulation results on learning a linearly separable rule. Input patterns of the learning set, t p= . . , [{) ( / I = 1,.. . ,P ) , were selected at random. To guarantee that the problem is linearly separable, a teacher perceptron whose weights w* = (w;, w;,. . . , wl;)were chosen at random and normalized l l ~ * 1 1=~ N gives the corresponding outputs 7’’ through ([f?.
(4.1)
The weights w((Y, /j) determined by the learning algorithm through a last minimization at temperature T = l / B have an overlap X(cy. 8)with the teacher
x
l N
R ( a , b) = -
w;w,(IY.L-l)
(4.2)
I=1
which depends on cr = PIN. This overlap measures how close the trained (student) perceptron is to the teacher. The mean value of R over different learning sets, (X), is related to the generalization error of the learning rule through (Seung et al. 1992; Watkin et al. 1993) ~ ~ ( c[I)r ,=
1 -
7r
arccos(R(cu,p ) )
(4.3)
Because theoretical results are generic and strictly valid only in the thermodynamic limit, we realized simulations with N = 50, and 100, for cy = 0.5, 1, 2, 3, 4, 5, and 6, with different samples of the training set, and averaged over the samples. To compare with the theory, the last minimization was performed at two different values, p = 5 and p = 10, of the noise parameter. For each final /? we determined the training error (4.4)
where (. . .) stands for the average over the samples, the generalization error 4.3, and the histograms representing the distribution of stabilities 47).
The training errors for p = 5 and /3 = 10 are displayed in Fig. 2. Smooth curves are the theoretical predictions, corresponding to N CQ, showing that at each learning temperature T = l/p, there is an upper learning set size, cu*(p), beyond which a cross-over to a regime with learning errors occurs. For example, a*@ = 5) z 1. Conversely, for each a, there is a temperature l/p* below which E t ( a , /3) becomes vanishingly small. Simulation results for a = 0.5 and 1 reached E~ = 0 on all our samples. This is clearly a finite size effect; the theory predicts a vanishingly --$
Bruno Raffin and Mirta B. Gordon
1216
Et(W
1.5 1 0.5 0 I
I
I
I
I
I
I
I
0
1
2
3
4
5
6
7
a
Figure 2: Training error vs. a. Simulation results correspond to mean values over 10 samples. Data have been horizontally expanded for a 2 2, for better visibility. Continuous curves are the theoretical predictions for N + m. small training error, but not strictly zero. Results corresponding to N 2 2, horizontally expanded for visualization purposes, are slightly larger than the theoretical ones.' The results on learning ra.ndom patterns (Gordon and Berchier 1993) suggest that this deviation with respect to the theory may be due to finite size effects. The generalization errors c g ( a , P ) for /j = !j and 10 are shown on Figure 3a and b, respectively. The MSP and the theoretical predictions at noise parameter P are displayed for comparison. Simulation results are in excellent agreement with the theoretical predictions, and show that at each learning temperature there is a range of a for which our algorithm has lower generalization error than the MSP. Simulation and theoretical results at both temperatures, the bayesian lower bound and the MSP, are expanded in Figure 4 for N > 2. It is apparent that finite temperature learning has lower generalization error than the zero temperature limit, provided the learning temperature is adequately tuned. As will be shown later, the algorithm finds automatically the optimal learning temperature. Finite 'The largest discrepancy occurs for p = 10 at N = 4, and is less than A(€,) 3 x
Minimerror, A Learning Algorithm
1217
Ep.1
0.4 0.3 0.2 0.1
p=10
0.4 I
n
0.3
1
0 simulations
-
N=100
-theory
Figure 3: (a) Generalization error vs. a, for /3 = 5. Simulation results correspond to mean values over 10 samples. The continuous curve is the theoretical prediction for N --i m. The MSP generalization error is plotted for comparison. (b)Generalization error vs. a, for /j= 10. Simulation results correspond to mean values over 10 samples. The continuous curve is the theoretical prediction for N m. The MSP generalization error is plotted for comparison. 4
size effects on the generalization error seem very small: no significant differences were found between simulation results with N = 50 and N = 100. Finally, the distribution of stabilities p ( y ) of the learning set in three different regimes is presented as histograms in Figure 5a, b, and c, for N = 0.5, 2, and 6, respectively, at /3 = 5. Smooth curves are the theoretical predictions. Figure 5a corresponds to the vanishingly small error regime. The distribution of stabilities has a gap r+(a;/3), meaning that the patterns are farther than y+ from the separating hyperplane. This gap characterizes robust learning against small perturbations of the weights. Although the weights found with Minimerror at finite correspond to
Bruno Raffin and Mirta B. Gordon
1218
0.15 E
g
(a>
p=5 '1
?.
0.10
0.05
3
4
a
5
6
Figure 4: Generalization error vs. a. Comparison of results at 0 = 5 and Y = 10. a smaller gap than the one of the MSP (which is reached in the limit 3 + MI), they endow the perceptron with higher generalization performance, as was already discussed. This rather unexpected result shows that not only the stability of the least stable pattern, but the whole distribution of stabilities plays a role on the perceptron's generalization performance. Figure 5b displays the distribution of stabilities at the crossover to the regime with training errors. The positive gap is vanishing small, and a band of negative stabilities, corresponding to a small fraction of not learned patterns, appears far from the origin: the distribution of negative stabilities presents a gap 17- I indicating that there is a confidence interval of width (7-1 on both sides of the separating hyperplane, containing only well learned patterns. Figure 5c shows the distribution of stabilities for c1 = 6 , /3 = 5, which corresponds to a temperature well above T', i.e., in the finite-error regime. The fraction of training errors is N 2% (see Fig. 2) and the generalization performance is larger than optimal (see Fig. 4). Correspondingly, the stabilities distribution shows no gap. In practice, the optimal results are obtained if the last minimization is done with /? = /?:, where @ is the value of the noise parameter [j+ at which the stopping condition is met. The numerical generalization
Minimerror, A Learning Algorithm
1219
2
1
0
-3
-2
-1
0
Y
1
2
3
Figure 5a: Histogram of stabilities obtained from simulations on 10 samples with N = 50. The continuous curves are the theoretical predictions for N -+ 03. errors thus obtained are displayed in Figure 6, together with the Bayes algorithm lowest bound, showing that the results of Minimerror are excellent. In the same figure the generalization error of the perceptron minimizing Q, named the "Boltzmann algorithm" (Opper and Haussler 19911, is also displayed to show the improvement bought about by evaluating the number of training errors at finite temperature. 5 Discussion
To understand what our algorithm is doing, it is useful to look at the prescription of a simple gradient descent minimization of 2.4:
i
w(t
+ 1)
+
w(t)
+ hw(t)
BE(w;P)
6w=--E .&
(5.1)
Dropping out terms arising from the normalization of the synaptic weights, 5.1 may be cast, like other perceptron learning rules, in the
Bruno Raffin and Mirta B. Gordon
1220
-3 -2 -1
0
1
2
3
Y Figure 5b: Histogram of stabilities obtained from simulations on 10 samples with N = 50. The continuous curves are the theoretical predictions for N + x. form of a hebbian-like iterative learning:
6w(t) cc Cc”(t)r”E’~
(5.2)
P
where the coefficient c p is rule-dependent. For example, the MSP has c” = O ( K ---f), where K is the stability imposed to the least stable pattern, meaning that only patterns with stabilities lower than K are “learned” at each iteration. The Widrow-Hoff rule corresponds to chL = 1 - yp: patterns with yLL< 1 are learned while those with -f > 1 are unlearned. Our rule has
(5.3) which is maximal at y = 0 and decays exponentially for IyI > 2T. Thus, mainly patterns within a window of width 2T on both sides of the separating hyperplane, with both positive and negative stabilities, are learned. Those outside this window have vanishingly small coefficients. At start, T is high and all the patterns contribute to learning with almost the
Minimerror, A Learning Algorithm
1221
N=50,10 tests
-3 -2 -1
0
1
2
3
Y Figure 5c: Histogram of stabilities obtained from simulations on 10 samples with N = 50. The continuous curves are the theoretical predictions for N + 00. same strength. This is Hebbs rule. By decreasing T, the window gets narrower, restricting learning to patterns closer and closer to the separating hyperplane. When the temperature is low enough, the number of patterns within the 2T window becomes vanishingly small. All the available information contained in the learning set is exhausted, and it is not worthwhile to lower the temperature any further. Within this intuitive picture, the two temperatures of our algorithm may be interpreted as two different windows: a narrow one for patterns with positive stabilities, and a large one for patterns with negative stabilities. Thus, synaptic weight modifications are more sensitive to patterns not learned than to those already learned. 6 Conclusion
We studied the performance of Minimerror, a temperature-dependent learning algorithm for the binary-output perceptron, which has analytically been shown to be optimal on learning both linearly and nonlinearly
Bruno Raffin and Mirta B. Gordon
1222
0
0.4
Minimerror N=l 00 / 40 tests -
-----Boltzmann
0.3
0.2
0.1 l
0
1
*
l
2
,
l
3
.
a
l
4
.
l
5
,
l
6
.
.
7
Figure 6: Generalization error vs. a, obtained after the last minimization at a:. The theoretical predictions of Bayes algorithm and of the perceptron that minimizes Et are drawn for comparison.
separable functions from examples, if the temperature is correctly chosen. The main interest of Minimerror is that it is the first learning algorithm that converges automatically to the optimal solution for both kinds of training sets. Numerical simulations of the nonseparable case that confirmed the theoretical predictions were previously reported (Gordon and Berchier 1993). Here we studied the problem of learning linearlyseparable rules. Our results, obtained at several learning temperatures, are in very good agreement with the theoretical predictions for the training error, the generalization error, and the distribution of stabilities of the training set. Moreover, in our implementation, the algorithm finds automatically the best learning temperature for each particular training set. The weights so determined endow the perceptron with minimal generalization error and maximal robustness. Applications of Minimerror to constructivistic algorithms for multilayered perceptrons are currently in progress.
Minimerror, A Learning Algorithm
1223
References Amit, D. J., Evans, M. R., Horner, H., and Wong, K. Y. M. 1990. Retrieval phase diagrams for attractor neural networks with optimal interactions. J. Phys. A: Math. Gen. 23, 3361-3381. Anlauf, J. K., and Biehl, M. 1989. The AdaTron: An adaptive perceptron algorithm. Europhys. Lett. 10, 687-692. Fahlman, S. E., and Lebiere, C. 1990. The cascade-correlation learning architecture. In Aduances in Neural Information Processing Systems Vol. 2, pp. 574-582. Morgan Kaufmann, San Mateo, CA. Frean, M. 1990. The upstart algorithm: A method for constructing and training feedforward neural networks. Neural Conzp. 2, 198-209. Frean, M. 1992. A "thermal" perceptron learning rule. Neural Comp. 4(6), 946957. Gallant, S. 1. 1990. Perceptron-based learning algorithms. l E E E Trans. Neural Networks 1, 179-191. Gardner, E. 1987. Maximum storage capacity in neural networks. Europhys. Lett. 4, 481485. Gardner, E., and Derrida, B. 1988. Optimal storage properties of neural network models. J . Phys. A 21, 271. Golea, M., and Marchand, M. 1990. A growth algorithm for neural network decision trees. Europhys. Lett. 12, 205-210. Gordon, M., and Grempel, D. 1995. Learning with a temperature dependent algorithm. Europhys. Lett. (in press). Gordon, M. B., and Berchier, D. 1993. Minimerror: A perceptron learning rule that finds the optimal weights. In ESANN'93, pp. 105-110. Brussels, D facto. Gordon, M. B., Peretto, P., and Berchier, D. 1993. Learning algorithms for perceptrons from statistical physics. J. Phys. 1France 3, 377-387. Griniasty, M., and Gutfreund, H. 1991. Learning and retrieval in attractor neural networks above saturation. 1.Phys. A: Math. Gen. 24, 715-734. Gyorgyi, G. 1990. Inference of a rule by a neural network with thermal noise. Phys. Rev. Lett. 64, 2957-2960. Gyorgyi, G., and Tishby, N. 1990. Statistical theory of learning a rule. In Neural Networks and Spin Glasses, pp. 3-36. World Scientific, Singapore. Knerr, S., Personnaz, L., and Dreyfus, G. 1990. Single-layer learning revisited: A stepwise procedure for building and training a neural network. In Neurocornputing: Algorithms, Architectures and Applications. Springer, Berlin. Krauth, W., and Mezard, M. 1987. Learning algorithms with optimal stability in neural networks. 1. Phys. A 20, L745-L752. Krauth, W., Nadal, J.-P., and Mkzard, M. 1988. The roles of stability and symmetry in the dynamics of neural networks. J. Phys. A: Math. Gen. 21, 2995-3011. Martinez, D., and Esteve, D. 1992. The offset algorithm: Building and learning method for multilayer neural networks. Europhys. Lett. 18, 95-100. Mkzard, M., and Nadal, J. P. 1989. Learning in feedforward layered neural networks: The tiling algorithm. 1.Phys. A 22,2191-2203. Nabutovsky, D., and Domany, E. 1991. Learning the unlearnable. Neural Comp. 3, 604-616.
1224
Bruno Raffin and Mirta B. Gordon
Nadal, J. P. 1989. Study of a growth algorithm for ii feedforward network. J. Phys. A: Math. Gen. 22, 2191-2203. Opper, M., and Haussler, D. 1991. Generalization performance of bayes optimal classification algorithm for learning a perceptron. Phys. Rev. Left. 66, 26772680. Opper, M., Kinzel, W., Kleinz, J., and Nehl, R. 1990. On the ability of the optimal perceptron to generalise. I. Phys. A 23, L581-L586. Peretto, P., and Gordon, M. 8. 1992. Monoplane: A constructive learning algorithm for one-hidden layer feedforward neural networks. In Neural Networks for Computing. Snowbird, Utah. Press, W. H., Flannery, B. P., Teukolsky, S. A., and k7etterling, W. T. 1986. N u rrierical Recipes. Cambridge University Press, Cambridge, England. Rujan, P. 1993. A fast method for calculating the perceptron with maximal stability. J. Phys. I France 3, 277-290. Rujan, P., and Marchand, M. 1989. Learning by activating neurons: A new approach to learning in neural networks. Complex Syst. 3, 229. Seewer, S., and Rujan, P. 1992. The generalization probability of a perceptron using the A-rule. J. Phys. A: Math. Gen 25, L505-L.510. Seung, H. S., Sompolinsky, H., and Tishby, N. 1992.. Statistical mechanics of learning from examples. Phys. Rev. A 45, 6056-6091. Sirat, J. A., and Nadal, J. P. 1990. Neural trees: A new tool for classification. Netzuork 1, 423438. Vallet, F. 1989. The Hebb rule for learning linearly separable functions: Learning and generalisation. Europhys. Lett. 8, 747. Watkin, T. L. H., Rau, A., and Biehl, M. 1993. The statistical mechanics of learning a rule. Rev. Mod. Phys. 65, 499-556.
Received August 23, 1994; accepted January 16, 1995.
This article has been cited by: 2. J. Manuel Torres Moreno, Mirta B. Gordon. 1998. Efficient Adaptive Learning for Classification Tasks with Binary UnitsEfficient Adaptive Learning for Classification Tasks with Binary Units. Neural Computation 10:4, 1007-1030. [Abstract] [PDF] [PDF Plus] 3. Arnaud Buhot, Juan-Manuel Torres Moreno, Mirta B. Gordon. 1997. Finite size scaling of the Bayesian perceptron. Physical Review E 55:6, 7434-7440. [CrossRef] 4. Vasant Honavar, Jihoon Yang, Rajesh ParekhConstructive Learning and Structural Learning . [CrossRef]
Communicated by Maxwell Stinchcornbe
Regularized Neural Networks: Some Convergence Rate Results Valentina Corradi Department of Economics, University of Pennsylvania, Philadelphia, PA, U S A
Halbert White Department of Economics, University of California a t San Diego, and Institute for Neural Computation, Sun Diego, CA, USA
In a recent paper, Poggio and Girosi (1990) proposed a class of neural networks obtained from the theory of regularization. Regularized networks are capable of approximating arbitrarily well any continuous function on a compactum. In this paper we consider in detail the learning problem for the one-dimensional case. We show that in the case of output data observed with noise, regularized networks are capable of learning and approximating (on compacta) elements of certain classes of Sobolev spaces, known as reproducing kernel Hilbert spaces (RKHS), at a nonparametric rate that optimally exploits the smoothness properties of the unknown mapping. In particular we show that the total squared error, given by the sum of the squared bias and the variance, will approach zero at a rate of n(-2rn)’(2m+1)l where rn denotes the order of differentiability of the true unknown function. On the other hand, if the unknown mapping is a continuous function but does not belong to an RKHS, then there still exists a unique regularized solution, but this is no longer guaranteed to converge in mean square to a well-defined limit. Further, even if such a solution converges, the total squared error is bounded away from zero for all n sufficiently large. 1 Introduction
The purpose of this paper is to describe a learning method for hidden layer feedforward networks of growing complexity that provides, for general classes of functions of a single variable and for discretely sampled and noisy target data, an estimate of an unknown mapping converging to the truth at a rate that optimally exploits the smoothness properties of the unknown function. To achieve this, we limit ourselves to certain classes of activation functions and exploit the theory available for regularized solutions of Fredholm integral equations of the first kind (e.g., Groetsch 1984). Treatment of multivariate input is possible, but not in the space allotted here. Neural Computation 7, 1225-1244 (1995) @ 1995 Massachusetts Institute of Technology
Valentina Corradi and Halbert White
1226
The major thrust of our results is as follows: Suppose that an unknown "target" function f on [0,1] is an element of a special class of spaces called reproducing kernel Hilbert spaces (RKHSs)and is observed with noise at n points labeled x, for i = 1.2,. . . , n. That is, suppose that y, = f(x,) E , is observed where the E , are orthogonal elements of some L2 probability space, all having mean 0 and L:! norm 0. If the x, are chosen properly (Assumption 3.4, discussed at more length below) then, using a method of training a neural network with n hidden units called regularization, we can avoid overfitting and have an approximation rate for functions having rn derivatives, the first in L2([0,11) of (n2/n)2m/(2m+1), rn - 1 of which are absolutely continuous. (We are indebted to the reviewer for this succinct overview.) Our approach is based on the fact that if we use certain types of sufficiently kinky polynomial spline activation functions (as introduced by Stinchcombe and White 1990 and defined in Section 4) and we consider a continuum of hidden units, then the neural network model can be interpreted as a Fredholm integral equation of the first kind, with network weights learned through the method of regularization. Networks obtained from the theory of regularization have been already considered by Poggio and Girosi (1990, appendix Cl), who claim that such networks can approximate any continuous function arbitrarily well over compacta. More recently Xu et al. (1994) consider the connection between radial basis function networks and kernel regression for the multivariate case and provide some convergence rate results; however, they consider only the non-noisy data case, and their approach is different. Here we show that regularized networks converge to the true mapping in mean square (the squared L2-norm),at a rate of n-2"1/(2m+1)for functions belonging to certain classes of Sobolev spaces, known as reproducing kernel Hilbert spaces (RKHS), which have rn derivatives, the first m - 1 of which are absolutely continuous. On the other hand, for functions belonging to C[O. 11 but not belonging to any RKHS, there still exists a unique regularized solution, but this is not guaranteed to converge to a well-defined limit. Even if such a solution converges, the squared approximation error to a function not in an RKHS is bounded away from zero as the size of the training set increases. While Poggio and Girosi's (1990) approximation claim is true for such functions, the optimal approximation cannot be learned from the sample target data by regularization. This paper is organized as follows. In Section 2 we give the general setup of the problem, and in Section 3 we state the Lukas theorem (Lukas 1988), a general result on convergence rates for regularized solutions. In Section 4 we obtain our result by showing that the assumptions of this theorem are satisfied for networks with certain types of sufficiently kinky polynomial spline activation functions. Finally in Section 5 we show that provided a limit exists, the approximation error of a regularized solution to functions belonging to C[O. 11 but not belonging to an RKHS is bounded away from zero for all n sufficiently large. Since the necessary
+
Regularized Neural Networks
1227
mathematical background may be somewhat unfamiliar, we provide a brief synopsis of the relevant material on operators in Hilbert spaces and RKHSs in an Appendix. 2 Underlying Framework
A Fredholm equation of the first kind (see Definition Al) has the form
1
1
fix) = U
( X ) :=
K ( x , rMri d r
(2.1)
where x and y are scalars, f and K are known functions, and /3 is a function to be found. K is interpreted as an operator on the space of square integrable functions on a compact support. The integral expression in 2.1 can be interpreted directly as the output of a single hidden layer feedforward network, with inputs x E [0,1], input to hidden unit weights 3, a hidden layer with a continuum of hidden units having activations K ( x . y) and hidden to output weights P(y), which, when convolved with hidden unit activation, deliver the network output f(x). Equation 2.1 resembles the continuous network model considered by hie and Miyake (1988, p. 1-643), in the case of univariate input (i.e., d = 1). Irie and Miyake propose a Fourier integral over R in place of 2.1 by expressing the weights [j(r)in terms of the Fourier transforms of K and f (f and 4,respectively, in Irie and Miyake’s notation), the former evaluated at 1. However, permitting the weights X to range over all of R can create practical difficulties. By considering 2.1, we can avoid these difficulties. A Riemann sum approximation to 2.1 has the form
cK ( x . ?1
Stlf(x) =
7l)inl
(2.2)
I=1
where p, = P(yl) and 7, = i/n. Equation 2.2 is the output function for a standard two-layer feedforward network with n hidden units, hidden unit activations K ( x , yl), and hidden to output weights PI. This resembles the case considered by Hornik, Stinchcombe, and White (1990), hereafter HSW, putting their Y = 1 and a = 0. Theorem 3.1 in HSW (1990) also provides, under certain conditions on f and K , a degree of approximation result, in their proof of Theorem 3.1. However, because of the nature of the Riemann sum approximation, that rate does not exploit the smoothness off. When K is a Green’s function associated with some differential operator, 2.2 can be also interpreted as the outptut of a radial basis function network, as in Poggio and Girosi (1990) and in Xu et al. (1994). The idea pursued here is to solve the learning problem by obtaining an estimate of the function [I, using an approximate regularized solution of 2.1. Given the estimate of ,8 we can then recover an estimate of Kp(.), that is, an estimate of the unknown mapping. Also by evaluating the
Valentina Corradi and Halbert White
1228
function P(y) at some y E [0,1], we can get a specific value for any hidden unit‘s output weight. This solution to the learning problem contrasts dramatically with standard learning procedures, such as backpropagation, which may require many passes through the data. A single pass through the data delivers the optimal network weights in this approach. A great theoretical advantage of this approach is that we can exploit results already obtained for regularized solutions to 2.1. We shall rely on a general result by Lukas (1988) (Lukas’s theorem) and check that the assumptions of this theorem are satisfied for two types of activation functions belonging to the class of sufficiently kinky polynomial splines. 3 A General Result on Convergence Rate for Regularized Solutions: Lukas‘s Theorem
Suppose we observe noisy output data, i.e. y, =fo(x,)
+ E , = IcP(x,) + E,,
i = 1 , 2 . .. . ,
(3.1)
where fo denotes the true unknown mapping of interest. The regularized solution D,, is defined as the minimizer with respect to /3 E L2 of (3.2) where a,, is a scalar regularization factor such that CY,, -+ 0 as n -+ 03, and lldll = $ ij2(r) dy. The explicit solution is given by Wahba (1977) as
A ( . )= v(.)’(Qt1 +ml)-ly where y = (yl. . . . ,y,,)’, and r/ = ( 7 l x , > . . . . qx,,)’wifh
(3.3)
qx,(y) = K ( x , .y), and Q,, the n x n matrix with i,jth entry, K ( x , , x , ) = (qy,.vx,). Note that 3.3 delivers an explicit expression for the estimator of the hidden to output weights. Further,
O A ( . )= Q(’)’(Qn
+allW’y
(3.4)
where
Q(.)= [Qx,
(.), ’ . . ? Q x , , ( . ) l ’
and
so 3.4 gives an explicit expression for the resulting network output. From equations 3.3 and 3.4, the role of the regularization factor is clear. Provided a,, does not approach zero too fast, the term a n d will increase in such a way as to compensate for the fact that as n increases,
Regularized Neural Networks
1229
Q,f becomes more and more poorly conditioned. Thus, if art --t 0 at an o,rrd-l in general will be bounded. Because an appropriate rate, explicit solution is available, we avoid the need to undertake nonlinear iterative estimation, such as the method of backpropagation. The present approach has connections to a procedure known in the statistics literature as ”adaptive ridge regression” (Judge eta!. 1985, ch. 22). When B(y) = P and K ( x . y ) = x for all y E [0,1],so that K P ( x ) = Bx, then equation 3.2 can be rewritten as
(arr+
(3.2’)
The minimizer of equation 3.2 with respect to the scalar y is the adaptive ridge estimator (3.3’)
To study the behavior of regularized solutions we impose the following assumptions, which rely on definitions given in the Appendix. Readers unfamiliar with the theory of reproducing kernel Hilbert spaces should certainly look this appendix over first. Assumption 3.1. fo E H’[O, 11 for some s 2 1, where H’ is a reproducing kernel Hilbert space as defined in Definition A16, and /?o E N(K)’ c L2[0,11, where PO is such that fo(x) = K[jo(x). Essentially, Assumption 3.1 guarantees the existence of a solution to 2.1. From Definition A17, we know that E S’‘ for p = 0 , l . Assumption 3.2. { E , } is a sequence of identically distributed random variables such that
E ( E , ) = 0, where b,,,
=
E ( E , E ~=) b,,,d
1 if i = j and 0 otherwise, and
uz
1/2. Assumption 3.3 requires that the eigenvalues of 8,say A, decline to zero as j + m, at a rate equal to j - 2 p , with p > 1/2. This assumption imposes restrictions on the choice of K, in particular on the smoothness of K. This rules out the use of the familiar logistic activation function, as the logistic is an analytic function, so that the eigenvalues decline to zero at an exponential rate (see Wahba 1990, section 8.1); Assumption 3.3
Valentina Corradi and Halbert White
1230
is thus violated. In fact if the eigenvalues of Q decay too fast, then Q,, is very poorly conditioned and even if on + 0 at a proper rate as n + 03 then (QII- (k,,nI)-’ will tend to be unbounded. For this numerical reason, even if the target function is very smooth, e.g., analytic, we may prefer to approximate it with a less smooth kernel (activation function).
Assumption 3.4. Let p be as in Assumption 3.3, an.d let {x,}be a sequence of nonstochastic scalars for which there exist a constant 11 with 0 < 11 < 1 - 1/(4p) and a sequence k,, + 0 such that for any f, g E H”, s 2 1, we have
where F is a distribution function on [O, 11, and the norm Definition A16.
I( . Ill,
is as in
Assumption 3.4 is a condition on the asymptotic design of the data points and on the goodness of the discretization. As we will see in the proof of Theorem 4.6 below, such a condition is satisfied for the case of equally spaced data, that is x , = i/n, i = 1 . 2 , .. . , n.
Assumption 3.5. There exist sequences { k I l } and {a,,} such that if s 2 max{i/. p } , p < 2 - 11- 1/(4p),then k , l c ~ T 1 ’ ( 4 y -+ ’ 0, and if > v, s > vs2 then k , l l y R V / 2 - f 1 / 2 - 1 / ( 4 y ) 0, ~
Assumption 3.5 says that the faster k,,approaches zero, the faster a, may approach zero.
Theorem 3.1. Lukas’s theorem (Lukas 1988): Given Assumptions 3.13.5, if s 2 11, + 2, then a,, is optimal, in the sense of guaranteeing that the squared bias and the variance of the estimate of f” have the same order of magnitude, if and only if
is defined as above, t denotes the generalized inverse, and where means ”of the same order of magnitude.” If instead p < s 5 IL 2, then a, is optimal if and only if
+
With this choice for an,it follows that
N
Regularized Neural Networks
1231
In practice, we need to choose (Y, from the data. If we pick an too large, then the approximate solution does not fit the data well; if instead we pick ctvl too small, then the approximate solution has too large a norm. A thorough discussion of the choice of a, is given by Wahba (1977), who suggests choosing a,, via the method of weighted cross-validation. Wahba also shows that for f~ E H s , s 2 2, as defined in Assumption 3.1, the sequence of m,!obtained via weighted cross-validation is of the correct order. 4 A Convergence Rate Result for Neural Network Models With a Continuum of Units
In this section we show that the assumptions of Lukas's theorem are satisfied for two types of activation functions, namely (4.1)
where u+
=
u if u 2 0 and is 0 otherwise and (4.2)
From Definition 2.2 in Stinchcombe and White (1990), an activation function A E C ( R )is a sufficiently kinky spline if (i) there are a finite number of knots, say x1 < x2 < . . . < xi,, such that
where P,(A) is a polynomial of finite degree and (ii) either one of the highest order polynomials PI adjoins a lower order polynomial or all the polynomials have the same order and two of them have different highest order coefficients. The kernels given in 4.1 and 4.2 can thus be considered as particular types of sufficiently kinky polynomial spline activation functions. In particular, the scaling weight in 4.1 is assumed to be equal to 1 and the bias term (y) belongs to [0.1], while in 4.2 both the bias term and the scaling weight belong to [0.1]. Such kernels can be also interpreted as radial basis functions for the case d = 1. Before moving on to a direct check of Lukas's assumptions, we state some basic facts relating integral and differential equations. Fact 4.1 (Tricomi 1956, p. 31; Jerry 1985, p. 60). The kernel 4.1 is the Green's function associated with the following boundary value problem:
Lf =f'"' s.t.
p =0
f o r j = 0 ,1 ,. . . , m - 1
Valentina Corradi and Halbert White
1232
where L is a differential operator of order m. In terms of equation 4.1, we have that P(y) = f(")(y) and KLf = f ; hence ZXp = p, so K-' exists and is equal to L. Fact 4.2 (Tricomi 1956, p. 131). The kernel 4.2 is the Green's function associated with the following boundary value problem:
L*f
= f(2)
s.t. f(0) = f(1)= 0
a(?)
where L' is a differential operator of order 2. In this case = f(2)(y) and, as above, KL*f = f and P K [ j = [j, so K-' exists and is equal to L". Fact 4.3. Define the collection of functions on [0,1]
Then (i) For rn 2 1, Wg, is an RKHS with RK given by Q(x. y) = where K is defined as in 4.1; and
(ii)
J K(x, i u ) K ( u ,y)d t
llfllu = Ilf'"'ll.
Proof. (i) From Fact 4.1 we know that there exists a solution to the Fredholm equation of the first kind 2.1, and the solution is of the type /j(y) = f(ln)(y).By Theorem A14, the existence of a solution implies that f belongs to an RKHS with RK given by Q(.- .). (ii) I/ flla = IIQ-1/2f(l = ~ ~ f ( J i since l ) ~ ~ , &-1/2 == K-' = L, which is a 0 differential operator of order rn. Fact 4.4. Define the collection of functions on [0,1] w&r =
f : f(0) = f(1)= 0, f b ) absolutely continuous for j = 0 , 1 , J ~ ( f ' 2 ) ) 2 ( x ) d x oo
0 for all n. Since, by assumption, 6,,+ 0 as n 00, it follows that A,, 2 A > 0 for all n sufficiently large. 0 -+
Appendix Definition A.l (Fredholm Equations of the First Kind). Let f : [O. 11 + R and K : [0.1] x [ O . l ] + R be given. By a Fredholm integral equation of the first kind is meant an equation of the type
where /-I : [O. 11 + R is to be found. The mapping K is called a kernel. Fredholm equations of the first kind are typically ill-posed, in that small perturbations in f can cause large perturbations in 0. A method for solving an ill-posed problem is called a regularization method. 0
Definition A.2 (Hilbert Space). A Hilbert space H is a vector space equipped with an inner product (.. .) : H x H R+ (a symmetric and bilinear map) such that H is complete with respect to the norm l l f l l := (fJ)’I2.f E H . 0 -+
The space L2 of measurable functions f : [O. 11 -+ R such that J,’ f(x)’dx < 33, equipped with the inner product ( f . g) := j f(x)g(x)dx is a Hilbert space. Two elements of a Hilbert space, f and g, are said to be orthogonal if ( f , g ) = 0. Given a subset of a Hilbert space, M c H , we denote the
Regularized Neural Networks
1239
orthogonal complement of M as
MI
=
{y E H : (y.x) = 0 for any x
EM
}
A Hilbert space valued operator K is a linear map from H to H . The operator is compact if every norm bounded sequence { K f n } has a convergent subsequence. Compactness implies boundedness and continuity. [K is bounded if lllcll = supYEH(IIKy(( : llyll < 1) < C, for some constant C; K is continuous if llKyl - Ky,II < E whenever JIy1 - y2)) < 6.1 If ( K f , g ) = (f,K"g) for allf,g E H,then K' is the adjoint operator of K . K is self-adjoint, or symmetric, if K = K*. Our interest centers on the operator K defined as
K / j :=
6'
K ( . , r)P(r)d r
(A.2)
The problem of interest can now be represented as solving the operator (integral) equation
f
=
for the unknown function /j. Fact A.3 (Groetsch 1980, p. 140; 1984, p. 67). If the kernel K in equation A.2 is such that K E L2([0.1]2)then K : L2[0,1] + L'[O,l] is a compact linear operator. If in in addition K is symmetric, then K is self-adjoint, i.e., K = K'. Compactness implies boundedness of the operator; in fact compact operators are continuous and continuous linear operators are bounded. 0
Definition A.4 (Range and Null Space of K) (Groetsch 1980, p. 112). For
Ic : L2[0.11 + L2[0.11, the range is R ( K ) = {f E L ~ [ o11 . : K/j = f for some P
E L'[o,
11)
and the null space is
N ( K ) = { p E P [ O , 11 : K / j = o} The following four identities hold: = cl R ( K * ) N ( K ) = R(K*)'
N(K)'
N(K*)' N(K*)
= =
cl R ( K ) R(K)i
Thus if K is self adjoint, the identies above reduce to:
N ( K ) = R(K)'
N ( K ) I = cl R ( K )
0
Definition A.5 (Reproducing Kernel Hilbert Spaces-RKHS) (Wahba 1990, p. I). A Hilbert space H of real functions on [0,1]is said to be an RKHS if, for any x E [O, 11, the evaluation functional L, : H -+ R, i.e., L,f = f(x), is a bounded linear functional.
Valentina Corradi and Halbert White
1240
The Hilbert space of square integrable functions on [0,1],L2[0,11, is not an RKHS, in that elements in L2[0,11 are not even defined pointwise, e.g., I f ( x ) l may be unboundedly large for some x E [O,11. For the purpose of regularization, the leading examples of RKHSs are the Sobolev spaces of order rn = 1,2.. . . defined as W",2[0, 11 = { f : f('"-l) absolutely continuous, J d c f ( m ) ) 2 < m}. Thus, polynomials of order m and certain trigonometric 0 functions belong to such an RKHS. Definition A.6 (Reproducing Kernel-RK) (Wahba 1990, p. 2). If 71 is an RKHS, then, for any x E [0,1] there exists, by the Riesz representation theorem, an element Q(x. .) E 71 with the "reproducing" property for all f E 'H Lf = [ Q ( x *.). f ]= f ( x ) where I., .] denotes the inner product in an RKHS. We call Q the reproducing kernel (RK). The inner product [.. .] depends on the specific RKHS we are considering. Hereafter 1.. .]adenotes the inner product in an RKHS with RK Q; the subscript Q is suppressed when there are no ambiguities. For example, for f.g E %, [ f > g = ] Jd K t f ( x ) K t f ( x )dx, where K? is the generalized inverse of K with K defined as in Definition A.Z. Put Qx(')
With f
Q-r = Q(.. Y)
= Q(x. ')
= Qr,
we also have
IQx, Q7l = Q(x3 Y)
Given the last equation, it follows that Q is a positive definite kernel, that is x
w I Q ( x I , x I2) 0
0
IJ=l
Fact A.7 (Aronszajin 1950, p. 344). If 'Ft is an RKHS , then (i) 71 has a unique RK, Q; and (ii) the converse holds: to every positive definite kernel, there corresponds one and only one class of functions forming a Hilbert space 0 and admitting Q as a reproducing kernel. When Q is an RK, the operator Q, where Q = K K * , is called a HilbertSchmidt operator. It is positive definite, compact and self-adjoint. Definition A.8 (Eigenvaluesand Eigenvectors). A scalar X is called an eigenvalue of a bounded linear operator K : L2[0.11 -+L2[0,11 if there exists a nonzero vector q5 E L2, called an eigenvector, associated with X such that K $ = A$. 0 Fact A.9. If K is self-adjoint, all its eigenvalues are real and countable in number. Fact A.10 (Aronszajin 1950, p. 342 and 344). Let Qf
=
Jd Q(.. y)f(y) dy.
Regularized Neural Networks
1241
Then (i) if XQ is an RKHS with RK Q, then 11 f [IQ = IlQt/2fll, where &t denotes the generalized inverse of Q = K K . If the domain of Qt is equal to the range of Q, then the true inverse exists and &+= Q-l. As we will see in more detail in Definition A.16 below, IIQt/2fll is well defined provided the eigenvalues of Q approach zero sufficiently fast. (ii) Convergence in the norm (1 . ( convergence in the L2-norm.
1 implies ~ pointwise convergence and 0
Theorem A . l l (Mercer, Hilbert, Schmidt Theorem) (Riesz and Nagy 1955, p. 242). Let Q be an RK and Q the associated Hilbert-Schmidt operator. The following two statements hold: M
(i) Q(x.y)
~ A , @ l ( x ) 4 ( ywhere )
=
A1
2
X2...
(repeated according to the
dimension ;Tihe eigenspace) are the eigenvalues of Q and 42,.. . , are the associated eigenfunctions, forming a complete orthonormal system for N ( Q ) l = Ho, where HO = { f E N ( Q ) I c L2[0.11 : E:,(A,, < 00). Moreover, by compactness, A,, + 0 as n + 00. M
(ii) f(x)
X;'(f.
=
q$)4,(x) for any f
E
YQ. Note that if K,O = f with
r=l
compact and symmetric kernel K , then the theorem above holds with Q KK.
= 0
Theorem A.12 (Picard Theorem) (Diaz and Metcalf 1970; Groetsch 1980, p. 156). Let K : L2[0.11 -+ L2[0.11 be a compact and self-adjoint operator, let XI 2 A2 2 . . . be the eigenvalues of K K " = K K and let ($,, 4 2 , . . .) be the associated eigenfunctions. In order that the equation K P = f have a solution lie., f E R(K)I a set of necessary and sufficient conditions is ~
(i) f E N(K)' = cl R ( K ) (ii)
:~;11(f,4,)12 ]=I
< co
The solution will be of the type $ exists.
=
Ktf, where Kt = K-' if the latter 0
Further, if K is compact but not self-adjoint, then the theorem is still valid, provided condition (i) is replaced by (i') f E N(K*)' = cl R ( K ) .
Theorem A.13 (Groetsch 1984, pp. 89-90). Let K : L2[0,1] -+ L2[0,1]be given. R ( K ) is an RKHS with RK given by
Valentina Corradi and Halbert White
1242
Theorem A.14. Given that K : L2[0.11 + L2[0,11 is a compact linear operator, the two following conditions are equivalenf: (i) there exists a solution / j E L2[0,11 to K/3 = f; and ( 1 1 ) f E ‘HQ[O,11, where X Q is an RKHS with RK Q, given by I
Q(x.y) =
K ( x . t l ) K ( t l . y)d u
Note that the 11 . I/Q-normis tighter than the L2-norm Proof. (i + ii) A solution exists if and only rf the Picard criterion is satisfied, i.e., i f f E R ( K ) ;by Theorem A.13, R ( K ) is an RKHS with RK Q. Now, by Fact A.7 (ii) Q ( x > r will ) define one and only one class of function, so R ( K ) = ‘HQ. (ii + i) f E ‘HQ = R ( K ) ; thus the Picard criterion is satisfied and there exists a solution. Fact A.15. The space X Q can be also defined as
Proof. From Fact A.10 (i),
l l f l l ~ = Zp, A,-’l(f.4,)12
Ilfll~=
IIQt/2f/l, and by Theorem A.12 (ii),
< 03, where A, and 4,are the jth eigenvalue and
eigenvector of &.
Definition A.16 (Hilbert scale) (Lukas 1988, p. 110). For s E R+ define
The collection {H‘.s E R’} is called a scale of Hilbert spaces; the higher is s the stronger is the norm 11 . IIH-. In fact for v < s, H‘ c H” with continuous embedding, in the Rellich theorem sense; that is, convergence in the H,-norm implies convergence in the H,-norm, for all v < s. From Theorems A.ll and A.13, it follows that a solution to (All exists, i.e., f E. X Q , if and only i f f E Hs for some s 2 1. Thus Hs is an RKHS if and only if s 2 1. (Thus, Ho appearing in Theorem A . l l is not an RKHS.) Definition A.17 (Lukas 1988, p. 111). For p E Rt,the space S@is defined to be the Hilbert completion of
{/)
E
hJ(K)‘- c L2[0.11 : Icg E Hp)
under the inner product
(p,g)Ss = ( K a , Kg)HI. where ( f . g ) H I L = CpO, X-”[(f, $,), (g,4)]. It should be pointed out that L2 c So, in that L2[0,1] is continuously embedded in So; further S’ = N(K)l.
Regularized Neural Networks
1243
Fact A.18 (Lukas 1988, p. 111). Given ii;l E S W ,
K : S"
--+
is an isometric isomorphism, so (p.g)SP =
(K/!. Kg)HhL
where Hp is an RKHS if and only if p 2 1. If / L = 0, then we have ([j.g)stl = (Kp, K g ) w ~5~ (KLj.Kg). The last inequality follows from the fact that Ho = N(S)' c L2[0.11. 0
Jd
Remark A.19. Given f(x) = K P ( x ) = K ( x ,r)B(r) d r , if K is symmetric and K-' is a differential operator of order rn, i.e., K-'f := f""), under some boundary conditions on f and/or on its derivatives fo), then K is said to be the Green's function for the boundary value problem C'f = f(") (s.t. boundary conditions). Let Q = KK; then the eigenvectors of Q form a complete orthonormal system in L2[0.11 (see Tricomi 1956, p. 134). Thus N(&)' = L2 and, since the eigenvectors of K are the same as those of &, we also have N(K)' = L2. It follows that (i)
HO
= L ~ [ o11 ,
c SO
and (ii) (P,g)s[l = (KP. Kg)HO = (KD.Kg) where (.. .) is the usual inner product in L2. Thus convergence in the Honorm can be interpreted as convergence in the more familiar L2-norm. Acknowledgments We wish to thank the Editor and an anonymous reviewer for very useful comments and suggestions on an earlier version of this paper. This research was supported by NSF grants SES 92-09023 and IS1 92-03532. References Adams, R. A. 1975. Sobolev Spaces. Academic Press, New York. Agmon, S. 1965. Lectures on Regular Elliptic Boundary Value Problems. Van Nostrand, Toronto. Aronszajin, N. 1950. Theory of reproducing kernels. Trans. Am. Math. Soc. 68, pp. 337-404. Cox, D. D. 1983. Asymptotics for rn-type smoothing splines, Ann. Statist. 11, 530-551. Cox D. D. 1984. Multivariate smoothing splines. SIAh4 I. Numerical Anal. 21, 789-81 3. Diaz, J. B., and Metcalf, F. T. 1970. On iteration procedures for equations of the first kind, Ax=y, and Picard's criterion for the existence of a solution. Math. Comput. 24,923-935.
1244
Valentina Corradi and Halbert White
Gallant, A. R. 1981. On the bias of flexible Fourier forms and an essentially unbiased form: The Fourier flexible form. 1.Econometr. 15, 211-246. Groetsch, C. W. 1980. Applicable Elements of Functiorral Analysis. Dekker, New York. Groetsch, C. W. 1984. The Theory of Tikhonov Regularization for Fredholm Equations of the First Kind. Pitman, New York. Hornik, K., Stinchcombe, M., and White, H. 1990. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Networks 3, 551-560. Irie, B., and Miyake, S. 1988. Capabilities of three-layered perceptrons. In Proceedings of the 1988 I E E E International Conference on Neural Networks Vol. I, pp. 593-606. IEEE Press, New York. Jerry, A. 1985. A n lntroducfion tohtegrul Equations with Applications. Dekker, New York. Judge, G., Griffiths, W. E., Hill, R. C., Lutkepohl, H., and Lee, T. C. 1985. The Theory and Practice of Econometrics. Wiley, New Ycrk. Lukas, M. A. 1988. Convergence rates for regularized solutions. Math. Comp. 42, 107-131. Nashed, M. Z., and Wahba, G. 1974. Generalized inverses in reproducing kernel spaces: An approach to regularization of linear operator equations. S l A M J. Math. Anal. 5, 974-987. Poggio, T., and Girosi, F. 1990. Networks and the best approximation property, Biol. Cybernet. 63, 169-176. Riesz, F., and Nagy, B. 1955. Functional Analysis. Ungar, New York. Stinchcombe, M., and White, H. 1990. Approximating and learning unknown mappings using multilayer feedforward neural networks with bounded weights. In Proceedings of the lnternational Joint Conference on Neural Networks, Vol. 111, pp. 7-16. IEEE Press, New York. Tricomi, F. G. 1956. lntegral Equations. Dover, New York (reprint 1987). Wahba, G. 1977. Practical approximate solutions to linear operator equations when the data are noisy. S l A M J. Numerical Anal. 14,651-667. Wahba, G. 1990. Spline Models for Observational Data. SIAM, Philadelphia. Xu, L., Krzyak, A., and Yuille, A. 1994. On radial basis function and kernel regression: Statistical consistency, convergence rates, and receptive field size. Neural Networks 7, 609-628.
Received July 26, 1993; accepted February 28, 1995.
This article has been cited by: 2. Zhe Chen , Simon Haykin . 2002. On Different Facets of Regularization TheoryOn Different Facets of Regularization Theory. Neural Computation 14:12, 2791-2846. [Abstract] [PDF] [PDF Plus] 3. G. De Nicolao, G. Ferrari-Trecate. 2001. Regularization networks: fast weight calculation via Kalman filtering. IEEE Transactions on Neural Networks 12:2, 228-235. [CrossRef] 4. G. De Nicolao, G.F. Trecate. 1999. Consistent identification of NARX models via regularization networks. IEEE Transactions on Automatic Control 44:11, 2045-2049. [CrossRef]
Communicated by Steven Nowlan
On the Practical Applicability of VC Dimension Bounds Sean B. Holden' Mahesan Niranjan Cambridge University Engineering Department, Trumpington Street, Cambridge CB2 I P Z , England This article addresses the question of whether some recent VapnikChervonenkis (VC) dimension-based bounds on sample complexity can be regarded as a practical design tool. Specifically, we are interested in bounds on the sample complexity for the problem of training a pattern classifier such that we can expect it to perform valid generalization. Early results using the VC dimension, while being extremely powerful, suffered from the fact that their sample complexity predictions were rather impractical. More recent results have begun to improve the situation by attempting to take specific account of the precise algorithm used to train the classifier. We perform a series of experiments based on a task involving the classification of sets of vowel formant frequencies. The results of these experiments indicate that the more recent theories provide sample complexity predictions that are significantly more applicable in practice than those provided by earlier theories; however, we also find that the recent theories still have significant shortcomings. 1 Introduction
Of the small number of existing, alternative theories that aim to model the phenomenon of generalization, one of the most widely studied is that based on computational learning theory (Anthony and Biggs 1992; Natarajan 19911, which uses ideas originally introduced by Valiant (1984) and Blumer et al. (1989). It has become clear that a parameter of fundamental importance in this theory is the Vupnik-Chervonenkis (VC) dimension, which we define in full below. The VC dimension can be regarded as a measure of the capacity or expressive power of a connectionist network or other pattern classifier. In this article we address the following question: do the VC dimension bounds available at present in any way constitute a practically applicable design tool, in the sense that they can be used in practice to guide the design of a pattern classifier? This type of question is not often asked by 'Present address: Department of Computer Science, University College London, Gower Street, London WClE 6BT, England. Neural Computation 7, 1265-1288 (1995) @ 1995 Massachusetts Institute of Technology
1266
Sean B. Holden and Mahesan Niranjan
researchers in computational learning theory, where the emphasis tends to be on the production of powerful theoretical results. However, despite the significant intrinsic interest inspired by such results, the long-term aim of such studies must be to provide powerful and generally applicable tools for the design of machine learning systems, and, consequently, it is important that some attempt is made to assess the available theoretical results from this point of view. The results presented in this article can be regarded as an extension of those obtained by Cohn and Tesauro (1992), who have made a detailed study of the average generalization performance o f various networks applied to some simple problems, and compared the results with the worstcase bounds provided by some VC dimension based results. However, there are three important differences. First, all our experiments use types of networks for which either exact results or very good bounds on the VC dimension are known. This is advantageous for the reasons discussed in Section 3; some, but not all of the experiments in Cohn and Tesauro (1992) used networks with this property. Second, our networks can be trained without the need to use the backpropagation algorithm; use of this algorithm leads, as discussed in Cohn and Tesauro (1992), to the need to be extremely careful in the control of possible associated random and systematic experimental errors. Finally, whereas in Cohn and Tesauro (1992) the experiments are based on synthetic data for rather unrealistic problems, namely the "majority," "real-valued threshold," "majority-XOX," and "threshold-XOR" problems, the experiments presented here are based on real data, namely a large set of formant frequencies for 10 different vowels uttered by people of different age and gender; these data were introduced by Peterson and Barney (1952). Additionally, in this work we concentrate specifically on the investigation of recent bounds due to Haussler et al. (1990, 1994), which were considered only quite momentarily in Cohn and Tesauro (19921, but which were found to perform better than earlier bounds in the situations considered. Finally, we discuss in detail the difficulties involved in applying these recent bounds in practice. 1.1 Why are VC Dimension Results Useful? Results based on the VC dimension are useful because they tell us about the ability of a classifier to generalize after it has been trained. There is at present no single, complete theory of generalization that provides us with general and easily applied design guidelines; such a theory would obviously be highly desirable. Results based on the VC dimension have taken various different forms; the best known form (at least, in the connectionist network research community), which appears in the work of Blumer et al. (19891, Baum and Haussler (1989), Holden and Rayner (19951, Shawe-Taylor and Anthony (1991), and others, is as follows. Assume that the classifier of interest takes inputs in Rn and produces outputs in (0, l}, and assume that training examples are generated independently according to some
VC Dimension Bounds
1267
arbitrary distribution P on R” x (0.1). Assume for the moment that our classifier is a connectionist network, and let the network of interest have architecture A (the network can be any type of feedforward network; we ignore the details for the time being). Finally, assume that we have a parameter 0 < t 5 114. Then there exists a value k, which is a function only of A and f, such that if 1. the network can learn at least a fraction 1- ~ / of 2 k randomly drawn training examples and 2. all future examples are also drawn according to P,
then there is a probability’ close to 1 that the actual generulizafion error of the network is at most f, where generalization error is defined as the probability that, for a random example (x, 0) drawn according to P, the output of the trained network for the input x is not equal to o. This sounds like, and indeed is, a very powerful result. It is completely independent of the actual distribution P that governs the way in which examples are generated and it is also independent of the actual algorithm used to train the network. The drawback is that all known upper bounds on the required value of k are rather large, in the sense that they lead to numbers of training examples that we would not in general expect to be able to load with the required accuracy on a network the size of A. This observation was verified experimentally in Cohn and Tesauro (1992). There are two main reasons for this (see Haussler et al. 1994); unfortunately, the result is limited by precisely the characteristics that make it so powerful. First, the result is valid for all distributions P, even the ones that we would never expect to govern the occurrence of data in practice. The second reason is that the result is independent of the algorithm used to train the network, and the explanation here is rather more subtle. Assuming the structure of the network is fixed, then given a particular vector w of weights the network computes a function fw : R” --+ (0. l}. We denote by T the class of all such functions, so,
F = cfw : w E RW}
(1.1)
where W is the total number of weights used by the network and we assume that weights are real-valued. The result described above would apply even if we were able to use a training algorithm that always provides a function having acceptable error on the training examples (assuming that at least one such function exists in F),but that in addition always provides the function that of all such functions is the one that provides the worst possible performance on future examples generated according to P. ’The exact probability involved here can be quantified in terms of a further parameter 6; in this case k is a function of A, E, and 6. Further elaborations are also possible; we omit the full details here, and refer the reader to Blumer ef al. (1989) and Shawe-Taylor and Anthony (1991).
1268
Sean B. Holden and Mahesan Niranjan
Of the two reasons stated for the large size of the standard VC dimension bounds, we might intuitively expect that the first-the distribution independence-is the most significant. However, recent results obtained by Haussler et al. (1990, 1994) suggest that by considering the precise training algorithm used it may be possible to maintain distribution independence and obtain quite practical bounds; note, however, that the model of machine learning used is different in some respects to that illustrated here, and is described in the next section. This is quite definitely an encouraging result; to say that we know something definite about the distribution governing the occurrence of data would, in practice, tend to mean that we would have a significant amount of a priori knowledge about the problem being addressed, and it is clearly desirable to maintain distribution independence if possible. This article is organized as follows. In Section 2 we briefly review some of the most recent theoretical bounds on the number of examples required when training a classifier or other system under specific conditions. In Section 3 we describe the experiments used to investigate the quality of these bounds; the results of these experiments are described in Section 4 and discussed in full in Section 5, where we also discuss the general practical applicability of the relevant theory. Section 6 concludes the article. 2 Recent VC Dimension Bounds
In this section we provide a brief summary of some of the results in the articles by Haussler et al. (1990, 19941, which are investigated in the remainder of this article. Let X be an environment, which we identify with the set of all possible inputs to the system of interest. This is typically Rn or a subset such as [O. 11”where n is the number o f inputs to the system; it can also be a set such as (0, l}”.Given any class F of functions with domain X and range (0.1) we define its VC dimension VCdim(F) in the usual manner. Given an arbitrary set of k points in X, each function f E F induces a dichotomy or two-coloring on the set by dividing it into two disjoint subsets consisting of the points mapped to 1 and the points mapped to 0. Given such a set we can apply all the functions in F and count the total number of distinct dichotomies obtained. The VC dimension of .F is defined as the size k of the largest subset of X for which we can obtain all 2k possible dichotomies. For examples of VC dimensions for various relevant classes of functions see Anthony and Biggs (1992), Anthony and Holden (19941, Blumer et al. (1989), Bartlett (1992), Maass (1993), Sontag (1992), Holden (1994), and Wenocur and Dudley (1981), and references therein. The task of training a classifier to solve a given problem can be modeled as that of identifying some target function 8 T : X (0, l}, which is assumed to be a member of some class G of target concepts. We assume
-
VC Dimension Bounds
1269
that a sequence Tk of k training examples is generated as follows. The sequence, Tk=
((X1101),(X2,02)r...1(Xkrok))
(2.1)
is formed by drawing k inputs x, independently according to an arbitrary distribution P on X and forming each corresponding o, such that o, = gT(X,). Note that this is slightly different from the process described in the previous section. In the process described earlier both inputs xi and outputs o, are governed by a distribution P on R" x (0,l). In the process described here only inputs are generated according to a distribution, and outputs are then obtained using gT. Note also that examples are in effect assumed to be noise free, and that we also assume that future examples are produced in the same manner after training. Throughout this article we will denote by F the class of functions computed by a connectionist network or other pattern classifier (equation 1.1). Training the network involves adjusting its weights, on the basis of Tk, such that it computes a function fw E F that is a "good approximation" to g T E 6. In all of the following work we assume that the classifier learns to classify the examples in Tk correctly (that the classifier is consistent). We now ask the following question: if, under the conditions described, our classifier is trained using a sequence Tk of training examples, what is the probability that fw(x) # gT(x) for new random inputs x generated according to P? We call this probability the generalization error, which we denote tg(k) for the remainder of this article. If we can answer this question then we clearly know something about the generalization ability of our network. We now describe some results that allow us to bound the expected value of t g ( k ) . 2.1 A General Prediction Algorithm. Let us, for the moment, discard the assumption that we will necessarily use some function chosen from a specified class .F to predict which output, 0 or 1, is associated with an input x, chosen according to P. In Haussler et al. (1990) an algorithm, called the randomized 1-inclusion graph prediction strategy, is constructed that has the following property: if it is provided with a set Tk of examples generated as described above, along with a further input xk+l drawn independently according to P, then the probability that its prediction is not in fact equal to gT(xk+l) is at most VCdim(G)/(k + 1). The fact that an algorithm exists that is capable of providing this performance can be used to obtain a further result described in the next subsection. Some degree of care needs to be taken in interpreting the results of Haussler et a[. (1990). In particular, recall that the generalization error t g ( k )denotes the probability of error for new inputs x generated according to P and classified using a classifier trained on a specific sequence Tk. This is not equivalent to the probability that a single new xk+l generated according to P is misclassified by such a network. In the former case a single trial corresponds to the generation of a single input x according to
Sean B. Holden and Mahesan Niranjan
1270
P, whereas in the latter case a single trial corresponds to the generation of (k + 1) inputs according to P. Formally,2 fg(k) =
:f(x)
# 8T(X))1
(2.2)
where f denotes the function computed by a classifier after training on some sequence Tk. The generalization error t,(k) is the standard measure of generalization performance used in practice. 2.2 Using a Bayes Optimal Classification Algorithm. We noted above that by producing results that are too powerful-by making them independent of the actual distributions or algorithms used-we can obtain results that are rather impractical. In the model of learning being used at present a further source of such problems has been introduced. Specifically, results must apply even if the actual target function gT being used is highly unrealistic. In Haussler et d . (1994) this problem is addressed by introducing a probability distribution P on 6 that governs the way in which target functions appear. The article then considers the performance of a classifier that is optimum in the sense that it implements a Bayes optimal classification algorithm (Duda and Hart 1973). In this case, it can be shown using the result given in the previous subsection that the expected generalization error is
where the expectation is taken over all k-element training sequences and all target functions, and the result holds regardless of the actual distribution; P and P. The bound of equation 2.3 was proved in Haussler et al. (1994), in which it was also conjectured that it will be possible to obtain an improved bound of (2.4) Two important points should be noted here. The first regards the use of a class of target concepts and corresponding distribution P . The use of a class 4 in the theory effectively models the fact that our classifier might, in practice, have to be applied to a selection of different problems. The distribution P can be thought of as encoding our prior beliefs about which function(s) will have to be learned. In this article we consider a single, specific problem (described in the next section). This specific problem corresponds to a specific gT E G, and we can therefore assume that P assigns a probability of 1 to this particular g T and a probability of 0 2We use the notation P[&]to denote the probability of the event & according to the distribution P.
VC Dimension Bounds
1271
to all other target functions. As the results of equations 2.3 and 2.4 are independent of the actual distribution P, they still apply. (This assumption can in fact be problematic and is discussed further in Section 5.) The second point that should be noted is that the Bayes optimal clussification algorithm, which is assumed in deriving equation 2.3, is distinct from the Buyes classifier (Duda and Hart 1973) for a given problem. The Bayes optimal classification algorithm tells us an optimum way of predicting the output associated with a new input on the basis of a finite quantity of training data for the model of machine learning described above, whereas the Bayes classifier tells us how to classify new examples to obtain the smallest possible probability of error, given complete information about the statistics of a pattern classification problem. In fact, in the model of machine learning considered, a function exists-namely gT-that classifies all examples correctly. The Bayes classifier therefore makes no errors for new examples and has an associated error probability of zero. To end this section, it is relevant to mention some further attempts that have been made to obtain more realistic results than those obtained using the standard VC dimension theory. One such attempt has involved the introduction of the effective VC dimension (Guyon et al. 1992; Bottou et al. 19941, and techniques based on statistical physics have also been used for this purpose. A comprehensive review of the latter work is given by Watkin et al. (1993). We will not discuss either of these alternative techniques further in this article. 3 Experiments Using the PetersodBarney Data 3.1 The PetersodBarney Data. The data used for the experiments were derived from a database containing the first four formant frequencies for 10 different vowel sounds uttered by people of different age and gender; this database was originally due to Peterson and Barney (1952). For the purposes of this study, a two-class pattern classification problem was constructed in which we attempt to discriminate between the front vowels [i], 111, [el, and [ael (class 1) and the mid vowels [a] and 101, and back vowels [U] and [ul (class 2). Figure 1 illustrates the entire set of available examples as it appears using only formants 2 and 3; in the following experiments all four formant frequencies were used as inputs to the networks. Class 1 contains a total of 600 examples, and class 2 a total of 594 examples. There were no conflicting examples in the complete set of 1194 examples, in the sense that no two examples exist with equal input vectors but conflicting classifications.
3.2 The Networks Used. The networks used in the experiments were specific examples of Linearly Weighted Connectionist Networks. These networks have been studied for many years; examples can be found in Nils-
Sean B. Holden and Mahesan Niranjan
1272
-
PetemonBarney data second and third formants 0
x x X
1200
"
X
?00
1000
1500
Zoo0 2500 Third Fhrmant (Hr)
3000
3500
4000
Figure 1: The Peterson/Barney data. Only two formants are shown in this figure. Examples in class 1 are displayed using x and examples in class 2 using 0. son (1965) and Cover (1965), and an extensive review can be found in Holden (1993). This class of networks computes functions of the general form,
(3.1) where
(3.2) In equations 3.1 and 3.2, wT = [ wo w1 . . . w,,, ] E Rm+'is a vector of W = rn 1 real-valued weights, the : X + R are m fixed, typically nonlinear basis functions, and 7-t denotes the step function,
+
{0
1
%(')
=
if y > 1/2 otherwise
(3.3)
Standard, linear perceptrons are clearly a specific case of this definition, but are not very useful for our purposes. In the following experiments we used two other network types, namely polynomial networks and radial basis function networks having fixed centers. In the former case, the basis functions are products of elements of the input vector xT = [ X I x2 . . . xn 1,
VC Dimension Bounds
1273
for example, q$(x) = xlx$x:o. If a network of this type has n inputs and uses basis functions corresponding to all possible products of up to d input elements then we call it an ( n , d )discriminator. For example, an ( n .2) discriminator computes functions of the form (3.4)
(Illd)
It is possible to show that an (77. d) discriminator has W = weights. In the case of radial basis functions we use inverse mulfiquadric basis functions of the form (3.5)
where each y l E X is a fixed center chosen according to the technique described below and // . 11 is a suitable norm; in our case we assume that X = R" and use the Euclidean norm. There are two reasons for using these networks in preference to more usual alternatives, such as multilayer perceptrons or radial basis function networks with adapting centers. First, in both cases we have very good results for the VC dimension of the network. In the case of polynomial networks we have VCdim(.F) = W and in the case of radial basis function networks we have W - 1 5 VCdim(.F) 5 W; this is proved in Anthony and Holden (1994) (see also Anthony and Holden 1993). In the following work we assume that VCdim(F) = W - 1 in the case of radial basis function networks. In the more usual cases mentioned the best that we can do at present is to bound the VC dimension for some specific cases, and it is not even known in general whether the bounds available are tight. Second, there is a technique available for training these networks that has significant advantages, when compared with the nonlinear optimization required for training the alternative network types, in that it allows us to significantly reduce the likelihood that various potential sources of random and systematic experimental error will affect our results. It is well-known (see Wan 1990; Gish 1990) that when addressing a two-class pattern classification problem using a sufficiently powerful connectionist network with a single, real-valued output we can obtain an approximation to the posterior probability that a given input is in class 1 by minimizing the usual squared error, k
for the examples in a training set Tk. We can therefore obtain an approximation to a Bayes classifier using a network of the form of equation 3.1. Of course, it is important to remember that we are unlikely to obtain the exact Bayes classifier, and consequently that the measured generalization
Sean B. Holden and Mahesan Niranjan
1274
errors obtained are in fact likely to be worse than those obtained using the true Bayes classifier. (The points raised above regarding the distinction between the Bayes classifier and the Bayes optimal classification algorithm should be recalled at this point.) We must be rather careful to consider precisely how much experimental results obtained using classifiers trained by minimizing [(w) can tell us about the quality of the bounds in equations 2.3 and 2.4. We use this approach as it is much closer to the types of technique used in practice than the Bayes optimal classification algorithm, and as the latter algorithm is in general likely to be extremely difficult to implement in full. The performance of the Bayes optimal classification algorithm depends on the value of k, as the algorithm only has access to a finite number of training examples. Although minimizing [(w) can allow us to approximate the Bayes classifier under suitable conditions, it should be noted that it still corresponds to training a classifier using k examples. As the Bayes optimal classification algorithm is the optimal procedure for predicting outputs corresponding to new inputs within the model of machine learning described above, we should expect classifiers designed by minimizing [(w) to perform worse in general. than the Bayes optimal classification algorithm. This is discussed further in Section 5. The weight vector that minimizes ((w) can be obtained easily as w = P+O
(3.7)
(3.8)
OT = [ 01 0 2 . . . ok ] and P+ denotes the MoorePenrose pseudoinverse of P (Golub and Van Loan 1989). By training using this technique we obtain a unique, global minimum of ((w). We therefore avoid the potential introduction of errors due to convergence to local minima, and in addition we avoid several other potential sources of error, as it is not necessary to choose initial weights, learning rates, momentum constants, training batch size, training cutoff time, or order of pattern presentation as in many alternative techniques. A potential problem with this training technique is that it does not guarantee to find a weight vector that correctly classifies all the examples in Tk, even if such a weight vector exists. This is discussed in Section 5.
3.3 The Experiments. To assess the bounds of equations 2.3 and 2.4 we conducted six experiments, three using polynomial networks and three using radial basis function networks. The polynomial networks
VC Dimension Bounds
1275
were a (4.2) discriminator, a (4.4) discriminator, and a (4,5) discriminator, having 15,70, and 126 weights, respectively. The radial basis function networks again had 15, 70, and 126 weights, and the centers used were chosen at random such that they were uniformly distributed in the subset of R4 populated by available input^.^ For each radial basis function network the same set of centers was used throughout the relevant experiment. The networks were trained using the method described; all six networks are powerful enough to learn exactly the entire set of 1194 available examples illustrated in Figure 1. There is an important point that should be noted regarding the choice of networks and the interpretation of the results that are presented below. The actual bounds in equations 2.3 and 2.4 require that we know the VC dimension of the class G of possible target concepts; recall also that we must assume that the network can always learn the training examples exactly. The former point is a significant shortcoming of the current theory, as clearly we are unlikely in practice to be in a position to draw any conclusion about the VC dimension of 6. However, because we assume that our network can always learn the available training examples exactly we can assume that VCdim(G) 5 VCdim(F), and for the purposes of the following work we assume that VCdim(6) = VCdim(F). This issue is discussed in full in Section 5. In each experiment values of k in the range 50 to 790 were examined, using steps of size 20. For each value of k the relevant network was trained for 40 different, randomly selected sets of training examples. In each case the generalization error was estimated using a further (disjoint), randomly selected set of 350 test examples. This allowed us to obtain estimates of the expected and worst case generalization error. Sets of training and testing examples were generated by selecting examples uniformly at random, without replacement, from the entire set of available examples. Examples were selected without replacement to reflect the fact that as real speech has a high degree of variability, it is unlikely that any real set of examples will ever contain two or more identical sets of formants. When training and testing the networks sets of examples were chosen such that there were equal numbers of examples from each of the two classes. A potential problem exists in using this method for selecting examples in that it does not exactly reflect the process of generating examples that is assumed by the theory, in which inputs x, are selected according to arbitrary P and outputs are formed as gT(xi). All the experiments were repeated using an alternative selection method in which training and 3This, of course, involves designing the networks having taken into account the characteristics of all the available data, and it is not clear whether the current theory strictly allows us to do this. We do not regard this as a problem in this case, however, as enough is known about the characteristics of speech formants to allow good guesses for the relevant ranges in which to place centers to be made without making any reference to the actual data.
1276
Sean B. Holden and Mahesan Niranjan
testing examples were chosen uniformly at random with replacement and without forcing the sets to contain equal numbers of examples from each of the two classes. The training and testing sets were still forced to be disjoint. The results of the second set of experiments are given in Appendix A; precisely the same conclusions can be drawn from either set of experiments. A final point should be noted regarding the manner in which examples are selected. Use of a disjoint testing set reflects a standard experimental procedure, whereas the theory allows training and testing sets to have common elements. This suggests that our measured generalization errors might be higher than those obtained if we allowed training and testing sets that are not disjoint and hence correspond more exactly to the theory. 4 Experimental Results
Figures 2,3,4,5, 6, and 7 show the results obtained using a (4.2) discriminator, a (4,4) discriminator, a (4,5) discriminator, a radial basis function network with 15 weights, a radial basis function network with 70 weights, and a radial basis function network with 126 weights, respectively. Perhaps the most important observation that can be made here is that the fully proved bound of equation 2.3 in fact overestimates both the expected and worst case generalization error in these cases by a significant factor. Although this bound is a great improvement on those typically encountered using the earlier theories, it still provides significant overestimates. (Note, however, that we must be cautious in drawing the latter conclusion, for reasons discussed in the next section.) The conjectured bound of equation 2.4 appears to be more realistic. In fact this bound also bounds the worst measured generalization errors in all the experiments conducted with only a very few exceptions such as, for example, in Figure 11 in Appendix A. Given that, as noted above, our networks cannot in general be expected to perform as well as the Bayes optimal classification algorithm, and also considering some other factors, discussed below, that lead us to expect that our measured generalization errors are worse than would be obtained if we were able to match exactly the conditions required by the theory, we conjecture that if it were possible to match exactly the required conditions then the worst measured generalization errors might also be bounded by equation 2.4 in the instances studied. Note again, however, that we must be cautious in drawing such conclusions for reasons discussed in the next section. 5 Discussion
As a result of the specific assumptions involved, the theory described in Section 2 appears to be quite difficult to interpret in any truly practical
VC Dimension Bounds
o,5,
1277 ,
,
pesUllsfora,(4,2)discnF
,
I
04504-
-x-
035-
2 03h
2 g025D
-a
'1.02-
I2 01501 L
0 05
o*--
-' '.. \ . 100
> 200
300
400
500
*-
'
6M)
- - - 700'
Figure 2: Results obtained for a (4,2) discriminator. The two upper dotted lines show the theoretical bounds of equations 2.3 and 2.4, respectively, assuming VCdim(G)= VCdim(3); for each value of k the bound on E[e,(k)] is shown. The upper and lower dashed lines show the best and worst measured generalization errors, and the final, solid line shows the average generalization error. Individual results for specific training sequences are marked as dots. sense. In particular, the fact that we must assume that the classifier implements a Bayes optimal classification algorithm after training, that the available examples are noise free, and that VCdim(4) is known are all significant shortcomings of the present theory, and should be addressed. 5.1 Optimal Classification Algorithms and Noise-Free Data. The first of these assumptions was mentioned above: it is unlikely in practice that it will be possible to implement exactly the Bayes optimal classification algorithm studied by Haussler et a!. (1994). In our experiments we have attempted to solve this problem, and to use an approach more similar to that generally used in practice, by using a standard error minimization technique. As argued above, we consequently expect our measured generalization performances to be worse in general than those that could be obtained using the Bayes optimal classification algorithm. The assumption that data are noise free is more problematic. It is highly unlikely to be a fully valid assumption in practice. Even in the case of the data used in the experiments described herein, which were
1278
Sean B. Holden and Mahesan Niranjan Results for a (4.4) discriminator
Figure 3: Results obtained for a (4.4) discriminator The plots are as described in Figure 2.
collected with significant care, it is unlikely to be a completely valid assumption (Peterson and Barney 1952; Nowlan 1994). However, a simple intuitive argument regarding this problem is as follows: if we make the assumption when it is not in fact the case we are likely to overfit the data and consequently increase the generalization error obtained. As a result of these two considerations we can therefore expect that the actual generalization errors measured are worse than those that would be obtained using a Bayes optimal classification algorithm with truly noise free data. This is important as the theoretical bounds of equations 2.3 and 2.4 nonetheless apply in all our experiments, and this suggests that these bounds may in fact overestimate expected generalization error to a greater extent than that suggested directly by our experimental results. (Note, however, that as a result of considerations discussed in the next subsection, it is not certain that the results can be interpreted in this manner.) There are also two further reasons for drawing this conclusion. First, as noted above, we force training and testing sets to be disjoint. Second, and again as noted above, our training technique does not guarantee to learn correctly all the examples in each Tk. If at any time this is the case then we obtain a measured generalization error corresponding to a network that learns exactly some subset of Tk.
VC Dimension Bounds
1279
Figure 4: Results obtained for a (4,5) discriminator. The plots are as described in Figure 2.
5.2 Knowing the Target Class. The assumption that we have some knowledge of VCdim(G) is possibly the most important shortcoming of the current theory because, as noted above, it is highly unlikely to be a good assumption in practice. [This problem, and the related problem of choosing P,are obviously quite similar to the ubiquitous problem of choosing a prior over weight vectors in the standard Bayesian treatments of learning; see for example Buntine and Weigend (1991).1 In fact, the assumption that in practice we will encounter target functions g T drawn from a class 6 does not itself accurately model the actual situation that we generally encounter when designing a pattern classifier. Although the assumption that g T is some member of a class G is a good one if we wish to consider general learning algorithms, that work in a variety of different circumstances, it is more usual that we approach a specific problem; that is, we wish only to learn some specific gT. This is precisely the case in this article, and was discussed above. We might therefore expect that we can assume any G' such that gT E G' and further assume that 3 = G. Our experimental results suggest that this may in fact be a good strategy. The assumption that .F = G' seems reasonable as the Bayes optimal classification algorithm must itself know G to make a prediction (although its hypothesis is not necessarily a member of G), and the assumption that gT E G = .F also seems reasonable because all our classifiers can learn exactly all the data that are available to us. A
Sean B. Holden and Mahesan Niranjan
1280
Results for a 15 weight radial baas function network
:
o
-
x
0 35 -
22 0 3 P
e0250
p
U
02-
l3 015-
Figure 5: Results obtained for a radial basis function network with 15 weights. The plots are as described in Figure 2. theoretical result relevant to this problem can be found in Haussler et al. (1990) (Theorem 4.1 of that article). This result upper bounds a particular measure of generalization performance for a consistent classifier using an expression that depends on the VC dimension of .F and is independent of the characteristics of G. It is important to note that this problem has two main consequences in the context of this article. The first is that there is some uncertainty regarding how the theoretical bounds should be placed in relation to the experimental results. Our conclusion that these bounds are better than earlier ones is still highly likely to be sound, simply as a result of the degree of improvement observed (see Cohn and Tesauro 1992). Also, this observation serves to accentuate the difficulty of applying this theory to practical classifier design. 5.3 Choosing a Prior. There is a further, rather subtle difficulty in the case where we are only interested in a single, specific & and consequently that the prior P assigns probability 1 to gT and probability 0 to all other members of 6.4 When this is the case, the Bayes optimal classification algorithm has an error probability of 0, and hence will always outperform both the worst-case bounds of equations 2.3 and 2.4 and our 4This was brought to our attention by an anonymous reviewer.
VC Dimension Bounds
1281
Results for a 70 weight radial basis function network
O
100
200
300
400 500 Size of training sequence
Mx)
700
800
Figure 6: Results obtained for a radial basis function network with 70 weights. The plots are as described in Figure 2. experimental algorithm. Consequently, it is very difficult to draw any firm conclusions from our experimental results. However, this observation does once again serve to illustrate the difficulty of applying the theory in a practical situation. It is possible that this problem could, to some extent, be addressed by arguing that the prior P can be made uniform over more than one function in 6, either to provide a very crude approximation to the fact that the real data are likely to be noisy, or to model the fact that there is no strictly "correct" target function that separates different vowel types in the desired manner. However, it is not certain that this allows us to overcome the problem and further research is required here. 5.4 Further Experiments. Further experiments would now be useful to investigate these bounds further. In particular, experiments using a larger set of data would be interesting, as well as useful in the sense that they would allow generalization errors to be calculated using a set of more than 350 test examples. Unfortunately, the requirement that all training examples are learned exactly makes experiments using large sets of real data difficult. It would also be interesting, in the case of radial basis function networks with fixed centers, to examine the effect of using a different set of randomly chosen centers each time a network is trained,
1282
Sean B. Holden and Mahesan Niranjan Results for a 126 weight radial basis function network
Size of training sequence
Figure 7: Results obtained for a radial basis function network with 126 weights. The plots are as described in Figure 2. rather than using the same set for an entire experiment. We have not examined this approach as the time required to perform an experiment in this case is likely to become prohibitive. Finally, it would be interesting to investigate the extent to which the assumptions of the theory can be violated before the bounds become invalid. For example, how good are the bounds for cases where the training set cannot be learned perfectly? 6 Conclusion
In this article we have addressed the question of whether some recent bounds on the sample complexity of the task of training a pattern classifier such that it performs valid generalization can be used as a practical design tool. The bounds considered, although they are probably the most "practical" available at present within the general framework of computational learning theory, require us to make several assumptions that will not in general be accurate in practice. In particular, it is necessary to assume that our classifier implements a Bayes-optimal classification algorithm, that all data are noise free, and that the VC dimension of the class G of target functions is known. The last of these assumptions forms at present the most important shortcoming of this theory. The need to make these assumptions makes it rather difficult to fully assess the bounds or
VC Dimension Bounds
1283
to apply them in the design of practical pattern classifiers. At present, the only conclusion that can be drawn regarding the use of these bounds in practice is that they appear to provide an approximate, probably pessimistic guide to expected generalization error, and appear therefore to be applicable in certain circumstances as an initial aid to design. In the experiments performed the bounds were also found to be valid for worst case generalization error in most cases. However, a detailed consideration of the theory suggests that it may not be possible to draw any firm conclusions from the experimental results. This conclusion is a rather pessimistic one. However, we note, finally, that these bounds are still rather more practically applicable, although unfortunately less powerful, than earlier bounds obtained in computational learning theory, and that they therefore provide an excellent starting point for further research.
Appendix A: Experimental Results Obtained Using the Alternative Example Selection Technique Figures 8 to 13 are exactly analogous to Figures 2 to 7, the only difference being that in producing these figures the alternative method for selecting examples was used. This method was described in Section 3. The centers used by the radial basis function networks were identical to those used in the experiments described above.
Acknowledgments Thanks are due to Martin Anthony for his comments on the initial draft of this article, and for many useful discussions. Thanks are also due to the reviewers for various helpful comments. This research was supported by SERC research Grant GR/H16759.
References Anthony, M., and Biggs, N. 1992. Computational Learning Theory. Cambridge Tracts in Theoretical Computer Science. Cambridge University Press, Cambridge. Anthony, M., and Holden, S. B. 1993. On the power of polynomial discriminators and radial basis function networks. Proc. Sixth Annu. ACM Conf. Comp. Learning Theory 158-164. Anthony, M., and Holden, S. B. 1994. Quantifying generalization in linearly weighted neural networks. Complex Syst. 8, 91-114. Bartlett, P. L. 1992. Lower Bounds on the Vapnik-Chervonenkis Dimension of MulfiLayer Threshold Nets. Tech. Rep. IML92/3, University of Queensland, Department of Electrical Engineering, Intelligent Machines Laboratory.
05
I
04504-
035 -
-x 2 03E i0250
B
'3 0 2 -
3
015-
010051,
:\
-.-..-.-._----_-.-_--_ - _
- .
0
L
0 rr-800
Figure 8: Results obtained for a (4.2) discriminator. The plots are as described in Figure 2. Results for a (4.4) discriminator
"
100
200
300
4" so0 Size of training sequence
600
700
800
Figure 9: Results obtained for a (4,4) discriminator. The plots are as described in Figure 2.
VC Dimension Bounds
1285
Results for a (4,s) discnrninator
0'451 I
0 0.45 - - - - - 7
'-
1
. I . .
:
I I .I
100
200
300
400 500 Size of training sequence
600
100
800
Figure 10: Results obtained for a (4.5) discriminator. The plots are as described in Fimiw _-. - -o----.3
0'45! 0.4
-z
x
035 -
2 03-
g $0250
P
2
02-
30 1 5 01-
0
---- 100
200
300
~ T T T T I a T n ~ l i r t f -'?,> 7 .
500 Size of training sequence 400
600
700
, 800
Figure 11: Results obtained for a radial basis function network with 15 weights. The plots are as described in Figure 2.
Sean B. Holden and Mahesan Niranjan
1286
Results for a 70 weigh1 radial basis function network
0451
Figure 12: Results obtained for a radial basis function network with 70 weights. The plots are as described in Figure 2. Results for a 126 weight radial basis function network
O
100
200
300
400
500
Mx)
700
800
Figure 13: Results obtained for a radial basis function network with 126 weights. The plots are as described in Figure 2.
VC Dimension Bounds
1287
Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Comp. 1, 151-160. Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. 1989. Learnability and the Vapnik-Chervonenkis dimension. J. Assoc. Comput. Machinery 36(4), 929-965. Bottou, L., Cortes, C., and Vapnik, V. 1994. On the effective VC dimension. Unpublished manuscript. Buntine, W. L., and Weigend, A. S. 1991. Bayesian back-propagation. Comptex Sysf. 5, 603-643. Cohn, D., and Tesauro, G. 1992. How tight are the Vapnik-Chervonenkis bounds? Neural Comp. 4(2), 249-269. Cover, T. M. 1965. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. I E E E Trans. Electronic Computers EC-14, 326-334. Duda, R. O., and Hart, P. E. 1973. Pattern Classification and Scene Analysis. John Wiley, New York. Gish, H. 1990. A probabilistic approach to the understanding and training of neural network classifiers. Proc. [ E E E Int. Conf. Acoustics, Speech Signal Process. 1361-1 364. Golub, G. H., and Van Loan, C. E 1989. Matrix Computations, 2nd ed. Johns Hopkins, Baltimore. Guyon, I., Vapnik, V., Boser, B., Bottou, L., and Solla, S. A. 1992. Structural risk minimization for character recognition. In Advances in Neural Information Processing Systems, Vol. 4, pp. 471-479. Morgan Kaufmann, San Mateo, CA. Haussler, D., Littlestone, N., and Warmuth, M. K. 1990. Predicting ( 0 ,1)-Functions on Randomly Drawn Points. Tech. Rep. UCSC-CRL-90-54, Computer Research Laboratory, Applied Sciences Building, University of California, Santa Cruz, Santa Cruz, CA. Haussler, D., Kearns, M., and Schapire, R. 1994. Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. Machine Learn. 14, 83-113. Holden, S. B. 1993. On the theory of generalization and self-structuring in linearly weighted connectionist networks. Ph.D. thesis, Cambridge University Engineering Department. Cambridge University Engineering Department Report number CUED/ F-INFEN G /TR. 161. Holden, S. B. 1994. Neural networks and the VC dimension. Proceedings of the I M A International Conference on Mathematics in Signal Processing, pp. 73-84. Oxford University Press, Oxford. Holden, S. B., and Rayner, P. J. W. 1995. Generalization and PAC learning: Some new results for the class of generalized single layer networks. I E E E Trans. Neural Networks 6(2), 368-380. Maass, W. 1993. Bounds for the computational power and learning complexity of analog neural nets. Proc. Twenty-Fifth Annu. ACM Symp. Theory Computing 33.5344. Natarajan, B. K. 1991. Machine Learning: A Theoretical Approach. Morgan Kaufmann, San Mateo, CA.
1288
Sean B. Holden and Mahesan Niranjan
Nilsson, N. J. 1965. Learning Machines. Foundations of 'TrainablePattern-Classifying Systems. McGraw-Hill, New York. Nowlan, S . J. 1994. Private communication. Peterson, G. E., and Barney, H. L. 1952. Control methods used in a study of the vowels. J. Acoust. SOC.A m . 24, 175-184. Shawe-Taylor, J., and Anthony, M. 1991. Sample sizes for multiple-output threshold networks. Network 2, 107-117. Sontag, E. D. 1992. Feedforward nets for interpolation and classification. 1. Computer Syst. Sci. 45(1), 20-48. Valiant, L. G. 1984. A theory of the learnable. Commun. ACM 27(11), 1134-1142. Wan, E. A. 1990. Neural network classification: A Bayesian interpretation. IE E E Trans. Neural Networks 1(4), 303-305. Watkin, T. L. H., Rau, A., and Biehl, M. 1993. The statistical mechanics of learning a rule. Rev. Mod. Phys. 65(2), 499-556. Wenocur, R. S., and Dudley, R. M. 1981. Some special Vapnik-Chervonenkis classes. Discrete Math. 33, 313-318.
Received March 4, 1994; accepted September 27, 1994.
This article has been cited by: 2. Miroslaw Galicki, Lutz Leistritz, Ernst Bernhard Zwick, Herbert Witte. 2004. Improving Generalization Capabilities of Dynamic Neural NetworksImproving Generalization Capabilities of Dynamic Neural Networks. Neural Computation 16:6, 1253-1282. [Abstract] [PDF] [PDF Plus] 3. E.A. Rietman, S.A. Whitlock, M. Beachy, A. Roy, T.L. Willingham. 2001. A system model for feedback control and analysis of yield: A multistep process model of effective gate length, poly line width, and IV parameters. IEEE Transactions on Semiconductor Manufacturing 14:1, 32-47. [CrossRef] 4. Rosa A. Schiavo, David J. Hand. 2000. Ten More Years of Error Rate Research. International Statistical Review 68:3, 295-310. [CrossRef] 5. E.A. Rietman, D.J. Friedman, E.R. Lory. 1997. Pre-production results demonstrating multiple-system models for yield analysis. IEEE Transactions on Semiconductor Manufacturing 10:4, 469-481. [CrossRef] 6. A. Doering, M. Galicki, H. Witte. 1997. Structure optimization of neural networks with the A*-algorithm. IEEE Transactions on Neural Networks 8:6, 1434-1445. [CrossRef] 7. H. Gu, H. Takahashi. 1996. Towards more practical average bounds on supervised learning. IEEE Transactions on Neural Networks 7:4, 953-968. [CrossRef]
Communicated by Scott Fahlman
The Target Switch Algorithm: A Constructive Learning Procedure for Feed-Forward Neural Networks Colin Campbell Advanced Computing Research Centre, Bristol University, Bristol BS8 1TR, United Kingdom
C. Perez Vicente Facultat de Fisica, Dept. de Fisica Fonamental, Universitat de Barcelona, Diagonal 647,08028 Barcelona, Spain
We propose an efficient procedure for constructing and training a feedforward neural network. The network can perform binary classification for binary or analogue input data. We show that the procedure can also be used to construct feedforward neural networks with binary-valued weights. Neural networks with binary-valued weights are potentially straightforward to implement using microelectronic or optical devices and they can also exhibit good generalization. 1 Introduction
A number of authors have proposed constructive algorithms that generate the architecture of a feedforward neural network in addition to determining the weights required. These algorithms can generate cascade architectures (Fahlman and Lebiere 19901, tree-structured architectures (Frean 1990a; Mezard and Nadal19891, tower architectures (Gallant 19901, and networks with a single hidden layer (Marchand et al. 1990; Zollner et al. 1992) or two hidden layers (Martinez and EstGve 1992). An efficient constructive algorithm should generate a neural network exhibiting good generalization. The generalization ability of a neural network is improved if the number of free parameters in the network is minimized. Thus an efficient constructive algorithm should reduce the number of weights in the network by generating a minimal number of hidden nodes. This leads to better generalization for most pattern distributions (Baum and Haussler 1989). We can also reduce the number of free parameters in the network if the weights are constrained to a restricted number of values. Along these lines Nowlan and Hinton (1992) have shown that enforcing weight-sharing dramatically improves generalization ability. Other factors bear on the generalization ability of the network constructed. For example, in binary classification tasks it is important to learn both classes of patterns symmetrically: if one class of Neural Computution 7, 1245-1264 (1995) @ 1995 Massachusetts Institute of Technology
C. Campbell and C. Perez Vicente
1246
patterns is embedded differently from the other class then the network can exhibit poor generalization (Zollner et al. 1992). Thus generalization ability is dependent on properties of the learning process in addition to the number of free parameters available. In this paper we propose an algorithm for generating a feedforward neural network for binary classification tasks. Our aim has been to enhance generalization ability by reducing the number of degrees of freedom in the network and by treating both target values symmetrically. Thus in Section 2 we use an efficient method (called target switching) for minimizing the number of hidden nodes while in Section 3 we show that it is possible to generate solutions with binary -valued weights. Apart from being easier to implement in hardware, binary-weight networks have good generalization abilities, at least for Boolean problems. We illustrate this improvement in generalization using the Shift Detection problem and Mirror Symmetry problems in Section 4. By constrast limited weight resolution has been difficult to implement with other constructive algorithms such as Cascade Correlation (Hoehfeld and Fahlman 1992) or the Upstart algorithm (Frean 1990a1, where large weight values are needed to correct wrongly on or wrongly olf errors. 2 The Target Switch Algorithm
We will consider a neural network with N input: nodes (labeled by index j ) and one output node. Let us suppose we wish to map inputs $‘ onto a , p is the pattern index and 17’‘ has components fl set of targets ~ p where (though (,!’ may have analogue or binary components). Weights leading from input j to a hidden node i will be denoted Wl,. We will use a f l updating function for the hidden and output nodes. Thus if S, is an input then the corresponding internal representation S; on the hidden nodes would be
Ti) where T , is the threshold at hidden node 1. We will define the signfunction as having an output of $1 if its argument is greater than or equal to zero and -1 otherwise. For binary classification tasks the patterns belong to two sets: patterns with target 7’1 = 1 (the set P+) and those with target 171‘ = -1 (the set P-). For binary inputs it is always possible to find a set of weights and thresholds that will correctly store all the patterns belonging to one of these sets and at least one member belonging to the other set (Gallant 1986b; Frean 1990b; Marchand et al. 1990). For example, suppose pattern 11 = 1 has target +l. If we use weights W,, = and a threshold T I = N then S, = sign(C, W,,S, - T,) gives an output S, = +l if S, is equal to and -1 otherwise. Usually it is possible to exceed this minimal solution
Test % correct
Backpropagation Cross-validation Weight-decay (Weigend et a/.) Soft-share (5 components) Soft-share (10 components)
67.3 f 5.7 83.5 f 5.1 89.8 f 3.0 95.6 f 2.7 97.1 f 2.1
'Percentage of new patterns correctly classified using a training set of 100 patterns and a validation set of 1000 patterns.
tions decreased the number of hidden nodes further but, suprisingly, we did not find evidence of any consequent improvement in generalization ability. Nowlan and Hinton (1992) have also used the Shift Detection problem to study the performance of a neural network with soft-weight sharing. They used a similar 20-node input network with 10 hidden nodes; 100 training patterns were used in addition to a set of 1000 validation examples to tune the parameters in the model. Apart from standard backpropagation (without a validation study) and cross-validation these authors also compare weight-sharing with the weight-elimination method proposed by Weigend ef d . (1991). Their results are summarized in Table 2. In addition to their results we also used a nearest-neighbor classifier with the same dataset and obtained a generalization performance of 75.1f3.1% for comparison. Despite the absence of a validation set our algorithm outperforms cross-validation when the Minover rule is used. Both the Hebb-like rule and the binary Hebb-like rule also compare well with standard backpropagation. The best performance (Minover algorithm with binary weights) also compares favorably with soft weight-sharing since we did
C. Campbell and C. Perez Vicente
1258
Table 3: Generalization using the Target Switch Algorithm on the Mirror Symmetry Problem with 30 Inputs! ~
Method
p
= 100
p = 200
~
Minover Minover (binary weights)
68.4 i3.2% 80.4 zt 4.8%
83.1 z!= 3.1% 92.6 f 3.0%
“ p is the number of training patterns. The values represent an average over 100 networks with 400 test examples per network.
not use the validation set in obtaining the 91.4% generalization performance. The Target Switch algorithm also has the advantages of being able to determine the number of hidden nodes required and guarranteed convergence [gradient descent methods have the disadvantage that spurious local minima proliferate in the presense of weight-sharing (Fontanari and Koberle 199011. 4.1.2 Mirror Symmetry Problem. For the Mirror Symmetry problem the output of the network is 1 if the input bit string is exactly symmetrical about its center, otherwise the output is -1. This problem is known to have two exact solutions: one with binary weights and N hidden nodes and a second using real weights and two hidden nodes (Minsky and Papert 1988). For randomly constructed inputs the output will be -1 with a high probability. Consequently the target value fl was selected with 50% probability, the first half of the input bit string was randomly constructed from components fl (both selected with a 50% probability), and the second half of the string was symmetrical or random depending on the target value determined. Generalization performance was evaluated using a test set drawn from the same pattern distibution. The results for p training patterns are given in Table 3. We can compare this performance with the neural decision lists of Marchand and Golea (1993) who report generalization rates of 69.7 f 7.5% ( p = 100) and 80.1 f 3.5% ( p = 200) for i l 30 input-node network performing the same problem. For 100 training patterns an average of 5.2 dichotomy nodes was generated for real weights and 13.2 for binary weights (for p = 200 these numbers were 11.4 and 17.3, respectively). In both these examples generalization performance is obviously improved by using binary weights in place of real weights. Using the Target Switch algorithm we have observed similar improvements for other Boolean problems such as “2-or-more clumps” (Denker et at. 1987), motion detection (distinguishing a shift from no shift), etc. 4.2 Analogue Input Data. For analogue input data and real weights the Target Switch algorithm will converge if the input-vectors of the training set are of the same length. In this case we can enforce the minimal
Target Switch Algorithm
1259
Table 4: Generalization Performance Using the
Backpropagation Algorithm for the Aspect-Angle Independent Classification of Sonar Returns Reported by Gorman and Sejnowski (1988) Number of hidden nodes
Generalization performance (%)
0 2 3 6 12 24
77.1 f8.1 81.9 f 6.2 82.0 f 7.3 83.5 f5.6 84.7 f5.7 84.5 i5.7
solution storing one member of one target set and all the members of the other target set. Geometrically this would correspond to a tangential hyperplane isolating one target value on the surface of a hypersphere. In general the input vectors can be of arbitrary length, consequently this construction is not always possible. However, for real weights, convergence can still be guarranteed if we realize that a particular distribution may exclude both a $- and @-dichotomybut construction of either a aor a @-dichotomyis always possible (since there will be one or a set of vectors of maximal length). Thus, in the worse case, we could store one member of P; and all the P;' or one member of P;' and all the P;. This minimal solution stores one pattern every pair of hidden nodes (the node storing all patterns of one target-value and none of the other can be replaced by a node clamped at that value or alternatively by a threshold at a cascade or tree node). As an example we have used the algorithm on the sonar problem of Gorman and Sejnowski (1988), which involves classification using analogue input data. For the sonar problem the task is to classify sonar returns from a roughly cylindrical rock or a metal cylinder. For the aspect-angle independent experiment we trained the network using the Minover algorithm with a maximum 200 iterations through the training set. The 208 examples (104 of each class) were divided into 13 disjoint sets each with 16 examples. Using 12 of these sets as training data and the thirteenth as the test set we cycled through the data using each set once as a test set. For each of these 13 sets we also averaged over 30 initial weight configurations for the Minover algorithm. Averaging these results we obtained 85.0 f 7.2% generalization on the test sets with an average 9.2 hidden nodes generated. This compares favorably with previous results for backpropagation reported by Gorman and Sejnowski (1988) and reproduced in Table 4 (these authors also report a generalization performance of 82.7% for a nearest-neighbor classifier on the same dataset).
C. Campbell and C. Perez Vicente
1260
Table 5: Generalization Performance Using i:he Glass Identification Problem.a Length of scale, T
Generalization performance (%)
Average number of hidden nodes
20
77.4 f 6.6% 80.5 f 5.4% 80.7 k 5.3% 81.3 f 5.3%
3.8 2.6 2.6 2.3
30 40 50
“Each real input is converted into a bit string of length T using a thermometer code.
As an alternative approach to analogue input data we can also reduce the mapping to a Boolean problem by using an analogue to digital conversion such as thermometer coding. Using thermometer coding the inputs are scaled to lie between 0 and 1 with real number x converted to a string of T bits in that bits 0 to xT (rounded down) are set to 1 and the remainder set to -1. For example, we have used this approach with the glass identification problem from the UCI Database Repository (Murphy and Aha, 1994). This problem involves binary classification (float-processed or non-float-processed glass) based on 9 real valued attributes. There are 163 examples and we used two-thirds as training data and one-third as test data. Twenty trials were attempted with random allocation of examples between the training and test data. The network was trained using the Minover algorithm with 200 iterations through the training set. For real weights generalization improved with the length of the scale T (see Table 5). By using thermometer coding and reducing the mapping to a Boolean problem it is also possible to find a solution with binary weights. Using clipped Minover to obtain the binary weights we found generalization passed through a peak as T was increased, the maximum value being 78.1 f 5.1% with an average 10.1 hidden nodes at T = 40 [step ( 6 ) in the dichotomy procedure was found to marginally improve generalization at a cost of considerably increased training time]. For the same dataset and partition between training and test data these generalization results may be compared with 74.3f6.6% for C4 (Quinlan 19861, an efficient tree-induction algorithm capable of implementing complex decision rules, and 76.4 f 6.7% for neural decision lists (Marchand and Golea 1993). 4.3 Noisy Input Data. In some applications the training data can be corrupted by noise. A constructive algorithm has the potential disadvantage that it could converge on a perfect solution overfitting the data and giving poor generalization. To rectify this problem we perform a validation study followed by pruning to remove any redundant di-
Target Switch Algorithm
1261
chotomy nodes. We record the number of patterns stored by each pair of dichotomy nodes during the learning process. We then successively remove the pairs of dichotomy nodes that store the fewest patterns (these tend to capture the outliers in the noisy data), recording generalization performance against the validation set. After finding the peak in generalization performance any redundant nodes are removed. As an illustration we trained a network using examples from the majority rule (target is fl if the number of Is in the input string is greater than the number of -1s and -1 otherwise). The network had 20 input nodes and real weights trained using the Minover rule. One hundred training patterns were used with training noise introduced by randomly flipping 20% of the input bits. A validation set of 1000 examples was used. For a sample of 100 such networks the validation study reduced the average number of hidden nodes from 16.6 to 10.4 and improved generalization by 2.3%. 5 Conclusion
In this paper we have introduced a general procedure for constructing a feedforward neural network for binary classification tasks using analogue or discrete weights. The network can be constructed quickly with short training times Ifor example, by using the Hebb-like rule or by discarding step (6)in the dichotomy procedure]. Alternatively, if generalization is to be maximized, a longer training period is required [for example, by using all the steps in the dichotomy procedure with the Minover algorithm in step (1)I. In the latter case generalization performance compares favorably with a number of alternative algorithms. For Boolean problems the extension to binary-weight learning (Section 3) can give improved generalization performance (compared to real weights). By using an analogueto-digital conversion (such as thermometer coding) it is also possible to handle analogue input data with binary weights. Within this approach there is scope for a number of variants that would be worthy of further investigation. For example, weight elimination can enhance the the generalization performance of a neural network for certain problems. Some authors have considered incorporating weight elimination into Hebb-like rules (Kurten 1992) and algorithms such as Minover (Kuhlmann ef al. 1992) and it would be interesting to investigate these alternatives. Clipped Hebb and clipped Minover are not the most efficient learning rules for binary-weight learning and it would also be worth trying other binary weight learning procedures in step (l),e.g., the Harmonic Rule, Directed Drift (Venkatesh 1991, 1993), gradient descent procedures (Perez 1990; Perez et al. 1991, 1992), Tabu search (Amaldi and Nicolis 1989), and genetic algorithms (Kohler 1990). It would also be worth investigating alternative heuristics for obtaining dichotomies of the pattern sets. Faster heuristics may involve switching a number of target-signs simultaneously rather than one at a time.
1262
C. Campbell and C. Perez Vicente
One of the most interesting points to emerge from our investigation is that learning with binary weights can be readily implemented using constructive algorithms. Furthermore, at least for Boolean problems, binary-weights have important advantages in terms of generalization performance and implementation simplicity. In general the amount of information carried by binary-valued weights is less, hence more hidden nodes are typically required. However, the increase in the number of hidden nodes is not that large for typical pattern distributions. This observation agrees with theoretical estimates suggesting neural networks with binary-valued weights have comparatively high storage capacities (Krauth and Opper 1989; Krauth and Mezard 1989; Barkai and Kanter 1991). In fact this increase in the number of hidden nodes can be viewed as an advantage of these models since the computationally intensive weight/input-vector multiplication has been effectively reduced by introducing more processors (i.e., hidden nodes).
Acknowledgment We gratefully acknowledge support from the Acciones Integradas programme (UK/Spain) Grant 83 (1993/94). Note: The programs used in this study are available by anonymous f t p from ftp.cs.bristo1. ac.uk (cf. switch.doc).
References Amaldi, E., and Nicolis, S. 1989. Stability-capacitydiagram of a neural network with Ising bonds. J. Phys. (France) 50, 2333-2345. Anlauf, J. K., and Biehl, M. 1990. Properties of an adaptive perceptron algorithm. In Parallel Processing in Neural Systems and Computers, R. Eckmiller, G. Hartmann, and G. Hauske, eds., pp. 153-156. North Holland: Amsterdam. Barkai, E., and Kanter, I. 1991. Storage capacity of a multilayer neural network with binary weights. Europhys. Lett. 14, 107-112. Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Comp. 1, 151-160. Baum, E. B., and Lyuu, Y-D. 1991. The transition to perfect generalization in perceptrons. Neural Camp. 3, 386401. Denker, J., Schwartz, D., Wittner, B., Solla, S., Howard, R., and Jackel, L. 1987. Automatic learning, rule extraction and generalization. Complex Syst. 1,877922. Fahlman, S., and Lebiere, C. 1990. The cascade correlation architecture. In Advances in Neural Information Processing Systems, 13. Touretzky, ed., Vol. 2, pp. 524-532. Morgan Kaufmann, San Mateo, CA. Fontanari, J. F., and Koberle, R. 1990. Landscape statistics of the binary perceptron. I. Phys. (France) 51, 1403-1413.
Target Switch Algorithm
1263
Frean, M. 1990a. The upstart algorithm: A method for constructing and training feedforward neural networks. Neural Comp. 2, 198-209. Frean, M. 1990b. Small nets and short paths: Optimising neural computation. Ph.D. thesis, University of Edinburgh, Center for Cognitive Science. Gallant, S. I. 1986a. Optimal linear discrimimants. IEEE Proc. 8th Conf. Pattern Recognition 849-852. Gallant, S. I. 1986b. Three constructive algorithms for network learning. Eighth Annu. Conf. Cog. Sci. SOC.,Amherst, MA, 652-660. Gallant, S. I. 1990. Perceptron-based learning algorithms. IEEE Trans. Neural Networks 1, 179-191. Gorman, R. I?, and Sejnowski, T. J. 1988. Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks 1, 75-89. Gyorgyi, G. 1990. First-order transition to perfect generalization in a neural network with binary synapses. Phys. Rev. A41, 7097-7100. Hoehfeld, M., and Fahlman, S. E. 1992. Learning with limited numerical precision using the cascade-correlation algorithm. IEEE Trans. Neural Networks 3, 602-61 1. Kohler, H. M. 1990. Adaptive genetic algorithm for the binary perceptron problem. J. Phys. A23, L1271-L1276. Krauth, W., and Mezard, M. 1987. Learning algorithms with optimal stability in neural networks. J. Phys. A20, L745-L752. Krauth, W., and Mezard, M. 1989. Storage capacity of memory networks with binary couplings. J. Phys. (France) 50, 3057-3066. Krauth, W., and Opper, M. 1989. Critical storage capacity of the J = fl neural network. J. Phys. A22, L519-L586. Kuhlrnann, I?, Garces, R., and Eissfeller, H. 1992. A dilution algorithm for neural networks. J. Phys. 25, L593-L598. Kurten, K. E. 1992. Adaptive architectures for Hebbian network models. J. Phys. (France) 2, 615-624. Marchand, M., and Golea, M. 1993. On learning simple neural concepts: From halfspace intersections to neural decision lists. Network 4, 67-85. Marchand, M., Golea, M., and Rujan, P. 1990 A convergence theorem for sequential learning in two-layer perceptrons. Europhys. Lett. ll, 487492. Martinez, D., and Esteve 1992. The offset algorithm: Building and learning method for multilayer neural networks. Europhys. Lett. 18,95-100. Mezard, M., and Nadal, J.-P. 1989. Learning in feedforward layered networks: The tiling algorithm. J. Phys. A22, 2191-2203. Minsky, M., and Papert, S. 1988. Perceptrons, 2nd ed., p. 252. MIT Press, Cambridge, MA. Murphy, P. M., and Aha, D. W. 1994. UCI RepositoryofMachine Learning Databases. University of California, Department of Information and Computer Science, Irvine, CA. Nowlan, S. J., and Hinton, G. E. 1992. Simplifying neural networks by soft weight-sharing. Neural Comp. 4, 473493. Perez Vicente, C. J. 1990. A learning algorithm for binary synapses. Lecture Notes Phys. 368, 167-174. Perez Vicente, C. J., Carrabina, J., Garrido, F., and Valderrama, E. 1991. Learning
1264
C. Campbell and C. Perez Vicente
algorithm for feed-forward neural networks with discrete synapses. Lecture Notes Comp. Sci. 540, 144-152. Perez, C. J., Carrabina, J., and Valderrama, E. 1992. Study of a learning algorithm for neural networks with discrete synaptic couplings. Network 3, 165-176. Quinlan, J. R. 1986. Induction of decision trees. Machine Learn. 1, 81. Saad, D., and Marom, E. 1990. Training feed forward nets with binary weights via a modified CHIR algorithm. Complex Syst. 4, 573-586. Seung, H. S., Sompolinsky, H., and Tishby, N. 1992. Statistical mechanics of learning from examples. Phys. Rev. A45,6056-6091. Venkatesh, S. S. 1991. On learning binary weights for majority functions. In Proceedings of the Fourth Annual Workshop on Computational Learning Theory, L. G. Valiant and M. K. Warmuth, eds., pp. 257-266. Morgan Kaufmann, San Mateo, CA. Venkatesh, S. S. 1993. Directed drift: A new linear threshold algorithm for learning binary weights on-line. J. Comp. Syst. Sci. 46, 198-217. Weigend, A. S., Rumelhart, D. E., and Huberman, B. A. 1991. Generalization by weight-elimination with application to forecasting. In Advances in Neural Information Processing Systems, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds., Vol. 3, pp. 875-882. Morgan Kaufmann, San Mateo, CA. Zollner, R., Schmitz, H. J., Wunsch, F., and k e y , ZJ. 1992. Fast generating algorithm for a general three-layer perceptron. Neural Networks 5, 771-777.
Received February 2, 1994; accepted January 26, 1995
This article has been cited by: 2. Ansgar H L West, David Saad. 1998. Journal of Physics A: Mathematical and General 31:45, 8977-9021. [CrossRef]
Communicated by John Platt
LeRec: A NN/HMM Hybrid for On-Line Handwriting Recognition Yoshua Bengio' Yann LeCun Craig Noh1 Chris Burges AT&T Bull Laboratories, Rm 4G332,101 Cramfords Corner Road, Holmdel, NJ 07733 U S A We introduce a new approach for on-line recognition of handwritten words written in unconstrained mixed style. The preprocessor performs a word-level normalization by fitting a model of the word structure using the EM algorithm. Words are then coded into low resolution "annotated images" where each pixel contains information about trajectory direction and curvature. The recognizer is a convolution network that can be spatially replicated. From the network output, a hidden Markov model produces word scores. The entire system is globally trained to minimize word-level errors. 1 Introduction Natural handwriting is often a mixture of different "styles," lower case printed, upper case, and cursive. A reliable recognizer for such handwriting would greatly improve interaction with pen-based devices, but its implementation presents new technical challenges. Characters taken in isolation can be very ambiguous, but considerable information is available from the context of the whole word. We propose a word recognition system for pen-based devices based on four main modules: a preprocessor that normalizes a word, or word group, by fitting a geometric model to the word structure using the EM algorithm; a module that produces an "annotated image" from the normalized pen trajectory; a replicated convolutional neural network that spots and recognizes characters; and a Hidden Markov Model (HMM) that interprets the network's output by taking word-level constraints into account. The network and the HMM are jointly trained to minimize an error measure defined at the word level. Many on-line handwriting recognizers exploit the sequential nature of pen trajectories by representing the input in the time domain. While 'Also, Department IRO, Universite de Montreal, C.P. 6128, Succ. Centre-Ville, Montreal, Qc, H3C-3J7, Canada. Neural Computation 7, 1289-1303 (1995) @ 1995 Massachusetts Institute of Technology
Yoshua Bengio et al.
1290
these representations are compact and computationally advantageous, they tend to be sensitive to stroke order, writing speed, and other irrelevant parameters. In addition, global geometric features, such as whether a stroke crosses another stroke drawn at a different time, are not readily available in temporal representations. To avoid this problem we designed a representation, called AMAP, that preserves the pictorial nature of the handwriting. In addition to recognizing characters, the system must also correctly segment the characters within the words. To choose the optimal segmentation and take advantage of contextual and linguistic structure, the neural network is combined with a graph-based postprocessor, such as an HMM. One approach, that we call INSEG, is to recognize a large number of heuristically segmented candidate characters and combine them optimally with a post-processor (Burges et al. 1992; Schenkel et al. 1993). Another approach, that we call OUTSEG, is to delay all segmentation decisions until after the recognition, as is often done in speech recognition. An OUTSEG recognizer must accept entire words as input and produce a sequence of scores for each character at each location on the input (Matan et al. 1992; Keeler et al. 1991; Schenkel et al. 1993). Since the word normalization cannot be done perfectly, the recognizer must be robust with respect to relatively large distortions, size variations, and translations. An elastic word model, e.g., an HMM, can extract word candidates from the network output. The HMM models the long-range sequential structure while the neural network spots and classifies characters, using local spatial structure. 2 Word Normalization
Input normalization reduces intracharacter variability, simplifying character recognition. We propose a new word normalization scheme, based on fitting a geometric model of the word structure. Our model has four ”flexible” lines representing, respectively, the ascenders line, the core line, the base line, and the descenders line (see Fig. 1). Points ( x , y ) on the lines are parameterized as follows:
y = f , ( x ) = k ( x - XO)’
+ S ( X - + yol XO)
(2.1)
where k controls curvature, s is the skew, and (x0,yo) is a translation vector. The parameters k, s, and xo are shared among all four curves, whereas each curve has its own vertical translation parameter yo,. The free parameters of the fit are actually k, s, a (ascenders yo minus baseline yo), b (baseline yo), c (core line yo minus baseline yo), and d (baseline yo minus descenders yo), as shown in Figure 1. xo is determined by taking the average abscissa of vertical extrema points. The lines of the model are fitted to the extrema of vertical displacement: the upper two lines to the vertical maxima of the pen trajectory, and the lower two to the minima.
LeRec: Hybrid for On-Line Handwriting Recognition
1291
Y
X
Figure 1: Word normalization model: ascenders and core curves fit y-maxima whereas descenders and baseline curves fit y-minima. There are six parameters: a (ascenderscurve height relative to baseline), b (baseline absolute vertical POsition), c (core line position), d (descenders curve position), k (curvature), s (angle). The line parameters 0 = { u , b, c, d , k,s} are tuned to maximize the joint probability of observed points and parameter values: 8' = arg maxlogP(X I 0)
+ logP(0)
(2.2)
8
P ( X I 0) is modeled by a mixture of gaussians (one gaussian per curve), whose means are the functions of x given in equation 2.1: (2.3) where N(y; L L , g ) is the likelihood of y under a univariate Normal model (mean p, standard deviation a). The wk are the mixture parameters, some of which are set to 0 in order to constrain the upper (lower) points to be fitted to the upper (lower) curves. They are computed a priori using measured frequencies of associations of extrema to curves on a large set of words. Priors P ( 0 ) on the parameters (modeled here with Normal
1292
Yoshua Bengio et al.
distributions) are important to prevent the collapse of the curves. They can be used to incorporate a priori information about the word geometry, such as the expected position of the baseline, or of the height of the word. These priors are also used as initial values in the EM optimization of the fit function. The prior distribution for each parameter (independently) is a Normal, with the standard deviation controlling the strength of the prior. In our experiments, these priors were set using some heuristics applied to the input data itself. The priors fox the curvature (k) and angle (s) are set to 0, while the ink points themselves are preprocessed to attempt to remove the overall angle of the word (looking for a near horizontal projection with minimum entropy). To compute the prior for the baseline, the mean and standard deviation of y-position are computed (after rough angle removal). The baseline (b) prior is taken to be one standard deviation below the mean. The core line (c) prior is taken to be two standard deviations above the baseline. The ascender (descender) line prior is taken to be between 1.8 (-0.9) and 3.0 (-2.0) times the core height prior, depending on the maximum (minimum) vertical position in the word. The discrete random variables that associate each point with one of the curves are taken as hidden variables of the EM algorithm. One can thus derive an auxiliary function that can be analytically (and cheaply) solved for the six free parameters 8. Convergence of the EM algorithm was typically obtained within two to four iterations (of maximization of the auxiliary function).
3 AMAP
The recognition of handwritten characters from a pen trajectory on a digitizing surface is often done in the time domain (Tappert et al. 1990; Guyon et al. 1991). Typically, trajectories are normalized, and local geometric or dynamic features are sometimes extracted. The recognition is performed using curve matching (Tappert et al. 1990), or other classification techniques such as time-delay neural networks (Guyon et al. 1991). While these representations have several advantages, their dependence on stroke ordering and individual writing styles makes them difficult to use in high accuracy, writer-independent systems that integrate the segmentation with the recognition. Since the intent of the writer is to produce a legible image, it seems natural to preserve as much of the pictorial nature of the signal as possible, while at the same time exploiting the sequential information in the trajectory. We propose a representation scheme, called AMAP, where pen trajectories are represented by low-resolution images in which each picture element contains information about the local properties of the trajectory.
LeRec: Hybrid for On-Line Handwriting Recognition
1293
An AMAP can be viewed as a function in a multidimensional space where each dimension is associated with a local property of the trajectory, such as the direction of motion 4, the X position, and the Y position of the pen. The value of the function at a particular location (4,X, Y) in the space represents a smooth version of the "density" of features in the trajectory that have values ( q ' ~X . Y) (in the spirit of the generalized Hough transform). An AMAP is implemented as a multidimensional array (in our case 5 x 20 x 18) obtained by discretizing the continuous "feature density" functions, which varies smoothly with position (X,Y) and other variables such as direction of motion 4, into "boxes." Each of these array elements is assigned a value equal to the integral of the feature density function over the corresponding box. In practice, an AMAP is computed as follows. At each sample on the trajectory, one computes the position of the pen ( X , Y) and orientation of the motion 4 (and possibly other features, such as the local curvature c). Each element in the AMAP is then incremented by the amount of the integral over the corresponding box of a predetermined point-spread function centered on the coordinates of the feature vector. The use of a smooth point-spread function (say a gaussian) ensures that smooth deformations of the trajectory will correspond to smooth transformations of the AMAP. An AMAP can be viewed as an "annotated image" in which each pixel is a feature vector. A particularly useful feature of the AMAP representation is that it makes very few assumptions about the nature of the input trajectory. It does not depend on stroke ordering or writing speed, and it can be used with all types of handwriting (capital, lower case, cursive, punctuation, symbols). Unlike many other representations (such as global features), AMAPs can be computed for complete words without requiring segmentation. In the experiments we used AMAPs with five features at each pixel location: four features are associated to four orientations (O", 45", 90", and 135"),the fifth one is associated to local curvature. For example, when there is a nearly vertical segment in an area, nearby pixels will have a strong value for the first ("vertical") feature. Near endpoints or points of high spatial curvature on the trajectory, the fifth ("curvature") feature will be high. Curvature information is obtained by computing the cosine of the angle between successive elementary segments of the trajectory. Because of the integration of the gaussian point-spread function, the curvature feature at a given pixel depends on the curvature at different points of the trajectory in the vicinity of that pixel. 4 Convolutional Neural Networks
Image-like representations such as AMAPs are particularly well suited for use in combination with multilayer convolutional neural networks (MLCNNs) (LeCun et al. 1989; LeCun et al. 1990). MLCNNs are feed-
1294
Yoshua Bengio et al.
forward neural networks whose architectures are tailored for minimizing the sensitivity to translations, rotations, or distortions of the input image. They are trained to recognize and spot characters with a variation of the backpropagation algorithm (Rumelhart et al. 1986; LeCun 1986). Each unit in an MLCNN is connected only to a local neighborhood in the previous layer. Each unit can be seen as a local feature detector whose function is determined by the learning procedure. Insensitivity to local transformations is built into the network architecture by constraining sets of units located at different places to use identical weight vectors, thereby forcing them to detect the same feature on different parts of the input. The outputs of the units at identical locations in different feature maps can be collectively thought of as a local feature vector. Features of increasing complexity and scale are extracted by the neurons in the successive layers. Because of weight-sharing, the number of free parameters in the system is greatly reduced. Furthermore, MLCNNs can be scanned (replicated) over large input fields containing multiple unsegmented characters (whole words) very economically by simply performing the convolutions on larger inputs. Instead of producing a single output vector, such an application of an MLCNN produces a sequence of output vectors. The outputs detect and recognize characters at different (and overlapping) locations on the input. These multiple-input, multiple-output MLCNNs are called space displacement neural networks (SDNNs) (Matan et al. 1992; Keeler et al. 1991; Schenkel et al. 1993). One of the best networks we found for character recognition has 5 layers arranged as illustrated in Figure 2; layer 1, convolution with 8 kernels of size 3 x 3; layer 2, 2 x 2 subsampling; layer 3, convolution with 25 kernels of size 5 x 5; layer 4, convolution with 84 kernels of size 4 x 4; layer 5, 2 x 1 subsampling; classification layer, 95 radial basis function (RBF) units (one per class). The subsampling layers are essential to the network's robustness to distortions. Hidden units of a subsampling layer apply the squashing nonlinearity to a scaled and offset sum of their inputs (from the same feature map at the previous layer). For each feature map, there are two learned parameters in a subsampling layer: the scaling and bias, which control the effect of the nonlinearity. The output layer is one (single MLCNN) or a series of (SDNN) 95-dimensional vectors, with a distributed target code for each character corresponding to the weights of the RBF units. The choice of input field dimension was based on the following considerations. We estimated that at least 4 or 5 pixels were necessary for the core of characters (between baseline and core line). Furthermore, very wide characters (such as "w") can have a 3 to 1 aspect ratio. On the vertical dimension, it is necessary to leave room for ascenders and descenders (at least one core height each). In addition, extra borders allow outer edges of the characters to lie at the center of the receptive field of some units in the first layer, thereby improving the accuracy. Once the
LeRec: Hybrid for On-Line Handwriting Recognition
... .
3x3I
INPUT AMAP 5820x18
1295
2x2
feature maps 8018x16
convolve
feature maps 889x8
feature maps 2505x4
output code 8482x1
o~~~~x"pde
Figure 2: Convolutional neural network character recognizer. This architecture is robust to local translationsand distortions, with subsampling, shared weights, and local receptive fields. number of subsampling layers and the sizes of the kernels are chosen, the sizes of all the layers, including the input, are determined unambiguously. The only architectural parameters that remain to be selected are the number of feature maps in each layer, and the information as to what feature map is connected to what other feature map. In our case, the subsampling rates were chosen as small as possible (2 x 2), and the kernels as small as possible in the first layer (3 x 3) to limit the total number of connections. Kernel sizes in the upper layers are chosen to be as small as possible while satisfying the size constraints mentioned above. The last subsampling layer performs a vertical subsampling to make the network more robust to errors of the word normalizer (which tends to create variations in vertical position). Several architectures were tried (but clearly not exhaustively), varying the type of layers (convolution, subsampling), the kernel sizes, and the number of feature maps. Larger architectures did not necessarily perform better and required considerably more time to be trained. A very small architecture with half the input field also performed worse, because of insufficient input resolution. Note that the input resolution is nonetheless much less than for optical character resolution, because the angle and curvature provide more information than a single grey level at each pixel. Training proceeded in two phases. First, we kept the centers of the RBFs fixed, and trained the network weights so as to maximize the logarithm of the output RBF corresponding to the correct class (maximum log-likelihood). This is equivalent to minimizing the mean-squared error between the previous layer and the center of the correct-class RBF.
Yoshua Bengio et al.
1296
INSEG ARCHITECTURE FOR WORD RECOGNITION
OUTSEG ARCHITECTURE
FOR WORD RECOGNITION raw word
raw word
normalization
u
normalizetion
norm word Cut hvwtheses
1
.... .... .... ..... ... ...... .....
S c.....r......I p.... e n.....e.1 0.1 5......a...i...u D
8 C
dates word
"Script"
Figure 3: INSEG and OUTSEG architectures for word recognition. This bootstrap phase was performed on isolated characters. In the second phase, all the parameters, network weights, and RBF centers were trained globally to minimize a discriminant criterion at the word level. This is described in more detail in the next section.
5 Segmentation and Postprocessing The convolutional neural network can be used to give scores associated to characters when the network (or a piece of it corresponding to a single character output) has an input field, called a segtnent, that covers a connected subset of the whole word. A segmenfafioiz is a sequence of such segments that covers the whole word. Because there are often many possible segmentations, sophisticated tools such as hidden Markov models and dynamic programming are used to search for the best segmentation. In this paper, we consider two approaches to the segmentation problem called INSEG (for input segmentation) and OUTSEG (for output segmentation). In both approaches, the postprocessors can be decomposed into two levels: (1) character-level scores and constraints obtained from the observations, and (2) word-level constraints (e.g., from a grammar or dictionary). The INSEG and OUTSEG systems share the second level. The INSEG and OUTSEG architectures are depicted in Figure 3.
LeRec: Hybrid for On-Line Handwriting Recognition
1297
In an INSEG system, the network is applied to a number of heuristically segmented candidate characters. A cutter generates candidate cuts, which represent a potential boundary between two character segments. It also generates definite cuts, which we assume that no segment can cross. A combiner then generates the candidate segments, based on the cuts found. The cutter module finds candidate cuts in cursive words (note that the data can be cursive, printed, or mixed). A superset of such cuts is first found, based on the pen direction of motion along each stroke. Next, several filters are applied to remove incorrect cuts. The filters use vertical projections, proximity to the baseline, and other similar characteristics. Horizontal strokes of "T"s that run into the next character (with no pen up) are also cut here. Next, the combiner module generates segments based on these cuts. Heuristic filters are again used to significantly reduce the number of candidate segments down to a reasonable number. For example, segments falling across definite cuts, or that are too wide, or that contain too many strokes, are removed from the list of candidates; and segments that contain too little ink are forcibly combined with other segments. Finally, some segments (such as the horizontal or vertical strokes of T's, other vertical strokes that lie geometrically inside other strokes, etc.) are also forcibly combined into larger segments. The network is then applied to each of the resulting segments separately. These scores are attached to nodes of an observation graph in which the connectivity and transition probabilities on arcs represent segmentation and geometrical constraints (e.g., segments must not overlap and must cover the whole word, some transitions between characters are more or less likely given the geometrical relations between their images). Each node in the observation graph thus represents a segment of the input image and a candidate classification for this segment, with a corresponding score or cost. In an OUTSEG system, all segmentation decisions are delayed until after the recognition (Matan et al. 1992; Keeler et al. 1991; Schenkel et al. 1993), as is often done in speech recognition (Bengio et al. 1992). The AMAP of the entire word is shown to an SDNN, which produces a sequence of output vectors equivalent to scanning the single-character network over all possible pixel locations on the input. The Euclidean distances between each output vector and the targets are interpreted as log-likelihoods of the output given a class. To construct an observation graph, we use a set of character HMMs, modeling the sequence of network outputs observed for each character. We used three-state HMMs for each character, with a left and right state to model transitions and a center state for the character itself. The observation graph is obtained by connecting these character models, allowing any character to follow any character. On top of the constraints given in the observation graph, additional
1298
Yoshua Bengio et al.
constraints that are independent of the observations are given by what we call a grammar graph, which can embody lexical constraints. These constraints can be given in the form of a dictionary or of a character-level grammar (with transition probabilities), such as a trigram. Recognition searches the best path in the observation graph that is compatible with the grammar graph. When the grammar graph has a complex structure (e.g., a dictionary), the product of the grammar graph with the observation graph can be huge. To avoid generating such a large data structure, we define the nodes of this product graph procedurally and we only instantiate nodes along the paths explored by the graph search (and pruning) algorithm. With the OUTSEG architecture, there are several ways to put together the within-character constraints of the HMM observation graph with the between-character constraints of the grammar graph. The approach generally followed in HMM speech recognition system consists of taking the product of these two graphs and searching for the best path in the combined graph. This is equivalent to using the costs and connectivity of the grammar graph to connect together the character HMM models from the observation graph, i.e., to provide the transition probabilities between the character HMMs (after making duplicates of the character models for each corresponding character in the grammar graph). Variations of this scheme include pruning the search (e.g., with beam search) and separating the search in the observation graph and the grammar graph. A crucial contribution of our system is the joint training of the neural network and the postprocessor with respect to a single criterion that approximates word-level errors. We used the following discriminant criterion: minimize the total cost (sum of negative log-likelihoods) along the ”correct” paths (the ones that yield the correct interpretations), while maximizing the costs of all the paths (correct or not). The discriminant nature of this criterion can be shown with the following example. If the cost of a path associated to the correct interpretation is much smaller than all other paths, the criterion is very close to 0 and almost no gradient is backpropagated. On the other hand, if the lowest cost path yields an incorrect interpretation but differs from a path o f correct interpretation on a subpath, then very strong gradients will be propagated along that subpath, whereas the other parts of the sequence will generate almost no gradient. Within a probabilistic framework, this criterion corresponds to maximizing the mutual information (MMI) between the observations and the correct interpretation (Nadas et al. 1988). The mutual information I(C, Y) between the correct interpretation C (sequence of characters) and the transformed observations Y (sequence of outputs of the last layer of the neural net before the RBFs) can be rewritten as follows, using Bayes’ rule: (5.1)
LeRec: Hybrid for On-Line Handwriting Recognition
1299
where P(Y I C) is the likelihood of transformed observations Y constrained by the knowledge of the correct interpretation sequence C, P( Y) is the unconstrained likelihood of Y (i.e., taking all interpretations possible in the model into account), and P ( C ) is the prior probability of the sequence of characters C. Interestingly, when the class priors are fixed, maximizing I(C. Y) is equivalent to maximizing the posterior probability of the correct sequence C, given the observations Y (also known as the maximum a posteriori, or MAP, criterion):
Both the MMI and MAP criteria are more discriminant than the maximum likelihood criterion [maximizing P( Y I C)] because the parameters are used not to model the type of observations corresponding to a particular class C, but rather to discriminate between classes. The most discriminant criterion is the number of classification errors on the training set but, unfortunately, it is computationally very difficult to directly optimize such a discrete criterion. During global training, the MMI criterion was optimized with a modified stochastic gradient descent procedure that uses second derivatives to compute optimal learning rates (LeCun 1989) (this can be seen as a stochastic version of the Levenberg-Marquardt algorithm with a diagonal approximation of the Hessian). This optimization operates on all the parameters in the system, most notably the network weights and the RBF centers. Experiments described in the next section have shown important reductions in error rates when training with this word-level criterion instead of just training the network separately for each character. Similar combinations of neural networks with HMMs or dynamic programming have been proposed in the past for speech recognition problems (Bengio et al. 1992). 6 Experimental Results
In the first set of experiments, we evaluated the generalization ability of the neural network classifier coupled with the word normalization preprocessing and AMAP input representation. All results are in writer independent mode (different writers in training and testing). Tests on a database of isolated characters were performed separately on the four types of characters: upper case (2.99% error on 9122 patterns), lower case (4.15%error on 8201 patterns), digits (1.4% error on 2938 patterns), and punctuation (4.3% error on 881 patterns). Experiments were performed with the network architecture described above. To enhance the robustness of the recognizer to variations in position, size, orientation, and other distortions, additional training data were generated by applying local affine transformations to the original characters.
1300
Yoshua Bengio et al.
The second and third sets of experiments concerned the recognition of lower case words (writer independent). The tests were performed on a database of 881 words. First we evaluated the improvements brought by the word normalization to the INSEG system. For the OUTSEG system we have to use a word normalization since the network sees a whole word at a time. With the INSEG system, and before doing any wordlevel training, we obtained without word normalization 7.3 and 3.5% word and character errors (adding insertions, deletions, and substitutions) when the search was constrained within a 25,461-word dictionary. When using the word normalization preprocessing instead of a character level normalization, error rates dropped to 4.6 and 2.0% for word and character errors, respectively, i.e., a relative drop of 37 and 43% in word and character error, respectively. In the third set of experiments, we measured the improvements obtained with the joint training of the neural network and the postprocessor with the word-level criterion, in comparison to training based only on the errors performed at the character level. Training was performed with a database of 3500 lower case words. For the OUTSEG system, without any dictionary constraints, the error rates dropped from 38 and 12.4% word and character error to 26 and 8.2%, respectively, after word-level training, i.e., a relative drop of 32 and 34%. For the INSEG system and a slightly improved architecture, without any dictionary constraints, the error rates dropped from 22.5 and 8.5% word and character error to 17 and 6.3%, respectively, i.e., a relative drop of 24.4 and 25.6%. With a 25,461-word dictionary, errors dropped from 4.6 and 2.0% word and character errors to 3.2 and 1.4%,respectively, after word-level training, i.e., a relative drop of 30.4 and 30.0%. Even lower error rates can be obtained by drastically reducing the size of the dictionary to 350 words, yielding 1.6 and 0.94% word and character errors. The AMAP preprocessing with bidimensional multilayer convolutional networks was also compared with another approach developed in our laboratory (Guyon et al. 1991),based on a time-domain representation and a one-dimensional convolutional network (OF time-delay neural network). The networks were not trained on the same data, but were both tested on the same database of 17,858 isolated characters provided by AT&T Global Information Solutions (formerly NCR) for comparing a variety of commercial character recognizers with the recognizers developed in our laboratory. Error rates for the AMAP network were, respectively, 2.0,5.4, 6.7, and 2.5% on digits, upper case, lower case, and a reduced set of punctuation symbols. On the same categories, the time-delay neural network (based on a temporal representation) obtained 2.6, 6.4, 7.7, and 5.1% errors, respectively. However, we noticed that the two networks often made errors on different patterns, probably because they are based on different input representations. Hence we combined their output (by a simple sum), and obtained on the same classes 1.4, 3.9, 5.3, and 2.2% errors, i.e., a very important improvement. This can be explained because
LeRec: Hybrid for On-Line Handwriting Recognition
1301
isolated Characters Comparative Raw Error Rates rate
39%
40.. 30.8%
32.5%
34%
30, 20,.
18.9%
10..
Bell corn. corn. corn. corn. Labs #1 #2 #3 #4 Figure 4: Comparative results on a benchmark test conducted by AT&T-GIS on isolated character recognition (uppers, lowers, digits, symbols). The last five bars represent the results obtained by five competing commercial recognizers. The floor (12.9%) represents the best result we could obtain by not counting irreducible confusions as errors.
when these recognizers not only make errors on different patterns, but also have good rejection properties, the highest scoring class tends to have a low score when it is not the correct class. AT&T-GIS conducted a test in which such a combined system was compared with 4 commercial classifiers on the printable ASCII set (isolated characters, including upper and lower case, digits, and punctuations). On this benchmark task, because characters are given in isolation without baseline information, there are inherent confusions between many sets of characters such as (”0,””o”),(”P,” “p”), (”2,” “z,” ”Z), (’‘I,” “i” with no dot, ”l”),(”;,” “i”), etc. We estimated that the best one could hope because of these confusions was around 12.9% error rate (by not counting these confusions as errors with our best recognizer). Our recognizer obtained 18.9%, that is, 6% worse than this estimated floor. The
1302
Yoshua Bengio et al.
error rates obtained by the commercial recognizers were, in decreasing order of performance, 30.8, 32.5, 34.0, and 39.0%. These are, respectively, 17.9, 19.6, 21.1, and 26.1% above our estimated floor. These results are illustrated in the bar chart of Figure 4. Note, however, that the results are slightly biased by the fact that we are comparing a laboratory prototype to established, commercial systems with real-time performance. 7 Conclusion
We have demonstrated a new approach to on-line handwritten word recognition that uses word or sentence-level preprocessing and normalization, image-like representations, convolutional neural networks, graphbased word models, and global training using a highly discriminant word-level criterion. Excellent accuracy on various writer-independent tasks was obtained with this combination.
Acknowledgments We would like to thank Isabelle Guyon for the fruitful exchanges of ideas and information on our approaches to the problem. Mike Miller and his colleagues at AT&T-GlS are gratefully acknowledged for providing the database and running the benchmarks. We would also like to acknowledge the help of other members of our department, in particular, Donnie Henderson, John Denker, and Larry Jackel. Y. B. would also like to acknowledge the support of the National Research and Engineering Research Council of Canada.
References Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. 1992. Global optimization of a neural network-hidden Markov model hybrid. I E E E Trans. Neural Networks 3(2),252-259. Burges, C., Matan, O., LeCun, Y., Denker, J., Jackel,L.,. Stenard, C., Nohl, C., and Ben, J. 1992. Shortest path segmentation: A method for training a neural network to recognize character strings. Proc. Int. Joint Conf. Neural Networks 3, 165-172. Guyon, I., Albrecht, I?, Le Cun, Y., Denker, J. S., and Hubbard, W. 1991. Design of a neural network character recognizer for a touch terminal. Pattern Rec. 24(2), 105-119. Keeler, J., Rumelhart, D., and Leow, W. 1991. Integrated segmentation and recognition of hand-printed numerals. In Neural Information Processing Systems, R. P. Lippman, J. M. Moody, and D. S. Touretzky, eds., Vol. 3, pp. 557563. Morgan Kaufmann, San Mateo, CA.
LeRec: Hybrid for On-Line Handwriting Recognition
1303
LeCun, Y. 1986. Learning processes in an asymmetric threshold network. In Disordered Systems and Biological Organization, E. Bienenstock, F. FogelmanSoulie, and G. Weisbuch, eds., pp. 233-240. Springer-Verlag, Berlin. LeCun, Y. 1989. Generalization and Network Design Strategies. Tech. Rep. CRGTR-89-4, Department of Computer Science, University of Toronto. LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., and Jackel, L. 1989. Backpropagation applied to handwritten zip code recognition. Neural Comp. 1, 541-551. LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., and Jackel, L. 1990. Handwritten digit recognition with a backpropagation network. In Advances in Neural Information Processing Systems, D. Touretzky, ed., Vol. 2, pp. 396404. Morgan Kaufmann, San Mateo, CA. Matan, O., Burges, C., LeCun, Y., and Denker, J. 1992. Multi-digit recognition using a space displacement neural network. In Advances in Neural lnformation Processing Systems, J. Moody, S. Hanson, and R. Lipmann, eds., Vol. 4, pp. 488495. Morgan Kaufmann, San Mateo, CA. Nadas, A., Nahamoo, D., and Picheny, M. 1988. On a model-robust training method for speech recognition. IEEE Trans. Acoustics, Speech Signal Process. ASSP-36(9), 1432-1436. Rumelhart, D., Hinton, G., and Williams, R. 1986. Learning representations by backpropagating errors. Nature (London) 323, 533-536. Schenkel, M., Weissman, H., Guyon, I., Nohl, C., and Henderson, D. 1993. Recognition-based segmentation of on-line hand-printed words. In Advances in Neural Information Processing Systems, s. J. Hanson, J. D. Cowan, and C. L. Giles, eds., Vol. 5, pp. 723-730. Morgan Kaufmann, San Mateo, CA. Tappert, C., Suen, C., and Wakahara, T. 1990. The state of the art in on-line handwriting recognition. l E E E Trans. Pattern Anal. Machine Intelligence 8(12), 787-808.
Received June 17, 1994; accepted January 19, 1995.
This article has been cited by: 2. A. MateoGonzalez, A. MunozSanRoque, J. Garcia-Gonzalez. 2005. Modeling and Forecasting Electricity Prices with Input/Output Hidden Markov Models. IEEE Transactions on Power Systems 20:1, 13-24. [CrossRef] 3. Y. Bengio, V.-P. Lauzon, R. Ducharme. 2001. Experiments on the application of IOHMMs to model financial returns series. IEEE Transactions on Neural Networks 12:1, 113-123. [CrossRef] 4. R. Plamondon, S.N. Srihari. 2000. Online and off-line handwriting recognition: a comprehensive survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 22:1, 63-84. [CrossRef] 5. K.W.C. Ku, Man Wai Mak, Wan Chi Siu. 1999. Adding learning to cellular genetic algorithms for training recurrent neural networks. IEEE Transactions on Neural Networks 10:2, 239-252. [CrossRef] 6. R.V. Cox, B.G. Haskell, Y. LeCun, B. Shahraray, L. Rabiner. 1998. On the applications of multimedia processing to communications. Proceedings of the IEEE 86:5, 755-824. [CrossRef]
1305
Index Volume 7 By Author Abbott, L. F. - See Idiart, M. Abu-Mostafa, Y. Hints (Review)
7(4):639-671
Allinson, N. M. - See Yin, H. Alqugzar, R. and Sanfeliu, A. An Algebraic Framework to Represent Finite State Machines in Single-Layer Recurrent Neural Networks (Letter) Amari, S. The EM Algorithm and Information Geometry in Neural Network Learning (Note) Barber, D., Saad, D. and Sollich, I? Test Error Fluctuations in Finite Linear Perceptrons (Letter)
7(5):931-949
7(1):13-18
7(4):809-821
Bartlett, E. B. - See Kim, K. Bartlett, P. - See Lee, W. S. Bauer, H.-U. Development of Oriented Ocular Dominance Bands as a Consequence of Areal Geometry (Letter) Baxt, W. G. and White, H. Bootstrapping Confidence Intervals for Clinical Input Variable Effects in a Network Trained to Identify the Presence of Acute Myocardial Infarction (Letter) Bell, A. J. and Sejnowski, T. J. An Information-Maximization Approach to Blind Separation and Blind Deconvolution (Article) Benaim, M. Convergence Theorems for Hybrid Learning Rules (Note)
7(1):36-50
7(3):624-638
7(6):1129-1159
7(1):19-24
1306
Bengio, Y., LeCun, Y., Nohl, C., and Burges, C. LeRec: A NN/HMM Hybrid for On-Line Handwriting Recognition (Letter) Bennani, Y. A Modular and Hybrid Connectionist System for Speaker Identification (Letter)
Index
7(6):1289-1303
7(4):791-798
Berk, B. - See Idiart, M. Bertsekas, D. P. A Counterexample to Temporal Differences Learning (Note)
7(2):270-279
Bishop, C. M. Training with Noise is Equivalent to Tikhonov Regularization (Letter)
7(1):108-116
Bishop, C. M., Haynes, P. S., Smith, M. E. U., Todd,T. N., and Trotman, D. L. Real-Time Control of a Tokamak Plasma Using Neural Networks (Letter)
7(1):206-217
Bruske, J. and Sommer, G. Dynamic Cell Structure Learns Perfectly Topology Preserving Map (Letter)
7(4):845-865
Bylander, T. Learning Linear Threshold Approximations Using Perceptrons (Letter)
7(2):370-379
Buchanan, J. T. - See Murphey, C. R. Budinich, M. Sorting with Self-organizing Maps (Letter) Budinich, M. and Taylor, J. G. On the Ordering Conditions for Self-organizing Maps (Note)
7(6):1188-1190
7(2):284-289
Burges, C. - See Bengio, Y. Campbell, C. and Perez Vicente, C. The Target Switch Algorithm: A Constructive Learning Procedure for Feed-Forward Neural Networks (Letter) Cannas, S. A. Arithmetic Perceptrons (Letter) Carrasco, R. C. - See Forcada, M. L.
7(6):1245-1 264
7(1):173-181
Index
1307
Chae, S. I. - See Lee, E. W. Chambet, N. - See Chapeau-Blondeau, E Chapeau-Blondeau, F. and Chambet, N. Synapse Models for Neural Networks: From Ion Channel Kinetics to Multiplicative Coefficient w;; (Letter) Cherkassky,V. and Mulier, F. Self-organization as an Iterative Kernel Smoothing Process (Letter)
7(4):713-734
7(6):1165-1177
Cho, S.-B. and Kim, J. H. An HMM/MLP Architecture for Sequence Recognition (Letter)
7(2):358-369
Coetzee, F. M. and Stonick, V. L. Topology and Geometry of Single Hidden Layer Network, Least Squares Weight Solutions (Article)
7(4):672-705
Corradi, V. and White, H. Regularized Neural Networks: Some Convergence Rate Results (Letter)
7(6):1225-1244
Cowan, J. D. - See Ohira, T. Dayan, I?, Hinton, G. E., Neal, R. M., and Zemel, R. S. The Helmholtz Machine (Letter)
7(5):889-904
Dayan, I? and Zemel, R. S. Competition and Multiple Cause Models (Letter)
7(3):565-579
Deco, G., Finnoff, W. and Zimmerman, H. G. Unsupervised Mutual Information Criterion for Elimination of Overtraining in Supervised Multilayer Networks (Letter)
7(1):86-107
Deco, G. and Obradovic, D. Decorrelated Hebbian Learning for Clustering and Function Approximation (Letter)
7(2):338-348
Deffuant, G. An Algorithm for Building Regularized Piecewise Linear Discrimination Surfaces: The Perceptron Membrane (Letter)
7(2):380-398
Edelman, S. Representation of Similarity in ThreeDimensional Object Discrimination (Letter)
7(2):408-423
1308
Elfadel, I. M. Convex Potentials and their Conjugates in Analog Mean-field Optimization (Letter)
Index
7(5):1079-11 04
Elfadel, I. M. - See Wyatt, J. L., Jr. Engel, A. K. - See Konig, P. Erwin, E., Obermayer, K., and Schulten, K. Models of Orientation and Ocular Dominance Columns in the Visual Cortex: A Critical Comparison (Review)
7(3):425-468
Finnoff, W. - See Deco, G. Fohlmeister, C., Gerstner, W., Ritz, R., and van Hemmen, J. L. Spontaneous Excitations in the Visual Cortex: Stripes, Spirals, Rings, and Collective Bursts (Letter)
7(5):905-914
Forcada, M. L. and Carrasco, R. C. Learning the Initial State of a Second-Order Recurrent Neural Network during RegularLanguage Inference (Letter)
7(5):923-930
Freeman, J. A. S. and Saad, D. Learning and Generalization in Radial Basis Function Networks (Letter)
7(5):1000-1020
Fukai, T. and Shiino, M. Memory Recall By Quasi-Fixed-Point Attractors in Oscillator Neural Networks (Letter) Fyfe, C. Introducing Asymmetry into Interneuron Learning (Letter) Gazzaniga, M. S. On Neural Circuits and Cognition (View)
7(3):529-548
7(6):1191-1 205 7(1):1-12
Gerstner, W. - See Fohlmeister, C. Girosi, F., Jones, M., and Poggio, T. Regularization Theory and Neural Networks Architectures (Review) Golomb, B. A. - See Gray, M. S. Gordon, M. B. - See Raffin, B.
7(2):219-269
Index Gray, M. S., Lawrence, D. T., Golomb, B. A., and Sejnowski, T. J. A Perceptron Reveals the Face of Sex (Note)
1309
7(6):1160-11 64
Guerrieri, R. - See Rovatti, R. Hansel, D., Mato, G., and Meunier, C. Synchrony in Excitatory Neural Networks (Letter)
7(2):307-337
Haynes, P. S. - See Bishop, C. M., Hinton, G. E. - See Dayan, P. Hinton, G. E. - See Zemel, R. S. Holden, 5. B. and Niranjan, M. On the Practical Applicability of VC Dimension Bounds (Letter)
7(6):1265-1288
Horn, D. and Ruppin, E. Compensatory Mechanisms in an Attractor Neural Network Model of Schizophrenia (Letter)
7(1):182-205
Huuhtanen, P. - See Lehtokangas, M. Idiart, M., Berk, B., and Abbott, L. F. Reduced Representation by Neural Networks with Restricted Receptive Fields (Letter)
7(3):507-517
Jacobs, R. A. Methods for Combining Experts’ Probabiiity Assessments (Review)
7(5):867-888
Jones, M. - See Girosi, F., Kabashima, Y. and Shinomoto, S. Learning a Decision Boundary from Stochastic Examples: Incremental Algorithms with and without Queries (Letter)
7(1):158-172
Kaski, K. - See Lehtokangas, M. Kim, J. H. - See Cho, S.-B. Kim, K. and Bartlett, E. B. Error Estimation by Series Association for Neural Network Systems (Letter)
7(4):799-808
1310
Konig, P., Engel, A. K., Roelfsema, P. R., and Singer, W. How Precise Is Neuronal Synchronization? (Letter)
Index
7(3):469485
KovScs, Z. M. - See Rovatti, R. Lawrence, D. T. - See Gray, M. S. LeCun, Y. - See Bengio, Y. Lee, E.-W. and Chae, S.-I. New Perceptron Model Using Random Bitstreams (Note)
7(2):280-283
Lee, W. S., Bartlett, P., and Williamson, R. C. Lower Bounds on the VC Dimension of Smoothly Parameterized Function Classes (Letter)
7(5):1040-1053
Leen, T. K. From Data Distributions to Regularization in Invariant Learning (Letter)
7(5):974-981
Lehtokangas, M., Saarinen, J., Huuhtanen, P., and Kaski, K. Initializing Weights of a Multilayer Perceptron Network by Using the Orthogonal Least Squares Algorithm (Letter)
7(5):982-999
Levin, A. U. and Narendra, K. S. Identification Using Feedforward Networks (Letter)
7(2) :349-357
Lowe, D. G. Similarity Metric Learning for a VariableKernel Classifier (Letter) Maass, W. Agnostic PAC Learning of Functions on Analog Neural Nets (Letter)
7(1):72-85
7(5):1054-1 078
Mato, G. - See Hansel, D. Meir, R. Empirical Risk Minimization versus Maximum-Likelihood Estimation: A Case Study (Letter) Meunier, C. - See Hansel, D.
7(1):144-1 57
Index Mitchison, G. A Type of Duality between Self-organizing Maps and Minimal Wiring (Letter)
1311
7(1):25-35
Moore, L. E. - See Murphey, C. R. Mulier, F. - See Cherkassky, V. Murphey, C. R., Moore, L. E., and Buchanan, J. T. Quantitative Analysis of Electrotonic Structure and Membrane Properties of NMDA-Activated Lamprey Spinal Neurons (Letter)
7(3):486-506
Narendra, K. S. - See Levin, A. U. Neal, R. M. - See Dayan, P. Niranjan, M. - See Holden, S. B. Nohl, C. - See Bengio, Y. Obermayer, K. - See Erwin, E. Obradovic, D. - See Deco, G. Ohira, T.and Cowan, J. D. Stochastic Single Neurons (Letter)
7(3):518-528
Orr, M. J. L. Regularization in the Selection of Radial Basis Function Centers (Letter)
7(3):606-623
Panzeri, S. - See Treves, A. Pearlmutter, B. A. Time-Skew Hebb Rule in a Nonisopotential Neuron (Letter)
7(4):706-712
Perez Vicente, C. - See Campbell, C. Phansalkar, V. V. and Thathachar, M. A. L. Local and Global Optimization Algorithms for Generalized Learning Automata (Letter)
7(5):950-973
Poggio, T. - See Girosi, F. Qian, N. Generalization and Analysis of the LisbergerSejnowski VOR model (Letter)
7(4):735-752
Index
1312
Raffin, B. and Gordon, M. B. Learning and Generalization with Minimerror, A Temperature-Dependent Learning Algorithm (Letter)
7(6):1206-1224
Ragazzoni, R. - See Rovatti, R. Reggia, J. A. - See Ruppin, E. Ritz, R. - See Fohlmeister, C. Roelfsema, P. R. - See Konig, I? Rovatti, R., Ragazzoni, R., Kovacs, Z. M., and Guerrieri, R. Adaptive Voting Rules for k-Nearest Neighbors Classifiers (Letter) Ruppin, E.
-
7(3):594-605
See Horn, D.
Ruppin, E. and Reggia, J. A. Patterns of Functional Damage in Neural Network Models of Associative Memory (Letter)
7(5):1105-1127
Saad, D. - See Barber, D. Saad, D. - See Freeman, J. A. S Saarinen, J. - See Lehtokangas, M. Sajda, J. - See Tiiio, P. Sanfeliu, A. - See Alqu&zar,R. Sanner, R. M. and Slotine, J.-J. E. Stable Adaptive Control of Robot Manipulators Using "Neural" Networks (Letter) Saund, E. A Multiple Cause Mixture Model for Unsupervised Learning (Letter) Sejnowski, T. J. - See Bell, A. J. Sejnowski, T. J. - See Gray, M. S. Schulten, K. - See Erwin, E. Shiino, M. - See Fukai, T. Shinornoto, S. - See Kabashima, Y.
7(4):753-790
7(1):51-71
Index
1313
Singer, W. - See Konig, I! Slotine, J.-J. E. - See Sanner, R. M. Smirnakas, S. M. - See Yuille, A. L. Smith, M. E. U. - See Bishop, C. M. Sollich, P. - See Barber, D. Sommer, G. - See Bruske, J. Stinchcombe, M. Precision and Approximate Flatness in Artificial Neural Networks (Letter)
7(5):1021-1039
Stonick,V. L. - See Coetzee, F. M. Taylor, J. G. - See Budinich, M. Thathachar, M. A. L. - See Phansalkar, V. V. Tiiio, P. and Gajda, J. Learning and Extracting Initial Mealy Automata with a Modular Neural Network Model (Letter)
7(4):822-844
Todd, T. N. - See Bishop, C. M. Treves, A. and Panzeri, S. The Upward Bias in Measures of Information Derived from Limited Data Samples (Letter)
7(2):399-407
Trotman, D. L. - See Bishop, C. M. van Hemmen, J. L. - See Fohlmeister, C. Wang, R. A Simple Competitive Account of Some Response Properties of Visual Neurons in Area MSTd (Letter)
7(2)~290-306
White, H. - See Baxt, W. G. White, H. - See Corradi, V. Williams, P. M. Bayesian Regularization and Pruning Using a Laplace Prior (Letter) Williamson, R. C. - See Lee, W. S.
7(1):117-143
1314
Wyatt, J. L., Jr. and Elfadel, I. M. Time-Domain Solutions of Oja’s Equations (Letter)
Index
7(5):915-922
Xu, L. - See Yuille, A. L. Yin, H. and Allinson, N. M. On the Distribution and Convergence of Feature Space in Self-organizing Maps (Letter) Yuille, A. L., Smirnakas, S. M., and Xu, L. Bayesian Self-organization Driven by Prior Probability Distributions (Letter)
7(6):1178-1187
7(3):580-593
Zemel, R. S. - See Dayan, P. Zemel, R. S. and Hinton, G. E. Learning Population Codes by Minimizing Description Length (Letter) Zimmerman, H. G. - See Deco, G.
7(3):549-564