Review of Neural Networks for Speech Recognition Richard P. Lippmann* MIT Lincoln Laboratory, Lexington, MA 021 73, USA
The performance of current speech recogition systems is far below that of humans. Neural nets offer the potential of providing massive parallelism, adaptation, and new algorithmic approaches to problems in speech recognition. Initial studies have demonstrated that multilayer networks with time delays can provide excellent discrimination between small sets of pre-segmented difficult-to-discriminate words, consonants, and vowels. Performance for these small vocabularies has often exceeded that of more conventional approaches. Physiological front ends have provided improved recognition accuracy in noise and a cochlea filter-bank that could be used in these front ends has been implemented using micro-power analog VLSI techniques. Techniques have been developed to scale networks up in size to handle larger vocabularies, to reduce training time, and to train nets with recurrent connections. Multilayer perceptron classifiers are being integrated into conventional continuous-speech recognizers. Neural net architectures have been developed to perform the computations required by vector quantizers, static pattern classifiers, and the Viterbi decoding algorithm. Further work is necessary for large-vocabulary continuousspeech problems, to develop training algorithms that progressively build internal word models, and to develop compact VLSI neural net hardware. 1 State of the Art for Speech Recognition
Speech is the most natural form of human communication. Compact implementations of accurate, real-time speech recognizers would find widespread use in many applications including automatic transcription, simplified man-machine communication, and aids for the hearing impaired and physically disabled. Unfortunately, current speech recognizers perform poorly on talker-independent continuous-speech recognition tasks that people perform without apparent difficulty. Although children learn to understand speech with little explicit supervision and adults take speech recognition ability for granted, it has proved to be a difficult task 'This work was sponsored by the Department of the Air Force. The views expressed are those of the author and do not reflect the official policy or position of the U.S. Government.
Neural Computation 1, 1-38 (1989)
@ 1989 Massachusetts Institute of Technology
2
Richard P. Lippmann
to duplicate with machines. As noted by Klatt (1986), this is due to variability and overlap of information in the acoustic signal, to the need for high computation rates (a human-like system must match inputs to 50,000 words in real time), to the multiplicity of analyses that must be performed (phonetic, phonemic, syntactic, semantic, and pragmatic), and to the lack of any comprehensive theory of speech recognition. The best existing speech recognizers perform well only in artificially constrained tasks. Performance is generally better when training data is provided for each talker, when words are spoken in isolation, when the vocabulary size is small, and when restrictive language models are used to constrain allowable word sequences. For example, talker-dependent isolated-word recognizers can be trained to recognize 105words with 99% accuracy (Paul 1987). Large-vocabulary talker-dependent word recognition accuracy with sentence context can be as high as 95% for 20,000 words from sentences in office memos spoken with pauses between words (Averbuch et al. 1987). Accuracy for a difficult 997-word talker-independent continuousspeech task using a strong language model (an average of only 20 different words possible after any other word) can be as high as 96% (Lee and Hon 1988). This word accuracy score translates to an unacceptable sentence accuracy of roughly 50%. In addition, the word accuracy of this high-performance recognizer when tested with no grammar model is typically below 70% correct. Results such as these illustrate the poor lowlevel acoustic-phonetic matching provided by current recognizers. These recognizers depend heavily on constraining grammars to achieve good performance. Humans do not suffer from this problem. We can recognize clearly spoken but contextually inappropriate words in anomalous sentences such as “John drank the guitar” almost perfectly (Marslen-Wilson 1987). The current best performing speech recognition algorithms use Hidden Markov Model (HMM) techniques. Good introductions to these techniques and to digital signal processing of speech are available in (Lee and Hon 1988; Parsons 1986; Rabiner and Juang 1986; Rabiner and Schafer 1978). The HMM approach provides a framework which includes an efficient decoding algorithm for use in recognition (the Viterbi algorithm) and an automatic supervised training algorithm (the forward-backward algorithm). New neural-net approaches to speech recognition must have the potential to overcome the limitations of current HMM systems. These limitations include poor low-level and poor high-level modeling. Poor low-level acoustic-phonetic modeling leads to confusions between acoustically similar words while poor high-level speech understanding or semantic modeling restricts applications to simple situations where finite state or probabilistic grammars are acceptable. In addition, the first-order Markov assumption makes it difficult to model coarticulation directly and HMM training algorithms can not currently learn the topological structure of word and sub-word models. Finally, HMM theory does not
Review of Neural Networks for Speech Recognition
3
WORD MODELS (Patterns and Sequences) SPEECH INPUT
PATTERN MATCHING
SIGNAL
-b
AND
SPECTRAL PATTERNS
TIME
PATTERN SEQUENCE CLASSIFICATION
SELECTED WORD
W ~ R D SCORES
(100 per Second)
Figure 1: Block diagram of an isolated word recognizer. specify the structure of implementation hardware. It is likely that high computation and memory requirements of current algorithms will require new approaches to parallel hardware design to produce compact, large-vocabulary, continuous-speech recognizers. 2 The Potential of Neural Nets
Neural nets for speech recognition have been explored as part of the recent resurgence of. interest in this area. Research has focused on evaluating new neural net pattern classification and training algorithms using real speech data and on determining whether parallel neural net architectures can be designed which perform the computations required by important speech recognition algorithms. Most work has focused on isolated-word recognition. A block diagram of a simple isolated word recognizer is shown in figure 1. Speech is input to this recognizer and a word classification decision is output on the right. Three major operations are required. First, a preprocessor must extract important information from the speech waveform. In most recognizers, an input pattern containing spectral information from a frame of speech is extracted every 10 msec using Fast Fourier Transform (FFT)or Linear Predictive Coding (LPC) (Parsons 1986; Rabiner and Schafer 1978) techniques. Second, input patterns from the preprocessor must be compared to stored exemplar patterns in word models to compute local frame-to-frame distances. Local distances are used in a third step to time align input pattern sequences to stored exem-
4
Richard P. Lippmann
plar pattern sequences that form word models and arrive at whole-word matching scores. Time alignment compensates for variations in talking rate and pronunciation. Once these operations have been performed, the selected word to output is that word with the highest whole-word matching score. This paper reviews research on complete neural net recognizers and on neural nets that perform the above three operations. Auditory preprocessors that attempt to mimic cochlea and auditory nerve processing are first reviewed. Neural net structures that can compute local distance scores are then described. Classification results obtained using static speech patterns as inputs are then followed by results obtained with dynamic nets that allow continuous-time inputs. Techniques to integrate neural net and conventional approaches are then described followed by a brief review of psychological and physiological models of temporal pattern sequence recognition. The paper ends with a summary and suggestions for future research. Emphasis throughout is placed on studies that used large public-domain speech data bases or that first presented new approaches. 3 Auditory Preprocessors
A preprocessor extracts important parameters from the speech waveform to compress the amount of data that must be processed at higher levels and provide some invariance to changes in noise, talkers, and the acoustic environment. Most conventional preprocessors are only loosely modeled on the cochlea and perform simple types of filtering and data compression motivated by Fourier analysis and information theory. Recent physiological studies of cochlea and auditory nerve responses to complex stimuli have led to more complex physiological preprocessors designed to closely mimic many aspects of auditory nerve response characteristics. Five of these preprocessors and the VLSI cochlea filter listed in table 1 are reviewed in this section. Good reviews of many of these preprocessors and of response properties of the cochlea and auditory nerve can be found in (Greenberg 1988a; 1988b). The five preprocessors in table 1 rely on periodicity or synchrony information in filter-bank outputs. Synchrony information is related to the short-term phase of a speech signal and can be obtained from the arrival times of nerve spikes on the auditory nerve. It could increase recognition performance by supplementing the spectral magnitude information used in current recognizers. Synchrony information is typically obtained by filtering the speech input using sharp bandpass filters with characteristics similar to those of the mechanical filters in the cochlea. The resulting filtered waveforms are then processed using various types of time domain analyses that could be performed using analog neural net circuitry.
Review of Neural Networks for Speech Recognition
Study
Processing
Comments
Deng and Geisler (1987)
Cross-Channel Correlation of Neural Outputs
Physiologically Plausible (Untested for Speech Recognition)
Ghitza (1988)
Create Histogram of Time Intervals Between Threshold Crossings of Filter Outputs
Improved Speech Recognition In Noise
Hunt and Lefebvre (1988)
Periodicity and Onset Detection
Improved Speech Recognition In Noise and with Spectral Tilt
5
Lyon and Mead Tapped Transmission Implemented Using Line Filter with Micropower VLSI (1988) 49 outputs Techniques
Seneff (1988)
Provides Periodicity and Spectral Magnitude Outputs
Synchrony Spectrograms Provide Enhanced Spectral Resolution (Untested for Speech Recognition)
Sharnrna (1988)
Lateral Inhibition Across Cochlea Filter Outputs
Physiologically Plausible (Untested for Speech Recognition)
Table 1: Recent Physiological Preprocessors. Spectrograms created using physiological preprocessors for steady state vowels and other speech sounds illustrate a n improvement in ability to visually identify vowel formants (resonant frequencies of the vocal tract) in noise (Deng and Geisler 1987; Ghitza 1988; Seneff 1988; Shamma 1988). Comparisons to more conventional front ends using existing speech recognizers have been performed by Beet (Beet et al. 19881, Ghitza (19881, and by Hunt and Lefebvre (1988). These comparisons demonstrated significant performance improvements in noise (Ghitza 1988; Hunt and Lefebvre 1988) and with filtering that tilts the
6
Richard P. Lippmann
input spectrum up at high frequencies (Hunt and Lefsbvre 1988). Extensive comparisons have not, however, been made between physiological preprocessors and conventional preprocessors when the conventional preprocessors incorporate current noise and stress compensation techniques. Positive results from such comparisons and more detailed theoretical analyses would do much to foster the acceptance of these new and computationally intensive front ends. Lyon and Mead (1988) describe a filter bank that could be used in a physiological preprocessor. This filter bank was carefully modeled after the cochlea, provides 49 analog outputs, and has been implemented using micropower analog VLSI CMOS processing. Extra circuitry would be required to provide synchrony or spectral magnitude information for a speech recognizer. This recent work demonstrates how preprocessors can be miniaturized using analog VLSI techniques. The success of this approach is beginning to demonstrate that ease of implementation using VLSI techniques may be more important when comparing alternative neural net approaches than computational requirements on serial Von Neuman machines. 4 Computing Local Distance Scores
Conventional speech recognizers compute local frame-to-frame distances by comparing each new input pattern (vector of parameters) provided by a preprocessor to stored reference patterns. Neural net architectures can compute local frame-to-frame distances using fine-grain parallelism for both continuous-observation and discrete-observation recognizers. New neural net algorithms can also perform vector quantization and reduce the dimensionality of input patterns. Local distances for continuous-observation recognizers are functions related to log likelihoods of probability distributions. Simple log likelihood functions such as those required for independent Gaussian or binomial distributions can be calculated directly without training using single-layer nets with threshold-logic nonlinearities (Lippmann 1987; Lippmann et al. 1987). More complex likelihood functions can be computed using multilayer perceptrons (Cybenko 1988; Lapedes and Farber 1988; Lippmann et al. 1987), hierarchical nets that compute kernel functions (Albus 1981; Broomhead and Lowe 1988; Hanson and Burr 1987; Huang acd Lippmann 1988; Moody 1988; Moody and Darken 1988), or high-order nets (Lee et al. 1986; Rumelhart et al. 1986a). Training to produce these complex functions is typically longest with multilayer perceptrons. These nets, however, often provide architectures with fewer nodes, simpler nodal processing elements, and fewer weights. They also may develop internal hidden abstractions in hidden layers that can be related to meaningful acoustic-phonetic speech characteristics such as for-
Review of Neural Networks for Speech Recognition
7
mant transitions and that also could be applied to many different speech recognition tasks. Discrete-observation recognizers first perform vector quantization and label each input with one particular symbol. Symbols are used to calculate local distances via look-up tables that contain symbol probabilities for each reference pattern. The look-up table calculation can be performed by simple single-layer perceptrons. The perceptron for any reference pattern must have as many inputs as there are symbols. Weights must equal symbol probabilities and all inputs must be equal to zero except for that corresponding to the current input symbol. Alternatively, a multilayer perceptron could be used to store probabilities for symbols that have been seen and interpolate between these probabilities for unseen symbols. The vector quantization operation can be performed using an architecture similar to that used by Kohonen’s feature-map net (Kohonen 1984). Inputs to the feature-map net feed an array of codebook nodes containing one node for each symbol. Components of the Euclidean distance between the input and the reference pattern represented by weights to each node are computed in each node. The codebook node with the smallest Euclidean distance to the input is selected using lateral inhibition or other maximum-picking techniques (Lippmann et al. 1987). This process guarantees that only one node with the minimum Euclidean distance to the input has a unity output as required. Weights used in this architecture can be calculated using the feature-map algorithm or any other standard vector quantization algorithm based on Euclidean distances such as k-means clustering (Duda and Hart 1973). Kohonen’s feature-map vector quantizer is an alternative sequentiallytrained neural net algorithm. It has been tested successfully in an experimental speech recognizer (Kohonen 1988; Kohonen et al. 1984) but not evaluated with a large public speech data base. A version with a small number of nodes but including training logic has been implemented in VLSI (Mann et al. 1988). Experiments with a discrete-observation HMM recognizer (Mann et al. 1988) and with a template-based recognizer (Naylor and Li 1988) demonstrated that this algorithm provides performance similar to that provided by conventional clustering procedures such as k-means clustering (Duda and Hart 1973). The feature-map algorithm incrementally trains weights to a two-dimensional grid of nodes such that after training, nodes that are physically close in the grid correspond to input patterns that are close in Euclidean distance. One advantage of this topological organization is that averaging outputs of nodes that are physically close using nodes at higher levels corresponds to a probability smoothing technique often used in speech recognizers called Parzen smoothing (Duda and Hart 1973). This averaging can be performed by nodes with limited fan-in and short connections. The auto-associative multilayer perceptron (Elman and Zipser 1987; Hinton 1987) is a neural net algorithm that reduces the dimensionality of continuous-valued inputs. It is a multilayer perceptron with the same
8
Richard P. Lippmann
number of input and output nodes and one or more layers of hidden nodes. This net is trained to reproduce the input at the output nodes through a small layer of hidden nodes. Outputs of hidden nodes after training can be used as reduced dimensional inputs for speech processing as described in (Elman and Zipser 1987; Fallside et al. 1988). Recent theoretical analyses have demonstrated that auto-associative networks are closedly related to a standard statistical technique called principal components analysis (Baldi and Hornik 1989; Bourlard and Kamp 1988). Auto-associative nets are thus not a new analytical tool but instead a technique to perform the processing required by principal components analysis.
5 Static Classification of Speech Segments Many neural net classifiers have been applied to the problem of classifying static input patterns formed from a spectral analysis of pre-segmented words, phonemes, and vowels. Table 2 summarizes results of some representative studies. Introductions to many of the classifiers listed in this table and to neural net training algorithms are available in (Cowan and Sharp 1988; Hinton 1987; Lippmann et al. 1987). Unless otherwise noted, error rates in this and other tables refer to talker-dependent training and testing, multilayer perceptrons were trained using back-propagation (Rumelhart et al. 1986a), and systems were trained and tested on different data sets. The number of tokens in this and other tables refers to the total number of speech samples available for both training and testing and the label "multi-talker" refers to results obtained by testing and training using data from the same group of talkers. The label "talkerindependent" refers to results obtained by training using one group of talkers and testing using a separate group with no common members. Input patterns for studies in table 2 were applied at once as one whole static spectrographic (frequency versus time) pattern. Neural nets were static and didn't include internal delays or recurrent connections that could take advantage of the temporal nature of the input for real-time processing. This approach might be difficult to incorporate in real-time speech recognizers because it would require long delays to perform segmentation and form the input patterns in an input storage buffer. It would also require accurate pre-segmentation of both testing and training data for good performance. This pre-segmentation was performed by hand in many studies. Multilayer perceptrons and hierarchical nets such as the feature-map classifier and Kohonen's learning vector quantizer (LVQ) have been used to classify static patterns. Excellent talker-dependent recognition accuracy near that of experimental HMM and commercial recognizers has been provided by multilayer perceptrons using small sets of words and digits. Hierarchical nets have provided performance similar to that of
9
Review of Neural Networks for Speech Recognition
Speech Materials
Study
Network
Elman and Zipser (1987)
Multilayer Perceptron (MW 16 x 20 Inputs
Huang and Lippmann, (1988)
MLP, Feature Map Classifier (FMC) 2 Inputs
67 Talkers 10 Vowels 671 Tokens
Gaussian, FMC,MLP- 20% FMC Trains Fastest
Kammerer and Kupper (1988)
MLP 16 x 16 Inputs
11 Talkers 20 Words 5720 Tokens
Talker Dep. - 0.4% Talker Indep. - 2.7%
Kohonen (1988)
Learning Vector Labeled Quantizer (LVQ) Finish Speech 3010 Tokens 15 Inputs
Gaussian - 12.9% k" - 12.0% LVQ - 10.9%
Lippmann and Gold (1987)
MLP 11 x 2 Inputs
16 Talkers 7 Digits 2,912 Tokens
Gaussian - 8.7% k" - 6% MLP - 7.6%
Peeling and Moore (1987)
MLP 19 x 60 Inputs
40 Talkers 10 Digits
Talker Dep. - 0.3% Multi Talker - 1.9%
1 Talker, CV's
/b,d,g/ /i,a,u/ 505 Tokens
Error Rate Cons. - 5% Vowels - 0.5%
16,000 Tokens
Table 2: Recognition of Speech Patterns Using Static Neural Nets. multilayer perceptrons but with greatly reduced training times and typically more connection weights and nodes. 5.1 Multilayer Perceptrons. Multilayer perceptron classifiers have been applied to speech problems more often than any other neural net classifier. A simple example from Huang and Lippmann (1988) presented in figure 2 illustrates how these nets can form complex decision regions with speech data. Input data obtained by Peterson and Barney (1952)
Richard P. Lippmann
10
consisted of the first two formants from vowels spoken by men, women, and children. Decision regions shown in the right side of figure 2 were formed by the two-layer perceptron with 50 hidden nodes trained using back-propagation shown on the left. Training required more than 50,000 trials. Decision region boundaries are near those that are typically drawn by hand to separate vowel regions and the performance of this net is near that provided by commonly used conventional k-nearest neighbor (k") and Gaussian classifiers (Duda and Hart 1973). A more complex experiment was performed by Elman and Zipser (1987) using spectrographic-like inputs. Input patterns formed from 16 filter-bank outputs sampled 20 times over a time window of 64 msec were fed to nets with one hidden layer and 2 to 6 hidden nodes. The analysis time window was centered by hand on the consonant voicing onset. Networks were trained to recognize consonants or vowels in consonantvowel (CV) syllables composed of the consonants /b,d,g/ and the vowels /i,a,u/. Error rates were roughly 5% for consonant recognition and 0.5% for vowel recognition. An analysis indicated that hidden nodes often become feature detectors and differentiate between important subsets of sound types such as consonants versus vowels. This study demonstrated the importance of choosing a good data representation for speech and of normalizing speech inputs. It also raised the important question of training time because many experiments on this small data base required more than 100,000 training trials. Lippmann and Gold (1987)performed another early study to compare multilayer perceptrons and conventional classifiers on a digit classification task. This study was motivated by single-talker results obtained
OUTPUT (One Node for Each Ten Vowels)
DECISION REGIONS
of 0 HOD
A WHO'D
-
iHAWED 2000
X
N
HEED
3
0 HID
N U
V HEAD
0 HAD
HUD INPUT (First and Second Formants)
HEARD
500 0
1000
1400
F1 (Hz)
Figure 2: Decision regions formed by a 2-layer perceptron using backpropagation training and vowel formant data.
Review of Neural Networks for Speech Recognition
11
by Burr (1988a). Inputs were 22 cepstral parameters from two speech frames located automatically by finding the maximum-energy frame for each digit. One- to three-layer nets with from 16 to 256 nodes in each hidden layer were evaluated using digits from the Texas Instruments (TI) 20-Word Speech Data Base (Doddington and Schalk 1981). Multilayer perceptron classifiers outperformed a Gaussian but not a k" classifier. Hidden layers were required for good performance. A single-layer perceptron provided poor performance, much longer training times, and sometimes never converged during training. Most rapid training (less than 1000 trials) was provided by all three-layer perceptrons. These results demonstrate that the simple hyperplane decision regions provided by single-layer perceptrons are sometimes not sufficient and that rapid training and good performance can be obtained by tailoring the size of a net for a specific problem. The digit data used in these experiments was also used to test a multilayer perceptron chip implemented in VLSI (Raffel et al. 1987). This chip performed as well as computer simulations when down-loaded with weights from those simulations. Kammerer and Kupper obtained surprisingly good recognition results for words from the TI 20-word data base (Kammerer and Kupper 1988). A single-layer perceptron with spectrogram-like input patterns performed slightly better than a DTW template-based recognizer. Words were first time normalized to provide 16 input frames with 16 2-bit spectral coefficients per frame. Expanding the training corpus by temporally distorting training tokens reduced the error slightly and best performance was provided by single and not multilayer perceptrons. Talkerdependent error rates were 0.4% (14/3520) for the single-layer perceptron and 0.7% (25/3520) for the DTW recognizer. These error rates are better than all but one of the commercial recognizers evaluated in (Doddington and Schalk 1981) and demonstrate good performance for a single-layer perceptron without hidden nodes. Talker-independent performance was evaluated by leaving out the training data for each talker, one at a time, and testing using that talker's test data. Average talker-independent error rates were 2.7% (155/5720) for the single-layer perceptron and 2.5% (145/5720) for the DTW recognizer. Training time was 6 to 25 minutes per talker on an array processor for the talker-dependent studies and 5 to 9 hours for the talker-independent studies. Peeling and Moore (1987) obtained extremely good recognition results for digit classification. A multilayer perceptron with one hidden layer and 50 hidden nodes provided best Performance. Its talker-dependent performance was low and near that provided by an advanced HMM recognizer. Spectrogram-like input patterns were generated using a 19channel filter-bank analyzer with 20 msec frames. Nets could accommodate 60 input frames (1.2 seconds) which was enough for the longest duration word. Shorter words were padded with zeros and positioned randomly in the 60 frame input buffer. Nets were trained using different numbers of layers and hidden units and speech data from the RSRE
12
Richard I? Lippmann
40-speaker digit data base. Multi-talker experiments explored performance when recognizers were tested and trained using data from all talkers. Error rates were near zero for talker-dependent experiments 0.25% (5/2000) and low for multi-talker experiments 1.9% (78/4000). Error rates on an advanced HMM recognizer under the same conditions were 0.2% (4/2000) and 0.6% (25/4000) respectively. The computation required for recognition using multilayer perceptrons was typically more than five times less than that required for the HMM recognizer. The good small-vocabulary word recognition results obtained by both Kammerer and Kupper (1988) and Peeling and Moore (1987) suggest that back-propagation can develop internal feature detectors to extract important invariant acoustic events. These results must be compared to those of other experiments which attempted to classify digits without time alignment. Burton, Shore, and Buck (Burton et al. 1985; Shore and Burton 1983) demonstrated that talker-dependent error rates using the TI 20-Word Data Base can be as low as 0.3% (8/2560) for digits and 0.8% (40/5120) for all words using simple vector-quantization recognizers that do not perform time alignment. These results suggest that digit recognition is a relatively simple task where dynamic time alignment is not necessary and talker-dependent accuracy remains high even when temporal information is discarded. The good performance of multilayer perceptrons is thus not surprising. These studies and the multilayer perceptron studies do, however, suggest designs for implementing computationallyefficient real-time digit and small-vocabulary recognizers using analog neural-net VLSI processing. 5.2 Hierarchical Neural Nets that Compute Kernel Functions. Hierarchical neural net classifiers which use hidden nodes that compute kernel functions have also been used to classify speech patterns. These nets have the advantage of rapid training and the ability to use combined supervised/unsupervised training data. Huang and Lippmann (1988) described a net called a feature-map classifier and evaluated the performance of this net on the vowel data plotted in figure 2 and on difficult artificial problems. A block diagram of the feature-map classifier is shown in figure. 3. Intermediate codebook nodes in this net compute kernel functions related to the Euclidean distance between the input and cluster centers represented by these nodes. The lower feature map net is first trained without supervision to form a vector quantizer and the upper perceptron-like layer is then trained with supervision using a modified version of the LMS algorithm. This classifier was compared to the multilayer perceptron shown in figure 2 and to a k" classifier. All classifiers provided an error rate of roughly 20%. The 2-layer perceptron, however, required more than 50,000 supervised training trials for convergence. The feature map classifier reduced the amount of supervised training required by three orders of magni-
Review of Neural Networks for Speech Recognition
OUTPUT A
13
c-'
SUPERVISED TRAINING
UNSUPERVISED TRAINING
INPUT
Figure 3: Block diagram of the hierarchical feature-map classifier. tude to fewer than 50 trials. Similar results were obtained with artificial problems. Kohonen and co-workers (Kohonen et al. 1988) compared a neural-net classifier called a learning vector quantizer (LVQ) to Bayesian and kNN classifiers. The structure of the learning vector quantizer is similar to that of the feature-map classifier shown in figure 3. Training differs from that used with the feature-map classifier in that a third stage of supervised training is added which adjusts weights to intermediate codebook nodes when a classification error occurs. Adjustments alter decision region boundaries slightly but maintain the same number of codebook nodes. Bayesian, k" and LVQ classifiers were used to classify 15-channel speech spectra manually extracted from stationary regions of Finnish speech waveforms. All classifiers were tested and trained with separate sets of 1550 single-frame patterns that were divided into 18 phoneme classes (Kohonen et al. 1988). A version of the LVQ classifier with 117 codebook nodes provided the lowest error rate of 10.9% averaging over results where training and testing data sets are interchanged. The Bayesian classifier and kNN classifiers had slightly higher error rates of 12.9% and 12.0% respectively. Training time for the LVQ classifier was roughly 10 minutes on an IBM PC/AT. These results and those of
14
Richard I? Lippmann
Huang and Lippmann (1988) demonstrate that neural nets that use kernel functions can provide excellent performance on speech tasks using practical amounts of training time. Other experiments on artificial problems described in (Kohonen et al. 1988) illustrate trade-offs in training time. Boltzmann machines provided near optimal performance on these problems followed by the LVQ classifier and multilayer perceptrons. Training times were 5 hours on an array processor for the Boltzmann machine, 1 hour on a Masscomp MC 5600 for the multilayer perceptron, and roughly 20 minutes on the Masscomp for the LVQ classifier. Two recent studies (Niranjan and Fallside 1988; Bridle 1988) have begun to explore a hierarchical net where nodes in a hidden layer compute kernel functions called radial basis functions (Broomhead and Lowe 1988). These nets are similar to previous classifiers that use the method of potential functions (Duda and Hart 1973). They have an advantage over multilayer perceptrons in that once the locations of the kernel functions are established, weights to the output nodes are determined uniquely by solving a least squares problem using matrix-based approaches. Initial results with small amounts of speech data consisting of vowels (Niranjan and Fallside 1988) and words (Bridle 1988) have been encouraging. Further work must explore techniques to assign the locations of kernel functions and adjust scale factors that determine the range of influence of each kernel function. 6 Dynamic Classification of Speech Segments
New dynamic neural net classifiers that incorporate short delays, temporal integration, or recurrent connections have been developed specifically for speech recognition. Spectral inputs for these classifiers are applied to input nodes sequentially, one frame at a time. These classifiers could thus be integrated into real time speech recognizers more easily than static nets because accurate pre-segmentation is typically not required for good performance and only short delays are used. Both multilayer nets with delays and nets with recurrent connections have been used to classify acoustically similar words, consonants, and vowels. Excellent performance has been obtained using time delay nets in many studies including those by Lang and Hinton (1988) and by Waibel et al. (1987; 1988). Performance for small vocabularies often slightly exceeded that provided by high-performance experimental HMM recognizers. Techniques have also been developed to scale nets up for larger vocabularies and to speed up training times both for feed-forward and recurrent nets. Rapid training has been demonstrated using a hierarchical learning vector quantizer with delays and good performance but extremely long training times has been provided by Boltzmann machines.
Review of Neural Networks for Speech Recognition
15
6.1 Time-Delay Multilayer Perceptrons. Some of the most promising neural-net recognition results have been obtained using multilayer perceptrons with delays and some form of temporal integration in outp u t nodes (Lang and Hinton 1988; Waibel et al. 1987; Waibel et al. 1988). Table 3 summarizes results of six representative studies. Early results on consonant and vowel recognition were obtained by Waibel a n d co-workers (Waibel e t al. 1987) using the multilayer percep-
Study
Network
Speech Materials
Lang and Hinton (1988)
Time Delay MLP 16 Inputs
100 Talkers ”B,D,E,V” 768 Tokens
Multi Talker - 7.8%
Unnikrishnan, Hopfield, and Tank (1988)
Time Concentration Net 32 Inputs
1 Talker
0.7%
Waibel et al. (1987)
Time Delay MLP 16 Inputs
Error Rate
Digits 432 Tokens
3 Japanese Talkers, /b,d,g/, Many Contexts > 4,000 Tokens
/b,d,g/ - 1.5%
Waibel, Sawai, Time Delay and Shikano MLP (1988) 16 Inputs
1 Japanese
/b,d,g,p,t,k/ - 1.4%
Talker, 18 Cons., 5 Vowels > 10,000 Tokens
18 Cons. - 4.1% 5 Vowels - 1.4%
Watrous (1988)
Temporal Flow Structured MLP 16 Inputs
1 Talker /b,d,g/ - 0.8% Phonemes, Words rapid/rabid - 0.8% > 2,000 Tokens /i,a,u/ - 0.0%
McDermott and Katagiri (1988)
Time Delay LVQ 16 Inputs
3 Japanese Talkers, /b,d,g/ > 4,000 Tokens
/b,d,g/
Table 3: Recognition of Speech Using Time-Delay Neural Nets.
- 1.7%
Richard P. Lippmann
16
tron with time delays shown in figure 4. The boxes labeled r in this figure represent fixed delays. Spectral coefficients from 10 msec speech frames (16 per frame) are input on the lower left. The three boxes on the bottom thus represent an input buffer containing a context of three frames. Outputs of the nodes in these boxes (16 x 3 spectral coefficients) feed 8 hidden nodes in the first layer. Outputs from these nodes are buffered across the five boxes in the first hidden layer to form a context of five frames. Outputs from these boxes (8 x 5 node outputs) feed three hidden nodes in the second hidden layer. Outputs from these three nodes are integrated over time in a final output node. In initial experiments (Waibel et al. 1987), the time-delay net from figure 4 was trained using back-propagation to recognize the voiced stops /b,d,g/. Separate testing and training sets of 2000 voiced stops spoken by three talkers were excised manually from a corpus of 5260 Japanese words. Excised portions sampled the consonants in varying phonetic contexts and contained 15 frames (150 msec) centered by hand around the vowel onset. The neural net classifier provided an error rate of 1.5% compared to an error rate 6.5% provided by a simple discrete-observation HMM recognizer. Training the time-delay net took several days on a fourprocessor Alliant computer. More recent work (Waibel et al. 1988)has led to techniques that merge smaller nets designed to recognize small sets of
-
"G"
OUTPUT
:
Figure 4: A time-delay multilayer perceptron.
COEFFICIENTS
Review of Neural Networks for Speech Recognition
17
consonants and vowels into large nets which can recognize all consonants at once. These techniques greatly reduce training time, improve performance and are a practical approach to the scaling problem. Experiments resulted in low error rates of 1.4% for the consonants /b,d,g,p,t,k/ and 1.4% for the vowels /i,a,u,e,o/. The largest net designed from smaller subnets provided a talker-dependent error rate for one talker of 4.1% for 18 consonants. An advanced discrete-observation HMM recognizer provided an error rate of 7.3%on this task. These two studies demonstrate that good performance can be provided by time-delay nets when the network structure is tailored to a specific problem. They also demonstrate how small nets can be scaled up to solve large classification problems without scaling u p training times substantially. Lang and Hinton (1988) describe an extensive series of experiments that led to a similar high-performance time-delay net. This net was designed to classify four acoustically similar isolated words “B”, “ D , ”E”, and “V” that are the most confusable subset from the spoken alphabet. A multi-talker recognizer for 100 male talkers was first trained and tested using pre-segmented 144 msec speech samples taken from around the vowel onset in these words. A technique called multi-resolution training was developed to shorten training time. This involved training nets with smaller numbers of hidden nodes, splitting weight values to hidden nodes to create larger desired nets, and then re-training the larger nets. A multiresolution trained net provided an error rate of 8.6%. This result, however, required careful pre-segmentation of each word. Presegmentation was not required by another net which allowed continuous speech input and classified the input as that word corresponding to the output node whose output value reached the highest level. Training used simple automatic energy-based segmentation techniques to extract 216 msecs of speech from around the vowel onset in each word. This resulted in an error rate of 9.5%. Outputs were then trained to be high and correct for the 216 msec speech segments as before, but also low for counter-example inputs selected randomly from the left-over background noise and vowel segments. Inclusion of counter-examples reduced the error rate to 7.8%. This performance compares favorably with the 11% error rate estimated for an enhanced HMM recognizer on this data base and based on performance with the complete E-set (Bahl et al. 1988; Lang and Hinton 1988). Watrous (1988) also explored multilayer perceptron classifiers with time delays that extended earlier exploratory work on nets with recurrent connections (Watrous and Shastri 1987). These multilayer nets differed from those described above in that recurrent connections were provided on output nodes, target outputs were Gaussian-shaped pulses, and delays and the network structure were carefully adjusted by hand to extract important speech features for each classification task. Networks were tested using hand-segmented speech and isolated words from one talker. Good discrimination was obtained for many different recognition tasks.
18
Richard P. Lippmann
For example, the error rate was 0.8% for the consonants /b,d,g/, 0.8% for the word pair “rapid/rabid,” and 0.0% for the vowels /i,a,u/. Watrous has also explored the use of gradient methods of nonlinear optimization to decrease training time (Watrous 1986). Rossen et al. (1988) recently described another time delay classifier. It uses more complex input data representations than the time-delay nets described above and a brain-state-in-a-box neural net classifier to integrate information over time from lower-level networks. Good classification performance was obtained for six stop consonants and three vowels. Notable features of this work are training to reject noise inputs as in (Lang and Hinton 1988) and the use of modular techniques to build large nets from smaller trained modules as in (Waibel et al. 1988). Other recent work demonstrating good phoneme and syllable classification using structured multilayer perceptron nets with delays is described in (Harrison and Fallside 1988; Homma et al. 1988; Irino and Kawahara 1988; Kamm et al. 1988; Leung and Zue 1988). Unnikrishnan, Hopfield, and Tank (1988) obtained low error rates on digit classification using a time-concentration neural net that does not use only simple delays. This net, described in (Tank and Hopfield 1987), uses variable length delay lines designed to disperse impulsive inputs such that longer delays result in more dispersion. Impulsive inputs to these delay lines are formed by enhancing spectral peaks in the outputs of 32 bandpass filters. Outputs of delay lines are multiplied by weights and summed to form separate matched filters for each word. These matched filters concentrate energy in time and produce a large output pulse at the end of the correct word. Limited evaluations reported in (Unnikrishnan et al. 1988) for digit strings from one talker demonstrated good performance using a modified form of back-propagation training. A prototype version of this recognizer using discrete analog electronic devices was also constructed (Tank and Hopfield 1987). Tests performed by Gold with a large speech data base and a hierarchical version of the time concentration net that included both allophone and word models yielded performance that was no better than that of an existing HMM recognizer (Gold 1988). 6.2 Hierarchical Nets that Compute Kernel Functions. McDermott and Katagiri (1988) used Kohonen’s LVQ classifier on the same /b,d,g/ speech data base used by Waibel et al. (1987). They were able to obtain an error rate of 1.7%which is not statistically different from the 1.5%error rate obtained by Waibel et al. using the time-delay net shown in figure 4 (Waibel et al. 1987). Inputs for the LVQ classifier consisted of a 7-frame window of 16 filterbank outputs. The nearest of 150 codebook nodes were determined as the 15-frame speech samples were passed through this 7-frame window. The normalized distances between nearest nodes and 112-element input patterns were integrated over time and used to classify speech inputs. The error rate without the final stage of LVQ train-
Review of Neural Networks for Speech Recognition
19
ing was high (7.3%). It dropped to 1.7%after LVQ training was complete. This result demonstrates that nets with kernel functions and delays can perform as well as multilayer perceptrons with delays. These nets train faster but require more computation and memory during use. In this application, for example, the LVQ classifier required 17,000 weights which was more than 30 times as many required for the time-delay net used in (Waibel et al. 1987). If memory is not an important limitation, rapid search techniques such as hashing and k-d trees described in (Omohundro 1987) can be applied to the LVQ classifier to greatly reduce the time required to find nearest-neighbors. This would make the differences in computation time between these alternative approaches small on existing serial Von Neuman computers. 6.3 Nets with Recurrent Connections. Nets with recurrent connections have not been used as extensively for speech recognition problems as feed-forward nets because they are more difficult to train, analyze, and design. Table 4 summarizes results of three representative studies. Initial work explored the use of recurrent Boltzmann machines. These nets typically provided good performance on small problems but required extremely long training times. More recent studies have focused on modified back-propagation training algorithms described in (Almeida 1987; Jordan 1986; Pineda 1987; Rohwer and Forrest 1987; Rumelhart et al. 1986a; Watrous 1988) that can be used with recurrent nets and time varying inputs.
Studv
Network
Anderson,
Recurrent Net 36 Inputs
Speech Materials
Error Rate
20 Talkers, Cv's /b,d,g,p,t,k/, /a/ 561 Tokens
Talker Indep. - 13.1%
Prager, Boltzmann Harrison, and Machine Fallside (1986) 2048 Inputs
6 Talkers 11 Vowels 264 Tokens
Multi Talker - 15%
Robinson and Fallside (1988b)
7 Talkers 27 Phonemes 558 Sentences
Multi Talker - 30.8%
Merrill, and Port (1988)
Recurrent Net 20 Inputs
Talker Dep. - 22.7%
Table 4: Recognition of Speech Using Recurrent Neural Nets.
Richard I? Lippmann
20
OUTPUTS
I
r
I
HIDDEN NODES
n INPUTS
"STATES, y (t-1)
Figure 5: A recurrent neural net classifier. Prager, Harrison, and Fallside (Prager et al. 1986)performed one of the first experiments to evaluate the use of Boltzmann machines for speech recognition. At the time this study was performed, the Boltzmann machine training algorithm described in (Ackley et al. 1985) was the only well-known technique that could be used to train nets with recurrent connections. This training algorithm is computationally intensive because simulated annealing procedures (Kirkpatrick et al. 1983) are used to perform a probabilistic search of connection weights. Binary input and output data representations were developed to apply Boltzmann machines to an 11-vowel recognition task. One successful net used 2048 input bits to represent 128 spectral values and 8 output bits to specify the vowel. Nets typically contained 40 hidden nodes and 7320 links. Training used 264 tokens from 6 talkers and required 6 to 15 hours of processing on a high-speed array processor. The resulting multi-talker error rate was 15%. Prager, Harrison, and Fallside (Prager et al. 1986) also explored the use of a Boltzmann machine recognizer inspired by single-order Markov Model approaches to speech recognition. A block diagram of this recurrent net is presented in figure 5. The output of this net is delayed and fed back to the input to "carry" nodes that provide information about the prior state. This net was trained to identify words in two sentences spoken by one talker. Training time required 4 to 5 days of processing on a VAX 11/750 computer and performance was nearly perfect on the training sentences. Other recent work on Boltzmann machines (Bengio
Review of Neural Networks for Speech Recognition
21
and De Mori 1988; Kohonen et al. 1988; Prager and Fallside 1987)demonstrates that good performance can be provided at the expense of excessive training time. Preliminary work on analog VLSI implementations of the training algorithm required by Boltzmann machines has demonstrated practical learning times for small hardware networks (Alspector and Allen 1987). Many types of recurrent nets have been proposed that can be trained with modified forms of back-propagation. Jordan (1986) appears to have been the first to study nets with recurrent connections from output to input nodes as in figure 5. He used these nets to produce pattern sequences. Bourlard and Wellekens (1988) recently proved that such nets could be used to calculate local probabilities required in HMM recognizers and Robinson and Fallside (1988a) pointed out the relationship between these nets and state space equations used in classical control theory. Nets with recurrent self-looping connections on hidden and output nodes were studied by Watrous and Shastri (1987) for a speech recognition application. Nets with recurrent connections from hidden nodes to input nodes were studied by Elman (1988) and by Servan-Schreiber, Cleeremans, and McClelland (1988) for natural language applications. Two recent studies have explored recurrent nets similar to the net shown in figure 5 when trained with modified forms of back-propagation. Robinson and Fallside (1988b)used such a net to label speech frames with one of 27 phoneme labels using hand-marked testing and training data. Training used an algorithm suggested by Rumelhart et al. (1986a) that, in effect, replicates the net at every time step during training. Talkerdependent error rates were 22.7% for the recurrent net and 26.0% for a simple feed-forward net with delays between input nodes to provide input context. Multi-talker error rates were 30.8% for the recurrent net and 40.8% for the feed-forward net. A 64 processor array of transputers provided practical training times in these experiments. Anderson, Merrill, and Port (1988) also explored recurrent nets similar to the net in figure 5 . Stimuli were CV syllables formed from six stop consonants and the vowel /a/ that were hand segmented to contain 120 msecs of speech around the vowel onset. Nets were trained on data from 10 talkers, tested on data from 10 other talkers, and contained from one to two hidden layers with different numbers of hidden nodes. Best performance (an error rate of 13.1%) was provided by a net with two hidden layers. 7 Integrating Neural Net and Conventional Approaches
Researchers are beginning to combine conventional HMM and DTW speech recognition algorithms with neural net classification algorithms and also to design neural net architectures that perform computations required by important speech recognition algorithms. This may lead
Richard P. Lippmann
22
Studv
Avvroach
Comments
Bourlard and MLP Provides Allophone Good Performance on 918-Word, TalkerWellekens Distance Scores Dependent, Continfor DTW Recognizer (1987) uous-Speech Task Burr (1988a)
MLP Classifier After Energy-Based DTW
Tested on SingleTalker E-Set
Huang and Lippmann (1988)
Second-Stage MLP Discrimination After HMM Recognizer
Improved Performance for "B,D,G from TI Alpha-Digit Data Base
Lippmann and Gold (1987)
"Viterbi-Net" Neural Net Architecture for HMM Viterbi Decoder
Same Good Performance on Large Data Base as Robust HMM Recognizer
Sakoe and Is0 (1987)
MLP Provides Distance Scores for DTW Recognizer
No Hand Labeling Required, Untested
Table 5: Studies Combining Neural Net and Conventional Approaches. to improved recognition accuracy and also to new designs for compact real-time hardware. Combining the good discrimination of neural net classifiers with the automatic scoring and training algorithms used in HMM recognizers could lead to rapid advances by building on existing high-performance recognizers. Studies that have combined neural net and conventional approaches to speech recognition are listed in table 5. Many (Bourlard and Wellekens 1987; Bun 1988b; Huang et al. 1988; Sakoe and Is0 1987) integrate multilayer perceptron classifiers with conventional DTW and HMM recognizers and one (Lippmann and Gold 1987) provides a neural-net architecture that could be used to implement an HMM Viterbi decoder. One study (Bourlard and Wellekens 1987) demonstrated how a multilayer perceptron could be integrated into a DTW continuous-speech recognizer to improve recognition performance.
Review of Neural Networks for Speech Recognition
23
7.1 Integrating Multilayer Perceptron Classifiers with DTW and HMM Recognizers. At least three groups have proposed recognizers where multilayer perceptrons compute distance scores used in DTW or HMM recognizers (Bourlard and Wellekens 1987; Burr 1988a; Sakoe and Is0 1987). Bourlard and Wellekens (1987) demonstrated how the multilayer perceptron shown in figure 6 could be used to calculate allophone distance scores required for phoneme and word recognition in a DTW discrete-observation recognizer. One net had inputs from 15 frames of speech centered on the current frame, 50 hidden nodes, and 26 output nodes. Outputs corresponded to allophones in a 10-digit German vocabulary. Inputs were from 60 binary variables per frame. One input bit was on in each frame to specify the codebook entry that represented that frame. The multilayer perceptron was trained using hand-labeled training data to provide a high output only for that output node corresponding to the current input allophone. Recognition then used dynamic time warping with local distances equal to values from output nodes. This provides good discrimination from the neural net and integration over time from the DTW algorithm. Perfect recognition performance was provided for recognition of 100 tokens from one talker. Bourlard and Wellekens (1987) also used a multilayer perceptron with contextual input and DTW to recognize words from a more difficult 919word talker-dependent continuous-speech task. The net covered an input context of 9 frames, used one of 132 vectors to quantize each frame, had 50 or 200 hidden nodes, and had 50 output nodes corresponding to 50 German phonemes. This net was trained using 100 hand-segmented sentences and tested on 188 other sentences containing roughly 7300 phonemes. The phoneme error rate was 41.6% with 50 hidden nodes and 37% with 200 hidden nodes. These error rates were both lower than the 47.5% error rate provided by a simple discrete-observation HMM recognizer with duration modeling and one probability histogram per phoneme. Bourlard and Wellekens suggested that performance could be improved and the need for hand-segmented training data could be eliminated by embedding muitilayer perceptron back-propagation training in an iterative Viterbi-like training loop. This loop could progressively improve segmentation for DTW or HMM recognizers. Iterative Viterbi training was not performed because the simpler single-pass training required roughly 200 hours on a SUN-3 workstation. As noted above, Bourlard and Wellekens (1988) also recently proved that recurrent neural nets could calculate local probabilities required in HMM recognizers. Sakoe and Is0 (1987) suggested a recognition structure similar to that of Bourlard and Wellekens (1987)where a multilayer perceptron with delays between input nodes computes local distance scores. They, however, do not require output nodes of the multilayer perceptron to represent sub-word units such as phonemes. Instead, a training algorithm is described that is similar to the iterative Viterbi-like training loop suggested
Richard P. Lippmann
24
by Bourlard and Wellekens (1987) but for continuous input parameters. No results were presented for this approach. Burr (1988a) gave results for a recognizer where words were first aligned based on energy information to provide a fixed 20 input frames of spectral information. These inputs were fed to nine outputs representing members of the E-set ("B,C,D,E,G,P,T,V,Z). This recognizer was trained and tested using 180 tokens from one talker. Results were nearly perfect when the initial parts of these words were oversampled. Huang and Lippmann demonstrated how a second-stage of analysis using a multilayer perceptron could decrease the error rate of an HMM recognizer (Huang and Lippmann 1988). The Viterbi backtraces from an HMM recognizer were used to segment input speech frames and average HMM log probability scores for segments were provided as inputs to single- and multilayer perceptrons. Performance was evaluated using the letters "B,D,G" spoken by the 16 talkers in the TI alpha-digit data base. Ten training tokens per letter were used to train the HMM and neural net recognizer for each talker and the 16 other tokens were used for testing. Best performance was provided by a single-layer perceptron which almost halved the error rate. The error rate dropped from 7.2% errors with the HMM recognizer alone to 3.8% errors with the neural net postprocessor.
LOCAL ALLOPHONE DISTANCE SCORES
I
CONTEXT
HIDDENNODES
CURRENT FRAME
I
CONTEXT
Figure 6: A feed-forward multilayer perceptron that was used to compute allophone distance scores for a DTW recognizer.
Review of Neural Networks for Speech Recognition
25
Figure 7 A recurrent neural net called a Viterbi net that performs the calculations required in an HMM Viterbi decoder.
7.2 A Neural Net Architecture to Implement a Viterbi Decoder. Lippmann and Gold (1987) described a neural-net architecture called a Viterbi net that could be used to implement the Viterbi decoder used in many continuous observation HMM recognizers using analog VLSI techniques. This net is shown in figure 7. Nodes represented by open triangles correspond to nodes in a left-to-right HMM word model. Each of these triangles represents a threshold-logic node followed by a fixed delay. Small subnets in the upper part of the figure select the maximum of two inputs as described in (Lippmann et al. 1987) and subnets in the lower part sum all inputs. A temporal sequence of input vectors is presented at the input and the output is proportional to the log probability calculated by a Viterbi decoder. The structure of the Viterbi net illustrates how neural net components can be integrated to design a complex net which performs the calculations required by an important conventional algorithm. The Viterbi net differs from the Viterbi decoding algorithm normally implemented in software and was thus evaluated using 4000 word tokens from the 9-talker 35-word Lincoln Stress-Style speech data base. Connection strengths in Viterbi nets with 15 internal nodes (one node per HMM model state) were adjusted based on parameter estimates obtained from the forward-backward algorithm. Inputs consisted of 12 me1 cepstra and 13 differential me1 cepstra that were updated every 10 msec. Performance was good and almost identical to that of current Robust HMM isolated-word recognizers (Lippmann and Gold 1987). The error
26
Richard I? Lippmann
rate was 0.56% or only 23 out of 4095 tokens wrong. One advantage an analog implementation of this net would have over digital approaches is that the frame rate could be increased to provide improved temporal resolution without requiring higher clock rates. 8 Other Nets for Pattern Sequence Recognition
In addition to the neural net models described above, other nets motivated primarily by psychological and physiological findings and by past work on associative memories have been proposed for speech recognition and pattern sequence recognition. Although some of these nets represent new approaches to the problem of pattern sequence recognition, few have been integrated into speech recognizers and none have been evaluated using large speech data bases. 8.1 Psychological Neural Net Models of Speech Perception. Three neural net models have been proposed which are primarily psychological models of speech perception (Elman and McClelland 1986; MacKay 1987; Marslen-Wilson 1987; Rumelhart et al. 1986b). The COHORT model developed by Marslen-Wilson (1987)assumes a left-to-right real-time acoustic phonetic analysis of speech as in current recognizers. It accounts for many psychophysical results in speech recognition such as the existence of a time when a word becomes unambiguously recognized (recognition point), the word frequency effect, and recognition of contextually inappropriate words. This model, however, is descriptive and is not expressed as a computational model. Hand crafted versions of the TRACE and Interactive Activation models developed by Elman, McClelland, Rumelhart, and co-workers were tested with small speech data bases (Elman and McClelland 1986; Rumelhart et al. 1986b). These models are based on neuron-like nodes, include both feed-forward and feed-back connections, use nodes with multiplicative operations, and emphasizes the benefits that can be obtained by using co-articulation information to aid in word recognition. These models are impractical because the problems of time alignment and training are not addressed and the entire network must be copied on every new time step. The Node Structure Theory developed by MacKay (1987) is a qualitative neural theory of speech recognition and production. It is similar in many ways to the above models, but considers problems related to talking rate, stuttering, internal speech, and rhythm.
8.2 Physiological Models For Temporal Pattern Recognition. Neural net approaches motivated primarily by physiological and behavioral results have also been proposed to perform some component of the time alignment task (Cohen et al. 1987; Dehaene et al. 1987; Wong and Chen 1986). Wong and Chen (1986) and Dehaene et al. (1987) describe similar
Review of Neural Networks for Speech Recognition
27
models that have been tested with a small amount of speech data. These models include neurons with shunting or multiplicative nodes similar to those that have been proposed in the retina to compute direction of motion (Poggio and Koch 1987). Three neurons can be grouped to form a ”synaptic triad” that can be used to recognize two component pattern sequences. This triad will have a strong output only if the modulator input goes ”high” and then, a short time later, the primary input goes ”high.” Synaptic triads can be arranged in sequences and in hierarchies to recognize features, allophones and words (Wong and Chen 1986). In limited tests, hand crafted networks could recognize a small set of words spoken by one talker (Wong and Chen 1986). More interesting is a proposed technique for training such networks without supervision (Dehaene et al. 1987). If effective, this could make use of the large amount of unlabeled speech data that is available and lead to automatic creation of sub-word models. Further elaboration is necessary to describe how networks with synaptic triads could be trained and used in a recognizer. Cohen and Grossberg proposed a network called a masking field that has not yet been tested with speech input (Cohen and Grossberg 1987).
CAT
TAC
MASKING FIELD
SHORT-TERM
(Only One Node ”High”)
INPUT
~
~~
Figure 8: A model called a masking field that can be used to detect pattern sequences.
28
Richard I? Lippmann
This network is shown in figure 8. Inputs are applied to the bottom subnet which is similar to a feature map net (Kohonen et al. 1984). Typically, only one node in this subnet has a “high output at any time. Subnet node outputs feed short-term storage nodes whose outputs decay slowly over time. Different input pattern sequences thus lead to different amplitude patterns in short term storage. For example the input C-A-T sampled at the end of the word will yield an intensity pattern in short-term storage with node C low, node A intermediate, and node T high. The input T-A-C will yield a pattern with node C high, node A intermediate, and node T low. These intensity patterns are weighted and fed to nodes in a masking field with weights adjusted to detect different patterns. The masking field is designed such that all nodes compete to be active and nodes representing longer patterns inhibit nodes representing shorter patterns. This approach can recognize short isolated pattern sequences but has difficulty recognizing patterns with repeated sub-sequences because nodes in short-term storage corresponding to those sub-sequences can become saturated. Further elaboration is necessary to describe how masking fields should be integrated into a full recognizer. Other recent studies (Jordan 1986; Stornetta et al. 1988; Tattersall et al. 1988) have also proposed using slowly-decaying nodes as short-term storage to provide history useful for pattern recognition and pattern sequence generation. 8.3 Sequential Associative Memories. A final approach to pattern sequence recognition is to build a sequential associative memory for pattern sequences as described in (Amit 1988; Buhmann and Schulten 1988; Hecht-Nielsen 1987; Kleinfield 1986; Sompolinskyand Kanter 1986). These nets extend past work on associative memories by Hopfield and Little (Hopfield 1982; Little 1974) to the case where pattern sequences instead of static patterns can be restored. Recognition in this approach corresponds to the net settling into a desired sequence of stable states, one after the other, when driven by an input temporal pattern sequence. Dynamic associative memory models developed by Amit, Kleinfield, Sompolinsky, and Kanter (Amit 1988; Kleinfield 1986; Sompolinsky and Kanter 1986) use long and short delays on links to generate and recognize pattern sequences. Links with short delays mutually excite a small set of nodes to produce stable states. Links with long delays excite nodes in the next expected stable state. Transitions between states thus occur at predetermined times that depend on the delays in the links. A net developed by Buhmann and Schulten (1988) uses probabilistic nodes to produce sequencing behavior similar to that produced by a Markov chain. Transitions in this net occur stochastically but at some average rate. A final net described by Hecht-Nielsen (1987) is a modified version of Grossberg’s avalanche net (Grossberg 1988). The input to this net is similar in structure to Kohonen’s feature map. It differs in that nodes have different rise and fall time constants and overall network activity is
Review of Neural Networks for Speech Recognition
29
controlled such that only the outputs of a few nodes are “high” at any time. A few relatively small simulations have been performed to explore the behavior of the sequential associative memories. Simulations have demonstrated that these nets can complete pattern sequences given the first element of a sequence (Buhmann and Schulten 1988) and also perform such functions as counting the number of input patterns presented to a net (Amit 1988). Although this approach is theoretically very interesting and may be a good model of some neural processing, no tests have been performed with speech data. In addition, further work is necessary to develop training procedures and useful decoding strategies that could be applied in a complete speech recognizer. 9 Summary of Past Research
The performance of current speech recognizers is far below that of humans. Neural nets offer the potential of providing massive parallelism, adaptation, and new algorithmic approaches to speech recognition problems. Researchers are investigating: 1. New physiological-based front ends,
2. Neural net classifiers for static speech input patterns, 3. Neural nets designed specifically to classify temporal pattern sequences, 4. Combined recognizers that integrate neural net and conventional recognition approaches,
5. Neural net architectures that implement conventional algorithms, and 6. VLSI hardware neural nets that implement both neural net and conventional algorithms.
Physiological front ends have provided improved recognition accuracy in noise (Ghitza 1988; Hunt and LefPbvre 1988) and a cochlea filterbank that could be used in these front ends has been implemented using micro-power VLSI techniques (Lyon and Mead 1988). Many nets can compute the complex likelihood functions required by continuousdistribution recognizers and perform the vector quantization required by discrete-observation recognizers. Kohonen’s feature map algorithm (KOhonen et al. 1984) has been used successfully to vector quantize speech and preliminary VLSI hardware versions of this net have been built (Mann et al. 1988). Multilayer perceptron networks with delays have provided excellent discrimination between small sets of difficult-to-discriminate speech inputs (Kammerer and Kupper 1988; Lang and Hinton 1988; Peeling and
30
Richard P. Lippmann
Moore 1987; Waibel et al. 1987; Waibel et al. 1988; Watrous 1988). Good discrimination was provided for a set of 18 consonants in varying phonetic contexts (Waibel et al. 1988), similar E-set words such as "B,D,E,V" (Lang and Hinton 1988), and digits and words from small-vocabularies (Kammerer and Kupper 1988; Peeling and Moore 1987; Watrous 1988). In some cases performance was similar to or slightly better than that provided by a more conventional HMM or DTW recognizer (Kammerer and Kupper 1988; Lang and Hinton 1988; Peeling and Moore 1987; Waibel et al. 1987; 1988). In almost all cases, a neural net approach performed as well as or slightly better than conventional approaches but provided a parallel architecture that could be used for implementation and a computationally simple and incremental training algorithm. Approaches to the problem of scaling a network up in size to discriminate between members of a large set have been proposed and demonstrated (Waibel et al. 1988). For example, a net that classifies 18 consonants accurately was constructed from subnets trained to discriminate between smaller subsets of these consonants. Algorithms that use combined unsupervised/supervised training and provide high performance and extremely rapid training have also been demonstrated (Huang and Lippmann 1988; Kohonen et al. 1988). New training algorithms are under development (Almeida 1987; Jordan 1986; Pineda 1987; Rohwer and Forrest 1987; Watrous 1988) that can be used with recurrent networks. Preliminary studies have explored recognizers that combine conventional and neural net approaches. Promising continuous-speech recognition results have been obtained by integrating multilayer perceptrons into a DTW recognizer (Bourlard and Wellekens 1987) and a multilayer perceptron post processor has improved the performance of an isolatedword HMM recognizer (Huang et al. 1988). Neural net architectures have also been designed for important conventional algorithms. For example, recurrent neural net architectures have been developed to implement the Viterbi decoding algorithm used in many HMM speech recognizers (Lippmann and Gold 1987) and also to compute local probabilities required in discrete-observation HMM recognizers (Bourlard and Wellekens 1988). Many new neural net models have been proposed for recognizing temporal pattern sequences. Some are based on physiological data and attempt to model the behavior of biological nets (Dehaene et al. 1987; Cohen et al. 1987; Wong and Chen 1986) while others attempt to extend existing auto-associativenetworks to temporal problems (Amit 1988; Buhmann and Schulten 1988; Kleinfield 1986; Sompolinsky and Kanter 1986). New learning algorithms and net architectures will, however, be required to provide the real-time response and automatic learning of internal word and phrase models required for high-performance continuous speech recognition. This is still a major unsolved important problem in the field of neural nets.
Review of Neural Networks for Speech Recognition
31
10 Suggestions for Future Work
Further work should emphasize networks that provide rapid response and could be used with real-time speech input. They must include internal mechanisms to distinguish speech from background noise and to determine when a word has been presented. They also must operate with continuous acoustic input and not require hand marking of test speech data, long internal delays, or duplication of the network for new inputs. Short-term research should focus on a task that current recognizers perform poorly on such as accurate recognition of difficult sets of isolated words. Such a task wouldn’t require excessive computation resources or extremely large data bases. A potential initial problem is talkerindependent recognition of difficult E-set words or phonemes as in (Lang and Hinton 1988; Waibel et al. 1988). Techniques developed using small difficult vocabularies should be extended to larger vocabularies and continuous speech as soon as feasible. Efforts should focus on: developing training algorithms to construct sub-word and word models automatically without excessive supervision, developing better front-end acousticphonetic feature extraction, improving low-level acoustic/phonetic discrimination, integrating temporal sequence information over time, and developing more rapid training techniques. Researchers should continue integrating neural net approaches to classification with conventional approaches to training and scoring. Longer-term research on continuousspeech recognition must address the problems of developing high-level speech-understanding systems that can learn and use internal models of the world. These systems must be able to learn and use syntactic, semantic, and pragmatic constraints. Efforts on building neural net VLSI hardware for speech recognition should also continue. The development of compact real-time speech recognizers is a major goal of neural net research. Parallel neural-net architectures should be designed to perform the computations required by successful algorithms and then these architectures should be implemented and tested. Recent developments in analog VLSI neural nets suggest that this approach has the potential to provide the high computation rates required for both front-end acoustic analysis and high-level pattern matching. All future work should take advantage of the many speech data bases that currently exist and use results obtained with experimental HMM and DTW recognizers with these data bases as benchmarks. Descriptions of some common data bases and comments on their availability are in (Pallett 1986; Price et al. 1988). Detailed evaluations using large speech data bases are necessary to guide research and permit comparisons between alternative approaches. Results obtained on a few locally-recorded speech samples are often misleading and are not informative to other researchers. Research should also build on the current state of knowledge in neural networks, pattern classification theory, statistics, and conventional HMM
32
Richard P. Lippmann
and DTW approaches to speech recognition. Researchers should become familiar with these areas and not duplicate existing work. Introductions to current HMM and DTW approaches are available in (Dixon and Martin 1979; Lee and Hon 1988; Parsons 1986; Rabiner and Juang 1986; Rabiner et al. 1978) and introductions to statistics and pattern classification are available in many books including (Duda and Hart 1973; Fukunaga 1972; Nilsson 1965).
Acknowledgments
I would like to thank members of Royal Signals and Radar Establishment including John Bridle and Roger Moore for discussions regarding the material in this paper. I would also like to thank Bill Huang and Ben Gold for interesting discussions and Carolyn for her patience. References Ackley, D.H., G.E. Hinton, and T.J. Sejnowski. 1985. A Learning Algorithm for Boltzmann Machines. Cognitive Science 9, 147-160. Albus, J.S. 1981. Brain, Behavior, and Robotics. BYTE Books. Almeida, L.B. 1987. A Learning Rule for Asynchronous Perceptrons with Feedback in a Combinatorial Environment. In: 1st International Conference on Neural Networks. IEEE, 11-609. Alspector, J. and R.B. Allen. 1987. A Neuromorphic VLSI Learning System. In: Advanced Research in VLSI: Proceedings of the 1987 Stanford Conference, ed. P Losleben, 313-349. Cambridge: MIT Press. Amit, D.J. 1988. Neural Networks for Counting Chimes. Proceedings National Academy of Science, U S A 85, 2141-2145. Anderson, S., J. Merrill, and R. Port. 1988. Dynamic Speech Categorization With Recurrent Networks. Technical Report 258, Department of Linguistics and Department of Computer Science, Indiana University. Averbuch, A., L. Bahl, and R. Bakis. 1987. Experiments with the Tangora 20,000 Word Speech Recognizer. In: Proceedings IEEE International Conference on Acoustics Speech and Signal Processing, Dallax, TX, 701-704. Bahl, L.R., P.F. Brown, P.V. De Souza, and R.L. Mercer. 1988. Modeling Acoustic Sequences of Continuous Parameters. In: Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, New York, NY, 4043. Baldi, P. and K. Hornik. 1988. Neural Networks and Principal Component Analysis: Learning from Examples Without Local Minima. Neural Networks 2, 53-58. Beet, S.W., H.E.G. Powrie, R.K. Moore, and M.J. Tomlinson. 1988. Improved Speech Recognition Using a Reduced Auditory Representation. In: Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, New York, NY, 75-78.
Review of Neural Networks for Speech Recognition
33
Bengio, Y. and R. De Mori. 1988. Use of Neural Networks for the Recognition of Place of Articulation. In: Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, New York, NY, 103-106. Bourlard, H. and Y.Kamp. 1988. Auto-Association by Multilayer Perceptrons and Singular Value Decomposition. Biological Cybernetics 59, 291-294. Bourlard, H. and C.J. Wellekens. 1988. Links Between Markov Models and Multilayer Perceptrons. Technical Report Manuscript M-263, Phillips Research Laboratory, Brussels, Belgium. . 1987. Speech Pattern Discrimination and Multilayer Perceptrons. Technical Report Manuscript M-211, Phillips Research Laboratory, Brussels, Belgium. Scheduled to appear in the December issue of Computer, Speech and Language. Bridle, J. 1988. Neural Network Experience at the RSRE Speech Research Unit. ATR Workshop on Neural Networks and Parallel Distributed Processing, Osaka, Japan. Broomhead, D.S. and D. Lowe. 1988. Radial Basis Functions, multi-variable functional interpolation and adaptive networks. Technical Report RSRE Memorandum No. 4148, Royal Speech and Radar Establishment, Malvern, Worcester, Great Britain. Buhmann, J. and K. Schulten. 1988. Noise-Driven Temporal Association in Neural Networks. Europhysics Letters 4, 120.5-1209. Burr, D.J. 1988a. Experiments on Neural Net Recognition of Spoken and Written Text. In: IEEE Transactions on Acoustics, Speech and Signal Processing, 36, 1162-1168. . 1988b. Speech Recognition Experiments with Perceptrons. In: Neural Information Processing Systems, ed. D. Anderson, 144-1.53. New York: American Institute of Physics. Burton, D.K., J.E. Shore, and J.T. Buck. 1985. Isolated-Word Speech Recognition Using Multisection Vector Quantization Codebooks. In: IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-33, 837-849. Cohen, M. and S. Grossberg. 1987. Masking fields: A Massively Parallel Neural Architecture for learning, recognizing, and predicting multiple groupings of patterned data. Applied Optics 26, 1866-1891. Cohen, M.A., S. Grossberg, and D. Stork. 1987. Recent Developments in a Neural Model of Real-Time Speech Analysis and Synthesis. In: 1st International Conference on Neural Networks, IEEE. Cowan, J.D. and D.H. Sharp. 1988. Neural Nets and Artificial Intelligence. Daedalus 117, 85-121. Cybenko, G. 1988. Continuous Valued Neural Networks with Two Hidden Layers are SufFcient. Technical Report, Department of Computer Science, Tufts University. Dehaene, S., J. Changeux, and J. Nadal. 1987. Neural Networks that Learn Temporal Sequences by Selection. Proceedings National Academy Science, USA, Biophysics 84, 2727-2713. Deng, Li and C. Daniel Geisler. 1987. A Composite Auditory Model for Processing Speech Sounds. Journal of the Acoustical Society of America 82:6,2001-2012. Dixon, N.R. and T.B. Martin. 1979. Automatic Speech and Speaker Recognition. New York IEEE Press.
34
Richard P. Lippmann
Doddington, G.R. and T.B. Schalk. 1981. Speech Recognition: Turning Theory into Practice. IEEE Spectrum, 2&32. Duda, R.O. and P.E. Hart. 1973. Pattern Classification and Scene Analysis. New York John-Wiley & Sons. Elman, J.L. 1988. Finding Structure in Time. CRL Technical Report 8801, University of California, San Diego, CA. Elman, J.L. and J.L. McClelland. 1986. Exploiting Lawful Variability in the Speech Wave. In: Invariance and Variability in Speech Processes, eds. J.S. Perkell and D.H. Klatt. New Jersey: Lawrence Erlbaum. Elman, J.L. and D. Zipser. 1987. Learning the Hidden Structure of Speech. ICS Report 8701, Institute for Cognitive Science, University of California, San Diego, La Jolla, CA. Fallside, F., T.D. Harrison, R.W. Prager, and A.J.R. Robinson. 1988. A Comparison of Three Connectionist Models for Phoneme Recognition in Continuous Speech. ATR Workshop on Neural Networks and Parallel Distributed Processing, Osaka, Japan. Fukunaga, K. 1972. Introduction to Statistical Pattern Recognition. New York: Academic Press. Ghitza, 0.1988. Auditory Neural Feedback as a Basis for Speech Processing. In: Proceedings IEEE International Conference on Acoustics Speech and Signal Processing, New York, NY, 91-94. Gold, 8. 1988. A Neural Network for Isolated Word Recognition. In: Proceedings IEEE International Conference on Acoustics Speech and Signal Processing, New York, NY, 44-47. Greenberg, S. 1988a. The Ear as a Speech Analyzer. Journal of Phonetics 16, 139-149. . 1988b. Special Issue on “Representation of Speech in the Auditory Periphery.” Journal of Phonetics 16. Grossberg, S. 1988. Nonlinear Neural Networks: Principles, Mechanisms, and Architectures. Neural Networks 1, 17-61. Hanson, S.J. and D.J. Burr. 1987. Knowledge Representation in Connectionist Networks. Technical Report, Bell Communications Research, Morristown, New Jersey. Harrison, T.D. and F. Fallside. 1988. A Connectionist Structure for Phoneme Recognition. Technical Report CUED/F-INFENG/TR.15, Cambridge University Engineering Department. Hecht-Nielsen, R. 1987. Nearest Matched Filter Classification of Spatiotemporal Patterns. Applied Optics 26, 1892-1899. Hinton, G.E. 1987. Connectionist Learning Procedures. Technical Report CMUCS-87-115, Carnegie Mellon University, Computer Science Department. Homma, T., L.E. Atlas, and R.J. Marks. 1988. An Artificial Neural Network for Spatio-Temporal Bipolar Patterns: Application to Phoneme Classification. In: Neural Information Processing Systems, ed. D. Anderson, 31-40. New York: American Institute of Physics. Hopfield, J.J. 1982. Neural Networks and Physical Systems with Emergent Collective Computational Abilities. Proceedings of the National Academy of Sciences, U S A 79, 2554-2558.
Review of Neural Networks for Speech Recognition
35
Huang, W.M. and R.P. Lippmann. 1988. Neural Net and Traditional Classifiers. In: Neural Information Processing Systems, ed. D. Anderson, 387-396. New York: American Institute of Physics. Huang, W.M., R.P. Lippmann, T. Nguyen. 1988. Neural Nets for Speech Recognition. In: Conference of the Acoustical Society of America, Seattle WA. Hunt, M.J. and C. Lefebvre. 1988. Speaker Dependent and Independent Speech Recognition Experiments With an Auditory Model. In: Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing 1,New York, 215-218. Irino, T. and H. Kawahara. 1988. A Study on the Speaker Independent Feature Extraction of Japanese Vowels by Neural Networks. ATR Workshop on Neural Networks and Parallel Distributed Processing, Osaka, Japan. Jordan, M.I. 1986. Serial Order: A Parallel Distributed Processing Approach. Institute for Cognitive Science Report 8604, University of California, San Diego. Kamm, C., T. Landauer, and S. Singhal. 1988. Training an Adaptive Network to Spot Demisyllables in Continuous Speech. ATR Workshop on Neural Networks and Parallel Distributed Processing, Osaka, Japan. Kammerer, B. and W. Kupper. 1988. Experiments for Isolated-Word Recognition with Single and Multi-Layer Perceptrons, Abstracts of 1st Annual INNS Meeting, Boston. Neural Netuiorks 1, 302. Kirkpatrick, S., C.D. Gelatt, and M.P. Vecchi. 1983. Optimization by Simulated Annealing. Science 229, 671-679. Klatt, K.H. 1986. The Problem of Variability In Speech Recognition and Models of Speech Perception. In: Invariance and Variability in Speech Processes, eds. J.S. Perkell and D.H. Klatt, 300-324. New Jersey: Lawrence Erlbaum. Kleinfield, D. 1986. Sequential State Generation by Model Neural Networks. Proceedings National Academy Science, USA, Biophysics 83, 9469-9473. Kohonen, T. 1988. An Introduction to Neural Computing. Neural Networks 1, 3-16. Kohonen, T. 1984. Self-organization and Associative Memory. Berlin: SpringerVerlag. Kohonen, T., G. Barna, and R. Chrisley. 1988. Statistical Pattern Recognition with Neural Networks: Benchmarking Studies. In: IEEE Annual International Conference on Neural Networks, San Diego, July. Kohonen, T., K. Makisara, and T. Saramaki. 1984. Phonotopic Maps - Insightful Representation of Phonological Features for Speech Recognition. In: IEEE Proceedings of the 7th International Conference on Pattern Recognition. Lang, K.J. and G.E. Hinton. 1988. The Development of the Time-Delay Neural Network Architecture for Speech Recognition. Technical Report CMU-CS-88-152, Carnegie-Mellon University. Lapedes, A. and R. Farber. 1988. How Neural Nets Work. In: Neural Information Processing Systems, ed. D. Anderson, 442456. New York: American Institute of Physics. Lee, Kai-Fu and Hsiao-Wuen Hon. 1988. Large-Vocabulary Speaker-Independent Continuous Speech Recognition Using HMM. In: Proceedings IEEE
36
Richard P. Lippmann
International Conference on Acoustics, Speech and Signal Processing 1,123126. Lee, Y.C., G. Doolen, H.H. Chen, G.Z. Sun, T. Maxwell, H.Y. Lee, C.L. Giles. 1986. Machine Learning Using a Higher Order Correlation Network. Physica D, 276306. Leung, H.C. and V.W. Zue. 1988. Some Phonetic Recognition Experiments Using Artificial Neural Nets. In: Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing 1. Lippmann, R.P., B. Gold, and M.L. Malpass. 1987. A Comparison of Hamming and Hopfield Neural Nets for Pattern Classification. Technical Report TR-769, MIT Lincoln Lab. Lippmann, R.P. 1987. An Introduction to Computing with Neural Nets. IEEE A S S P Magazine 42,622. Lippmann, R.P. and Ben Gold. 1987. Neural Classifiers Useful for Speech Recognition. In: 1st International Conference on Neural Networks, IEEE, IV-417. Little, W.A. 1974. The Existence of Persistent States in the Brain. Mathematical Biosciences 19, 101-120. Lyon, R.F. and C. Mead. 1988. An Analog Electronic Cochlea. IEEE Transactions on Acoustics, Speech and Signal Processing 36, 1119-1134. MacKay, D.G. 1987. The Organization of Perception and Action, New York Springer Verlag. Mann, J., J. Raffel, R. Lippmann, and B. Berger. 1988. A Self-organizing Neural Net Chip. Neural Networks for Computing Conference, Snowbird, Utah. Marslen-Wilson, W.D. 1987. Functional Parallelism in Spoken Word-Recognition. In: Spoken Word Recognition, eds. U.H. Frauenfelder and L.K. Tyler. Cambridge, MA: MIT Press. McDermott, E. and S. Katagiri. 1988. Phoneme Recognition Using Kohonen’s Learning Vector Quantization. ATR Workshop on Neural Networks and Parallel Distributed Processing, Osaka, Japan. Moody, J. 1988. Speedy Alternatives to Back Propagation. Neural Networks for Computing Conference, Snowbird, Utah. Moody, J. and C. Darken. 1988. Learning with Localized Receptive Fields. Technical Report YALEU/DCS/RR-649, Yale Computer Science Department, New Haven, CT. Naylor, J. and K.P. Li. 1988. Analysis of a Neural Network Algorithm for Vector Quantization of Speech Parameters, Abstracts of 1st Annual INNS Meeting, Boston. Neural Networks 1,310. Nilsson, Nils J. 1965. Learning Machines. New York McGraw Hill. Niranjan, M. and F. Fallside. 1988. Neural Networks and Radial Basis Functions in Classifying Static Speech Patterns. Technical Report CUED/F-INFENG/TR 22, Cambridge University Engineering Department. Omohundro, S.M. 1987. Efficient Algorithms with Neural Network Behavior. Complex Systems 1, 273347. Pallett, D.S. 1986. A PCM/VCR Speech Database Exchange Format. In: Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, Tokyo, Japan, 317-320.
Review of Neural Networks for Speech Recognition
37
Parsons, T. 1986. Voice and Speech Processing. New York McGraw-Hill. Paul, D.B. 1987. A Speaker-Stress Resistant HMM Isolated Word Recognizer. ICASSP 87, 713-716. Peeling, S.M. and R.K. Moore. 1987. Experiments in Isolated Digit Recognition Using the Multi-Layer Perceptron. Technical Report 4073, Royal Speech and Radar Establishment, Malvern, Worcester, Great Britain. Peterson, Gordon E. and Harold L. Barney. 1952. Control Methods Used in a Study of Vowels, The Journal of the Acoustical Society of America 249, 175-84. Pineda, F.J. 1987. Generalization of Back-Propagation to Recurrent Neural Networks. Physical Review Letters 59, 2229-2232. Poggio, T. and C. Koch. 1987. Synapses that Compute Motion. Scientific American 256, 46-52. Prager, R.W. and F. Fallside. 1987. A Comparison of the Boltzmann Machine and the Back Propagation Network as Recognizers of Static Speech Patterns. Computer Speech and Language 2,179-183. Prager, R.W., T.D. Harrison, and F. Fallside. 1986. Boltzmann Machines for Speech Recognition. Computer Speech and Language 1, 2-27. Price, P., W.M. Fisher, J. Bernstein, D.S. Pallett. 1988. The DARPA 1000-Word Resource Management Database for Continuous Speech Recognition. In: Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, New York 1, 651454. Rabiner, L.R. and B.H. Juang. 1986. An Introduction to Hidden Markov Models. IEEE A S S P Magazine 3:1,4-16. Rabiner, Lawrence R. and Ronald W. Schafer. 1978. Digital Processing of Speech. New Jersey: Prentice-Hall. Raffel, J., J. Mann, R. Berger, A. Soares, and S. Gilbert. 1987. A Generic Architecture for Wafer-Scale Neuromorphic Systems. In: 1st International Conference on Neural Networks, IEEE. Robinson, A.J. and F. Fallside. 1988a. A Dynamic Connectionist Model for Phoneme Recognition. nEuro '88, Paris, France. . 1988b. Static and Dynamic Error Propagation Networks with Application to Speech Coding. In: Neural Information Processing Systems, ed. D. Anderson, 632-641. New York: American Institute of Physics. Rohwer, R. and B. Forrest. 1987. Training Time-Dependencies in Neural Networks. In: 1st International Conference on Neural Networks, IEEE, 11-701. Rossen, M.L., L.T. Niles, G.N. Tajchman, M.A. Bush, J.A. Anderson, and S.E. Blumstein. 1988. A Connectionist Model for Consonant-vowel Syllable Recognition. In: Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, New York, NY, 59-66. Rumelhart, D.E., G.E. Hinton, and R.J. Williams. 1986a. Interactive Processes in Speech Perception: The TRACE Model. In: Parallel Distributed Processing: Vol. 2, Psychological and Biological Models, eds. D.E. Rumelhart and J.L. McClelland. Cambridge, MA: MIT Press. . 1986b. Learning Internal Representations by Error Propagation. In: Parallel Distributed Processing: Vol. 1, Foundations. Cambridge, MA: MIT Press. Sakoe, H. and K. Iso. 1987. Dynamic Neural Network - A New Speech Recognition
38
Richard P. Lippmann
Model Based on Dynamic Programming and Neural Network. IEICE Technical Report 87, NEC Corporation. Seneff, S. 1988. A Joint Synchrony/Mean-Rate Model of Auditory Speech Processing. Journal of Phonetics 16, 55-76. Servan-Schreiber, D., A. Cleeremans, and J.L. McClellan. 1988. Encoding Sequential Structure in Simple Recurrent Networks. Technical Report CMU-CS-88-183, Carnegie Mellon University. Shamma, S. 1988. The Acoustic Features of Speech Sounds in a Model of Auditory Processing: Vowels and Voiceless Fricatives. Journal of Phonetics 16, 77-91. Shore, J.E. and D.K. Burton. 1983. Discrete Utterance Speech Recognition Without Time Alignment. I E E E Transactions on Information Theory lT-29,473-491. Sompolinsky, H. and I. Kanter. 1986. Temporal Association in Asymmetrical Neural Networks. Physical Review Letters 57,2861-2864. Stornetta, W.S., T. Hogg, and B.A. Huberman. 1988. A Dynamical Approach to Temporal Pattern Processing. In: Neural Information Processing Systems, ed. D. Anderson, 750-759. New York: American Institute of Physics. Tank, D. and J.J. Hopfield. 1987. Concentrating Information in Time: Analog Neural Networks with Applications to Speech Recognition Problems. In: 1st International Conference on Neural Networks, IEEE. Tattersall, G.D., P.W. Linford, and R. Linggard. 1988. Neural Arrays for Speech Recognition. British Telecommunications Technology Journal 6 , 140-163. Unnikrishnan, K.P., J.J. Hopfield, and D.W. Tank. 1988. Learning Time-delayed Connections in a Speech Recognition Circuit. Neural Networks for Computing Conference, Snowbird, Utah. Waibel, Alex, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. 1987. Phoneme Recognition Using Time-Delay Neural Networks. Technical Report TR-1-006, ATR Interpreting Telephony Research Laboratories, Japan. Scheduled to appear in March 1989 issue of l E E E Transactions on Acoustics Speech and Signal Processing. Waibel, Alex, H. Sawai, and K. Shikano. 1988. Modularity and Scaling in Large Phonemic Neural Nets. Technical Report TR-1-0034, ATR Interpreting Telephony Research Laboratories, Japan. Watrous, R.L. 1988. Speech Recognition Using Connectionist Networks. Ph.D thesis, University of Pennsylvania. . 1986. Learning Algorithms for Connectionist Networks: Applied Gradient Methods of Nonlinear Optimization. Technical Report MS-CIS-87-51, Linc Lab 72, University of Pennsylvania. Watrous, R.L. and Lokendra Shastri. 1987. Learning Phonetic Features using Connectionist Networks: An Experiment in Speech Recognition. In: 1st International Conference on Neural Networks, IEEE, IV-381. Wong, M.K. and H.W. Chen. 1986. Toward a Massively Parallel System for Word Recognition. In: Proceedings IEEE International Conference on Acoustics Speech and Signal Processing, 37.4.1-37.4.4. ~
Received 10 November; accepted 14 November 1988
Communicated by Richard Lippmann
Modular Construction of Time-Delay Neural Networks for Speech Recognition Alex Waibel Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA and ATR Interpreting Telephony R.esearch Laboratories, Twin 21 MiD Tower, Osaka, 540, Japan
Several strategies are described that overcome limitations of basic network models as steps towards the design of large connectionist speech recognition systems. The two major areas of concern are the problem of time and the problem of scaling. Speech signals continuously vary over time and encode and transmit enormous amounts of human knowledge. To decode these signals, neural networks must be able to use appropriate representations of time and it must be possible to extend these nets to almost arbitrary sizes and complexity within finite resources. The problem of time is addressed by the development of a Time-Delay Neural Network, the problem of scaling by Modularity and Incremental Design of large nets based on smaller subcomponent nets. It is shown that small networks trained to perform limited tasks develop time invariant, hidden abstractions that can subsequently be exploited to train larger, more complex nets efficiently. Using these techniques, phoneme recognition networks of increasing complexity can be constructed that all achieve superior recognition performance. 1 Introduction Numerous studies have recently demonstrated powerful pattern recognition capabilities emerging from connectionist models or “artificial neural networks” (Rumelhart and McClelland 1986; Lippmann 1987). Most are trained on mere presentations of suitable sets of inputjoutput training data pairs. Most commonly these networks learn to perform tasks by effective use of hidden units as intermediate abstractions or decisions in an attempt to create complex, non-linear, decision functions. While these properties are indeed elegant and useful, they are, in their most simple form, not easily applicable to decoding human speech.
Neuruf Computation 1, 3946 (1989)
@ 1989 Massachusetts Institute of Technology
40
Alex Waibel
2 Temporal Processing
One problem in speech recognition is the problem of time. A human speech signal is produced by moving the articulators towards target positions that characterize a particular sound. Since these articulatory motions are subject to physical constraints, they commonly don’t reach clean identifiable phonetic targets and hence describe trajectories or signatures rather than a sequence of well defined phonetic units. Properly representing and capturing the dynamic motion of such signatures, rather than trying to classify momentary snapshots of sounds, must therefore be a goal for suitable models of speech. Another consequence of the dynamic nature of speech is the general absence of any unambiguous acoustic cue that indicates when a particular sound occurs. As a solution to this problem, segmentation algorithms have been proposed that presegment the signal before classification is carried out. Segmentation, however, is an errorful classification problem in itself and, when in error, sets up subsequent recognition procedures for recognition failure. To overcome this problem, a suitable model of speech should instead simply scan the input for useful acoustic clues and base its overall decision on the sequence and co-occurrence of a sufficient set of detected lower level clues. This then presumes the existence of translation invariant feature detectors, i.e., detectors that recognize an acoustic event independent of its precise location in time. A “Time Delay Neural Network (TDNN) (Lang 1987; Waibel et al. 1987) possesses both of these properties. It consists of TDNN-units that, in addition to computing the weighted sum of their current input features, also consider the history of these features. This is done by introducing varying delays on each of the inputs and processing (weighting) each of these delayed versions of a feature with a separate weight. In this fashion each unit can learn the dynamic properties of a set of moving inputs. The second property, ”translation invariance” is implemented by TDNN-units that scan an input token over time, in search of important local acoustic clues, instead of applying one large network to the entire input pattern. Translation invariant learning in these units is achieved by forcing the network to develop useful hidden units regardless of position in the utterance. In our implementation this was done by linking the weights of time shifted instantiations of the net during a scan through the input token (thus removing relative timing information). Figure 1 illustrates a TDNN trained to perform the discrimination task between the voiced stop consonants /b, d, g/ (Waibel et al. 1987) for a more detailed description of its operation). The three-category TDNN shown here (Fig. 1)has been evaluated over a large number of phonetic tokens (/b,d,g/). These tokens were generated by extracting the 150 msec intervals around pertinent phonemes from a phonetically handlabeled, large vocabulary database of isolated
41
Modular Construction of Time-Delay Neural Networks
B
D
G
Output Layer
integration
.-3 C a
Hidden Layer 2
m
Hidden Layer 1 m
Input Layer
15 frames 10 msec frame rate
Figure 1: The TDNN architecture (input: " B A ) . Eight hidden units in hidden layer 1 are fully interconnected with a set of 16 spectral coefficients and two delayed versions illustrated by the window over 48 input units. Each of these eight units in hidden layer 1 produces patterns of activation as the window moves through input speech. A five frame window scanning these activation patterns over time then activates each of three units in hidden layer 2. These activations over time in turn are then integrated into one single output decision. Note that the final decision is based on the combined acoustic evidence, independent of where in the given input interval (15 frames or 150 msecs) the /b, d or g / actually occurred.
Alex Waibel
42
Japanese utterances (Waibel et al. 1987). While isolated pronunciation provided relatively well articulated tokens, the data nevertheless included significant variability due to different phonetic contexts (e.g., " D O vs. "DI") and position in the utterance. Recognition experiments with three different male speakers showed that discrimination scores between 97.5% and 99.1%could be obtained.' These scores compare favorably with those obtained using several standard implementations of Hidden Markov Model speech recognition algorithms (Waibel et al. 1987). To understand the operation of the TDNNs, the weights and activation patterns of trained /b,d,g/-nets have been extensively evaluated (Waibel et a1. 1987). Several interesting properties were observed: 1. The TDNNs developed linguistically plausible features in the hid-
den units, such as movement detectors for first and second formants, segment boundary detectors, etc. 2. The TDNN has developed alternate internal representations that can link quite different acoustic realizations to the same higher level concept (here: phoneme). This is possible due to the multilayer arrangement used.
3. The hidden units fire in synchrony with detected lower layer events. These units therefore operate independent of precise time alignment or segmentation and could lead to translation invariant phoneme recognition.
Our results suggest that the TDNN has most of the desirable properties needed for robust speech recognition performance. 3 The Problem of Scaling
Encouraged by the good performance and the desirable properties of the model, we wanted to extend TDNNs to the design of large scale connectionist speech recognition systems. Some simple preliminary considerations, however, raise serious questions about the extendibility of connectionist design: Is it feasible, within limited resources and time, to build and train ever larger neural networks? Is it possible to add new knowledge to existing networks? With speech being one of the most complex and all encompassing human cognitive abilities, this question of scaling must be addressed. As a first step, let us consider the problem of extending the scope of our networks from tackling the three category task of all voiced stops (/b,d,g/) to the task of dealing with all stop consonants (/b,d,g,p,t,k/). The first row in table 1 shows the recognition scores of two individually 'All recognition scores in this paper were obtained from evaluation on test data that was not included in training.
43
Modular Construction of Time-Delay Neural Networks
Method Individual TDNNs
bdg
ptk
bdgptk
98.3% 98.7%
TDNN: Max. Activation
60.5%
Retrain BDGPTK
98.3%
Retrain Combined Higher Layers
98.1%
Retrain with V/W-units
98.4%
Retrain with Glue
98.4%
All-net Fine Tuning
98.6%
Table 1: From /b,d,g/ to /b,d,g,p,t,k/; Modular Scaling Methods. trained three category nets, one trained on the voiced stop consonant discrimination task (/b,d,g/) and the other on the voiceless stop consonant discrimination task (/p,t,k/). A naive attempt of combining these two nets by simply choosing the maximally activated output unit from these two separately trained nets resulted in failure as seen by the low recognition score (60.5%) in the second row. This is to be expected, since neither network was trained using other phonetic categories, and independent output decisions minimize the error for only small subsets of the task. A larger network (/b,d,g,p,t,k/-net) with six output units was therefore trained. Twenty hidden units (instead of eight) were used in Hidden layer 1 and six in hidden layer 2. Good performance could now be achieved (98.3%),but significantly more processing had to be expended to train this larger net. While task size was only doubled, the number of connections to be trained actually tripled. To make matters worse, more training data is generally needed to achieve good generalization in larger networks and the search complexity in a higher dimensional weight space increases dramatically as well. Even without increasing the number of training tokens in proportion to the number of connections, the complete /b,d,g,p,t,k/-net training run still required 18 days on a 4-processor Alliant supermini and had to be restarted several times before an acceptable solution had been found. The original /b,d,g/-net, by comparison, took only three days. It is clear that learning time increases more than linearly with task size, not to mention practical limitations such as available training data and computational capabilities. In summary, the dilemma between performance and resource limitations must
Alex Waibel
44
be addressed if Neural Networks are to be applied to large real world tasks. Our proposed solutions are based on three observations: 1. Networks trained to perform a smaller task may not produce outputs that are useful for solving a more complex task, but the knowledge and internal abstractions developed in the process may indeed be valuable.
2. Learning complex concepts in (developmental) stages based on previously learned knowledge is a plausible model of human learning and should be applied in connectionist systems. 3. To increase competence, connectionist learning strategies should build on existing distributed knowledge rather than trying to undo, ignore or relearn such knowledge.
Four experiments have been performed: 1. The previously learned hidden abstractions from the first layer of a /b,d,g/-net and a /p,t,k/-net were frozen by keeping their connections to the input fixed. Only connections from these two hidden layers 1 to a combined hidden layer 2 and to the output layer were retrained. While only modest (a few hours of) additional training was necessary at the higher layers, the recognition performance (98.1%)was found to be almost as good as for the monolithically trained /b,d,g,p,t,k/-net (see table I). The small difference in performance might have been caused by the absence of features needed to merge the two subnets (here, for example, the voicing feature distinguishing voiced from voiceless stops). 2. Hidden features from hidden layer 1 are fixed as in the previous experiment, but four additional class-distinctive features are incorporated at the first hidden layer. These four units were excised from a net that was exclusively trained to perform voiced/unvoiced discrimination. The voiced/unvoiced net could be trained in little more than one day and combination training at the higher layers was accomplished in a few hours. A high recognition rate of 98.4% was achieved.
3. The hidden units from hidden layer 1 are fixed as before, and four additional free units are incorporated. These free units are called connectionist glue, since they are intended to fit or glue together two distinct, previously trained nets. This network is shown in figure 2. The four glue units can be seen to have free connections to the input that are trained along with the higher layer combinations. In this fashion they can discover additional features that are needed to accurately perform the larger task. In addition to training the original /b,d,g/- and /p,t,k/-nets, combination training using glue
Modular Construction of Time-Delay Neural Networks
45
units was accomplished in two days. The resulting net achieved a recognition rate of 98.4%.
4. All-net fine tuning was performed on the previous network. Here, all connections of the entire net were freed once again for several hours of learning to perform small additional weight adjustments. While each of these learning iterations was indeed very slow, only few iterations were needed to fine tune the entire network for best performance of 98.6%. Only modest additional training beyond that required to train the subcomponent nets was necessary in these experiments. Performance, however, was as good or better than that provided by a monolithically trained net and as high as the performance of the original smaller subcomponent nets.
Output Layer
Hidden Layer 2
._
Free
c
-=
BDG
Input Layer
Figure 2: Combination of a /b,d,g/-net and a /p,t,k/-net using 4 additional units in hidden layer 1 as free "Connectionist Glue."
46
Alex Waibel
4 Conclusion
We have described connectionist networks with delays that can represent the dynamic nature of speech and demonstrated techniques to scale these networks up in size for increasingly large recognition tasks. Our results suggest that it is possible to train larger neural nets in a modular, incremental fashion from smaller subcomponent nets without loss in recognition performance. These techniques have been applied successfully to the design of neural networks capable of discriminating all consonants in spoken isolated utterances (Waibel et al. 1988). With recognition rates of 96%, these nets were found to compare very favorably (Waibel et al. 1988) with competing recognition techniques in use today. Acknowledgments The author would like to acknowledge his collaborators: Toshiyuki Hanazawa, Geoffrey Hinton, Kevin Lang, Hidefumi Sawai, and Kiyohiro Shikano, for their contributions to the work described in this paper. This research would also not have been possible without the enthusiastic support of Akira Kurematsu, President of ATR Interpreting Telephony Research Laboratories. References Lang, K. 1987. Connectionist Speech Recognition. Ph.D thesis proposal, Carnegie Mellon University. Lippmann, R.P. 1987. An Introduction to Computing with Neural Nets. IEEE A S S P Magazine, 422. Rumelhart, D.E. and J.L. McClelland. 1986. Parallel Distributed Processing; Explorations in the Microstructure of Cognition. Cambridge, MA: MIT Press. Waibel, A., T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. 1987. Phoneme Recognition Using Time-Delay Neural Networks. Technical Report TR-1-0006, ART Interpreting Telephony Research Laboratories. Also scheduled to appear in IEEE Transactions on Acoustics, Speech and Signal Processing, March 1989. Waibel, A,, H. Sawai, and K. Shikano. 1988. Moduiarify and Scaiing in Large Phonemic Neural Networks. Technical Report TR-1-0034,ATR Interpreting Telephony Research Laboratories; I E E E Transactions on Acoustics, Speech and Signal Processing, to appear.
Received 6 November; accepted 8 December 1988
Communicated by John Wyatt
A Silicon Model Of Auditory Localization John Lazzaro Carver A. Mead Department of Computer Science, California Institute of Technology, MS 256-80, Pasadena, CA 91125, USA
The barn owl accurately localizes sounds in the azimuthal plane, using interaural time difference as a cue. The time-coding pathway in the owl's brainstem encodes a neural map of azimuth, by processing interaural timing information. We have built a silicon model of the time-coding pathway of the owl. The integrated circuit models the structure as well as the function of the pathway; most subcircuits in the chip have an anatomical correlate. The chip computes all outputs in real time, using analog, continuous-time processing.
1 Introduction The principles of organization of neural systems arose from the combination of the performance requirements for survival and the physics of neural elements. From this perspective, the extraction of time-domain information from auditory data is a challenging computation; the system must detect changes in the data which occur in tens of microseconds, using neurons which can fire only once per several milliseconds. Neural approaches to this problem succeed by closely coupling algorithms and implementation. We are building silicon models of the auditory localization system of the barn owl, to explore the general computational principles of time-domain processing in neural systems. The barn owl (Tyto alba) uses hearing to locate and catch small rodents in total darkness. The owl localizes the rustles of the prey to within one to two degrees in azimuth and elevation (Knudsen et al. 1979). The owl uses different binaural cues to determine azimuth and elevation. The elevational cue for the owl is interaural intensity difference (IID). This cue is a result of a vertical asymmetry in the placement of the owl's ear openings, as well as a slight asymmetry in the left and right halves of the owl's facial ruff (Knudsen and Konishi 1979). The azimuthal cue is interaural time difference (ITD). The ITDs are in the microsecond range, and vary as a function of azimuthal angle of the sound source (Moiseff and Konishi 1981). The external nucleus of the owl's inferior colliculus (ICx) contains the neural substrate of sound localization, a map of auditory space (Knudsen and Konishi 1978). Neurons in the ICx respond Neurul Computation 1,47-57 (1989)
@ 1989 Massachusetts Institute of Technology
48
John Lazzaro and Carver A. Mead
maximally to stimuli located in a small area in space, corresponding to a specific combination of IID and ITD. There are several stages of neural processing between the cochlea and the computed map of space in the ICx. Each primary auditory fiber initially divides into two distinct pathways. One pathway processes intensity information, encoding elevation cues, whereas the other pathway processes timing information, encoding azimuthal cues. The time-coding and intensity-coding pathways recombine in the ICx, producing a complete map of space (Takahashi and Konishi 1988). 2 A Silicon Model of the Time-Coding Pathway
We have built an integrated circuit that models the time-coding pathway of the barn owl, using analog, continuous-time processing. Figure 1 shows the floorplan of the chip. The chip receives two inputs, corresponding to the sound pressure at each ear of the owl. Each input connects to a silicon model of the cochlea, the organ that converts the sound energy present at the eardrum into the first neural representation of the auditory system. In the cochlea, sound is coupled into a traveling wave structure, the basilar membrane, which converts time-domain information into spatially-encoded information, by spreading out signals in space according to their time scale (or frequency). The cochlea circuit is a one-dimensional physical model of this traveling wave structure; in engineering terms, the model is a cascade of second-order sections, with exponentially scaled time constants (Lyon and Mead 1988). In the owl, inner hair cells contact the basilar membrane at discrete intervals, converting basilar-membrane movement into a graded, halfwave rectified electrical signal. Spiral ganglion neurons connect to each inner hair cell, producing action potentials in response to inner-hair-cell electrical activity. The temporal pattern of action potentials encodes the shape of the sound waveform at each basilar-membrane position. Spiral ganglion neurons also reflect the properties of the cochlea; a spiral ganglion neuron is most sensitive to tones of a specific frequency, the neuron’s characteristic frequency. In our chip, inner hair cell circuits connect to taps at discrete intervals along the basilar-membrane model. These circuits compute signal processing operations (half-wave rectification and nonlinear amplitude compression) that occur during inner hair cell transduction. Each inner hair cell circuit connects to a spiral ganglion neuron circuit. This integrate-to-threshold neuron circuit converts the analog output of the inner-hair-cell model into fixed-width, fixed-height pulses. Timing information is preserved by greatly increasing the probability of firing near the zero crossings of the derivative of the neuron’s input. In the owl, the spiral ganglion neurons project to the nucleus magnocellularis (NM), the first nucleus of the time-coding pathway. The NM
49
A Silicon Model Of Auditory Localization
I1
Left
Ear Input
Nonllnur Inhlbitlon Clrcult (170 inpurr)
Right Ear Input
Time-Multiplexing Scanne.
Output Map
of Interaural Time Delay
Figure 1: Floorplan of the silicon model of the time-coding pathway of the owl. Sounds for the left ear and right ear enter the respective silicon cochleas at the lower left and lower right of the figure. Inner hair cell circuits tap each silicon cochlea at 62 equally-spaced locations; each inner hair cell circuit connects directly to a spiral ganglion neuron circuit. The square box marked with a pulse represents both the inner hair cell circuit and spiral ganglion neuron circuit. Each spiral ganglion neuron circuit generates action potentials; these signals travel down silicon axons, which propagate from left to right for spiral ganglion neuron circuits from the left cochlea, and from right to left for spiral ganglion circuits from the right cochlea. The rows of small rectangular boxes, marked with the symbol At, represent the silicon axons. 170 NL neuron circuits, represented by small circles, lie between each pair of antiparallel silicon axons. Each NL neuron circuit connects directly to both axons, and responds maximally when action potentials present in both axons reach that particular neuron at the same time. In this way, ITDs map into a neural place code. Each vertical wire which spans the array combines the response of all NL neuron circuits which correspond to a specific ITD. These 170 vertical wires form a temporally smoothed map of ITD, which responds to a wide range of input sound frequencies. The nonlinear inhibition circuit near the bottom of the figure increases the selectivity of this map. The time-multiplexing scanner transforms this map into a signal suitable for display on an oscilloscope.
50
John Lazzaro and Carver A. Mead
acts as a specialized relay station; neurons in the NM preserve timing information, and project bilaterally to the nucleus laminaris (NL), the first nucleus in the time-coding pathway that receives inputs from both ears. For simplicity, our chip does not model the NM; each spiral ganglion neuron circuit directly connects to a silicon NL. Neurons in the NL are most sensitive to binaural sounds with a specific ITD. In 1948, Jeffress proposed a model to explain the encoding of ITD in neural circuits (Jeffress 1948). In the Jeffress model applied to the owl, axons from the ipsilateral and contralateral NM, with similar characteristic frequencies, enter the NL from opposite surfaces. The axons travel antiparallel, and action potentials counterpropagate across the NL; the axons act as neural delay lines. NL neurons are adjacent to both axons. Each NL neuron receives synaptic connections from both axons, and fires maximally when action potentials present in both axons reach that particular neuron at the same time. In this way, ITD is mapped into a neural place coding; the ITD that maximally excites an NL neuron depends on the position of the neuron in the NL. Anatomical and physiological evidence in the barn owl supports this theory (Carr and Konishi 1988). The chip models the anatomy of the NL directly (Fig. I). Two silicon cochleas lie at opposite ends of the chip; spiral ganglion neuron circuits from each cochlea, with similar characteristic frequencies, project to separate axon circuits, which travel antiparallel across the chip. The axon circuit is a discrete neural delay line; for each action potential at the axon's input, a fixed-width, fixed-height pulse travels through the axon, section by section, at a controllable velocity (Mead 1989). NL neuron circuits lie between each pair of antiparallel axons at every discrete section, and connect directly to both axons. Simultaneous action potentials at both inputs excite the NL neuron circuit; if only one input is active, the neuron generates no output. For each pair of antiparallel axons, there is a row of 170 NL neuron circuits across the chip. These neurons form a place encoding of ITD. Our silicon NL differs from the owl's NL in several ways. The silicon NL neurons are perfect coincidence detectors; in the owl, NL neurons also respond, with reduced intensity, to monaural input. In the owl, many axons from each side converge on an NL neuron; in the chip, only two silicon axons converge on each silicon NL neuron. Finally, the brainstem of the owl contains two NLs, symmetric about the midline; each NL primarily encodes one half of the azimuthal plane. For simplicity, our integrated circuit has only one copy of the NL, which encodes all azimuthal angles. In the owl, the NL projects to a subdivision of the central nucleus of the inferior colliculus (ICc), which in turn projects to the ICx. The ICx integrates information from the time-coding pathway and from the amplitude-coding pathway to produce a complete map of auditory space. The final output of our integrated circuit models the responses of ICx
A Silicon Model Of Auditory Localization
51
neurons to ITDs. In response to ITDs, ICx neurons act differently from NL neurons. Experiments suggest mechanisms for these differences; our integrated circuit implements several of these mechanisms to produce a neural map of ITD. Neurons in the NL and ICc respond to all ITDs that result in the same interaural phase difference (IPD) of the neuron’s characteristic frequency; neurons in the ICx respond to only the one true ITD. This behavior suggests that ICx neurons combine information from many frequency channels in the ICc, to disambiguate ITDs from IPDs; indeed, neurons in the NL and ICc reflect the frequency characteristics of spiral ganglion neurons, whereas ICx neurons respond equally to a wide range of frequencies. In our chip, all NL neuron outputs corresponding to a particular ITD are summed to produce a single output value. NL neuron outputs are current pulses; a single wire acts as a dendritic tree to perform the summation. In this way, a two-dimensional matrix of NL neurons reduces to a single vector; this vector is a map of ITD, for all frequencies. In the owl, inhibitory circuits between neurons tuned to the same ITD may also be present, before summation across frequency channels. Our model does not include these circuits. Neurons in the ICc are more selective to ITDs than are neurons in the NL; in turn, ICx neurons are more selective to ITDs than are ICc neurons, for low frequency sounds. At least two separate mechanisms join to increase selectivity. The selectivity of ICc and ICx neurons increases with the duration of a sound, for sounds lasting less than 5 milliseconds, implying that the ICc and perhaps the ICx may use temporal integration to increase selectivity (Wagner and Konishi, in preparation). Our chip temporally integrates the vector that represents ITD; the time constant of integration is adjustable. Nonlinear inhibitory connections between neurons tuned to different ITDs in the ICc and ICx also increase sensitivity to ITDs; application of an inhibitory blocker to either the ICc or ICx decreases sensitivity to ITD (Fujita and Konishi, in preparation). In our chip, a global shunting inhibition circuit (Lazzaro et al. 1988) processes the temporally integrated vector that represents ITD. This nonlinear circuit performs a winner-take-all function, producing a more selective map of ITD. The chip time-multiplexes this output map on a single wire for display on an oscilloscope. 3 Comparison of Responses
We presented periodic click stimuli to the chip (Fig. 2a), and recorded the final output of the chip, a map of ITD. Three signal-processing operations, computed in the ICx and ICc of the owl, improve the original encoding of ITDs in the NL: temporal integration, integration of infor-
John Lazzaro and Carver A. Mead
52
Right
w 2.1 ms
Figure 2: Input stimulus for the chip. Both left and right ears receive a periodic click waveform, at a frequency of 475 Hz. The time delay between the two signals, notated as 6t, is variable. mation over many frequency channels, and inhibition among neurons tuned to different ITDs. In our chip, we can disable the inhibition and temporal-integration operations, and observe the unprocessed map of ITD (Fig. 2b). By combining the outputs of 62 rows of NL neurons, each tuned to a separate frequency region, the maps in figure 2b correctly encode ITD, despite random variations in axonal velocity and cochlear delay. Figure 3 shows this variation in velocity of axonal propagation, due to circuit element imperfections. Figure 2c shows maps of ITD taken with inhibition and temporal integration operations enabled. Most maps show a single peak, with little activity at other positions. Figure 4a is an alternative representation of the map of ITD computed by the chip. We recorded the map position of the neuron with maximum signal energy, for different ITDs. Carr and Konishi (1988) performed a similar experiment in the owl's NL (Fig. 4b), mapping the time delay of an axon innervating the NL, as a function of position in the NL. The linear properties of our chip map are the same as those of the owl map.
4 Conclusions Traditionally, scientists have considered analog integrated circuits and neural systems to be two disjoint disciplines. The two media are different in detail, but the physics of computation in silicon technology and in
53
A Silicon Model Of Auditory Localization
A & JJkJlc 0.6
0.0
0.1
0.6
0.2
A
0.7
0.8
0.3
0.4
&
-
0.0
1.0
1.5
1.1
1.6
1.2
&
& & 1.4
&
4
1.3
1.7
1.8
1.9
]Re*p0nse
2.0
A
2.1
2.2
Jwl)JwL
2.3
2.4
Position
Figure 3: Map of ITD, taken from the chip. The nonlinear inhibition and temporal smoothing operations were turned off, showing the unprocessed map of ITD. The vertical axis of each map corresponds to neural activity level, whereas the horizontal axis of each map corresponds to linear position within the map. The stimulus for each plot is the periodic click waveforms of Figure 2a; 6t is shown in the upper left corner of each plot, measured in milliseconds. Each map is an average of several maps recorded at 100 millisecond intervals; averaging is necessary to capture a representation of the quickly changing, temporally unsmoothed response. The encoding of ITD is present in the maps, but false correlations add unwanted noise to the desired signal. Since we are using a periodic stimulus, large time delays are interpreted as negative delays, and the map response wraps from one side to the other at an ITD of 1.2 milliseconds.
John Lazzaro and Carver A. Mead
54
0.0
0.1
0.2
0.3
0.4
1 L 11 11IA __I I1I1 II1 I1 1 1.6
2.0
1.1
1.6
2.1
0.7
1.2
1.7
2.2
0.8
1.3
1.8
2.3
0.9
1.4
1.9
2.4
0.5
1.0
0.6
,
,
_,
Figure 4: Map of ITD, taken from the chip. The nonlinear inhibition and temporal smoothing operations were turned on, showing the final output map of ITD. Format is identical to Figure 2b. Off-chip averaging was not used, since the chip temporally smooths the data. Most maps show a single peak, with little activity at other positions, due to nonlinear inhibition. The maps do not reflect the periodicity of the individual frequency components of the sound stimulus; additional experiments with a noise stimulus confirm the phase-disambiguation property of the chip.
A Silicon Model Of Auditory Localization
55
400 T
350 -300 --
Pulse Width of Axon Segment (PS)
250
-.
200 .-
150--
,,,,o ,o lt 5020
40
60
80
100
120
140
160
Position on Chip
Figure 5: Variation in the pulse width of a silicon axon, over about 100 axonal sections. Axons were set to fire at a slower velocity than in the owl model, for more accurate measurement. In this circuit, a variation in axon pulse width indicates a variation in the velocity of axonal propagation; this variation is a potential source of localization error. neural technology are remarkably similar. Both media offer a rich palette of primitives in which to build a structure; both pack a large number of imperfect computational elements into a small space; both are ultimately limited not by the density of devices, but by the density of interconnect. Modeling neural systems directly in a physical medium subjects the researcher to many of the same pressures faced by the nervous system over the course of evolutionary time. We have built a 220,000 transistor chip that models, to a first approximation, a small but significant part of a spectacular neural system. In doing so we have faced many design problems solved by the nervous system. This experience has forced us to a high level of concreteness in specifying this demanding computation. This chip represents only the first few stages of auditory processing, and thus is only a first step in auditory modeling. Each individual circuit in the chip is only a first approximation to its physiological counterpart. In addition, there are other auditory pathways to explore: the intensity-coding localization pathway, the elevation localization pathway in mammals, and, most formidably, the sound-understanding structures that receive input from these pathways.
John Lazzaro and Carver A. Mead
56
Position of Maximum Neural Activity
-
a:
20 -800-600-400-200
0
200 400 600 800
Interaural Time Difference
(pa)
T Depth In
NL (rm)
1800-.
1600-
1400-
1200.-
1000
b:
.. 0
800
I
0
60
100
150
200
Axonal Time Delay
260
300
(ps)
Figure 6: (a) Chip data showing the linear relationship of silicon NL neuron position and ITD. For each ITD presented to the chip, the output map position with the maximal response is plotted. The linearity shows that silicon axons have a uniform mean time delay per section. (b) Recordings of the NM axons innervating the NL in the barn owl (Carr and Konishi 1988). The figure shows the mean time delays of contralateral fibers recorded at different depths during one penetration through the 7 KHz region.
We thank M. Konishi and his entire research group, in particular S. Volman, I. Fujita, and L. Proctor, as well as D. Lyon, M. Mahowald, T. Delbruck, L. Dupr6, J. Tanaka, and D. Gillespie, for critically reading and correcting the manuscript, and for consultation throughout the project.
A Silicon Model Of Auditory Localization
57
We thank Hewlett-Packard for computing support, a n d DARPA a n d MOSIS for chip fabrication. This work w a s sponsored by the Office of Naval Research and the System Development Foundation.
References Carr, C.E. and M. Konishi. 1988. Axonal Delay Lines for Time Measurement in the Owl's Brainstem. Proc. Nat. Acad. Sci. 85, 8311-8315. Fujita, I. and M. Konishi. In preparation. Jeffress, L.A. 1948. A Place Theory of Sound Localization. J. Cornp. Physiol. Pyschol. 41, 35-39. Knudsen, E.I., G.G. Blasdel, and M. Konishi. 1979. Sound Localization by the Barn Owl Measured with the Search Coil Technique. 1. Comp. Physiol. 133, 1-11. Knudsen, E.I. and M. Konishi. 1979. Mechanisms of Sound Localization in the Barn Owl (Tyto alba). J. Cornp. Physiol. 133, 13-21. . 1978. A Neural Map of Auditory Space in the Owl. Science 200, 795797. Lazzaro, J.P., S. Ryckebusch, M.A. Mahowald, and C.A. Mead. 1988. WinnerTake-All Networks of O(n) Complexity. Proc. I E E E Conf. Neural lnforrnation Processing Systems, Denver, CO. Lyon, R.F. and C. Mead. 1988. An Analog Electronic Cochlea. IEEE Trans. Acoust., Speech, Signal Processing 36, 1119-1134. Mead, C.A. 1989. Analog V L S l and Neural Systems. Reading, MA: AddisonWesley. Moiseff, A. and M. Konishi. 1981. Neuronal and behavioral sensitivity to binaural time differences in the owl. J. Neurosci. 1, 4048. Takahashi, T.T. and M. Konishi. 1988. Projections of the Nucleus Angularis and Nucleus Laminaris to the Lateral Lemniscal Nuclear Complex of the Barn Owl. J. Cornpar. Neurol. 274, 221-238. Wagner, H. and M. Konishi. In preparation.
Received 26 October; accepted 9 November 1988.
Communicated by Carver Mead
Criteria for Robust Stability In A Class Of Lateral Inhibition Networks Coupled Through Resistive Grids John L. Wyatt, Jr. David L. Standley Department of EJectrical Engineering and Computer Science, Massachusetts Institute of TechnoJogy, Cambridge, MA 02139, USA
In the analog VLSI implementation of neural systems, it is sometimes convenient to build lateral inhibition networks by using a locally connected on-chip resistive grid to interconnect active elements. A serious problem of unwanted spontaneous oscillation often arises with these circuits and renders them unusable in practice. This paper reports on criteria that guarantee these and certain other systems will be stable, even though the values of designed elements in the resistive grid may be imprecise and the location and values of parasitic elements may be unknown. The method is based on a rigorous, somewhat novel mathematical analysis using Tellegen’s theorem (Penfield et al. 1970) from electrical circuits and the idea of a Popov multiplier (Vidyasagar 1978; Desoer and Vidyasagar 1975) from control theory. The criteria are local in that no overall analysis of the interconnected system is required for their use, empirical in that they involve only measurable frequency response data on the individual cells, and robust in that they are insensitive to network topology and to unmodelled parasitic resistances and capacitances in the interconnect network. Certain results are robust in the additional sense that specified nonlinear elements in the grid do not affect the stability criteria. The results are designed to be applicable, with further development, to complex and incompletely modelled living neural systems. 1 Introduction In the VLSI implementation of lateral inhibition and certain other types of networks, active cells are locally interconnected through an on-chip resistive grid. Linear resistors fabricated in, e g , polysilicon, could yield a very compact realization, and nonlinear resistive grids, made from MOS transistors, have been found useful for image segmentation (Hutchinson et al. 1988). Networks of this type can be divided into two classes: feedback systems and feedforward-only systems. In the feedfonvard case Neural Computation 1, 58457 (1989)
0 1989
Massachusetts Institute of Technology
Robust Stability In A Class Of Lateral Inhibition Networks
59
one set of amplifiers imposes signal voltages or currents on the grid and another set reads out the resulting response for subsequent processing, while the same amplifiers both ”write to” the grid and ”read from” it in a feedback arrangement. Feedforward networks of this type are inherently stable, but feedback networks need not be. A practical example is one of Mahowald and Mead’s retina chips (Mead and Mahowald 1988; Mead 1988) that achieve edge enhancement by means of lateral inhibition through a resistive grid. Figure l a shows a single cell in an earlier version of this chip, and figure l b illustrates the network of interconnected cells. Experiment has shown that the individual cells in this system are open-circuit stable and remain stable when the output of amplifier #2 is connected to a voltage source through a resistor, but the interconnected system oscillates so badly that the earlier design is scarcely usable’ (Mahowald and Mead 1988). Such oscillations can readily occur in most resistive grid circuits with active elements and feedback, even when each individual cell is quite stable. Analysis of the conditions of instability by conventional methods appears hopeless, since the number of simultaneously active feedback loops is enormous. This paper reports a practical design approach that rigorously guarantees such a system will be stable if the active cells meet certain criteria. The work begins with the naiv6 observation that the system would be stable if we could design each individual cell so that, although internally active, it acts like a passive system as seen from the resistive grid. The design goal in that case would be that each cell’s output impedance should be a positive-real (Vidyasagar 1978; Desoer and Vidyasagar 1975; Anderson and Vongpanitlerd 1973) function. This is sometimes possible in practice; we will show that the original network in figure la would satisfy this condition in the absence of certain parasitic elements. Furthermore, it is a condition one can verify experimentally by frequencyresponse measurements. It is obvious that a collection of cells that appear passive at their terminals will form a stable system when interconnected through a passive medium such as a resistive grid, and that the stability of such a system is robust to perturbations by passive parasitic elements in the network. The work reported here goes beyond that observation to provide (i) a demonstration that the passivity or positive-real condition is much stronger than we actually need and that weaker conditions, more easily achieved in practice, suffice to guarantee robust stability of the linear active network model, and (ii) an extension of the analysis to the nonlinear domain that furthermore rules out sustained large-signal oscillations under certain conditions. A key feature of the integrated circuit environment that makes these results applicable is the almost total absence of on-chip inductance. While the cells can appear inductive, as in figure 3c, ‘The later design reported in (Mead and Mahowald 1988) avoids stability problems altogether, at a small cost in performance, by redesigning the circuits to passively sense the grid voltage in a ”feedforward” style as described above.
John L. Wyatt, Jr., and David L. Standley
60
I I
I I incident I
light
-r
Figure 1: (a) This photoreceptor and signal processor circuit, using two MOS amplifiers, realizes spatial lateral inhibition and temporal sharpening by communicating with similar cells through a resistive grid. The resistors will often be nonlinear by design. (b) Interconnection of cells through a hexagonal resistive grid. Cells are drawn as 2-terminal elements with the power supply and signal output lines suppressed. The voltage on the capacitor in any given cell is affected both by the local light intensity incident on that cell and by the capacitor voltages on neighboring cells of identical design. The necessary ingredients for instability - active elements and signal feedback - are both present in this system. (c) Grid resistors with a nonlinear characteristic of the form i = tanh(v) can be useful in image segmentation (Hutchinson et al. 1988).
Robust Stability In A Class Of Lateral Inhibition Networks
61
the absence of inductance in our grid models makes these theorems possible. Note that these results do not apply directly to networks created by interconnecting neuron-like elements, as conventionally described in the literature on artificial neural systems. The ”neurons” in, e.g., a Hopfield network (Hopfield 1984) are unilateral 2-port elements in which the input and output are both voltage signals. The input voltage uniquely and instantaneously determines the output voltage of such a neuron model, but the output can only affect the input via the resistive grid. In contrast, the cells in our system are I-port electrical elements (temporarily ignoring the optical input channel) in which the port voltage and port current are the two relevant signals, and each signal affects the other through the cell’s internal dynamics (modeled as a Thevenin equivalent impedance) as well as through the grid’s response. It is apparent that uncontrolled spontaneous oscillation is a potential problem in living neural systems, which typically also consist of active elements arranged in feedback loops. Biological systems have surely solved the same problem we attack in this paper. It is reasonable to believe that stability has strongly constrained the set of network configurations nature has produced. Whatever Nature’s solutions may be, we suspect they have at least three features in common with the ones proposed here: (1) robustness in the face of wide component variation and the presence of parasitic network elements, (2) reliance on empirical data rather than anything we would recognize as a theory or analytic method, (3) stability strategies based on predominantly local information available to each network element. Several reports on this work have appeared and will appear in (Wyatt and Standley 1988; Standley 1989; Standley and Wyatt 1989; 1988a; 198813) during its development; a longer tutorial exposition will be given in the second printing of (Mead 1988). 2 The Linear Theory 2.1 Terminology. The output impedance of a linear system is a measure of the voltage response due to a change in output current while the input (light intensity in this case) is held constant. This standard electrical engineering concept will play a key role here. Figure 2a illustrates one experimental method for measuring the output impedance, and figure 2b is a standard graphical representation of an impedance, known as a Nyquist diagram. Similar plots have been used in experimental physiology (Cole 1932). In the context of this work, an impedance is said to be positive-real (Vidyasagar 1978, Desoer and Vidyasagar 1975, Anderson and Vongpanitlerd 1973)if it is stable (i.e., has no poles or zeroes in the right-half plane) and its Nyquist diagram lies entirely in the right-half plane (i.e., in the
John L. Wyatt, Jr., and David L. Standley
62
(a) r - - - - - - - - - - - - - - -1 T
I
I
I
Bcos (wt
I
+
=K I
cv
Figure 2: (a) Simplified experimental measurement of the output impedance of a cell. A sinusoidal current i = Acos(wt) is injected into the output and the voltage response u = Bcos(wt + 4) is measured. The impedance, which has magnitude B/A and phase 4,is typically treated as a complex number Z(iw) that depends on the frequency w. (b) Example of the Nyquist diagram of an impedance. This is a plot in the complex plane of the value of the impedance, measured or calculated at purely sinusoidal frequencies, ranging from zero upward toward infinity. It is not essential to think of Nyquist diagrams as representing complex numbers: they are simply polar plots in which radius represents impedance magnitude and angle to the horizontal axis represents phase. The diagram shown here is the Nyquist plot of the positive-real impedance in equation (2.1).
Robust Stability In A Class Of Lateral Inhibition Networks
(b)
63
r--I I I
I I I 1
I I I I I
-
I. - I I
Figure 3: (a) Elementary model for an MOS amplifier. These amplifiers have a relatively high output resistance, which is determined by a bias setting (not shown). (b) Linearity allows this simplification of the network topology for the circuit in figure la without loss of information relevant to stability. The capacitor in figure l a has been absorbed into the output capacitance of amp #2. (c) Passive network realization of the output impedance given in equation (2.1) for the network in (b). language of complex numbers, Re{Z(iw)} 2 0 for all purely sinusoidal frequencies w). Figure 2a is an example, while the system represented in figure 4 is stable but not positive-real. A deep link between positive-real functions, physical networks and
64
John L. Wyatt, Jr., and David L. Standley
Figure 4: Nyquist diagram of an impedance that satisfies the Popov criterion, defined as follows. A linear impedance Z(s) satisfies the Popov criterion if (1 + TS)Z(S)is positive-real for some T > 0. The “Popov multiplier” (1 + T S ) modifies the Nyquist diagram by stretching and rotating it counterclockwisefor w > 0. The impedance plotted here is active and thus is not positive-real, but the rotation due to the (1+ T S ) term can make it positive-real for an appropriate value of T . The Popov criterion is a condition on the linear elements that is weaker than passivity: active elements satisfying this criterion are shown to pose no danger of instability even when nonlinear resistors and capacitors are present in the grid. passivity is established by the classical result in linear circuit theory which states that H ( s ) is positive-real if and only if it is possible to synthesize a 2-terminal network of positive linear resistors, capacitors, inductors and ideal transformers that has H ( s ) as its driving-point impedance (Anderson and Vongpanitlerd 1973). This work was originally motivated by the following linear analysis of a model for the circuit in figure l a . For an initial approximation to the output impedance of the cell w e use the elementary model shown in figure 3a for the amplifiers and simplify the circuit topology within a single cell as shown in figure 3b.
Robust Stability In A Class Of Lateral Inhibition Networks
65
Straightforward calculations show that the output impedance is given by
This is a positive-real impedance that could be realized by a passive network of the form shown in figure 3c, where
Of course this model is oversimplified, since the circuit does oscillate. Transistor parasitics and layout parasitics cause the output impedance of the individual active cells to deviate from the form given in equations (2.1) and (2.21, and any very accurate model will necessarily be quite high order. The following theorem shows how far one can relax the positive-real condition and still guarantee that the entire network is robustly stable. It obviously applies to a much wider range of linear networks than has been discussed here. A linear network is said to be stable if for any initial condition the transient response converges asymptotically to a constant. Theorem 1. Consider the class of linear networks of arbitrary topology, consisting of any number of positive 2-terminal resistors and capacitors and of N lumped linear impedances Z,(s), n = 1,2,.. . , N, that are open- and short-circuit stable in isolation, i.e.. that have no poles or zeroes in the closed right-half plane. Everby such network is stable if at each frequency w 2 0 there exists a phase angle O(w) such that 0 2 O(w) 2 -90" and ILZ,,(iw) - O(iw)l < 90",n = 1 , 2 , .. . , N. An equivalent statement of this last condition is that the Nyquist plot of each cell's output impedance for w 2 0 never intersects the 2nd quadrant of the complex plane (figure 4 is an example), and that no two cells' output impedance phase angles can ever differ by as much as 180". If all the active cells are designed identically and fabricated on the same chip, their phase angles should track fairly closely in practice, and thus this second condition is a natural one. The theorem is intuitively reasonable and serves as a practical design goal. The assumptions guarantee that the cells cannot resonate with one another at any purely sinusoidal frequency s = jw since their phase angles can never differ by as much as 180", and they can never resonate with the resistors and capacitors since they can never appear simultaneously active and inductive at any sinusoidal frequency. A more advanced argument (Standley and Wyatt 1989) shows that exponentially growing instabilities are also ruled out.
John L. Wyatt, Jr., and David L. Standley
66
3 Stability Result for Networks with Nonlinear Resistors and
Capacitors The previous results for linear networks can afford some limited insight into the behavior of nonlinear networks. If a linearized model is stable, then the equilibrium point of the original nonlinear network must be locally stable. But the result in this section, in contrast, applies to the full nonlinear circuit model and allows one to conclude that in certain circumstances the network cannot oscillate or otherwise fail to converge wen if the initial state is arbitrarily fur from the equilibrium point. Figure 4 introduces the Popov criterion, which is the basis of the following theorem. This is the first nonlinear result of its type that requires no assumptions on the network topology.
Theorem 2. Consider any network consisting of nonlinear resistors and capacitors and linear active cells with output impedances Zn(s),n = 1,2,. . . ,N . Suppose (a) the nonlinear resistor and capacitor characteristics, ij = g3(vj) and q k = hk(vk), respectively, are monotone increasing continuously differentiable functions, and
(b) the impedances Z,(s> all satisfy the Popov criterion for some common value of r > 0.
Then the network is stable in the sense that, for any initial condition at t = 0, (3.1)
Acknowledgments We sincerely thank Professor Carver Mead of Cal Tech for encouraging this work, which was supported by Defense Advanced Research Projects Agency (DARPA) Contract No. N00014-87-K-0825 and National Science Foundation (NSF) Contract No. MIP-8814612.
References Anderson, B.D.O. and S. Vongpanitlerd. 1973. Network Analysis and Synthesis - A Modern Systems Theory Approach, Englewood Cliffs, NJ: Prentice-Hall. Cole, K.S. 1932. Electric Phase Angle of Cell Membranes. J. General Physiology 156,64149.
Desoer, C. and M. Vidyasagar. 1975. Feedback Systems: Input-Output Properties. New York Academic Press.
.
Robust Stability In A Class Of Lateral Inhibition Networks
67
Hopfield, J.J. 1984. Neurons with Graded Response have Collective Computational Properties Like Those of Two-state Neurons. Proc. Nat'l. Acad. Sci. 81,3088-3092. Hutchinson, J., C. Koch, J. Luo, and C. Mead. 1988. Computing Motion Using Analog and Binary Resistive Networks. Computer 21:3, 52-64. Mahowald, M.A. and C.A. Mead. 1988. Personal communication. Mead, C.A. 1988. Analog VLSI and Neural Systems. Addison-Wesley. Mead, C.A. and M.A. Mahowald. 1988. A Silicon Model of Early Visual Processing. Neural Networks 1:1, 91-97. Penfield, P., Jr., R. Spence, and S. Duinker. 1970. Tellegen's Theorem and Electrical Networks. Cambridge, MA: MIT Press. Standley, D.L. 1989. Design Criteria Extensions for Stable Lateral Inhibition Networks in the Presence of Circuit Parasitics. PYOC.1989 I E E E Int. Symp. on Circuifs and Systems, Portland, OR. Standley, D.L. and J.L. Wyatt, Jr. 1989. Stability Criterion for Lateral Inhibition Networks that is Robust in the Presence of Integrated Circuit Parasitics. I E E E Trans. Circuits and Systems, to appear. . 1988a. Stability Theorem for Lateral Inhibition Networks that is Robust in the Presence of Circuit Parasitics. Proc. I E E E Int. Conf. on Neural Networks I, San Diego, CA, 27-36. . 1988b. Circuit Design Criteria for Stability in a Class of Lateral Inhibition Neural Networks. Proc. I E E E Conf. on Decision and Control, Austin, TX, 328-333. Vidyasagar, M. 1978. Nonlinear Systems Analysis. Englewood Cliffs, NJ: PrenticeHall. Wyatt, J.L., Jr. and D.L. Standley. 1988. A Method for the Design of Stable Lateral Inhibition Networks that is Robust in the Presence of Circuit Parasitics. In: Neural Information Processing Systems, ed. D. Anderson, American Institute of Physics, New York, 860-867.
Received 30 September; accepted 13 October 1988.
Communicated by David Mumford
Two Stages of Curve Detection Suggest Two Styles of Visual Computation Steven W. Zuckert AIlan Dobbins Lee Iverson Computer Vision and Robotics Laboratory, McGill Research Centre for Intelligent Machines, McGill University, Montrkal, QuCbec, Canada
The problem of detecting curves in visual images arises in both computer vision and biological visual systems. Our approach integrates constraints from these two sources and suggests that there are two different stages to curve detection, the first resulting in a local description, and the second in a global one. Each stage involves a different style of computation: in the first stage, hypotheses are represented explicitly and coarsely in a fixed, preconfigured architecture; in the second stage, hypotheses are represented implicitly and more finely in a dynamically-constructed architecture. We also show how these stages could be related to physiology, specifying the earlier parts in a relatively fine-grained fashion and the later ones more coarsely. 1 Introduction
An extensive mythology has developed around curve detection. In extrapolating from orientation-selectiveneurons in the visual cortex (Hubel and Wiesel 19621, it is now widely held that curve detection is simply a matter of "integrating" the responses of these cells. More specifically, the mythology holds that this integration process is global, that the initial estimates are local, and that the relationship between them will become clear as a more detailed understanding of cortical circuitry is uncovered. However, this mythical process of "integration" has turned out to be elusive, the search for it has led, instead, to a series of dilemmas, and the quantity of physiological data is exploding. It is rarely clear how new details of cortical circuitry relate to different components of the curve detection problem. We believe that this situation is typical of vision in general, and amounts to ascribing too little function to the earlier stages, and too +SeniorFellow, Canadian Institute for Advanced Research.
Neural Computation 1, 68-81 (1989)
@ 1989 Massachusetts Institute of Technology
Two Stages of Curve Detection
69
much to the later ones. For curve detection, virtually all of the complexity is delegated to the process of "integration," so it is not surprising that successful approaches have remained elusive. Part of the problem is that models of integrative processes have been rich in selected detail, but poor in abstract function. In the sense that it is often useful to see the forest before the trees, we submit that solutions will likely be found by considering both coarse-grained and fine-grained models, and that such models will suggest a partitioning of function whose abstraction varies with granularity. To make this point concretely, we here outline a coarsegrained solution to the curve detection problem from a computational perspective, and sketch how it could map onto physiology. The sketch is coarse enough to serve as an organizational framework, but fine enough to suggest particular physiological constraints. One of these comprises our first, coarse-grain prediction: curve detection naturally decomposes into two stages, the first in which a local description is computed, and the second in which a global description is computed. These computations are sufficiently different that we are lead to hypothesize two different styles of visual computation. 2 The Dilemma of Curve Detection
The initial measurement of orientation information is broadly tuned, which suggests the averaging necessary to counteract retinal (sensor) sampling, quantization, and noise. However, the end result of curve detection is unexpectedly precise: corners can be distinguished from arcs of high curvature, and nearby curves can be distinguished from one another to a hyperaccurate level, even though they might pass through the same receptive field. An analogous dilemma exists for computer vision systems, even with the spectacular numerical precision of which computers are capable: quantization and noise imply smoothing, but smoothing blurs corners, endpoints, and nearby curves into confusion (Zucker 1986). At the foundation is a chicken-and-egg problem: if the points through which the curve passed, together with the locations of discontinuities, were known, then the actual properties of the curve could be inferred. But initially they are not known, so any smoothing inherent in the inference process is potentially dangerous. 3 Two Stages of Curve Detection
We have discovered a computational solution to this dilemma, which involves decomposing the full problem into two stages, each of which has a rather different character. In the first stage, the local properties of the curve are computed: its trace (the set of retinotopic points through which the curve passes), its tangent (or orientation at those points), and
70
Steven W. Zucker, Allan Dobbins, and Lee Iverson
its curvature. In the second stage, these properties are refined to create a global mode1 of the curve. This much - proceeding from local to global - is standard; the style of the computations is not. The key to the first stage is to infer the local properties coarsely - not in fine detail - but without sacrificing reliability or robustness. Coarseness is here related to quantization, whch must limit error propagation without blurring over corners. Observe that this is precisely what is lacking in the standard myth, where errors (e.g., about placing discontinuities) can have far reaching consequences. The result is a style of computation in which the different (quantized) possibilities are made explicit, and arranged in a fixed, preconfigured computational architecture that imposes no a priori ordering over them. Each distinct hypothesis, say rough orientation and curvature at every position, forms a unit in a fixed network that strongly resembles neural-network-style models. Reliability and robustness are then maintained by the network; hence the local description is not computed locally! A mapping onto orientation hypercolumns will be discussed shortly. The second stage embodies a rather different style of computation. Now the possibilities no longer need be general, but are constrained to be in the range dictated by the first stage. Thus the architecture can be tailored to each problem-that is, constructed adaptively rather than preconfigured-and variables can be represented implicitly. With these highly focused resources, the key limitation on precision is implementation, and it need not be hampered by uncontrolled error propagation. From the outside, this constructive style of computation holds certain key properties in common with later visual areas, such as V4 and IT, where receptive field structure has been shown to vary with problem constraints (e.g., Maunsell and Newsome 1987; Moran and Desimone 1985). 4
The Model of Curve Detection
In physiological terms, neurons are said to be orientation selective if they respond differentially to stimulus (edge or line) orientation. We take this operational statement one step further by defining orientation selection to be the inference of a local description of the curve everywhere along it, and postulate orientation selection as the goal of our first stage. In the second stage, global curves are inferred through this local description. The various stages of our process are shown in figure 1, and expanded below. 4.1 Stage 1: Inferring the Tangent Field. Formally orientation selection amounts to inferring the trace of the curve, or the set of points (in the image) through which the curve passes, its (approximate) tangent and curvature at those points, and their discontinuities (Zucker 1986). We refer to such information as the tangent field, and note that, since the initial
Two Stages of Curve Detection
71
measurements are discrete, this will impose constraints on the (inferred) tangents, curvatures, and discontinuities (Parent and Zucker 1985). This first stage of orientation selection is in turn modeled as a two step process: Step 1.1. Initial Measurement of the local fit at each point to estimate orientation and curvature. These estimates derive from a model of simple cell receptive fields instantiated at multiple scales and orientations at each image position. However, these local measurements are inherently inaccurate (e.g., broadly tuned), so we require: Step 1.2. Inter~retatjoninto an explicit distributed representation of tangent and curvature by establishing consistency between the local measurements. This is accomplished by modifying them according to their geometric relationships with nearby estimates. 4.2 Stage 2 Infemng a Covering of the Curve. Since the tangent is the first derivative of a curve (with respect to arc length), the global curve can be recovered as an integral through the tangent field. Such a view typically leads to sequential recovery algorithms (e.g., Kass and Witkin 1987). But these algorithms require global parameters, starting points, and some amount of topological structure (i.e., which tangent point follows which); in short, they are biologically implausible. In contrast, we propose a novel approach in which a collection of short, dynamically modifiable curves (”snakes” in computer vision; see Montanari 1971; Kass et al. 1988) move in parallel. The key idea behind our approach is to recover the global curve by computing a covering of it; i.e., a set of objects whose union is equivalent to the original curve. The elements of the covering are unit-length dynamic splines, initially equivalent to the elements of the tangent field, but which then evolve according to a potential distribution constructed from the tangent field. The evolution takes two forms: (i) a migration in position to achieve smooth coverings; and (ii) a “growth to triple their initial length. Furthermore, since the splines are initially independent, it is not known which should be grouped into the covering of each distinct global curve. For graphical purposes we represent this by creating each one with a different “color,” and include a second process which converts overlapping splines to the same color. In the end, then, the cover is given by a collection of overlapping splines, or short “snakes,“ each of which is the same color. Again, there are two conceptually distinct steps to Stage 2 of the algorithm (David and Zucker 1989):
Step 2.1. Constructing the Potential Distribution from the discrete tangent field. Each entry in the tangent field actually represents a discretization of the many possible curves in the world that could project onto that particular (tangent, curvature) hypothesis. Now these pieces
72
Steven W. Zucker, Allan Dobbins, and Lee Iverson must be put together, so consider a measure (or envelope) over all of these possible curves. Assuming the curves are continuous but not necessarily differentiable everywhere, each contribution to the potential can be modeled as a Gaussian (the Wiener measure) oriented in the direction of the tangent field entry. The full potential distribution is their pointwise sum; see figure 3.
Step 2.2. Spline Dynamics The discrete entities in the tangent field are converted into unit splines initialized in the valleys of the potential distribution. They evolve according to a variational scheme that depends on spline properties (tension and rigidity) as well as the global potential.
5 Implementing the Model
Each stage of the model has different implementation requirements. To differentiatebetween smooth curves, curves with corners, crossing curves and branching curves, it is necessary to represent each possible tangent (orientation) and curvature value at every possible position. Smooth curves are then represented as a single (tangent, curvature) hypothesis at each (retinotopic) trace point, corners as multiple tangents at a single point, and bifurcations as a single tangent but multiple curvatures at a single point. Orientation hypercolumns in the visual cortex are thus a natural representational substrate, with explicit representation of each possible orientation and curvature at each position. This leads to a new observation regarding discontinuities: explicit neurons to represent them are unnecessary, and leads to our first physiological prediction: Prediction 1. Crossings, corners, and bifurcations are represented at the early processing stages by multiple neurons firing within a "hypercolumn." 5.1 Stage 1, Step 1: Intra-Columnar Initial Measurements. We first seek a physiologically plausible mechanism for measuring orientation and curvature. Observe that an orientation-selective cortical neuron carries information about the tangent to curves as they pass through its receptive field, and an ensemble of such cells of different size carries information about how orientation is changing over it. Such differences are related to curvature (or deviation from straightness), and adding appropriate rectification leads to a model of endstopped neurons (Dobbinset al. 1987; cf. Hubel and Wiesel 1965). This model exhibits curvature-selective response at the preferred orientation, as do endstopped neurons. Thus
Prediction 2. Endstopped neurons carry the quantized representation of orientation and (non-zero) curvature at each position.
Two Stages of Curve Detection
73
Figure 1: An illustration of the different stages of curve detection. In (a) we show a section of a fingerprint image; note the smooth curves and discontinuities around the " Y in the center. (b) Graphical illustration of the initial information, or those orientation/curvature hypotheses resulting from convolutions above the noise level. (c) The discrete tangent field resulting from the relaxation process after 2 iterations; note that most of the spurious initial responses have been eliminated. (d) Final snake positions, or coverings of the global curves. (e) The potential distribution constructed from the entries in the tangent field.
74
Steven W. Zucker, Allan Dobbins, and Lee Iverson
By varying the components one obtains cells selective for different ranges and signs of curvature. Thus the initial measurements can be built up by intra-columnar local circuits, with the match to each (quantized) orientation and curvature represented explicitly as, say, firing rate in endstopped neurons. However, these measurements of orientation and curvature are broadly tuned; nearby curves are blurred together and multiple possibilities arise at many positions. Introducing further non-linearities into the initial measurements eliminates some spurious responses (Zucker et al. 19881, but the broadly-tuned smearing remains. We thus seek an abstract principle by which these broadly tuned responses can be refined into crisper distributions. 5.2 Stage 1, Step 2: Inter-Columnar Iterative Refinement. Again curvature enters the model, but now as a way of expressing the relationship between nearby tangent (orientation) hypotheses. Consider an arc of a curve, and observe that tangents to this arc must conform to certain position and Orientation constraints for a given amount of curvature; we refer to such constraints geometrically as co-circularity (Fig. 2a). Discretizing all continuous curves in the world that project into the columnar space of coarse (orientation, curvature) hypotheses partitions these curves into equivalence classes, examples of which are shown in figure 2b (Parent and Zucker 1985; Zucker et al. 1988). Interpreting the (orientation, curvature) hypotheses as endstopped neurons, such co-circularlyconsistent relationships are what is to be expected of the firing pattern between endstopped neurons in nearby orientation hypercolumns given such a curve as stimulus. Turning this around, when such intercolumnar patterns arise from the initial measurements, a curve from one of the equivalence classes is to be expected. Such inter-columnar interactions can be viewed physiologically as excitatory and inhibitory projections between endstopped cells at nearby positions (adjacent hypercolumns), and can be used as follows. Since curvature is a relationship between tangents at nearby positions, two tangents should support one another if and only if they agree under a curvature hypothesis, and co-circularity provides the measure of such support. In addition, two tangents that disagree with the curvature estimate should detract support from one another. Relaxation labeling provides a formal mechanism for defining such support, and for specifying how to use it (Hummel and Zucker 1983). Mathematically it amounts to gradient descent; physiologically it can be viewed as a mechanism for specifying how the response of neighboring neurons will interact. In summary:
Prediction 3. Inter-columnar interactions exist between curvature consistent (co-circular) tangent hypotheses.
Two Stages of Curve Detection
75
Figure 2: (a) The geometric relationships necessary for defining the compatibilities between two label pairs at points i and j . (b) Compatibilities between coarse (orientation, curvature) hypotheses at nearby positions. 8 distinct orientations and 7 curvatures were represented, and 3 examples are shown. (top) The labels which give positive (left) and negative (right) support for a diagonal orientation that curves slightly left; (middle) positive and negative support for a straight curvature class; (bottom) positive and negative support for the maximum curvature class. The magnitude of the interactions varies as well, roughly as a Gaussian superimposed on these diagrams. The values were obtained by numerically solving a 6-dimensional closest point problem (Zucker et al. 1988). Physiologically these projective fields represent inter-columnar interactions. Multiplied by the original tangent receptive fields, they represent the units for building the potential distribution that guides Stage 2.
76
Steven W. Zucker, Allan Dobbins, and Lee Iverson
Given interaction, the next question relates to precision. Earlier we hypothesized that this first stage was coarse. Both computational experiments (Zucker et al. 19881, psychophysics (Link and Zucker 19881, and the range of receptive field sizes in striate cortex (Dobbins et al. 1988) provide independent evidence about the quantization of curvature: Prediction 4. The initial representation of curvature in the visual cortex is quantized into 5 f 2 distinct classes; namely, straight, curved to the left a small amount, curved to the left a large amount, and similarly to the right. Relaxation processes can be realized iteratively, and computational experiments suggest that about 3 interations suffice (Zucker et al. 1988). At this time we can only speculate how these iterations relate to physiology, but perhaps the first iteration is carried out by a recurrent network within V1, and the subsequent iterations through the feed-forward and -back projections to extrastriate cortex (e.g., V2 or V4 in monkey). There is no doubt, however, that interactions beyond the classical receptive field abound (Allman et al. 1985). The advantage of this style of ”coarse modeling” is that a number of testable physiological hypotheses do emerge, and we are now beginning to explore them. The requirement of initial curvature estimates led to the connection with endstopping, and the current model suggests roles for inter-columnar interactions. In particular, we predict that they should be a function of position and orientation, a prediction for which some support exists (e.g. Nelson and Frost 1985) in the zero-curvature case; experiments with curved stimuli remain to be done. 5.3 Stage 2: Potential Distributions and Evolving Spline Covers. The tangent field serves as a coarse model for the curve, represented locally. The next task is to infer a smooth, global curve running through it. We perform this inference in a rather different kind of architecture, one that involves potential distributions constructed specifically for each instance. It proceeds as follows. The potential distribution is created by adding together contributions from each element in the tangent field; see figure 3. Changing the representation from the tangent field to the potential distribution changes what is explicit and what is implicit in the representation. In Stage 1 there were discrete coarse entities; now there are smooth valleys that surround each of the global curves, with a separation between them. The “jaggies’’ imposed by the initial image sampling have been eliminated, and interpolation to sub-pixel resolution is viable. To recover the curves through the valleys, imagine creating, at each tangent field entry, a small spline of unit length oriented according to the tangent and curvature estimates (Fig. 4). By construction, we know that this spline will be born in a valley of the tangent field potential
Two Stages of Curve Detection
77
Figure 3: Illustration of how a potential distribution is constructed from tangent field entries. (a) A small number of tangents, showing the individual contributions from each one. (b) As more tangents are included, long "valleys" begin to form when the individual entries are added together. (c) The complete tangent field and potential distribution as shown in figure 1. Physiologically one might think of such potentials as being mapped onto neuronal membranes. Not shown is the possible effect of attention in gating the tangent field contributions, the smallest unit for which could correspond to a tangent field entry. distribution, so they are then permitted to migrate to both smooth out the curve and to find the true local minima. But the inference of a cover for the global curves requires that the splines overlap, so that each point on every curve is covered by at least one spline. We therefore let the splines extend in length while they migrate in position, until they each reach a prescribed length. The covering is thus composed of these extensible splines which have grown in the valleys of the tangent field potential. Their specific dynamics and properties are described more fully in (David and Zucker 1989). It is difficult to interpret these ideas physiologically within the classical view of neurons, in which inputs are summed and transformed into
Steven W. Zucker, Allan Dobbins, and Lee Iverson
78
an output train of action potentials. Dendrites simply support passive diffusion of depolarization. Recently, however, a richer view of neuronal processing has emerged, with a variety of evidence pointing to active dendritic computation and dendro-dendritic interaction (Schmitt and Worden 1979). Active conductances in dendrites functionally modify the geometry, and dendro-dendritic interactions suggest that the output transformation is not uniquely mediated by the axon. Taken together, these facts imply that patterns of activity can be sustained in the dendritic arbor, and that this membrane could be the substrate of the above potential distribution computations. For this to be feasible, however, we require Prediction 5. The mapping of the potential distribution onto the neuronal membrane implies that the retinotopic coordinates are similarly mapped (at least in open neighborhoods) onto the membrane. The large constructed potential distributions may bear some resemblance to the large receptive fields observed in areas V4 and IT (Maunsell and Newsome 1987). While any such relationship is clearly speculative at this time, it should be noted that they have two key similarities: (i) extremely large receptive fields (potential distributions) have been created, but they maintain about the same orientation selectivity as in V1 (Desimone et al. 1985); (ii) their structure can change. We have stressed how structure is controlled by upward flowing information, but it should be modifiable by "top-down" attentional influences as well (Maunsell and Newsome 1987; Moran and Desimone 1985). Attention could easily "gate" the tangent field entries at the creation of the potential, which leads to: Prediction 6. There exists a smallest scale of attentional control, and it corresponds (in size) to the scale of the unit potential contributions. 6 Conclusions
This paper is both constructive and speculative. On the constructive side, we have outlined a computational solution to the curve detection problem that fills the wide gulf between initial broad measurements of orientation and precise final descriptions of global curve structure. Much of the mythology that has developed around curve detection is due, we believe, to ascribing too little function to the first (measurement) stage, and too much function to the second (integration) stage. Our solution was to interpose a stable description-the tangent field-between the stages, to represent the local properties of curves (and their discontinuities). Three points emerged: (i) represent the local structure coarsely,
Two Stages of Curve Detection
79
Figure 4: Illustration of the splines in motion. Initially, each spline is born at a tangent field location, with unit length. Then, according to the potential distribution shown in figure le, the splines migrate in position (to find minima in the distribution) and in length, so that they overlap and fill in short gaps. At convergence, the length of each spline has tripled. Not shown is the fact that each spline is born with a different "color," and that, as they overlap, the colors equilibrate to a unique value for the entire covering of each global curve. Also, those splines that migrate to positions unsupported by the potential distribution are eliminated at convergence. (a) Initial distribution; (b) and (c) intermediate iterations; (d) final convergence. Physiologically one might think of the spline computations as being supported by localized dendric or dendrodendritic interactions. not in fine detail; so that (ii) the different possibilities can be represented explicitly and (iii) do not assume that local properties must be computed purely locally. Once the tangent field was in place, the task for the second, global stage could then be posed, and led to the introduction of the mathematical notion of a cover to suggest parallel (and hence at least not biologically implausible) mechanisms for recovering global information.
Steven W. Zucker, Allan Dobbins, and Lee Iverson
80
Finally, we introduced the notion of a potential distribution as the representation for mediating the local to global transition between the two stages. The paper has also been speculative. Problems in vision are complex, and computational modeling can certainly help in understanding them. But in our view computational modeling cannot proceed without direct constraints from the biology, and modeling - like curve detection - should involve both coarse-grained and finer-grained theories. We attempted to illustrate how such constraints could be abstracted by speculating how our model could map onto physiology While much clearly remains to be done, the role for curvature at several levels now seems evident. That such roles for curvature would have emerged from more traditional neural network modeling seems doubtful. Two different styles of computation emerged in the two stages of curve detection. Although we stressed their differences in the paper, in closing we should like to stress their similarities. Both stages enjoy formulations as variational problems, and recognizing the hierarchy of visual processing, we cannot help but postulate that the second, fine stage of curve detection may well be the first, coarse stage of shape description. The fine splines then become the coarse units of shape. Acknowledgments This research was sponsored by NSERC grant A4470. We thank R. Milson and especially C. David for their contributions to the second stage of this project. References Allman, J., F. Miezin, and McGuinness. 1985. Stimulus Specific Responses from Beyond the Classical Receptive Field: Neurophysiological Mechanisms for Local-global Comparisons in Visual Neurons. Annual Rev. Neurosci. 8,407430. David, C. and S.W. Zucker. 1989. Potentials, Valleys, and Dynamic Global Coverings. Technical Report 89-1, McGill Research Center for Intelligent Machines, McGill University, Montreal. Desimone, R., S. Schein, J. Moran, and L. Ungerleider. 1985. Contour, Color, and Shape Analysis Beyond the Striate Cortex. Vision Research 25, 441452. Dobbins, A., S.W. Zucker, and M.S. Cynader. 1987. Endstopping in the Visual Cortex as a Substrate for Calculating Curvature. Nature 329, 4 3 8 4 1 . . 1988. Endstopping and Curvature. Technical Report 88-3, McGill Research Center for Intelligent Machines, McGill University, Montreal. Hubel, D.H. and T.N. Wiesel. 1962. Receptive Fields, Binocular Interaction and Functional Architecture in the Cat’s Visual Cortex. J. Physiol. (London) 160, 106154.
Two Stages of Curve Detection
81
. 1965. Receptive Fields and Functional Architecture in Two Non-striate Visual Areas (18 and 19) of the Cat. J. Neurophysiol. 28, 229-89. Hummel, R. and S.W. Zucker. 1983. On the Foundations of Relaxation Labeling Processes. IEEE Transactions on Pattern Analysis and Machine Intelligence 5, 267-287. Kass, M., A. Witkin, and D. Terzopoulos. 1988. SNAKES Active Contour Models. Int. J. Computer Vision 1, 321-332. Kass, M. and A. Witkin. 1987. Analyzing Oriented Patterns. Computer Vision Graphics and Information Processing 37, 362-385. Link, N. and S.W. Zucker. 1988. Corner Detection in Curvilinear Dot Grouping. Biological Cybernetics 59, 247-256. Maunsell, J. and W. Newsome. 3987. Visual Processing in Monkey Extrastriate Cortex. Ann. Rev. Neuroscience 10, 363-401. Montanari, U. 1971. On the Optimum Detection of Curves in Noisy Pictures. CACM 14, 335-345. Moran, J. and R. Desimone. 1985. Selective Attention Gates Visual Processing in the Extriastriate Cortex. Science 229, 782-784. Nelson, J.J. and B.J. Frost. 1985. Intracortical Facilitation among Co-oriented, Co-axially Aligned Simple Cells in Cat Striate Cortex. E x p . Br. Res. 61, 5M1. Parent, P. and S.W. Zucker. 1985. Trace Inference, Curvature Consistency, and Curve Detection. CVaRL Technical Report CIM-86-3, McGill University. I E E E Transactions on Paftern Annlysis and Machine Intelligence, in press. Schmitt, F. and F. Worden. 1979. The Neurosciences: Fourth Study Program, Cambridge, MA: MIT Press. Zucker, S.W. 1986. The Computational Connection in Vision: Early Orientation Selection. Behaviour Research Methods, Instruments, and Computers 18, 608617. Zucker, S.W., C. David, A. Dobbins, and L. Iverson. 1988. The Organization of Curve Detection: Coarse Tangent Fields and Fine Spline Coverings. Proc. 2nd Int. Conf. on Computer Vision, Tarpon Springs, Florida.
Received 14 October; accepted 23 October 1988.
Communicated by Dana Ballad
Part Segmentation for Object Recognition Alex Pentland Vision Sciences Group, The Media Lab, Massachusetts Institute of Technology, Room E15-410, 20 Ames Street, Cambridge, M A 02139, USA
Visual object recognition is a difficult problem that has been solved by biological visual systems. An approach to object recognition is described in which the image is segmented into parts using two simple, biologically-plausible mechanisms: a filtering operation to produce a large set of potential object "parts," followed by a new type of network that searches among these part hypotheses to produce the simplest, most likely description of the image's part structure. 1 Introduction
In order to recognize objects one must be able to compute a stable, canonical representation that can be used to index into memory (Binford 1971; Marr and Nishihara 1978; Hoffman and Richards 1985). The most widely accepted theory on how people recognize objects seems to be that they first segment the object into its component parts and then recognition occurs by using this part description to classify the object, perhaps by use of an associative network. Despite the importance of object recognition, most vision research and especially neural network research -has been aimed at understanding early visual processing. In part this focus on early vision is because the uniform, parallel operations typical of early vision are easily mapped onto neural networks, and are more easily understood than the nonhomogeneous, nonlinear processing required to segment an object into parts and then recognize it. As a consequence, the process of object recognition is little understood. The goal of this research is to automatically recover accurate part descriptions for object recognition. I have approached this objective by developing a system that segments an imaged object into convex parts using a neural network that is similar to that described by Hopfield and Tank (Hopfield and Tank 19851, but which uses a temporally-decaying feedback loop to achieve considerably better performance. For the sake of efficiency and simplicity I have used silhouettes, obtained from greyscale images by intensity, motion, and texture thresholding, rather than operating on the grey-scale images directly.
Neural Computation 1, 82-91 (1989)
@ 1989 Massachusetts Institute of Technology
Part Segmentation for Object Recognition
83
2 A Computational Theory of Segmentation Many machine vision systems employ matched filters to find particular 2-D shapes in an image, typically using a multiresolution approach that allows efficient search over a wide range of scales. Thus, in machine vision, a natural way to locate the parts of a silhouetted object is to make filter patterns that cover the spectrum of possible 2-D part-shapes (as is shown in figure l(a)), match these 2-D patterns against the silhouette, and then pick the best matching filter. If the match is sufficiently good, then we register the detection of a part whose shape is roughly that of the best-matching filter. A biological version of this approach might use many hypercolumns each containing receptive fields with excitatory regions shaped as in figure 1. The cell with the best-matching excitatory field would be selected by introducing strong lateral inhibition within the hypercolumn in order to suppress all but the best-responding cells. This arrangement of receptive fields and within-hypercolumn inhibition produces receptive fields with oriented, center-surround spatial structure, such as is shown in figure l(b). The major problem with such a filtering/receptive field approach is that all such techniques incorporate a noise threshold that balances the number of false detections against the number of missed targets. Thus we will either miss many of the object’s parts because they don’t quite fit any of our 2-D patterns, or we will have a large number of false detections. This false-alarm versus miss problem occurs in almost every image processing domain, and there are only two general approaches to overcoming the problem. The first is to improve the discriminating power of the filter so as to improve the false-alarm/miss tradeoff. The success of this approach depends upon precise characterization of the target and so is not applicable to this problem. In the second approach, each non-zero response of a filter/receptive field is considered as an hypothesis about the object’s part structure rather than being considered as a detection. One therefore uses a very low threshold to obtain a large number of hypotheses, and then searches through them to find the “real” detections. This approach depends upon having some method of measuring the likelihood of a set of hypotheses, i.e., of measuring how good a particular segmentation into parts is as an explanation of the image data. It is this second, “best explanation” approach that I have adopted in this paper. 2.1 Global Optimization: The Likelihood Principle and Occam’s Razor. The notion that vision problems can be solved by optimizing some “goodness of fit” measure is perhaps the most powerful paradigm found in current computational research (Hopfield and Tank 1985; BalIard
84
Alex Pentland
Figure 1: (a) Two-dimensional binary patterns used to segment silhouettes into parts. (b) Spatial structure of a receptive field corresponding to one of these binary patterns. et al. 1983; Hummel and Zucker 1983; Poggio et al. 1985). Although heuristic measures are sometimes employed, the most attractive schemes have been based on the likelihood principle (the scientific principle that the most likely hypothesis is the best one), i.e., they have posed the problem in terms of an a priori model with unknown parameter values, and then searched for the parameter settings that maximize the likelihood of the model given the image data. Recently it has been proven (Rissanen 1983) that one method of finding this maximum likelihood estimate is by use of the formal, informationtheoretic version of Occam’s Razor: the scientific principle that the simplest hypothesis is the best one. In information theory the simplicity or complexity of a description is measured by the number of bits (binary digits) needed to encode both the description and remaining residual noise. This new result tells us that both the likelihood principle and Occam’s Razor agree that the best description of image data is the one that provides the bitwise shortest encoding. This method of finding the maximum likelihood estimate is partic-
85
Part Segmentation for Object Recognition
ularly useful in vision problems because it gives us a simple way to produce maximum likelihood estimates using image models that are too complex for direct optimization (Leclerc 1988). In particular, to find the maximum likelihood estimate of an object's part structure one needs only to find the shortest description of the image data in terms of parts. 2.2 A Computational Procedure. How can the shortest/most likely image description be computed? Let { H } be a set of n part hypotheses h, produced by our filters/receptive fields, and let { H ' } be a subset of { H } containing m hypotheses. The particular elements which comprise { H * } can be indicated by a vector F consisting of n - m zeros and m ones, with a one in slot L indicating that hypothesis h, is an element of
{H*}The presence of part hypothesis h, in the set {H} indicates that a particular pattern from among those illustrated in figure l(a) has at least a minimal correspondence to the image data at some particular image location. Let us designate the number of image pixels at which h, and the image agree (have the same value) by a,,, and the number of image pixels at which h, and the image disagree (have different values) by e?,. Then h, provides an encoding of the image which saves S(h,) bits as compared to simple pixel-by-pixel description of the image pixel values. The amount of this savings, in bits, is:
S(h,)= klutz - k2ezz- k3
(2.1)
where kl is the average number of bits needed to specify a single image pixel value, k2 is the average number of bits needed to specify that a particular pixel is erroneously encoded by h,, and k3 is the cost of specifying h, itself. The ratio between kl and kz is our a priori estimate of the signal to noise ratio, including both image noise and noise from quantization of the set of 2-D shape patterns. The parameter k3 is equal to the minus log of the probability of a particular part hypothesis. By default we make k3 equal for all h,; however, we can easily incorporate a priori knowledge about the likelihood of each h, by setting k3 to the minus log probability associated with each h,. Equation 2.1 allows us to find the single hypothesis which provides the best image description by simply maximizing S(h,) over all the hypotheses h,. To find the overall maximum-likelihood/simplest description, however, we must search from among the power set of { H } to find that subset { H * } which maximizes S(2). Thus we must be able to account for interactions between the various h, in {If*}. Let be the number of image pixels at which h,, h,, and the image all agree, and e,, be number of image pixels at which both h, and h, disagree with the image. We then define a matrix A with values a,, on the diagonal, and values -1/2u,, for z # 3 , and similarly a matrix E with values e,, on the diagonal, and values -1/2eZJ for L 1. Ignoring points
+
Alex Pentland
86
where three or more hi overlap, the savings generated by encoding the image data using { H * } (as specified by the vector i?) is simply S(2) = klZAZT - kzZEZT
-
k3ZZT.
(2.2)
Equation 2.2 can easily be extended to include overlaps between three or more parts by adding in additional terms that express these higherorder overlaps. However, these higher-order overlaps are expensive to calculate. Moreover, such high-order overlaps seem to be infrequent in real imagery. I have chosen, therefore, to assume that in the final solution that there are a negligible number of image points covered by three or more hi. Note that we are not assuming that this is true of the entire set { H } , where such high-order overlaps will be common. The important consequence of this assumption is that the maximum of the savings function S ( 3 over all Zis also the maximum of equation 2.2. The solution to equation 2.2 is straightforward when the matrix Q
Q = klA - kzE - k3I
(2.3)
is positive (or negative) definite. Unfortunately, this is not the case in this problem. As a consequence, relaxation techniques (Hummel and Zucker 1983) such as the Hopfield-Tank network (Hopfield and Tank 1985) give a very poor solution. I have therefore devised a new method of solution (and corresponding network) which can provide a good solution to equation 2.2. This new technique is a type of continuation method: one first picks a problem related to the original problem that can be solved, and then iteratively solves a series of problems that are progressively closer to the original problem, each time using the last solution as the starting point for the next iteration. In the problem at hand, Q is easily solved when k3 is large enough, as then Q is diagonally dominant and thus negative definite. Therefore, I can obtain a globally good solution by first solving using a large k3, and then - using that answer as starting point - progressively resolve using smaller and smaller values of k3 until the desired solution is obtained. Because k j is the cost of adding a model to our description, the effect of this continuation technique is to solve for the largest, most prominent parts first, and then to progressively add in smaller and smaller parts until the entire figure is accounted for. The neural network interpretation of this solution method is a Hopfield-Tank network placed in a feedback loop where the diagonal weights are initially quite large and decay over time until they finally reach the desired values. In each "time step" the Hopfield-Tank network stabilizes, the diagonal weights are reduced, and the network outputs are fed back into the inputs. When the diagonal weights reach their final values, the desired outputs are obtained. It can be shown that for many well-behaved problems (for example, when the largest eigenvalues are all of one sign, with opposite-signed
Part Segmentation for Object Recognition
87
eigenvalues of much smaller magnitude) this feedback technique will produce an answer that is on average substantially better than that obtained by Hopfield-Tank or relaxation methods. As with relaxation techniques (Hummel and Zucker 19831, this feedback method can be applied to problems with asymmetric weights. A biological equivalent of our solution method is to use a set of hypercolumns (each containing cells with the excitatory subfields illustrated in figure 1) that are tied together by a Hopfield-Tank network augmented by a time-decaying feedback loop. The action of this network is to suppress activity in all but a small subset of the hypercolumns. After this network has stabilized, each of the remaining active cells correspond exactly to one part of the imaged object. The characteristics of that cell's excitatory subfield correspond to the shape of the imaged part. 3 Segmentation Examples
This technique has been tested on over two hundred synthetic images, with widely varying noise levels (Pentland 1988). In these tests the number of visible parts was correctly determined 85-95% of the time (depending on noise level), with largely obscured or very small parts accounting for almost all of the errors. Estimates of part shape were similarly accurate. The following three examples illustrate this segmentation performance. The first example uses synthetic range data with a dynamic range of 4 bits. In this example, only 72 2-D shape patterns were employed in order to illustrate the effects of coarse quantization in both orientation and size. The intent of this example is to demonstrate that a high-quality segmentation into parts can be achieved despite coarse quantization in both orientation, size, and range values, and despite wide variation in the weights. In the remaining examples, the 2-D shape patterns shown in figure l(a) were employed. Figure 2(a) shows an intensity image of a CAD model; synthetic range data from this model is shown in figure 2(b). These range data were histogrammed and automatically thresholded, producing the silhouette shown in figure 2(c). Figure 2(d) shows the operation of our new solution method. The parameter k3 is initially set to a large value, thus making equation 2.2 diagonally dominant. In this first step only the very largest parts are recovered, as is shown in the first frame of figure 2(d). The parameter k3 is then progressively reduced and the equation resolved, allowing smaller and smaller parts to be recovered. This is shown in the remaining frames of figure 2(d). This solution method therefore constructs a scale hierarchy of object parts, with the largest and most visible at the top of the hierarchy and the smallest parts on the bottom. This scale hierarchy can be useful in matching and recognition processes.
88
Alex Pentland
Figure 2: (a) Intensity image of a CAD model. (b) Range image of this model. (c) Silhouette of the range data. (d) This sequence of images illustrates how our continuation method constructs a scale-space description of part structure, first recovering only large, important parts and then recovering progressively smaller part structure. (el Final segmentation into parts obtained using only very coarsely quantized 2-D patterns; 3-D models corresponding to recovered parts are used to illustrate the recovered structure. (f) Segmentations for a 5 : 1 ratio of the parameters ki, showing that the segmentation is stable. The final segmentation for this figure is shown in figure 2(e); here
3-D volumetric models have been substituted for their Corresponding' 2D shapes in order to better illustrate how the silhouette was segmented into parts. The z dimension of these 3-D models is arbitrarily set equal to the smaller of the 5 and y dimensions. It can be seen that, apart from coarse quantization in orientation and size, the part segmentation is a good one. 'That is, for each 2-D pattern we substituted a 3-D CAD model whose outline corresponds exactly to the 2-D shape pattern.
Part Segmentation for Object Recognition
89
One important question is the stability of segmentation with respect to the parameters k,. Figure 2(f) shows the results of varying the ratio of parameters k , , k2, and kj over a range of 5 : 1. It can be seen that the part segmentation is stable, although as the relative cost of each model increases (the final value of k3 becomes large) small details (such as the feet) disappear. The second example of segmenting a silhouette into parts uses a real image of a person, shown in figure 3(a). A silhouette was produced by automatic thresholding of a fractal measure of texture smoothness; this silhouette is shown in figure 3(b). The resulting segmentation into parts is shown in figure 3(c). An example of segmenting a more complex silhouette into parts uses the Rites of Spring, a drawing by Picasso, shown in figure 3(c). The area within the box was digitized and the intensity thresholded to produce a
Figure 3: (a) Image of a person. (b) Silhouette produced by thresholding a fractal texture measure. (c) Automatic segmentation into parts. (d) The Rites of Spring, by Picasso. (e) Digitized version. (f) The automatic segmentation into parts.
90
Alex Pentland
coarse silhouette, as shown in figure 3(d). The automatic segmentation is shown in figure 3(e). It is surprising that such a good segmentation can be produced from this hand-drawn, coarsely digitized image (note that very small details, e.g., the goat’s horns, were missed because they were smaller than any of the 2-D patterns). 4 Summary
I have described a method for segmenting 2-D images into their component parts, a critical stage of processing in many theories of object recognition. This method uses two stages: a detection stage which uses matched filters to extract hypotheses about part structure, and an optimization stage, where all hypotheses about the object’s part structure are combined into a globally optimum (i.e., simplest, most likely) explanation of the image data. The first stage is implemented by local competition among the filters illustrated in figure l(a), and the second stage is implemented by a new type of neural network that gives substantially better answers than previously suggested optimization networks. This new network may be described as a relaxation or Hopfield-Tank network augmented by time-decaying feedback. For additional details the reader is referred to reference (Pentland 1988). Acknowledgments This research was made possible in part by National Science Foundation, Grant No. IRI-8719920. I wish to especially thank Yvan Leclerc for his comments, and for reviving my interest in minimal length encoding.
References Ballard, D.H., G.E. Hinton, and T.J. Sejnowski. 1983. Parallel Visual Computation. Nature 306, 21-26. Binford, T.O. 1971. Visual Perception by Computer. Proceeding of the IEEE Conference on Systems and Control, Miami. Hoffman, D. and W. Richards. 1985. Parts of Recognition. In: From Pixels to Predicates, ed. A. Pentland. New Jersey: Ablex Publishing Co. Hopfield, J.J. and D.W. Tank. 1985. Neural Computation of Decisions in Optimization Problems. Biological Cybernetics 52, 141-152. Hummel, R.A. and S.W. Zucker. 1983. On the Foundations of Relaxation Labeling Processes. IEEE Transactions on Pattern Analysis and Machine lntelligence 53,267-287.
Leclerc, Y. 1988. Construction Simple Stable Descriptionsfor Image Partitioning. Proc. DARPA lmage Understanding Workshop, April 6-8, Boston, MA, 365-382.
Part Segmentation for Object Recognition
91
Man; D. and K. Nishihara. 1978. Representation and Recognition of the Spatial Organization of Three-dimensional Shapes. Proceedings of the Royal SocietyLondon B 200, 269-94 Pentland, A. 1988. Automatic Recovey of Deformable Part Models. Massachusetts Institute of Technology Media Lab Vision Sciences Technical Report 104. Poggio, T., V. Torre, and C. Koch. 1985. Computational Vision and Regularization Theory. Nature 317, 314-319. Rissanen, J. 1983. Minimum-iength Description Principle. Encyclopedia of Slatistical Sciences 5, 523-527. New York Wiley.
Received 23 September; accepted 8 November 1988.
Communicated by Richard Andersen
Computing Optical Flow in the Primate Visual System H. Taichi Wang Bimal Mathur Science Center, Rockwell International, Thousand Oaks, CA 91360, USA
Christof Koch Computation and Neural Systems Program, Divisions of Biology and Engineering and Applied Sciences, 216-76, California Institute of Technology, Pasadena, CA 91125, USA
Computing motion on the basis of the time-varying image intensity is a difficult problem for both artificial and biological vision systems. We show how gradient models, a well known class of motion algorithms, can be implemented within the magnocellular pathway of the primate’s visual system. Our cooperative algorithm computes optical flow in two steps. In the first stage, assumed to be located in primary visual cortex, local motion is measured while spatial integration occurs in the second stage, assumed to be located in the middle temporal area (MT). The final optical flow is extracted in this second stage using population coding, such that the velocity is represented by the vector sum of neurons coding for motion in different directions. Our theory, relating the single-cell to the perceptual level, accounts for a number of psychophysical and electrophysiological observations and illusions. 1 Introduction
In recent years, a number of theories have been advanced at both the computational and the psychophysical level, explaining aspects of biological motion perception (for a review see Ullman 1981; Nakayama 1985; Hildreth and Koch 1987). One class of motion algorithms exploit the relation between the spatial and the temporal intensity change at a particular point (Fenneman and Thompson 1979; Horn and Schunck 1981; Marr and Ullman 1981; Hildreth 1984). In this article we address in detail how these algorithms can be mapped onto neurons in striate and extrastriate primate cortex (Ballard et al. 1983). Our neuronal implementation is derived from the most common version of the gradient algorithm, proposed within the framework of machine vision (Horn and Schunck 1981). Due to the ”aperture” problem inherent in their definition of optical flow, only the component of Neural Cornputation 1, 92-103 (1989) @ 1989 Massachusetts Institute of Technology
Computing Optical Flow in the Primate Visual System
93
motion along the local spatial brightness gradient can be recovered. In their formulation, optical flow is then computed by minimizing a twopart quadratic variational functional. The first term forces the final optical flow to be compatible with the locally measured motion component ("constraint line term"), while the second term imposes the constraint that the final flow field should be as smooth as possible. Such as "smoothness" or "continuity" constraint is common to most early vision algorithms. 2 A Neural Network Implementation
Commensurate with this method, and in agreement with psychophysical results (e.g. Welch 19891, our network extracts the optical flow in two stages (Fig. 1). In a preliminary stage, the time-varying image Z ( i , j ) is projected onto the retina and relayed to cortex via two neuronal pathways providing information as to the spatial location of image features ( S neurons) and temporal changes in these features (T neurons): S ( i , j ) = V 2 G * I ( i , j and ) T ( i , j )= a ( V * G * I ( i ,j) ) /&where , * is the convolution operator and G the 2-D Gaussian filter (Enroth-Cugell and Robson 1966; Marr and Hildreth 1980; Marr and Ullman 1981). In the first processing stage, the local motion information is represented using a set of n ON-OFF orientation- and direction-selective cells U , each with preferred direction indicated by the unit vector o k :
where t is a constant and Vk is the spatial derivative along the direction Ok. This derivative is approximated by projecting the convolved image S ( i , j ) onto a "simple" type receptive field, consisting of a 1 by 7 pixel positive (ON) subfield adjacent to a 1 by 7 pixel negative (OFF) subfield. The cell U responds optimally if a bar or grating oriented at right angles to Ok moves in direction Ok.Note that U is proportional to the product of a transient cell ( T )with a sustained simple cell with an odd-symmetric receptive field, with an output proportional to the magnitude of local component velocity (as long as \V&'(z,j)l > t). At each location i , j , n such neurons code for motion in n directions. Equation (2.1) differs from the standard gradient model in which U = - T / V k S , by including a gain control term, t, such that U does not diverge if the stimulus contrast decreases to zero. t is set to a fixed fraction of the square of the maximal magnitude of the gradient V S for all values of z,j. Our gradient-like scheme can be approximated for small enough values of the local contrast (i.e. if lVS(i,j)12< e), by -T(z,j)VkS(z,j). Under this condition, our model can be considered a second-order model, similar to the correlation or spatio-temporal energy models (Hassenstein and Reichardt 1956; Poggio and Reichardt 1973; Adelson and Bergen 1985; Watson and Ahumada
94
H. Taichi Wang, Bimal Mathur, Christof Koch
1985). We also require a set of n ON-OFF orientation- but not directionselective neurons E , with E ( i , j ,k ) = IVkS(i,j ) ] ,where the absolute value ensures that these neurons only respond to the magnitude of the spatial gradient, but not to its sign. We assume that the final optical flow field is computed in a second stage, using a population coding scheme such that the velocity is represented within a set of n‘ neurons V at location i , j , each with preferred direction Ok with V(i, j) = ETLl V ( i ,j , k)@k. Note that the response of any individual cell V ( i ,j , k ) is not the projection of the velocity field V(i,j) onto Ok. For any given visual stimulus, the state of the V neurons is determined by minimizing the neuronal equivalent of the above mentioned variational functional. The first term, enforcing the constraint that the final optical flow should be compatible with the measured data, has the form: k)-
(2.2)
where cos(k’- k ) is a shorthand for the cos of the angle between @k’ and @k. The term E2(i,j , k ) ensures that the local motion components U ( i ,j , k ) only have an influence when there is an appropriate oriented local pattern; thus E2 prevents velocity terms incompatible with the measured data from contributing significantly to Lo. In order to sharpen the orientation tuning of E , we square the output of E (the orientation tuning curve of E has a half-width of about 60”). The smoothness term, minimizing the square of the first derivative of the optical flow (Horn and Schunck 1981) takes the following form:
Our algorithm computes the state of the V neurons that minimizes Lo + XLI (A is a free parameter, usually set at 10). We can always find this state by evolving V ( i ,j , k ) on the basis of the steepest descent rule: dV/at = -a(& + XLI)/dV. Since the variational functional is quadratic in V, the right hand side in the above differential equation is linear in V. Conceptually, we can think of the coefficients of this linear equation as synaptic weights, while the left hand side can be interpreted as a capacitive term, determining the dynamics of our model neurons. In other words, the state of the V neurons evolve by summating the synaptic contribution from E , U and neighboring V neurons and updating its state accordingly. Thus, the system ”relaxes” into its final and unique state. To mimic neuronal responses more accurately, the output of our model neurons S, T , E , U , and V is set to zero if the net input is negative.
Computing Optical Flow in the Primate Visual System
lir Kil
... ...
K
95
K-1
.... .... ........
1i.j)
u. E
i
b
V
90"
I
270"
MT NEURON
Figure 1: Computing motion in neuronal networks. (a) Schematic representation of our model. The image Z is projected onto the rectangular 64 by 64 retina and sent to the first processing stage via the S and T channels. A set of n = 16 ON-OFF orientation- ( E ) and direction-selective (U)cells code local motion in 16 different directions @k. These cells are most likely located in layers 4Ca and 48 of V1. Neurons with overlapping receptive field positions i, j but different preferred directions @k are arranged here in 16 parallel planes. The ON subfield of one such U cell is shown in Fig. 4a. The output of both E and U cells is relayed to a second set of 64 by 64 by 16 V cells where the final optical flow is represented via population coding V(i, j) = J& V ( i ,j , k ) @ k , with n' = 16. Each cell V ( i ,j , k ) in this second stage receives input from cells E and U at location i , j as well as from neighboring neurons at different spatial locations. We assume that the V units correspond to a subpopulation of MT cells. (b) Polar plot of the median neuron (solid line) in MT of the owl monkey in response to a field of random dots moving in different directions (Baker et al. 1981). The tuning curve of one of our model V cells in response to a moving bar is superimposed (dashed line). Figure courtesy of J. Allman and S. Petersen.
96
H. Taichi Wang, Bimal Mathur, Christof Koch
3 Physiology and Psychophysics
Since the magnocellular pathway in primates is the one processing motion information (Livingstone and Hubel 1988; DeYoe and Van Essen 1988) we assume that the U and E neurons would be located in layers 4Ca and 4B of V1 (see also Hawken et al. 1988) and the V neurons in area MT, which contains a very high fraction of direction- and speed-tuned neurons (Allman and Kass 1971; Maunsell and Van Essen 1983). All 2n neurons U and E with receptive field centers at location i, j then project to the n’ MT cells V in an excitatory or inhibitory (via interneurons) manner, depending on whether the angle between the preferred direction of motion of the pre- and post-synaptic neuron is smaller or larger than 190”1.’ Anatomically, we then predict that each MT cell receives input from V1 (or V2) cells located in all different orientation-columns. The smoothness constraint of equation (3)results in massive interconnections among neighboring V cells (Fig. la). Our model can explain a number of perceptual phenomena. When two identical square gratings, oriented at a fixed angle to each other, are moved perpendicular to their orientation (Fig. 2a), human observers see the resulting plaid pattern move coherently in the direction given by the intersection of their local constraint lines (“velocity constraint combination rule”; in the case of two gratings moving at right angle at the same velocity, the resultant is the diagonal; Adelson and Movshon 1982). The response of our network to such an experiment is illustrated in figure 2: the U cells only respond to the local component of motion (component selectivity; Fig. 2b), while the V cells respond to the global motion (Fig. 2c), as can be seen by computing the vector sum over all V cells at every location (pattern selectivity; Fig. 2d). About 30%of all MT cells do respond in this manner, signaling the motion of the coherently moving plaid pattern (Movshon et al. 1985). In fact, under the conditions of rigid motion in the plane observed in Adelson and Movshon’s experiments, both their “velocity space combination rule” and the “smoothness” constraint converge to the solution perceived by human observers (for more results see Wang et al. 1989). Given the way the response of the U neurons vary with visual contrast (equation 2.1), our model predicts that if the two gratings making up the plaid pattern differ in contrast, the final pattern velocity will be biased in the direction of the component velocity of the grating with the larger contrast. Recent psychophysical experiments support this conjecture (Stone et al. 1988). It should be noted that the optical flow field is not represented explicitly within neurons in the second stage, but only implicitly, via vector addition. Our algorithm reproduces both “motion capture” (Ramachandran and ‘The app;opriate weight of the synaptic connection between U and V is codk k’)U(k’)E ( k ). Various biophysical mechanisms can implement the required multiplicative interaction as well as the synaptic power law (Koch and Poggio 1987).
Computing Optical Flow in the Primate Visual System
97
Figure 2: (a) Two gratings moving towards the lower right (one at -26" and one at -64"), the first moving at twice the speed of the latter. The amplitude of the composite is the sum of the amplitude of the individual bars. The neuronal responses of a 12 by 12 pixel patch (outlined in (a)) is shown in the next three sub-panels. (b) Response of the U neurons to this stimulus. The half wave rectified output of all 16 cells at each location is plotted in a radial coordinate system at each location as long as the response is significantly different from zero. (c) The output of the V cells using the same needle diagram representation after the network converged. (d) The resulting optical flow field, extracted from (c) via population coding, corresponding to a coherent plaid moving towards the right, similar to the perception of human observers (Adelson and Movshon 1982) as well as to the response of a subset of MT neurons in the macaque (Movshon et al. 1985). The final optical flow is within 5% of the correct flow field.
98
H. Taichi Wang, Bimal Mathur, Christof Koch
Anstis 1983; see Wang et al. 1989) and “motion coherence” (Williams and Sekuler 19841, as illustrated in figure 3. As demonstrated previously, these phenomena can be explained, at least qualitatively, by a smoothness or local rigidity constraint (Yuille and Grzywacz 1988; Biilthoff et al. 1989). Finally, y motion, a visual illusion first reported by the Gestalt psychologists (Lindemann 1922; Kofka 1931; for a related illusion in man and fly see Bulthoff and Gotz 1979), is also mimicked by our algorithm. This illusion arises from the initial velocity measurement stage and does not rely on the smoothness constraint. Cells in area MT respond well not only to motion of a bar or grating but also to a moving random dot pattern (Albright 1984; Allman et al. 1985). Similarly, our algorithm detects a random-dot figure moving over a stationary random-dot background, as long as the spatial displacement between two consecutive frames is not too large (Fig. 4). An interesting distinction arises between direction-selective cells in V1 and MT. While the optimal orientation in V1 cells is always perpendicular to their optimal direction, this is only true for about 60% of MT cells (type I cells; Albright 1984; Rodman and Albright 1988). 30% of MT cells respond strongly to flashed bars oriented parallel to the cells’ preferred direction of motion (type I1 cells). If we identify our V cells with this MT subpopulation, we predict that type I1 cells should respond to an extended bar (or grating) moving parallel to its edge (Fig. 4). 4 Discontinuities in the Optical Flow
The major drawback of all motion algorithms is the degree of smoothness required, smearing out any discontinuities in the optical flow field, such as those arising along occluding objects or along a figure-ground boundary. It has been shown previously how this can be dealt with by introducing the concept of line processes which explicitly code for the presence of discontinuities in the motion field (Hutchinson et al. 1988; see also Poggio et al. 1988). If the spatial gradient of the optical flow between two neighboring points is larger than some threshold, the flow field “is broken”; that is the process or ”neuron” coding for a motion discontinuity at that location is switched on and no smoothing occurs. If little spatial variation exists, the discontinuity remains off. The performance of the original version of the Horn and Schunck version is greatly improved using this idea (Hutchinson et al. 1988). Perceptually, it is known that the visual system uses motion to segment different parts of visual scenes (Baker and Braddick 1982; van Doorn and Koenderink 1983; Hildreth 1984). But what about the possible cellular correlate of line processes? Allman and colleagues (Allman et al. 1985)first described cells in area MT in the owl monkey whose “true” receptive field extended well beyond the classical receptive field as mapped with bar or spot stimuli. About half of all MT cells have an antagonistic direction-selective
Computing Optical Flow in the Primate Visual System
99
SC47304
. . . . . . . . . . . . .
I
. . . . . . . . . . . . . . . . . .
...................
...................
I
.................... .................. .. *
t
ICCCe..rr.*cCI... e<e%GF C f F e q C e
eF-€c c c c c F F-
.. ...... ................... ' F C C E , . * ' C C C C . . . .
C I . . . . . * r . * . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
................... . . . . . . . . . . . . . . . . . .
I
I
I
1
Figure 3: (a) In "motion coherence," a cloud of dots is perceived to move in the direction of their common motion component. In this sequence, all dots have an upward velocity component, while their horizontal velocity component is random. (b) The final velocity field only shows the motion component common to all dots. Humans observe the same phenomena (Williams and Sekuler 1984). (c) In 7-motion, a dark stimulus flashed onto a homogenously lit background appears to expand (Lindemann 1922; Kofka, 1931). It will disappear with a motion of contraction. (d) Our algorithm perceives a similar expansion when a disk appears. The contour of the stimulus is projected onto the final optical flow. surround, such that the response of the cell to motion of a random dot display or an edge within the center of the receptive field can be modified by moving a stimulus within a very large surrounding region. The response depends on the difference in speed and direction of motion between the center and the surround, and is maximal if the surround
100
H. Taichi Wang, Bimal Mathur, Christof Koch
........, ....
t
,
.
.
.
.
.
.
.
.
. . . .
..... ....
.....
.. .. . . .
. .
.....
'
.... .....
....... . . . . . . . . . . . . . . . . . . .
. . .
.
.
.
.
'
6
,
*
&
. . . . . . . . . . . . . . . . . . . . .
I Figure 4 (a) A dark bar (outlined) is moved parallel to its orientation towards the right. Due to the aperture problem, those U neurons whose receptive field only "see" the straight elongated edges of the bar - and not the leading or trailing edges - will fail to respond to this moving stimulus, since motion remains invisible on the basis of purely local information. The ON subfield of the receptive field of a vertically oriented U cell is superimposed onto the image. (b) It is only after information has been integrated, following the smoothing process inherent in the second stage of our algorithm, that the V neurons respond to this motion. The type I1 cells of Albright (1984) in MT should respond to this stimulus. (c) Response of our algorithm to a random-dot figure-ground stimulus. The central 10 by 10 pixel square moved by 1 pixel toward the left. (d) Final optical flow after smoothing. The V cells detect the figure, similar to cells in MT. The contour of the translated square is projected onto the final optical flow.
Computing Optical Flow in the Primate Visual System
101
moves at the same speed as the stimulus in the center but in the opposite direction. Thus, tantalizing hints exist as to the possible neuronal basis of motion discontinuities. Recently, two variations of Horn and Schuncks (1981) original algorithm have been proposed, based on computational considerations (Uras et al. 1988; Yuille and Grzywacz 1988). Both algorithms can be mapped onto the neuronal network we have proposed, with minimal changes either in the U stage (Uras et al. 1988) or by increasing the connectivity among more distant V cells (Yuille and Grzywacz 1988). It remains an open challenge to provide both psychophysical and electrophysiological evidence to evaluate the validity of these and similar schemes (e.g. Nakayama and Silverman 1988). We are currently trying to extend our model to account for the intriguing phenomena of ”motion transparency,” such as when two fields of random dots moving in opposite directions are perceived to form the 3-D motion field associated with a transparent and rotating cylinder (Siege1 and Andersen 1988). Acknowledgments We thank John Allman, David Van Essen, and Alan Yuille for many fruitful discussions and Andrew Hsu for computing the figure-ground response. Support for this research came from NSF grant EET-8700064, ONR Young Investigator Award and NSF Presidential Young Investigator Award to C.K. References Adelson, E.H. and J.R. Bergen. 1985. Spatio-temporal Energy Models for the Perception of Motion. J. Opt. SOC.Am. A 2, 284299. Adelson, E.H. and J.A. Movshon. 1982. Phenomenal Coherence of Moving Visual Patterns. Nature 200, 523-525. Albright, T.L. 1984. Direction and Orientation Selectivity of Neurons in Visual Area MT of the Macaque. 1. Neurophysiol. 52, 1106-1130. Allman, J.M. and J.H. Kass. 1971. Representation of the Visual Field in the Caudal Third of the Middle Temporal Gyms of the Owl Monkey (Aotus trivirgatus). Bruin Res. 31, 85-105. Allman, J., F. Miezin, and E. McGuinness. 1985. Direction- and Velocity-specific Responses from Beyond the Classical Receptive Field in the Middle Temporal Area (MT). Perception 14, 105-126. Ballard, D.H., G.E. Hinton, and T.J. Sejnowski. 1983. Parallel Visual Computation. Nature 306,21-26. Baker, C.L. and O.J. Braddick. 1982. Does Segregation of Differently Moving Areas Depend on Relative or Absolute Displacement. Vision Res. 7,851-856. Baker, J.F., S.E. Petersen, W.T. Newsome, and J.M. Allman. 1981. Visual Response Properties of Neurons in Four Extrastriate Visual Areas of the Owl
102
H. Taichi Wang, Bimal Mathur, Christof Koch
Monkey (Aotus trivirgatus): A Quantitative Comparison of Medial, Dorsomedial, Dorsolateral, and Middle Temporal Areas. J. Neurophysiol. 45, 397416. Bulthoff, H.H., J.J. Little, and T. Poggio. 1989. Parallel Computation of Motion: Computation, Psychophysics and Physiology. Nature, in press. Bulthoff, H.H. and K.G. Gotz. 1979. Analogous Motion Illusion in Man and Fly. Nature 278, 636-638. DeYoe, E.A. and D.C. Van Essen. 1988. Concurrent Processing Streams in Monkey Visual Cortex. Trends Neurosci. 11, 219-226. Enroth-Cugell, C. and J.G. Robson. 1966. The Contrast Sensitivity of Retinal Ganglion Cells of the Cat. 1. Physiol. (Lond.) 187, 517-552. Fennema, C.L. and W.B. Thompson. 1979. Velocity Determination in Scenes Containing Several Moving Objects. Comput. Graph. Image Proc. 9,301-315. Hassenstein, B. and W. Reichardt. 1956. Systemtheoretische Analyse der Zeit-, Reihenfolgen- und Vorzeichenauswertungbei der Bewegungsperzeption des Russelkafers Chlorophanus. Z. Naturforschung l l b , 513-524. Hildreth, E.C. 1984. The Measurement of Visual Motion. Cambridge, MA: MIT Press. Hildreth, E.C. and C. Koch. 1987. The Analysis of Visual Motion. Ann. Rev. Neurosci. 10, 4777.53 Horn, B.K.P. and B.G. Schunck. 1981. Determining Optical Flow. Artif. Intell. 17, 185-20 Hutchinson, J., C. Koch, J. Luo, and C. Mead. 1988. Computing Motion using Analog and Binary Resistive Networks. I E E E Computer 21, 52-61. Koch, C. and T. Poggio. 1987. Biophysics of Computation. In: Synaptic Function, eds. G.M. Edelman, W.E. Gall, and W.M. Cowan, 637-698. New York John Wiley. Kofka, K. 1931. In: Handbuch der normalen und pathologischen Physiologie 12, eds. A. Bethe et al. Berlin: Springer. Lindemann, E. 1922. Experimentelle Untersuchungen uber das Enstehen und Vergehen von Gestalten. Psych. Forsch. 2, 5-60. Livingstone, M. and D. Hubel. 1988. Segregation of Form, Color, Movement, and Depth: Anatomy, Physiology and Perception. Science 240, 740-749. Marr, D. and E.C. Hildreth. 1980. Theory of Edge Detection. Proc. R. SOC.Lond. B 297, 181-217. Marr, D. and S. Ullman. 1981. Directional Selectivity and its Use in Early Visual Processing. Proc. R. SOC. Lond. B 211, 151-180. Maunsell, J.H.R. and D. Van Essen. 1983b. Functional Properties of Neurons in Middle Temporal Visual Area of the Macaque Monkey. 11. Binocular Interactions and Sensitivity to Binocular Disparity. J. Neurophysiol. 49,11481167. Movshon, J.A., E.H. Adelson, M.S. Gizzi, and W.T. Newsome. 1985. The Analysis of Moving Visual Patterns. In: Exp. Brain Res. Suppl. 11: Pattern Recognition Mechanisms, eds. C. Chagas, R. Gattass, and C. Gross, 117-151. Heidelberg: Springer. Nakayama, K. 1985. Biological Motion Processing: A Review. Vision Res. 25, 625-660.
Computing Optical Flow in the Primate Visual System
103
Nakayama, K. and G.H. Silverman. 1988. The Aperture Problem-11. Spatial Integration of Velocity Information along Contours. Vision Res. 28, 747-75 Poggio, T., E.B. Gamble, and J.J. Little. 1988. Parallel Integration of Visual Modules. Science 242 337-340. Poggio, T. and W. Reichardt. 1973. Considerations on Models of Movement Detection. Kybernetik 13, 223-227. Ramachandran, V.S. and S.M. Anstis. 1983. Displacement Threshold for Coherent Apparent Motion Motion in Random-dot Patterns. Vision Res. 12, 1719-1724. Rodman, H. and T. Albright. 1989. Single-unit Analysis of Pattern-motion Selective Properties in the Middle Temporal Area (MT). Exp. Brain Res., in press. Siegel, R.M. and R.A. Andersen. 1988. Perception of Three Dimensional Structure from Motion in Monkey and Man. Nature 331, 259-261. Stone, L.S., J.B. Mulligan, and A.B. Watson. 1988. Neural Determination of the Direction of Motion: Contrast Affects the Perceived Direction of Motion. Neurosci. Abstr. 14, 502.5. Ullman, S. 1981. Analysis of Visual Motion by Biological and Computer Systems. I E E E Computer 14, 57-69. Uras, S., F. Girosi, A. Verri, and V. Torre. 1988. A Computational Approach to Motion Perception. Btol. Cybern. 60, 79-87. van Doorn, A.J. and J.J. Koenderink. 1983. Detectability of Velocity Gradients in Moving Random-dot Patterns. Vision Res. 23, 799-804. Wang, H.T., B. Mathur, A. Hsu, and C. Koch. 1989. Computing Optical Flow in the Primate Visual System: Linking Computational Theory with Perception and Physiology. In: The Computing Neurone, eds. R. Durbin, C. Miall, and G. Mitchinson. Reading: Addison-Wesley. In press. Watson, A.B. and A.J. Ahumada. 1985. Model of Human Visual-motion Sensing. J. Opt. SOC.Am. A 2, 322-341. Welch, L. 1989. The Perception of Moving Plaids Reveals Two Motion Processing Stages. Nature, in press. Williams, D. and R. Sekuler. 1984. Coherent Global Motion Percepts from Stochastic Local Motions. Vision Res. 24, 55-62. Yuille, A.L. and N.M. Grzywacz. 1988. A Computational Theory for the Perception of Coherent Visual Motion. Nature 333, 71-73.
Received 28 October; accepted 6 December 1988.
Communicated by Richard Lippmann
A Multiple-Map Model for Pattern Classification Alan Rojer Eric Schwartz Computational Neuroscience Laboratory, New York University Medical Center, Courant Institute of Mathematical Sciences, New York University, New York, NY 10016, USA
A characteristic feature of vertebrate sensory cortex (and midbrain) is the existence of multiple two-dimensional map representations. Some workers have considered single-map classification (e.g. Kohonen 1984) but little work has focused on the use of multiple maps. We have constructed a multiple-map classifier, which permits abstraction of the computational properties of a multiple-map architecture. We identify three problems which characterize a multiple-map classifier: classification in two dimensions, mapping from high dimensions to two dimensions, and combination of multiple maps. We demonstrate component solutions to each of the problems, using Parzen-window density estimation in two dimensions, a generalized Fisher discriminant function for dimensionality reduction, and splivmerge methods to construct a "tree of maps" for the multiple-map representation. The combination of components is modular and each component could be improved or replaced without affecting the other components. The classifier training procedure requires time linear in the number of training examples; classification time is independent of the number of training examples and requires constant space. Performance of this classifier on Fisher's iris data, Gaussian clusters on a five-dimensional simplex, and digitized speech data is comparable to competing algorithms, such as nearest-neighbor, back-propagation and Gaussian classifiers. This work provides an example of the computational utility of multiplemap representations for classification. It is one step towards the goal of understanding why brain areas such as visual cortex utilize multiple map-like representations of the world. 1 Introduction
One of the most prominent features of the vertebrate sensory system is the use of multiple two-dimensional maps to represent the world. The observational data base for cortical maps is excellent, and this area represents one of the better-understood aspects of large-scale brain architecture. Recently, through the use of a system for computer-aided neuNeural Computation 1,104-115 (1989) @ 1989 Massachusetts Institute of Technology
A Multiple-Map Model for Pattern Classification
105
roanatomy, we have been able to obtain high-precision reconstructions of primary visual cortex map and column architectures, have constructed accurate models of both columnar and topographic architecture of primary visual cortex, and have suggested several computational algorithms which are contingent on the specific forms of column and map architecture which occur in this first visual area of monkey cortex (Schwartz et al. 1988). We expect to be able to extend these methods and ideas to other cortical areas. There is thus good progress in the areas of measuring, modeling, and computing with single-map representations. However, the problem of how to make use of multiple maps has been little explored. Other workers have considered the application of single-map representations to classification. Kohonen (1984) has developed an algorithm for representing a feature space in a map; this algorithm constructs a space-variant representation, in rough analogy to the space-variant nature of primate visual cortex. However, this work does not provide a computational model for computing with multiple maps. We believe that a classifier utilizing a multiple-map architecture must incorporate the following modules: An efficient algorithm for classification in two dimensions. A projection of high dimensional data into a two-dimensional representation. An algorithm for combining multiple two-dimensional representations. Our strategy in this work has been to use simple components to construct our multiple-map classifier. In particular, we were seeking algorithms which require one pass through the data and which are not sensitive to convergence issues (e.g. local minima in an energy function). We are interested in the overall properties of the classifier, and we are trying to deemphasize the role of the individual components, which are modular and hence subject to improvement or replacement. 2 Classification in Two Dimensions
We assume that the items, or instances, we wish to classify are represented as vectors z E Rd, where each component of 2 is a feature measurement. Each instance belongs to a class k . We also have a training set, a set of instances of known class (training examples). We refer to the set of training examples in class k as xk. Our problem is to construct a set of discriminant functions fk : Rd + R, k = 1,.. . , c. An arbitrary instance x is assigned to the class k for which f&) is maximal. Because the instances are represented as vectors, we can refer to the distance between an instance and a training example as 115 - 2’11. We compute discriminant functions
106
Alan Rojer and Eric Schwartz
where g ( r ) is some function which decreases as
T
increases. If we let
g be a probability density (i.e. nonnegative and integrating to one over
its support) this is the Parzen-window estimate (Parzen 1962) for the a = p(klz). Since this is the same term which is maximized in the Bayes classifier, our classifier performance approaches the Bayesian limit as the approximation above approaches the actual probability density. This algorithm is related to the nearest-neighbor classifier. Its principal novelty is to use maps to store fk(z).Then, given the training examples, we can compute fk(x> by convolution in one pass. For illustration, see figure 1. We depict a two-class, one-dimensional classifier. The "map" is simply a segment of the real line. The training examples are shown on the z-axis as boxes. The weighting function g(z) is a Gaussian function. The individual convolutions g(x) * 6(z - z') are shown as dotted lines. The class-specific density estimates, which are also the discriminant functions, are shown as a solid and broken line, respectively. We consider a two-dimensional, three-class problem in figure 2 and figure 3. The weighting function is a circular two-dimensional Gaussian function. The instances have been drawn from prespecified twodimensional multivariate normal distributions; this permits construction of a Bayes classifier to determine minimal error rate. In figure 2, we show a comparison between the Parzen-window density estimates f k ( l c ) and the actual probability density functions for each class. In figure 3, the classifier is compared to a Bayesian classifier. The visual comparison indicates that the classifier is capturing much of the character of the Bayesian classifier. When the classifier was trained on 400 samples from each class, and tested on 300 (different) instances, its error rate was 16.0%, which may be compared to 14.4% for the Bayesian classifier. One important issue in the application of this method is the choice of the weighting function (or kernel). We have typically used Gaussian kernels, in which case we need to choose the kernel variance CT* (or covariance matrix x,,, in higher dimensions). This is a difficult problem in general; we have used heuristic algorithms. For example, if we desire an isotropic kernel, we might use CT = N-'/"&, where X I is the largest eigenvalue of the covariance matrix resulting from the projection of the data into an rn-dimensional map. The factor N-'/"arises from the heuristic decision to give each training instance an equal amount of map volume; since the kernel is m-dimensional, the volume scales as cP. More generally, we could use C, = h P C , where P is the projection into the map (see below). In experimental studies, we have found that the performance of the classifier is insensitive to small changes in the kernel size or shape.
posteriori density; i.e. f&)
A Multiple-Map Model for Pattern Classification
0
107
i
0.5
1.5
Position x
Figure 1: A one-dimensional two-class classifier. Class 0 instances are shown as solid boxes; class 1 instances are shown as open boxes. The estimated a posteriori density p ( k l z ) (here, for one-dimensional z)is shown as a solid line for class 0, and a broken line for class 1. The Parzen-window function g(11z - 2'11) for each paradigm x' is shown as a dotted line. The classifier operates by choosing the class for which the estimated a posteriori density is maximized. Thus, samples drawn with feature measurements below 0.7 would be assigned to class 0. Samples drawn with z > 0.7 would be assigned to class 1.
Alan Rojer and Eric Schwartz
108
Binormal densities
Map estimates
Figure 2: Comparison of actual binormal density against estimated density f&) computed by our classifier. Here a higher density corresponds to a darker region of the plot. The top row shows density plots for three binormal distributions in R2.The bottom row shows the estimates fk(x) computed by the classifier for 400 samples drawn from each of these three distributions. The weighting function used is a circular Gaussian with a variance approximately 1/60 the width of the figure.
3 From d Dimensions to Two Dimensions
In the previous section, we showed that the performance (as measured by error rate) approaches that of the Bayesian classifier for two-dimensional
109
A Multiple-Map Model for Pattern Classification
Bayesian classifier
Map Classifier
Figure 3: Comparison of decision regions computed by our classifier against decision regions of a hypothetical Bayes classifier which had complete knowledge of the underlying class distributions. Regions in the Bayes classifier are clipped due to round-off. data drawn from Gaussian distributions. Actually there is nothing in our derivation which restricts us to two dimensions; a d-dimensional classifier is defined as above, except that the density estimate p k ( z ) will require a d-dimensional map. In practice we limit ourselves to two dimensions for three reasons. First, our original motivation is to understand the functional utility of laminar structures such as neocortex for pattern classification in the brain. Second, two-dimensional maps can be processed
Alan Rojer and Eric Schwartz
110
by conventional image processing software and hardware, and the user of the classifier gets the benefit of visual displays of the intermediate structures in the classifier (e.g. P k ( Z ) and the computed partition). Finally, the number of bins required to store the maps P k ( Z ) grows exponentially with the dimensionality d, and thus favors small d. Restriction to two dimensions introduces an interesting aspect to the classification problem. Although our original data is in d dimensions, i.e. the data is composed of d measurements, we must somehow extract only two measurements or combinations of measurements with which to construct our classifier. The classification problem then spawns a problem of feature derivation. We can formulate the dimensionality reduction problem as construction of a function P : Rd .+ R2 which maps d-dimensional instances to two-dimensional map positions. The two dimensions of the map constitute the two derived features. We need to specify what kind of function P we will allow. To date, we have only considered linear projections, but nonlinear functions could also be used (e.g. Kohonen's self-organizing feature map; Kohonen 1984). We apply the generalized Fisher discriminant which was first introduced for a projection to R (Fisher 1936)and later generalized to a domain of arbitrary dimensionality (Bryan 1951). A discussion of the technique may be found in (Duda and Hart 1973). The two vectors which comprise P turn out to be the eigenvectors associated with the two largest eigenvalues in the generalized eigenvalue system sbu
= XSwu,
(3.1)
where S, is the "within-class" scatter matrix, given by (3.2)
with c k the class mean and N k the number of training examples representing class k , and sb is the "between-class" scatter matrix, given by (3.3) Here, z is the mean over all the training examples. In practice, we have found that S, is nonsingular, so the system can be reduced to a standard eigenvalue problem Si'sbU
= Xu.
(3.4)
The extraction of the principal eigenvalue and its eigenvector are realizable using a typical Hebb synapse model with a fixed-length weight vector (Oja 1982). The performance of the discriminant can be observed with Fisher's classical iris data. This data describes a four-dimensional, three-class
A Multiple-Map Model for Pattern Classification
111
problem. Figure 4 depicts the classifier constructed from the projection of the iris data into the two-dimensional subspace which maximizes the ratio described above. The classes can be seen to be fairly well separated. The classifier was tested by splitting the 150 instances into a 100-instance training set and a 50-instance test set. After training on the 100 instances, the classifier achieved 98% correct classification on the 50 instances in the test set. By comparison, the nearest-neighbor classifier operating on the same training and test sets achieved 98% correct classification, the Gaussian classifier achieved 94% correct classification, and a multilayer perceptron trained using back-propagation' achieved 96% correct classification. 4 Using Multiple Maps: A Tree of Maps
The previous example showed that for a relatively easy four-dimensional three-class problem, the generalized Fisher discriminant analysis was adequate to obtain a map which permitted good classifier performance. But in general, the discriminant analysis does not yield enough separation. For example, consider a regular five-dimensional simplex; this is a set of six equidistant points on the unit sphere in R5.Locate a spherical multivariate normal distribution at each vertex of the simplex. This is a point swarm whose density declines as exp(-r2), where T is the distance from the vertex. We construct the classifier by utilizing discriminant analysis to find a projection P : R5 + R2. With 600 training points and 300 test points, the error rate is 36%. Fortunately, we are not confined to one map. One method of using multiple maps utilizes a split/merge technique to reduce one many-class problem to several problems, each with fewer classes. We merge the original base classes into superclasses, each of which is represented by the union of training examples from its underlying base classes. We then apply discriminant analysis to the newly formed superclasses. If we can achieve an adequate separation, we proceed as above with a separate map for each superclass. If necessary, we can again perform merges among the elements of a superclass, until we have divided each superclass into component base classes. The consequence of this approach is to create a tree of maps. From the root of the tree, we project instances into a superclass. If the superclass is a base class, we assign that class to the instance. Otherwise, the instance is assigned to a superclass, which has its own map. We project the instance into that map, and continue as above, until the instance lands in a region assigned to a base class. In the training phase, we construct 'A multilayer perceptron with 4 hidden units was trained using back-propagation (Rumelhart et al. 1986) with 5000 iterations through the training set, with E = 0.02 and a = 0.
112
Alan Rojer and Eric Schwartz
Figure 4: Iris classifier constructed from 100 training points drawn from the iris data (shaded background). Projection of four-dimensional iris data points into the two-dimensional subspace which maximizes the ratio of between-class variance to within-class variance (foreground data points). a classifier in R2 for each internal node in the tree to classify instances into one of the superclasses for that node. We can illustrate this algorithm with the simplex data. We partition the classes so that three maps are used to classify. In the first map, we merge classes 2-5 into a superclass, letting classes 0 and 1 remain as base classes. In the second map, we will resolve the superclass composed of classes 2-5 into classes 2 and 3 and a superclass formed of 4 and 5. Finally, in the third map, we will resolve the superclass formed from 4 and 5 above into component base classes. The error rate of the three-map classifier is found to be 4.7%, a dramatic improvement over the singlemap classifier (36% error rate). This may be compared to error rates of 2%, 6% and 2.7% respectively for the Gaussian, nearest-neighbor and
A Multiple-Map Model for Pattern Classification
113
multilayer perceptron classifiem2 We have also applied our classifier to real-world data which consisted of 22 cepstral parameters from digitized ~ p e e c h . ~Each of 16 data sets represented one speaker; seven classes (monosyllabic words) were present in each data set. Each set consisted of 70 training instances and 112 test instances. The results are summarized: Classifier: Average error rate(%1: Range (%):
Nearest- Multilayer Multiple-map Gaussian4 neighbor perceptron5
6.5
6.0
5.9
6.3
1.8-12.5
3.6-10.7
1.8-15.2
1.8-11.6
from which it may be seen that all four classifiers under consideration had closely comparable performance. 5 Automatic Generation of the Map Tree
In the preceding examples of multiple map usage, we interactively chose a map tree. In this section we explore a simple approach to automatic generation of the map tree. This is a clustering problem; we want to group classes into superclasses which in some way reflect the natural similarity between classes. We introduce the distance matrix A for the classes. For any interclass distance measure dist(i,j), A,, = A,, = dist(i, j ) . We use a very simple tree generation algorithm. we treat A as a graph with each class represented by a node, and each edge weighted according to interclass distance. We then compute the minimal spanning tree. We form superclasses by recursively removing the largest edge in the tree, yielding two subtrees, each of which forms a superclass. We can use a variety of interclass distance measures; we have experimented with distances between class means, overlap of the one-dimensional Fisher discriminant projections, and overlap of the two-dimensional Fisher discriminant projections. 2A multilayer perceptron with 7 hidden units was trained using back-propagation (Rumelhart et al. 1986) with 5000 iterations through the training set, with E = 0.02 and N = 0. 3We are grateful to R. Lippmann of MIT Lincoln Laboratory for providing this data. 4Covariance matrix estimates were obtained by pooling data from all 16 speakers for each class. 5Multilayer perceptrons for each speaker used 15 hidden units.
114
Alan Rojer and Eric Schwartz
6 Discussion Our classifier is inspired by the prevalence of maps in the vertebrate brain. Its components are two-dimensional map units which implement Parzen-window density estimation, a dimensionality reduction methodology, and a scheme for decomposing a problem so that it can be solved by a system of maps. The training and running costs are favorably low. It admits an easy formal description. The intermediate results of classifier construction, e.g. the density estimates p&) and the partition computed by the classifier, are easily observed by a human user. This allows insight into the structure of the data that is hard to gain from other algorithms. Many of the operations can be implemented with conventional image processing operations (and thus can take advantage of special-purpose image processing hardware). The error rate is comparable to popular parametric, nonparametric, and neural network classifiers. Only a few other workers have considered the role of maps in pattern classification. In particular, Kohonen (1984) has considered iterative algorithms for "self-organizing feature maps." We wish to distinguish his work from ours. In our classifier, the map function comprises a linear projection of a data instance to determine a position in the map followed by a reference to that position. Kohonen maps an instance via a distance computation at each node of his map, followed by a winner-take-all cycle to obtain the nearest-neighbor to the instance among all the map nodes. The projection we use is computed with one pass through the training set to compute second-order statistics, which are diagonalized in a step which has a cost related only to the dimensionality of the data, but not the number of samples. Kohonen uses a very large number of iterations through the training set. Most importantly, we emphasize the use of multiple maps, which is not considered in (Kohonen 1984). We could use a Kohonen-type feature map as a module in our classifier (replacing the Fisher discriminant analysis) although we would then sacrifice these advantages. Tree classifiers have been considered at length in Breiman et al. (1984). There are similarities between their classifiers and ours at classification time, although the training algorithms are quite distinct. The principal difference in classifier operation is that we use two-dimensional density estimation at each node, while they use one-dimensional linear discriminants. Their discriminant is typically a threshold comparison of one feature value, although they also describe an iterative technique for obtaining a discriminant from a linear combination of a subset of the feature variables. There are much larger differences in the classifier training algorithms; we present an example of a simple heuristic for generating map trees (based on minimal spanning trees) whereas they examine a large set of possible splits in the data to generate trees. We wish to emphasize that our tree classifier is one possible technique for utilizing multiple maps; examination of alternative approaches is an important research problem.
A Multiple-Map Model for Pattern Classification
115
Perceptual (and probably cognitive) functions of the brain are mediated by laminar cortical systems. Three carefully investigated systems (monkey vision, bat echolocation, and auditory localization in the owl) are committed to multiple two-dimensional spatial maps. The present paper describes the first attempt to construct a pattern classification system which has high performance and which is based on a multiple parallel map-like representation of feature vectors. The algorithms described in this paper allow us to begin to investigate the pattern classification and perceptual performance of such map-based architectures. Acknowledgments Supported by AFOSR-88-0275. References Breiman, L., J.H. Freedman, R.A. Olshen, and C.J. Stone. 1984. Classification and Regression Trees. Bellmont, CA: Wadsworth. Bryan, J.G. 1951. The Generalized Discriminant Function: Mathematical Foundation and Computation Routine. Harvard Educ. Rev. 21, 90-95. Duda, R.O. and P.E. Hart. 1973. Pattern Classification and Scene Analysis. New York Wiiey. Fisher, R.A. 1936. The Use of Multiple Discriminant Measurements in Taxonomic Problems. Ann. Eugenics 7, 179-188. Kohonen, T. 1984. Self-organization and Associative Memory. New York: SpringerVerlag. Oja, E. 1982. A Simplified Neuron Model as a Principal Component Analyzer. J. Math. Biol. 15, 267-273. Parzen, E. 1962. On Estimation of a Probability Density Function and Mode. Ann. Math. Stat 33, 1065-1076. Rumelhart, D.E., G.E. Hinton, and R.J. Williams. 1986. Learning Representations by Back-propagating Errors. Nature 323, 533-536. Schwartz, E.L., B. Merker, E. Wolfson, and A. Shaw. 1988. Computational Neuroscience: Applications of Computer Graphics and Image Processing to Two and Three Dimensional Modeling of the Functional Architecture of Visual Cortex. IEEE Computer Graphics and Applications ]uly 1988, 13-28.
Received 1 October; accepted 18 November 1988.
Communicated by David A. Robinson
A Control Systems Model of Smooth Pursuit Eye Movements with Realistic Emergent Properties R. J. Krauzlis S. G. Lisberger Department of Physiology and Neuroscience Graduate Program, University of California, San Fkancisco, C A 94143, USA
Visual tracking of objects in a noisy environment is a difficult problem that has been solved by the primate oculomotor system, but remains unsolved in robotics. In primates, smooth pursuit eye movements match eye motion to target motion to keep the eye pointed at smoothly moving targets. We have used computer models as a tool to investigate possible computational strategies underlying this behavior. Here, we present a model based upon behavioral data from monkeys. The model emphasizes the variety of visual signals available for pursuit and, in particular, includes a sensitivity to the acceleration of retinal images. The model was designed to replicate the initial eye velocity response observed during pursuit of different target motions. The strength of the model is that it also exhibits a number of emergent properties that are seen in the behavior of both humans and monkeys. This suggests that the elements in the model capture important aspects of the mechanism of visual tracking by the primate smooth pursuit system. 1 Introduction Computer models have advanced our understanding of eye movements by providing a framework in which to test ideas suggested by behavioral and physiological studies. Our knowledge of the smooth pursuit system is at a stage where such models are especially useful. We know that pursuit eye movements are a response to visual motion and that they are used by primates to stabilize the retinal image of small moving targets. Lesion and electrophysiological studies have identified several cortical and subcortical sites that are involved in pursuit (see Lisberger et al. 1987 for review), but the precise relationship between the visual motion signals recorded at these sites and those used by pursuit remain unclear. We present a model that was designed to replicate the monkey’s initial eye velocity response as a function of time for different pursuit target motions. The model emphasizes the variety of visual signals available for pursuit and minimizes the computations done by motor pathways. The structure of the model is based upon behavioral experiments which Neural Computation 1,116122 (1989) @ 1989 Massachusetts Institute of Technology
A Control Systems Model of Smooth Pursuit Eye Movements
117
have characterized how different aspects of visual motion act to initiate pursuit. In monkeys, the pursuit system responds not only to the velocity of retinal images, but also to smooth accelerations and to the abrupt accelerations that accompany the onset of target motion (Krauzlis and Lisberger 1987; Lisberger and Westbrook 1985). Therefore, the model includes three parallel pathways that are sensitive to these three aspects of visual motion. 2 Structure of the Model
Our model is drawn in figure 1. Within each pathway, a time-delayed signal related to the motion of retinal images (?' - E , called retinal "slip") is processed by a nonlinear gain element derived from our behavioral experiments and a filter. The first pathway is sensitive to slip velocity (C) and its gain is linear. The second and third pathways are both sensitive to slip acceleration (el, but in different ways. The impulse acceleration pathway is sensitive to the large accelerations that accompany step changes in target velocity, but has a dead zone in the gain element that renders it insensitive to smaller accelerations. The smooth acceleration pathway is sensitive to gradual changes in image velocity. The outputs of the gain elements in each pathway are low pass filtered to produce three signals with different dynamics (&w, Pi,p a ) that are then summed and integrated to give a command for eye velocity (I?). The integrator makes our model reproduce the fact that the pursuit system interprets visual inputs as commands for eye accelerations (Lisberger et al. 1981). Eye velocity ( E ) is obtained by passing the eye velocity command through a low pass filter that represents the eye muscles and orbital tissues. The behavior of the model was refined by matching its performance under open-loop conditions to that of the monkey. We removed visual feedback by setting the value of feedback gain to zero, and compared the model's output to averages of the monkey's eye velocity in behavioral trials where visual feedback was electronically eliminated for 200 ms (methods in Morris and Lisberger 1987). First, we stimulated the model with steps in target velocity, which activate the slip velocity and impulse acceleration pathways. We adjusted the filters in these pathways to obtain the fit in figure 2A, where the rising phase of the model's output (dotted lines) matches the rising phase of the monkey's eye velocity (solid lines) during the open-loop portion of the monkey's response. Then, we stimulated the model with steps in target acceleration, which activates the slip velocity and smooth acceleration pathways, and adjusted the filter in the smooth acceleration pathway (Fig. 2B).
R. J. Krauzlis and S. G. Lisberger
118
3 Emergent Properties of the Model
We tested the model by restoring visual feedback and providing steps of target velocity in closed-loop conditions. Figure 2C shows that the model matches the rising phase of the monkey's response and makes a realistic transition into steady-state tracking. Like the monkey, it reaches steadystate velocity at later times for higher target speeds. The contribution of each pathway to the initiation of pursuit can be assessed by setting its
m
retina
1
"sib smooth acceleration"
I
Figure 1: Model of smooth pursuit eye movement system. Boxes contain transfer functions expressed in Laplace notation. Abbreviations: T , target velocity; e, slip velocity; e, slip acceleration; E'v, output of slip velocity pathway; B'i, output of slip impulse acceleration pathway; p a , output of slip acceleration pathway; eye acceleration command; l?', eye velocity command; E , eye velocity. Parameters used: t = .065; T, = 0.030; Ti = 0.020; T, = 0.010; Tp = 0.015. Functions for gain elements: Slip velocity pathway: y = az; a = 8.3. Slip impulse acceleration pathway: for z > c, y = alog(bz + 1); a = 17500, b = .00015, c = 3000. Slip smooth acceleration pathway: for d > z > e, y = alog(bz + 1); for z < e, y = (cz2)alog(bz+ 1);a = 28, b = .l, c = .0016, d = 500, e = 18.5. Equations given for impulse and smooth acceleration pathways apply for z > 0. For z < 0, equivalent odd functions are used.
e',
A Control Systems Model of Smooth Pursuit Eye Movements
119
gain to zero, effectively “lesioning” that limb of the model. When the slip impulse acceleration pathway is lesioned, the rising phase is delayed and more sluggish, but the transition to steady-state tracking is unchanged (Fig. 2D, open arrow). This pathway contributes exclusively to the initial 25-50 ms of the response to steps in target velocity and allows the model to reproduce the observation that the earliest component of the pursuit response is sensitive to target direction, but relatively insensitive to target speed (Lisberger and Westbrook 1985). When the smooth acceleration pathway is eliminated, the rising phase is unchanged, but there is a large overshoot in the transition to steady-state tracking (Fig. 2D, filled arrow). Thus, the smooth acceleration pathway normally decelerates the eye as eye velocity approaches target velocity. This is an emergent property of the model, since this pathway was tuned by adjusting its contribution to the acceleration of the eye as shown in figure 2B. The model rings at a relatively high frequency, 5 Hz, in response to steps in target velocity (solid lines in figure 3A). Similar oscillations are seen in the behavior of both humans and monkeys (Goldreich and Lisberger 1987; Robinson et al. 1986). If the model is driven at this resonant frequency, the output lags target velocity by 180 degrees (Fig. 3B, filled arrow), an effect that is also seen in the monkey’s behavior (Goldreich and Lisberger 1987). The high frequency properties of the model depend upon the presence of the smooth acceleration pathway. If this pathway is eliminated, the spontaneous oscillations still occur, but now at only 1.6 Hz (Fig. 3A, open arrow), and the phase lag in the response to sinusoidal inputs increases (Fig. 3B, open arrow). 4 Discussion An important property of our model is that it allows independent control over the initiation and maintenance phases of pursuit. The rising phase is determined mainly by the slip velocity and impulse acceleration pathways. The steady-state behavior is determined primarily by the smooth acceleration pathway. Since the differentiator in the smooth acceleration pathway introduces a phase lead, the steady-state behavior of the model has a higher frequency response than the rising phase. The exact frequency of ringing depends upon the total delay around the smooth acceleration pathway. For example, if the delay in the visual input is increased by 30 ms, the model will ring at 3.8 Hz, similar to what is normally observed in humans (Robinson et al. 1986) or in monkeys when the delay in visual feedback is increased (Goldreich and Lisberger 1987). The amount of ringing depends upon the gain element in the smooth acceleration pathway. Lowering the gain dampens or eliminates the ringing, increasing the gain produces persistent ringing. Such variations are also seen on individual trials in the monkey’s behavior.
R. J. Krauzlis and S. G. Lisberger
120
25
B
120
15
64 45
5
;.
...--..
..'..
Figure 2: Comparison of the eye velocity output from the model and the monkey. In all panels, dotted lines show the model's output and solid lines show the monkey's eye velocity. A: Steps in target velocity of 5, 15, and 25 d/s under open-loop conditions. The monkey's open-loop response lasts for the first 200 ms of his response. B: Steps in target acceleration of 45, 64,and 120 d/s2 under open-loop conditions. C: Steps in target velocity of 5, 15, and 25 d/s under closed-loop conditions. D Effect of lesioning either accelerationpathway on the model's response to a step in target velocity of 20 d/s. Open arrow, model's response when gain in slip impulse acceleration pathway is set to zero. Closed arrow, model's response when gain in slip smooth acceleration pathway is set to zero. The strength of our model is that it uses open-loop behavioral data to embody the pursuit system's sensitivity to different aspects of visual motion. Although it was designed to replicate the dynamics of the initiation of pursuit, the model also serendipitously solves a problem noted by Robinson (Robinson et al. 1986), namely, that the rising phase of pursuit is sluggish compared to the frequency of its ringing. The emergence of realistic steady-state properties in the model indicates that the visual elements in its pathways capture important aspects of signal processing within the smooth pursuit system. Our model does not include a sensitivity to position errors, which can affect steady-state tracking in monkeys (Morris and Lisberger 1987). It also does not include the topographic organization of the visual system and therefore cannot reproduce the retinotopic deficits in the initiation of pursuit or the directional deficits
121
A Control Systems Model of Smooth Pursuit Eye Movements
in the maintenance of pursuit which are seen after lesions of cortical areas MT and MST (Dursteler et al. 1987; Newsome et al. 1985). However, it should be possible to embed the dynamics of our model within a topographic structure that would account for these effects.
B
A
model target 4 U
400 ms
100 ms
Figure 3: Ringing and frequency response of the model. A: Solid lines, model's closed-loop response to steps in target velocity of 10 and 20 d/s, viewed on larger time scale than in figure 2C. Dotted line, model's response to step in target velocity of 10 d / s when gain in smooth acceleration pathway is set to zero. 8:Driving the model with sine wave target velocity at 5.0 Hz under closed-loop conditions. Filled arrow, the intact model's response lags target motion by 180 degrees. Open arrow, model's response when gain in smooth acceleration pathway is set to zero. Phase lag is now 325 degrees.
122
R. J. Krauzlis and S. G. Lisberger
Acknowledgments This research was supported by NIH Grants EY03878 a n d EY07058.
References Dursteler, M.R., R.H. Wurtz, and W.T. Newsome. 1987. Directional Pursuit Deficits Following Lesions of the Foveal Representation within the Superior Temporal Sulcus of the Macaque Monkey. J. Neurophysiol 57, 1262-1287. Goldreich, D. and S.G. Lisberger. 1987. Evidence that Visual Inputs Drive Oscillations in Eye Velocity during Smooth Pursuit Eye Movements in the Monkey. SOC. Neurosci. Abstr. 13, 170. Krauzlis, R.J. and S.G. Lisberger. 1987. Smooth Pursuit Eye Movements are Not Driven Simply by Target Velocity. SOC.Neurosci. Abstr. 13, 170. Lisberger, S.G., C. Evinger, G.W. Johanson, and A.F. Fuchs. 1981. Relationship between Eye Acceleration and Retinal Image Velocity during Foveal Smooth Pursuit Eye Movements in Man and Monkey. J. Neurophysiol. 46,229-249. Lisberger, S.G., E.J. Morris, and L. Tychsen. 1987. Visual Motion Processing and Sensory-motor Integration for Smooth Pursuit Eye Movements. Ann. Rev. Neurosci. 10, 97-129. Lisberger, S.G. and L.E. Westbrook. 1985. Properties of Visual Inputs that Initiate Horizontal Smooth Pursuit Eye Movements in Monkeys. J. Neurosci. 5, 1662-1673. Morris, E.J. and S.G. Lisberger. 1987. Different Responses to Small Visual Errors during Initiation and Maintenance of Smooth-pursuit Eye Movements in Monkeys. J. Neurophysiol. 58, 1351-1369. Newsome, W.T., R.H. Wurtz, M.R. Dursteler, and A. Mikami. 1985. Deficits in Visual Motion Processing Following Ibotenic Acid Lesions of the Middle Temporal Area of the Macaque Monkey. J. Neurosci. 5, 825-840. Robinson, D.A., J.L. Gordon, and S.E. Gordon. 1986. A Model of the Smooth Pursuit Eye Movement System. Biol. Cybern. 55, 43-57.
Received 15 August; accepted 1 October 1988.
Communicated by Patricia Churchland
The Brain Binds Entities and Events by Multiregional Activation from Convergence Zones Antonio R. Damasio Department of Neurology, Division of Behavioral Neurology and Cognitive Neuroscience, University of Iowa College of Medicine, Iowa City, IA, USA
The experience of reality, in both perception and recall, is spatially and temporally coherent and "in-register." Features are bound in entities, and entities are bound in events. The properties of these entities and events, however, are represented in many different regions of the brain that are widely separated. The degree of neural parcellation is even greater when we consider that the perception of most entities and events also requires a motor interaction on the part of the perceiver (such as eye movements and hand movements) and often includes a recordable modification of the perceiver's somatic state. The question of how the brain achieves integration starting with the bits and pieces it has to work with, is the binding problem. Here we propose a new solution for this problem, at the level of neural systems that integrate functional regions of the telencephalon. 1 Introduction
Data from cognitive psychology, neurophysiology, and neuroanatomy indicate unequivocally that the properties of objects and events that we perceive through various sensory channels engage geographically separate sensory regions of the brain (Posner 1980; Van Essen and Maunsell 1983; Damasio 1985; Livingstone and Hubel 1988). The need to "bind" together the fragmentary representations of visual information has been noted by Treisman and Gelade (19801, Crick (1984), and others, but clearly the problem is a much broader one and includes the need to integrate both the sensory and motor components in both perception and recall, at all scales and at all levels. This broader concept of binding is closer to that of Sejnowski (1986). The traditional and by now untenable solution to the binding problem has been that the components provided by different sensory portals end up being displayed together in so-called multimodal cortices, where the most detailed and integrated representations of reality are achieved. This intuitively reasonable view suggests that perception depends on a unidirectional process which provides a gradual refinement of signal extraction Neural Computation 1, 123-132 (1989) @ 1989 Massachusetts Institute of Technology
124
Antonio R. Damasio
along a cascade aimed towards integrative cortices in anterior temporal and anterior frontal regions. Some of the most influential accounts for the neural basis of cognition in the post-war period as well as major discoveries of neurophysiology and neuroanatomy over the past two decades, have seemed compatible with this view. After all, anatomical projections do radiate from primary sensory cortices toward structures in the hippocampus and prefrontal cortices via a multi-stage sequence (Pandya and Kuypers 1969; Jones and Powell 1970; Nauta 1971; Van Hoesen 1982),and the farther away neurons are from primary sensory cortices, the larger their receptive fields become, and the less unimodal their responses are (Desimone and Ungerleider 1989). However, there are several lines of evidence on the basis of which this traditional solution can be rejected. 2 Experimental Evidence
Evidence from Experimental Neuroanatomy: The notion that integration of perceptual or recalled components depends on a single neural meeting ground calls for the identification of a neuroanatomical site that would receive projections from all neural regions involved in the processing of entities and events as they occur in experience. Despite considerable exploration no such region has yet been found. The anterior temporal cortices and the hippocampus do receive projections from multiple sensory areas, but not from motor regions (Van Hoesen 1982). The anterior frontal cortices, the most frequently mentioned candidates for ultimate integration, are even less suited for that role. The sensory and motor streams that reach them remain segregated in different regions (Goldman-Rakic 1988). In other words, there seems to be no structural foundation to support the intuition that temporal and spatial integration occur at a single site. Advances in experimental neuroanatomy have added a new element to neuroanatomical reasoning about this problem: at every stage of the chain of forward cortical projections, there exist prominent projections back to the originating sites. Moreover, the systems are just as rich in multi-stage, reciprocating feedback projections as they are in feedforward projections (Van Hoesen 1982; Van Essen 1985; Livingstone and Hubel 1987). The neuroanatomical networks revealed by these studies allow for both forward convergence of some parallel processing streams, and for the flow of signaling back to points of origin. In the proposal we will describe below, such networks operate as coherent phase-locked loops in which patterns of neural activity in "higher" areas can trigger, enhance, or suppress patterns of activity in "lower" areas. Evidence from Experimental Neuropsychology in Humans with Focal Cerebral Lesions: If temporal and frontal integrative cortices were the substrate for the integration of neural activity on which binding depends, the bilateral destruction of those cortices in humans should: (a) preclude the
The Brain Binds Entities and Events by Multiregional Activation
-Frontal
125
"integrative"cortices-
Figure 1: Fundamental divisions of human cerebral cortex depicted in a simplified diagram of the external and internal views of the left hemisphere. The motor and premotor cortices include cytoarchitectonicfields 4,6, and 8. The early and intermediate sensory cortirces include the primary visual, auditory, and somatosensory regions (respectively fields 17, 41 /42, and 3/1/2), and the surrounding association cortices (fields 18/19, 7, 39, 22, 40, 5). The temporal "integrative" cortices include fields 37, 36, 35, 20, 21, 38, and 28, i.e., neocortical as well as limbic and paralimbic areas. The frontal "integrative" cortices include fields 44, 45, 46, 9, 10, 11, 12, 13, and 25, i.e., prefrontal neocortices as well as limbic. perception of reality as a coherent multimodal experience and reduce experience to disjointed, modal tracks of sensory or motor processing; (b) reduce the integration of even such modal track processing; and (c) disable memory for any form of past integrated experience and interfere with all levels and types of memory. However, the results of bilateral destruction of the anterior temporal lobes, as well as bilateral destruction of prefrontal cortices, falsify these predictions (see Fig. 1). Coherent perceptual experience is not altered by bilateral damage to the anterior temporal regions, nor does such damage disturb perceptual quality (see Corkin 1984; Damasio et al. 1985; Damasio et al. 1987). Our patient Boswell is a case in point. His extensive, bilateral damage in
126
Antonio R. Damasio
anterior temporal cortices and hippocampus, disables his memory for unique autobiographical events, but not his ability to perceive the world around in fully integrated fashion and to recall and recognize the entities and events that he encounters or participates in, at non-unique level. His binding ability breaks down at the level of unique events, when the integration of extremely complex combinatorial arrangement of entities is required. Bilateral lesions in prefrontal cortices, especially when restricted to the orbitofrontal sector, are also compatible with normal perception and even with normal memory for most entities and events except for those that pertain to the domain of social knowledge (Eslinger and Damasio 1985; Damasio and Tranel 1988). Finally, it is damage to certain sectors of sensory association cortices that can affect both the quality of some aspects of perception within the modality served by those cortices, and recognition and recall. Depending on precisely which region of visual cortex is affected, lesions in early visual association cortices can disrupt perception of shape, or color, or texture, or stereopsis, or spatial placement of the physical components of a stimulus (Damasio 1985; Damasio et al. 1989). A patient may lose the ability to perceive color and yet perceive shape, depth and motion normally. More importantly, damage within some sectors of modal association cortices can disturb recall and recognition of stimuli presented through that modality, even when basic perceptual processing is not compromised. For instance, patients may become unable to recognize familiar faces that they perceive flawlessly (although, intriguingly, they can discriminate familiar from unfamiliar faces at covert level; Tranel and Damasio 1985; 1988). The key point is that damage in a posterior and unimodal association cortex can disrupt recall and recognition at virtually every level of the binding chain, from the entity-categorical level to the event-unique level. It can preclude the kind of integrated experience usually attributed to the anterior cortices. 3 A New View on the Binding Problem
The evidence then indicates: (a) that substantial binding, relative to entities or parts thereof, occurs in unimodal cortices and can support recall and recognition at the level of categories; (b) that recall and recognition at category level, are generally not impaired by damage confined to anterior integrative cortices, i.e., knowledge recalled at categoric levels depends largely on posterior sensory cortices and interconnected motor cortices; (c) that recall and recognition of knowledge at the level of unique entities or events, requires both anterior and posterior sensory cortices, i.e., a more complex network is needed to map uniqueness; anterior integra-
The Brain Binds Entities and Events by Multiregional Activation
127
tive structures alone are not sufficient to record and reconstruct unique knowledge. The implication is that the early and intermediate posterior sensory cortices contain fragmentary records of featural components which can be reactivated, on the basis of appropriate combinatorial arrangements (by fragmentary featural components we mean “parts of entities,” at a multiplicity of scales, most notably at feature level, e.g., color, movement, texture, shape and parts thereof). They also contain records of the combinatorial arrangement of features that defined entities (”local” or ”entity” binding), but do not contain records of the spatial and temporal relationships assumed by varied entities within an event (“non-local” or “event binding”). The latter records, the complex combinatorial codes needed for event recall, are inscribed in anterior cortices. In this perspective the posterior cortices contain the fragments with which any experience of entities or events can potentially be re-enacted, but only contain the binding mechanism to re-enact knowledge relative to entities. Posterior cortices require binding mechanisms in anterior structures in order to guide the pattern of multiregional activations necessary to reconstitute an event. Thus posterior cortices contain both basic fragments and local binding records and are essential for recreating any past experience. Anterior cortices contain non-local or event-binding records and are only crucial for reconstitution of contextually more complex events. Perhaps the most important distinction between this perspective and the traditional view, is that higher-order anterior cortices are seen as repositories of combinatorial codes for inscriptions that lie elsewhere and can be reconstructed elsewhere, rather than being the storage site for the more refined ”multimodal” representations of experiences. Although anterior cortices receive multimodal projections we conceptualize the records they harbor as amodal. If parts of the representation of an entity are distributed over distant regions of the brain, then mechanisms must be available to bind together the fragments. A proposal for a new solution to the binding problem (Damasio 1989) is illustrated in figure 2 and presented in outline as follows: 1. The neural activity prompted by perceiving the various physical properties of any entity, occurs in fragmented fashion and in geographically separate regions located in early sensory cortices and in motor cortices. So-called ”integrative” cortices do not contain such fragmentary inscriptions.
2. The integration of multiple aspects of external and internal reality in perceptual or recalled experiences, depends on the phase-locked coactivation of geographically separate sites of neural activity within the above mentioned sensory and motor cortices, rather than on a transfer and spatial integration of different representations towards anterior higher-order cortices. Consciousness of those co-
128
Antonio R. Damasio activations depends on their being attended to, i.e., on simultaneous enhancement of a pertinent set of activity against background activity.
3. The patterns of neural activity that correspond to distinct physical properties of entities are recorded in the same neural ensembles
Figure 2: Simplified diagram of some aspects of the proposed neural architecture. V, SS, and A depict early and intermediate sensory cortices in visual, somatosensory, and auditory modalities. In each of those sensory sectors, separate functional regions are represented by open and filled dots. Note feedforward projections (black lines) from those regions toward several orders of convergence zones (CZ1, CZ2, CZn), and note also feedback projections from each CZ level toward originating regions (red lines). H depicts hippocampal system, one of the structures where signals related to a large number of activity sites can converge. Note outputs of H toward last station of feedforward convergence zones (CZn) and toward noncortical neural stations (NC) in basal forebrain, brain stem, and neurotransmitter nuclei. Feedforward and feedback pathways should not be seen as rigid channels. They are conceived as facilitated lines which become active when concurrent firing in early cortices or CZs takes place. Furthermore, those pathways terminate over neuron ensembles, in distributed fashion, rather than on specific single neurons.
The Brain Binds Entities and Events by Multiregional Activation
129
in which they occur during perception, but the combinatorial arrangements (binding codes) that describe their pertinent linkages in entities and in events (their spatial and temporal coincidences), are stored in separate neural ensembles called convergence zones. 4. Convergence zones trigger and synchronize neural activity patterns
corresponding to topographically organized fragment representations of physical structure, that were pertinently associated in experience, on the basis of similarity, spatial placement, temporal sequence, or temporal coincidence, or combinations thereof. The triggering and synchronization depends on feedback projections from the convergence zone to multiple cortical regions where fragment records can be activated.
5. Convergence zones are located throughout the telencephalon, at multiple neural levels, in association cortices of different orders, limbic cortices and subcortical limbic nuclei, and non-limbic subcortical nuclei such as the basal ganglia. 6. The geographic location of convergence zones for different entities varies among individuals but is not random. It is constrained by
the subject matter of the recorded material (its domain), and by contextual complexity of events (the number of component entities that interact in an event and the relations they adopt), and by the anatomical design of the system. Convergence zones that bind features into entities are located earlier in the processing streams, and convergence zones that bind entities into progressively more complex events are gradually placed more anteriorly in the processing streams.
7. The representations inscribed in the above architecture, both those that preserve topographic/topologic relationships and those that code for combinatorial arrangements, are committed to populations of neuron ensembles and their synapses, in distributed form. 8. The co-occurrence of activities in multiple sites that is necessary for binding conjunctions, is achieved by recurrent feedback interactions.
Thus, we propose that the processing does not proceed in a single direction but rather through temporally coherent phase-locking amongst multiple regions. Although the convergence zones that realize the more encompassing integration are placed more anteriorly, it is activity in the more posterior cortical regions that is more directly related to conscious experience. By means of feedback, convergence zones repeatedly return processing to earlier cortices where activity can proceed again towards the same or other convergence zones. Integration takes place when activations occur within the same time window, in earlier cortices. There is no need
130
Antonio R. Damasio
to postulate a "final" and single integration area. This model accommodates the segregation of neural processing streams that neuroanatomical and neurophysiological data continue to reveal so consistently, and is compatible with the increase in receptive fields of neurons that occurs in cerebral cortex, in the posterior-anterior direction. It accords with the proposal that fewer and fewer neurons placed anteriorly in the system are projected on by structures upstream and thus subtend a broader compass of feed-forwarding regions. Broad receptive field neurons serve as pivots for reciprocating feedback projections rather than as accumulators of the knowledge inscribed at earlier levels. They are intermediaries in a continuous process that centers on early cortices. 4 Conclusions
The problem of how the brain copes with the fragmentary representations of information is central to our understanding of brain function. It is not enough for the brain to analyze the world into its components parts: the brain must bind together those parts that make whole entities and events, both for recognition and recall. Consciousness must necessarily be based on the mechanisms that perform the binding. The hypothesis suggested here is that the binding occurs in multiple regions that are linked together through activation zones; that these regions communicate through feedback pathways to earlier stages of cortical processing where the parts are represented; and that the neural correlates of consciousness should be sought in the phase-locked signals that are used to communicate between these activation zones. Several questions are raised by this new view. For instance, what is the precise nature of the feedback signals that provide temporally coherent phase-locking among multiple regions? How large are the convergence zones in different parts of the brain? How are the decisions made to store an aspect of experience in a particular zone? There are several possible approaches to test the hypothesis proposed here. One approach is to develop new techniques for recording from many neurons simultaneously in communicating brain regions. Another relies on neuropsychological experiments in neurological patients with small focal lesions in key areas of putative networks dedicated to specific cognitive processes. Finally, modeling studies should illuminate the collective properties of convergence zones and provide us with the intuition we need to sharpen our questions. Acknowledgments Supported by NINCDS Grant PO1 NS19632.
The Brain Binds Entities and Events by Multiregional Activation
131
References Corkin, S. 1984. Lasting Consequences of Bilateral Medial Temporal Lobectomy: Clinical Course and Experimental Findings in HM. Seminars in Neurology 4, 249-259. Crick, F. 1984. Function of the Thalamic Reticular Complex: The Searchlight Hypothesis. Proc. Natl. Acad. Sci. USA 81, 4586-4590. Damasio, A. 1989. Multiregional Retroactivation: A Systems Level Model for Some Neural Substrates of Cognition. Cognition, in press. . 1985. Disorders of Complex Visual Processing. In: Principles of Behavioral Neurology, ed. M.M. Mesulam, Contemporary Neurology Series, 259-288. Philadelphia: F.A. Davis. Damasio, A., P. Eslinger, H. Damasio, G.W. Van Hoesen, and S. Cornell. 1985. Multimodal Amnesic Syndrome Following Bilateral Temporal and Basal Forebrain Damage. Archives of Neurology 42, 252-259. Damasio, A., H. Damasio, D. Tranel, K. Welsh, and J. Brandt. 1987. Additional Neural and Cognitive Evidence in Patient DRB. Society for Neuroscience 13, 1452. Damasio, A. and D. Tranel. 1988. Domain-specific Amnesia for Social Knowledge. Society for Neuroscience 14, 1289. Damasio, A.R., H. Damasio, and D. Tranel. 1989. Impairments of Visual Recognition as Clues to the Processes of Memory. In: Signal and Sense: Local and Global Order in Perceptual Maps, eds. G. Edelman, E. Gall, and M. Cowan, Neuroscience Institute Monograph. Wiley and Sons. Desimone, R. and L. Ungerleider. 1989. Neural Mechanisms of Visual Processing in Monkeys. In: Handbook of Neuropsychology, Disorders of Visual Processing, ed. A. Damasio, in press. Eslinger, P. and A. Damasio. 1985. Severe Disturbance of Higher Cognition after Bilateral Frontal Lobe Ablation. Neurology 35, 1731-1741. Goldman-Rakic, P.S. 1988. Topography of Cognition: Parallel Distributed Networks in Primate Association Cortex. In: Annual Review of Neuroscience 11, Annual Reviews Inc., Palo Alto, CA, 137-156. Jones, E.G. and T.P.S. Powell. 1970. An Anatomical Study of Converging Sensory Pathways within the Cerebral Cortex of the Monkey. Brain 93,793420. Livingstone, M. and D. Hubel. 1988. Segregation of Form, Color, Movement, and Depth Anatomy, Physiology, and Perception. Science 240, 740-749. . 1987. Connections between Layer 4B of Area 17 and Thick Cytochrome Oxidase Stripes of Area 18 in the Squirrel Monkey. Journal of Neuroscience 7, 3371-3377. Nauta, W.J.H. 1971. The Problem of the Frontal Lobe: A Reinterpretation. J. Psychiat. Res. 8, 167-187. Pandya, D.N. and H.G.J.M. Kuypers. 1969. Cortico-cortical Connections in the Rhesus Monkey. Brain Res. 13, 13-36. Posner, M.I. 1980. Orienting of Attention. Quarterly Journal of Experimental Psychology 32, 3-25. Sejnowski, T.J. 1986. Open Questions about Computation in Cerebral Cortex. In:
132
Antonio R. Damasio
Parallel Distributed Processing, eds. J.L. McClelland and D.E. Rummelhart, 372-389. Cambridge: MIT Press. Tranel, D. and A. Damasio. 1988. Nonconscious Face Recognition in Patients with Face Agnosia. Behavioral Brain Research 30, 235-249. . 1985. Knowledge without Awareness: An Autonomic Index of Facial Recognition by Prosopagnosics. Science 22821, 1453-1454. Treisman, A. and G. Gelade. 1980. A Feature-integration Theory of Attention. Cognitive Neuropsychology 12, 97-1 36. Van Essen, D.C. 1985. Functional Organization of Primate Visual Cortex. In: Cerebral Cortex, eds. A. Peters and E.G. Jones, 259-329. Plenum Publishing. Van Essen, D.C. and J.H.R. Maunsell. 1983. Hierarchical Organization and Functional Streams in the Visual Cortex. Trends in Neuroscience 6, 370-375. Van Hoesen, G.W. 1982. The Primate Parahippocampal Gyrus: New Insights Regarding its Cortical Connections. Trends in Neurosciences 5, 345-350.
Received 18 November; accepted 25 November 1988.
Communicated by David Touretzky
Product Units: A Computationally Powerful and Biologically Plausible Extension to Backpropagation Networks Richard Durbin David E. Rumelhart Department of Psycholoa, Stanford University, Stanford, CA 94305, USA
We introduce a new form of computational unit for feedfoxward learning networks of the backpropagation type. Instead of calculating a weighted sum this unit calculates a weighted product, where each input is raised to a power determined by a variable weight. Such a unit can learn an arbitrary polynomial term, which would then feed into higher level standard summing units. We show how learning operates with product units, provide examples to show their efficiency for various types of problems, and argue that they naturally extend the family of theoretical feedforward net structures. There is a plausible neurobiological interpretation for one interesting configuration of product and summing units. 1 Introduction
The success of multilayer networks based on generalized linear threshold units depends on the fact that many real-world problems can be well modeled by discriminations based on linear combinations of the input variables. What about problems for which this is not so? It is clear that for some tasks higher order combinations of some of the inputs, or ratios of inputs, may be appropriate to help form a good representation for solving the problem (for example cross-correlation terms can give translational invariance). This observation led to the proposal of ”sigmapi units” which apply a weight not only to each input, but also to all second and possibly higher order products of inputs (Rumelhart, Hinton, and McClelland; Maxwell et al. 1987). The weighted sum of all these terms is then passed through a non-linear thresholding function. The problem with sigma-pi units is that the number of terms, and therefore weights, increases very rapidly with the number of inputs, and becomes unacceptably large for use in many situations. Normally only one or a few of the non-linear terms are relevant. We therefore propose a different type of unit, which represents a single higher order term, but learns which one to represent. The output of this unit, which we will call a product unit, is Neural Computation 1, 133-142 (1989) @ 1989 Massachusetts Institute of Technology
134
Richard Durbin and David E. Rumelhart
Figure 1: Two suggested forms of possible network incorporating product units. Product units are shown with a II and summing units with a C. (a) Each summing unit gets direct connections from the input units, and also from a group of dedicated product units. (b) There are alternating layers of product and summing units, finishing with a summing unit. The output of all our summing units was squashed using the standard logistic function, 1/(1+ e-”); no non-linear function was applied to the output from product units.
We will treat the pi in the same way as variable weights, training them by gradient descent on the output sum square error. In fact such units provide much more generality than just allowing polynomial terms, since the pi can take fractional and negative values, permitting ratios. However, simple products can still be represented by setting the pi to zero or one. Related types of units were previously considered by Hanson and Burr (1987). There are various ways in which product units could be used in a network. One way is for a few of them to be made available as inputs to a standard thresholded summing unit in addition to the original raw inputs, so that the output can now consider some polynomial terms (Fig. la). This approach has a direct neurobiological interpretation (see the discussion). Alternatively there could be a whole hidden layer of product units feeding into a subsequent layer of summing units (Fig. lb). We do not envision product units replacing summing units altogether; the attractions are rather in mixing them, particularly in alternating layers so that we can form weighted sums of arbitrary products. This is analogous to alternating disjunctive and conjunctive layers in general forms for logical functions. 2 Theory
In order to discuss the equations governing learning in product units it is convenient to rewrite equation 1 in terms of exponentials and logarithms.
Product Units for Backpropagation Networks =
1%
135
(2.1)
J1
In this form we can see that a product unit acts like a summing unit whose inputs are preprocessed by taking logarithms, and whose output is passed through an exponential, rather than a squashing function. If L , is negative then log, zz= log, 12,I + ZT, which is complex, and so equation (2.1) becomes
(2.2) We want to be able to consider negative inputs because the non-linear characteristics of product units, which we want to use computationally, are centered on the origin. There are two main alternatives to dealing with the resulting complex-valued expressions. One is to handle the whole network in the complex domain, and at the end fit the real component to the data (either ignoring the complex component or fitting it to 0). The other is to keep the system in the real domain by ignoring the imaginary component of the output from each product unit, restricting us to real-valued weights. For most problems the latter seems preferable. In the case where all the exponents pi are integral, as with a true polynomial term, then the approximation of ignoring the imaginary component is exact. Given this, it can be viewed that we are extending the space of polynomial terms to fractional exponents in a well behaved fashion, so as to permit smooth learning of the exponents. Additionally, in simulations we seem to gain nothing for the added complexity of working in the complex domain (it doubles the number of equations and weight variables). On the other hand, for some physical problems it may be appropriate to consider complex-valued networks. In order to train the weights by gradient descent we need to be able to calculate two sets of derivatives for each unit. First we need the derivative of the output y with respect to each weight pi so as to be able to update the weights. Second we need the derivative with respect to each input zi so as to be able to propagate the error back to previous layers using the chain rule. Let us set I, equal to 1 if xi is negative, otherwise 0, and define U , V by N
N
Then the equations we need for the real-valued version are y = eucosTv
(2.3)
136
Richard Durbin and David E. Rumelhart
It is possible to add an extra constant input to a product unit, corresponding to the bias for a summing unit. In this case the appropriate constant is -1, since a positive value would simply multiply the output by a scalar, which is irrelevant when there is a variable multiplicative weight from the output to a higher level summing unit. Although this multiplicative bias is often eventually redundant, we have found it to be important during the learning process for some tasks, such as the symmetry task (see below and Fig. 2). One property that we have lost with product units is that they are vulnerable to translation and rotation of the input space, in the sense that a learnable problem may no longer be learnable after translation. Summing units with a threshold are not vulnerable to such transformations. If desired, we can regain translational invulnerability by introducing new parameters pi to allow an explicit change of origin. This would replace xi by (xi- pi)in all the above equations. We can once again learn the pi by gradient descent. With the pi present a product unit can approximate a linear threshold unit arbitrarily closely, by working on only a small region of the exponential function. Alternatively, we can notice that rotational and translational vulnerability of single product units is in part compensated for if a number of them are being used in parallel, which will often be the case. This is because a single product transforms to a set of products in a rotated and translated space. In any case, there may be some benefit to the asymmetry of a product unit’s capabilities under affine transformation of the input space. For non-geometric sets of input variables this type of extra computational power may well be useful. 3 Results
Many of the problems that are studied using networks use Boolean input. For product units it is best to use Boolean values -1 and 1, in which case the exponential terms in equations (3) disappear, and the units behave like cosine summing units with 1 and 0 inputs. Examples of the use of product units for learning Boolean functions are provided by networks that learn the parity and symmetry functions. These functions are hard to learn using summing units: the parity function requires as many hidden units as inputs, while symmetry requires 2 hidden units, but often gets stuck in a local minimum unless more are given. Both functions are learned rapidly using a single product hidden unit (Fig. 2 a,b). A good example of a problem that multilayer nets with product units find good solutions for is the multiplexing task shown in figure 2c. Here two of the inputs code which of the four remaining inputs to output. This task has a biological interpretation as an attentional mechanism, and is therefore relevant for computational models of sensory information processing. Indeed, the neurobiological interpretation of just the type of hybrid net
Product Units for Backpropagation Networks
137
Q
(a)
-1
1 1
-1 -1
1 -1 1 -I
x X
I
-1
~
-1
-I
I
X X X
X
X
Figure 2: Examples of product unit networks that solve "hard" binary problems. In each case there is a standard thresholded summing output unit (C) and one or more "hidden" product units (II). The weight values are shown by each arrow, and there is also a constant bias value shown inside each unit's circle. Product unit biases can be considered to have constant -1 input (see text). In each case the network was found by training from data. (a) Parity. The output is 1 if an even number of inputs is on, 0 if an odd number is on. (b) Symmetry. The output is 1 if the input pattern is mirror symmetric (as shown here), 0 otherwise. For summing unit network solutions to the symmetry and parity problems see (Rumelhart, Hinton, and Williams). (c) Multiplexer. Here the values of the two lefthand input units encode in binary fashion which of the four right hand inputs is transmitted to the output. Examples are shown. Where there is a dot the value of the input unit (1 or -1) is irrelevant. An "d' stands for either 1 or -1.
used here (see below) suggests a substrate and mechanism for attentional processes in the brain. We can measure the informational capacity of a unit by the number of random Boolean patterns that it can learn (more precisely, the number a t which the probability of storing them all perfectly drops to a haIf;
Richard Durbin and David E. Rumelhart
138
Structure M 61
12 1
621
621 fixed output
Product unit Summing unit percentage percentage
12 18 20
49
92
24 36 40
100 66 20
29
24 36 40
100 82 58
2 0 0
24 36 40
100 45 14
0
28
18 3 1
0 0
0
0
Table 1: Results on storage of random data. The number of successful storage attempts in 100 trials is shown in the last two columns for various net structures and numbers of vectors, M . Storage is termed successful if all input vectors produce output on the correct side of 0.5. Input vectors were random, q = -1 or 1, and output values for each vector were random 0 or 1. The “6 1” and ”12 1” nets had a single learning unit with 6 or 12 inputs. For these comparisons the output of a product unit was passed through the standard summing unit squashing function, e”/(l + e”). The single summing units do not attain the M=2N theoretical limit (Cover 1965), presumably because the squashing function output creates local minima not present for a simple perceptron. The “6 2 1” nets had 2 hidden units (either product or summing) and one summing output unit, which was trainable for the first set of results, and fixed with all weights equal for the second set. These results indicate that storage capacity for product units is at least 3 bits per weight, as opposed to no more than 2 bits per weight for summing units, and that fixed output units do not drastically reduce computational power in multilayer networks. Mitchison and Durbin 1989). For a single summing unit with N inputs the capacity can be shown theoretically to be 2N (Cover 1965). The empirical capacity of a single product unit is significantly higher than this at around 3N (table 1). The relative improvement is maintained in a comparison of multilayer networks with product hidden units compared with ones consisting purely of summing units (table 11, indicating that product units cooperate well with summing units. We can also consider the performance of product units when the inputs are real valued. An example is the ability of a network with two
&
139
Product Units for Backpropagation Networks
(a>
-20.0
2.0
0.0
2.0
gq -3.0
-12.3
2.3
Figure 3: Performance on a task with real-valued input: learning a circular domain. In each case the network and a plot of its response function over the range -2.0 to 2.0 are shown. A variety of local minima were found using two product unit networks (which in theory could solve the problem exactly), whereas there was only one solution found using two summing units. Although non-optimal, the product unit solutions were always better than the summing units solutions. (a) The ideal product unit network, used to generate the data. (b) An example of a good empirical solution with product units (2% misclassified, MSE 0.03). (c) An example of a poor product unit local minimum (13%misclassified, MSE 0.10). (d) The solution essentially always obtained with two summing hidden units (38%misclassified, MSE 0.24). puts 2 , are real valued. An example is the ability of a network with two product hidden units to learn to respond to a circular region around the origin. In fact it appears that there are many local minima for this problem, and although the network occasionally finds the "correct" solution (Fig. 3a), it more often finds other solutions such as those shown in figure 3b,c. However these solutions are not bad: the average mean square error (MSE) for product unit networks is 0.09 (average of 10) with 88% correct data classification (whether the output is the correct side of 0.5) whereas the best corresponding summing unit network gives an MSE of 0.24, and only 62% correct classification.
140
Richard Durbin and David E. Rumelhart
4 Discussion
We have proposed a new type of computational unit to be used in layered networks along with standard thresholded summing units. The underlying idea behind this unit is that it can learn to represent any generalized polynomial term in the inputs. It can therefore help to form a better representation of the data in cases where higher order combinations of the inputs are significant. Unlike sigma-pi units, which to some extent perform the same task, product units do not increase the number of free parameters, since there is only one weight per input, as with summing units. Although we have been unable to prove that product units are guaranteed to learn a learnable task, as can be shown for a single simplified summing unit (Rosenblatt 1962), we have shown that product units can be trained efficiently using gradient descent, and allow much simpler solutions of various standard learning problems. In addition, as isolated units they have a higher empirical learning capacity than summing units, and they act efficiently to create a hidden layer representation for an output summing unit (table 1). There is a natural neurobiological interpretation for this type of combination of product and summing units in terms of a single neuron. Local regions of dendritic arbor could act as product units whose outputs are summed at the soma. Equation (2.1) shows that a product unit acts like a summing unit with an exponential output function, whose inputs are preprocessed by passing them through a log function. Both these transfer functions are realistic. When there are voltage sensitive dendritic channels, such as NMDA receptors, the post-synaptic voltage response is qualitatively exponential around a critical voltage level (Collingridge and Bliss 1987); an effect that will be influenced by other local input apart from the specific input at the synapse. Presynaptically, there are saturation effects giving an approximately logarithmic form to the voltage dependency of transmitter release. In fact just these features have been presented as problems with the standard thresholded summing model of neurons. Standard summing inputs could still be made using neurotransmitters that do not stimulate voltage sensitive channels. As far as learning in the biological model is concerned, it is acceptable that the second layer summing weights, corresponding to the degree of influence of dendritic regions at the soma, are not very variable. Systems with fixed summing output layers are nearly as computationally powerful as fully variable ones, both in theory (Mitchison and Durbin 19891, and simulations (table 1). Learning at the input synapses is still essentially Hebbian (equation 2.2b), with an additional term when the input 2, is negative. Although the periodic form of this term appears unbiological, some type of additional term is not unreasonable for inhibitory input, which may well have different learning characteristics. Alternatively, it might be that the learning model only applies to excitatory input. Further consideration of this neurobiological model is required, but it seems likely that this
Product Units for Backpropagation Networks
141
approach will lead to a plausible new computational model of a neuron that is potentially much more powerful than the standard McCullochPitts model. One possible criticism of introducing a new type of unit is that it is trivially going to improve the representational capabilities of the networks: one can always improve a fit to data by making a model more complex, and this is rarely worth the price of throwing away elegance. The defence to this must be that the extension is in some sense natural, which we believe that it is. Product units provide the continuous analogy to general Boolean conjunctions in the same way that summing units are continuous analogs of Boolean disjunctions (although both continuous forms are much more powerful, sufficiently so that either can represent any arbitrary disjunction or conjunction on Boolean input). In fact many of the proofs of capabilities of networks to perform general tasks rely on the “abuse” of thresholded summing units to perform multiplicative or conjunctive tasks, often in alternating layers with units being used in an additive or disjunctive fashion. Such proofs will be much simpler for networks with both product and summing units, indicating that such networks are more appropriate for finding simple models of general data. It might be argued that in opening up such generality the special properties of learning networks will be lost, because they no longer provide strong constraints on the type of model that is created. We feel that this misses the point. The real justification for layered network models appears when a number of different output functions are fit to some set of data. By using a layered model the fit of each function influences and constrains the fit of all the others. If there is some underlying natural representation this will be modeled by the intermediate layers, since it will be appropriate for all the output functions. This cross-constraining of learning is not easily available in many other systems, which therefore miss out on a vast amount of data that is relevant, although indirectly so. Product units provide a natural extension of the use of summing units in this framework.
Acknowledgments
R.M.D is a Lucille P. Markey Visiting Fellow at Stanford University. We thank T.J. Sejnowski for pointing out the neurobiological interpretation. References Collingridge, G.L. and T.V.P. Bliss. 1987. NMDA receptors - their role in long-term potentiation. Trends Neurosci. 10, 288-293. Cover, T. 1965. Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition. IEEE Trans. Elect. Cornp. 14,326-334.
142
Richard Durbin and David E. Rumelhart
Hanson, S.J. and D.J. Burr. 1987. Knowfedge Representation in Connectionist Nefworks. Technical Report Bell Communication Research, Morristown, NJ. Maxwell, T., C.G. Giles, and Y.C. Lee. 1987. Generalization in Neural Networks, the Contiguity Problem. In: Proceedings IEEE First International Conference on Neural Networks 2 , 4 1 4 5 . Mitchison, G.J. and R.M. Durbin. 1988. Bounds on the Learning Capacity of Some Multilayer Networks. Biofogical Cybernetics, in press. Rosenblatt, F. 1962. Principles of Neurodynamics. New York Spartan. Rumelhart, D.E., G.E. Hinton, and J.L. McClelland. 1986. A General Framework for Parallel Distributed Processing. In: Parallel Distributed Processing 1,4576. Cambridge, MA, and London: MIT Press. Rumelhart, D.E., G.E. Hinton, and R.J. Williams. 1986. Learning Internal Representations by Errtjr Propagation. In: Parallel Distributed Processing 1, 318-362. Cambridge, MA, and London: MIT Press.
Received 11 November; accepted 17 December 1988.
Communicated by Scott Kirkpatrick
Deterministic Boltzmann Learning Performs Steepest Descent in Weight-Space Geoffrey E. Hinton Department of Computer Science, University o f Toronto, 10 King’s College Road, Toronto M5S 1 A4, Canada
The Boltzmann machine learning procedure has been successfully applied in deterministic networks of analog units that use a mean field approximation to efficiently simulate a truly stochastic system (Peterson and Anderson 1987). This type of ”deterministic Boltzmann machine” (DBM) learns much faster than the equivalent ”stochastic Boltzmann machine” (SBM), but since the learning procedure for DBM’s is only based on an analogy with SBM‘s, there is no existing proof that it performs gradient descent in any function, and it has only been justified by simulations. By using the appropriate interpretation for the way in which a DBM represents the probability of an output vector given an input vector, it is shown that the DBM performs steepest descent in the same function as the original SBM, except at rare discontinuities. A very simple way of forcing the weights to become symmetrical is also described, and this makes the DBM more biologically plausible than back-propagation (Werbos 1974; Parker 1985; Rumelhart et al. 1986). 1 Introduction
The promising results obtained by Peterson and Anderson (Peterson and Anderson 1987) using a DBM are hard to assess because they present no mathematical guarantee that the learning does gradient descent in any error function (except in the limiting case of a very large net with small random weights). It is quite conceivable that in a DBM the computed gradient might have a small systematic difference from the true gradient of the normal performance measure for each training case, and when these slightly incorrect gradients are added together over many cases their resultant might bear little relation to the resultant of the true casewise gradients (see Fig. 1). 2 The Learning Procedure for Stochastic Boltzmann Machines
~
A Boltzmann machine (Hinton and Sejnowski 1986) is a network of symmetrically connected binary units that asynchronously update their states Neural Computation 1, 143-150 (1989) @ 1989 Massachusetts Institute of Technology
Geoffrey E. Hinton
144
according to a sfochastic decision rule. The units have states of 1 or 0 and the probability that unit i adopts the state 1 is given by
c
1 pi = o(T i
(2.1)
SjWij)
where s j is the state of the jthunit, wij is the weight on the connection between the jthand the ithunit, T is the "temperature" and c is a smooth non-linear function defined as 1 o(z)= 1 + e-"
(2.2)
If the binary states of units are updated asynchronously and repeatedly using equation 2.1, the network will reach "thermal equilibrium" so that the relative probabilities of global configurations are determined by their energies according to the Boltzmann distribution:
where Pa is the probability of a global configuration and E, is its energy defined by
where s: is the binary state of unit i in the othglobal configuration, and bias terms are ignored because they can always be treated as weights on connections from a permanently active unit. At any given temperature, T , the Boltzmann distribution is the one that minimizes the Helmholtz free energy, F , of the distribution. F is defined by the equation
h
a+b
Figure 1: The true gradients of the performance measure are a and b for two training cases. Even fairly accurate estimates, 2 and 6, can have a resultant that points in a very different direction.
Deterministic Boltzmann Learning Performs Steepest Descent F = ( E )- T H
145
(2.5)
where ( E ) is the expected value of the energy given the probability distribution over configurations and H is the entropy of the distribution. It can be shown that minima of F (which will be denoted by F') satisfy the equation a
In a stochastic Boltzmann machine, the probability of an output vector, Op, given an input vector, I, is represented by
(2.7)
where F& is the minimum free energy with I , and Op clamped, and F,* is the minimum free energy with just I , clamped. A very natural way to observe P-(OplI,) is to allow the network to reach thermal equilibrium with I , clamped, and to observe the probability of 00.The key to Boltzmann machine learning is the simple way in which a small change to a weight, w , ~ affects , the free energy and hence the log probability of an output vector in a network at thermal equilibrium. (2.8)
where ( s , s j ) is the expected value of s,sj in the minimum free energy distribution. The simple relationship between weight changes and log probabilities of output vectors makes it easy to teach the network an input-output mapping. The network is "shown" the mapping that it is required to perform by clamping an input vector on the input units and clamping the required output vector on the output units (with the appropriate conditional probability). It is then allowed to reach thermal equilibrium at T = 1, and at equilibrium each connection measures how often the units it connects are simultaneously active. This is repeated for all input-output pairs so that each connection can measure (s,sJ)+, the expected probability, averaged over all cases, that unit i and unit j are simultaneously active at thermal equilibrium when the input and output vectors are both clamped. The network must also be run in just the same way but without clamping the output units to measure ( s , s j ) - , the expected probability that both units are active at thermal equilibrium when the output vector is determined by the network. Each weight is then updated by
146
Geoffrey E. Hinton
It follows from equation 2.7 and equation 2.8 that if E is sufficiently small this performs steepest descent in an information theoretic measure, G, of the difference between the behavior of the output units when they are clamped and their behavior when they are not clamped. (2.10)
where I, is a state vector over the input units, Op is a state vector over the output units, P+ is a probability measured at thermal equilibrium when both the input and output units are clamped, and P- is a probability measured when only the input units are clamped. Stochastic Boltzmann machines learn slowly, partly because of the time required to reach thermal equilibrium and partly because the learning is driven by the difference between two noisy variables, so these variables must be sampled for a long time at thermal equilibrium to reduce the noise. If we could achieve the same simple relationships between log probabilities and weights in a deterministic system, learning would be much faster. 3 Mean field theory
Under certain conditions, a stochastic system can be approximated by a deterministic one by replacing the stochastic binary variables of equation 2.1 by deterministic real-valued variables that represent their mean values
We could now perform discrete, asynchronous updates of the pi using equation 3.1 or we could use a synchronous, discrete time approximation of the set of differential equations (3.2)
We shall view the pi as a representation of a probability distribution over all binary global configurations. Since many different distributions can give rise to the same mean values for the pi we shall assume that the distribution being represented is the one that maximizes the entropy, subject to the constraints imposed on the mean values by the pi. Equivalently, it is the distribution in which the pi are treated as the mean values of independent stochastic binary variables. Using equation 2.5 we can calculate the free energy of the distribution represented by the state of a DBM (at T = 1).
Deterministic Boltzmann Learning Performs Steepest Descent
147
+ C[pzlog(pa)+ (1 - pz) lOg(1 - pz)l
(3.3)
F = -C 2 tchar. The network of locallytuned units (Fig. la) has between 100 and 10,000 internal units arranged in a single layer, while the backpropagation network (Fig. lb) has two internal layers each containing 20 sigmoidal units. The backpropagation network thus has 541 adjustable parameters (weights and thresholds) total.
Fast Learning in Networks of Locally-Tuned Processing Units
289
Figure 5 contrasts the prediction accuracy E (Normalized Prediction Error) versus number of internal units for three versions of our algorithm to the backpropagation benchmark (A) of Lapedes and Farber (1987). The three versions of the learning algorithm are: 1. Nearest neighbor prediction. Here, the nearest data point in the training set is used as a predictor. This behavior is actually a special case for the network of equation 1.4 where each input/output training pair (?%, fi} defines a processing unit {.", f"} of infinitely narrow width (cr" + 0). 2. Adaptive processing units with one unit per data point. Here, the amplitudes are determined by LMS, the widths by the global first
Frequency of first formant Figure 3: Adaptive k-means clustering applied to phoneme data.
John Moody and Christian J. Darken
290
0
200
400
600
800
1000
Time Figure 4 One thousand successive integer timesteps for the Mackey-Glass chaotic time series with delay parameter 7 = 17.
nearest neighbors heuristic, and the centers are fixed to be training data vectors. 3. Self-organizing, adaptive processing units. Similar to 2, but the training set has ten times more exemplars than the network has processing units and the processing unit centers are found using Ic-means clustering. The backpropagation benchmark used a training set with 500 exemplars. For all methods, prediction accuracies were measured on a 500-member test set. Note that versions 1 and 2 require storage of past time series data; version 2 assigns a processing unit to each data point. Hence, neither of these methods is appropriate for real-time signal processing with fixed memory. However, version 3 is fully adaptive in that a fixed set of network parameters can be varied in response to new data in real time and does not, in principle, require storage of previous data. In order
Fast Learning in Networks of Locally-Tuned Processing Units
291
to make a fair comparison to the backpropagation benchmark, however, w e optimized our networks on a fixed training set, rather than measure time-averaged, real-time performance. As is apparent from figure 5 (note horizontal reference line), method 2 achieves a prediction accuracy equivalent to backpropagation with about 7 times as much training data, while method 3 requires about 27 times
-0.6
+
Gaussian Units
-0.8
ba 0
d
W
8 t:
-1.0
w
a 8 -1.2 .M
Back Prop Benchmark
-1.6
2.0
2.5
3 .O
3.5
4.0
Number of Units (loglo) Figure 5: Comparison of prediction accuracy vs. number of internal units for four methods: (1)first nearest neighbor, (2) adaptive units (one unit per training vector), (3) self-organizing units (ten training vectors per processing unit), and (A) backpropagation. The methods are described in the text. For backpropagation, the abscissa indicates the number of training vectors. The horizontal line associated with (A) is provided for visual reference and is not intended to suggest a scaling law. In fact, the scaling law is not known.
John Moody and Christian J. Darken
292
as much data to reach equivalent accuracy. These differences in training data requirements stem from the fact that a backpropagation network is fully supervised, learns a global fit to the function, and is thus a more powerful generalizer, while the network of locally-tuned units learns only local representations. However, the larger data requirements of the networks of locallytuned processing units are outweighed by dramatic improvements in computational efficiency. The following table shows E versus computational time measured in Sun 3/60 seconds for each of the three methods described above in the 1000 internal units case. Locally-Tuned Network Model Version 1 Version 2 Version 3
Computation Time Normalized Prediction Error (Sun 3/60 CPU secs) 17.1% 67 secs (0.019 hours) 229 secs (0.064 hours) 9.9% 6.1% 1858 secs (0.51 hours)
The Lapedes and Farber backpropagation results required a few minutes to a fraction of an hour of Cray X/MP time running at an impressive 90 MFlops (Lapedes 1988). Their implementation used the conjugate gradient minimization technique and achieved approximately 5% Normalized Error. Our simulations on the Sun 3/60 probably achieved no more than 90 KFlops (the LINPACK benchmark). Taking these differences into account, the networks of locally-tuned processing units learned hundreds to thousands of times faster than the backpropagation network. Our own experiences with backpropagation applied to this problem are consistent with this difference. Using a variety of methods including on-line learning, off-line learning, gradient descent, and conjugate gradient, we have been unable to get a backpropagation network to come close to matching the prediction accuracy of locally-tuned networks in a few hours of Sun 3 time. Although our discussion has focused on algorithms which can be implemented as real-time adaptive systems, such as backpropagation and the networks of locally-tuned units we have presented, a number of offline algorithms for multidimensional function modeling achieve excellent performance both in terms of efficiency and precision. These include methods of exact interpolation based on rational radiallysymmetric basis functions (Powell 1985). In spite of the apparent locality of their functional representations, these methods are actually global since the basis functions do not drop off exponentially fast. Furthermore, they require a separate basis function for each data point and require computing time which scales as N 3 in the size of the data set. Finally, these algorithms are not adaptive.
Fast Learning in Networks of Locally-Tuned Processing Units
293
Approximation methods based on local linear and local quadratic fitting have been championed recently by Farmer and Sidorowich (1987). These algorithms utilize local representations in the input space, but are not appropriate for real-time use since they require multi-dimensional tree data structures which are cumbersome to modify on the fly and would be extremely difficult to implement in special purpose hardware. The off-line methods require explicit storage of past data and assume that all such data is retained. In contrast, the neural net approach requires storage of only a fixed set of tunable network parameters; this number is independent of the total amount of data observed over time. An alternative adaptive approach based upon hashing and a hierarchy of locally-tuned representations has been explored by Moody (1989a) with very favorable results. Note Added in Proof After this manuscript was accepted for publication, we learned that Hanson and Burr (1987) had suggested using a single layer of locally-tuned units in place of two layers of sigmoidal or threshold units. This was also suggested independently by Lapedes and Farber (198%). These authors, along with Lippmann (19871, describe constructions whereby localized "bumps" or convex regions can be built from two layers of sigmoidal or threshold units respectively. Acknowledgments We gratefully acknowledge helpful comments from and discussions with Andrew Barron, Doyne Farmer, Walter Heiligenberg, Alan Lapedes, Y.C. Lee, Richard Lippmann, Bartlett Mel, Demetri Psaltis, Terry Sejnowski, John Shewchuk, John Sidorowich, and Tony Zador. We furthermore wish to thank Bill Huang and Richard Lippmann for kindly providing us with the phoneme data. This research was supported by ONR grant N0001486-K-0310, AFOSR grant F49620-88-COO25, and a Purdue Army subcontract. References Farmer, J.D. and J.J. Sidorowich. 1987. Predicting chaotic time series. Physical Review Letters, 59, 845. Hanson, SJ. and D.J. Burr. 1987. Knowledge representation in connectionist networks. Bellcore Technical Report. Huang, W.Y. and R.P. Lippmann. 1988. Neural net and traditional classifiers. In: Neural Information Processing Systems, ed. D.Z. Anderson, 387-396. American Institute of Physics.
294
John Moody and Christian J. Darken
Kohonen, T. 1988. Self-organization and Associative Memory. Berlin: SpringerVerlag. Lapedes, A.S. 1988. Personal communication. Lapedes, A.S. and R. Farber. 1987. Nonlinear signal processing using neural networks: Prediction and system modeling. Technical Report, Los Alamos National Laboratory, Los Alamos, New Mexico. , 1988. How neural nets work. In: Neural Information Processing Systems, ed. D.Z. Anderson, 442-456. American Institute of Physics. Lippmann, R.P. 1987. An introduction to computing with neural nets. IEEE ASSP Magazine, 4:4. Lloyd, S.P. 1957. Least squares quantization in PCM. Bell Laboratories Internal Technical Report. l E E E Transactions on Information Theory, IT-282,1982. MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematics, Statistics, and Probability, eds. L.M. LeCam and J. Neyman. Berkeley: U. California Press, 281. Moody, J. 1989a. Fast learning in multi-resolution hierarchies. In: Advances in Neural Information Processing Systems, ed. D. Touretzky. Morgan-Kaufmann, Publishers. . 1989b. In preparation. Moody, J. and C. Darken. 1988. Learning with localized receptive fields. In: Proceedings of the 1988 Connectionist Models Summer School, eds. Touretzky, Hinton, and Sejnowski. Morgan-Kaufmann, Publishers. Omohundro, S. 1987. Efficient algorithms with neural network behavior. Complex Systems, 1,273. Powell, M.J.D. 1985. Radial basis functions for multivariate interpolation: a review. Technical Report DAMPT 1985/NA12, Department of Applied Mathematics and Theoretical Physics, Cambridge University, Silver Street, Cambridge CB3 9EW, England.
Received 10 October 1988; accepted 27 October 1988.
REVlEW
Unsupervised Learning
What use can the brain make of the massive flow of sensory information that occurs without any associated rewards or punishments? This question is reviewed in the light of connectionist models of unsupervised learning and some older ideas, namely the cognitive maps and working models of Tolman and Craik, and the idea that redundancy is important for understanding perception (Attneave 1954), the physiology of sensory pathways (Barlow 1959), and pattern recognition (Watanabe 1960). It is argued that (1)The redundancy of sensory messages provides the knowledge incorporated in the maps or models. (2) Some of this knowledge can be obtained by observations of mean, variance, and covariance of sensory messages, and perhaps also by a method called "minimum entropy qodihq" (3) Such knowledge may be incorporated in a model of "what usually happens" with which incoming messages are automatically compared, enabling unexpected discrepancies to be immediately identified. (4) Knowledge of the sort incorporated into such a filter is a necessary prerequisite of ordinary learning, and a representation whose elements are independent makes it possible to form associations with logical functions of the elements, not just with the elements themselves. 1 Introduction Much of the information that pours into our brains throughout the waking day arrives without any obvious relationship to reinforcement, and is unaccompanied by any other form of deliberate instruction. What use can be made of this impressive flow of information? In this article I hope, first, to show that it is the redundancy contained in these messages that enables the brain to build up its "cognitive maps" or "working models" of the world around it; second, to suggest initial steps by which these might be formed; and third, to propose a structure for the maps or models that automatically ensures their access and use in everyday perception, and represents percepts-in a form suitable for detecting the new associations involved in ordinary learning and conditioning. Self-organization has been a major preoccupation of those interested in neural networks since the early days, and the volume edited by Yovits Ncurd
Coiiipiftntioi~ 1,
295-311 (1YXY)
@ 1989 Massachusetts Institute of Technology
296
H.B. Barlow
al. (1962) gives an overview of some of this work; it is interesting to compare this with the systematic and much more developed treatment in the book on the subject by Kohonen (1984). One goal has been to explain topographic projections of sensory pathways and the occurrence of feature-selective neurons without depending completely on genetic specification (see especially von der Malsburg 1973; Nass and Cooper 1975; Cooper et al. 1979; Perez et al. 1975; Fukushima 1975, 1980; Swindale 1980, 1982; Barrow 1987). Another goal has been to explain the automatic separation and classification of clustered sensory stimuli (Rosenblatt 1959, 1962; Uttley 1958, 1979). The informon (Uttley 19701, for example, separated frequently occurring patterns from among a background of randomly associated elements, and it mimicked many aspects of the model of Rescorla and Wagner (1972) for conditioning and learning (Uttley 1975). Grossberg (1980) mainly emphasized the interactions between supervised and unsupervised learning. The adaptive critic in the pole-balancing scheme described by Barto et nl. (1983) improved learning performance by observing the pattern of recurring correction-movements and their outcomes. Self-organization may be mediated by the competitive learning analyzed by Rummelhart and Zipser (1985), which has been applied to the generation of feature specificity by Barrow (1987) and to a hippocampal model by Rolls (1989). The hierarchical mapping scheme of Linsker (1986, 1988) shows spontaneous self-organization, and his infomax principle develops further some ideas related to those Uttley (1979) proposed. Linsker’s networks can produce an organization reminiscent of the cortex both spontaneously, and in response to regularities of the incoming signals. From an informational viewpoint the recent exploration by Pearlmutter and Hinton (1986) of unsupervised procedures for discovering regularities in the input is especially relevant. Much of this paper has antecedents in the above work as well as in theories about the importance of redundancy in perception (Attneave 1954; Barlow 1959) and pattern recognition (Watanabe 1960, 1985). However, I have also tried to relate unsupervised learning to ideas about cognitive processes developed by Tolman (1932) and Craik (1943). Since these ideas provide a link with traditional psychology they will be briefly described.
et
1.1 Cognitive Maps and Working Models. Tolman (1932) worked within the behaviorists‘ tradition, but he disagreed with the rigidity of their explanations, feeling that these did not adequately convey the richness of the knowledge about their environment that maze-running rats clearly possessed and freely utilized. As he said, “behavior reeks of purpose and of cognition,“ and the structured knowledge of the environment that he argued for was subsequently called a cognitive map. Craik (19431, in his shorter, more philosophically oriented, book proposed that “thought models, or parallels, reality” These working models embodied the essential features and interactions in the world that fed the senses,
Unsupervised Learning
297
so that the outcomes of various possible actions could be correctly predicted; this is very similar to the way'Tolman thought of cognitive maps being used by his rats. What is the source of the extensive and well-organized knowledge of the environment implied by the possession of a cognitive map or working model? Though their structure may be genetically determined, the specific evidence they incorporate can be obtained only from the sensory messages received by the brain, and it is argued below that it is the statistical regularities in these messages that must be used for this purpose. This is an extraordinarily complex and difficult task, for it requires something like a major program of scientific research to be conducted at a precognitive level. There is plenty of room for genetic help in doing this, but once the nature of the task has been defined the statistical aspects can be approached systematically. In the next sections this is attempted for the first few steps, and a new method of finding these regularities - minimum entropy coding - is proposed. 2 Redundancy Provides Knowledge ___
There are genuine conceptual difficulties in applying information theory to the nervous system. These start with the paradox that although redundancy is claimed to be terribly important, sensory pathways are said to eliminate or reduce it rather than preserve it. Some of these difficulties (such as that one) disappear upon better understanding of information theory, but others do not: it is, for instance, difficult to apply the concepts when one is uncertain about the information-bearing features of the messages in nerve fibres, and about the overall plan used to represent information in the brain. In the next section these difficulties are avoided by talking about the sensory stimuli applied to the animal rather than the messages these arouse, and by doing this the definitions can be made precise. In principle, the maximum rate of presentation of usable information to the senses can be specified if one knows the psychophysical facts about their discriminatory capacities; call this C bits/sec. Now look at the actual rate at which information is delivered, and call this f i bits/sec; then the redundancy is simply C' - H bits/sec, or 100 x ( C ' - H)/C%. There remains a problem about measuring H , for the lower limit to its value can be calculated only if one knows all there is to know about the constraints operating in the world that gives rise to our sensations, and this point can obviously never be reached. Fortunately the concept of redundancy remains useful even if H is calculated using incomplete knowledge of the constraints, for this defines an upper limit to H and a lower limit to the redundancy. It is confusing to refer to these C - H bits/sec as information, but the technically correct term, redundancy, is almost equally misleading,
298
H.B. Barlow
for it suggests that this part of the sensory inflow is useless or irrelevant, whereas it is the potential source of all the available knowledge about the constant or semiconstant patterns and regularities in an animal’s environment. Knowledge is perhaps the best term for it, though it may seem paradoxical that this knowledge of the world around us can be derived only from the redundancy of the messages. The point can be illustrated by briefly considering what nonredundant sensory stimuli would be like. Completely nonredundant stimuli are indistinguishable from random noise. Thus, such a visual stimulus would look like a television set tuned between stations, and an auditory stimulus would sound like the hiss on an unconnected telephone line. Though meaningless to the recipient, technically such signals convey information at the maximum rate because they cannot be predicted at all from other parts of the message; H = C and there is no redundancy. Thus, redundancy is the part of our sensory experience that distinguishes it from noise; the knowledge it gives us about the patterns and regularities in sensory stimuli must be what drives unsupervised learning. With this in mind one can begin to classify the forms the redundancy takes and the methods of handling it. 3 Finding and Using Sensory Redundancy
Some features of sensory stimuli are almost universal. For instance, the upper part of the visual field is imaged on the lower part of the retina in an erect animal, and it is almost always more brightly illuminated. In animals such as cats that have a reflecting tapetum one usually finds that it is confined to the part receiving the image of the lower, dimmer, part of the visual field while the reflecting tapetum is replaced by a densely absorbing pigment in the part receiving the bright image; the result is to greatly reduce the amount of scattered light obscuring the image in its dimmer parts. The many ways that redundant aspects of sensory stimuli are reflected in permanent features of the sensory system are themselves interesting, but here we are concerned with learning-like responses. To exploit redundant features the brain must determine characteristics of the stimuli that behave in a nonrandom manner, so one can consider methodically the various measures that could be made on the messages in order to characterize these regularities statistically. 3.1 Mean. One starts with the mean, taken over the recent past. In vision, this can assume any value from a few thousandths up to many thousands of cd/m2, but it behaves in a very nonrandom manner because it tends to stay rather constant for quite long periods. I have been sitting at my desk for the past hour, and during this time the mean luminance has always been close to 10 cd/m2; the constancy of this mean is a highly nonrandom feature and the visual system takes advantage of it to adjust
Unsupervised Learning
299
the sensitivity of the pathways to suit the limited range of retinal illuminations it will receive. Much is understood about these adaptational mechanisms, but the principles are well understood by communication engineers and I shall go ahead to consider more interesting types of redundancy. However the way in which coding by the retina changes with the mean luminance of the images is a simple paradigm of unsupervised learning, and the one we are closest to understanding physiologically.
3.2 Variance. The variance of sensory signals probably does not show the constancy over short periods combined with very large changes over long periods that is characteristic of the mean, though a walker in mist or a fish in murky water would certainly be exposed to signals with an exceptionally low range of image contrasts and hence low variance. After the transformations in the retina, taking account of changes in the variance of the input signals is actually very nearly equivalent to adjusting for the mean values of the signals in the "on" and "off" systems, and it has been suggested that such contrast gain control occurs in primary visual cortex (Ohzawa et al. 1982, 1985). One might perhaps consider next the higher moments of the distributions of input stimuli on the many input channels, but it is hard to imagine that adapting to these would have any great advantages and I know of no evidence that natural systems respond in any way to them. Hence the next step is the large one of considering the patterns of correlation between the inputs on different channels.
3.3 Covariance. The simplest measure of the correlated activity of sensory pathways would be the covariance between pairs of them. Just as adaptational mechanisms take advantage of the mean by using it as an expected value and expressing values relative to it, so one might take advantage of covariance by devising a code in which the measured correlations are "expected" in the input, but removed from the output by forming a suitable set of linear combinations of the input signals. It is possible to form an uncorrelated set of signals in a neural network with a rather simple scheme of connection and rule of synaptic modification (Barlow 1989; Barlow and Foldiiik 1989; see also Kohonen 1984). The essential idea is that each neuron's output feeds back to the other inputs through anti-Hebbian synapses, so that correlated activity among the outputs is discouraged. Such a network would account for many perceptual phenomena hitherto explained in terms of fatigue of pattern selective elements in sensory pathways, and it also offers a mechanism for some forms of the "unconscious inference" described by von Helmholtz (1925) and modern psychologists of perception (Rock 1983). These aspects are discussed in the references cited above, and here some of the possible extensions of the principle will be mentioned.
300
H.B. Barlow
So far it has been supposed that the covariance is worked out from paired values occurring at the same moment, but this need not be the case. Sutton and Barto (1981) have discussed temporal relationships in conditioning, and there are several synaptic mechanisms that might depend on the correlation between synaptic input at one moment and postsynaptic depolarization at a later moment; a transmitter might cause lingering "eligibility" for subsequent reinforcement, or a synaptic rewarding factor or reverse transmitter released by a depolarized neuron might be optimally picked up by presynaptic terminals some moments after they had themselves been active. Decorrelating networks based on such principles would "expect" events that occurred in often-repeated sequences, and would tend to respond less strongly to frequently occurring sequences and more strongly to abnormal ones. It is easy to see how such a mechanism might explain aftereffects of motion. A consequence of using covariances is that, since the inputs are taken in pairs, the number of computations increases in proportion to the square of the number of inputs. This means that it would be impossible to decorrelate the whole sensory input; the best that could be done would be to decorrelate local sets of sensory fibers. However, the process could then be repeated, possibly organizing the decorrelated outputs of the first stage according to principles other than their topographical proximity, such as proximity in color space or similarity of direction of motion (Barlow 1981; Ballard 1984). Such hierarchical decorrelation processes may have considerable potential, but there is no denying that the methods so far considered only begin the task of finding regularities in the sensory input. 3.4 Rules for Combination or Agglomeration. Decorrelation separates variables that are correlated, but if the correlation between two variables is very strong they might be conveying the same message, and then one should combine them. For instance, taste information is carried by a large number of nerve fibers each of which has its characteristic mixture of sensitivities to the four primary qualities, salt, sweet, sour, and bitter. We have shown (Barlow and Foldiak 1989) how these can be decorrelated in groups of four to yield the four primary qualities, but one might expect all the outputs for one quality then to be combined on to a much smaller number of elements, for without doing this they just seem to replicate the information needlessly. There is need for an operation of this sort in many situations: for instance, to exploit the fact that there are only two dimensions of color (in addition to luminance), to exploit the prevalence of edges in ordinary images, to combine in one entity the host of sensory experiences for which we use a single word or name, and to do the same for a commonly repeated phrase or cliche. Pearlmutter and Hinton (1986) consider a related problem, that of finding input patterns that occur more often than would be expected if the constituent features occurred independently.
Unsupervised Learning
301
Finding that some combinations occur more often than expected is the converse of finding that some combinations d o not occur at all, as is the case when the number of degrees of freedom or dimensions in a set of messages is less than would appear from the form of the messages. The set of S features spans less than an .V-dimensional space because certain combinations d o not occur, and exploiting this is just the kind of simplification that would enable one to make useful cognitive maps and working models. Principle component analysis will do what is required, and it is believed that the method described in the next section will also, but it is natural to look for network methods, especially as these have already achieved some success (for example, Oja 1982; Rumelhart and Zipser 1985; Pearlmutter and Hinton 1986; Foldiak 1989). 3.5 Minimum Entropy Coding. As with decorrelation the idea is to find a set of symbols to represent sensory messages such that, in the normal environment, each symbol is as nearly as possible independent of the others, but there are two differences: first, it is applicable to discretely coded, logically distinct variables rather than continuous ones, and second it takes into account all possible nonrandom relations between the outputs, not just the pairwise relationships of the covariance matrix. To make the principle clear the simple example of coding keyboard characters in 7 binary digits to find alternatives to the familiar 7-bit ASCII code will be considered. The advantages of examining this are its familiarity, its simplicity, and the fact that samples of normal English text are readily available from which the nonrandom character frequencies can be determined. If a sample of ordinary text is regarded simply as a string of independent characters randomly selected from the alphabet with the probabilities given by their frequency of occurrence in ordinary text, the average entropy of the characters 13, is given by the familiar expression:
(3.1)
where : 1 are the probabilities of the mutually exclusive set of characters. Each of the characters is represented by a 7-bit word, and the entropies for each bit can be obtained by measuring their frequencies in a sample of text. The entropy expression for the bits takes the form: H , = - (P ,log r, + Q , l o g o , )
(3.2)
where H , is the average entropy of the cth bit, P, is its probability, and C), is 1 - L'). An estimate of the average character entropy can be obtained by adding the 7-bit entropies, but it is important to realize that this can never be less, and will usually be greater, than the character entropy given by the original expression (3.1). The reason for this is the lack of independence between the values of the bits; if it were true for all the 7
302
H.B. Barlow
bits that their values were completely independent of the other bits occurring in any combination, then the two estimates would be equal. The object is to find a code for which this is true, or as nearly true as possible, and the method of doing this is to find a code that minimizes the sum of the bit entropies - hence the name. If the minimum is reached and the bits are truly independent we call it a factorial code, since each bit probability or its complement is then a factor of the probability of each of the input states. The maneuver can be looked at another way. The seven binary digits of the ASCII code can carry a maximum of 7 bits, but actually carry less when used to transmit normal text, for two reasons. First, the bit probabilities are a long way from 1/2, which would yield the maximum bit entropy; this form of redundancy is explicit and causes no trouble, for the probability of each of the 7 bits is available wherever they are transmitted and easily measured. Second, there are complicated interdependencies among the bits, so the conditional bit probabilities are not the same as the unconditional ones; this form of redundancy is troublesome, for it is not available wherever the bits are transmitted and to describe it completely one needs to know the conditional probabilities of each bit for all combinations of other bits. If both of these forms of redundancy were taken into account the information conveyed per ASCII word would of course be the same as H,. of expression (l),i.e., about 4.3 bits, and no change of the code would alter this. However, changing the code does change the relative amounts of the two forms of redundancy, and by finding one that minimizes the sum of the bit entropies one maximizes the redundancy that results from bit probabilities deviating from 1/2. This leaves less room for redundancy from interdependencies between the bits; the troublesome form of redundancy is squeezed out by maximizing the other less troublesome form. The minimum entropy principle should be generally applicable and clearly goes further than decorrelation, which considers only the outputs in pairs. It can also be used to compare and select from codes that change the number of channels or dimensionality of the messages. The entropy is a locally computable quantity, and by minimizing it one can increase the independence of the outputs without actually measuring the frequencies of all the possible output states, which would often be an impossible task. An accompanying article (Barlow et al. 1989) goes into some of the practical and theoretical problems in finding minimum entropy codes. In this section it has been suggested that the statistical regularities of the incoming sensory messages might be measured and used to change the way they are coded or represented. It is easy to see that this would have advantages, analogous to those conferred by automatic gain control, in ensuring a compact representation within the dynamic range of the representative elements, but there may be more profound benefits attached to a representation in which the variables are independent in the environment to which the representation has been adapted. To under-
Unsupervised Learning
303
stand these one must consider the main task for which our perceptions are used, namely the detection of new associations and their utilization in ordinary learning and conditioning. 4 Ordinary Learning Requires Previous Knowledge
Over the past 20 years the work of Kamin (1969), Rescorla and Wagner (19721, Mackintosh (1974,19831, Dickinson (1980), and others has brought about a very big change in the way theorists approach the learning problem. Whereas previously they tended to think in terms of mechanistic links whose strengths were increased or decreased according to definable laws, attention has now shifted to the computational problem that an animal solves when it learns. This started with the realization and experimental demonstration of the fact that the detection of new associations is strongly dependent on other previously and concurrently learned associations, many of which may be “silent” in that they do not themselves produce overt and obvious effects on outward behavior. As a result of this change it is at last appreciated that the brain studied in the learning laboratory is doing a profoundly difficult job: it is deducing causal links from which it can benefit in the world around it, and it does this by detecting suspicious coincidences; that is, it picks u p associations that are surprising, new, or different among those that the experimenter offers it. To detect new associations one must detect changes in the probabilities of certain events, and once this is realized an important role for unreinforced experience becomes clear: it is to find out and record the a priori probabilities, that is, the normal state of affairs, or what usually happens. Though this elementary fact does not seem to have been much emphasized by learning theorists it is obviously crucial, for how can something be recognized as new and surprising if there is no preexisting knowledge about what is old and expected? 4.1 Detecting New Associations. The basic step in learning is to detect that event C predicts U; C might be the conditional, U the unconditional stimulus of Pavlovian conditioning, or C might be a motor action and U a reinforcement in operant conditioning, or they might be successive elements in a learned sequence. Unsupervised learning can help with at least two aspects of this process: first, the separate representation of a wide range of alternative Cs, and second, the acquisition of knowledge of the normal probabilities of occurrence of these possible conditional stimuli. It is often tacitly assumed that all alternative conditioning stimuli can be separated by the brain and their occurrences independently registered in some way, but one should not blandly ignore the whole problem of pattern recognition, and the massive interconnections we know exist
304
H.B. Barlow
between the neurons of the brain means that the host of alternative Cs are unlikely to be completely separable unless there are specific mechanisms for ensuring that they are. The tacit assumption that the probabilities of occurrence of these stimuli, or of their cooccurrence with U, are known is equally unjustified, though it is evident that if they were not there would be no sound basis for judging that a particular C had become a good predictor of U. The logical steps necessary to detect an association between C and U will be considered in more detail to show the importance both of knowledge of their normal probabilities and of the separability of alternative conditional stimuli. The only way to establish that C usefully predicts U is to disprove the null hypothesis that the number of occasions U follows C is no more than would be expected from chance coincidences of the two events; it is easy to see that if this null hypothesis is correct, no benefit can possibly result from using C as a predictor of U. To know the expected-rate of chance coincidences one must either have measured the normal rate of the compound event (U following C) directly, or have knowledge of the normal probabilities of occurrence, P(C) and PW); further if these probabilities are to be used it must be reasonable to assume they are independent. This prior knowledge is clearly necessary before new predictive associations can be detected reliably. Now consider the difficulties that arise if a particular C cannot be fully resolved or separated from the alternative
cs.
Failure of resolution or separation means that the registration of the occurrence of an event is contaminated by occurrences of other events. Estimates of the probabilities of occurrence of C both with and without U would be misleading if based on these contaminated counts, and their use would cause failures to detect associations that were present and the detection of spurious associations that did not exist. Thus, if counts of alternative events like C are to be used to detect causal factors, they must be adequately resolved or separated if learning is to be efficient and reliable. 4.2 Independence Is Needed for Versatile Learning. Now reconsider the two ways, measurement and calculation, of estimating the compound event probability P(U following C). Directly measuring it is adequate and plausible when one has prior expectations about the possible conditional stimuli C, especially as in either scheme one must somehow be able to detect the occurrence of this sequence when it occurs. But calculating P(U following C) from P(C) and P(U) is much more versatile, for the following reason. Measuring the rates of Ar coincidences such as "U following C" just gives these rates and no more, whereas knowledge of the probabilities of N independent events enables one to calculate the probability of all possible logical functions of those events, at least in principle. This gigantic increase in the number of null hypotheses whose predictions can be specified and tested gives an enormous advan-
Unsupervised Learning
305
tage to the method of calculating, rather than measuring, the expected coincidence rates. However, calculating P(U following C) from the probabilities of its constituents depends on the formation of a representation in which the constituent events can be relied on to be independent until the association that is to be detected occurs. To summarize: to detect a suspicious coincidence that signals a new causal factor in the environment one should have access to prior knowledge of the probabilities of simpler constituent events, and these simpler events should be separately registered and independent on the null hypothesis from which one wishes to advance. It is obviously an enormously difficult and complicated task to generate such a representation, and the types of coding discussed above are only first steps; however, the versatility of subsequent learning must depend critically on how well the task has been done.
4.3 Some Other Issues. The approach taken here might be criticized on the grounds that the problem facing the brain in learning is considered in too abstract a manner, the actual mechanisms being ignored. For example, the logic of the situation requires that the numbers of occurrences and joint occurrences be somehow stored, and one might point to this as the major problem, rather than the way the numbers are used. It certainly is a major problem, but the attitude adopted here is that one is not going to get far in understanding learning without recognizing the logic of inductive inference, since this dictates what quantities actually need to be stored; it seems obvious that this problem should be looked at first. There must be many ways in which the brain fails to perform the idealized operations required to detect new causal factors. It performs approximations and estimates, not exact calculations, but one cannot appreciate the mistakes an approximation will lead to without knowing what the exact calculation is. It is likely that many of the features of learning stem from the nature of the problem being tackled, not from the specific details of the mechanisms, and it is foolish to confuse the one with the other through failing to attend to the complexity of the task the brain appears to perform so effectively. There is another somewhat irrelevant issue. If it was known with certainty that a predictive relation between C and U existed it would still have to be decided whether it should be acted on. This theoretically depends on whether P(U following C ) is high enough for the reward obtained when U does follow C to outweigh the penalty attached to the behavior needed to reap the reward when U fails to materialize; that is a different matter from deciding whether the relation exists, and for the present it can be ignored.
306
H.B. Barlow
4.4 Storing and Accessing the Model. So far no means has been proposed for performing the computations suggested above, nor for storing and accessing the knowledge of the environment that the model contains. One possibility is to form a massive memory for the usual rates of occurrence of various combinations of sensory inputs. Something like this may underlie our ability to say "the almond is blossoming unusually early this year" and to make similar cognitive judgments, but the comparative judgments of everyday perception are certainly made in quite a different way. When we see white walls in a dimly lit room we do not observe their luminance, then refer to a memorized look-up table that tells us what luminances are to be called white when the mean luminance is such and such; instead we have mechanisms (admittedly not yet fully understood) that automatically compare the signals generated by the image of the wall with signals from other regions, and then attach the label "white" wherever this comparison yields a high value. This automatic comparison was regarded above as a way of eliminating the redundancy involved in signaling the mean luminance on every channel, and it should now be clear how the various other suggested forms of recoding do much the same operation for other "expected" statistical regularities in the sensory messages. One can regard the model or map as something automatically held u p for comparison with the current input; it is like a negative filter through which incoming messages are automatically passed, so that what emerges is the difference between what is actually happening and what one would expect to happen, based on past experience. In this way past experience can be made continuously and automatically available. 5 Discussion
Since the early days of information theory it has been suggested that the redundancy of sensory stimuli is particularly important for understanding perception. Attneave (1954) was the first to point this out, and I have periodically argued for its importance in understanding both the physiological mechanisms of sensory coding, and higher level functions including intelligence (Barlow 1959, 1961, 1987). One can actually trace the line of thought back to von Helmholtz (1877, 19251, and particularly to the writings of Ernst Mach (1886) and Karl Pearson (1892) about "The Economy of Thought." To what extent is this line of thought the same as that of Tolman and Craik on cognitive maps and working models? They are certainly closely related, for they both say that the regularities in the sensory messages must be recorded by the brain for it to know what usually happens. However, redundancy reduction is the more specific form of the hypothesis, for I think it also implies that the knowledge contained in the map or model is stored in such a form that the current sensory scene is automatically compared with it and the dis-
Unsupervised Learning
307
crepancies passed on for further consideration - the idea of the model as a negative filter. There is perhaps something contradictory and intuitively hard to accept in this notion, especially when applied to the cognitive knowledge of our environments to which we have conscious access. When we become aware that the almond is blossoming unusually early, we think this is an absolute judgment based on comparisons with past, positive experiences, and not the result of a discrepancy between present experience and unconscious rememberings of past blossomings. Perhaps the negative filter idea applies only to the unconscious knowledge that our perceptions use so effectively, with quite different mechanisms employed at the higher levels to which we have conscious access. On the other hand redundancy reduction may be the deeper view of how our brains handle sensory input, for it may describe the goal of the whole process, dictating the form of representation as well as what is represented; we should not be too surprised if our introspections turn out to be misleading on such a matter, for they may be concerned with guiding us how to tell others about our experiences, not with informing us how the brain goes about its business. The discussion should have demonstrated that there is a close relationship between the properties of the elements that represent the current scene, the model that tells one “what usually happens,” and the ease with which new associations can be detected and learned. But the recoding methods suggested above are unlikely to be complete, and it is worth listing other factors that must be important in determining the utility of representations.
5.1 Other Factors Affecting the Utility of Representations. The best method of detecting a target in a noisy background is to derive a signal that picks up all the energy available from the signal with the minimum contamination by energy from its background the principle of the matched detector. This principle must be very important when detecting events in the environment that are associated with rewards or punishments, but there is no guarantee that the code elements of a minimum entropy code (or any other code that is unguided by reinforcement) will be well matched to these classes of events. Though a priori probabilities can be calculated for any logical function of the inputs if the representative elements are independent, this calculation is not necessarily as accurate as that obtained from a matched filter.
2. It is also important that a coding scheme should lead to appropriate generalization. Probably representative elements should start by responding to a wider class of events than that to which, under the influence of ”shaping,” they ultimately respond. To meet this
H.B. Barlow
308
requirement mechanisms additional to minimum entropy coding are required.
3. Items such as the markings of prey, predators, or mates may have a biological significance that is arbitrary from an informational viewpoint. 4. Sensory scenes and stimuli that have been reinforced obviously have special importance, and they should therefore have a key role in classifying sensory stimuli. It is clear that the minimum entropy principle is not the only one on which the representation of sensory information should be based. Nonetheless a code selected on this principle stores a wealth of knowledge about the statistical structure of the normal environment, and the independence of the representative elements gives such a representation enormous versatility. It is relatively easy to devise learning schemes capable of detecting specific associations, but higher mammals appear to be able to make associations with entities of the order of complexity that we would use a word to describe. As George Boole (1854) pointed out, words are to the elements of our sensations like logical functions to the variables that compose them. We cannot of course suppose that an animal can form an association with any arbitrary logical function of its sensory messages, but they have capacities that tend in that direction, and it is these capacities that the kind of representative schemes considered here might be able to mimic.
References Attneave, F. 1954. Informational aspects of visual perception. Psychol. Rev. 61, 183-193. Ballard, D.H. 1984. Parameter networks. Artificial Intell. 22, 235-267. Barlow, H.B. 1959. Sensory mechanisms, the reduction of redundancy, and intelligence. In National Physical Laboratory Symposium No. 10, The Mechanisation of Thought Processes. Her Majesty’s Stationery Office, London. Barlow, H.B. 1961. Possible principles underlying the transformationsof sensory messages. In Sensory Communication, W. Rosenblith, ed., pp. 217-234. MIT Press, Cambridge, MA. Barlow, H.B. 1981. Critical limiting factors in the design of the eye and visual cortex. The Ferrier lecture, 1980. Proc. Roy. SOC. London, B 212, 1-34. Barlow, H.B. 1987. Intelligence: The art of good guesswork. In The Oxford Companion to the Mind, R.L. Gregory, ed., pp. 381-383. Oxford University Press, Oxford. Barlow, H.B. 1989. A theory about the functional role and synaptic mechanism of visual after-effects. In Vision: Coding and Efficiency, C. Blakemore, ed. Cambridge University Press, Cambridge.
Unsupervised Learning
309
Barlow, H.B., and Foldiak, P. 1989. Adaptation and decorrelation in the cortex. In The Cornputing Neicron, R. Durbin, C. Miall, and G. Mitchison, eds. New York: Addison-Wesley. Barlow, H.B., Kaushal, T.P., and Mitchison, G.J. 1989. Finding minimum entropy codes. Neural Comp. 1, 412-423. Barrow, H.G. 1987. Learning receptive fields. Proc. IEEE First bit. Conf. Neural Networks, Cat. # 87TH0191-7, pp. 1V-115-IV-121. Barto, A.G., Sutton, R.S., and Anderson, C.W. 1983. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transact. Systems, Man, Cybernet. SMC-13(5), 835-846. Boole, G. 1854. A n lnvestigatioii of the Laws of Thought. Dover Publications Reprint, New York. Cooper, L.N., Liberman, F., and Oja, E. 1979. A theory for the acquisition and loss of neuron specificity in visual cortex. Bid. Cybemet. 33, 9-28. Craik, K.J.W. 1943. The Nature of Explanation. Cambridge University Press, Cambridge. Dickinson, A. 1980. Contemporary Animal Learning Theory. Cambridge University Press, Cambridge. Foldiak, P. 1989. Adaptive network for optimal linear feature extraction. In Proc. I E E E INNS Int. Ioint Coiif. Neural Networks, Washington DC. Fukushima, K. 1975. Cognitron: A self-organising multi-layered neural network. Bid. Cybernet. 20, 121-136. Fukushima, K. 1980. Neocognitron: A self-organising neural network model for a mechanism of pattern recognition unaffected by shift of position. B i d . Cybernet. 36, 193-202. Grossberg, S. 1980. How does a brain build a cognitive code? Psychol. Rev. 87, 1-51. Kamin, L.J. 1969. Predictability, surprise, attention and conditioning. In Putiishment and Aversive Behavior, B.A. Campbell and R.M. Church, eds., pp. 279296. Appleton-Century-Crofts, New York. Kohonen, T. 1984. Self-Organisation and Associative Memory. Springer-Verlag, Berlin. Linsker, R. 1986. From basic network principles to neural architecture (series). Proc. Natl. Acad. Sci. U.S.A. 83, 7508-7512, 8390-8394, 8779-8783. Linsker, R. 1988. Self-organisation in a perceptual network. Computer (March 1988), 105-117. Mach, E. 1886. The Analysis of Sensations, and the Relation of the Physical to the Psychical. Translation of the lst, revised from the 5th, German Edition by S. Waterlow. Open Court, Chicago and London. (Also Dover reprint, New York, 1959.) Mackintosh, N.J. 1974. The Psycliology of Animal Learning. Academic Press, London. Mackintosh, N.J. 1983. Conditionir~gand Associative Learniiig. Oxford University Press, Oxford. Nass, M.M., and Cooper, L.N. 1975. A theory for the development of feature detecting cells in visual cortex. Bid. Cybernet. 19, 1-18.
310
H.B. Barlow
Ohzawa, 1. Sclar, G., and Freeman, R.D. 1982. Contrast gain control in the cat visual cortex. Nature (London) 298, 266-278. Ohzawa, I. Sclar, G., and Freeman, R.D. 1985. Contrast gain control in the cat’s visual system. 1. Neuropkysiol. 54, 651-667. Oja, E. 1982. A simplified neuron as a principal component analyser. 1. Matk. B i d . 15, 267-273. Pearlmutter, B.A., and Hinton, G.E. 1986. G-maximization: an unsupervised learning procedure for discovering regularities. In Proc. Conf. Neural Networks Comp. American Institute of Physics. Pearson, K. 1892. The Grammar of Science. Walter Scott, London. Perez, R., Glass, L., and Shlaer, R. 1975. Development of specificities in the cat visual cortex. J. Matk. Biol. 1, 275-288. Rescorla, R.A., and Wagner, A.R. 1972. A theory of Pavlovian conditioning: Variations in effectiveness of reinforcement and non-reinforcement. In Classical conditioning II: Current Research and Theory, A.H. Black and W.F. Prokasy, eds., pp. 64-99. Appleton-Century-Crofts, New York. Rock, I. 1983. The Logic of Perception. MIT Press, Cambridge, MA. Rolls, E.T. 1989. The representation and storage of information in neuronal networks in the Primate cerebral cortex and hippocampus. In The Computing Neuron, R. Durbin, C. Miall, and G. Mitchison, eds. Addison-Wesley, New York. Rosenblatt, F. 1959. Two theorems of statistical separability in the Perceptron. In National Physical Laboratory Symposium No. 10, Mechanisation of Thought Processes, 419-456. Her Majesty’s Stationery Office, London. Rosenblatt, F. 1962. Principles of Neurodynamics. Spartan Books, Washington, DC. Rumelhart, D.E., and Zipser, D. 1985. Feature discovery by competitive learning. Cog. Sci. 9, 75-112. (Also in Parallel Distributed Processing, Val 1, pp. 151-193, MIT Press, Cambridge, MA, 1986). Sutton, R.S., and Barto, A.G. 1981. Towards a modern theory of adaptive networks: Expectation and prediction. Psyckol. Rev. 88, 135-170. Swindale, N.V. 1980. A model for the formation of ocular dominance stripes. Proc. Roy. SOC.London B 208, 243-264. Swindale, N.V. 1982. A model for the formation of orientation columns. Proc. Roy. SOC. London B 215, 211-230. Tolman, E.C. 1932. Purposive Behavior in Animals and Men. The Century Company, New York. Uttley, A.M. 1958. Conditional probability as a principle in learning. Actes I re CongrPs Cybernetiques. Namur, 2956. J. Lemaire, ed. Gauthier-Villars, Paris. Uttley, A.M. 1970. The Informon: A network for adaptive pattern recognition. J. Tkeoret. Biol. 27, 3147. Uttley, A.M. 1975. The Informon in classical conditioning. 1. Tkeoret. Biol. 49, 355-376. Uttley, A.M. 1979. Information Transmission in the Neraous System. Academic Press, London. von der Malsburg, C. 1973. Self-organisation of orientation sensitive cells in the striate cortex. Kybernetik 14, 85-100.
Unsupervised Learning
311
von Helmholtz, H. 1877. T l v Serisations of Tone. (by A.J. Ellis 1885). Reprinted Dover, New York, 1954. von Helmholtz, H. 1925. Physiological Optics Voliiine 3: The Tlzcwy of the Perceptions of Visim. Translated from 3rd German Edition (1910). Optical Society of America, Washington, DC. Watanabe, S. 1960. Information-theoretical aspects of Inductive and Deductive Inference. I.B.M. I. Res. Dezt 4, 208-231. Watanabe, S. 1985. Pattern Rccugnifion: Hicinnn and Mechanical. Wiley, New York. Yovits, M.C., Jacobi, G.T., and Goldstein, G.D. 1962. Self-nrganiziq Systems. Spartan Books, Washington, DC.
Received 29 December 1988; accepted 15 February 1989
REVlEW
The Vapnik-Chervonenkis Dimension: Information versus Complexity in Learning Yaser S. Abu-Mostafa California Institute of Terhnohgy, Pasadcna, CA 91 125 USA
When feasible, learning is a very attractive alternative to explicit programming. This is particularly true in areas where the problems do not lend themselves to systematic programming, such as pattern recognition in natural environments. The feasibility of learning an unknown function from examples depends on two questions: 1. Do the examples convey enough information to determine the function? 2. Is there a speedy way of constructing the function from the ex-
amples? These questions contrast the roles of information and complexity in learning. While the two roles share some ground, they are conceptually and technically different. In the common language of learning, the information question is that of generalization and the complexity question is that of scaling. The work of Vapnik and Chervonenkis (1971)provides the key tools for dealing with the information issue. In this review, we develop the main ideas of this framework and discuss how complexity fits in. 1 Introduction We start by formalizing a simple setup for learning from examples. We have an environment such as the set of visual images, and we call the set X . In this environment we have a concept defined as a function f : X + (0, l}, such as the presence or absence of a tree in the image. The goal of learning is to produce a hypothesis, also defined as a function g : X + (0, l}, that approximates the concept f , such as a pattern recognition system that recognizes trees. To do this, we are given a number of examples (21,f h l ) ) , . . . , ( x N ,f ( x N ) ) from the concept, such as images with trees and images without trees. In generating the examples, we assume that there is an unknown probability distribution 'I on the environment X . We pick each example independently according to this probability distribution. The statements in the paper hold true for any probability distribution P, which sounds Neural Computation 1, 312-317 (1989) @ 1989 Massachusetts Institute of Technology
The Vapnik-Chervonenkis Dimension
313
very strong indeed. The catch is that the same P that generated the example is the one that is used to test the system, which is a plausible assumption. Thus, we learn the tree concept by being exposed to "typical" images. While X can be finite or infinite (countable or uncountable), we shall use a simple language that assumes no measure-theoretic complications. The hypothesis g that we produce approximates f in the sense that g would rarely be significantly different from f (Valiant 1984). This definition allows for two tolerance parameters c and 6. With probability 2 1 - 6, g will differ from ,f at most e of the time. The 6 parameter protects against the small, but nonzero, chance that the examples happen to be very atypical. A learning algorithm is one that takes the examples and produces the hypothesis. The performance is measured by the number of examples needed to produce a good hypothesis as well as the running time of the algorithm. 2 Generalization
We start with a simple case that may look at first as having little to d o with what we think of as generalization. Suppose we make a blind guess of a hypothesis g, without even looking at any examples of the concept f . Now we take some examples of f and test g to find out how well it approximates f . Under what conditions does the behavior of g on the examples reflect its behavior in general? This turns out to be a very simple question. On any point in X , f and y either agree or disagree. Define the agreement set
A = {x E X :
f ( s ) = y(r)}.
The question now becomes: How does the frequency of the examples in A relate to the probability of A? Let i7 be the probability of A, i.e., the probability that f (.c) = g(.r) on a point z picked from X according to the probability distribution P. We can consider each example as a Bernoulli trial (coin flip) with probability i-r of success ( f = g ) and probability 1 - i7 of failure (f # 9). With ,li examples, we have A' independent, identically distributed, Bernoulli trials. Let n be the number of successes (71 is a random variable), and let v = 71/N be the frequency of success. Bernoulli's theorem states that, by taking N sufficiently large, v can be made arbitrarily close to T with very high probability. In other words, if you take enough examples, the frequency of success will be a good estimate of the probability of success. Notice that this does not say anything about the probability of success itself, but rather about how the probability of success can be estimated from the frequency of success. If on the examples we get 90% right, we
Yaser S. Abu-Mostafa
314
should get about 90% right overall. If we get only 10% right, we should continue to get about the same. We are only predicting that the results of the experiment with the examples will persist, provided there are enough examples. How does this case relate to learning and generalization? After all, we do not make a blind guess when we learn, but rather construct a hypothesis from the examples. However, at a closer look, we find that we make a guess, not of a hypothesis but of a set of hypotheses. For example, when the backpropagation algorithm (Rumelhart et al. 1986) is used in a feedforward network, we are implicitly guessing that there is a good hypothesis among those that are obtained by setting the weights of the given network in some fashion. The set of hypotheses G would then be the set of all functions y that are obtained by setting the weights of the network in any fashion. Therefore, when learning deals with a limited domain of representation, such as a given network with free weights, we in effect make a guess G of hypotheses. The learning algorithm then picks a hypothesis g E G that mostly agrees with f on the examples. The question of generalization now becomes: Does this choice, which is based on the behavior on the examples, hold in general? We can approach this question in a way similar to the previous case. We define, for every y E G, the agreement set A, = { x
E
X
I j ( x ) = g(.r)}.
These sets are different for different gs. Let 7rg be the probability of A,, i.e., the probability that j ( s ) = g( a) on a point 5 picked from X according to the probability distribution P, for the particular g E G in question. We can again define random variables rig (the number of successes with respect to different gs) and the frequencies of success ug = n,,/N. At this point the problem looks exactly the same as the previous one and one may expect the same answer. There is one important difference. In the simple Bernoulli case, the issue was whether u converged to K . In the new case, the issue is whether the u?)sconverge to the rigs in a uniform manner as N becomes large. In the learning process, we decide on one y but not the other based on the values of vg. If we had the vgs converge to the K,S, but not in a uniform manner, we could be fooled by one erratic g. For example, we may be picking the hypothesis g with the maximum ug. With nonuniform convergence, the y we pick can have a poor K ~ We . want the probability that there is some g E G such that ug differs significantly from 7rg be very small. This can be expressed formally as r
1
where sup denotes the supremum.
The Vapnik-Chervonenkis Dimension
315
3 The V-C Dimension
A condition for uniform convergence, hence generalization, was found by Vapnik and Chervonenkis (1971). The key is the inequality
I
Pr sup Iv,, - ~ , , 1 > f qE(:
I
5 4rn(2,Y)r~””~*.
where 7rt is a function that depends on G.We want the right-hand side of the inequality to be small for large N , in order to achieve uniform convergence. The factor , - f 2 N / 8 is very helpful, since it is exponentially decaying in A’. Unless the factor rtt(2ll’) grows too fast, we should be OK. For example, if rr~(2~V) is polynomial in the right-hand side will go to zero as N goes to infinity. What is the function rtt? It depends on the set of hypotheses G. Intuitively, nr(,V) measures the flexibility of G in expressing an arbitrary concept on examples. For instance, if G contains enough hypotheses to be able to express any concept on 100 examples, one should not really expect any generalization with only 100 examples, but rather a memorization of the concept on the examples. On the other hand, if gradually more and more concepts cannot be expressed by any hypothesis in G as N grows, then the agreement on the examples means something, and generalization is probable. Formally, rn(,V) measures the maximum number of different binary functions on the examples .rl . . . . X,V induced by the hypotheses yl,,q2. E C:. For example, if X is the real line and G is the set of rays of the form .I 5 ( t , i.e., functions of the form ~
then r r i ( N ) = N + 1. The reason is that on I” points one can define only N + 1 different functions of the above form by sliding the value of (I from left of the leftmost point all the way to right of the rightmost point. There are two simple facts about the function rn. First, rrt(N) 5 I Gl (where I I denotes the cardinality), since G cannot induce more functions that it has. This fact is useful only when G is a finite set of hypotheses. The second fact is that r i t ( N ) 5 2’, since G cannot induce more binary functions on ,Vpoints than there are binary functions on N points. Indeed, there are choices of G (trivially the set of all hypotheses on X ) for which rrJ(A7 = 2.‘. For those cases, the V-C inequality does not guarantee uniform convergence. The main fact about r t t ( S ) that helps the characterization of G as far as generalization is concerned is that n t ( S ) is either identically equal to ’2 for all AT, or else is bounded above by N” + 1 for a constant d. This striking fact can be proved in a simple manner (Cover 1965; Vapnik and Chervonenkis 1971). The latter case implies a polynomial r r r ( h r )
316
Yaser S. Abu-Mostafa
and guarantees generalization. The value of rl matters only in how fast convergence is achieved. This is of practical importance because this determines the number of examples needed to guarantee generalization within given tolerance parameters. The value of d turns out to be the smallest N at which C starts failing to induce all possible 2N binary functions on any N examples. Thus, the former case can be considered the case d = m. d is called the V-C dimension (Baum and Haussler 1989; Blumer et al. 1986).
4 Interpretation
Training a network with a set of examples can be thought of as a process for selecting a hypothesis g with a favorable performance on the examples (large u,J from the set G. Depending on the characteristics of G, one can predict how this performance will generalize. This aspect of the characteristics of G is captured by the parameter (1, the V-C dimension. If the number of examples N is large enough with respect to d, generalization is expected. This means that maximizing v!, will approximately maximize rII,the real indicator of how well the hypothesis approximates the concept. In general, the more flexible (expressive, large) C is, the larger its V-C dimension (1. For example, the V-C dimension of feedforward networks grows with the network size (Baum and Haussler 1989). For example, the total number of weights in a one-hidden-layer network is an approximate lower bound for the V-C dimension of the network. While a bigger network stands a better chance of being able to implement a given function, its demands on the number of examples needed for generalization is bigger. These are often conflicting criteria. The V-C dimension indicates only the likelihood of generalization. This means, for better or for worse, whether the behavior on the examples is going to persist. The ability of the network to approximate a given function in principle is a separate issue. The running time of the learning algorithm is a key concern (Judd 1988; Valiant 1984). As the number of examples increases, the running time generally increases. However, this dependency is a minor one. Even with few examples, an algorithm may need an excessive amount of time to manipulate the examples into a hypothesis. The independence of this complexity issue from the above discussion regarding information is apparent. Without a sufficient number of examples, no algorithm slow or fast can produce a good hypothesis. Yet a sufficient number of examples is of little use if the computational task of digesting the examples into a hypothesis proves intractable.
The Vapnik-Chervonenkis Dimension
317
Acknowledgments
The support of the Air Force Office of Scientific Research under Grant AFOSR-88-0213 is gratefully acknowledged. References Baum, E.B., and Haussler, D. 1989. What size network gives valid generalization. Neural Comp. 1, 151-160. MIT Press, Cambridge, MA. Blumer, A., Ehrenfeucht, A,, Haussler, D., and Warmuth, M. 1986. Classifying learnable geometric concepts with the Vapnik-Chervonenkis dimension. Proc. ACM Symp. Theory Coinpirting 18, 273-282. Cover, T.M. 1965. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trms. Electroiiic Comput. 326-334. Judd, J.S. 1988. On the complexity of loading shallow neural networks. 1. Complex. 4, 177-192. Rumelhart, D.E., Hinton, G.E., and Williams, R.J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processiiig, Vol. 1. MIT Press, Cambridge, MA. Valiant, L.G. 1984. A theory of the learnable. Coinmun. ACM 27, 1134-1142. Vapnik, V.N., and Chervonenkis, A. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory Prob. Appl. 16, 264-280.
Received 6 July 1989; accepted 25 July 1989
NOTE
Communicated by Ellen Hildreth
Linking Linear Threshold Units with Quadratic Models of Motion Perception Humbert Suarez Christof Koch Computation and Nmral Systrms Program, 216-76, California Institutr of Technology,Pasadcna, CA 91 125 {JSA
Behavioral experiments on insects (Hassenstein and Reichardt 1956; Poggio and Reichardt 1976) as well as psychophysical evidence from human studies (Van Santen and Sperling 1985; Adelson and Bergen 1985; Watson and Ahumada 1985) support the notion that short-range motion perception is mediated by a system with a quadratic type of nonlinearity, as in correlation (Hassenstein and Reichardt 1956), multiplication (Torre and Poggio 1978), or squaring (Adelson and Bergen 1985). However, there is little physiological evidence for quadratic nonlinearities in directionally selective cells. For instance, the response of cortical simple cells to a moving sine grating is half-wave instead of full-wave rectified as it should be for a quadratic nonlinearity (Movshon ef al. 1978; Holub and Morton-Gibson 1981) and is linear for low contrast (Holub and MortonGibson 1981). Complex cells have full-wave rectified responses, but are also linear in contrast. Moreover, a detailed theoretical analysis of possible biophysical mechanisms underlying direction selectivity concludes that most do not have quadratic properties except under very limited conditions (Grzywacz and Koch 1987). Thus, it is presently mysterious how a system can show quadratic properties while its individual components do not. We briefly discuss here a simple population encoding scheme offering a possible solution to this problem. We assume a population of n directionally selective cells whose output is zero if the "somatic potential" x is below a certain threshold JT and whose output is linear for small s above this value f(x) = Q [,c - xT1 where 1x1 = J: if x > 0 and 0 otherwise (Fig. la). We do not consider further the mechanism generating direction selectivity but will assume that the perceptual response to motion R is given by the sum of the responses of a large number of these neurons. Thus, if the moving stimulus induces the intracellular response .r in all n cells, we have (0.1) Neural Computation 1, 318-320 (1989) @ 1989 Massachusetts Institute of Technology
Linking Linear Threshold Units
319
b
a
I
"T
XT+
x rn
SOMATIC POTENTIAL X
Figure 1: (a) Schematic input-output relationship of a highly idealized directionally selective cell. If the "somatic potential" r is below a threshold .rr (here 21, the cell remains silent; above this threshold the output of the cell is o(.r-.rr 1, and saturates for .I' = . ~ ' y , + . r , ~ For , . the simulations described here, we use .rr,, = 4 and n = 1. (b) The sum R of the responses for a group of 50 such units with . ~ r uniformly distributed between .I' = 1 and .r = 3 (see arrows). R is quadratic for small values of s and saturates for large values. The dashed curve is 12.5(,1'- 1)2 and corresponds to the expected mean of R for uniformly distributed values of .r between 1 and 3.
If ,1'7. is the same for all cells, R = ~ J ( I ( . I ,- . I Y ) as long as .I' > .I,,T. However, w e will now assume that the threshold varies randomly from cell to cell, let us say distributed uniformly between .rI,7and .ryl. If .I' falls within these values, the function o(,r,- .ry,)is randomly sampled across this interval prior to summation, and the system then simply computes the area below S(.i.), similar to Monte-Carlo integration methods. For a sufficiently large value of t i , we then have that R is proportional to (.r - . I . , . , ) ~ [in general, if , f ( , r )is a rr)th order polynominal, R will be pro]. if were constant for all cells portional to (.r - . r , ~ , ) ' " + ' Alternatively, while the "somatic potential" of the neurons was uniformly randomly between .i',/', and .I' for a given input, the same quadratic behavior in .r would be obtained. In particular, this random variation could be obtained by summing over a population of cells that is broadly tuned for the direction of motion with a certain distribution of preferred directions.
320
Humbert Suarez and Christof Koch
In all cases, for values of .I' much higher than .rr2,the output R will grow linearly since the system will integrate only over a narrow range around .r. Finally, more realistic neurons saturate at some output value, that is f(.r) = or,,) for .I' > .i.r+ .r,,!. R will then saturate also. The output of this simple system thus approximates a square function for motion in the preferred direction over a range of positive input values (Fig. lb). By using this averaging technique as well as ON- and OFFrectified "neurons," systems that show quadratic behavior, including fullwave rectification, could in principle be built out of linear threshold units, thereby linking the properties of single cells with the observed behavioral responses. It is rather elegant that this can be accomplished solely by taking into account the random variations in neuronal properties. Note that detailed simulations of more realistic neuronal models are needed to verify the applicability of this mechanism to biological visual systems.
References Adelson, E.H., and Bergen, J.R. 1985. Spatiotemporal energy models for the perception of motion. J. Opt. SOC.Am. A2, 284-299. Grzywacz, N.M., and Koch, C. 1987. Functional properties of models for direction selectivity in retina. Synapse 1, 417-434. Hassenstein, B., and Reichardt, W. 1956. Systemtheoretische analyse der zeit-, reihenfolgen- und vorzeichenauswertung bei der bewegungsperzeptiondes riisselkafers chlorophanus. Z. Nafurforsckung Ilb, 513-524. Holub, R.A., and Morton-Gibson, M. 1981. Response of visual cortical neurons of the cat to moving sinusoidal gratings: Response-contrast functions and spatiotemporal interactions. J. Neurophysiol., 46, 1244-1259. Movshon, J.A., Thompson, I.D., and Tolhurst, D.J. 1978. Spatial summation in the receptive fields of simple cells in the cat's striate cortex. J. Physial. 283, 53-88. Poggio, T., and Reichardt, W.E. 1976. Visual control of orientation behavior in the fly: Part 11: Towards the underlying neural interactions. Q. Reu. Biophys. 9, 377438. Torre, V., and Poggio, T. 1978. A synaptic mechanism possibly underlying directional selectivity to motion. Proc. R. SOC. London Ser. B 202, 409-416. Van Santen, J.P.H., and Sperling, G. 1985. Elaborated Reichardt detectors. J. Opt. SOC.Am. A2, 300-320. Watson, A.B., and Ahurnada, A.J. 1985. Model of human visual-motion sensing. J. Opt. SOC. Am. A2, 322-341.
Received 6 April 1989; accepted 4 July 1989.
NOTE
Communicated by Terrence Sejnowski
Neurogammon Wins Computer Olympiad Gerald Tesauro IBM Thomas 1. Watson Researcli Ceiiter, P.O. Box 704, Yorktom Hrights, NY 10598 USA
Neurogammon 1.0 is a backgammon program which uses multilayer neural networks to make move decisions and doubling decisions. The networks learned to play backgammon by backpropagation training on expert data sets. At the recently held First Computer Olympiad in London, Neurogammon won the backgammon competition with a perfect record of five wins and no losses, thereby becoming the first learning program ever to win a tournament.
Neural network learning procedures are being widely investigated for many classes of practical applications. Board games such as chess, go, and backgammon provide a fertile testing ground because performance measures are clear and well defined. Furthermore, expert-level play can be of tremendous complexity. Learning programs have been studied in games environments for many years, but heretofore have not reached significant levels of performance. Neurogammon 1.O represents the culmination of previous research in backgammon learning networks (Tesauro and Sejnowski 1989; Tesauro 1988; Tesauro 1989) in the form of a fully functioning program. Neurogammon contains one network which makes doubling cube decisions and a set of six networks which make move decisions in different phases of the game. Each network has a standard fully-connected feed forward architecture with a single hidden layer, and was trained by the weilknown backpropagation algorithm (Rumelhart et al. 1986). The movemaking networks were trained on a set of positions from 400 games in which the author played both sides. A "comparison paradigm," described in (Tesauro 1989), was used to teach the networks that the move selected by the expert should score higher than each of the other possible legal moves. The doubling network was trained on a separate set of about 3000 positions which were classified according to a crude ninepoint ranking scale of doubling strength. The training of each network proceeded until maximum generalization performance was obtained, as measured by performance on a set of test positions not used in training. The resulting program appears to play at a substantially higher level than conventional backgammon programs. At the Computer Olympiad in London, held on August 9-15, 1989, and organized by David Levy, Neirml Coinpirtntioii 1, 321-323 (1989) @ 1989 Massachusetts Institute of Technology
322
Gerald Tesauro
Neurogammon competed against five other opponents: three commercial programs (Video Gammon/USA, Mephisto Backgammon/ W. Germany, and Saitek Backgammon/Netherlands) and two non-commercial programs (Backbrain/Sweden and A1 Backgammon/USA). Hans Berliner’s BKG program was not entered in the competition. In matches to 11 points, Neurogammon defeated Video Gammon by 12-7, Mephisto by 12-5, Saitek by 12-9, Backbrain by 114, and A1 Backgammon by 16-1, to take the gold medal in the backgammon competition. Also, in unofficial matches to 15 points against two other commercial programs, Fidelity Backgammon Challenger and Sun Microsystems’ Gammontool, Neurogammon won by scores of 16-3 and 15-8 respectively. There were also a number of unofficial matches against intermediate-level humans at the Olympiad. Neurogammon won three of these and lost one. Finally, in an official challenge match on the last day of the Olympiad, Neurogammon put u p a good fight but lost to a human expert, Ossi Weiner of West Germany, by a score of 2-7. Weiner said that he was surprised at how much the program plays like a human, how rarely it makes mistakes, and that he had to play extremely carefully in order to beat it. In summary, Neurogammon’s victory at the Computer Olympiad demonstrates, along with similar recent advances in fields such as speech recognition (Lippmann 1989) and optical character recognition (Le Cun et al. in press), that neural networks can be practical learning devices for tackling hard computational tasks. It also suggests that machine learning procedures of this type might be useful in other games. However, there is still much work to be done both in extracting additional information from the data sets within the existing approach, as well as in developing new approaches such as unsupervised learning based on outcome, which would supplement what can be achieved with supervised learning from expert data.
References Tesauro, G., and Sejnowski, T.J. 1989. A parallel network that learns to play backgammon. Avfificinl Intellig~nce39, 357-390. Tesauro, G. 1988. Neural network defeats creator in backgammon match. Tech. Rep. no. CCSR-88-6, Center for Complex Systems Research, University of Illinois at Urbana-Champaign. Tesauro, G. 1989. Connectionist learning of expert preferences by comparison training. In D. Touretzky, (Ed.), Adzlatices it7 Nr.urd Iiifortnation Processiq Systems, 99-106. Morgan Kaufman Publishers. Rumelhart, D.E., et al. 1986. Learning representations by backpropagating errors. N n t m 323, 533-536.
Neurogammon Wins Computer Olympiad
323
Lippmann, R.P. 1989. Review of neural networks for speech recognition. Neiirnl Cotnp 1, 1-38. LeCun, Y., Boser, B., Denker, J.S., Hendersen, D., Howard, R.E., Hubbard, W., and Jackel, L.D. (in press). Backpropagation applied to handwritten zip code recognition. N t w d Compiitatioii.
Received 30 August 1989; accepted 30 August 1989.
Communicated by Christof Koch
Surface Interpolation in Three-Dimensional Structure-from-Motion Perception Masud Husain Stefan Treue Richard A. Andersen Department of Bran and Cognitive Sncnccs. hfnssarhiIsetts hstiti1tr of Trrhnology, Cn~nbridgc,MA 02139, [JSA
Although it is appreciated that humans can use a number of visual cues to perceive the three-dimensional (3-D) shape of an object, for example, luminance, orientation, binocular disparity, and motion, the exact mechanisms employed are not known (De Yoe and Van Essen 1988). An important approach to understanding the computations performed by the visual system is to develop algorithms (Marr 1982) or neural network models (Lehky and Sejnowski 1988; Siege1 1987) that are capable of computing shape from specific cues in the visual image. In this study we investigated the ability of observers to see the 3-D shape of an object using motion cues, so called structure-from-motion (SFM).We measured human performance in a two-alternative forced choice task using novel dynamic random-dot stimuli with limited point lifetimes. We show that the human visual system integrates motion information spatially and temporally (across several point lifetimes) as part of the process for computing SFM. We conclude that SFM algorithms must include surface interpolation to account for human performance. Our experiments also provide evidence that local velocity information, and not position information derived from discrete views of the image (as proposed by some algorithms), is used to solve the SFM problem by the human visual system. 1 Introduction
Recovering the three-dimensional (3-D) structure of a moving object from its two-dimensional (2-D) projection is considered an ”ill-posed” problem (Poggio and Koch 1985) since there are an infinite number of interpretations of a given 2-D pattern of motion. Several elegant algorithms have been formulated for computing SFM, each using a number of constraints to restrict the number of possible solutions (Ullman 1979, 1984; LonguetHiggins and Prazdny 1980; Hoffman 1982; Grzywacz and Hildreth 1987). None of them use surface interpolation. Rather these algorithms compute the relative position of isolated points. Existing schemes therefore Npurul Computation 1, 324-333 (1989) @ 1989 Massachusetts Institute of Technology
Surface Interpolation in 3-D Structure-from-MotionPerception
325
require that the tracked points on an object must be present continuously over the entire duration of the SFM computation. This leads to the powerful prediction that if the visual system is forced to sample a new set of points on the same object, the old set of points should not improve the perception of SFM since a new model of the object would have to be computed with each new set of sample points. An alternative approach to solving the SFM problem is to compute a 3-D surface representation by interpolating a surface between the sample points (Andersen and Siegel 1988). Such a scheme would sample the movement of as many points as possible across the surfaces of the object, and interpolate locally across these measurements to compute a continuous surface. New sets of points can easily be integrated into the representation and thereby improve its accuracy while the information of disappearing points is preserved in the interpolated surface. (We apply the term “surface interpolation” in a general way since the surface could be generated in physical as well as in velocity space.)
2 Experiments
In our experiments we examined how the human visual system performs the SFM computation when confronted with continuously changing sets of sample points. We used novel ”structured” and ”unstructured” dynamic random-dot stimuli with limited point lifetimes (Morgan and Ward 1980; Zucker 1984; Siegel and Andersen 1988). The structured stimulus was computed from the parallel projection of points covering the surface of a transparent rotating cylinder (Fig. 1). All subjects, whether naive or experienced, have reported the perception of a revolving hollow cylinder when viewing the structured display. The unstructured stimulus was generated by randomly displacing the velocity vectors present in the structured display within the boundaries of the stimulus, thereby conserving the population of vectors but destroying the spatial relationship between them (see Siegel and Andersen 1988). Each point was displayed for a ”lifetime” of only 100 msec (7 frames), after which it was replotted randomly at another position on the surface of the cylinder. In the first frame, points were randomly assigned positions in their life cycle. Thus, between two given frames of the stimulus only about 15%of the poiRtS “died” and were randomly replotted (“desynchronized case”). Under these conditions, using a reaction time task, we have found that subjects detect the change from an unstructured to a structured display reliably (> 80% correct) but take as much as 900 msec to react as shown in figure 2. This observation would suggest that the computation of SFM builds up over time and that new points can be integrated into the representation, which is partially computed by the old points. Unfortunately, it is not possible to determine from these data how much of this reaction time is needed as visual input and how much of it is com-
326
fl
Masud Husain, Stefan Treue, and Richard A. Andersen
.. . .. . . a
.
..* . . .. -*. . . * ' .
Bl
Figure 1: Schematic description of stimulus creation. All movies were created off-line on a PDP 11/73 computer that was also used to run the experiments. For the structured stimulus (E) 126 or 12 points were first randomly plotted on a two-dimensional surface (A). They were then parallel projected onto the surface of a transparent cylinder that was rotated at an angular rotation rate of 35". sec-' (B). Each point existed for a predetermined point lifetime after which it was randomly repositioned. The moving points were then parallel projected onto the two-dimensional CRT screen (HP 13118; P31 phosphor) (C) that was viewed by two highly trained observers (D) (ST and MH). The resulting velocity distribution in the structured stimulus is sinusoidal along any horizontal line across the stimulus, with the fastest speeds in the center of the display. The unstructured stimulus (F)was created by randomly shuffling the paths of the points in the structured display. Observers viewed the display binocularly from a distance of 57 cm; the stimuli subtended 6" of visual angle at the eye. The display rate was 70 Hz and mean luminance was 1 cd m-*.
putation time in the brain or motor reaction time. This is an important question since performance should improve when the stimulus is seen for longer than the point lifetime if surface interpolation is used.
327
Surface Interpolation in 3-D Structure-from-Motion Perception
2.1 Perceptual Buildup and Surface Interpolation. In order to address this issue we presented equal numbers of structured and unstructured stimuli of 40 to 1700 msec duration in random order and asked subjects to indicate in a two-alternative forced-choice paradigm whether they saw a rotating cylinder or an unstructured noise pattern. Figure 3 (filled symbols) shows that performance peaked only after viewing stimuli longer than 5 times the point lifetime (> 500 msec), being hardly above chance after one point lifetime. Current algorithms (which do not use surface interpolation) would not have predicted improved performance when viewing stimuli of more than one point lifetime. It could be argued, however, that the visual system selects a number of points from the display and needs to track their relative positions as a group for the duration of their lifetime. Since in our stimulus the points are not synchronized it is very unlikely that all the points in such a group are at the same point in their life cycle, that is, their onsets and offsets do not occur at the same time. So, because groups of dots constantly form
1000-
800 600
--
-
400 200
0
W
.
I
.
I
.
iJ
point lifetime used for 2AFC task (98ms)
,
,
,
,
,
,
.
,
,
~
Figure 2: Reaction time for detecting the transition from an "unstructured" to a "structured" cylinder. Observers were shown movies that started with the unstructured version of the cylinder, which after an unpredictable time changed into the structured display. The task was to press a key as soon as the structured cylinder was detected. (For further details see Siege1 and Andersen 1988). The arrow and dotted line indicate the point lifetime used for the two-alternative forced-choice experiments described in the text. The regression line is a best-fit third-order polynomial. Each data point represents the mean of about 100 trials.
Masud Husain, Stefan Treue, and Richard A. Andersen
328
% correct responses
loo 90
1 -
.
I
-
"
I
126 points point lifetimes desynchronized
A
point lifetimes synchronized
4 0 4 - . . * . 0 2 4 6 9
.
#
8
*
)
10
'
I
12
'
I
14
-
I
'
16
multiples of point lifetime
Figure 3: Percentage accuracy in a two-alternative forced-choice paradigm plotted as a function of duration of display in multiples of point lifetime (point lifetime was kept at 100 msec). Observers were shown movies of different duration containing either the cylinder or the unstructured stimulus and were asked to distinguish between them. The dots in the display were either desynchronized (open symbols), or the onsets and offsets of all the dots were synchronized (filled symbols). Note that in both cases peak performance is not reached until over 5 times the lifetime of each point, that is, > 490 msec. The regression lines are best-fit fourth-order polynomials (T > 0.97 for both). Each data point represents the mean of 200 to 600 trials. and dissolve, it might be argued that it simply takes a long time before one finds a group in which all the dots are "in phase." Therefore, we asked our subjects to view displays in which all the points appeared and disappeared together, that is, they were synchronized. Figure 3 (open symbols) shows that performance was indistinguishable from the desynchronized case. Another important consideration is that a surface interpolation may be used only when the high density of dots in the stimulus already perceptually constitutes a surface. Under different conditions, when the points are not dense enough to constitute an apparent surface by themselves, an alternative algorithm might be used. To investigate if the perceptual buildup we observed occurs only with a high density of dots, we decreased the number of points to less than a tenth of the original 126 points. Figure 4 (open symbols) shows that the time course using 12
Surface Interpolation in 3-D Structure-from-Motion Perception
329
points is even longer, with performance peaking only after more than 10 point lifetimes. To control for the possibility that the buildup in performance is not due to the presentation of new points but to some other effect we performed another experiment. We showed stimuli of the same duration in which the 12 points, after living through their first lifetime, were not randomly replotted but repositioned to the location they originally occupied at the beginning of the movie. They then moved through the same path as before and at the end of their lifetime were again replotted at their original starting position, thus beginning the cycle again. These “oscillating” stimuli therefore contained the same number of points with the same point lifetime as used in the previous experiment but after the passage of the first point lifetime they contained no new information. The results are plotted in figure 4 (filled symbols). It is evident that subjects did not perform above chance under these conditions. Thus, the visual system can improve its performance dramatically when presented with
% correct
12 points
responses
loo
A
1
desynchronized lifetimes “osclllating’stimulus
““1. 50
40
0
2
4
6
a
10
12
14
16
multiples of point lifetime
Figure 4: Percentage accuracy plotted as a function of display duration when 12 desynchronized points were used (point lifetime again 100 msec). Open symbols show the results from the experiment comparable to Figure 3. In this case perceptual buildup is more gradual and long-lasting. Peak performance is not reached until a stimulus length of more than 10 point lifetimes, that is, over 1 sec. The regression line is a best-fit third-order polynomial (7.= 0.96). Filled symbols show the results from the experiment in which points were replotted to their original position at the end of their point lifetimes (for details see text).
330
Masud Husain, Stefan Treue, and Richard A. Andersen
new sets of points and this is not due to a requirement to view stimuli for an extended period of time. These results suggest that the brain uses surface interpolation in computing the shape of 3-D surfaces from motion. As predicted, the accuracy of the object representation rises to some maximum value with the presentation of new data points, and the performance of the system is not influenced by whether the points are synchronized or not (cf. Fig. 3). Moreover, given less points, it predictably takes longer to compute an accurate surface representation (cf. Fig. 4). As expected, the surface representation integrates information over space, since performance was better with larger numbers of points, and over time, since several point lifetimes were required for the computation. 2.2 Position- versus Velocity-Based Computation. A second issue is whether the visual system measures position or local velocities in computing SFM. Position-based algorithms sample position information derived from a few discrete image views of a moving object and attempt to reach a rigid 3-D interpretation from the 2-D sample frames (Ullman 1984; Grzywacz and Hildreth 1987; Grzywacz et al. 1988). Velocity-based algorithms measure the local velocities of points on an image and use the global velocity field to compute 3-D SFM (Longuet-Higgins and Prazdny 1980; Hoffman 1982; Grzywacz and Hildreth 1987). To date, neither position- nor velocity-based algorithms have used surface interpolation and all velocity-based algorithms have used instantaneous velocity whereas the nervous system requires 50-80 msec to measure velocity accurately (McKee and Welch 1985; Nakayama 1985). A modified position-based scheme could incorporate measurements from new sets of points to improve performance by smoothing over the computed 3-D locations of points to interpolate a surface (E.C. Hildreth and S. Ullman, personal communications). However, there are several reasons to believe the nervous system uses a velocity-based algorithm with surface interpolation. In our displays the angular extent of the individual movements is quite small, approximately 3.5", since they are of finite point lifetime. Position-based algorithms require large displacements of 30-50" (Grzywacz and Hildreth 1987). Other experimental support from our laboratory for the velocity-based surface scheme comes from the finding that the minimum point lifetime required for perceiving SFM (Treue et al. 1988) corresponds to the minimum viewing time required to measure accurately the velocity of a moving stimulus (McKee and Welch 1985). This correspondence is preserved with changes in stimulus velocity: the point lifetime threshold falls in parallel for both tasks as velocity is increased. This correlation is further strengthened by the fact that subjects can detect motion in our displays with point lifetimes lower than the ones required for comparative performance in detecting SFM, suggesting that the perception of motion per se is not sufficient but that an accurate velocity field has to be measured. Finally,
Surface Interpolation in 3-D Structure-from-MotionPerception
331
our laboratory as well as other investigators have shown that lesions of area MT, a region in primate visual cortex that contains neurons tuned to global stimulus direction and velocity (Movshon et al. 1985; Allman et al. 1985), impair perception of both coherent motion (Newsome and Par6 1988) and SFM (Siege1 and Andersen 1986, 1988). 3 General Discussion
There are two possible levels at which a surface interpolation of the velocity field might occur. One is at a 2-D level in which the velocities of points moving on the 2-D retinal image are computed and an interpolation process fills in to form a dense 2-D velocity field from which a 3-D interpretation will be computed by a later process. In the second possibility the 3-D surface is immediately computed from the local 2-D velocities and the interpolation process operates on the 3-D image representation. At present we do not have evidence to distinguish between these two possibilities. A large number of algorithms for 2-D velocity measurement have been proposed that perform some velocity integration, averaging, or smoothing (Hildreth and Koch 1987; Horn and Schunk 1981; Zucker and Iverson 1986; Yuille and Grzywacz 1988; Biilthoff et al. 1989). Some of these algorithms have also been implemented in neural networks (Wang et al. 1989). Since all these algorithms integrate motion over local spatial neighborhoods they can account for a number of perceptual phenomena. Unfortunately, they cannot deal with transparent objects such as our rotating cylinder since vectors (with opposing direction) from the front and rear surface would be assigned to one surface, and the averaging of velocities over a patch would yield zero velocity. Evidently, an additional requirement for the successful application of these algorithms to transparent objects is the segregation of surfaces prior to the smoothing operation. For our stimulus, a simple solution is to assign motion in one direction to one surface. To investigate this issue we are presently recording from visual cortex in awake macaque monkeys. Preliminary results indicate that transparent motions in different directions are already separated at the level of VI (Erickson ef al. 1989).
Acknowledgments We are grateful to Shabtai Barash, Martyn Bracewell, Roger Erickson, Norbert0 Gryzwacz, Ellen Hildreth, and Shimon Ullman for their comments on earlier drafts of this manuscript. This work was supported by grants from the NIH, the Sloan Foundation, and the Whitaker Health Sciences Foundation. M.H. is a Harkness Fellow and S.T. is a Fellow of the Evangelisches Studienwerk Villigst, F.R.G. and is supported by the Educational Foundation of America.
332
Masud Husain, Stefan Treue, and Richard A. Andersen
References
Allman, J., Miezin, F., and McGuinness, E. 1985. Stimulus specific responses from beyond the classical receptive field. Ann. Rev. Neurosci. 8,407-430. Andersen, R.A., and Siege], R.M. 1989. Local and global order in perceptual maps. In Signal and Sense, G.M. Edelman, W.E. Gall, and W.M. Cowan, eds., in press. Wiley, New York. Biilthoff, H., Little, J., and Poggio, T. 1989. A parallel algorithm for real-time computation of optical flow. Nature (London) 337, 549-553. De Yoe, E.A., and Van Essen, D.C. 1988. Concurrent processing streams in monkey visual cortex. Trerids Neurosci. 11, 219-226. Erickson, R.G., Snowden, R.J., Andersen, R.A., and Treue, S. (in press). Directional neurons in awake rhesus monkeys: Implications for motion transparency. Soc. Neurosci. Abst. Grzywacz, N.M., and Hildreth, E.C. 1987. Incremental rigidity scheme for recovering structure from motion: Position-based versus velocity-based formulations. J. Opt. SOC.A m . 4, 503-518. Grzywacz, N.M., Hildreth, E.C., Inada, V.K., and Adelson, E.H. 1988. The temporal integration of 3-D structure from motion: A computational and psychophysical study. In Organization of Neural Networks, W. von Seelen, G. Shaw, and U.M. Leinhos, eds., pp. 239-259. VCH, Weinheim. Hildreth, E.C., and Koch, C. 1987. The analysis of visual motion: From computational theory to neuronal mechanisms. Anti. Rev. Netirosci. 10,477-533. Hoffman, D.D. 1982. Inferring local surface orientation from motion fields. I. Opt. Soc. Am. 72, 888-892. Horn, B.K.P., and Schunk, B.G. 1981. Determining optical flow. Artificial Intelligence 17, 185-203. Lehky, S.R., and Sejnowski, T.J. 1988. Network model of shape-from-shading: Neural function arises from both receptive and projective fields. Nature [London) 333, 452454. Longuet-Higgins, H.C., and Prazdny, K. 1980. The interpretation of a moving retinal image. Proc. R. Soc. London Ser. B 208, 385-397. Marr, D. 1982. Vision. Freeman, San Francisco. McKee, S.P., and Welch, L. 1985. Sequential recruitment in the discrimination of velocity. J. Opt. SOC.A m . A2, 243-251. Morgan, M.J., and Ward, R. 1980. Interocular delay produces depth in subjectivity moving noise patterns. Q. 1. E x p . Psycho/. 32, 387-395. Movshon, J.A., Adelson, E.H., Gizzi, M.S., and Newsome, W.T. 1985. The analysis of moving visual patterns. In Pattern Recognition Mechanisms (Exp. Br. Res. Suppl. ll), C. Chagass, R. Gattas, and C. Gross, eds., pp. 117-151. Springer-Verlag, Heidelberg. Nakayama, K. 1985. Biological image motion processing: A review. Visioti Rrs. 25, 625-660. Newsome, W.T., and Pare, E.B. 1988. A selective impairment of motion perception following lesions of the middle temporal visual area (MT). J.Neurosci. 8, 2201-2211.
Surface Interpolation in 3-D Structure-from-Motion Perception
333
Poggio, T., and Koch, C. 1985. Ill-posed problems in early vision: From computational theory to analog networks. Proc. R. Soc. Londoii Ser. B 226, 303-323. Siegel, R.M. 1987. A parallel distributed processing model for the ability to obtain three-dimensional structure from visual motion in monkey and man. SOC. Neurosci. Abstr. 13, 630. Siegel, R.M., and Andersen, R.A. 1986. Motion perceptual deficits following ibotenic acid lesions of the middle temporal area in the behaving rhesus monkey. Soc. Neurosci. Abstr. 12, 1183. Siegel, R.M., and Andersen, R.A. 1988. Perception of three-dimensional structure from motion in monkey and man. Nature (London) 331, 259-261. Treue, S., Husain, M., and Andersen, R.A. 1988. Human perception of 3-D structure from motion: Spatial and temporal characteristics. Soc. Neurosci. Abstr. 14, 1251. Ullman, S. 1979. The Irzterpretatiori of Visual Motion. MIT Press, Cambridge, MA. Ullman, S. 1984. Maximizing rigidity: The incremental recovery of 3-D structure from rigid and nonrigid motion. Perceptioti 13, 255-274. Wang, H.T., Mathur, B., and Koch, C. 1989. Computing optical flow in the primate visual system. Neural Cornp. 1, 92-103. Yuille, A.L., and Grzywacz, N.M. 1988. A computational theory for the perception of coherent visual motion. Nature (London) 333, 71-74. Zucker, S.W. 1984. Type I and Type I1 processes in early orientation selection. In Figural Synthesis, P.C. Dodwell and T. Caelli, eds., 283-300. Lawrence Erlbaum, London. Zucker, S.W., and Iverson, L. 1986. From orientation selection to optical flow. Memo CIM-86-2, Computer Vision and Robotics Laboratory, McGill Research Center for Intelligent Machines.
Received 17 April 1989; accepted 16 June 1989.
Communicated by Gordon Shepherd
A Winner-Take- All Mechanism Based on Presynaptic Inhibition Feedback Alan L. Yuille Harvard University, Division of Applied Sciences, G12c Pierce Hall, Cambridge, M A 02138 USA
Norbert0 M. Grzywacz Center for Biological Information Processing, Dcpartinent of Brain arid Cognitivr Sciences, Massachusetts Institute of Teclinology, E2,5-201, Cambridge, MA 02139 USA
A winner-take-all mechanism is a device that determines the identity and amplitude of its largest input (Feldman and Ballard 1982). Such mechanisms have been proposed for various brain functions. For example, a theory for visual velocity estimate (Grzywacz and Yuille 1989) postulates that a winner-take-all selects the strongest responding cell in the cortex's middle temporal area (MT). This theory proposes a circuitry that links the directionally selective cells in the primary visual cortex to MT cells, making them velocity selective. Generally, several velocity cells would respond, but only the winner would determine the perception. In another theory, a winner-take-all guides the spotlight of attention to the most salient image part (Koch and Ullman 1985). Also, such mechanisms improve the signal-to-noise ratios of VLSI emulations of brain functions (Lazzaro and Mead 1989). Although computer algorithms for winner-take-all mechanisms exist (Feldman and Ballard 1982; Koch and Ullman 1985), good biologically motivated models do not. A candidate for a biological mechanism is lateral (mutual) inhibition (Hartline and Ratliff 1957). In some theoretical mutual-inhibition networks, the inhibition sums linearly to the excitatory inputs and the result is passed through a threshold nonlinearity (Hadeler 1974). However, these networks work only if the difference between winner and losers is large (Koch and Ullman 1985). We propose an alternative network, in which the output of each element feeds back to inhibit the inputs to other elements. The action of this presynaptic inhibition is nonlinear with a possible biophysical substrate. This paper shows that the new network converges stably to a solution that both relays the winner's identity and amplitude and suppresses information on the losers with arbitrary precision. We prove these results mathematically and illustrate the effectiveness of the network and some of its variants by computer simulations. Neural Cornputation 1,334-347 (1989) @ 1989 Massachusetts Institute of Technology
A Winner-Take-All Mechanism
I
I I I I I I
I_
I-
I
I I I I I I I-
335
I
I I I I I I
I I I I I I I-
Figure 1: The layout of the presynaptic inhibitory network. The dashed lines represent the inputs to the individual network elements. Each element excites inhibitory elements (similar to interneurons), which act on the presynaptic inputs of the other elements (no self-inhibition). Excitatory and inhibitory synapses are labeled with + and - signs, respectively. Figure 1 illustrates our winner-take-all network, which consists of a number of elements that feed back to inhibit the presynaptic inputs of each other. Each element would receive a positive input, I , , from previous neurons if the inhibition was not present. [The network does not have self-inhibition, which if present would lead to a gain control mechanism (Reichardt et nf. 1983).1 The network is updated by the following equation ih, = -.r, rlt
7-
+ I,I< ( . I . , . . l ' 2 . . . . ..I,,-] ..]',+I. . . . . .rn.)
(1.I)
where .i.,(t) is the state of the itlr network element, T is constant, and the function I< is symmetric in all its variables, decreases (or remains
Alan L. Yuille and Norbert0 M. Grzywacz
336
constant) as they increase, and tends to zero when any of them goes to infinity. The initial values of the .r, are set to I,. The first term on the right-hand side of equation 1.1 corresponds to a time decay and the second contains the inhibition, which is implemented by the function K . It is not necessary that every function I< meeting the above criteria implements a winner-take-all network. Later, we discuss the effectiveness of some biologically motivated examples. Now, the following case is studied:
where X is constant. It is now shown that the network described by equation 1.1, using the the function K given in equation 1.2, implements a winner-takeall operation. More precisely, we prove the following winner-take-all theorem:
Theorem 1. Giveri a sy,steiii rlrscribed by eqnatioris 1.1 and 1.2 with iiiitiill coi~tiitionsx I = I, (we disc*rissa l t m i a t i w initial c:onditioiis later.), if I,,, = max, IT,then for siifficiently large X (strictly speakiiig as X + a), it fallows that TI,, + I,,, aiid r', + 0 (for i ,u1) as the systeni tmds to equilitrhi.
+
Proof. The proof proceeds in three stages. We first show that if I , > I,, then .c,(t) > rJ(t).Next, we use this result to prove that if the system converges to an equilibrium state, then the theorem holds. Finally, we show that there is a Lyaponov function associated with equation 1.1, and hence the system must converge to an equilibrium state. First, observe that the update equation preserves the ordering of the elements, that is i , ( t ) > i , ( t ) if and only if I, > I J . To see this we notice that the ordering can be violated only if at some time x , ( t ) = .r,(t). However, since I, > IJ,this implies d r , / d t > d.rJ/df from equation 1.1. Thus, it is impossible for the ordering to change. [Strictly speaking, we need only .~?,,(t) 2 s,(t)for all L UI, for the rest of the proof to be correct.] Second, we show that under the assumption that the system reaches equilibrium, the theorem holds. In equilibrium, d.r,/dt = 0 and the solutions obey
+
where I L , = X.r,. The function ii,c-"i is shown in figure 2; it has a single maximum at I!, = 1. From equation 1.3 it follows that
Combining equation 1.4 with the ordering constraint shows that it is impossible to have more than one network element such that u j > 1. In
A Winner-Take-All Mechanism
337
fact, if u, > 1~~ > 1, then 1 1 ~ p - l ' < i i J p - " ~ (Fig. 2) and I, > IJ (by the ordering constraint), which is inconsistent with equation 1.4. Also, if X is sufficiently large, then for at least one network element, 1 1 , > 1 in equilibrium. In fact, taking the product of equation 1.3 for all I gives p(N-')C,
n n (L,
llJ
I
=
XI,
(1.5)
I
If 11, was less than 1 for all t , then the left-hand side of equation 1.5 would be bounded from above. The right-hand side, however, can be -
.4
s, S
1. L
Small Lambda Larqe Lambda
.3
.2
.I
.o .o
2.
1.
3.
4.
5
U -
Figure 2: The graph of the function uc-" and its relation to the equilibrium state of the network. In the final state, the ratio of the values of this function for two network elements is independent of A. If the final values u, and l i l (with uz > u , ) of the elements are less than 1 (the small X case, labeled by S , s), then little suppression occurs and increasing the value of u , will increase the value of uJ. If, however, the final value of 7 i 1 > 1 (for large A, labeled by L.U, then suppression occurs, since increasing the value v, causes the value of 11) to decrease and uJ + 0 as 11, o. +
Alan L. Yuille and Norbert0 M. Crzywacz
338
made arbitrarily large by increasing A. Thus, if X is sufficiently large, then at least one ( I , > 1. From this result, and the one after equation 1.4, it follows that for large X there is only one network element such that 1 1 , > 1; the one corresponding to the largest I,. Thus, from equation 1.5, the winner's output is such that ( I , , , m as X + m. This implies that the losers' output go to zero, that is, u , -+ 0, to maintain the validity of equation 1.4 (Fig. 2). This proves the theorem at equilibrium, because from equation 1.3 and the losers' condition (u, + 0), the winner and the losers are such that L , , + I,,, and .r, + 0, respectively. Third, we demonstrate that a Lyaponov function exists for equation 1.1, and hence the system always settles into a final state solution of equation 1.1. It is convenient to perform a change of variables to 2 , = P - ~ ' , .Equation 1.1 transforms into +
(1.6)
This can be rewritten as (12,
-
M ! z ,OE
(Lt
J
(1.7)
a:,
where
E = 1- C ( ZJ log
IJ
X
- Zj
J
)+rI.5
(1.8)
I
satisfies the properties of a Lyaponov function; it is bounded from below and always decreases with time. (The system can be thought of as performing a form of steepest descent in an energy landscape given by E.) This completes the proof of the winner-take-all theorem. How large should X be for the theorem to be practical? We derive two lower bounds for X (necessary conditions). (It will be argued - supported by computer simulations - that if X is slightly larger than these bounds, then winner-take-all occurs - so the conditions may be sufficient.) The first bound comes from the application of perturbation theory to equation 1.1 to find when a winner solution exists (without proving that the system finds it). The analysis gives necessary and sufficient conditions, ~ X I l ~ ~ - To/Il,,,then the inhibition expressed in equation 1.12, coupled with equation 1.1 with initial conditions x7 = I,, generates a winner-take-all network. Similar to the winner-take-all theorem, there is an ordering constraint, that is, if I, > 4 , then s, > T I ; . It follows, that the function K eventually becomes 1 for the winner. Otherwise, there is a contradiction, because if K = 0 always, then the winner and losers fall to 0, implying Xx7#l,, zZ+ 0, eventually making X Z+,, .L‘, < To, and K becomes 1. Also, it follows from the ordering constraint that if K becomes 1 for the winner, then K never becomes 0 again. To see this note that if I, > I,, then XCJp xJ > XC,#, I’,. Thus, if K is about to become zero for the winner, then I( is zero for the losers. At that moment, d.r,/dt < 0 for the losers, and so X I , + LL.,/dt< 0, implying X Z+,, .r, < To, and the winner keeps rising. Thus, from equation 1.1, for t + m, zlr,+ I,,, and the winner ”kills” the losers by making K = 0 for them. The losers’ elimination is guaranteed if X > To/I,,,because in this case, XC+,,G > TO. Computer simulations show that the inhibition in equation 1.12 (Fig. 4e) and a version that uses inhibition with hyperbolic threshold (Fig. 4e) give rise to good winner-take-all networks. In conclusion, we propose a network that relays information on the identity and amplitude of its largest input with arbitrary precision. This winner-take-all network is based on presynaptic inhibition feedback where the allowable inhibitory mechanisms are biophysically plausible. Besides the biological motivation, another advantage of this network is spatial homogeneity, which does not hold for some other networks (Koch and Ullman 1985). For the network to have arbitrary precision, its parameter X must be sufficiently large; must it be so large that it is biologically unrealistic? We now argue that this is not the case. To know when X is “too“ large, one must consider its physiological meaning. The paper demonstrates that there are several neural mechanisms, which may underlie the winnertake-all network. Consider, for example, the threshold mechanism of equation 1.12. The proof following this equation shows that ”sufficiently large A” means that the winner’s activity alone is enough to make the excitatory inputs to the losers, subthreshold. The neurobiological literature suggests that such inhibition is not exaggeratedly large. For example, in the motor control of swimming in leeches, inhibitory synapses can kill other neurons’ activity single-handedly (Stent and Kristan 1981). For mechanisms different than that in equation 1.12, ”sufficiently large A’’ implies an inhibitory synapse, which is stronger than the minimal nec-
A Winner-Take-All Mechanism
345
essary to kill other neurons’ activity. Figure 4 shows that the necessary strength depends on the specific mechanism, the number of inputs to the winner-take-all network, and its precision. A complete experimental verification of the existence of such a network is very difficult with present techniques, but three specific predictions may already be tested. First, a pair of this network‘s cells would inhibit each other. [Mutual inhibition occurs in the interaction between synergist and antagonist groups of neurons in the motor system of mammalians (Sherrington 1947) and invertebrates (Stent and Kristan 19811. Sherrington (1947) suggested that this mutual inhibition leads to ”singleness of action,” which strongly resembles the action of a winner-take-all mechanism.] Second, this mutual inhibition would have to be presynaptic. [Physiologists found that presynaptic inhibition is common in the nervous system, as, for example, in the spinal cord (Eccles et al. 19611, retina (Masland et al. 19841, and hippocampus (Colmers et al. 1987). Also, this type of inhibition implies the existence of axoaxoanic synapses, which appear in electron micrographs at numerous locations of the mammalian central nervous system (Schmidt 1971; Somogyi 1977; Saint Marie and Peters 19841.1 Third, the inhibition in the network would be highly nonlinear, possibly with a threshold or of a shunting-inhibition type. [A new technique to investigate the linearity of inhibition in the visual system without intracellular recordings has been recently described (Amthor and Grzywacz 19891.1 Issues of the time dynamics of the network are interesting, but beyond the scope of this paper. In particular, we are investigating the stability of the network including time delays in the connections. Some results on the stability of inhibitory networks (Wyatt and Standley 1989) and analog neural networks with delay (Marcus and Westervelt 1989) have been developed. Preliminary computer simulations suggest that our network is stable, at least for small time delays (C.M. Marcus, personal communication).
Acknowledgments We thank Tomaso Poggio for critically reading the manuscript. Also, we are grateful to the action editor for emphasizing the importance of the initial conditions for the success of the network. A.L.Y. was supported by the Brown-Harvard-MIT Center for Intelligent Control Systems with an United States Army Research Office Grant DAAL03-86-K-0171. N.M.G. was supported by Grant BNS-8809528 from the National Science Foundation, by the Sloan Foundation, by a grant to Tomaso Poggio and Ellen Hildreth from the Office of Naval Research, Cognitive and Neural Systems Division, and by Grant IRI-8719394 to Tomaso Poggio, Ellen Hildreth, and Edward Adelson from the National Science Foundation.
346
Alan L. Yuille and Norbert0 M. Grzywacz
References
Alger, B.E., and Nicoll, R.A. 1982. Feed-forward dendritic inhibition in rat hippocampal pyramidal cells studied in vitro. J. Physiol. 328, 105-123. Amthor, F.R., and Grzywacz, N.M. 1989. Retinal directional selectivity is accounted for by shunting inhibition (submitted for publication). Colmers, W.F., Lukowiak, K., and Pittman, Q.J. 1987. Presynaptic action of neuropeptide Y in area CAI of the rat hippocampal slice. J . Physiol. 383, 285-299. Coombs, J.S., Eccles, J.C., and Fatt, P. 1955. The inhibitory suppression of reflex discharges from motoneurones. J. Physiol 130, 396-413. Eccles, J.C., Eccles, R.M., and Magni, F. 1961. Central inhibitory action attributable to presynaptic depolarization produced by muscle afferent volleys. I. Physiol. 159, 147-166. Elias, S.A., and Grossberg, S. 1975. Pattern formation, contrast control, and oscillations in the short term memory of shooting on-center off-surround cells. Biol. Cybern. 20, 69-98. Feldman, J.A., and Ballard, D.H. 1982. Connectionist models and their properties. Cog. Sci. 6, 205-254. Grzywacz, N.M., and Koch, C. 1987. Functional properties of models for direction selectivity in the retina. Synapse 1,417-434. Grzywacz, N.M., and Yuille, A.L. 1989. A model for the estimate of local image velocity by cells in the visual cortex (submitted for publication). Hadeler, K.P. 1974. On the theory of lateral inhibition. Kybernetik 14,161-165. Hartline, H.K., and Ratliff, F. 1957. Inhibitory interaction of receptor units in the eye of Limulus. J. Gen. Physiol. 40, 357-376. Kemp, J.A. 1984. Intracellular recordings from rat visual cortex cells in vitro and the action of GABA. J. Physiol. 349, 13P. Koch, C., and Ullman, S. 1985. Selecting one among the many: A simple network implementing shifts in selective visual attention. Human Neurobiol. 4, 219-227. Lazzaro, J., and Mead, C.A. 1989. A silicon model of auditory localization. Neural Comp. 1,47-57. Marcus, C.M., and Westervelt R.M. 1989. Dynamics of analog neural networks with time delay. In Advances in Neural Information Processing Systems 1, D.S. Touretzky, ed., pp. 568-576. Morgan Kaufmann, San Mateo, CA. Masland, R.H., Mills J.W., and Cassidy, C. 1984. The functions of acetylcholine in the rabbit retina. Proc. R. SOC.London Ser. B 223, 121-139. Reichardt, W.E., Poggio, T., and Hausen, K. 1983. Figure-ground discrimination by relative movement in the visual system of the fly. Part 11: Towards the neural circuitry. Bzol. Cybern. 46, 1-30. Saint Marie, R.L., and Peters, A. 1984. The morphology and synaptic connections of spiny stellate neurons in monkey visual cortex (area 17): A Golgi section microscopic study. 1. Comp. Neurol. 233, 213-235. Schmidt, R.F. 1971. Presynaptic inhibition in the vertebrate central nervous system. Ergeb. Physiol. 63, 20-101.
A Winner-Take-All Mechanism
347
Sherrington, C.S. 1947. The Integrntiue Action of the Nerzuus System. Yale University Press, New Haven. Somogyi, P. 1977. A specific "axo-axonal" interneuron in the visual cortex of the rat. Brain Res. 136, 345-350. Stent, G.S., and Kristan, W.B. 1981. In Neiirobiology of the Leech, K.J. Muller, J.G. Nicholls, and G.S. Stent, eds., pp. 113-146. Cold Spring Harbor Laboratory, Cold Spring Harbor, NY. Torre, V., and Poggio, T. 1978. A synaptic mechanism possibly underlying directional selectivity to motion. Proc. R. SOC.(London) Ser. B 202, 409-416. Wyatt, J.L., Jr., and Standley, D. 1989. Criteria for robust stability in a class of lateral inhibition networks coupled through resistive grids. Neural Cornp. 1, 58-67.
Received 14 March 1989; accepted 16 June 1989.
Communicated by Andrew Barto
An Analysis of the Elastic Net Approach t o the Traveling Salesman Problem Richard Durbin' King's College Research Centre, Cambridge CB2 1 ST, England
Richard Szeliski Artificial Intelligence Center, SR,I International, Menlo Park, C A 94025 USA
Alan Yuille Division of Applied Sciences, Harvard University, Cambridge, M A 02138 USA
This paper analyzes the elastic net approach (Durbin and Willshaw 1987) to the traveling salesman problem of finding the shortest path through a set of cities. The elastic net approach jointly minimizes the length of an arbitrary path in the plane and the distance between the path points and the cities. The tradeoff between these two requirements is controlled by a scale parameter K . A global minimum is found for large K, and is then tracked to a small value. In this paper, we show that (1)in the small K limit the elastic path passes arbitrarily close to all the cities, but that only one path point is attracted to each city, (2) in the large K limit the net lies at the center of the set of cities, and (3) at a critical value of K the energy function bifurcates. We also show that this method can be interpreted in terms of extremizing a probability distribution controlled by K. The minimum at a given K corresponds to the maximum a posteriori (MAP) Bayesian estimate of the tour under a natural statistical interpretation. The analysis presented in this paper gives us a better understanding of the behavior of the elastic net, allows us to better choose the parameters for the optimization, and suggests how to extend the underlying ideas to other domains. 1 Introduction The traveling salesman problem (Lawler et al. 1985) is a classical problem in combinatorial optimization. The task is to find the shortest possible tour through a set of N cities that passes through each city exactly once. This problem is known to be NP-complete, and it is generally believed *Current address: Department of Psychology, Stanford University, Stanford, CA 94305 USA.
Neurnl Computation 1, 348-358 (1989) @ 1989 Massachusetts Institute of Technology
The Elastic Net Approach to the Traveling Salesman Problem
349
that the computational power needed to solve it grows exponentially with the number of cities. In this paper we analyze a recent parallel analog algorithm based on an elastic net approach (Durbin and Willshaw 1987) that generates good solutions in much less time. This approach uses a fast heuristic method with a strong geometrical flavor that is based on the tea trader model of neural development (Willshaw and Von der Malsburg 1979). It will work in a space of any dimension, but for simplicity we will assume the two-dimensional plane in this paper. Below we briefly review the algorithm. Let {Xz), I = 1 to N , represent the positions of the ?V cities The algorithm manipulates a path of points in the plane, specified by {YJ}, 3 = 1 to A4 (34 larger than N),so that they eventually define a tour (that is, eventually each city X, has some path point Y, converge to it). The path is updated each time step according to
AY, = IL
C w,,(X, - YJ) + dK(Y,+, + YJ_,
-
2Y,)
7
where e-lx,-Y,12/2K2 lPl,
=
-yLp - l x , - Y L 1 2 / 2 k ~ ~
j j are constants, and K is the scale parameter. Informally, the n term pulls the path toward the cities, so that for each X I there is at least one YJ within distance approximately h-.The /7 term pulls neighboring path points toward each other, and hence tries to make the path short. The update equations are integrable, so that AY, = -KdE/dYJ for an “energy“ function, E , given by
o and
E({YJ},K ) = -OK
c log7 I
J
r-1X,-Y,I”2KL + ‘ ’ x { y J
-
yJ+l}2 (l.l)
I
For fixed K the path will converge to a (possibly local) minimum of E. At large values of I< the energy function is smoothed and there is only one minimum. At small values of I ' , yields a second-order differential equation for riT, in terms of a sum over other q,/. As in equation 2.6, the sum will be controlled by some dominant term fi, and the equation for this term is
-
where rl, r2, and 73 are numerical constants. It is not possible to exhaustively determine all possible classes of solutions to equation 3.6, although it is easy to rule out simple exponential solutions. One can, however, look for certain specific classes of solutions, such as the polynomial-time solutions found in the previous section. If we assume a solution of the form rip t" for some exponent 2 , then the second derivative term is of order F2, and can be neglected relative to the first derivative term, which is of order t"-'. Similarly, the ?/-'Ij2 term is also of order tZ-* and can also be neglected. The resulting equation for the exponent z thus has exactly the same form as in the zero momentum case of Section 2, and therefore the rate of convergence is the same as in equation 2.7. We have also verified this in numerical simulations.
-
4 Convergence in Networks with Hidden Units
We now consider networks with a single hidden layer. Let 1 1 1 , ~ represent the weights from the input layer to the hidden layer, and R, represent the weights from the hidden layer to the output unit. The total input activation of the hidden units is given by 71, = C, 7 1 ~ 7 J . I ' J , and the corresponding hidden unit outputs are = ~ ( u , ) The . total input activation of the output unit is now given by h = C , Q,o,. For this network, the resulting equation analogous to equation 2.4 for the rate of change of the output for pattern 11 is now more complicated: (1,
Since this equation explicitly depends on the second layer weights R,,we also need an equation governing how these weights are changing with time. This equation comes directly from the gradient-descent learning rule: (4.2)
The right-hand side of equation 4.1 has two terms. The first term, depending on the hidden unit output states, is analogous to equation 2.4 for single-layer convergence, and will in fact give the same rate of convergence as in the single-layer case, because the hidden unit outputs are
Gerald Tesauro, Yu He, and Subutai Ahmad
388
always of order 1. The second term, which depends on the slope of the hidden unit transfer function times the second layer weights, could give a different convergence rate if it dominates the first term, which is of order 1. In general, we d o not expect this to happen, because the saturation of the hidden unit states generally causes the 9' terms to vanish faster than the growth of the second-layer weights. Certainly this is true for a sigmoid transfer function: the weights can grow only logarithmically (as shown below), and any polynomial decrease of y' will kill the second term. However, for purposes of argument, let us assume that the hidden unit states do not saturate, and that the y' terms remain of order 1. This will give us the maximum possible effect of the second term. Expanding equations 4.3 and 4.4 in a small 17 expansion as before, and suppressing indices for convenience, we obtain the following coupled system of equations:
First we look for polynomial-time solutions to this system of equations of the form rj t', R t X ,with X > 0 and 2 < 0. Equation 4.4 yields the following expression for A: N
N
x
= z(?.+/?-l)-l
> 0
(4.5)
Equation 4.3, together with the above expression for A, then gives for the exponent 2 : z =
-3 33' + 4ji:
-
4
< 0
(4.6)
The constraint 2 < 0 will be satisfied provided that 7 > !(l -j?'). The constraint X > 0 will be satisfied, assuming y is always positive, whenever il > 1. This will also guarantee that 2 < 0. To summarize, we have shown that polynomial solutions for both the weights and the errors are possible when the transfer function exponent 13 > 1. It is interesting to note that these solutions converge slightly faster than in the single-layer case. For example, with 7 = 2 and 13 = 2, r/ t-3/'0 in the multilayer case, but as shown previously, 3 ) goes to zero only as t-'/* in the single-layer case. We emphasize that this slight speed-up will be obtained only when the hidden unit states do not saturate. To the extent that the hidden units saturate and their slopes become small, the convergence rate will return to the single-layer rate. When j?' = 1, as with the sigmoid transfer function, the above polynomial solution is not possible. Instead, one must look for solutions in which the weights increase only logarithmically. It is easy to verify that the following is a self-consistent leading order solution to equations 4.3 and 4.4 when 13 = 1: N
I,
t-'l?/r,-2137t
(4.7)
Asymptotic Convergence of Backpropagation
389
5.00
2.50
0.00
0 Hidden Units
-2.50
-5.00
3 Hidden Units 10 Hidden Units 50 Hidden Units
-7.50
Y
-10.00
2
3
5
7
6
9
10
Figure 1: Plot of total training set error versus epochs of training time on a loglog scale for networks learning the majority function using backpropagation without momentum. The networks had 23 input units and varying numbers of hidden units in a single layer, as indicated. The training set contained 200 patterns generated with a uniform random distribution. The straight line behavior at long times indicates power-law decay of the error function. In each case, the slope is approximately -1, indicating E l / t .
-
-
Recall that in the single-layer case, q t-’’?. Therefore, the effect of multiple layers is to provide only a logarithmic speed-up of convergence, and only when the hidden units do not saturate. For practical purposes, then, w e expect the convergence of networks with hidden units to be no different empirically from networks without hidden units. This is in fact what we have found in our numerical simulations, as illustrated in Figure 1. 5 Discussion We have obtained results for the asymptotic convergence of gradientdescent learning that are valid for a wide variety of error functions a n d
390
Gerald Tesauro, Yu He, and Subutai Ahmad
transfer functions. In typical situations, we expect the same rate of convergence to be obtained regardless of whether or not the network has hidden units. However, in some cases it may be possible to obtain a slight polynomial speed-up with nonsigmoidal units, or a logarithmic speed-up with sigmoidal units. We point out that in all cases, the sigmoid provides the maximum possible convergence rate, and is therefore a ”good” transfer function to use in that sense. We have not attempted analysis of networks with multiple layers of hidden units; however, the structure of equation 4.1 suggests a recursive structure in which one accumulates additional factors of g‘ as one adds more layers. To the extent that the hidden unit states saturate and the y’ factors vanish, we conjecture that the rate of convergence would be no different even in networks with arbitrary numbers of hidden layers. We have also examined some modifications to strict gradient-descent learning, and have found that, while momentum terms and margins do not affect the rate of convergence, adaptive learning rate schemes can have a big effect. Another important finding is that the expected rate of convergence does not depend on the use of all 2” input patterns in the training set. The same behavior should be seen for general subsets of training data. This is also in agreement with our numerical results, and with the results of Ahmad (1988) and Ahmad and Tesauro (1988). In conclusion, our analysis is only the first step toward a more complete theoretical understanding of gradient-descent learning in feed-forward networks. It would be of great interest to extend our analysis to times earlier in the learning process, when not all of the errors are small. The formalism developed in this paper might provide some of the ingredients of such an analysis. It might also provide a framework for the analysis of the numbers, sizes, and shapes of the basins of attraction for gradient-descent learning in feed-forward networks. Another topic of great importance is the behavior of the generalization performance, i.e., the error on a set of test patterns not used in training, which was not addressed in this paper. Finally, the type of analysis used in this paper might provide insight into the development and selection of new learning algorithms that might scale more favorably than backpropagation.
References Ahmad, S. 1988. A study of scaling and generalization in neural networks. Master’s Thesis, University of Illinois at Urbana-Champaign. Ahmad, S., and Tesauro, G. 1988. Scaling and generalization in neural networks: A case study. In Proceedings of the 2988 Connectionist Models Summer School, D. Touretzky et al., eds. Morgan Kaufman, San Mateo, CA. >lo. Hinton, G.E. 1987. Connectionist learning procedures. Tech. Rep. CMU-CS-87115, Department of Computer Science, Carnegie-Mellon University. Jacobs, R.A. 1988. Increased rates of convergence through learning rate a d a p tation. Neural Networks 1, 295-307.
Asymptotic Convergence of Backpropagation
391
Le Cun, Y. 1985. A learning procedure for asymmetric network. Proc. Cognit. (Paris), 85, 599-604. Minsky, M., and Papert, S. 1969. Perceptrons. MIT Press, Cambridge, MA. Parker, D.B. 1985. Learning-logic. Tech. Rep. TR-47, MIT Center for Computational Research in Economics and Management Science. Rumelhart, D.E., Hinton, G.E., and Williams, R.J. 1986. Learning representations by back-propagating errors. Nature (London) 323, 533-536. Werbos, P. 1974. Ph.D. Thesis, Harvard University. Widrow, B., and Hoff, M.E. 1960. Adaptive switching circuits. Institute of Radio Engineers, Western Electronic Show and Convention, Convention Record, Part 4, 9&104.
Received 12 September 1988; accepted 5 June 1989.
Communicated by Steven Zucker
Learning by Assertion: Two Methods for Calibrating a Linear Visual System Laurence T. Maloney Center for Neural Scirnce, Dcpartrnent of Psychology, New York University, New York, NY 10003 USA
Albert J. Ahumada NASA Ames Research Center Moffett Field, CA 94035 USA
A visual system is geometrically calibrated if its estimates of the spatial properties of a scene are accurate: straight lines are judged straight, angles are correctly estimated, and collinear line segments are perceived to fall on a common line. This paper describes two new calibration methods for a model visual system whose photoreceptors are initially at unknown locations. The methods can also compensate for optical distortions that are equivalent to remapping of receptor locations (e.g., spherical aberration). The methods work by comparing visual input across eyelhead movements; they require no explicit feedback and no knowledge about the particular contents of a scene. This work has implications for development and calibration in biological visual systems. 1 Introduction
It’s likely that no biological visual system is ever perfectly calibrated, but considerable evidence exists that biological visual systems do compensate for optical distortions and initial uncertainty about the position of photoreceptors in the retinal photoreceptor lattice (Banks 1976; Hirsch and Miller 1987). Recent anatomical work, for example, demonstrates apparent disorder in the retinal lattice outside the central fovea, increasing with eccentricity (Hirsch and Miller 1987). Further, the optics of the eye change throughout the life span (Banks 1976; Weale 19821, suggesting that calibration may continue in the adult. Previous work in visual neural development suggests a variety of sources of information that drive calibration (Meyer 1988; Purves and Lichtman 1985; Shatz 1988), and there are computationaI models of visual neuraI development based on these cues (Sejnowski 1987). Yet, although biological visual systems are known to require patterned visual stimulation to achieve normal organization (Movshon and Van Sluyters Neural Computntion 1,392-401 (1989) @ 1989 Massachusetts Institute of Technology
Learning by Assertion
393
1981), few models require such stimulation to function. Exceptions include Kohonen (1982) and Toet, Blom, and Koenderink (1987). Further, while all these models could in principle compensate for disorder in the retinal lattice, none of them addresses the problem of compensation for optical distortion. We describe two methods for calibrating a simple linear visual system that work by comparing visual input across eye/head movements. These methods can organize the receptive fields of a simple visual system so as to compensate for irregularities in the retinal photoreceptor lattice and optical irregularities equivalent to distortions in the lattice. They require no explicit feedback and no knowledge about the particular contents of a scene, but instead work by "asserting" that the internal representation of the scene behave in a prespecified way under eye and head movements. We demonstrate that these methods can be used to calibrate the simple, linear visual system described next. In the final section, we discuss the implications of this work for other models of visual processing. 2 A Model Linear Visual System
The model visual system has ,1: photoreceptors arranged in a receptor array. The locations of these receptors are initially unknown. The light image is the mean intensity of light at each location in the receptor array. The output of a receptor is the value of the light image at the location assigned to the receptor. The mapping from light image to receptor array is assumed to be linear. The output (measured intensity) from the Ith receptor is denoted p,. In vector notation, the instantaneous input from the receptor array is p = [ / I , , . . ~ N I ? ' . Figure 1 also shows an ideal receptor array of A' receptors at specified, known locations. This receptor array may be a square or hexagonal grid of receptors, but it need not be. The input from the tth receptor in this ideal array is denoted I!,, and the input from the ideal array is denoted 1' / r = [ p , . . . . ./"I . The real array is connected to the ideal by linear receptive fields (one is shown in the figure). The visual system is calibrated when the receptive fields translate the input of the irregular, real array to what the ideal array would have sampled. Without some restriction on the light images sampled, there need be, of course, no connection between samples taken by the real array and those taken by the ideal array. For the remainder of this paper, the set of light images L is assumed to be a space of two-dimensional finite Fourier series of dimension N where ,V is the number of receptors in the ideal and in the real arrays. When N is 49, as in the simulations below, the 49dimensional space of two-dimensional finite Fourier series are weighted sums of products of one of 1, sin 27r.r.cos 27r.r, sin 2~2.1.. cos 27r2,r,sin 2 ~ 3 . rcos , 2~3.r ~
Laurence T. Maloney and Albert J. Ahumada
394
and one of I, sin 27ry, cos 27ry, sin 27r2y, cos 27r2y, sin 2 ~ 3 9cos , 2x311
This lowpass assumption reflects the blurring induced by the optics of the eye; it is commonly made in modeling spatial vision (Maloney 1990). With this assumption, we can show that if there is a solution to the calibration problem for a particular real array, ideal array, and linear subspace of lights, then it must be a linear transformation (Maloney 1990). This linear transformation, W, will depend on the unknown position of receptors in the real array and, consequently, on the optics of the visual
IDEAL
w
B
.'l
I -
Figure 1: An irregular photoreceptor array can be translated to an ideal regular array by a proper choice of receptive field weights.
Learning by Assertion
395
RETINA
I.."
.
. 0.0
0.0
-039 0.10
0.10 0.0
0.0
0.99
-0.10
0.37
Receptive Field 1
0.12
Receptive Field 2
Figure 2: Equivalent receptive fields. The black dots represent the locations of photoreceptors in the disordered real array. The squares correspond to the receptive fields of two receptors in the ideal array. See text. system. Each row of the linear transformation W , written as a matrix, is the weights of the receptive field of one ideal receptor. Because of the irregular distribution of receptors in the real array, the receptive fields for different receptors in the ideal array may be very different. Figure 2 shows two correctly calibrated receptive fields corresponding to two locations in the ideal array pointed to by dashed lines. The receptors in the real array are shown as black dots. The weights assigned to the nearby receptors in the real array are shown in the "exploded" squares at the ends of the dashed lines. These two receptive fields both extract the equivalent information that would have been sampled by an ideal receptor at the location indicated. The problem of calibration is now reduced to learning the unknown linear transformation W , given input only from the real sampling array.
Laurence T. Maloney and Albert J. Ahumada
396
In computer vision this problem is commonly solved by the use of “test patterns.” If the contents of the scene are known, then the value of the ideal array, p , can be compared to the value that is correct for the known test pattern, and W can be adjusted to eliminate any discrepancies (Rosenfeld and Kak 1976). We describe algorithms that do not require knowledge of the specific contents of the scene. 3 Eye/Head Movements
Consider the consequences of moving the eye and/or head, while the scene remains unchanged. The eye and head may translate to a new position, change the angle of gaze, rotate, “zoom” in or out on the scene, and so on. A particular eye and head movement serves to transform the value of the real (retinal) array (see Fig. 3). The transform, denoted T , maps the initial value of the real array p to the value after an eye/head movement, p‘. Different eye/head movements, of course, correspond to different transformations, T . If the visual system is properly calibrated, then 11, = W p and p‘ = Wp’ will be related by an equivalent transformation, denoted t. Intuitively, if the retinal image moves rigidly to the left on the real array, then, in a calibrated visual system, it would move rigidly to the left on the ideal array as well. T is a physical transformation induced by actual eye/head movements. t is an internal transformation that simulates the external transformation. The last assumption we make concerning the model visual system is that it can perform transformations, denoted t, on the ideal array that mimic all possible eye and head movements. The set of transfor-
P
T
I’
W
W
r,
L
1
?
Figure 3: Schematic diagram of an assertion. The consequencesof an eye/head movement can be computed in two ways. See text for an explanation.
Learning by Assertion
397
mations t is easily computed; it is precisely the transformations needed, for example, to compensate computationally for eye movements. The visual system can now compute the outcome of eye/head movements in two ways. It can look at the scene, take the resulting value / I = W p and apply t to get p'. Alternatively, it can perform the physical transformation T by actually moving, and then compute p' = WiTp. If the two methods of computing /I' = tWp = 1I'Tp (the two paths sketched by arrows in Fig. 3) produce different answers, then the visual system is not calibrated. Conversely, a specific transformation T constrains the choice of W so that f l l i = 1I.T. This constraint we term an assertion. We assert, for example, that in a calibrated visual system, moving closer to an object should simply result in scaling the object in size. Any other changes (rippling, flickering, distortion) are indications of failures of calibration. 4 Mathematical Results
To what extent is W constrained by all of the transformations T taken together? For the simple visual system considered here, we have the following mathematical results (Maloney 1989):
Result 1. If 1Y is noilsiiigrilar. it is corqiletely rfeterr~~ined up to a ,s(alir~g factor by the assertions gnieratetf Or. all eye arid head I I ~ O V ~ ~ I I ~inI I the ~S S('?IIf'.
Satisfying the assertions is almost equivalent to calibrating the visual system. The requirement that I t ' be nonsingular avoids pathological solutions where the visual system disconnects itself from the environment. If, for example, all weights in 14- were set to 0, the visual system, missing all visual input, could never see any failure of rigid transformation under eye/ head movements. The second result concerns equivalent receptive fields. Suppose we consider not all eye and head movements, but just (small) eye movements: translations of the retina perpendicular to the line of sight.
The importance of the second result is that it would permit anatomically more regular portions of the retina to serve as a template for organizing equivalent receptive fields elsewhere in the retina. The two results, taken together, suggest that assertions can be used to guide calibration. In the next sections, we develop and illustrate an algorithm for calibration based on assertions.
Laurence T. Maloney and Albert J. Ahumada
398
5 Learning by Assertion
The requirement that W T = tW for all transformations T and all light images determines W as stated above. The penalty term T
P
is minimized precisely when this condition holds. We can therefore develop a learning algorithm for calibrating the simple visual system outlined above by minimizing the quadratic penalty in equation 5.1. The algorithm repeats the following steps until the error term (equation 5.1) is sufficiently small: 1. Generate a light image drawn at random from the lowpass subspace of finite Fourier series. 2. Sample the light image to obtain p. 3. Simulate a randomly chosen eye/head movement T .
4. Resample the light image to obtain
11’.
5. Compute WT(p)- tW(p),the Euclidean vector difference between the two ways to compute it‘. 6. Update W using a modified Widrow-Hoff algorithm.
The Widrow-Hoff algorithm (Widrow and Hoff 1960) compares the correct and actual outputs of a linear transformation and alters the transforamtion to make them coincide. Our algorithm is computationally identical to the Widrow-Hoff algorithm except that it compares tWp and WTp, two possibly erroneous outputs, and attempts to minimize the discrepancy between them. We do not know the conditions under which the modified algorithm is guaranteed to converge. In Step 5, we assume that the transformation t, corresponding to T , is known to the visual system. We are investigating whether f can be estimated from the visual input with inexact knowledge of T (Maloney 1989). 6 Simulation Results
We have implemented the algorithm corresponding to Result 2 above: eye movements only. For the results reported here we assumed a 7 x 7 square grid ideal array with a 49-receptor irregular real array. The locations of the 49 receptors in the real array were chosen by randomly perturbing about half of the receptors in the regular array by about 0.25 of the spacing between receptors in the regular array. The lowpass space was described above. One receptive field in the ideal array was fixed.
Learning by Assertion
399
Figure 4: (a) The correct receptive field for one of the receptors in a 7 x 7 square grid ideal array. (b) The receptive field "learned" by the algorithm described in the text. The remaining 48 equivalent receptive fields were learned. The fixed receptive field could be set to arbitrary (nonzero) values. Figure 4a shows one of the 7 x 7 learned receptive fields after 150,000 iterations of the Widrow-Hoff algorithm; Figure 4b shows the correct receptive field for this element. A linear interpolation algorithm was used to render each 7 x 7 grids as a perspective plot. The receptive field has converged to its desired shape. 7 Generalizations and Predictions
We are currently implementing the algorithm corresponding to Result 1 (eye and head movements). For a nonlinear visual system, equation 5.1 still serves as a constraint guiding calibration that in combination with other constraints may be sufficient to guarantee proper calibration. Since equation 5.1 is quadratic, it is plausible that any candidate neural learning algorithm is capable of minimizing it, if the penalty term can be
400
Laurence T. Maloney and Albert J. Ahumada
computed. Since the penalty term represents nonrigid motion induced in the representation by eye or head movements, it is plausible that the penalty term is available to the nervous system. The methods described here use a novel cue, derived from comparision of visual input across successive glances at a scene, to calibrate a simple linear visual system. Previous models of visual neural development, reviewed above, use different cues and methods to calibrate the visual system. The methods outlined here differ from these methods in that (1) they directly optimize a visual capability (stability and rigidity under change of direction of gaze), (2) they can compensate for small optical distortions and remappings, ( 3 ) they require structured visual input (actual scenes), and (4) they require successive fixations on a single unchanging scene. Taken as a claim about visual development, the procedures developed here are readily testable empirically. (1)Animals reared in environments lacking structured visual input, or in environments where visual input is rendered perpetually nonrigid, or where it is never possible to fixate the same scene twice, should be perfectly calibrated according to previous theories, but not according to the work developed here. (2) Animals with small optical distortions induced early in development should not be perfectly calibrated according to previous theories, but may be so according to the work described here. (3) The visual system will compensate for retinally stabilized optical distortions in adults; these distortions may include small induced scotomas. Prediction (2) may hold while ( 3 ) fails if recalibration in the adult is limited by connectivity restrictions on receptive fields (Purves and Lichtman 1985).
Acknowledgments We thank Marty S. Banks, Jeffrey B. Mulligan, Michael Pavel, Brian A. Wandell, and John I. Yellott, Jr. for useful comments. Send reprint requests to Laurence T. Maloney, Department of Psychology, Center for Neural Science, 6 Washington Place, 8th Floor, New York, NY 10003.
References Banks, M.S. 1976. Visual recalibration and the development of contrast and optical flow perception. In Perceptual Developmetit in Infancy; The Minnesota Symposia on Child Psychology, Volume 20, A. Yonas, ed., Erlbaum, Hillsdale, NJ. Hirsch, J., and Miller, W.H. 1987. Does cone positional disorder limit resolution? J. Opt. SOC. Am. 4, 1481-1492. Kohonen, T. 1982. Analysis of a simple self-organizing process. Bid. Cybern. 44,135-140.
Learning by Assertion
401
Maloney, L.T. 1989. Geometric calibration of a linear visual system. In preparation. Maloney, L.T. 1990. The consequences of discrete retinal sampling for vision. In Computatioml Models of Visiial Processing, M.S. Landy and A J . Movshon, eds., MIT Press, Cambridge, MA. Meyer, R.L. 1988. Activity, chemoaffinity, and competition: Factors in the formation of the retinotectal map. In c e / / Iirteractiorzs i i i Visiinl Dezlelopineiit, S.R. Hilfer and J.B. Sheffield, eds., Springer-Verlag, New York. Movshon, J.A., and Van Sluyters, R.C. 1981. Visual neural development. Ami. Rev. Neurosci. 7. Purves, D., and Lichtman, J.W. 1985. Priirciples of Neirral Development, pp. 251270. Sinauer, Sunderland, MA. Rosenfeld, A., and Kak, A.C. 1976. Digital Picture Processing. Academic Press, New York. Sejnowski, T. 1987. Computational models and the development of topographic projections. Trends Neurosci. 10, 304-305. Shatz, C.J. 1988. The role of function in the prenatal development of retinogeniculate connections. In Cellular Thalainic Mecizntzisms, M. Bentivoglio and R. Spreafico, eds., Elsevier Science, Amsterdam. Toet, A., Blom, J., and Koenderink, J.J. 1987. The construction of a simultaneous functional order in nervous systems. 111. The influence of environmental constraints on the resulting functional order. B i d . Cyhern. 57, 331-340. Weale, R.A. 1982. A Biography of tlw Eye; Dezdopinent; GrozoW; Age. H.K. Lewis, London. Widrow, G., and Hoff, M.E. 1960. Adaptive switching circuits. Iizstitirtc of Radio Engineers, Westerir Electronic Shozci and Conve~ztioii,Coiizieritioti Record, Part 4, 96-104.
Received 23 May 1989; accepted 13 July 1989.
Communicated by Christoph von der Malsburg
How to Generate Ordered Maps by Maximizing the Mutual Information between Input and Output Signals Ralph Linsker I D N Rcsmrcli Dilision, T.J. \%;I tson Rcscarch CcnteI. P.O. Box 218. Yorktorvzi Hrights. NY 10598 I J S A
A learning rule that performs gradient ascent in the average mutual information between input and output signals is derived for a system having feedforward and lateral interactions. Several processes emerge as components of this learning rule: Hebb-like modification, and cooperation and competition among processing nodes. Topographic map formation is demonstrated using the learning rule. An analytic expression relating the average mutual information to the response properties of nodes and their geometric arrangement is derived in certain cases. This yields a relation between the local map magnification factor and the probability distribution in the input space. The results provide new links between unsupervised learning and information-theoretic optimization in a system whose properties are biologically motivated. 1 Introduction
A great deal is known experimentally about the complex organization of certain biological perceptual systems such as the visual system in cat and monkey. One way to study these systems theoretically is to explore whether there are optimization principles that can correctly predict what signal transformations are carried out at various stages of a perceptual pathway. I have proposed a principle of “maximum information preservation” (Linsker 1988a,b) according to which a processing stage has the property that the values of the output signals from that stage optimally discriminate, in an information-theoretic sense, among the possible sets of input signals to that stage. [See (Linsker 1990) for a review of earlier ideas relating information theory and sensory processing.] The principle in its basic form states: Given a statistically stationary ensemble of input patterns L (to a processing stage) having probability density function (pdf) PL(L), and a set S of allowed input-output mappings S = { f : L M } , where each ,f is characterized by a conditional pdf P ( M I L ) , choose an f E S that maximizes the Shannon information -i
Nerrrnl CotnpiMion 1, 402-411 (1989) @ 1989 Massachusetts Institute of Technology
Maximizing Mutual Information
403
rate or average mutual information (Shannon 1949)
R
=
/ rlL P,,(L) /"d.ll P(,U 1 L ) log[PLI.f 1 L)/Phf(LU11
(1.1)
where Ph,(.U) G I tlL P L ( L ) P (1 ~L ~ ) . In this paper we study some consequences of R-maximization for a processing stage in which the choice of set S is relatively simple yet biologically motivated. 2 Information Rate and Gradient Ascent
The type of processing stage we shall consider has the following properties. Each input pattern is denoted by a vector in a space L. There is a set of output "nodes" hi,each characterized by a vector .r.(Af) in L space. The response to an input L occurs in three steps: (1) Feedforward activation: Each node Af' receives activation i l ( L , Al')2 0; this quantity depends on L and the position r(Af') of node M'.( 2 ) Lateral interaction: The activity of each node Af at this step is given by B ( L , A l ) = Cnrrg(W. Al)A(L.A l ' ) where y(Ai'.Af) 2 0 and C4! y(Af'.Al) = 1 for all ,If'. 13) Selection of single output node to be fired: Node N is selected with probability P(M I L ) = B(L,Al)/EAllB ( L . M').(If we view the system as a network, A corresponds to the feedforward connection strengths, and y to lateral excitatory strengths. Lateral inhibitory connections are implicit in the selection of a single firing node at step 3. The requirement that a single node fire makes it easier to compute the information rate, but deprives the system of much of the richness of biological network response.) Thus P(A2 I L ) = [Ch,, g(i13', A1),4(L, Ai')l/Ebf,A(L, hl')
(2.1)
The functions A and y are specified. We wish to maximize R over all choices of the set of vectors {.rOW>. We derive a learning rule that, when averaged over the input ensemble, performs gradient ascent on R. The derivative of R with respect to the /th coordinate of .r(AfcJ is d R / t l [ ~ , ( A f "=) J ] tlL P I ~ ( L ) Z ~Jfo) ( L , where
z,(L.%Q' )
= IaA4(L. nf")/as,(Al")][c,jiA(L, Al')]-'
1
x Cnr [log P(Ai L ) - log S,r(1Zf)l
x [ g ( n l " , '11) - P(M
I L)]
(2.2)
The learning rule is: Select input presentations L according to the pdf PL(L). For each L in turn, move each node position . ~ ( A l o by ) a small amount k Z ( L , M o ) where k > 0. The rule makes use of one item of "historical" information at each node: PbJ(Af). If the {-I ( M ) } change slowly over many presentations, an average of the firing incidence of A2 over an appropriate number of recent presentations can provide a suitable approximation to Phf(Al). We can interpret equation 2.2 as follows. Consider each node hl in turn. Suppose that (1) P(Al I L ) > PAj(Ai); that is, the occurrence of
404
Ralph Linsker
pattern L conduces to the firing of 111. Suppose also that (2) y(M0, M ) > P(M I L). In network terms, ~ ( M o)A! , is the strength of the lateral connection from 1140 to M. By equation 2.1, P ( M I L ) equals the average strength of the active connections to node M , defined by weighting each y(M’, M ) by the activity A(L, M’)of node 111’.If both inequalities hold, then the effect of that N term in the right-hand side of equation 2.2 is to tend to move z(M0) in the direction of increase of A(L,Mo).Reversing the second inequality tends to move d M 0 ) so as to decrease A(L,A&). Stated informally, the derived learning rule has the effect that each node Mo develops to become more (respectively less) responsive to L if it is relatively strongly (respectively weakly) connected to nodes M that are themselves relatively strongly responsive to L. Three elements Hebb-like modification, and cooperation and competition among output nodes for “territory” in the input signal space - are apparent in this description. I emphasize that while we specified the activity dynamics of the processing stage - that is, the relationship between input and output (equation 2.1) as a function of the parameters {.z(M)}- we made no assumptions concerning the form of the learning rule by which the {dM)}are to be adjusted. It is striking that a learning rule combining Hebb-like modification and cooperative and competitive learning in a specific way emerges as the gradient of an important information-theoretic function - the average mutual information between input and output. Our model and result can be mapped directly onto a simple ecological problem in which, for example, different organisms M are differently suited to obtain various types of food L, or food at different locations L. Rather than denoting a rate of transfer of signaling information, R in this case provides a measure of the extent to which the statistical structure of M space reflects, or is matched to, the structure of L space. The R function is thus a candidate for a function that may be locally optimized (subject to developmental constraints) by familiar mechanisms such as adaptation and competition, at least in sufficiently simple ecological models. 3 Neighbor-Preserving Map Formation
Figure 1 shows the emergence of a neighbor-preserving or “topographic” map as the result of performing gradient ascent in R, for a case in which the L and M spaces are both two-dimensional. (To reduce boundary effects, periodic boundary conditions are imposed; each space can thus be regarded as the surface of a torus.) For this computation PL(L)is uniform, A(L,M’)c( exp(-rw I L - .r(hl’) 12), g ( h l ’ , M ) c( exp(-P 1 M’ - A4 I*), and 1 . . . 1 denotes Euclidean distance going the “short way around” the periodic L or M space. The generation of a neighbor-preserving map by a connection modification rule is of course not new (e.g., von der Malsburg and Willshaw 1977; Kohonen 1982). The point of interest
Maximizing Mutual Information
405
1
Figure I: Gradient ascent in the information rate R induces a neighborpreserving map. The input space L is a unit square. The M space consists of a 10 x 10 square array of nodes. Periodic boundary conditions are imposed (see text). Each node ( i ,j ) ( i , j = 1 , .. . ,101 is initially mapped onto a point dz,j ) in L space, which is randomly chosen from a uniform distribution on a square of side s = 0.7 centered a t s = (O.li,O.lj). Thus, a very coarse topographic ordering is initially present. [If the initial { s ( i , j ) }are entirely random ( s = l), a map having partially disrupted topographic order and a lower lying local maximum of R is obtained.] At each iteration, .r(i,j ) is changed by Mi, j ) = (-,/K)Er=,Z[LA.M ( i .j)] (see equation 2.2) where { L k } is an ensemble of input vectors. Parameter values are a = 20, 17 = 4/9, y = 1, and ' h = 900 ( { L A }is a 30 x 30 array of points). Plots show the links connecting each r ( i .j ) with s(i + 1,j ) and s(i.j + l), after 0 (upper left), 10 (upper right), 15 (lower left), and 40 (lower right) iterations.
406
Ralph Linsker
is rather that an optimization principle and learning rule yielding this result have emerged from information-theoretic considerations. Figure 1 shows that a square grid in M space is optimally mapped onto a square grid in L space. That is, the ”magnification factors” of the mapping M -+ .c(M) are the same in both coordinate directions, and orthogonality of the coordinate axes is preserved. In the next section we prove that this is a consequence of the principle of maximum information preservation under conditions that are more general than those of Figure 1. 4 Coarse-Grained Information Rate
We can derive a useful “coarse-grained” version of equations 1.1 and 2.1 under certain conditions. Suppose that A ( L ,M’) is negligible for IL - a(M’)I > a0 and that g ( M ’ , M ) is negligible for 1 Ad’ - M I> go. Suppose also that the following approximations can be made: (1) The mapping A4 -+ z ( M ) - which we will call the embedding of the M space in L space - is linear over a local region that is large compared with the length scales a0 and 90. (2) For each L, firing is confined to a single such local region in M space. (3) The firing rate is uniform over such a local region. (We will consider a two-dimensional M space, but the generalization to other dimensionalities is straightforward.) Figure 2a shows an orthogonal coordinate grid and unit vectors t i , 1 1 in M space, and Figure 2b shows a disk cut from the (arbitrary) linear mapping of this grid onto a two-dimensional subspace of L. The mapping is characterized by the lengths f and 9 of the images of u and ’I! under the mapping, and by the angle 0 between them. An area element dM in M space is thus mapped onto an area dL = c dhd in L space, where c = fg sin 0. For definiteness we choose A ( L , M ’ ) ‘xexp(-cr I L - s(M‘) 12) and g(M’, M ) = (p/a)exp(-p I hf’ - A4 1 2 ) . The density of nodes in M space is uniform, and we pass to the continuum limit, so that sums over M become integrals over area elements in M space.
4.1 Derivation - Qualitative Aspects. We can now express R-maximization as a geometric optimization problem. We first describe qualitatively the main geometric effects that arise. (1) By equation 1.1 we have R = Rl + R2 where R1 = - Jdhl P*,(M) log P M ( M )and R2 = J d L P L ( L ) J dllf P(M I L ) logP(M I L). (2) The quantity R1 is the entropy of the pdf P*f(M), and is a maximum when Ph,(M) is uniform over M. An example of an embedding that achieves this maximum is one in which the density of nodes M mapped to each region of L space is proportional to PL(L). (3) The quantity R2 is the average over input vectors L of the negative of the entropy of the pdf P(M I L). Its value is greater when the embedding is chosen so that the P ( M 1 L ) distribution for each L is more sharply localized to a small region of M space. (The intuitive
Maximizing Mutual Information
407
Figure 2: Locally linear mapping of the M-space coordinate system onto a region of L space. (a) Orthonormal vectors 1 1 . v and coordinate grid in M space. (b) Arbitrary linear mapping of this grid onto region of L space. (c) A mapping that maximizes information rate under conditions stated in text. idea is that if each input vector activates fewer nodes, then one’s ability to discriminate among the possible input vectors, given knowledge of which node fired, is improved.) This sharpened localization of P(Al 1 L ) is achieved in two ways: (a) Since the spread of activation due to the feedforward process A(L. M ‘ ) has fixed extent in L space, lowering the density of nodes hl’ in the vicinity of L tends to localize A(L. AT), and thereby P(M I L), to a smaller region of M space. This effect favors spreading out the embedding over a larger region of L space. The balance between this effect and the tendency to cluster the nodes in regions of high Pl,(L) (item 2 above) determines the ”magnification factor” of the mapping (next section). (b) When viewed in L space (Fig. 2b), the contour lines of A(L,M’) are circular, but those of y(Al’, Af) are in general elliptical. When f = y and sin 0 = 1 (Fig. 2c) the contour lines of y(Al’, A3) become circular in L space, and P(hl I L ) - which is proportional to the convolution of A(L,A T ) and y(M’. A l ) -becomes more sharply localized, as shown at the end of this section. 4.2 Mathematical Details. To derive the coarse-grained information rate R we need to express RZ in terms of the geometric properties of the embedding, such as the values of ( f .y, 8) at each ’If. The qualitative statements of the previous paragraph apply to a variety of functional forms for ,4(L, Al’) and g ( W . M ) , and one can, in general, compute the entropy of the P(A1 I L ) distribution numerically. However, when A(L. hl‘) and y ( W , h l ) have the gaussian forms assumed above, we can proceed analytically. The derivation is outlined in the remainder of this paragraph, which may be skipped by the reader interested only in the result and its consequences. (1) Rewriting P(Ai 1 L ) of equa-
Ralph Linsker
408
tion 2.1 as a ratio of integrals in L space (using dL' = c dhf'),we find that P(Al I L ) is a two-dimensional gaussian function of distance: P(M I L ) = (c/X)(n+a-)'/2exp(-n+~2 - n - q 2 ) , where (1;' = 0 - l + ij-'[h f (h2 - r2)'/'1 and h = ( f 2 + g2)/2. We define Lo as the point in the embedded M sheet that lies closest to the input vector L, and ( 5 , ~as) the components of the vector [ s ( M ) - L o ] along the major and minor axes of the elliptical contour lines of g(M', M ) in L space. The values of (f,g , I91 at Lo are used, since the activation for given L is confined to a local region of the embedding centered at Lo. ( 2 ) The negative of the entropy of the P ( M I L ) distribution for given L is u(L) = J d M P(Ai I L ) log P(M I L ) = l ~ g [ ( o , a - ) ' / ~ r / ( ~ r ) l . (3) Note that u ( L )depends on L only through Lo, and that the integral of PL(L)over all L sharing the same Lo is Ph,(Alo)/c(Mo), where Mo is a node in the vicinity of Lo. Therefore R2 = J ~ L P L ( L M L=) f d M PA!(hf&. ~ Some algebraic manipulation then yields the result stated below. 4.3 Results. The coarse-grained information rate we derive is R = J d b l r ( A l ) with
(fg sin O)z
(4.1)
where ( f ,g, 19) are functions of M , and p ~ ( A 4is) the firing probability per unit area in M space. The firing probability per unit area of the embedding in L space is q ( M ) 3 P&bf)/-c, which depends on the PL(L) distribution and the shape of the the embedded surface in L space but is independent of (f,g , 0). When the stated approximations are valid, we see that rate maximization has become a geometric problem: that of embedding an M sheet in L space subject to constraints (such as boundary conditions) so as to maximize the integral over A4 of equation 4.1. (Only a portion of the mapping from M to L might satisfy the stated approximations. In that case only the contribution of that portion to R is being considered here.) The bracket in equation 4.1 can be written as X = [(a+ ~ ? p+)2~0 @ ~ p ~ I , where p 3 1/c and F = --c + (f 2 + g2)/2. When L and M have the same dimension, p is called the "magnification factor." Note that c 2 0, with F = 0 only when f = g = p-'l2 and sin I9 = 1; that is, when a square coordinate grid in M maps onto a square grid in L (Fig. 2c). Any deviation from this square mapping makes a negative contribution to R, akin to a "surface distortion energy" cost term. 5 Magnification Factor
Now consider a number, N , of local regions of the embedding in L space (they need not be near each other). The kth such region has area ALk in L
Maximizing Mutual Information
409
space, p-value p k , firing rate cik per unit area in L space, and information rate r'k = r ( 3 l ) per unit area in M space, where ,I1 is a node in region k . Assume that f = y = ,1-'12 and sin0 = 1 for each region, so that f = 0 as derived above. How should a fixed total area of M space (this area is Ck p k A L p ) be allocated among the A' regions so as to maximize the total contribution, C k r~.pkALk,to R from these regions? The result (obtained using equation 4.1 for each r k , and the Lagrange multiplier method) is that the p k should be chosen such that the value of ( p k + p i . / j / o ) / q k is the same for all k . We thus obtain an "equation of state" relating p (the area of the M sheet that maps onto a unit area in L space) to the firing probability (I of that region of the M sheet (measured per unit area in L space): p = 4 / 2 1 + [(t2/4) + X f q l ' / 2
(5.1)
where t = o/d, and X is chosen so that the total area of M space being allocated has the desired value. Note that if the L space is two-dimensional and the mapping is bijective, (I z PL(I.1. Our "equation of state" has two limiting regimes. (I) If p r i 0 and all n' sufficiently large, given the uniform convergence and continuity just established. Now X(wo) - X(w) = [X(wo) - A,,(G,,)I + [K,t(8,/) - Kn4w)l + [K,I(W) - X(w)l I 3~ for any E and all n' sufficiently large, because X(wo)-A,J(6nf) 5 2~ as just established, KnI(G,,)-&,(w) 5 0 by the optimality of G,,, and K,t(w) - X(w) < E by uniform convergence. Because E is arbitrary, X(wo) 5 X(w), and because w is arbitrary wo E W'. Because (6,) is arbitrary, every limit point wo of such a sequence belongs to W". Now suppose that inf,,,.EW.Iltij,- w*ll f , 0. Then there exist E > 0 and a subsequence {n'} such that JJ6,r - w*II 2 E for all n' and w* E W*. But {G,!}has a limit point that, by the preceding argument, must belong to W*. This is a contradiction to 1 1 6 , ,- w*ll 2 E for all n', so infw.,p(16, - w*ll + 0. Because the realization of { Z t } is chosen from a set with probability 1, the concIusion follows. 6 t ,
W , and 1 be as in Theorem 1, and suppose Theorem 2. Let (QF,P ) , {Z,}, that 8, -+ w*a.s. - P where w* is an isolated element of W* interior to W . Suppose in addition that for each z in R" l ( z , .) is continuously differentiable of order 2 on J W ; that E(Vl(Z,,w*)'Vl(Zt,w ' ) ) < co; that each
Halbert White
458
element of V21 is dominated on W by an integrable function; and that A* = E(V21(Zt,w*)) and B* = E(V1(Zt, w*)Vl(Zt,w*)') are nonsingular (s x s) matrices, where V and V2 denote the (s x 1) gradient and (s x s) Hessian operators with respect to w. Then @(6, - w*) 3 N ( 0 ,c'), where c ' = A*-'B*A*-'. If in addition each element of VlVl' is dominated on W by an integrable function, then C?, + c ' a.s. - P , where = = n-' Cz=l V21(Zt, 6,), B, = 0 n-1 Vl(Z,, 6,)Vl(Z,, 6,)'.
ct"=1
en a;'&&', a,
See Gallant and White (1988, chap. 5) for a proof of a more general result. The steps of the proof of the present result are completely analogous to those of Gallant and White. The following four results are proven in White (1989b). The notation has been adapted to correspond to that used here. The function m appearing in these results corresponds to -V1 in the results above; use of m permits treatment of situations in which learning need not optimize a performance measure.
Theorem 3. (White 1989b, Proposition 3.1). Let (2,)be a sequence of i.i.d. random v x 1 vectors such that lZ,l < A < 03. Let m : R" x Rs+ R5 be continuously differentiable on R" x R" and suppose that for each w in R", M(w) z E(m(Z,,w)) < 00. Let (7, E R'} be a decreasing sequence such that xglvn = 00, limn+wsup(q;' - 7;;') < 03 and Czl 7: < 00 for some d > 1. Define the recursive m-estimator 271, = W,-l + qnm(ZnrWn-1),n = I, 2,. . ., where 60E R" is arbitrary. (a) Suppose there exists X : R" + R twice continuously differentiable such that VX(w)'M(w) 5 0 for all w in R". Then either 6,+ W i = {w : VX(w)'M(w) = 0) or W, -+ 03 with probability one (w.p.l). (b) Suppose w* E R" is such that P[6,-+ S,'] > 0 for all E > 0, where St ZE {w E R" : IIw - w*JJ< E } . Then M(w') = 0. If in addition M is continuously differentiable in a neighborhood of w with V M * = VM(w*) finite and if J* = E(m(Z,, w*)m(Z,, w*)') is finite and positive definite, then VM* has all eigenvalues in the left half-plane. (c) Suppose the conditions of (a) hold, that M(w) = -VX(w), that X(w) has isolated stationary points, and that the conditions of (b) hold for each w* E W t = {w : VX(w) = 0). Then as n 4 oa either 6, tends to a local minimum of X(w) w.p.l or 6,--+ 00 w.p.l. 0
Theorem 4. (White 1989b, Proposition 4.1). Let the conditions of Theorem 3(a,b) hold, and suppose also that Irn(Zn,w)l < A < co as. for all w in R". Let be the maximum value of the real part of the eigenvalues of VM* and suppose C* < -1/2. Define J(w) f var [m(Z,, w)] and suppose J is continuous on a neighborhood of w'. Set J*= J(w*) and vn = n-'.
Learning in Artificial Neural Networks
459
Then the sequence of random elements T,(a) of C ~ 1 [ 0 , 1with ] sup norm, a E [0,11, defined by T,(d = n-'/'SIna] + (nu - [nal)n-'/'(S~,,l,~ - SL,~]), with S, = n(Wn- w*), converges in distribution to a Gaussian Markov process G with G(a)= exp[(lna)(I+V M * ) l x exp[-(lnt)(VM* + I)ldW(t), a E (0,1], where W is a Brownian motion in R" with W(0)= 0, E(W(1))= 0 and E(W(l)W(I)') = J'. In particular, n'/'(W, - w * ) ~ N ( O P), , where F' = e x p ( - ( h t ) [ V M * + 1])J*exp(-(lnt)[VM" + I ] ) & is the unique solution t o the equation ( V M * + 1/2)F* + F*(VM*' + 112) = -P. When V M ' is symmetric, F* = P H P - ' , where P is the orthogonal matrix such that PZP-' = - V M * with Z the diagonal matrix containing the (real) eigenvalues (c1,. . . , o f - V M * in decreasing order, and H the s x s matrix with elements Hij = (&+A, -l)-'K$, i , j = 1,.. . ,s , where [K,',] = K' = P-'J'P.
cs)
0
Theorem 5. (White 1989b, Proposition 5.1). Let M : R" + R" have unique zero w*interior to a convex compact set W c R" and suppose M is continuously differentiableon W with V M ' finite and nonsingular. Let (R,F, P ) be a probability space, and suppose there exists a sequence {M,: R x W + R3} such that for each w in W,M,(., w) is measurable-F and for each w in R,M,(w, .) is continuously differentiable on W , with Jacobian VM,(w, .). Suppose that for some positive definite matrix B*,n'/'M,(., w*)5 N ( 0 ,B') and that w)- M ( w ) + 0, VM,(-, w)- V M ( w )+ 0 as.(-P) uniformly on W . Let (6,: R + R"} be a measurable sequence such that W, + w-*a.s. and n'/'(W, - w*) is Op(l). Then with M , = M,(.,W,) and V M , = - -1 VMn(.,W,),Gn w, - V M , M , is such that G, + w*a s . and n'f2(G, w') 5 N ( 0 ,c*),where C* = A*-'B'A*-'', A* F VM*. I f there exists {B,} such that E n + B' a s . , then with A,, = O M , we - --I' have 6, = A, B,A, + c* a s . 13
--'
Theorem 3 and 4 establish that the recursive m-estimator satisfies the conditions required for W n here. The utility of Theorem 5 is that 6,can afford an improvement over 6,in the sense of having smaller asymptotic covariance matrix. Theorem 6. (White 1989b, Proposition 5.2). Let the conditions of Theorem 5 hold with w* an isolated zero of M(w) = E(m(Z,, w)) = 0, and let W be a convex compact neighborhood of w*.Put M,(., w) = n-' Cy=,m(Zt,w) so that OM,(., w)= n-* Cy=lVm(Zt,w), and suppose that Vm is dominated on W by an integrable function. Let W, be the recursive m-estimator and - -1 define 3, = 6,- V M , M,, n = I, 2 , . . .. Then the conclusions of Theorem 5 hold and F* - C' is positive semidefinite. 0 In stating the final result, we make use of mixing measures of stochastic dependence, in particular, uniform (+) and strong (a-) mixing.
460
Halbert White
These are defined as
where
FT
3
= o(Z1,.. . ,ZJ is the o-field generated by { Z l , . . . ,Zt}, and a(&, . . .) is the o-field generated by { Z t ,Zt+,,. . .}. For a discus-
6
sion of d ( k ) and a ( k ) and the properties of mixing processes { & } [i.e., processes for which d ( k ) + 0 or a ( k ) + 0 as k + co],we refer to White (1984b). The sets On($) and T($,q, A) are as defined in Section 5.
Theorem 7. (White 1988, Theorem 4.5). Suppose that the observed data are the realization of a stochastic process {Zt : R -+ I", t = 1,2,. . .}, I = [0,1], on the complete probability space (a,F, P ) , and that P is such that either (1) { Z t } is i.i.d. or (2) { Z t }is a stationarymixingprocess with either d ( k ) = or a ( k ) = crop;, k 2 I, for some constants $0, a0 > 0, O < po < 1. Suppose that 8, is the unique element of 0 = L 2 ( I T , p )such that E ( x l X t ) = 8,(Xt). Put On($) = T ( $ ,qn, A,) where $ is a cumulative distribution function satisfying a Lipschitz condition, and (4,) and {A,} are such that q, and A, are increasing with n, q, -+ co and A, -+ co as n -+ co,A, = ~ ( n *and / ~ either ) (1) q,A;logq,A, = o(n) (for i.i.d. { Z t } )or (2) qnA~logqnAn = (for mixing {Zt}). Then there exists a measurable connectionist sieve estimator 4, : R -+ 0 such that n
n
Further, p(&, 0,)
-+
0 prob - P.
0
Acknowledgments The author is indebted to Richard Durbin, Mark Salmon, and the editor for helpful comments and references. This work was supported by National Science Foundation Grant SES-8806990.
References Andrews, D. W. K. 1988. Asymptotic normality of series estimators for various nonparametric and semi-parametric estimators. Yale University, Cowles Foundation Discussion Paper 874. Ash, T. 1989. Dynamic node creation in backpropagation networks. Poster presentation, International Joint conference on Neural Networks, Washington, D.C.
Learning in Artificial Neural Networks
461
Barron, A. 1989. Statistical properties of artificial neural networks. Paper presented to the IEEE Conference on Decision and Control. Bartle, R. 1966. The Elements of Integration. Wiley, New York. Baum, E., and Haussler, D. 1989. What size net gives valid generalization? Neural Comp. 1,151-160. Billingsley, P. 1979. Probability and Measure. Wiley, New York. Blum, J. 1954. Multivariate stochastic approximation methods. Ann. Math. Stat. 25,737-744. Carroll, S. M., and Dickinson, B. W. 1989. Construction of neural nets using the Radon transform. In Proceedings of the International Joint Conference on Neural Networks, Washington, D.C. pp. 1:607-611. IEEE, New York. Cerny, V. 1985. Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm. J. Opt. Theory Appl. 45, 41-51. Cox, D. 1984. Multivariate smoothing splines. S I A M J. Numerical Anal. 21, 7894313. Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function. Math. Control, Signals Sys. 2, 303-314. Davies, R. B. 1977. Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika 64, 247-254. Davies, R. B. 1987. Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika 74, 33-43. Davis, L. (ed.) 1987. Genetic Algorithms and Simulated Annealing. Pitman, London. Domowitz, I., and White, H. 1982. Misspecified models with dependent observations. J. Economet. 20, 35-50. Efron, B. 1982. The lacknife, the Bootstrap and Other Re-sampling Plans. SIAM, Philadelphia. Funahashi, K. 1989. On the approximate realization of continuous mappings by neural networks. Neural Networks 2, 183-192. Gallant, A. R., and Nychka, D. 1987. Semi-nonparametric maximum likelihood estimation. Econometrica 55,363-390. Gallant, A. R., and White, H. 1988. A Unified Theory of Estimation and Inference for Nonlinear Dynamic Models. Basil Blackwell, Oxford. Geman, S., and Hwang, C. 1982. Nonparametric maximum likelihood estimation by the method of sieves. Ann. Stat. 10, 401414. Goldberg, D. 1989. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading, MA. Golden, R. 1988. A unified framework for connectionist systems. Biological Cybernetics 59, 109-120. Grenander, U. 1981. Abstract Inference. Wiley, New York. Hajek, B. 1985. A tutorial survey of theory and applications of simulated annealing. In Proceedings of the 24th I E E E Conference on Decision and Control, pp. 755-760. Hajek, B. 1988. Cooling schedules for optimal annealing. Math. Operations Res. 13,311-329.
462
Halbert White
Haussler, D. 1989. Generalizing the PAC model for neural net and other learning applications. UCSC Computer Research Laboratory Tech. Rep. UCSC-CRL89-30. Hecht-Nielsen, R. 1989. Theory of the back-propagation neural network. In Proceedings of the International Joint Conference on Neural Networks, Washington, D.C., pp. 1:593-606. IEEE, New York. Hirose, Y., Yamashita, K., and Hijiya, S. 1989. Back-propagation algorithm which varies the number of hidden units. Poster presentation, International Joint Conference on Neural Networks, Washington, D.C. Holland, J. 1975. Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor. Homik, K., Stinchcombe, M., and White, H. 1989a. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-368. Hornik, K., Stinchcombe, M., and White, H. 1989b. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. UCSD Department of Economics Discussion Paper. Jennrich, R. 1969. Asymptotic properties of nonlinear least squares estimators. Ann. Math. Stat. 40, 633-643. Kiefer, J., and Wolfowitz, J. 1952. Stochastic estimation of the maximum of a regression function. Ann. Math. Stat. 23, 462466. Kirkpatrick, S., Gelatt, C. D., Jr., and Vecchi, M. P. 1983. Optimization by simulated annealing, Science 220, 671-680. Kuan, C.-M., and White, H. 1989. Recursive M-estimation, nonlinear regression and neural network learning with dependent observations. UCSD Department of Economics Discussion Paper. Kullback, S., and Leibler, R. A. 1951. On information and sufficiency. Ann. Math. Stat. 22, 79-86. Kushner, H. 1987. Asymptotic global behavior for stochastic approximations and diffusions with slowly decreasing noise effects: Global minimization via Monte Carlo. SIAM I. Appl. Math. 47, 169-185. Kushner, H., and Clark, D. 1978. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer-Verlag, Berlin. Kushner, H., and Huang, H. 1979. Rates of convergence for stochastic approximation type algorithms. SIAM J. Control Optim. 17, 607-617. Lee, T.-H., White, H., and Granger, C. W. J. 1989. Testing for neglected nonlinearity in time series models: A comparison of neural network methods and alternative tests. UCSD Department of Economics Discussion Paper. Ljung, L. 1977. Analysis of recursive stochastic algorithms, IEEE Truns. Automatic Control AC-22, 551-575. Ortega, J., and Rheinboldt, W. 1970. Iterative Solution of Nonlinear Equations in Several Variables. Academic Press, New York. Parker, D. B. 1982. Learning logic. Invention Report 581-64, File 1, Office of Technology Licensing, Stanford University. Phillips, I? C. B. 1989. Partially identified econometric models. Econometric Theory 5, 181-240.
Learning in Artificial Neural Networks
463
Renyi, A. 1961. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium in Mathematical Statistics, Vol. 1, pp. 547-561. University of California Press, Berkeley. Rheinboldt, W. 1974. Methods for Solving Systems of Nonlinear Equations. SIAM, Philadelphia. Rinnooy Kan, A. H. G., Boender, C. G. E., and T i e r , G. Th. 1985. A stochastic approach to global optimization. In Computational Mathematical Programming, K. Schittkowski, ed., NATO AS1 Series, Vol. F15, pp. 281-308. Springer-Verlag, Berlin. Robbins, H., and Monro, S. 1951. A stochastic approximation method. Ann. Math. Stat. 22, 400-407. Rumelhart, D. 1988. Parallel distributed processing. Plenary lecture, I E E E International Conference on Neural Networks, San Diego. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructures of Cognition, D. E. Rumelhart and J. L. McClelland, eds., Vol. 1, pp. 318-362. MIT Press, Cambridge. Ruppert, D. 1989. Stochastic approximation. In Handbook of Sequential Analysis, B. Ghosh and P. Sen, eds. Marcel Dekker, New York, forthcoming. Serfling, R. 1980. Approximation Theorems of Mathematical Statistics. Wiley, New York. Shannon, C. E. 1948. A mathematical theory of communication. Bell System Tech. J. 27, 379-423, 623-656. Stinchcombe, M., and White, H. 1989. Universal approximation using feedforward networks with non-sigmoid hidden layer activation functions. In Proceedings of the International Joint Conference on Neural Networks, Washington, D.C., pp. 1613-617. IEEE, New York. Stone, M. 1974. Cross-validatory choice and assessment of statistical predictions. J. R. Stat. SOC. SU. B 36, 111-133. Timmer, G. Th. 1984. Global optimization: A stochastic approach. Unpublished Ph.D. Dissertation, Erasmus Universiteit Rotterdam, Centrum voor Wiskunde en Informatica. Tishby, N., Levin E., and Solla, S. 1989. Consistent inference of probabilities in layered networks: predictions and generalization. Proceedings of the International Joint Conference on Neural Networks, Washington, D.C., pp. II:403-409. IEEE, New York. van Laarhoven, P. J. M. 1988. Theorefical and Computational Aspects of Simulated Annealing. Centrum voor Wiskunde en Informatica, Amsterdam. Wahba, G. 1984. Cross-validated spline methods for the estimation of multivariate functions from data on functionals. In Statistics: An Appraisal, H. A. David and H. T. David, eds., pp. 205-235. Iowa State University Press, Ames. Walk, H. 1977. An invariance principle for the Robbins-Monro process in a Hilbert space. Z. Wahrscheinlichkeitstheor. Venuand. Geb. 39, 135-150. Werbos, P. 1974. Beyond regression: New tools for prediction and analysis in the behavioral sciences. Unpublished Ph.D. Dissertation, Harvard University, Department of Applied Mathematics.
464
Halbert White
White, H. 1981. Consequences and detection of misspecified nonlinear regression models. J. Am. Stat. Assoc. 76,419-433. White, H. 1982. Maximum likelihood estimation of misspecified models. Econometrica 50, 1-25. White, H. 1984a. Maximum likelihood estimation of misspecified dynamic models. In Misspecificution Analysis, T. Dijkstra, ed., pp. 1-19. Springer-Verlag, New York. White, H. 1984b. Asymptotic Theory for Econometricians. Academic Press, New York. White, H. 1988. Multilayer feedforward networks can learn arbitrary mappings: Connectionist nonparametric regression with automatic and semi-automatic determination of network complexity. UCSD Department of Economics Discussion Paper. White, H. 1989a. Estimation, Inference and Specification Analysis. Cambridge University Press, New York, forthcoming. White, H. 1989b. Some asymptotic results for learning in single hidden layer feedforward networks. J. Am. Stat. Assoc., forthcoming. White, H. 1989~. An additional hidden unit test for neglected nonlinearity in multilayer feedforward networks. Proceedings of the lnternational Joint Conference on Neural Networks, Washington, D.C., pp. II:451455. IEEE, New York. White, H., and Wooldridge, J. 1989. Some results on sieve estimation with dependent observations. In Nonparametric and Semiparametric Methods in Econometrics and Statistics, W. Barnett, J. Powell, and G. Tauchen, eds. Cambridge University Press, New York, forthcoming. Wiener, N. 1948. Cybernetics. Wiley, New York. Wooldridge, J. 1989. Some results on specification testing against nonparametric alternatives. MIT Department of Economics Working Paper.
Received 10 August 1989; accepted 26 September 1989.
NOTE
Communicated by Halbert White
Representation Properties of Networks: Kolmogorov’s Theorem Is Irrelevant Federico Girosi Tomaso Poggio Massachusetts Instit u te of Technology, Artificial Intelligence Laboratory, Cam bridge, MA 02142 USA and Center for Biological Information Processing, Whitaker College, Cambridge, MA 02142 USA
Many neural networks can be regarded as attempting to approximate a multivariate function in terms of one-input one-output units. This note considers the problem of an exact representation of nonlinear mappings in terms of simpler functions of fewer variables. We review Kolmogorov‘s theorem on the representation of functions of several variables in terms of functions of one variable and show that it is irrelevant in the context of networks for learning. 1 Kolmogorov’s Theorem: An Exact Representation Is Hopeless A crucial point in approximation theory is the choice of the representation of the approximant function. Since each representation can be mapped in an appropriate network choosing the representation is equivalent to choosing a particular network architecture. In recent years it has been suggested that a result of Kolmogorov (1957) could be used to justify the use of multilayer networks composed of simple one-input-one-output units. This theorem and a previous result of Arnol’d (1957) can be considered as the definitive disproof of Hilbert’s conjecture (his thirteenth problem, Hilbert 1900): there are continuous functions of three variables, not representable as superpositions of continuous functions of two variables. The original statement of Kolmogorov’s theorem is the following (Lorentz 1976):
Theorem 1.1. (Kolmogorov 1957). There exist fixed increasing continuous functions hp,(x),on I = [0,1] so that each continuous function f on I“ can be written in the form f(x7,. . . I x n > =
2n+1
I1
q=1
p=l
C gq(C
hpq(xp))$
where gp are properly chosen continuous functions of one variable Neural Computation 1, 465469 (1989) @ 1989 Massachusetts Institute of Technology
Federico Girosi and Tomaso Poggio
466
X
Y
F
Figure 1: The network representation of an improved version of Kolmogorov's theorem, due to Kahane (1975). The figure shows the case of a bivariate function. The Kahane's representation formula is f(q,. .. ,z,J = Ci1;' g[CF=l lphq(zp)l where h, are strictly monotonic functions and lp are strictly positive constants smaller than 1.
This result asserts that every multivariate continuous function can be represented by the superposition of a small number of univariate continuous functions. In terms of networks this means that every continuous function of many variables can be computed by a network with two hidden layers (see Figure 1) whose hidden units compute continuous functions (the functions g, and hpq). Does Kolmogorov's theorem, in its present form, prove that a network with two hidden layers is a good and usable representation? The answer is definitely no. There are at least two reasons for this: 1. In a network implementation that has to be used for learning and generalization, some degree of smoothness is required for the func-
Representation Properties of Networks
467
tions corresponding to the units in the network. Smoothness of the h,, and of the g, is important because the representation must be smooth in order to generalize and be stable against noise. A number of results of Vituskin (1954, 1977) and Henkin (1964) show, however, that the inner functions h, of the Kolmogorov’s theorem are highly not smooth (they can be regarded as “hashing” functions). Due to this ”wild” behavior of the inner functions h,,, the functions g, do not need to be smooth, even for differentiable functions f (de Boor 1987). 2. Useful representations for approximation and learning are purametrized representations that correspond to networks with fixed units and modifiable parameters. Kolmogorov’s network is not of this type: the form of 9, (corresponding to units in the second “hidden” layer) depends on the specific function f to be represented (the h,, are independent of it). g, is at least as complex, for instance in terms of bits needed to represent it, as f .
A stable and usable exact representation of a function in terms of two or more layers network seems hopeless. In fact the result obtained by Kolmogorov can be considered as a “pathology” of the continuous functions: it fails to be true if the inner functions h, are required to be smooth, as it has been shown by Vitushkin (1954). The theorem, though mathematically surprising and beautiful, cannot be used by itself in any constructive way in the context of networks for learning. This conclusion seems to echo what Lorentz (1962) wrote, more than 20 years ago, asking ’Will it [Kolmogorov’s theorem] have useful applications?. . . One wonders whether Kolmogorov’s theorem can be used to obtain positive results of greater [than trivial] depth.” Notice that this leaves open the possibility of finding good and well founded approximate representations. This argument is discussed in some length in Poggio and Girosi (19891, and a number of results have been recently obtained by some authors (Hornik et al. 1989; Stinchcombe and White 1989; Carroll and Dickinson 1989; Cybenko 1989; Funahashi 1989; Hecht-Nielsen 1989). The next section reviews Vitushkin’s main results. 2 The Theorems of Vitushkin
The interpretation of Kolmogorov’s theorem in term of networks is very appealing: the representation of a function requires a fixed number of nodes, polynomially increasing with the dimension of the input space. Unfortunately, these results are somewhat pathological and their practical implications very limited. The problem lies in the inner functions of Kolmogorov’s formula: although they are continuous, theorems of Vitushkin and Henkin (Vitushkin 1964, 1977; Henkin 1964; Vitushkin and Henkin 1967) prove that they must be highly nonsmooth. One could ask if it is
Federico Girosi and Tomaso Poggio
468
possible to find a superposition scheme in which the functions involved are smooth. The answer is negative, even for two variable functions, and was given by Vitushkin with the following theorem (1954):
Theorem 2.1. (Vitushkin 1954). There are T(r = 1,2, . . .) times continuously differentiable functions of n 2 2 variables, not representable by superposition of r times continuously differentiable functions of less than n variables; there are r times continuously differentiable functions of two variables that are not representable by sums and continuously differentiable functions of one variable.
We notice that the intuition underlying Hilbert's conjecture and theorem 2.1 is the same: not all the functions with a given degree of complexity can be represented in simple way by means of functions with a lower degree of complexity. The reason for the failing of Hilbert's conjecture is a "wrong" definition of complexity: Kolmogorov's theorem shows that the number of variables is not sufficient to characterize the complexity of a function. Vitushkin showed that such a characterization is possible and gave an explicit formula. Let f be an T times continuously differentiable function defined on I" with all its partial derivatives of order r belonging to the class Lzp[O,11". Vitushkin puts x = (T + a)/n and shows that it can be used to measure the inverse of the complexity of a class of functions. In fact he succeded in proving the following:
Theorem 2.2. (Vitushkin 1954). Not all functions of a given characteristic xo = qO'o/ko > 0 can be represented by superpositions of functions of characteristic x = q / k > xo, q 2 1. Theorem 2.1 is easily derived from this result.
Acknowledgments We acknowledge support from the Defense Advanced Research Projects Agency under contract number N00014-89-J-3139. Tomaso Poggio is supported by the Uncas & Helen Whitaker Chair at MIT.
References Amol'd, V. I. 1957. On functions of three variables. Dokl. Akad. Nauk S S S R 114, 679-681.
Carroll, S. M., and Dickinson, B. W. 1989. Construction of neural nets using the Radon transform. In Proceedings of the International Joint Conference on Neural Networks, pp. 1-607-1-611, Washington, D.C., June 1989. IEEE TAB Neural Network Committee. Cybenko, G. 1989. Approximation by superposition of a sigmoidal function. Math. Control Systems Signals, in press.
Representation Properties of Networks
469
de Boor, C. 1987. Multivariate approximation. In The State of the Art in Numerical Analysis, A. Iserles and M. J. D. Powell, eds., pp. 87-109. Clarendon Press, Oxford. Funahashi, K. 1989. On the approximate realization of continuous mappings by neural networks. Neural Networks 2, 183-192. Hecht-Nielsen, R. 1989. Theory of backpropagation neural network. In Proceedings of the International Joint Conference on Neural Networks, pp. 1-593-1-605, Washington D.C., June 1989. IEEE TAB Neural Network Committee. Henkin, G. M. 1964. Linear superpositions of continuously differentiable functions. Dokl. Akad. Nauk SSSR 157, 288-290. Hilbert, D. 1900. Mathematische probleme. Nachr. Akad. Wiss. GGttingen, 290329. Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366. Kahane, J. P. 1975. Sur le theoreme de superposition de Kolmogorov. J. Approx. Theory 13, 229-234. Kolmogorov, A. N. 1957. On the representation of continuous functions of several variables by superposition of continuous functions of one variable and addition. Dokl. Akad. Nauk SSSR 114, 953-956. Lorentz, G. G. 1976. On the 13-th problem of Hilbert. In Proceedings of Symposia in Pure Mathematics, pp. 419-429, Providence, RI, 1976. American Mathematical Society. Lorentz, G. G. 1962. Metric entropy, widths, and superposition of functions. Am. Math. Monthly 69, 469485. Poggio, T., and Girosi, F. 1989. A theory of networks for approximation and learning. A.I. Memo No. 1140, Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Stinchcombe, M., and White, H. 1989. Universal approximation using feedforward networks with non-sigmoid hidden layer activation functions. In Proceedings of the International Joint Conference on Neural Networks, pp. I-6071-611, Washington, D.C., June 1989. IEEE TAB Neural Network Committee. Vitushkin, A. G. 1954. On Hilbert's thirteenth problem. Dokl. Akad. Nauk SSSR 95, 701-704. Vitushkin, A. G. 1964. Some properties of linear superposition of smooth functions. Dokl. Akad. Nauk SSSR 156: 1003-1006. Vitushkin, A. G. 1977. On Representation of Functions by Means of Superpositions and Related Topics. L'Enseignement Mathematique. Vitushkin, A. G., and Henkin, G. M. 1967. Linear superposition of functions. Russian Math. Surveys 22, 77-125. ~
Received 17 July 1989; accepted 30 August 1989.
Communicated by Eric Baum
NOTE
Sigmoids Distinguish More Efficiently Than Heavisides Eduardo D. Sontag SYCON-Rutgers Center for Systems and Control, Department of Mathematics, Rutgers University, New Brunswick, NJ 08903 USA
Every dichotomy on a 2k-point set in RN can be implemented by a neural net with a single hidden layer containing k sigmoidal neurons. If the neurons were of a hardlimiter (Heaviside) type, 2k - 1 would be in general needed. 1 Introduction and Definitions
The main point of this note is to draw attention to the fact mentioned in the title, that sigmoids have different recognition capabilities than hardlimiting nonlinearities. One way to exhibit this difference is through a worst-case analysis in the context of binary classification, and this is done here. Results can also be obtained in terms of VC dimension, and work is in progress in that regard. For technical details and proofs, the reader is referred to Sontag (1989). RN Let N be a positive integer. A dichotomy (S-, S+)on a set S is a partition S = S- US, of S into two disjoint subsets. A function f :IRN + pi will be said to implement this dichotomy if it holds that f(u)> 0 for u E S+ and f(u>< 0 for u E SLet 6 : R + R be any function. We shall say that f is a single hidden layer neural net with k hidden neurons of type 6 [or just that f is a ” ( k ,@-net”] if there are real numbers wo, wl, . . . ,‘Wk,7 1 , . . . ,Tk, and vectors v1,. . . ,v k E RN such that, for all u E RN,
where the dot indicates inner product. For fixed 6, and under mild assumptions on 6, such neural nets can be used to approximate uniformly arbitrary continuous functions on compacts. See, for instance, Cybenko (1989) and Hornik et al. (1989). In particular, they can be used to implement arbitrary dichotomies. Neural Computation 1, 470-472 (1989)
@ 1989 Massachusetts Institute of Technology
Sigmoids Distinguish More Efficiently Than Heavisides
471
In neural net practice, one often takes B to be the sigmoid 1 = 1 + e-" or equivalently, up to translations and change of coordinates, the hyperbolic tangent Q(z)= tanh(z). Another usual choice is the hardlimiter or Heaviside function ~
which can be approximated well by tanh(yz) when the ''gain'' y is large. Most analysis has been done for 31, but backpropagation techniques typically use the sigmoid (or equivalently tanh). It is easy to see that arbitrary dichotomies on an 1-element set can be implemented by ( I - 1,'H)-nets, but that some dichotomies on sets of 1 elements cannot be implemented by nets with less than 1 - 1 Heaviside hidden neurons. We consider functions 8 : R + R that satisfy the following two properties: (S1) t , := limz++mB(s) and t- := limz+-m B(z) exist, and t , # t-. (S2) There is some point c such that B is differentiable at c and Q'(c) = p # 0. Note that the function 31 does not satisfy (S2), but the sigmoid of course does. The main result will be stated for these. 2 Main Result and Remarks
Theorem 1. Let 8 satisfy ( S l ) and (S2), and let S be any set of cardinality 1 = 2 k . Then, any dichotomy on S can be implemented by some ( k , 8)-net. Thus, using sigmoids we can reduce the number of neurons from 2k - 1 to k , a factor of 2 improvement. Of course this result should not really be surprising, since for Heaviside functions there are fewer free degrees of freedom [because 31(yz) = H ( z ) for any y > 01, and in fact its proof is very simple. The idea is to first classify using a net with k - 1 Heaviside hidden neurons plus a direct connection from the inputs to the output, and then replacing these direct connections by just one nonlinear hidden neuron. The differentiablity assumption allows this replacement, since it means that at low gains any linear map can be approximated. To conclude this note, we wish to remark that there are "universal" functions B satisfying (Sl)-(S2) and as differentiable as wanted, even realanalytic, such that, for each N and each dichotomy on any finite set S RN, this dichotomy can be implemented by a (1,@-net. Of course, the function B is so complicated as to be purely of theoretical interest, but it serves to indicate that, unless further restrictions are made on (Sl)-(S2), much better bounds can be obtained.
472
Eduardo D. Sontag
Acknowledgments Supported in part by U.S. Air Force Grant AFOSR-88-0235.
References Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function. Math. Control, Signals, Syst. 2, 303-314. Hornik, K. M., Stinchcornbe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Nehuorks Z(5): 359-366. Sontag, E. D. 1989. Sigmoids distinguish more efficiently than Heavisides. Report 89-12, SYCON-Rutgers Center for Systems and Control, August 1989. (Electronic versions available from
[email protected].)
Received 28 July 1989; accepted 15 September 1989.
Communicated by John Allman
How Cortical Interconnectedness Varies with Network Size Charles F. Stevens* Section of Molecular Neurobiology, Yale University School of Medicine, New Haven, CT 06510 USA
When model neural networks are used to gain insight into how the brain might carry out its computations, comparisons between features of the network and those of the brain form an important basis for drawing conclusions about the network’s relevance to brain function. The most significant features to be compared, of course, relate to behavior of the units. Another network property that would be useful to consider, however, is the extent to which units are interconnected and the law by which unit-unit connections scale as the network is made larger. The goal of this paper is to consider these questions for neocortex. The conclusion will be that neocortical neurons are rather sparsely interconnected - each neuron receives direct synaptic input from fewer than 3% of its neighbors underlying the surrounding square millimeter of cortex - and the extent of connectedness hardly changes for brains that range in size over about four orders of magnitude. These conclusions support the currently popular notion that the brain‘s circuits are highly modular and suggest that increased cortex size is mainly achieved by adding more modules. 1 Introduction Different mammalian species have brains of very different sizes - man’s brain is more than a thousand times larger than that of a mouse - but homologous areas are thought to have the same circuits and to operate in the same way. Thus, evolution has scaled what we believe to be the same basic network up many fold and this presents the opportunity to examine how connectedness varies with the size of the network. Since neuronal intercommunications occur at synaptic contacts, estimates of connectedness can be made by simply counting the number of synapses on a neuron. Direct counting is impracticable, however, because a cubic millimeter of cortex contains on the order of a billion synapses. This paper presents a simple theory that provides a way of characterizing connectedness from measurements on cortical thickness and surface area. *Address correspondence to The Salk Institute, P.O. Box 85800, San Diego, CA 921389216.
Neural Compufafion 1, 473-479 (1989) @ 1989 Massachusetts Institute of Technology
Charles F. Stevens
474
Such data are available for many species so that the way connectedness is scaled can be assessed over a very wide range, about four orders of magnitude, of cortex sizes. 2 A Scaling Law
Brain size will be specified by N , the total number of neurons in neocortex, or in some defined subsystem in neocortex. The goal, then, is to find an expression for the average number of synapses a cortical neuron receives (and gives) as a function of brain size. I begin with a consideration of how the average number of synapses per cortical neuron, q(N), scales with brain size N . Consider some reference brain, a mouse or cat brain, for example, with size n and q(n) synapses per neuron. If evolution scales this brain by a factor s = N / n to a new brain of size N , q should change as
where f(s) is some function that gives the increase in q for the larger brain over the reference one. Because we believe all mammalian brains, no matter how large or small, conform to the same general design and operate in the same general way, f(s) should vary continuously with s and the particular brain we select as a reference should not alter the scaling function f(s). A standard result for homogenous functions (Aczel 1969) implies, then, that f(s) is a power function
for some constant b. Rearrangement of the preceding equations gives the scaling law q(N)= q(n)(N/db
If every neuron were connected to a constant fraction of the neurons in every sized cortex then b would be 1, whereas if each neuron were connected to the same number of others independent of brain size then b would equal zero; if the degree of interconnectedness decreases as brains became larger (perhaps larger, more powerful brains operate more efficiently by sharing information over a smaller number of units), b would be negative. 3 Relating Measurable Quantities
The next goal in our development is to recast the preceding equation into a form that relates quantities for which data are available so that the accuracy of the scaling law can be evaluated, and so that q(n) and b can be determined.
How Cortical InterconnectednessVaries with Network Size
475
Two experimental observations provide the key for relating the quantities in the scaling law to measured features of cortical structure. The first is that the density of synapses in cortex, r, is constant (within measurement error) across cortical layers, regions, and species (Aghajanian and Bloom 1967; Armstrong-James and Johnson 1970; Cragg 1967, 1975a,b; Schuz and Palm 1989; Jones and Cullen 1979; O'Kusky and Colonnier 1982; Rakic et al. 1986; Vrensen et al. 1977) and has an average value: r = 0.6 x lo9 synapses/mm3
The second experimental observation is that the number of neurons that underlies a square millimeter of cortical surface, p , is also about constant (Powell 1981; Rockel et al. 1980) across cortical regions and species (with the exception of primate area 17 where it is also constant across species, but differs in magnitude from other cortical areas): p = 1.48 x lo5 neurons/mm2
(for primate area 17 p = 3.57 x
105 neurons/mm2)
This quantity p is constant in spite of variations in cortical thickness. The average number of synapses per cortical neuron is, by definition, the total number of cortical synapses Q divided by the number of neurons N . From the experimental observations presented above Q = rV
where V is the total cortical volume and is given by the product of the average cortical thickness T and the cortical surface area A: V=AT N is, according to the preceding, given by N=pA
This means that the average number of synapses per cortical neuron q ( N ) is q ( N ) = Q / N = rAT/pA = rT/p
From the scaling law, however,
where a is the surface area and t the cortical thickness of the reference brain. If q(N) is eliminated between these last two equations, a relationship between cortical surface area and thickness results: T / t = (A/a)b
476
Charles F. Stevens
Cortical thickness and surface area data available in the literature thus provide a way of testing the scaling law and of evaluating the constant b. Because the scaling law is a power relation, the most convenient form for comparisons with experimental data is obtained by taking logarithms of the preceding equation: lo@) = b log(A) + [(log(t)- b log(^)] A double logarithmic plot of cortical thickness T versus surface area A should be linear, then, with a slope that gives b, the parameter that characterizes neuronal interconnectedness as a function of brain size. 4 Conclusions
Data are available in the literature for evaluating b in two contexts, a particular cortical subsystem, primate (and one tree shrew) area 17, and the entire cortex from a variety of mammalian species. The advantage of using data from primate primary visual cortex is that this cortical region is comparable from one species to another, but the disadvantage is that a relatively narrow range of cortical sizes is available. Using data for the entire mammalian cortex means that different functional regions might be compared from species to species, but a four order of magnitude range of brain sizes is available. In so far as cortex is uniform, as many believe (Creutzfeldt 1977; Eccles 1984; Lorente de No 1949; Powell 1981), the comparison of mammalian cortex across species is appropriate. If the cross species comparison is invalid, then my conclusions are limited to the primate visual area. Figure 1 presents log(T)/log(A) for primate area 17. The data for cortical sizes in this figure vary over a 50-fold range and conform to the expectations of the scaling law with b = 0.07 (least-squares fit). Prothero and Sundsten (1984) have gathered data for thickness and surface area of total mammalian neocortex from 15 species (7 animal orders) and find the regression line for a double logarithmic plot like that in Figure 1 has a slope of 0.09. Their data (in their Figures 1 and 3) range in cortical size over about four orders of magnitude. The scaling law for interconnectedness does indeed seem to provide an adequate fit for the experimental data, and the value of b is small, but not zero. This implies that each neuron is connected to an almostconstant number of other neurons irrespective of brain size. The quantity q(n) for a I-mm-thick reference cortex is 4.12 x lo3 synapses per neuron (1.71x 103 for primate area 17). This means that a particular neuron could receive synaptic connections from less than 3% of the neurons underlying the surrounding square millimeter of cortex, so that brain cells are rather sparsely interconnected. Other, but less complete, data for hippocampus (Stevens, unpublished) suggest that interconnectedness grows slowly (less than linearly) with hippocampal size. The sparseness in interconnections places limits on models for neuronal circuits, and suggests,
How Cortical Interconnectedness Varies with Network Size
100
1000
477
10000
Cortical Surface Area (mrn’)
Figure 1: A double logarithmic plot of cortical thickness (in mm) as a function of cortical surface area (in mm2) for area 17. Data derived from cortical volumes given by Frahm et al. (1984) and cortical thickness given by Rockel et al. (1980). Animals represented, in order of increasing cortical size, are tree shrew (Scandentia), galago (Prosimian), marmoset, squirrel monkey, macaque, baboon, chimpanzee, and man (Simians). The regression line is log(T) = 0.0710g(A) - 0.047 where T is cortical thickness (mm) and A the surface area (mm’).
for example, that any large content-addressable memories present in cortex seem not to be based on rich connectedness of large populations of neurons. Further, network models that provide realistic representations of cortical circuits should also embody a scaling law like that used by the brain.
Acknowledgments Supported by NIH No. 12961-14 and the Howard Hughes Medical Institute.
Charles F. Stevens
478
References
Aczel, J. 1969. Applications and Theory of Functional Equations. Academic Press, New York. Aghajanian, G. K., and Bloom, F. E. 1967. The formation of synaptic junctions in developing rat brain: A quantitative electron microscopic study. Brain Res. 6, 716-727. Armstrong-James, M., and Johnson, R. 1970. Quantitative studies of postnatal changes in synapses in rat superficial motor cerebral cortex. Z. Zellforsch 110,559-568.
Cragg, B. G . 1967. The density of synapses and neurones in the motor and visual areas of the cerebral cortex. 1. Anat. 101,639-654. Cragg, B. G. 1975a. The density of synapses and neurons in normal mentally defective and ageing human brains. Brain 98, 81-90. Cragg, B. G. 1975b. The development of synapses in the visual system of the cat. J. Comp. Neurol. 160, 147-168. Creutzfeldt, 0. D. 1977. Generality of the functional structure of the neocortex. Naturzuissenschaften 64,507-517. Eccles, J. C. 1984. The cerebral neocortex. A theory of its operation. In Cerebral Cortex, Vol. 2, Functional Properties of Cortical Cells, E. G. Jones and A. Peters, eds., pp. 1-36. Plenum Press, New York. Frahm, H. D., Heinz, S., and Baron, G. 1984. Comparison of brain structure volumes in insectivora and primates. V. Area striata (AS).J. Hirnforsch 25, 537-557.
Jones, D. G., and Cullen, A. M. 1979. A quantitative investigation of some presynaptic terminal parameters during synaptogenesis. Exp. Neurobiol. 64, 245259.
Lorente de No 1949. Cerebral cortex: Architecture, intracortical connections, motor projections. In Physiology of the Nemous System, J. Farguhar Fulton, ed., pp. 288-315. Oxford University Press, London. OKusky, J., and Colonnier, M. 1982. A laminar analysis of the number of neurons, glia, and synapses in the visual cortex (area 17) of adult macaque monkeys. J. Comp. Neurol. 210, 27E-290. Powell, T. P. S. 1981. Certain aspects of the intrinsic organisation of the cerebral cortex. In Brain Mechanisms and Perceptual Awareness, 0. Pompeiano and C. Ajmone Marsan, eds., pp. 1-9. Raven, New York. Prothero, J. W., and Sundsten, J. W. 1984. Folding of the cerebral cortex in mammals. A scaling model. Brain Behav. Evol. 24, 152-167. Rakic, P., Bourgeois, J.-P., Eckenhoff, M. F., Zecevic, N., and Goldman-Rakic, P. S. 1986. Concurrent overproduction of synapses in diverse regions of the primate cerebral cortex. Science 232, 232-235. Rockel, A. J., Hirons, R. W., and Powell, T. I? S. 1980. The basic uniformity in structure of the neocortex. Brain 103,221-244. Schiiz, A., and Palm, G. 1989. Density of neurons and synapses in the cerebral cortex of the mouse. J. Comp. Neurol. 286,442455.
How Cortical Interconnectedness Varies with Network Size
479
Vrensen, G., De Groot, D., and Nunes-Cardozo, J. 1977. Postnatal development of neurons and synapses in the visual and motor cortex of rabbits: A quantitative light and electron microscopic study. Brain Res. Bull. 2,405-416.
Received 29 June 1989; accepted 11 October 1989.
Communicated by Gordon M. Shepherd
A Canonical Microcircuit for Neocortex Rodney J. Douglas Kevan A.C. Martin David Whitteridge MRC Anatomical Neuropharmacology Unit, Department of Pharmacology, South Parks Road, Oxford OX1 3QT, England
We have used microanatomy derived from single neurons, and in vivo intracellular recordings to develop a simplified circuit of the visual cortex. The circuit explains the intracellular responses to pulse stimulation in terms of the interactions between three basic populations of neurons, and reveals the following features of cortical processing that are important to computational theories of neocortex. First, inhibition and excitation are not separable events. Activation of the cortex inevitably sets in motion a sequence of excitation and inhibition in every neuron. Second, the thalamic input does not provide the major excitation arriving at any neuron. Instead the intracortical excitatory connections provide most of the excitation. Third, the time evolution of excitation and inhibition is far longer than the synaptic delays of the circuits involved. This means that cortical processing cannot rely on precise timing between individual synaptic inputs. 1 Introduction
The uniformity of the mammalian neocortex (Hubel and Wiesel 1974; Rockel et al. 1980) has given rise to the proposition that there is a fundamental neuronal circuit (Creutzfeldt 1977; Szentfigothai 1978) repeated many times in each cortical area. Here we provide evidence for such a canonical circuit in cat striate cortex, and model its form and functional attributes. The microcircuitry of the striate cortex of the cat is by far the best understood of all cortical areas. The anatomical organization that has emerged from studies (Gilbert and Wiesell979; Martin 1988) of neuronal morphology and immunochemistry is one of stereotyped connections between different cell types: pyramidal cells connect principally to other pyramidal cells, and the smooth cells connect principally to pyramidal cells. Pyramidal cells are excitatory; smooth cells are GABAergic and thought to be inhibitory. Some neurons of both types are driven directly by thaiamic input and others indirectly. We used these findings and those described below to develop the simplest neuronal circuit that Neural Computation 1, 480-488 (1989) @ 1989 Massachusetts Institute of Technology
481
A Canonical Microcircuit for Neocortex
h
That a m us
Figure 1: Model of cerebral cortex that successfully predicts the intracellular responses of cortical neurons to stimulation of thalamic afferents. Three populations of neurons interact with one another: one population is inhibitory (GABA cells, solid synapses), and two are excitatory (open synapses), representing superficial (P2 + 3) and deep (P5 + 6) layer pyramidal neurons. The layer 4 spiny stellate cells are incorporated with the superficial group of pyramids. Each population receives excitatory input from the thalamus, which is weaker (dashed line) to deep pyramids. The inhibitory inputs activate both GABAA and GABAB receptors on pyramidal cells. The thick line connecting GABA to P5+6 indicates that the inhibitory input to the deep pyramidal population is relatively greater than that to the superficial population. However, the increased inhibition is due to enhanced GABAA drive only. The GABAB inputs to P5 + 6 is similar to that applied to P2 + 3. showed analogous functional behavior to that which we observed in our intracellular recordings (Fig. 1). 2 Cortical Model
The model circuit consisted of populations of neurons that interacted with one another. The behavior of each population was modeled by a single "cell" that represented the average response of the neurons be-
482
Rodney J. Douglas, Kevan A.C. Martin, and David Whitteridge
longing to that population. The action potential discharge was treated as a rate-encoded output rather than discrete spike events. The populations excited or inhibited one another by inducing changes in the average membrane potential of their target populations, after a transmission delay. The relaxation of the membrane potential was governed by a membrane time constant. The magnitude of excitation or inhibition was determined by the product of the input population‘s discharge rate, a synaptic coupling coefficient, and a synaptic driving potential. The discharge rate was a thresholded hyperbolic function of the average membrane potential. The synaptic coupling coefficient incorporated the fraction of all synaptic input that was derived from a particular source population, the average efficacy of a synapse from that source, and the sign of its effect (either positive or negative). The synaptic driving potential was the difference between the average membrane potential and the appropriate synaptic reversal potential. The number and characteristics of the populations, and the functional weighting of their interconnections, were optimized by comparing the performance of the model with that of the cortex itself, as described below. The model was programmed in TurboPascal and run on a 8-MHz 80286/287 AT-type computer, which computed a typical model response of 400 msec in 30 sec.
3 Intracellular Recordings
Neurons were recorded from the postlateral gyrus of the striate visual cortex (area 17) of anesthetized, paralyzed cats (Martin and Whitteridge 1984; Douglas et al. 19881, while continuously monitoring vital signs. Glass micropipettes were filled with 2 M K citrate, or a 4% buffered solution of horseradish peroxidase (HRP) in 0.2 M KC1. GABA agonists and antagonists were applied ionophoretically via a multibarrel pipette using a Neurophore (Medical Systems Inc.). The intracellular electrode was mounted in a “piggy-back configuration on a multibarrel ionophoretic pipette. The tip of the recording electrode was separated from the tips of the ionophoretic barrels by 10-20 pm. Receptive fields of the cortical neurons were first plotted in detail by hand, and then intracellular recordings were made while stimulating the optic radiation (OR) above the lateral geniculate nucleus via bipolar electrodes (0.2-0.4 msec, 200400 PA). A control period of 100 rnsec, followed by 300 msec of the intracellular response to OR stimulation, was averaged over up to 32 trials. We used electrical pulse stimulation as a test signal both because it simplifies the analysis of systems, and because it permits the canonical microcircuit hypothesis to be tested in the many cortical areas whose natural stimulus requirements are not yet known. Where possible, HRP was injected intracellularly following data collection to enable morphological identification. Twenty HRP-labeled neurons were recovered (Fig. 2).
A Canonical Microcircuit for Neocortex
483
4 Results and Discussion
In all 53 cells examined, the stimulus pulse induced a sequence of excitation followed by a lengthy (100-200 msec) hyperpolarizing inhibitory postsynaptic potential (IPSP) that inhibited completely any spontaneous action potential discharge. This general pattern has been reported in visual and other cortical areas, and it is currently supposed that the early excitation is due to activation of thalamic afferents, and that the inhibition arises from feed-forward and feedback excitation of cortical smooth cells. There is strong evidence that inhibition in the cortex involves GABAA receptors, which can be selectively blocked by bicuculline (Sillito 1975). However, we found that there was also a second GABAergic inhibitory mechanism present in vim, which was insensitive to bicuculline and could be activated by the specific GABAB agonist, baclofen. Baclofen mediated inhibition has also been observed in in vitro cortical preparations (Connors et al. 1989). Both GABAergic mechanisms were incorporated into the model so that GABAA simulation produced an early inhibition of short duration, while the GABAB reponse evolved more slowly and had a longer duration. For simplicity, both inhibitory processes behaved linearly. This approximation is reasonable since nonlinear inhibition is not prominent in cat visual cortex (Douglas et al. 1988). Using these principles, we were able to model the neuron’s response to electrical stimulation by a circuit that consisted simply of two interacting populations: one population of excitatory pyramidal cells and another of inhibitory smooth cells, with thalamic input applied to both populations. However, the temporal forms of the responses obtained from in vivo cortical cells were not all similar. For pyramidal cells, which formed the bulk of our HRPlabeled sample, we could discriminate two different temporal patterns of poststimulus response on the basis of the latency to maximum hyperpolarization (Fig. 2). These patterns were not correlated with functional properties such as receptive field type or ordinal position, but were strongly correlated with cortical layer (Fig. 2). Hyperpolarization evolved more slowly in morphologically identified pyramidal neurons of layer 2 and 3 than those located in layers 5 and 6. These data suggested that the pyramidal cells might be involved in two different circuits, one for superficial layers (2 and 3), and another for deep layers (5 and 6). Consequently, the model was expanded to incorporate these two populations. Unfortunately, we did not label any spiny stellate cells, which are found only in layer 4. However, the output of spiny stellate cells is directed to the superficial layers as well as layer 4 (Martin and Whitteridge 1984), and so we assumed that they should be incorporated with the population of superficial pyramids. We also assumed that the superficial and deep populations should have similar rules of interconnection. By exploring the properties of the expanded
Rodney J. Douglas, Kevan A.C. Martin, and David Whitteridge
484
0
30
60
90
120
150 ms
Latency t o max hyperpol
0
300 rns
Figure 2 Correlation of cortical layer with the pattern of the intracellular responses to stimulation of the optic radiation. (a) Hyperpolarization evolved more slowly in morphologically identified pyramidal neurons of layer 2 and 3 than those located in layers 5 and 6. (b,d) (modeled in c, e) Latencies to maximum hyperpolarization (filled arrows) were measured with respect to the stimulus at t = 0. Mean latencies for the two populations were 108.9f -7.3 SEM msec ( N = 9) and 27.5 f -1.7 msec ( N = 11). Superficial pyramids [e.g., b, position circled in a always exhibited marked excitation (open arrow), which was less prevalent in deep layers (e.g., d circled in a)]. In this figure, stimulus artifacts removed for clarity. Depths of identified pyramidal cell somata and layer boundaries were measured with respect to cortical surface and then normalized against the layer 5/6 boundary. The model (Fig. 1) predicted qualitatively similar responses to those observed in vivo (compare b to c and d to e) if and only if GABAA inhibition of deep layers was greater than that of superficial layers. model, we were able to show that the two layers correlated response patterns could be elicited from the same basic circuit by modifying the the relative intensities of GABAA inhibition applied to the two pyramidal populations. The observed differences in the pattern of response of superficial and deep neurons could be most simply achieved by making the
A Canonical Microcircuit for Neocortex
485
GABAA inhibition of the deep pyramidal cells four times stronger than that of the superficial cells. The strength of GABAB inhibition was the same in both layers. This configuration (Fig. 1)of the model provided the best fit to the biological data. Alternative combinations of populations and coupling coefficients were markedly less successful in simulating the biological data. Having established the basic configuration, we then modeled the affect of altering the weightings of the inhibitory (GABAergic)connections. The predictions were tested experimentally by recording the intracellular pulse response during ionophoretic application of various GABA agonists and antagonists. These drugs were applied directly to the recorded cell via a multibarrel ionophoretic micropipette that was mounted on the shank of the intracellular pipette. The close agreement between the predicted and the experimental results are shown in Figure 3. The performance of the model depends on the coupling between the pyramidal and smooth cells, and this suggests that excitation and inhibition are not separable events. Activation of the cortex inevitably sets in motion a sequence of excitation and inhibition in every neuron. Moreover, the time evolution of excitation and inhibition is far longer than the synaptic delays of the circuits involved. In particular, the large component of the inhibition derives from the GABAB-like process that extends over some 200 msec. This means that cortical processing cannot rely on precise timing between individual synaptic inputs. The model also predicted that the excitation due to intracortical connections would greatly exceed that of the thalamic afferents, which provide the initial excitation. This amplification of excitation is a consequence of the intracortical divergence (Gilbert and Wiesel 1979; Martin 1988) of pyramidal cell projections. Thus, thalamic input does not provide the major excitation arriving at any neuron. Instead the intracortical excitatory connections provide most of the excitation. This excitation would grow explosively, but it is gated by inhibition of the pyramids. It is this intracortical excitatory component that is more strongly inhibited in the deep layers, so the onset of maximum hyperpolarization occurs more rapidly in these cells (compare Fig. 2b and d). The degree to which the intracortical component is normally inhibited can be demonstrated by comparing the form of the excitatory depolarization before and after blockade of GABAA-mediated inhibition (Fig. 4). Bicuculline enhanced predominantly the intracortical excitatory component. This suggests that the GABAA mechanism is activated by the arrival of the thalamic volley, and that the role of tonic cortical inhibition is small. 5 Conclusion
Taken together, these data show that this simple model can provide a remarkably rich description of the average temporal behavior of
Rodney J. Douglas, Kevan A.C. Martin, and David Whitteridge
486
mode 1
cortex rnV- a
-70
gaba
-501b L
I
-70
bicuc
-501 -70 *
L
d
A M
0
300
ms
0
300 ms
Figure 3: Comparison of predictions of the model with experimental results following modification of synaptic weights. To simulate the localized effect of ionophoresis, manipulations of the model affected only a small subset of P2+3 or P5+6 cells (Fig. 1). (a) Observed and predicted control response of a deep ceII. (b) Sustained GABA ejection (Sigma, 0.5M, 70 nA) hyperpolarized the membrane of this cell so reducing the stimulus-induced hyperpolarization. (c) Additional application of GABAA antagonist bicuculline (Sigma, 100 mM, 200 nA) did not reverse the GABA-induced hyperpolarization, but enhanced excitation (arrow). (d) When GABA was removed, bicuculline further enhanced excitation (arrowed), but did not block late hyperpolarization. (e) Control response of a superficial cell. (f) The GABAB agonist baclofen (Ciba Geigy, 10 mM, 60 nA) hyperpolarized the membrane of this cell, accentuating the early excitation (arrowed).
A Canonical Microcircuit for Neocortex
487
b>r model
cort ex
thalamus
d l
control -40 - 60
bicuc
-40 - 60
E
~
l
0
'% I
l
30
l
l
30 m s
0
, thalamus
10
' cortex 0
5
10
15 ms
Figure 4: Blocking GABAA inhibition unmasks intracortical excitation. (a) Observed and predicted excitatory response of a superficial cell during the first 30 msec following the stimulus. Excitatory response was followed by IPSP similar to that seen in Figure 3e. Only initial phase of IPSP is seen in this short time window. Thalamic and intracortical components of excitation are indicated in model response; stimulus artifact is marked with open arrow. (b) Sustained application of bicuculline (O.lM, 100 nA) enhanced the intracortical component, but left the thalamic component largely unaffected. (c) Histogram of experimental results showing that the latency to peak of the bicuculline-affected depolarization (filled bars) corresponds with that of the intracortical component (hatched bars). Both were significantly later than the earliest (thalamic) depolarization (open bins). populations of cortical neurons when they are activated by pulse stimuli. Because this stimulus is not area specific, the same experimental methods and tests could, in principle at least, be applied to any cortical area, even those whose function is unknown. Similar responses obtained in another cortical area would suggest a basic circuitry similar to that of visual cortex. Furthermore, models of cortical processing that are based
l
488
Rodney J. Douglas, Kevan A.C. Martin, and David Whitteridge
on analogues of neurons (Sejnowski et al. 1988) should exhibit responses to pulse activation of their inputs that are qualitatively similar to those reported here.
Acknowledgments We thank John Anderson for technical assistance, and the E.P. Abrahams Trust for support. R.J.D. acknowledges the support of the Guarantors of Brain, and the SA MRC.
References Connors, B. W., Malenka, R. C., and Silva, L. R. 1988. Two inhibitory postsynaptic potentials, and GABAA and GABAB receptor-mediated responses in neocortex of rat and cat. J.Physiol. 406,443-468. Creutzfeldt, 0.D. 1977. Generality of the functional structure of the neocortex. Natumissenschaften 64, 507-517. Douglas, R. J., Martin, K. A. C., and Whitteridge, D. 1988. Selective responses of visual cortical cells do not depend on shunting inhibition. Nature (London) 332, 642-644. Gilbert, C. D., and Wiesel, T. N. 1979. Morphology and intracortical projections of functionally characterised neurons in the cat visual cortex. Nature (London) 280, 120-125. Hubel, D. H., and Wiesel, T. N. 1974. Uniformity of monkey striate cortex: A parallel between field size, scatter, and magnification factor. 1. Comp. Neurol. 158,295-305. Martin, K. A. C. 1988. From single cells to simple circuits in the cerebral cortex. Q. J. Exp. Physiol. 73, 637-702. Martin, K. A. C., and Whitteridge, D. 1984. Form, function, and intracortical projections of spiny neurones in the striate visual cortex. I. PhysioI. 353, 463-504. Rockel, A. J., Hiorns, R. W., and Powell, T. P. S. 1980. The basic uniformity in structure of the neocortex. Brain 103,221-244. Sejnowski, T. J., Koch, C., and Churchland, P. S. 1988. Computational neuroscience. Science 241, 1299-1306. Sillito, A. M. 1975. The contribution of inhibitory mechanisms to the receptive field properties of neurones in the striate cortex of the cat. J. Physiol. 250, 387-304. Szenthgothai, J. 1978. The neuron network of the cerebral cortex: A functional interpretation. Proc. R. SOC. (London) Ser. B 201, 219-248.
Received 13 July 1989; accepted 2 October 1989.
Communicated by John Wyatt
Synthetic Neural Circuits Using Current-Domain Signal Representations Andreas G. Andreou Kwabena A. Boahen Electrical and Computer Engineering, The Johns Hopkins University, Baltimore, M D 21218 USA
We present a new approach to the engineering of collective analog computing systems that emphasizes the role of currents as an appropriate signal representation and the need for low-power dissipation and simplicity in the basic functional circuits. The design methodology and implementation style that we describe are inspired by the functional and organizational principles of neuronal circuits in living systems. We have implemented synthetic neurons and synapses in analog CMOS VLSI that are suitable for building associative memories and self-organizing feature maps. 1 Introduction
Connectionist architectures, neural networks, and cellular automata (Rumelhart and McClelland 1986; Kohonen 1987; Grossberg 1988; Toffoli 1988) have large numbers of simple and highly connected processing elements and employ massively parallel computing paradigms, features inspired by those found in the nervous system. In a hardware implementation, the physical laws that govern the cooperative behavior of these elements are exploited to process information. This is true both at the system level, where global properties such as energy are used, and at the circuit level, where the device physics are exploited. For example, Hopfield’s network (1982) uses the stable states of a dynamic system to represent information; associative recall occurs as the system converges to its local energy minima. On the other hand, Mead’s retina (1989) uses the native properties of silicon transistors to perform local automatic gain control. In this paper we discuss the importance of signal representations in the implementation of such systems, emphasizing the role of currents. The paper is organized into six sections: Section 2 describes the roles played by current as well as voltage signals. The metal-oxidesemiconductor (MOS) transistor, the basic element of complementaryMOS (CMOS) very large scale integration (VLSI) technology, is introduced in Section 3. In the subthreshold region, the MOS transistor’s behavior strongly resembles that of the ionic channels in excitable cell Neural Computation 1, 489-501 (1989)
01989
Massachusetts Institute of Technology
490
Andreas G . Andreou and Kwabena A. Boahen
membranes. Translinear circuits, a computationally rich class of circuits with current inputs and outputs, are reviewed in Section 4. These circuits are based on the exponential transfer characteristics of the transistors, a property that also holds true for certain ionic channels. Simple and useful circuits for neurons and synapses are described in Section 5. Proper choice of signal representations leads to very efficient realizations; a single line provides two-way communication between neurons. Finally, a brief disscussion of the philosophy behind the adopted design methodology and implementation style is presented in Section 6.
2 Signals
In an electronic circuit, signals are represented by either voltages or currents.' A digital CMOS circuit depends on two well defined voltage levels for reliable computation. Currents play only an incidental role of establishing the desired voltage levels (through charging or discharging capacitive nodes). Since the abstract Turing model of computation does not specify the actual circuit implementation, two distinct current levels will work as well. In contrast, the circuits described here use andog signals and rely heavily on currents; both currents and voltages having continuous values. At the circuit level, Kirchoff's current law (KCL) and Kirchoff's voltage law (KVL) are exploited to implement computational primitives. KCL states that the sum of the currents entering a node equals the sum of the currents leaving it (conservation of charge). So current signals may be summed simply by bringing them to the same node. KVL states that the sum of voltages around a closed loop is zero (conservation of energy). Therefore, voltage signals may be summed as well. Actually, the translinear circuits described in Section 4 rely on KVL while avoiding the use of differential voltage signals (not referenced to ground). Voltages are used for communicating results to different parts of the system or for storing information locally. Accumulation of charge on a capacitor (driven by a current source) results in a voltage that represents local memory in the system. This also implements the useful function of temporal integration. Distributed memory can be realized using spatiotempord patterns of charge, following the biological model (Freeman et al. 1988; Eisenberg et al. 1989). In this type of memory, stored information is represented by limit-cycles in the phase space of a dynamic system. However, in current VLSI implementations, memory is represented as point attractors (i.e., a stable equilibrium) in the spatial distributions of charge, as, for example, in our bidirectional associative memory chips (Boahen et al. 1989a,b). 'This may be an area in which biological systems have a distinct advantage by employing both chemical and electrical signals in the computation.
Synthetic Neural Circuits Using Current-Domain Signal Representations
491
Figure 1: The MOS transistor. (a) Structure. 3 Devices
The MOS transistor, shown in Figure la, has four terminals: the gate (G), the source (S), the drain (D), and the substrate (B, for bulk). The gate and source potentials control the charge density in the channel between the source and the drain, and hence the current passed by the device. The MOS transistor is analogous to an ensemble of ionic channels in the lipid membrane of a cell controlled by the transmembrane potential. We operate the MOS transistor in the so-called "off" region, characterized by gate source voltages that are below the threshold voltage. In this region charge transport is by diffusion from areas of high carrier concentration to energetically preferred areas of lower carrier concentration. This is referred to as weak-inversion (ViHoz and Fellrath 1977) or subthreshold conduction (Mead 1989; Maher et al. 1989). The transfer characteristics are shown in Figure lb. These curves are very similar to those for the calcium-ontrolled sodium channel, Hille (1984, p. 317). In both cases the exponential relationships arise from the Boltzmann distribution. The subthreshold current is given by2
2For the sake of brevity, we discuss only the n-type device whose operation depends on the transport of negative charges. The operation of a p-type device is analogous.
Andreas G. Andreou and Kwabena A. Boahen
492 I d s (nA)
vgs ( V ) 0.7
0.4
100.03 0 . 0.-
10.03.0-
1.00.3-
0.10.03-
0.010.003.1Vgs (V)
Figure 1: Cont’d (b) transfer characteristics, (c) output characteristics. To first order, the current is exponentially dependent on both the substrate and the gate voltages. In (b) the dots show measured data from an n-type transistor of size 4 x 4 ,%m,with vd, = 1.0 V . The solid lines are obtained using equation 3.1 with 10 = 0.72 x 10-l8A and rc. = 0.75. The data in (b) are for a similar device with Vb, = 0; it is fitted with Vo = 15.0 V .
where 10 is the zero-bias current and K measures the effectiveness of the gate potential in controlling the channel current. To first order, the effectiveness of the substrate potential is given by (1 - K ) ; VT = IcT/q, the thermal voltage, equals 26 mV at room temperature, and V, is the Early voltage, which can be determined from the slope of the I d s versus vds curves. Notice that Id, changes by a factor of e for a VT/K= 33.0 mV change in Vgs.This drain current equation is equivalent to that in Maher
Synthetic Neural Circuits Using Current-Domain Signal Representations 493
et al. (1989);however, in this form the dependence on the substrate voltage is explicit. This three parameter model is adequate for rough design calculations but not for accurate simulation of device operation. Refer to Mead (1989, Appendix B) for a more elaborate model. Subthreshold currents are comparable to currents in cell membranes; they range from a few picoamps to a few microamps. For a given gate-source voltage V,,, the MOS transistor has two distinct modes of operation, determined by the drain-source voltage v d s , as shown by the output characteristics in Figure lc. The behavior is roughly linear if v d s is less than V d s a t % 100 mv; small changes in v d s cause proportional changes in the drain current. For voltages above V d s a t , the current saturates. In this region the MOS transistor is a current source with output conductance: (3.2) The change in drain current for a small change in gate voltage is given by (3.3) gm is called the transconductance because it relates a current between two nodes to a voltage at a third node. As we shall see, the subthreshold MOS transistor is a very versatile circuit element because gln>> &sat. 4 Circuits
Area-efficient (compact) functional blocks can be obtained by using the MOS transistor itself to perform as many circuit functions as possible. The three possible circuit configurations for the transistor are shown in Figure 2: In the common-source mode, it is an inverting amplifier with high voltage gain: gm/gdsat. In the common-drain mode, it is a voltage follower with low output resistance; l/gm. In the common-gate mode, it is a current buffer with low output conductance; (&sat. In the synthetic neuronal circuits described in the next section the inverting amplifier is used as a feedback element to obtain more ideal circuit operation while the voltage follower and the current buffer are used to effectively transfer signals between different circuits.
Andreas G . Andreou and Kwabena A. Boahen
494
The actual computations are performed by current-domain (or currentmode) circuits. A Current-Domain (CD) circuit is one whose input signals and output signals are currents. The simplest CD circuit Is shown in Figure 3. This circuit copies the input current to the output and reverses its direction. It is appropriately named a current mirror. The circuit has just two transistors: an input transistor and an output transistor. The input current Ii, is converted to a voltage V b by the input transistor. This voltage sets the gate voltage of the output transistor. Thus, both devices have the same gate-source voltages and will pass the same current if they are identical and have the same drain and substrate voltages. In practice, device mismatch produces random variations in the output current, while the nonzero drain conductance results in systematic variations. More complicated mirror circuits, for example, the Wilson mirror or the Complex mirror (Pavasovie et a2. 19881, may be used to obtain lower output conductance. By using more output devices, several copies of the input current can be obtained. The current mirror is analogous to a basic synapse structure in biological systems: it is simple in
0 I
CI
I
Figure 2 MOS transistor circuit configurations. (a) Common-source, (b) common-drain, and (c) common-gate modes of operation. In (a) voltage gain is obtained by converting the current produced by the device's transconductance to a voltage across its drain conductance. In (b) a voltage follower/buffer is realized; the gatesource drop is kept constant by using a fixed bias current and setting = 0. In (c) the device serves as a current buffer by transferringthe signal from its high conductance source terminal to the low conductance drain node.
Synthetic Neural Circuits Using Current-Domain Signal Representations 495
lout
(a)
(b)
Figure 3: Current mirror circuits using (a) n-type and (b) ptype transistors. These circuits provide an output current that equals the input current if the devices are perfectly matched. For subthreshold operation, we observe variations of about lo%, on average, using 4 x 4pm devices.
form, it enforces unidirectional information flow, and it can function over a large range of input and output signal levels. Translinear circuits (Gilbert 1975) are a computationally powerful subclass of CD circuits. A translinear circuit is defined as one whose operation depends on the linear relationship between the transconductance and the channel current of the active devices (Equation 3.3). The current mirror in subthreshold operation is an example of a translinear circuit. The Translinear Principle (Gilbert 1975) can be used to synthesize a wide variety of circuits to perform both linear and nonlinear operations on the current inputs, including products, quotients, and power terms with fixed exponents. The Gilbert current multiplier is one of the better known translinear circuits. Gilbert’s elegant analog array normalizer (1984) is an example of a more powerful translinear circuit. One fascinating aspect of translinear circuits is that although the currents in its constitutive elements (the transistors) are exponentially dependent on temperature, the overall input/output relationship is insensitive to isothermal temperature variations. The effect of small local variations in fabrication parameters can also be shown to be temperature independent. Finally, translinear circuits are simple, because an analog representation is used and the native device properties provide the computational primitives. 3Translinear circuits have traditionally been built using bipolar transistors.
Andreas G. Andreou and Kwabena A. Boahen
496
5 Synapses and Neurons In a neuronal circuit, the interaction between neurons is mediated by a large variety of synapses (Shepherd 1979). A neuron receives its inputs from other neurons through synaptic junctions that may have different efficacies. In a VLSI system, the synapses are implemented as a twodimensional array with the neurons on the periphery. This is because OW2) synapses are required in a network with N neurons. Generally, two sets of lines (buses) are run between the neurons and the synaptic array; one carries neuronal output to the synapses and the other feeds input to the neurons. However, in networks with reciprocal connections, such as the bidirectional associative memory (Boahen et al. 1989a,b), proper choice of signal representations leads to a more efficient implementation. Our circuit implementations for neurons and synapses are shown in Figure 4. These circuits use voltage to represent a neuron's output (presynaptic signal) and current to represent its inputs (postsynaptic signals). Since currents and voltages may be independently transmitted along the same line, these signal representations allow a neuron's output and
"2
(a)
Figure 4: Circuits for synapses and neurons. (a) Reciprocal synapse and (b)neuron. These circuits demonstrate efficient signal representations that use a single line to provide two-way communication. A voltage is used to represent information going one way while a current is used to send information the other way. The synapse circuit in (a) provides bidirectional interaction between two neurons connected to nodes nl and n2. The neuron circuit in (b) sends out a voltage that mirrors its output current toutin the synapses while receiving the total current I, from these synapses.
Synthetic Neural Circuits Using Current-Domain Signal Representations 497 inputs to be communicated using just one line. Voltage output facilitates fan-out while current input provides summation. Thus, in close analogy to actual neuronal microcircuits, the output signal is generated at the same node at which inputs are integrated. The two transistor synapse circuit (Figure 4a) provides bidirectional interaction between neurons connected to nodes nl and 722; each transistor serves as a synaptic junction. When s is at ground, voltages applied at nodes rzl and n2 are transformed into currents by the transconductances of M2 and M I , respectively. If these voltages exceed Vdsat,the transistors are in saturation and act as current sources. Thus, changes in the voltage at n1(rz2)do not affect the current in Ml(M2). Actually, for a small change in V,,,, the changes in 11 and I 2 are related by
This gives
Hence, we can double 1 2 (using the voltage at n l ) while disturbing I1 by only 0.2%. The interaction is turned off by setting s to a high voltage, or modulated by applying an analog signal to the substrate. The circuit for the neuron also uses just two transistors (Figure 4b). The net input current I , (for activation), formed by summing the inputs at node rz, is available at the drain of Ml. This device buffers the input current and controls the output voltage. I, is fed through a nonlinearity, for example, thresholding (not shown), to obtain IolLt,which sets the output voltage Vmt.This is accomplished by using MI as a voltage follower and providing feedback through M2, which functions as an inverting amplifier; MI adjusts V,, so that the current in M2 equals Imt. Hence, Vout will mirror Iout in the synapses. The feedback makes the output voltage insensitive to changes in the input current, I,. Actually, the output conductance is approximately gmlgm2/gdstrtz;it is increased by a factor equal to the gain provided by M2. In this case, a small change in VoUt produces changes in I, and Is (the postsynaptic copy of IOuJ given by
Hence, if I, doubles, the resulting change in Voutdecreases I , by only 0.2%-just as in the previous case. Note that Iout must always exceed a few picoamps to keep Voutabove &,at. The characteristics of these
498
Andreas G. Andreou and Kwabena A. Boahen
Figure 5: Characteristics of a synthetic neuronal circuit. (a) A simple circuit consisting of two neurons (nl and nz) and a synapse (8)was built and tested to demonstrate the proposed communication scheme. The currents sent by nl(n2) and that received by nz(n1) are denoted by 112(121) and f1&), respectively. Continued on next page. circuits, designed using 4pm x 4pm devices and fabricated through MOSIS, are shown in Figure 5a-c. 6 Discussion
The adopted design methodology is governed by three simple principles: First, the computation is carried out in the analog domain; this gives simple functional blocks and makes efficient use of interconnect lines. Second, the physical properties of silicon-based devices and circuits are used synergeticully to obtain the desired result. Third, circuits are designed with power dissipation and area efficiency as prime engineering constraints, not accuracy or speed. We believe power dissipation will be a serious limitation in large scale-analog computing hardware. Unlike digital integrated circuits, the massive parallelism and concurrency attainable with analog computation impose serious limits on the amount of power that each circuit can dissipate. This is why we operate the devices with currents in the nanoamps range and, if possible, picoamps, about the same current levels found in biological systems. This approach is similar to, and strongly influenced by, that of Mead’s group at Caltech. Our approach is more minimalistic, we view the transistor itself as the basic building block; not the transconductance amplifier. Thus, currents, rather than differential voltages, are the primary signal representation.
Synthetic Neural Circuits Using Current-Domain Signal Representations
499
I21^ (nA)
'"1
Vbs (mV)
8 0.-
6 0.-
0
4 0..
-50 2 0.-
-100
20
40
60
80
: I12(nA) 100
(b) I12^ (nA) vbs (mV)
100.-
8 0-
6 0.-
4 0-
2 0.-
20
40
60
80
: I12 ( n A ) 100
Figure 5: Cont'd Plots (b) and (c) show how 1;2 and vary as 112 is stepped from 2.0nA to lOOnA while 121 is held at 50nA, for various substrate bias voltages. The values vbs = 0, -50, and -100mV correspond to weights of 0.93, 0.57, and 0.33, respectively. Notice that these weights modulate signals going both ways symmetrically.
We are not concerned about accuracy or matching in the basic elements because biological systems perform well despite the limited precision of their neurons and synaptic connections. The emerging view is that this is a result of the collective nature of the computation performedwhereby large numbers of elements contribute to the final result. From
Andreas G. Andreou and Kwabena A. Boahen
500
a system designer’s point of view, this means that random variations in transistor characteristics are not deleterious to the system’s performance, whereas systematic variations are and must therefore be kept to a minimum. Indeed, we have observed this in silicon chips. The translinear property of the subthreshold MOS transistor provides a very powerful computational primitive. This property arises from the highly nonlinear relationship between the gate potential and the channel current. In fact, the exponential is the strongest nonlinearity relating a voltage and a current in solid-state devices (Shockley 1963; Gunn 1968). It is interesting to note that the same property holds for voltage-activated ionic channels, however, the conductance dependence is steeper due to correlated charge control of the current (Hille 1984, p. 55). In translinear (current-domain) circuits we have seen a classical example of how a rich form for circuit design emerges from the properties of the basic units (MOS transistor in subthreshold). To summarize, we have addressed some issues related to the engineering of collective analog computing systems. In particular, we have demonstrated that currents are an appropriate analog signal representation. Current levels comparable to those in excitable membranes are achieved by operating the devices in the subthreshold region resulting in manageable power dissipation levels. This design methodology and implementation style have been used to build associative memories (Boahen et nl. 1989a, b) and self-organizing feature maps in analog VLSI. Acknowledgments This research was funded by the Independent Research and Development program of the Applied Physics Laboratory; we thank Robert Jenkins for his personal interest and support. The authors would like to thank Professor Carver Mead of Caltech for encouraging this work. Philippe Pouliquen and Marc Cohen made excellent comments on the paper, and Sasa Pavasovie helped with acquiring the experimental data. We are indebted to Terry Sejnowski, who provided a discussion forum and important insights in the field of neural computation at Johns Hopkins University. We thank the action editor, Professor John Wyatt, for his critical review and insightful comments. References Boahen, K. A,, Pouliquen, P. O., Andreou, A. G., and Jenkins, R. E. 1989a. A heteroassociative memory using current-mode MOS analog VLSI circuits. I E E E Trans. Circ. Sys. 36 (5), 643-652.
Boahen, K. A., Andreou, A. G., Pavasovic, A., and Pouliquen, P. 0.1989b. Architectures for associative memories using current-mode analog MOS circuits.
Synthetic Neural Circuits Using Current-Domain Signal Representations
501
Proceedings of the Decennial Caltech Conference on VLSI, C. Seitz, ed. MIT Press, Cambridge, MA. Eisenberg, J., Freeman, W. J., and Burke, B. 1989. Hardware architecture of a neural network model simulating pattern recognition by the olfactory bulb. Neural Networks 2, 315-325. Freeman, W. I., Yao, Y., and Burke, B. 1988. Central pattern generating and recognizing in olfactory bulb: A correlation learning rule. Neural Networks 1, 277-288. Gilbert, B. 1975. Translinear circuits: A proposed classification. Electron. Left. 11 (l),14-16. Gilbert, 8. 1984. A monolithic 16-channel analog array normalizer. I E E E J. Solid-state Circuits SC-19,956-963. Grossberg, S. 1988. Nonlinear neural networks: Principles, mathematics, and architectures. Neural Networks 1, 17-61. G u m , J.B. 1968. Thermodynamics of nonlinearity and noise in diodes. J. Appl. Phy. 39 (12), 5357-5361. Hille, B. 1984. lonic Channels of Excitable Membranes. Sinauer, Sunderland, MA. Hopfield, J. J. 1982. Neural networks and physical systems with emergent computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554-2558. Kohonen, T. 1987. Self-Organization and Associative Memoy. Springer-Verlag, New York. Maher, M. A. C., DeWeerth, S. I?, Mahawold, M. A., and Mead, C. A. 1989. Implementing neural architectures using analog VLSI circuits. I E E E Trans. Circ. Sys. 36 (5), 643452. Mead, C. A. 1989. Analog V L S l and Neural Systems. Addison-Wesley, Reading, MA. Pavasovit, A., Andreou, A. G., and Westgate, C. R. (1988) An investigation of minimum-size, nano-power MOS current mirrors for analog VLSI systems. JHU Elect. Computer Eng. Tech. Rep., JHU/ECE 88-10, Rumelhart, D. E., and McClelland, J. L. 1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press, Cambridge, MA. Shepherd, G. M. 1979. The Synaptic Organization of the Brain. Oxford University Press, New York. Shockley, W. 1963. Electrons and Holes in Semiconductors, p. 90. D. van Nostrand, Princeton, NJ. Toffoli, T. 1988. Information transport obeying the continuity equation. IBM J. Res. Dev. 32, 29-35. ViHoz, E. A., and Fellrath, J. 1977. CMOS analog integrated circuits based on weak inversion operation. I E E E J. Solid-state Circuits SC-12,224-231.
Received 30 March 1989; accepted 13 October 1989.
Communicated by Jack Cowan
Random Neural Networks with Negative and Positive Signals and Product Form Solution Erol Gelenbe Ecole des Hautes Etudes en Informatique (EHEI), Universitk Paris V, 45 rue des Saints-P&es, 75006 Paris, fiance
We introduce a new class of random "neural" networks in which signals are either negative or positive. A positive signal arriving at a neuron increases its total signal count or potential by one; a negative signal reduces it by one if the potential is positive, and has no effect if it is zero. When its potential is positive, a neuron "fires," sending positive or negative signals at random intervals to neurons or to the outside. Positive signals represent excitatory signals and negative signals represent inhibition. We show that this model, with exponential signal emission intervals, Poisson external signal arrivals, and Markovian signal movements between neurons, has a product form leading to simple analytical expressions for the system state. 1 Introduction
Consider an open random network of n neurons in which "positive" and "negative" signals circulate. External arrivals of signals to the network can either be positive, arriving at the ith neuron according to a Poisson process of rate A(i), or negative according to a Poisson process of rate X(i). Positive and negative signals have opposite roles. A negative signal reduces by 1 the potential of the neuron to which it arrives (i.e., it "cancels" an existing signal) or has no effect if the potential is zero. A positive signal adds 2 to the neuron potential. Negative potentials are not allowed at neurons. If the potential at a neuron is positive, it may "fie," sending signals out toward other neurons or to the outside of the network. As signals are sent, they deplete the neuron's potential by the same number. The times between successive signal emissions when neuron i fires are exponentially distributed random variables of average value 1/r(z);hence r(i) is the rate at which neuron i fires. A signal leaving neuron i when it "fires" heads for neuron j with probability p + ( i , j )as a positive signal, or as a negative signal with probability p - ( z , j ) , or departs from the network with probability d(i). Let p ( i , j ) = p + ( i ,j ) + p - ( i , j ) ; it is the transition probability of a Markov chain representing the movement of signals between neurons. We shall not Neural Computation 1, 502-510 (1989) @ 1989 Massachusetts Institute of Technology
Random Neural Networks
503
allow the signals leaving a neuron to return directly back to the same neuron: p ( i , i) = 0 for all i. We have n Cp(i,j) + di)= 1 for 1 I i I
(1.1)
3
Positive signals represent excitation and negative signals represent inhibition. Positive external arrivals represent input information and negative external arrivals can be used to represent the thresholds at each neuron. A simple example of the computational use of this model is presented in Section 3. We show that this new model has “product form” solution (Gelenbe and Mitrani 1980; Gelenbe and Pujolle 1986). That is, the stationary probability distribution of its state can be written as the product of the marginal probabilities of the state (or potential) of each neuron. This leads to simple expressions for the network state. Previously, product form solutions were known to exist for certain networks with only “positive signals,” which are queueing networks used in computer and communication system modeling and in operations research (Gelenbe and Mitrani 1980; Gelenbe and Pujolle 1986). 2 The Main Properties of the Model
The main properties of our model are presented in the following theorems. Theorem 1. Let qi denote the quantity qi = A+(i)/[r(i) + X-(i)l (2.1) where the X+(z), X-(i) for i = 1,. . . , n satisfy the system of nonlinear simulataneous equations:
qj,(j)p+(j, i) + Mi),X ( i ) =
X’(i) = j
q j r ( j ) p - ( j , i)
+ X(i)
(2.2)
j
Let k ( t ) be the vector of neuron potentials a t time t , and k = ( k l , . . . , k,) be a particular value of the vector; let p ( k ) denote the stationary probability distribution p ( k ) = lim Prob[k(t) = k ] t-m
If a nonnegative solution { A + ( i ) , X - ( i ) } exists to equations 2.1 and 2.2 such that each qi < 1, then
The proof is given in Appendix A. A direct consequence is: Corollary 1.1 The stationary probability that a neuron i fires is given by + X-(i)l if qi < 1 lim Prob[ki(t) > 01 = pi = X+(i)/[r(z) ‘tcc
Erol Gelenbe
504
2.1 Networks with Some Saturated Neurons. We say that neuron i is safurafed if A+(i)/[r(i) + A-(i)] 2 1; i.e., in steady-state it continuously fires. In many applications, one is interested in working with networks containing some saturated neurons. We have the following extension of Theorem 1. Let N S be the (largest) subset of neurons such that no neuron in N S is saturated, and S be its complement. Consider the solutions A+(i), X - ( i ) of the flow equations:
qjr(j)p+cj,i ) + A(i),
A+(i) =
qjr(j>p-(j,i)
A-(i) =
+ Ni),
j
j
where qi = A+(z)/[?-(i)
+ A-WI,
if i
E
N S and qi = 1 if i E S
Theorem 2. Let k ( t ) N S denote the restriction o f the vector k ( t ) to the neurons in N S . l f a positive solution to the flow equations exists then limt-oo Plkt(t) > 01 = A+(z)/[r(z)+ A-(i)], 1,
ifi E N S ifiE S
and
lim P [ k N S ( t ) = k N S l
t-oo
= PUNS) =
n
11- qJq?
iENS
We omit the proof of this result. 2.2 Equations 2.1 and 2.2 Describing Signal Flow in Feedforward Networks. Let us now turn to the existence and uniqueness of the solutions A+(i), A-(i), 1 5 i 5 n to equations 2.1 and 2.2, which represent the average arrival rate of positive and negative signals to each neuron. We are unable to guarantee the existence and uniqueness of these quantities for arbitrary networks, except for feedforward networks. A network is said to be feedforward if for any sequence il, . . ., is, . . ., i,, . . ., i,,, of neurons, is = i, for r > s implies that ni-1
JJ p(iu,zu+d= 0 u=l
Theorem 3. If the network is feedforward, then the solutions A+(i), A-(i) to equations 2.1 and 2.2 exist and are unique. Proof. For any feedforward network, we may construct an isomorphic network by renumbering the neurons so that neuron 1 has no predecessors [i.e., p ( i , 1) = 0 for any i], neuron n has no successors [i.e., p(n,i ) = 0 for any i], and p ( i , j ) = 0 if j < i. Thus in the isomorphic network, a signal can possibly (but not necessarily) go directly from neuron i to neuron j only if j is larger than i.
Random Neural Networks
505
For such a network, the X+(z) and X-(z) can be calculated recursively as follows: First compute X'(1) = A(l),X-(l) = X(2), and calculate q1 from equation 2.1; if you obtain q1 2 l, set q1 = l (neuron l is saturated), otherwise leave it unchanged. For each successive z such that X+(z), X-(z) have not yet been calculated proceed as follows; since the qJ for each j < a are known, compute X'(z)
=
c
q j r ( j ) p f O ,z)
+N z ) ,
X-(z) =
.I