< BACK
Elements of Artificial Neural Networks Kishan Mehrotra, Chilukuri K. Mohan a n d Sanjay Ranka Preface 1 Introduction
1.1 History of Neural Networks 1.2
Structure and Function of a Single Neuron 1.2.1 Biological neurons 1.2.2 Artificial neuron models
1.3
1.3.1 Fully connected networks
October 1 9 9 6 ISBN 0 - 2 6 2 - 1 3 3 2 8 - 8 3 4 4 p p . , 1 4 4 illus. $ 7 0 . 0 0 / £ 4 5 . 9 5 (CLOTH)
1.3.2 Layered networks 1.3.3 Acyclic networks 1.3.4 Feedforward networks
ADD TO CART
1.3.5 Modular neural networks
Series
1.4
Bradford Books
1.4.2 Competitive learning
R e l a t e d Links
More about this book and related software
Neural Learning 1.4.1 Correlation learning
Complex Adaptive Systems
Instructor's Manual
Neural Net Architectures
^ ^
i
I
i
I
1.4.3 Feedback-based weight adaptation 1.5
What Can Neural Networks Be Used for? 1.5.1 Classification
Request E x a m / D e s k Copy
1.5.2 Clustering
Table of Contents
1.5.3 Vector quantization 1.5.4 Pattern association 1.5.5 Function approximation 1.5.6 Forecasting 1.5.7 Control applications 1.5.8 Optimization 1.5.9 Search 1.6
Evaluation of Networks 1.6.1 Quality of results 1.6.2 Generalizability 1.6.3 Computational resources
1.7
Implementation
1.8
Conclusion
1.9
Exercises
2 Supervised Learning: Single-Layer Networks 2.1 Perceptrons
2.4 Guarantee of Success 2.5
Modifications 2.5.1 Pocket algorithm 2.5.2 Adalines 2.5.3 Multiclass discrimination
2.6 Conclusion 2.7 Exercises 3 Supervised Learning: Multilayer Networks I 3.1 Multilevel Discrimination 3.2
Preliminaries 3.2.1 Architecture 3.2.2 Objectives
3.3 Backpropagation Algorithm 3.4 Setting the Parameter Values 3.4.1 Initialization of weights 3.4.2 Frequency of weight updates 3.4.3 Choice of learning rate 3.4.4 Momentum 3.4.5 Generalizability 3.4.6 Number of hidden layers and nodes 3.4.7 Number of samples 3.5 Theoretical Results* 3.5.1 Cover's theorem 3.5.2 Representations of functions 3.5.3 Approximations of functions 3.6 Accelerating the Learning Process 3.6.1 Quickprop algorithm 3.6.2 Conjugate gradient 3.7
Applications 3.7.1 Weaning from mechanically assisted ventilation 3.7.2 Classification of myoelectric signals 3.7.3 Forecasting commodity prices 3.7.4 Controlling a gantry crane
3.8
Conclusion
3.9
Exercises
4 Supervised Learning: Mayer Networks II 4.1 Madalines 4.2 Adaptive Multilayer Networks
4.2.6 Tiling algorithm 4.3 Prediction Networks 4.3.1 Recurrent networks 4.3.2 Feedforward networks for forecasting 4.4 Radial Basis Functions 4.5 Polynomial Networks 4.6 Regularization 4.7 Conclusion 4.8 Exercises 5 Unsupervised Learning 5.1 Winner-Take-All Networks 5.1.1 Hamming networks 5.1.2 Maxnet 5.1.3 Simple competitive learning 5.2 Learning Vector Quantizers 5.3 Counterpropagation Networks 5.4 Adaptive Resonance Theory 5.5 Topologically Organized Networks 5.5.1 Self-organizing maps 5.5.2 Convergence* 5.5.3 Extensions 5.6 Distance-Based Learning 5.6.1 Maximum entropy 5.6.2 Neural gas 5.7
Neocognitron
5.8 Principal Component Analysis Networks 5.9
Conclusion
5.10 Exercises 6 Associative Models 6.1 Non-iterative Procedures for Association 6.2 Hopfield Networks 6.2.1 Discrete Hopfield networks 6.2.2 Storage capacity of Hopfield networks* 6.2.3 Continuous Hopfield networks 6.3 Brain-State-in-a-Box Network 6.4 Boltzmann Machines 6.4.1 Mean field annealing 6.5
Hetero-associators
7.1.2 Solving simultaneous linear equations 7.1.3 Allocating documents to multiprocessors Discrete Hopfield network Continuous Hopfield network Performance 7.2 Iterated Gradient Descent 7.3 Simulated Annealing 7.4 Random Search 7.5 Evolutionary Computation 7.5.1 Evolutionary algorithms 7.5.2 Initialization 7.5.3 Termination criterion 7.5.4 Reproduction 7.5.5 Operators Mutation Crossover 7.5.6 Replacement 7.5.7 Schema Theorem* 7.6
Conclusion
7.7
Exercises
Appendix A: A Little Math A.1 Calculus A.2 Linear Algebra A.3 Statistics Appendix B: Data B.1 Iris Data B.2 Classification of Myoelectric Signals B.3 Gold Prices B.4 Clustering Animal Features B.5 3-D Corners, Grid and Approximation B.6 Eleven-City Traveling Salesperson Problem (Distances) B.7 Daily Stock Prices of Three Companies, over the Same Period B.8 Spiral Data Bibliography Index
Preface This book is intended as an introduction to the subject of artificial neural networks for readers at the senior undergraduate or beginning graduate levels, as well as professional engineers and scientists. The background presumed is roughly a year of college-level mathematics, and some amount of exposure to the task of developing algorithms and computer programs. For completeness, some of the chapters contain theoretical sections that discuss issues such as the capabilities of algorithms presented. These sections, identified by an asterisk in the section name, require greater mathematical sophistication and may be skipped by readers who are willing to assume the existence of theoretical results about neural network algorithms. Many off-the-shelf neural network toolkits are available, including some on the Internet, and some that make source code available for experimentation. Toolkits with user-friendly interfaces are useful in attacking large applications; for a deeper understanding, we recommend that the reader be willing to modify computer programs, rather than remain a user of code written elsewhere. The authors of this book have used the material in teaching courses at Syracuse University, covering various chapters in the same sequence as in the book. The book is organized so that the most frequently used neural network algorithms (such as error backpropagation) are introduced very early, so that these can form the basis for initiating course projects. Chapters 2, 3, and 4 have a linear dependency and, thus, should be covered in the same sequence. However, chapters 5 and 6 are essentially independent of each other and earlier chapters, so these may be covered in any relative order. If the emphasis in a course is to be on associative networks, for instance, then chapter 6 may be covered before chapters 2, 3, and 4. Chapter 6 should be discussed before chapter 7. If the "non-neural" parts of chapter 7 (sections 7.2 to 7.5) are not covered in a short course, then discussion of section 7.1 may immediately follow chapter 6. The inter-chapter dependency rules are roughly as follows. l->2->3->4 l-»5 l->6 3-»5.3 6 . 2 - • 7.1 Within each chapter, it is best to cover most sections in the same sequence as the text; this is not logically necessary for parts of chapters 4, 5, and 7, but minimizes student confusion. Material for transparencies may be obtained from the authors. We welcome suggestions for improvements and corrections. Instructors who plan to use the book in a course should
XIV
Preface
send electronic mail to one of the authors, so that we can indicate any last-minute corrections needed (if errors are found after book production). New theoretical and practical developments continue to be reported in the neural network literature, and some of these are relevant even for newcomers to the field; we hope to communicate some such results to instructors who contact us. The authors of this book have arrived at neural networks through different paths (statistics, artificial intelligence, and parallel computing) and have developed the material through teaching courses in Computer and Information Science. Some of our biases may show through the text, while perspectives found in other books may be missing; for instance, we do not discount the importance of neurobiological issues, although these consume little ink in the book. It is hoped that this book will help newcomers understand the rationale, advantages, and limitations of various neural network models. For details regarding some of the more mathematical and technical material, the reader is referred to more advanced texts such as those by Hertz, Krogh, and Palmer (1990) and Haykin (1994). We express our gratitiude to all the researchers who have worked on and written about neural networks, and whose work has made this book possible. We thank Syracuse University and the University of Florida, Gainesville, for supporting us during the process of writing this book. We thank Li-Min Fu, Joydeep Ghosh, and Lockwood Morris for many useful suggestions that have helped improve the presentation. We thank all the students who have suffered through earlier drafts of this book, and whose comments have improved this book, especially S. K. Bolazar, M. Gunwani, A. R. Menon, and Z. Zeng. We thank Elaine Weinman, who has contributed much to the development of the text. Harry Stanton of the MIT Press has been an excellent editor to work with. Suggestions on an early draft of the book, by various reviewers, have helped correct many errors. Finally, our families have been the source of much needed support during the many months of work this book has entailed. We expect that some errors remain in the text, and welcome comments and corrections from readers. The authors may be reached by electronic mail at
[email protected],
[email protected], and
[email protected]. In particular, there has been so much recent research in neural networks that we may have mistakenly failed to mention the names of researchers who have developed some of the ideas discussed in this book. Errata, computer programs, and data files will be made accessible by Internet.
A Introduction If we could first know where we are, and whither we are tending, we could better judge what to do, and how to do it. —Abraham Lincoln Many tasks involving intelligence or pattern recognition are extremely difficult to automate, but appear to be performed very easily by animals. For instance, animals recognize various objects and make sense out of the large amount of visual information in their surroundings, apparently requiring very little effort. It stands to reason that computing systems that attempt similar tasks will profit enormously from understanding how animals perform these tasks, and simulating these processes to the extent allowed by physical limitations. This necessitates the study and simulation of Neural Networks. The neural network of an animal is part of its nervous system, containing a large number of interconnected neurons (nerve cells). "Neural" is an adjective for neuron, and "network" denotes a graph-like structure. Artificial neural networks refer to computing systems whose central theme is borrowed from the analogy of biological neural networks. Bowing to common practice, we omit the prefix "artificial." There is potential for confusing the (artificial) poor imitation for the (biological) real thing; in this text, non-biological words and names are used as far as possible. Artificial neural networks are also referred to as "neural nets," "artificial neural systems," "parallel distributed processing systems," and "connectionist systems." For a computing system to be called by these pretty names, it is necessary for the system to have a labeled directed graph structure where nodes perform some simple computations. From elementary graph theory we recall that a "directed graph" consists of a set of "nodes" (vertices) and a set of "connections" (edges/links/arcs) connecting pairs of nodes. A graph is a "labeled graph" if each connection is associated with a label to identify some property of the connection. In a neural network, each node performs some simple computations, and each connection conveys a signal from one node to another, labeled by a number called the "connection strength" or "weight" indicating the extent to which a signal is amplified or diminished by a connection. Not every such graph can be called a neural network, as illustrated in example 1.1 using a simple labeled directed graph that conducts an elementary computation. 1.1 The "AND" of two binary inputs is an elementary logical operation, implemented in hardware using an "AND gate." If the inputs to the AND gate are x\ e {0,1} and X2 e {0,1}, the desired output is 1 if x\ = X2 = 1, and 0 otherwise. A graph representing this computation is shown in figure 1.1, with one node at which computation (multiplication) is carried out, two nodes that hold the inputs (x\,x2), and one node that holds one output. However, this graph cannot be considered a neural network since the connections EXAMPLE
1
Introduction
Multiplier
*,s{0,l}•*> o = xx ANDJC2 X26{0,1}-
Figure 1.1 AND gate graph.
(W,J:1)(H'2J(2) 1
»» o = JCI AND x^
Figure 1.2 AND gate network.
between the nodes are fixed and appear to play no other role than carrying the inputs to the node that computes their conjunction. We may modify the graph in figure 1.1 to obtain a network containing weights (connection strengths), as shown in figure 1.2. Different choices for the weights result in different functions being evaluated by the network. Given a network whose weights are initially random, and given that we know the task to be accomplished by the network, a "learning algorithm" must be used to determine the values of the weights that will achieve the desired task. The graph structure, with connection weights modifiable using a learning algorithm, qualifies the computing system to be called an artificial neural network. 1.2 For the network shown in figure 1.2, the following is an example of a learning algorithm that will allow learning the AND function, starting from arbitrary values of w\ and u>2. The trainer uses the following four examples to modify the weights: {(*l = l,JC 2 =l,i or u>2 may be repeated until the final result is satisfactory, with weights w\ = 5.0, W2 = 0.2. Can the weights of such a net be modified so that the system performs a different task? For instance, is there a set of values for w\ and W2 such that a net otherwise identical to that shown in figure 1.2 can compute the OR of its inputs? Unfortunately, there is no possible choice of weights w\ and u>2 such that {w\ • x\) • (tU2 • *2) will compute the OR of x\ and X2. For instance, whenever x\ = 0, the output value (w\ - x\) > (\V2 • X2) = 0, irrespective of whether X2 = 1. The node function was predetermined to multiply weighted inputs, imposing a fundamental limitation on the. capabilities of the network shown in figure 1.2, although it was adequate for the task of computing the AND function and for functions described by the mathematical expression o = wiW2XiX2. A different node function is needed if there is to be some chance of learning the OR function. An example of such a node function is (x\ + X2 — x\ • X2), which evaluates to 1 if x\ = 1 or X2 = 1, and to 0 if x\ = 0 and X2 = 0 (assuming that each input can take only a 0 or 1 value). But this network cannot be used to compute the AND function. Sometimes, a network may be capable of computing a function, but the learning algorithm may not be powerful enough to find a satisfactory set of weight values, and the final result may be constrained due to the initial (random) choice of weights. For instance, the AND function cannot be learnt accurately using the learning algorithm described above if we started from initial weight values w\ — W2 = 0.3, since the solution w\ = 1/0.3 cannot be reached by repeatedly incrementing (or decrementing) the initial choice of w\ by 0.1. We seem to be stuck with a one node function for AND and another for OR. What if we did not know beforehand whether the desired function was AND or OR? Is there some node function such that we can simulate AND as well as OR by using different weight values? Is there a different network that is powerful enough to learn every conceivable function of its inputs? Fortunately, the answer is yes; networks can be built with sufficiently general node functions so that a large number of different problems can be solved, using a different set of weight values for each task. The AND gate example has served as a takeoff point for several important questions: what are neural networks, what can they accomplish, how can they be modified, and what are their limitations? In the rest of this chapter, we review the history of research in neural networks, and address four important questions regarding neural network systems.
4
1 Introduction
1. How does a single neuron work? 2. How is a neural network structured, i.e., how are different neurons combined or connected to obtain the desired behavior? 3. How can neurons and neural networks be made to learn? 4. What can neural networks be used for? We also discuss some general issues important for the evaluation and implementation of neural networks. 1.1
History of Neural Networks
Those who cannot remember the past are condemned to repeat it. —Santayana, "The Life of Reason" (1905-06) The roots of all work on neural networks are in neurobiological studies that date back to about a century ago. For many decades, biologists have speculated on exactly how the nervous system works. The following century-old statement by William James (1890) is particularly insightful, and is reflected in the subsequent work of many researchers. The amount of activity at any given point in the brain cortex is the sum of the tendencies of all other points to discharge into it, such tendencies being proportionate 1. to the number of times the excitement of other points may have accompanied that of the point in question; 2. to the intensities of such excitements; and 3. to the absence of any rival point functionally disconnected with the first point, into which the discharges may be diverted. How do nerves behave when stimulated by different magnitudes of electric current? Is there a minimal threshold (quantity of current) needed for nerves to be activated? Given that no single nerve cell is long enough, how do different nerve cells communicate electrical currents among one another? How do various nerve cells differ in behavior? Although hypotheses could be formulated, reasonable answers to these questions could not be given and verified until the mid-twentieth century, with the advance of neurology as a science. Another front of attack came from psychologists striving to understand exactly how learning, forgetting, recognition, and other such tasks are accomplished by animals. Psycho-physical experiments have helped greatly to enhance our meager understanding of how individual neurons and groups of neurons work. McCulloch and Pitts (1943) are credited with developing the first mathematical model of a single neuron. This model has been modified and widely applied in subsequent work.
1.1 History of Neural Networks
5
System-builders are mainly concerned with questions as to whether a neuron model is sufficiently general to enable learning all kinds of functions, while being easy to implement, without requiring excessive computation within each neuron. Biological modelers, on the other hand, must also justify a neuron model by its biological plausibility. Most neural network learning rules have their roots in statistical correlation analysis and in gradient descent search procedures. Hebb's (1949) learning rule incrementally modifies connection weights by examining whether two connected nodes are simultaneously ON or OFF. Such a rule is still widely used, with some modifications. Rosenblatt's (1958) "perceptron" neural model and the associated learning rule are based on gradient descent, "rewarding" or "punishing" a weight depending on the satisfactoriness of a neuron's behavior. The simplicity of this scheme was also its nemesis; there are certain simple pattern recognition tasks that individual perceptrons cannot accomplish, as shown by Minsky and Papert (1969). A similar problem was faced by the Widrow-Hoff (1960, 1962) learning rule, also based on gradient descent. Despite obvious limitations, accomplishments of these systems were exaggerated and incredible claims were asserted, saying that intelligent machines have come to exist. This discredited and discouraged neural network research among computer scientists and engineers. A brief history of early neural network activities is listed below, in chronological order. 1938 Rashevsky initiated studies of neurodynamics, also known as neural field theory, representing activation and propagation in neural networks in terms of differential equations. 1943 McCulloch and Pitts invented the first artificial model for biological neurons using simple binary threshold functions (described in section 1.2.2). 1943 Landahl, McCulloch, and Pitts noted that many arithmetic and logical operations could be implemented using methods containing McCulloch and Pitts neuron models. 1948 Wiener presented an elaborate mathematical approach to neurodynamics, extending the work initiated by Rashevsky. 1949 In The Organization of Behavior, an influential book, Hebb followed up on early suggestions of Lashley and Cajal, and introduced his famous learning rule: repeated activation of one neuron by another, across a particular synapse, increases its conductance. 1954 Gabor invented the "learning filter" that uses gradient descent to obtain "optimal" weights that minimize the mean squared error between the observed output signal and a signal generated based upon the past information. 1954 Cragg and Temperly reformulated the McCulloch and Pitts network in terms of the "spinglass" model well-known to physicists.
6
1 Introduction
1956 Taylor introduced an associative memory network using Hebb's rule. 1956 Beurle analyzed the triggering and propagation of large-scale brain activity. 1956 Von Neumann showed how to introduce redundancy and fault tolerance into neural networks and showed how the synchronous activation of many neurons can be used to represent each bit of information. 1956 Uttley demonstrated that neural networks with modifiable connections could learn to classify patterns with synaptic weights representing conditional probabilities. He developed a linear separator in which weights were adjusted using Shannon's entropy measure. 1958 Rosenblatt invented the "perception" introducing a learning method for the McCulloch and Pitts neuron model. 1960 Widrow and Hoff introduced the "Adaline," a simple network trained by a gradient descent rule to minimize mean squared error. 1961 Rosenblatt proposed the "backpropagation" scheme for training multilayer networks; this attempt was unsuccessful because he used non-differentiable node functions. 1962 Hubel and Wiesel conducted important biological studies of properties of the neurons in the visual cortex of cats, spurring the development of self-organizing artificial neural models that simulated these properties. 1963 Novikoff provided a short proof for the Perception Convergence Theorem conjectured by Rosenblatt. 1964 Taylor constructed a winner-take-all circuit with inhibitions among output units. 1966 Uttley developed neural networks in which synaptic strengths represent the mutual information between fixing patterns of neurons. 1967 Cowan introduced the sigmoid fixing characteristic. 1967 Amari obtained a mathematical solution of the credit assignment problem to determine a learning rule for weights in multilayer networks. Unfortunately, its importance was not noticed for a long time. 1968 Cowan introduced a network of neurons with skew-symmetric coupling constants that generates neutrally stable oscillations in neuron outputs. 1969 Minsky and Papert demonstrated the limits of simple perceptions. This important work is famous for demonstrating that perceptions are not computationally universal, and infamous as it resulted in a drastic reduction in funding support for research in neural networks.
1.2 Structure and Function of a Single Neuron
7
In the next two decades, the limitations of neural networks were overcome to some extent by researchers who explored several different lines of work. 1. Combinations of many neurons (i.e., neural networks) can be more powerful than single neurons. Learning rules applicable to large NN's were formulated by researchers such as Dreyfus (1962), Bryson and Ho (1969), and Werbos (1974); and popularized by McClelland and Rumelhart (1986). Most of these are still based on gradient descent. 2. Often gradient descent is not successful in obtaining a desired solution to a problem. Random, probabilistic, or stochastic methods (e.g., Boltzmann machines) have been developed to combat this problem by Ackley, Hinton, and Sejnowski (1985); Kirkpatrick, Gelatt, and Vecchi (1983); and others. 3. Theoretical results have been established to understand the capabilities of non-trivial neural networks, by Cybenko (1988) and others. Theoretical analyses have been carried out to establish whether networks can give an approximately correct solution with a high probability, even though the correct solution is not guaranteed [see Valiant (1985), Baum and Haussler (1988)]. 4. For effective use of available problem-specific information, "hybrid systems" (combining neural networks and non-connectionist components) were developed, bridging the gulf between symbolic and connectionist systems [see Gallant (1986)]. In recent years, several other researchers (such as Amari, Grossberg, Hopfield, Kohonen, von der Malsburg, and Willshaw) have made major contributions to the field of neural networks; such as in self-organizing maps discussed in chapter 5 and in associative memories discussed in chapter 6. 1.2 Structure and Function of a Single Neuron In this section, we begin by discussing biological neurons, then discuss the functions computed by nodes in artificial neural networks. 1.2.1 Biological neurons A typical biological neuron is composed of a cell body, a tubular axon, and a multitude of hair-like dendrites, shown in figure 1.3. The dendrites form a very fine filamentary brush surrounding the body of the neuron. The axon is essentially a long, thin tube that splits into branches terminating in little end bulbs that almost touch the dendrites of other cells. The small gap between an end bulb and a dendrite is called a synapse, across which information is propagated. The axon of a single neuron forms synaptic connections with many other
8
1 Introduction
Dendrites
Figure 1J A biological neuron. neurons; the presynaptic side of the synapse refers to the neuron that sends a signal, while the postsynaptic side refers to the neuron that receives the signal. However, the real picture of neurons is a little more complicated. 1. A neuron may have no obvious axon, but only "processes" that receive and transmit information. 2. Axons may form synapses on other axons. 3. Dendrites may form synapses onto other dendrites. The number of synapses received by each neuron range from 100 to 100,000. Morphologically, most synaptic contacts are of two types. Type I: Excitatory synapses with asymmetrical membrane specializations; membrane thickening is greater on the postsynaptic side. The presynaptic side contains round bags (synaptic vesicles) believed to contain packets of a neurotransmitter (a chemical such as glutamate or aspartate). Type II: Inhibitory synapses with symmetrical membrane specializations; with smaller ellipsoidal or flattened vesicles. Gamma-amino butyric acid is an example of an inhibitory neurotransmitter. An electrostatic potential difference is maintained across the cell membrane, with the inside of the membrane being negatively charged. Ions diffuse through the membrane to maintain this potential difference. Inhibitory or excitatory signals from other neurons are
1.2 Structure and Function of a Single Neuron
9
transmitted to a neuron at its dendrites' synapses. The magnitude of the signal received by a neuron (from another) depends on the efficiency of the synaptic transmission, and can be thought of as the strength of the connection between the neurons. The cell membrane becomes electrically active when sufficiently excited by the neurons making synapses onto this neuron. A neuron will fire, i.e., send an output impulse of about lOOmV down its axon, if sufficient signals from other neurons fall upon its dendrites in a short period of time, called the period of latent summation. The neuron fires if its net excitation exceeds its inhibition by a critical amount, the threshold of the neuron; this process is modeled by equations proposed by Hodgkin and Huxley (1952). Firing is followed by a brief refractory period during which the neuron is inactive. If the input to the neuron remains strong, the neuron continues to deliver impulses at frequencies up to a few hundred impulses per second. It is this frequency which is often referred to as the output of the neuron. Impulses propagate down the axon of a neuron and reach up to the synapses, sending signals of various strengths down the dendrites of other neurons. 1.2.2
Artificial neuron models
We begin our discussion of artificial neuron models by introducing oft-used terminology that establishes the correspondence between biological and artificial neurons, shown in table 1.1. Node output represents firing frequency when allowed to take arbitrary nonbinary values; however, the analogy with biological neurons is more direct in some artificial neural networks with binary node outputs, and a node is said to be fired when its net input exceeds a certain threshold. Figure 1.4 describes a general model encompassing almost every artificial neuron model proposed so far. Even this noncommittal model makes the following assumptions that may lead one to question its biological plausibility. 1. The position on the neuron (node) of the incoming synapse (connection) is irrelevant. 2. Each node has a single output value, distributed to other nodes via outgoing links, irrespective of their positions.
Table 1.1 Terminology Biological Terminology
Artificial Neural Network Terminology
Neuron Synapse Synaptic Efficiency Firing Frequency
Node/Unit/Cell/Neurode Connection/Edge/Link Connection Strength/Weight Node Output
10
1 Introduction
>V
f )
*n
*(
f(wlx\
w x
n n)
-—
J
Figure 1.4 General neuron model.
3. All inputs come in at the same time or remain activated at the same level long enough for computation (of /) to occur. An alternative is to postulate the existence of buffers to store weighted inputs inside nodes. The next level of specialization is to assume that different weighted inputs are summed, as shown in figure 1.5. The neuron output may be written as f(w\x\ -\ h w„x„) or w x or w x e f(S=i i i) f(fl£t)* where net = YH=i i i- Th simplification involved here is the assumption that all weighted inputs are treated similarly, and merely summed. When examining biological plausibility of such models, we may pose questions such as the following: If different inputs to a biological neuron come in at different locations, exactly how can these be added up before any other function (/) is applied to them? Some artificial neuron models do not sum their weighted inputs, but take their product, as in "sigma-pi" networks [see Feldman and Ballard (1982), Rumelhart and McClelland (1986)]. Nevertheless, the model shown in figure 1.5 is most commonly used, and we elaborate on it in the rest of this section, addressing the exact form of the function /. The simplest possible functions are: the identity function firiet) = net; the non-negative identity function f(net) = max (0, net); and the constant functions finet) = c for some constant value c. Some other functions, commonly used in neural networks, are described below. Node functions whose outputs saturate (e.g., lim^oo f(x) = 1 and lim^-co /(*) = 0) are of great interest in all neural network models. Only such functions will be considered in this chapter. Inputs to a neuron that differ very little are expected to produce approximately the same outputs, which justifies using continuous node functions. The motivation for us-
1.2
11
Structure and Function of a Single Neuron
f(wlXl + ... + wnxn)
Figure 1.5 Weighted input summation.
finet) A
(ON)
(OFF) -*> net Figure 1.6 Step function.
ing differentiable node functions will become clear when we present learning algorithms that conduct gradient descent. Step functions A commonly used single neuron model is given by a simple step function, shown in figure 1.6. This function is defined in general as follows. fi.net)
I
a if net < b if net >
(1.1)
and at c, /(c)* is sometimes defined to equal a, sometimes b, sometimes (a + b)/2 and sometimes 0. Common choices are c = 0, a = 0, b = 1; and c =; 0, a = — 1, b = 1. The
12
1 Introduction
latter case is also called the signum function, whose output is +1 if net > 0, -1 if net < 0, andOifnef = 0. The step function is very easy to implement. It also captures the idea of having a minimum threshold (= c in figure 1.6) for the net weighted input that must be exceeded if a neuron's output is to equal b. The state of the neuron in which net > c, so that f{net) = b, is often identified as the active or ON state for the neuron, while the state with finet) = a is considered to be the passive or OFF state, assuming b > a. Note that b is not necessarily greater than a; it is possible that a node is activated when its net input is less than a threshold. Though the notion of a threshold appears very natural, this model has the biologically implausible feature that the magnitude of the net input is largely irrelevant (given that we know whether net input exceeds the threshold). It is logical to expect that variations in the magnitudes of inputs should cause corresponding variations in the output. This is not the case with discontinuous functions such as the step function. Recall that a function is continuous if small changes in its inputs produce corresponding small changes in its output. With the step function shown in figure 1.4, however, a change in net from c - e/2 to c + e/2 produces a change in f{nei) from a to b that is large when compared to e, which can be made infinitesimally small. Biological systems are subject to noise, and a neuron with a discontinuous node function may potentially be activated by a small amount of noise, implying that this node is biologically implausible. Another feature of the step function is that its output "saturates," i.e., does not increase or decrease to values whose magnitude is excessively high. This is desirable because we cannot expect biological or electronic hardware to produce excessively high voltages. The outputs of the step function may be interpreted as class identifiers: we may conclude that an input sample belongs to one class if and only if the net input exceeds a certain value. This interpretation of the step-functional neuron appears simplistic when a network contains more than one neuron. It is sometimes possible to interpret nodes in the interior of the network as identifying features of the input, while the output neurons compute the application-specific output based on the inputs received from these feature-identifying intermediate nodes. Ramp functions The ramp function is shown in figure 1.7. This function is defined in general as follows. f{net) =
a if net < c b if net a + {{net — c){b — a)) /{d — c) otherwise
>d
Common choices are c = 0, d — 1, a — 0, b = 1; and c = — 1, d = 1, a = — b
(1.2)
1.2 Structure and Function of a Single Neuron
13
finet) I i
d
7 c (CiPF\ \\jrr)
(ON)
\ b
4 I
.
*- net
Figure 1.7 Ramp function. As in the case of the step function, finet) = max(a, b) is identified as the ON state, and f(net) = min(#, b) is the OFF state. This node function also implies the existence of a threshold c which must be exceeded by the net weighted input in order to activate the node. The node output also saturates, i.e., is limited in magnitude. But unlike the step function, the ramp is continuous; small variations in net weighted input cause correspondingly small variations (or none at all) in the output. This desirable property is gained at the loss of the simple ON/OFF description of the output: for c 0 otherwise.
The training part of the Adaline (preceding output generation via the step function) can be used for function approximation tasks as well, unlike the perceptron. In this respect, the
60
2 Supervised Learning: Single-Layer Networks
behavior of the Adaline is seen to be identical to that of statistical "linear regression" [see Neter, Wasserman, and Kutner (1990)], where a set of (n + 1) linear equations p
^(dj - (w0 + wiii,y + --- + wn, inj))h,j = 0 y=i
must be solved for the unknown values WQ terns.
wn\ here P denotes the number of pat-
EXAMPLE 2.6 This example illustrates the execution of the Adaline LMS training algorithm on the data of example 2.3, consisting of 27 one-dimensional input patterns which are not linearly separable. Samples {0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.49, 0.52, 0.56, 0.57, 0.82} belong to one class (with desired output —1), whereas samples {0.12, 0.47, 0.48, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95} belong to the other class (with desired output 1). As in the case of the perceptron, a learning rate of 0.1 was chosen, the bias weight was initialized at —1.0, and w\ was initialized randomly at w\ = -0.36. When the input sample 0.05 was presented, the net input to the Adaline was (0.05) (—0.36) - 1 & —1.0; since this was the desired output, there is no change in the weight vector. When the input sample 0.5 was presented, the net input to the Adaline was (0.5)(-0.36) - 1 = -1.18, leading to a change of 8w\ = (0.1)(0.5)(1.0 - (-1.18)) = 0.109, so that the new weight is w\ = -0.36 + 0.109 & -0.25. Similarly, the bias weight changes to w0 = -1 + (0.1)(1.0 - (-1.18)) = -0.78. The MSE is now 2.4, reducing to about 0.9 in the next 20 presentations, but the number of misclassified samples is still large (13). By the end of 1,000 presentations, the MSE reduces to 0.73, with the least possible number (6) of misclassified samples, for w\ = 2.06, WQ = —1.2, corresponding to a class separation at x & 0.58. In subsequent presentations, MSE remains roughly at this level, although the number of misclassifications occasionally increases during this process.
2.53
Multiclass discrimination
So far, we have considered dichotomies, or two-class problems. Many important real-life problems require partitioning data into three or more classes. For example, the character recognition problem (for the Roman alphabet) consists of distinguishing between samples of 26 different classes. A layer of perceptrons or Adalines may be used to solve some such multiclass problems. In figure 2.10 four perceptrons are put together.to solve a four-class classification problem. Each weight wij indicates the strength of the connection from the jih input to the ith node. A sample, supplied as input to a (trained) single-layer perceptron network, is con-
2.6
Conclusion
61
Figure 2.10 A four-node perception that solves a four-class problem in n-dimensional input space.
sidered to belong to the itfa class if and only if the ith output o, = 1, and every other output Ok = 0, for kj^i. Such networks are trained in the same way as individual perceptrons; in fact, all the perceptrons can be trained separately, in parallel. If all outputs are zeroes, or if more than one output value equals 1, the network may be considered to have failed in the classification task. In networks whose node outputs can have values in between 0 and 1, a "maximum-selector" can be used to select the highestvalued output, possibly subject to the requirement that the highest-valued output must at least exceed some small threshold level. In some m-class problems, one of the classes can be interpreted as an otherwise case, so that a network with only m — 1 nodes (in the perceptron layer) can be used; if all m - 1 outputs have low values, we assume that the network assigns that input sample to the mth class. 2.6 Conclusion The perceptron and Adaline are the simplest neural network models; indeed, it may be argued that such single-node systems should not even be called networks. Their applicability is restricted to very simple problems. Their chief contribution is in indicating how a simple supervised learning rule similar to gradient descent can be used to make a network learn a task. The mathematical results discussed in this chapter are clean results that indicate the capabilities as well as limitations of these networks. Such easily interpretable results are difficult to obtain for the more complex networks discussed in later chapters. Perceptrons have been used for pattern recognition tasks. There have been many applications of the Adaline in areas such as signal processing [see Widrow et al. (1975), Hassoun and Spitzer (1988)], and control [see Widrow and Smith (1964), Tolat and Widrow (1988)].
62
2 Supervised Learning: Single-Layer Networks
2.7 Exercises 1. Which of the following sets of points are linearly separable? a. Class 1: (0,0,0), (1,1,1), (2,2,2) Class 2: (3,3,3), (4,4,4), (5,5,5) b. Class 1: (0,0,0), (1,1,1), (4,4,4) Class 2: (2,2,2), (3,3,3) c. Class 1: (0,0,0,0), (1,0,1,0), (0,1,0,1) Class 2: (1,1,1,1), (1,1,1,2), (1,2,1,1) d. Class 1: (0,0,0,0,0), (0,1,0,1,0), (1,0,1,0,1) Class 2: (1,1,1,1,1), (1,1,1,1,2), (0,1,2,1,1) 2. Generate 30 patterns in three-dimensional space such that 15 patterns belong to class 1 and the remaining 15 patterns belong to class 2, and they are linearly separable. Apply the perceptron algorithm on the data so generated and comment on the computational time and performance obtained using different learning rates, with r\ € {10,1,0.1,0.01). 3. Train an Adaline for the data {(0,0; 0), (0,1; 0), (2,0; 0), ( - 1 , - 1 ; 0), (-1,0; 0), ( 0 , - 1 ; 0), (1,1; 0), (0,3; 1), (1,3; 1), (2,3; 1), (1,2; 1), (1,0; 1), ( 1 , - 1 ; 1), (2,0; 1), (2,1; 1), (2,2; 1), ( 2 , - 1 ; 1)}, where the third element indicates class membership. Choose the initial weights randomly. Evaluate the performance using the following measures. a. Number of iterations in training b. Amount of computation time c. Number of misclassifications d. Mean squared error 4. For each of the following problems, determine the length of the perceptron training sequence, with r\ = 1, beginning from WQ = (1,1,1). a. Class 1: (3,1), (4,2), (5,3), (6,4) Class 2: (2,2), (1,3), (2,6) b. Class 1: (0.03,0.01), (0.04,0.02), (0.05,0.03), (0.06,0.04) Class 2: (0.02,0.02), (0.01,0.03), (0.02,0.06) c. Class 1: (300,100), (400,200), (500,300), (600,400) Class 2: (200,200), (100,300), (200,600) 5. Apply the pocket algorithm and the LMS algorithm to the following data sets.
2.7 Exercises
63
a. Class 1: (3,1), (4,2), (5,3), (6,4) Class 2: (2,2), (1,3), (2,6), (6,1) b. Class 1: (3,1), (4,2), (5,3), (6,4), (2.5,2.5) Class 2: (2,2), (1,3), (2,6), (2.6,2.6) 6. For a specific problem, assume that we know that available data can be completely separated into two classes using a circle instead of a straight line. Suggest a modification of the perceptron algorithm that determines the equation of the appropriate circle. 7. Does the perceptron algorithm converge to a solution in a finite number of iterations (for linearly separable data) if the sequence of data points is not random, but chosen in a "malicious" manner by an adversary? 8. In the pocket algorithm, is there any advantage gained if we are allowed to store more than one tentative solution in the "pocket"? 9. Is there any advantage obtained by modifying the Adaline training algorithm to include the heuristic of the pocket algorithm (storing the best recent vector)? 10. Illustrate the development of the weights of an Adaline for the following data. a. Class 1: (0,0,0), (1,1,1), (2,2, 2), (0,0,3), (4,1,4), (2,5,5) Class 2: (3,3, 3), (4,4,4), (5,5, 5), (3, 3,0), (4,4,1), (5,5,2) b. Class 1: (0,0,0), (1,1,1), (4,4,4) Class 2: (2,2,2), (3, 3,3) 11. Adalines without the step function can be used for function approximation. Illustrate the development of the weights of a two-input Adaline for the following data, where the third component of each data point is interpreted as the desired result of the network when the first two components are presented. {(0,0,0), (1,1,1), (2,2,2), (0,0,3), (4,1,4), (2,5,5), (3,3,3), (4,4,4), (5,5,5), (3,3,0), (4,4,1), (5,5,2)} What is the best possible mean squared error that can be obtained (using any other method) for the above data? 12. a. Assume that a two-class classification problem is such that samples belong to Class I if and only if 5^ CiXi + ^ CijXiXj > 0, ' i,J where x\,...,xn are input parameters, and the coefficients (c,-,QJ) are unknown. How would you modify the simple perceptron to solve such a classification problem?
64
2 Supervised Learning: Single-Layer Networks
b. Assume that a function to be approximated is of the form / ( * ! , .. .,*„) = ^ Q*i + J^ CtjXiXj where JCI, . . . , xn are input parameters, and the coefficients (c,-, qj) are unknown. Suggest a simple modification of the Adaline that can learn to approximate such a function. 13. Would it be possible to train a perceptron in "batch mode," periodically invoking the following weight update rule after presenting all training samples to the perceptron? Aw = y^(misclassified input vectors from class 1) - ^(misclassified input vectors from class 2) Support your answer by presenting a counterexample or by extending the perceptron convergence theorem. 14. Would it be possible to train a perceptron using a variant of the perceptron training algorithm in which the bias weight is left unchanged, and only the other weights are modified? Support your answer by presenting a counterexample or by extending the perceptron convergence theorem. 15. What is the performance obtainable on iris data, given in appendix B.l, using (a) a perceptron, (b) the LMS algorithm, and (c) the pocket algorithm (with ratchet)?
3
Supervised Learning: Multilayer Networks I
Each new machine or technique, in a sense, changes all existing machines and techniques, by permitting us to put them together into new combinations.
—AlvinToffler(1970) Perceptrons and other one-layer networks, discussed in the preceding chapter, are seriously limited in their capabilities. Feedforward multilayer networks with non-linear node functions can overcome these limitations, and can be used for many applications. However, the simple perceptron learning mechanism cannot be extended easily when we go from a single layer of perceptrons to multiple layers of perceptrons. More powerful supervised learning techniques for multilayer networks are presented in this chapter and the next. The focus of this chapter is on a learning mechanism called error "backpropagation" (abbreviated "backprop") for feedforward networks. Such networks are sometimes referred to as "multilayer perceptrons" (MLPs), a usage we do not adopt, particularly since the learning algorithms used in such networks are considerably different from those of (simple) perceptrons. The phrase "backpropagation network" is sometimes used to describe feedforward neural networks trained using the backpropagation learning method. Backpropagation came into prominence in the late 1980's. An early version of backpropagation was first proposed by Rosenblatt (1961), but his proposal was crippled by the use of perceptrons that compute step functions of their net weighted inputs. For successful application of this method, differentiable node functions are required. The new algorithm was proposed by Werbos (1974) and largely ignored by die scientific community until the 1980's. Parker (1985) and LeCun (1985) rediscovered it, but its modern specification was provided and popularized by Rumelhart, Hinton, and Williams (1986). Backpropagation is similar to the LMS (least mean squared error) learning algorithm described earlier, and is based on gradient descent: weights are modified in a direction mat corresponds to the negative gradient of an error measure. The choice of everywheredifferentiable node functions allows correct application of this method. For weights on connections that directly connect to network outputs, this is straightforward and very similar to the Adaline. The major advance of backpropagation over the LMS and perceptron algorithms is in expressing how an error at a higher (or outer) layer of a multilayer network can be propagated backwards to nodes at lower (or inner) layers of the network; the gradient of these backward-propagated error measures (for inner layer nodes) can then be used to determine the desired weight modifications for connections that lead into these hidden nodes. The backpropagation algorithm has had a major impact on the field of neural networks and has been widely applied to a large number of problems in many disciplines. Backpropagation has been used for several kinds of applications including classification, function approximation, and forecasting.
66
3
Supervised Learning: Multilayer Networks I
3.1 Multilevel Discrimination In this section, we present a first attempt to construct a layered structure of nodes to solve linearly nonseparable classification problems, extending the perceptron approach of chapter 2. Such a network is illustrated in figure 3.1, and contains "hidden" (interior) nodes that isolate useful features of the input data. However, it is not easy to train such a network, since the ideal weight change rule is far from obvious. Given that the network makes an error on some input sample, exactly which weights in the network must be modified, and to what extent? This is an instance of a "credit assignment" problem, where credit or blame for a result must be assigned to many different entities that participated in generating that result. If some detailed domain-specific information about the problem is available, methods such as the perceptron training procedure may be used in different successive steps to achieve successful classification. For instance, we may first allocate each sample to one subgroup or cluster by applying neural or non-neural clustering algorithms, and simple perceptrons may be successful in identifying each sample as belonging to one of these subgroups. Now, using the information about the subgroup to which an input sample belongs, one can make the final classification. °\
o2
Figure 3.1 A feedforward network with one hidden layer
3.2
Preliminaries
67
Figure 3.2 An instance of a two-class problem in which elements of each class are found in two clusters. Clusters can be separated by straight lines Pi, Pi, P3, and PA,G\, G2, G3, and G4 represent the cluster centers.
For instance, in the XOR-like problem of figure 3.2, a layer of four different perceptions can be separately made to learn to which subgroup (Gi, G2, G3, G4) an input sample belongs. Another (output) perceptron with four inputs can be trained to discriminate between classes C\ and Co based on subgroup information alone. Crucial to this particular case is the availability of the information regarding subgroup membership of different samples. The remainder of this chapter does not assume any such information to be available, and the backpropagation learning algorithm includes mechanisms to train hidden nodes to solve appropriate subtasks. 3.2 Preliminaries In this section, we discuss the architecture of neural networks for which the backpropagation algorithm has been used, and we also describe the precise task attempted by backpropagation. This is followed by the derivation of the backpropagation algorithm, in section 3.3. 3.2.1 Architecture The backpropagation algorithm assumes a feedforward neural network architecture, outlined in chapter 1. In this architecture, nodes are partitioned into layers numbered 0 to L, where the layer number indicates the distance of a node from the input nodes. The lowermost layer is the input layer numbered as layer 0, and the topmost layer is the output layer numbered as layer L. Backpropagation addresses networks for which L > 2, containing
68
3 Supervised Learning: Multilayer Networks I
"hidden layers" numbered 1 to L - 1. Hidden nodes do not directly receive inputs from nor send outputs to the external environment. For convenience of presentation, we will assume that L = 2 in describing the backpropagation algorithm, implying that there is only one hidden layer, as shown in figure 3.1. The algorithm can be extended easily to cases when L ^ 2. The presentation of the algorithm also assumes that the network is strictly feedforward, i.e., only nodes in adjacent layers are directly connected; this assumption can also be done away with. Input layer nodes merely transmit input values to the hidden layer nodes, and do not perform any computation. The number of input nodes equals the dimensionality of input patterns, and the number of nodes in the output layer is dictated by the problem under consideration. For instance, if the task is to approximate a function mapping rc-dimensional input vectors to m-dimensional output vectors, the network contains n input nodes and m output nodes. An additional "dummy" input node with constant input (= 1) is also often used so that the bias or threshold term can be treated just like other weights in the network. The number of nodes in the hidden layer is up to the discretion of the network designer and generally depends on problem complexity. Each hidden node and output node applies a sigmoid function to its net input, shown in figure 3.3. As discussed briefly in chapter 1, the main reasons motivating the use of an S-shaped sigmoidal function are that it is continuous, monotonicaUy increasing, invertible, everywhere differentiable, and asymptotically approaches its saturation values as net -> ±oo. These basic properties of the sigmoidal function are more important than the specific sigmoidal function chosen in our presentation below, namely,
S(net) = -
\—-.
1 + e(-net) 3.2.2
Objectives
The algorithm discussed in this chapter is a supervised learning algorithm trained using P input patterns. For each input vector xp, we have the corresponding desired £-dimensional output vector dp
=
(dp,i,dPt2,
...,dp,K)
for 1 < /? < /\ This collection of input-output pairs constitutes the training set {(xp, dp) : p=l,...,P}. The length of the input vector xp is equal to the number of inputs of the given application. The length of the output vector dp is equal to the number of outputs of the given application.
3.2
69
Preliminaries
Figure 3 J A sigmoid function.
The training algorithm should work irrespective of the weight values that preceded training, which may initially have been assigned randomly. As with the other networks, one would expect that some training samples are such that any reasonably small network is forced to make some errors in the hope of better performance on the test data. Furthermore, many real-life problems are such that perfect classification or approximation using a small network is impossible, and not all training or test samples can be classified or approximated correctly. As in the case of perceptrons and Adalines, the goal of training is to modify the weights in the network so that the network's output vector op = (op,i,oPf2,
...,OP,K)
is as close as possible to the desired output vector dp, when an input vector xp is presented to the network. Towards this goal, we would like to minimize the cumulative error of the network: Error = ^ Err(op, dp). P=\
(3.1)
70
3 Supervised Learning: Multilayer Networks I
As discussed in chapter 1, there are many possible choices for the error function Err above. The function Err should be non-negative, and should be small if op is close to dp (and large otherwise). One such function is the "cross-entropy" function, T,P(dp^gop + (1 -d p )log(l Op)), suggested by Hinton (1987). The more popular choices for Err are based on norms of the difference vector dp — op. An error measure may be obtained by examining the difference tPiJ = \opj - dPtj\ between the y'th components of the actual and desired output vectors. For the entire output vector, a measure indicating its distance from the desired vector is lp = ((tp,i)u H 1U (IP,K) Y , for some « > 0 . For the rest of the section, we will consider u = 2 and define the Err(op, dp) function to be (lp)2. This function is differentiable, unlike the absolute value function. It is also easy to apply gradient descent methods, since the derivative of the resulting quadratic function is linear and thus easier to compute. Hence our goal will be to find a set of weights that minimize p
K
Sum Square Error = ^ ^ t f / , , ; ) 2 P=\
(3.2)
j=\
or mean squared error P
K
MSE
= TY,t,«PJ?-
33
Backpropagation Algorithm
(3.3)
The backpropagation algorithm is a generalization of the least mean squared algorithm that modifies network weights to minimize the mean squared error between the desired and actual outputs of the network. Backpropagation uses supervised learning in which the network is trained using data for which inputs as well as desired outputs are known. Once trained, the network weights are frozen and can be used to compute output values for new input samples. The feedforward process involves presenting an input pattern to input layer neurons that pass the input values onto the first hidden layer. Each of the hidden layer nodes computes a weighted sum of its inputs, passes the sum through its activation function and presents the result to the output layer. The following is the scenario for the pih pattern in a feedforward network with L = 2.
3.3 Backpropagation Algorithm
71
1. The ith node in the input layer holds a value of xpj for the pth pattern. 2. The net input to the y'th node in the hidden layer = net{P = YH=owff)xP,i- This includes the threshold with xPto = 1; the connection from the ith input node to the jih hidden layer node is assigned a weight value wfy \ as shown in figure 3.1. 3. The output of the yth node in the hidden layer = xp1^ = S (E?=o wff)jcpj) is a sigmoid function.
where s
4. The net input to the kih node of the output layer = netj® = ]£,• (wf'Pxplj\ including the threshold; the connection from the jih hidden layer node to the kih output layer node is assigned a weight value wf'P. 5. Output of the kih node of the output layer oPtk = S (5Z,- w^ '• *:, A where S is a sigmoid function. 6. The desired output of the kth node of the output layer is dPtk, and the corresponding squared error is I2 k = \dp 2, weight wj'j l f , ) denotes the weight assigned to the link from node j in the ith layer to node k in the (i + l)th layer, and xplj denotes output of the j'th node in the ith layer for the pth pattern. The error (for pattern p) is given by Ep = £k {tp,k) • We need to discover w, the vector consisting of all weights in the network, such that the value of Ep is minimized. Suppressing the suffix p for convenience, the expression to be minimized is E = £f=1(£fc)2. One way to minimize E is based on the gradient descent method. Since o* depends on the network weights, E is also a function of the network weights. According to gradient descent, the direction of weight change of w should be in the same direction as -dE/dw. To simplify the calculation of —dE/dw, we examine the weight change in a single weight. We calculate the value of BE/dwfP for each connection from the hidden layer to the output layer. Similarly, we calculate the value of dE/dwjj ' for each connection from the input layer to the hidden layer. The connection weights are then changed by using the values so obtained; this method is also known as the generalized delta rule. In brief, the following two equations describe the suggested weight changes.
72
3 Supervised Learning: Multilayer Networks I
The derivative of E with respect to a weight wf'P associated with the link from node j of the hidden layer to the fcth node of the output layer is easier to calculate than for a weight wf:0) connecting the ith node of the input layer to the ;th node of the hidden layer. But both calculations use the same general idea—the chain rule of derivatives. We consider the chain rule in some detail in the following paragraphs as it applies to the backpropagation learning algorithm. We assume h = \dk - o*| in the following derivation. The error E depends on wfP only through 0*, i.e., no other output term ov, k'^k (2 1)
contains wk j '. Hence, for the calculations that follow, it is sufficient to restrict attention to the partial derivative of E with respect to 0* and then differentiate 0* with respect to w(k2f. Since E = J^ if + (dk - ok)2, we obtain dE — = -2(dk-ok).
(3.6)
Before differentiating 0*, we observe that the output 0* is obtained by applying a node function 8 to net^, i.e., 0* = B(netk(2)) and net^ represents the total input to a node k in the output layer, i.e.,
j
Hence Bok/Bnet(k2) = $'(netj*\ where S'(x) = dS(x)/dx. Finally, Bnet^/Bwff = xfK Consequently, the chain rule BE
_ BE
Bok Bnet^
dwff'BokdnetPdtfV
(3 7)
'
gives - 2 | i j = - 2 ( 4 - ok)$\net™)xf.
(3.8)
Next, consider the derivation of (BE/Bwfj®). The error E depends on wf:0) through netf\ which, in turn, appears in each 0* for all k = 1,..., K. Also, 0* = S(n^ 2 ) ), x*p = §(netf}), and net^ = £,- wf}0) x *,-. Therefore, using the chain rule of derivatives, we obtain
3.3
73
Backpropagation Algorithm
J £ _ _ ^ 9 £ d
dw?f~£i °i
, . i
end-fnr end-while. Figure 3.5 Backpropagation training algorithm, for a one-hidden-layer network; each node has the same activation function S.
with a 1 at the *th position of the output vector if the given sample ** belongs to the kth class. But if the node function is the sigmoid 1/(1 + exp(—net)), the output value will be 0 only when the net input is —oo, and the output value will be 1 only when the net input is +oo. Since net\ = J^j WIJXJ, and since XJ values (inputs and hidden node outputs) are finite, this requires weights of infinitely high magnitude. Moreover, since the magnitude of the first derivative of the sigmoid function is very small for large net inputs (cf. figure 3.6), the rate of change of output is very small for large net inputs. For these reasons, it is preferable to use a smaller value (1 — e) instead of 1, and larger value € instead of 0, as the desired output values. In other words, for each class e*, the desired network output for a pattern xp belonging to that class is a real vector dp of length K, called the target vector for that class. When xp belongs to class e*, the elements of dp satisfy dpjc = 1 - €
dp,i = €,
for I ^ kt
where 6 is a small positive real number. Typical choices of € are between 0.01 and 0.1. For
3.3
Backpropagation Algorithm
77
0.25
0.15
0.05 | -
Figure 3.6 Graph of exp(-«)/(l + exp(-n))2, the derivative of a sigmoid function.
example, when K = 3 and € = 0.05, the target vectors for the three classes are as shown below. dpJ, then ipJ = 0. 2. If dpj — € and opj < dpj, then tpj = 0. 3. Otherwise, lpj = \opj - dpj\, the absolute value of the difference between opj and Figure 3.7 gives a pictorial description of these choices of £pj for a two-class classification problem, using a single output node. The backpropagation algorithm described in figure 3.5 can be adjusted to accommodate for this definition of ipj to obtain faster convergence and improved results. The above representation scheme implies that eight different output nodes are used for an eight-class problem. Is it not sufficient to use only three output nodes to generate eight possible output vectors (0,0,0), (0,0,1), (0,1,0), (0,1,1),... (1,1,1) where each
78
3 Supervised Learning: Multilayer Networks I
Network output is not in error for class l,Ipj = 0
Error for class 2 input, / w = 10.95-0.751
Network output is not in error forclass2,/ p j = 0
WV\,VN,\I
0
y/SS///
0.05
0.5 Class 1
0.75 Class 2
0.95
1 *•
Figure 3.7 Two-class classification and the choice of£pj.
output vector represents a separate class? This requires training the network so that in some cases the outputs of multiple outermost layer nodes are simultaneously high. Such a binary encoding scheme is feasible, but training such a network is much more difficult than training the network with as many output nodes as the number of classes. The inability to obtain appropriate weights is partly due to a phenomenon called "cross-talk," defined as follows: Learning is unsuccessful because different training samples require conflicting changes to be made to the same weight. If there is a separate output node for each class, each weight in the outer layer can focus on learning a specific task (for one class) rather than on performing multiple duties. After training is complete we can use the following procedures to determine the class membership of a pattern, xp. 1. Compute the output vector op and assign xp to that class k for which ll 0 is a constant small enough such that E(w - E'(w)c) < E(w). The following procedure successively searches for a large enough learning rate (2'e) at which system error no longer decreases; we then choose r\ = 2'~l€ as the largest learning rate (examined) for which error does decrease. i := 0; wnew :=w- E'(w)€; while E(wnew) < E(w) do i :=/ + l;w:=wnew; WW .= W- E'(W)?€
end-while. 4. The second derivative of the error measure provides information regarding the rate with which the first derivative changes. If the second derivative is low in magnitude, it is safe to assume a steady slope, and large steps can be taken in the direction of the negative gradient, with confidence that error steadily decreases. But if the second derivative has high magnitude for a given choice of w, the first derivative may be changing significantly at H>, and the error surface may be uneven in the neighborhood of w. Assumptions of steady slope are then incorrect, and a smaller choice of r\ may be appropriate. Although there is clear intuitive justification for this method, the main difficulty is that a large amount of computation is required to compute the second derivative with good precision. Furthermore, this method leads to choosing very small steps when the error surface is
3.4
Setting the Parameter Values
83
IiadE/dw*0
*-x dE/dw = 0 Figure 3.9 Graph of a jagged error surface, where the neighborhood averaging method may produce improved performance.
highly jagged (as in figure 3.9 where error is plotted against weight values), so that the system is certain to get stuck in the nearest local minimum of the error surface in the neighborhood of the current weight vector. 3.4.4 Momentum Backpropagation leads the weights in a neural network to a local minimum of the MSE, possibly substantially different from the global minimum that corresponds to the best choice of weights. This problem can be particularly bothersome if the "error surface" (plotting MSE against network weights) is highly uneven or jagged, as shown in figure 3.9, with a large number of local minima. We may prevent the network from getting stuck in some local minimum by making the weight changes depend on the average gradient of MSE in a small region rather than the precise gradient at a point. Averaging dE/dw in a small neighborhood can allow the network weights to be modified in the general direction of MSE decrease, without getting stuck in some local minima. Calculating averages can be an expensive task. A shortcut, suggested by Rumelhart, Hinton, and Williams (1986), is to make weight changes in the £th iteration of the backpropagation algorithm depend on immediately preceding weight changes, made in the (I — l)th iteration. This has an averaging effect, and diminishes the drastic fluctuations in weight changes over consecutive iterations. The implementation of this method is straightforward, and is accomplished by adding a momentum term to the weight update rule, AwkJ(t + 1) = T)8kXj + aAwkj(t),
(3.18)
where Awkj(t) is the weight change required at time t, and a is an additional parameter. The effective weight change is an average of the change suggested by the current gradient
84
3 Supervised Learning: Multilayer Networks I
and the weight change used in the preceding step. However, direction for weight change chosen in early stages of training can strongly bias future weight changes, possibly restricting the training algorithm (with momentum) to explore only one region of the weight space. Use of the momentum term in the weight update equation introduces yet another parameter, a, whose optimal value depends on the application and is not easy to determine a priori. Values for the momentum coefficient a can be obtained adaptively, as in the case of the learning rate parameter r\. A well-chosen value of a can significantly reduce the number of iterations for convergence. A value close to 0 implies that the past history does not have much effect on the weight change, while a value closer to 1 suggests that the current error has little effect on the weight change. 3.4.5 Generalizability Given a large network, it is possible that repeated training iterations successively improve performance of the network on training data, e.g., by "memorizing" training samples, but the resulting network may perform poorly on test data. This phenomenon is called overtraining. One solution is to constantly monitor the performance of the network on the test data. Hecht-Nielsen (1990) proposes that the weights should be adjusted only on the basis of the training set, but the error should be monitored on the test set. Training continues as long as the error on the test set continues to decrease, and is terminated if the error on the test set increases. Training may thus be halted even if the network performance on the training set continues to improve; this is illustrated in figure 3.10 for a typical problem. To eliminate random fluctuations, performance over the test set is monitored over several iterations, not just one iteration. This method does not suggest using the test data for training: weight changes are computed solely on the basis of the network's performance on training data. With this stopping criterion, final weights do depend on the test data in an indirect manner. Since the weights are not obtained from the current test data, it is expected that the network will continue to perform well on future test data. A network with a large number of nodes (and therefore edges) is capable of memorizing the training set but may not generalize well. For this reason, networks of smaller sizes are preferred over larger networks. Thus, overtraining can be avoided by using networks with a small number of parameters (hidden nodes and weights). Injecting noise into the training set has been found to be a useful technique to improve the generalization capabilities of feedforward neural networks. This is especially the case when the size of the training set is small. Each training data point (JCI , . . . , xn) is modified to a point (x\ ±on,X2±ot2, ...,xn±an) where each a,- is a small randomly generated displacement.
3.4
Setting the Parameter Values
85
Error on training data
'
•
Instant when error on test data begins to worsen
* •
Training time
Figure 3.10 Change in error with training time, on training set and test set
3.4.6 Number of hidden layers and nodes Many important issues, such as determining how many training samples are required for successful learning, and how large a neural network is required for a specific task, are solved in practice by trial and error. These issues are complex because there is considerable dependence on the specific problem being attacked using a neural network. With too few nodes, me network may not be powerful enough for a given learning task. With a large number of nodes (and connections), computation is too expensive. Also, a neural network may have the resources essentially to "memorize" the input training samples; such a network tends to perform poorly on new test samples, and is not considered to have accomplished learning successfully. Neural learning is considered successful only if the system can perform well on test data on which the system has not been trained. We emphasize capabilities of a network to generalize from input training samples, not to memorize them. Adaptive algorithms have been devised that either begin from a large network and successively remove some nodes and links until network performance degrades to an unacceptable level, or begin from a very small network and introduce new nodes and weights until performance is satisfactory; the network is retrained at each intermediate state. Some s\ich algorithms are discussed in the next chapter. Some researchers, e.g.,
3 Supervised Learning: Multilayer Networks I
Kung (1987) and Siet (1988), have proposed penalties for choosing the number of hidden nodes. Lippmann (1988) has pointed out that for classification tasks solved using a feedforward neural network with d input nodes, first hidden layer nodes often function as hyperplanes that effectively partition ^-dimensional space into various regions. Each node in the next layer represents a cluster of points that belong to the same class, as illustrated in figure 3.11. It is assumed that all members in a cluster belong to the same class, and samples of each class may be spread over different clusters. For a one-hidden-layer network, Mehrotra, Mohan, and Ranka (1990) suggest the following approach to estimate the number of hidden nodes required to solve a classification problem in ^-dimensional input space. If the problem is characterized by M clusters of points, where each cluster belongs to a separate region in d-dimensional space, then these regions need to be separated by segments of hyperplanes. Assuming each hyperplane corresponds to a hidden node, the number of hidden nodes needed (m) is a function of the number of clusters (M). In the worst case, as many as M — 1 hidden nodes (hyperplanes) may be needed to separate M clusters, e.g., to separate M collinear points in two-dimensional space, when adjacent points belong to different classes. In general, the number of hidden nodes needed will be sufficiently large so that R(m,d)>M, where R(m,d) is the maximum number of regions into which m hyperplanes can divide rf-dimensional space. Note that R(m, 1) =m + 1, and R(\,d) = 2, while R(m,0) is assumed to be 0. For effective use of the above analysis in designing neural networks for a given classification task, certain clustering procedures may first be required as a preprocessing step. Using a first estimate of the number of clusters of samples, we can obtain an estimate of the number of regions to be separated by hyperplane segments. 3.4.7 Number of samples How many samples are needed for good training? This is a difficult question whose answer depends on several factors. A rule of thumb, obtained from related statistical problems, is to have at least five to ten times as many training samples as the number of weights to be trained. Baum and Haussler (1989) suggest the following number, on the basis of the desired accuracy on the test set:
where P denotes the (desired) number of patterns (i.e., the size of the training set), \W\ denotes the number of weights to be trained, and a denotes the expected accuracy on the test set. Thus, if a network contains 27 weights and the desired test set accuracy is
Network with a single node (step function)
x
y
1
One-hidden-layer network that realizes the convex region: each hidden node realizes one of the lines bounding the convex region, as shown above
Two-hidden-layer network that realizes the union of three convex regions: each box represents a one-hidden-layer network realizing one convex region
Figure 3.11 Types of decision regions that can be formed by two-input feedforward neural networks with one and two layers of hidden units that implement step functions.
87
88
3 Supervised Learning: Multilayer Networks I
95% (a = 0.95), then their analysis suggests that the size of the training set should be at least P > 27/0.05 = 540. The above is a necessary condition. A sufficient condition that ensures the desired performance is |W| ,
n
where n is the number of nodes. The minimum number of hyperplane segments needed to separate samples can be used to provide a measure on the number of samples required for a given neural network for a classification task. We consider the case where the number of hidden nodes exceeds the minimum number required for successful classification of a given number of clusters. Let A(m,d) denote the maximum number of hyperplane segments formed by mutual intersections among m hyperplanes in J-dimensional space. Note that A(0, d) = 0 and A(m, 1) = m + 1. Since each hyperplane contains at least one segment, A(m,d)>m. Since no hyperplane can be divided into more than 2 m _ 1 segments by the m — 1 other hyperplanes, we must also have A(m, d) < m • 2 m_1 . For R(m,d)>M, it can be shown that min(m,rf) Aim, d) _ m R(m - \,d - 1) min(m,2d) 2 " R(m,d)~ Rim,d) " 2 '
(
)
The above result can be used to estimate the required number of "boundary samples," samples which are close to the region boundaries and hence provide the maximum information for classification. This number ought to be proportional to the number of hyperplane segments needed to separate various regions. When all hyperplanes intersect each other ik = n), equation 3.21 implies that this number is proportional to min(m, d).R(m,d). This gives the minimum number of boundary samples required for successful classification proportional to min(Mm, Md). 3.5 Theoretical Results* This section discusses some theoretical properties of feedforward neural networks, illustrating their adequacy for solving large classes of problems. We first present a result of Cover (1965) that shows the usefulness of hidden nodes. We then discuss the ability of multilayer feedforward networks to represent and approximate arbitrary functions. 3.5.1 Cover's theorem Cover (1965) showed that formulating a classification problem in a space of higher dimensionality (than the original problem) can help significantly, since the patterns of different
3.5 Theoretical Results*
89
classes may become linearly separable when transformed to the higher-dimensional space. This result indicates the usefulness of hidden nodes in various neural networks. In the rest of this section, we assume that f is a function mapping 9t" to fdh, i.e., mapping n-dimensional input vectors to n-dimensional vectors. The role of f can be viewed as analogous to the activation of h hidden nodes when an input vector is presented to a neural network with n input nodes. Consider the two-class classification problem based on the training samples T = {if. I = 1,..., P}, where each it is an n-dimensional input vector. This problem is said to be 'f -separable" if the result of applying f makes the data linearly separable, i.e., the set {f (i): i e T) is linearly separable. Thus f-separability implies that there exists an ndimensional weight vector w such that w • f (i) > 1 if £ belongs to one class and w • f (i) < 1 if i belongs to the other class. (Using 1 in the inequalities, instead of the more commonly used 0, allows us to omit the threshold or bias terms.) Cover's theorem derives the probability that sets of points in T are f-separable, as a function of size of the training set P. It relies on Schlafli's formula for counting the number of sets of input vectors that are f-separable. The result also depends on another important concept: the degrees of freedom of the function f, denoted by #(f), which represents the dimensionality of the subspace defined by the elements in {f (ii),..., f (*»}. The relevance of the degrees of freedom of a function is easy to follow in terms of two simple examples. If we define f to map every input vector i e T to the n-dimensional zero vector, the data are hardly likely to be f-separable, since all of the inputs are mapped to the same point. In this case we say that such a function has zero degrees of freedom. On the other hand, if f maps an input i to "general position" for all inputs, then there is zero probability that any non-trivial subset of {f ( I ' I ) , . . . , f (ip)} with n elements is linearly dependent. The statement of Cover's theorem is as follows. We will assume that degrees of freedom of f is n. A
THEOREM 3.1 Let f be a function mapping 9t" to 9t , where h is a positive integer. Any
set of P randomly chosen n-dimensional input vectors is f-separable with probability
For a fixed P, this probability increases as n increases. This implies that the use of many hidden nodes makes it very likely that a feedforward network with a step function at its output node will succeed in separating the input data into the appropriate classes. Samples are f-separable with probability & 1 when P &h, i.e., the number of hidden nodes is roughly equal to the number of input samples.
90
3 Supervised Learning: Multilayer Networks I
3.5.2 Representations of functions A neural network with one hidden layer can represent any binary function. To understand the reason for this, suppose f(x\ xn) is a function in n variables such that Xi; € {0,1} for i = 1 n, and the output f(x\, ...,xn)e {0,1}. Such a function has unique representation in disjunctive normal form.2 If the desired output is a vector € {0,1}", then we can imagine a separate neural network module that predicts each component of this vector; hence it suffices to examine the single output case. The neural network interpretation of the disjunctive normal form representation is that / can be written as
where or is a step function. A similar result also holds in general for continuous functions and is known as Kolmogorov's theorem (1957). Let / be a continuous function such that fi.x\
xn) € R, xi € [0,1] for i = 1,2
n.
Kolmogorov's theorem tells us that / can be represented in terms of functions in one variable. There are two kinds of functions in terms of which the function / is expressed: fixed functions (htj) and other functions (gj) whose choice depends on /. 3.2 (Kolmogorov, 1957) There exist fixed increasing continuous functions {• •. htj, • • •} and properly chosen continuous functions {• • •, gj, • • •} so that every continuous function / can be written in the form THEOREM
2/H-l
/(*i
/ n
\
*») = S 8j I J2 hijixt) J
where htj : [0,1] —• 3d for i = 1, ...,n; j = 1, ...,2n + 1 2. As described in books such as Mattson (1993), an expression is in disjunctive normal form (DNF) if it is written as a disjunction (OR) of conjunctions (AND's) of variables (x,) and complements of variables (xj). Any binaiyfunction of binary variables can be written in DNF, e.g., (*i + JC2*3)(*i*2 + W) can be written in DNF as X1.f3.r4 + *i*2*3. where addition represents OR, multiplication represents AND, and the horizontal bar above a variable indicates NOT.
3.5 Theoretical Results*
91
and gj-.W—>m
for
7 = l , . . . , 2 n + l.
Hecht-Nielsen (1987) suggested an obvious interpretation of this theorem in terms of neural networks: Every continuous function, whose argument is a vector of real components, can be computed by a two-hidden-layer neural network. Nodes in the first hidden layer of this network apply h transformations to each individual input (*,-), nodes at the second layer compute g transformations, and the output layer sums the outputs of the secondhidden-layer nodes. There are some difficulties in obtaining practical applications for this very important mathematical result. This theorem is an existence theorem—it tells us that any multivalued continuous function can be expressed as described above. It is non-constructive since it does not tell us how to obtain h and g functions needed for a given problem. Results obtained by Vitushkin (1964,1977) have shown that g and h functions may be highly nonsmooth. On the other hand, as Girosi and Poggio (1989) have pointed out In a network implementation that has to be used for learning and generalization, some degree of smoothness is required for the functions corresponding to the units in the network. In addition, as described in earlier sections, the process of solving a problem using a feedforward network implies that smooth node functions are first chosen, architecture of the network is "guessed," and only the weights remain to be adjusted. Poggio and Girosi (1990) call such approximations "parameterized representations." In view of the above observations, several researchers focus on a different issue: approximation of a continuous multivalued function within prespecified error limits. Pertinent results are discussed in the following subsection. 3.5.3
Approximations of functions
A well-known result in mathematics shows that functions can be approximated to any degree of accuracy using their Fourier approximations. This degree of approximation is measured in terms of mean squared error. Understanding the mathematical result requires defining a few technical terms. If f(x\,..., xn) is a function and f Mir
f\x)dx < oo,
then the function is "square-integrable." Suppose }N(X) is another function that approximates / ( * ) . We can measure the difference between fix) and }N(X) in terms of
92
/
3 Supervised Learning: Multilayer Networks I
\f(x)-Mx)\2dx,
n
J[0,\] '
I
which is called the mean squared error (MSE). Note that this definition of MSE is for continuous functions, whereas the definition presented in chapter 1 was for finite sets of training samples. THEOREM 3.3 (Fourier series approximation) Let f(x) be a square integrable function, being approximated by
/„(*>= £ ... jr ki=-N
CJt