39• Neural Networks
39• Neural Networks Art Neural Nets Abstract | Full Text: PDF (223K) Boltzmann Machines Abstract | ...
65 downloads
1700 Views
5MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
39• Neural Networks
39• Neural Networks Art Neural Nets Abstract | Full Text: PDF (223K) Boltzmann Machines Abstract | Full Text: PDF (164K) Cerebellar Model Arithmetic Computers Abstract | Full Text: PDF (298K) Constructive Learning and Structural Learning Abstract | Full Text: PDF (137K) Divide-and-Conquer Methods Abstract | Full Text: PDF (186K) eedforward Neural Nets1 Abstract | Full Text: PDF (318K) Neural Architecture in 3-D Abstract | Full Text: PDF (143K) Neural Chips Abstract | Full Text: PDF (204K) Neural Net Architecture Abstract | Full Text: PDF (178K) Neural Nets, Hopfield Abstract | Full Text: PDF (227K) Neural Nets, Recurrent Abstract | Full Text: PDF (311K) Neural Nets Based on Biology Abstract | Full Text: PDF (257K) Neurocontrollers Abstract | Full Text: PDF (242K) Optical Neural Nets Abstract | Full Text: PDF (484K) Perceptrons Abstract | Full Text: PDF (317K) Self-Organizing Feature Maps Abstract | Full Text: PDF (195K)
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...LECTRONICS%20ENGINEERING/39.Neural%20Networks.htm16.06.2008 15:46:49
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...CTRONICS%20ENGINEERING/39.%20Neural%20Networks/W5101.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Art Neural Nets Standard Article Michael Georgiopoulos1, Gregory L. Heileman2, Juxin Huang3 1University of Central Florida, Orlando, FL 2University of New Mexico, Albuquerque, NM 3Hewlett-Packard, Santa Rosa, CA Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5101 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (223K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20...20ENGINEERING/39.%20Neural%20Networks/W5101.htm (1 of 2)16.06.2008 15:49:25
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...CTRONICS%20ENGINEERING/39.%20Neural%20Networks/W5101.htm
Abstract The sections in this article are Fuzzy Art Fuzzy ARTMAP Templates in Fuzzy Art and Fuzzy Artmap: A Geometrical Interpretation Fuzzy Art Example Fuzzy ARTMAP Example Applications Theoretical Results | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20...20ENGINEERING/39.%20Neural%20Networks/W5101.htm (2 of 2)16.06.2008 15:49:25
ART NEURAL NETS
641
ART NEURAL NETS When developing a neural network to perform a particular pattern-classification task, one typically proceeds by gathering a set of exemplars, or training patterns, and then using these exemplars to train the network. Once the network has adequately learned the exemplars, the weights of the network are fixed, and the system can be used to classify future ‘‘unseen’’ patterns. This operational scenario is acceptable when the problem domain is ‘‘well-behaved’’—in the sense that it is possible to select a set of training patterns that, once learned, will allow the network to classify future unseen patterns accurately. Unfortunately, in many realistic situations, the problem domain is not well-behaved. Consider a simple example. Suppose a company wishes to train a neural network to recognize the silhouettes of the parts that are required to produce the products in the company’s product line. The appropriate images can be collected and used to train a neural network, a task that is typically computationally time consuming depending on the size of the J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
642
ART NEURAL NETS
network required. After the network has learned this training set (according to some criteria), the training period is ended and weights are fixed. Now assume that at some future time another product is introduced, and that the company wishes to add the component parts of this new product to the knowledge presently stored in the network. This would typically require a retraining of the network using all of the previous training patterns, plus the new ones. Training on only the new patterns could result in the network learning these new patterns quite well, but forgetting the previously learned patterns. Although this retraining may not take as long as the initial training, it is still likely to require a significant amount of time. Moreover, if the neural network is presented with a previously unseen pattern that is quite different from all of the training patterns, in most neural network models there is no built-in mechanism for recognizing the novelty of the input. We have been describing what Grossberg calls the stability–plasticity dilemma (1). This dilemma can be restated as a series of questions: How can a learning system remain adaptive (plastic) in response to a significant input, yet remain stable in response to an irrelevant input? How does the system know when to switch between the plastic and the stable modes? How can the system retain previously learned information, while continuing to learn new things? In response to such questions, Grossberg developed the adaptive resonance theory (ART) (1). An important element of ART that is used to resolve the stability–plasticity dilemma is the feedback that occurs from the output layer to the input layer of these architectures. This feedback mechanism allows for the learning of new information without destroying old information, the automatic switching between stable and plastic modes, and stabilization of the encoding of the pattern classes. These feedback connections in ART neural network architectures will be clearly illustrated later when these architectures are described in more detail. Adaptive resonance theory gets its name from the particular way in which learning and recall interplay in these networks. In physics, resonance occurs when a small-amplitude vibration of the proper frequency causes a large-amplitude vibration in an electrical or mechanical system. In an ART network, information in the form of processing-element outputs reverberates back and forth between layers. If the proper patterns develop, a stable oscillation ensues, which is the neural network equivalent of resonance. During this resonant period, learning—or adaptation—can occur. Before the network has achieved a resonant state, no learning takes place, because the time required for changes in the weights is much longer than the time that it takes for the network to achieve resonance. In ART networks, a resonant state can be attained in one of two ways. If the network has previously learned to recognize an input pattern, then the resonant state will be achieved quickly when the input pattern is presented. During resonance, the adaptation process will reinforce the memory of the stored pattern. If the input pattern is not immediately recognized, the network will rapidly search through its stored patterns looking for a match. If no match is found, the network will enter a resonant state, whereupon a new pattern will be stored for the first time. Thus the network responds quickly to previously learned data, yet remains able to learn when novel data are presented.
Adaptive resonance theory was introduced by Grossberg in 1976 as a means of describing how recognition categories are self-organized in neural networks (1). Since this time, a number of specific neural network architectures based on ART have been proposed. Many of these architectures originated from Carpenter, Grossberg, and their colleagues at Boston University. The first ART neural network architecture, named ART1, appeared in the literature in 1987 (2). This model is an unsupervised neural network capable of self-organizing (clustering) arbitrary collections of binary input patterns. Later in 1987 the ART2 neural network architecture was introduced. This architecture is capable of clustering arbitrary collections of real-valued input patterns (3). The ART2 network was made obsolete in 1991, when the simpler Fuzzy ART architecture was proposed (4). Like ART2, Fuzzy ART is able to cluster real-valued input patterns. In addition, for binary-valued inputs, the operation of Fuzzy ART reduces to that of ART1. The ART1, ART2, and Fuzzy ART architectures all perform unsupervised learning. In unsupervised learning (also called self-organization), training patterns of unknown classification are used, and there is no external teaching procedure. An internal teaching function determines how network parameters are adapted based on the nature of the input patterns. In this case, the teaching procedure results in the internal categorization of training patterns according to some measure of similarity among the patterns. That is, similar training patterns are grouped together during the training of the network. These groups (or clusters) are then considered to be the pattern classes into which unknown input patterns are later classified. Supervised learning, on the other hand, requires a set of training patterns of known classification and an external teaching procedure. The teaching procedure is used to adapt network weights according to the network’s response to the training patterns. Normally, this adjustment is in proportion to the amount of error present while attempting to classify the current input pattern. The use of supervised learning can logically be separated into two phases—a training phase and a performance phase. In the training phase, a training set is formed from representative samples taken from the environment in which the neural network is expected to operate. This training set should include sample patterns from all the pattern classes being categorized. Next, the training patterns are applied to the network inputs and the external teacher modifies the system through the use of a training algorithm. Once acceptable results have been obtained from the learning phase, the network may be used in the performance phase. In the performance phase, an unknown pattern is drawn from the environment in which the network operates and applied to the network inputs. At this point, the neural network is expected to perform the recognition task for which it has been trained. If the neural network is able to correctly classify with a high probability input patterns that do not belong to the training set, then it is said to generalize. Generalization is one of the most significant concerns when using neural networks to perform pattern classification. A number of ART architectures have been introduced by the Boston University group of researchers for performing supervised learning. These include ARTMAP (5), in which the input patterns must be binary, and Fuzzy ARTMAP (6), ARTEMAP (7), Gaussian ARTMAP (8), and ARTMAP-IC (9),
ART NEURAL NETS
where the input patterns can be real valued. The primary purpose of the last three contributions to the supervised-ART family is to improve the generalization performance of Fuzzy ARTMAP. In conjunction with the vigorous activity of researchers at Boston University in developing ART architectures, other researchers in the field independently developed, analyzed, and applied ART architectures or ART-like architectures to a variety of problems. A short, and obviously not exhaustive, list of such efforts includes the adaptive fuzzy leader clustering (AFLC) (10), LAPART (11), the integrated adaptive fuzzy clustering (IAFC) (12), the Fuzzy Min-Max (13,14), and the Adaptive Hamming Net (15). In the original ART1 paper (2), a significant portion of the paper is devoted to the analysis of ART1 and its learning properties. Other noteworthy contributions to the analysis and understanding of the learning properties in ART1 can be found in Refs. 16–19. The analysis of Fuzzy ART was initially undertaken in Ref. 4; additional results can be found in Refs. 20 and 21. Properties of learning in the ARTMAP architecture are discussed in Refs. 22 and 23, while properties of learning in the Fuzzy ARTMAP architecture are considered in Ref. 21. From the discussion above, it is evident that the most fundamental ART architectures are Fuzzy ART and Fuzzy ARTMAP (since the binary versions, ART and ARTMAP, respectively, can be considered special cases). Hence the next four sections of this chapter are devoted to the description of these fundamental ART architectures. We start with Fuzzy ART, because it is the building block for the creation of the Fuzzy ARTMAP architecture. In particular, we discuss in detail the Fuzzy ART architecture, the operation of the Fuzzy ART architecture, and the operating phases (training and performance) of the Fuzzy ART architecture. Next, we discuss the Fuzzy ARTMAP architecture, the operation of the Fuzzy ARTMAP architecture, and the operating phases (training and performance) of the Fuzzy ARTMAP architecture. Later, we present a geometrical interpretation of how Fuzzy ART and Fuzzy ARTMAP operate. This gives a clearer (pictorial) explanation of how these two architectures function. Furthermore, we illustrate with simple examples the training phases of the Fuzzy ART and Fuzzy ARTMAP architectures. A number of applications that make use of ART neural network architectures are considered. Finally, properties of learning in ART1, Fuzzy ART, and ARTMAP are discussed. FUZZY ART A brief overview of the Fuzzy ART architecture is provided in the following sections. For a more detailed discussion of this architecture, the reader should consult Ref. 4. Fuzzy ART Architecture The Fuzzy ART neural network architecture is shown in Fig. 1. It consists of two subsystems, the attentional subsystem, and the orienting subsystem. The attentional subsystem consists of two fields of nodes denoted F1a and F2a. The F1a field is called the input field because input patterns are applied to it. The F2a field is called the category or class representation field because it is the field where category representations are formed. These categories represent the clusters to which the input patterns belong. The orienting subsystem consists of a
643
ARTa Module (ART1 or Fuzzy ART) Field F 2a Attentional subsystem
w aj
Orienting subsystem
W ja Reset node
Field F 1a I=(a, ac) Field F 0a
ρa
a Figure 1. Block diagram of the ART1 or Fuzzy ART architecture.
single node (called the reset node), which accepts inputs from the F1a field, the F2a field (this input is not shown in Fig. 1), and the input pattern applied across the F1a field. The output of the reset node affects the nodes in the F2a field. Some preprocessing of the input patterns of the pattern clustering task takes place before they are presented to Fuzzy ART. The first preprocessing stage takes as an input an Madimensional input pattern from the pattern clustering task and transforms it into an output vector a ⫽ (a1, . . ., aMa), whose every component lies in the interval [0, 1] (i.e., 0 ⱕ ai ⱕ 1 for 1 ⱕ i ⱕ Ma). The second preprocessing stage accepts as an input the output a of the first preprocessing stage and produces an output vector I, such that
a, a c ) = a1 , . . ., aM a , ac1 , . . ., acM a I = (a
(1)
where aci = 1 − ai
1 ≤ i ≤ Ma
(2)
The above transformation is called complement coding. The complement coding operation is performed in Fuzzy ART at a preprocessor field designated by F0a (see Fig. 1). We will refer to the vector I formed in this fashion as the input pattern. We denote a node in the F1a field by the index i (i 僆 兵1, 2, . . ., 2Ma其), and a node in the F2a field by the index j ( j 僆 兵1, 2, . . ., Na其). Every node i in the F1a field is connected via a bottom-up weight to every node j in the F2a field; this weight is denoted Wija . Also, every node j in the F2a field is connected via a top-down weight to every node i in the F1a field; this weight is denoted wjia . The vector whose components are equal to the top-down weights emanating from node j in the F2a field is designated wja and is referred to as a template. Note a that wja ⫽ (wj1a , wj2a , . . ., wj,2M ) for j ⫽ 1, . . ., Na. The vector of a bottom-up weights converging to a node j in the F2a field is designated Wja. Note that in Fuzzy ART the bottom-up and top-down weights corresponding to a node j in F2a are equal. Hence, in the forthcoming discussion, we will primarily refer to the top-down weights of the Fuzzy ART architecture. Initially, the top-down weights of Fuzzy ART are chosen to be equal to the ‘‘all-ones’’ vector. The initial top-down weight choices in Fuzzy ART are the values of these weights prior to presentation of any input pattern.
644
ART NEURAL NETS
Before proceeding, it is important to introduce the notations wja,o and wa,n j . Quite often, templates in Fuzzy ART are discussed with respect to an input pattern I presented at the F1a field. The notation wja,o denotes the template of node j in the F2a field of Fuzzy ART prior to the presentation of I. The notation wa,n denotes the template of node j in F2a after the j presentation of I. Similarly, any other quantities defined with superscripts 兵a, o其 or 兵a, n其 will indicate values of these quantities prior to or after a pattern presentation to Fuzzy ART, respectively. Operation of Fuzzy ART As mentioned previously, we will use I to indicate an input pattern applied at F1a, and wja to indicate the template of node j in F2a. In addition, we will use 兩I兩 and 兩wja兩 to denote the size of I and wja, respectively. The size of a vector in Fuzzy ART is defined to be the sum of its components. We define I wja to be the vector whose ith component is the minimum of the ith I component and the ith wja component. The operation is called the fuzzy-min operation, while a related operation designated by is called the fuzzy-max operation. These operations are shown in Fig. 2 for 2 two-dimensional vectors, denoted by x and y. Let us assume that an input pattern I is presented at the F1a field of Fuzzy ART. The appearance of pattern I across the F1a field produces bottom-up inputs that affect the nodes in the F2a field. These bottom-up inputs are given by the equation I ∧ w a,o j a (3) Tj (II ) =
αa + w a,o j
where 움a, which takes values in the interval (0, 앝), is called the choice parameter. It is worth mentioning that if in the above equation wja,o is equal to the ‘‘all-ones’’ vector, then this node is referred to as an uncommitted node; otherwise, it is referred to as a committed node. The bottom-up inputs activate a competition process among the F2a nodes, which eventually leads to the activation of a single node in F2a, namely, the node that receives the maximum bottom-up input from F1a. Let us assume that node jm in F2a has been activated through this process. The activation of node jm in F2a indicates that this node is considered as a potential candidate by Fuzzy ART to represent the input pat-
1 x
y
x
y x
0
y
1
Figure 2. Illustration of the fuzzy min () and the fuzzy max () operations in the two-dimensional space.
tern I. The appropriateness of this node is checked by examining the ratio I ∧ w a,o jm (4) I If this ratio is smaller than the vigilance parameter a, then node jm is deemed inappropriate to represent the input pattern I, and as a result it is reset (deactivated). The parameter a is set to a prespecified value in the interval [0, 1]. The deactivation process is carried out by the orienting subsystem and, in particular, by the reset node. If a reset happens, another node in F2a (different from node jm) is chosen to represent the input pattern I; the deactivation of a node (nodes) lasts for the entire input pattern presentation. The above process continues until an appropriate node in F2a is found, or until all the nodes in F2a have been considered. If a node in F2a is found appropriate to represent the input pattern I, then learning ensues according to the following rules. Assuming that node jm has been chosen to represent I, the corresponding top-down weight vector wja,o becomes equal to m wa,n jm , where
w a,n = I ∧ w a,o (5) j j m
m
It is worth mentioning that in Eq. (5) we might have wa,n jm ⫽ ; in this case we say that no learning occurs for the wja,o m weights of node jm. Also note that Eq. (5) is actually a special case of the learning equations of Fuzzy ART that is referred to as fast learning (4). In this chapter we only consider the fast learning case. We say that node jm has coded input pattern I if during I’s presentation at F1a, node jm in F2a is chosen to represent I, and the jm top-down weights are modified as Eq. (5) prescribes. Note that the weights converging to or emanating from an F2a node other than jm (the chosen node) remain unchanged during I’s presentation. Operating Phases of Fuzzy ART Fuzzy ART may operate in two different phases: the training phase and the performance phase. The training phase is as follows: Given a collection of input patterns I 1, I 2, . . ., IP (i.e., the training list), we want Fuzzy ART to cluster these input patterns into different categories. Obviously, we expect patterns that are similar to each other to be clustered in the same category. In order to achieve this goal, one must present the training list repeatedly to the Fuzzy ART architecture. We present I 1, then I 2, and eventually IP; this corresponds to one list presentation. We present the training list as many times as is necessary for Fuzzy ART to cluster the input patterns. The clustering task is considered accomplished (i.e., learning is complete) if the weights in the Fuzzy ART architecture do not change during a list presentation. The aforementioned training scenario is called off-line training, and its step-by-step implementation is as follows: Off-Line Training Phase of Fuzzy ART 1. Choose the Fuzzy ART network parameters (i.e., 움a, Ma, a) and the initial weights (i.e., wja). 2. Choose the pth input pattern from the training list. 3. Calculate the bottom-up inputs at the F2a field of the ARTa module due to the presentation of the pth input
ART NEURAL NETS
pattern. These bottom-up inputs are calculated according to Eq. (3). The bottom-up inputs that are actually required include those for all the committed nodes in F2a and the uncommited node of the lowest index. 4. Choose the node in F2a that is not disqualified and receives the maximum bottom-up input from F1a. Assume that this node is the node with index jm. Check to see whether this node satisfies the vigilance criterion in ARTa [Eq. (4)]. a. If node jm satisfies the vigilance criterion, modify the top-down weights emanating from node jm according to learning equation (5). If this is the last pattern in the training list go to Step 5. Otherwise, go to Step 2, to present the next in sequence input pattern. b. If node jm does not satisfy the vigilance criterion, disqualify this node and go to the beginning of Step 4. 5. After all patterns have been presented once: a. If in the previous list presentation at least one component of top-down weight vectors has changed, go to Step 2 and present the first in sequence input pattern. b. If in the previous list presentation no weight changes occurred, the learning process is complete. In the performance phase of Fuzzy ART the learning process is disengaged and patterns from a test list are presented in order to evaluate the clustering performance of Fuzzy ART. Specifically, an input pattern from the test list is presented to Fuzzy ART. Through the Fuzzy ART operating rules, discussed previously, a node jm is chosen in F2a that is found appropriate to represent the input pattern. Assuming that some criteria exist for determining how well node jm represents the cluster to which the input pattern presented to Fuzzy ART belongs, we can apply this process to all the input patterns from the test list to determine how well Fuzzy ART clusters them. Of course, our results are heavily dependent on the criteria used to judge the clustering performance of Fuzzy ART. In the following we propose a procedure to judge this performance. First, train Fuzzy ART with a list of training patterns until the learning process is complete. The assumption made here is that the list of training patterns is labeled; that is, the label (category) of each input pattern in the list is known. After training, assign a label to each committed node formed in the F2a field of Fuzzy ART. A committed node formed in F2a is labeled by the output pattern to which most of the input patterns that are represented by this node are mapped. The clustering performance of Fuzzy ART is evaluated by presenting to it, one more time, the input patterns from the training list. For each input pattern from the training list, Fuzzy ART chooses a node in F2a. If the label of this node is the output pattern to which this pattern corresponds, then we say that Fuzzy ART clustered this input pattern correctly. If, on the other hand, the label of this node is different from the output pattern to which this input pattern corresponds, then we say that Fuzzy ART made an erroneous clustering. The aforementioned procedure for evaluating clustering performance is suggested in Ref. 24. ART1 The ART1 architecture, operation, and operating phases are identical to those of Fuzzy ART. The only difference being
645
that, in ART1, the input patterns are not complement coded. Hence, in ART1, the preprocessing field F0a of Fig. 1 is not needed. FUZZY ARTMAP A brief overview of the Fuzzy ARTMAP architecture is provided in the following sections. For a more detailed discussion of this architecture, the reader should consult Ref. 6. Fuzzy ARTMAP Architecture A block diagram of the Fuzzy ARTMAP architecture is provided in Fig. 3. Note that two of the three modules in Fuzzy ARTMAP are Fuzzy ART architectures. These modules are designated ARTa and ARTb in Fig. 3. The ARTa module accepts as inputs the input patterns, while the ARTb module accepts as inputs the output patterns of the pattern classification task. All the previous details are valid for the ARTa module without change. These details are also valid for the ARTb module, where the superscript a is replaced with the superscript b. One of the differences between the ARTa and the ARTb modules in Fuzzy ARTMAP is that for pattern classification tasks (many-to-one maps) it is not necessary to apply complement coding to the output patterns presented to the ARTb module. As illustrated in Fig. 3, Fuzzy ARTMAP contains a module that is designated the inter-ART module. The purpose of this module is to make sure the appropriate mapping is established between the input patterns presented to ARTa, and the output patterns presented to ARTb. There are connections (weights) between every node in the F2a field of ARTa, and all nodes in the Fab field of the inter-ART module. The weight vector with components emanating from node j in F2a and converging to the nodes of Fab is denoted wab ⫽ (wab j j1 ,. . ., ab wab , . . ., w ), where N is the number of nodes in Fab (the b jk jNb number of nodes in Fab is equal to the number of nodes in F2b). There are also fixed bidirectional connections between a node k in Fab and its corresponding node k in F2b. Operation of Fuzzy ARTMAP The operation of the Fuzzy ART modules in Fuzzy ARTMAP is slightly different from the operation of Fuzzy ART described previously. For one thing, resets in the ARTa module of Fuzzy ARTMAP can have one of two causes: (1) the category chosen in F2a does not match the input pattern presented at F1a, or (2) the appropriate map has not been established between an input pattern presented at ARTa and its corresponding output pattern presented at ARTb. This latter type of reset, which Fuzzy ART does not have, is enforced by the inter-ART module via its connections with the orienting subsystem in ARTa (see Fig. 3). This reset is accomplished by forcing the ARTa architecture to increase its vigilance parameter value above the level that is necessary to cause a reset of the activated node in the F2a field. Hence, in the ARTa module of Fuzzy ARTMAP, we identify two vigilance parameter values, a baseline vigilance parameter value a, which is the vigilance parameter of ARTa prior to the presentation of an input/output pair to Fuzzy ARTMAP, and a vigilance parameter a, which corresponds to the vigilance parameter that is established in ARTa via appropriate resets enforced by the
646
ART NEURAL NETS
Inter-ART module w ab j
Field F 2a w aj
w bj
W ja Match tracking
I
j
and
O ∧ w b,o k = α + w b,o b
(7)
k
where in Eq. (7), O stands for the output pattern associated with the input pattern I, while the rest of the ARTb quantities are defined as they were defined for the ARTa module. Similarly, the vigilance ratios for ARTa and ARTb are computed as follows: I ∧ w a,o J (8) I and O ∧ w b,o K O
ρa
ARTa module
inter-ART module. Also, the node activated in F2b due to a presentation of an output pattern at F1b can either be the node receiving the maximum bottom-up input from F1b or the node designated by the Fab field in the inter-ART module. The latter type of activation is enforced by the connections between the Fab field and the F2b field. Equations (1)–(5) for the Fuzzy ART module are valid for the ARTa and ARTb modules in Fuzzy ARTMAP. In particular, the bottom-up inputs to the F2a field and the F2b field are given by I ∧ w a,o j a (6) Tj (II ) = αa + w a,o
O) Tkb (O
Field F 2b
Reset
Field F 1a
Figure 3. Block diagram of the ARTMAP or Fuzzy ARTMAP architecture.
Field Fab
(9)
The equations that describe the modifications of the weight vectors wab j can be explained as follows. A weight vector emanating from a node in F2a to all the nodes in Fab is initially the ‘‘all-ones’’ vector and, after training that involves this F2a node, all of its connections to Fab, except one, are reduced to the value of zero. Operating Phases of Fuzzy ARTMAP The operating phases of Fuzzy ARTMAP are the same as the operating phases of Fuzzy ART, the only difference being that
Reset
W jb Field F 1b
O
ρb
ARTb module
in the training phases of Fuzzy ARTMAP, input patterns are presented along with corresponding output patterns. As is the case with Fuzzy ART, Fuzzy ARTMAP may operate in two different phases: training and performance. Here we focus on classification tasks, where many inputs are mapped to a single, distinct output. It turns out that for classification tasks, the operations performed at the ARTb and inter-ART modules can be ignored, and the algorithm can be described by simply referring to the top-down weights of the ARTa module. The training phase of Fuzzy ARTMAP works is as follows. Given the training list 兵I1, O1其, 兵I2, O2其, . . ., 兵IP, OP其, we want Fuzzy ARTMAP to map every input pattern of the training list to its corresponding output pattern. In order to achieve the aforementioned goal, present the training list repeatedly to the Fuzzy ARTMAP architecture. That is, present I1 to ARTa and O1 to ARTb, then I2 to ARTa and O2 to ARTb, and eventually IP to ARTa and OP to ARTb; this corresponds to one list presentation. Present the training list as many times as is necessary for Fuzzy ARTMAP to classify the input patterns. The classification (mapping) task is considered accomplished (i.e., the learning is complete) when the weights do not change during a list presentation. The aforementioned training scenario is called off-line training, and its step-bystep implementation is as follows: Off-Line Training Phase of Fuzzy ARTMAP 1. Choose the Fuzzy ARTMAP network parameters (i.e., Ma, 움a, a) and the initial weights (i.e., wja). 2. Choose the pth input/output pair from the training list. Set the vigilance parameter a equal to the baseline vigilance parameter a. 3. Calculate the bottom-up inputs at the F2a field of the ARTa module due to the presentation of the pth input pattern. These bottom-up inputs are calculated according to Eq. (6). When calculating bottom-up inputs at F2a, consider all committed nodes in F2a and the uncommited node with the lowest index. 4. Choose the node in F2a that is not disqualified and receives the maximum bottom-up input from F1a. Assume that this node has index jm. Check to see whether this node satisfies the vigilance criterion in ARTa [see Eq. (8)].
ART NEURAL NETS
a. If node jm satisfies the vigilance criterion, go to Step 5. b. If node jm does not satisfy the vigilance criterion, disqualify this node, and go to the beginning of Step 4. 5. Now consider three cases: a. If node jm is an uncommitted node, designate the mapping of node jm to be the output pattern Op. Note that Op is the output pattern corresponding to the input pattern Ip presented in F1a. Also, the top-down weights corresponding to node jm are modified according to Eq. (5). If this is the last input/output pair in the training list go to Step 6. Otherwise, go to Step 2, to present the next in sequence input/output pair. b. If node jm is a committed node, and due to prior learning node jm is mapped to an output pattern equal to Op, then the correct mapping is achieved, and the top-down weights corresponding to node jm are modified according to Eq. (5). If this is the last input/output pair in the training list go to Step 6. Otherwise, go to Step 2, to present the next in sequence input/output pair. c. If node jm is a committed node, and due to prior learning node jm is mapped to an output pattern different from Op, then the mapping is incorrect, and we disqualify the activated node jm by increasing the vigilance parameter in ARTa to a level that is sufficient to disqualify node jm. In particular, the vigilance parameter in ARTa (a) becomes I ∧ w a,o jm + (10) I where ⑀ is a very small positive quantity. Go to Step 4. 6. After all patterns have been presented once, consider two cases: a. In the previous list presentation, at least one component of top-down weight vectors has changed. In this case, go to Step 2, and present the first in sequence input/output pair. b. In the previous list presentation, no weight changes occurred. In this case, the learning process is complete. In the performance phase of Fuzzy ARTMAP the learning process is disengaged, and input/output patterns from a test list are presented in order to evaluate its classification performance. In particular, during the performance evaluation of Fuzzy ARTMAP, only the input patterns of the test list are presented to the ARTa module. Every input pattern from the test list will choose a node in the F2a field. If the output pattern to which the activated node in F2a is mapped matches the output pattern to which the presented pattern should be mapped, then Fuzzy ARTMAP classified the test input pattern correctly; otherwise Fuzzy ARTMAP committed a classification error. ARTMAP The ARTMAP architecture, operation, and operating phases are identical to those of Fuzzy ARTMAP. The only difference
647
is that the input and output patterns in ARTMAP must be binary vectors. TEMPLATES IN FUZZY ART AND FUZZY ARTMAP: A GEOMETRICAL INTERPRETATION We previously referred to the top-down weights emanating from a node in the F2a field as a template. A template corresponding to a committed node is called a committed template, while a template corresponding to an uncommitted node is called uncommitted template. As we have already mentioned, an uncommitted template has all of its components equal to one. In the original Fuzzy ART paper (4), it is demonstrated that a committed template wja, which has coded input patterns I1 ⫽ (a(1), ac(1)), I2 ⫽ (a(2), ac(2)), . . ., IP ⫽ (a(P), ac(P)), can be written as
w aj = I 1 ∧ I 2 ∧ · · · ∧ I P = ∧Pi=1 a (i), ∧Pi=1a c (i)
= ∧Pi=1 a (i), {∨Pi=1a (i)}c
(11)
uaj , {v vaj }c ) w aj = (u
(12)
u aj = ∧Pi=1a (i)
(13)
v aj = ∨Pi=1a (i)
(14)
or
where
and
Based on the aforementioned expression for wja, we can now state that the weight vector wja can be expressed in terms of the two Ma-dimensional vectors uja and vja. Hence the weight vector wja can be represented, geometrically, in terms of two points in the Ma-dimensional space, uja and vja. Another way of looking at it is that wja can be represented, geometrically, in terms of a hyperrectangle Rja with endpoints uja and vja (see Fig. 4 for an illustration of this when Ma ⫽ 2). For simplicity, we refer to hyperrectangles as rectangles because most of our illustrations are in the two-dimensional space.
1 a
vj
a
uj
0
1
Figure 4. Representation of the template w ⫽ (uaj, 兵vaj其c) in terms of the rectangle Raj with endpoints uaj and vaj (in the figure Ma ⫽ 2). a j
648
ART NEURAL NETS
In this case there is actual weight change; the size of the rectangle that represents the template of node j is now increased. Thus, during the training process of Fuzzy ART or Fuzzy ARTMAP, the size of a rectangle Rja, which the weight vector wja defines, can only increase from the size of zero to possibly a maximum size, which will be determined next. The maximum size of a rectangle is determined by the vigilance parameter a. More specifically, with complement coding the size of an input pattern I is equal to Ma. Hence a node j in the F2a field with corresponding weight vector wja,o codes an input pattern I if the following criterion is satisfied:
1 a,o
vj
a,o v j2
a,o
Rj
a2
a
a,o
uj
a,o
uj
0
a,o
u j1
a1
a,o
v j1
Figure 5. Input pattern Iˆ ⫽ (aˆ, aˆc) represented by the point aˆ, lies inside rectangle Rja,o that represents template wja,o ⫽ (uja,o, 兵vja,o其c). Learning of Iˆ leaves Rja,o intact.
Obviously, the aforementioned representation implies that we can geometrically represent an input pattern I ⫽ (a, ac) by a rectangle with endpoints a and a. In other words, I can be represented by a rectangle of size 0, which is the single point a in the Ma-dimensional space. Note that the size of a rectangle Rja with endpoints uja and vja is taken to be equal to the norm of the vector vja ⫺ uja. The norm of a vector in Fuzzy ART or Fuzzy ARTMAP is defined to be equal to the sum of the absolute values of its components. In summary, we will treat wja ⫽ (uja, 兵vja其c) as a rectangle Rja with endpoints uja and vja in the Ma-dimensional space, and I ⫽ (a, ac) as the point a in the Ma-dimensional space. The reason why the rectangle representation of a template wja is so useful is explained below. Consider the template wja,o, and its geometrical representative, the rectangle Rja,o with endpoints uja,o and vja,o. Assume that uja,o ⫽ Pi⫽1a(i) and vja,o ⫽ Pi⫽1a(i). Let us now present pattern Iˆ ⫽ (aˆ, aˆc) to Fuzzy ART. Recall that the quantities defined above with a superscript 兵a, o其 indicate values of these quantities prior to the presentation of Iˆ to Fuzzy ART. Suppose that, during Iˆ’s presentation to Fuzzy ART, node j in the F2a field is chosen and node j with corresponding weight vector wja,o is appropriate to represent the input pattern Iˆ. We now distinguish two cases. In case 1 we assume that Iˆ lies inside the rectangle Rja,o that geometrically represents the template wja,o (see Fig. 5). According to the Fuzzy ART rules wja,o now becomes equal to wa,n j , where u a,o v a,o ua,o va,o w a,n = w a,o ∧ Iˆ = (u ∧ aˆ , {v ∨ aˆ }c ) = (u , {v }c ) = w a,o j j j j j j j In this case there is no actual weight change or, equivalently, the size of the rectangle that represents the template of node j remains unchanged. In case 2, we assume that Iˆ lies outside the rectangle Rja,o that geometrically represents template wja,o (see Fig. 6). Once more, according to the Fuzzy ART rules, wja,o becomes equal to wa,n j , where
w a,n j
I ∧ w a,o ≥ Ma ρa j
1
v a,o = w a,o ∧ Iˆ = u a,o ∧ aˆ , {v ∨ aˆ }c j j j
a,o c c
a,o ∧ aˆ 1 , . . ., ujM ∧ aˆ M a , va,o ∨ aˆ 1 , . . ., vjM ∨ aˆ M a = ua,o j1 j1 a a
va,o (15) = u a,o , {v }c = w a,o j j j
(16)
However,
a, a c ) ∧ u a,o va,o I ∧ w a,o = (a , {v }c j j j
va,o = a ∧ u a,o , a c ∧ {v }c j j
va,o = a ∧ u a,o , a c ∨ {v }c j j =
Ma Ma
c ai ∧ ujia,o + ai ∨ vjia,o i=1
=
i=1
Ma
ai ∧ ujia,o + Ma −
i=1
Ma
(17)
ai ∨ vjia,o
i=1
− a ∧ u a,o = Ma − a ∨ v a,o j j = Ma − Ra,n j From the above equations we can see that the rectangle size is allowed to increase provided that the new rectangle size satisfies the constraint Ra,n ≤ Ma (1 − ρa ) j
(18)
The above inequality implies that if we choose a small (i.e., a 앒 0), then some of the rectangles that the Fuzzy ART architecture defines might fill most of the entire input pattern space. On the other hand, if a is close to 1, all of the rectangles will be small.
1 a,o
v
vj
a,o j2
a,o
Rj a,o
u j2
a,o
a,n
uj
Rj
a2
0
a
a,o
u j1
a,o
v j1
a1
1
Figure 6. Input pattern Iˆ ⫽ (aˆ, aˆc) represented by the point aˆ, lies outside rectangle Rja,o that represents template wja,o ⫽ (uja,o, 兵vja,o其c). Learning of Iˆ creates a new rectangle Ra,n (the rectangle including all j the points of rectangle Rja,o and the point aˆ) of larger size than Rja,o.
ART NEURAL NETS
It is worth pointing out that during the training process of Fuzzy ART or Fuzzy ARTMAP compressed representations of the input patterns, belonging to the training set, are formed at the F2a field. These compressed representations could be visualized as the rectangles corresponding to committed nodes in F2a. The idea of the rectangle corresponding to a node is that it includes within its boundaries all the input patterns that have been coded by this node. In Fuzzy ARTMAP, the compressed representations of the input patterns, formed in F2a, are mapped, during the training process, to appropriate output patterns (classes). FUZZY ART EXAMPLE The input patterns of the training list are given below. Furthermore, the Fuzzy network parameters are chosen as follows: Ma ⫽ 2, a ⫽ 0.8, 움a ⫽ 0.01. Finally, the initial weights wja are chosen equal to the ‘‘all-ones’’ vectors.
I 1 = (0.20 0.20 0.80 0.80) I 2 = (0.35 0.35 0.65 0.65) I 3 = (0.30 0.50 0.70 0.50) I 4 = (0.50 0.30 0.50 0.70)
(19)
I 5 = (0.32 0.32 0.68 0.68) I 6 = (0.42 0.42 0.58 0.58) First List Presentation Present Pattern I1. Since no committed nodes exist in Fuzzy ART, node 1 in F2a will be activated and it will code input I1. After learning is over, the top-down vector from node 1 in F2a is equal to w1 ⫽ I1. The committed topdown vectors in ARTa, after the presentation of pattern I1 in the first list, are pictorially shown in Fig. 7(a) (see R1a in the figure). Present Pattern I 2. The bottom-up inputs to nodes 1 and 2 in F2a are equal to 0.8457 and 0.4987, respectively. Node 1 will be activated first and it will pass the vigilance criterion, since 兩I2 w1兩/兩I2兩 ⫽ 0.85 ⱖ a ⫽ 0.80. After learning is over, w1 ⫽ I1 I2 ⫽ (0.2 0.2 0.65 0.65). The committed top-down vectors in ARTa, after the presentation of pattern I2 in the first list, are pictorially shown in Fig. 7(b) (see R1a in the figure). Present Pattern I 3. The bottom-up inputs to nodes 1 and 2 in F2a are equal to 0.9064 and 0.4987, respectively. Node 1 will be activated first and it will not pass the vigilance criterion, since 兩I3 w1兩/兩I3兩 ⫽ 0.775 ⬍ a ⫽ 0.80. Hence node 1 will be reset, and node 2 will be activated next. Node 2 will pass the vigilance criterion, since 兩I3 w2兩/ 兩I3兩 ⫽ 1.0 ⱖ a ⫽ 0.80. After learning is over, w2 ⫽ I3 ⫽ (0.3 0.5 0.7 0.5). The committed top-down vectors in ARTa, after the presentation of pattern I3 in the first list, are pictorially shown in Fig. 7(c) (see R1a and R2a in the figure). Present Pattern I4. The bottom-up inputs to nodes 1, 2, and 3 in F2a are equal to 0.9064, 0.7960, and 0.4987, respectively. Node 1 will be activated first and it will not pass vigilance criterion, since 兩I4 w1兩/兩I4兩 ⫽ 0.775 ⬍ a ⫽ 0.80. Hence node 1 will be reset, and node
649
2 will be activated next. Node 2 will pass the vigilance criterion, since 兩I4 w2兩/兩I4兩 ⫽ 0.8 ⱖ a ⫽ 0.80. After learning is over, w2 ⫽ I3 I4 ⫽ (0.3 0.3 0.5 0.5). The committed top-down vectors in ARTa, after the presentation of pattern I4 in the first list, are pictorially shown in Fig. 7(d) (see R1a and R2a in the figure). Present Pattern I 5. The bottom-up inputs to nodes 1, 2, and 3 in F2a are equal to 0.9994, 0.9993, and 0.4987, respectively. Node 1 will be activated first and it will pass the vigilance criterion, since 兩I5 w1兩/兩I5兩 ⫽ 0.85 ⱖ a ⫽ 0.80. After learning is over, w1 ⫽ I1 I2 I5 ⫽ (0.2 0.2 0.65 0.65). The committed top-down vectors in ARTa, after the presentation of pattern I5 in the first list, are pictorially shown in Fig. 7(e) (see R1a, R2a, and R3a in the figure). Present Pattern I 6. The bottom-up inputs to nodes 1, 2, and 3 in F2a are equal to 0.9122, 0.9937, and 0.4987, respectively. Node 2 will be activated first and it will pass the vigilance criterion, since 兩I6 w2兩/兩I6兩 ⫽ 0.80 ⱖ a ⫽ 0.80. After learning is over, w2 ⫽ I3 I4 I6 ⫽ (0.3 0.3 0.5 0.5). The committed top-down vectors in ARTa, after the presentation of pattern I6 in the first list, are pictorially shown in Fig. 7(f) (see R1a, R2a, and R3a in the figure). In the second list presentation I1, I2, I3, I4, I5, and I6 will be coded by w1, w1, w2, w2, w1, and w2, respectively. Also, in the second list presentation no weight changes will occur, and as a result we can declare the learning complete at the end of the first list presentation. FUZZY ARTMAP EXAMPLE The input patterns of the training list are given below. Furthermore, the Fuzzy ARTMAP network parameters are chosen as follows: Ma ⫽ 2, a ⫽ 0.8, 움a ⫽ 0.01. Finally, the initial weights wja are chosen equal to the ‘‘all-ones’’ vectors.
I 1 = (0.20 0.20 0.80 0.80) I 2 = (0.35 0.35 0.65 0.65) I 3 = (0.30 0.50 0.70 0.50) I 4 = (0.50 0.30 0.50 0.70)
(20)
I 5 = (0.32 0.32 0.68 0.68) I 6 = (0.42 0.42 0.58 0.58) The corresponding output patterns are output pattern O1 for input patterns I1 and I2, output pattern O2 for input patterns I3 and I4, and output pattern O3 for input patterns I5 and I6. First List Presentation Present Pattern I1. Since no committed nodes exist in the F2a field of Fuzzy ARTMAP, node 1 in F2a will be activated and it will code input I1. After learning is over, the topdown vector from node 1 in F2a is equal to w1 ⫽ I1, and node 1 in F2a is mapped to output pattern O1. The committed top-down vectors in ARTa, after the presentation of pattern I1 in the first list, are pictorially shown in Fig. 8(a) (see R1a in the figure). Rectangle R1a is mapped to output pattern O1
650
ART NEURAL NETS
1
1
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2 I 1 = R a1
0.1 0
0.1
0.2
I2 R a1 I1
0.1
0.3 (a)
0.4
0.5
1
1
0
0.1
0.2
0.3 (b)
I = 0.5
0.5
R a2
0.4
I2
I2 0.3
0.3
0.2 I
R a1 1
I4
R a1
0.2
0.1
I1
0.1 0.1
0.2
0.3 (c)
0.4
0.5
1
1
0
0.1
0.2
0.3 (d)
0.4
0.5
1
1 I3
0.5
I3 0.5
R a2
0.4
R a2
0.4 I5
0.3
I5
0.3
I4
I6 I2
I2
R a1
0.2
I4
R a1
0.2
I1
I1
0.1 0
1
I3
R a2
0.4
Figure 7. Rectangular representation of top-down templates in Fa2 during the first list presentation of the input patterns in the Fuzzy ART example.
0.5
1 3
0
0.4
0.1 0.1
Present Pattern I 2. The bottom-up inputs to nodes 1 and 2 in F2a are equal to 0.8457 and 0.4987, respectively. Node 1 will be activated first and it will pass the vigilance criterion, since 兩I2 w1兩/兩I2兩 ⫽ 0.85 ⱖ a ⫽ 0.80. Also, node 1 in F2a is mapped to output pattern O1, which is the output pattern to which input pattern I2 needs to be mapped. Hence learning will take place, and after learning is over, w1 ⫽ I1 I2 ⫽ (0.2 0.2 0.65 0.65). The committed top-down vectors in ARTa, after the presentation of pattern I2 in the first list, are pictorially shown in Fig. 8(b) (see R1a in the figure). Rectangle R1a is mapped to output pattern O1. Present Pattern I 3. The bottom-up inputs to nodes 1 and 2 in F2a are equal to 0.9064 and 0.4987, respectively. Node 1 will be activated first and it will not pass the vigilance criterion, since 兩I3 w1兩/兩I3兩 ⫽ 0.775 ⬍ a ⫽ 0.80. Hence node 1 will be reset, and node 2 will be activated next. Node 2 will pass the vigilance criterion, since 兩I3
0.2
0.3 (e)
0.4
0.5
1
0
0.1
0.2
0.3 (f)
0.4
0.5
1
w2兩/兩I3兩 ⫽ 1.0 ⱖ a ⫽ 0.80. After learning is over, w2 ⫽ I3 ⫽ (0.3 0.5 0.7 0.5), and node 2 is mapped to the output pattern O2. The committed top-down vectors in ARTa, after the presentation of pattern I2 in the first list, are pictorially shown in Fig. 8(c) (see R1a and R2a in the figure). Rectangles R1a and R2a are mapped to output patterns O1 and O2, respectively. Present Pattern I4. The bottom-up inputs to nodes 1, 2, and 3 in F2a are equal to 0.9064, 0.7960, and 0.4987, respectively. Node 1 will be activated first and it will not pass the vigilance criterion, since 兩I4 w1兩/兩I4兩 ⫽ 0.775 ⬍ a ⫽ 0.80. Hence node 1 will be reset, and node 2 will be activated next. Node 2 will pass the vigilance criterion, since 兩I4 w2兩/兩I4兩 ⫽ 0.8 ⱖ a ⫽ 0.80. Also, node 2 is mapped to the output pattern O2 to which the input pattern I4 needs to be mapped. Hence learning will occur, and after learning is over, w2 ⫽ I3 I4 ⫽ (0.3 0.3 0.5 0.5). The committed top-down vectors in ARTa,
ART NEURAL NETS
after the presentation of pattern I4 in the first list, are pictorially shown in Fig. 8(d) (see R1a and R2a in the figure). Rectangles R1a and R2a are mapped to output patterns O1 and O2, respectively. Present Pattern I 5. The bottom-up inputs to nodes 1, 2, and 3 in F2a are equal to 0.9994, 0.9993, and 0.4987, respectively. Node 1 will be activated first and it will pass the vigilance criterion, since 兩I5 w1兩/兩I5兩 ⫽ 0.85 ⱖ a ⫽ 0.80. But node 1 is mapped to the output pattern O1, while the input pattern I5 needs to be mapped to output pattern O3. Hence node 1 will be reset and the vigilance criterion in ARTa will be raised to a level slightly higher than 兩I5 w1兩/兩I5兩 ⫽ 0.85. Next, node 2 will be activated and node 2 will not pass the vigilance criterion, since 兩I5 w2兩/兩I5兩 ⫽ 0.80 ⬍ a ⫽ 0.85⫹. Hence node 2 will be reset and node 3 will be activated next. Node 3 will pass the vigilance criterion, since 兩I5 w3兩/兩I5兩 ⫽ 1.0 ⱖ a ⫽ 0.85⫹. After learning is over, w3 ⫽ I5 ⫽ (0.32 0.32
1
1
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2 I 1 = R a1
0.1 0
0.1
0.2
0.68 0.68), and node 3 is mapped to the output pattern O3. The committed top-down vectors in ARTa, after the presentation of pattern I5 in the first list, are pictorially shown in Fig. 8(e) (see R1a, R2a, and R3a in the figure). Rectangles R1a R2a, and R3a are mapped to output patterns O1, O2, and O3, respectively. Present Pattern I 6. The bottom-up inputs to nodes 1, 2, 3, and 4 in F2a are equal to 0.9122, 0.9993, 0.8955, and 0.4987, respectively. Node 2 will be activated first and it will pass the vigilance criterion, since 兩I6 w2兩/兩I6兩 ⫽ 0.80 ⱖ a ⫽ 0.80. But node 2 is mapped to the output pattern O2, while the input pattern I6 needs to be mapped to output pattern O3. Hence node 2 will be reset and the vigilance criterion in ARTa will be raised to a level slightly higher than 兩I6 w2兩/兩I6兩 ⫽ 0.80. Next, node 1 will be activated and node 1 will not pass the vigilance criterion, since 兩I6 w1兩/兩I6兩a ⫽ 0.80⫹. Hence node 1 will be reset and node 3 will be activated next.
I2 R a1 I1
0.1
0.3 (a)
0.4
0.5
1
1
0
0.1
0.2
0.3 (b)
0.4
0.5
1
1 I3
I 3 = R a2
0.5
0.5
0.4
R a2
0.4
I2
I2 0.3
0.3 R a1
0.2
0.2
0.1
I4
R a1 I1
I1
0
651
0.1 0.1
0.2
0.3 (c)
0.4
0.5
1
1
0
0.1
0.2
0.3 (d)
0.4
0.5
1
1 I3
0.5
I3 R a2
0.4
0.5
R a2
0.4
R a3
I2 0.3
I 5= R a3 R a1
0.2
I5
0.3
I4
0.2
I4
I1
0.1
0.1
0
0
0.2
I2
R a1
I1
0.1
I6
0.3 (e)
0.4
0.5
1
0.1
0.2
0.3 (f)
0.4
0.5
1
Figure 8. Rectangular representation of top-down templates in Fa2 during the first list presentation of the input/output pairs in the Fuzzy ARTMAP example.
652
ART NEURAL NETS
Node 3 will pass the vigilance criterion, since 兩I6 w3兩/ 兩I兩 ⫽ 0.9 ⱖ a ⫽ 0.80⫹. Also, node 3 is mapped to the output pattern O3, which is the same output pattern to which the input pattern I6 needs to be mapped. Thus learning will take place, and after learning is over, w3 ⫽ I5 I6 ⫽ (0.32 0.32 0.58 0.58). The committed top-down vectors in ARTa, after the presentation of pattern I6 in the first list, are pictorially shown in Fig. 8(f) (see R1a, R2a, and R3a in the figure). Rectangles R1a, R2a, and R3a are mapped to output patterns O1, O2, and O3, respectively. In the second list presentation, I1, I3, I4, I5, and I6 will be coded by w1, w2, w2, w3, and w3, respectively. On the other hand, pattern I2 will be coded by a new node with template w4 ⫽ I2, and node 4 will be mapped to the output pattern O1. In the third presentation, patterns I1, I2, I3, I4, I5, and I6 will be coded by w1, w4, w2, w2, w3, and w3, respectively. Furthermore, all the input patterns are mapped to the correct output patterns, since nodes 1, 2, 3, and 4 are mapped to the output patterns O1, O2, O3, and O1, respectively. Also, in the third list presentation no weight changes will occur, and as a result we can declare the learning complete at the end of the second list presentation. Note that in the Fuzzy ART and Fuzzy ARTMAP examples with two-dimensional data, the rectangles formed during learning can be of the trivial type 兵e.g., a point [R1a in Fig. 7(a)], or a line [R3a in Fig. 8(f)]其. APPLICATIONS The classification performance of Fuzzy ARTMAP has been examined against a plethora of pattern classification problems. In the original ARTMAP paper (5) the performance of the network with the mushroom database (25) was investigated. The mushroom database consists of 8124 input/output pairs of input/output features. The input features of the input vector represent each of the 22 observable features of a mushroom (e.g., cap-shape, gill-spacing, population, habitat). The output features of the output vector correspond to the mushroom classification in ‘‘edible’’ and ‘‘poisonous.’’ Based on the results reported in Ref. 5, the Fuzzy ARTMAP system consistently achieved over 99% classification accuracy on the testing set with 1000 training input/output pairs; the testing set is the collection of input/output pairs (out of the 8124 possible) that were not included in the training set (1000 input/output pairs randomly chosen from the collection of 8124 input/output pairs). Classification accuracy of 95% was usually achieved with off-line training of 100–200 input/output pairs. The STAGGER algorithm (26), reached its maximum performance level of 95% accuracy after exposure to 1000 input/ output training pairs. The HILLARY algorithm (27) demonstrated a performance similar to the STAGGER algorithm. Hence, for this database, Fuzzy ARTMAP was found to be an order of magnitude more efficient than the alternative systems. Frey and Slate (28) developed a benchmark machine learning task that they describe as a ‘‘difficult categorization problem.’’ The objective was to identify each of a large number of black-and-white rectangular pixel images as one of 26 capital letters A–Z. The character images were based on 20 different
fonts and each letter within these 20 fonts was randomly distorted to produce a database of 20,000 characters. The fonts represent five different stroke styles (simplex, duplex, triplex, complex, and Gothic), and six different letter styles (block, script, italic, English, Italian, and German). Sixteen numerical feature attributes were then obtained from each character image, and each attribute value was scaled to a range of 0– 15. The identification task was challenging because of the wide diversity among the different fonts and the primitive nature of the attributes. Frey and Slate used this database to test the performance of a family of classifiers based on Holland’s genetic algorithms. The training set consisted of 16,000 exemplars, with the remaining 4000 exemplars used for testing. Genetic algorithm classifiers having different input representations, weight update and rule creation schemes, and system parameters were systematically compared. Training was carried out for five epochs, plus a sixth ‘‘verification’’ pass during which no new rules were created, but a large number of unsatisfactory rules were discarded. In the Frey–Slate comparative study, these systems had correct classification rates that ranged from 24.5% to 80.2% on the 4000-item test set. Fuzzy ARTMAP had an error rate on the letter recognition task that was consistently less than one-third that of the best Frey–Slate genetic algorithm classifiers. Of the 28 Fuzzy ARTMAP simulations reported in Ref. 6, the one with the best performance had a 94.7% correct prediction rate on the 4000item test set, after five training epochs. Thus the error rate (5.3%) was less than one-third that of the best simulation in the Frey–Slate comparative study (19.2%). Another paper (9) compared the performance of Fuzzy ARTMAP and its variants [ART-EMAP (7) and ARTMAP-IC (9)] with other algorithms, such as K-nearest neighbor (29), the ADAP perceptron (30), multisurface pattern separation (31), CLASSIT (32), instance-based (33), and C4 (34). The databases chosen for this comparison were the diabetes database, the breast cancer database, the heart disease database, and the gallbladder removal database (25). The basic conclusion out of this comparison is that Fuzzy ARTMAP, or its variants, performed as well or better than a variety of methods applied to the aforementioned benchmark problems. In a recent publication Carpenter (35) produced a list of applications, where the family of ART networks and their variations have been used successfully. Below we reproduce this list with some additions of our own. A Boeing part retrieval system (36), satellite remote sensing (37,38), robot sensory-motor control (39–41), robot navigation (42), machine vision (43), three-dimensional object recognition (44), face recognition (45), image enhancement (46), Macintosh operating system software (47), automatic target recognition (48–50), electrocardiogram wave recognition (51,52), prediction of protein secondary structure (53), air quality monitoring (54), strength prediction for concrete mixes (55), signature verification (56), adaptive vector quantization (57), tool failure monitoring (58,59), chemical analysis from UV and IR spectra (60), frequency selective surface design for electromagnetic system devices (61), Chinese character recognition (62), and analysis of musical scores (63). THEORETICAL RESULTS In this section we investigate the learning properties of ART1, Fuzzy ART, and ARTMAP architectures. Some of the
ART NEURAL NETS
653
learning properties discussed in this paper involve characteristics of the clusters formed in these architectures, while other learning properties concentrate on how fast it will take these architectures to converge to a solution for the type of problems that are capable of being solved. This latter issue is a very important issue in the neural network literature, and there are very few instances where it has been answered satisfactorily. It is worth noting that all of the results described in this section have been developed and proved elsewhere (2,4,16–18,20). In this article, we present these results in a unified manner, with the purpose of pointing out the similarities in the learning properties of the ART1, Fuzzy ART, and ARTMAP architectures.
phases of the ART architectures. In the sequel, we provide this definition in more rigorous terms.
Preliminaries
For example, in the case of an input list (I1, O1), (I2, O2), . . ., (IP, OP), assume that list presentation n is the first list presentation at which each one of the input patterns chooses a node in F2a that satisfies the direct-access, no-learning conditions. In particular, assume that I1 chooses node j1, I2 chooses node j2, . . ., and IP chooses node jP, and nodes j1, j2, . . ., jP satisfy the direct-access, no-learning conditions; the notation jp (1 ⱕ p ⱕ P) implies the node in F2a chosen by input pattern Ip, and, as a result, we might have cases where jp ⫽ jp⬘ for p ⬆ p⬘. At the end of the nth list presentation we can declare that learning is complete. In the above example, no modification of the ART weights is performed during list presentation n. Hence we can further claim that learning is complete by the end of the n ⫺ 1 list presentation. Obviously, in list presentations ⱖ n, input pattern I1 will always choose node j1, input pattern I2 will always choose node j2, and so on.
For the properties of learning in ART architectures, it is important to understand the distinctions among the top-down weights emanating from nodes in the F2a field. Consider an input I presented to the ARTa module. Consider also an arbitrary template of the ARTa module, designated as wa. A component of an input pattern I is indexed by i if it affects node i in the F1a field. Similarly, a component of a template wa is indexed by i if it corresponds to the weight converging to node i in the F1a field. Based on this correspondence between the components of input patterns and templates in ARTa, we can identify three types of learned templates with respect to an input pattern I: subset templates, mixed templates, and superset templates. A template wa is a subset of pattern I if each one of the wa components is smaller than or equal to its corresponding component in I. A template wa is a mixed template of pattern I if some of the wa components are smaller than or equal to their corresponding components in I, and the rest of the wa components are larger than their corresponding components in I. A template wa is a superset of pattern I if each one of the wa components is larger than or equal to its corresponding component in I. Besides the templates defined above, we also define an uncommitted template to be the vector of top-down weights associated with a node in F2a, which has not yet been chosen to represent an input pattern. As before, every component of an uncommitted template is equal to one. With reference to an input pattern I, we also designate nodes in F2a as subset, mixed, superset, or uncommitted depending on whether their corresponding template is a subset, mixed, superset, or uncommitted template with respect to the input pattern I. One of the modeling assumptions required for the validity of some of the results presented in this section is fast learning. Fast learning implies that the input/output pairs presented to the ARTMAP architecture or the inputs presented to the ART1 and Fuzzy ART architectures are held at the network nodes long enough for the network weights to converge to their steady-state values. The learning equation for the weights provided by Eq. (5) is a learning equation pertaining to the fast learning scenario. Whenever the fast learning assumption is not imposed, we imply that the weights are modified in a slow learning mode; in the slow learning mode the input/output pairs (ARTMAP) or inputs (ART1, Fuzzy ART) are not applied at the network nodes long enough for the network weights to reach their steady-state values. We have already defined before what we mean by the statement that ‘‘learning is complete’’ in the off-line training
Definition 1. In the off-line training phase of an ART architecture the learning process is declared complete if every input pattern from the list chooses a node in the F2a field that satisfies the direct-access, no-learning conditions. A node j in F2a chosen by a pattern I satisfies the direct-access, no-learning conditions if (a) node j is the first node chosen in F2a by pattern I, (b) node j is not reset, and (c) the top-down weights corresponding to node j (i.e., wja’s) are not modified. Conditions (a) and (b) are the direct-access conditions and condition (c) is the no-learning condition.
Properties of Learning We will state a number of learning properties pertinent to the ART1, Fuzzy ART, and ARTMAP architectures. We will focus on learning properties that are common among the ART architectures under consideration. Distinct Templates Property. The templates formed in ART1 with fast learning, Fuzzy ART, and ARTMAP with fast learning are distinct. Direct Access by Perfectly Learned Template Property. In ART1, ARTMAP with fast learning, and Fuzzy ART, if an input pattern I has been perfectly learned by a node in the category representation field, this node will be directly accessed by the input pattern I, whenever this input pattern is presented to the network architecture. We say that an input pattern I has been perfectly learned by node j in F2a iff wja ⫽ I. Number of Templates Property. At the completion of the offline training phase of ART1 and Fuzzy ART with fast learning, with enough nodes in the category representation layer, and small values for the network parameter 움a, the number of templates created is smaller than the number of patterns in the input list. Order of Search Property. Suppose that I is an arbitrary pattern from the list of input patterns in ART1 and Fuzzy ART and from the list of input/output pairs in ARTMAP. Then, if
654
ART NEURAL NETS
the network parameter value 움a is small, the largest subset template of I will be searched first. If this subset template is reset, all subset templates will be reset. If all learned subset templates are reset, then superset templates, mixed templates, and uncommitted templates are searched, not necessarily in that order. Number of List Presentations Property—1. The off-line training phase of ART1, Fuzzy ART, and ARTMAP with fast learning, with enough nodes in the category representation field and small values of the network parameter 움a, will be complete in at most Ma list presentations. Number of List Presentations Property–2. The off-line training phase of ART1 and Fuzzy ART with fast learning, with enough nodes in the category representation field and small values for the network parameter 움a, will be complete in m list presentations, where m represents the number of distinct size input patterns in the list. The Distinct Template Learning Property is one of the good properties of the ART architectures. Since templates in ART1, Fuzzy ART, and ARTMAP represent compressed representations of the input patterns presented to these architectures, it would have been a waste to create templates that are equal. The Direct Access by a Perfectly Learned Property is another indication that the ART architectures employ learning rules that behave in a sensible manner. This property is very essential for any pattern clustering or pattern classification machine. Since templates represent compressed representations of the input patterns presented to the architectures, we should expect an input pattern to point first to an equal (with the pattern) template as its most preferred representation versus any other template created by the architecture. The Number of Templates property provides an upper bound for the number of nodes created in the category representation field of ART1 and Fuzzy ART so that these architectures can learn a list of input patterns, repeatedly presented to them. In practice, the number of templates created (or nodes required) in the category representation field is usually much less than the number of patterns in the input list and is an increasing function of the network parameters 움a and a. The Order of Search Property is a very interesting result because it identifies the order according to which the templates created in the category representation field are accessed. This property is very instrumental in demonstrating the Number of List Presentation Properties. The Number of List Presentation Properties of the ART architectures are somehow unique in the family of neural network architectures. To illustrate our point, consider the most popular back-prop network (64), where not only do we not know how many list presentations are required to learn an arbitrary list of input/output pairs but often we do not know whether the architecture will converge to a solution or not. On the contrary, for the ARTMAP architecture The Number of List Presentations Property—1 tells us that we will need at most Ma list presentations to learn an arbitrary list of binary input/output pairs. The parameter Ma identifies the number of components of our input pattern a, or the number of ones of our input pattern I. This bound on the number of list pre-
sentations is a tight bound and it turns out to be very impressive if we consider a couple of examples. For instance, consider the case of input/output pairs, where the input patterns a have Ma ⫽ 10 (100) components; the input/output mapping problem that might be given to us in this example case can have at most 210 앒 1000 (2100 앒 1030) input/output pairs. ARTMAP would need only at most 10 (100) presentations to learn this mapping problem. Can you imagine the time required by a back-prop network to learn a mapping problem involving a 1030 input/output pairs? The Number of List Presentation Property—2 tells us that the upper bound on the number of list presentations required by ART1 and Fuzzy ART to learn a list of input patterns, repeatedly presented to them, can get tighter. In particular, the number of list presentations required is upper bounded by the number of distinct size templates in the input list. For example, if Ma ⫽ 100 and the number of distinct size inputs presented to ART1 is 2, it will require 2 list presentations for ART1 to learn the list. This property is taken to extreme with Fuzzy ART, because in Fuzzy ART the preprocessing of the inputs leaves us with input patterns of the same size (Ma). Hence Fuzzy ART needs only one list presentation to learn the list of input patterns presented to it. BIBLIOGRAPHY 1. S. Grossberg, Adaptive pattern recognition and universal recoding II: feedback, expectation, olfaction, and illusions. Biol. Cybernet., 23: 187–202, 1976. 2. G. A. Carpenter and S. Grossberg, A massively parallel architecture for a self-organizing neural pattern recognition machine. Comput. Vision, Graphics, Image Proc., 37: 54–115, 1987. 3. G. A. Carpenter and S. Grossberg, ART 2: Self-organization of stable category recognition codes for analog input patterns. Appl. Opt., 26 (23): 4919–4930, 1987. 4. G. A. Carpenter, S. Grossberg, and D. B. Rosen, Fuzzy ART: fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks, 4 (6): 759–771, 1991. 5. G. A. Carpenter, S. Grossberg, and J. H. Reynolds, ARTMAP: Supervised real-time learning and classification of nonstationary data by a self-organizing neural network. Neural Networks, 4 (5): 565–588, 1991. 6. G. A. Carpenter, S. Grossberg, N. Markuzon, J. H. Reynolds, and D. B. Rosen, Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps. IEEE Trans. Neural Networks, 3 (5): 698–713, 1992. 7. G. A. Carpenter and W. D. Ross, ART-EMAP: A neural architecture for object recognition by evidence accumulation. IEEE Trans. Neural Networks, 6: 805–818, 1995. 8. J. R. Williamson, Gaussian ARTMAP: A neural network for fast incremental learning of noisy multi-dimensional maps. Neural Networks, 9 (5): 881–897, 1996. 9. G. A. Carpenter and N. Markuzon, ARTMAP-IC and medical diagnosis: Instance counting and inconsistent cases. In Technical Report CAS/CNS-96-017. Boston: Boston University, 1996. 10. S. C. Newton, S. Pemmaraju, and S. Mitra, Adaptive fuzzy leader clustering model for pattern recognition. IEEE Trans. Neural Networks, 3 (5): 784–800, 1992. 11. M. J. Healy, T. P. Caudell, and S. D. G. Smith, A neural architecture for pattern sequence verification through inferencing. IEEE Trans. Neural Networks, 4 (1): 9–20, 1993.
ART NEURAL NETS 12. Y. S. Kim and S. Mitra, An adaptive integrated fuzzy clustering model for pattern recognition. Fuzzy Sets Syst., 65: 297–310, 1994. 13. P. K. Simpson, Fuzzy Min-Max neural networks—part 1: Classification. IEEE Trans. Neural Networks, 3 (5): 776–786, 1992. 14. P. K. Simpson, Fuzzy Min-Max neural networks—part 2: Clustering. IEEE Trans. Fuzzy Syst., 1 (1): 32–45, 1993. 15. C. A. Hung and S. F. Lin, Adaptive Hamming net: A fast learning ART1 model without searching. Neural Networks, 8: 605–618, 1995. 16. M. Georgiopoulos, G. L. Heileman, and J. Huang, Convergence properties of learning in ART1. Neural Comput., 2 (4): 502–509, 1990. 17. M. Georgiopoulos, G. L. Heileman, and J. Huang, Properties of learning related to pattern diversity in ART1. Neural Networks, 4 (6): 751–757, 1991. 18. M. Georgiopoulos, G. L. Heileman, and J. Huang, The N–N–N conjecture in ART1. Neural Networks, 5 (5): 745–753, 1992. 19. B. Moore, ART1 and pattern clustering. In D. S. Touretzky, G. Hinton, and T. Sejnowski (eds.), Proceedings of the 1988 Connectionist Summer School. San Mateo, CA: Morgan Kaufmann, 1989, pp. 175–185. 20. J. Huang, M. Georgiopoulos, and G. L. Heileman, Fuzzy ART properties. Neural Networks, 8 (2): 203–213, 1995. 21. M. Georgiopoulos et al., Order of search in Fuzzy ART and Fuzzy ARTMAP: A geometrical interpretation. In Proceedings of the International Conference on Neural Networks, Washington, DC: IEEE Press 1996, pp. 215–220. 22. G. Bartfai, On the match tracking anomaly of the ARTMAP neural network. Neural Networks, 9 (2): 295–308, 1996. 23. M. Georgiopoulos, J. Huang, and G. L. Heileman, Properties of learning in ARTMAP. Neural Networks, 7: 495–506, 1994. 24. R. Dubes and A. Jain, Clustering techniques: The user’s dilemma. Pattern Recognition, 8: 247–260, 1976. 25. P. Murphy and D. Ada, UCI repository of machine learning databases. Technical report, Department of Computer Science, University of California, Irvine, CA, 1994. 26. J. S. Schlimmer, Concept acquisition through representational adjustment (technical report 87-19). Technical report, Doctoral Dissertation, Department of Information and Computer Science, University of California, Irvine, CA, 1987. 27. W. Iba, J. Wogulis, and P. Langley, Trading off simplicity and coverage in incremental concept learning. In Proceedings of the 5th International Conference on Machine Learning. Ann Arbor, MI: Morgan Kaufmann, 1988, pp. 73–79. 28. P. W. Frey and D. J. Slate, Letter recognition using Holland-style adaptive classifiers. Mach. Learning, 6: 161–182, 1991. 29. R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis. New York: John Wiley & Sons, 1973. 30. J. W. Smith et al., Using the ADAP learning algorithm to forecast the onset of diabetes melitus. In Proceedings Symposium on Computer Applications and Medical Care. New York: IEEE Computer Society Press, 1988, pp. 261–265. 31. W. H. Wolberg and O. L. Mangasarian, Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc. Natl. Acad. Sci. USA, 87: 9193–9196, 1990. 32. J. H. Gennari, P. Langley, and D. Fisher, Models of incremental concept formation. Artif. Intell., 40: 11–60, 1989. 33. D. W. Aha, D. Kibler, and M. K. Albert, Instance-based learning algorithms. Mach. Learning, 6: 37–60, 1991. 34. J. R. Quinlan, The effect of noise on concept learning. In R. S. Michalski, J. C. Carbonell, and T. Mitchell (eds.), Machine Learn-
655
ing: An Artificial Intelligence Approach. San Mateo, CA: Morgan Kaufmann, 1986, pp. 149–166. 35. G. A. Carpenter, Distributed learning, recognition, and prediction by ART and ARTMAP networks. In Technical Report CAS/CNS96-004. Boston: Boston University, 1996. 36. T. P. Caudell et al., NIRS: Large scale ART-1 neural architectures for engineering design retrieval. Neural Networks, 7: 1339– 1350, 1994. 37. A. Baraldi and F. Parmiggiani, A neural network for unsupervised categorization of multivalued input patterns. IEEE Trans. Geosci. Remote Sensing, 33: 305–316, 1995. 38. S. Gopal, D. M. Sklarew, and E. Lambin, Fuzzy-neural networks in multi-temporal classification of landcover change in sahel. In Proceedings of DOSES Workshop on New Tools for Spatial Analysis. Brussels, Luxemburg: DOSES, EUROSTAT, ECSC-ECEAEC, 1994, pp. 55–68. 39. I. A. Bachelder, A. M. Waxman, and M. Seibert, A neural system for mobile robot visual place learning and recognition. In Proceedings of World Congress on Neural Networks (WCNN93). Hillsdale, NJ: Lawrence Erlbaum, 1993, pp. I512–I517. 40. A. A. Baloch and A. M. Waxman, Visual learning, adaptive expectations, and learning behavioral conditioning of the mobil robot MAVIN. Neural Networks, 4 (3): 271–302, 1991. 41. A. Dubraski and J. L. Crowley, Learning locomotion reflexes: A self-supervised neural system for a mobile robot. Robotics Autonomous Syst., 12: 133–142, 1994. 42. A. Dubraski and J. L. Crowley, Self-supervised neural system for reactive navigation. In Proceedings of the IEEE International Conference on Robotics and Automation, Los Alamitos, CA, May 1994. New York: IEEE Computer Society Press, 1994, pp. 2076–2081. 43. T. P. Caudell and M. J. Healy, Adaptive Resonance Theory networks in the encephalon autonomous vision system. In Proceedings of the 1994 IEEE International Conference on Neural Networks, Piscataway, NJ. New York: IEEE Press, 1994, pp. II1235–II1240. 44. S. Seibert and A. M. Waxman, Adaptive 3D object recognition from multiple views. IEEE Trans. Pattern Anal. Mach. Intell., 14: 107–124, 1992. 45. S. Seibert and A. M. Waxman, An approach to face recognition using saliency maps and caricatures. In Proceedings of the World Congress on Neural Networks (WCNN-93). Hillsdale, NJ: Lawrence Erlbaum, 1993, pp. III661–III664. 46. F. Y. Shih, J. Moh, and F. C. Chang, A new ART-based neural architecture for pattern classification and image enhancement without prior knowledge. Pattern Recognition, 25 (5): 533–542, 1992. 47. C. Johnson, Agent learns user’s behavior. Electr. Eng. Times, 43– 46, 1993. 48. A. M. Bernardon and J. E. Carrick, A neural system for automatic target learning and recognition applied to bare and camouflaged SAR targets. Neural Networks, 8: 1103–1108, 1995. 49. M. W. Koch et al., Cueing, feature discovery, and one-class learning for synthetic aperture radar automatic target recognition. Neural Networks, 8: 1081–1102, 1995. 50. A. M. Waxman et al., Neural processing of targets in visible, multispectral IR and SAR imagery. Neural Networks, 8: 1029– 1051, 1995. 51. F. M. Ham and S. W. Han, Quantitative study of the QRS complex using Fuzzy ARTMAP and the MIT/BIH arrhythmia database. In Proceedings of the World Congress on Neural Networks (WCNN-93). Hillsdale, NJ: Lawrence Erlbaum, 1993, pp. I207– I211.
656
ARTIFICIAL HEARTS AND OTHER ORGANS
52. Y. Suzuki, Y. Abe, and K. Ono, Self-organizing QRS wave-recognition system in ECG using ART2. In Proceedings of the World Congress on Neural Networks (WCNN-93). Hillsdale, NJ: Lawrence Erlbaum, 1993, pp. IV39–IV42. 53. B. V. Metha, L. Vij, and L. C. Rabelo, Prediction of secondary structures of proteins using Fuzzy ARTMAP. In Proceedings of the World Congress on Neural Networks (WCNN-93). Hillsdale, NJ: Lawrence Erlbaum, 1993, pp. I228–I232. 54. D. Wienke, Y. Xie, and P. K. Hopke, An Adaptive Resonance Theory based artificial neural network (ART2-A) for rapid identification of airborne particle shapes from their scanning electron microscopy images. Chemometrics and Intelligent Systems Laboratory, 1994. 55. J. Kasperkiewicz, J. Racz, and A. Dubraswski, HPC strength prediction using artificial neural networks. J. Comput. Civil Eng., 9: 279–284, 1995. 56. N. A. Murshed, F. Bortozzi, and R. Sabourin, Off-line signature verification without a-priori knowledge of class w2. A new approach. In Proceedings of ICDAR 95: The Third International Conference on Document Analysis and Recognition, 1995. 57. S. Mitra and S. Pemmaraju, Adaptive vector quantization using an ART-based neuro-fuzzy clustering algorithm. In Proceedings of the International Conference on Neural Networks. Washington, DC: IEEE Press, 1996, pp. 211–214. 58. S. Ly and J. J. Choi, Drill condition monitoring using ART-1. In Proceedings of the 1994 IEEE International Conference on Neural Networks, Piscataway, NJ. New York: IEEE Press, 1994, pp. II1226–II1229. 59. Y. S. Tarng, T. C. Li, and M. C. Chen, Tool failure monitoring for drilling purposes. In Proceedings of the 3rd International Conference on Fuzzy Logic, Neural Nets and Soft Computing, 1994, pp. 109–111. 60. D. Wienke and G. Kateman, Adaptive Resonance Theory based artificial neural networks for treatment of open-category problems in chemical pattern recognition—application to UV-Vis and IR spectroscopy. Chemometrics and Intelligent Systems Laboratory, 1994. 61. C. Christodoulou et al., Design of gratings and frequency selective surfaces using Fuzzy ARTMAP neural networks. J. Electromagnet. Waves Appl., 9: 17–36, 1995. 62. K. W. Gan and K. T. Lua, Chinese character classification using Adaptive Resonance network. Pattern Recognition, 25: 877–888, 1992. 63. R. O. Gjerdingen, Categorization of musical patterns by self-organizing neuron-like networks. Music Perception, 7: 339–370, 1990. 64. J. L. McClelland, D. E. Rumelhart, and G. E. Hinton, The appeal of parallel distributed processing. In D. E. Rumelhart and J. L. McClelland (eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations. Cambridge, MA: MIT Press, 1986.
MICHAEL GEORGIOPOULOS University of Central Florida
GREGORY L. HEILEMAN University of New Mexico
JUXIN HUANG Hewlett-Packard
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...CTRONICS%20ENGINEERING/39.%20Neural%20Networks/W5102.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Boltzmann Machines Standard Article Laurene V. Fausett1 1University of South Carolina— Aiken, Melbourne, FL Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5102 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (164K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20...20ENGINEERING/39.%20Neural%20Networks/W5102.htm (1 of 2)16.06.2008 15:49:39
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...CTRONICS%20ENGINEERING/39.%20Neural%20Networks/W5102.htm
Abstract The sections in this article are Overview of Neural Networks Boltzmann Machines Sample Applications of Boltzmann Machines Operation of a Fixed-Weight Boltzmann Machine Alternative Formulations of the Basic Boltzmann Machine Extensions of the Boltzmann Machine Summary and Conclusions | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20...20ENGINEERING/39.%20Neural%20Networks/W5102.htm (2 of 2)16.06.2008 15:49:39
BOLTZMANN MACHINES
499
BOLTZMANN MACHINES As modern computers become ever more powerful, engineers continue to be challenged to use machines effectively for tasks that are relatively simple for humans, but difficult for traditional problem-solving techniques. Artificial neural networks, inspired by biological systems, provide computational methods that can be utilized in many engineering disciplines. Following a brief overview of the features that characterize neural networks in general, we consider the neural networks known as Boltzmann machines. Fixed-weight Boltzmann machines are used for constrained optimization problems, such as those arising in scheduling, management science, and graph theory. They are applied to intractible NP-complete problems to rapidly locate near-optimal solutions. Three constrained optimization problems, the traveling salesman, asset allocation, and scheduling problems, are considered below. Other problems of this type J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
500
BOLTZMANN MACHINES
include maximum cut, independent set, graph coloring, clique partitioning, and clique covering problems (1). A second type of Boltzmann machine is used for input-output mapping problems such as the encoder, seven-segment display, and XOR problems.
X1 X8
X2
OVERVIEW OF NEURAL NETWORKS Neural networks consist of many simple processing elements, called neurons or units, which are connected by weighted pathways. The neurons communicate with each other by sending signals over these paths. Each neuron processes the input signals that it receives to compute its activation, which becomes the signal that the neuron sends to other units. The weights on the pathways may be fixed when the network is designed or when it is trained using examples. Fixed-weight networks are used for constrained optimization problems, and adaptive weights are used for pattern classification and general mapping networks. After training, the neural network is able to recognize an input pattern that is similar to, but not exactly the same as, one of the training patterns. Neural Network Architectures The pattern of connections among the neurons is called the neural network architecture. A simple feed-forward neural network, in which the signals flow from left to right, is illustrated in Fig. 1(a). Recurrent neural networks have feedback connections, such as the connection from unit Y2 back to unit X3 in Fig. 1(b).
X7
X3
X6
X4 X5
Figure 2. Fully interconnected neural network.
BOLTZMANN MACHINES Boltzmann machines are neural networks in which the units can have only two states; the present discussion is limited to the case of binary output, that is, a unit is either off (output is 0) or on (output is 1). Furthermore, the net-input does not determine the output value, but only the probability of each output value. The massive parallelism of neural networks in general, and Boltzmann machines in particular, provides a promising approach for computers of the future.
Neural Network Operation
Architecture
In a typical neural network, the signal transmitted over a connection pathway is multiplied by the weight on the path. The neuron first sums the incoming signals and then processes this sum (its net-input) to determine the signal it will transmit. In many neural networks this output signal is a nonlinear function of the net input, with a range of 0 to 1 (or ⫺1 to 1). For example, for the neural network in Fig. 1(a), the output signal from unit Y1 could be expressed as
The architecture of a Boltzmann machine is very general. The neurons may be fully interconnected, as illustrated in Fig. 2, or only partially interconnected, as shown in Fig. 3. However, the connection pathways are always bidirectional. In other words, if neuron Xi is connected to neuron Xj, with weight wij, then Xj is also connected to Xi and the connection has the same weight (i.e., wji ⫽ wij). Using a Boltzmann Machine
Y1 = f (x1 w11 + x2 w21 + x3 w31 )
In recurrent neural networks, the activations of the neurons evolve in such a way that the equilibrium configuration (pat-
for a suitable nonlinear function f. See Refs. 2 and 3 for further discussion of neural networks.
Y2
Y1
Y3
Y4
X1 X1
w11
w12
Y1
X2 w21 w22 w31 X3
w11
w32
Y2
w12
Y1
X2 w21 w22 w31 X3 w32
Z1
Z2
Y2
X1
X2
X3
X4
v23 (a)
(b)
Figure 1. Simple neural networks with no hidden nodes.
Figure 3. Partially interconnected neural network for the 4–2–4 encoder problem.
BOLTZMANN MACHINES
tern of activations) represents the problem solution. In a Boltzmann machine, a unit may flip its activation (from 0 to 1, or vice versa); whether this flip occurs depends on the unit’s net-input and a parameter, known as temperature. The process of selecting a unit at random and allowing it to change its activation (or not, depending on the specified probability function) continues, with the temperature reduced very gradually, until the activations stabilize. The change is accepted stochastically in order to reduce the chances of the network becoming trapped in a local optimum. The process of gradually reducing the temperature, by which the stochastic behavior of a system is gradually made less and less random, is known as simulated annealing. It is analogous to the physical annealing process used to produce a strong metal (with a regular crystalline structure). During annealing, a molten metal is cooled gradually in order to avoid freezing imperfections in the crystalline structure of the metal. Boltzmann Machine Weights Some Boltzmann machines belong to the set of neural networks for which the weights are fixed when the network is designed. These networks are used for constraint satisfaction and constrained optimization problems. Fixed-weight Boltzmann machines for constraint satisfaction and constrained optimization problems are designed so that the network converges to a minimum of an energy function or maximum of a consensus function. These two formulations are equivalent; the following discussion will use the consensus function. Other Boltzmann machines undergo a training phase after which the network can perform the intended task using input data that are similar (but not necessarily identical) to its training input. A Boltzmann machine with learning can solve pattern completion and more general input-output mapping problems. Although training is characteristic of the majority of neural networks, fixed-weight Boltzmann machines are simpler and more widely used than adaptive-weight Boltzmann machines and will be discussed first. Designing a Fixed-Weight Boltzmann Machine. Fixed-weight Boltzmann machines are designed by formulating a function (consensus) that describes the constraints, and objective to be optimized if there is one, of the problem. Each unit in the network represents a hypothesis; the activation of the unit corresponds to the truth or falsity of the hypothesis. The weights are fixed to represent both the constraints of the problem and the quantity to be optimized. The activity level of each unit is adjusted so that the network will converge to the desired maximum consensus; the pattern of activations then corresponds to the solution of the problem. The connection between two units controls whether the units are encouraged to be on or off. A positive connection between two units encourages both units to be on; a negative connection encourages one or the other of the units to be off. Each unit also may have a bias (self-connection) that influences its activation regardless of the activations of the other units connected to it. The weights for the network are determined so that the probability of accepting a change that improves the network configuration is greater than the probability of rejecting it. However, early in the solution process, when the temperature
501
is relatively high, the probability of accepting a ‘‘bad change’’ or rejecting a ‘‘good change’’ is much closer to 0.5 than later, after the network has cooled. Neural networks have several potential advantages over traditional techniques for certain types of optimization problems. They can find near-optimal solutions relatively quickly for large problems. They can also handle situations in which some constraints are less important than others. Training an Adaptive-Weight Boltzmann Machine. The Boltzmann machine is also used for learning tasks. The network architecture may incorporate input, hidden, and output units. Input and output neurons are those for which the correct activations are known; any other units are hidden. During training, a neural network is given a sequence of training patterns, each of which specifies an example of the desired activations for the input and output neurons. Boltzmann learning requires several cycles during which the network is allowed to reach equilibrium. Each cycle requires starting the network at a fairly high temperature and allowing the appropriate neurons to adjust their activations as the network cools. For each training pattern, the network is allowed to reach equilibrium with the activations of the input and output units held fixed (clamped) to the values given for that pattern. Only the activations of the hidden units change during this phase. After this has been done several times for all training patterns, the probability of each pair of neurons being on is computed as the fraction of the time both are on, averaged over all training runs for all training patterns. The same process is repeated with none of the activations clamped; this is called the free-running phase of training. Since large positive weights encourage both neurons to be on, the weight on the connection between a pair of neurons is increased if the probability of both units being on was higher in the clamped phase of training than in the free-running phase. On the other hand, if it was less likely for the units to be on simultaneously in the clamped than in the free-running phase, the weight between that pair of units is reduced. SAMPLE APPLICATIONS OF BOLTZMANN MACHINES Many constrained optimization problems have been solved using neural networks; if the problem can be formulated as a 0-1 programming problem, then the states of the Boltzmann machine are assigned to the variables, and the cost function and constraints are implemented as the weights of the network. The solution of the traveling salesman problem (TSP) serves as a model for other constrained optimization problems. Boltzmann machines can be used to generate initial configurations of assets for a generic game (e.g., chess). The desired distribution of playing pieces is subject to restrictions on the number of pieces (of several different types) that are present, as well as some preferences for the relative positions of the pieces. The rules implemented in the network allow for flexibility in assigning locations for available resources while the probabilistic nature of the network introduces a degree of variability in the solutions generated (4). The class scheduling/instructor assignment problem is an example of a problem containing both strong and weak con-
502
BOLTZMANN MACHINES
straints. For example, the strong constraints could ensure that a single instructor is not assigned two classes at once, that each class is offered exactly once, and that each instructor is assigned a fair class load. The weak constraints might specify instructors’ preferences for class subjects and class time periods (5,6). Boltzmann learning is illustrated using the encoder problem, which requires that binary patterns presented to the input units pass through a bottleneck (the hidden units) and reproduce the original pattern at the output units. Using input patterns in which only one unit is active, the network learns a more concise representation of the information at the hidden units (7). OPERATION OF A FIXED-WEIGHT BOLTZMANN MACHINE Recall that the neurons in a fixed-weight Boltzmann machine represent hypotheses; if the neuron is active, the hypothesis is interpreted to be true; otherwise the hypothesis is considered to be false. The weights in a Boltzmann machine for constraint satisfaction or constrained optimization represent the constraints of the problem and the quantity to be optimized. The weight wij expresses the degree of desirability that units Xi and Xj are both on. The bidirectional nature of the connection requires that wij ⫽ wji. A unit may also have a self-connection, wii. A fixed-weight Boltzmann machine operates to maximize the consensus function
C=
XX i
wi j xi xj
j≤ i
by letting each unit attempt to change its state. The change in consensus, if unit Xi changes its state, is given by
C = (1 − 2xi )
wii +
X
wi j x j
j = i
where xi is the current state of unit Xi. However, unit Xi does not necessarily change its state even if doing so would increase the consensus. The probability of accepting the change in state is given by Pr[Xi changes state] =
1 1 + exp(−C/T )
(1)
The parameter T (temperature) is gradually reduced as the network searches for a maximal consensus. Lower values of T make it more likely that the network will accept a change of state that increases consensus and less likely that it will accept a change that reduces consensus. In general, the initial temperature should be taken large enough so that the probability of accepting the change of state is approximately 0.5, regardless of whether the change is beneficial or detrimental. The temperature is then reduced slowly so that the ratio of probabilities of two states of the network will continue to obey the Boltzmann distribution, which gives the network its name. It is convenient to break the iterative process by which the network converges to equilibrium into a number of smaller cycles called epochs. Each epoch consists of a specified num-
b X1 —p X2
—p
—p
X3
b
b
Figure 4. A simple Boltzmann machine.
ber of unit update attempts (usually equal to the number of units in the network). An exponential cooling schedule in which the temperature is reduced by a given factor after each epoch is common in practice: T (k + 1) = αT (k) Fewer epochs are required at each temperature for larger values of 움 (such as 움 ⫽ 0.98) than for smaller 움 (e.g., 움 ⫽ 0.9). Simple Fixed-Weight Boltzmann Machine The weights for a Boltzmann machine are fixed so that the network will tend to make state transitions towards a maximum of the consensus function defined previously. For example, if we wish the simple Boltzmann machine illustrated in Fig. 4 to have exactly one unit on, the weights p and b must be chosen so that improving the configuration corresponds to increasing the consensus. Each unit i is connected to every other unit j with weight wij ⫽ ⫺p ( p ⬎ 0). These weights are penalties for violating the conditions that at most one unit is ‘‘on.’’ In addition, each unit has a self-connection of weight wii ⫽ b (b ⬎ 0). The self-connection weight is an incentive (bonus) to encourage a unit to become active if it can do so without causing more than one unit to be on. The relationship between p and b can be deduced by considering the effect on consensus in the following two situations. If unit Xi is off and none of the units connected to Xi is on, allowing Xi to become active will increase the consensus of the network by the amount b. This is a desirable change; since b ⬎ 0, it corresponds to an increase in consensus and the network will be more likely to accept this change than to reject it. On the other hand, if one of the units connected to Xi is already on, attempting to turn unit Xi on would result in a change of consensus of the amount b ⫺ p. Thus, for b ⫺ p ⬍ 0 (i.e., p ⬎ b), the effect is to decrease the consensus and the network will tend to reject this unfavorable change. Bonus and penalty connections, with p ⬎ b ⬎ 0, are used in the traveling salesman problem (TSP) network to represent the constraints for a valid tour and in an analogous manner for other applications of fixed-weight Boltzmann machines. Traveling Salesman Problem The standard TSP serves as a model for many constrained optimization problems. The requirements are that a salesman visit each of a specified group of cities once and only once, returning at the end of the trip to his initial city. It is desired that the tour be accomplished in the shortest possible total distance. Many variations on this basic problem can also be solved using essentially the same approach as described here.
BOLTZMANN MACHINES
503
Architecture. A neural network solution to the TSP is usually formulated with the units arranged in a two-dimensional array. Each row of the array represents a city to be visited; each column corresponds to a position or stage of the tour. Thus, unit Ui, j is on if the ith city is visited at the jth step of the tour. A valid tour is given by a network configuration in which exactly one unit is on in each row and each column. An example of a valid tour, in which city B is visited first, city D second, city C third, and city A last, is illustrated in Fig. 5.
cluding the asset allocation and scheduling problems discussed in the next sections. The TSP is, however, a difficult problem for the Boltzmann machine, because in order to go from one valid tour to another, several invalid tours must be accepted. The transition from valid solution to valid solution is not as difficult in many other constrained optimization problems.
Weights. Although the connections are not shown in Fig. 5, the architecture consists of three types of connections. The units within each row (and within each column) are fully interconnected. The weights on each of these connections is ⫺p; in addition, each unit has a self-connection, with weight b. If p ⬎ b ⬎ 0, the network will evolve toward a configuration in which exactly one unit is on in each row and each column. To complete the formulation of a Boltzmann neural network for the TSP, weighted connections representing distances must be included. In addition to the weights described above (which represent the constraints), a typical unit Ui, j is connected to the units Uk, j⫺1 and Uk, j⫹1 (for all k ⬆ i) by weights that represent the distances between city i and city k. Since the Boltzmann machine operates to find the maximum of the consensus function, the weights representing distances are the negative of the actual distances. Units in the last column are connected to units in the first column by connections representing the appropriate distances also. However, units in a particular column are not connected to units in columns other than those immediately adjacent. The bonus weight b is related to the distance weights. Let d denote the maximum distance between any two cities on the tour and consider the situation in which no unit is on in column j or in row i. Since allowing Ui,j to turn on should be encouraged, the weights should be set so that the consensus will be increased if it turns on. The change in consensus will be b ⫺ dk1,i ⫺ di,k2, where k1 indicates the city visited at stage j ⫺ 1 of the tour, and k2 denotes the city visited at stage j ⫹ 1 (and city i is visited at stage j). This change is greater than (or equal to) b ⫺ 2d so ⌬C will be positive if b ⬎ 2d. Thus we see that if p ⬎ b, the consensus is larger for a feasible solution (one that satisfies the strong constraints) than for a nonfeasible solution, and if b ⬎ 2d the consensus will be higher for a short feasible solution than for a longer tour.
Consider the problem of distributing a fixed number of assets (such as chess pieces) of several different types on a two-dimensional region (i.e., the chessboard) in arrangements that satisfy certain rules regarding their relative positions with respect to assets of other types. As an example, the placement of pieces on a chessboard must follow certain strong conditions (e.g., the two pieces cannot occupy the same square on the chessboard at the same time) as well as weak conditions (e.g., black might consider it desirable to have several other chess pieces in the vicinity of black’s king). There are a variety of problems of this type, including distribution of biological species and deployment of military assets.
Performance. The traveling salesman problem is a nice model for a variety of constrained optimization problems, in-
A1
A2
A3
A4
B1
B2
B3
B4
C1
C2
C3
C4
D1
D2
D3
D4
Figure 5. A valid solution of the four-city traveling salesman problem.
Asset Allocation
Architecture. To illustrate the basic approach, consider first two types of chess pieces (assets) representing the black king and other black pieces. The problem to be solved by a Boltzmann machine is to generate a number of arrangements of these pieces on a chessboard, subject to specified restrictions. To accomplish this, the neural network architecture consists of two layers of units (layer X for the king and layer Y for the other pieces); each layer is an 8 ⫻ 8 array corresponding to the squares of a chessboard. If a unit is on in the king layer, it signifies that a chess piece (king) is present at the location; if a unit is on in the other layer, it indicates that some other chess piece is present at that location. Weights. There are several types of weights required for this example. First, each unit has an excitatory self-connection, b, to encourage the unit to be active. Second, the units in each layer are fully interconnected among themselves, with inhibitory weights, which are determined so that the desired number of units will be active in that layer. The units corresponding to the same square on the chessboard (the same physical location) are connected by very strong inhibitory weights, to discourage having a king and another piece on the same square at a given time. Furthermore, if it is desirable to have several other pieces present in the general vicinity of the location of a king, excitatory connections between each unit in layer X and the units corresponding to nearby positions on the chessboard in layer Y are included. The connection paths between units in Fig. 6 show the inhibition between units X23 and Y23 as well as the excitation between X23 and the units in the Y layer that correspond to neighboring board positions. By carefully designing the weights in the network, the network will tend to converge to a configuration that represents a desirable arrangement of the assets. However, the random nature of the unit update process causes the network to produce a variety of solutions satisfying the specified relationships among the assets. In order to limit the number of assets of type X to the desired number, nX, there are inhibitory connections with value
504
BOLTZMANN MACHINES
X31
X32
X21 X11
X22 X12
Y21 Y12
X24 X14
Y32 Y22
These inequalities are sufficient to allow the network to evolve towards an activation pattern in which there are nX assets; corresponding inequalities hold in layer Y. To encourage the network to converge to a solution in which the assets have the desired relative arrangement, consider a configuration of the network in which there are nX assets of type X, but some of these assets are located in the wrong place. If a unit that is in the wrong place is selected for update, the probability that it changes states needs to be maximized. This unit receives a total bonus signal of bX (it does not receive any bonus from assets of type Y since it is not in the region of encouragement for any unit that is on). Furthermore, it receives a total penalty signal of (nX ⫺ 1) pX from the other units that are on in layer X. It will be more likely for the unit to turn off if
X34
X23 X13
Y31
Y11
X33
Y33 Y23
Y13
Y34 Y24
bX ≤ (nX − 1)pX
(5)
Y14
Combining Eq. (2) and Eq. (5), we find that Figure 6. A portion of a Boltzmann machine for king (X) and other (Y) chess pieces.
pX between each pair of units in layer X; similarly to limit the number of assets of type Y to the desired number, nY, there are inhibitory connections with value pY between each pair of units in layer Y. There are also excitatory connections with weight bXY between the appropriate units in layer X and layer Y, to encourage a desirable arrangement of the assets. The relations between these weights that need to be satisfied so that the network will evolve towards a configuration in which there are nX assets of type X are deduced in a manner similar to that used for the TSP. Assume that at a particular time there are nX ⫺ 1 assets of type X. If an inactive unit in layer X is selected, the total bonus signal received by this unit should exceed the total penalty signal. In the worst case, there are no units in other layers encouraging this unit to turn on, and the only bonus signal the unit will receive is bX, its self-bonus. At the same time, it is receiving an inhibitory signal of (nX ⫺ 1) pX from the other units that are on in layer X. So, to increase the probability of the unit changing states (a desirable change), we require (nX − 1)pX ≤ bX
(2)
On the other hand, we want no more than nX assets of type X present. If there are nX assets of type X present and an inactive unit in layer X is selected for update, the probability that it will change state needs to be minimized. The unit receives a total penalty signal of nX pX. Under the most extreme conditions, all units in layer Y that encourage the selected unit to turn on will be on; say there are mY such units. This means that the unit receives a total bonus signal of bX ⫹ mY bXY. Since it is not desirable for the unit to turn on, we require bX + mY bX Y ≤ nX pX
(3)
From Eq. (2) and Eq. (3) it follows that (nX − 1)pX ≤ bX ≤ bX + mY bX Y ≤ nX pX
(4)
bX = (nX − 1)pX By assigning an arbitrary value for pX, the remaining weights can be determined. To prevent the existence of two or more assets of different types in the same physical location a large penalty connection ph is introduced between units in different layers but with the same subscripts. This penalty signal should override any bonus signals going into any unit. Performance. Simulations with four types of assets corresponding to white king, white other, black king, and black other pieces illustrate that the number of assets of each type converges relatively quickly to the desired values. Fluctuations in the locations of assets of one type relative to other types continue until the temperature becomes very small. However, many valid solutions are generated quite quickly. These studies specified the number of assets of each type that should be present throughout the entire region, and that the other pieces should be near the king of the same color. Many extensions of these ideas are possible. For example, there is no additional difficulty encountered if white and black do not have the same number of other pieces. The logic of describing more complicated board arrangements, with more different playing pieces, is a straightforward extension of this simple example. See Ref. 4 for a more detailed description of this example. A Time-Task-Worker Scheduling Problem The Boltzmann machine can also be used to solve the classic problem of assigning workers to cover a variety of tasks in different time periods. As a simple example, consider the problem of scheduling instructors to teach classes that may be offered at various times. This problem can be viewed as the intersection of three separate problems: scheduling classes to be given at appropriate time periods, scheduling instructors to teach at certain time periods, and assigning instructors to teach particular classes. A similar approach could be used for scheduling airline flights and pilots, or many other related problems. The strong constraints for generating a valid schedule include: ensure that each class is taught exactly once, no in-
BOLTZMANN MACHINES
structor is assigned to teach more than one class during any given time period, and so on. It is also desirable that each instructor be responsible for a ‘‘fair’’ share of the class load. In addition, we allow for weak constraints describing instructors’ preferences for certain classes and/or time periods. The problem of producing a teaching schedule for a single instructor is closely related to the TSP, with classes corresponding to the cities to be visited and the time periods corresponding to the order in which the cities are visited. Similarly, both the assignment of classes to instructors (during each time period) and the scheduling of each class, in terms of who will teach it and at what time, are instances of the TSP. Architecture. It is convenient to visualize the architecture of a Boltzmann machine neural network for this problem as composed of several rectangular arrays of neurons, one for each instructor, stacked on top of each other in a three-dimensional array. As a simple example problem, one might assume that there are 20 classes, to be taught by six instructors, within five possible time periods. The architecture would then be a 5 ⫻ 6 ⫻ 20 array of neurons; with a 5 ⫻ 6 array corresponding to each class, a 6 ⫻ 20 array corresponding to each time period, and a 5 ⫻ 20 array for each instructor. An active cell, Uijk ⫽ 1, means that at time i instructor j teaches class k. Weights. As in the Boltzmann machines for the traveling salesman and asset allocation problems, each neuron has a self-connection to encourage the unit to be active. To allow for some variation in instructors’ preferences for certain classes or time periods, or factors that make it preferable to have certain classes taught at certain times, the bias for each neuron is taken to be equal to a standard bias b plus some preference, which may be between ⫺m and m. Thus, the maximum possible bias for a unit is b ⫹ m and the minimum is b ⫺ m. Since each class is to be taught exactly once, only one unit should be on in the array for each class; this is accomplished by fully interconnecting the units within the plane for each class, with inhibitory connections of strength ⫺p, with p ⬎ b ⫹ m ⬎ 0. Similarly, the units in each line corresponding to an instructor–class period combination are connected with weights of strength ⫺p to ensure that each instructor is assigned to no more than one class during any period. Finally, the units within each class offering–time period plane must be fully interconnected to ensure that each instructor is assigned an appropriate number of classes. An inhibitory weight with strength of ⫺f is needed to ensure that all instructors teach approximately the same number of classes, r (with 20 classes and six instructors, r ⫽ 4). The suitable range of values for the weight f can be deduced by consideration of a single unit deciding whether to be active or inactive. Ignoring other connections for the time being, the unit should be turned on if the total number of active units in the class offering–time slot plane is less than r and turned off otherwise. To encourage a plane to have exactly r units active, we require that b − m − (r − 1) f > 0 and b + m − rf < 0 Thus, the base value of the bonus, b, must satisfy b > (2r − 1)m
505
It is possible to have different values of r for different instructors, with either different values of b for each instructor, or the value of b for the instructor with the greatest value of r used for all instructors’ class offering–class period planes. In either case p must be greater than the maximum bias applied to any one unit. It is often the case that a group of students will require the same set of classes, which should be scheduled at different times. Within the plane for each time period, we include an inhibitory connection with strength ⫺c between units that represent classes that should not conflict, to encourage the Boltzmann machine to converge to a schedule without any such conflicts. The value of c can vary depending upon how many such conflicts there are. In general, the sum of such inhibitory connection strengths must be less than b for any given unit. Usually, however, a single class conflicts with no more than one or two other classes. The value of c can range from ⫺b ⫹ m ⫹ 1 to 0 for a unit that only conflicts with one other class. Performance. The Boltzmann machine is better suited for the class scheduling/instructor assignment problem than for the TSP, since it is easy to move from one valid schedule to another. The system must pass through only one state with a lower overall consensus to move from one valid schedule to another. Once the transition from a state corresponding to a valid schedule to one with an invalid schedule is made, only transitions resulting in positive changes in consensus are required to return to a state corresponding to a new valid schedule. This application of Boltzmann machines to scheduling problems is based on the discussion in Ref. 6; a more extensive example, solved with a closely related neural network, is presented in Ref. 5. Boltzmann Machine with Learning The Boltzmann machine can also be trained for use in input– output transformations such as pattern completion problems when examples of the desired network action are available for supervised learning. The most interesting situations for which a learning algorithm is needed are the cases in which only partial information about the global states of the system is available. Thus, the network is assumed to consist of visible units (input and output units) and hidden units. Part of the training process includes letting the network converge with the activations of the visible units clamped. After training, the input units are clamped and the network is allowed to find the correct values for the output units. Hidden units are never clamped. A simple example problem is to train the network to reproduce the input pattern on the output units after passing the signals through a hidden layer that has fewer units than the input and output layers (7). This is known as an m–p–m encoder problem, with p, the number of hidden units, less than m, the number of input units or outputs units. The architecture shown in Fig. 3 can be used for a problem with 4 input units, 2 hidden units, and 4 output units. The presence of interconnections among the hidden units, and among the output units is significant; however, there are no connections directly from the input units to the output units. A self-connection is also used for each unit, but not shown.
506
BOLTZMANN MACHINES
The agreement between the desired probabilities for the visible units and the probabilities of the visible units when the network is at equilibrium can be increased by changing the weights. Furthermore, the weight changes can be made based on local information.
pattern, and the corresponding output pattern is the same as the input pattern.
Input
Algorithm. Boltzmann learning requires information about the probability that any two units, i and j, are both on, in two different equilibrium situations: PCij is the probability when the visible units are clamped PFij is the probability when only the input units are clamped The training process can be summarized in the following algorithm: To compute the values of PC For each training pattern: Clamp visible units Perform several cycles of the following two steps (using a different random seed for each trial) Let the network converge For each pair of units ij, determine whether they are both on Average the results for this pattern to find values of PCij for each i and j After the cycle is completed for each training pattern, average the results to find PC values To compute the values of PF For each training pattern: Clamp only the input units Perform several cycles of the following two steps Let the network converge For each pair of units, determine whether they are both on Average the results for this pattern to find values of PF After the cycle is completed for each training pattern, average the results to find PF values Compare PC and PF (for each pair of units), and adjust the weight between them:
Output
(1 0 0
0)
(1 0 0 0)
(0 1 0
0)
(0 1 0 0)
(0 0 1
0)
(0 0 1 0)
(0 0 0
1)
(0 0 0 1)
During the clamped phase of training, only the 2 hidden units adjust their activations, so each epoch consists of 2 unit updates. The annealing schedule was 2 epochs at T ⫽ 20; 2 epochs at T ⫽ 15; 2 epochs at T ⫽ 12; and 4 epochs at T ⫽ 10. After the network cools, statistics are gathered for 10 epochs at T ⫽ 10 to determine the fraction of the time that units i and j are both on. This process is repeated for each of the four training vectors, and the results for all training vectors are averaged to give PC for each pair of units that are connected. The process of determining PFij uses the same annealing schedule and gathers statistics for the same number of epochs at T ⫽ 10. However, since no units are clamped during this second phase, each epoch consists of 10 unit update attempts. Once the values of PCij and PFij have been found, the weights are updated and the entire weight update cycle is repeated until the weights have stabilized or the differences between PCij and PFij are sufficiently small. In 250 tests of the 4-2-4 encoder problem, the network always found one of the global minima and remained at that solution. As many as 1810 weight update cycles were required, but the median number was 110 (7). After training, the network can be applied by clamping the input units and allowing the net to converge. The activations of the output units then give the response of the network. The algorithm as originally presented uses a fixed weightstep increment if PCij ⬎ PFij and the same sized decrement for the weights if PCij ⬍ PFij. Difficulties can occur when only a few of the 2v possible states for the visible units are specified. Rather than trying to demand that other (nonspecified) states never occur, it is recommended to use noisy inputs with low, but nonzero probabilities. For the published simulations described previously, noise was added on each presentation of a training pattern: a component that is 1 in the true training vector was set to 0 with probability 0.15, and 0 components were set to 1 with probability 0.05.
wij = µ(PCij − PFij ) where 애 ⬎ 0 is the learning rate. The update of the weight connecting two units may be proportional to the difference between the probability that the units are both active when the network is running in the clamped mode versus the corresponding probability when the network is in the unclamped mode, as shown in the algorithm above. On the other hand, the network can also be trained using a fixed-size weight adjustment, as described in the original presentation of the network (7). Application. As a simple example, consider the following four training vectors; only one input unit is active in each
ALTERNATIVE FORMULATIONS OF THE BASIC BOLTZMANN MACHINE Variations As mentioned earlier, the constraint satisfaction problems to which the Boltzmann machine is applied can be formulated as either maximization or minimization problems. Ackley, Hinton, and Sejnowski (7) define the energy of a configuration as E=
X i< j
wij xi x j +
X i
θ i xi
BOLTZMANN MACHINES
where i is a threshold and biases are not used. The difference in energy if unit Xk changes from off to on is Ek = −θk +
X
that does not depend on k. A very slow decrease of the temperature is necessary, but with this slow decrease, only one epoch is required at each value of k.
wik xi
i
EXTENSIONS OF THE BOLTZMANN MACHINE
The Boltzmann machine is also used with either of two slightly different acceptance conditions, namely, 1. Set the output of unit to 1 with probability given by Eq. (1) regardless of the current activity of the unit. Or 2. Accept the proposed change in activation if it improves the solution, but accept a change that moves the solution in the opposite direction, with probability given by Eq. (1). See Refs. 2 and 8 for further discussion. Markov Chain Process The Boltzmann machine can be described in terms of a Markov chain process. Each stage consists of the following steps: 1. Generate a potential new configuration of the network 2. Accept or reject the new configuration 3. Reduce temperature according to the annealing schedule For the Boltzmann machine, the generating probability is given by the Gaussian distribution G = T −0.5n exp
507
−D2 T
where n is the number of units in the network, and D is the number of units the activations of which change in going from current configuration to new configuration. Note that as T is reduced the generation probability G also changes. Thus, the probability of generating a candidate configuration depends only on the temperature and the number of units that change their state. In the preceding discussion, all configurations in which exactly one unit changes its state are equally likely to be chosen as the candidate state at any time. Configurations in which more than one unit changes its state (D ⬎ 1) are generated with probability 0. The probability of accepting the new configuration depends on the current temperature and the change in consensus ⌬C that would result, according to Eq. (1). This form of analysis is useful for theoretical analysis of the process. Cooling Schedules The success of a Boltzmann machine is closely related to how slowly the temperature is decreased and how many update trials are performed at each temperature. An exponential cooling schedule, Tk ⫽ 움kT0, is very common in practice (8). This cools the system rather quickly at high temperatures and then very slowly at low temperatures. As long as enough trials are performed at each temperature to allow each unit to attempt to change its state several times, good results are obtained. Geman and Geman (9) present a theoretical proof that, if Tk ⱖ c/ln(1 ⫹ k), the system converges to an optimal state (as k 씮 앝) where k is the number of epochs and c is a constant
The Boltzmann machine is closely related to several more general types of evolutionary computing. The most important of these more general approaches are summarized in the following sections. Mean-Field Annealing One of the most popular modifications of the original Boltzmann machine replaces the probabilistic action of a binary neuron with an analog neuron, the activation of which is determined as the average (mean) value of the binary neuron at any particular temperature. The value of an arbitrary analog neuron takes the form (in the mean-field theory approximation, with an energy function E to be minimized)
Ei = −θi +
X
wji vj
E j
vi = tanh
i
T
In general, little change occurs to vi for temperatures above a critical value, Tc. Thus, the annealing process can proceed more rapidly at higher temperatures and can be slowed when the temperature reaches the point at which changes to the activations of the neurons in the network occur. Alternatively, the mean-field equations can be solved iteratively. This gives a direct connection between the Boltzmann machine and the continuous Hopfield network with noise (see Ref. 3 for a discussion of the Hopfield network). For further discussion of mean-field annealing see Refs. 2, 10, and 11; it is used for applications to scheduling problems (5) and the knapsack problem (12). Other Related Networks High-order Boltzmann machines (HOBM) allow for terms of higher order in the consensus function than those for the standard Boltzmann machine (in which the consensus function has only first- and second-order terms). The theoretical results, such as uniqueness of the learned solution, which have been established for these HOBM do not hold for the Boltzmann machine with hidden units. See Ref. 13 for discussion and proofs. The Helmholtz machine is a fairly general unsupervised learning architecture with feedback connections; Boltzmann machines are one simple specific variety of Helmholtz machine. For a detailed discussion see Ref. 14; this article also includes an extensive bibliography of relevant papers. For Boltzmann machines in which the hidden and output units have a special hierarchical organization, learning can be accomplished using gradient descent (as for the popular backpropagation training algorithm of feedforward neural networks). Simulations with the N-bit parity problem and detection of hidden symmetries in square pixel arrays have demonstrated the network’s ability to learn quickly and to generalize successfully. See Ref. 15 for further discussion.
508
BOOLEAN FUNCTIONS
SUMMARY AND CONCLUSIONS One of the potential advantages of a neural network approach to problem solving is its inherent parallelism. Although units updating in parallel may make their decision to accept or reject a change of state based on information that is not completely up to date, several parallel schemes for the Boltzmann machine have given promising results (1). These schemes can be characterized as either synchronous or asynchronous, and either limited or unlimited. In limited parallelization, small groups of neurons that do not directly affect each other can update at the same time without any possibility of errors in the calculation of the change of consensus. This scheme, however, is not well suited to massive parallelism since the number of sets of independent units is small. In synchronous unlimited parrallelization, all units compute their change in consensus and acceptance probabilities independently, and any potential difficulty from erroneously calculating their acceptance probability is simply ignored. In asynchronous parallelization, each unit has its own cooling schedule and state transitions are performed simultaneously and independently. Since the probability of any unit changing its state approaches 0 as the network cools, the likelihood of two connected units changing their states based on out-ofdata information also decreases. Simulations using this type of parallelization for a variety of combinatorial problems give results that are comparable to other methods. Problems from many fields can be formulated in a manner for which a layered Boltzmann machine solution is of interest. Applications to biological ecosystems and urban planning are two promising areas. The results presented here suggest that layered Boltzmann machines are an interesting neural network approach to applications for which some variation in the solutions is desirable. BIBLIOGRAPHY 1. E. Aarts and J. Korst, Simulated Annealing and Boltzmann Machines. Chichester: Wiley, 1989. 2. A. Cichocki and R. Unbehauen, Neural Networks for Optimization and Signal Processing. Chichester: Wiley, 1993. 3. L. V. Fausett, Fundamentals of Neural Networks: Architectures, Algorithms, and Applications. Englewood Cliffs, NJ: Prentice Hall, 1994. 4. W. Elwasif, L. V. Fausett, and S. Harbaugh, Boltzmann machine generation of initial asset distributions. In S. K. Rogers and D. W. Ruck (eds.), Proceedings, Applications and Science of Artificial Neural Networks, SPIE, Vol. 2492, 1995, pp. 331–340. 5. L. Gislen, C. Peterson, and B. Soderberg, Complex scheduling with Potts neural networks, Neural Computat., 4: 805–831, 1992. 6. R. S. Schumann, Analysis of Boltzmann Machine Neural Networks with Applications to Combinatorial Optimization Problems, M.S. thesis, Florida Institute of Technology, 1992. 7. D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, A learning algorithm for Boltzmann machines, Cognitive Sci., 9: 147–169, 1985. 8. S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, Optimization by simulated annealing, Science, 220 (4598): 671–680, 1983. 9. S. Geman and D. Geman, Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images, IEEE Trans. Pattern Anal. Mach. Intell., PAMI-6: 721–741, 1984. 10. C. Peterson and J. R. Anderson, A mean field learning algorithm for neural networks, Complex Syst., 1: 995–1019, 1987.
11. C. Peterson and B. Soderberg, A new method for mapping optimization problems onto neural networks, Int. J. Neural Syst. 1: 3– 22, 1989. 12. M. Ohlsson, C. Peterson, and B. Soderberg, Neural networks for optimization problems with inequality constraints: the knapsack problem, Neural Computat., 5: 331–339, 1993. 13. F. X. Albizuri, A. D’Anjou, M. Grana, and J. A. Lozano, Convergence properties of high-order Boltzmann machines, Neural Netw., 9: 1561–1567, 1996. 14. P. Dayan and G. E. Hinton, Varieties of Helmholtz machine, Neural Netw., 9: 1385–1403, 1996. 15. L. Saul and M. I. Jordan, Learning in Boltzmann trees, Neural Computat., 6: 1174–1184, 1994.
LAURENE V. FAUSETT University of South Carolina—Aiken
BOLTZMANN TRANSPORT EQUATION. See SEMICONDUCTOR BOLTZMANN TRANSPORT EQUATION.
BOOKKEEPING. See ACCOUNTING.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...CTRONICS%20ENGINEERING/39.%20Neural%20Networks/W5104.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Cerebellar Model Arithmetic Computers Standard Article S. Commuri1 and F. L. Lewis2 1CGN & Associates, Inc., Peoria, IL 2The University of Texas at Arlington, Fort Worth, TX Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5104 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (298K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20...20ENGINEERING/39.%20Neural%20Networks/W5104.htm (1 of 2)16.06.2008 15:49:56
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...CTRONICS%20ENGINEERING/39.%20Neural%20Networks/W5104.htm
Abstract The sections in this article are Background on CMAC Neural Networks Background on Nonlinear Dynamical Systems Passivity-Based Design Evaluation | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20...20ENGINEERING/39.%20Neural%20Networks/W5104.htm (2 of 2)16.06.2008 15:49:56
CEREBELLAR MODEL ARITHMETIC COMPUTERS
CEREBELLAR MODEL ARITHMETIC COMPUTERS The nonlinearities in the dynamics of practical physical systems make their control a complex problem. Traditionally the plant dynamics were first modeled and verified through offline experimentation. The control was then designed using linear system design techniques or geometric techniques with linear analogues. These techniques were successful in the control of systems when the model accurately described the process. The results for systems with unknown dynamics were at first limited by-and-large to ad hoc techniques and simulations involving assumptions such as certainty equivalence. These approaches are limited by the complexity of the model and cannot accommodate the variation of systems parameters. This has resulted in the development of controllers that can learn the process dynamics, as well as adapt to parametric changes in the system. Adaptive controllers attempt to learn the plant characteristics while simultaneously achieving the control objectives. These controllers tune the adaptation parameters using the input-output measurements of the plant (1–4). While the classical adaptive methods guarantee stability for a large class of systems, the system must satisfy assumptions on linearity in the unknown system parameters. A regression matrix must be computed for each system by often tedious preliminary off-line analysis. In recent years learning-based control has emerged as an alternative to adaptive control. Notable among this class of controllers are the neural network (NN) and fuzzy logic-based controllers. In the neural network, learning was accomplished in an off-line fashion by associating input-output pairs during training cycles. While neural networks are very successful in a variety of applications like pattern recognition, classification, and system identification to name a few, their applications in closed-loop is fundamentally different. In the literature neural networks have been utilized mostly in indirect control configuration, that is, identification-based control where back-propagation NN weight tuning is used to identify the system off-line (5–7). These methods are essentially openloop control and do not guarantee stability in closed-loop applications. Unfortunately, when the neural network is employed in the feedback configuration, the gradients required for tuning cannot be found if the plant has unknown dynamics. Thus proofs of stability and guaranteed tracking performance are absent in these works (6–10). Rigorous research in NN for closed-loop control is being pursued by several research groups. Narendra et al. (5,6,9)
153
emphasizes the finding of gradients needed for backprop tuning. Sadegh (11) employs approximate calculations of the gradient to establish stability results, and Cristodoulou (12), Ioannou (13), Sadegh (11), and Slotine (14) offer rigorous proofs of performance in terms of tracking error stability and bounded NN weights. All these works assume that the NN is linear in the unknown parameters by employing single-layer NNs or recursive NNs with special structures. While the use of multilayer NNs in system identification was rigorously investigated, only recently researchers have focused on closedloop control of nonlinear systems using multilayer NNs either in the continuous-time or discrete-time domains. In Refs. 15–17 it has been shown that NN controllers can effectively control complex nonlinear systems without requiring assumptions like linearity in parameters, availability of a regression matrix, and persistency of excitation. There are NNs are all multilayer nonlinear networks, and tuning laws guaranteeing tracking as well as stability of both the closed-loop system and the NN have been established, both for continuous-time and discrete-time cases. The approximation property of fully connected NNs is basic to their application in control of complex dynamical systems. It has been shown that multilayer feed-forward NNs are theoretically capable of representing arbitrary mappings if a sufficiently large number of nodes are included in the hidden layers (15). Since all the weights are updated during each learning cycle, the learning is essentially global in nature. This global nature of the weight updating does not utilize the information on local NN structure and thus slows down the speed of learning. Furthermore, it is necessary to have a large number of nodes in each layer to guarantee a good function approximation. It has been shown that the speed of learning is inversely proportional to the number of nodes in a layer (15). The fully connected NNs suffer from an additional drawback in the sense that the function approximation is sensitive to the training data. Thus the effectiveness of a general multilayer NN is limited in problems requiring on-line learning. To address these issues, the cerebellar model articulation controller (CMAC) NN (18) was proposed for closed-loop control of complex dynamical systems (19–25). The CMAC is a nonfully connected perceptronlike associative memory network that computes a nonlinear function over a domain of interest. The CMAC NN is capable of learning nonlinear functions extremely quickly due to the local nature of its weight updating (26). The earliest contributions in the study of the behavior and properties of the CMACs were by H. Tolle and his group of researchers. Their finding on the approximation properties and learning in CMAC is well-presented in their classic book NeuroControl (27). Brown and Harris (28,29) also studied the use of the CMAC in adaptive modeling and control of systems. The importance of the convergence and stability properties of the CMAC in closed-loop control was established by Parks and Militzer (30,31). Ellison (32) independently presented similar results using CMACs for closed-loop control of robots. Recently, Commuri (33–39) established a method for passivity-based design of the learning laws for the CMAC that enables modular design for on-line learning and guaranteed closed-loop control. This article presents a comprehensive study of the use of CMAC NNs in control applications. The structure and properties of the CMAC NN that make it highly
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
154
CEREBELLAR MODEL ARITHMETIC COMPUTERS
suited for closed-loop control are studied. The weight-update laws for guaranteed stability, tracking performance, and robustness issues are discussed. BACKGROUND ON CMAC NEURAL NETWORKS Structure of CMAC Neural Networks The cerebellar model arithmetic computer (CMAC) is a perceptronlike associative memory that performs nonlinear function mapping over a region of the function space. This highly structured nonfully connected neural network model was established by J. Albus (22,26) based on a model of the human memory and neuromuscular control system. Figure 1 shows a typical application of a CMAC neural network where the CMAC is used to manufacture a continuous function g(x) ⫽ [g1(x), g2(x), . . ., gm(x)]T, where x 僆 Rn, and g : Rn 씮 Rm. The nonlinear function g(x) produced by the CMAC is composed of two primary functions R:X ⇒A
(1)
P:A ⇒Y
where X is the continuous n-dimensional input space, A is an NA-dimensional association space, and Y is the m-dimensional output space. The function R(.) is fixed and maps each point x in the input space X onto a association vector a ⫽ R(x) in the association space A. The function P(a) computes an output y 僆 Y by projecting the association vector determined by R(x) onto a vector of adjustable weights w such that y = P(α) = wT α
(2)
R(x) in Eq. (1) is the multidimensional receptive field function which assigns an activation value to each point x in the input space X兵x ⫽ (x1, . . ., xn) 僆 X其. From Eq. (2), it can be seen that the output of the CMAC is a linear combination of the weights (18). In order to obtain the multidimensional receptive field functions, the input space is first discretized, and activation
functions of finite span are defined on each of the intervals. A receptive field function is said to be active if it has a nonzero activation value for a particular input. Standard CMAC implementations have a finite number of maximally active receptive field functions for any given input vector. Figure 2 depicts some standard receptive functions, and Fig. 3 shows a multidimensional receptive field function of order 2 with an overlap of 4. The width of the receptive field function controls the output generalization of the CMAC and the offset between adjacent receptive field functions controls the input quantization and the output resolution (18). Further the function generated by the CMAC depends on the type of receptive fields used. Splines of order one generate staircase functions, while splines of order two generate linear output functions. The CMAC is a nonfully connected perceptronlike network that computes a nonlinear function over a domain of interest. Since the receptive field functions have a finite span, an element in the input space excites only a finite number of these receptive field functions. Let x be the input vector presented to the network and 움 be the corresponding vector in the association space A. Let 움* be the set of active or nonzero elements of 움. Since the output is a linear combination of these nonzero values, it is then necessary only to adjust the weights w attached to 움* in Eq. (2) to change the output. Thus the CMAC NN is capable of learning nonlinear functions extremely quickly due to this local nature of its weight updating (26). Generalization versus Dichotomy. Since the network need not have an unique set of association cells for every possible input pattern, a given association cell can be activated by different input patterns. For example, let two inputs x1 and x2 activate two overlapping sets of association vectors 움*1 and 움*2 . Now adjustment in the weights corresponds to 움*1 will have the unintended consequence of influencing the output due to 움*2 , which can either be beneficial or detrimental to the implementation. In general, the networks ability to generalize between similar input patterns in determined by the overlap of 움*1 ∧ 움*2 . If 움*1 ∧ 움*2 is null, then the two input patterns will be independent. The amount by which the outputs will be
gm
X _
g2
g1
R(.)
W
Figure 1. CMAC architecture for the approximation of a vector function.
CEREBELLAR MODEL ARITHMETIC COMPUTERS
155
1 n=1
0
λ 1 λ 2 λ 3 λ 4 λ5 λ 6 λ 7 λ 8 λ 9
x
λ 1 λ 2 λ 3 λ 4 λ5 λ 6 λ 7 λ 8 λ 9
x
λ 1 λ 2 λ 3 λ 4 λ5 λ 6 λ 7 λ 8 λ 9
x
1 n=2
0
1 n=4
0 (a)
(b)
Figure 2. Standard CMAC receptive field functions of orders 1, 2, and 4.
similar to two input patterns x1 and x2 will be determined by extent of overlap of 움*1 and 움*2 (22). Similarly the network’s ability to dichotomize or produce dissimilar outputs for the two inputs patterns x1 and x2 depends on the nonintersecting elements of 움*1 ∧ 움*2 . Effects of Hash Coding. It can be seen from the above discussion that CMAC can learn any function by proper choice of the weights. The mapping generated, however, is dependent on the actual implementation of the CMAC. Let Ap be the number of association cells physically implemented by CMAC and A* be the number of maximally active elements of A for any given pattern. In practice, Ap is chosen to be at least 100 times A*. Then it can be shown that a
x2 (4, 7) (5, 7) (6, 7) (7, 7) (4, 6) (5, 6) (6, 6) (7, 6) (4, 5) (5, 5) (6, 5) (7, 5)
2r
7 (x2)
2r
6 (x2)
2r
5 (x2)
2r
4 (x2)
(4, 4) (5, 4) (6, 4) (7, 4)
x1
1r
4 (x1)
1r
5 (x1)
1r
6 (x1)
1r
7 (x1)
Figure 3. Multidimensional receptive field functions of order 2 and overlap 4.
unique mapping from X 씮 A is theoretically possible if Rn ⬍ 99 兩A*兩, where Rn is the number of possible input patterns (22). The number of association cells in any CMAC is determined by the level of discretization of the input space. If the level of discretization is very fine, there will be too many association cells and it becomes physically impossible to implement the CMAC. This problem can be solved by hash coding (22,40) where the size of physical memory is maintained at manageable size by mapping many association cells to the same physical memory locations. Hashing has the undesirable side effect of ‘‘collisions.’’ If the actual number of memory locations available is two thousand, namely Ap ⫽ 2000 and A* ⫽ 20, then the probability of two or more cells being mapped into the same cell in A is approximately 0.1 (22). Therefore, as long as this probability is low, collisions are not a serious problem and only results in reduced resolution of the output. Another effect of hashing is the interference in the form of unwanted generalization between input vectors. It can be shown that this effect is insignificant as long as the overlap is not large compared to the total number of cells in A*. If, for example, Ap ⫽ 20,000 and A* ⫽ 20, then the probability of two or more collisions is 0.01, and the probability of two or more cells spuriously overlapping is 0.0002. Thus, in the implementation of CMAC, it is desirable to keep A* small to minimize the amount of computations required. It is also desirable to keep the ratio A*/Ap small to minimize the probability of overlap between widely separated input patterns. Constructive Method for Linear Multidimensional Receptive Field Functions The structure of the CMAC discussed in preceding section gives insight into the nature of the function generated by the
156
CEREBELLAR MODEL ARITHMETIC COMPUTERS
CMAC. However, in practical applications the reverse is often necessary, that is, when, given a particular function to approximate, the task is to determine the CMAC structure that will generate this required map. This problem was recently addressed in Ref. 33. In this work methods to construct CMACs that guarantee an approximation for a class of functions were established. In this subsection these results are summarized. One-Dimensional Receptive Field Functions. Given x ⫽ [x1, x2, . . ., xn] 僆 Rn, let [ximin, ximax] 僆 R ᭙ 1 ⱕ i ⱕ n be the domain of interest. For this domain, select integers Ni and strictly increasing partitions 앟i ⫽ 兵xi,1, xi,2, . . ., xi,Ni其, ᭙1 ⱕ i ⱕ n (e.g., ximin ⫽ xi,1 ⬍ xi,2 ⬍ . . . ⬍ xi,Ni ⫽ ximax). For each component of the input space, define the receptive field functions as
µi,1 (xi ) = (−∞, xi,1 , xi,2 )(xi ) µi, j (xi ) = (xi, j−1 , xi, j , xi, j+1 )(xi ), µi,N (xi ) = (xi,N i
i
1 < j < Ni
b. Compact support, Rj1, j2, . . ., jn (x) ⫽ 0 for all x 僆 (x1, j1⫺1, x1, j1⫹1) ⫻ . . . ⫻ (xn, jn⫺1, xn, jn⫹1) Nn N1 N2 c. Normalization, 兺jn⫽1 ⭈ ⭈ ⭈ 兺j2⫽1 兺j1⫽1 Rj1, j2, . . ., jn(x) ⫽ 1 for all x. According to Lemma 1(b), for any prescribed value of x 僆 Rn, only 2n values of Rj1, j2, . . ., jn(x) are nonzero. Salient Properties of the Output of CMAC. Given any element x of the input space, the receptive field values Rj1, j2, . . ., jn(x) are elements in the association space A. The output of the CMAC neural network is now computed by projecting this association vector onto a vector of adjustable weights w. Let w(j1, . . ., jn) be the weight associated with the index j1, . . ., jn. Then the function manufactured by a singleoutput CMAC can be expressed as
(3)
g(x) =
−1 , xi,N , ∞)(xi ) i
...
j n =1
where the triangular functions ⌳(.) are defined as
y −a , b−a (a, b, c)(y) = c − y , c−b 0
Nn
j 1 =1
w( j
1 ,..., j n )
Rj
1 ,..., j n
(x) : Rn → R
(4)
Lemma 2 A multi-input multi-output CMAC with output g(x) : Rn 씮 Rm is a nonlinear mapping defined as g(x) = [g1 (x), g2 (x), . . ., gm (x)]T
otherwise
The leftmost and rightmost receptive field functions are selected such that every value of xi corresponds to at least one receptive field function. Given the partition 앟i ⫽ 兵xi,1, xi,2, . . ., xi,Ni其, the one-dimensional receptive field functions selected as in Eqs. (3) and (4) are shown in Fig. 4.
where
Multidimensional Receptive Field Functions. Given any x ⫽ [x1, . . ., xn] 僆 Rn, define multidimensional receptive field functions as
The function g(x) in Eq. (7) is Lipschitz continuous.
µ1, j (x1 ) · µ2, j (x2 ) . . . µn, j n (xn ) 2 R j , j ,..., j n (x) = N 1 N N1 n n 1 2 2 i=1 µi, j (xi ) j n =1 . . . j =1 j =1 2
It is easy to see that the receptive fields so defined are normalized n-dimensional second-order splines. Lemma 1 The multidimensional receptive field functions selected in Eq. (5) satisfy three significant properties: a. Positivity, Rj1, j2, . . ., jn(x) ⬎ 0 for all x 僆 (x1, j1⫺1, x1, j1⫹1) ⫻ . . . ⫻ (xn, jn⫺1,, xn, jn⫹1)
1
xi, 1
xi, 2
xi, 3
xi, 4
gk (x) =
Nn j n =1
...
N1 j 1 =1
wk,( j
1 ,..., j n )
Rj
1 ,..., j n
(x) : Rn → R
(7)
(8)
In fact, according to the normalization property of Lemma 1(c), Eq. (8) is a convex combination of the weights w.
(5)
i
1
(6)
A general CMAC is easily constructed by using this framework as follows.
a ≤ y ≤ b(= 1 if a = −∞) b ≤ y ≤ c(= 1 if c = ∞)
N1
xi, 5
Figure 4. One-dimensional receptive field function: Ni ⫽ 5 spanning R1.
Function Approximation Properties of CMAC Neural Networks. In recent years neural networks have been used in the control of systems with unknown dynamics. In the early applications NNs were used as direct adaptive controllers, where the NN was used to identify the system off-line, and the controllers were developed using the identified model. In later applications on-line learning laws were developed, and the NNs were used as indirect adaptive controllers (see NEURAL NETWORKS FOR FEEDBACK CONTROL). In all of these approaches the approximation property of fully connected NNs is basic to their application in control of complex dynamical systems. However, the effectiveness of a general multilayer NN is limited in problems requiring on-line learning. Since in a CMAC only a finite number of receptive fields are active for any given input, an efficient controller for systems with unknown dynamics can be implemented using CMAC NNs. In the early approaches learning in CMAC was first accomplished off-line. The CMAC was presented with training samples, and the corresponding weights were updated until the network could reconstruct the unknown function with reasonable accuracy over the domain of interest. In these works the CMAC weight update rules were similar to the least mean
CEREBELLAR MODEL ARITHMETIC COMPUTERS
squares (LMS) algorithm. This way they ensured convergence of CMAC learning to some local minima. The convergence properties of CMAC were also studied by Wong and Sideris (40). In this work the CMAC learning is essentially solving a linear system with methods similar to the Gauss-Seidel method. This results in a highly accurate learning algorithm that converges exponentially fast. Therein the following result was also established. Theorem 3 (40) Given a set of training samples composed of input-output pairs from Rn 씮 Rm, CMAC always learns the training set with arbitrary accuracy if the input space is discretized such that no two training samples excite the same set of association cells. Recently it has been shown that CMACs can be constructed to approximate nonlinear function’s with arbitrary accuracy. Consider the partition 앟i, 1 ⱕ i ⱕ n, given earlier. Then the following theorem can be proved (33). Theorem 4 The function estimate g(x) defined in Eq. (6) uniformly approximates any C1-continuous function f(x): Rn 씮 Rm on ⍀ 傺 Rn. Specifically, given any ⑀ ⬎ 0 and L, the Lipschitz constant of f(.) on ⍀, the maximum partition size 웃 can be chosen such that f (x) − g(x) ≤
(9)
where δ≤
mL
(10)
157
hand side of Eqs. (7) and (8). In the implementation of CMAC neural networks, it is customary to employ the following submappings (18,22,26):
R:X ⇒M Q:M ⇒I
(14)
:I×M ⇒A where R(x) is the receptive field function described in Eq. (5), Q is a quantization function, M is a matrix of receptive field activation values, and I is an array of column vectors used to identify the locations of the active receptive fields along each input dimension. Let the receptive field functions along each dimension be chosen to have an overlap of two. Then, in all, only 2n receptive fields will be active for a given input x. These active receptive fields can be located by constructing a set of active indices of 움. Given the partition on the input space, for any x 僆 Rn there exists a unique n-tuple (j1, j2, . . ., jn) such that x 僆 ⍀j1, j2, . . ., jn. Let k1, k2, . . ., kn be positive integers such that (xj1, xj2, . . ., xjn) 僆 앟k1,k2, . . .,kn. Given this index set (k1, k2, . . ., kn), after selecting left-hand odometer ordering, the indicator function is constructed as I = k1 + (k2 − 1)N1 + (k3 − 1)N1 N2 + · · ·
(15)
By Lemma 1, the elements of a not addressed by I are equal to zero. Thus Q is a map from N1 ⫻ N2 ⫻ . . . ⫻ Nn space composed of the tensor products of the receptive field functions to a (N1N2 . . . Nn) ⫻ one-dimensional space I. The map ⌫ is now defined by I and M. Specifically the 2n nonzero values of R(x) are placed into the matrix ⌫(x) at the locations specified by I(x). This ordering of the indices uniquely determines w and ⌫ in Eq. (13).
and Corollary 1 Given any C1-function f(.), ideal weights w can be found such that
δ = max( x − y|) ∀x, y ∈ [x1, j
1 −1
, x1, j ) × · · · × [xn, j n −1 , xn, j n ), ∀ ji 1
(11)
According to the theorem, an estimate to a given function f(x) is given by g(.) ⫽ [g1, g2, . . ., gm]T with
gk (x) =
Nn j n =1
...
N1 j 1 =1
wk,( j
1 ,..., j n )
Rj
1 ,..., j n
(x)
f (x) = wT (x) +
(16)
where ⑀ is the function estimation error and 储⑀储 ⱕ ⑀N, with ⑀N a given bound.
(12) BACKGROUND ON NONLINEAR DYNAMICAL SYSTEMS
for some weights w. In fact, the weights can be shown to be the samples of the function components to be approximated at each of the knot points of the partition. Implementation Properties of CMAC Neural Networks
The earliest use of CMACs in control applications was in the control of robot manipulators (22,26,40). In these application’s, the CMAC was first trained to learn the inverse dynamics of the system to be controlled (41,42). The training law used in these applications is similar to the Widrow-Hoff training procedure for linear adaptive elements (43,44),
The output in Eqs. (7) and (8) of the CMAC can be represented as a function from Rn to Rm and expressed in vector notation as
dw = β ∗ (Vo − f (so )) where
g(x) = wT (x)
(13)
where w is a matrix containing the set of weights, and ⌫ is a vector of the receptive field activation values. The definition of w and ⌫ is not unique, though wT⌫ is equal to the right-
dw is the weight vector adjustment. 웁 is the learning gain between 0 and 1. Vo is the applied control command vector during the previous control cycle.
158
CEREBELLAR MODEL ARITHMETIC COMPUTERS
CMAC memory
Training Vo = f (so) so
Response Vp = f (sd)
sd Trajectory planning Figure 5. Block diagram of learning controller (32) for robot control. The output of the controller has two components: a fixed part and a variable part that depends on the response determined by the CMAC memory.
Vo
Vp z–1
Fixed gain control
s = < θ. θ. θ >
+
+
PWM Current sense Tachometer
Fixed gain feedforward
_so
so is the observed state of the system in the previous control cycle. f(so) is the predicted drive value.
When the system is initialized, the weights contains all zeros. Therefore the output of the CMAC is zero. As the CMAC learns the inverse dynamics of the system, the CMAC network output will be similar to the actual control values required and the CMAC will take over from the fixed gain controller (see Fig. 5). To illustrate the application of CMAC NNs in the control of nonlinear systems with unknown dynamics, three classes of systems from literature are presented. The systems represented by the dynamical equations in the following subsections are important from the standpoint of control, since most physical systems to be controlled can be expressed in the form of these equations. Here the dynamical representation is given followed by the CMAC formulation of the controller.
Discrete-Time Representation of a Nonlinear System in Brunowskii Canonical Form The description of a nonlinear system in Brunowskii canonical form is given as
x1 (k + 1) = x2 (k) x2 (k + 1) = x3 (k) ·
with the output equation given as x1 (k) xn +1 (k) 1 y(k) = · (k) xn +n +···+n +1 1
2
(18)
m−1
where y(k) denotes the sampled value of y(t) at t ⫽ kT, and T is the sampling period. It is assumed that the coefficients bi, 1 ⱕ I ⱕ m are known. d(k) ⫽ [d1(k), d2(k), . . ., dm(k)]T is an unknown disturbance with known upper bound so that 储d储 ⬍ bd, x(k) ⫽ [x1(k), x2(k), . . ., xn(k)]T 僆 Rn, and f ⫽ [f 1, f 2, . . ., f m]T : Rn 씮 Rm is a smooth vector function. Output Tracking Problem. Given the system in Eqs. (17) and (18), it is required to manufacture a bounded control input u(k) ⫽ [u1(k), u2(k), . . ., um(k)]T such that the output y(k) of the system tracks a specified desired output yd(k) ⫽ [yd1(k), yd2(k), . . ., ydm(k)]T while ensuring that the states x(k) are bounded. It is assumed that the desired output satisfies yd (k) y (k + 1) d ≤ γ , k = 0, 1, 2, . . ., N − 1 (19) · y (k + n) d Feedback Linearizing Controller. The tracking problem above can be solved using a feedback linearizing controller if the complete dynamics in Eq. (20) are known. In this implementation the system is first expressed in terms of the filtered error system and the filter gains selected to make the error dynamics Hurwitz (Table 1). The control input is then
xn 1 (k + 1) = f 1 (x(k)) + b1 u1 (k) + d1 (k) xn xn
1 +1 1
(k + 1) = xn
+2 (k + 1) = xn
1 +2 1
(k)
Table 1
+3 (k)
·
(17)
xn 1 +n 2 (k + 1) = f 2 (x(k)) + b2 u2 (k) + d2 (k) · xn
1 +n 2 +...+n m−1 +1
(k + 1) = xn
1 +n 2 +···+n m−1 +2
xn
1 +n 2 +...+n m−1 +2
(k + 1) = xn
1 +n 2 +···+n m−1 +3
· xn (k + 1) = f m (x(k)) + bm um (k) + dm (k)
Tracking error Filtered tracking error Control input
1ⱕiⱕn ui (k) ⫽ 兵⫺fi (x(k)) ⫺ Kvi ri (k) ⫺ [i,ni⫺1 ei(k) ⫹ i,ni⫺2 ei (k ⫺ 1) ⫹ . . . ⫹ i,1 ei (k ⫺ ni ⫹ 2)] ⫹ ydi(k ⫹ 1)其/bi
(k) (k)
ei (k) ⫽ yi (k) ⫺ ydi(k) ri (k) ⫽ ei (k) ⫹ i,ni⫺1 ei (k ⫺ 1) ⫹ . . . ⫹ i,1 ei (k ⫺ ni ⫹ 1)
Filtered tracking error system
1ⱕiⱕm ri (k ⫹ 1) ⫽ Kvi rk(k) ⫹ di(k)
CEREBELLAR MODEL ARITHMETIC COMPUTERS
_x d (k) –
ΛT
+
x (k) _ Plant
Kv _ [0 Λ T ]
CMAC
^ f ( _x (k)) – –
r (k)
e (k)
159
+
–
–
+
Figure 6. Control of an unknown nonlinear system using CMAC neural network. The controller includes an inner feedback linearizing loop and an outer tracking loop.
_x d (k + n)
computed to force the filtered tracking error to be bounded, which in turn guarantees that the error and all its derivatives are bounded (39,45). Adaptive CMAC Control. In the implementation of the controller in Table 1, it is assumed that the function f(.) is known. However, in practice, f(.) is unknown, since the information on the dynamics of the system is only partially known. The approach of the preceding section can still be used if an estimate ˆf(x) of ˆf(x) of f(.) is available. According to Corollary 1, any nonlinear function can be approximated to any required degree of accuracy using a CMAC neural network. The output of the CMAC is then given as f (x(k)) = wT (k)(x(k))
Remark 1 The first term in Eq. (21) is a gradient term that ensures stability of the weight update algorithm. The second term is necessary to overcome the requirement of persistency of excitation condition (46) for the convergence of the weights and ensures robustness in the closed-loop. Remark 2 (39,45) ⌫T(x(k))⌫(x(k)) ⬍ 1.
(20)
where w is a matrix of weights and ⌫(x) is the vector of receptive field activation values based on n-dimensional secondorder splines. However, for such a network to ensure small tracking error in closed-loop control, the weights (e.g., sample values of f(.)) associated with the network must be known. Since f(.) is unknown in control applications, it is necessary to learn the weights on-line. In Refs. 39 and 45, a learning law was derived that ensured the stability of the overall filtered tracking error system (Table 1). Theorem 5 For the system in Eqs. (17) and (18) let the inputs be selected as in (Table 1) (39,45). Further let the estimate of the nonlinearity fˆ(.) be manufactured by a CMAC NN in Eq. (20). Let the CMAC weights be tuned on-line by wˆ k+1 = α wˆ k − βRk rTk+1
the unknown nonlinear dynamics, while the outer tracking loop ensures stability of the closed loop system. As the CMAC learns, more of the stabilization role is assumed by the CMAC, which cancels out the nonlinear terms in the dynamics.
(21)
with 움, 웁 ⬎ 0 design parameters. Then for small enough outer-loop gains Kvi (as specified in the proof), the filtered ˆ (k) are Unitracking error r(k) and the weight estimates w formly Ultimately Bounded (UUB). Further, the filtered tracking error can be made arbitrarily small by proper selection of the feedback gains Kvi.
Remark 2 explains how the CMACs overcome one of the serious difficulties in the implementation of fully connected NNs. In the fully connected NNs, the adaptation rate a must satisfy the condition a must satisfy the condition 움储T(x(k)) (x(k))储 ⬍ 1, where (.) is the vector of the activation function of each node. Therefore, as the number of nodes increase, a must decrease thereby slowing the rate of adaptation (15). In the case of CMAC, however, since ⌫T(x(k))⌫(x(k)) ⬍ 1, the rate of adaptation can be chosen independent of the partitioning of the input space. This, together with the localized learning in CMAC, ensures quick convergence of the weights of the CMAC and better tracking performance. Numerical Example. As an example (45), the controller proposed in the preceding sections is tested on the system given by the following set of equations:
x˙1 = x2 + u1 2
2
x˙2 = x1 + 2e−(x 1 +x 2 ) x2 − 0.1x2 + u2
(22)
The system outputs are
The proposed control scheme (Table 1) is shown in Fig. 6. Note that the structure has a nonlinear CMAC inner loop plus a linear outer tracking loop. The CMAC inner loop learns
y1 = x1 y2 = x2
(23)
0.6 0.4
1 2
0.2
Desired Actual: CMAC controller
0.0 –0.2 –0.4 –0.6 0.0
0.2
0.4
0.6
0.8
1.0 1.2 Time ( s )
1.4
1.6
1.8
2.0
Figure 7. Actual and desired output y1 with the discrete-time CMAC controller.
160
CEREBELLAR MODEL ARITHMETIC COMPUTERS
0.6 0.4
1 2
0.2
Desired Actual: CMAC controller
0.0 –0.2 –0.4 Figure 8. Actual and desired output y2 with the discrete-time CMAC controller.
–0.6 0.0
0.2
The control inputs u1 and u2 are to be selected so that y1 tracks a square signal and y2 tracks a sinusoidal signal of period 2 seconds. In the implementation of the CMAC controller for the system in Eqs. (22) and (23), the system is first discretized for a sample period of 10 milliseconds. The CMAC is then required to manufacture the nonlinearities in the system dynamics. In order to achieve this, the receptive fields for the CMAC NN are selected to cover the input space 兵[⫺2, 2] ⫻ [⫺2, 2]其 with knot points at intervals of 0.25 along each input dimension. The initial conditions for both the states x1 and x2 are taken to be zero. Figures 7 and 8 show the desired and actual outputs for the MIMO system in Eqs. (22) and (23) using the CMAC NN controller (Table 1). It is seen that although 578 weights are needed to define the output in Eq. (22), only 8 (2 ⫻ 22) weights are updated at any given instant. In other words, the performance of the CMAC controller is good even though the CMAC controller knows none of the dynamics a priori.
0.4
0.6
0.8
1.0 1.2 Time ( s )
1.4
1.6
1.8
2.0
be the desired output vector or the trajectory to be tracked. Here the superscript in parenthesis indicates the order of the operator d/dt. It is assumed that the desired trajectory vector yd is continuous and bounded and that the sign of g(x) is known. The state-feedback linearizing controller is implemented as shown in Table 2. The system is first expressed in terms of the filtered error system and the filter gains selected to make the error dynamics Hurwitz (Table 2). The control input is then computed to force the filtered tracking error small, which in turn guarantees that the error and all its derivatives are bounded (34). CMAC NN Controller. The controller in Table 2 cannot be implemented in practice as the functions f(.) and g(.) are unknown. As seen earlier, the controller can be implemented using estimates of f(.) and g(.). In order to approximate f(.) and g(.), two CMAC NN systems are employed. Using the approximation property of the CMAC, f(.) and g(.) can be written as
Class of State-Feedback Linearizable Nonlinear Systems
f (x) = W fT f (x) + f
(26)
A class of mnth order multi-input multi-output (MIMO) statefeedback linearizable system in the controllability canonical form is given as
g(x) = WgT g (x) + g
(27)
x˙1 = x2 x˙2 = x3 .. .
(24) Robot Arm Control
x˙ = f (x) + g(x)u + d n
The dynamics of an n-link robot manipulator may be expressed in the Lagrange form as (47)
y = x1 with state xi ⫽ [x1 x2 . . . xn]T 僆 ᑬn for i ⫽ 1, . . ., m, output yi(t) 僆 ᑬm and control u. It is assumed that the unknown disturbance d(t) 僆 ᑬm has a constant known upper bound so that 兩d(t)兩 ⬍ bd, and that f,g : ᑬmn 씮 ᑬm are smooth unknown functions with 兩g(x)兩 ⱖ g ⬎ 0 for all x, where g is a known lower bound. Tracking Problem. The output tracking problem for this class of systems can be handled using the same design procedure as in the preceding section. The chief difference is that for this class of systems, the control coefficient g(x) is not constant but a function of the states of the system. Let xd (t) ≡ [yd y˙d . . . yd(n−1) ]T
where Wf, Wg are vectors and ⑀f, ⑀g are the maximal function reconstruction errors for f(.) and g(.), respectively. Let ˆf(x) and gˆ(x) be the estimates of f(.) and g(.) generated by the CMACs. The controller can then be implemented as in Table 3 (34). The closed-loop implementation is as shown in Fig. 9.
(25)
M(q)q¨ + Vm (q, q)q ˙ + G(q) + F (q) ˙ + τd = τ
(28)
Table 2 Tracking error Filtered error Filtered tracking error dynamics Control input Closed-loop dynamics
e ⫽ x ⫺ xd ei⫹1 ⬅ y (i) (t) ⫺ y d(i)(t), i ⫽ 1, 2, . . . , n ⫺ 1 r ⫽ ⌳Te, where ⌳ ⫽ [⌳ 1] ⫽ [1 2 . . . n⫺11] T s n⫺1 ⫹ n⫺1 s n⫺2 ⫹ . . . ⫹ 1 is Hurwitz. r˙ ⫽ f (x) ⫹ g(x)u ⫹ d ⫹ Yd
冘
n⫺1
i ei⫹1 where Yd ⬅ ⫺y (n) d ⫹ i⫽1 1 [⫺f (x) ⫺ yd ⫺ ⌳r] U⫽ g(x) r˙ ⫽ ⌳r ⫹ d
CEREBELLAR MODEL ARITHMETIC COMPUTERS
where the tracking error is defined as e(t) ⬅ q(t) ⫺ qd(t), M is a constant diagonal matrix approximation of the inertia matrix, and Kv, Kp are constant diagonal matrices of the derivative and proportional gains. With this control, Eq. (29) can be rewritten as
Table 3 uc ⫽
Auxiliary control input
Robustifying control input
1 [⫺fˆ (x) ⫹ v] gˆ(x)
v ⫽ ⫺Kv r ⫺ Yd , Kv ⬎ 0. 兩uc兩 ur ⫽ ⫺애 sgn(r) g
u⫽
Control input
冦
冦
再
ur ⫺ uc 웂( uc ⫹ e 兩uc 兩 ⫺ s) 2 ur ⫺ uc ⫺웂( e 兩uc 兩 ⫺ s) 2 if gˆ ⱖ g and 兩uc兩 ⱕ s
ur ⫺
1
q(t) ¨ = M −1 (q){−Vm (q, q) ˙ q˙ − G(q) − F (q) ˙ − τd }
if I ⫽ 0
I⫽
Design parameters
0 otherwise 웂 ⬍ ln 2/s, 애 ⬎ 0, Mf , Mg ⬎ 0, and s ⬎ 0 ˆf ˆ⭈ f ⫽ Mf ⌫f (x)r ⫺ Mf 储r储W W
再
ˆg Mg ⌫g (x)r ⫺ Mg 储r储W ⭈ˆ U g⫽ 0
Weight update for gˆ
− M −1 (q)M{Kv e˙ + K p e − q¨ d }
if I ⫽ 1
Indicator
Weight update for fˆ
(31)
Simplifying and rearranging, we get
e(t) ¨ + Kv e(t) ˙ + K p e = M −1 (q){−Vm (q, q) ˙ q˙ − G(q) − F(q) ˙ − τd } + (I − M −1 (q)M){Kv e˙ + K p e − q¨ d } Defining
if I ⫽ 1 otherwise
f (q, q) ˙ = M −1 (q){−Vm (q, q) ˙ q˙ − G(g) − F (q)} ˙ + (I − M −1 (q)M){Kv e˙ + K p e − q¨ d }
with q(t) 僆 Rn the joint variable vector, M(q) the inertia matrix, Vm(q, q˙) the coriolis/centripetal vector, and F(q˙) the friction component. Bounded unknown disturbances are denoted by d, and is the control torque. It is assumed that d is an unknown disturbance with a known upper bound bd so that 储d储 ⱕ bd. The control problem is then to design a control input such that the joint angles q(t) track a desired trajectory qd(t). Conventional Controller Design. Traditionally the controller problem has been attempted by linearizing the robot system in some region of operation and then designing a linear proportional-derivative (P-D) or proportional-integral-derivative (PID) controller for the system. That is, the system in Eq. (28) is first expressed as q(t) ¨ = M −1 (q){−Vm (q, q) ˙ q˙ − G(q) − F (q) ˙ − τd } + M −1 (q)τ (29) In practice, it is known that M⫺1(.) exists, and hence the linear equivalent of Eq. (28) can be found about any operating point. Thus, given any smooth desired trajectory qd(t), neglecting the coriolis, gravity, and the friction terms, the control input can be designed as τ = −M{Kv e˙ + K p e − q¨ d }
161
(30)
e(t) ¨ + Kv e(t) ˙ + K p e = f (q, q) ˙ − M −1 (q) τd
(32) (33)
In conventional controller design, it is standard practice to design M such that 储I ⫺ M⫺1(q)M储 is small. Also for nominal trajectories, the effects of the centripetal, coriolis, and the friction terms on Eq. (31) are small. Therefore f(q, q˙) is small and can be neglected. This design guarantees adequate performance in the designed region of operation, but the tracking response degenerates rapidly if the region of operation is enlarged. Moreover, even for a given region of operation, the effect of f(q, q˙) cannot be neglected if the robot is required to operate at high speeds. This in essence becomes a serious bottleneck to enlarging the envelope of the robot performance (47). Robot Control Using State-Feedback Linearization Approach. The use of CMAC NN in designing feedback linearizing controllers can be extended to control the robotic system in Eq. (29) (34). Consider a two-link robot arm (47) where the first link is 1 m long and weighs 1 kg and the second link is 1 m long and weighs 2.3 kg. The first joint is required to track a trajectory qd1 ⫽ 0.3 sin(t) and the second joint is required to track a trajectory qd2 ⫽ 0.3 cos(t). The controller parameters were selected as kv ⫽ diag兵5,5其, ⌳ ⫽ diag兵5,5其, and the diagonal elements of the design matrix F are taken to be 10 with ⫽ ⫺2. The response of the system with the CMAC controller
~ Wf Update ^ f xd
e
+ –
[0 Λ]
[Λ 1]
r
Kv
– +
+
Plant
– ^ g ~ Wg Update
Figure 9. Structure of the feedback linearizing CMAC controller. The controller has two adaptive loops: one for generating the estimate of the unknown function f(.) and the other for generating the estimate of the unknown function g(.).
162
CEREBELLAR MODEL ARITHMETIC COMPUTERS
0.3
0.3 Desired Actual
0.2
0.1 Radians
0.1 Radians
Desired Actual
0.2
0 – 0.1
0 – 0.1 – 0.2
– 0.2
– 0.3
– 0.3
– 0.4
– 0.4 0
5
10
15 Time ( s )
20
25
– 0.5
30
Figure 10. Robot control—Joint 1 response with CMAC controller.
is shown in Figs. 10 and 11. From these figures it is evident that after a short learning phase, the system is able to track the desired trajectories effectively. Figures 12 and 13 show the response of the system without the CMAC NN in the feedback loop. From these results it is clear that the CMAC NN does improve on the linear design. Intelligent Control Formulation of the Robot Control Problem. While it is well known that the robot control problem can be satisfactorily addressed using the filtered tracking error formulation of the control problem (16,17,47), this approach would entail a complete redesign of the controller. Here we propose an alternative strategy based on techniques rooted in intelligent control literature. It will be shown that the thorny problem of the neglected dynamics can be easily handled by adding a feedforward component to the controller designed in Eq. (31). The feedforward component is adaptive in nature and can be manufactured using neural networks, adaptive controllers, or fuzzy logic networks (39,45). Here we restrict the presentation to CMAC neural networks.
5
10
15 Time ( s )
20
25
30
Figure 12. Robot control—Joint 1 response without CMAC controller.
Let the modified control be defined as τ = −M{Kv e˙ + K p e − q¨ d + fˆ(q, q)} ˙
(34)
where fˆ(q, q˙) is the output generated by a CMAC NN. The error dynamics of the system in Eq. (29) under this new control can be written in the form e(t) ¨ + Kv e(t) ˙ + K p e = f (q, q) ˙ − M −1 (q)τd − M −1 (q)M fˆ(q, q) ˙ (35) Defining N = f (q, q) ˙ − M −1 (q)M fˆ(q, q) ˙ − M −1 (q)τd
(36)
˙ T]T, the error equation in Eq. (36) can and the state e ⬅ [eT E be put in the state-space form as 0 I 0 e+ e˙ = (37) −K p −Kv N
0.3
0.3 Desired Actual
0.2
Desired Actual
0.2 0.1 Radians
0.1 Radians
0
0 – 0.1
0 – 0.1
– 0.2
– 0.2
– 0.3
– 0.3 – 0.4
– 0.4 0
5
10
15 Time ( s )
20
25
30
Figure 11. Robot control—Joint 2 response with CMAC controller.
0
5
10
15 Time ( s )
20
25
30
Figure 13. Robot control—Joint 2 response without CMAC controller.
CEREBELLAR MODEL ARITHMETIC COMPUTERS
Now, if the CMAC NN is designed such that
Assumption 1 Let the system in Eqs. (17) and (18) satisfy the following conditions:
˙ fˆ(q, q) ˙ = M −1 M(q) f (q, q)
(38)
f (q, q) ˙ − M −1 (q)M fˆ (q, q) ˙ ≡0
(39)
then
Then in the absence of disturbances, perfect tracking can be achieved. However, since f(q, q˙) and M(q) are not known in practice, the CMAC NN can be designed to learn the dynamics online and ensure that 储N储 in Eq. (36) is small. In fact this bound on 储N储 influences the overall closed-loop performance and can be made as small as desired by proper choice of the learning laws for the CMAC NN. Theorem 6 For the system in Eq. (28) let the inputs be selected as in Eq. (36) (36). Let k1 be a positive constant such that
e
T
0 −K p
I −Kv
a. f(0) ⫽ y(0) ⫽ 0. b. The system is completely reachable; that is, for a given x(tf) there exists a constant N, and bounded controls u(k), k ⫽ 0,1,2, . . ., N ⫺ 1 such that the state can be driven from x(0) ⫽ 0 to x(tf ⫽ NT). c. (u, y, Tc) is an energy supply rate associated with this system such that σ (u, y, Tc ) = y, QyTc + 2y, SuTc + u, RuTc
(41)
where Q, R, S are constant matrices with Q and R ⬎ symmetric and 具.,.典 is the inner product. Definition 1 A system is state-strict passive if it is (a) passive (48–53) and (b) there exists a real function ⌿(.) satisfying ⌿(x(k)) ⬎ 0 ᭙x(k) ⬆ 0, ⌿(0) ⫽ 0, and
(x(k + 1)) − (x(k)) ≤ y (k)u(k) − xT (k)x(k)
e ≤ −k1 e2
Further let the estimate of the nonlinearity ˆf(.) be manufactured by a CMAC NN in Eq. (41). Let the weights of the CMAC NN be tuned on-line by the following update laws: T ˙ˆ = (x)r ˆ − k1 rwˆ w
163
(40)
with k2 a positive design parameter. Then for large enough outer-loop gain k1, the tracking error e(t), and the weight estimates are UUB. Further the tracking error can be made arbitrarily small by proper selection of the feedback gains Kp and Kv. PASSIVITY-BASED DESIGN Earlier CMAC controllers were presented that guarantee closed-loop tracking performance for nonlinear systems with unknown dynamics. The stability of these controllers was proved using Lyapunov stability analysis. While this technique guarantees closed-loop stability of the overall system, it does not give insight into the actual selection of CMAC learning laws for a particular application. In recent work (37,38), the CMAC design was studied from an input-output point of view, and conditions that guarantee closed-loop stability were derived. These results give insight into the selection of learning laws for a given class of systems and are presented in the following subsection.
(42)
where x is the state of the system. Equation (42) is referred to in literature as the power form. Theorem 7 Consider the system of the form shown in Fig. 14. Suppose that the subsystems H1 and H2 are state-strict passive with the supply rates 1(u, y, Tc) and 2(u, y, Tc). Further let H1 satisfy y1 (k) ≤ αx(k),
α |z0 |)
|σ − σ (−az)| < , σ¯ − σ
then σ (az) − σ σ¯ − σ
σ¯ − σ (az)| 0)(∃ f ∈ A )
d(g, f ) < }
Universal Approximation to Functions
pτ (z) = U (z) − U (z − τ )
f (z) ≈
345
k=−∞
THE REPRESENTATIONAL POWER OF A SINGLE-HIDDEN-LAYER NETWORK Approximation: Metrics and Closure In order to present precise results on the ability of feedforward networks to approximate to functions, and hence to training sets, we first introduce the corresponding concepts of arbitrarily close approximation of one function (e.g., the network response) to another function (e.g., the target). To measure approximation error of f by g we consider the distance between them defined by the size of their difference f ⫺ g, where this is well-defined when both are elements of a normed, linear vector space (e.g., the space of square-integrable functions or usual Euclidean space ⺢d with the familiar squared distance between vectors). For the space of continuous functions C on X we typically use the metric d( f, g) = sup| f (x) − g(x)|
The power of even single-hidden-layer feedforward neural networks is revealed in the technical results cited below. A large number of contributions to this issue have been made, with the major ones first appearing in 1989 [e.g., Refs. 45 and 46]. In essence, almost any nonpolynomial node function used in such a network can yield arbitrarily close approximations to functions in familiar and useful classes, with the approximation becoming arbitrarily close as the width of the layer is increased. That not be a polynomial is clearly a necessary condition since a single-hidden-layer network with polynomial nodes of degree p can only generate a polynomial of degree p no matter what the width s of the hidden layer. To report these somewhat technical results we need to define first the set M of node functions. Definition 3 [Leshno et al. (47)]. Let M ⫽ 兵其 denote the set of node functions such that: 1. The closure of the set of points of discontinuity of any 僆 M has zero Lebesgue measure (length). 2. For every compact (closed, bounded) set K 傺 ⺢, the essential supremum of on K, with respect to Lebesgue measure , is bounded ess supx∈K |σ (x)| = inf{λ : v{x : |σ (x)| ≥ λ} = 0} < ∞
For example, property 1 is satisfied if the points of discontinuity have only finitely many limit points, while property 2 is satisfied if is bounded almost everywhere. We can now assert Theorem 1 [Leshno et al. (44), Theorem 1]. Let 僆 M, then the closure of the linear span of (w ⭈ x ⫺ ) is C (⺢d) if and only if is not almost everywhere an algebraic polynomial.
x∈X
where distance or error here is worst-case error. Another common metric, for p ⱖ 1, is
1/ p
d( f, g) =
| f (x) − g(x)| p µ(dx) x∈X
Noting that sigmoidal nodes satisfy the conditions of this theorem, we see that networks composed of them enjoy the ability to universally approximate to continuous functions. While the preceding theorem tells us much about the power of feedforward neural networks to approximate functions according to specific norms or metrics, there are issues
346
FEEDFORWARD NEURAL NETS
that are not addressed. For example, we may wish to approximate not only to a function t(x) but also to several of its derivatives. An approximation, say, using step functions can give an arbitrarily close approximation in sup-norm to a differentiable function of a single variable, yet at no point approximate to its derivative; the approximating function has derivatives that are zero almost everywhere. Results on the use of neural networks to simultaneously approximate to a function and several of its derivatives are provided in Refs. 48 and 49. Results on approximation ability in terms of numbers of nodes have also been developed along lines familiar in nonlinear approximation theory, and these include the work of Barron (50) and Jones (51). They show that in certain cases (in a Hilbert space setting) approximation error decreases inversely with the number s of single hidden layer nodes, and this decrease can in some cases be surprisingly independent of the dimension d of the input.
(53), and Fine (13). Studies of network generalization ability (see section entitled ‘‘Learning and Generalization Behavior’’) also rely on VC dimension.
TRAINING A NEURAL NETWORK: BACKGROUND AND ERROR SURFACE Error Surface We address the problem (posed in the section entitled ‘‘Universal Approximation to Partial Functions’’) of selecting the weights w and thresholds, generically referred to simply as ‘‘weights,’’ to approximate closely to a function partially specified by a training set. We are confronted with the following nonlinear optimization problem: minimize ET (w) by choice of w ∈ W ⊂ R p
Universal Approximation to Partial Functions We now turn to the problem of approximating closely to a partially specified function. The problem format is that we are given a training set T = {(x i , t i ), i = 1 : n} of input–output pairs that partially specify t ⫽ f(x), and we wish to select a net ( ⭈ , w) so that the output yi ⫽ (xi, w) is close to the desired output ti for the input xi. This is the typical situation in applications of neural networks—we do not know f but have points on its graph. If instead you are fortunate enough to be given the function f relating t to x, then you can generate arbitrarily large training sets by sampling the function domain, either deterministically or randomly, and calculating the corresponding responses, thereby reducing this problem to the one we will treat in detail in the next section. The notion of ‘‘closeness’’ on the training set T is typically formalized through an error or objective function or metric of the form ET =
n 1 y − t i 2 2 i=1 i
Hence, E T ⫽ E T (w), a function of w, since y depends upon the parameters w defining the selected network . Of course, there are infinitely many other measures of closeness (e.g., metrics such as ‘‘sup norm’’ discussed in the section entitled ‘‘Approximation: Metrics and Closure’’). However, it is usually more difficult to optimize for these other metrics through calculus methods, and virtually all training of neural networks takes place using the quadratic metric even in some cases where eventual performance is reported for other metrics. It is apparent from the results of the section entitled ‘‘Universal Approximation to Functions’’ that one can expect a single-hidden-layer network to be able to approximate arbitrarily closely to any given training set T of size n provided that it is wide enough (s1 Ⰷ 1). An appropriate measure of the complexity of a network that relates to its ability to approximate closely to a training set is given by the notion of Vapnik–Chervonenkis (VC) dimension/capacity. Discussion of VC dimension is available from Vapnik (52), Kearns and Vazirani
The inherent difficulty of such problems is aggravated by the typically very high dimension of the weight space W ; networks with hundreds or thousands of weights are commonly encountered in image processing and optical character recognition applications. In order to develop intuition, it is helpful to think of w as being two-dimensional and determining the latitude and longitude coordinates for position on a given portion W of the surface of the earth. The error function E T (w) is then thought of as the elevation of the terrain at that location. We seek the point on W of lowest elevation. Clearly we could proceed by first mapping the terrain, in effect by evaluating E T at a closely spaced grid of points, and then selecting the mapped point of lowest elevation. The major difficulty with this approach is that the number of required grid points grows exponentially in the dimension of W (number of parameter coordinates). What might be feasible on a two-dimensional surface will quickly become impossible when we have, as we usually will, a more than 100-dimensional surface. One expects that the objective function E T (w) for a neural network with many parameters defines a highly irregular surface with many local minima, large regions of little slope (e.g., directions in which a parameter is already at a large value that saturates its attached node for most inputs), and symmetries (see section entitled ‘‘Basic Properties of the Representation by Neural Networks’’). The surface is technically smooth (continuous first derivative) when we use the usual differentiable node functions. However, thinking of it as smooth is not a good guide to our intuition about the behavior of search/optimization algorithms. Figure 3 presents two views of a three-dimensional projection (two parameters selected) of the error surface of a single node network having three inputs and trained on ten input–output pairs. Multiple Stationary Points The arguments of the section entitled ‘‘Basic Properties of the Representation by Neural Networks’’ establish the existence of multiple minima. Empirical experience with training algorithms shows that different initializations almost always yield different resulting networks. Hence, the issue of many minima is a real one. A construction by Auer et al. (54) shows that one can construct training sets of n pairs, with the inputs
FEEDFORWARD NEURAL NETS
Error surface n = 10 d = 3 s = 1
347
Error surface n = 10 d = 3 s = 1
2 2 1.5 1.5 1 1 0.5 0.5
0 80
0 80
60 40 20 0 70 60 50
40 30
20 10
60 40
0
20
0 0
20
40
60
80
Figure 3. Two views of an error surface for a single node.
drawn from ⺢d, for a single-node network with a resulting exponentially growing number n d d of local minima! Hence, not only do multiple minima exist, but there may be huge numbers of them. The saving grace in applications is that we often attain satisfactory performance at many of the local minima and have little incentive to persevere to find a global minimum. Recent techniques involving the use of families of networks trained on different initial conditions also enables us, either through linear combinations of the trained networks (e.g., see Refs. 21 and 55) or through a process of pruning, to achieve good performance. Outline of Approaches There is no ‘‘best’’ algorithm for finding the weights and thresholds for solving the credit assignment problem that is now often called the loading problem—the problem of ‘‘loading’’ the training set T into the network parameters. Indeed, it appears that this problem is intrinsically difficult (i.e., NPcomplete versions). Hence, different algorithms have their staunch proponents who can always construct instances in which their candidate performs better than most others. In practice today there are four types of optimization algorithms that are used to select network parameters to minimize E T (w). Good overviews are available in Battiti (56), Bishop (12), Fine (13), Fletcher (57), and Luenberger (58). The first three methods, steepest descent, conjugate gradients (e.g., Møller (59), and quasi-Newton (see preceding references), are general optimization methods whose operation can be understood in the context of minimization of a quadratic error function. While the error surface is surely not quadratic, for differentiable node functions it will be so in a sufficiently small neighborhood of a local minimum, and such an analysis provides information about the high-precision behavior of the training algorithm. The fourth method of Levenberg and Marquardt [e.g., Hagan and Menhaj (60), Press et al. (61)] is specifically adapted to minimization of an error function that arises from a quadratic criterion of the form we are assuming. A variation on all of the above is that of regularization [e.g., Tikhonov (62), Weigend (63)] in which a penalty term is
added to the performance objective function E T (w) so as to discourage excessive model complexity (e.g., the length of the vector of weights w describing the neural network connections). All of these methods require efficient, repeated calculation of gradients and backpropagation is the most commonly relied upon organization of the gradient calculation. We shall only present the steepest-descent algorithm; it has been the most commonly employed and limitations of space preclude presentation of other approaches.
TRAINING: BACKPROPAGATION Notation Formal exposition of feedforward neural networks (FFNN) requires us to introduce notation, illustrated in Fig. 1, to describe a multiple layer FFNN, and such notation has not yet become standard. 1. Let i generically denote the ith layer, with the inputs occurring in the 0th layer and the last layer being the Lth and containing the outputs. 2. A layer is indexed as the first subscript and separated from other subscripts by a colon (:). 3. It is common in approximation problems (e.g., estimation, forecasting, regression) for the last layer node to be linear but to be nonlinear in pattern classification problems where a discrete-valued response is desired. 4. The number of nodes in the ith layer is given by the width si. 5. The jth node function in layer i is Fi:j; alternatively we also use i:j. 6. The argument of Fi:j, when xm is the input to the net, is denoted cmi:j. 7. The value of Fi:j(cmi:j) equals ami:j when the net input xm equals 兵xmj ⫽ am0:j其 and the vector of node responses in layer i is ai. 8. The derivative of Fi:j with respect to its scalar argument is denoted f i:j. 9. The thresholds or biases for nodes in the ith layer are given by the si-dimensional vector bi ⫽ 兵bi:j其.
348
FEEDFORWARD NEURAL NETS
10. The weight wi:j,k assigned to the link connecting the kth node output in layer i ⫺ 1 to the jth node input in layer i is an element of a matrix Wi.
Fm i:j
Hence, in this notation the neural network equations are m am 0: j = (x m ) j = x j ,
am 0: = x m
am i:j
Layer i
m bi:j
m C i:j
(1)
δ i:j
s i−1
cm i: j =
wi: j,k am i−1:k + bi: j ,
m cm i: = Wa i: + b i
= Fi: j (cm i: j ),
m am i: = Fi (c i: ),
am L:1 = ym
ai – l:k a m i – l:k
(2)
k=1
am i: j
Wi:j,k
(3)
Fi– l:k
Layer i – 1 bi– l:k
For clarity we assume that the network has a single output; the extension to vector-valued outputs is straightforward but obscures the exposition. The discrepancy em between the network response ym to the input xm and the desired response tm is given by e m = ym − tm = a m L:1 − tm ,
m C i– l:k
Figure 4. Information flow in backpropagation.
Combining the last two results yields the backwards recursion s i+1
e = (em )
δi:mj = f i: j (cm i: j )
m δi+1:k wi+1:k, j
(7a)
k=1
and the usual error criterion is 1 Em = (ym − tm )2 = e2m , 2
ET =
n
Em (w) = e e T
(4)
m=1
for i ⬍ L. This equation can be rewritten in matrix–vector form using m Wi+1 = [wi+1:k, j ], δ m i = [δi: j ],
m fm i = [ f i: j (ci: j )]
m T m δm i = (δ i+1 ) Wi+1· ∗ f i
Backpropagation A systematic organization of the calculation of the gradient for a multilayer perceptron is provided by the celebrated backpropagation algorithm. We supplement our notation by introducing w as an enumeration of all weights and thresholds/biases in a single vector and defining δi:mj
∂Em (w) = ∂cm i: j
where .ⴱ is the Hadamard product (Matlab element-wise multiplication of matrices). The ‘‘final’’ condition, from which we initiate the backwards propagation, is provided by the direct evaluation of m m δL:1 = f L:1 (cm L:1 )(aL:1 − tm )
(7b)
(5)
Thus the evaluation of the gradient, as illustrated in Fig. 4, is accomplished by:
To relate this to our interest in the gradient of E m with respect to a weight wi:j,k or bias bi:j, note that these parameters affect E m only through their appearance in Eq. (2). Hence, we obtain an evaluation of all of the elements of the gradient vector in terms of 웃mi:j through
1. A forward pass of the training data through the network to determine the node outputs ami:j and inputs cmi:j 2. A backward pass through the network to determine the 웃mi:j through Eqs. (7a) and (7b) 3. Combining results to determine the gradient through Eqs. (6a) and (6b)
m ∂Em ∂ci: j ∂Em = m = δi:mj am i−1:k ∂wi: j,k ∂ci: j ∂wi: j,k
∂Em = δi:mj ∂bi: j
(6a) DESCENT ALGORITHMS
(6b)
It remains to evaluate 웃mi:j. Note that since E m depends upon cmi:j only through ami:j,
δi:mj =
m ∂Em ∂ai: j ∂Em = f i: j (cm i: j ) m m ∂am ∂c ∂a i: j i: j i: j
If layer i is hidden, then E m depends upon ami:j only through its effects on the layer i ⫹ 1 to which it is an input. Hence,
∂Em ∂am i: j
s i+1 ∂Em ∂cm i+1:k m = = δi+1:k wi+1:k, j m m ∂ci+1:k ∂ai: j k=1 k=1
Overview and Startup Issues The backpropagation algorithm (BPA), in common usage, refers to a descent algorithm that iteratively selects a sequence of parameter vectors 兵wk, k ⫽ 1 : T其, for a moderate value of running time T, with the goal of having 兵E T (wk) ⫽ E k其 converge to a small neighborhood of a good local minimum rather than to the global minimum ET∗ = minw∈W ET (w) Issues that need to be addressed are:
s i+1
1. Initialization of the algorithm 2. Choice of online (stochastic) versus batch processing
FEEDFORWARD NEURAL NETS
3. Recursive algorithm to search for an error surface minimum 4. Selection of parameters of the algorithm 5. Rules for terminating the algorithmic search 6. Convergence behavior (e.g., local versus global minima, rates of convergence) The search algorithm is usually initialized with a choice w0 of parameter vector that is selected at random to have moderate or small values. The random choice is made to prevent inadvertent symmetries in the initial choice from being locked into all of the iterations. Moderate weight values are selected to avoid saturating initially the node nonlinearities; gradients are very small when S-shaped nodes are saturated and convergence will be slow. It has been argued in (64) that the performance of steepest descent for neural networks is very sensitive to the choice of w0. In practice, one often trains several times, starting from different initial conditions. One can then select the solution having the smaller minimum or make use of a combination of all the solutions found (21). The descent algorithm can be developed either in a batch mode or in an online/stochastic mode. In the batch mode we attempt the (k ⫹ 1)st step of the iteration to reduce the total error over the whole training set, E T (wk), to a lower value E T (wk⫹1). In the online mode we attempt the (k ⫹ 1)st step of the iteration to reduce a selected component E mk⫹1, the error in the response to excitation xmk⫹1, of the total error. Over the course of the set of iterations, all components will be selected, usually many times. Each version has its proponents. To achieve true steepest descent on E T (w) we must do the batch update in which the search direction is evaluated in terms of all training set elements. In practice, the most common variant of BPA is online and adjusts the parameters after the presentation of each training set sample. The operation of the online search is more stochastic than that of the batch search since directions depend upon the choice of training set term. The online mode replaces the large step size taken by the batch process (a sum over online mode type steps for each training sample) by a sequence of smaller step sizes in which you continually update the weight vectors as you iterate. This mode makes it less likely that you will degrade performance by a significant erroneous step. There is a belief (e.g., see Ref. 64a, p. 157) that this enables the algorithm to find better local minima through a more random exploration of the parameter space W . Iterative Descent Algorithms We now enumerate all network parameters (link weights and biases) in a vector w 僆 W 傺 ⺢p. The basic iterative recursion, common to all of the training methods in widespread use today, determines a new parameter vector wk⫹1 in terms of the present vector wk through a search direction dk and a scalar learning rate or step size 움k: w k+1 = w k + αk d k
(8)
Typically, descent algorithms are Markovian in that one can define a state and their future state depends only upon their present state and not upon the succession of past states that led up to the present. In the case of basic steepest descent, this state is simply the current value of the parameter and
349
gradient. In the variation on steepest descent using momentum smoothing, the state depends upon the current parameter value and gradient and the most recent past parameter value. Each of the algorithms in current use determine the next search point by looking locally at the error surface. We can explore the basic properties of descent algorithms by considering the following first-order approximation [i.e., f(x) ⫺ f(x0) 앒 f⬘(x0)(x ⫺ x0)] to successive values of the objective/error function: Ek+1 − Ek ≈ g(w k )T (w k+1 − w k )
(9)
If we wish our iterative algorithm to yield a steady descent, then we must reduce the error at each stage. For increments wk⫹1 ⫺ wk that are not so large that our first-order Taylor’s series approximation of Eq. (9) is invalid, we see that we must have
g(w k )T (w k+1 − w k ) = g(w k )T (αk d k ) = αk g Tk d k < 0 (descent condition)
(10)
One way to satisfy Eq. (10) is to have αk > 0,
d k = −g k
(11)
The particular choice of descent direction of Eq. (11) is the basis of steepest descent algorithms. Other choices of descent direction are made in conjugate gradient methods (59). An ‘‘optimal’’ choice 움*k for the learning rate 움k for a given choice of descent direction dk is the one that minimizes E k⫹1: αk∗ = argminα ET (w k + αdk ) This choice is truly optimal if we are at the final stage of iteration. It is easily verified that for the optimal learning rate we must satisfy the orthogonality condition g Tk+1 d k = 0
(12)
The gradient of the error at the end of the iteration step is orthogonal to the search direction along which we have changed the parameter vector. Hence, in the case of steepest descent [Eq. (11)], successive gradients are orthogonal to each other. When the error function is not specified analytically, then its minimization along dk is accomplished through a numerical line search for 움*k . Further analysis of the descent condition can be carried out if one makes the customary assumption that E T is quadratic with a representation ET (w) = ET (w 0 ) +
1 (w − w 0 )T H(w − w 0 ) 2
(12a)
in terms of the Hessian matrix H of second derivatives of the error with respect to the components of w0; H must be positive definite if E T is to have a unique minimum. The optimality condition for the learning rate 움k derived from the orthogonality condition [Eq. (12)] becomes T
αk∗ =
−d k gk d Tk Hdk
(13)
350
FEEDFORWARD NEURAL NETS
stant value 움. The simplicity of this approach is belied by the need to select carefully the learning rate. If the fixed step size is too large, then we leave ourselves open to overshooting the line search minimum, we may engage in oscillatory or divergent behavior, and we lose guarantees of monotone reduction of the error function E T . For large enough 움 the algorithm will diverge. If the step size is too small, then we may need a very large number of iterations T before we achieve a sufficiently small value of the error function. To proceed further we assume the quadratic case given by Eq. (12a) and let 兵j其 denote the eigenvalues of the Hessian. It can be shown [e.g., Fine (13, Chapter 5)] that convergence of wk⫹l to the local minimum w* requires, for arbitrary wk, that
60
50 40 30 20 10
0
10
20
30
40
50
max |1 − αλ j | < 1 or 0 < α