To appear in : Proceedings of: Applications of Soft Computing, SPIE International Symposium on Optical Science, Engineer...
32 downloads
734 Views
2MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
To appear in : Proceedings of: Applications of Soft Computing, SPIE International Symposium on Optical Science, Engineering and Instrumentation, San Diego, 27 July - 1 August 1997 (invited paper)
Incremental neuro-fuzzy systems B. Fritzke Systembiophysik, Institut fur Neuroinformatik Ruhr-Universitat Bochum, Germany
ABSTRACT
The poor scaling behavior of grid-partitioning fuzzy systems in case of increasing data dimensionality suggests using fuzzy systems with a scatter-partition of the input space. Jang has shown that zero-order Sugeno fuzzy systems are equivalent to radial basis function networks (RBFNs). Methods for nding scatter partitions for RBFNs are available, and it is possible to use them for creating scatter-partitioning fuzzy systems. A fundamental problem, however, is the structure identication problem, i.e., the determination of the number of fuzzy rules and their positions in the input space. The supervised growing neural gas method uses classication or regression error to guide insertions of new RBF units. This leads to a more eective positioning of RBF units (fuzzy rule IF-parts, resp.) than achievable with the commonly used unsupervised clustering methods. Example simulations of the new approach are shown demonstrating superior behavior compared with grid-partitioning fuzzy systems and the standard RBF approach of Moody and Darken. Keywords: RBFN, fuzzy system, incremental learning, normalization, generalization
1. GRID-PARTITIONING FUZZY SYSTEMS
In this introductory section we remind of some fundamental properties of fuzzy systems. Moreover, we describe the standard approach of partitioning the input space of a fuzzy system using a rectangular grid. Fuzzy systems can model continuous input/output relationships. The purpose can be, for example, function approximation (regression), non-linear control, or pattern classication. In the latter case the output of a fuzzy system can sometimes be interpreted as posterior class probabilities which can be used to achieve a classication by means of the standard Bayes rule. A basic component of a fuzzy system is a fuzzy rule. Sometimes these rules are expressed using linguistic labels such as the rule IF (pressure is low) AND (temperature is medium) THEN (valve opening is 0.7):
(1)
Fuzzy membership functions (MFs) associate linguistic labels (e.g. low) with a particular area of one of the input or output variables (e.g. \pressure"). In the example shown above the THEN-part of the rule does not consist of a membership variable but of the \crisp" value 0.7. Fuzzy systems consisting of this kind of rules are called zero-order Sugeno fuzzy systems. In an n-th order Sugeno fuzzy system the THEN-part of each rule consists of a polynomial of degree n in the input variables. In this article we will concentrate on zero-order Sugeno fuzzy systems since they have the interesting property of being equivalent to radial basis function networks (RBFNs) as is discussed in section 4. Dierent shapes of the MFs have been proposed such as triangular, trapezoidal, or Gaussian. For the following discussion we assume the MFs to have the form of a Gaussian 2 ( c ; x ) i j : mi(xj ) = exp ; 22 i Thereby, i denotes the index among the dierent MFs dened for this variable. The variables ci and i are the center and the width of the Gaussian, respectively. Usually only a moderate number of MFs is dened for each input variable (see Fig. 1 for examples). Other author information: World-Wide-Web: http://www.neuroinformatik.ruhr-uni-bochum.de/ini/PEOPLE/fritzke/top.html
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
1 low medium high
low medium low medium medium high high
0.8 0.6 0.4 0.2 0
0
0.5
1
1.5
2
2.5
3
3.5
4
0
1
2
3
4
5
6
a) pressure MF, = 0:4 b) temperature MF, = 0:2 Figure 1. Gaussian MFs. Dierent values of lead to dierently strong overlap between neighboring MFs. The spacing between MFs is set to 1.0 in both examples. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
1 low medium high
low medium low medium medium high high
0.8 0.6 0.4 0.2 0
0
0.5
1
1.5
2
2.5
3
a) pressure MF, normalized
3.5
4
0
1
2
3
4
b) temperature MF, normalized
5
6
Normalized Gaussian MFs (all parameters as in Fig. 1). Due to the normalization the sum of all MFs equals one at each position. This leads (in case of equal width parameters , see text) to sigmoidal shapes for the \outer" MFs which now enclose all smaller or larger values, respectively. If the MFs overlap only slightly, normalization leads to a at plateau in the center of the MF which falls o steeply at the border to the next MF (b). In the limiting case (for ! 0) the MFs get step borders and in fact become crisp (non-fuzzy) variables.
Figure 2.
Often it is desired that the membership values for one input variable add to unity everywhere. This can be achieved by dividing the membership values mi(x) by the sum of all membership values leading to normalized MFs
Mi(x) = mi(x)=
X j
mj (x):
Fig. 2 illustrates how normalization transforms the MFs shown in Fig. 1. The \outer" MFs now extend to very negative and very positive values, respectively. For the pressure MF, e.g., the linguistic labels low, medium, and high are distinguished. Interpreting all pressure values smaller than the center of the small MF also as small is consistent with everyday language use. One should note, however, that only for equal values i it is always the nearest normalized MF which has the highest value. If the values of the i dier, then normalization leads to exponentially fast dominance of the broadest Gaussian when the input value leaves the convex hull of the Gaussian centers in any direction. The mathematical reason for this is that for any two Gaussians centered at c1 and c2 with width parameters 1 and 2 the following statement holds (as can be shown easily):
exp ; (x;2 c121) jxlim j!1 exp ; (x;2 c21)2 + exp ; (x;2 c22)2 = 1 1 2 2
for 1 > 2
Thus, in case of dierently wide Gaussians and normalization the broadest Gaussian dominates all other ones if the input value x gets very large or very small.
1
1
0.5
0.5
0
0 5
0 0.5
1 1.5 2 pressure 2.5
3 3.5
4
0
1
2
6
4 3 temperature
0 0.5
a) IF-part of fuzzy rule, non-normalized
1 1.5 2 pressure 2.5
3 3.5
4
0
1
2
6 5 4 3 temperature
b) IF-part of fuzzy rule, normalized
IF-part (or rule patch) of linguistic fuzzy rule in (1). For the normalization it is assumed that all possible 3 5 = 15 fuzzy rule positions are actually occupied in the fuzzy system. Figure 3.
After the denition of MFs one can formulate rules in terms of linguistic values. In order to combine subexpressions concerned with dierent input variables, fuzzy operators such as fuzzy AND (a.k.a. T-norm) or fuzzy OR (a.k.a. T-conorm) are used. We focus our attention on rules which combine their sub-expressions by fuzzy AND. In the case of a Gaussian MF the fuzzy AND can be realized by the arithmetic product. In this case the IF-part of each fuzzy rule can be described by an n-dimensional Gaussian: 1 T ; 1 g(x) = exp ; 2 (x ; c) C (x ; c) (2) Thereby, x is the n-dimensional input vector, C ;1 = diag(1=12 1=22 : : : 1=n2 ) is the inverse of the covariance matrix C corresponding to the product of n one-dimensional Gaussians with variances i2. The vector c = (c1 : : : cn )T is the center of the n-dimensional product Gaussian and has the corresponding centers of the onedimensional factor Gaussians as components. As performed for the one-dimensional MFs, also the IF-parts of all fuzzy rules in the fuzzy system can be normalized. If the fuzzy system consists of N rules, then the normalized Gaussians can be written as
Gi(x) = gi(x)=
N X j =1
gj (x):
(3)
The normalization ensures that for each input signal x the sum of all rule activations adds up to unity: N X j =1
Gj (x) = 1
Each rule is considerably activated only in a particular area of the input space. One can think of the input space as being covered by small patches, each of them corresponding to one fuzzy rule. We will denote the IF-parts of fuzzy rules as rule patches. In Fig. 3 the rule patch of the example fuzzy rule (1) is shown depending on whether normalization takes place or not. In general, a fuzzy system can have a multi-dimensional output. The i-th output component of a fuzzy system without normalization is given as
Oi(x) =
N X j =0
wij gj (x):
Thereby, wij is the THEN-part (or output weight) of the fuzzy rule j with respect to the output component i. The extra term g0 (x) is constantly set to 1.0 and its associated output weight wi0 usually codes the average value of the
component i, so that the other weights only have to encode deviations from this average value. It is also possible to leave out the bias term in which case the remaining terms have to encode the complete input/output mapping. If normalized n-dimensional Gaussians are used, the i-th output component of the fuzzy system is
Oi(x) =
N X j =1
wij gj (x)
,N X j =1
gj (x) =
N X j =1
wij Gj (x) :
In this case no bias unit is needed and the output is a weighted average of the output weights. The output weights wij can be set manually by domain experts. Alternatively, a given training data set
D = ( i i ) i = 1 : : : M
i 2 Rn i 2 Rm
could be used to set them appropriately. The output vectors = (1 : : : m )T have dierent meaning depending on the general kind of the problem. For classication problems the output vector often contains a 1-out-of-m encoding of the class associated with an input data point . In this case it usually consists of a single '1' and (m ; 1) '0's. For regression problems contains the desired m-dimensional continuous output. A common goal in both cases is to nd output weights that minimize the summed squared error
E = 21
M X m X i=1 j =1
(Oj ( i ) ; ji )2 :
(4)
If the IF-parts of the fuzzy rules (the Gaussian centers and covariance matrices) are xed, the determination of suitable weights wij reduces to a linear least squares problem which can be readily solved by standard matrix techniques like the singular value decomposition.1
2. CRITIQUE OF GRID-PARTITIONING APPROACH
The grid-partitioning approach to fuzzy systems has the advantage that all fuzzy rules can be formulated with a relatively small number of linguistic labels. These labels are heavily re-used in dierent rules. If each input variable is characterized by k MFs and if the input is n-dimensional, then there are kn possible fuzzy IF-parts but only k n dierent linguistic labels are needed to generate them. There is, however, a serious disadvantage of the grid-partitioning approach: the very regular partition of the input space may be unable to produce a rule set of acceptable size which is able to handle a given data set well. If, for example, the data contains regions with several small clusters of dierent classes, then small rule patches have to be created to classify the data in this region correctly. Due to the grid partition, however, the ne resolution needed in this particular area is propagated also to areas which are much easier to handle, perhaps because they only contain data belonging to a single class. The grid-partitioning approach enforces a large number of small identical rule patches, although one large patch would theoretically be able to classify the data in this region correctly. The problem is illustrated in Fig. 5. It becomes even more serious as the dimension of the input data increases. The ne partition needed for some problems requires the specication of correspondingly many linguistic labels, too. There is a limit of understandability achieved at some point, e.g. when more then 12 linguistic labels have to be dened for one class. Eventually one might be forced to number the labels which would make it completely impossible to describe the fuzzy system in rules which are intuitively comprehensible.
3. SCATTER-PARTITIONING FUZZY SYSTEMS
To eliminate the problems associated with grid-partitioning, other ways of dividing the input space into rule patches have been proposed. One approach, known as scatter partitioning, is to allow the IF-parts of the fuzzy rules to be positioned at arbitrary locations in input space. If the rules are represented by n-dimensional Gaussians (2) or normalized Gaussians (3), this means that the centers of the Gaussians are not anymore conned to corners of a rectangular grid. Rather, they can be chosen freely, e.g, by a clustering algorithm working on the training data. If the covariance matrices Ci of the Gaussians are diagonal, then they can still be thought of being generated as
a)
c)
e)
b)
d)
f)
Classication with a grid-partitioning fuzzy system. a) Two dimensional data set in the unit square consisting of two classes A (white) and B (black). b) Fuzzy system with 12 rules arranged in a 4 3 grid. c) Nonnormalized fuzzy system: estimated a posteriori probability that the data belongs to class A. The box shown is the scaled unit cube. One can note that some output values are higher than one thereby violating the properties of a probability. In regions with no training data the fuzzy system output falls o to a non-zero plateau which corresponds to the bias. Since the training data contains more exemplars from class B, the bias has a value smaller than 0.5. d) Normalized fuzzy system: estimated a posteriori probability that the data belongs to class A. e) Classication according to the non-normalized fuzzy system. The system is unable to classify all training examples correctly. f) Classication according to the normalized fuzzy system. Also here some training examples are misclassied. The probable reason for the misclassications is the predened resolution of the fuzzy rules which is slightly too coarse for the given data set.
Figure 4.
a product of n one-dimensional Gaussian MFs. Each MF, however, is now only used by a single rule so that the number of dierent MFs for one input variable may be as high as the total number of fuzzy rules. Therefore, assigning meaningful linguistic labels to these MFs is impossible in all but trivial cases. A problem to be solved with scatter partitions is to nd a suitable number of rules and suitable positions and width of the rule patches in input space. In this context it is helpful to note, that the overall architecture of a fuzzy system with Gaussian MFs and scalar THEN-parts of the fuzzy rules is equivalent to an RBFN as has been noted by Jang.2 We recapitulate the ndings of Jang shortly and proceed then with methods to determine the parameters of a scatter-partitioning fuzzy system.
a)
c)
e)
b)
d)
f)
Classication with a grid-partitioning fuzzy system. a) Two dimensional data set in the unit square consisting of two classes A (white) and B (black). b) Fuzzy system with 64 rules arranged in an 8 8 grid. c) Nonnormalized fuzzy system: estimated a posteriori probability that the data belongs to class A. d) Normalized fuzzy system: estimated a posteriori probability that the data belongs to class A. e) Classication according to the nonnormalized fuzzy system. f) Classication according to the normalized fuzzy system.
Figure 5.
4. EQUIVALENCE OF FUZZY SYSTEMS AND RADIAL BASIS FUNCTION NETWORKS
Jang2 showed that under certain mild restrictions a fuzzy inference is equivalent to an RBFN. The following must be fullled for the equivalence to be valid:
The number of RBF units is equal to the number of fuzzy IF-THEN-rules. The output of each fuzzy rule is a constant (the fuzzy system is a zero-order Sugeno fuzzy system). The MFs within each rule are chosen as Gaussian functions with the same variance. The T-norm operator used to compute the activation of each rule is multiplication. Both the RBFN and the fuzzy inference system under consideration use the same method to derive their overall output, i.e. weighted average (with normalization) or weighted sum (without normalization).
The generalization from strictly radial basis functions to ones with a diagonal covariance matrix with possibly dierent elements in the diagonal is straightforward. In this case the squared Euclidean distance kx ; ck2 used for computing the Gaussian activations must be replaced by the Mahalanobis distance (x ; c)T C ;1(x ; c) using the inverse of the covariance matrix C of the respective Gaussian unit.
a)
b)
c)
Classication with a scatter-partitioning fuzzy system (with normalized MFs) constructed by an RBFN according to Moody and Darken. The training data set is that of Fig. 4a. a) Fuzzy system with 20 rules positioned with the LBG clustering algorithm. b) Estimated a posteriori probability that the data belongs to class A. c) Classication result. Since classes in the training data seem to have very little overlap, it would seem reasonable to require that the fuzzy system maps nearly all training data points to the correct class. However, the distribution of rule centers found by LBG contains only 3 rules in the upper right corner of the shown part of the input space and 4 would be necessary in this case. Since LBG is initialized at random from the data set, one can expect a distribution of the centers proportional to the probability density. Considering this, the fraction of 3=20 = 0:15 positioned in the upper right is already more than the value one can expect looking at the given data, namely 4=36 = 0:11. The underlying problem is, that LBG does not take the desired outputs for the data into account. Figure 6.
Keeping the equivalence between Sugeno fuzzy systems and RBFNs in mind we now discuss a standard RBFN training method. Thereafter, we describe an incremental RBFN which, according to Jang's results, is also an incremental neuro-fuzzy system.
5. RADIAL BASIS FUNCTION NETWORKS ACCORDING TO MOODY AND DARKEN
Moody and Darken3 have proposed a multi-phase approach to RBFNs. First, a pre-dened number of centers is distributed in input space with a cluster method (e.g., LBG4 or k-means5). The width parameters of the Gaussians are set by a local heuristic, e.g., setting the of each unit equal to the distance to the nearest other units. Moreover, Moody and Darken propose to use normalized activations according to (3). In terms of fuzzy systems the steps just described correspond to the identication of the IF-parts of a Sugeno fuzzy rule. The THEN-parts of the fuzzy rules or, alternatively, the output weights of the RBFN, are set by pseudo-inverse computation such that the summed square error (4) for a given training data set is minimized. It is also possible, but has usually no advantage, to compute the output weights iteratively through gradient descent on the error function.6 This multi-phase approach is straight-forward and is often reported to be much faster than, e.g., the backpropagation training of multi-layer perceptrons for the same data. A possible problem of the approach, which has for example been noted by Bishop,7 is that the clustering is completely unsupervised and does not take the given desired output information (class labels or continuous output values) into account. Clustering methods usually try to minimize the mean distance between the centers they distribute and the given data (which is only the input part of the training data). This error, however, is of little relevance to many supervised learning problems. The resulting distribution of RBF centers (or rule patches) may, therefore, be poor for the classication or regression problem at hand. Fig. 6 shows an example where this is the case. If one visually analyzes the generated neuro-fuzzy system in Fig. 6, it becomes obvious that at several places there are rules covering neighboring areas and having basically the same output. Such rules could be combined into fewer rules each covering a correspondingly larger area of the input space. This would set free resources which could be
used in places where the system can benet from them more, for example in the upper right corner of the displayed part of the input space. Instead of rst constructing a possibly poor neuro-fuzzy system and then improving it later on, it had some advantages if one could immediately build a good system for the problem at hand. This is the goal of the method in the following section.
6. CREATING SCATTER-PARTITIONING FUZZY SYSTEMS WITH THE SUPERVISED GROWING NEURAL GAS MODEL
The approach to fuzzy systems described in this section has its origin in a neural network model from the area of competitive learning, the so-called unsupervised growing neural gas model.8 This method can be described as follows (for a broader discussion see the reference given):
It is a vector quantization method which aims at distributing a limited number of centers wi in the n-dimensional input space. The goal is to minimize the mean quantization error for a given data set
DU = ( 1 : : :
M ) i 2 R n :
Centers are adapted on-line by soft competitive learning. In particular, for each presented input signal the winner s (the center nearest to the input signal) is moved towards the input signal: w
s := ws + (
; ws )
To a smaller degree also the direct neighbors within the network topology (see below) are moved towards the input signal. At each adaptation step the squared distance between current input signal and winner is added to a local error variable of the winner: Es := Es + k ; wsk2 (5) All error variables undergo slow exponential decay to eventually forget earlier accumulated errors. Among the centers a topology consisting of neighborhood edges is created through competitive Hebbian learning.9 The principle of this method is to always insert an edge between the winner and the second nearest center for the current input signal. If such an edge does already exist, its age parameter is set to zero. Edges with an age larger than a maximum value are removed again. Competitive Hebbian learning has been shown to generate a subgraph of the Voronoi tessellation corresponding to the current set of centers. The Voronoi tessellation, however, is among all triangulations the best for function interpolation.10 A new center is always inserted after a xed numbers of input signals has been processed. It is positioned at the center of one of the edges emanating from the center with the largest accumulated error. The error values are reduced and are redistributed locally after each insertion (for details see Ref. 8). This strategy leads to a growth process which reduces the mean distortion error eectively since new centers are inserted where a large distortion error has occurred in the past. At the same time a topology is created describing the sub-area of input space where the input data is located. Fig. 7 shows an example simulation. The growth principle of the unsupervised growing neural gas model can be applied in a very similar fashion to supervised learning. In particular, growing RBFNs have been investigated in the past.11{13 The principle is to associate one RBF unit with each center and to accumulate the classication error or the regression error instead of the quantization error. Therefore, equation 5 is replaced by
Es := Es + (O( ) ; )2: Thereby, O( ) is the m-dimensional output vector of the network and is the m-dimensional desired output for the current input vector .
a) 0 signals
b) 600 signals
c) 1800 signals
d) 5000 signals e) 15000 signals f) 20000 signals Figure 7. The unsupervised growing neural gas model adapts to a signal distribution which has dierent dimensionalities in dierent areas of the input space. Shown are intermediate stages after the indicated number of processed input signals. The last network shown is not necessarily the nal one since the growth process could in principle be continued indenitely. (from Ref. 8) Interpreted verbally, the classication or regression error for the current input signal is accumulated at the nearest Gaussian. After a while, those Gaussians can be identied which are unable to correctly map the data in their vicinity. Inserting new Gaussians nearby increases the chance that this data is correctly mapped in the future. Further details of the method are:
The output weights are trained on-line with the delta-rule.6 Due to the on-line nature of the growing neural
gas model this is more appropriate (computationally less complex) than to repeatedly solve the complete linear system with matrix techniques. At the end of the growth process, however, it is advantageous to compute the optimal weights with least squares methods, once. The width parameter i of each Gaussian is set to a fraction (e.g. 0.5) of the mean length of all edges emanating from this unit. This leads to a similar relative overlap of neighboring Gaussians. The growth process has to be terminated by a suitable criterion to prevent over-tting. One possibility is to split the available data into training and validation data. Only the training data is used to determine the network parameters. In parallel, however, the performance on the validation data is observed and the growth process is stopped when this measure does not improve anymore. In Fig. 8 the results of the described method are shown for the data from Fig. 4a. It can be seen that the insertions based on accumulated classication error led to a small network (fuzzy system, resp.) which classies the training data well. The proposed method is equally well applicable to regression problems. Fig. 9 shows a data set which stems from sampling a Gabor wavelet in a square region of the input space. In Fig. 10 the results for grid-partitioning fuzzy systems and scatter partitioning fuzzy systems constructed with a supervised growing neural gas are shown. Fuzzy systems of three dierent sizes were created. The approximation error for the incrementally generated fuzzy systems was consistently lower. The obvious reason for this is the non-uniform distribution of rules found by the growing neural gas network which matches the non-linearity in the training data well. One should note that the data density in the example is perfectly uniform so that any reasonable clustering method (like LBG) would have distributed a given number of rules in a uniform fashion, too. Accordingly, for this problem the results of Moody and Darken's approach are very similar to those of a grid-partitioning fuzzy system and are not displayed separately.
a)
b)
c)
Classication with a scatter-partitioning fuzzy system (with normalized MFs) constructed by the supervised growing neural gas approach according to the author. The training data set is that of Fig. 4a. a) Fuzzy system with 16 rules positioned by the classication-error driven growing neural gas insertion scheme. b) Estimated a posteriori probability that the data belongs to class A. c) Classication result. Even though the system has only 16 rules, it handles the region in the upper right well. One can compare this with the fuzzy system in Fig. 6 which has 20 rules but is unable to map all data in the mentioned region correctly. As was discussed in the caption of Fig. 6, the failure of Moody and Darken's approach in that simulation is no exception. Rather, this must be expected given the data distribution and the random initialization of the rule centers. Figure 8.
a) data distribution in input space Figure 9.
b) data in combined input/output space
Sampled Gabor wavelet as regression training data set. 50 50 = 2500 points were generated.
7. DISCUSSION AND OUTLOOK
In this article we rst tried to illustrate some problems associated with the traditional grid-partitioning approach to fuzzy systems. Doing this, we also discussed some aspects related to normalization of MFs. Scatter-partitioning fuzzy systems are a possible way to overcome the outlined disadvantages, in particular the poor scaling behavior with increasing input dimensionality. The well-known equivalence of zero-order Sugeno fuzzy systems and RBFNs suggests to use methods developed in the context of RBFNs to construct scatter-partitioning fuzzy systems. The standard method of Moody and Darken has the disadvantage that it uses unsupervised clustering methods to position the RBF units (fuzzy rule IF-parts, resp.). This tends to minimize the quantization error for the given input data, but the resulting network congurations can be poor for supervised learning. The supervised growing neural gas method is able to incrementally build up a network structure during the training process. Since it positions new units on the basis of accumulated classication or regression error, it can nd congurations well adapted to the given data. In the examples investigated this resulted in better performance for networks/fuzzy systems of the same size, or in smaller networks to reach a given performance.
Grid-partitioning fuzzy systems
GNG-constructed scatter-partitioning fuzzy systems
a) 4 4 rules
b) error: 0.134
c) error: 0.120
d) 16 rules
e) 7 7 rules
f) error: 0.101
g) error: 0.020
h) 49 rules
i) 10 10 rules
j) error: 0.023
k) error: 0.007
l) 100 rules
Regression with neuro-fuzzy systems. The underlying data is the one displayed in Fig. 9. Each row shows two systems with an equal number of rules. On the left traditional grid-partitioning fuzzy systems are displayed and on the right systems constructed with the supervised growing neural gas approach. The two center columns show the respective output of the networks. The given error measure is the square root of the MSE for the complete dataset of 2500 samples. It gives an indication how dierent on average the system output is from the true data. The error values produced by the supervised growing neural gas approach are consistently lower. This can be attributed to the ability of this method to position more rules in areas, where the given data is dicult to map. In this example the central region of the training data is most dicult. Figure 10.
An obvious question is how the supervised growing neural gas approach can cope with high-dimensional problems. This is currently investigated in the context of the ICE ROUTES project of the European Union which is concerned with classication of arctic sea ice and with automated ship routing based on this classication. The most important training data types are SAR (synthetic aperture radar) images and weather information. Feature vectors extracted from this data (e.g., with Gabor lters as the one shown in Fig. 9) can easily have 40 dimensions or more. Experts provide classications of the training images into ice classes. Field trips (with ice-breakers) provide actual ground truth information for a small fraction of the data. Results obtained so far are preliminary but already promising. There are several open research questions as well as possible extensions of the supervised growing neural gas approach. One open question has to do with the error based insertion strategy. This is very eective in nding areas in the input space where small clusters of dierent classes lie close to each other or where the function to approximate is highly non-linear. However, also regions with overlapping classes or noisy continuous output data do generate much local error. In these regions, however, it would be good to insert only as many units as are necessary to model the Bayes classier or the true regression function of the data, respectively. Finding suitable criteria allowing to locally stop the growth process while continuing it elsewhere is, therefore, an active area of investigation. A possible extension is to use a local linear mapping (instead of a constant) as the output of each fuzzy rule. This would correspond to rst-order Sugeno fuzzy systems and might permit smaller systems to approximate a given data set. Another possible extension is to do an a posteriori merge of similar neighboring fuzzy rules. This would lead to still smaller rule sets or networks, respectively, and should further improve generalization.
ACKNOWLEDGEMENTS
For their helpful comments and careful proofreading of the manuscript we like to thank Rolf P. Wurtz, Jan C. Vorbruggen, Michael Potzsch and Daniel Gorinevski.
REFERENCES
1. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical Recipes in C: The art of scientic computing, Cambridge University Press, Cambridge, 1992. (second edition). 2. J.-S. R. Jang and C.-T. Sun, \Functional equivalence between radial basis function networks and fuzzy inference systems," IEEE Trans. on Neural Networks 4(1), pp. 156{159, 1993. 3. J. E. Moody and C. Darken, \Fast learning in networks of locally-tuned processing units," Neural Computation 1, pp. 281{294, 1989. 4. Y. Linde, A. Buzo, and R. M. Gray, \An algorithm for vector quantizer design," IEEE Transactions on Communication COM-28, pp. 84{95, 1980. 5. J. MacQueen, \On convergence of k-means and partitions with minimum average variance," Ann. Math. Statist. 36, p. 1084, 1965. abstract. 6. G. O. Stone, \An analysis of the delta rule and the learning of statistical associations," in Parallel Distributed Processing, R. D. E. and J. L. McClelland, eds., vol. 1, pp. 444{459, MIT Press, Cambridge, 1986. 7. C. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995. 8. B. Fritzke, \A growing neural gas network learns topologies," in Advances in Neural Information Processing Systems 7, G. Tesauro, D. S. Touretzky, and T. K. Leen, eds., pp. 625{632, MIT Press, Cambridge MA, 1995. 9. T. M. Martinetz, \Competitive Hebbian learning rule forms perfectly topology preserving maps," in ICANN'93: International Conference on Articial Neural Networks, pp. 427{434, Springer, (Amsterdam), 1993. 10. S. M. Omohundro, \The Delaunay triangulation and function learning," tr-90-001, International Computer Science Institute, Berkeley, 1990. 11. B. Fritzke, \Fast learning with incremental RBF networks," Neural Processing Letters 1(1), pp. 2{5, 1994. 12. B. Fritzke, \Supervised Learning with Growing Cell Structures," in Advances in Neural Information Processing Systems 6, J. Cowan, G. Tesauro, and J. Alspector, eds., pp. 255{262, Morgan Kaufmann Publishers, San Mateo, CA, 1994. 13. B. Fritzke, \Transforming hard problems into linearly separable ones with incremental radial basis function networks," in HELNET 94/95: Proceedings of the international workshop in neural networks, M. van der Heyden, J. Mrsic-Flogel, and K. Weigel, eds., pp. 54{63, VU University Press, (Amsterdam), 1996.