KOHONEN MAPS
This Page Intentionally Left Blank
KOHONEN MAPS
Edited
by
ERKKI OJA and SAMUEL KASKI Neural Networks Research Centre Helsinki University of Technology P.O. Box 5400, FIN-02015 HUT, Finland
1999
ELSEVIER AMSTERDAM
- LAUSANNE
- NEW
YORK
- OXFORD
- SHANNON
- SINGAPORE
- TOKYO
ELSEVIER SCIENCE B.V. Sara Burgerhartstraat 25 P.O. Box 211, 1000 AE Amsterdam, The Netherlands
9 1999 Elsevier Science B.V. All rights reserved.
This work is protected under copyright by Elsevier Science, and the following terms and conditions apply to its use: Photocopying Single photocopies of single chapters may be made for personal use as allowed by national copyright laws. Permission of the Publisher and payment of a fee is required for all other photocopying, including multiple or systematic copying, copying for advertising or promotional purposes, resale, and all forms of document delivery. Special rates are available for educational institutions that wish to make photocopies for non-profit educational classroom use. Permissions may be sought directly from Elsevier Science Rights & Permissions Department, PO Box 800, Oxford OX5 1DX, UK; phone: (+44) 1865 843830, fax: (+44) 1865 853333, e-mail:
[email protected]. You may also contact Rights & Permissions directly through Elsevier's home page (http://www.elsevier.nl), selecting first 'Customer Support', then 'General Information', then 'Permissions Query Form'. In the USA, users may clear permissions and make payments through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA; phone: (978) 7508400, fax: (978) 7504744, and in the UK through the Copyright Licensing Agency Rapid Clearance Service (CLARCS), 90 Tottenham Court Road, London W1P 0LP, UK; phone: (+44) 171 631 5555; fax: (+44) 171 631 5500. Other countries may have a local reprographic rights agency for payments. Derivative Works Tables of contents may be reproduced for internal circulation, but permission of Elsevier Science is required for external resale or distribution of such material. Permission of the Publisher is required for all other derivative works, including compilations and translations. Electronic Storage or Usage Permission of the Publisher is required to store or use electronically any material contained in this work, including any chapter or part of a chapter. Except as outlined above, no part of this work may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission of the Publisher. Address permissions requests to: Elsevier Science Rights & Permissions Department, at the mail, fax and e-mail addresses noted above. Notice No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made.
First edition 1999 Library of Congress Cataloging in Publication Data A catalog record from the Library of Congress has been applied for.
ISBN:
0 444 50270 X
The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper). Printed in The Netherlands.
Preface:
Kohonen
Maps
Professor Teuvo Kohonen is known world-wide as a leading pioneer in neurocomputing. His research interests include the theory of self-organization, associative memories, neural networks, and pattern recognition, on which he has published over 200 research papers and four monograph books. His influence on the contemporary research is indicated by the more than 3300 recent publications world wide in which the most central of his ideas, the Self-Organizing Map (SOM), also known as the Kohonen Map, is analyzed and applied to data analysis and pattern recognition problems. Kohonen's research on the SelfOrganizing Map began in early 1981. What was required was an efficient algorithm that would map similar patterns, given as vectors close to each other in the input space, onto contiguous locations in the output space. Numerous experiments were made with the SOM, including the formation of phoneme maps for use in speech recognition. Extensions to supervised learning tasks, the Supervised SOM and Learning Vector Quantization (LVQ) algorithms, brought further improvements. The SOM algorithm was one of the strong underlying factors in the new popularity of neural networks starting in the early 80's. It is the most widely used neural network learning rule in the class of unsupervised algorithms, and has been implemented in a large number of commercial and public domain neural network software packages. The best source of details and applications of the SOM are Kohonen's books "Self-Organization and Associative Memory" (Springer, 1984) and "Self-Organizing Maps" (Springer, 1995; 2nd extended edition, 1997). Recently, Teuvo Kohonen has been working on a new type of feature extraction algorithm, the Adaptive-Subspace SOM (ASSOM), which combines the old Learning Subspace Method and the Self-Organizing Map. He has shown how invariant feature detectors, for example the well-known wavelet filters for digital images and signals will emerge automatically in the ASSOM. In another sample of his present research, the SOM algorithm is applied to organize large collections of free-form text documents like those available in the Internet. The method is called WEBSOM. In the largest application of the WEBSOM
Yl
implemented so far, reported in this book for the first time, about 7 million documents have been organized. Teuvo Kohonen will retire from his office at the Academy of Finland in July, 1999, although he will not retire from his research work. With the decade drawing to an end, during which neural networks in general and the Kohonen Map in particular attained high visibility and much success, it may be time to take a look at the state of the art and the future. Therefore, we decided to organize a high-level workshop in July 1999 on the theory, methodology and applications of the SOM, to celebrate this occasion. Many of the top experts in the field accepted our invitation to participate and submit articles covering their research. The result is contained in this book, expertly compiled and printed by Elsevier. The 30 chapters of this book cover the current status of SOM theory, such as connections of SOM to clustering, vector quantization, classification, and active learning; relation of SOM to generative probabilistic models; optimization strategies for SOM; and energy functions and topological ordering. Most of the chapters, however, are focussed on applications of the SOM. Data mining and exploratory data analysis is a central topic, applied to large databases of financial data, medical data, free-form text documents, digital images, speech, and process measurements. Other applications covered are robotics, printed circuit board optimization and electronic circuit design, EEG classification, human voice analysis, and spectroscopy. Finally, there are a few chapters on biological models related to the SOM such as models of cortical maps and spatio-temporal memory.
A c k n o w l e d g e m e n t s . We wish to thank all the people who have made the WSOM'99 workshop and hence this book possible. We are especially grateful to the rest of the organizing committee: Esa Alhoniemi, Johan Himberg, Jukka Iivarinen, Krista Lagus, Markus Peura, Olli Simula, and Juha Vesanto. Finally, we thank the Academy of Finland for financial support.
Espoo, Finland, April 23, 1999
Erkki Oja
Samuel Kaski
VII
Table of c o n t e n t s
Preface: Kohonen Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v vii
Analyzing and representing multidimensional quantitative and qualitative data: Demographic study of the RhSne valley. The domestic consumption of the Canadian families. M. Cottrell, P. Gaubert, P. L e t r e m y , P. R o u s s e t
....................................
1
Value maps: Finding value in markets that are expensive G. J. Deboeck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
Data mining and knowledge discovery with emergent Self-Organizing Feature Maps for multivariate time series A . Ultsch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
From aggregation operators to soft Learning Vector Quantization and clustering algorithms N. B. K a r a y i a n n i s
................................................................
47
Active learning in Self-Organizing Maps M. H a s e n j S g e r , H. R i t t e r , K. O b e r m a y e r
..........................................
57
Point prototype generation and classifier design J. C. B e z d e k , L. I. K u n c h e v a
.....................................................
71
Self-Organizing Maps on non-Euclidean spaces H. R i t t e r
.........................................................................
97
Self-Organising Maps for pattern recognition N. M. A l l i n s o n , H. Y i n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
111
Tree structured Self-Organizing Maps P. K o i k k a l a i n e n
.................................................................
Growing self-organizing networks B. Fritzke
121
history, status quo, and perspectives
.......................................................................
131
Kohonen Self-Organizing Map with quantized weights P. T h i r a n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
145
VIII
On the optimization of Self-Organizing Maps by genetic algorithms D. Polani . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
157
Self organization of a massive text document collection T. Kohonen, S. Kaski, K. Lagus, J. SalojSrvi, J. Honkela, V. Paatero, A. Saarela . 171
Document classification with Self-Organizing Maps D. Merkl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
183
Navigation in databases using Self-Organising Maps S. A. S h u m s k y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
197
A SOM-based sensing approach to robotic manipulation tasks E. Cervera, A. P. del Pobil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
207
SOM-TSP: An approach to optimize surface component mounting on a printed circuit board H. Tokutaka, K. Fujimura . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
219
Self-Organising Maps in computer aided design of electronic circuits A. Hemani, A. Postula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
231
Modeling self-organization in the visual cortex R. Miikkulainen, J. A. Bednar, Y. Choe, J. Sirosh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
243
A spatio-temporal memory based on SOMs with activity diffusion N. R. Euliano, J. C. Principe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
253
Advances in modeling cortical maps P. G. Morasso, V. Sanguineti, F. Frisone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
267
Topology preservation in Self-Organizing Maps T. ViUmann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
279
Second-order learning in Self-Organizing Maps R. Der, M. H e r r m a n n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
293
Energy functions for Self-Organizing Maps T. Heskes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
303
LVQ and single trial EEG classification G. Pfurtscheller, M. Pregenzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
317
Self-Organizing Map in categorization of voice qualities L. Leinonen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
329
Chemometric analyses with Self Organising Feature Maps: A worked example of the analysis of cosmetics using Raman spectroscopy R. Goodacre, N. Kaderbhai, A. C. McGovern, E. A. Goodacre . . . . . . . . . . . . . . . . . . . .
335
iX
Self-Organizing Maps for content-based image database retrieval E. Oja, J. Laaksonen, M. Koskela, S. Brandt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
349
Indexing audio documents by using latent semantic analysis and SOM M. Kurimo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
363
Self-Organizing Map in analysis of large-scale industrial systems O. Simula, J. Ahola, E. Alhoniemi, J. Himberg, J. Vesanto . . . . . . . . . . . . . . . . . . . . . . .
375
Keyword index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
389
This Page Intentionally Left Blank
Kohonen Maps. E. Oja and S. Kaski, editors 9
Elsevier Science B.V. All rights reserved
Analyzing and representing multidimensional quantitative and qualitative data : Demographic study of the Rh6ne valley. The domestic consumption of the Canadian families. Marie Cottrell, Patrice Gaubert, Patrick Letremy, Patrick Rousset SAMOS-MATISSE, Universit6 Paris 1 90, rue de Tolbiac, F-75634 Paris Cedex 13, France 1. I N T R O D U C T I O N The SOM algorithm is now extensively used for data mining, representation of multidimensional data and analysis of relations between variables([1], [2], [5], [9], [11], [12], [13], [15], [16], [17]). With respect to any other classification method, the main characteristic of the SOM classification is the conservation of the topology: after learning, ~ close ~ observations are associated to the same class or to ~ close ~ classes according to the definition of the neighborhood in the SOM network. This feature allows to consider the resulting classification as a good starting point for further developments as shown in what follows. But in fact its capabilities have not been fully exploited so far. In this chapter, we present some of the techniques that can be derived from the SOM algorithm: the representation of the classes contents, the visualization of the distances between classes, a rapid and robust twolevel classification based on the quantitative variables, the computation of clustering indicators, the crossing of the classification with some qualitative variables to interpret the classification and give prominence to the most important explanatory variables. See in [3], [4], [8], [9] precise definitions of all these techniques. We also define two original algorithms (KORRESP and KACM) to analyze the relations between qualitative variables. The paper is organized as follows : in sections 2 and 3, we present the main tools, in section 4 and 5, we show real applications in socio-economic fields. 2. T H E M A I N T E C H N I Q U E S Let us give some notations : we consider a set of N observations, where each individual is described by P quantitative real valued variables and K qualitative variables. The main tool is a Kohonen network, generally a two-dimensional grid with n units, but the method can be used with any topological organization of the network. After learning, each unit i is represented in the R P space by its weight vector Cg (or code vector). We do not examine here the delicate problem of the learning of the code vectors ([9], [16], [17]) which is supposed to be successfully realized from the N observations restricted to their P quantitative variables.
Classification: After convergence, each observation is classified by a nearest neighbor method, (in RP): observation I belongs to class i if and only if the code vector Ci is the closest among all the
code vectors. The distance in R P is the Euclidean distance in general, but it can be chosen in another way according to the application.
Representation of the contents: The classes are represented according to the chosen topology of the network, along a chain, or on a grid, and all the elements can be drawn inside their classes. So it is possible to see how the observations are modified from a class to its neighbors and to appreciate the homogeneity of the classes. Distances between classes To highlight the true inter-classes distances, following the method proposed in [6], we represent each unit by an octagon inside the cell of the SOM map. The bigger it is, the closer it is to the border, the nearer the code vector is of its neighbors. This avoids misleading interpretations and gives an idea of the discrimination of the classes. Two-level classification: A hierarchical clustering of the n code vectors puts together the most similar SOM classes and provides a second classification into a smaller number of classes. These macro-classes create connected areas in the initial map, so the neighborhood relations between them are kept. This grouping together facilitates the interpretation of the contents of the classes. Crossing with qualitative variable: To interpret the classes according to an explicative qualitative variable, it is valuable to study the discrete distribution of its modalities in each class. We propose to draw inside each cell a frequency pie for example. So we make clear the continuity of the classes as well as the cutoffs. We also can associate to a SOM class the more frequent modalities of a qualitative variable and in this manner give a better description of the classes. 3. ANALYSIS OF R E L A T I O N S B E T W E E N Q U A L I T A T I V E VARIABLES Let us define here two original algorithms to analyze the relations between qualitative variables. The first one is defined only for two qualitative variables. It is called KORRESP and is analogous to the classical Correspondence Analysis. The second one is devoted to the analysis of any finite number of qualitative variables. It is called KACM and is similar to the Multiple Correspondence Analysis. See [3], [4] for previous related papers. For both algorithms, we consider a sample of individuals and a number K of qualitative variables. Each variable k = 1,2 ..... K has m~ possible modalities. For each individual, there is one and only one modality. If M is the total number of modalities, each individual is represented by a row M-vector with values in {0, 1 }. There is only one 1 between the 1st component and the ml-th one, only one 1 between the ml+l-th component and the (ml+mz)-th one and so on. In the general case, where M > 2, the data are summarized into a Burt Table which is a cross tabulation table. It is a M x M symmetric matrix and is composed of K x K blocks, such that the (k, /)-block Bkl (for k < l) is the (mk X mlt) contingency table which crosses the variable
k and the variable I. The block Bkk is a diagonal matrix, whose diagonal entries are the numbers of individuals who have respectively chosen the modalities 1, 2 ..... mk, for modality k. From now on, the Burt Table is denoted by B. In the case M=2, we only need the contingency table T which crosses the two variables. In that case, we set p (resp. q) for mi (resp. m2).
3.1 The KORRESP algorithm Let M = 2. In the contingency table T, the first qualitative variable has p levels and corresponds to the rows. The second one has q levels and corresponds to the columns. The entry nij is the number of individuals categorized by the row i and the column j. From the contingency table, the matrix of relative frequencies (fij = nij/(Zij nij)) is computed. Then the rows and the columns are normalized in order to have a sum equal to 1. The row profile r(i), 1 < i < p is the discrete probability distribution of the second variable when the first variable has modality i and the column profile c(j), 1 < j < q is the discrete probability distribution of the first variable when the second variable has modality j. The classical Correspondence Analysis is a simultaneous weighted Principal Component Analysis on the row profiles and on the column profiles. The distance is chosen to be the ~2 distance. In the simultaneous representation, related modalities are projected into neighboring points. To define the algorithm KORRESP, we build a new data matrix D: to each row profile
r(i), we associate the column profile c(j(i)) which maximizes the probability o f j given i, and conversely, we associate to each column profile c(j) the row profile r(i(j)) the most probable given j. The data matrix D is the ((p+q)x (q+p))-matrix whose first p rows are the vectors (r(i),c(j(i))) and last q rows are the vectors (r(i(j)),c(j)). The SOM algorithm is processed on the rows of this data matrix D. Note that we randomly pick the inputs among alternatively the p first rows and the q last ones and that the winning unit is computed only on the q first components in the first case, on the p last ones in the second case according to the ~2 distance. After convergence, each modality of both variables is classified into a Vorono?" class. Related modalities are classified into the same class or into neighboring classes. This method give a very quick, efficient way to analyze the relations between two qualitative variables. See [3] for real-world applications.
3.2 The KACM Algorithm When there are more than two qualitative variables, the above method no longer works. In that case, the data matrix is the Burt Table B. The rows are normalized, in order to have a sum equal to 1. At each step, we pick a normalized row at random according to the frequency of the corresponding modality. We define the winning unit according to the ~2 distance and update the weight vectors as usual. After convergence, we get an organized classification of all the modalities, where related modalities belong to the same class or to neighboring classes. In that case also, the KACM method provides a very interesting alternative to classical Multiple Correspondence Analysis. The main advantages of both KORRESP and KACM methods are their rapidity and their small computing time. While the classical methods have to use several representations with
decreasing information in each, ours provide only one map, that is rough but unique and permits a rapid and complete interpretation. See [3], [4] and [7] for the details and financial applications.
4. DEMOGRAPHIC STUDY OF THE RHONE VALLEY
The data come from the project ARCHEOMEDES, supported by the EU, in collaboration with the laboratory P.A.R.I.S (University Paris 1). We consider 1783 communes in the Rh6ne valley, in the south of France. This valley is situated on the two banks of the river Rh6ne. It includes some big cities (Marseille, Avignon, Aries . . . . ), some small towns, many rural villages. A large part is situated in medium mountains, in very depopulated areas since the so-called drift from the land. At the same time, in the vicinity of the large or small towns, the communes have attracted a lot of people who are working in urban employment. The goal of this study is to understand the relations between the evolution of the population and the professional composition in each communes. The data include two tables. The first one gives the numbers of seven population census (1936, 1954, 1962, 1968, 1975, 1982, 1990). These numbers are normalized by dividing by their sum, to keep the evolution and not the absolute values. The second one contains the current numbers of working population, distributed among six professional categories (farmers, craftsmen, managers, intermediate occupations, clerks, workers). In this second table, the data are transformed into percentages and will be compared with the ~2 distance. The first step consists in defining two classifications of the communes from the two types of data. We use a Kohonen one-dimensional network (a chain) to transform the quantitative variables into qualitative ordered characters. First we classify the communes into 5 classes from the census data, and then into 6 classes from the professional composition data. The first classification into 5 classes is easily interpretable. The classes are arranged according to an evident order: there are the communes with strong increase (aug_for), with medium increase (aug_moy), with relative stability (stable), with medium decrease (dim_moy), with strong decrease (dim_for). See in fig. 4.1 the code vectors and in fig. 4.2 the contents of the 5 classes. 07~
Q7
(17-
i (15
O5
Q3
. 3~
.
.
54
~
. 6B
.
. 75
aug_for
Qa
. a2
~0
. 33
.
.
54
~
.
. tt3
. 73
aug_moy
03
. 82
~0
36
91
6~
(B
75
stable
fo
9)
. 36
. 54
. 6Z
.
. 6B
. 75
dim_moy
,t 8~
O3,
.
.
.
.
.
.
.
if)
dim_for
Fig. 4.1 9The code vectors of the 5 classes (first classification). The curves represent the population evolution along the seven census.
Fig. 4.2" The contents of the 5 classes (first classification). In each class, all the communes vectors are drawn in a superposed way. The 6 classes (A, B, C, D, E, F) provided by the second classification of the professional composition data, are a little more delicate to interpret. Actually they straightforwardly correspond to an order following the relative importance of the farmer category. Class A does not contain almost any farmer, while class F consists in communes with a majority of farmers (those are very small villages, but the use of the %2 distance restores their importance). See in fig. 4.3 the code vectors and in fig. 4.4 the contents of the 5 classes.
2~
011 ~
26
~
a~
21.
2~-
~
~ ~
2~!
11"
11-
1.1
01
F~n ~
IVI~ t m e e "
Wk
11" ~n
~
Mra l"e
01 gl~r V~k
261
OI rmn ~
I~
tle
~
WJk
11" E~n ~
Nlm I~
OI ~
~k
~ E~'n ~
Nlm i'~
01~
Vilk
Fern ~
t,~a ~
~
V~
A B C D E F Fig. 4.3 9The code vectors of the 6 classes (second classification). The curves correspond to the (corrected) proportion of farmers, craftsmen, managers, intermediate occupations, clerks, workers.
Fig. 4.4 9The contents of the 6 classes (second classification). Note that in class A, some communes are very specific, they do not have any farmer, but all their inhabitants belong to one or two categories. From these two classifications, we compute the contingency table, see table 4.1. A quick glance shows a strong dependence between the row variables (census classes) and the column variables (professional classes). Table 4.1 9Contingency Table which crosses the two classifications. aug_for aug_moy stable dim_moy dim for
223 85 80 29 34
B
C
D
E
53 112 100 50 18
26 86 112 55 35
9 34 113 80 57
0 7 54 56 69
F 0 1
22 57 126
To analyze this dependence, we use a classical Correspondence Analysis and the KORRESP algorithm See in fig. 4.5 and table 4.3, the results.
Axes 1(0,70) and 2(0,26) stable
D
aug_for dim for
A
c~m_moy
B
aug_moy
dim_moy daLle -1
B aug molt
0
F
A
dim_for
aug_for
Fig.4.5 : The first projection of the modalities Table 4.3 The two-dimensional Kohonen using a Factorial Correspondence Analysis map with the results of KORRESP.
Both representations suggest to use a one-dimensional Kohonen network (a chain) to implement the KORRESP method. The results are shown in table 4.3. Table 4.3 : The one-dimensional Kohonen map with the results of KORRESP.
A augjor
B
aug_moy
C stable
D
F
dim_moye
dim_for
The conclusions are simple. The rural communes where the agriculture is dominant are depopulated, while the urban ones have an increasing population. The relations are very precise: we can note the pairs of modalities ((dim_for), F), ((dim_moy), D), ((stable), C), ((aug_moy), B), ((aug_for), A). Actually, the SOM-inspired method is very quick and efficient, and gives the basic points of the information with only one representation. As to the classical correspondence method, it is also useful but its computation time is longer and it is usually necessary to examine several projections to have a complete analysis, since each axis represents only a percent of the total information.
5. The domestic consumption of the Canadian families
The data have been provided by Prof. Simon Langlois from the Universit~ of Laval. The purpose of the study is to define homogenous groups from the point of view of their consumption choices. The interest of such clustering is at least double. On the one hand, when one has successive surveys that include distinct individuals stemming from a same population, we can build a pseudo-panel, composed of synthetic individuals representative of the groups, which will be comparable from one survey to another. On the other hand, it facilitates the matching of distinct surveys when each one provides different information about samples extracted from the same population. The constitution of groups which are homogenous for these data allows the linking of all the surveys. For example, it is possible to apply this method to match consumption surveys done for the same period with different samples (each sample is answered about the consumption of halfnomenclature). The matching is necessary to build complete consumption profiles. One has to notice that this method does not exactly correspond to the methodology proposed by Deaton, 1985, (13), who follows the same cohort from one survey to another by considering only individuals born at the same time. When pseudo-panels are considered, one uses to build the clusters by crossing some significant variables. For example, to study the households consumption modes, the significant variables are the age cohort, the education level and the income distribution. Here, we use the Kohonen algorithm to define the clusters and apply it to the data of two consumption surveys. We also compare the results to those that we obtain from a standard classification.
5.1. The data We consider two consumption surveys, performed by Statistiques Canada, in 1986 and 1990, with about 10 000 households which were not the same ones from one survey to the other. The consumption structure is known through a 20 functions nomenclature, described in Table 5.1. Table 5.1 : Consumption nomenclature. Alcohol" Food at home; Food away; House costs; Communication" Others" Gifts; Education; Clothes; Housing; Leisure; Lotteries; Furniture; Health; Security; Personal Care ; Tobacco; Individual Transport; Collective Transport; Vehicles. Each household is represented by its consumption structure, expressed as percentages of the total expenditure. The two surveys have been gathered in order to define classes including individuals which belong to one or the other year. So it will be possible to observe the dynamic evolution of the households groups which have similar consumption structures. One can see that for any classification method, the classes contain in almost equal proportion data of the two surveys. It seems that there is no temporal effect on the groups, which simplifies the further analyses.
See in Fig. 5.1 the mean consumption structure for the 1992 survey.
0.25
0.2
0.15
0.1
0.05
Fig. 5.1 Mean Consumption Profile in 1992. 5.2. The classes W e want to compare two clustering methods : 1) a SOM algorithm using a two-dimensional (8 x 8) grid, that defines 64 classes, whose number is then reduced to 10 macro-classes, by using a hierarchical clustering with the 64 code vectors, in order to get an easier interpretation of their contents. 2) a hierarchical classification into 10 classes with the Ward method. 5.3. The S O M classes and the macro classes Fig. 5.2 represents the 64 SOM classes with their code vectors and the macro-classes which differ by their texture. First we note that, due to the topological conservation property of the SOM algorithm that the macro-classes group only neighboring SOM classes. In Fig 5.3, the distances between the SOM classes are drawn, following the method suggested in [6]. Observe that the grouping into 10 macro-classes respects the distance : the changes of macroclasses generally occur where the distances are larger.
Fig. 5.2 9The 64 S O M classes, their code vectors and the macro-classes.
Fig. 5.3" The distances between the 64 code vectors' in each direction, the classes are more distant when there is more white area.
10 The SOM classes could be analyzed, but it is difficult to keep and characterize 64 types of classes. Conversely, the 10 macro classes have well separated features. One can observe that : 1. the sizes of the macro-classes are about 600 to 700 households, or about 1200, except one with a little more than 400. This macro-class gathers only 4 SOM classes which have a very special profile (as it will be seen below). 2. in all the macro-classes, there are as many 1986 data as 1992 ones. So there is no significant effect of the year of the survey. 3. the mean profiles of the 10 macro-classes are well identified, and are different from the mean profile of the whole population. Nine types of consumption items are at the origin of the differentiation of the macroclasses. 1. macro-class 5 is dominated by the Housing item (with a 38 %). 2. macro-class 9 is characterized by the importance of the Housing item (26 %) and the Collective Transport. 3. for two macro-classes (1 and 2), the Vehicle purchase makes the difference. While the general mean value for this item is about 5 %, the value is 17 % in macro-class 1, and the other items are reduced in an homothetic way. In macro-class 2, the value is 36 %, and the housing expenditure is small, what corresponds to a large representation of the house-owners (71% instead of 60 % in general). 4. the Food Home (20 %) and the Others items define the macro-class 7. 5. in macro-class 3, the Security (insurance) expenditure is the double mean value. 6. macro-class 10 corresponds to a large value of the Gifts item (25 %). 7. leisure item defines macro-class 8 (with 13 %), while tobacco defines macro-class 4 (with 12 %) and education is dominant in macro-class 6 (10 %). The grouping into 10 macro-classes increases the contrast with respect to the mean consumption profile. All the SOM classes inside a macro-class have more or less the same features, with some specific characteristics.
5.4. Hierarchical clustering If we consider the 10 classes defined by a hierarchical Ward clustering on the consumption profiles, the results are disappointing. The groups have unequal sizes, the differentiation between groups are more quantitative than qualitative, and poorer in information. For example, 4 groups have more than 1000 or 2000 elements, while the others have about 200 or 400. Among the 4 more important groups, 3 (the 1, 4, 6) have a mean profile similar to the general mean one, with only one component a little larger. Groups 2 and 7 correspond to a high housing expenditure, and cannot be clearly set apart, and so on. Actually the correctly spotted groups are the small ones, while the others are not very different from one to another, and are similar to the general population. Furthermore, some specific behaviors, in particular those which are distinguished by a relatively large importance of the Security or Education expenditures, do not emerge in this clustering.
11 So from now on, we continue the analysis by using the SOM classification, followed by the grouping into 10 macro-classes.
5.5. Crossing with qualitative variables To understand better the factors which determine the macro-classes, and to allow their identification, we use a graphic representation of some qualitative variables, that were not present in the classification. For that, we use 4 qualitative variables, that give a socio-demographic description of the households: 1. the first one (Wealth) is a variable with 5 modalities (poor, quasi-poor, middle, quasi-rich, rich). This variable is defined by combining three simple criteria, the income distribution, the total expenditure, the food expenses, according to the age, the education level and the regional origin. 2. the second one (Age) is the age of the head of household, with 6 modalities (less than 30, 30-39, 40-49, 50-59, 60-69, more than 69). 3. the educational level (Education), with 5 levels (primary, secondary, post-secondary without diploma, post-secondary with diploma, university diploma). 4. the tenure status (Tenure Status) (owner or tenant). For each SOM class, we compute the distribution of the four qualitative variables (which did not participate to the classification), and we represent it as a sector diagram (a pie) inside the cell. We observe that there is also a continuity for the variations of the socio-demographic variables distributions among the SOM classes. But they provide other information, different from the previous one.
Fig. 5.4 : The distribution of the variable Wealth.
12
Fig 5.5 9The distribution of the variable Age
Fig. 5.6 9The distribution of the variable Education.
13
Fig 5.7 : The distribution of the variable Tenure Status For example, the partitioning of the population according to the poverty-wealthy criterion that the classes having a strong proportion of rich or quasi-rich people are rather situated at the extremities of the diagonal right top - left bottom, the poor and quasipoor being at the central area. At the same time, the opposition owner-tenant is distributed according to a simple opposition on this diagonal. The first ones are at the left bottom, the second ones at the opposite. We rediscover a well-known situation of poor people, who can be as well owner as tenant of their lodgings. It is possible to analyze in this way the four graphic representations. Actually, it is the combination of these characteristics that we have to examine to interpret the zones of the grid, as gathered by the classification into 10 classes.
(Wealthy), indicates
6. Conclusion The SOM algorithm is therefore a powerful tool to analyze multidimensional data and to help to understand the underlying structure. We are now working about local representation of the contents of a class in relation with the neighboring classes, in order to give an interpretation to the significant and discriminate variables. There is no doubt that the related data mining techniques will have a large development in many scientific fields, where one deals with numerous, large dimensioned data.
14 References
[ 1] F.Blayo, P.Demartines : Data analysis : How to compare Kohonen neural networks to other techniques ? In Proceedings of IWANN'91, Ed. A.Prieto, Lecture Notes in Computer Science, Springer-Verlag, 469-476, 1991. [2] F.Blayo, P.Demartines : Algorithme de Kohonen: application ?~l'analyse de donn6es 6conomiques. Bulletin des Schweizerischen Elektrotechnischen Vereins & des Verbandes Schweizerischer Elektrizitatswerke, 83, 5, 23-26, 1992. [3] M.Cottrell, P.Letremy, E.Roy : Analyzing a contingency table with Kohonen maps : a Factorial Correspondence Analysis, Proc. 1WANN'93, J.Cabestany, J.Mary, A.Prieto Eds., Lecture Notes in Computer Science, Springer-Verlag, 305-311, 1993. [4] M.Cottrell, S.Ibbou : Multiple correspondence analysis of a crosstabulation matrix using the Kohonen algorithm, Proc. ESANN'95, M.Verleysen Ed., Editions D Facto, Bruxelles, 27-32, 1995. [5] M.Cottrell, B.Girard, Y.Girard, C.Muller, P.Rousset : Daily Electrical Power Curves : Classification and Forecasting Using a Kohonen Map, From Natural to Artificial Neural Computation, Proc. IWANN'95, J.Mira, F.Sandoval eds., Lecture Notes in Computer Science, Vol.930, Springer, 1107-1113, 1995. [6] M.Cottrell, E. de Bodt: A Kohonen Map Representation to Avoid Misleading Interpretations, Proc. ESANN'96, M.Verleysen Ed., Editions D Facto, Bruxelles, 103-110, 1996. [7] M.Cottrell, E. de Bodt, E.F.Henrion : Understanding the Leasing Decision with the Help of a Kohonen Map. An Empirical Study of the Belgian Market, Proc. ICNN'96 International Conference }, Vol.4, 2027-2032, 1996. [8] M.Cottrell, P.Rousset :, The Kohonen algorithm: A Powerful Tool for Analysing and Representing Multidimensional Quantitative and Qualitative Data, Proc. 1WANN'97, 1997. [9] M.Cottrell, J.C.Fort, G.Pag~s: Theoretical aspects of the SOM Algorithm, WSOM'97, Helsinki 1997, Neurocomputing 21, 119-138, 1998. [10] A.Deaton : Panel data from time series of cross-sections, Journal of Econometrics, 1985. [11] G.Deboeck, T.Kohonen : Visal Explorations in Finance with Self-Organization Maps, Springer, 1998. [12] P.Demartines : Organization measures and representations of Kohonen maps, In : J.H6rault (ed), First IFIP Working Group, 1992. [13] P.Demartines, J.H6rault: Curvilinear component analysis: a self-organizing neural network for non linear mapping of data sets, IEEE Tr. On Neural Networks, 8, 148-154, 1997. [14] F.Gardes, P.Gaubert, P.Rousset: Cellulage de donn6es d'enqu~tes de consommation par une m6thode neuronale, Preprint SAMOS # 69, 1996. [15] S.Kaski: Data Exploration Using Self-Organizing Maps, Acta Polytechnica Scandinavia, 82, 1997. [16] T.Kohonen: Self-Organization and Associative Memory, (3rd edition 1989), Springer, Berlin, 1984. [ 17] T.Kohonen: Self-Organizing Maps, Springer, Berlin, 1995.
Kohonen Maps. E. Oja and S. Kaski, editors 9 Elsevier Science B.V. All rights reserved
15
Value Maps: Finding Value in Markets that are Expensive G u i d o J. D e b o e c k 1818 Society, W o r l d B a n k e-marl:
[email protected] Based on traditional measures of value such as price-earnings ratio (PE) many stocks are at present either way over valued or are faring very poorly. The rapid expansion of Internet and technology companies has made surfing stock markets much more hazardous. What investors need is a new guide, a more visual way to assess markets. The main idea of this chapter is to demonstrate how maps can be designed for finding value in stock markets. "Value Maps" presents self-organization maps of vast amounts of data on company and stock performance most of which is derived from traditional measures for assessing growth and value of companies. This chapter which extends the applications in [1 ] provides maps for finding value among the largest companies, the best performing companies, those that have improved the most, and those that based on traditional measures have excellent value attributes. The data was selected from over 7150 stocks listed on the New York Stock Exchange (NYSE), the American Exchange (AMEX) and the NASDAQ, including American Depository Receipts (ADRs) of foreign companies listed on US exchanges. Hence the maps in this chapter provide guidance not only to American companies but also companies of world class whose stock can easily be bought and sold without brokerage accounts on every continent. The Value Maps in this chapter are models for how company and market information could be treated and represented in the future.
Introduction Benjamin Graham was among the first to study stock returns by analyzing companies displaying common characteristics. In the early 1930's, he developed what he called the net current asset approach of investing. This called for buying stocks priced less than 66 percent of the company's liquidity. Ben Graham and David Dodd wrote in 1940 that people who habitually purchase common stocks at more than about 20 times their average earnings are likely to lose considerable money in the long run [2]. Nevertheless the participation in the stock markets has never been higher at a time when the average stock prices of US companies are close to 33 times the average earnings. In these circumstances what is value? What stocks have value? Which companies are worth investing in? What markets or sectors are undervalued? Sixty years after Graham and Dodd there is still a lively debate on this. The debate on value has actually increased in intensity in
16 recent years especially because of the high valuation of American stocks and the emergence and rapid escalation of the prices of the Internet stocks. High valuation is usually attributed to stocks with high price earnings ratios- a company's stock price divided by its per share earnings. Compared to historical precedence the current price earnings ratio of 33 for all stocks listed on the S&P 500 is very high. Compared to the average PE's of the internet stocks included in the DOT index (http://www.TheStreet.com), the ISDEX index (http ://www.internet.com), or the IIX index (http://quote.yahoo.com/q?s=^IIX), tree relatively new indices measuring the performance of Internet stocks, the current PE of S&P500 stocks is relatively low, especially since many Internet stocks have yet to produce earnings. So what has value? Some classic definitions of value can be found in Box 1. Interesting background reading on value investing can be found in [3], [4], and [5]. A long-term maverick on value is Warren Buffett. Since he took control of Berkshire Hathaway in 1965, when shares were trading in the $12-15 range, the per-share book value of Berkshire Hathaway stock has grown at a rate of more than 23 percent annually, which is nearly three times the gains in major stock averages. Buffett's approach on value is to seek intrinsic value, which he defines as 'you have to know the business whose stock you are considering buying' [6]. He recognizes that valuing a business is part art and part science but leaves it to the reader to interpret when a business has value. In Warren's own words '[a business] has to be selling for less than you think the value of the business is, and it has to be run by honest and able people. If you can buy into a business for less than it' s worth today, and you're confident of the management, and you buy into a group of businesses like that, you' re going to make money' [6].
Looking at Buffett's portfolio and the stocks held by Berkshire Hathaway Inc it is clear that Buffett's portfolio is diversified but that the bulk of his assets are tied up in five companies. In fact one third of the $36 billion worth of Berkshire Hathaway Inc is in one company -Coca Cola Co which has a PE of 44, about one third higher than the average for all S&P500 stocks. More specific guidance on what has value can be found in James O'Shaughnessy's [7]. After weighting risk, rewards, and long-term base rates, O'Shaughnessy shows the best overall strategy of uniting growth and value in a single portfolio over the past 42 years produces nearly five times better than the S&P500. The annual compounded rate of return of the united growth and value strategy achieved 17.1 percent or 4.91 percent higher than the S&P500 return of 12.91 percent a year.
17
Box 1: What is value? There are several definitions of value: F a i r m a r k e t v a l u e : Fair market value or FMV is whatever someone is willing to pay for a similar asset. FMV reflects the amount of cash a buyer would be willing to pay and a seller willing to accept. I n v e s t m e n t v a l u e : A company' s investment value is unique to all potential buyers; all buyers possess their own rate of return requirements for an asset because each investor has established a unique minimum rate of return. B o o k v a l u e : a standard value measure based on a company's net worth on an accounting basis. L i q u i d a t i o n v a l u e : what an enterprise could fetch if all assets were sold, all receivables collected and all outstanding bills and debts paid. I n t r i n s i c v a l u e : represents what someone would conclude a business is worth after taking an analysis of the company's financial position. The real worth of a company. -
O'Shaughnessy suggests that (i) (ii)
(iii)
(iv)
(v) (vi) (vii)
all stocks and large stocks with high PE ratios do substantially worse than the market (page 70); companies with the 50 lowest PE ratios from among the large stocks do much better than all others and the three lowest deciles by PE substantially outperform all large stocks; over the long term, the market clearly rewards low price book (PB) ratios; yet the data shows that for 20 years the 50 largest stocks with high price book ratios did better than all stocks; a high PB ratio is one of the hallmarks of a growth stock (page 99); a low price to sales (PS) ratio beats the market more than any other value ratio and did so more consistently, in terms of both the 50 stock portfolios and the decile analysis ( page 134); value strategies work when applied both to large stocks and to the universe of commons stocks and they did so at least 88% of the time over all rolling 10 year periods (page 164); multifactor models i.e. using several factors, dramatically enhance returns; in all likelihood, adding relative strength to a value portfolio dramatically increases performance because it picks stocks when investors recognized the bargains and begin buying again (page 269)
In sum, Buffett suggests a highly subjective way of assessing value; O'Shaughnessy works through a lot of statistical evidence to suggest a united value and growth investment. Both approaches leave the average investor who do not have Buffett's skills and time to assess
18 businesses; or who may not have a PhD in statistics to do an in depth rigorous analysis, in a real quandary. To facilitate investment decisions by the average investor we demonstrate an alternative approach which is easier and provides a visual way of finding value in expensive markets without requiring elaborate statistical analyses. The proposed approach is based on self-organizing maps, which provide two-dimensional representations of vast quantities of data.
Methodology and Data Self-organizing maps (SOM) belong to a general class of neural network methods, which are non-linear regression techniques that can be applied to find relationships between inputs and outputs or organize data so as to disclose so far unknown patterns or structures. As this approach has been demonstrated to be highly relevant in many financial, economic and marketing applications [1 ] and is the subject of this conference we will refer the novice reader to the literature for further explanation of the details of this approach [8], [1 ]. The data used for this study was derived from Morningstar TM, which publishes monthly data on over 7150 stocks listed on the NYSE, AMEX and NASDAQ exchanges. Morningstar's Principia Pro TM was used to selecting the stocks for each of the maps in this paper. Principia Pro's key features include filtering, custom tailored reporting, detailed individual-stock summary pages, graphic displays, and portfolio monitoring. Principia Pro does not allow data mining based on the principles of self-organization. Hence we used Principia Pro T M to select data for constructing self-organizing maps. The maps shown here were obtained by using Viscovery| (from Eudaptics Software GmbH in Austria), an impressive tool representing state of the art SOM capability -- according to Brian O'Rourke in Financial Engineering News, March 1999--. More information regarding Viscovery| can be found at http://www.eudaptics.com. A demo copy of Viscovery| can be downloaded from the same website. The maps in this article seek to identify the best companies based on how well companies treat their shareholders. The main yardstick used for creating Value Maps is the total return to stockholders. The total return to shareholders includes changes in share prices, reinvestment of dividends, rights and warrants offerings, and cash equivalents such as stocks received in spin-offs. Returns are also adjusted for stock splits, stock dividends and re-capitalization. The total return to shareholders that companies provide is the one true measure important to investors. It is the gauge against which investment managers and institutional investors are measured and it should be the measure used to judge the performance of corporate executives. The maps shown in the next section can be used by investors to see how stocks in their portfolio measure up; how to spot new investment opportunities; how to adjust their portfolio to meet the objectives they have chosen. Corporate managers can use the maps to see how their companies stack up against the competition.
19
Main Findings The main findings are presented as follows: first we analyzed value among the largest companies, next the best performing companies, then among companies that have improved the most and finally those that based on traditional measures have excellent value attributes.
1. The largest 100 companies The one hundred largest companies on the NYSE have the most visibility. Microsoft, IBM, GE, Wall Mart, Exxon, American Express, Coca-Cola, AT&T, Ford and many others are known all around the globe. How to differentiate between them? Applying SOM we found some interesting differences in value. The one hundred largest companies listed on the NYSE were obtained by sorting 7159 companies by market capitalization i.e. the current stock-market value of a company's equity, in millions. Market capitalization is calculated by multiplying the current share price by the number of shares outstanding as of the most recently completed fiscal quarter. Market capitalization is often used as an indicator of a company size. Stocks with market caps of less than $1 billion are often referred to as small-cap stocks, while market caps of more than $5 billion generally denote large-cap stocks. Based on data as of 12-30-98 Microsoft (MSFT) with a market capitalization of $345 billion is the largest company; Anheuser-Bush Companies (BUD) which produces Budweiser with a capitalization of $31.3 billion ranked
100th. The main inputs for this analysis included (i) the total return produced over three months, one year, three years, five years; (ii) the percentage rank of these returns in each industry; (iii) the market capitalization in millions; (iv) value measures including the price earnings ratio, price book ratio, price sales ratio and price cash flow ratios; and (v) the relative strength of the current stock price as compared to the stock price over the past 52 weeks. Equal priority was applied to all inputs; log transformations were applied to all. The initial map size was set to 2000, yielding maps of 54 by 35 nodes. Because of the small number of records the map tension was set to 3 which encourages interpolation between records. The SOM shown in Figure 1 shows six clusters. Summary statistics on each of these clusters is provided in Table 1. About 73% of all companies formed one cluster including IBM, GM, Motorola (MOT) and many others. The average market capitalization in this cluster is $ 90 billion. The average PE ratio for this group is 37 (or slightly higher than the average for the
20
Figure 1: Self-organizing map of the largest 100 companies (in terms of market capitalization) listed on New York Stock Exchange which shows 1. a main cluster including IBM, Microsoft, Intel, and many others; 2. cluster two (with CMB, ING, FRE, AXA) which are financial institutions that are leaders in performance; 3. cluster three (with TYC, XRX, TWX, WCOM) which are companies that are rich in valuations; 4. cluster four with SUNW, TXN, GPS, BBV which are high fliers; 5. cluster five which includ BAC and C which are banks and RD and SC which are oil companies that provide attractive investment opportunitie; and 6. cluster six with Nokia and Dell which have grown very fast in the past years but have also become very expensive.
21 market as a whole); the average PB ratio is about 10. Five other clusters show the more interesting information. Cluster two includes among others Chase Manhattan bank (CMB), ING Group (ING), Morgan Stanley Dean Witter &Co (MWD), Freddie Mac (FRE), AXA ADR (AXA). This group can be labeled recent leaders in performance with relatively low cap, low PEs, PBs and PSs. The average market capitalization of the companies in this group is $48 billion. Their PE of 24 is 25% less than the current average PE of the market as a whole; the average price book value is 3.3. Over the past year these companies produced an average return of 48%; over the past three months their return was 53%. Cluster three includes Tyco International (TYC), Xerox (XRX), Time Warner (TWX) and MCI WorldCom (WCOM). This cluster is similar to cluster two in recent return achievements, however stocks in this group are much more expensive; exceptional high PEs and price book values twice as high as those of cluster two. These make this group less attractive than the previous one. We call this group the very rich group.
Table 1" Summary Statistics on Biggest 100 companies
Clusters >> Matching records (%)
Cl
C2
C5
C3
C4
C6
C0
73
5
4
4
3
2
9
Tot Ret 3 Months Mean 25.0 Tot Ret 1 Year Mean 44.7 Tot Ret 3 Year Mean 34.8 Tot Ret 5 Year Mean 30.0 Market Cap Mean 90,688 PE Current Mean 37.3 Price to Book Mean 9.9 Price to Sales Mean 4.3 Price to Cash Flow Mean 24.7 Relative Strength Mean 12.5
53.0 48.9 41.9 34.9 48,923 24.4 3.3 1.6 7.3 16.0
12.2 -6.7 23.0 24.5 95,308 20.0 3.2 12.3 280.3 -28.0
41.2 92.0 52.8 36.1 72,643 355.4 6.4 6.2 73.2 50.8
64.5 114.8 63.0 50.5 32,832 45.6 11.8 3.7 26.0 69.3
32.6 247.5 154.4 117.7 82,670 68.5 33.1 5.6 47.4 174.5
40.6 108.8 53.1 44.9 48,396 125.9 14.0 6.7 56.3 64.0
Notes: c 1: F, XON, CHV, BP, DT, TI, TEF, FON, BUD, A, NW, BCS, MOB, AN, AXP, BTY, DEO, FNM, EIRICY,ATI, E, MOT, FTE, BLS, ONE, MC, UN, OIRCL, HWP, MCD, Aft, FTU, CPQ, VVMT, PEP, GTE, INTC, T, HMC, HD, MIRK, BEL, AHP, SBC, TOYOY, ABT, BMY, JNJ, MO, GLX, SBH, NTT, DIS, PG, LLY, PFE, MSFT, MDT, G, LU, CSCO, VOD, SGP, WLA, KO, ABBBY DCX GIVI,IBM, UL, BIRK.B,AIG, GE C 2: CMB MVVD,ING, FIRE,AXA; C 3: TYC XIRX,TVVX,WCOM; C 4: TXN, SUNW, GPS, BBV C 5: C, BAC, IRD SC; C 6: NOK.A, DELL; C 0: MBK, AEG, EMC, DD, BA, AOL, SAP, ALL
Cluster four contains Texas Instruments (TXN), Sun Microsystems (SUNW), and Gap (GPS). These had the highest returns over the past year (114% on average). They also had the largest returns in the last three months (64% on average) and relatively low capitalization ($32 billion). However they had PEs well above the market, actually one third higher than the market (45) and PBs four times those of the companies in cluster two. Maybe these can best be considered as the high fliers.
22 Cluster five can be labeled the underperformers. It includes Bank of America (BAC), Royal Dutch Petroleum (RD) and Shell Transport (SC) which produced negative returns over the past year and only 12% return over the past three months. Average capitalization in this group is high ($95 billion). Average PEs of 20 and PBs of 3.2 makes this group however very attractive for future investments. Finally, cluster six shows Nokia and Dell, two companies with high returns over the past three years (154% average) and the past year (247% average), but which have in the process become very expensive in terms of PEs and PBs. In sum, a SOM map of the one hundred largest companies listed on the NYSE defines six clusters, among these six there are the recent performance leaders, the very rich, the high fliers, the under-performers and a large group in the middle. Nine companies stand out with attractive valuations. Among them are BAC, CMB, MWD, ING, FRE and AXA which are financial institutions and RD and SC which are oil companies. At time of this writing (March 1999) banks and oil companies have started to accelerate and have produced significant advances in recent weeks. The average return on the stocks selected via SOM was 7.3% in 2.5 months. Exceptional good performance included Chase Manhattan Bank (CMB) who increased 25% over 2.5 months (quoted at $59.66 on 12-31-98 and $74.61 on 3-12-99).
2. Best Performing Companies In the previous section we started from size using the market capitalization as the main initial selection criterion for picking companies to map performance and valuations. In this section we ignore size and start with the one hundred best performing companies. We choose the annualized return over the past three years as the main criterion for selecting one hundred best performing companies. Three years is a reasonable time period because longer periods, five or ten years, often include economic regime changes. Data pre-processing included the same log transformations as described above. Also map parameters were chosen to be consistent with those used earlier. The SOM map shown in Figure 2 shows five clusters. As before the outliners provide the most interesting information. For example, in the bottom left comer of the map in Figure 2 we find HIST and AOL. HIST is the symbol of Gallery of History who markets historical documents; AOL is American Online which provides consumer on-line computer services. Both of these companies have experienced phenomenal growth over the past three years: their average annualized return over the past three years was 120%. Their valuations are however also record high. Hence they may be considered the least attractive among the best performing one hundred. At the bottom right of the map in Figure 2 we find a cluster that includes Federal Agricultural Mortgages (FAMCK), Mediware Information Services (MEDW), Bank of Commerce
23
Figure 2: Self-Organizing map of 100 companies that over the past three years produced the most value for shareholders. In the bottom right we find FAMCK and BCO, which are financial institutions, and MEDW and TSRI, which are service companies that may provide the best investment opportunities in this group.
24 (BCOM) and TSRI (which provides computer programming services on a contract basis). While the average market capitalization of this group was small (only $120 million), the valuation numbers speak for themselves: PEs of 21 and PBs of 3. The average return in the last three years for this group was 102%; over the past year they regressed by 29% but in the last three months of 1998 they accelerated by 41%. How did they do out of sample? Bank of Commerce (BCOM) is up 40%, Federal Agricultural Mortgages (FAMCK) increased by 19% in the first quarter of 1999; MEDW and TSRI both decreased by about 30%. Hence we only find confirmation of earlier findings in regard to banking and financial institutions.
3. Companies who improved If neither size nor best return over the past few years are considered relevant for selecting stocks to produce value maps, then maybe the rate at which they are changing or improving in producing value for shareholders should be used as the primary criterion. For this section we started from computing the difference between the most recent annual return with the annualized return over the past three years. Based on this the top twenty companies that improved the most over the past three years are shown in table 2. This acceleration in shareholder value was then used to sort companies to obtain the top one hundred. The SOM map of these one hundred companies who produced most improvement in shareholder value is shown in Figure 3. This map in Figure 3 shows four clusters of which the ones in the top right and left comer provide the most interesting selections. The cluster in the top right comer includes Carver Bancorp (CNY) and JMC Group (JMCG) which went from zero to negative total return. Similar to the main cluster in the center of the map a large number of companies in this group have gone from zero to small negative return over the past year and then back to positive returns in the past three months. The companies in the top left comer show however substantial improvements in return. In cluster 2 (top left comer) we find LML Payment Systems (LMLAF), KeraVision (KERA), ACTV (IATV), Research Frontiers (REFR), Cypress Biosciences (CYPB). Right next to it we find TeleWest Communications (TWSTY). Companies in both of these clusters have gone from 3 to 5 % annualized return over three years to 130 to 140% return over the past year. The former group averaged returns of 85% in the past three months of 1998 while TWSTY produced 25% return. Out of sample results show that TWSTY increased by 61%, from $28.25 on 12-31-98 to $ 45.75 on 3-12-99.
4. Highest Valuations or lowest PE's, PB's, PS's Traditional measures of valuation are based on price earnings ratios, price book value, and price cash flow and price sales ratios. Hence we searched the database for companies with the lowest price earnings ratios i.e. PE's less than 15, who have price book ratios less than 1 and have price sales ratios less than 0.5. This yielded 376 records of which several had missing
25 data. Fortunately SOM allows records with missing data and hence all 376 records were used for the roadmap on companies with best valuations. Table 2: Companies that improved the most in shareholder value over the past three years. ICompany Name
Ticker
IExchange
Sector
Grand Union Track Data LML Payment Systems Business Objects SA ADR Cellular Communications of Puerto Rico rCatalyst International Osicom Technologies Tops Appliance City ACTV TeleWest Communications PLC ADR
GUCO TRAC LMLAF BOBJY CLRP CLYS FIBR TOPS IATV TVVSTY BAMM KERA CYPB CERG SAFE PCHM VGHN KNBWY WHI ELT
NNM NNM NASQ NNM NNM NNM NNM NNM !NASQ NNM NNM NNM NASQ NNM NNM NNM NNM NASQ !NYSE NYSE
Retail Services Services Services Services Technology Technology iRetail !Technology Services Retail Health Health Financial Health Health Services Consumer Staple Industrial Cyclicals Technology
Books-A-Million KeraVision Cypress Bioscience Ceres Group Invivo PharmChem Laboratories iVaughn Communication Kirin"BreweryADR Washington Homes Elscint
.....
Improvement 1-3 year 458.7 418.7 308.8 202.9 197.1 193.9 162.7 160.6 134.0 125.2 123.3 119.9 106.4 97.6 81.8 75.2 69.1 66.5 59.8 52.0
Figure 4 shows 11 component planes obtained via a SOM on 376 records on companies with best valuations. The top four planes show the distribution of total returns over the past three months, one year, three years and five years (left to right). The next set on the second row show the distribution of the price earnings, price book, price sales and price cash flow ratios (left to right). Finally on the bottom row are the distributions of the market capitalization, the relative strength, and the rate of improvement over the past three years among all 376 companies. In a color print out of these planes the lowest values are in blue and the higher values are in red; green and yellow areas indicate values closer to the lowest or highest for a particular component, respectively. From Figure 4 we can visually derive that value provided to shareholders as measured by total annualized returns over three and five years correlate highly, especially among the companies with highest valuations. Total return achieved over the past three months and in the past one year are however not highly correlated with returns over three and five years. Interestingly enough the returns over three months and the past year do not entirely overlap. The second row of planes shows some overlap between price earnings and price book values; however the distributions on price sales ratios and price cash flow ratio are substantially different. On all four planes we find companies with very low valuation ratios in the right bottom comer.
26
Figure 3: Self-organizing map of 100 companies that have most improved in terms of generating return for shareholders over the past three years. The most improved indicator was computed by subtracting the total return in the past year from the annualized return over three years for each company.
27
Figure 4 : Component planes of 376 companies with the lowest price-earning ratios, price book ratios and price sales ratios. The component planes included are from left to right total return in last 3 months, one year, three years and five years; on the second row, the price earning ratio, the price book ratio, the price sales ratio and price cash flow ratio; on the third row, the market capitalization, the relative strength of stock prices and the distribution of companies that improved the most over the past three years.
28
The companies with the highest market capitalization among these 376 can be found in the left upper comer in the bottom left plane. The companies with the best relative strength i.e., whose price is closest to their 52 week low, are on the right side in the middle plane on the bottom row. Companies that have been most improving can be found in the left bottom comer of the bottom right plane. From the above we can visually discover without analysis of statistics which is the most interesting selection among companies with best valuations. By selecting the darkest area in the bottom right plane i.e. the left segment of the bottom right plane, we can filter those companies that improved the most over the past three years from among all those that have the best valuations. This filtering produces a short list of companies that provide interesting investment opportunities. Table 3 shows the results of this filtering process; it shows the top ten obtained by filtering 376 companies with best valuations on the basis of the most improvements over the past few years. Table 3" SOM selected top ten companies with excellent valuations who have improved lot over the past three years.
a
Tick er
Exc han ge
Secto
Total
Tot
Tota
Tot
r
Retu rn 3 Mont hs
al Re tur n1 Ye ar
I Ret urn 3 Yea r
al Re tur n5 Ye ar
Ma rke t Ca p
Pri ce Ea rni ng s
P ri c e B o o k
P ri c e S al e s
P ri c e c a s h FI o
R e I. S tr e n g t h
Im pr
ov e m en t
w TFC Enterprises TFCE
Fin.
-7.1
73.2
-33.8
-34.0
18.5
7.7
0.5
0.5
2.5
37
107.
Bangor HydroBGR NYSE Elec American Health AHEPZ NNM
NNM
Utility
31.4 107.0
6.1
-1.8
94.4
14.9
0.7
0.5
2.6
63
100.
-4 208.1
129.6
NA
3.7
2.4
0.3
0.5
0.6
-59
78.4
Isle Capris
Services
54.8
62.8
-13.4
-24.3
93.5
11.3
1
0.2
1.4
29
76.2
NA
!ISLE
NNM
Sound Advice
!SUND
NNM
Retail
33.3 116.6
42.4
-12.2
t2.1
8.6
0.8
0.1
2.5
71
74.2
OroAmerica
OROA
NNM
11.2
95.0
26.5
-8.0
62.9
8.3
0.9
0.4
4.6
54
68.5
17.3
32.6
-32.6
-40.7
2.9
4.8
0.8
0.3
1.6
5
65.3
16.0
62.0
2.2
-9.4
46.7
10.7
0.8
0.2
6.3
28
59.8
-15.6
-11.4
-58.0
NA
22.9
0.9
0.6
0.4
19.2
-30
46.6
21.4
27.5
-18.4
NA
54.9
5.3
0.6
0.3
NA
1
45.9
15.3
82.3
6.2
-18.7
36.3
7.7 0.74 0.35
4.6
23.
76.1
Casinos
Network Six
NWSS
NASQ
Cons Dur. Techn.
Washington
WIll
NYSE
Industria
Homes Astea International
ATEA
NNM
I Techn.
Cronos Group
CRNSF NNM
Services
Average Top 10
A few points deserve highlighting:
29 1. Among the top ten we find companies listed on every exchange NYSE, NNM (NASDAQ National Market) and NASDAQ (small cap market); 2. The companies listed belong to various sectors : finance, retail, consumer durables, industrial cyclicals, services, utilities; hence good value can be found in almost any sector; (internet stocks do not appear here because the Morningstar data we used contained very few internet companies as of 12-30-98 and had yet to recognize internet as a separate sector); 3. While the average annualized return of the top ten companies was negative over three and five years, in the last year the average return of the top ten improved to 82%; in the last three months of 1998 the average return was 15% in one quarter; 4. The average price earning ratios of these most improving best valuation companies is 7.7; the average price book value is 0.74 and the average price sales ratio is 0.35. 5. Most of these companies have over the past three years improved by 45 to 107% meaning that several have doubled in value to shareholders. The out of sample performance on each of this top ten stock is shown in table 4. Table 4 shows the prices of each of these stocks on 12-31-1998 and on 3-12-1999 (or the time of this writing). It also shows the number of shares that could have been bought on 12-31-1998 for $10,000 investment in each stock. The last column in Table 4 shows the net return of investing $10,000 in each stock. This portfolio of the top ten most improving companies, selected from among 376 companies with best valuation, would have produced in the first two and a half months of 1999 a total return of 8.4% or 2.8% more than the S&P500. Annualized this is close to 40% or an added value of 13% over and above the S&PS00. Table 4 Out of sample performance 12-31-98 to 3-12-99
TFC Enterprises iBangor Hydro-Electric American Health Prop Isle of Capris Casinos Sound Advice OroAmerica Network Six Washington Homes Astea International Cronos Group Portfolio S&P 500
Price Price Difference 12-31-98 3-12-99 1.63 2.44 0.81 12.81 13 0.19 1.88 1.38 -0.5 3.97 4.38 0.41 3.25 2.38 -0.87 9.88 8.75 -1.13 4.06 4.38 0.32 5.88 6.63 0.75 1.69 3.31 1.62 6.38 4.5 -1.88 1229.23
1297.68
Shares per Net Return 10,000 US$ 49.7% 6135 4,969 1.5% 781 148 -26.6% 5319 (2,660) 10.3% 2519 1,033 -26.8% 3077 (2,677) -11.4% 1012 (1,144) 7.9% 2463 788 12.8% 1701 1,276 95.9% 5917 9,586 -29.5% 1567 (2,947) 8.4% Total 8,373 68.45 5.6% 5,569
30
Conclusions The maps in this chapter demonstrate how value can be discovered visually even in expensive markets through self-organizing of vast amounts of company and stock market data. We have identified interesting investment opportunities based on how well companies treat their shareholders. The total return to shareholders that companies provide is the one true measure important to investors. It is the gauge against which investment managers and institutional investors are measured and it should be the measure used to judge the performance of corporate executives. We looked at the one hundred largest companies and found some banking and oil companies; we looked at the best performing companies and those that have made most progress and again found some banks and financial institutions with attractive potentials. We looked at 376 companies with the best valuations and identified ten that have most improved and are outperforming the S&P500 in the first quarter of 1999. The self-organizing maps in this chapter provide a visual way to find interesting investment opportunities. The best opportunities that were identified are companies that have good valuations (based on classic valuation criteria) who may have had negative returns over the past 3 to 5 years but who recently changed from negative to positive returns. Most of them are accelerating very fast in producing value to shareholders. The attractiveness of this visual way of identifying value in markets is that it reduces the subjective-ness imbedded in the Warren Buffett approach for assessing what companies have value, and it reduces the need for elaborate statistical analyses which O'Shaughnessy approach was based on. The maps presented here are models for representation of vast quantities of financial and economic data; they demonstrate the speed with which multi-dimensional data can be synthesized and produce meaningful results for making investment decisions. Being able to discern patterns in data fast is particularly important to electronic day traders who in volatile markets are constantly picking stocks for very short holding periods, usually not exceeding one day. Of course electronic day traders especially those betting on Internet stocks have a saying that 'those who are discussing value are out of the market; those who are in the market discuss price'!
References [1 ] Guido Deboeck & Teuvo Kohonen, Visual Explorations in Finance with self-organizing maps, Springer-Verlag, 1998,250 pp. [2] Benjamin Graham, David Dobbs, Security Analysis, reprint 1934 ed., New York, McGraw-Hill, 1997, p. 493
31 [3] Timothy P. Vick, Wall Street on Sale: How to beat the market as a value investor, McGraw-Hill, 1999, 289 pp. [4] Anthony M Gallea & William Patalon III, Contrarian Investing: Buy and Sell when others won't and make money doing it, Prentice Hall, 1998. [5] Peter Lynch, Beating the Street, Simon & Schuster, New York, 1993, 318 pp. [6] Janet Lowe: Warren Buffett Speaks: Wit and Wisdom from the Worlds's Greatest Investor, John Wiley & Sons, New York, 1997 [7] James O'Shaughnessy: What works on Wall Street: A guide to the best performing investment strategies of all times.McGraw-Hill, New York, 1998 revised edition, 366 pp. [8] Teuvo Kohonen, Self-Organizing Map, Springer-Verlag. 2nd edition, 1997, 426 pp.
Acknowledgments The author wants to thank Professor Teuvo Kohonen for educating him about self-organizing maps, an approach, which makes representation of multi-dimensional financial data a lot more effective. He also likes to thank Professor Erkki Oja and Samuel Kaski for assembling this important collection of papers on self-organizing maps. He is grateful to Dr Gerhard Kranner, CEO of Eudaptics Software, and Joahnnes Sixt for all the support they have provided in making the creation of Value Maps so easy.
This Page Intentionally Left Blank
Kohonen Maps. E. Oja and S. Kaski, editors 9 Elsevier Science B.V. All rights reserved
33
Data Mining and Knowledge Discovery with Emergent Self-Organizing Feature Maps for Multivariate Time Series A. Ultsch Philipps-University of Marburg Department of Computer Science Hans-Meerwein-Str 35032 Marburg Germany Self-Organizing Feature Maps, when used appropriately, can exhibit emergent phenomena. SOFM with only few neurons limit this ability, therefore Emergent Feature Maps need to have thousands of neurons. The structures of Emergent Feature Maps can be visualized using U-Matrix Methods. U-Matrices lead to the construction of self-organzing classifiers possessing the ability to classify new datapoints. This subsymbolic knowledge can be converted to a symbolic form which is understandable for humans. All these steps were combined into a system for Neuronal Data Mining. This system has been applied successfully for Knowledge Discovery in multivariate time series. 1. I n t r o d u c t i o n Data Mining aims to discover so far unknown knowledge in large datasets. The most important step thereby is the transition from subsymbolic to symbolic knowledge. SelfOrganizing Feature Maps are very helpful in this task. If appropriately used, they exhibit the ability of emergence. I. e. using the cooperation of many neurons, Emergent Feature Maps are able to build structures on a new, higher level. The U-Matrix-method visualizes these structures corresponding to structures of the high-dimensional input-space that otherwise would be invisible. A knowledge conversion algorithm transforms the recognized structures into a symbolic description of the relevant properties of the dataset. In chapter two we shortly introduce our approach to Data Mining and Knowledge Discovery, chapter three clarifies the use of Self-Organizing Feature Maps for Data Mining. Chapter four clarifies the use of Feature Maps in order to obtain emergence. In the chapters five and six those steps of Data Mining, where Feature Maps can be used, are described. Chapter seven is a description of our system, the so called Neuronal Data Mine which uses Emergent Feature Maps and Knowledge Conversion. In chapter eight an important application area- Knowledge Discovery in multivariate time series - is described. Chapter nine gives first results of an application of this system.
34 2. Data Mining and Knowledge Discovery Since the use of the term Data Mining is quite diverse we give here a short definition in order to specify our approach to Data Mining and Knowledge Discovery. A more detailed description can be found in [Ultsch 99a]. We define Data Mining as the inspection of a large dataset with the aim of Knowledge Discovery. Knowledge Discovery is the discovery of new knowledge, i. e. knowledge that is unknown in this form so far. This knowledge has to be represented symbolically and should be understandable for human beings as well as it should be useful in knowledge-based systems. Central issue of Data Mining is the transition from data to knowledge. Symbolically represented knowledge as sought by Data Mining- is a representation of facts in a formal language such that an interpreter with competence to process symbols can utilize this knowledge [Ultsch 87]. In particular, human beings must be able to read, understand and evaluate this knowledge. The knowledge should also be useable by knowledge-based systems.
Inspee/ion ofthe D a t a
$,,
~>
~p~
dear, chosen Da~a l
1~t*
:I "
- rne~ va~u~ - ~tr~tio. . areas r
Clwh:
~>
~>
,
Construction ~r Classifiers
owledge C~tverslon
i
Figure 1: Steps of Data Mining The knowledge should be useful for analysis, diagnosis, simulation and/or prognosis of the process which generated the dataset. We call the transition from data, respectively an unfit knowledge representation, to useful symbolic knowledge Knowledge Conversion [Ultsch 98]. Data Mining can be done in the following steps: 9 inspection of the dataset 9 clustering
35 9 construction of classifiers 9 knowledge conversion and 9 validation (see figure 1 for an overview) Unfortunately it has to be stated that in many commercial Data Mining tools there is no Knowledge Conversion [Gaul 98]. The terms Data Mining and Knowledge Discovery are often used in those systems in an inflationary way for statistical tools enhanced with a fancy visualization interface [Woods, Kyral 98]. The difference between exploratory statistical analysis and Data Mining lies in the aim which is sought. Data Mining aims at Knowledge Discovery. 3. S O F M for D a t a M i n i n g Figure 1 shows the different steps in Data Mining in order to discover knowledge. Statistical techniques are commonly used for the inspection of the data and also for their validation. Self-Organizing Feature Maps can be used for classification and the construction of classifiers. Particularly well suited for these tasks are Emergent Feature Maps as described in the next chapter. Classifiers constructed with the use of a SelfOrganizing Feature Map do, however, not possess a symbolic representation of knowledge. They can be said to contain subsymbolic knowledge. In the step Knowledge Conversion the extraction of knowledge from Self-Organizing Feature Maps will be performed. One method to extract knowledge from Self-Organizing Feature Maps is the so called sig*algorithm which will be briefly described in chapter six. 4. E m e r g e n t vs. N o n - e m e r g e n t Feature M a p s Self-Organizing Feature Maps were developed by Teuvo Kohonen in 1982 [Kohonen 82] and should, to our understanding, exhibit the following interesting and non-trivial property: the ability of emergence through self-organization. Self-organization means the abilty of a biological or technical system to adapt its internal structure to structures sensed in the input of the system. This adaptation should be performed in such a way that first, no intervention from the environment is necessary (unsupervised learning) and second, the internal structure of the self-organizing system represents features of the input-data that are relevant to the system. A biological example for self-organization is the learning of languages by children. This process can be done by every child at a very early age for different languages and in quite different cultures. Emergence means the ability of a system to produce a phenomenon on a new, higher level. This change of level is termed in physics "mode"- or "phase-change". It is produced by the cooperation of many elementary processes. Emergence happens in natural systems as well as in technical systems. Examples of natural emergent systems are: Cloud-streets, Brusselator, BZ-Reaction, certain slime molds, etc.
36
Fig. 2: Hexagonal convection cells on a uniformly heated copper plate [Haken 71] Even crowds of human beings may produce emergent phenomena. An example is the so called "La-Ola Wave" in ballgame stadiums. Participating human beings function as the elementary processes, who by cooperation, produce a large scale wave by rising form their places and throwing their arms up in the air. This wave can be observed on a macroscopic scale and could, for example, be described in terms of wavelength, velocity and repititionrate. Important technical systems that are able to show emergence are in particular laser and maser. In those technical systems billions of atoms (elemantary processes) produce a coherent radiation beam. Although Kohonen's Self-Organizing Feature Maps are able to exhibit emergence they are often used such, that emergence is impossible. For emergence it is absolutly necessary that a huge number of elementary processes cooperate. A new level or niveau can only be observed when elementary processes are disregarded and only the overall structures, i. e. structures formed by the cooperation of many elementary processes, are considered. In typical applications of Kohonen's Self-Organizing Feature Maps the number of neurons in the feature maps are too few to show emergence. A typical example, which is representative for many others, is taken from [Reutterer 99]. The dataset describes consumers of household goods. Each household is described by a nine-dimensional vector of real numbers. Self-Organizing Feature Maps are used to gain some inside-knowledge in the structure and segmentation of the market for the household goods. A Self-Organizing Feature Map with three by three, i. e. nine neurons has been used [Reutterer 99]. Using Kohonen's learning scheme, each of the nine neurons represents the bestmatch of several input-data. Each neuron is considered to be a cluster.
37 Common to all non-emergent ways to use Kohonen's feature maps is that the number of neurons is roughly equal to the number of clusters expected to be found in the dataset. A single neuron is typicaly regarded as a cluster, i. e. all data, whose bestmatches fall on this neuron, are members of this cluster. It seemed for some time that this type of Kohonen's feature maps performs clustering in a way that is similar to a statisical clustering algorithm called k-means[Ultsch 95]. An absolutely necessary condition for emergence is the cooperation of many elementary processes. Emergence is therefore only expected to happen in Self-Organizing Feature Maps with a large number of neurons. Such feature maps, we call them Emergent Feature Maps, have typically at least some thousands if not tens of thousands of neurons. In particular the number of neurons may be much bigger than the number of datapoints in the input-data. Consequently most of the neurons of Emergent Feature Maps will represent very few input-points if any. Clusters are detected on Emergent Feature Maps not by regarding single neurons but by regarding the overall structure of the whole feature map. This can be done by using U-Matrix-methods [Ultsch 94]. With Emergent Feature Maps we could show that Self-Organizing Feature Maps are different and often superior to classical clustering algorithms [Ultsch 95]. A canonical example, where this can be seen, is a dataset consisting of two different subsets. These subsets are taken from two well seperated toroids that are interlinked like a chain as it can be seen in figure 3. Using an Emergent Feature Map of a dimension 64 by 64 = 4096 neurons, the two seperate links of the chain could easily be distinguished. In contrast to this, many statistical algorithms, in particular the k-means algorithm, were unable to produce a correct classification. We think that the task of Data Mining, i. e. the seeking of new knowledge, calls for Emergent Feature Maps. The property of emergence, i. e. the appearance of new structures on a different abstraction level, coincides well with the idea of discovering new knowledge.
38
Fig. 3: Chainlink Dataset 5. C o n s t r u c t i o n of Classifiers When Emergent Feature Maps with a sufficiently large number of neurons are trained with high-dimensional input-data, these datapoints distribute sparsely on the feature map. Regarding the position of the bestmatches, i. e. those neurons, whose weights are most similar to a given input-point, gives no hint on any structure in the input-dataset. In the following picture a three-dimensional dataset consisting of thousand points was projected on a 64 by 64 Emergent Feature Map. The topology of the feature map is toroid, i. e. the borders of the map are cyclically connected. The positions of the bestmatches exhibit no structure in the input-data. In order that structures of the input-data can emerge we use the so called U-MatrixMethod. The simplest of these methods is to sum up the distances between the neuronsweights and those of its immediate neighbours. This sum of distances to its neighbours is displayed as elevation at the position of each neuron. The elevation-values of the neurons produce a three-dimensional landscape, the so called U-Matrix. U-Matrices have the following properties: 9 Bestmatches that are neighbours in the high-dimensional input-data space lie in a common valley. 9 If there are gaps in the distibution of input-points, hills can be seen on the U-Matrix. The elevation of the hills is proportional to the gap distance in the input-space. 9 The principal properties of Self-Organizing Feature Maps, i.
e.
conserving the
39
overall topology of the input-space, is inherited by the U-Matrix. Neighbouring data in the input-space can also be found at neighbouring places on the U-Matrix. 9 Topological relations between the clusters are also represented on the two-dimensional layout of the neurons
Fig. ~: An U-Matrix of the Data With U-Matrix-Methods emergence in Kohonen maps has been observed for many different applications, for example: medical diagnosis, economics, environmental science, industrial process control, meteorology, etc. The cluster-structure of the input-dataset is detected using an U-Matrix. Clusters in the input-data can be detected in the U-Matrix as valleys surrounded by hills with more or less elevation, i. e. clusters can be detected, for example, by raising a virtual waterlevel up to a point, where the water floods a valley on the U-Matrix. Regarding an U-Matrix the user can indeed grasp the high-dimensional structure of the data. Neurons that lie in a common valley are subsumed to a cluster. Regions of a feature map that have high elevations in an U-Matrix are not identified with a cluster. Neurons that lie in a valley but are not bestmatches are interpolations of the input-data. This allows to cluster data with Emergent Feature Maps. This approach has been extensively tested over the last years and for many different applications. It can be shown that this method gives a very good picture of the high-dimensional and otherwise invisible structure of the data. In many applications meanings for clusters could be detected. Emergent Feature Maps can be easiliy used to construct classifiers. If the U-Matrix has been seperated into valleys corresponding to clusters and hills corresponding to gaps in the data, then an input-datapoint can be easily classified by looking at the bestmatch of this datapoint. If the point's bestmatch lies inside a cluster-region on the U-Matrix the input-data is added to that cluster. If the bestmatch lies on a hill in the U-Matrix, no classification of this point can be assigned. This is in particular the case if the dataset possesses new features, i. e. aspects that were not included in the data learned so far. With this approach, for example, outliers and errouneous data are easily detected.
40 6. K n o w l e d g e C o n v e r s i o n The classifiers constructed with Emergent Feature Maps and the U-Matrix described in the last chapter, possess the "knowledge" to classify new data. This knowledge is, however, not symbolic. Neither a reason, why a particular dataset belongs to a particular cluster, nor, why a given dataset can not be classified, can be given. What is necessary at this point is to convert this type of knowledge to a symbolic form. We have developed an algorithm called sig* in order to perform this Knowledge Conversion [Ultsch 94]. As input sig* takes the classifier as described in the last chapter. A symbolic description of all the weights of the neurons belonging to a particular cluster is constructed. Sig* generates description using decision-rules. These decision-rules contain as premises conditions on the input-data and as conclusions the decision for a particular cluster. Clusters are described by two different types of rules. There are so called characterization rules which describe the main characteristics of a cluster. Secondly, there are rules which describe the difference between a particular cluster and neighbouring clusters. The different steps of this Knowledge Conversion can be described as follows: 9 Selection of components of the high-dimensional input-data that are most significant for the characterization of a cluster 9 Construction of appropriate conditions for the main properties of a cluster 9 The Composition of the conditions in order to produce a high-level significant description. To realize the first step, sig* uses a measure of significance for each component of the high-dimensional input-data with respect to a cluster. The algorithm uses only very few conditions, if the clusters can be easily described. If the clusters are more difficult to describe sig* uses more conditions and more rules to describe the differences are generated in order to specify the borders of a cluster. For the representation of conditions, sig* uses interval-descriptions for characterization-rules and splitting-conditions for the differention-rules. The conditions can be combined using "and", "or" or a majority-vote. It could be shown that for known classifications sig* reproduces 80 to 90 § % of the classification-ability of an Emergent Feature Map. 7. T h e N e u r o n a l D a t a M i n e N D M The methods presented in the previous chapters have been developed and refined over the last years and combined to a tool for Data Mining and Knowledge Discovery called Neuronal Data Mine [Ultsch 99a]. This tool contains the following modules : 9 Statistics (Inspection of the Data) 9 Emergent Feature Maps (Clustering) 9 U-Matrix (Construction of Classifiers) 9 sig* (Knowledge Conversion)
41 9 Validation
Fig. 5: Screen-shot of the user-interface of the NDM 8. T h e N e u r o D a t a M i n e for K n o w l e d g e D i s c o v e r y in T i m e Series One of the latest and most fascinating applications of NDM is Data Mining in multivariate time series. The key for this application is a suitable knowledge representation for temporal structures (see chapter 2). With the definition of unification-based temporal grammars (UTG) this key-issue has been solved [Ultsch 99b]. UTGs belong to the class of definitve clause grammars. UTGs describe temporal phenomena by a hierachy of semiotic description-levels. Each semiotic description-level consists of a symbol (syntax), a description (semantic) and an explanation useful in the context of a particular application (pragmatic). Temporal phenomena are described on each level using temporal phases and temporal operations. The latter are called connexes. As phases we identified Primitive Patterns, Successions, Events, Sequences and Temporal Patterns. Primitive Pattern represent the different elementary conditions of the process described by the multivariate time series. Successions model the duration, Events the simultanety of temporal phases. Sequences are used to formulate repetitions. Temporal Patterns finally condense variations in Sequences to a common abstract description of important temporal patterns. The phases described above can be combined using connexes for duration, simultanety and temporal sequence. These temporal operations are designed to be flexible with regard to time. The connexes require not a coincidence of events in a mathematically sense. They allow a certain flexibility, i. e. events that are sufficiently close in time are considered to be simultaneous. This is necessary since the multivariate time series stem from natural or technical processes that have always a certain variation in time. A special fuzzy representation was used to represent this flexibility [Ultsch 99b]. This approach leads to only three temporal operations for the representation of temporal features. In other representation formalisms, for example Allen, much more temporal operations are necessary [Allen 84]. Emergent feature maps are used for Temporal Data Mining in the following steps: 9 description of the elementary conditions of the process (Primitive Pattern) 9 description of the duration of phases (Successions) 9 description of simultaneity (Events)
42 9 detection of start- and end-points of temporal patterns (Sequences resp. Temporal Patterns).
f N~o 1
Temporal Data Mining
IF
Fig. 6: Temporal Data Mining
9. Application: Sleep Related Breathing Disorders In a first application the Temporal Data Mine has been used for a medicalproblem, the so called Sleep Related Breathing Disorders (SBRD) [Penzel et al 91]. Humans who suffer from SRBD experience the stopping of breathing during certain periods of sleep. The stopping-periods are critical if they last at least 10 seconds and occur more than 40 times per hour [Penzel et al 91]. As multivariate time series were considered: EEG, EMG, EOG, EKG, airflow, thorax- and abdomen movements, airflow and saturation of blood with oxygen.
Fig. 7: Multivariate Time Series for SRBD For those multivariate time series, two different types of U-Matrix called "air" and "move", were generated [Guimaraes/Ultsch 99]. The U-Matrix "air" focuses on all aspects of the time series related to airflow. The U-Matrix "move" concentrates on aspects of movements of thorax and abdomen. In the "air"-U-Matrix six elementary states were identified.
43 These elementary states (clusters) are considered elementary primitive temporal elements and termed Primitive Patterns. In the U-Matrix "move" nine Primitive Patterns could be identified. The temporal sequence of the Primitive Pattern are represented as paths on the U-Matrices. Using temporal knowledge conversion, six Events and five different Temporal Patterns could be found in the time series. The knowledge was formulated in UTG notation. All semiotic description levels of the UTG (see last chapter) have been presented to an expert in order to evaluate the plausibility of the phases and the descriptions. This showed that all the events found represented important medical properties. In particular the events could be related to physiological stages like, for example, "obstructive snoring" or "hyperpnoe". Four of the five Temporal Patterns that were discovered were very well known to the expert and could be assigned a medical meaning. One of the Temporal Patterns was a newly discovered pattern inherent in some type of human sleep. This gave a hint on a potential new way to look onto this certain types of sleeping disorders. In the following picture an example of a Temporal Pattern and the corresponding multivariate time series is depicted.
Fig 8: Temporal Knowledge 10. C o n c l u s i o n In Data Mining the first step after the inspection of a dataset is the identification of clusters. SOFM with only few neurons limit implicitly the number of clusters to be found in the data. With such feature maps only a very crude insight into the input-data can be gained, if at all. In feature maps possessing thousands of neurons, U-Matix-methods can be used to detect emergence. U-Matrices visualize structures in the data by considering the cooperation of many neurons. The structures seen give insights to the otherwise invisible high-dimensional dataspace. It can be shown that emergent feature maps are superior to other clustering methods, particularly to k-means [Ultsch 95]. The most important step of Data Mining is Knowledge Conversion, i. e. the transition from a subsymbolic to a symbolic representation of knowledge. Emergent feature maps provide an excellent starting-point for Knowledge Conversion. Other classifiers such as
44 decision-trees focus on the efficiency of the discrimination between clusters. Declarative rules, extracted from U-Matrices, using sig* provide an extract description of significant properties of clusters. The methods described above could be used to analyze multivariate time series. Unificationbased grammars (UTG) have been developed as a tool to represent symbolic knowledge for Temporal Data Mining. The approach has been successfully tested for a medical problem regarding sleep disorders. 11.
References
[Allen 84] Allen, J.: Towards a General Theory of Action and Time, Artifical Intelligence 23, 1984, S 123- 154 [Gaul 98] Gaul, W. G. Classification and Positioning of Data Mining Tools, Herausforderungen der Informationsgesellschaft an Datenanalyse und Wissensverarbeitung 22. Jahrestagung Gesellschaft fiir Klassifikation [Guimar~es/Ultsch 99] Guimar~ess, G. Ultsch, A.: A Method for Temporal Knowledge Conversion, to appear. [Guimar~es/Ultsch 96] Guimar~es, G. Ultsch, A.: A Symbolic Representation for Pattern in Time Series Using Definitive Clause Grammars, 20 th Annual Conference of the Society for Classification, Freiburg 6th - 8th March 1996, pp 105 - 111 [Kohonen 82] Kohonen, T.: Self-Organized Formation of Topologically Correct Feature Maps, Biological Cybernetics Vol. 43, pp 59- 69, 1982 [Penzel et al. 91] Penzel, P., Stephan, K., Kubicki, S., Herrmann, W. M.:Integrated Sleep Analysis with Emphasis on Automatic Methods, In: R. Degen, E. A. Rodin (Eds.) Epilepsy, Sleep and Sleep Deprivation, 2nd ed. (Epilepsy Res. Suppl. 2), Elsevier Science Publisher, 1991, S 177- 2041 [Reutterer 98] Reutterer; T, Panel Data Based Competitive Market Structure and Segmentation Analysis using Self-Organizing Feature Maps, Proc.Annual Conf. Siciety for Classification, pp 92, Dresden, 1998 [Ultsch 99a] Ultsch, A.:Data Mining und Knowledge Discovery mit Neuronalen Netzen, Technical Report, Department of Computer Science, University of Marburg, Hans-MeerweinStr., 35032 Marburg [Ultsch 99b] Ultsch, A.: Unifikationsbasierte Temporale Grammatiken fiir Data Mining und Knowledge Discovery in multivariaten Zeitreihen, Technical Report Department of Computer Science, University of Marburg, March 1999 [Ultsch 98] Ultsch, A.: The Integration of Connectionist Models with Knowledge-based Systems: Hybrid Systems, Proceedings of the 11th IEEE SMC 98 International Conference on Systems, Men and Cybernetics, 11 - 14 October 1998, San Diego [Ultsch 95] Ultsch, A.: Self-Organizing Neural Networks Perform Different from Statistical k-means clustering, Gesellschaft ffir Klassifikation, Basel 8th- 10th March, 1995 [Ultsch 94] Ultsch, A.: The Integration of Neural Networks with Symbolic Knowledge Processing, in Diday et al. "New Approaches in Classification and Data Analysis", pp 445 - 454, Springer Verlag 1994 [Ultsch 87] Ultsch, A.: Control .for Knowledge-based Information Retrieval, Verlag der Fachvereine, Ziirich, 1987
45
[Woods/Kyral 97] Woods, E., Kyral, E. Data Mining, Ovum Evaluates, Catalumya Spain, 1997
This Page Intentionally Left Blank
Kohonen Maps. E. Oja and S. Kaski, editors 91999Elsevier Science B.V.All rights reserved
47
F r o m A g g r e g a t i o n O p e r a t o r s t o Soft L e a r n i n g V e c t o r Q u a n t i z a t i o n and Clustering Algorithms Nicolaos B. Karayiannis a aDepartment of Electrical and Computer Engineering, University of Houston, Houston, Texas 77204-4793, USA
This paper presents an axiomatic approach for developing soft learning vector quantization and clustering algorithms based on aggregation operators. The development of such algorithms is based on a subset of admissible aggregation operators that lead to competitive learning vector quantization models. Two broad families of algorithms are developed as special cases of the proposed formulation.
1. I N T R O D U C T I O N Consider the set X' C IRn which is formed by M feature vectors from an n-dimensional Euclidean space, that is, X = {Xl,X2,...,XM}, xi E IR", 1 _< i _< M. Clustering is the process of partitioning the M feature vectors to c < M clusters, which are represented by the prototypes vj E 12, j E Arc -- {1,2,...,c}. Vector quantization can be seen as a mapping from an n-dimensional Euclidean space IR" into the finite set 12 = {Vl, v 2 , . . . , vc} C IRn, also referred to as the codebook. Codebook design can be performed by clustering algorithms, which are typically developed by solving a constrained minimization problem using alternating optimization. These clustering techniques include the crisp c-means [1], fuzzy c-means [1], generalized fuzzy c-means [2], and entropyconstrained fuzzy clustering algorithms [3]. Recent developments in neural network architectures resulted in learning vector quantization (LVQ) algorithms [4-11]. Learning vector quantization is the name used in this paper for unsupervised learning algorithms associated with a competitive neural network. Batch fuzzy learning vector quantization (FLVQ) algorithms were introduced by Tsao et al. [5]. The update equations for FLVQ involve the membership functions of the fuzzy cmeans (FCM) algorithm, which are used to determine the strength of attraction between each prototype and the input vectors. Karayiannis and Bezdek [9] developed a broad family of batch LVQ algorithms that can be implemented as the FCM or FLVQ algorithms. The minimization problem considered in this derivation is actually a reformulation of the problem of determining fuzzy c-partitions that was solved by the FCM algorithm [12]. This paper presents an axiomatic approach to soft learning vector quantization and clustering based on aggregation operators.
48 2. R E F O R M U L A T I O N ATORS
F U N C T I O N S B A S E D ON A G G R E G A T I O N O P E R -
Clustering algorithms are typically developed to solve a constrained minimization problem which involves two sets of unknowns, namely the membership functions that assign feature vectors to clusters and the prototypes. The solution of this problem is often determined using alternating optimization [1]. Reformulation is the process of reducing an objective function treated by alternating optimization to a function that involves only one set of unknowns, namely the prototypes [9,12]. The function resulting from this process is referred to as the reformulation function. A broad family of batch LVQ algorithms can be derived by minimizing [9] 1 M R v = ~ ~ D,({llx, - vell2}ee~r~),
(1)
i=1
where D , ( { ] l x i - vt[12}eearc) is the generalized mean of {[Ix,- vtll2}texc, defined as 1
):
(2) C g--I
with p E I R - {0}. The function (1) was produced by reformulating the problem of determining fuzzy c-partitions that was solved by the FCM algorithm. The reformulation of the FCM algorithm essentially established a link between batch clustering and learning vector quantization [9-11].
2.1. Aggregation Operators The reformulation function (1) is formed by averaging the generalized means of { ]Ix,vtll2}tea5 over all feature vectors x~ C X. The generalized mean is perhaps the most widely known and used aggregation operator. Nevertheless, the search for soft LVQ and clustering algorithms can naturally be extended to a broad variety of aggregation operators selected according to the axiomatic requirements that follow. A multivariable function h(al, a2,..., ac) is an aggregation operator on its arguments if it satisfies the following axiomatic requirements:
Axiom AI: The function h(.) is continuous and differentiable everywhere. Axiom A2: The function h(al, a2,..., ac) is idempotent, that is, h(a, a , . . . , a) - a . Axiom A3: The function h(.) is monotonically nondecreasing in all its arguments, that is, h(al, a2,..., ac) < h(bl, b2,..., b~), for any pair of c-tuples [el, a 2 , . . . , a~] and Ibm,b2,..., b~] such that a,, bi E (0, c~) and a, < bi, Vi E Aft. Soft LVQ and clustering algorithms were developed by using gradient descent to minimize the reformulation function (1). This function can also be written as i
M
R = ~ z ~ (llx,- viii ~, llx,- v~ll~,, llx,- v~ll~), i--1
(3)
49 where the aggregation operator h(.) is the generalized mean of {de}tells. This is an indication that the search for soft LVQ and clustering algorithms can be extended to reformulation functions of the form (3), where h(-) is an aggregation operator in accordance with the axiomatic requirements AI-A3.
2.2. U p d a t e Equations Suppose the development of LVQ algorithms is attempted by using gradient descent to minimize the function (3). The gradient Vv~R = OR/Ovj of R with respect to the prototype vj can be determined as 2 M
(4)
i=1 where {Ctij } are the competition functions, defined as 0
= o(llx
- v ll :)
h ( l l x i - viii 2, I l x i - v2112,..., I l x i - Vcll2) .
(5)
The update equation for the prototypes can be obtained according to the gradient descent method as
M
Zxvj =
Vv, R =
52" J ( x , - vj),
(6)
i=1 where rlj : ~2 Tlj~ is the learning rate for the prototype Vj and the competition functions {aij} are determined in terms of h(-) according to (5). The LVQ algorithms derived above can be implemented iteratively. Let {vj,~-l}je~r be the set of prototypes obtained after the ( u - 1)th iteration. According to the update equation (6), a new set of prototypes {vj,,}je~r can be obtained according to M
Vj,u ~-- Vj,v-1 "JV~]j,u E OLij,~' ( X i - Vj,u-1 ), j e N'c. i=1
(7)
If r/j,, = (~]M 101ij,u) -1, then {vj,u_l}jEN c do not affect the computation of {vj,,}jeHc, which are obtained only in terms of the feature vectors xi E A'. The algorithms derived above can be implemented as clustering or batch LVQ algorithms [10,11].
2.3. Admissible Reformulation Functions Based on Aggregation Operators The search for admissible reformulation functions is based on the properties of the competition functions {aij}, which regulate the competition between the prototypes {vj}jejd c for each feature vector xi C A'. The following three axioms describe the properties of admissible competition functions: AxiomRl" Ifc=l,
then a n - l, l _< i _< M.
Axiom R2: aij >_ O, l _ < i _ < M ; l _ < j < c . Axiom R3: If [[xi- vp[[ 2 > I [ x i - vq][ 2 > 0, then dip < a~q, 1 < p, q _< c, and p -/- q.
50 Axiom R1 indicates that there is actually no competition in the trivial case where all feature vectors xi E X' are represented by a single prototype. Thus, the single prototype is equally attracted by all feature vectors xi C A'. Axiom R2 implies that all feature vectors xi E X' compete to attract all prototypes {Vj}jEA/" c. Axiom R3 implies that a prototype vq that is closer in the Euclidean distance sense to the feature vector xi than another prototype vp is attracted more strongly by this feature vector. Minimization of a function R defined in terms of an aggregation operator h(.) in (3) does not necessarily lead to competitive LVQ algorithms satisfying the three axiomatic requirements R1-R3. The function R can be made an admissible reformulation function by imposing additional conditions on the aggregation operator h(-). This can be accomplished by utilizing the Axioms R1-R3, which lead to the admissibility conditions for aggregation operators summarized by the following theorem:
Theorem 1: Let X' = { x l , x 2 , . . . , X M } C ]Rn be a finite set of feature vectors which are represented by the set of c < M prototypes 12 = {Vl,V2,...,v~} C ]Rn. Consider the function R defined in terms of the multivariable function h(.) in (3). Then, R is an admissible reformulation function in accordance with the axiomatic requirements R1-R3 if: 1. h(.) is a continuous and differentiable everywhere function, 2. h(.)is an idempotent function, i.e., h(a, a , . . . , a ) = a, 3. h(.) is monotonically nondecreasing in all its arguments, i.e., Oh(a~, a 2 , . . . , ac)/Oaj >__ 0, Vj E N'c, and 4. h(-) satisfies the condition Oh(a~, a 2 , . . . , a~)/Oap < Oh(a1, a 2 , . . . , a~)/Oaq, for ap > aq > 0, Vp, q E Arc and p 5r q. The first three conditions of Theorem 1 indicate that a multivariable function h(.) can be used to construct reformulation functions of the form (3) if it is an aggregation operator in accordance with the axiomatic requirements A1-A3. Nevertheless, Theorem 1 indicates that not all aggregation operators lead to admissible reformulation functions. The subset of all aggregation operators that can be used to construct admissible reformulation functions of the form (3) are those satisfying the fourth condition of Theorem 1. 3. R E F O R M U L A T I O N F U N C T I O N S GATION OPERATORS
BASED ON MEAN-TYPE
AGGRE-
The reformulation function (1) can also be written as
with f ( z ) = x 1-m and g(z) = f - x ( x ) = z ~j-~--~. Any function of the form ( 8 ) i s an admissible reformulation function if f(.) and g(.) satisfy the conditions summarized by the following theorem [11]:
51 Theorem 2: Let A" = { x I , x 2 , . . . , X M } C ]~n be a finite set of feature vectors which are represented by the set of c < M prototypes 12 = {Vl,V2,...,vc} C IRn. Consider the function R defined in (3), with
l=1
Then, R is an admissible reformulation with the axiomatic requirements R1-R3 everywhere functions satisfying f ( g ( x ) ) creasing (increasing) functions of x E (decreasing) function of x E (0, oc).
function of the first (second) kind in accordance if f(.) and g(.) are continuous and differentiable = x, f ( x ) and g(x) are both monotonically de(0, c~), and g ' ( x ) i s a monotonically increasing
A broad variety of reformulation functions can be constructed using functions g(.) of the form g(x) - (go(x)) 11 , rn 7~ 1, where g o ( x ) i s called the generator function. The following theorem summarizes the conditions that must be satisfied by admissible generator functions [11]: Theorem 3: Consider the function R defined in terms of the aggregation operator (9) in (3). Suppose g(.) is defined in terms of the generator function go(') that is continuous on (0, ec) as g(x) = (go(x))~2--~, m -r 1, and let m , ro(x) - m - 1 (g~
)2
(10)
, - go(x)go(X).
The generator function go(x) leads to an admissible reformulation function R if: 9 go(x) > 0, Vx E (0, co), g~o(X) > 0, Vx E (0, c~), and ro(x) > 0, Vx E (0, c~), or 9 go(x) > 0, Vx E (0, oc), g~o(X) < 0, Vx E (0, co), and ro(x) < 0, Vx E (0, co).
If g~o(X) > 0, Vx E (0, co), and m > 1 (rn < 1), then R is a reformulation function of the first (second) kind. Ifg~(x) < 0, Vx E (0, c~), and m > 1 (m < 1), then R i s a reformulation function of the second (first) kind. 3.1. C o m p e t i t i o n and M e m b e r s h i p F u n c t i o n s Consider the soft LVQ and clustering algorithms produced by minimizing a reformulation function of the form (3), where the aggregation operator h(-)is defined in (9). Using the definition of h(-), O--~-h(al,a2,...
COaj
ac) =
'
-
Oaj
c/=l
g(ae)
)1, = -
c
(aj)
-
g(a~)
)
.
(11)
c l=1
Since the constant 1/c can be incorporated in the learning rate, the competition functions {aij} can be obtained from (5) using (11) as c~ij - g ' ( l l x i - viii 2) f' (Si), where Si - ~ ~~--1 g ( i i X i - Vt]]2) 9
(12)
52 Suppose g(-) is formed in terms of an admissible generator function go(') as g ( x ) = (g0(x))~---~, rn -r 1. In this case, the competition functions { a i j } can be obtained as
---'~/ --m , =
x,
v,
e--1 go0 (llx~ - viii =)
(~3)
where
o,, = ~o(llx,- v, ll~) So (sa-~) = ~o(tlx,- v, tl~) So (,o(llx,- v, ll~),,,).
(1~)
It can easily be verified that {a{j} and {Oij} satisfy the condition 1 -
(~,j / o,j)
1__
=l,l_ 1, then g o ( x ) = x q generates an admissible reformulation function of the first kind for all q > 0. For m < 1, g o ( x ) = x q generates an admissible reformulation function of the second kind only if 0 < q < 1 - rn. If q = 1, then g o ( x ) = x q generates an admissible reformulation function of the first kind for rn > 1. This is the range of values of rn associated with the FCM algorithm. For q = 1, g o ( x ) = x q generates an admissible reformulation function of the second kind if m O. For this generator function, {0ij} can be obtained from (14) as
Oij= (1 ~([[Xi--Vell2)a--~m)L----~(1-q) -C l = l
IIx~- viii =
(17)
53 The competition functions {aij} can be obtained from (13)as
~=i
I I x , - v~ll ~
(18)
'
where {Oij} are defined in (17). If q = 1, then Oij = 1 , V i , j , and
e=~
I I x , - v~ll ~
(19)
This is the form of the competition functions of the FLVQ algorithm. aij = ( c Uij )r~ , where
In this case,
=
IIx,-wll ~
l=l
are the membership functions of the FCM algorithm. For q 7~ 1, the membership functions {uij} can be obtained from the competition functions {aij} as uij - ( a i j ) ! / c . Using (18), -1
I----1
IIx,-vell ~
"
(21)
The membership functions (21) resulting from the generator function go(x) - x q with q r 1 do not satisfy the constraint (16) that is necessary for fuzzy c-partitions. The cpartitions obtained by relaxing the constraint (16) are called soft c-partitions, and include fuzzy c-partitions as a special case. The performance of the LVQ and clustering algorithms corresponding to the generator function go(x) = x q, q > 0, depends on both parameters m E (1, oc) and q E (0, oc). The algorithms generated by go(x) = x q produce asymptotically crisp c-partitions for fixed values of m > 1 as q --+ ec and for fixed values of q > 0 as m --+ 1+. The partitions produced by the algorithms become increasingly soft for fixed values of m > 1 as q -+ 0 and for fixed values of q > 0 as rn ~ co. 4. R E F O R M U L A T I O N
AGGREGATION
FUNCTIONS OPERATORS
BASED
ON ORDERED
WEIGHTED
A broad family of soft LVQ and clustering algorithms can also be developed by minimizing reformulation functions constructed in terms of ordered weighted aggregation operators [10]. Consider the function R = -~
f i--1
w~g(llxi- v[~]ll ~)
)
,
(22)
where { l l x i - v[~]ll2}~eH~ are obtained by ordering { l l x i - v~ll2}eex~ in ascending order, that is, ] l x i - v[1]ll2 < I l x i - v[2]ll2 < . . . < I l x i - v[~]l[2, and the weights {w~}tex~ satisfy w~ E [0, 1], Vg C Aft, and ~ = 1 w~ = 1. The reformulation function (1) can be obtained
54 from (22) if g(x) = x p, f ( x ) = g - l ( x ) = x~, and we = 1/c, Vg e Arc. Any function of the form (22) is an admissible reformulation function if the functions f(.) and g(.) and the weights {we}echo satisfy the conditions summarized by the following theorem: Theorem ~: Let X = { x l , x 2 , . . . ,XM} C ]R'~ be a finite set of feature vectors which are represented by the set of c < M prototypes 1; = { v l , v 2 , . . . , v c } C ]Rn. Consider the function R defined in (3), with h(al,a2,...,ac)=
f (~weg(a[e]))
(23)
where {a[e]}eexc are the arguments of h(.) ordered in ascending order, that is, a[1] < a[2] < ... < all, and the weights {we}ee~rc satisfy we E [0, 1], Vg E Aft, and ~ = 1 we = 1. Then, R is an admissible reformulation function of the first (second) kind in accordance with the axiomatic requirements R1-R3 if f(-) and g(-) are continuous and differentiable everywhere functions satisfying f ( g ( x ) ) = x, f ( x ) and g(x) are both monotonically decreasing (increasing) functions of x e (0, c~), g'(x) is a monotonically increasing (decreasing) function of x E (0, c~), and wl > w2 > . . . > w~. The development of soft LVQ and clustering algorithms can be accomplished by considering aggregation operators of the form (23) corresponding to g(x) = x p and f ( x ) = x 1. For g(x) = x p, g'(x) = p x p-1 and g " ( z ) = p ( p - 1)x p-2. The functions g(x) = x p and 1 f ( x ) - x-~ are both monotonically decreasing for all x E (0, c ~ ) i f p < 0. Theorem 4 requires that g'(x) be a monotonically increasing function of x E (0, c~), which is true if p ( p - 1) > 0. For p < 0, this last inequality is valid if p < 1. Thus, the function R corresponding to g(x) = x p is an admissible reformulation function of the first kind if p e ( - c ~ , 0). The functions g(x) = x p and f ( x ) = x~ are both monotonically increasing for all x E (0, oc) if p > 0. In this case, Theorem 4 requires that g'(x) be a monotonically decreasing function for all x E (0, oc), which is true if p ( p - 1) < 0. For p > 0, this last inequality is valid if p < 1. Thus, the function R corresponding to g(x) = x p is an admissible reformulation function of the second kind if p E (0,1). Consider the aggregation operator resulting from (23) if g(x) = x p, which implies that f ( x ) = x~. In this case, the reformulation function defined in (3) takes the form 1 M R = ~ ~ Dp(w, { l [ x , - v~]l2 }eexc),
(24)
i--1
with Op(w, { l l x i - veII2}eezo) =
we ( l I x i - v[e]li2)p ~ ,
(25)
with p e R - { 0 } . For a given weight vector w = [w~ w 2 . . . we] T such that we e [0,1], Vg e Arc, and E~=I we = 1, (25)is the ordered weighted generalized mean of {]lxi-ve]12}eex~. If p = 1, then the ordered weighted generalized mean (25) reduces to the ordered weighted mean. For w = [ 1 0 . . . 0]T, Dp(w, {llxi - ve]12}eex~) = mineexr - veil2}. For w = [00 ... 1]T, Dp(w, {]Ix,- ve]12}eez~) = maxeey~{]]xi- veil2}. If w = [Wl w 2 . . . wc] T with we = 1/c, V~ C Arc, then (25) coincides with the generalized mean or unweighted p-norm of { l l x , - vetl2}eex~, defined in (2).
55 4.1. O r d e r e d W e i g h t e d LVQ A l g o r i t h m s Consider the LVQ and clustering algorithms produced by minimizing a reformulation function of the form (3), with the aggregation operator h(.) defined in (23). From the definition of h(.),
~h(al,
a~,..,
Oa[j]
'
a~) = _ _ 0 f Oa[j]
~,g(ai~ )
= ~j #(a~l) f'
w, a(al, ]) .
(26)
The update equation for each viii can be obtained from (5) using (26) as M i=1
where {r]j} are the learning rates and {a;[jl} are the competition functions, defined as
~;~] = g'(llx - vtj] II2) if(S;),
(28)
where Si = E g----1 ~ We g(llx; - vt~] II 2). If p = 1/(1 - m), the function g ( x ) = x ~j-~ leads to admissible reformulation functions of the first kind if m E ( 1 , ~ ) . In this case, the competition functions {a@]} can be obtained from (28) as
~i[j] :
We g=l
(,,x,_ IIx;-
vt~e]II ~
"
(29)
The effect of ordering the squared Euclidean distances {llxi- vjl]2}jeHc is carried in the update equation (27) for each prototype vii I by the corresponding weight wj that can be incorporated in the learning rate. In such a case, the update equation for each prototype is independent of the ordering of {llxi- vjll~}j~jco and takes the form (6), with
(
E
g=l
,Ix,-
"
(30)
= 1/c, vg E Arc, then (30) reduces to the competition functions (19) of the FLVQ algorithm, which were produced by minimizing the reformulation function (1).
If w~
4.2. O r d e r e d Weighted Clustering A l g o r i t h m s For we = 1/c, Vg E Arc, (30) gives the competition functions (19) which can be written in terms of the membership functions (20) of the FCM algorithm as aij = (cuij) m. This indicates that the competition functions (30) of ordered weighted LVQ algorithms also correspond to a set of membership functions {u;j ), obtained according to oqj = (c u~j)m as
g--1
ilXi_ V[ellj2
.
(31)
56 If w = [1 0 . . . 0]T, then (31) becomes 1 (minee~cc.{.!!x_/_-veil 2} ) ~-1 IIx,- viii ~ . .
u,j- ~ \
(32)
This is the form of the membership functions of the Minimum FCM algorithm [2]. If wc] T with we = 1/c, Vg E Arc, then (31) takes the form of the membership functions (20) of the FCM algorithm [1]. w = [wa w 2 . . .
5. C O N C L U S I O N S This paper proposed a general framework for developing soft LVQ and clustering algorithms by using gradient descent to minimize a reformulation function based on admissible aggregation operators. This approach establishes a link between competitive LVQ models and operators developed over the years to perform aggregation on fuzzy sets. For meantype aggregation operators, the development of LVQ and clustering algorithms reduces to the selection of admissible generator functions. This paper studied the properties of soft LVQ and clustering algorithms derived using nonlinear generator functions. Another family of soft LVQ and clustering algorithms was developed by minimizing admissible reformulation functions based on ordered weighted aggregation operators. In addition to its use in the development of soft LVQ and clustering algorithms, the proposed formulation can also provide the basis for exploring the structure of the data by identifying outliers in the feature set. A major study is currently under way, which aims at the evaluation of a broad variety of soft LVQ and clustering algorithms on segmentation of magnetic resonance images of the brain. REFERENCES
.
8. 9. 10. 11. 12.
J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum, New York, 1981. N. B. Karayiannis, Proc. Fifth Int. Conf. Fuzzy Syst. New Orleans, LA, September 8-11, 1996, 1036. N. B. Karayiannis, J. Intel. Fuzzy Syst. 5 (1997) 103. T. Kohonen, Self-organization and Associative Memory, 3rd Edition, Springer-Verlag, Berlin, 1989. E. C.-K. Tsao, J. C. Bezdek, and N. R. Pal, Pattern Recognition 27 (1994) 757. I. Pitas, C. Kotropoulos, N. Nikolaidis, R. Yang, and M. Gabbouj, IEEE Trans. Image Proc. 5 (1996) 1048. N. B. Karayiannis, IEEE Trans. Neural Networks 8 (1997) 505. N. B. Karayiannis, SPIE Proc. 3030 (1997) 2. N. B. Karayiannis and J. C. Bezdek, IEEE Trans. Fuzzy Syst. 5 (1997) 622. N. B. Karayiannis, Proc. 1998 Int. Conf. Fuzzy Syst. Anchorage, AK, May 4-9, 1998, 1388. N. B. Karayiannis, Proc. 1998 Int. Conf. Fuzzy Syst. Anchorage, AK, May 4-9, 1998, 1441. R. J. Hathaway and J. C. Bezdek, IEEE Trans. Fuzzy Syst. 3 (1995) 241.
Kohonen Maps. E. Oja and S. Kaski, editors 9 Elsevier Science B.V.All rightsreserved
57
A c t i v e L e a r n i n g in S e l f - O r g a n i z i n g M a p s M. Hasenj/iger ~, H. Ritter ~, and K. Obermayer b ~Technische Fakultiit, Universit/it Bielefeld, Postfach 10 01 31, D-33501 Bielefeld, Germany bDept, of Computer Science, Technical University of Berlin, FR 2-1, Franklinstr. 28/29, D-10587 Berlin, Germany The self-organizing map (SOM) was originally proposed by T. Kohonen in 1982 on biological grounds and has since then become a widespread tool for exploratory data analysis. Although introduced as a heuristic, SOMs have been related to statistical methods in recent years, which led to a theoretical foundation in terms of cost functions as well as to extensions to the analysis of pairwise data, in particular of dissimilarity data. In our contribution, we first relate SOMs to probabilistic autoencoders, re-derive the SOM version for dissimilarity data, and review part of the above-mentioned work. Then we turn our attention to the fact, that dissimilarity-based algorithms scale with O(D2), where D denotes the number of data items, and may therefore become impractical for real-world datasets. We find that the majority of the elements of a dissimilarity matrix are redundant and that a sparse matrix with more than 80% missing values suffices to learn a SOM representation of low cost. We then describe a strategy how to select the most informative dissimilarities for a given set of objects. We suggest to select (and measure) only those elements whose knowledge maximizes the expected reduction in the SOM cost function. We find that active data selection is computationally expensive, but may reduce the number of necessary dissimilarities by more than a factor of two compared to a random selection strategy. This makes active data selection a viable alternative when the cost of actually measuring dissimilarities between data objects becomes high. 1. I N T R O D U C T I O N The self-organizing map (SOM) was first described by T. Kohonen in 1982 [1] as a biologically inspired method to generate useful representations of data objects. In the subsequent decade his method has been developed into a widely used tool for exploratory data analysis [2]. The method is simple and intuitive, which made it so popular, and it generates a representation of the data that preserves the important relational properties of the whole dataset while still being amenable to visual inspection. The SOM combines the two standard paradigms of unsupervised data analysis [3,4] which are the grouping of data by similarity (clustering) and the extraction of explanatory variables (projection methods). Given a representation of data objects in terms of feature vectors which live in an Euclidean feature space, standard SOMs perform a mapping from
58 the continuous input space to a discrete set of "neurons" (clustering) which are arranged in a lattice. Similar data objects are assigned to the same or to neighboring neurons in a way that the lattice coordinates (labels) of the neurons often correspond to the relevant combinations of features describing the data. The final representation of the data by the lattice of neurons can be viewed as an "elastic net" of points that is fitted to some curved data manifold in input space providing a non-linear projection of the data. SOMs have long been plagued by their origin as a heuristic method 1 and there have been several efforts to put the SOM on firm theoretical grounds (e.g. [5-10]). By pointing out the relation between topographic clustering methods and source-channel coding problems Luttrell [11] and later Graepel et al. [12,13] derived cost functions for topographic clustering methods, and advocated the use of deterministic annealing strategies (cf. [14]) for their optimization. The tasks of grouping data, of finding the relevant explanatory variables, and of embedding the data in some low dimensional "feature space" for the purpose of visualization, are not restricted to data that are described via feature vectors and Euclidean distance measures. On the contrary, there is an even stronger need for grouping and for embedding methods if relations between data objects are only defined pairwise, for example via a table of mutual dissimilarities obtained by actually measuring "distances" between objects. Pairwise data are less intuitive and there are less tools available for their analysis. Based on ideas put forward by [11,15], Graepel and Obermayer [16] have recently generalized the SOM method to perform clustering on dissimilarity data and to generate a representation which can be viewed as a non-linear extension to metric multidimensional scaling [17]. If the dissimilarities are given by the Euclidean distances between feature vectors assigned to the objects, the new method reduces to the standard SOM. The analysis of dissimilarity data, however, faces a serious scaling problem. The number of dissimilarity values grows quadratically with the number of data objects, and for a set of 100 data objects there are already 10,000 dissimilarities to be measured! Data acquisition and data processing may become quite demanding tasks, and the analysis may become infeasible even for a moderate number of data objects. Luckily it turns out that dissimilarity matrices are highly redundant if there is structure in the data to be discovered, a fact that is well known from multidimensional scaling [18,19]. Just consider for example a matrix of distances between European cities. Because the cities are located on the earth's surface, three distances per city are in general sufficient to recover their relative locations, all the other distances being redundant. These considerations suggest that the scaling problem for dissimilarity-based algorithms can be overcome by providing strategies (i) for the treatment of missing data, and (ii) for the selection of only those dissimilarity values that carry the relevant information, i.e. for active data selection. Active data selection (e.g. [20-24]) and the missing data problem (e.g. [25,26]) have been well investigated in the past for problems of supervised learning, but applications to unsupervised learning have so far been scarce (but see [27]). In our contribution we investigate SOMs for the analysis of dissimilarity data and we specifically address the open problems of missing data and of active data selection. For this purpose, we first review Graepel and Obermayer's [16] work on pairwise topographic clustering in section 2, and we re-derive the generalized SOM utilizing the relationship l w.r.t, to its property to generate neighborhood preserving maps.
59
input c~ 9~-
P~ ( i)
o
p1(r )
P~(rl i)
cO
P2(slr)
bottleneck
P2(s ) [_s[
c~ .c_ -o
_ pl(r')
T
9
P2(r'l s )
P~(i Ir )
-o
Po(i,)
output Figure 1. Left: Sketch of a neural network autoencoder with three hidden layers. The target representation is formed in the bottleneck layer. Right" A probabilistic autoencoder with three hidden representations.
between clustering and probabilistic autoencoders. In section 3 we then turn to the handling of missing data and to the problem of active data selection. Following [27], we replace every missing distance between two data objects by an average value calculated from the distances to other objects from the same cluster. As a strategy for active data selection we then suggest to include only those dissimilarity values for learning for which we expect the maximum reduction in the clustering cost. In section 4 we provide the results of numerical simulations. We show the performance of SO M for different percentages of missing data, and we compare the performance of active selection strategies with the performance obtained by selecting new dissimilarity values at random. Section 5, finally, summarizes our conclusions. 2. T O P O G R A P H I C
MAPPING
OF D I S S I M I L A R I T Y
DATA
2.1. C o s t F u n c t i o n s Fig. 1 (left) shows the sketch of a feedforward neural network architecture called autoencoder. Autoencoders are typically used to find a representation of the data which makes the regularities in a given dataset explicit. In order to achieve this goal, the input data is first passed through a limited capacity bottleneck and then reconstructed from the internal representation. The bottleneck enforces a representation using only a few explanatory variables while the constraint of reliable reconstruction at the output layer ensures that the extracted explanatory variables are indeed relevant. For linear connectionist neurons and a Euclidean distance measure between the data objects, the autoencoder performs a projection of the data into a subspace which is spanned by the first principal components [28]. If the representation in the bottleneck layer is constrained to be sparse - with exactly one neuron active for a given data object - that same architecture generates a clustering solution as we will see below. Let us next consider an autoencoder for which the transformations between representa-
60 tions are probabilistic (Fig. 1 (right)). In an encoding stage, data objects i are mapped to representatives r and s in the bottleneck layer with probabilities Pl(rli ) and P2(slr ). In the decoding stage, the data objects i ~ are reconstructed from the internal representation via the probabilistic "decoders" /52(r'ls) and/51 (i'lr'). Object i and reconstruction i' need not coincide, because the transformations between the representations are probabilistic and because the representation in the bottleneck may be lossy. We now introduce a dissimilarity measure 5(i, i') which describes the degree of dissimilarity between any two objects i and i ~. We can think of 5 being derived from a distance measure d(x, x ~) between associated feature vectors x and x ~ or being directly measured in an experiment. For any given autoencoder and for any given dissimilarity measure, we then obtain for the average dissimilarity E
E - E E P~
(1)
i,i ~ r,s,r ~
where Po(i) denotes the probability that a data object i is contained in a given dataset. If the reconstruction of an object i' from a representation s is done in an optimal way, the encoders and the decoders are related by Bayes' theorem, /5 (i, lr, )
_ Pl(r'li')Po(i') P~ (r')
/52(r,]s ) _ P2(s]r')Plp2(s)(r') ,
(2)
so that the average dissimilarity E can be expressed solely in terms of the encoding probabilities. By inserting Eqs. (2) in Eq. (1) we obtain
E = E E P~
P2(slr') P2(s)
P~ (r'li')Po(i')5(i ' i') '
(3)
i,i ~ r,s,r ~
where P2(s) is given by P2(s) = E
P2(slr)Pl(rli)P~
(4)
r,i
The quantity E in Eq. (3) is the cost which is associated with a particular representation s. It serves as a quality measure for the representation" the lower the cost E is, the better does the representation s capture the regularities underlying the data. Now we specialize Eq. (3) to describe a representation r of size N of D data objects i as it is obtained via topographic clustering methods. We enforce a sparse representation via the ansatz P~ (rli) - m ~ r , i = 1 . . . D, r - 1 . . . N, where the assignment variables m~r are equal to one if the data object i is assigned to neuron r and zero otherwise. The assignment variables form the entries of an assignment matrix A4 and are constrained by Y]r mir = 1 Vi to ensure sparseness, i.e. a unique assignment of objects to neurons. Topography is ensured via the second encoding stage whose probabilities P2(s}r') can be interpreted as permutation probabilities between neurons. In the following, we assume that these probabilities are given and we treat them as elements of a neighborhood matrix 7t - (hr~)r,.~=l...N E [0, 1]N• with hr~ - P2(slr). Note, that the rows of 7-/are normalized by ~--~ hr~ = 1 Vr.
61 If we additionally denote the distance between two objects i and i' by D
E
:
N
1 i~j l
mirhrsmjthts E u = I mkuhus
die, Eq. (3) reads (5)
", ": r,s,t=l E k = l 1
D
N
(6) i=1 s = l
where the factor ~1 was introduced for computational convenience and where we have introduced the abbreviations D
d~ - j~l
N
rhj~d~j ,-b--=-
and
"-- E k - - 1 ?Ttks
mi~ - E miuhu~
(7)
u=l
for the average dissimilarity of object i to objects assigned to neuron s and the "effective" assignment variables this. E TMP is our cost function for the clustering and the topographic mapping of data objects based on their pairwise dissimilarities (TMP). The optimal clustering solutions are given by the assignment matrices A4 for which Eq. (6) is m i n i m a l - for a given neighborhood matrix 7-/. These solutions are consistent with the intuitively plausible idea that the "correct" groups of data objects are found if the average dissimilarities between objects that are assigned to the same neurons are minimal. Let us briefly relate this approach to previous work. If the dissimilarities dij are given by Euclidean distances between feature vectors, E TMP is equivalent to the cost function of topographic vector quantization (TVQ) [29]. If there is no coupling between neurons, i.e. if hrs = 5r~, E TMP reduces to the cost function for pairwise clustering [15]. 2.2. O p t i m i z a t i o n In order to avoid local minima, optimization of E TMP w.r.t, the assignment variables is performed using mean field annealing [30]. Introducing noise in the assignment variables leads to the Gibbs distribution
-
1 exp ( - ~ E TMP )
(8)
for representations, where the sum in the partition function Zp = Y'~{m~r)e x p ( - / ~ E T M P ) is carried out over all admissible assignment matrices A/l. For a given value of the inverse computational temperature /3 - which governs the level of noise - the probability of assigning a data object i to a neuron r is then given by the expectation value of mir w.r.t. to the probability distribution P~. Unfortunately, these expectation values cannot be calculated analytically, because E TMP is not linear in the assignment variables mir. The usual solution to this problem is to approximate the Gibbs distribution Eq. (8) by the factorizing distribution 1 -
(9)
{r
62 and to estimate the quantities ei~, the so-called partial assignment costs or mean fields, by minimizing the Kullback-Leibler divergence between Pa and Qz. As detailed in [16], the optimal mean fields % are given by
~:~
(
j:l
)
~:1 (m~)
vk,~,
(10)
where
(?/~kr) =
exp(-~e~r) EN=lexp(_3e~s)
V k, r
(11)
are the expectation values of the assignment variables w.r.t, to Qz and D
((njs)dkj
(As) : j~l:E/:I '
N
(this) - ~
(mju)hus,
(12)
u:l
are the average effective distances between a data object k and the data objects assigned to neuron s. The approximation of PZ by Qz is called mean-field approximation and implicitly assumes that on average the assignments of data items to neurons are independent in the sense that (mirmjr) - (mir)(mjr), where (.) denotes the expectation value. The self-consistent equations (11) and (10) can be solved by fixed point iteration at any value of the temperature parameter/3,/3 > 0. In particular, it is possible to employ deterministic annealing in/3, thus finding the unique minimum of the mean field approximation to Eq. (8) at low values of/3 which is then tracked with increasing t9, until at a sufficiently high value of/3 the solution is expected to correspond to a good minimum of the original cost function Eq. (6). For SOM applications, annealing in the computational temperature should generally be preferred over annealing the range of the neighborhood function. It is robust, and it allows to use hrs solely for encoding permutation probabilities between neurons (cf. [12,31]). Note, that for a Euclidean distance measure the standard SOM as described by T. Kohonen in 1982 is obtained from TMP, Eqs. (10) and (11), by omitting the convolution with hr~ in Eq. (10) (and considering an on-line update for/3 -~ cx~)(cf. [12]). Therefore we define a SO M approximation to TMP by:
. 1 f