THE DISSIMILARITY REPRESENTATION FOR PATTERN RECOGNITION Foundations and Applications
SERIES IN MACHINE PERCEPTION AN...
17 downloads
587 Views
30MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
THE DISSIMILARITY REPRESENTATION FOR PATTERN RECOGNITION Foundations and Applications
SERIES IN MACHINE PERCEPTION AND ARTIFICIAL INTELLIGENCE* Editors:
H. Bunke (Univ. Bern, Switzerland) P. S. P. Wang (Northeastern Univ., USA)
Vol. 50: Empirical Evaluation Methods in Computer Vision (Eds. H, 1. Christensen and P. J. Phillips) Vol. 51: Automatic Diatom Identification (Eds. H. du Buf and M. M. Bayer) Vol. 52: Advances in Image Processing and Understanding A Festschrift for Thomas S. Huwang (Eds. A. C. Bovik, C. W. Chen and D. Goldgof) Vol. 53: Soft Computing Approach to Pattern Recognition and Image Processing (Eds. A. Ghosh and S. K. Pal) Vol. 54: Fundamentals of Robotics - Linking Perception to Action (M. Xie) Vol. 55: Web Document Analysis: Challenges and Opportunities (Eds. A. Antonacopoulos and J. Hu) Vol. 56: Artificial Intelligence Methods in Software Testing (Eds. M. Last, A. Kandel and H. Bunke) Vol. 57: Data Mining in Time Series Databases y (Eds. M. Last, A. Kandel and H. Bunke) Vol. 58: Computational Web Intelligence: Intelligent Technology for Web Applications (Eds. Y, Zhang, A. Kandel, T. Y. Lin and Y. Yao) Vol. 59: Fuzzy Neural Network Theory and Application (P.Liu and H. LI) Vol. 60: Robust Range Image Registration Using Genetic Algorithms and the Surface Interpenetration Measure (L. Silva, 0. R. P. Bellon and K, L. Boyer) Vol. 61 : Decomposition Methodology for Knowledge Discovery and Data Mining: Theory and Applications (0.Maimon and L. Rokach) Vol. 62: Graph-Theoretic Techniques for Web Content Mining (A. Schenker, H. Bunke, M. Last and A. Kandel) Vol. 63: Computational Intelligence in Software Quality Assurance (S. Dick and A. Kandel) Vol. 64: The Dissimilarity Representation for Pattern Recognition: Foundations and Applications (flzbieta Pekalska and Roberi P. W. Duin) Vol. 65: Fighting Terror in Cyberspace ( f d s . M. Last and A. Kandel)
*For the complete list of titles in this series, please write to the Publisher.
THE DISSIMILARITY REPRESENTATION FOR PATTERN RECOGNITION Foundations and Applications
Elibieta Pekalska
Robert P. W. Duin
Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology Delft, The Netherlands
vp World Scientific NELVJERSEY
*
LONDON
*
SINGAPORE
-
RElJlNG
-
ShAYGHAI
-
HONG K O N G
-
TAIPEI
-
CHEWKAI
Published b-y
World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA qflce: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
U K @ice: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
THE DISSIMILARITY REPRESENTATION FOR PATTERN RECOGNITION Foundations and Applications Copyright 02005 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or p u t s thereoj may not be reproduced in any,form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval s.ysrem now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN 981-256-530-2
Printed by FuIsland Offset Printing (S) Pte Ltd, Singapore
To t h e ones who ask questions and look for answers
This page intentionally left blank
Preface
Progress has not followed a straight ascending line, h u t a spiral with rhythms of progress and retro,qression, o,f evolution and dissolution. JOHANN WOLFGANG
VON
GOETHE
Pattern recognition is both an art and a science. We are able to see structure and recognize patterns in our daily lives and would like to find out how we do this. We can perceive similarities between objects, between people, between cultures and between events. We are able to observe the world around us, to analyze existing phenomena and to discover new principles behind them by generalizing from a collection of bare facts. We are able to learn new patterns, either by ourselves or with the hclp of a teacher. If we will ever be able to build a machine that does the same, then we will have made a step towards an understanding of how we do it ourselves. The two tasks, the recognition of known patterns and the learning of new ones appear to be very similar, but are actually very different. The first one builds on existing knowledge, while the second one relies on observations and the discovery of underlying principles. These two opposites need to be combined, but will remain isolated if they are studied separately. Knowledge is formulated in rules and facts. Usually, knowledge is incomplete and uncertain, and modeling this uncertainty is a challenging task: who knows how certain his knowledge is, and how can we ever relate the uncertainty of two different experts? If we really want to learn something new from observations, then a t least we should use our existing knowledge for their analysis and interpretation. However, if this leads to destruction of all inherent organization of and relations within objects themselves, as happens when they are represented by isolated features, then all what is lost by (not incorporated in) the representation has to be learned again from the observations. These two closely related topics, learning new principles from observavii
...
Vlll
T h e dissinailarity representation for p a t t e r n recognition
tioris and applying existing knowledge in recognition, appear to be hard to combine if we concentrate on these opposites separately. There is a need for an integrated approach that starts in between. We think that the notion of proximity between objects might be a good candidate. It is an intuitive concept that may be quantified and analyzed by statistical means. It does not a priori tear the object into parts, or more formally, does not require neglecting the inherent object structure. Thereby, it offers experts a possibility to model their knowledge of object descriptions and their relations in a structural way. Proximity may be a natural concept, in which the two opposites, statistical and structural pattern recognition, meet. The statistical approach focuses on measuring characteristic numerical features and representing objects as points, usually in a Euclidean or Hilbert feature space. Objects are different if their point representations lie sufficiently far away from each other in this space, which means that the corresponding Euclidean distance between them is large. The difference between classes of objects is learned by finding a discrimination function in this feature space such that the classes, represented by sets of points, are separated as well as possible. The structural approach is applicable to objects with some identifiable structural organization. Basic descriptors or primitives, encoded as syntactic units, are then used to characterize objects. Classes of objects are either learned by suitable syntactic grammars, or the objects themselves are compared by the cost of some specified match procedure. Such a cost expresses the degree of difference between two objects. One of the basic questions in pattern recognition is how to tell the differencc between objects, phenomena or events. Note that only when the difference has been observed and characterized, similarity starts to play a role. It suggests that dissimilarity is more fundamental than similarity. Therefore, we decided to focus more on the concept of dissimilarity. This book is concerned with dissimilarity representations. These are riurncrical representations in which each value captures the degree of' conimonality between pairs of objects. Since a dissimilarity measure can be defined on arbitrary data given by collections of sensor measurements, shapes, strings, graphs, or vectors, the dissimilarity representation itself becomes very general. The advantages of statistical and structural approaches can now be integrated on the level of representation. As the goal is to develop and study statistical learning methods for dissimilarity representations, they have to be interpreted in suitable mathematical frameworks. These are various spaces in which discrimination
Preface
ix
functions can be defined. Since non-Euclidean dissimilarity measures are used in practical applications, a study beyond the traditional use of Euclidean spaces is necessary. This caused us to search for more general spaces. Our work is founded on both mathematical and experimental research. As a result, a trade-off had to be made to present both theory and practice. We realize that the discussions may be hard to follow due to a variety of' issues presented and the necessary brevity of explanations. Although some foundations are laid, the work is not complete, as it requires a lot of research to develop the ideas further. In many situations, we are only able to point to interesting problems or briefly sketch new ideas. We are optimistic that our use of dissimilarities as a starting point in statistical pattern recognition will pave the road for structural approaches to extend object descriptions with statistical learning. Consequently, observations will enrich the knowledgebased model in a very generic way with confidences and natural pattern classifications, which will yield improved recognition. This book may be useful for all researchers and students in pattern recognition, machine learning and related fields who are interested in the foundations and application of object representations that integrate structural expert knowledge with statistical learning procedures. Some understanding of pattern recognition as well as the familiarity with probability theory, linear algebra arid functional analysis will help one in the journey of finding a good representation. Important facts from algebra; probability theory and statistics are collected in Appendices A D. The reader may refer to [Fukunaga, 1990; Webb, 1995; Duda et al., 20011 for an introduction to statistical pattern recognition, and to [Bunke and Sanfeliu, 1990; Fii, 19821 for an introduction to structural pattern recognition. More theoretical issues are treated in [Cristianini and Shawe-Taylor, 2000; Devroye et al.: 1996; Hastie et al.; 2001; Vapnik, 19981, while a practical engineering approach is presented in [Nadler and Smith, 1993; van der Heiden et d., 20041. Concerning mathematical concepts, some online resources can be used, such as http: //www .probability.net/, http: //mathworld.wolfram. corn/, http : //planetmath. org/ and http : //en. wikipedia. org/. ~
Credits. This research was supported by a grant from the Dutch Organization for Scientific Research (NWO). The Pattern Recognition Group of the Faculty of Applied Sciences at Delft University of Technology was a fruitful arid inspiring research environment. After reorganization within the university, we could continue our work, again supported by NWO, in the
X
T h e dissamilarity r e p r e s e n t a t t o n for patterm. recognttton
Information and Communication Theory group of the Faculty of Electrical Engineering, Mathematics and Computer Science. We thank both groups and their leaders, prof. Ian T. Young and prof. Inald Lagendijk, all group members and especially our direct colleagues (in alphabetic order): Artsiom Harol, Piotr Juszczak, Carmen Lai, Thomas Landgrebe, Pavel Paclik, Dick de Ridder, Marina Skurichina, David Tax, Sergey Verzakov and Alexander Ypma for the open and stimulating atmosphere, which contributed to our scientific development. We gained an understanding of the issues presented here based on discussions, exercises in creative thinking and extensive experiments carried out in both groups. We are grateful for all the support. This work was finalized, while the first author was a visiting research associate in the Artificial Intelligerice group at University of Manchester. She wishes to thank for a friendly welcome. This book is an extended version of the PhD thesis of Elzbieta Pekalska and relies on work published or submitted before. All our co-authors are fully acknowledged. We also thank prof. Anil Jain and Douglas Zongker, prof. Horst Biinke arid Simon Gunter, Pavel Paclik and Thomas Landgrebe, and Volker Rotli for providing some dissimilarity data. All the data sets are described in Appendix E. The experiments were conducted using PRTools [Diiiri et ul., 2004b1, DD-tools [Tax, 20031 and own routines. To the Reader. Our main motivation is to bring the attention to the issue of representation and one of its basic ingredient: dissimilarity. If one wishes to describe classes of objects within our approach this requires a mental shift from logical and quantitative observations of separate features to an intuitive and possibly qualitative perceiving of the similarity betwcen objects. As the similarity judgement is always placed in some context, it can only be expressed after observing the differences. The 1110ment the (dis)similarity judgements are captured in some values, one may process from whole (dissimilarity) to parts (features, details or numerical descriptions). As a result, decision-theoretic methods can be used for learning. The representation used for the recognition of patterns should enable integration of both qualitative and quantitative approaches in a balanced manner. Only then, the process of learning will be enhanced. Let it be so. Dear Reader, be inspired! Wishing You an enjoyable journey,
Elibieta Pqkalska and Robert P.W. Duin, Chester / Delft, June 2005.
Notation and basic terminology
Latin symbols
matrices, vector spaces, sets or random variables scalars, vectors or object identifiers vectors in a finite-dimensional vector space basis vector estimated mean vectors Gram matrix or Gram operator estimated covariance matrix dissimilarity function, dissimilarity measure dissimilarity matrix functions identity matrix or identity operator centering matrix number of clusters space dimensions kernel number of objects or vectors, usually in learning neighborhood sets i-th object in the representation set R probability function projection operator, projection matrix or probability orthogonal matrix representation set R = { P I ,p 2 , . . . ,p,} similarity function, similarity measure similarity matrix or stress function i-th object in the training set T training set T = { t l , t a , . . , tiv} weight vectors
xi
The dissimilarity representation for pattern recognition
xii
Greek symbols scalars or parameters vectors of parameters Kronecker delta function or Dirac delta function evaluation functional dissimilarity matrix used in multidimensional scaling trade-off parameter in mathematical programming field, usually R or C, or a gamma function regularization parameter i-th eigenvalue diagonal matrix of eigenvalues mean, probability measure or a membership function mean vector mappings covariance matrix dissimilarity function set or a closed and bounded subset of Rm
Other symbols
A
c cm 2)
3
G, K: 'FI
z J
Q
w R+
% R"
ST I
u.v,x z
a-algebra set of complex numbers m-dimensional complex vector space domain of a mapping set of features Krein spaces Hilbert space indicator or characteristic function fundamental symmetry operator in Krein spaces set of rational numbers set of real numbers set of real positive numbers
R+ u (0) rn-dimensional real vector space m-dimensional spherical space, m+l 2 2 = {xERm+l: 2 , - 7- } transformation group subsets. subspaces or random variables set of integers
s,m
cz=l
xiii
Notation and basic terminology
Sets and pretopology A, D, . . . , Z
sets
( A i , A z , .. . , A n } cardinality of A generalized interior of A generalized closure of A set union of A and B set intersection of A and B set difference of A and B set symmetric difference, AAB = (A\B) U (B\A) Cartesian product,, A x B = { ( a ,b ) : ~ E AA b E B } power set, a collection of all subsets of X neighborhood system neighborhood basis neighborhood, pretopological or topological space defined by the neighborhoods n/ (pre)topological space defined by a neighborhood basis neighborhood, pretopological or topological space defined by the generalized closure algebraic dual space of a vector space X continuous dual space of a topological space X generalized metric space with a dissimilarity p metric space with a metric distance d &-ballin a generalized metric space (X, p ) , B E ( Z )= { y E X : p ( y 9 z )< F }
collection A of subsets of the set s2 satisfying: (1) Q E A , (2) A E A=+ (R\A)cA, (3) ( V k A k E A A A = U g I A k ) * A E A p : A 4 RO,is a measure on a a-algebra A if p ( @ )= 0 and p is additive, i.e. 1-1 Ak) = C kp ( A k ) for pairwise disjoint sets Ak measurable space; R is a set and A is a a-algebra measure space; /I, is a measure probability space normal dist,ribution with the mean vector p and the covariance matrix C probability of an event A E R
(uk
P(A)
The dissimilarity representation f o r p a t t e r n recognition
conditional probability of A given that B is observed likelihood function is a function of 6’ with a fixed A, such that L(0lA) = cP(AIB = 6’) for c>O expected value (mean) of a random variable X defined over ( 0 ,A, P ) ;E [ X ]= xdP variance of a random variable X , V ( X ) = E[(X-E[X])Z] standard deviation of a random variable X ,
s,
4x1 = d r n
k-th central moment of a random variable X ,
“-wl)kl
Pk(X) = cumulative distribution function probability deriisty function
Mappings and functions &:X+Y
4 is a mapping
(function) from
X
to
Y;
X is the domain of q5 and Y is the codomain of q5 range of 4 : X + Y ,R$= { ~ E YjZrx : y = 4(x)} 407 injection
surjectioii bijection homomorphism eridomorphism isomorphism automorphisrri monomorphism linear form functiorial irri ( 4 ) ker ( 4 ) concave function corivex function logistic function logarithmic function
composition of mappings 4 : X ---t Y such that (21# Z Z ) + (4(21)#4(xz)) holds for all ~ 1 ~ ExX2 ; 724 # Y 4 : X + Y ,X onto Y , such that 724 = Y injection which is also a surjection linear mapping froni one vector space to another linear mapping from a vector space to itself homomorphism which is a bijection endoniorphism which is an isomorphism homomorphism which is an injection homomorphism from a vector space X to the field r linear form iniage of a homomorphism 4 : X + Y ,724 kernel of a homomorphism 4: X + Y ker(q5) = { z E X :q5(x) = 0} f ( CY 2 1 (1- C Y ) Z Z ) 2 ~ f ( 2 2 ) (1- ~ ) f ( 2 2 holds ) for all 2 1 , xz E Df and all a: E [0,1] f is convex iff -f is concave f ( x ) = 1/(1+ e x p ( - c z ) ) f ( x ) = log(z); here log denotes a natural logarithm
+
+
Notation and basic terminology
sigmoid function gamma function
xv
f(x) = 2 / ( l + exp(-z/a)) - 1 st-ie-zds,t > o
r ( t )=
Jr
Vectors and vector spaces
u,v,x,y, z Z = X x Y Z=X$I'
Z=X@Y
V. W , X ,
Y,Z
{xt1I=1 0 1 ei
XT X+ X'Y
xi Y X* X' C(X7
L ( X ,y > CJX, r) LJX, Y )
vector spaces Cartesian product of vector spaces direct sum of vector spaces; each z E Z can be uniquely decomposed into z E X and y E Y such that z = .r y and X n Y = (0) tensor product of vector spaces; for any vector space U and any bilinear map F : X x Y + U , there exists a bilinear map H : Z 4 U such that F ( x ,y) = H ( x @ y) for all z E X and y E Y vectors in finite-dimensional or separable vector spaces {Xi x2 7 . . . , x n 1 column vector of zeros column vector of ones standard basis vector, e , = 1 and e3 = 0 for j # 1: transpose of a real vector conjugate transpose of a complex vector inner product of vectors in R'" inner product of vectors in Cm algebraic dual space of a vector space X cont,inuous dual space of a topological space X space of linear functiorials from X onto the field I?, equivalent to algebraic dual X * space of linear operators from X onto Y space of continuous linear functionals from X onto the field r, equivalent to continuous dual X ' space of continuous linear operators from X onto Y
+
1
Inner product spaces and normed spaces
L:
closed and bounded set, R c R" set of all functions on R set of all continuous functions on R set of function classes, Lebesgue measurable: on f 2 L; = { f E C ( R ) : (J, I f ( z ) l P dz)i/p< oo}, p 2 1
LF
LpM
0
an) C(R) M(R)
=
{ f E M ( R ) : (J, I f ( z ) l P p ( d ~ ) ) ~ < / Pm}, p 2 1
The dzssimilarity representatzon f o r p a t t e r n recognition
inner product norm &norm of X E I W ~ lIxIJP , = (Czl)x,)*)~/P, p 21 !,-norm of f E L F ; llf /I” = (J, If ()”.I dz)l’p, P L 1 space X equipped with the inner product (., .) space X equipped with the norm I / . I / space X equipped with the dissimilarity p orthogonal complement to X Hilbert space reproducing kernel Hilbert space with the kernel K Banach space (Rm,/ I . lip), p 2 1 Banach space (Rm, / I . l i p ) . p >. 1 Indefinite inner product spaces Hilbert spaces (Ic+. (., .)) and (Ic-, -(., .)) Krein space, Ic = Ic+ @ K - and Ic- = Ic; Hilbert space associated with a Krein space K IIcl = Ic+ @ IIc-1, where Ic- = K i and 1K-l = (L, (.. .)) pseudo-Euclidean space with the signature ( p ,q ) inner product in a K r e h space K inner product in a pseudo-Euclidean space E reproducing kernel Krein space with the kernel K fundamental projections identity operator in a Krein space; I = P+ + Pfundamental symmetry in a K r e h space; J = P+ - Pfundamental symmetry in Iw(P14) H-scalar product, [z, y] = (3%. y ) ~ H-norm, IlxlirL = [I.+ Operators in inner product spaces and normed spaces
( u t J ) matrix or an operator A with the elements atJ 2-th row of a matrix A j - t h column of a matrix A a I,A determinant of a matrix A det(A) A*B Hadaniard product, A * B = (aLII btJ) A*” Hadaniard power, A*P = ( a f J ) * .B Hadaniard power, a*B = ( n b 7 2 )where , UER AT transpose of a real matrix A
A
=
02. , A ,
Notation and basic terminology
At AX A hermitian A symmetric
A orthogonal A unitary A cnd
A cpd
A nd A nsd A Pd A psd
xvii
conjugate transpose of a complex niatrix A adjoint A in a Hilbert space; A X = AT or A X = At A = At A = AT A A T = I and ATA = I A A t = I and AtA = I A = At is conditionally negative definite if x t A x 5 0 and x t 1 = O for x # 0 A = At is conditionally positive definite if x t A x 2 0 and x t 1 = o for x # o A = At is negative definite if x t A x < 0 for x # 0 A = At is negative sernidefiriite if x t A x 5 0 for x # 0 A = At is positive definite if x t A x > 0 for x # 0 A = At is positive semidefinite if x t A x 2 0 for x # 0
Operators in indefinite inner product spaces
A*
A A A A A
J-self-adjoint 3-isomctric J-coisometric J-symmetric J-unitary
space of continuous linear functionals from a Kreiri space K into thc field I? space of continuous linear operators from a Krein space Ic into a Krein space G adjoint of an operator A, A t C(K,G ) is such that (A f ,g)B = ( f .A * ~ ) holds K for all f E K a11d 9 E
A = A* A E G ( K , G ) is isometric if A*A = Ic A E C(K, G ) is coisometric if AA* = 1, (4, g)lc = ( f >A S ) K for all f , g IC ( A f , A!dK = ( f , S ) K for all f , g K
Dissimilarities d
D D*2 D ( T ,R ) S d2
D E ,D2 dP DP dmax
dissimilarity measure dissimilarity matrix, D = ( d t J )
D*2 = ( d ; J ) dissimilarity representation similarity matrix, S = (sz7) Euclidean distance Euclidean distance matrix .$-distance &distance matrix &-distance
xviii
T h e dissimilarity representation for p a t t e r n recognition
&distance matrix Hausdorff distance modified-Hausdorff distance square Mahalanobis distance Levenhstein distance, normalized Levenhstein distance Kullback-Leibner divergence J-coefficient information radius divergence Bhat t acharayya distance Chernoff distance Hellinger coefficient Tversky dissimilarity and Tversky similarity cut semimetric based on the set V Graphs and geometry
cut on X G = (V,E) ad,jacent nodcs linear hull cone convex hull hyperprism
hypercylinder
partition of a set X into V arid X\V graph with a set of nodes V and a set of edges E = { ( u , w )u: , u E V } two nodes in a graph joined by an edge huiir(x)={C;=l p t z Z :X , E V A ~ , E I ' } , ~ C R { X : huIlR+( X ) = X} Pt J,: 2%EV, Pt 2 0 A Pt = I} figure generated by a flat region in Rm, moving parallel to itself along a straight line hyperprisni for which the flat region in Rm-' is a hypersphere
c;=1
{c:=l
rn
hypersphere hyperplane
{ x E R ~llx:ll; = R2}with the volume V = 2KkZ mr(F) and the area A = 2Rm-1T' r(?) m-dimensional hyperplane, {xERm+l:
parallelotope
polyhedron polyhedral cone polytope simplex
c : ; '
w,x,= wo}
collection of points in Rm bounded by m pairs of ( m- l)-dimensional hyperplanes (a generalization of a parallelogram) { x EA~ xP 5 b, : AER""" A bER"} {xER": A x 5 0, AEIW"~"} collection of points bounded by m-dimensional hyperplanes (a generalization of a triangle in 2D) polytope; a collection of points in Rm enclosed by (m+1) (m- 1)-dimensional hyperplanes
Abbreviations
’
iff cnd CPd nd nsd Pd Pdf psd k-CDD k-NN
NN k-NNDD AL CCA CH CL CNN
cs
CPS DS GNMC GMDD LLE LogC LP LPDD LSS MAP MDS ML MST NLC NMC NQC NN PCA
if and only if
conditionally negative definite conditionally positive definite negative definite negative semidefinite positive definite probability density function positive semidefinite &Centers Data Description &Nearest Neighbor rule Nearest Neighbors k-Nearest Neighbor Data Description Average Linkage Curvilinear Component Analysis Compactness Hypothesis Complete Linkage Condensed Nearest Neighbor Classical Scaling Classifier Projection Space Dissimilarity Space Generalized Nearest Mean Classifier Generalized Mean Data Description Locally Linear Embedding Logistic regression linear Classifier Linear Programming Linear Programming Dissimilarity data Description Least Square Scaling Maximum A Posteriori Multidimensional Scaling Maximum Likelihood Minimum Spanning Tree Normal density based Linear Classifier Nearest Mean Classifier Normal density based Quadratic Classifier Nearest Neighbor rule Principal Component Analysis
xix
xx
RKHS RKKS RNLC RNQC QC QP SL SOM SRQC
sv
SVM SVDD
so
WNMC
The dissamilarity representation for pattern recognition,
Reproducing Kernel Hilbert Space Reproducing Kernel Krein Space Reqularized Normal density based Linear Classifier Reqularized Normal density based Quadratic Classifier Quadratic Classifier Quadratic Programming Single Linkage Self-organizing Map Strongly Reqularized Quadratic Classifier Support Vector Support Vector Machine Support Vector Data Description Support Object Weighted Nearest Mean Classifier
Contents
Preface
vii
Notation and basic terminology
xi
A bbreuintions
xix
1. Introduction
1
1.1 Recognizing the pattern . . . . . . . . . . . . . . . . . . . . 1.2 Dissimilarities for representation . . . . . . . . . . . . . . . 1.3 Learning from examples . . . . . . . . . . . . . . . . . . . . 1.4 Motivation of the use of dissinlilarity representations . . . . 1.5 Relation to kernels . . . . . . . . . . . . . . . . . . . . . . . 1.6 Outline of the book . . . . . . . . . . . . . . . . . . . . . . . 1.7 In summary . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 . Spaces
1 2 4 8 13 14 16 23
2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 A brief look a t spaces . . . . . . . . . . . . . . . . . . . . . 2.3 Generalized topological spaces . . . . . . . . . . . . . . . . . 2.4 Generalized metric spaces . . . . . . . . . . . . . . . . . . . 2.5 Vector spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Normed and inner product spaces . . . . . . . . . . . . . . . 2.6.1 Reproducing kernel Hilbert spaces . . . . . . . . . . 2.7 Indefinite inner product spaces . . . . . . . . . . . . . . . . 2.7.1 Reproducing kernel Krein spaces . . . . . . . . . . . 2.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 . Characterization of dissimilarities 3.1 Embeddings, tree models and transformations . . . . . . . . xxi
25 28 32 46 56 62 69 71
85 87 89
90
The dissimilarsty representation for p a t t e r n recognition
xxii
3.2 3.3
3.4
3.5
3.6
3.7
3.1.1 Embeddings . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Distorted metric embeddings . . . . . . . . . . . . . Tree models for dissimilarities . . . . . . . . . . . . . . . . . Useful transformations . . . . . . . . . . . . . . . . . . . . . 3.3.1 Transformations in sernimetric spaces . . . . . . . . . 3.3.2 Direct product spaces . . . . . . . . . . . . . . . . . . 3.3.3 Invariance and robustness . . . . . . . . . . . . . . . Properties of dissiniilarity matrices . . . . . . . . . . . . . . 3.4.1 Dissimilarity matriccs . . . . . . . . . . . . . . . . . 3.4.2 Square distances and inner products . . . . . . . . . Linear embeddings of dissimilarities . . . . . . . . . . . . . 3.5.1 Euclidean embedding . . . . . . . . . . . . . . . . . . 3.5.2 Correction of non-Euclidean dissimilarities . . . . . . 3.5.3 Pseudo-Euclidean embedding . . . . . . . . . . . . . 3.5.4 Generalized average variance . . . . . . . . . . . . . . 3.5.5 Projecting new vectors to a n embedded space . . . . 3.5.6 Reduction of dimension . . . . . . . . . . . . . . . . 3.5.7 Reduction of complexity . . . . . . . . . . . . . . . . 3.5.8 A general embedding . . . . . . . . . . . . . . . . . . 3.5.9 Spherical enibeddings . . . . . . . . . . . . . . . . . . Spatial representation of dissimilarities . . . . . . . . . . . . 3.6.1 FastMap . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Multidiniensional scaling . . . . . . . . . . . . . . . . 3.6.3 Reduction of complexity . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 . Learning approaches 4.1 Traditional learning . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Data bias arid model bias . . . . . . . . . . . . . . . 4.1.2 Statistical learning . . . . . . . . . . . . . . . . . . . 4.1.3 Inductive principles . . . . . . . . . . . . . . . . . . . 4.1.3.1 Empirical risk minimization (ERM) . . . . . 4.1.3.2 Principles based on Occam’s razor . . . . . . 4.1.4 Why is the statistical approach not good enough for learning from objects? . . . . . . . . . . . . . . . . . 4.2 The role of dissimilarity representations . . . . . . . . . . . 4.2.1 Learned proximity representations . . . . . . . . . . 4.2.2 Dissimilarity representations: learning . . . . . . . . 4.3 Classification in generalized topological spaces . . . . . . . .
90 95 95 99 99 102 103 105 105 116 118 118 120 122 124 125 127 128 129 130 132 133 135 143 144 147 148 148 151 154 156 160 163 166 171 172 175
Contents
xxiii
4.4 Classification in dissimilarity spaces . . . . . . . . . . . . . 4.4.1 Characterization of dissimilarity spaces . . . . . . . . 4.4.2 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Classification in pseudo-Euclidean spaces . . . . . . . . . . 4.6 On generalized kernels and dissimilarity spaces . . . . . . . 4.6.1 Connection between dissimilarity spaces and psendoEuclidean spaces . . . . . . . . . . . . . . . . . . . . 4.7 Disciission . . . . . . . . . . . . . . . . . . . . . . . . . . . .
209 211 215
5. Dissimilarity measures 5.1 Measures depending on feature types . . . 5.2 Measures between populations . . . . . . 5.2.1 Normal distributions . . . . . . . . 5.2.2 Divergence measures . . . . . . . . 5.2.3 Discrete probability distributions . 5.3 Dissimilarity measures between sequences 5.4 Information-theorctic measures . . . . . . 5.5 Dissimilarity measures between sets . . . 5.6 Dissimilarity measures in applications . . 5.6.1 Invariance and robustness . . . . . 5.6.2 Example nieasures . . . . . . . . . 5.7 Discussion and conclusions . . . . . . . . .
180 180 185 196 205
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
6 . Visualization
216 228 228 229 233 234 237 238 242 242 242 250 255
6.1 WIultidimensional scaling . . . . . . . . . . . . . . . . . . . . 257 259 6.1.1 First examples . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Linear and nonlinear methods: cxamples . . . . . . . 261 267 6.1.3 Implemeritation . . . . . . . . . . . . . . . . . . . . . 6.2 Other mappings . . . . . . . . . . . . . . . . . . . . . . . . . 268 6.3 Examples: getting insight into the data . . . . . . . . . . . 274 6.4 Tree models . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 7 . Flirther da.ta exploration
7.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Standard approaches . . . . . . . . . . . . . . . . . . 7.1.2 Clustering on dissimilarity representations . . . . . . 7.1.3 Clustering examples for dissimilarity representations
289 290 290 295 303
xxiv
T h e dissimilarity representation for pattern recognition
7.2 Intrinsic dimension . . . . . . . . . . . . . . . . . . . . . . . 309 7.3 Sampling density . . . . . . . . . . . . . . . . . . . . . . . . 319 7.3.1 Proposed criteria . . . . . . . . . . . . . . . . . . . . 320 7.3.2 Experiments with the NIST digits . . . . . . . . . . . 325 7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 8 . One-class classifiers 8.1 General issues . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Construction of one-class classifiers . . . . . . . . . . 8.1.2 Onc-class classifiers in feature spaces . . . . . . . . . 8.2 Domain descriptors for dissimilarity representations . . . . 8.2.1 Neighborhood-based OCCs . . . . . . . . . . . . . . 8.2.2 Generalized mean class descriptor . . . . . . . . . . . 8.2.3 Linear programming dissimilarity data description . 8.2.4 More issues on class descriptors . . . . . . . . . . . . 8.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Experiment I: Condition monitoring . . . . . . . . . 8.3.2 Experiment 11: Diseased mucosa in the oral cavity . . 8.3.3 Experiment 111: Heart disease data . . . . . . . . . . 8.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 . Classification 9.1 Proof of principle . . . . . . . . . . . . . . . . . . . . . . . . 9.1.1 NN rule vs alternative dissimilarity-based classifiers . 9.1.2 Experiment I: square dissimilarity representations . . 9.1.3 Experiment 11: the dissiniilarity space approach . . . 9.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Selection of t.he representation set: the dissimilarity space approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Prototype selection methods . . . . . . . . . . . . . . 9.2.2 Experimental setup . . . . . . . . . . . . . . . . . . . 9.2.3 Results and discussion . . . . . . . . . . . . . . . . . 9.2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . 9.3 Selection of the representation set: the embedding approach 9.3.1 Prototype selection methods . . . . . . . . . . . . . . 9.3.2 Experiments and results . . . . . . . . . . . . . . . . 9.3.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . .
333 336 337 341 346 348 350 353 359 366 366 374 377 379 383 384 384 388 389 395 396 398 401 404 416 417 418 421 428
Contents
XXV
9.4 On corrections of dissimilarity measures . . . . . . . . . 9.4.1 Going more Euclidean . . . . . . . . . . . . . . . 9.4.2 Experimental setup . . . . . . . . . . . . . . . . 9.4.3 R.esults and conclusions . . . . . . . . . . . . . . 9.5 A few remarks on a simulated missing value problem . . 9.6 Existence of zero-error dissimilarity-based classifiers . . 9.6.1 Asymptotic separability of classes . . . . . . . . 9.7 Final discussion . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
428 429 430 432 439 443 . 444 451
10. Combining
453
10.1 Combining for one-class classification . . . . . . . . . . . 10.1.1 Combining strategies . . . . . . . . . . . . . . . . 10.1.2 Data and experimental setup . . . . . . . . . . . 10.1.3 Results and discussion . . . . . . . . . . . . . . . 10.1.4 Summary and conclusions . . . . . . . . . . . . . 10.2 Combining for standard two-class c1assificat)ion . . . . . . 10.2.1 Combining strategies . . . . . . . . . . . . . . . . 10.2.2 Experiments on the handwritten digit set . . . . 10.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . 10.2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . 10.3 Classifier projection space . . . . . . . . . . . . . . . . . . 10.3.1 Construction and the use of CPS . . . . . . . . . 10.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 11. Representation review and recommendations
11.1 Representation review . . . . . . . . . . . 11.1.1 Three generalization ways . . . 11.1.2 Representation formation . . . . 11.1.3 Generalization capabilities . . . 11.2 Practical considerations . . . . . . . . . 11.2.1 Clustering . . . . . . . . . . . . . 11.2.2 One-class classification . . . . . 11.2.3 Classification . . . . . . . . . . .
12. Conclusions and open problems
455 456 459 462 465 466 466 468 470 473 474 475 483 485
. . . . . . . . . 485 . . . . . . . . . . 486 . . . . . . . . . . 489 . . . . . . . . . . 492 . . . . . . . . . . 493 . . . . . . . . . 495 . . . . . . . . . . 496 . . . . . . . . . 497 503
12.1 Summary and contributions . . . . . . . . . . . . . . . . . 505 12.2 Extensions of dissimilarity representations . . . . . . . . . 508 12.3 Open questions . . . . . . . . . . . . . . . . . . . . . . . . 510
The disszrnilarity representation f o r pattern. recognition
xxvi
Appendix A
515
On convex arid concave functions
Appendix B Linear algebra in vector spaces
519
B . l Some facts on matrices in a Euclidean space . . . . . . . . . 519 B.2 Some facts on matrices in a pseudo-Euclidean space . . . . 523
Appendix C
Measure and probability
527
Appendix D
Statistical sidelines
533
D.l D.2 D.3 D.4
Likelihood arid parameter estimation . . . . . . . . . . . . Expectation-maximization (EM) algorithm . . . . . . . . Model selection . . . . . . . . . . . . . . . . . . . . . . . . . PCA and probabilistic models . . . . . . . . . . . . . . . . D.4.1 Gaussian model . . . . . . . . . . . . . . . . . . . . . D.4.2 A Gaussian mixture model . . . . . . . . . . . . . . D.4.3 PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . D.4.4 Probabilistic PCA . . . . . . . . . . . . . . . . . . D.4.5 A mixture of probabilistic PCA . . . . . . . . . . .
Appendix E
Data sets
E . l Artificial data sets . . . . . . . . . . . . . . . . . . . . . . . E.2 Real-world data sets . . . . . . . . . . . . . . . . . . . . . .
. 533 . 535 .
. . .
536 538 538 539 541 542 543 545 545 549
Bihliqqraphy
561
Index
599
Chapter 1
Introduction
T h u s all h u m a n knowledge begin,s with, intuitions, then goes t o concepts, and is completed in ideas. IMVIMANUEL KANT
1.1
Recognizing the pattern
We recognize many patterns’ while observing the world. Even in a country never visited before, we recognize buildings, streets, trees, flowers or animals. There are pattern characteristics, learned before, that can be applied in a new environment. Sometimes, we encounter a place with objects that are alien t o us, e.g. a garden with an unknown flower species or a market place with strange types of fish. How do we learn these patterns so that this place will look more familiar on our next visit‘? If we take the time, we are able to learn some patterns by ourselves. If somebody shows us around, points out and explains what is what, we may learn faster and group the observations according to the underlying concepts. What is the first step in this categorization process? Which principle is used in the observations to constitut,e the first grouping? Are these descriptive features like color, shape or weight? Or is it our basic perception that some objects are somehow different and others are similar? The ability to observe the differences and the similarities between objects seems to be very basic. Discriminating features can be found once we are familiar with Similarities. This book is written from the perspective that the most primary observation we can make when studying a group of objects or phenomena is that some are dissimilar and others are similar. From this starting point, we aim to define a theory for learning and recognizing patterns by automatic means: sensors and computers that try to imitate the human ability We will use the word ‘pattern’ exclusively to refer to quantitativelqualitative characteristics between objects. In the literature, however, ‘pattern’ is also used to refer to a single object for which such characteristics are studied. We will avoid this usage here.
The disszmzlarity representation for pattern recognition
2
A
B
C
D
E Figure 1.1
F
G
H
I
J
K
Fish contours.
of pattern recognition. We will develop a framework in which the initial
representation of objects is based on dissimilarities, assuming that a human expert can make explicit how to measure them from sensor data. We will develop techniques to generalize from dissimilarity-based representations of sets of objects to the concepts of the groups or classes that can be distinguished. This is in contrast to the traditional paradigm in automatic pattern recognition that starts from a set of numerical features. As stated above, features are defined after dissimilarities have been observed. In a featurebased approach, more hunian expertise may be included. Consequently, if this is done properly. a feature description should be preferred. If, however, this expertise is not available, then dissimilarities have to be preferred over arbitrarily selected features. There are already many applied studies in this area based on dissimilarities. They lack a foundation and, consequently, consistent ways for building a generalization. This book will contribute to these two. In the first part, Chapters 2 to 5, concepts and theory are developed for dissimilarity-based pattern recognition. In the second part, Chapters 6 to 10. they are used for analyzing dissimilarity data and for finding and classifying patterns. In this chapter, we will first introduce our concepts in an intuitive way.
1.2
Dissimilarities for representation
Human perception and inference skills allow us to recognize the common characteristics of a collection of objects. It is, however, difficult to formalize such observations. Imagine, for instance, the set of fish shape contours [Fish contours, site] as presented in Fig. 1.1. Is it possible to define a simple rule that divides them into two or three groups? If we look at the contours. wc firid that some of the fish are rather long without characteristic fins (shape C, H and I), whereas others have distinctive tails as well as fins, say a group
3
htroduction
A (a) Fish shapes
H
(b) Area difference
9
(c) By covers
(d) Between skeletons
-i B/
Figure 1.2 Various dissimilarity measures can be constructed for matching two fish shapes. (b) Area difference: the area of non-overlapping parts is computed. To avoid scale dependency, the measured difference can be expressed relative to the sum of the areas of the shapes. (c) Measure by covers: one shape is covered by identical balls (such that the ball centers belong to it), taking care that the other shape is covered as well. The shapes are exchanged and the radius of the minimal ball is the sought distance. 111 both cases above, B is covered such that either A or H are also covered. (d) Mcasure between skeletons: two shape skeletons are compared by summing up the differences between corresponding parts, weighting missing correspondences more heavily.
of fin-type fish. Judging shapes F and K in the context of all fish shapes presented here, they could be found similar to other fin-type fish: A, B, D, E, G and J. By visual inspection, they do riot really appear to be alike. as they seem to be thinner and somewhat larger. If the exairiples of C, H and I had been absent, the differences between F and K and other fin-type fish would havc been more pronounced. Furthermore. shape A could be considered similar to F and K , but also different due to the position arid shape of its tail and fins. This simple example shows that without any extra knowledge or a clear context, one cannot claim that the identification of two groups is better than the identification of three groups. This decision relies on a free interpretation of what makes objects similar to be considered as a group. For the purpose of automatic grouping or identification, it is difficult to determine proper features, i.e. mathematically encoded particular properties of the shapes that would precisely discriminate between different fish
4
The dissimilarity representation for pattern recognition
and at the same time emphasize the similarity between resembling examples. An alternative is to compare the shapes by matching them as well as possible and determining the remaining differences. Such a match is found with respect to a specified measure of dissimilarity. This measure should take on small values for objects that are alike and large values for distinct objects. There are many ways of comparing two objects, and hence there are many dissimilarity measures. In general, the suitability of a measure depends on the problem at hand and should rely on additional knowledge one has about this particular problem. Example measures are presented in Fig. 1.2, where two fish shapes are compared. Here, the dissimilarity between two similar fish, A and B, is much smaller than between two different fish, B and H. Which to choose depends on expert knowledge or problem characteristics. If there is no clear preference for one measure over the other, a number of measures can be studied and combined. This may be beneficial, especially when different measures focus on different aspects of patterns.
1.3
Learning from examples
The question how to extract essential knowledge and represent it in a formal way such that a machine can ‘learn’ a concept of a class, identify objects or discriminate between them, has intrigued and provoked many researchers. The growing interest inherently led to the establishment of the areas of pattern recognition, machine learning and artificial intelligence. Researchers in these disciplines try to find ways to mimic the human capacity of using knowledge in an intelligent way. In particular, they try t o provide mathematical foundations and develop models and methods that automate the process of recognition by learning from a set of examples. This attempt is irispircd by the human ability to recognize for example what a tree is, given just a few examples of trees. The idea is that a few examples of objects (and possible relations between them) might be sufficient for extracting suitable knowledge to characterize their class. After years of research, some practical problems can now be successfully treated in industrial processing tasks such as automatic recognition of damaged products on a conveyor belt, or to speed up data-handling procedures. or the automatic person identification by fingerprints. The algorithms developed so far are very task specific and, in general, they are
Introduction
5
still far from reaching the human recognition performance. Although the models designed are becoming more and more complex, it seems that to take them a step further, one will need to analyze t,heir basic underlying assumptions. An understanding of the recognition process is needed; not only the learning approaches (inductive or deductive principles) must be understood, but mainly the basic notions of class, measurenient, process and the representation of objects derived from these. The formalized representation of objects (usually in mathematical terms) and the definition of classes determine how the act of learning should be modeled. While many researchers are concerned wit,h various algorithmic procedures, we would like to focus on the issue of representation. This work is devoted t o part,ic.iilar representations, na.mely dissimilarity representations. Below and in the subsequent sections, we will give some insight into the nature of basic problenis in pattern recognition and machine learning and motivate the use of dissimilarity representations. While dealing with entities to be compared, we will always refer to them as to object,s, elements or instances, regardless of whether they are real or abstract. For instance, images, textures and shapes are called objects in the same way as apples and chairs. An appropriate representation of objects is based on data. These are usually obtained by a measurement device and encoded in a numerical way or given by a set of observations or dependencies; presented in a structural form, e.g. a relational graph. It is assumed that objects can, in general, be grouped together. Our aim then is to identify a number of groups (clusters) whose existence supports an understanding of not only the data, but also the problem itself. Such a process is often used to order information and to find suitable or efficient descriptions of the data. The challenge of automatic object recognition is to develop computer methods which learn to identify whether an object belongs to a specific class or learn to distinguish between a number of classes. Typically, the system is first presented with a set of labeled objects, the training set,, in some convenient representation. Learning consists of finding the class descriptions such that t,he system can correct,ly classify novel examples. In practice, the entire system is trained such that the given examples are (mostly) assigned to the correct class. The underlying assumption is that the training examples are representative and sufficient for the problem at hand. This implies that the system can extrapolate well to previously unseen examples, that is, it can generalize well. There are two principal directions in pattern recognition, statistical arid
6
T h e dissimilarity representation f o r pattern recognition
Table 1.1 Basic differences between statistical and structural Pattern Recognition [Nadler and Smith, 19931. Distances are a common factor used for discrimination in both approaches. __ Properties
Statistical
Structural
Foundation
Well-developed mathematical theory of vector spaces Quantitative Numerical features: vectors of a fixed length Element position in a vector Easily encoded Vector-based methods Metric, often Euclidean Relies on distances or inner products in a vcctor space Due to improper features and probabilistic models
Intuitively appealing: human cognition or perception Qualitative: structural/syntactic Morphological primitives of a variable size Encoding process of primitives Needs regular structures Graphs, decisions trees, grammars Defined in a matching process Grammars recognize valid objects; distances often used Due to improper primitives leading to ambiguity in the description
Approach Descriptors Syntax Noise Learning Dissimilarity Discrimination Class overlap
structural (or syntactic) pattern recognition [Jain et al., 2000; Nadler and Smith, 1993; Bunke and Sanfeliu, 19901. The basic differences are summarizcd in Table 1.1. Both approaches use features to describe objects, but these features are defined difEerently. In general, features are functions of (possibly preprocessed) measurements performed on objects, e.g. particular groups of bits in a binary image summarizing it in a discriminative way. The statistical, decision-theoretical approach is (usually) metric and quantitative, while the structural approach is qualitative [Bunke and Sanfeliu, 1990; Nadler and Smith, 19931. This means that in the statistical approach, features are encoded as purely numerical variables. Together, they constitute a feature vector space, usually Euclidean, in which each object is represented as a point2 of feature values. Learning is then inherently restricted to the rria,thexnatical methods that one can apply in a vector space, equipped with additional algebraic structures of an inner product, norm and the distance. In contrast, the structural approach tries to describe the structure of objects that intuitively reflects the human perception [Edelman et al., 1998; Edelnmn, 19991. The features become primitives (subpatterns), fundamental structural elements, like strokes, corners or other morphological elements. 21n this book, the words ‘points’ and ’vectors’ are used interchangeably. In the rigorous mathematical sense, points and vectors are not the same, as points are defined by fixed sets of coordinates in a vector space, while vectors are defined by differences between points. In statistical pattern recognition, objects are represented as points in a vector space, but for the sake of convenience, they are also treated as vectors, as only then they define the operations of vector addition and multiplication by a scalar.
Introduction
7
Characterization I Decision
I
t Generalization/ inference
I
Representation
Segmentation
1
..
..__
i
Measurements
...*
~ - -
I
t
Objects
Figure 1.3 Components of a general pattern recognition system. A representation is either a numerical description of objects and/or their relations (statistical pattern recognition) or their syntactical encoding by set of primitives together with a set of operations on objects (structural pattern recognition). Adaptation relies on a suitable change (simplification or enrichment) of a representation, e.g. by a reduction of the number of features, relations or primitives describing objects, or some nonlinear transformation of the features, to enhance the class or cluster descriptions. Generalization is a process of determining a statistical function which finds clusters, builds a class descriptor or constructs a classifier (decision function). Inference describes the process of a syntax analysis, resulting in a (stochastic) grammar. Characterization reflects the final decision (class label) or the data description (determined clusters). Arrows illustrate that a building of the complete system may not be sequential.
Next, the primitives are encoded as syntactic units from which objects are constructed. As a result, objects are represented by a set of primitives with specified syntactic opcrations. For instance, if the operation of concatenation is used, objects are described by strings of (concatenated) primitives. The strength of the statistical approach relies on well-developed concepts and learning techniques, while in the structural approach, it is much casier to encode existing knowledge on the objects. A general description of a pattern recognition system is illiistratcd in Fig. 1.3; see also [Duin et al., 20021 for a more elaborate discussion and [Nadler and Smith, 19931 for an engineering approach. The description starts from a set of measurements performed on a set of objects. These measurements may be subjected to various operations in order to extract the
8
T h e dissimdarity representation f o r p a t t e r n recognition
essential information ( e g . to segment an object from the image background arid identify a number of characteristic subpatterns), leading to some nunierical or structural representation. Such a representation has evolved from an initial description, derived from the original measurements. Usually, it is not directly the most appropriate one for realizing the task, such as identification or classification. It may be adapted by suitable transformations, e.g. a (nonlinear) rescaling of numerical features or an extension and redefinition of primitives. Then, in the generalization/inference stage, a classifier/identifier is trained, or a grammar3 is determined. These processes should include a careful treatment of unbalanced classes, non-representative data, handling of missing values, a rejection option, combining of inforrnatiori and combiriing of classifiers and a final evaluation. In the last stage, a class is assigned or the data. are characterized ( e g . in terms of clusters and their relations). The design of a complete pattern recognition system may require repetition of some stages to find a satisfactory trade-off between the final recognition accuracy or data description and the computational and storage resources required. Although this research is grounded in statistical pattern recognition, we recognize the necessity of combining numerical and structural information. Dissimilarity measures as the common factor used for discrimination, Table 1.1, seems to be the natural bridge between these two types of information. The integration is realized by a representation. A general discussion on the issue of representation can be found in [Duin et al., 2004al.
1.4
Motivation of the use of dissimilarity representations
The notion of similarity plays a pivotal role in class formation, since it might he seen as a natural link between observations on objects on the one hand arid a judgment on their shared properties on the other. In essence, similar objects can be grouped together to form a class, and consequently u class is a set of sim,ilar objects. However, there is no such thing as a general object similarity that car1 be universally measured or applied. A comparison of two objects is always with respect to a frame of reference, i.e. a particular point of view, a context, basic characteristics, a type of domain, or attributes considered (see also Fig. 1.1). This means that background information, or 3Primitives are interpreted as syntactic units or symbols. A grammar is a set of rules of syntax that enables the generation of sentences (structures) from the given symbols (units).
Introduction
(
9
Measurements or intermediate representation
1
1 Feature-based representation
1
Define a set of features
Dissimilarity-based representation
1 Define a dissimilarity measure \I
Represent objects as points in a feature vector space
i
Impose the geometry, e.g. of the Euclidean distance between the points
1 Interpret the dissimilarities in a suitable space to reflect the distance geometry
Figure 1.4 The difference with respect to the geometry between the traditional featurebased (absolute) representations and dissimilarity-based (relative) representations.
the existence of other classes; will influence the way objects are compared. For instance, two brothers may not appear to resemble each other. However, they may appear much more alike if compared in the presence of their parents. The degree of similarity between two objects should be determined relative to a given context or a procedure. Any measurement of similarity of objects will be based 011 certain assumptions concerning the properties of their relation. Such assumptions come from some model. Similarity can be modeled by a measure of sirriilarity or dissimilarity. These are intimately connected; a small dissimilarity and a large similarity both imply a close resemblance of objects. There exist ways of changing a similarity value into a dissimilarity value and vice versa, but the interpretation of the measure might be affected. In this work. we mostly concentrate on dissimilarities, which by their construction, focus on the class and object differences. The choice for dissimilarities is supported by the fact that they can be interpreted as distances in suitable vector spaces, and in many cases, they may be more intuitively appealing. In statistical pattern recognition, objects are usually encoded by feature values. A feature is a corijunction of measured values for a particular attribute. For instance, if weight is an attribute for the class of apples, then a feature consists of the measured weights for a number of apples. For a set T of N objects, a feature-based representation relying on a set .F of m features is then encoded as an N x m matrix A(T, F), where each
10
The dissimilarity representation f o r pattern recognition
row is a vector describing t8hefeature values for a particular object. Feat,ures 3 are usually interpreted in a Euclidean vector space equipped with the Euclidean metric. This is motivated by the algebraic structure (defined by operations on vectors) being consistent with the geometric (topological) structure defined by the Euclidean distance (which is then defined by the norm). Then all traditional mathematical concepts and methods, such as continuity, convergence or differentiation are applicable. The continuity of algebraic operations ensures that the local geometry (defined by the Euclidean distance) is preserved throughout the space [Munkres, 2000; Kothe, 19691. Discrimination techniques operating in vector spaces make use of their homogeneity and other properties. Consequently, such spaces require that up to scaling all the features are treated in the same way. Moreover, there is no possibility to relate the learning to the geometry defined between the raw representations of the training examples. The geometry is simply imposed beforehand by the nature of the Euclidean distance between (reduced) descriptions of objects, i.e. between vectors in a Euclidean space; see also Fig. 1.4. The existence of a well-established theory for Euclidean metric spaces made researchers place the learning paradigm in that context. However, the severe restrictions of such spaces simply do not allow discovery of structures richer than affine subspaces. From this point of view, the act of learniiig is very limited. We argiic here that the notion of proximity (similarity or dissimilarity) is more fundamental than that of a feature or a class. According to an intuitive definition of a class as a set of similar objects, proximity plays a crucial role for its constitution, and not features, which may (or may not) come later. From this point of view, features might be a superfluous step in the description of a class. Surely, proximity can be specified by features, such as their weighted linear combination, but the features should be meaningful with respect to the proximity. In other words, the chosen combination of features should reflect the (natural) proximity between the objects. On the other hand, proximity can be directly derived from raw or pre-processed measurements like images or spectra. Moreover, in the case of syrnbolic objects, graphs or grammars, the determination of numerical features might be an intractable problem, while proximity may be easier to define. This emphasizes that a class of objects is represented by individual examples which are judged to be similar according to a specified measure. A dissimilarity representation of objects is then based on pairwise comparisons and is expressed e.g. as an N x N dissimilarity matrix D ( T , T ) .Each entry of D is a dissimilarity value computed between pairs of objects; see
Introduction
11
OBJECTS
\ I / ensor measurements
f
e
y
ABSOLUTE REPRESENTATION
A(T,F)
feature 2
ilarity measure
RELATIVE REPRESENTATION
feature 3
feature 1
Figure 1.5 Feature-based (absolute) representation vs. dissimilarity-based (relative) representation. In the former description, objects are represented as points in a feature vector space, while in the latter description, objects are represented by a set of dissimilarity values.
also Fig. 1.5. Hence, each object z is represented by a vector of proximities
D ( z ,T ) to the objects of T (precise definitions will be given in Chapter 4). For a number of years, Goldfarb and colleagues have been trying to establish a new mathematical formalism allowing one to describe objects from a metaphysical point of view, that is, to learn their structure and characteristics from the process of their construction. This aims at unifying the geometric learning models (statistical approach with the geometry imposed by a feature space) and symbolic ones (structural approach) using dissimilarity as a natural bridge. A dissimilarity measure is determined in a process of inductive learning realized by so-called evolving transformation systems [Goldfarb. 1990; Goldfarb and Deshpande. 1997; Goldfarb and Golubitsky, 20011. Loosely speaking, such a system is composed of a set of primitive structures, basic operations that transform one object
12
The dissimilarity representation f o r p a t t e r n recognition
into another (or which generate a particular object) and some composition rules which permit the construction of new operations from existing ones [Goldfarb et al., 1995, 1992, 2004; Goldfarb and Deshpande, 1997; Goldfarb and Golubitsky, 20011. This is the symbolic component of the integrated model. The geometric component is defined by means of a dissimilarity. Since there is a cost associated with each operation, the dissimilarity is determined by the minimal sum of the costs of operations transforming one object int,o another (or generating this particular object). In this sense, the operations play the role of features, and the dissimilarity - dynamically learned in the training process - combines the objects into a class. In this book, t,he study of dissimilarity representations has mainly an epistemological character. It focuses on how we decide (how we make a model to decide) that an entity belongs to a particular class. Since such a decision builds on the dissimilarities, we come closer to the nature of what a class is, as we believe that it is proximity which defines the class. This approach is much more flexible than the one based on features, since now, the geometry and the structure of a class are defined by the dissimilarity measure, which can reflect the structure of the objects in some space. Note that the reverse holds in a feature space, that is, a feature space determines the (Euclidean) distance measure, and hence the geometry; see also Fig. 1.4. Although, dissimilarity information is further treated in a numerical way, the development of statistical methods dealing with general dissimilarities is the first necessary step towards a unified learning model, as the dissimilarity measure may be developed in a structural approach. Notwithstanding the fact that integrated model may be constructed for objects containing an inherent, identifiable structure or organization, like apples, shapes, spectra, text excerpts etc., current research is far from being generally applicable [Korkin and Goldfarb, 2002; Goldfarb and Golubitsky, 2001; Goldfarb et al., 2000b, 20041. On the other hand, there are a number of instances or events which are mainly characterized by discontinuous numerical or categorical information, e.g. gender, or number of children, etc. Therefore, wc may have to consider heterogeneous types of information to support decisions in medicine, finance, etc. In such cases, the symbolic learning model cannot be directly utilized, but a dissimilarity can be defined. This emphasizes the importance of techniques operating on general dissimilarities. The study of proximity representations is the necessary starting point from which to depart on a journey into alternative inductive learning methodologies. These will learn the proximity measure, and hence a class description, from examples.
Introduction
13
1.5 Relation to kernels Kernel methods have become popular in statistical learning [Cristianini arid Shawe-Taylor, 2000; Scholkopf arid Smola, 20021. Kernels are (conditionally) positive definite (cpd) functions of two variables, which car1 be thought to encode similarities between pairs of objects. They are originally defined in vector spaces, e.g. based on a feature representation of objects, and interpreted as generalized inner products in a reproducing kernel Hilbert space (RKHS). They offer a way to construct non-linear decision functions. In 1995, Vapnik proposed an elegant formulation of the largest margin classifier [Vapnik, 19981. This support vector machine (SVM) is based on the reproducing property of kernels. Since then, many variants of the SVM have been applied to a wide range of learning problems. Before the start of our research project [Duin et al., 1997, 1998, 19991 it was already recognized that the class of cpd functions is restricted. It does riot accommodate a number of useful proximity measures already developed in pattern recognition and computer vision. Many existing similarity measures are not positive definite and many existing dissimilarity measures are not Euclidean4 or even not metric. Examples are pairwise structural alignments of proteins, variants of the Hausdorff distance, and normalized edit-distances; see Chapter 5. The major limitation of using such kernels is that the original formulation of the SVM relies on a quadratic optimization. This problem is guaranteed to be convex for cpd kernels, and therefore uniquely solvable by standard algorithms. Kernel matrices disobeying these requirements are usually somehow regularized, e.g. by adding a suitable constant to their diagonal. Whether this is a beneficial strategy is an open question. Although our research was inspired by the concept of kernel, the line we followed heavily deviates from the usage of kernels in machine learning [Shawe-Taylor and Cristianini, 20041. This is caused by the patternrecognition background of the problems we aim to solve. Our starting point is a given set of dissimilarities, observed or determined during thc development of a pattern recognition system. It is defined by a human expert and his/her insight into the problem. This set is, thereby, an alternative to the definition of features (which also have to originate from such expertise). A given Euclidean distance matrix may be transformed into a kernel and interpreted as a generalized Gram matrix in a proper Hilbert space. 4The dissimilarity measure being Euclidean is inherently related t o t,he corresponding kernel being positive definite; this is explained in Chapter 3.
The dissimilarity representation for p a t t e r n recognition
14
[
(
Characterizationof dissimilaritymatrices
Chapter3
Learning aspects
I
I
Reprera;;;;,review
j [T)
j
ConclusionsI open problems Chapter 12
Figure 1.6 Conceptual outline of the book
However, many general dissimilarity measures used in pattern recognition give rise to indefinite kernels, which have only recently become of interest [Haasdonk, 2005; Laub and Miiller, 2004; Ong et al., 20041, although we had already identified their importance before [Pckalska et al., 2002bl. How to handle these is an important issue in this book.
1.6
Outline of the book
Dissimilarities play a key role in the quest for an integrated statisticalstructural learning model, since they are a natural bridge between these two approaches, as explained in the previous sections. This is supported by the theory that (dis)similarity can be considered as a link between perception and higher-level knowledge, a crucial factor in the process of human recognition and categorization [Goldstone, 1999; Edelman et ul., 1998; Wharton et al., 19921. Throughout this book, the investigations are dedicated to dissimilarity (or similarity) representations. The goal is to study both methodology and approaches to learning from such representations. An outline of the book is presented in Fig. 1.6.
Introduction
15
The concept of a vector space is fundamental to dissimilarity representations. The dissimilarity value captures the notion of closeness between tjwo objects, which can be interpreted as a distance in a suitable space: or which can be used to build other spaces. Chapter 2 focuses on mathematical characteristics of various spaces, among others (generalized) metric spaces, norm spaces and inner product spaces. These spaces will later become the context in which the dissimilarities are interpreted arid learning algoritlinis are designed. Familiarity with such spaces, their properties and their interrelations is needed for further understanding of learning processes. Chapter 3 discusses fundamental issues of dissimilarity measures and generalized metric spaces. Since a metric distance, particularly the Euclidean distance, is mainly used in statistical learning, its special role is explained and related theorems are given. The properties of dissimilarity matrices are studied, together with some embeddings, i.e. spatial represeritations (vectors in a vector space found such that the dissiniilarities are preserved) of symmetric dissimilarity matrices. This supports the analysis of pairwise dissimilarity data D ( T ,T ) based on a set of examples T . Chapter 4 starts with a brief introduction into traditional statistical learning, followed by a more detailed description of dissimilarity reprcsentations. Three different approaches to building classifiers for such representations are considered. The first one uses dissimilarity values directly by interpreting them as neighborhood relations. The second one interprets them in a space where each dimension is a dissimilarity to a particular object. Finally, the third approach relies on a distance-preserving embedding to a vector space, in which classifiers are built. In Chapter 5, various types of similarity and dissimilarit,y measures are described, together with their basic properties. The chapter ends with a brief overview of dissimilarity measures arising from various applications. Chapters 6 and 7 start from fundamental questions related to exploratory data analysis on dissimilarity data. Data visualization is one of the most basic ways to get insight into relations between data instances. This is discussed in Chapter 6. Other issues related to data exploration and understanding are presented in Chapter 7. They focus on methods of unsupervised learning by reflecting upon the intrinsic dimension of the dissimilarity data, the complexity of the description and data striicture in terms of clusters. A possible approach to outlier detection is analyzed in Chapter 8 by coiistructing one-class classifiers. These methods are designed to solve problems, where mainly one of the classes, called the target class, is present.
T h e dissimilarity representation for p a t t e r n recognition
16
Objects of the other, outlier, class occur rarely, cannot be well sampled, e.g. due the measurement costs or are untrustworthy. We introduce the probleni arid study a few one-class classifier methods built on dissimilarity represent at ions. Chapter 9 deals with classification. It practically examines three approaches to learning. For recognition, a so-called representation set is used instcad of a complete training set. This chapter explains how to select such a set out of a training set arid discusses the advantages and drawbacks of t hc studied techniques. Chapter 10 investigates combining approaches. These either combine different dissiniilarity representations or different types of classifiers. Additionally. it briefly discusses issues concerning nieta-learning, i.e. conceptual dissimilarity representations resulting from combining classifiers, one-class classifiers or weak models. in general. Chapter 11 discusses the issue of representation in pattern recognition and provides practical recommendations for the use of dissimilarity representations. Overall conclusions are given in Chapter 12. Appendices A-D provide additional information on algebra, probability and statistics. Appendix E describes the data sets used in the experiments.
1.7
In summary
Dissimilarity representations are advantageous for identification and recognition, especially in the following cases: 0 0
0 0 0 0
0 0 0
sensory data, such as spectra, digital or hyperspectral images data represented by histograms, contours or shapes, phenomena that can be described by probability density functions, binary files, text-related problems, when objects are encoded in a structural way by trees, graphs or strings. when objects are represented as vectors in a high-dimensional space, when the features describing objects are of mixed types, as a way of constructing nonlinear classifiers in given vector spaces.
Mathematical foundations for dissimilarity representations rely on:
(1) topology and general topology [Sierpiliski, 1952; Cech, 1966; Kothe, 1969; Willard, 1970; Munkres, 2000; Stadler et al., 2001; Stadler and Stadler, 2001b],
Introduction
17
(2) linear algebra [Greub, 1975; Bialynicki-Birula, 1976; Noble and Daniel, 1988; Leon, 1998; Lang, 20041, ( 3 ) operator theory [Dunford and Schwarz, 1958; Sadovnichij, 19911. (4) functional analysis [Kreyszig, 1978; Kurcyusz, 1982; Conway, 1990; Rudin, 1986. 19911, ( 5 ) indefinite inner product spaces [BognBr, 1974; Alpay et al., 1997; Iohvidov et al., 1982, Dritschel and Rovnyak, 1996; Constantinescu and Gheondea, 20011, (6) probability theory [Feller, 1968, 1971; Billingsley, 1995; Chung, 20011, (7) statistical pattern recognition [Devijver and Kittler, 1982; Fukunaga, 1990; Webb, 1995; Devroye et al., 1996; Duda et nl., 20011, (8) statistical learning [Vapnik, 1998; Cherkassky and Mulier, 1998; Hastie et al., 20011. (9) the work of Scholkopf and colleagues [Scbolkopf. 1997, 2000; Scholkopf et al., 1999b, 1997a, 1999a, 2000b], (10) the results of Goldfarb [Goldfarb, 1984, 1985, 19921, and inspiration from many other researchers. We will present a systematic approach to study dissimilarity representations and discuss some novel procedures to learning. These are inevitably compared to the nearest neighbor rule (NN) [Cover and Hart, 19671, the method traditionally applied in this context. Although many researchers have thoroughly studied the NN method and its variants together with design of perfect dissimilarity measures (appropriate to the character of the NN rule), to our knowledge little attention was dedicated to alternative approaches. An exception are the support vector machines. These rely on a relatively narrow class of (conditionally) positive definite kernels, which, in turn, are special cases of similarity representations [Duin et al., 1997, 19981. Only recently the interest has arisen in indefinite kernels [Haasdonk, 2005; Laub and Muller, 2004; Ong et al., 20041. The methods presented here are applicable to general (dis)similarity representations, and this is where our main contribution lies. A more detailed description of the overall contributions is presented below.
Representation of objects. A proximity representation quarititatively encodes the proximity between pairs of objects. It relies on the representa-
18
T h e dissimilarity representation f o r p a t t e r n recognition
tion set, R, a relatively small collection of objects capturing the variability in the data. Each object is described by a vector of proxirnities to R. In the beginning, the representation set may consist of all training examples as it is reduced later in the process of instance selection. Here, a number of selection criteria are proposed and experimentally investigated for different learning frameworks. In this way, we extend the notion of a kernel t o t,hat of a proximity representation. If R is chosen to be the set of training examples, then this representation becomes a generalized kernel. When a suitable similarity measure is selected, a cpd kernel is obtained as a special case. Using a proximity representation, learning can be addressed in a more general way than by using the support vector machine. As such, we develop proximity representations as a first step towards bridging the statistical and structural approaches to pattern recognition. They are successfully used for solving object recognition problems.
Data understanding. Understanding data is a difficult task. The main consideration is whether the data sampling is sufficient to describe the problem domain well. Other important questions refer to intrinsic dimension, data structure, e.g. in terms of possible clusters and the means of data visualization. Since there exist marly algorithms for unsupervised learning, our primary interest lies in the former questions. In this book, three distinct approaches to learning from dissimilarity representations are proposed. The first, one a.ddresses the given dissimilarities directly. The second addresses a dissimilarity representation as a mapping based on the representation set R. As a result, the so-called dissimilarity space is considered, where each dinlension corresponds to a dissimilarity to a particular object from R. The third one relies on an approxiniate embedding of dissimilarities into a (pseudo-)Euclidean space. The approaches are introduced, studied arid applied in various situations. Domain description. The problem of describing a class has gained a lot of attention, since it can be identified in many applications. The area of interest covers all problems where specified targets have to be recognized and anomalies or outlier situations have to be detected. These might be examples of any type of fault detection, abnormal behavior, or rare diseases. The basic assumption that an object belongs to a class is based on the idea that it is similar to other examples within this class. The identification procedure can be realized by a proximity function equipped with a threshold, determining whether or not an instance is a class member. This proximity function can be e.g. a distance to a set of selected prototypes.
Introduction
19
Therefore, the data represented by proximities is more natural for buildirig concept descriptors, since the proximity function can directly be built on these proximities. To study this problem, we have not only adopted known algorithnis for dissimihrity representa,tions, but have also implemented and investigated new methods. Both in terms of' efficiency and performance issues. our methods were found to perform well.
Classification. We propose new methodologies to deal with dissiniilarity/similarity data. These rely either on approximat,e embedding in a pseudo-Euclidean space and construction of the classifiers there, or on building of the decision rules in a dissimi1arit)yspace, or on designing of neighborhood-based classifiers, e.g. the NN rule. In all cases, foundations are established, that allow us t o handle general dissimilarity measures. Our methods do not require metric constraints, so their applicability is quite universal. Combining. The possibility to combine various types of information has proved to be useful in practical applications; see e.g. [MCSOO,2000; NICS02, 20021. We argue that combining either significantly different dissimilarity representations or classifiers different in nature on the same representation can be beneficial for learning. This may be useful when there is a lack of expertise of how a well-discrimination dissimilarity measure should be designed. A few measures can be considered, taking into account differeiit characteristics of the data. For instance, when scanned digits should hc compared, one measure focuses on the contour information, while others on the area or on statistical properties. Applications. The proximity measure plays an important role in many research problems. Proximity representations are widely used in many areas, although often indirectly. They are used for text or iniage retrieval, data visualization, the process of learning from partially labeled sets, etc. A number of applications is discussed where such measures ares found to be advantageous. In essence. The study on dissimilarity representations applies to all dissimilarities, independently of the way they have been derived, e.g. from raw data or from an initial representation by features, strings or graphs. Expert knowledge oil the application can be used to formulate this initial representation and in the definition of the proximity measure. This makes the dissimilarity representations developed natural candidates for combining
20
T h e dzssamzlaraty r e p r e s e n t a t z o n for p a t t e r n recognztion
the strengths of structural and statistical approaches in pattern recognition and machine learning. The advantage of the structural approach lies in encoding both domain knowledge and the structure of an object. The benefit of the statistical approach lies in a well-developed mathematical theory of vector spaces. First, a description of objects in the structural framework can be found. This can then be quantized to capture the dissimilarity relations between the objects. If necessary, other structurally and statistically derived measures can be designed and combined. The final dissimilarity representation is then used in statistical learning. The results in this work justify the use and further exploration of dissimilarity information for pattern recognition and machine learning.
PART 1
Concepts and theory
Budowatem na piasku I zawalito sie. Budowatem na skale I znwalilo sie. Teraz budujgc Zaczn,e od dymu z komina.
I built on the sand And at tumbled down. I built on 0; rock And it tumbled douin,. Now when I build, I shall be,qi,n With the smoke from. the chimney.
“PODWALINY” , LEOPARDSTAFF
“FOUNDATIONS”, LEOPOLDSTAFF
This page intentionally left blank
Chapter 2
Spaces
Ring the bells that still can ring Forget yo’w perfect offeer.ing There i s a crack: in ever-ything That’s h,o,w the light gets in. “ANTHEM>’LEONARDCOHEN
Many dissimilarity measures have been designed and are used in various ways in pattern recognition, machine learning, computer vision and relat,ed fields. What is missing, however, is a general and unified framework for learning from examples that are represented by their dissimilarities to a set of representation objects. Different aspects of the measures, such as showing of the Euclidean behavior, metric or asymmetric properties, may lead to different learning approaches. In the statistical approach to pattern recognition, objects are represented as points in a vector space, equipped with an additional algebraical structure of an inner product and the associated norm. This is iisually a Euclidean or a Hilbert space. The distance between the points is then naturally measured by the Euclidean or Hilbert distance. If beneficial. other metric distances niay be introduced, usually froni the family of the &distances or C,-distances. Classifiers are functions defined by firiit,e vector representatioris in this vector space. Usually, they are designed, i)ased on the assumed model, applied probabilistic reasoning or used pairwisc distances. The question we begin with is more difficult. How can a learning task be performed given a set of pairwisc dissimilaritics? Dissimilarities are measured according to a specified dissimilarity measure, which is not necessarily a metric and not necessarily a measure in the strict niathcmatical sense. It quantifies the similarity or commonality between two objects by taking small values for two similar objects and large values for two distinct objects. Additionally, when possible, sensor measurements or other intermediate description of the set of examples niay be given. The challenge is
23
24
T h e dissimilarity representation f o r p a t t e r n recognition
to discover the structure in the data, identify objects of a particular class or learn t o distinguish among the classes, knowing the procedure according to which the dissimilarity is computed and the dissimilarity values between a set of (training) examples. As no vectorial representation of the objects is provided, the challenge is now to use the dissimilarities in a meaningful way. To make use of statistical learning, we must find an appropriate framework for the interpretation of dissimilarities. The concept of a (vector) space is important for the development of a theoretical foundation, both from the representational and algorithmic point of view, since we will rely on numerical procedures and deal with numerical representations of the problems. Dissimilarities quantitatively express the differences between pairs of objects. while learning algorithms usually optimize some error or loss function for a chosen numerical model. Dissimilarities, therefore, have a particular meaning within a frame of specified assumptions and models. Spaces possessing different characteristics will allow different interpretations of the dissimilarity data, which will lead to different learning algorithms. Therefore, before discussing dissimilarity representations and learning methods, we need essential concepts arid properties of various spaces. This chapter is motivated by the lack of a consistent and clearly ideritifiable mathematical theory on general dissimilarity measures, not only in the pattern recognition field, but also in mathematics. In its foundations, such a theory should rely on the notion of nearness between two objects. Therefore. the theory of spaces plays a key role, since suclu a nearness can easily be introduced here. Most of the existing theories deal with norms, which are often used to define metrics. Usually, Euclidean, city block or max-norm distances are considered. Other interesting contributions can be fourid in various subfields of mathematics, such as non-Euclidean geornetries [Blumenthal, 1953; Coxeter, 19981, differential geometry [Kreyszig, 1991; Struik, 19881, algebras [Paulsen, 20021 and operator spaces [Effros and Ruan, 2000; Pisier, 20031. Additional inspiration can be found in the fields of experimental psychology and artificial intelligence. These, however, remain of interest for future study. To oiir knowledge, no book yet exists that explains the theoretical background of general dissimilarity measures and studies learning problems from such a perspective (although a general study on pattern theory in this direction by Grenander is available [Grenander, 1976, 1978, 19811). Therefore, this chapter is meant to fill this gap. It not only introduces spaces with their basic properties, but it also shows the relations between them. Conse-
Spaces
25
quently, the concepts are presented from a mathematical point of view and supported, if possible, by examples from pattern recognition. The purpose of this chapter is to bring together and present a basic theory on spaces in the context of general dissimilarities, both metric and non-metric. The spaces described here will serve as interpretation frameworks of dissimilarity data. The connections will become clear in Chapters 3 arid 4. We will start by recalling basic notions from set theory.
2.1
Preliminaries
Throughout this book, a set X is a collection of objects of any kinds, both real and abstracts, such as real-word objects, digital images, binary strings, points in a plane, real numbers or functions. These are called elements of X . In some cases, a set can be determined by a means of a property of its elements, such as the set of convex pentagons, non-decreasing functions or scanned handwritten digits of '1'. The set of natural numbers is W. The sets of real, positive real and nonnegative real numbers are denoted by R, B+ and R : , respectively. The set of complex numbers is denoted by @. If X and Y are two sets, then X U Y is their union, A n B is their intersection, X\Y is their difference and X A Y = ( X \ Y ) U (Y \X ) is their symmetric difference. X x Y = { (2, y) : .?:EXA y E Y } denotes a Cartesian product. P ( X ) is a power set, which is a collection of all subsets of X . An index set I defines a correspondence between i E I and either an element ai of a set A or a subset Ai of A. A family of sets in A is denoted by A = {Ai : i E I } . The union, intersection and Cartesian product can be extended to a family of sets as UiEIAi, Ai and Hi,, Ai,respectively.
ni,,
Definition 2.1 (Mapping, function) Let X and Y be two sets. If wit,h each element zE X we associate a subset F ( z ) of Y , then the correspondence z d F ( z ) is a mapping of X into Y or a function from X to Y . If the set F ( z ) consists of a single element; then the mapping is single-valued, and multi-valued, otherwise. Mapping, function or transformation will be used interchangeably. Definition 2.2 (Basic facts on functions)
0
Let f : X + Y be a function from X to Y . X is the domain, of f a.nd Y is the codomain o f f . The range of f is Rf= {y E Y : gZEx y = q5(z)}. The inverse function of f is a mapping f-' : Y 4 X that satisfies f - ' ( f ( z ) ) = z and f ( f - l ( y ) ) = y for all Z E X and ~ E Y .
26
0
The dissimilarity representation for pattern recognztzon
The image of z is f ( i c ) . The preimage of y are all z E X whose image is y. i.e. fpl(y) = {:I; E X : f ( z ) = y}. The image of A c X is the set f ( A ) c Y consisting of all elerncrits of Y which equal f ( a ) for some (1, E A. The preimage (inverse image) of B c Y is the set f - l ( B ) c X consisting of all elements ic E X such that f(x)E B.
Definition 2.3 (Composition) Let f : X + Y and g : Y 4 2 be fuiictions. Then g o f : X 4 Z is a composition of mappings such that (.9 O .f)(.) = 9
(.f.I).(
Definition 2.4 (Injection, surjection, bijection) 0 A fiinction f : X t Y is injective or one-to-one if it maps distinct arguments to distinct images, i.e. x1 # 2 2 + f ( z 1 ) # f ( i c 2 ) holds for all :I: 1, :1'2 t x . 0 A function f : X + Y is ective if' its maps to all images, i.e. if for evcry y t Y , there exists Z E X such that f ( z ) = y. In other words, f is 0
surjective if its rangc is cqual to its codonlain. A function f : X --f Y is bijective if it is both injective and surjective, i.c. if' for cvery Y ,there exists exactly one Z E X such that f ( x ) = y.
The composition of two injections (surjections, bijections) is again an injection (surjection, bijection).
Definition 2.5 (Binary relation) Let X and Y be two sets. A binary relation, R is a subset of the Cartesian product X X Y ,i.e. R C X X Y . A subsct of X x X is a binary relation on X. One writes zRg to indicate that :1:
is in relation with y.
Definition 2.6 (Equivalence relation) is a binary relation
N
An equivalence relation on X
which is
(1) reflexive: .7: z or all EX. (2) syrrirnetric: (x y) + ( y x) for all z . y ~ X . ( 3 ) transitive: (x y A y z) + (x z) for all 2 , y, zE X . N
-
- -
-
-
-
The set of all elements of X equivalent to z is an equivalen,ce class of n: and denoted hy This means that [z] = {y : y t X A y z}.
[XI.
Definition 2.7 (Partially ordered set) A partially ordered set is a pair ( X ,5 ) ;where X is a set arid 5 is a partial order on X , which is: (1) reflexive: x 5 z or all z t X. (2) antisyrnrnetric: (x 5 y A y 5 x) + x = y for all Z, EX. ( 3 ) transitive: (z 5 p A y 5 z)+ (z 5 z) for all z, y, Z E X .
Spaces
27
Definition 2.8 (Upper bound, lower bound ) Let ( X ,5)be a partially ordered set and Y c X.Y is partially ordered under Y is bounded from above (from below) if there exists z E X for which y 5 z (x 5 y) holds for all y E Y.:2: is an 'upper (lo,wer) bound of Y .
s.
Definition 2.9 (Directed set) A partially ordered set X is a directed set if for any z, y E X , there exists z E X such that x 5 z and y 5 z . X is inversely directed if for any x,y E X , there exists z E X such that z 5 x and
z ,{ a ,b, .>I. NB(0)= { { u , b : c } } .
NB(c) = { { c } ,{a?b; c}, { c , d. e } } . N B ( d ) = {{c.d,e}}.
N B ( e ) = { { e } , { c ,d , e } } Extension of the above neighborhood relations to a set of integers is the Khalimsky line: used to define a digital topology [Khalimsky, 1987; Khalimsky et al.: 19901. (2) Let p : X x X + 'w? be a general dissimilarity measure as in Def. 2.45, such that p ( z , x) = 0. Then B~(z) = {y t X : p(x,y) < S } is a neighborhood of z for a given 6> 0. The neighborhood basis is then defined as
NB(2)= {&(z): &>0}.
Spaces
35
( 3 ) Let X be a set. A hierarcliical clustering (see Sec. 7.1) can be seen as a successive top-down decomposition of X, represented by a tree. The root describes the complete set and it is the largest cluster. Its children nodes point to a decomposition of X into a faniily of pairwise disjoint, clusters. Each cluster can be further deconiposed into smaller clusters, represented by nodes in t,he tree, until the single elements in the leaves. In this way, sequences of nested clusters are created. A neighborhood of II: is a cluster ch at the level h in the subtree containing the leaf 2 . Then &(x) = {CtL: z E Ch,}. Notc that the requiremerit of disjoint clusters at each level is riot essential for the definition of N B ( Z ) .
Definition 2.14 (Neighborhood of a set) Let ( X ; N )be a pretopological space and let Y C X . Then N is a neighborhood of 1’ iff N contains a neighborhood Nu of each E Y . The neighborliood system for I’ is then n/(g). See also Fig. 2.4(c). given by N ( Y )= Definition 2.15 (Open and closed sets via neighborhoods) Let X be a set. A 2 X is an open set if it is a neighborhood of each of its elements, i.e. V l z E ~A E N ( z ) . A is a closed set if (X\A) is open.
A neighborhood function N defines a generalized topology on the set X , as presented in Def. 2.12. Neighborhoods can be used to define genersliaed interior and closure operators, which may further define open arid closed sets, the basic concepts in a topological space. Since the properties of t,hc neighborhood, closure and interior functions can be translated into each other, t,hey are equivalent constructions on X . For instance, a generalized closure can be considered as a principal concept to define other operators on sets [Gastl and Hammer, 1967; Stadler et al., 2001; Stadler arid Statiler, 20021. Definition 2.16 (Generalized closure) Let P ( X ) be a powcr set of X. A genegrulized closure is a function P ( X ) + P(X)which for each A c X assigns A - c X such that 0- = 0 arid A c A-.
The generalized closure is not idempotent. This means that for A c X . the condition A - - = A- does not necessarily hold, as required for the topological closure. The interior function and neighborhood system N can be now defined by the generalized closure. Definition 2.17 (Generalized interior) Let P ( X )be a powrr set of X . A generalized znterzor is a function P ( X ) 4 P ( X ) which for each subset A
T h e dissimilarity representation for pattern recognition
36
Table 2.1 Equivalent axioms for the neighborhood system and the generalized closure operator. X is a set and A , B , N , M represent any of its subsets. Axioms (1)-(3) describe neighborhood spaces, axioms (1)-(4) define pretopological spaces and axioms (1)--(5) define topological spaces. Closure A -
Propcrties
(5) Idempotent
of X assigns a subset A" of X such that A" one can write that A- = X\(X\A)".
=
X\(X\A)-.
Equivalently,
Definition 2.18 (Neighborhood system) The neighborhood N : X + P ( P ( X ) )is a function which for each Z E X assigns the collection of neighborhoods defined as N(z) = { N E P ( X ) : II: $ ( X \ N ) - } . Equivalently, one can write that Z E N (X\N)$N(z).
*
Definition 2.19 (Generalized topology via closure) Let P ( X ) be the power set of X . Consider a generalized closure - : P ( X ) + P ( X ) with the following properties: (1) 0- = 0. ( 2 ) Expansive: VACX A C: A-. (3) Monotonic: V A , n-c x A C B jA(4) Sublinear: VA.BCX ( AU B ) - C A(5) Idempotent: VACX A-- = A-.
B-
u B-
If axioms (1) (3) are fulfilled, then ( X . - ) is a neighborhood spacc. If axionis (1) (4) hold, then ( X ,-) is a pretopological space. If all conditions are satisfied, (X. -) defines a topological space; see also Table 2.1. -
~
Corollary 2.1 Axioms given in Table 2.1 are eyuzvalen,t. Proof. Let X be a set and let N . M be any subsets of X . Then the basic fact in set theory is that following equivalence N C:M @ ( X \ M ) C ( X \ N ) holds. In the proof, we will make use of this and Def. 2.18, in which the generalized closure is defined by the neighborhood system. The latter means that Z E N - H ( X \ N ) @N(z). The proof follows.
Spaces
37
(1) Assume @ = 0- holds for every zEX. From Def. 2.18, 'dZEx~9'0-H ' d z a z @ ( X \ W @ Y 7 x x X EN(Z). (2)
Assume that the generalized closure is expansive. Let
:1: E
X arid
N E N ( ~ By ) . Def. 2.18, this is equivalent to z @ ( X \ N ) - . Making use of the expansive property of the closure, on has (X\N) C: (X\N)-. I t follows that X\(X\N)C ( X \ ( X \ N ) ) = N. For any z E X the following equivalence z @ (X\N)- H z E X\(X\N)holds. Since X\(X\N)- C (X\(X\N)) = N, then z E X\(X\N)+ :I: E N . Therefore, z; @ (X\N)z E N . Hence, we have proved that N E N(z) X€N.
*
x E X and N E N ( z ) . Assume that N E N ( x ) + z E N holds for any N C X . By Def. 2.18, for any 1c one has 1c E N + :I:# (X\N) + (X\N) @ N ( x@ ) z E N-.As z E N + z E N - consequently, N C N-.
-+== Let
( 3 ) Let z E X . Assume that N EN(^) and N C M . The latter is equivalent to (X\n/l) C ( X \ N ) . Since the generalized closure is monotonic, N C A!! @ ( ( X \ M ) C ( X \ N ) ) + ( ( X \ M - 2 ( X \ N ) - ) holds for all N , M C X . The latter set relation is equivalent to stating that x 9' (X\N)- + n: @ ( X \ M ) - , which by Def. 2.18: is equivalent to N E N(z)+ A f ~ " ( z ) . Since ( N € N ( z )A N C &I), then MEN(^).
C N - U M - hold for all N ; Af i X . Assume that N , M E N ( z ) . Replacing N by (X\N) and M by ( X \ M ) , one gets: ( ( X \ N ) U ( X \ M ) ) - 2 (X\N)- U (X\A,f)-. Herice 12: E ( ( X \ N ) U (X\M))- + ( Z E (X\N)- V z E ( X \ M - ) , which is equivalent to { z # ( X \ N ) - A z g ( X \ M ) - + z $ ( ( X \ N ) U ( X \ M ) ) - } . Since N , MEN(rc) and from de Morgan's law (X\N)U(X\M)= X \ ( N n M ) , the latter implication is equivalent to (N E N ( z ) A M E N(x:))+ ( N Ti M) € N ( z )by Def. 2.18.
(4) Let ( N U A d -
(5) Let z E X and N E N ( z ) . Assume that the generalized closure is idenipotent for all subsets of X . Therefore, one can write ( X \ N ) - = ( X \ N ) ) . BasedonDef. 2 . 1 8 , o n e h a s N E N ( x ) w z ; @ ( X \ N ) - -s (X\N)-- ++ (X\(X\N)-) ~ n / ( z )Let . M = X\(X\N)-. Then M E n/(:c) by the reasoning above. For all y, the following holds y E Ail ++ y @
( X \ W @ Y e (X\N)- @ Y @ (X\N)-- ++Y @ X\(X\(X\N)-)'++ (X\Ad)- @ M E N ( y ) , by Def. 2.18. Hence, we have shown that Y N E N ( ~3~.r=(x\(x\,v-)~,v(~) ) 'dy~ns EN^). 0 y@
T h e dissimilarity representation f o r p a t t e r n recognition
38
N(z) A EN(z)
Neighborhood
A c X A is open
Closure A-
InteriorA'
A = X\(X\A)-
A = A'
The difference between pretopological and topological spaces lies in the notion of a closiirc operator. In a topological space, the closure of any set A is closed, A-- = A - , and the interior of any set is open, (A")" = A". In a pretopological space, this is not necessarily true: so the basis neighborhoods are not open. Here, the generalized closure operator expresses the growth phenomenon, where the cornposition of several closures results in successive augmentations, i.e. A 5 A- C A-- C . . . .
Example 2.3 (Pretopological and topological spaces) (1) Let X be any set and let S : X X X + P ( X ) be a symmetric relation, i.e. S ( x , y ) = S ( y , x ) . Assume a generalized closure of A C: X be S ( x ,y). Then (X, - ) is a neighborhood space, defined as A- = since the generalized closure obeys conditions (1)-(3) of Def. 2.19. (2) Let X be a finite set and ( X , E ) be a directed graph. Let F ( x ) be a set of the forward neighbors of x, i.e. F ( x ) = EX: ( T C , ~E) E } . Let A X . By axioms of Def. 2.19 it is straightforward to show that the closure A - = U Z E A ( F ( xU) {x}) defines a pretopological space ( X ,-). (3) Let N B ( ~ = ){ ~ E R Ix : - y1 < E A E > 0). Then (R,&) defines a topological space. (4) Let &(z) = { ( u . m ) : U E I W A X E ( a , ~ ) }Then . (R,NB) defines a topological space.
c
Corollary 2.2 (Open and closed sets) Let ( X ,- ) be n neighborhood space de,fined by t h e generulized closure, i.e. conditiom (1)-(3) of De,f. 2.19 hold. A 2 X i s an open set if A" = A. A is a closed set if A- = A; see also Table 2.2. Th,e followin,g holds:
(i)
AEN(TC H )A = X\(X\A)-.
Y Z E ~
(2) A
=
A"
A
= X\(X\A)-.
Pro0f. (1) Assiimc that Y r E ~A E N ( x ) holds. By Def. 2.18, Y r E ~A E N ( x ) @ Y l r g ~x $ ( X \ A ) - @ Y l s € ~xEX\(X\A)-. Hence A = X\(X\A)-. (2) A = A" = X\(X\A)- by Def. 2.17.
Spaces
39
Lemma 2.1 Let ( X , N ) be a neighborhood space. T h e assertions: (1) V I N E , V (3~A)4 C N ( z ) V y t &JEN(Y) ~ and (2) VN(NE N ( z ) @ No E N(2:)) are equivalent. The proof of Corollary 2.1, point (5) shows that VINE~u(a) VyEnr hf E N(y). Since M = N o by Def. 2.17, 3M=(X\(X\N)-)EN(zL.) then No E N ( x ) . 0
Proof.
A collection of open sets containing z constitutes a neighborhood basis in a topological space, which can be proved by Lemma 2.1. Equivalently, since the closure operator is dual to the interior operator, a neighborhood basis in a topological space can be built by a collection of closed sets coritaining x. Lemma 2.2 Let (X,&) be a pretopological space. If all neighborhoods oj = No} ,for all x E X , t h e n (X,NB) is a topological space.
NB are open sets, or NB(x) = {N C X : x E N A N
Corollary 2.3 (Closure on neighborhoods) Let ( X , - ) be a neighborhood space. T h e n a genm-alized closure operator i s a function, P ( X ) + P ( X ) , defined as gcl(A) = {x E X : V N t ~ ( r ) A n N # Moreover, gcl(A) = A-.
a}.
In order to prove that gcl(A) = A- holds for any A C X, we will equivalently show that (z$gcl(A)) H (x$A-) holds for all z t X .
Proof.
=+ z $gcl(A) + 3 ~ ~ , vN (n~A )= 0.By Def. 2.18, the latter is equivalent t o (x$(X\N)- A N n A = 0). Since (N n A = 0) + (A C X \ N ) , then by the monotonic property of -, A- C (X\N)- holds. Since z $ (X\N)-, then x#A-. +=
By Def. 2.18, then z@gcl(A).
(.$Ap) + ((X\A) @ N ( x )holds. )
Since (X\A) n A = 0,
0
Definition 2.20 (Limit point) Let ( X , N ) be a neighborhood space. An element y E X is a limit of A C X iff for every neighborhood N EN(^), N intersects A\{y}. The set of all limits points is called the derived set, der(A) = EX: V N c ~ ( y )(A\{y}) n N # 0} [Sierpinski, 19521. Corollary 2.4 In a neighborhood space, der(A) tains all its limit elements and conversely.
C A - . A closed set
con-
The notion of corivergencc is important in neighborhood (pretopological arid topological) spaces. Recall that a sequence x, in X is a function from N to X: hence f ( n ) = zn. The order of elements in z , is, thereby, important.
40
T h e dzssimilarity representataon f o r p a t t e r n recognatzon
The sequence x,, is different from a set { x : n } ~ ?which l, is simply indexed by W. One would say that a sequence xn converges to x E X in a neighborhood space ( X . N ) if for every neighborhood N c N ( z ) ,there exists k E N such that xn E N for all n 2 k . The problem with this definition is, however, that neighborhoods may have an uncountable number of elements arid countable sequences may not capture the idea of convergence well. In general, convergence is defined by the use of filters, which are generalization of sequcnces.
Definition 2.21 (Filter) A filter on a set X is a collection of X such that (1) (2)
YFE3
F
F of subsets
# @.
~ F" ~ C g (~F n F'). ( 3 ) b ' p ~v ~~F ( C F' + F ' E 3 . ' d p . p i e 3~ p
If 3 satisfies only the first two conditions, then
F defines a filter basis.
Note that given a filter 3 on a set X and a function f : X + Y , the set f ( 3 )= {f(A): A t 3 } forms a filter base for a filter of the function f .
Definition 2.22 (Convergence) Let ( X ,JV)be a neighborhood space. A filter 3 converges to z E X , 3 4x if V N E ~ / ( s ~) F F~ 2 FN . One may easily verify that a neighborhood system N(z) of an element X in a prctopological space ( X , A f ) ;compare t o Def. 2.12. One may, therefore, imagine a set of nested subsets (neighborhoods) of an element 2 that defines the convergence to x. If one is given a sequence of elements 2 , for n E W, then a filter basis can be defined as {PI,k EN},where Fk is a subsequence of x ,starting from the element x k , i.e. (xk,zk+l,. . .). :c is a filter on
Definition 2.23 (Hausdorff space) A neighborhood space ( X . N )is Hausdorfl or T2 if every two distinct elements of X havc disjoint neighborhoods. i.e. Yr,yEX 3 N z E ~ ( Nz )g ,E ~ ( y )Nzn & = 0. Lemma 2.3 Every convergent filter in a Hausdorfl space has u unique limit. Functions and, especially, continuous functions are basic tools in applications of various spaces. The basic intuition of a continuity is that small changcs in the input produce small changes in the corresponding function output. where 'small' is expressed by a chosen distance. In general neighborhood spaces, one can only work with sets.
41
Spaces
Definition 2.24 (Continuity by filters) Let f : ( X , N )+ ( Y . M ) be a function between two pretopological spaces. f is continuous at :I; E X if for all filters 3 on X if .F + x, then f ( F ) f (x). --f
Definition 2.25 (Continuity by neighborhoods) Let f : ( X , N ) + (Y.M ) be a function between two neighborhood spaces. f is continuous at z E X if for each neighborhood M of f ( z ) in Y , there exists a neighborhood N of z in X , whose image lies entirely in M . f is contiriiious on X if it is continuous at every x E X . Formally, f is continiloils if holds for all EX. Yin/ic,u(f(.)) ~ N E N ( . ) f ( N ) C Theorem 2.2 (On continuous functions) Let f : ( X , N )+ ( Y , M ) be a f u n c t i o n between two neigh,borhood spaces. T h e following assertions are equi.ualent [Gnilka, 1997; Munkres, 20001: 1. f i s continuous at x. 9. For all x E X , B E M ( f ( 2 ) + ) f - ' ( l ? ) ~ N ( z. ) 3. For every set A E P ( X ) ,f ( A - ) C ( f ( A ) ) -. 4. For eiie7-77 set B E P ( Y ) ,( f - l ( B ) ) -C f - l ( B - ) . 5. For every set B E P ( Y ) ,f - l ( B " )C ( f - l ( B ) ) " . Note that in topological spaces, continuity of a function translates to the fact that the preimage of an open (closed) set is an open (closed) set.
Remark 2.1 T h e composition of finitely many contin,uous mappin,gs i s a continuous mapping. Definition 2.26 (Regular space) 0
0
A neighborhood space ( X , N ) is regular if for each neighborhood N of z E X , there exists a smaller neighborhood M of z whose closure is contained in N ? i.e. Y N ~ N ( ~31\.lc~(.) ) M - C N. A topological space is regular if every neighborhood of n: contains a closed neighborhood of x. It means that the closed neighborhoods of' I(: forin a local basis at z. In fact. if the closed neighborhoods of each point in a topological space form a local basis at that point, then the space milst be regular.
Definition 2.27 (Normal space) 0
A pretopological space is normal if the separation of the closures of two sets imposes the existence of their disjoint neighborhoods, i.e. if for nonempty sets A and B , one has (A-nB-= 0) ( ~ N * , N( ~A C NA)A(BC N B ) A (NAn N B = 0)) [Cech, 1966: Stadler and Stadler, 20021.
*
42
The dissimilarity representation for p a t t e r n recognition
Table 2.3 Properties in Neighbood spaces
Regularity axioms
Separation axioms
A topological space is normal if the separation of two closed sets imposes the existence of their disjoint neighborhoods. Neighborhood and (pre)topological spaces can be classified with respect to the degree to which their points are separated, their compactness, overall size and connectedness. The separation axioms are the means to distinguish disjoint sets and distinct points. A few basic properties are presented in Table 2.3 and scheniatically illustrated in Fig. 2.5 [Cech, 1966: Stadler and Stadler, 2001b; Munkres, 20001.
Definition 2.28 (Completely within) A set A is completely within B in a neighborhood space ( X , N ) if there is a continuous function 4 : ( X . N )+ [O. 11 such that 4 ( A )C (0) and 4(X\B) C (1). Therefore, A C B. Different pretopological spaces can be distinguished by the way they 'split up into pieces'. The idea of connectedness becomes therefore useful.
Definition 2.29 (Connectedness) A space X which is a union of two disjoint non-empty open sets is disconmected, and connected, otherwise. Equivalently, a space X is connected if the only subsets of X which are both open and closed are the empty set arid X . Definition 2.30 (Cover) Let X be a set. A collection of subsets w C X is a coiier of' X if X = U w . A cover is finite if finitely many sets belong to it. If w and w' arc covers of X , then w' is a subcover if w' c w .
Spaces
REG
QN
43
--
TO
TI 1
T2
T2 t
T3
T,
t
0
Figure 2.5 A pictorial illustration of the regularity and separation properties; based on [Stadler and Stadler, 2001bl. Neighborhoods are drawn as ovals and closures are indicated as filled ovals. The regularity condition REG demands for each neighborhood N the existence of a smaller neighborhood whose closure is contained in N. The quasinormality axiom Q N requires that the separation of the closures of two sets iniposes the existence of their disjoint neighborhoods. To means that for two distinct elements, there exists a neighborhood of one of them such that it does not contain the other element. TI states that any two elements have neighborhoods with the property that the neighborhood of one element does not contain the other element. Tz imposes the existence of disjoint neighborhoods for any two elements. T' asks for the existence of neighborhoods such that their closures axe disjoint, for any two elements. T3 demands that for each neighborhood N, there is a set h!! which is completely within N.
Definition 2.31 (Compact space) A topological space X is compact if every open cover has a finite subcover. A topological space is locally compact if every element has a compact neighborhood. Theorem 2.3 Let f : X + Y be a continuous function betweesn topological spaces. If A is a compact subset of X , then f ( A ) i s a, compact subset of Y . Theorem 2.4 A closed subset of a compact set is compact. A compact subset of a Hausdorfl space i s closed. Definition 2.32 (Dense subset) Subset A of a topological space ( X .- ) is dense in X if A- = X . Equivalently, whenever N , is an open neighborhood of Z E X the ~ set NT n A is non-empty. Definition 2.33 (Size) 0
0
A topological space is
separable if it is a closure of a countable subset of itself, or in other words if contains a countable dense subset. first-countable if every element has a countable local basis; see also Def. 2.13.
44
T h e dissimilarzty representation f o r pattern recognition
second-countable if‘ it has a countable basis for its topology. Second-countable spaces are separable, first-countable and every open cover has a countable subcover. Example 2.4 1. Every topological spaces is dense in itself. 2. Let (R,Ng) be a topological space withNB(x) = ( ( ~ - - E , Z + E ) : E > O } . Then by Corollary 2.3, the set of rational numbers Q is dense5 in R, i.e. Q- = R. Consequently, Iw is separable, as Q is countable. More generally, R” is separable. 3. A discrete topological space ( X , N g ) is a space with Ng(x) = {z}, i.e. the basis consists of single elements. This means that every subset of X is both open and closed. Every discrete space is first-countable and second countable iff it is countable.
Definition 2.34 (Topological product space) Suppose X i , i = 1:2. . . . , *rL are given sets. The set X of all n-tuples ( X I ,2 2 , . . . , x T L )z ,iE X i is a Cartesian product X = X ~ X X ~ . X xX, . .= X i . Let ( X i , N i ) , i = 1 , 2 , . . . , n be (pre)topologicaI spaces. (X,n/) is a (pre)topological product space if n/(z)= hii(xi) is a neighborhood basis of Z E X .
n;=,
Remark 2.2 T h e definitions above can be extended t o a n y (countable or n o t ) ,family of topological spaces. T h e mapping 7ri : x + z i i s a projection o,f X onto X i . It is a continuous mapping and the topology defined o n X is the weakest topology for which all the projections ~i are continuous [Kothe, 19691. Topology (pretopology) can be introduced on a set in many ways. It can be defined by a collection of open sets, or generated by a neighborhood basis, a (generalized) closure or other operators. The way it is introduced specifies particular ‘closeness relations’. One should, however, remember , that new topologies can always be added to a set. Some topologies can be cornpa,red, however not all of them are comparable.
Definition 2.35 (Weaker and stronger topologies) Let X be a set arid let H , M be two neighborhood systems defined for every x E X . The topology defined by N , the N-topology is stronger (finer) than the topology defincd by M , the M-topology if for each x E X every neighborhood ‘Informally, one may think that a subset A is dense in X if the elements of A can ‘approximate’ the elements of X with the arbitrary precision with respect to X .
Spaces
45
M E M ( x ) is also a neighborhood of n/(z). It means that N has more neighborhoods than M . The M-topology is then ,weaker (coarser) than the N-topology. If' neighborhood bases NB arid M B are considered, then the N-topology is stronger than tlie M-topology if for each z E X and every basis neighborhood M B E M B ( Z ) there , is a basis neighborhood of N B E N B ( such ~ ) that N B c MB. If finitely or infinitely many topologies are defined by N, on a set X : there is tlie strongest, (finest,) topology specified by n/ among the topologies on X which are weaker (coarsest) than every n/,-topology. This nieaiis that every neighborhood of n/(x) is a neighborhood of N, for every 0.
Definition 2.36 (Homeomorphism) A bijective function6 f : X i Y between two topological spaces (X,n/) and (Y,M ) is a ho,meomorphism if both f and f - l are continuous. The spaces X arid Y are homeoniorph,ic. The homeomorphisms form an equivalence relation on the class of all topological spaces. Therefore, homeomorphic spaces are iridistinguishahlc as topological spaces; they belong to the same equivalence class. Two homeomorphic spaces share the same topological properties. e.g. if one is compact, connected or Hausdorff, then the other is as well. This also means that a set N g N ( z ) is open in X iff the set f ( N )~ M ( f ( x is ) )open in Y . Moreover, a sequence z, converges to z iff the sequence f ( ~ converges , ~ ) to
f (x). Remark 2.3 T h e identity m a p I : ( X , N ) ( X , N ) ,where I(%) = z is a homeomorphism when the same topology (neighborhood systems) are used over the domain and the range of the map. In general, it i s not true, if two di.fferent topologies are defined o n X . Let N B ( z )= X and M n ( x ) = {x} be the neigh,borhood bases for all 2 E X . T h e n N consists of X and M is u power set of X (without the empty set). B y Dgf. 2.25, I i s con,tinuous at z z f f o r all hl E M ( z ) there exists N E N ( : c ) such, that f ( N ) C A f . As N = X fm- all z and there exists hf = {x} such that f ( X ) @ {x}, then I : ( X ,N ) 4 ( X ,M ) i s discontinuous at each point IC.
Proposition 2.1 Let n/ and M be two neighborhood systems de,finined o n a topological space X . Th,e identity m a p I : ( X ,N ) 4 ( X ,M ) is continluous iff the N-topology is stronger than the M-topology. 6A bijective function f always has an inverse f-', even if f is.
but not necessarily continuous,
46
The dassemalaraty representataon f o r p a t t e r n recognition
Equivalence relation on a set is a binary relation between its elements, such that some of them become indistinguishable by belonging to the same class. In the study of spaces, a quotient space is the result of identifying such classes by an equivalence relation. This is usually done to construct new spaces from given ones.
-
-
Definition 2.37 (Quotient space) Let ( X , N )be a topological space and let be an equivalence relation on X. Denote by X/ the set of equivalence classes of X under -. Let i7 : X + X/ be the projection map which sends each element of X to its equivalence class. The quotient topology on X/ is the strongest topology (having the most open sets) for which 7r is continuous.
-
Remark 2.4 If X is a topological space and A c X , we denote by X / A a quotient space of the equivalence classes X/ under the relation x y zf x = y or 2 , y E A. So for x @ A , {x} is a n equivalence class and A is a single class.
-
2.4
-
Generalized metric spaces
A set can be augmented with a metric distance, or a structure weaker than metric, which leads to generalized metric spaces. A metric can also be introduced to vector spaces. They are, however, discussed in thc subsequent section. Most of the material presented here relies on the following books [Bialynicki-Birula, 1976; Blumenthal, 1953; Dunford and Schwarz, 1958; Kiithe, 1969; Kreyszig, 1978; Willard, 19701.
Definition 2.38 (Metric space) A metric space is a pair ( X ,d ) , where X is a set and d is a distance function d : X X X i R; such that thc following conditions are fulfilled for all x,y, z E X : (1) Reflexivity: d ( x : x )= 0. (2) Symmetry: d(x,y) = d ( y , z). ( 3 ) Definiteness: ( d ( z ,y) = 0) + (x = y). (4) Triangle inequality: d ( z , y ) d(y, z ) 2 d(x,z ) .
+
For instance, X can be R", Z", [a,bIm, or a collection of all (bounded) subsets of [ a ,b]". If X is a finite set, e.g. X = { X I ,5 2 , . . . , x n } >then d is specified by an n x n dissimilarity matrix D = (&), i,, j = 1,.. . , n such that di,i = d ( x i , zj).Consequently, the matrix D is nonnegative, symmetric and has a zero diagonal.
Spaces
47
Example 2.5 Examples of metric spaces: 1. Let X be any set. For x,y E X , the discrete distance metric on X is given by d(x,y) = Z ( x # y), where Z is the indicator (or characteristic) function. If X is a finite set, then all the pairwise distances can be realized by points lying on an equilateral polytope (extension of an equilateral triangle and of a tetrahedron). 2. Let X be a set of all binary sequences of the length 111. Given two binary strings s = ~ 1 ~ 2. s,, . and t = t l t 2 . . . t,, the Hamming distance is defined as d H a m ( s , t ) = Z(sk # t k ) . 3. Metrics in a vector space R". To emphasize that a vector x comes form a finite-dimensional vector space Rm,we will mark it in bold:
c;;r"=,
(a) d, (x,y)= (CEl (xi- yzi")h, p 2 1; a general Minkowski distance. (b) dl (x,y) = Ixi - yil,the city block distancc. (c) dz (x,y) = d~ (x,y) = (CE,(zz- yi)')+, the Euclidean distancc. (d) d, (x.y ) = d, ( x y~) = maxi (xi- gi(,the max-norm distance.
El"=,
4. Let F ( 0 ) be a set of real-valued functions defined on a bounded and closed set 0. Let M ( 0 ) c F ( 0 ) be a set of function cl which are Lebesgue measurable on R. Then L f = {.f E M(62) : (Jn If(x)l"dz)l/l-' (Y,d y ) and (2, d ~ be) generalized m,etricspaces with continuous dissimilari t y measures a n d let f : X + Y , g : Y + Z and h,; X + Z be mappings. I f f a n d g are coritinuous, then the composed mappin,g h = go f , h,(x) = g ( . f ( z ) ) , i s continuous as well. Sketch of proof. The proof follows disrectly frosrri considering thp equirualence between the continuity arid the converge of a seqmnce based o n Corolla?-y 2.5. Direct product spaces can he used for the construction of a new spacc by combining two (or more) spaces. In the context of (finite) generalizcd metric spaces, if the measures refer. the same set of objects, a new dissimilarity measure can be created, e.g. by their summation.
Definition 2.49 (Product space) Let (X,dx) and ( Y , d y ) he generalized metric spaces. Then a product generalized niet'ric space X x Y with a dissimilarity d can be defined as ( X X Y d, x o d y ) , where 0 is thc sum or max operator. This means that (dxody)((x~,y~), ( 2 2 , ~ ~= ) )dx(:1:1:x2)+ dY (Yl,Y2) or ( d x o d y)((:El, Yl), ( 2 2 , Y2)) = max { d x ( Q .: c 2 ) ,dY (Yl>?In)> for ~ I , Z ~ and E Xy l , y 2 E Y . Extension of the concepts of neighborhoods, convergence arid continuity to a product space is straightforward. For instance, U is a neighborhood of
56
The dissimilarity representation f o r pattern recognition
the pair (x.y) if there exist a neighborhood N of x E X and a neighborhood M of y E Y such that N x M C: U . Also, the convergence of a sequence (xn.yYc)E X x Y is equivalent to the convergence of sequences 2 , E X and Yn E Y .
2.5
Vector spaces
Generalized topological spaces and generalized metric spaces defined on sets were described in the previous sections. The necessity, however: arises to consider sets on which meaningful binary operations are allowed. This leads to groups and fields. When the operations become the addition of elements and scalar multiplication, vector spaces can be defined. When, additionally, a topology or a metric is introduced to a vector space, its algebraic structure is enriched. The reader is referred to [Bialynicki-Birula, 1976; Dunford and Schwarz, 1958; Garrett, 2003; Greub, 1975; Kothe, 1969; Larig, 2004; Willard, 19701 for more details.
Definition 2.50 (Group) A group ( G , o ) is a nonempty set G with a binary operation 0 : GxG + G, satisfying the group axioms: (1) Associative law: V i n , b , c E ~ ( a o b ) o c= a o ( b o c ) . (2) Existence of a unique identity element: 3 i d E ~' d l a Eaoid ~ = idoa = a. ( 3 ) Existence of an inverse element: ' d a E ~3,- E~ n o aa- o a = id. If additionally the commutative law, a o b = boa, holds for all a, b E G , then the, group G is Abeliun.
+, +
Definition 2.51 (Field) A field (r, *) is a nonempty set I? together with the binary operations of addition and multiplication * satisfying the following conditions: (1) (I', +) is an Abeliari group with the 0 additive identity element. ( 2 ) (F\{O},*) is ari Abeliari group with the unit multiplicative identity clement. ( 3 ) Distributive laws: a*(b+c) = (n*b)+(n*c) and (a+b)*c = (n*c)+(b*c) hold for all a , b, c E I?.
Example 2.9 (Fields and groups) (1) Let Z be a set of integers. (Z, +) is a group, but (Z, *) is not. ( 2 ) Let R be a set of real numbers and C be a set of complex numbers. (R, +, *) and (C,+, *) are fields.
57
Spaces
Definition 2.52 (Vector space) A vector space (a linear space) X over the field r is a set of elements, called vectoi-s, with the following algebraic struc,ture:
+
(1) There is a function X x X 4 X , mapping (z,y) t o 2 y , such that ( X ,+) is an Abelian group with the zero additive identity. ( 2 ) There is a function r x X 4X , mapping ( X , z ) t o Xz, such that the following conditions are satisfied for all x, EX and all A, p € r : (a) Associative law: (A p ) 2 = X ( p z ) . (b) Distributive laws: X(z+u) = X x + X y , and (X+p)n: = Az+pz. (c) Existence of multiplicative identity element 1 E r: 1x = z. If the field r is not explicitly mentioned,
r is assumed to be either R or @.
Definition 2.53 (Linear combination, span and independence) Let X be a vector space. The vector x is a linear combination of vectors {x1,x:2:. . . ,xn} from X if there exist { a l , a 2 , .. . ,a,) E r such that z = C,"=,ajxJ. The span of ( 2 1 : ~ 2 ,. .. , x,} is a collection of all their linear E X is linearly independent combinations. A finite set of vectors if C,"=,ajx,? = 0 implies that) all aj = 0. Otherwise, the set is linearly dependent. An infinite set is linearly independent if every finite subset is liriearly independerit. Definition 2.54 (Basis and dimension of a vector space) Let X be a vector space. The set B of vectors b, E X forms a Hamel basis of' X if B is linearly independent and each vector x is in the span of V = { b 3 } for some finite subset V of B. The dimension of X , diniX, is the cardinality of B . Definition 2.55 (Subspace) A subspace V of a vector space X is a subset of X , closed for the operations of vector additions and scalar multiplication. Example 2.10 Examples of vector spaces: 1. Iw and C with usual operations of scalar addition and multiplication. 2 . Rrn and C", with the elements z = (zl, zz,.. . ,x,) and the elementwise addition and multiplication by a scalar, are m-dimensional vector spaces. 3. A set of nxm, matrices with the matrix addition and multiplication by a scalar. 4. The set 3 ( a )of all functions defined on a closed and bounded set 62, with the pointwise addition ( f g)(z) = f(x) g(z) and the scalar multiplication ( c f ) ( z ) = c f ( n : ) .
+
+
58
T h e dissimilarity representation f o r pattern recognition
5. The set Pn of all polynomials of the degree less than n is a vector space a,nd a,n n~-dimensiorialsubspace of F(52). 6. The set C(52) of continuous functions on R arid the set M(R) of classes of functions measurable in the Lebesgue sense" are infinite dimensional vect,or spaces and subspaces of F(i1). 7. = { f E M(12) : (<Jnl.f(x)I"dx); < m} for p 2 1 is an infinite dirriensional vector space and a subspace of F ( 0 ) .
C ,F
Definition 2.56 (Quotient vector space) Let X be a vector space over a field r and let Y be a subspace of X. Consider an equivalence relation o i l X such that :cl x2 if ( 2 1 x2) E Y . X / Y , X mod Y , defined by the relation is a quotient vector space. Let [:c] denote the equivalence class of 3;. The addition on the equivalent classes is defined as [XI] [ 2 2 ] = [xi 221 for all il:2 E X and the multiplication by a scalar is defined as a[.] 1 [ax] for all CY Er and Z E X .
-
N
~
N
+
+
~
If X is an n-dimensional space and Y is an m-dimensional space, then X / Y has the dimension n, rri. ~
Definition 2.57 (Linear map) Let X and Y be vector spaces over the field I'. A linear m,ap (linear operator) from one vector space to another is a function 4 : X 4 Y , also called homomorph,ism,, such that for all x 1 , 5 2 E X and all X t I?, the following conditions are fulfilled:
+
(1) Additivity: 4(x1 x2) = qh(il;X:l) ( 2 ) Honiogencity: q5(Xx) = @ ( m ) .
+ q5(x2).
Note that the above conditions are equivalent t o stating that f preserves linear combinations, i.e. $(CzlXixi) = Czl &4(zi) for all xi E X and all X i E r. If'Y = I',then 4 is called a linear functional. X arid Y can also be defined over different fields.
Remark 2.6 If X and Y are finite dimensional vector spaces with chosen hases, th,en any linear map can be represented by a m,atrix. For instanm, a h e a r transformation Rk + R" is represented by an k x m matrix A such that y = Ax for .?: ER'"and ' ~ E I W ' " ~ . Definition 2.58 (Kernel and image) Let f : X i Y be a linear trmsformation between two vectors spaces. The kernel or a null-space of f is a subspace of X consisting of vectors whose image is 0, i.e. ker(f) = "Two functions are in the same equivalence class if they agree almost everywhere, i.e. if they disagree on a set of a measure zero. From now on, M ( n )refers to such classes of functions measurable in the Lebesgue sense.
59
Spaces
{ x : X : f(x) = O}. The image of f is a subspace of Y consisting of images of vectors from X , i.e. i m ( f ) = { ~ E YI z:E X f ( x ) = y}.
+
Lemma 2.4 If X is finite dimensional, then dim(ker(f)) dim(irn(f)) = dim(X). If additionally Y is finite dimensional and the bases are chosen. then the linear map is represented by the matrix F . dim(im(F)) is the rank of F and dim(ker(F)) is the nullity of F . The notion of a dual vector space is important in niost applications. It is especially useful for inner product and nornied spaces; see also Sec. 2.6.
Definition 2.59 (Dual space) Let X be a vector space over the field I? (Ror C).The dual space, also called algebraic dual, X * = C ( X .I?) of X is a set of linear functions , f : X + I',also called lin,ear functionals. Remark 2.7 The collection X * of linear functionals on X over r is a vector space over r with the pointwise addition (.f g)(x) = f ( x ) g ( z ) and scalar niultiplicatio,n ( a . f ) ( x )= a f ( z ) for all f , g E X * , a ~ and r ZEX. The 0-vector in X* is the linear functional that maps every vector x E X to zero. The additive inverse ( - f ) is defin,ed by (-f)(x) = - f ( x ) . The associative law and distributive laws, De,f. 2.52, can easily be uerified b y struightforward computations.
+
+
As we will later deal with finite samples, our focus is on finitedimensional spaces. If X is finite-dimensional, then both X and X * havc the same dimension. Moreover, X is isomorphic'' to X * . The isomorphism depends on the basis B of X , which defines a dual basis B*of X* arid a bijection B + B*. So, given a basis of X , there exists a unique corresponding dual basis. Definition 2.60 (Dual basis) Let X be an n-dimensional vector space with a basis B = { b l , b 2 , . . . , bn}. A dual basis { f l , f 2 , . . . , f n } of X * with respect to B is a basis for X * with the property that 1, if i fj(b2) =
=j
,
0 , otherwise,
"An informal definition by Hofstadter [Hofstadter, 19791: T h e word 'isomorphism' applies when two complex structures can be mapped onto each other, in such a way that t o each part of one structure there is a corresponding part in the other structure, where 'corresponding' means that the two parts play similar roles in their respective structures. Formally, the isomorphism f is a bijective map (one-to-one and onto) such that both f and its inverse f p l are linear maps.
T h e dissimilarity representation f o r p a t t e r n recognition
60
{fz}r=l
The linear functionals of X* are formally defined as f t : X + r by fL(CT=l ~ . k b k )= 2,. Then fz are nonzero elements of X* and span X*. If X is infinite-dimensional, then the dimension of X* is strictly larger than that of X [Kothe, 19691. A simple illustration of this fact is a space X of infiriitc real sequences (21 zz, . . .) with a finite number of non-zero elements. The dual space X* consists of infinite real sequences (x?,x:, . . .) of a n y elements, hence its dimension niust be larger than this of X. ~
Definition 2.61 (Bilinear map) Let X. Y' and 2 be vector spaces over the field I?. A bilinear m a p (bilinear operator) is a function f :X X Y + 2 such that (1) For any fixed z E X the map y transformation from Y to 2. (2) For any fixed y E Y the map z transformatioil from X to 2.
4
f ( z ,y), f 3 ; ( g ) = f ( z ?y), is a linear
+
f ( z ,y), f,(x)
=
f ( z ,g ) , is a linear
If X = Y and f ( x , y) = f ( y , z) for all z, y E X , then f is s y m m e t r i c . If r = C and f(z, y) = f + ( y ,z) for all x,y E X: where t denotes complex c o n j u g a t i o ~ ithcn ~ ~ , f is H e r m i t i a n .
Definition 2.62 (Bilinear form) Let X be a vector space over the field I?. A bilinear f o r m , is a bilinear transformation f : X X X+ I?. Notc that any rcal n x m matrix A can be regarded as a matrix of a bilinear form X X Y + R such that X = R" and Y = R" and f ( x . y ) =
c:=,x;:1
AL1XZY.l.
Definition 2.63 (Non-degenerate bilinear form) Let f : X X X + r be a bilinear form over thc vector space X. f is non-degenerate when the following conditions hold: (1) If f ( ~ l , x z=) 0 for all (2) If f(xl,x2)= 0 for all
x1 E X , then .c2
E X . then
xz
= 0.
z 1 = 0.
Thc spaces X and X* are dual with respcct to a bilinear function X * x X + I?, called a scalar product or i n n e r product, and denoted as (., .), such that ( f ,x) = f ( x ) for :I:E X and f E X*. For instance, if X = R" = X* (R" is self-dual), then ( x * ,x ) = C711x:xi for x* E X " and x E X . This scalar product is linear in both arguments and its properties are analogous ''If
zE@ such that
t=
a+bz, then
z t = a-bi, i2 = -1.
Recall that IzI = z z t = a2+b2.
Spaces
61
to the properties of a well-known scalar product studied in analytical geonietry (that is, given two vectors x and y, their scalar product, is computed by multiplying their lengths and the cosinus of the angle betwecn thcrn). Note, however, a subtle difference. In general, the arguments of ( f ,:c) belong to diflerent spaces and they cannot be exchanged, which rrieaiis that this inner product is not symmetric. One, however, uscs the same notioii of inner product to strengthen the analogy to the traditional geoiiietric inner product. Formal definitions on inner product will follow in Sec. 2.6.
Definition 2.64 (Evaluation functional) Let X * = C(X, r)be a space of linear functionals. An evaluation ,functional 6, evaluates each function f € X * at a point Z E X as 6 , [ f ] = , f ( x ) . One can, therefore, write that 6,[f = ( . f , z ) for f e X * arid Z E X . Any isomorphism q5 :X + X* defines a unique non-degenerate bilinea function on a finite-dimensional vector space X by ( : x , g ) = 4 ( x ) ( g ) for x,!/EX such that for the fixed II:, q5(z): X 4I?. R.emind that X* consists of linear functionals X 4 r and Y * consists of linear functionals Y + r. Note that a bilinear map f : X X Y + r can be characterized by the left linear map f~ E C ( X , Y*). i.e. f~ : x f r arid the right linear map f~ E C(Y,X*),i.e. f ~g :+ f, such that f~(x)(y) = fz(y) = f ( z , y ) and f ~ ( y ) ( x = ) f,(x) = f ( z , y ) for all Z E X and ~ E Y . ---f
Theorem 2.14 (Dual map) 0 Consider a homomorphism l i, : X + I/ over th,e ,field r. T ~there L exists an associated (unique) dual m a p $* : Y *+ X* bekween the dual spaces Y * and X * such thmt $ * ( g ) ( : c ) = g(,dJ(x))for all g E Y * and X E X . 0 T h e dual m a p i s h e a r , hence ($I 4)* = $* 4* and (a$)*= ~ 4f o*r the h e a r maps $ and 4, and a ~ rAdditio,nally, . (4)o 4)* = 4* o $*.
+
+
Definition 2.65 (Second dual space) Let X be a vector space over the field I?. The second dual space X** of X is the dual of its dual spacc. This means that the elements of X** arc linear fimctionals f : X * 4I?. There exists a vector space homomorphism : X + X** defined by q(x)(f) = f ( x ) for all II: E X and f E X*. If X is finite-dimensional: then 17 is an isomorphism, called canonical isomorphism, arid dimX = diniX* = diniX**.
Definition 2.66 (Quadratic form) Let X be a vector space. A mapping q : X 4 R is a qu,adratic form if for all ~ 1 ~ ExX2 arid N E R,the following conditions hold:
62
The dissimilarity representataon for p a t t e r n recognition
(1) f ( z 1 , ~=) y ( z l (2) q(nz)= fY2y(z).
+2 2 )
-
y(x1)
-
y(z2) is bilinear in z1 and
22.
Note that f is a symmetric bilinear form.
Definition 2.67 (Continuous dual space) Continuous dual C,(X, r') of a topological vector space X is a subspace of the dual space X* = C(X,I?) consisting of all continuous linear functiona1sl3. Definition 2.68 (Topological vector space) A vector space X over the field r is a topological vector space if there exists a neighborhood system N such that ( X , N )is a topological space and the vector space operations of addition ( z , y ) + z + y of X X X + X and multiplication by a scalar (A, z) 4Xz of FxX + X are continuous in the topology. Note that in a topological vector space, the topology is determined by the neighborhoods of 0. The neighborhood base NB(0) is defined by open sets of 0 such that every neighborhood of 0 contains a base neighborhood from NB(O).All the neighborhoods of other points are unions of translatcd base neighborhoods: U,,o(za Bp, z, E X and Bp ~ N g ( 0 ) .
+
Definition 2.69 (Convex set) Let X be a set in a real vector space. X is co'ri'uex if QZ (1 - a ) y E X for all z, y E X and all a: E [0,1].
+
Definition 2.70 (Locally convex topological vector space) A topological vector space X is locally convex if every point has a local base consistirig of convex sets. X is locally com,pact if every point has a local base consisting of compact neighborhoods. These definitions can also be simplified tjo consider only a local base of 0. 2.6
Normed and inner product spaces
Metric spaces are already richer in stzructurethan topological spaces, still more structure can be introduced; see Fig. 2.1 and Fig. 2.2. Normed and irnier product spaces are special cases of metric vector spaces, where metric is dcfined either by a norm or an inner product. The algebraic and geometric structures of such spaces are richer than those of metric spaces only. Inner product spaces are important, since there exists a welldeveloped mathematical theory which places the pattern description and I3For any finite-dimensional normed vector space (to be defined in Sec. 2.6) or any t,opological vector spacc, such as a Euclidean space, the continuous dual and the algebraic dual coincide. L , - ( X ) is then a normed vector space, where the norm I l f I I of a continuous linear functional f on X is defined as IIfII = sup{lf(z)i:llxll 5 1).
Spaces
63
learning in their context. Details can be found in [Dunford and Schwarz, 1958; Garrett, 2003; Greub, 1975; Kreyszig, 1978; Kothe, 1969; Pryce, 1973; Sadovnichij, 19911. Definition 2.71 (Normed space) Let X be a vector space over the field r. A norm on X is a function I I . I I : X + eS: satisfying for all 2 , y E X and all a E F the following conditions: (1) Nonnegative definiteness: 11x1I 2 0. (2) Non-degeneration ((x(( = 0 iff J: is a zero vector. ( 3 ) Homogeneity: ( \ a x l (= ( a /( ( X I ( . (4) Triangle inequality: / ( z+ yl( _< (1z/( ( ( y ( ( .
+
A vector space with a norm ( X , I / . 11) is called a normed space. If only the axioms (1); (3) and (4) are satisfied, then ( 1 . ( 1 becomes seminorm and ( X , 1 I . 1 I) a seminormed space. Example 2.11 Examples of seminornied spaces: 1. (F([-l,1 1 , ) )(1). with ( I f 1 1 = ! f ( O ) ( is a seminormed space. p l . 2. (R", ~ ~ ~withp ~ ~ 2p 1,)where , llxilp = (C:Ll lxil ). is anormed space. 3. (Kim3 /I . llm), where l\xllm= maxi=1,..., lxil, is a normed space. 4. Let C(Q) be a set of continuous functions on a closed and bounded set, R c R". (C(Q), 11 . l i p ) , where l i f l i P = (J: l f ( x ) l p d z ) iand p 2 1, is a normed space.
Theorem 2.15 (Seminormed spaces are topological) A seminormed space is a topological vector space, where the open ball neighborhoods are defined asB,(z) = { ? / E X : ! ( x - y y J l < ~ }E,> O . Remark 2.8 I n a seminormed space with the topology induced b p a semin,orm, all neighborhood systems can be constructed by the translation of the neighborhood system for 0 , i.e. N ( x ) = N(0)+ x. This is also tr.ue wh,en the topology Is defined by a translution invuariant (serni)metric. Remark 2.9 A n,ormed space is a locally convex topological vector space, since a norm is a convex function, Example A . l . Therefore, the open ball neighborhoods BE(x) are convex sets. Lemma 2.5 (On seminormed spaces)
1. The (serna)norm is a continuous function, i.e. q)is a real vector space equipped with a non-degenerate, indefinite inner product (., . ) E [Greub, 19751. & admits a direct orthogonal decornposition & = E+ @ E - , where &+ = IWP and E- = IW4 and the inner product is positive definite on €+ and negative definite on E-. The space & is, therefore, characterized by the signature ( p ,q ) [Goldfarb, 19841. '*A direct sum V = X @ Y @ Z means that every v t V can be uniquely decomposed into z E X , y E Y and t E 2 such that v = z y z and X n Y = {0}, Y n 2 = ( 0 ) and X n 2 = (0). An orthogonal sum of X , Y and 2 is their direct sum such that they are pairwisc orthogonal. Here, a direct orthogonal decomposition V = V+ V - @ Vo means that V - = V$ and Vo = (V+ n V:)', i.e. Vo = V+ n V $ consists of neutral vectors orthogonal t o all other vectors in V .
+ +
The dissimilarity representation for pattern recognition
74
Definition 2.90 (Orthonormal basis) Let € = Iw(P,Q)be a pseudoEuclidean space. An orthonormal basis {el,e2, . . . , eP+q}in € is defined as
i
1, f o r i = j = 1 , 2 , . . . ,p , ( e i , e , ) = -1, f o r i = j = p + l , . . . l p + q , 0; for i # .I.
The inner product between two vectors x and y in R(P.Q)can be expressed by the standard inner product (., .) in a Euclidean space.
Lemma 2.7 (Pseudo-Euclidean inner product via the standard inner product) L e t € = R(Piq) be a pseudo-Euclidean space. T h e n (., .)& can be expressed by t h e traditional (., .) in n Euclidean space RP+q as P t9
where
and I,,,
and I,,, are t h e adeiitzty matrrces.
If x+ and x- stand for the orthogonal projections of x onto RP arid Rq, respectively, then (xl y ) = ~ ( x t ly + ) - (x-,y-).The indefinite Liior~n' of a non-zero vector x becomes llxll; = ( X , X ) E = xTJpqx,which can have any sign. Based on the inner product, the pseudo-Euclidean distance is defined analogous to tlie Euclidean case. Definition 2.91 (Pseudo-Euclidean square distance) Let € be a pseudo-Euclidean space. Then
d2X.Y)
=
llx - YII; = (x - Y1 x - Y) & = (x - Y)
T
Jpq
(x - Y ) ,
= Iw(P,Q)
(2.2)
is a yseiido-Euclidean square distance. It can be positive. negative or zero.
am,
The distance d is either real or in the form of where i2 = -1. Note that the square distance between distinct vectors x and y may equal zero. Note that an orthonormal basis of the pseudo-Euclidean space is chosen as it is convenient for the representation. The rcason is that Jpq has a siniplr form arid it is both symmetric and orthogonal in tlie Euclidean space RP+q and in the pseudo-Euclidean space PQ(p,q) (this will be explained
Spaces
75
Figure 2.8 Left: a pseudo-Euclidean space & = R ( l > l ) = R1 x iR1 with d2(x,y) = (~-y)~J11(x-y).Orthogonal vectors are mirrored versus the lines 5 2 = 2 1 or xz = -1, for instance ( O A , O C ) E= 0. Vector v defines the plane 0 = ( v , x )=~ vTJ1lx. Note that the vector w = J ~ I va,‘flipped’ version of v, describes the plane as if in a Euclidean space 8’. Therefore, in any pseudo-Euclidean space, the inner product can be interpreted as a Euclidean operation, where one vector is ‘flipped’ by Jpg. The square distances can have any sign, e.g. d 2 ( A , C )= 0, d 2 ( A , B )= 1, d 2 ( B , C )= -1, d 2 ( D , A ) = -8, d 2 ( F ,E ) = -24 and d 2 ( E ,D ) = 32. Right: A pseudo-sphere 11x11; = x: - x; = 0. From the Euclidean point of view, this is an open set between two conjugated hyperbolas. Consequently, the rotation of a point is carried out along them.
later on). One may, however, consider another basis. Let V = IWTL be a vector space and let { v ~ } ? =be~ any basis. Consider two vectors of V ; x = C:k,zivi and y = Cr=lyivi, as expressed with respect the basis vectors. Let 4 : V X V + IR be a symmetric bilinear form in V. Then 4(x,y ) = C:=‘=, Cy=lziyi 4(vi,vj) = xTMy, where A,f = M ( 4 ) such that Mij = 4(vi, vj) for all i ,j = I, . . . , n is a matrix of the form 4 with respect to the basis { V ~ } F = ~ .Assume that 4 is non-degenerate, which means that the rank of M is n. If M is positivc (negative) definite, i.e. if 4(x,x) > 0 ( d ( x , x )< 0) for all x E V, then qh (4) defines a traditional inner product in V. If A f is indefinite, i.e. $ ( x , x ) is either positive or negative for x E V, then 4 defines an indefinite inner product in V. We will denote it as (x,y)bf = xTMy. If M is chosen to be J P q , then { V ~ } Y = ~ is an orthonormal basis in R(”q), p g = n. This means that any symmetric non-degeneratc bilinear form 4 defines a specific pseudo-Euclidean space. Any other such form $ will define either the same or different pseudo-Euclidean space, depending on the signature, i.e. the number of positive and negative eigenvalues of M ( $ ) . If the signatures of M(qh) and AT($) are identical, then the same pseudo-Euclidean space is obtained.
+
76
T h e dissimilarity representation f o r p a t t e r n recognition
Note that if the basis of R" is changed, then the matrix of the bilinear form changes as well. If T is a transformation matrix of the basis { V ~ } F = ~ to the basis { W ~ } F = ~ ,then M"(4) = TTM"(q5)T is the matrix of q5 with respect to the new basis. This directly follows by substituting x by (Tx) and Y by (TY) in ( X , Y ) M . By introducing algebraical structures to a vector space V = R",specific vector spaces are obtained, depending on a form of a bilinear map or of a metric. One may introduce both an inner product (., .) and an indefinite inner product (., .)E to the same vector space. Such inner products are naturally associated with the (indefinite) norm and the (indefinite) distance. Additional metrics or norms can also be introduced. In this way, a vector space may be explored more fully by equipping it with various structures. A pseudo-Euclidean space R(P>q)can also be represented as a Cartesian product IWP x i Rq. It is, thereby, a ( p q)-dimensional real subspace of the ( p y)-dimensional complex space CP+q, obtained by taking the real parts of the first p coordinates and the imaginary parts of the remaining q coordinates. This justifies Eqs. (2.1) and (2.2), and allows one to express the square distance as d2(x,y ) = d i p (x,y) - d& (x,y ) , where the distances on the right side are square Euclidean. A Euclidean space is a special case of the pseudo-Euclidean space as RP = IR(P3').
+
+
Definition 2.92 (Isometry between pseudo-Euclidean spaces) Let ( X ,(., and (Y,(., . ) E z ) be pseudo-Euclidean spaces. A mapping q5 : X + Y is an isometry if (4(z), 4(y))e, = (x,y ) ~ ~ . The notions of symmetric and orthogonal matrices should be now properly redefined. Sirice the matrix JPq plays a key role in the definitions below, we will denote them as 3-symmetric and 3-orthogonal matrices to make a distinction between matrices in indefinite and traditional inner product spaces.
Definition 2.93 (3-symmetric, 3-orthogonal matrices) an n x n matrix in IW(P>'J), n = p y. Then
+
1. A is J-sym,metric or 3-self-udjoint if 2. A is J-orthogon,al if J&AT&, A = I .
JPq
Let A be
ATJpq= A.
A 3-symmetric or 3-orthogonal matrix in a pseudo-Euclidean sense is neither symmetric nor orthogonal in the Euclidean sense. If, however, pQ(P>q) coincides with a Euclidean space, i.e. q = 0, then the above definitions simplify to the traditional ones, as Jpq becomes the identity operator I . For instance, by straightforward operations one can check that the matrix
Spaces
[
I : ]
is 3-symmetric in
IR(l.l)
with
77
3
=
[
and that
5 [ 2 1;l i s
3-orthogonal in IR(lil). If we denote A* = JP4 ATJp4,then the conditions above can be reformulated as A* = A for a J-symmetric matrix A and as A*A = I for a J-orthogonal matrix A . This already suggests that A* plays a special role of the adjoint operator, which will be discussed below. An extension of a pseudo-Euclidean space leads to a K r e h space, which is a generalization of a Hilbert space as a pseudo-Euclidean space is a generalization of a Euclidean space.
Definition 2.94 (Kreh and Pontryagin spaces) a vector space K over C such that
A K r e k space is
(1) There exists a Hermitian form, an indefinite inner product (.; . ) X on K , such that the following holds for all z, y; Z E K and Q, [ ~ E C :
(a) Hermitian symmetry: (z, y ) =~ (y, x ) i , (b) Linearity over IC and sesquilinearity over a!i (z, .)X
+ P'
@I:
(ax+ /j y 3z ) =~
(y, Z ) X .
(2) Ic admits a direct orthogonal decomposition K: = K+ @ K - such that ( K + , (., .)) and ( L-(.,.)) , are Hilbert spaces'' and ( Z + , L ) ~ = 0 for any z+ E IC+ and 2- E Ic-. The space IC- is also called an antispace with respect to (., .). If K is a vector space over also Def. 2.61.
R,then (., -)K is a
symmetric bilinear form; see
It follows that K admits a fundamental decomposition with a positive subspace Ic+ and a negative subspace K - . Therefore, Ic+ = ( K - ) l . Let dimIc+ = K,+ and dimIC- = 6- be the ranks of positivity and negativity, respectively. Krein spaces with a finite rank of negativity are called Pontrgagin spaces (in other sources, e.g. [Bognar, 19741, the rank of positivity is assumed to be finite). A Pontryagin space with a finite K - is denoted by II,. Note that if (.; .)Ic is positive definite or zero for zero vectors only. then K is a Hilbert space. Example 2.17 (Pseudo-Euclidean, Kre'in and Pontryagin spaces) Let V be a vector space of real sequences ( u l , v 2 , . . .) satisfying C,"=,I ~ i jv,l j 2 < 00. Then ( 2 ,y ) = ~ Czl ~i z i yi defines an inner product. If ~1 = 1 and ~j = -1 for all j> 1, then the inner product is given ans 03 ( xy)v ~ = z1y1 zi yyi and V becomes a Pontryagin space. If ~ 2 >j 0
xi=*
IgAll Hilbert spaces discussed here are assumed to be separable, i.e. they admit countable bases.
The dissimilarity representation for pattern recognitaon
78
and & z J - 1 < O for all j,then V equipped with (z,.y)vdefines a Kre'in space. If V is a vector space of finite sequences (v1, vz, . . . , v,) and all E~ # 0, then V with (2, y)v = ~i zi yi is a pseudo-Euclidean space.
c;:,
Definition 2.95 (Fundamental projections and fundamental symmetry) Let Ic = Ic+ @ Ic-. The orthogonal projections P+ and P- onto Ic+ and Ic- , respectively, are called f u n d a m e n t a l projections. Therefore, any x E K can be represented as x = P+ x + -'J x where I K = P+ P- is the identity operator in K . The linear operator 3 = P+ - P- is called t>he fundam,ental s y m m e t r y .
+
Corollary 2.9 (Indefinite inner product by the traditional one)
(x,w)lC
=
(x,3 , ) .
In Hilbert spaces, the cla.sses of symmetric, self-adjoint, isometric and unitary operators are well known [Dunford and Schwarz, 19581. Linear operators. carrying the same names can also be defined in Krein spaces. The dcfinitions are analogous and many results from Hilbert spaces can be generalized to KreYn spaces. However, due to indefiniteness of the inner product, the classes of special properties with respect to the inner product are larger. We will only present the most important results; see [Bognjr, 1974; Iohvidov e t al., 1982; Pyatkov, 2002; Goldfarb, 1984. 19851 for details. Definition 2.96 (H-scalar product, H-norm) Let z, y t IC. The H scalar product is defined as [x,y] = (2, J ~ ) Kand the H - n o r m is 11x1i~ = [x;:x] 4.
+
Let x E Ic be represented as x = x+ x-, where x+ E Ic+ and x- E Ic-. Since [x,y] = (2, Jy)x, we can write [z, y] = (z+, y + ) K - (z-, y - ) ~= (:x+,yj+) - (-(z-,y-)) = ( z , ~ ) This . means that [x,g]is equivalent to the traditional (Hilbert) inner product and Ic+ and Ic- are orthogonal with rcspect to [z, y]. Moreover, the associated Hilbert space IFI is then such that IFI = IIcI = Ic+ @ IIc-1, where 1Ic-I stands for (K-,(., .)). Formally, there is a close 'bound' between a Krein space arid its associated Hilbert space: Lemma 2.8 A decomposable, non-degenerate i n n e r product space Ic i s u K r e k space ifl f o r every f u n d a m e n t a l s y m m e t r y J ,t h e H-scalar product t u r n s it into a Hilbert space [Bognur, 19741. H-scalar product is a Hilbert inner product, therefore Ic can be regarded as a complete Hilbert space (Banach space) with the H-scalar product (Hnorm). As a result, the (strong) topology of K: is the norm topology of the associated Banach space, i.e. the H-norm topology. This topology is
79
Spaces
simply defined by the nornis in the associated Hilbert space 1KI. does not depend on the choice of fundamental symmetry2'. Therefore. continuity, convergence and other notions can be defined for K with respect to the H-norm.
Definition 2.97 (Convergence, Cauchy sequence) (1) The sequence x, in K: converges to x E K with respect to thc H-norm iff lim7L-m(z7h,y)K= (x,y)~ for all y E K: and limn400(xfl,,.x,)K: = (x,Z ) K . (2) The sequence x ,in K is Cauchy with respect to the H-norm iff (xT1x,,xn - x,)~ 40 and ( z T L , y form ) ~ a Cauchy sequence for y € K . Corollary 2.10 Since (x,y ) =~ [x+,y+] - [ L , g-1, then (z, y ) is~ continUOTLS with respect t o the H-norm in both x aad y. Theorem 2.23 (Schwarz inequality) /1x11~ l l v l l ~holds for all r c , y E K . Proof.
(1 .+1 2
2
The inequality l(z, y ) ~ l5
l [ ~ + ? ~ + 1 F [ ~ - ~ Y2 -5l l(1 .+1 I / ~ + I / + I I ~ ~ - ll lIY -l l )2 + 11~-/12)(llY+l/2 + IlY-112) = I l ~ l l ~ l l Y l l ~ . I(.,V)Kcl
5
Definition 2.98 (Orthonormal basis in a KreYn space) Krein space. If K+ and K - are separable Hilbert spaces, then a countable orthonormal basis {ei}g1 in K such that any z uniquely written as IC = a i e i for some E r and means that
czl
(ei,e,j)lc =
{
I
Let K be a there exists E K can be lat12. This
c,"=,
1, if i = j and P+ei is an orthonormal vcctor in K+, -1, if i = j and P-e, is an orthonormal vector in IC-. 0, otherwise.
Theorem 2.24 (Orthogonal expansions) I f K + and Ic- ure sepurable Hilbert spaces, then there exists a countable orthonmrmal basis in K . For x,y E IC, one has [Boyna'r, 19741:
{e,}zl
(1)
CF1 Ib,+ I 2
< 0.
(21 -Cz(e,.e,)KC=-l I(X, 4 K 1 2 5 (3)
(Z.?/)K: 5
c : ,
(Z.Z)K
5
Ci(e,,e,)K=l I(x7 +I3.
(ez, e i ) K ( x ,e i ) K ( e t ,Y ) K .
201na Krein space, there are infinitely many fundamental decompositions, hence fundamental symmetries and, consequently, infinitely many associated Hilbert spaces. However, the decompositions yield the same ranks of positivity and negativity, the same H-norm topologies; simply, they are isomorphic.
80
T h e dissimilarity representation f o r p a t t e r n recognition
Definition 2.99 (Adjoint operator) Let C,(IC, G ) be a space of continuous linear operators from the Krein space K onto the Krein space G. If G is K:, then C,(K) will be used. Note that C,(K:) is a dual space of K .
1. A* E C,(G,K) is a unique 3 - a d j o i n t of A E C,-(K,G) if ( A x , y ) g = ( x , A * y ) ~for : all z t K and all ~ E G . 2. A E &(K) is 3-self-adjoint ( 3 - s y m m e t r i c ) if A* = A , ( A z , y ) ~= (x,Ay)x for all rc, y E K . Definition 2.100 (Isometric and unitary operators) [Alpay et al., 1997; BognBr, 19741 Let A E L,(K,G) be a continuous linear operator K: + 4. A is 3 - i s o m e t r i c if A*A = I , and 3-coisometric if AA* = I,. A E L c ( K ) is 3 - u n i t a r y if (Arc, Ay), = (rc, y ) for ~ all rc, y E K , or in other words. if it is both 3-isometric and J-coisometric. Remark 2.12 Th,e fundamental s y m m e t r y 3 fulfills 3 Hence, 3 is J-symm,etric and J-unitary.
=
J* =
3-l.
Theorem 2.25 (Factorization) [Bogncir, 19'74l Every 3 - s y m m e t r i c operator A E C,(K:) can be expressed as A = T T * , where T E C,(V, K ) f o r some K r e k space V and ker(T) = 0 . Since a Krein space is inherently connected t o its associated Hilbert space, both the 3-adjoint and 3-unitary operators can be expressed through operators in this Hilbert space. Hence, the condition ( A x ,y)g = (z, A * ~ ) Kis: equivalent to stating that ( A x ,J g ) = (z,JA*y). This is further equivalent to ( J A z ,g) = (z, JA*g), since J is self-adjoint (symmetric) with respect t o (., .) in the associated Hilbert space \ K \ . This means that in IKl, the adjoint of ( J A ) is (JA*). Let A X be a Hilbert adjoint of A. (This means that A X = AT or A X = At, depending whether the Hilbert space is over W or C.)Then (JA)X = A X J = JA* and finally
A* = J A X J . For a 3-unitary operator in a Krein space K , we have (Arc,Ay)~:= (rc, y ) ~which , is equivalent to stating that (Ax,JAy) = (x,Jy) in the associated Hilbert space. Since J is self-adjoint in 1x1, then ( ( J A ) z ,(JA)y) = ( x ?y). So, ( J A ) is a unitary operator in IKI, which means that ( J A ) " = ( J A ) - l . Then Apl = J A X J ,which is equivalent to A-l = A*. Formally, we have: Theorem 2.26 Let A E & ( K , G ) , then A E Cc,(lKl,IGl) f o r the associated If A X is a Halbert adjoint of A, t h e n A* = Hilbert spaces 1x1 and IS/. Jx A XJ G , where JK:an,d JG are the fundamental symmetries. Moreover,
IIA*IIH = / I A X / I H= IlAIIH.
Spaces
81
Definition 2.101 (Krein regular subspace) Let K be a Krein space. A Krein regular subspace of K is a subspace X which is a Krein space in the inner product of K , i.e. (x,y ) x = ( x ,y ) for ~ all 2 , Y E X . Definition 2.102 (Positive, uniformly positive subspaces) A closed or non-closed subspace V E K is positive if ( x ,x ) >~0 for all x E V and V is ,unifownly positzue if it is positive and (x,Z)K >a(/xil&for a positive cv depending on X and the associated H-norm. Similar definitions can be made for negative, uniformly negative, nonnegative etc. subspaces. The term maximal, if added, stands for a subspace which is not properly contained in another subspace with the same property. Every maximal positive (negative) subspace of a Krein space is closed. If K = K+ 6? K - is the fundamental decomposition, then the subspaces K+ and K - are maximal uniformly positive or negative, respectively. Any maximal unifornily positive or negative subspace arises in this way [Bognk, 19741.
Definition 2.103 (Positive definite operator) A J-self-adjoint operator A E C , ( K ) is positive definite ( 3 - p d ) in a Kreiri space if (x,A ~ )>K0 for all x E K . The negative definiteness ( 3 - n d ) or semi-definiteness is defined accordingly. The above condition is equivalent to 0 < (2, AZ)K = (x,JAz). This means that A is J-pd if ( J A ) is pd in the associated Hilbert space 1 K 1. For instance, the fundamental symmetry J is J-pd, since it is J-symmetric and JJ = I.
Theorem 2.27 (Projection theorem) Let V be a closed, non-degenerate subspace of a K r e k space K . Th,en f o r every z E K , there exist unique zvE V and x 1 E V' such that z = x, 21, where 2 , = Px and P is the orthogonal projection of x onto V [Bogncir, 1974; Iohvadov et al., 19821. P h,us th,e following properties:
+
1. 2.
P2
=P.
(Px, y ) ~= : (x,Py)x. (J-self-adjoint)
c?,
( P z ,( I K
4.
z = Pz
~
P)Z ) K
+ (I;c
~
= 0.
P ) z and PI( I x - P ) .
Only the first two conditions are required f o r P t o be u projection. Definition 2.104 (Gram and cross-Gram operators) Let V be a linear subspace of K: spanned by linearly independent vectors { u l , v 2 , . The Gram operator, or the inner product operator, is defined as G,,,, =
82
T h e dassimilarity representataon for pat t er n recognition
( ( ~ ~ , v , ~ ) x ),..., i ,., j = 1 Assume further that a subspace ZA C IC, spanned by ( 7 ~ 1 , I L Z , . . . , u t } . is given. Then G,[,, = ( ( u i , u j ) ~ ) i , l , . _t :_j =_l ,..., is the cross- G r a m operator.
Theorem 2.28 (Projection onto a subspace) Let V be a linear subspace of a Krein space IC spanned by the vectors (w1,112, . . . , un}. Hence, V = [ ~ q + * u 2. ., .,ti,] is the basis of V . If the Gmrn operator G,, = ( ( v i , v j ) )i,,i=l....,n ~ is nonsingular, then the orthogonal projection of x E I C on,to V is unique and given by 5,
= V G:,
g,,
(2.3)
where g , is a n n x 1 vector of the elements (x,w i ) ~ ,i = 1 , 2 , . . . , ri,. If the G r a m operator G,, is singular, then either the projection does not exist or x ,= Vz, w h ~ r ez i s a solution t o the linear system, G,, z = g,. Proof. Let J be the fundamental symmetry of K . Let x, be the projection of x onto V . Based on Theorem 2.27, X can be uniquely decomposed as z = x,, ZL, such that z, E V and x i E VL. Moreover. ( x l ~ u z=) 0.~ Hence, ( : c , ’ u , ) ~= ( x t , > t i i ) xwhich > are the elements of g , . Since the vectors ( 7 i i } are linearly independent (as the span of V ) , then there exists a , siich t,hat x,,= C/4,a p ~= , V a , where a is a colurnn vector. The elements of g, become (x,,wi)x = (Va,’ui)x,‘i = 1,.. . , 72. This gives rise to g , = V t J V a = G,, a. If G,, is nonsingular. then a can be determined uniquely as G :; g , , hence x,, = V Gzf g , . If G,, is singular then either there is no solution to the cquation g , = G,,,, a or there are many solutions. 0
+
Remark 2.13 T h e same formulation as Eq. ( 2 . 3 ) holds for a projection onto a subspace in a Hilbert space, provided that the indefinite i r i n w product (.>.)x %sreplaced b y the usual inner product (., .). In a Hilbert space, the singularity of the Gram operator G,, means that { . u ~ } are ~ = linearly ~ dependent. In the case of a Krein space, this means that V contains an isotropic vector, i.e. there exists a linear combination of {ZI~}:=~which is orthogonal to every vector in V . In other words, to avoid the singularity of the Gram operator, the subspace V niust be nondegenerate.
Remark 2.14 Since ( x , ~ i i ) x= ( x > J u i )= d J w i , then by th,e use of the Hilbert operations only, we can write that g, = V i J x and also G,, = V t J V . As a result, x, = V(VtJ’V)plVtJ’xand the projection operator P onto the subspace V i s expressed as P = V ( V t J V ) - l V t J .
Spaces
83
Corollary 2.11 Let V = span(v1, va,. . . , v,} and Li = s p a n { u l ,u ~. ... , u,} be linear subspaces of K . Assume the Gram operator G,,, and the crossGra.m operator G,,,, = ( ( u i , u,)lc)i=l..t.j=l..,. If G,, is nonsingular, th,en by Theorem 2.28 the orthogonal projections of the elements f r o m K: onto V are given by Qv = G,, G,: V. Theorem 2.29 (Indefinite least-square problem from a Hilbert perspective)'l. Let V be a linear non-degenerate subspace of a Krefn space K spanned by the vectors { u l , v2,.. . , vn}. T h e n for the basis V = [2/1.v2,. . . ,un] of V and f o r u E K , the function, F ( x ) = IIu - Vxli$ reuch,es its m i n i m u m iff G,,, = V t J V is positive de,fin,ite in, a Hilbert sense in the uicinity of xS2'. x, is the sought solution such that 5 , = G;Jg, and g , = VtJu. Otherwise, n o solution exists.
+
2 x t V t J u z t V tJVz . From mathematProof. Jju- Vxllg = utJu ical analysis [Birkholc, 1986; Fichtenholz, 19971, .z', is a stationary point of F ( x ) if the gradient VFlz=zs = 0. By a straightforward differentiation of F , one gets 2VtJVx - 2 V t J u = 0, hence VtJVx, = V t J u . Since V is non-degenerate, then G :; exists. Therefore, by Remark 2.14, the potential solution is given as 2, = (VtJV)plVtJu = Gzfg,. Traditionally, the stationary point x, is a unique minimum iff the nxn>Hessian is positive definite in a Hilbcrt sense. The Hessian H = a2F H = 2 Vt JV = 2 Gu,. Since the matrix of indefinite inner products equals G,, is generally not positive definite, H 2 G,, is also not. Consequently, 2 , cannot be a global minimum. However, H is positive definite at the point 5 , . Observe that zLHz, = u ~ J V ( V ~ J V ) - ~ = V u~tJJ P Uu , where P is the projection matrix onto the space spanned by the column vectors of V ;see Remark 2.14. By Theorem 2.27, P is J-self-adjoint, hence JP is ~
(-)&=lz5
21For comparison, an equivalent formulation is given for the Hilbert case: (Least-square problem in a Hilbert space) Let V = span(v1, vz,. . . , w,,} be a linear subspace of a Hilbert space 'h and let V = [v]vz . . . v,]. Then for u E 7-1, the norm F ( z ) = IIu V t ~ 1 1is~minimized for z such t h a t , V z = u,, i.e. z is the orthogonal projection of 1~ onto V . The unique solution is zs = Gr:gu, where G,, if the Gram matrix (in a Hilbert space) and g, is a nxl vector of the elements ( u , v z ) , for i = 1 , 2 , . . . , n. roof. 1 ( u- V Z 1'~ = 1 / u - un u, - Vzl1' = ( /u- uu1 j 2 I (u, - V Z /1;' since (u - u v , u o V z ) = 0. From Theorem 2.28, we know that the projection of u onto V is unique and it is given by uz,= VGL: g,. F ( z ) is then minimized for llu?,- V z / I 2 = IlVGr; g , - V s / J 2being equal t o zero, if the sought solution is x g = G,;,'g,. 22From a Hilbert point of view, the minimum of F cannot be found for a n arbitrary indefinite space. Assume, for instance a K r e h space K = W(',') with the indefinite norm = z: - 'c;. Then for a particular z = [l z2], the minimum of 110 - ziig = 1 - xg is reached at --oo ~
+
~
Ilzllk
+
84
T h e dissimilarity representation f o r pattern recognition
positive definite in the Hilbert space lKI. Therefore, x i H x , = u t J P u holds for any U E K ,which means that H is positive definite at 2 , .
>0 0
Below, we present an interpretation of the indefinite least-square problem, but from the indefinite point of view. The solution does not change, however, the interpretation does:
Proposition 2.2 (Minimum in the Kre'in sense) L e t K be a K r e h space over t h e field F and let f (x)= Ilb - Axil: be a, quadratic f u n c t i o n in K . T h e minimum o,f f in K i s a special saddle p o i n t xs in t h e associated Hllbert space JKl. T h i s space i s specified by t h e indefiniteness of J. T h i s m e a n s t h a t f Ix+ takes t h e minimum a t x , ~ , + a n d f 1 ~ - takes t h e m a x i m u m a t x,,_, iihere x,,+ and x,,_ are t h e f u n d a m e n t a l projections of x, E K o n t o K+ and K - , respectively.
+
+
Proof. Givcn that J = P+ (-P-), we have: f ( x ) = f + ( x ) f - ( x ) , whcre f+(x) = ztAtP+Ax - 2xtAtP+b b t P + b and f _ ( x ) = -(xtAtP_Ax 2xtAtP-b btP_b) are the restrictions of .f to Ic+ and K - , respectively. As f+ and f - are defined in the complementary subspaces (L is the orthogonal coniplement of K + ) , the minimum of f is realized by determining x,,+ for which f+ reaches its minimurn and finding :I;,,- for which f - reaches its maximum. The final solution is then 2, = z,,+ x , ~ , -(this is due to K being the direct orthogonal sum of K+ and K - ) . The critical points are the ones for which the gradients of f+ and f - are zero. This leads to x,,+ = (iltP+A)_lAtP+b and x;,,_ = (AtP-A)-lAtP_b. The Hessian matrices become, correspondingly, H+ = 2AtP+A and H - = -2AtP-A. Thanks to the properties of projection operators, P+ = Pip+ and P_ = Pip-, Theorem 2.27, one has H+ = 2 (P+A)t(P+A),which is positive definite by the construction, and H - = -2 (P_A)t(P-A), which is negative definite. Hence, f+ reaches its 0 rnininium for x,>+ and f- reaches its maximum for zs,-. ~
+
+
+
Theorem 2.30 (Indefinite least-square problem) L e t V be a linear non-degenerate su.bspace of a K r e i n space K spanned by th,e vectors ,vn] as t h e basis of V . Th,en for 712, (711, 112,. . . , u , , ~ } .D e n o t e V = [UI, the f u n c t i o n F ( x ) = liu, - Vxllg i s m i n i m i z e d in t h e Kreiiz sense for x, being t h e orthogonal projection o f 'u on,to V . T h e u n i q u e solution i s f o u n d 0,s x, = G;Jgu.
Proof. Similarly as in the proof above, we have: IIu - Vxllc = u + J u2 ztVtJu+xtVtJVz. x, is a stationary point of F ( z ) if the V7Fl,=,3 = 0. This leads to the equation VtJVz, = VtJu. By Remark 2.14, the solution
Spaces
85
is then given as x, = G,;i,lg,. We require that the Hessian, equal to 2 V t J V , is indefinite with the indefiniteness specified by J.This holds as VtP+V is positive definite in a Hilbert space K+, hence z , , ~ +yields a niiriimurn there and -VtP-V is negative definite in a Hilbert space IK-1, hence x S , ~ 0 yields a maximum there; see Proposition 2 . 2 . Remark 2.15 Note that the system of linear eguation,s V t J V x = V t J u solved in an inde5nite least-square problem can be expressed as Q'Qz = Q*u, where Q = V and Q* = V t J . This can be interpreted as a system of normal equations i n a Krefn space. Consequently, G;JVtJ is a pseudoinverse o f V in this space. 2.7.1
Reproducing kernel Krez'n spaces
Reproducing kernel Krein spaces (RKKS) are natural extensions of rcproducing kernel Hilbert spaces (RKHS). The basic intuition here relies on the fact that a Krein space is composed as a direct orthogonal siini of two Hilbcrt spaces. hence the reproducing property of the Hilbert kernels can he extended to a KreYn space, basically by constructing two reproducing Hilbert kernels and combining them in a usual way. We will present facts on reproducing kernel Pontryagin spaces (RKPS), which are Krein spaces with a finite rank of negativity (in other sources, e.g. [BognAr, 19741: a rank of positivity is assumed to be finite). Here, we will only present the most important issues, for details and proofs, see [Alpay et al., 19971 and also the articles [Constantinescu and Gheondea, 2001; Dritschel and Rovnyak, 1996; Rovnyak, 19991. All Hilbert spaces associated to Krein spaces are considered to be separable. Definition 2.105 (Hermitian kernel) Let X be a Krein space. A function K defined on X x X 4 CC of coritinuous linear operators in a Krein space X , is called a Hermitian kernel if K ( z ,y) = K ( z ,y)* for all 2 , y E X . K(z,y) has K negative squares, where K is a nonnegative integer, if every matrix { K ( x i , ~ ~ ) }based ? ~ =on ~ { Q , x ~ ., . . , xn} E X and n = 1 , 2 , . . . has at most K negative eigenvalues and there exists at least one such a matrix that has exactly K negative eigenvalues. Lemma 2.9 Let IIK be a Pontryagin space and let 2 1 , 2 2 , . . . 2, E IT,. The Gram operator G = ( ( ~ i , x j ) n J & = cwn ~ have n o more than K negative eigenvalues. Every total set in ITK contains a finite subset whose Gram matrix has exactly K negative eigenvalues [Alpav ct al., 19971.
86
T h e dissimilarity representation f o r p a t t e r n recognition
Lemma 2.10 Let 5 1 , 5 2 , ,zn,belong t o a n inner product space ( K . (.. . ) K c ) . T h e n the n,umber of negative eigenvalues of the G r a m operator G = ((xi,z j ) ~ ) ~ j coincides ,l with the dimension of the maximal negative subspace of s p a n ( x 1 , . . . , z T L[Alpay } et al., 19971. Definition 2.106 (Reproducing kernel Kreln space) Let X be a KreTn space arid let C X be a space of functions f : X + CC. Assunie K K c C X is a Kreiri space of continuous linear functionals on X . A Hermitian fiinctiori K : X X X + C is a reproducing kernel K K if (1) K ( X ; ) E K K for all Z E X and (2) K ( z . .) is the representer of evaluation at z in K K : ( f , K ( z , . ) ) K ,for all ~ E K and K all (fixed) Z E X .
,f(x) =
K K equipped with K is a reproducing kernel Krein space (RKKS). If K K is a Poritryagin space, then the resulting space of functions is called a reproducin,g kxrnel Poritryagin space (RKPS).
Corollary 2.12 Let K = Lc(X,CC) be a K r e f n space of continuous linear f u n c t i o n d s defined over the dom,ain X. If the eiJaluation functional 6,, 6 , r [ f ]= f (z) is defined and continuous for every z E X , t h e n K is a RKKS. Hence, there exists K ( z , . )E K such that 6, : z K ( z ; . )or 6 J f ] = ,f (x) = ( K ( x ,.), ,f ( . ) ) K . Therefore, the reproducing kernel is unique arid can he written as K ( z :y ) = 6, 6;, where 6; EL,-(@,K K ) is the 3-adjoint of the evaluation mapping E ( x ) for any fixed z E X . Similarly to the Hilbert case, m e has ( K ( z ,.), K(y, . ) ) K ~= K ( z ,y ) . In the case of the Pontryagin space, K ( z , y ) has at most /c. negative squares, Def. 2.105, where m is the rank of negativity. --f
Theorem 2.31 (On reproducing kernels) [Rovnyak, 19991 Let K ( z ,y) be a Hermitian kernel X X X+ C. T h e following assertions are equivalent:
1. K ( x ;y ) is a reproducing kernel for some K r e f n space K K consisting of functions over the domaisn X . 2. K ( z ,y) has a nonnegative majorane3 L ( s ,y) o n X X X . 3. K ( z .y) = K+(x,y ) - lip(.,y) for some nonnegative definite kernels K+ and lipo n X X X .
If the above holds, then for a given nonnegative majorant L ( z , y ) for K ( x ,y), there exists a K r e f n space K K with a reproducing kernel K ( x ,y ) , 23A nonnegative majorant L for K is a nonnegative definite kernel L such that L - K and &K are nonnegative definite kernels in the ‘Hilbert’sense, i.e. according to Def. 2.82.
Spaces
87
which as continuously contained in the Hi,lbert space XL with, the reprodli~cin.9 kernel L ( x . y).
+
Note that L(z,y) can be chosen as K+(s,y) K-(z.y). Note also that the consequence of this theorem is that the decomposition K ( x ,y ) = K + ( z ,y)-K-(x, y) can be realized such that K+ is a reproducing (Hilbert) kernel for ( K K ) + and K- is a reproducing (Hilbert) kernel for I(ICI