DATA HANDLING IN SCIENCE AND TECHNOLOGY
- VOLUME 9
Multivariate pattern recomition in chemornetrics
DATA HANDLING I...
125 downloads
939 Views
9MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
DATA HANDLING IN SCIENCE AND TECHNOLOGY
- VOLUME 9
Multivariate pattern recomition in chemornetrics
DATA HANDLING IN SCIENCE AND TECHNOLOGY Advisory Editors: B.G.M. Vandeginste and O.M. Kvalheim Other volumes in this series: Volume 1 Microprocessor Programming and Applications for Scientists and Engineers by R.R. Smardzewski Volume 2 Chemometrics: A Textbook by D.L. Massart, B.G.M. Vandeginste, S.N. Deming, Y. Michotte and L. Kaufman Volume 3 Experimental Design: A Chemometric Approach by S.N. Deming and S.L. Morgan Volume 4 Advanced Scientific Computing in BASIC with Applications in Chemistry, Biology and Pharmacology by P. Valko and S.Vajda Volume 5 PCs for Chemists, edited by J. Zupan Volume 6 Scientific Computing and Automation (Europe) 1990, Proceedings o f the Scientific Computing and Automation (Europe) Conference, 72-15 June, 7990, Maastricht, The Netherlands, edited by E.J. Karjalainen Volume 7 Receptor Modeling for Air Quality Management, edited by P.K. Hopke Volume 8 Design and Optimization in Organic Synthesis by R. Carlson Volume 9 Multivariate Pattern Recognition in Chemometrics, illustrated by case studies, edited by R.G. Brereton
DATA HANDLING IN SCIENCE AND TECHNOLOGY - VOLUME 9 Advisory Editors: B.G.M. Vandeginste and O.M. Kvalheim
Multivariate pattern recognition in chemometrics, illustrated by case studies edited by RICHARD G. BRERETON School of Chemistry, University of Brktol, Cantock's Close, Bristol BS8 ITS, U.K.
ELSEVIER Amsterdam
- London -New York -Tokyo
1992
ELSEVIER SCIENCE PUBLISHERS B.V. Sara Burgerhartstraat 25 P.O. Box 211,1000 AE Amsterdam, The Netherlands
ISBN: 0-444-89783-6(hardbound) 0-444-89784-4(paperback) 0-444-89785-2(software supplement) 0-444-89786-0(5-pack paperback + software supplement) Q 1992 Elsevier Science Publishers B.V. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior written permission of the publisher, Elsevier Science Publishers B.V., Copyright & Permissions Department, P.O. Box 521,1000 AM Amsterdam, The Netherlands. Special regulations for readers in the USA - This publication has been registered with the Copyright Clearance Center Inc. (CCC), Salem, Massachusetts. Information can be obtained from the CCC about conditions under which photocopies of parts of this publication may be made in the USA. All other copyright questions, including photocopying outside of the USA, should be referred to the copyright owner, Elsevier Science Publishers B.V., unless otherwise specified. No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. This book is printed on acid-free paper. Printed in The Netherlands
V
CONTENTS ACKNOWLEDGEMENTS
ix
CONTRIBUTORS
xi
INTRODUCTION (R.G.Brereton) References
4
CHAPTER 1 INTRODUCTION TO MULTIVARIATE SPACE (P.J.Lewi) 1 Introduction 2 Matrices 3 Multivariate space 4 Dimension and rank 5 Matrix product 6 Vectors as one-dimensional matrices 7 Unit matrix as a frame of multivariate space 8 Product of a matrix with a vector. Projection of points upon a single axis 9 Multiple linear regression (MLR) as a projection of points upon an axis 10 Linear discriminant analysis (LDA) as a projection of points on an axis 11 Product of a matrix with a two-column matrix. Projection of points upon a plane 12 Product of two matrices as a rotation of points in multivariate space 13 Factor rotation 14 Factor data analysis 15 References ANSWERS CHAPTER 2 MULTIVARIATE DATA DISPLAY (P.J.Lewi) 1 Introduction 2 Basic methods of factor data analysis 3 Choice of a particular display method 4 SPECTRAMAP program 5 The neuroleptics case 6 Principal components analysis (PCA) with standardization 7 Principal components analysis (PCA) with logarithms 8 Correspondence factor analysis (CFA)
1
7 7 7 10 12 13 16 17
18 20 23 25 26 30 34 36 37 43 43 44 46 51
54 56
59 63
vi
64
9 Spectral map analysis (SMA) 10 References
66 67
ANSWERS CHAPTER 3 VECTORS AND MATRICES : BASIC MATRIX ALGEBRA (N. Bratchell) 1 Introduction 2 The data matrix 3 Vector representation 4 Vector manipulation 5 Matrices 6 Statistical equivalents 7 References ANSWERS
71
CHAPTER 4 T H E MATHEMATICS OF PATTERN RECOGNITION Bratchell) 1 Introduction 2 Rotation and projection 3 Dimensionality 4 Expressing the information in the data 5 Decomposition of data 6 Final comments 7 References ANSWERS
99
CHAPTER 5 DATA REDUCTION USING PRINCIPAL ANALYSIS (J.M.Deane) 1 Introduction 2 Principal components analysis 3 Data reduction by dimensionality reduction 4 Data reduction by variable reduction 5 Conclusions 6 References ANSWERS CHAPTER 6 CLUSTER ANALYSIS (N.Bratchell) 1 Introduction 2 Two problems 3 Visual inspection
71 71 74 78 82 89 91 92
(N. 99 99 103 112 116 120 121 122 125
COMPONENTS 125 126 131 150 164 165 167 179 179 180 183
vii 183 190 197 204 204 205
4 Measurement of distance and similarity 5 Hierarchical methods 6 Optimization-partitioning methods 7 Conclusions 8 References
ANSWERS CHAPTER 7 SIMCA CLASSIFICATION BY MEANS OF DISJOINT CROSS VALIDATED P R I N C I P A L COMPONENTS MODELS (0.M.Kvalheim and T.V.Karstang) 1 Introduction 2 Distance, variance and covariance 3 The principal component model 4 Unsupervised principal component modelling 5 Supervised principal component modelling using crossvalidation 6 Cross validated principal component models 7 The SIMCA model 8 Classification of new samples to a class model 9 Communality and modelling power 10 Discriminatory ability of variables 11 Separation between classes 12 Detection of outliers 13 Data reduction by means of relevance 14 Conclusion 15 Acknowledgements 16 References ANSWERS
-
CHAPTER 8 HARD MODELLING I N SUPERVISED RECOGNITION (D.Coomans and D.L.Massart) 1 Introduction 2 The data set 3 Geometric representation 4 Classification rule 5 Deterministic pattern recognition 6 Probabilistic pattern recognition 7 Final remarks 8 References ANSWERS
209
209 210 217 220 222 222 226 230 233 235 236 238 240 242 242 242 245 249
PATTERN 249 249 252 254 255 273 276 277 278
viii SOFTWARE APPENDICES SPECTRAMAP (P.J.Lewi) 1 Installation of the program 2 Execution of the program 3 Tutorial cases
289 289 290 293
SIRIUS (0.M.Kvalheim and T.V.Karstang) 1 Introduction 2 Starting SIRIUS 3 The data table 4 Defining, selecting and storing a class 5 Principal component modelling 6 Variance decomposition plots and other graphic representations 7 Summary 8 Acknowledgements 9 References
303 303 303 304 304 306 313 320 320 320
INDEX
321
ix
ACKNOWLEDGEMENTS The events leading to the publication of this text have benefited from the support and collaboration of many colleagues and many organisations. Sara Bladon has worked in Bristol on the final phases of layout and desktop publishing. Much of the material in this book has been first presented at courses organized by the University of Bristol. Support for these courses has come from various sources including the Health and Safety Executive, IBM, COMEl'T and Bristol's Continuing Education Department. The development of good tutorial material requires resources and substantial commitment throughout a protracted period, and we are grateful for continued interest in and support for these courses. Richard G.Brereton
Bristol (March 1992)
This Page Intentionally Left Blank
xi
CONTRIBUTORS Nicholas Bratchell, Pjizer Central Research, Sandwich, Kent, U.K.
Olav M . Kvalheim, Deparment of Chemistry, University of Bergen, N-5007 Bergen, Norway
Richard G.Brereton, School of Chemistry, University of Bristol, Cantock's Close, Bristol BS8 ITS,
Terje V. Karstang, Department of Chemistry, University of Bergen, N-5007 Bergen, Norway
U.K. Danny Coomans, Department of Mathematics and Statistics, James Cook University of North Queensland, Townsville, Q481 I , Australia John M.Deane, AFRC Institute of Food Research, Bristol Laboratory, Lmgford, Bristol BS18 7DY, U.K.
Paul J . Lewi, Janssen Research Foundation, B-2340 Beerse, Belgium D. Luc Massart, Farmaceutisch Instinuu, Vrije Universiteit Brussel, Laarbeeklaan 103, B- I090 Brussels, Belgium
This Page Intentionally Left Blank
INTRODUCTION Richard G . Brereton, School of Chemistry, University of Bristol, Cantock's Close, Bridtol BS8 1 TS, U.K. I
When analyzing projects, one of the first steps should be to examine the factors that influence the decision to start the project, and the directions in which the project progresses. Most of the advances in modern science arise not from single individuals sitting in rooms having new ideas, but from economic, managerial, social and even political influences. So it is worth charting the history of this text in order to understand better why this work has evolved in the way it has done. Much has been said about the history of chemometrics, and no single individual or group of individuals can present a totally unbiassed view. But my feeling is that chemomemcs has gone through two phases. The first phase, from the early 1970s to mid 1980s was a time when a small community of enthusiasts developed the foundations of the subject. The application of techniques such as principal components analysis, supervised and unsupervised pattern recognition and so on was relatively difficult because of the limitations in user-friendly software and because information was relatively hard to come by for the non-expert. The number of people calling themselves chemometricians were few, although the number of people working on approaches now well recognised as chemometric approaches were much larger. Not everyone in this field called themselves a chemometrician in these early days, and one of the confusions when speaking about the origins of chemomemcs is whether we are talking about the origin of the name or of the discipline. The first people to label themselves chemometricians worked mainly in Scandinavia and in the U.S.In the Benelux countries, this term was slow to take off, yet two of the earliest texts [ 1,2] to encompass what we would now call chemomemcs were written by groups in the Benelux. In the U.K., the term has been even slower to take off, yet the breadth and depth of this activity has been for many years very strong. The mid 1980s marked events that were to catalyze a widening of the chemometric community and a general acceptance of the term - the subject entered a new, and
2
second phase. Probably the first major event to bring people together was the NATO Advanced Study Institute in Cosenza, Italy, in 1983 [3]. Several key events followed rapidly from then, namely the founding of the journals Chemometrics and Intelligent Laboratory Systems (Elsevier) and Journal of Chemomeaics (Wiley), the publishing of two comprehensive texts on the subject [4,5] and the development of national chemometrics groups such as the UK Chemomemcs Discussion Group. I was fortunate enough to have an involvement in some of these events and it is through the rapid developments during this new phase that the foundation stones for this book were laid. As an initiative in 1987, the University of Bristol organized the first Spring School in Chemometrics and both N.Bratchel1 and J.Deane, at the time working at the AFRC Institute of Food Research at Langford in Bristol, and myself, among others, contributed to this course. A feature was the large volume of coursenotes. In 1988 Bristol organized the first COMEl'T School of Chemometrics. Links with D.L.Massart via the journal Chemometrics and Intelligent Laboratory Systems allowed us to enhance our program with outside speakers, and both P.J. Lewi and D.L.Massart contributed to this course. P. J. Lewi also introduced the package SPECTRAMAP into the tutorial part of the course. The European School in Chemometrics, organized by the University of Bristol, has continued as an annual event and allowed new tutors to contribute and also is a fertile testing ground for new tutorial software. At around this time, also, great interest focussed on developments in the journal Chemomerrics and Intelligent Laboratory Systems, especially the tutorial section [6].There was seen to be a need to extend these ideas beyond the publishing of straight articles, to the publishing of open learning material, backed up, when appropriate, with software. Hence discussions took place over the possibility of publishing some of the material in the form of a text. At around this stage O.M.Kvalheim, whose contacts with myself and Elsevier began in earnest after the Ulvik workshop, whose proceedings were published in Chemometrics and Intelligent Laboratory Systems [7],and who was developing tutorial material around the SIRIUS package [8], became involved in this project. His material was presented in the 1989 course, and widely tested out in courses both in Scandinavia and the U.K.
So the team emerged through a common interest in developing new and front-line tutorial material in chemometrics. Perhaps it is necessary to ask why publish new material in this area? This book is not a collection of review articles, and so serves a different purpose to journals. It is not a proceedings. The book is one of reference and training for the professional chemometrician. We will see a much greater need for the chemometrician in the future. Industry, in particular, has increasing need for
3
people that have a good, in-depth, understanding of this area. Chemometrics is a little like computing. A team will consist of one or two people (systems managers or programmers) with good, in-depth, knowledge of the subject, and another group with some technical knowledge, and a final group that is just users of packaged software. This hierarchy of knowledge is important in the construction of modem interdisciplinary teams. Existing texts either aim at the middle [4,5] or low [9] ground - this one aims at the high ground. Despite this depth of treatment, the book is tutorial in nature, and breaks new ground in the chemometrics community. The book is structured around questions and answers, involving simple worked examples. Also some of the chapters are supported by tutorial software packages, and these are available as a supplement to the text. Users without this software should be able to gain most information from the book, but it is hard to deal with large multivariate data sets without some software. This book, for reasons of length, is restricted to pattern recognition. The first two chapters provide a general, largely geometric, view of pattern recognition. Chapters 3 and 4 introduce mathematical formalism. Chapter 5 explores further the topic of principal components analysis. Chapters 6 to 8 discuss ways to classify groups of objects, namely unsupervised approaches (cluster analysis), soft modelling (exemplified by SIMCA) and hard modelling. Related areas of calibration and factor analysis have not been included, although brief introductions to these topics are included in Chapter 1. The reader is referred to two recent books in these areas [ 10,111 for a greater in-depth treatment. One feature of this book is that different authors use different notation. Although this may, at first, be confusing, surveys from courses suggest that people actually find different approaches to similar problems complementary, and I have resisted the temptation to edit out this diversity. It is essential that the user of methods for pattern recognition recognises that there are a large number of methods and approaches to choose from, and there is no one "correct" way that is automatically superior to others. He or she must be willing to consider different viewpoints, and to read papers written by people from different groups and with different backgrounds. What is necessary is to be able to translate from one approach to another, and to understand the underlying similarity between different methods. Therefore, much attention has been placed, in this book, to cross-referencing between different sections. Important concepts such as singular value decomposition or data pretreatment are introduced time and time again by different authors, often under different guises and for a diversity of purposes. It is hoped that this extensive effort is worthwhile.
4
One important difference between chemometrics and many more traditional areas is that people come into the subject with different levels of prior knowledge and different expectations. A good text must attempt to encompass this variety of readers. In this book, it is not necessary to read chapters in a linear fashion, beginning from Chapter 1 and finishing at Chapter 8. It is possible to start anywhere in the book and refer back (or forward) to other topics if required. A final and important point is that the team of authors has a substantial collective
writing and editing experience. R.G.Brereton is the author of one text in chemometrics [9]. D.L.Massart has been an author of three books in areas covered by this book [1,5,12]. P.J.Lewi has written a book on multivariate analysis in industrial practice [13], and D.Coomans on potential pattern recognition [ 141. Most authors have written major reviews or tutorial papers in this general area [15 - 191. Editorial duties include D.L.Massart (Chemometrics and Intelligent Laboratory Systems), 0.M.Kvalheim (Data Handling in Science and Technology), P.J.Lewi (Computational Data Analysis) and R.G.Brereton (Chemometrics and Intelligent Laboratory Systems). Finally innumerable conference proceedings, newsletters, sections of journals etc. have been edited by various members of the team. This strong group experience has done much to influence the final product. It is hoped that this book will be of service to the chemomemcs community.
References 1. D.L.Massart, A.Dijkstra and L.Kaufman. Evaluation and Optimization of Laboratory Methods
and Analytical Procedures, Elsevier, Amsterdam, (1978). 2. G.Kateman and F.J.Pijpers, Quality Control in Analytical Chemistry, Wiley, New York, (1981).
3. B.R.Kowalski (editor), Chemometrics : Mathematics and Statistics in Chemistry, Reidel, Dordrecht, (1984). 4. M.A.Sharaf. D.L.lllman and B.R.Kowalski, Chemometrics, Wiley, New York,(1986). 5. D.L.Massart, B.G.M.Vandeginste. S.N.Deming, Y.Michotte and L.Kaufman. Chemometrics : a Textbook. Elsevier, Amsterdam, (1988). 6. D.L.Massart, R.G.Brereton. R.E.Dessy, P.K.Hopke, C.H.Spiegelman and W.Wegscheider, Chemometrics Tutorials, Elsevier, Amsterdam. (1990). 7. Chemometrics and Intelligent Laboratory Systems, 2 (1987). 1-244. 8. 0.M.Kvalheim and T.V.Karstang, A General-Purpose Program for Multivariate Data Analysis, Chemometrics and Intelligent Laborarory Systems, 2 (1987), 235-237. 9. R.G.Brereton, Chemometrics :Applications of Mathematics and Statistics to Laboratory Systems, EUis Horwood, Chichester, (1990). 10. E.Malinowski, Factor Analysis in Chemistry :Second Edition, Wiley, Chichester, (1991). 1 1. H.Martens and T.Nres, Multivariate Calibrafion, Wiley, Chichester, (1989). 12. D.L.Massart and L.Kaufman, The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis, Wiley. New York, (1983).
5
13. P.J.Lewi, Multivariate Analysis in Industrial Practice, Research Studies Press 1 Wiley, Chichester, (1982). 14. D.Coomans and I.Broeckaert, Potential Pattern Recognition, Research Studies Press / Wiley, Chichester, (1986). 15. R.G.Brereton, Chemometrics in Analytical Chemistry: a Review, Analyst, 112 (1987). 16351657. 16. N.Bratchell. Cluster Analysis, Chemometrics and Intelligent Laboratory Systems, 6 (1989), 105-125. 17. P.J.Lewi, Spectral Map Analysis : Factorial Analysis of Contrasts, especially from Log Ratios, Chemometics and Intelligent Laboratory Systems, 5 (1989). 105-116. 18. A.Thielemans, P.J.Lewi and D.L.Massart, Similarities and Differences among Multivariate Display Techniques illustrated by Belgian Cancer Mortality Distribution Data, Chemometrics and Intelligent Laboratory Systems, 3 (1988). 277-300. 19. 0.M.Kvalheim and T.V.Karstang, Interpretation of Latent Variable Regression Models, Chemometrics and Intelligenr Laboratory Systems, 7 (1989), 39-51.
This Page Intentionally Left Blank
7
CHAPTER 1 Introduction to Multivariate Space Paul J . Lewi, Janssen Research Foundation, B-2340 Beerse, Belgium
1 Introduction The concept of multivariate space is important for the understanding of data analysis and pattern recognition. We intend to cover this topic from a geometrical rather than an algebraic point of view. We assume that some aspect of reality can be described numerically in the form of a matrix e.g. by means of measurements or observations made on a collection of objects or subjects. (For convenience we will only use the terms measurement and object. But, these have to be understood in a broad sense.) The same aspect of reality can also be described spatially in the form of a pattern of points in multivariate space. We intend to show the analogy between operations on matrices (especially multiplication) and the corresponding operations in multivariate space (especially projection and rotation of the pattern of points). Some of these operations have widespread and important applications in regression, discrimination and correlation, in classification, in reduction of dimensionality and in the search for meaningful structure in data. 2 Matrices According to the dictionary, a matrix is a medium into which something else is embedded or developed. flhe Latin word matrix refers to the womb of an animal.) In pharmaceutical technology it refers to the inert substance in which active ingredients are dispersed. In our context, a matrix is a rectangular arrangement of numbers divided horizontally and vertically into rows and columns. A row can be visualized as a horizontal line-up of people. A column has obvious connotations with a vertical pillar. The number of rows and the number of columns are the dimensions of the matrix. Their product equals the number of cells in the matrix.
8
We may have used the more colloquial name table instead of the mathematicallysounding term matrix. But this leads us into semantic problems when discussing operations on matrices. Tabulation, however, is the proper word for collecting numbers into a matrix. It refers to the practice of 13th-century accountants who used to chalk their numbers on the tables at which they were seated. Modern accountants now prefer the term spreadsheet, electronic or otherwise, for their arrangement of numbers into matrices.
In relational data base terminology, a matrix is referred to as a relation. This makes good sense, as the numbers in the cells of a matrix define the relation between its rows and columns. As will be apparent later on, the relation between rows and columns defines a pattern of points in multivariate space. The numbers in the cells of a matrix can be generated in several ways. In the laboratory situation one may deal with objects and measurements made on them, e.g. absorption of light at different wavelengths. In this case, all measurements are expressed in the same unit. In other applications, the measurements may be expressed in different units, e.g. physico-chemical characteristics, such as specific gravity, melting point, heat capacity, etc. It is important to make the distinction between these two types of data in view of operations that can be carried out on a matrix. In the former case, one can compute the average absorption over the various wavelengths. In the latter cases one cannot meaningfully add or average measurements expressed with different units. We distinguish between data expressed with the same unit and those expressed with different units. A special type of data arises in surveys where people are classified according to two different categories, e.g. geographical locations and classes of occupation. In this case, the resulting mamx is also called a contingency table, a term which appears commonly in statistical treatises. In a contingency table, all numbers represent counts that can be added meaningfully into totals by rows and by columns, e.g. totals by location and totals by class of occupation. These are called the marginal totals, since they are usually appended at the right side and at the bottom of the matrix. Contingency tables are also referred to as two-way tables or cross-tabulations. Any given quantity can be cross-tabulated according to two categories, e.g. a country's gross national production in billion dollars according to geographical locations and branches of industry. The results are presented in the form of a two-way table or cross-tabulation. In this chapter, we assume that the data in the mamx have been generated from several measurements on a variety of objects.
9
Q.1 By way of illustration we consider the determination of the contents of 20 common amino acids in 50 different proteins. Here, we have 50 objects or rows (proteins) and 20 measurements or columns (amino acid contents). In this table all data are expressed in the same unit (percentage of total protein mass). Interpret the sums of the rows and the sums of the columns of this matrix. Nowhere in our discussion have we assigned a distinct role to the rows and the columns of a matrix. It is only by convention that rows are traditionally assigned to objects and columns to measurements. One reason for this convention is that before the advent of automated measuring and recording equipment good measurements were hard to come by. So most mamces contained many more objects than measurements. In this case, the assignment of objects to rows of the matrix is most convenient for printing and reading. But in theory, one can assign objects and measurements arbitrarily to rows or columns of the matrix. For practical reasons, however, we adhere to the traditional convention. Each row of a matrix can thus be seen as the description of an object in terms of the measurements made upon it. Using our illustration, a given protein is described by the mamx in terms of its composition of 20 amino acids. (Of course, a protein possesses many other atmbutes. But these are left out of consideration here.) Complementarily, each measurement in the mamx is defined by its results from the various objects. Indeed, each column of a matrix can be seen as the description of a measurement in terms of the objects subjected to it. In our illustration, a given amino acid is described by the mamx in terms of its contributions to the total content of the 50 proteins. (Evidently, amino acid determinations have been made on more than these 50 proteins. But again, this is not taken into consideration here.) As we have pointed out above, the roles assigned to rows and columns in a matrix are defined arbitrarily. They can be interchanged depending on our particular point of view. In the practical situation one's interest may be the study of differences between proteins. In the more theoretical context one may focus on differences between amino acids. We consider both points of view to be complementary. The interchange of the roles of rows and columns of a mamx is called transposition of the matrix. It is an important concept for our understanding of multivariate space.
One may distinguish between the row of a matrix (which is a horizontal arrangement of numbers) and the object referred to by it, which is called a rowitem. Similarly one can distinguish a column of a matrix (which is a vertical
10
arrangement of numbers) and the measurement associated with it, which is the corresponding column-item. For convenience, however, we use the terms row and column to denote both the horizontal and vertical subdivisions of a mamx and the items they refer to.
3 Multivariate space The relationship between algebra and geometry was first reported by RenC Descartes [ 11. His objective was to translate geomemcal problems about curves into algebraic ones, such that he could study them more conveniently. A plane curve can be described by means of the projection of successive points upon two appropriately chosen axes. These axes span the plane of the curve. The distances of the projections on these axes from an arbitrarily chosen origin are called the coordinates of the points. Hence the names coordinate axes and coordinate geometry. The operation of projecting points on coordinate axes generates a two-column mamx in which the rows represent the points along the curve and in which the columns correspond with the coordinate axes. Each row in such a table contains the coordinates of the corresponding point. This way, we have translated a pattern in two-dimensional space into a two-column matrix. It should be noted at this stage that the translation can be achieved in an infinite number of ways, depending on the choice of the coordinate axes. The concept can also be readily extended to three or more axes and to mamces with three or more columns, both representing patterns in multidimensional space. For the sake of simplicity it is assumed that the so-called Cartesian coordinate axes are perpendicular to each other (although Descartes would have chosen his axes to best suit his particular problem). The reverse problem of translating a matrix of numbers into a spatial pattern of points was of less concern to Descartes. In fact, it took more than 150 years before the Cartesian coordinates were used for the production of line and point charts such as we know them today. In 1786, William Playfair, a publicist who spent many years in France, charted the trade relationships of Great Britain with all the important nations of his time [2]. On copper plates he engraved exports and imports of Great Britain against consecutive years, ranging from 1700 to 1780. Playfair's innovation (for which he was granted patents in France) was to disrupt the obvious association of coordinate axes with physical dimensions. (It appeared that John Playfair, professor of mathematics at Edinburgh, taught his younger brother
11
William to chart thermometer readings on successive days in the form of a Cartesian diagram.)
In our previous illustration of a mamx which describes the composition of 20 amino acids in 50 proteins, we have to visualize a pattern of points representing 50 proteins in a 20-dimensional space spanned by 20 axes, each representing an amino acid content. Two proteins are seen to be near to each other in this 20-dimensional space when there is little difference in their amino acid composition. But, does such a non-physical space really exist or is it only an abstract algebraic construction? (The same remark applies to Playfair's charts.) We do not attempt to answer this philosophical point which is bound to lead us into a scholastic dispute about the reality of multivariate space. Few people, however will deny the usefulness of abstract spaces such as phase space (relating the state of an object to its position and velocity) or of the Herzsprung-Russell diagram (representing luminosity and temperature of stars, and serving as a model of cosmological evolution). None of these possesses the physical reality of an architect's drawing or of Mercator's geographical projection. Yet they are powerful guides and roadmaps for understanding our physical, chemical and biological world. Perhaps, the human brain has acquired through evolution a special faculty for processing information in a spatial way (recognition of faces, maps, patterns, etc.) which is distinct from rational and abstract thinking. Interpretation of geometrical patterns is often an alternative to the manipulation of algebraic symbols. The duality of cognitive processes has taken many forms, among which are lateral thinking versus vertical thinking @. de Bono), or the paradigm of the right and left brain. Returning to our illustration of the matrix of 50 proteins and 20 amino acids, we also have to consider the complementary view of a pattern of 20 points representing amino acids in a 50-dimensional space spanned by 50 axes each representing one of the proteins. Here, two amino acids are seen as close together when their contributions to the masses of the proteins are much the same. This space is called the complementary space with respect to the first one. In general, the space spanned by the columns of a mamx is called column-space and is denoted here by S,. The complementary space spanned by the rows of the same matrix is called row-space and is represented here by S, (with S for space, c for columns and r for rows). Two spaces are the complements of each other when they result from transposition of the roles of rows and columns in the corresponding matrix. (Some authors refer to dual spaces, but this term has already acquired a precise and different meaning in mathematics.) Any matrix can be seen as a generator of two complementary multivariate spaces.
12
It is more appropriate to speak of multivariate space rather than of multidimensional space. Multivariate space is defined by coordinate axes that in general have little to do with physical dimensions such as length, height and depth. In Playfair's time a bivariate space spanned by events of trade and time, rather than by physical dimensions, was not widely understood, although he went at great lengths to explain the concept. Today, we are perhaps in the same situation with regard to multivariate space. 4.2 Plot the following matrix as a three-dimensional projection in (a) row-space and (b) column-space:
[!4 91 Use circles to represent rows and squares to represent columns. So the first row is represented by a circle with coordinates [4 1 41 in the space defined by the 3
4 Dimension and rank A multivariate space generated by a matrix implicitly refers to the two complementary spaces, unless we explicitly refer to row- or column-space. Likewise, a pattern of points representing the data in the mamx implicitly refers to the two complementary patterns, unless it is explicitly stated to be represented in row- or column-space. Both are the expression of one and the same aspect of reality expressed by the matrix. They cannot exist independently from one another. Properties expressed in one pattern are automatically reflected in the other. As a consequence, the number of dimensions of the pattern of points in row-space is equal to that of the pattern in column-space.If all points in row-space are collinear, then all points in column-space must be collinear. If all points in row-space happen to be coplanar, then we also find that all points in column-space are coplanar, etc.
We have to distinguish between the dimensions of a matrix and the dimension of the pattern of points represented in multivariate space. The dimensions of a mamx are its number of rows and its number of columns. These two numbers also define
13
the dimensions of the row- and column-spaces generated by the matrix. But the dimension of the pattern of points can at most be equal to the smallest dimension of the matrix. In our example of 50 proteins and 20 amino acids, the dimension of the patterns of points in row- or column-space is at most equal to 20. Often the dimension of the pattern of points is less, because of linear dependencies between the rows or columns in the table. (A variable is linearly dependent on other variables if the former can be expressed as a sum of the latter multiplied by appropriate coefficients.) We define the rank of a mamx as the number of dimensions of the pattern of points generated by it in multivariate space. As a rule, the rank is at most equal to the smallest dimension of the matrix (or its associated multivariate space).
14.3 If 1 out of the 20 amino acids in a series of proteins can be expressed as a linear combination of the other 19 amino acids, then (a) what is the dimension of the pattern representing the 50 proteins and (b) what is the rank of the corresponding
matrix? more mathematical discussion of the rank of a matrix is given in Chapter 3
5 Matrix product
The matrix product is the basic operation performed on matrices. It is different from the element-by-element product of two matrices. In the latter case each element is the product of the corresponding elements in the left- and right-hand matrices. It requires that the number of rows and the number of columns of the right- and lefthand matrices are equal. The matrix product is more complex in that it involves the multiplication of rows of the left-hand matrix with columns of the right-hand matrix. (For this reason it is sometimes called the rows-by-columns product.) Each element of the matrix product is the sum of the products of the elements of the corresponding row in the left-hand matrix with the elements of the corresponding column in the right-hand matrix. In the general case, two rectangular matrices can be combined by means of the matrix product if their inner dimensions are conforming (hence the synonym inner product). (In a rectangular matrix, unlike in a square one, the number of rows is different from the number of columns.) The inner dimensions are said to be
I
14
conforming when the number of columns of the left-hand matrix is equal to the number of rows of the right-hand matrix.
Q-4 What is the product of the matrices 2 3 1 4 [ 5 -11 and [3 - 2 1 ?
Beer's Law Pure components P
Mixtures
M
Mixtures
M
= W
Wavelengths
W
ponents
XWP.YPM = ZWM
C
XWP,
X YPMpm
=
ZWMwm
p for all wand rn Fig. 1 Beer's law expressed in matrix terms.
As an example of the matrix product we consider Beer's law of the additivity of absorption spectra. Consider a number of mixtures (M)with various compositions of a number of pure components (P). For each pure component we can obtain the absorption spectra at a number of fixed wavelengths (W). This results in the wavelength-pure component mamx (XWP). The element at row w and column p of the mamx XWP gives us the absorption of a specific pure component at a specified wavelength. The compositions of the samples are defined by means of the pure component-mixture mamx (YPM). The element at row p and column m of the mamx YPM tells how much of a specific pure component is present in a given mixture. Beer's law states that the absorption of mixtures of pure components can be expressed as the product of XWP with YPM, producing the wavelength-
15
mixture matrix (ZWM). Each element of the matrix Z W M represents the absorption of a given mixture rn at a specified wavelength w. In this text we (optionally) specify the dimensions of a matrix by left-hand subscripts so that a matrix XRC with N rows and M columns can be denoted by N,MXRC as discussed in Chapter 3, Section 2.1. These subscripts are for information only, and used in some chapters for clarity. In this chapter, matrices have names that are three letters long. The last two letters are names for the rows and columns of the matrix X, e.g. XWM means that the rows of X refer to the wavelengths (W)and that the columns refer to the mixtures
Multivariate calibration is a major application in chemometrics which makes use of Beer's law. Here, the aim is to determine the composition of an unknown mixture from either the spectra of pure components (XWP) or from the spectra of reference mixtures (ZWM) with known composition of pure components. The two alternatives are referred to as the direct and indirect methods of calibration, respectively. The section on multiple linear regression below presents an approach to the problem of multivariatecalibration. In what follows, the product of two matrices will implicitly refer to the matrix product and not to the element-by-element product, unless stated otherwise. The matrix product is represented conventionally by means of a dot (hence the synonym dot product). It follows from our definition that a rectangular matrix can be multiplied with a one-column matrix to yield another one-column matrix (provided that the inner dimensions are conforming). Similarly, a one-row matrix can be multiplied with a rectangular matrix to yield another one-row matrix (provided again that the inner dimensions are conforming). We adopt a special notation for matrices which facilitates the reading of products. For example, in Beer's law XWP.YPM = ZWM, we have made explicit by means of suffices that the inner dimensions (pure components P) vanish in the product to yield a matrix which is shaped by the outer dimensions (wavelengths W and mixtures M).In the more general case, we represents the data X which are arranged in a mamx with rows (R) and columns (C) as XRC.The transpose of a matrix is obtained, as we have seen above, by reversing the roles of rows and columns. Hence, the transpose of XRC is denoted by XCR.
16
In many chemomemc applications the number of rows (or objects) is denoted by N and the number of columns (or measurements) by M.In this and the next chapter we retain the notation of R for rows and C for columns in the original data matrix for ease of understanding, but the alternative notation is used in other chapters of
An important theorem of matrix algebra states that the transpose of a product is the product of the transposed terms in reversed order. In other words, the transpose of the product ZWM = XWP.YPM equals Z M W = Y M P . X P W , where Z M W , YMP and X P W are the transposes of Z W M , YPM and X W P , respectively. The advantage of our notation is that the suffices allow us to check the consistency of the dimensions in expressions. An additional advantage is that the linearized formulas can be easily recoded into a computer notation by which operations on matrices can be suitably expressed. (One such notation is APL, A Programming Language, designed by Iverson [3]). A drawback of our notation is that it lengthens mathematical expressions. For this reason, we solely propose it for didactic and not for general use.
6 Vectors as one-dimensional matrices A vector is a linear arrangement of numbers. We can visualize a vector as a one-row matrix or as a one-column matrix. These are one-dimensional matrices. For the sake of brevity we refer to them as row-vectors or column-vectors. The geometrical analogy of a row-vector (say YFC) is a single point in a multivariate space S,. Likewise, the geometrical counterpart of a column-vector (say YCF) is that of multiple points on a single axis which constitutes a univariate space S.,
In most chapters of this text vectors are denoted by lower case bold letters, so y would indicate a vector. There is, however, no universal notation as to the distinction between row and column vectors. In Chapter 3, Section 2.2 vectors are introduced as columns or rows of a parent matrix. However, for the purpose of this chapter the notation introduced above provides an unambiguous introduction to vertnrs
17
Q.5
Sketch the representation of (a) a row-vector [2 1 41 in multivariate space S, (b) a column-vector
in univariate space Sf ~~
Our definition of the matrix product can be extended to include vectors by thinking of them as one-dimensional matrices. Hence, the product of a rectangular matrix XRC with a column-vector YCF yields another column-vector ZRF = XRC.YCF (provided that the inner dimensions conform). This product can be transposed in the manner described above to yield ZFR = YFC.XCR where ZFR and YFC represent row-vectors.
4.6 Illustrate the matrix products that give ZRF and ZFR in diagrammatic form similar
7 Unit matrix as a frame of multivariate space A unit matrix is a square matrix. Unlike a rectangular matrix, it possesses an identical number of rows and columns. It has 1's on its principal diagonal (the one that runs from top left to bottom right) and 0 s on the off-diagonal positions. (Hence, the name unit matrix.) In our notation we will identify unit matrices by means of the symbol I. To each multivariate space, one can associate a unit mamx. (The unit matrix has the same number of dimensions as the space it is associated to.) We associate the unit matrix ICC to the data space S, and the unit matrix IRR to the complementary space S,. Geomemcally speaking, a unit mamx defines an orthonormal basis of the space, i.e. a set of mutually perpendicular vectors of unit length. This can be readily derived from the geometrical analogy between a unit matrix (ICC say) and the space spanned by its columns (S,) (Fig. 2). A unit mamx defines a space that contains no other points than the endpoints of the
unit vectors that span the space. These constitute the frame of the multivariate space. Multiplication of a data mamx with a unit mamx of conforming dimensions
18
is a null operation : it reproduces the original matrix. Note that the inner (conforming) dimensions vanish in the products XRC.ICC = X R C and IRR.XRC = XRC.
0
0
IRR
ICC
IRR .XRC = XRC.ICC = XRC Fig2 Unit matrices.
8 Product of a matrix with a vector. Projection of points upon a single axis The key issue of this chapter is the interpretation of the matrix product as a geometrical projection. We have already discussed the analogy between a data matrix XRC and a pattern of points in a multivariate space S, (i.e. points represented by the rows of the matrix in a space spanned by the columns).
19
In addition, we conceive of a vector UCF which represents, as we have seen, the axis of a univariate space Sf This vector is defined here as a one-column mamx. We assume that UCF is normalized, which means that the product UFC.UCF equals 1. (In other words, the length of the vector equals 1.) The U in front of a vector or mamx indicates that the size is 1 and should be distinguished from the I in front of the identity matrix which imposes the additional constraint that the diagonal elements are all equal to 1 and the off-diagonalelements equal to 0. The concept of the length of a vector is also discussed in Chapter 3, Section 3.2. There are an infinite number of one-dimensional vectors that have length 1. A
Next we project the points of the pattern XRC perpendicularly upon the unit vector UCF. The result XRC.UCF is a compression of the multivariate pattern XRC into a univariate pattern XRF. The dimensionalityof the original pattern is hereby reduced to one.
Q.7 Demonstrate the projection of the 5 points indicated by circles onto the axis UCF both in three-dimensional space (S, ) and as a one-dimensional projection (S,>.
I
These products may be represented diagrammatically as in Fig. 3. The product XRC.UCF = XRF now represents the perpendicular projections of the points in S, upon a univariate space Si. The univariate space S, is uniquely defined by UCF whose elements define this axis in S,. If the pattern generated by XRC has the form of an ellipsoid, we can choose the axis UCF to coincide with the principal axis of symmetry. This way, we obtain a projection XRF which has
20
maximal variance. (The total variance of XRC is defined as the mean squared deviation of the elements in XRC from their mean value. It can be regarded as a measure of information content of a matrix. If all elements in XRC are identical, then the total variance is zero and no information is conveyed in this case.) No other axis will produce a higher variance when XRC is projected upon the principal axis of symmetry, if the problem is to capture the largest possible fraction of the possible choice. When the pattern of XRC is not ellipsoidal we will choose UCF to coincide with the principal axis of inertia. When the axis has been chosen this way, we call it afacror.
0
0 l
A 0
0
A
n
B
UFC. UCF
=m
Fig.3 Product of a matrix with a vector.
\Variance is also discussed in Chapter 4. Section 4.1.
I
9 Multiple linear regression (MLR) as a projection of points upon an axis
In multiple linear regression (MLR) we are given a matrix XRC which describes objects in terms of so-called independent variables (measurements). This mamx XRC, as we have seen, corresponds with a pattern of points in a multivariate space S,. In addition, we are given a dependent variable which represents measurements on the same objects and which we would like to predict from the independent variables. The dependent variable is contained in ZRF which defines a univariate space Sf The objective of the analysis is to find some combination of the independent variables in XRC which is able to reproduce as well as possible the dependent variable in ZRF. This combination can be associated with a vector, say
21
YCF. In general, the vector YCF is not a unit vector, because we have to accommodatefor differences in the units of measurement in XRC and ZRF. When we project the pattern of points represented by XRC on the axis YCF, we obtain the coordinates of the projected points by means of the product XRC.YCF. a) Draw the axis YCF through space S, in the diagram below : this axis coincides with UCF but is not necessarily of unit length.
Independent variables
b) The dependent variable ZRF is given below. This variable might, for example, be the concentration of a compound in a mixture whereas XRC may represent spectroscopic intensities. We assume that both XRC and ZRF are centred which is performed by parallel translation of the patterns in the two data spaces so that their centres of masses coincide with the spatial origins.
ZRF Dependent variable
Show, graphically, that the projection of the four points in space S, onto the axis YCF is approximately equal to ZRF. A popular criterion for selecting the axis YCF is the least squares criterion. In this case, we require the axis YCF to be oriented such that the sum of squared differences between the projection XRC.YCF and the dependent variable ZRF is minimal. Using our notation for mamces and remembering the convention for
22
matrix transposition, we require that the product (ZFR YFC.XCR).(ZRF XRC.YCF) be minimal. (Note that this expression is evaluated as a single number, which can also be considered as a matrix with one row and one column.) In fact, we have collapsed a multivariate space into a univariate one with a desirable property. The new space gives a view on the objects of XRC which reproduces as best as possible the dependent variable ZRF. It is evident that the data when projected onto YCF do not necessarily exactly match the data when measured along ZRF, as illustrated in A.8.
-
-
In other chapters of this text we refer to residual sum of squares (RSS), or residual standard deviation. The RSS is identical to the expression above. ZRF is the actual vector of the dependent variables for each sample and XRC.YCF is the estimated vector of dependent variables for each sample. Multiplying a vector by its transpose gives its length. This concept will be used, in
0
A
A
Fig. 4 Summary of multiple linear regression.
Multiple linear regression can be applied to the spectroscopic resolution of a mixture of chemical components. This involves a matrix of independent data XWP which contains measurements for every pure component (p) in the mixture and for every specified wavelength (w).The dependent variable ZWM is defined by the measurements of an unknown mixture (m) at the specified wavelengths (w).The problem now is to determine an axis YPM, such that the projection XWP.YPM is as close as possible to the dependent variable ZWM in the sense of the least squares criterion. The elements of the computed YPM are estimates for the composition of the mixture (m)in terms of the pure components (p). Given the spectra of pure components (p) and given the spectrum of an unknown mixture (m), we can compute the composition of the sample by MLR. We will not go into the details of the computations of MLR here as our aim is to understand multivariate space rather than computational algorithms.
23
10 Linear discriminant analysis (LDA) as a projection of points upon an axis Assume that the pattern of XRC is composed of two distinct classes of objects. This pattern may be represented by means of two distinct ellipsoids. 2.9 Issume that the ellipsoids are similarly oriented in the data space S, and that the! r e similarly shaped (up to a scale constant), as illustrated. In this discussion w( issume that the number of objects in Class I is greater than in Class 11,
kepresent the class membership data ZRF in the form of a histogram in which the engths of the two bars represent the number of objects in each of the two classes Jse a value of - 1 to indicate membership of class I, and a value of +1 for class I1 The values - 1 and +1 are chosen arbitrarily. Any pair of distinct values will do.) Chapters 7 and 8 elaborate on class membership functions. In particular, Chapter 8, Section 5.2 discusses FLD (Fisher linear discriminant analysis) which is equivalent to LDA introduced in this chapter. Soft modelling (SIMCA) is not discussed in this chapter, but has many similarities with conventional hard modelling (or linear discriminant analysis or canonical variates analysis) as indicated in Chapters 7 and
The problem now is to find an axis Y C F in S, such that the projections XRC.YCF are maximally separated according to classes I and 11. We consider the classes I and I1 to be separated when the mean values of the projections XRC.YCF are maximally distant and when their overlap is minimal. In mathematical terms we state that the variance between classes divided by the variance within classes must be maximal along the axis of projection.
24
ldiscriminant scores introduced mathematically in of Chapter 8, Section 5.2.1.
i
The problem of linear discriminant analysis (LDA) is somewhat analogous to that of multiple linear regression (MLR). We search for an axis YCF in S, such that the projections XRC.YCF satisfy the criterion of separation between classes ZRF. Once again, we collapse a multivariate pattern into a univariate one, in such a way that a desirable property is realized. We view the pattern of XRC in such a way that their projections upon the axis YCF are most closely approximated by their class membership in ZRF. 2.10
]raw a frequency diagram of the projection of the objects in the two classes I and I into the axis YCF as given below.
YCF Class II/
I/
The method of LDA is used for the identification of unknown items from their chemical, physical or physicochemical properties. This requires a set of items whose properties and class membership are known. This set is called the training set. An unknown item is then projected upon the axis that results from LDA, and a prediction is made into either of the classes I or II. Linear discriminant analysis can be extended to discriminate between more than two classes. (In that case it is possible to find several axes that discriminate between the different classes in the sense of the criterion mentioned above.) The concept of training sets, used to set up class models is also discussed in 2. Chapter 7,Section 7 and Ch-Section
25
It will become clear to the reader by now that multivariate data analysis is, among other things, concerned with finding interesting points of view in multivariate space from which a particular property of the data can be perceived most clearly. Usually, there are several interesting properties to be looked at. It is not surprising, therefore, that there are many methods of multivariate data analysis. We only have considered two of them so far (MLR and LDA).
11 Product of a matrix with a two-column matrix. Projection of points upon a plane We consider once again a pattern of points XRC which is represented as an ellipsoid in S, . Additionally, we define in S, two perpendicular vectors of unit length. The coordinates of the end-point of these vectors are contained in the twocolumn mamx UCF. Each column of UCF defines one of two perpendicular unit vectors that span a bivariate space Sf This is equivalent to the assumption that UCF is orthononnalized, which means that the product UFC.UCF equals the unit mamx IFF of Sf 2.11 Draw the projection of the data XRC onto the plane UCF.
In the general case when the pattern of points is not ellipsoidal, the two dominant axes of inertia define a plane which accounts for a maximum of the variance in the data. When the axes have been chosen this way, we call them factors.
26
fym= A
0
A
H
UFC.UCF=IFF= 1
0
Fig. 5 Summary of Section 11
12 Product of two matrices as a rotation of points in multivariate space We have seen in previous sections that a multivariate pattern of points XRC in S, can be projected into a lower-variate space $(an axis or a plane) by means of the product of XRC with a projection mamx UCF. Generally, the projection results in loss of information as the original pattern is collapsed into one with a smaller number of dimensions.
12.1 Rotation in column-space The notion of projection can be extended to spaces which are spanned by three or more axes. As before, we define the new space Sf in S, by means of an orthonormal projection matrix UCF. The number of columns in UCF defines the number of axes that span the new space where UFC.UCF = IFF, the unit mamx of Sf In particular, it is possible to define the number of axes in the new space Sf such that the pattern of points is not altered by the projection. In this case, we refer to a rotation of the pattern of points XRC by means of the rotation mamx UCF. Distances between points are conserved under an orthonormal rotation (this term means that axes are at right angles to each other and of unit length). Such rotation can also be seen as a change of the frame of the multivariate space. Prior to the rotation we have a description XRC of the rows ( R ) in terms of the columns (C).
21
After the rotation XRC.UCF of S, into Sf we obtain a description of the same rows (R)in terms of the new axes (F).Usually, the new axes of S ~ a r echosen such as to reflect an interesting property of the data. One such property is reproduction of the variance in the data. But there are other properties, such as discrimination between classes and correlation with another description of the same objects. (Such axes are also referred to as canonical axes or canonical variables.) The rotation of XRC by means of the rotation mamx UCF produces the coordinates of the objects XRC.UCF = XRF inSf Q. 12 Illustrate, diagrammatically,the rotation of XRC using axes UCF to give the nev data mamx XRF.
Orthonormal rotation can be reversed. Suppose that we have the rotation XRC.UCF from S, to Sf,which produces the factor coordinates XRF. These can be rotated back from Sf into S, by means of a backrotation XRF.UFC = XRC. Substitution of XRF in the former expression leads to the original description of the data
XRC.UCF.UFC = XRCJCC = XRC
(1)
In the latter expression, ICC represents the unit mamx of S, when the rank (or number of dimensions of the pattern of points) of XRC is equal to the number of columns in XRC. (Remember that the rank will be less than the number of columns, when there are linear dependencies among the columns of XRC. In the case of linear dependencies, ICC will not be a unit mamx. But the backrotation still holds.) Evidently, one can think of an infinity of orthogonal rotations. Each of them will produce a different description of the rows in Sf
28
A
0
UAC.UCF = IFF =
fq A
1W-K-i
Fig. 6 Matrix operation of rotation in column-space onto three-dimensionalfactor space.
12.2 Rotation in row-space So far, we have discussed rotations UCF from S, to Sf. We can apply similar arguments to a rotation of S, into Sf by means of the orthonormal rotation mamx URF. The transposed matrix XCR describes the columns (C) as a pattern of points in the complementary space S , spanned by the rows ( R ) . The rotation produces a description of the columns (0in a space Sf spanned by the factors (F). Algebraically, the factor coordinatesof the rotated pattern of points are defined by X C R . U R F = X C F . Here the condition for orthonormality is UFR.URF = IFF, the unit mamx of the rotated space S f . The columns of U R F define the endpoints of the unit vectors of Sf in S,. Alternatively, the rows of URF define the endpoints of the unit vectors that span S , in Sf.The rotation can also be reversed, since
XCF.UFR = XCR.URF.UFR = XCR.IRR = XCR
(2)
where XRC is the transpose of the original data matrix Sf. (Note that IRR represents the unit matrix of Sr, provided that there are no linear dependencies
29
among the rows of XRC. In the case of linear dependencies, IRR is not a unit matrix. But the backrotation still holds.)
0
Sr
P
URF
I I
I
sr
0
f
XRF
XCR.URF = UCF
A
0
@
A
O
H
0
0
=
UFR.URF = IFF =
Fig. 7 Rotation in row-space.
30
Hence, one can rotate a pattern from S, into Sf and back from Sf into Sr, without any loss of information (i.e. without changing the distances between points in the pattern). Of course, the condition is that the rotation matrix URF is orthonormal.
13 Factor rotation In the previous section we have discussed the rotation of XRC into XRF by means of the orthonormal rotation matrix UCF. The condition for orthonormality has been defined as UFC.UCF = IFF, the unit matrix of rotated space. We now discuss the special case where UCF produces projections XRF that account for a maximum of the variance in XRC. It can be shown that such a rotation always exists and that it is unique. We refer to such a rotation as factor rotation. The axes of the rotated space are called factors and the coordinates of the points in factor space are referred to as factor coordinates. By definition,factor coordinates are independent. Independence means here that the projection of a pattern upon one factor axis is uncorrelated with the projection upon another factor axis. Algebraically, this means that the product XFR.XRF = VFF is a diagonal matrix. Diagonal matrices are also discussed in Chapter 3, Section 5.2. In the notation of Chapter 4,Section 5.3 and Chapter 7,Section 3, the mamx G is similar to VFF. The mamx G112contains the square roots of the elements of G and will be introduced below as SFF. The discussion in Chapter 4,Section 5.3, is an alternative way of deriving similar results to this section. The idea of a factor in this chapter differs from some terminology. Here all factors are strictly orthogonal to each other and so the overall geometry of the multidimensional space remains unchanged; the factors are said to be uncorrelured. In some areas, nonorthogonal factors are encountered. This includes canonical variates analysis (CVA) discussed in Chapter 8, Section 5.3, which is equivalent to LDA for more than one class; also factors extracted by approaches such as target transform factor analysis (TTFA) are not usually at right angles to each other. We introduce the prefix V to indicate that we have computed variances. (For the sake of brevity we omit to divide by a constant which in this case is equal to the number of rows in the table XRC.) The elements of the principal diagonal of the square matrix VFF represent the variances of the factors. The elements at the offdiagonal positions are the covariances between the factors. Covariance is a measure
31
of association between two variables. In this case, all covariances are zero since we have defined the projections XRF to be independent (or uncomlated). The sum of the elements on the principal diagonal (or trace) of VFF represents the total variance of the data in XRC. Orthonormal rotations conserve the total variance. Distances between points are not altered under an orthonormal rotation.
The elements of the principal diagonal represent parts of the total variance in the data XRC that are explained by each factor. Usually factors are arranged in decreasing order of their contribution to the total variance. The first few factors that account for a large part of the total variance are called the dominant factors. Similar arguments can be developed for a factor rotation of the complementary space Sr into the factor space Sf. In this case, we define an orthonormal matrix U R F which rotates the transposed pattern XCR into the rotated pattern XCR.URF = XCF with the condition for orthononnality UFR.URF = IFF, the unit mamx of Sf. In order to be a factor rotation we must obtain that the projections XRF are independent, i.e. uncorrelated. Algebraically this translates again into the requirement that XFR.XRF = VFF must be a diagonal matrix. The rotated matrix XRF provides a description of the rows ( R ) in terms of orthogonal (i.e. independent or uncorrelated) factors (F).The rotated matrix XCF provides a description of the columns (C) in terms of the same orthogonal factors (0.The duality between rows and colums in the original data mamx XRC can be resolved by means of two rotated matrices XRF and XCF. 4.13 Show, algebraically that VFF, derived from a factor rotation in Sr,is identical to the one obtained from a factor rotation in S,. Similar arguments to those of A. 13 can show that the variance-covariance matrix VCC = XCR.XRC is diagonalized into VFF by the factor matrix UCF. For this reason, orthogonal factor rotation can be regarded as the geometrical interpretation of the diagonalization of a variance-covariance matrix. This can be summarized by means of
UFC.VCC.UCF = U F R . V R R . U R F = V F F
(3)
32
Note that VRR and VCC are generally not diagonal matrices. The elements on the principal diagonals of these matrices refer to the variances of the rows and of the columns in XRC,respectively. The off-diagonal elements in V R R and V C C represent the covariances between rows and between columns, respectively. The fact that often these covariances are different from zero indicates redundancy in the description of the rows in terms of the columns, and vice versa. The matrices VRR and VCC are often called covariance (or variance-covariance) matrices. Often they are divided by the number of rows (or columns) - 1 : this latter definition is used in Chapter 4, Section 4.2, and such matrices have important
There is no redundancy in the description of the rows and the columns in terms of the factors (since factors are by definition uncorrelated using the terminology employed in this chapter). This way, it is often possible to achieve a reduction of dimensionality by means of factor rotation. In the case of an ellipsoidal data structure one can think of factor rotation as a rotation of the original coordinate axes toward the axes of symmetry of the ellipsoid. In the more general case, the rotation is toward the principal axes of inertia of the data structure. Each of these principal axes explains independently a maximum of the variance in the data (in the present context we use the terms factor and principal component synonymously). Calculation of factors can be done iteratively as follows: (a) Look for the dominant axis which contributes most to the total variance. (b) Project the pattern of points upon a subspace which is orthogonal to this axis. (c) Extract a new axis which accounts for a maximum of the residual variance of the data. By construction, this second axis is independent from the first one. (d) Project once again the pattern upon a subspace which is orthogonal to the first and second axes. (e) Return to step (c) until all dimensions of the pattern of points have been exhausted. At that stage, the total variance in the data matrix is completely accounted for by the computed factors. Iterative factor extraction produces factors in decreasing order of importance. The latter is measured by the fraction of the total variance in the data that is accounted for by each factor. Iterative factor extraction is the method of choice when the number of columns in the data matrix is moderate (less than 20) and when only the first two or three factors are required. This approach is recommended when use is made of an interpretative computer language, such as Iverson's notation APL [31.
33
For larger-size problems one can choose from more powerful algorithms that yield all factors at the same time. Orthogonal factors UCF and URF that are computed from the variance-covariance mamx VCC or VRR are often called eigenvectors. The associated factor variances are the elements of the principal diagonal of the diagonalized matrix VFF where
VFF = UFC.VCC.UCF = UFR.VRR.URF
(4)
They are called eigenvalues and are associated with the eigenvectors. The above notation is also referred to as eigenvector decomposition (EVD)of the variancecovariance mamces VRR and VCC. An alternative approach is singular vector decomposition (SVD) of the data matrix, which can be defined as
SFF = UFR.XRC.UCF
(5)
where SFF contains the square roots of the elements of VFF. Equivalently,
XRC = URF.SFF.UFC
(6)
Singular vectors are identical to the eigenvectorsdiscussed above. Singular values are the square roots of the corresponding eigenvalues. 4.14 In the notation of Chapters 4 and 7, UFR is the transpose of U, XRC is X, UCF is P and SFF is G112. Show that Eq. (5) is equivalent to Eq. (28) of Chapter 4 i.e.
X
= U GI” p ’
Sometimes, U G1’2are called the scores of the objects and are often denoted by the symbol T; P’are referred to as loadings and are associated with the variables. In the notation of this chapter we see, therefore, that the scores are given by URF.SFF and the loadings by UCF. It should be noted that the current definition of factor is somewhat different from the original meaning given to it by the pioneer of factor analysis, Thurstone [4]. Originally, factors were computed such as to reproduce the correlations among the measurements in a low-dimensional space with a specified number of dimensions.
34
In our context, factors are computed such as to reproduce the total variance of the measurements in a low-dimensional space. The term principal component which has been introduced by Hotelling [5] is used synonymously to the term factor such as it has been interpreted here. The more general term latent variable, as used by Kvalheim [6],encompasses both factors and principal components.
14 Factor data analysis In the practice of data analysis one has to make a trade-off between the number of factors retained in the result (which should be minimal) and the variance accounted for by these factors (which should be maximal). Ideally, one would like to have a high degree of reproduction (e.g. more than 80 % of the variance in the data by two dominant factors). Should this be the case, then we obtain a Cartesian diagram in which rows are represented as points in a plane spanned by the two dominant factors (Fig.8). The coordinates of these points are defined by the columns in the projected mamx XRF = XRC.UCF. Each point in the diagram represents a row of XRC and each axis represents a dominant factor. Similarly, we can construct a Cartesian diagram representing the projection XCF = XCR.URF. In this diagram each point represents a column of XRC, and each of the two axes represents a dominant factor (Fig. 8). It is essential to understand that the factors involved in XRF and XCF are the same. Hence, we can construct a joint diagram in which both the rows and the columns of XRC are represented in a plane spanned by two common and dominant factors. (The concept can be extended from two to three and more factors.) This joint representation of XRF and XCF is called a biplot. The term biplot refers to the two entities, rows and column, that are jointly displayed in a space spanned by common factors. In the outset of this chapter we have referred to the complementarity of row- and column-spaces (S, and Sc). Rows can be described by columns and vice versa. In the course of this chapter we have introduced a third entity, factors. Rows and columns can be described in a space spanned by common factors. This is a major achievement in the analysis of data, especially when it allows the representation of the relationship between rows and columns in a space of low dimensionality. This always succeeds when the rows or columns in the data table are intercorrelated. Correlation means that some kind of redundancy is apparent in the data. This redundancy causes the data to be described by more dimensions than is strictly necessary. By making a trade-off between minimal number of factors and maximal variance accounted for, one can often reduce substantially the dimensionality of the original data.
35
9 XRC.UCF = XRF
XRF
XRC.UCF = XCF
XRF XCF
Fig. 8 Factor rotation and biplot.
Many techniques of multivariate data analysis make use of factor extraction. They differ by the particular objective of the analysis. Three main objectives of factor data analysis are data reduction, discrimination between known groups of objects and correlation between different descriptions of the same set of objects. Even in the
36
field of data reduction one finds many methods that differ by the transformations operated on the data prior to the extraction of factors. A few of these methods of factorial data analysis will be covered in the section on data display techniques. Familiarity with multivariate space and related concepts can be of great help in the correct application of the various methods. On the other hand, frequent application of these techniques to practical problems also fosters familiarity with multivariate space. Ideally, theoretical insight and practical expertise should go hand-in-hand.A wealth of information remains hidden beyond the surface of data tables. Without an appropriate instrument one will never 'see' what lies hidden in the data. An analogy with microscopes and telescopes is compelling. For this reason, we propose the term of datascope for such an instrument which is composed of a (personal) computer and software for the analysis of multivariate data. (The software package SPECTRAMAP will be discussed in some detail in Chapter 2 and the Appendix.) The datascope provides for the practical innoduction to multivariate space. 15 References
References 7 to 10 are general reading in multivariate statistics. 1. R. Descartes, Discours de la Me'thode. Jan Maire, Leyden, (1637). 2. W. Playfair, The Commercial and Political Aatlas, T. Burton, London, (1786). 3. K.E.Iverson. Notation as a tool of thought (1979 ACM Turing award lecture). Communications of the Association for Computing Machines, 23 (1980), 444-465. 4. L.L. Thurstone, The Vectors of Mind, Univ. Chicago Press, Chicago, Illinois, (1935). 5. H. Hotelling, Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology, 24 (1933), 417-44. 6. O.M. Kvalheim, Latent-structure decompositions (projections) of multivariate data, Chemometrics and Intelligent Laboratory Systems, 2 (1987). 283-290. 7. H.Bryant and W.R.Atchley. Multivariate Statistical Methods, h w d e n , Hutchinson and Ross, Stroudsberg,Pennsylvania, (1975). 8. P.J.Lewi, Multivariate Analysis in Industrial Practice, Research Studies Press,Wiley, Chichester, (1982). 9. B.FJ.Manly ,Multivariate Statistical Methods, Chapman and Hall, New York,(1982). 10. M.S.Srivastana and E.M. Carter, An Introduction to Applied Multivariate Statistics. NorthHolland, Amsterdam, (1983).
37
ANSWERS
A. 1
The sums of the rows yield the total contents of the 20 amino acids which, in this case, should be close to unity. The sums of the columns are proportional to the average contents of each of the amino acids in each of the 50 proteins.
COLUMNS 0
1
ROWS 0
MATRIX
L
3
Data Structure
P
A sc
0
L,'' 'I
# ,'
P
Y'
Column -Space
Row -Space
38
A.3
(a) The maximum dimension is 20 - 1 = 19. This is because one of the amino acids can be expressed as a linear combination of the others, so the amount of this amino acid is uniquely determined from the amounts of the other 19 amino acids. (b) The corresponding rank is 19. If the there were further linear dependencies, the dimension and rank of the matrix would be reduced more. The problem of collinearity, that appears in regression analysis, is the result of linear dependencies among the columns of the matrix. The effect of collinearity is that the corresponding pattern of points is defined with a number of dimensions which is smaller than the number of columns in the matrix. IA.4
Left-hand term. Right-hand term = Product
1x3
lx-2
4.5 Row-vector
Column-vector
m m m
Ap-p-p-j
A
YFC
w 2I
YCF
SC
YFC
Sf
@? YCF
39
Note that the row-vector is represented by a single point, whereas the columnvector is represented by a straight line. 4.6 c
rn r
w 1
c
I
d
.
XRC UCF = XRF
40
4.8 a) The projection axis is similar to that in 4.7, except it can be any length.
7
S,
XRC
b) The projection onto the axis YCF is not exactly equivalent to ZRF. You answer should be approximately as follows:
XRC.YCF 1 v
A
Sf
.
XRC YCF = ZRF
Class I1 0
0
ZRF
41
A. 10
Class I Class I1
XRC.YCF
XRC.YCF =: ZRF Note that the two objects are not in the centre of the two classes. The distribution i! likely to be approximately normal or Gaussian. This assumption is part of mos statistical approaches for discriminating between classes. 1.11
The projection mamx UCF defines a plane upon which we can project the pattern )f XRC. The projection XRC.UCF = XRF reproduces some property of the nultivariate pattern of XRC in the plane defined by UCF. In particular, we could lave chosen our two axes to coincide with the major and second major axes of iymmetry of the ellipsoid. This way, we capture the largest part of the total variance n the data that can be accounted for in a plane spanned by two perpendicular axes. The first of these accounts for a maximum of the variance in XRC. The second one zproduces a maximum of the variance in XRC that is not accounted for already by
42
the first axis. The variance not accounted for by the first axis is also the variance of that part of XRC which is independent of the first axis. We state that the second axis accounts for a maximum of the (residual) variance in the subspace which is gerpendicular to the first axis.
Df XRC.UCF = XRF
fi.l.3
Working out the factor variance matrix, we find that the variance-covariance matrix XRC.XCR = VRR is diagonalized into the same VFF by the factor matrix URF V F F = XFC.XCF = (UFR.XRC).(XCR.URF) = UFR.(XRC.XCR).URF = UFR.VRR.URF
A.14
Eq.( 5 ) in this chapter can be expressed, in the notation of Chapter 4,as G'/2 = U'
x
p
Therefore, premultiplying both sides by U and postmultiplying by P'we have
u
(3'12
p' =
x
I(remembering that both Y and U are orthonormal mamces).
I
43
CHAPTER 2 Multivariate Data Display Paul J . Lewi Janssen Research Foundation, 8-2340 Beerse, Belgium
1 Introduction From the many methods of multivariatedata display that are available today we will consider only those that are based on the extraction of factors. Factor data analysis focuses on the reduction of dimensionality of the data. It capitalizes on the often present correlations between measurements and on frequently occurring similarities between objects (or subjects). These correlations and similarities in the data suggest that there is a certain amount of redundancy in the data. When this is the case, it is possible to reproduce a large fraction of the information (variance) in the data by means of a small number of computed factors. This has been explained in Chapter 1. The paradigm of colour vision is a pertinent one. Millions of hues of colour can be reproduced by three primary colours. Furthemore, if intensity is held constant, all visible colours can be represented in a plane diagram (the so-called chromaticity diagram) in which pure spectral colours form a horse-shoe curve. Each of these primary colours can be thought of as a factor of colour vision. The biological interpretation of the three factors relates to three different pigments in the cone cells of the retina of the eye. In practical situations, our task is not only to extract and display the factors from the data. We also have to provide, if possible, an interpretation of the underlying factors. The emphasis in this chapter is on qualitative interpretation of factors. In Chapter we discuss cases where quantitative interpretation of factors is the main aim. An essential characteristic of factor data analysis is that its results are graphical. This is a mixed blessing, however, because most often those for whom the display is intended, do not understand the graphical semantics of multivariate diagrams. One of the major stumbling blocks is a familiarity with the Cartesian diagram, which represents objects as points in a plane spanned by two measurements. In a multivariate display, people see objects in a plane which seems to be spanned by a multitude of measurements. Of course, often people fail to understand that the plane
44 is spanned by dominant factors of the data. Each of the multiple measurements is the combined effect of a few independent factors, much as each individual colour can be reproduced as the combination of three primary colours (e.g. as on a colour TV screen). In this chapter we will assume that the reader has familiarized himself with the concepts explained in Chapter 1, especially the two complementary spaces, projection, rotation and biplot.
2 Basic methods of factor data analysis We will discuss three basic methods of factorial data analysis. Principal component analysis (PCA) is one of the older methods of data reduction [l].It is most closely related to traditional factor analysis as devised by Thurstone [2,3] which attempted to reproduce the correlations between psychometric measurements in a lowdimensional space. Thurstone's aim was to detect simple structures in the data. The objective of PCA is to reproduce the variance of the data rather than the correlations between the measurements. A modem variant of PCA is the principal component biplot [4] which shows the relationship between the rows and the columns of the table in a low-dimensional space of factors (ideally a plane). This approach allows reproduction of the data in the table as accurately as possible by means of perpendicular projections upon axes such as is customary in a Cartesian diagram. PCA emphasizes absolute or quantitative aspects of the data. Correspondence factor analysis (CFA) is a modem approach devised by J-P. BenzCcri and a French group of statisticians [5,6].CFA makes use of the biplot technique as it represents both the rows and the columns of the table in a space of low dimensionality (ideally a plane). Its emphasis, however, is on relative or qualitative aspects of the data. The relationship (or correspondence) between rows and columns is one of specificity. The display is meant to show which rows correspond most specifically with which columns. In CFA, data are assumed to be in the form of a contingency table, such as is produced in a survey. Subjects that score similarly on the various items of a survey appear close together on the display. Those that reveal widely different profiles are far away. Test items of a survey that correlate highly also appear close. Those that are uncorrelated or that correlate negatively are at a distance from one another. Distances in CFA are measured in a so-called metric of chi-square, which is the counterpart of variance in PCA.
45
expected value of a parameter. The interpretation of this distribution is discussed in most basic texts on statistics. If the observed parameter exactly equals the expected, then x 2 is equal to 0. The larger the value the greater the deviation from the
In CFA each element of the mamx is evaluated relative to its expected value. The latter can be computed from the totals in the margins of the table. Rows that contain data which are much higher or lower than their expected values will be displayed close to the border of the display and at a distance from the centre. Likewise, columns that contain data that are different from their expectations will also be at a distance from the centre and closer to the border. Such rows and columns can be said to possess high contrasts, where the term contrast refers to difference from expectation. The position of a contrasting row with respect to a certain column on the display reveals the correspondence between them. When they are in the same direction, as seen from the centre, they possess a positive contrast : the row scores abnormally high for the column and vice versa. When they are antipodes, the contrast is negative : the row scores abnormally low for the column and vice versa. In CFA, data must be recorded in the same unit such as is the case with contingency tables and crosstabulations. lQ.1
As a simple example calculate the values, of ' chi-squared for the following mamx
I
1
7
5
6
12
9
2
3
3
10
16
11
3
8
4
I
as follows: (a) Calculate the totals for each row (ri) and each column (c,), and the grand tota
0). (b) Calculate the mamx of expected values, so that eij = ricj / t for each value. (c) Calculate the resultant chi-squared mamx. (d) Which row would you expect to be closest to the centre of a diagram based 01 chi-squared,and which would be farthest from the centre?
46
Spectral map analysis (SMA) is an approach which has been extensively used in the author's laboratory [7,8]. The method of SMA is similar to CFA as it focuses on relative or qualitative aspects of the data. It also emphasizes contrasts in the data. But the interpretation of contrasts is in terms of ratios. Rows and columns are drawn toward the border and away from the centre of the display when they carry high (positive or negative) contrasts. Data in SMA can be recorded in the same unit or in different units, but must have positive values. SMA is a method which allows detection of characteristic ratios in the data. It can be regarded as a method for multiple ratio analysis. In the particular case of a mamx with three columns, one can construct an equilateral triangular diagram which displays the three possible ratios that can be formed from the three columns of the table, namely the ratios between columns 1 & 2 , l Br 3 and 2 & 3.
3 Choice of a particular display method The schematic of Fig. 1 may serve as a guide to the selection of the most appropriate method for the analysis of a given data set. The main distinction is between absolute and relative aspects of the data. An absolute aspect of the data is its size or importance. This may be the amount of a component in a mixture, the strength of a signal from a detector, etc. The absolute or quantitative aspect tells us how big or important an object is with respect to a given scale of measurement. The relative aspect of the data does away with the size or importance of the objects. In chemical composition data, this aspect refers to the relative content of a component with respect to the total size or volume of a mixture. The relative or qualitative aspect often tells how specific an object is with respect to a given measurement, independently of its size or importance. Two samples may have identical relative contents of a certain component, although they can be of largely different size. Principal component analysis (PCA) is used when one is concerned with the absolute or quantitative aspect of the data. Often, one finds a strong size component in the data. This occurs when all measurements are positively correlated. In this case, if an object scores high according to one measurement, then it is expected to score high in most of the other measurements. If it scores low on a particular measurement then it is also expected to score low on most of the other ones. If one wishes to express this aspect in the display, then PCA is the method of choice. The dominant factor of PCA will account for most of the size component in the data. The second dominant factor of PCA will express some of the relative aspects (or contrasts) of the data together with a fraction of the size component. Higher order factors of PCA are usually components of contrast. It is important to realize that the
47
most dominant factor is generally not a pure component of size, nor is the second component generally a pure component of contrast.
FACTORIAL DATA ANALYSIS
--. .-.
-. ..-. +-
I
Absolute
I
Relative
/'
'\\
I
0
\
,
hpl I
I
I
'\
Positive
I I I
ColumnStandardization
.\
Non-negative
Positive
I
I
I
I
I I I
I
I
Logarithms Column-Centring
Double-Closure Marginal Weights
Logarithms Double-Centring
I I
I I
Fig. 1 Methodsfor factor data analysis discussed in this chapter.
The mathematics of PCA is discussed in various chapters. In Chapter 1, PCA is introduced as a method of dimensionality reduction, and the mathematical derivation of principal components is given in Section 13. A more formal derivation called singular value decomposition is given in Chapter 4, Section 5.3. Principal components analysis in soft modelling is discussed in Chapter 7. In this chapter we are principally concerned with qualitative uses of PCA as a means to explore the major trends in data. In Chapter 5, PCA as a quantitative tool is discussed.
48
The characteristic of PCA is that it involves column-cenmng of the data. Columncenmng corrects for differences in size or importance of the measurements. (By convention, we assume that measurements are represented by the columns of the datatable.) Algebraically, this involves the calculation of mean values column-wise and the subtraction of the corresponding mean from each element in the table. Geomemcally, it corresponds to a parallel translation of the barycentre of a pattern of points toward the origin of column-space. Column-centring does not alter the distances between the objects. It does alter, however, the distances between the measurements. If there are N rows and M columns in a data set, and the rows are numbered using the index i and columns using index j , then column-centring involves the transformation N Yij
=
xij
-
C x i j i= I
IN
In some cases, it may also be necessary to correct for differences between the variances of the measurements, especially when these are expressed in different units. To this effect one has to column-centre and subsequently to normalize the data to unit column-variances. Algebraically, this requires the calculation of standard deviations (square roots of the variances) column-wise and the division of each column-centred element in the table by its corresponding standard deviation. Geometrically, the standard deviation of a measurement is proportional to the distance of its representation from the origin of multivariate-space. The combined effect of centring and normalization is called standardization.The geometrical effect of standardization is to sphericize the representations of the measurements, as they now all appear at the same distance from the origin of multivariate space. Standardization (or autoscaling) can be expressed as follows:
Note that in data analysis we can divide by N rather than by N-1in the expression of the standard deviation.
49
4.2 Calculate the column standardized mamx fiom the data in Q. 1. In a sense, one loses some part of the information in the data after standardization. But sometimes the differences in variance of the measurements may be prohibitive. For example, one measurement may outweight all the others and may become by itself the only dominant factor of the analysis. Another approach which tends to correct for differences in the variances is logarithmic transformation of the data. Of course, this can only be applied when the data are strictly positive. Logarithmic transformation followed by column-centring tends to level out extreme differences between variances in the measurements. It is to be preferred over columnstandardization, whenever it is applicable. This transformation may be written as follows:
Q.3 Calculate the column-centred mamx derived from the logarithmically transformed mamx obtained from the raw data in Q. 1. Use logarithms to the base 10. Relative or qualitative aspects of the data are displayed by means of correspondence factor analysis (CFA) and spectral map analysis (SMA). CFA is only applicable to data that are defined with the same unit, i.e. data that can be added meaningfully both row- and column-wise. The data must be non-negative (although a limited number of small negative values can be tolerated). In CFA, each element of the data table is divided by the product of the corresponding row- and column-totals. This can be seen as a closure [9] of the data both row- and columnwise: data are said closed if they sum to unity. For example, compositional data in chemistry are closed as they sum to one or to 100%. Compositional data are common in chemistry - the proportion of solvents in an HPLC system or constituents of a fuel are good examples. The product of the row- and column-totals also produces the expected values of the corresponding elements. (We assume that the gross total of the data is equal to unity. If not, we first divide each element of the table by the gross total.) Hence, division by the product of marginal totals can be interpreted either as a division by expected value or as a closure of the data both row- and column-wise:
50
N
M
In CFA one assigns variable weights to the rows and to the columns of the table according to their marginal totals. This way, a larger influence in the analysis is given to those rows and columns that possess the largest marginal totals. (Note that the marginal totals represent a measure of the size component in the data.) The reason why this is done is as follows. Closure eliminates the size component. After closure of the data they are on equal footing, whether they come from small or large values in the original table. Marginal weighting restores the balance somewhat, by preventing relatively small values in the original table from having too large an influence on the result. CFA shows correspondences, i.e. specificities, between rows and columns in the data. The display reveals which rows are most prominent in which columns of the table. Rows that have similar profiles (irrespective of their size as expressed in the marginal totals) will appear close together. Those whose profiles do not match will appear distant from one another on the display. The same applies to columns, as the method is completely symmetrical with respect to rows and columns. Distances between rows and between columns are expressed in a so-called metric of chisquare, which is a measure of variance for closed data. Hence, CFA can be considered as a special method of principal component analysis in which the data have been previously closed both row- and column-wise. For this reason, we can refer to the transformation of CFA as a double-closure of the data. An alternative method for analyzing the relative aspects of the data is spectral map analysis (SMA). This approach has been developed originally for the analysis and display of biological activity spectra. This method is closer to principal component analysis than CFA. The main difference with PCA is in the operation of double-centring rather than column-cenmng of the data. Double-cenmng involves subtraction of both the corresponding row- and column-means from each element in the data table. (Here we assume that the gross total of the table is zero. If not, we first subtract the gross total from each element of the table before double-cenmng.) The effect of double-centring is similar (although not identical) to that of doubleclosure. Both approaches remove the size component from the data prior to their factorial analysis.
51
Double-centringcan be expressed as follows:
4.4 Double-centrethe matrix of Q.l.
If the data are positive one applies logarithmic transformation to the data before double-centring. This entails that a display of S M A can be interpreted in terms of ratios. (A limited number of non-positive data can be replaced by small positive values without distorting the result too much.) Similar to CFA, SMA shows which rows are most influenced by which columns. Rows that have similar profiles (irrespective of their size) will be close. Those that are dissimilar will be distant. The same argument also holds for columns, as this method too is symmetrical with respect to rows and columns. The result of logarithmic double-centring (SMA) is not very different from that of double-closure (CFA), when all the data are close to their expected values. This, however, occurs only in trivial cases, where the rows and columns of the table possess little or no contrast. Each of the above methods of data display can be produced by the program SPECTRAMAP, which will be briefly discussed in the following section. 4 SPECTRAMAP program
SPECTRAMAP is a program which has been developed in our laboratory for factorial data analysis and for the production of high-quality graphic displays. At the present time it is only available for IBM PC and fully compatible computers. The program is designed hierarchically by means of consecutive panels in which the various options of the program are exercised. There are seven modules in the program which we will describe briefly below: (a) Data Inpur allows to input and edit data tables from the keyboard. (b) The Analysis module specifies the various types of factor analysis and the format of the display. (c) Plot File Library maintains a file of plots which can be reproduced one-by-one, in batched mode on the screen or on various printerlplotter devices. (d) Data Fife Library maintains a file of tables which can be called for analysis or for printing.
52
(e) Data Transfer is used to exchange data files to and from the program in the form of ASCII or DIF files. The latter allows compatibility with spreadsheet programs such as Lotus and Symphony. (0 Plot Attributes fixes various parameters of the printed and plotted output, such as format, size of characters, colours etc. (g) The Peripherals module finally assigns the ports of the computer to a specific printer or plotter. A range of printers and plotters can be specified to this effect. SPECTRAMAPcan be operated in two modes as follows. (a) One is driven entirely by special procedures which can be called in the Analysis module for specific types of analysis, including PCA, CFA and SMA. (b) The other is a general procedure which makes all options of the Analysis module available to the user. At the beginning, all options are defined to a default value. A special key combination (Control-Page Down) goes straight through the analysis and displays the result which has been specified by the current setting of the options. This 'express' key combination can be used at any stage in the analysis. The Analysis module provides three types of printed output : a table of data as used in the analysis, a table of factor coordinates, and a list of options chosen in the analysis. The start-up procedure of the program and the setting of options in the Plot Attributes and Peripherals modules are covered in a brief installation manual which accompanies the tutorial software package. In this chapter we focus only on the Analysis module and seven selected panels that are pertinent to our illustration. The Selection of input panel defines the name of the table (e.g. NEUROLEP), the procedure name (e.g. GENERAL), a serial number to start with (e.g. 1) and an initial serial letter (e.g. A). The serial letter is automatically incremented each time a display is saved to the Plotfile library. The Re-expression panel defines the type of transformation on the data, such as Logarithms (PCA, SMA), Division by expectation (CFA) and Columnstandardization (PCA).
Weighting ofrows is either constant (PCA, SMA) or proportional to marginal totals (CFA). Weighting of columns is either Constant (SMA), Marginal (CFA), or Constant with bias on the origin of multivariate space (PCA).
53
Factor rescaling is mostly for Contrasts (SMA) or for Variances of contrasts (SMA, CFA). Both options provide symmemcal forms of scaling. The latter specifies that the variances of the factor coordinates are equal to the computed factor variances (eigenvalues).The former makes the variances of the factor coordinatesequal to the square roots of the computed factor variances (singular values). Scaling for contrasts allows to construct axes through the squares on a biplot and to project the circles perpendicularly upon them. Absolute values (in PCA) or relative values (in SMA) can thus be read off from a biplot on these axes which can be provided with appropriate tick marks. Areas of circles are chosen such as to make visible a component of size. In a table which is defined with the same unit one can choose areas to be proportional to the marginal totals of the table. In other tables one may select a particular column to represent a component of size by means of the variable areas of the circles. The alternative is to use constant areas (PCA, CFA). The maximal diameter of the circles and the squares can be specified in 0.1 mm. Areas of squares can also be used to express a component of size in the columns of the table. They are made proportional to their corresponding marginal totals when the data are recorded with the same unit (SMA), otherwise they are defined as constant (PCA, CFA). Line segments between squares are specified for the construction of axes on the display. This option allows to read off the absolute (PCA) or relative (SMA) aspects of the data. The former yields the actual values in the corresponding columns of the data table. The latter displays ratios between measurements if Reexpression has been set for logarithms. If no transformation of the data has been defined, then algebraic differences will be displayed along the axes in SMA. A bipolar axis (SMA) is specified by means of the letters that correspond with the two columns that define the axis. A unipolar axis (PCA) is specified by means of the centre of the display (+) and the letter of the column that defines the axis. Extensions can be specified at either or both ends of line segments. Calibrations (tick marks on line segments) produce ratios (SMA) or actual values (PCA) of the columns involved. Up to six line segments can be specified simultaneously. The Axes panel assigns factors 1,2 and 3 to the horizontal, vertical or depth axes of the display. The depth axis is made visible on the display by the variable contours of the circles and squares. A thick contour codes for a circle or square that is above the plane of the display. A thin contour indicates that the circle or square lies below the plane. If the contribution to the depth factor is small, differences among contours may not be perceived (except for a possible outlier). Each of the three axes can be reflected, by changing the sign of the corresponding factor coordinates.
54
Finally, the plot can be rotated about the centre of the map, clockwise or anticlockwise (negative values). There are many more panels and options in the Analysis module. But, those mentioned above will allow to produce the three basic types of factor data analysis (PCA, CFA, SMA). Further information about SPECTRAMAP is given in the software appendix.
5 The neuroleptics case
Neuroleptics are a class of chemical compounds that are used in psychiatry for the control of psychotic states. Chlorpromazine (Largactil or Nozinan) has been the first neuroleptic of clinical importance. Its antipsychotic properties have been discovered in France by Delay and Denike [ 101. It belongs to the chemical class of phenothiazines. Thereafter, many analogues of this prototype drug have been synthetized. In 1961 Janssen discovered Haloperidol, a new type of neuroleptic belonging to the novel chemical class of butyrophenones [ 111. The introduction of neuroleptic treatment has revolutionized the management of severe psychotic disorders, such as schizophrenia and mania. It has done away with shackles and straitjackets. Nowadays, patients remain hospitalized for only a fraction of the time that was required before the discovery of neuroleptics. Many of them can return to family life and work. In the mean time, much has been learned about the mechanism of action of neuroleptics. They all exert their activity in the central nervous system, where they attach to very specialized proteins that mediate in the transmission of brain signals. These so-called receptors are embedded in the membranes of nerve cells. Normally, these receptors are activated in a delicately balanced way by so-called neurotransmitter substances, primarily dopamine, norepinephrine (adrenaline) and serotonin. An excess of dopamine is known to cause mania, delusions and other characteristics of psychosis. Abnormal stimulation by norepinephrine and related compounds are the cause of anxiety and agitation. Serotonin seems to play a harmonizing and regulating function and influences sleep patterns. All known neuroleptics attach to the dopamine receptor, and thus block its interaction with natural dopamine. Hence, neuroleptics protect the brain from exposure to an excess of dopamine and thus prevent delusions and manic states. Neuroleptics also attach to various extents to other receptors, among which those sensitive to norepinephrine and serotonin. Each neuroleptic is known to possess a
55
typical spectrum of activity. Some are predominantly dopamine-blockers, others have additional serotonin- or norepinephrine-blockingproperties. Some interfere with all three types of receptor at the same time. In the laboratory, one can mimic excess stimulation of the receptors in animals by administration of a fixed dose of apomorphine (a dopamine agonist), tryptamine (a serotonin agonist) and norepinephrine itself. These compounds are called the agonists because they stimulate the receptors. In rats, apomorphine causes stereotyped behaviour and agitation (apo-agitation and apo-stereotypy),tryptamine causes seizures and tremors (try-seizures), and a high dose of norepinephrine is lethal (nep-mortality).These effects are very reproducible in rats, unless they have been pretreated by a protective dose of a neuroleptic drug.
Table 1. Neuroleptic Profiles AP-
- in vivo pharmacology. AP-
Agitation Stereotypy
TryNepSeizures Mortality
Total
1 Chorphromazine 2 Pmmazine 3 Trifluperazine 4 Fluphenazine 5 Perphenazine 6 Thioridazine 7 Pifluthixol 8 Thiothixene 9 Chorprothixene 10 Spiperone 11 Haloperidol 12 Amperone 13 Pipamperone 14 Pimozide 15 Metitepine 16 Clozapine 17 Perlapine 18 Sulpiride 19 Butaclamol 20 Molindone
3.846 0.323 27.027 17.857 27.027 0.244 142.857 4.348 5.882 62.500 52.632 2.941 0.327 20.408 15.385 0.161 0.323 0.047 10.204 7.692
3.333 1.111 1.923 0.213 0.108 1.429 17.857 0.562 0.140 15.385 1.695 1.075 27.027 1.961 2.083 0.185 0.093 1.333 142.857 20.408 163.934 4.348 0.047 0.345 2.941 4.545 4.167 47.619 11.765 0.847 62.500 1.282 0.568 1.282 2.222 3.030 0.187 1.724 0.397 20.408 0.107 0.025 10.204 10.204 27.027 0.093 0.327 0.323 0.323 0.370 0.067 0.047 0.003 0.001 9.091 1.471 0.025 7.692 0.140 0.006
10.219 2.073 45.586 36.012 58.098 1.855 470.056 9.088 17.535 122.73 1 116.982 9.475 2.635 40.948 62.820 0.904 1.083 0.098 20.79 1 15.530
Total
402.031 373.592 60.145 208.745
1044.513
56
Neuroleptics are called antagonists as they inhibit or block the receptors. Some neuroleptics will typically inhibit the effects of dopamine, other will additionally block the effects of serotonin, norepinephrine, or both. Some neuroleptics will protect the animals against all three agonists. The antagonistic property of a neuroleptic, say in blocking the dopamine receptors, is measured in the laboratory by finding the dose of a neuroleptic compound that protects half of the animals from the effect of an agonist. This is called the median effective dose or ED50, which is usually expressed as mg substance per kg body weight (mg/kg). Each compound produces a characteristic spectrum of ED50 values in the battery of the different tests on rats. The problem with ED50 values is that a lower value indicates a more potent compound (as a lower dose is required to inhibit a given effect, such as stereotyped behaviour induced by apomorphine). For this reason we have taken reciprocals of ED50 values (Table 1). A higher value now indicates a higher blocking ability. In the table we have reproduced the spectra of reciprocal ED50 values for 20 wellknown neuroleptic compounds. By definition, the table is expressed in the same unit (reciprocal of m a g ) . It is meaningful to compute sums and averages both row-wise and column-wise. One immediately notices a strong size component in these data. Indeed, a neuroleptic that scores high in a given test (say Pifluthixol on row 7) also scores high in all five tests. Those that score low in a given test (say Sulpiride on row 18) also score low on all other tests.The numbers in the marginal column are an indication of the average potency of each compound. The range of potencies is about 5000-fold.
In the following sections we will apply three basic methods of factor data analysis : principal components analysis (PCA), conespondence factor analysis (CFA) and spectral map analysis (SMA) to the table of neuroleptic activity spectra.
6 Principal components analysis (PCA) with standardization PCA is a method which emphasizes the absolute or quantitative aspects of the data. By this we mean the numbers in the columns of the table and the correlations between them (see also the schematic diagram of Fig. 1). We produce a standardized PCA as follows: (a) Column-standardization of the data (column-centringfollowed by transforming the columns to unit variance). (b) Factorizationof the variance-covariancemamx of the standardized data.
57
(c) Scaling of the factor coordinates such that their variances are equal to the factor variances (eigenvalues). (d) Joint plot (biplot) of the rows and columns in the plane of the two dominant factors. Q.5 Obtain the standardizedPCA plot of Fig. 2 using SPECTRAMAP.
In all our illustrations we assume that the first dominant factor is oriented along the horizontal direction of the diagram. The second factor is along the vertical direction. A third and less dominant factor can be thought to be. oriented along a depth direction perpendicular to the plane of the diagram. Actual factor coordinates of rows and columns are not shown here. They can be obtained in the form of a printed list from the SPECTRAMAP program. Fig. 2 shows the result of a standardized PCA applied to the data of Table 1. Looking at Fig. 2 we see at the bottom of the diagram that the contributions (c) of the first three factors account for 89,6 and 4 % of the total variance in the data after column-standardization.Note that 95 % of the variance of the data is retained by the first two factors. We gain by trading 5% of variance for a reduction of dimensionalityfrom the original four columns to two factors. The variance of a data set is discussed in various chapters of this book principal component analysis, the higher the sum of squares or variance of the principal component the more significant the component. Chapter 5 discusses a number of methods for assessing how many significant principal components adequately describe a data set.
Q.6 In Fig. 2 the rows are represented by circles, the columns by squares and the origin by a cross. (a) Why are the lengths of the lines from the origin to the squares approximately equal, so that the squares are roughly on the circumference of a circle? (b) Why are the distances of the squares from the origin not exactly equal? (c) How do you interpret the angles between the lines to each square from the
1
origin?
It appears that inhibition of stereotypy and inhibition of agitation after administration of apomorphine (apo) are tightly correlated. Compounds that effectivelycounteract stereotypy are also seen to block agitation. There also appears
I
SPEC TRA MA P
NEUROLEPTIC PROFILES
-
I N V I V O PHARMACOLOGY
Principal Components Analysis (PCA) Standardized
OHALOPERIDOL TRIFLWERAIXNE
APo-161 TATXGU
PIFLUWIXOL
0
NEP-FMCITAL X N TRY-SEIZURES
Fig. 2
C=89+6+5=100
59
a correlation between the effects on ayptamine (try)-induced seizures and those on mortality due to norepinephrine (nep). But, the latter two do not correlate well with the previously-mentioned effects on agitation and stereotypy produced by apomorphine (apo). This result seems to indicate the existence of three types of interactions of neuroleptic compounds in the brain. These can be related to antagonism of apomorphine, tryptamine and norepinephrine. We know from other studies that these effects correlate with the blocking of the receptors of three distinct neurotransmitter substances in the brain : dopamine, serotonin and norepinephrine, respectively. The PCA display of Fig.2 is typical for a table with a strong size component. The latter is evidenced by the grouping of the squares in one part of the display. This grouping is the result of positive correlations among the tests. We also find that the pattern produced by the circles is centred about the origin of the diagram. This is the result of column-centring which translates the barycentre of the compounds to the origin of the data space. (Note that column-centring is included within columnstandardization.) Two compounds are displayed closely together when their results in the five tests are numerically close. For example, the numerical profiles of Pimozide, Perphenazine, Fluphenazine and Trifluperazine are very much in agreement, as well as those of Promazine, Thioridazine and Clozapine. Evidently, the clustering of compounds that can be formed in this way follows from their absolute or quantitative properties. The latter are expressed by the actual numbers in the columns of the table of biological activity in the various tests (Table 1). Note the outlying position of Pifluthixol whose mean activity is about 10 to 100 times larger than the mean activity of most of the other compounds. Logarithmic transformation may correct for such abnormalities in the distribution of the data as will be seen in the following section. 7 Principal component analysis (PCA) with logarithms Taking logarithms is often a good alternative to column-standardization. It can effectively reduce differences in the variances of the columns of the table. Logarithmic re-expression requires, however, that the data be strictly positive. A limited number of negative or zero values can be replaced by small positive values without grossly disturbing the result. We recommend t:, assign a smaller weight to those rows that contain non-positive values.
60 ~
~~
~~
4.7 Obtain Fig. 3 using SPECTRAMAP. The result in Fig. 3 is not much different from that of Fig. 2, except for the disappearance of the outlying position of Pifluthixol. Note that the conmbutions of the two dominant factors to the total variance in the transformed data are now 70 and 25 % (95 % in total). The distances from the centre of the display (+) to the squares represent the square roots of the column-variances of the logarithmic data. The norepinephrine (nep) test appears to possess the largest variance. The tryptamine (try) test produces the smallest variance. The scaling of the factor coordinates is such that their variances are equal to the square root of the factor variances (and not to the factor variances themselves as in standardized PCA). The consequence of this type of factor scaling is important. In the biplot of Fig. 3 we can project circles perpendicularly upon a particular axis and read-off the values in the corresponding column of the table. The readings will be approximate, however, since 5% of the variance in the data is not represented in the plane of the map. This type of scaling of the factor coordinates distorts the angular distances between the line segments. As a consequence, correlation coefficients are only approximately equal to the cosines of the angular separation in Fig. 3. But, we can still judge with certainty whether two tests are strongly correlated or not (as it is the case with stereotypy and agitation). Furthermore, we have varied the areas of the circles and squares such as to represent their average biological activity (as expressed by the marginal totals of Table 1). Pifluthixol is still seen to be the most potent compound. The more potent compounds appear on the right upper side of the display, while the least potent ones are found in the left lower corner. This is partly due to the important size component in the data. A compound that scores heavily in one test can be expected to score heavily in most of the others. Sometimes, however, it is stated erroneously that the first (horizontal) factor of PCA entirely accounts for the size compound in the data (provided that there is one). This is clearly not m e , as can be observed in this example, where the size component correlates most strongly with the tests of stereotypy and agitation. In this case, we find that the size component runs diagonally on the display between the first and second factors. Hence, in this application, each of the two dominant factors accounts for a fraction of the variance produced by the size component.
0PIMOZIDE
MOLINDONE
WUTACLAWL
TRIFLWERAZINE A
* PERLAPIK
C= 70+25+5=1 00
A R M 6 1TA TION
63
8 Correspondence factor analysis (CFA) Q.8 Obtain Fig. 4 using SPECTRAMAP. Correspondence factor analysis focuses on the relative or qualitative aspects of the data (see schematic of Fig.1). Note that both circles and squares are arranged around the centre of the display.
Q.9 (a) Why has the size component (as reflected in the areas of the squares and circles) disappeared? (b) How do you interpret the position of the compounds on the map in relation to the tests? (c) Which substance is the most specific tryptamine blocker? The three types of tests exert a polarizing effect on the compounds. We also find compounds that show a mixed-type spectrum of activity (e.g. Pifluthixol which is half-way between the apomorphine (apo) and norepinephrine (nep) poles of the map, Metitepine which is in-between the tryptamine (try) and norepinephrine (nep) poles). Compounds that are near the origin of the display (+) have little specificity for either of the three poles (e.g. Chlorpromazine). Those that are positioned further toward the border of the map have increasingly more specificity. The direction of the specificity corresponds with the tests that are nearest to it (this justifies the term Correspondence factor analysis). Compounds that have similar spectra of activity, even if they are different in mean biological activity (or size), appear close on the map. Tests that correlate with one another appear close together (e.g. agitation and stereotypy). The two dominant factors in the display of Fig. 4 account for 76 and 22% of the variance in the data after re-expression (i.e. 98% of the total chi-square value). Note that there is no true origin for the scales of measurements (tests) in this diagram, because the absolute or quantitative aspect has been divided out in the transformation of the data. Division by expectation closes the data with regard to both the rows and the columns of the table. As we have seen above, closure is the operation which consists of dividing each number in a row or column by its corresponding marginal total (as is done with chemical composition data).
64
Distances between compounds and between tests are expressed in a secalled metric of chi-square. This metric displays the similarity of compounds and tests. independently of their potency. Hence, relative or qualitative aspects of the data are revealed by CFA (unlike the results obtained by PCA). But distance of chi-square is not readily computable from the original data in the table. A more intuitive display of relative or qualitative aspects of the data can be rendered in terms of ratios by means of specual map analysis (SMA).
9 Spectral map analysis (SMA) This method is related to CFA as it is also geared toward the analysis of relative or qualitative aspects of the data. But with respect to operations performed on the data it is closest to logarithmic PCA. It has a number of distinct properties, however, as shown in the schematic diagram of Fig. 1. Formally this procedure differs only from that of logarithmic PCA in the weighting of columns. In the weighting of columns of PCA we specified a large bias to be placed on the origin, such as to be represented exactly in the plane of the map. This bias in not applied in SMA. The absence of a bias on the origin of column-space effectively produces double-centred data. Q.10 Obtain Fig. 5 using SPECTRAMAP. The display of Fig. 5 accounts for 100% of the variance in the data after logarithmic re-expression and double-centring. Here, we only consider relative or qualitative aspects of the data, specifically ratios, which are unrelated to the aspect of size. Ratios are dimensionless quantities. Note that potent and less potent compounds are dispersed in a rather unsystematic way over the plane of Fig. 5. A second difference between SMA and PCA is that all line segments in SMA represent bipolar axes, rather than unipolar axes such as in PCA. Indeed, as there is no true origin for the scales of measurements, we only construct axes through pairs of squares. Q.11 (a) Interpret the axes in Fig. 5. (b) Interpret the projections of the circles upon these axes. (c) What is the main difference between SMA and CFA?
I
In
65
0 0
II
w l
0 + +
(n P
1
9p
c,
66
The reading rules of the spectral map of Fig. 5 can be explained in terms of attraction and repulsion. Compounds that are highly specific for a given test are attracted by it. Those that are less specific in the test are repelled. The same holds true for tests, because of the mutuality of attraction and repulsion. Tests that are highly specific for a given compound are attracted by it. Those that show less specificity for the compound are repelled. Sometimes repulsion can be very strong, for example when a compound shows no effect at all in a given test. This may drive a compound toward the border of the map. As shown in the schematic of Fig. 1, SMA can be used with data that are recorded
in the same unit as well as those that have different units. The only requirement is that the data be positive, although a small number of nonpositive data can be tolerated without compromising the result of the analysis.
10 References 1 H.Hotelling, Analysis of a Complex of Statistical Variables into Principal Components, Journal of Educational Psychology, 24 (1933), 417-441. 2 L.L.Thurstone. The Vectors of Mind, Univ. Chicago Press, Chicago, Illinois, (1935). 3 L.L.Thurstone, Multiple Factor-Analysis, A Development and Expansion of the Vectors of Mind, Univ. Chicago Press, Chicago, Illinois. (1947). 4 K.R.Gabrie1, The Biplot Graphic Display of Matrices with Applications to Principal Components Analysis, Biometrika, 58 (1971). 453-467. 5 J-P. Benzkri, L'Analyse des Donnies. Vol.11. L'Analyse des Correspondances, Dunod, Paris. (1973). 6 J-P. Benzkri, Histoire et Pre'histoire de IXnalyse des Donnies, Dunod, Paris, (1982). 7 P.J.Lewi, Spectral mapping, a technique for classifying biological activity profiles of chemical compounds, Arzneimittel Forschung (Drug Research), 26 (1976). 1295-1300.
8 P.J.Lewi, Spectral Map Analysis, Analysis of Contrasts, especially from Log-Ratios, Chemometrics and Intelligent Laboratory Systems, 5 (1989). 105- 116. 9 J. Aitchison, The Statistical Analysis of Compositional Data. Chapman and Hall, London, (1986). 10 J. Delay J. and P.Denike. Mtthodes Chim'othkrapiquesen Psychiatrie, Masson, Pan's, (1961). 11 P.A.J.Janssen, C.J.E.Niemegeers and K.H.L.Schellekens, Is it possible to predict the Clinical Effects of Neuroleptic Drugs (major Tranquillizers) from Animal Data?, Arzneimittel Forschung (Drug Research), 15 (1965), 104-117.
67 ANSWERS
a) The total for each row is given by the column-vector
[j]
md for each column by the row-vector [22 46 321. The grand total is 100.
. b) The expected values are
c) The chi-squared matrix
2.86
5.98
4.16
5.94
12.42
8.64
1.76
3.68
2.56
8.14
17.02
11.84
3.30
6.90
4.80
1.21
0.17
0.17
0.00
0.01
0.02
0.03
0.13
0.08
0.43
0.06
0.06
-
-
-
0.03 0.18 0.13 d) The row that would be closest to the centre is the second row, whereas the row farthest from the centre is the first row.
’
-1.04
-0.49
-0.46
0.49
0.63
0.85
-0.74
-1.39
-1.11
1.72
1.53
1.50
-0.43
-0.27
-0.78
iote that the columns are divided by 5 (not 4) to give this answer: this is the usual vay of calculating population rather than sample standard deviations.
68
4.3 '
*
-0.51
-0.06
-0.06
0.27
0.18
0.20
-0.21
-0.42
-0.28
0.49
0.30
0.29
-0.03
0.00
-0.15
-1.07
0.13
0.93
-
-
A.4
- 1.07
-0.07 0.27
0.47
-0.73
A.3
The setting of the options in SPECTRAMAP for PCA are : Re-expression : column-standardization; Weights of rows : constant; Weights of columns : constant with bias (on the origin of column-space, such as to effectively produce column-centreed data); Factor rescaling : variances; Areas of circles and squares : Iconstant; Line segments between squares : from centre (+) to columns A, B, C and ID.All other options are defined by default. A.6 (a) The reason why the squares are all roughly equidistant from the origin is because the data have been standardized, so the variance of each column is identical. (b) The reason why the distances are not exactly equal is because the two principal components describe only 95% of the overall variance and therefore do not provide a complete picture of the data, but a good approximation. An even better approximation would be obtained if three principal components are used. In such a case the squares would be lifted either above or below the plane of Fig. 2. (c) The cosine of the angles between the lines are almost equal (subject to the 2 principal component approximation)to the correlation coefficients between the correspondingcolumns. This is discussed in Chapter 3, Section 6.
69
I
A.7 The steps used in this procedure are specified below: Re-expression : logarithmic; Weighting of rows : constant; Weighting of columns : constant with bias on the origin of space; Factorization of the variance-covariance matrix of the transfomed data; Factor rescaling : contrasts; Areas of circles and squares : proportional to marginal totals; Line segments between squares : from the centre of the display (+) to the squares labeled A, C and D, extended at both sides and calibrated. A. 8 The steps to be used are as follows: Re-expression : division by expected values; Weighting of rows and columns : marginal totals; Factorizationof the variancecovariancematrix (which is effectively a matrix of chi-square values); Factor rescaling : variances; Areas of circles and squares : constant; Line segments : through A and C, C and D, D and A, extended at both sides.
A.9 (a) The size component has disappeared because of division by expected values. (b) The position of each compound on the map is the result of its specificity for one or more tests. (d) Pipamperone appears as the most specific tryptamine blocker. A.10 The steps of the procedure are detailed below: Re-expression : logarithmic; Weightingof rows and columns : constant; Factorization of the variance-covariance matrix of the transformed data; Factor rescaling : contrasts; Areas of circles and squares : proportional to marginal totals; Line segments between squares : between A and C, C and D, D and A, calibrated and extended at both sides.
A.ll (a) Each calibrated axis represents a logarithmic ratio between two measurements because of the previous logarithmic transformation of the data. Note that the difference between two logarithms is the logarithm of a ratio. The three axes represent the logarithmic ratios of apomorphine/ norepinephrine, apomorphine / tryptamine and tryptamine / norepinephrine. Note that the latter ratio can be derived from the previous two ratios. (b) Vertical projections of the circles upon an axis allows us to read off the corresponding ratios, in this case of well-reproduced data, almost exactly. Note that the order of the projections is very similar to those obtained in CFA.
70
(c) The difference between SMA and CFA lies in the ability of the former to interpret the diagram in terms of ratios. For this reason SMA can be regarded as multiple ratio analysis.
71
CHAPTER 3
Vectors and Matrices: Basic Matrix Algebra N . Bratchell', AFRC Institute of Food Research, Shinfield, Reading, Berkshire RG2 9AT, U.K.
1 Introduction Matrix algebra, or more precisely, linear algebra, has a particular importance in statistics. It provides a concise means of algebraic, abstract, manipulation of arrays of data, data matrices, and it permits graphical representation of those data: almost every matrix and operation has a graphical interpretation. Thus it provides a means of communication for the mathematician; and for the layman results can be interpreted graphically by analogy with the simple three-dimensional (Euclidean) space in which we live. This simplicity of manipulation and representation makes mamx algebra particularly suited to multivariate problems. Since multivariate statistics is concerned with manipulation of data it is important to understand how a particular representation is obtained to gain a full interpretation. This chapter is concerned with the definition of structures and basic mamx operations and will assemble these into forms that underlie the pattern recognition and relational methods presented in this volume.
2 The data matrix Since we are concerned primarily with manipulating a set of data to obtain an interpretable result the obvious place to begin is the so-called data marrix. We have in an "experiment" a set of objects or samples which can be characterized severally and individually by measurements of a set of atmbutes or variables. Examples of these are found throughout this book; other examples are various strains of bacteria identified by a series of binary, presence-absence, tests and physico-chemical tests; or a set of meat samples characterized by their proportions of moisture, fat and ash or (alternatively) by their near infra-red spectra. In each case each object has associated with it a set of measured attributes. present address: Pfizer Central Research, Sandwich, Kent, U.K.
72
The data may be collected in an array. Here we have four objects or samples each characterized by three variables and represented as a table or array as in Table 1 .
Table 1. Example data consisting of four samples or objects and three variables. Variable 1
Variable2
Variabk3
5.5 8.3 8.9 0.5
9.8 0.5 3.1 15.6
6.1 6.1 9.2
Sample 1 Sample 2 Sample 3 Sample 4
3.3
The m a y of numbers, without the row and column headings, is the data matrix. The convention that will be used in this chapter is that a row represents the set of measurements on a particular object; and a column represents the observations on a particular variable for all objects in the data set. This convention is not unique, and many authors reverse the rows and columns. This is ultimately unimportant, but it may cause confusion!
2.1 Man-ices The collection of data is represented by a mamx. In general data may be presented in an (N x M ) matrix with N rows, for the N objects, and M columns, for the M variables. There are various shorthand ways of denoting mamces:
XI1
XI2
...
x21
x22
a * *
... ... xN1
XN2
XlM x23
... ... * * *
XNM
-
Here we have denoted each element by a scalar xu where i denotes the row number and j the column. In general scalars will be denoted by italics. The subscripts of N,MXdenote the dimensions of the matrix: N rows and M columns.
73
Bold upper-case roman typeface is typically used to denote a matrix, with or without subscripted dimensions. Throughout this chapter subscripts will be used.
Q.1 Write the data in Table 1 in the same form as the matrices in Eq.(1).
2.2 Vectors The data maaix can be represented in yet other ways which emphasize the collection of rows and columns. These have particular significance:
Each element xi,, i = 1 ... N, is a row vector whose elements x i , = [ x i l
xi2
...
.iM]
(3)
are the set of observations on the ith object. Similarly, each element xj, j = 1 ... M,is a column vector whose elements
are the set of N observations for the j t h variable.
74
1
14.2 Write down (a) the second row-vector, and (b) the third column-vector of the mamx of Q. 1.
I
Typically vectors are denoted by bold lower-case roman letters; sometimes bold typeface is replaced by underlined characters. Here bold typeface will be used. In general the orientation of a vector is not denoted and can be determined from the context if necessary. Unless otherwise stated or implied a vector will be a columnvector.
3 Vector representation Before proceding further we shall examine the graphical representation of vectors. This generalizes naturally to mamces and is important as it underlies many of the concepts presented in earlier chapters.
0
2
Column 1
Fig. 1 Graphical representation of a vector.
4
75
3.1 Graphical representation: vector-spaces Suppose we have a single object and we measure two variables:
1.
[
=
2.5
3.8
3
We can represent this graphically. The vector consists of two columns or variables, and so defines a two-dimensionalspace: each variable defines an axis or dimension of the space. The elements of the vector are the coordinatesof a point relative to the axes; the vector is the line joining the origin of the space, i.e. of the axes, to the point. This plot represents the position of the point in the space defined by the two variables. If we draw only the point, ignoring the vector joining the point to the origin of the space, we have a typical scatter-plotrepresentation of the datum.
4.3 A rnamx, as a collection of row vectors, may be represented by a collection of points. Draw the scatter plot of the following matrix:
3,2x
=
[ ::: :::] 3.0
2.9
The graphical representation of the matrix in 4.3 is often referred to as plotting the points in the column- or variable-space. The space is defined by the two columns of the data mamx. The matrix in 4.3 has only two columns, i.e. variables, and the objects lie in a twodimensional space defined by the variables. The rows and columns of the mamx are interchangeableand we can plot the two variables in a three-dimensional row- or object-space.
4.4 Draw the column vectors of the mamx in 4.3 in the object-space. This duality is an important property, although it is not immediately obvious how to interpret the variable plot. The first plot represents the objects in the space of the
76
variables, i.e. their positions relative to the variable axes, but the second plot presents the variables in the space defined by the objects. The next few sections develop some properties of the vectors from their graphical representation. A more geometric view of vectors in multidimensional space is given in Chapter 1, Section 3 where the concepts of row- and column-space are introduced.
3.2 Length of a vector In the answer to Q.3, the convention of drawing the vector from the origin to the point has been dropped. But from the figures in this section, it is clear that a vector has length. The length of vector xl, of matrix in 4 . 3 is defined as
This is familiar from geometry as the hypotenuse of a triangle. In general, the length of a vector is the square root of the sum of squares of its elements.
3.3 Distance between vectors Closely associated with the concept of length is that of distance between the points. The distance between two vectors of dimension 2 is calculated as
In effect this is the length of a new vector obtained by shifting the origin to a new point. Again this generalizes readily to higher dimensions. Also notice that the distance is symmetric: the distance measured from vector 1 to 2 is identical to that from vector 2 to 1. Also notice that distance does not have a direction. The idea of distance is implicit in our perception of the scatter of points in 4.3. 3.4 Angle between vectors
The concepts of length and distance are very useful when we wish to interpret the scatter plots. However, sometimes, as for Q.4, it is more useful to think in terms of the orientation of vectors. Thus we can define the angle between two vectors x1 and x2 of dimension 3 as cos 8 =
x 1 1 x 2 1 + x12 x 2 2
1x11 1x21
I 1 3 x23
(7)
77
where 8 is the angle between the vectors. This property is often useful in interpreting plots of the variables in the object-space (4.4). We can also calculate the angle between a vector and the coordinate axes. From Section 3.2 and simple mgonometry, the cosine of the angle between a vector and a particular axis is simply the coordinate along that axis divided by the length of the vector. The vector of such direction cosines is a normed or normufized vector. A point to note is that if the cosine of an angle is 0 then the angle is 90". In twoand three-dimensional space this means that the vectors are perpendicular to each other; in multidimensional space the vectors are said to be orthogonal.
Q.5 Referring to the mamx in Q.3, calculate the following: (a) the lengths of row 1 and column 1. (b) the nomed vector of column 1. (c) the distance between the column vectors. (d) the angle between the two column vectors.
35 Multidimensionalspace In the previous example we represented three objects as points in two-dimensional space defined by two variables; by the duality of rows and columns we were able to represent the variables as points in three-dimensional space defined by the objects. In the introduction we had M variables and N objects. Thus we can, in principle, represent the N objects as points in M-dimensional space, and vice versa. The only fact that prevents such representation is the impossibility of adequately drawing a space of more than three dimensions on a two-dimensional surface. In all other respects multidimensional space is identical to that of one, two or three dimensions. However, we can represent one-, two- or three-dimensional subspaces of the full space defined by the variables. In the two-dimensional example above the two variables may be part of a much larger set with the space defined by the two variables as a two-dimensional subspace of the full space. The concept of representing the full space by a subspace is one that recurs frequently in multivariate statistics in such techniques as principal components analysis, principal coordinates
78
statistics in such techniques as principal components analysis, principal coordinates analysis, canonical variates analysis, SIMCA pattern recognition, partial least squares, canonical correlations analysis, and so on. Applications of principal components analysis are discussed in several chapters, including Chapters 1 , 2 and 5. SIMCA is described in greater detail in Chapter 7. Canonical variates (linear discriminant analysis) are discussed in Chapter 8 and in Chapter 1, Section 10.
4 Vector manipulation The basic properties of vectors were defined algebraically above. They can be defined more concisely by the algebraic operations explained in this section. 4.1 Multiplication by a scalar
This is the simplest operation. We may multiply a vector by any scalar constant a to obtain a new vector
Division follows naturally. One important use of division by a scalar is to produce the normed vector in which each element is divided by the vector's length. For example, if the length of x is given by x , then X
normed(x) = = - = 1x1 x
x
x
(9)
ormalization of a vector should not be confused with normalizing rows of a atrix. In the latter case the rows are divided by the sum of the elements of the ws rather than by the vector length of the rows.
4.2 Addition of vectors
Provided vectors have the same dimensions they can be added to form a new vector
x+y
=
[::
+ Y y1 2
1
-
This is implicit in the definition of distance between vectors given in Section 3.5.
4.3 Transposition Transposition of a vector or mamx is a very useful computational device. In this text, the transpose of a vector is denoted by prime ', but some other authors prefer a superscript T. The transpose of a column-vector is a row-vector with rhe elements in the same order, e.g.
and vice versa. Similar comments apply to mamces. Transposition involves exchanging rows and columns. 4.4 Vector multiplication
Two types of vector multiplication are defined. The inner or scalar product of two row vectors requires that the vectors have the same dimensions and is given by
where N is the length of each vector, which is simply the sum of products of corresponding elements (sum of cross-products) and is a scalar. If two vectors are normalized, Eq. (9), their inner product is identical to the cosine of the angle between them. The outer or cross product of two row vectors x and y of dimensions N and M respectively is given by
80
x'xy =
[
'11
'12
'21
'22
*.*
'1N '2M
... ... ... ... 'N1
'N2
]
*.. 'NM
It is the mamx of all possible cross-products of elements, and places no constraint on the dimensions of the vectors.
4 5 Mean-centred vectors Mean-centring is a way of conveniently re-locating the origin of the space. In the data shown in Q.3 the means of the column-vectors are 2.27 and 3.60. Meancentring gives
3,2x
=
0.73
-0.7
whose column means (or totals) are zero. The origin now coincides with the mean or centroid of the points. This transformation preserves the distances between the points, but it does alter the lengths and angles of the vectors as these are measured relative to a new origin.
Q.6 Write in the form of summations (using C sign) (a) the length of a vector x (defining the scalar x), (b) the distance between vectors x and y, (c) the cosine of the angles between vectors x and y. 4.6 Linear independence Linear independence of vectors is an important concept that recurs frequently and is heavily exploited in multivariate analysis.
81
Suppose we have a set of p vectors, then they are said to be linearly independent, if the equation alxl+a2x2+
.....+agxp = 0
(15)
(where 0 denotes a vector of zeros) can be solved only when all of the scalars aj are zero. If some scalars exist which are not zero, the vectors are said to be linearly dependent.
0.0
1.0
0.5
Fig. 2 Graphical representation of linearly dependent and independent vectors.
Consider 3 two-dimensional vectors
x1=
9
-0.5
x3=
1
1 .o lS0
which are plotted in Fig. 2. Vectors x1 and x2 are linearly independent because the only solution to Eq. (15) is
82
ox1+ox2 = 0
But for x1 and x3 we may have the solution 2 x 1 - x3 = 0
For the set of three vectors we have 2 x * + o x 2 - x3 = 0
demonstrating linear dependence among them.
4.7
[ ;:;I
Referring to Fig. 2, a fourth vector is
x4=
Test for linear dependence between vectors x2 and x4. Considering only vectors x,- and xA,mean-centre the vectors and re-test their linear dependence.
5 Matrices
As noted above, matrices can be regarded as collections of vectors. As such they can be represented graphically, and many of the vector operations generalize readily to matrices; in particular, multiplication by a scalar, transposition and addition of matrices, provided that their dimensions are identical. There are, however, several special types of matrices and operations.
5.1 Matrix multiplication Multiplication of matrices follows the same basic rules of vector scalar product multiplication. In particular the dimensions must be correct. For example, suppose
83
1 r
3
A =
1 1
Z:
3.8
1.3
3.0
2.9
and
2B
=
[ i-$ ]
The elements of a new product matrix are obtained by carrying out multiplication of row vectors by column vectors, and so the number of columns of the first matrix must equal the number of rows of the second matrix. For example, we may have r-
L =
[
16.48
20.90
31.60
33.99
1.3
]
The bold element in the product is obtained as the inner product of the bold row and column vectors on the left of the equation, and the other elements are formed as the other inner products. Note that the product of the two matrices has dimensions given by the 'outer' subscripts of the two mamces. In this case the product is a square (2 x 2) mamx, but it can be rectangular depending on the dimensions of the matrices. One consequenceof this method of multiplication is that in general
AB # B A
(17)
and that if one side of a matrix equation is pre-multiplied by a matrix, the other side must also be pre-multiplied. Q.8 Which of the following are not valid? (a) 33'4 5,4B (b) 3,sA 4,5c (c) 3,5A 5,4D 4,4E ( 4 4 9 5,4B 4,4F = 4,4F 4,s E 5,4B Where appropriate, give the result or explanation.
84
ha-
discussed in C h s , Section 5.
5.2 Diagonal matrices A diagonal matrix is one whose (leading) diagonal elements are non-zero elements while all others are zero. An example of a diagonal mamx is given in Eq. (18).
A particular and important type of diagonal mamx is the identity mamx whose diagonal elements are all unity. An important property of the identity matrix is that any mamx may be pre- or post-multiplied by the identity mamx (subject to dimensionality) without being altered. For example,
where, typically, the identity mamx is denoted by I. L
L
e
c
t
i
o
n 7.
5.3 Symmetric and triangular matrices A symmetric mamx is one whose lower mangular region is the mirror image of its upper triangular region when reflected in the leading diagonal. For example
In other words, S = S'. A triangular matrix is one whose upper or lower triangular region consists solely of zeros. An example of a lower triangular mamx is
85
3,3T
=
[ ; ;]
5.4 Orthogonal matrices Orthogonal matrices occur frequently in multivariate statistics because of their special properties. An orthogonal matrix is one whose column vectors are orthogonal, i.e. have zero cross-product. An orthonormal matrix is an orthogonal matrix whose column vectors have unit length. The most important result concerning an orthogonal matrix is that its inverse is equal to its transpose. If L is a square orthogonal matrix, then
-',
The inverse of a matrix, denoted by the superscript will be more fully discussed in Section 5.9 below. For now it is sufficient to state that the product of any matrix multiplied by its inverse is the identity matrix. Here, for simplicity, we can assume that L is a square matrix. But the identity Eq. (22) holds for rectangular orthogonal matrices.
I: 1
Q.9 Which of the following are orthogonal or orthonormal?
-
-
28
49
55
25
98
57
18
20
4
65
70
85
78.31
(b)
4.63
-7.52
112.08 -30.25
-0.05
23.20
-1.28
14.15
125.51
24.36
2.12
86
0.39
0.61
0.69
0.61
0.39 -0.69
5.5 The trace of a square matrix The trace of a square mamx A, denoted by trace(A), is simply the sum of its diagonal elements.
5.6 The rank of a matrix The rank of a matrix, denoted rank(.), is given by the minimum of the number of linearly independent row or column vectors it contains. For example, suppose A is a (5 x 3) matrix containing three linearly independent column vectors, then the rank of A is 3, and A is said to havefull rank. However, if it contains only two linearly independent vectors (the third is said to be linearly dependent on the others), the rank of A is 2, and A does not have full rank and is said to be singular. The geomemc interpretation of rank and linear independence will be more fully discussed in the next chapter. The rank of a mamx has particular importance in determining the invertibility of a matrix: the condition for invertibility of A is that it is non-singular. The concept of rank of a mamx and its relation to chemical information is also
5.7 The determinant of a square matrix The determinant of a square mamx A is denoted by IAl or det(A). It is frequently defined in terms of a recursive formula which will not be given here for the general case. The simpler case for a diagonal mamx is given by
n N
if D = I d i i ) , then det(D) =
i = l
dii
87
capital ll sign denotes a product as opposed to a sum, which is norm a capital C sign. Thus, the determinant of a diagonal matrix is simply the roduct of its diagonal elements.
Q.10 Calculate the determinantsof the following:
4
0
0
0
1
0
0
0
2
2
0
0
0
0
5
0
0
0
0
0
0
0
0
0
4
The determinant can be (loosely) regarded as a measure of its volume, with the important proviso that if the mamx is not of full rank the determinant is zero. This will be seen geomemcally in Chapter 4. In the case of a diagonal matrix, it is clear that the determinant is z e r o if any of the diagonal elements is zero.
5.8 The inverse matrix If a square mamx A is multiplied by another square mamx B to give the identity mamx, then B is defined to be the inverse of A. This gives the identity
where A-1 denotes the inverse of A. If the mamx A is rectangular or singular the inverse does not exist. A particularly simple type of matrix to invert is a diagonal mamx. Provided that
none of the diagonal elements are zero, its inverse is simply the diagonal mamx of inverse elements; that is if D = ( d i i } , then
D-1 = (I/dii}
(25)
To understand the existence or non-existence of the inverse of a square mamx, we can show that it is proportional to the inverse of the determinant. Hence, it follows
88
that if A is not of full rank, that its determinant is zero and its inverse does not exist. In such cases we can define a generalized inverse.
5.9 Generalized inverse
If the inverse of a matrix X does not exist or it is a rectangular matrix, we can define a generalized inverse, denoted X-,such that
xx-x
=
x
(26)
A particular type of generalized inverse is the Moore-Penrose generalized
inverse which has the further properties that
x-x x- = x-; x x- =
(X
x-)'; x-x
=
(x-X)'
(27)
Note that the true inverse also obeys these conditions. We can obtain the Moore-Penrose generalized inverse via the singular value decomposition of X. Suppose X is an (Nx M)matrix, which may or may not be square and may or may not be of full rank, then
89
The singular value decomposition expresses X as the product of three components: U and P are orthonormal and S is a diagonal mamx; it is one of the decompositions that will be considered more fully in the next chapter. The rank of X is denoted by R, and if X is not of full rank, some of the diagonal elements of S are zero. If these zero elements are removed from S and the corresponding columns of U and P are also removed, the generalized inverse of X may then be written
which always exists and fulfils both sets of conditions Eqs (26) and (27). 'Singular value decomposition is elaborated on in Chapter 4, Section 5.3 and Chapter 1, Section 13. It is also discussed in the context of principal components analysis and SIMCA in
6 Statistical equivalents The concepts of length, inner product and angle have important parallels in statistical theory. These are obtained by considering the mean-centredobservations on a variable as a column-vector. Mean-centring entails subtracting the mean of a variable from the observed values. The graphical effect of this transformation is to shift the origin of the space to the mean or centroid of the data. In statistics the analogue of length is the standard deviation, and the analogue of a normed vector is an autoscaled or standardized variable. The difference is that the standard deviation of a vector of observations is equal to the length of a meancentred vector of observations divided by d(N-1)where N is the number of observations. The concept of angle was introduced as a means of comparing two vectors. The d a r product of two mean-centred vectors of observations divided by (N-1) is the covariance of two variables and so the analogue of the angle between vectors is the correlation coefficient. Thus if we represent the variables in the sample space it follows that if the variables are uncorrelated, with a correlation of 0,then their vectors will be perpendicular or orthogonal to each other. Moreover, if two variables are autoscaled (or standardized) their covariance is identical to the correlation of the unscaled variables, in parallel with the inner product of normed vectors.
90
2.12 4 mamx = 28
49
55
25
98
57
18
20
4
65
70
85
:an be decomposed into the following three matrices by a singular value lecomposition
0.42
0.12
-0.47
0.60 -0.77
0.00
0.12 -0.03
0.88
0.67
0.62
0.13
0.39
0.61
0.69
0.61
0.39 -0.69
1
(b)
187
0
0
0
39
0
0
0
16
dentify the left, right, and singular matrices, and hence invert X.
We now return to our original (N x M ) data matrix of N observations on M variables. Combining the rules of matrix multiplication and the parallels listed above, if X is mean-centred we have
91
where C is called the symmemc variance-covariancematrix of X.The M diagonal elements of C are the variances of the M variables and the off-diagonal elements are the covariances. If the columns of X are autoscaled then C is identical to the correlurion matrix of X with unit elements on the diagonal and correlations as the off-diagonal elements. If the M variables are independent or uncorrelated all the off-diagonalelements will be zero for both the covariance and correlation mamces. [Variance and covariance are discussed in Chapter 4, Section 4.1.
7 References All text books on multivariate analysis contain a section on matrix algebra. A brief selection is presented here. Chatfield and Collins [ l ] is an introductory text; Kendall [2] and Mardia er al. [3] are more advanced. Both Mardia er al. [3] and Rao [4] provide much more rigorous statements of properties, and Rao [4] in particular develops multivariate, and other, analyses in a more rigorous way. C. Chatfield and A. J. Collins. Introduction to Multivariate Analysis, Chapman and Hall, London, (1986). M.G. Kendall, Multivariate Analysis, Griffin, London, (1980). K.V. Mardia, J.T. Kent and J. Bibby, Multivariale Analysis, Academic Press, London, (1979). C.R. Rao, Linear Statistical Inference and its Applications, Wiley, New York,(1965).
1
92
ANSWERS A. 1
The data m a y be represented as 9.8 6.1
5.5
8 . 3 0.5 6 . 1 8 . 9 3.1 9 . 2
= 4,3x =
x
0 . 5 15.6 3 . 3 A.2 The second row and third column column vectors from matrix in Q. 1 are (a)
x2, =
[
8.3
6.1 6.1
(b)
x
,3
=
9.2
-
3.3
0.5
6.1
1
93
A. 3
The mamx consists of three rows and two columns: three objects measured on twi variables. The two variables define two axes, hence a two-dimensional space. Th three points can be represented as: e4
- 4
3
*
* *
2
0
I
2
I
4
Column 1 The "vectors"have been omitted to give a typical scatter-plotof one variable agains
94
~~
4.4 3 e three-dimensional plot of the columns of the matrix in 4 . 3 is Row 3
Row 1
\
Row 2
/
rhe "vectors"have been included to facilitate vision. 4.5 :a)
3) :c) ld)
Row 1: 4.55 ; Column 1 :4.12
[19i]
3.09 24.90
4.6 Suppose the two column vectors x and y are each of dimension N. Then [a)
(Length of x ) ~=
J?
= Ix12 = xx' =
&: c(xi-
i = 1
N
(b)
(Distance x - Y ) ~= dxy = (x - y)(x -y)' =
2
yi)
i = 1
95
4.7 inear dependence can be tested as follows. We need to solve the condition
-Ience, we can rewrite this as simultaneousequations
solving for a2 gives a2= 0 and hence a1= 0. They are linearly independent.
[ z::]
To mean-centre we need to calculate the mean vector il = 0.5 (x2 +x4) =
Ience we have the mean-centred vectors
0.125 0.375
1
:o test for linear independence
-0.125al+0.125a2 = 0 -0.3750(1+ 0 . 3 7 5 ~ 2= 0
'he equations are solved for any values which satisfy a,= a*.The mean-centred 'ectors are linearly dependent. Graphically we have a new pair of vectors whose irigin is at the centroid of the original vectors. But we can draw a single axis lassing through the space which preserves the relative positions, angles and lengths If the vectors.
96
A.8
(a) and (c) are valid; (b) and (d) are invalid. (a) This is valid and results in a (3 x 4) matrix. (b) This is not valid because the dimensions do not match.& needs to be transposed to give a valid multiplication. (c) This is valid and results in a (3 x 4) mamx. (d) This is not generally valid. Although both sides of the equation give a (4 x 4) matrix, the left side of the equation is post-multiplied by 4,4Fand the right side is pre-multiplied. The two sides will generally be different.
A.9
(a) Neither orthogonal nor orthonormal. (b) Orthogonal. (c) Orthonormal. These can be verified simply by calculating Z = X'X where X is the mamx. If is diagonal, X is orthogonal; if Z is diagonal with elements of 1.0, X nrthnnnrmal
A. 10 (a) 8. (b) 0. We can, however, remove rows and columns without altering the fundamental properties of the matrix and we find the determinant of the new mamx is 40.
IA.11 0.00
0.00
0.00 0.00
0.50
0.25
97
:b) The inverse does not exist. But, as in A.9, we can remove row 3 and column 3 Hence
0.5 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.5
:c) This is an orthononnal mamx. Hence its inverse is identical to its transpose. 0.39
0.69
0.61
0.61 -0.61
0.39
0.61
0.22 -0.69
1.12 i4aaix (a) corresponds to U, matrix (b) to S and matrix (c) to P'. fie inverse is given by -0.015
-0.012
0.033
0.017
-0.012
0.016
0.022
-0.006
0.022
-0.002
-0.038
0.000
This Page Intentionally Left Blank
99
CHAPTER 4
The Mathematics of Pattern Recognition N . Bratchell', AFRC Institute of Food Research, Shinjield, Reading, Berkshire RG2 9AT, U.K.
1 Introduction In Chapter 3, a set of data was presented as vectors and matrices and displayed graphically. The graphical representations were used to derive several quantities which provide information about the data and several rules of manipulation of vectors and matrices were listed. This chapter draws on Chapter 3 to develop the principles underlying multivariate analyses. Initially the chapter will examine data in the context of multivariate space. Various aspects of the structure will be exploited to simplify interpretation. This is a primary aim of all analyses as it allows attention to be focused on the important features of the data. The differences between analyses arise from the features of the data which are exploited. 2 Rotation and projection In Chapter 3 it was shown that data may be considered to lie in a multidimensional space whose coordinate axes represent the observed variables. This section examines the nature of the space and the axes which define it. Underlying many methods is the aim of locating a set of new axes which also define this space. Ultimately this permits simplification of the space and, hence, of interpretation. But first it is necessary to define the operations of rotation (of axes) and projection (onto axes). They will be illustrated graphically and subsequently defined mathematically. 2.1 Graphical illustration Two sets of perpendicular coordinate axes, A and B, (which define the same twodimensional space) are illustrated below. Throughout the following discussion it present address: Pfizer Central Research, Sandwich, Kent, U.K.
100
should be remembered that coordinate axes are simply a convenient peg on which to hang measurements and have no physical presence in the space. Suppose that axes A represent two variables which were measured and provide the initial definition of the space and, therefore, of the structure of the points within that space. Axes B represent two hypothetical variables which also define the space and provide the same scatter of points, seen from a different perspective. If information is loosely defined here as the scatter of points in the space, axes B retain the same total information. This is illustrated in Fig. 2.
A2 B2
b
.. .
A1
b
b
Fig. I Two sets of perpendicular axes, A and B , defining the same two-dimensional space.
Fig. 2 Data plotted against the two sets of axes given in Fig. 1.
The coordinates of the points relative to axes A were obtained from observed variables but can be determined from the graph by projecting perpendicularly from
101
the points onto the axes. The projections from a point onto the axes are illustrated below. Coordinates relative to the new axes B can be obtained in a similar manner.
Fig. 3 Perpendicular projection from a point onto the coordinate axes.
2.2 Defining the rotation The angles between the axes A and B above are illustrated below. Moving clockwise we have angles of 3150 and 450 between A, and B , and between A2 and B , respectively; and angles of 450 and I350 for B2. The cosines of these angles are 0.707, 0.707, 0.707 and -0.707.
Fig. 4 Angles between axes.
102
This rotation can be represented as a matrix
r
=
1
0.707
0.707
0.707
-0.707
1
1
The two column vectors are the cosines of the angles between A and B , and A and B2 respectively. m
e
c
t
o
r
1
s are defined -n i
As noted above, axes are merely convenient abstract constructions. The column
vectors of B are vectors of unit length relative to axes A. However, B has special properties. Its column vectors are normed, so that their elements are direction cosines relative to the axes (Chapter 3, Section 3.4) and they are orthogonal (Chapter 3, Sections 3.4 and 5.4). Hence, B is an orthogonal matrix which has the property
B'B = I
(2)
where I is the identity matrix.
Q.1 (a) Verify that Eq. (2) holds for B. (b) What does this imply about the relationship between an orthogonal mamx and its transpose?
2.3 Projection onto new axes Matrix B was introduced as defining the angles between an original set of axes and a new set. Subsequently it was shown to define a set of unit vectors which coincide with these axes rather than the axes themselves. Projection of the points onto the new axes takes the form of projection onto these unit vectors.
103
Suppose the observations plotted in the space of axes A are
x =
-1.8
0.7
-0.9
-1.4
-0.4
1.2
0.6
0.2
1.4
-1.1
We can project them onto the axes B by the operation
Y = X B
(3)
4.2 Perform the multiplication in Eq.(3) and verify the projection onto axes B.
The inverse rotation or projection is also readily defined. The inverse of B,B', defines the rotation from axes B to A. Hence we have the inverse projection
X =YB'
(4)
3 Dimensionality Dimensionality is a fundamental property of data and was introduced in Chapter 3 in several ways. Initially data were shown to lie in multidimensional space whose number of dimensions is given by the number of variables or column vectors of the data; subsequently, the variables were shown to lie in a row-space whose dimensionality is defined by the number of objects or row vectors. The equivalence between row and column spaces was emphasized although the equivalence of their dimensionalities was not considered. More formal aspects of dimensionality were presented in Sections 4.6 and 5.1 in terms of linear independence of vectors and the rank of a matrix. Dimensionality is also introduced, in a more geomemc way, in Chapter 1, Section 4. In this section a distinction will be drawn between the dimensionalityof a space and that of the data. The former is given by the number of row or column vectors of the matrix. The dimensionality of the data can be defined more pragmatically as the
104
minimum number of dimensions needed to represent fully the vectors, that is, the distances and angles among the vectors. These two definitions are not necessarily the same and the latter is determined by the number of linearly independent row or, equivalently, column vectors. A set of objects may occupy a subspace of lower dimensionality whose axes do not coincide with those of the original space. This definition will later be relaxed to permit an approximate representation of the data. Dimensionality is a broad topic which can be discussed in several ways. In the following sections examples illustrate various aspects and implications of dimensionality. They will be used to develop a mathematical definition of dimensionality and subsequently a means of exploiting the structure of the data for reduction of (apparent) dimensionality.
3.1 Fewer variables than objects Discussion of dimensionality cannot be made without reference to the number of objects and variables as they provide the first limiting factor to the dimensionality of the data. For a given set of data, the dimensionality is at most the minimum of the number of objects or variables. This can be illustrated very simply by the the following data:
x =
2.5
3.8
1.3
4.1
3.0
2.9
These data, consisting of three objects and two variables, can be represented graphically as three points in the two-dimensional column-spaceor,equally, as two points in the three-dimensional row-space, as in Fig. 5. The three points lie in a two-dimensional space defined by the columns of the mamx. From this perspective the dimensionality is two as both axes are needed to plot the three row vectors, that is, the positions of the points and the origin of the axes. We also have a threedimensional space defined by the rows of the mamx. However, inspection of the right-hand diagram indicates that the two vectors, that is, the points and the origin of the space, can be represented fully in only two dimensions without altering either the angle or the distance between the vectors.
105
Fig.5 Two graphical representationsof a 3 x 2 matrix.
Q.3 Verify that the dimensionality of the data is two by plotting the two column vectors in a two-dimensional space. Comment on the significance of this result.
3.2 More variables than objects The previous example had more rows than columns in the matrix, like most examples in these chapters. A similar approach can be taken when there are more variables than objects, and this section will demonstrate how a set of objects can occupy a lower-dimensional space than is at first apparent. This will be developed further in the following sections. Consider the following data
-0.04 0.78 -0.06 0.28 -0.56
-0.50 -0.32 -0.45 0.67 -0.03 -0.47 0.41 -0.47 -0.45 0.43
1
The row vectors are linearly independent and the dimensionality of the data is three. The five column vectors can be plotted in the three-dimensional row-space. And, although the row vectors are defined within five dimensions, we can find three new axes which fully represent the vectors. That is, we can show that the objects occupy a three-dimensional subspace.
106
This observation holds for all sets of similar data, but this set contains an extra detail. The mamx above consists of three mutually orthogonal row vectors each of unit length; if plotted in the five-dimensional space, the three vectors are mutually perpendicular. Following Section 2.2 above we may define a rotation of the original five axes to a set of new axes by the rotation mamx
-0.47 -0.04 0.23 -0.50 0.69 0.41
0.78 0 . 3 5 -0.32 -0.02
-0.47 -0.06 0.23 -0.45 -0.72
R =
-0.45 ,
0.67
0.02
0.71 -0.03
0.00
0.52
0.28
0.43 -0.56
Projecting the three row vectors onto the new axes gives new coordinates
Y =
0
1
0
0
0
0
0
0
1
0
1
0
0
0
0
1
Each of the vectors lies an equal length along one of the new axes. But it can also be seen that none of the vectors projects along four of the axes.
4.4 Compare the original data matrix with the rotation matrix. Give the rotation mamx which defines the three-dimensional subspace needed to represent the data. What is their product? The previous example was conmved to illustrate that data may lie in a subspace and, following Section 2, to show how the subspace is defined relative to the full space. We may now generalize the example. In reality, it is unlikely that the row vectors of observations are mutually orthogonal. We could have, for example, three linearly independent vectors whose angles with each other are 450. These too are three-dimensional. In the previous discussion, the vectors in the new space each coincided with one of the axes. In the present example, the new vectors are no longer coincident with the axes, but they do occupy the three dimensions. To verify that they are inherently three-dimensional, the vectors can be considered to form a pyramid with one apex at the origin.
107
3.3 Data lying in a subspace The examples of Section 3.2 hide a deceptively important point. The dimensionality of the data is determined by the linear dependences among the vectors (Chapter 3, Section 4.6). In both of the examples in that section, three linearly independent row vectors lying originally in a five-dimensional space were found to be inherently three-dimensional. In this section the concept of data lying in a subspace is developed in more generality and the influence of linear dependence is examined. Consider the following data:
x =
0.5
0.5
1 .o
1.9
0.7
2.6
2.0
2.0
4.0
0.3
1.8
2.1
1.9
1.7
3.6
1.2
0.2
1.4
1.9
0.9
2.8
Here we have seven objects and three variables. We can plot the objects in threedimensional space (Fig. 6). The points, although they project along each of the three axes, do in fact lie on a plane within this space. That is, the data occupy a two-dimensional subspace. To understand this we need to examine the relationships among the column vectors. They are linearly dependent and follow Eq. (5).
Only two of the column vectors are linearly independent: given two of them the values of the third are determined by Eq. (5). Thus, although the maximum dimensionality is three, the rank of the mamx (Chapter 3, Section 5.6, and Chapter 1, Section 4) is two and so the dimensionality of the data is two. We could find a pair of new axes which allow us to plot the data in a plane, for example, two axes defining the plane above.
108
/I
I
I /. /
'a
I
Fig.6 Seven objecis ploiied in ihree-dimensional space.
3.4 Data lying near a subspace Both of the previous sections gave examples of data which occupy a subspace. In the former, the data were constrained by the number of objects. In the latter, the constraint arose from the relationship between the column vectors. Another situation arises when data lie near a subspace. Consider the data of the previous section. The column vectors followed a linear relationship. This will now be perturbed by adding noise to the variables. Consider the new relationship
+ (x2 +ez)
x3 + e 3 = (xl + e l )
(6)
Eq. (5) no longer holds precisely; in other words, there is no longer the strict linear dependence among the vectors, although this is the dominant relationship. If the data are plotted in three-dimensional space, the points do not now lie on a plane, but are scattered slightly about it. That is, they lie near the subspace. Projecting onto this plane gives an approximate two-dimensionalview of the dominant structure in the data. ksimple e
x
a
m
p
l
e
1
109
3 5 Further constraintson dimensionality Sections 3.1 to 3.4 listed several constraints on the dimensionality of data, namely the number of objects or variables and the linear relationships among the vectors. Arithmetic operations on the data can also constrain the dimensionality, although they can operate in unexpected ways. The operations of row- and column-cenmng are introduced as mean-centred vectors
3.5.1 Row- and column-centring Both row- and column-centring may each reduce the dimensionality of the data by one under certain circumstances. To understand this consider a matrix with three linearly independent row vectors. The mean row-vector is calculated as
The mean-centred row vectors are then calculated as
zi = xi - i;i = 1 ... 3
(8)
They now have the linear relationship
The operation of mean-centring has introduced linear dependence into the set of vectors, irrespective of other relationships already present. Similarly, mean-centring the row vectors will also introduce linear dependence. However, whether this will affect the dimensionality of the data is a complex problem and two examples will serve to illustrate the effects. Consider the vectors of Eq. (5). They define a plane passing through the origin, which was illustrated above. For any set of vectors which follow this relationship and for which none of the vectors is constant, the dimensionality of the points is two. Neither row- nor column-centringhas the effect of reducing the dimensionality further. The dimensionality cannot be reduced because the origin of the space lies within the subspace of the data, Mean-centring the rows and columns can have an effect only when the smallest subspace does not pass through the origin of the original space. This is illustrated by the following set of data
110
x =
-
1
6
7
2
7
9
3
8
11
4
9
13
5
10
15
Here we have two linear relationships. The first x3
= x1 + x 2
is one of linear dependence; the second, x2
= x1+5
is not. The plots of the data before (a) and after (b) mean-centring the columns are shown in Fig. 7.
Fig. 7 One-dimensionaldata discussed above before and after mean-cenm'ng.
In both cases the points are inherently one-dimensional. However, the space needed to represent the original vectors is two-dimensional because the line of the points does not pass through the origin of the space even though the values on one
111
of these (hypothetical) axes is constant. After mean-centring the origin of the space coincides with the centroid of the points. Now only a single dimension is needed. The operation of mean-centring has altered the second linear relationship to one of linear dependence. This is a subtle point which is generally avoided by the standard practice of working with mean-centred data. Q.5
(a) What is another mathematical way in which the dimensionality of the last set of data could be reduced? (b) Consider the data of Section 3.2, how could their dimensionality be reduced from three to two?
3.5.2 Closure Closure is a special case of constrained dimensionality. In this case the variables (columns)have the relationship
where 1 denotes a vector of 1s. For the case of three closed variables the space in which the values are constrained to lie is a plane triangle with apexes at the points (1, 0, 0), (0, 1 , O ) and (0, 0, 1). Any point outside this triangle does not obey the relationship of Eq. (10). This space is illustrated in Fig. 8. This space is sometimes referred to in chemometrics as a mixture space, and measurementsthat lie in such a closed space will be discussed in Chapter 5 , Section
In general, the observations occupy a simplex. A simplex is an n-dimensional shape or object with N+1 apexes. For example, a mangle is a two-dimensional simplex and a tetrahedron is a three-dimensional simplex. The problem of dimensionality is similar to the last example above. Without mean-centring, for instance, the space needed to represent the vectors is three-dimensional; two-dimensional representation requires the simplex to include the origin of the space.
112
Fig. 8 The closed data space.
4 Expressing the information in the data
Sections 2 and 3 examined the structure of data within multivariate space. This section examines how that structure can be expressed in another form which leads more readily to exploration and exploitation.The following discussion draws closer links between multidimensional space, vecturs, data and statistics. 4.1 Variance and covariance
Variance and covariance, like the mean, are fundamental properties of data. The mean measures the location (in space) of the data; variance measures the dispersion or spread of the data about their mean; and covariance measures the extent to which two variables vary together.
In traditional notation the mean of thejth variable in a data set is calculated as .
i j =
1
c..x i j N
i= 1
113
where N denotes the number of objects on the jth variable. In vector notation this is replaced by ij =
1
xj 1
where XJ is a column-vector and 1 is a vector of 1s. The corresponding form for covariance is
where Vjk denotes the covariance between variables j and k: when j = k, Eq. (13) becomes the variance. This expression, without the divisor (N-1), is the sum of cross-products; when j = k the expression is the sum of squares. Throughout this section data will be assumed to be mean-centred. This simplifies the notation and avoids some of the dimensionality problems outlined in Section 3.5. Eq. (13)now simplifies to
which can also be written in vector notation
Correlation is closely related to covariance
where V j is the standard deviation, i.e. the square root of the variance (=dvj). Note that the (N-1)s cancel out of Eq. (16)and correlation can be expressed in terms of the sums of squares and cross-products. Note also that Eq. (16)can be rearranged to show the correlation as the covariance between two variables which have been standardized to unit variance.
114
The variance is used in various chapters of this text, particularly Chapter 5. The use of x'x as a measure of the square of the length of a vector is computationally convenient and particularly useful in areas such as SIMCA (Chapter 7) where many of the proponents have a background in the development of computational algorithms. ~
An alternative, geomemc, view of variance is outlined in Chapter 1, Section 8. In the expressions above there is clear correspondence between the sums of squares and cross-productsand the inner products of vectors. Indeed the standard deviation is, apart from a constant factor, the length of a vector. However, once the (N-1)s are cancelled from Eq. (16), the correlation coefficient emerges as the cosine of the angle between two mean-centred vectors.
l$iat does this treatment of correlation infer about the relationship between two
Iuncorrelated column vectors in multidimensionalspace? 4.2 The covariance matrix
In multivariate analysis we generally deal with a number of variables. It is convenient to collect their variances and covariances together in a single matrix
Z = m1 S = - 1
N-1 X'X
where X is the mean-centred data matrix. 4.7 (a) What are the dimensionsof Z? (b) Interpret the diagonal and off-diagonal elements of Z. The mamx Z is often called the variance-covariance (or covariance) mamx and contains information about the scatter of points in multidimensional space. In fact it describes the elliptical covariance structure of the data as illustrdted.
In Fig. 9 the two variables are correlated, i.e. have non-zero covariance.
I
115
Fig. 9 Covariance structure of two variables.
Q.8 Several special cases of covariance structure can be identified. Draw the covariance ellipses of the following matrices and comment on them.
[ -:-:]
[ : :]
[a
[ a :]
Y]
The covariance mamx expresses the covariance structure of the data relative to the original axes of the space, and the ellipses it defines are oriented relative to these axes. Using the principles of Section 2 we can find a new set of axes which present a different orientation of the ellipses and hence a different - simpler - expression of the covariance structure. This can be achieved by a latent root and vector decomposition of the covariance or sums of squares and products matrix, also termed an eigenvalue-eigenvectordecomposition.
116
5 Decomposition of data In mamx algebra there are a number of decompositions.The latent root and vector decomposition is one of them; an equivalent decomposition is the singular value decomposition. Although they operate in different ways, they both result in the same decomposition of the data into its components and will be examined here. 5.1 Principal axes
The covariance matrix expresses the elliptical covariance structure. Q.8 gives an indication of how this Structure can be exploited to find a set of new axes which simplify the covariance structure. Consider the elliptical covariance structure illustrated above: we can locate two new axes which pass first through the ellipse. These are the principal axes of the ellipse.
Fig. 10 Principal axes of an ellipse.
The rotation matrix for the new axes can be defined in mamx form following Section 2 and the points projected onto these axes to form two new variables. The new variables are typically called the principal components. They allow the covariance to be expressed in the simpler form
117
where g o is the variance along the ath principal axis and the new variables are uncorrelated. The matrix G contains the eigenvalues of the principal components along the main diagonal. It is referenced in several other chapters of this text. In Chapter 7, Section 3 this matrix is used for calculation of principal components. Chapter 5, Section 3 discusses how the values of go can be employed to estimate the number of significant components. These concepts are introduced slightly differently in Chapter 1, Section 13. The matrix VAA is identical to the matrix G described in this section. 5.2 Latent roots and vectors The example above is trivial and the new axes and rotation matrix can be found graphically. In practice the problem must be solved analytically by a latent root and vector decomposition of the covariance matrix, also called an eigenvalueeigenvector decomposition. The problem can be structured in several ways which lead to the same result, and are discussed elsewhere. Here we will concentrate on the results. The latent root and vector decomposition is defined by two equations IZ-g,II = 0
(19)
where Z is a (Mx M) variance-covariance mamx (sometimes data can be scaled so that each variable is standardised to equal variance down the columns: in which case the matrix becomes the correlation matrix); the matrix I is the unit matrix, and 0 is a matrix of zeroes. Eq. (19) is a constrained maximization in which g is called the Lagrange multiplier, the go are the A latent roots and are obtained as the roots of the polynomial equation of order M defined by the determinant. Eq. (20) defines the corresponding latent vectors pa of dimension N. Two constraints are placed on the values of the latent vectors. They have unit length and are mutually orthogonal. From the properties of the covariance matrix, the latent roots all have a value greater than or equal to zero. Conventionally and conveniently we may consider them in order of decreasing magnitude. The latent vectors may be correspondingly ordered and collected into a matrix
118
P = [ P1
P2
*-.
pa
1
Since the columns of P have unit length and are orthogonal, P is an orthogonal matrix (Section 2.2). It defines the rotation from the original axes to the principal axes. Projection follows from this as
T = XP
(22)
where the columns of T can be called the principal component scores. It is sometimes convenient to note that the principal components are linear combinations (sums) of the original variables, and the latent vectors pa are typically called the loadings. A number of results now arise. The covariance matrix of T is the diagonal matrix of latent roots
G =
~1 -T'Tl
The latent roots are the variances of the principal components. G is the canonical form of the original covariance mamx, and it can be shown that trace((;) = trace(Z)
(24)
The sum of the variances of the original variables equals that of the variances of the new variables. The rotation preserves the distances within the p-dimensional space between the points. This equality holds even if some of the latent roots are zero, that is some of the principal components. A second important result concerns dimensionality. Suppose that the data mamx X of size (Nx M) has rank A c M, it follows that the covariance matrix also has rank A. That is, the data lie in a A-dimensional subspace. The last example of 4 . 8 gives the covariance structure of such a set of one-dimensional data lying in a twodimensional subspace. The points lie only along the first principal axis of the elipse; the second, therefore, has a zero latent root. In general, if a mamx has rank A, (MA) latent roots are zero. In terms of the column vectors, a rank of A implies that there are (M-A) linear dependences among the vectors. The latent vectors corresponding to zero latent roots describe the linear dependences, Chapter 3, Section 4.6. Thus, the latent mot and vector decomposition locates the subspace in which the data lie. Projection into this subspace is accomplished by collecting only
119
the first A latent vectors with non-zero latent root in the matrix P of Eqs (21) and (22). The method is termed a decomposition as it decomposes the data into several components. First we can define the inverse projection from the principal axes
X
=
(25)
TP'
We can expand this to write the data mamx as the sum of independent or orthogonal components
It is important to compare this to Eq. ( 5 ) of Chapter 7,Section 3. In Eq. (26) of this chapter the matrix X has been column centred. In this section, the principal components completely describe the data. If there are M variables, then, using the treatment in this chapter, A will only be less than M if there is exact linear dependence in the data. Therefore Eq. (26) completely describes the data, and principal components analysis is used to simplify the data and not reduce the dimensionality. In other cases principal components analysis is used for dimensionality reduction, and A might be substantially smaller than M. Under such situations, the equality in Eq. (26) will not exactly hold and there will be error terms. Choice of A under such circumstancesis of substantial interest to chemometricians and is discussed in detail When the equality of Eq. (26) does not exactly hold we are performing regression analysis, as introduced in Chapter 1, Section 9 and constantly used throughout this text. The estimated value of X will not exactly equal the observe4 value of X,the difference between the two normally being expressed as a residual sum of squares
Hence we have A component matrices of X
x
=
x, + x 2 + ... +x,
I
(27)
Each of the component matrices is formed from orthogonal information. If only the first A components have non-zero latent roots we can ignore the last (M-A)matrices as they contribute nothing to X.In graphical terms, each component is based on the information from a different dimension of the space.
120
5.3 Singular value decomposition Another way of achieving the previous result is the singular value decomposition which operates directly on the data matrix, X.X is decomposed into three matrices
X
=
U
GI12
p’
(28)
or using the dimensionality notation introduced in Chapter 3, Section 2.1 for clarity
N,MX = N,AU A , A G ~ A,MP’ ” These are related to the latent roots and vectors of the previous section as follows. G112is the matrix whose diagonal elements are the square root of the eigenvalues, or the square root of the diagonal elements of G. The mamx UGU2 is identical to the mamx of scores or T.
Eq.(28) is often used by chemometricians, as in Chapter 7,Section 3.
1 Singular value decomposition and eigenvalue decomposition are also discussed in Chapter 7,Section 13, but from a slightly different perspective. The matrix SAA is the same as GIn being the matrix whose diagonal elements are the square root of the eigenvalues. The matrix URA is equivalent to U and the matrix UAC to P’. Note that the notation of Chapter 7 requires both U and P to have unit size. In most situations it is common to quote P as a unit matrix as the loadings are usually scaled appropriately, but the scores are normally quoted by T which equals UG* in our notation here, or URA.SAA in the - on The other results follow naturally. 6 Final comments
Chapters 3 and 4 have given a brief introduction to the mathematics underlying multivariate methods. Many fascinating topics have not been included and the aim has been to describe the most common manipulations of the vectors and matrices and to illustrate the structure of multidimensional space and data. These underly a wide variety of methods which exploit the multidimensional structure of data for simplification.
121
It was not the aim of these chapters to present the mathematics of the methods themselves, although the description of the latent root and vector decomposition is indistinguishable from that of principal components analysis. This is not intentional, simply a reflection on the simplicity of that technique. The latent root and vector decomposition underlies many other techniques: correspondence analysis, canonical variates analysis, canonical correlations analysis and PLS are examples of methods which rely on the principles of this decomposition.
The methods listed above differ from each other in two respects. Principal components analysis, correspondence analysis and canonical variates analysis seek simplification of a single data space; they differ in the way they pre-treat the data. Canonical correlations analysis and PLS seek to explore and simplify the relationships between different data spaces, each defined by a different set of variables. This aspect of multidimensional space has not been explored here, although the principle of projecting into a different space is the same as that given in Section 2 for projection within the same space. Again these two methods differ in the way the data are treated. But in all of these techniques the relationships within or between sets of data are explored and simplified. t -
of da
7 References
The reader is referred to the references at the end of Chapter 3.
122
ANSWERS A. 1
(a) To verify Eq. (2). expand the equation and carry out the multiplication to find the identity mamx as the product. (b) From the definition of the inverse matrix, the inverse of an orthogonal matrix is simply its transpose.
4.2 The coordinates of the points relative to axes B are
Y =
-
-0.8
-1.8
-1.6
0.4
0.6
-1.1
0.6
0.3
0.2
1.8
-
;:1
distance and the cosine of the angle between the vectors are 4.85 and 0.907. Provided these are preserved there is no constraint on the orientation of the axes. Without listing the details of how the subspace is located, the important point to note is that there is equivalence between the dimensionality of the row vectors in column-space and the dimensionality of the column vectors in the row-space. A. 4 Three of the column vectors of the rotation matrix are identical to the original row vectors. The mamx consisting of only these three column vectors defines the rotation to the subspace, giving the identity matrix as their product. AS
(a) The dimensionality of the data could be reduced by subtracting 5 from x2. (b) Mean-centring the column vectors of Section 3.2 would reduce the dimensionality to two, since the centroid of the points lies on a plane with the three points. It is worth noting that in this case mean-centring the rows would not have an effect since the dimensionality is already constrained.
I
123
A.6 A comelation coefficient of zero translates into perpendicularity in multidimensional
space. The vectors are said to be independent or orthogonal. A.1 (a) If X has M columns, Z is a symmetric (M xM)mamx. (b)The diagonal elements are the variances and the off-diagonal elements are the covariances. 9.8 The covariance structures are as follows:
Matrices 2 and 3 describe uncorrelated variables; mamx 3, in particular, arises from wo standardized uncorrelated variables. Matrix 4 is an example of the covariance namx of data lying in a (one-dimensional) subspace. The data lie along only one ixis; the values on the other are constant.
This Page Intentionally Left Blank
125
CHAPTER 5
Data Reduction Using Principal Components Analysis John M . Deane, AFRC Institute of Food Research, Bristol Laborarory, Langford, Bristol BS18 7DY, U.K.
1 Introduction Principal Components Analysis (PCA) [ 1,2] is probably the multivariate statistical technique most widely used by chemometricians today. It is a technique in which a set of correlated variables are transformed into a set of uncorrelated variables (principal components) such that the first few components explain most of the variation in the data. PCA is applied to high dimensional data sets to identify/display their variation structure, for sample classification, for outlier detection and for data reduction. PCA also forms the basis for the SIMCA classification technique (see Chapter 7) and the partial least squares (PLS) regression technique [3,4] evolved from the NIPALS algorithm [3,5] for performing PCA.
(or noise) is discussed. The reader is referred to that section for an introduction to
In this chapter we will explore how PCA can be used for data reduction. All real data contains experimentalhandom noise; PCA will extract some of this error which will usually be represented by the principal components with smallest size or variance; removal of these components is therefore one form of data reduction. Consideration will be given to both those methods based on knowledge of the experimental error in the data and those requiring no knowledge of the experimental error in the data. While these approaches provide data reduction, by dimensionality reduction, in many applications it is desirable not only to reduce the dimensionality of the data space but also to reduce the physical number of variables which have to be considered or measured in future. This chapter will therefore also discuss and
126
compare the different variable reduction methods based on the PC model fitted to a data set.
2 Principal components analysis 2 .I Introduction PCA is introduced qualitatively in Chapter 2. The basis of the mamx algebra of PCA is discussed in Chapter 4, Section 5 and Chapter 1, Section 13. PCA is introduced as a chemical tool in Chapter 7,Section 2. In this text we use the following standard notation. The number of samples or objects is given by N.The number of variables or measurements on each sample is M.Finally principal components are numbered from 1 to A, where A is, ideally, the
In order to illustrate the technique of PCA and to demonstrate how important it can be to identify accurately the true dimensionality of a data set (A), let us consider a simple model data set. This data set will also be used in later sections to demonstrate the data reduction techniques discussed. Say we have three pure chemical compounds (XI, X,, X3),which each produce a spectrum of six peaks when analysed by NMR spectroscopy.Say we also produce all possible binary and 1
Fig. 1 Mixture triangle showing the proportions of the constituentsfor the seven blends in the theoretical data set.
127
ternary blends, with equal proportions of each constituent; this results in a total data set of seven samples. These data may be represented in a mixture space. For a b component mixture this consists of a b- 1 dimensional figure, often called a simplex, which is the simplest possible geomemc figure in that space (a line in 1 dimension, mangle in 2 dimensions, tetrahedron in 3 dimensions and so on). The comers of the figure represent pure components; each sample lies somewhere in the mixture space, the geometric distance from the corners indicating the proportion of components in the mixture. For the example discussed here, the mixture space is a mangle as illustrated below, and the seven samples form a regular pattern (often called a simplex lattice) over this mixture space. Table 1 gives the NMR spectra peak heights of these mixtures. The data set is a theoretical one which contains no blending or NMR spectroscopy measurement errors. Closure has important consequences in chemometrics, and is a consequence of data that sums to a constant total. It is discussed in various chapters, including Chapter 3, Section 3.2.1.
Table 1. NMR spectra peak heights for the theoretical mixture data set discussed in this section. Constituent Proportions Mixture 1 2 3 4 5
6 7
XI 1
0 0 0.5 0.5
X2 0 1 0 0.5
NMR Spectra Peak Heights
X3
0 0 1 0
3.00 2.00 4.00 2.50
0 0.5 3.50 0 0.5 0.5 3.00 0.33 0.33 0.33 3.00
5.00 3.00 8.00 4.00 6.50 5.50 5.33
6.00 7.00 4.00 6.50 5.00 5.50 5.67
2.00 9.00 6.00 5.50 4.00 7.50 5.67
4.00 8.00 9.00 6.00 6.50 8.50 7.00
2.00 1.00 3.00 1.50 2.50 2.00 2.00
Q. 1 The data set presented in Table 1 contains seven rows (N) and six columns (M), so a maximum of six dimensions can be extracted from the data. It is assumed throughout this chapter that data has been collected such that N > M . Given the above information on the construction of the data set, how many real dimensions will be extracted from the mean-centred data and what shape will the reduced data form?
128
The results from a PCA of the mean-centred data in Table 1, with no scaling or transformations, are presented in Table 2. The reader with access to a PCA package such as SIRIUS should be able to reproduce these results. The eigenvalues (g,), percent variance accounted for, and PC scores (t,) all show, as discussed above, that all the variation in the data is accounted for by the first two dimensions. The loadings (p,) for the third to sixth dimensions define the transformed axis of the mixture space but the scores show that they express none of the variation in the data. The reader is referred to Chapter 4, Section 5 and Chapter 7, Section 2, for more detailed discussion of the notation used in this text for principal components analysis. The eigenvalues are related to the size of each component (a). Each sample (e.g. mixture) has an associated score for each component and each variable (e.g. NMR
Table 2. Principal components analysis of the theoretical mixture data set presented in Table 1. Principal Component 3 I 4 2 5 6 3.000 5.334 5.667 5.667 7.000 2.000 Xk & I
% Variance
Cumulative % Variance
PI P2
Loadings p3 P4
PS
p6 '1
'2
Scores
'3 t4 '5
'6 t7
45.14 60.18
29.86 39.82
0.00 0.00
0.00 0.00
0.00 0.00
0.00 0.00
60.18
100.00
100.00
100.00
100.00
100.00
-0.0978 -0.1827 0.0849 0.8264 0.5073 -0.0978
-0.2632 -0.6926 0.4295 0.0130 -0.4440 -0.2632
-0.5550 -0.0041 -0.6033 -0.1178 0.0773 -0.5550
-0.2850 0.3226 0.6458 -0.3383 0.4494 -0.2850
-0.7071 -0.0000 -0.0000 -0.0000 0.0000 0.7071
0.1786 -0.8187 -0.1651 -0.4342 0.5811 0.1786
-4.4633 3.9966 0.4652 -0.2334 -1.9991 2.3309 0.0031
1.6578 2.3139 -3.9733 1.9859 -1.1578 -0.8297 0.0032
0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000
129
4.2 Using the results presented in Table 2 produce a PC scores plot of the first two dimensions extracted from the data to verify your answer to Q. 1.
2.2 The effect of random experimental error within the data As discussed above all 'real' data sets contain some random / experimental error and it is the level of this error which can mask the identification of the true dimensionality of a data set. Malinowski [6,7] showed that the error associated with a data set can be split into two sources - imbedded error and extracted error. Extracted error is error which is contained within the minor PC dimensions ((A+l)th, (A+2)th, ..., Mth) and can therefore be removed, or extracted, from the data by retaining only the first A dimensions. Imbedded error is error which mixes into the factor scheme and is contained within the first A dimensions: this error can never be completely removed from a data set but may be scaled to a minimum [8]; so even a data set reproduced from the true number of PCs (A) contains some error. Therefore, the level of imbedded error within a data set will affect the reproduction of the data space.
4.3 How will levels of imbedded error in a mixture data set affect the reproduction of a mixture space? Discuss the answer refemng to a mixture of b constituents. While imbedded error affects the quality of data reproduction what is the effect of over- or underestimating the level of extracted error in a data set? If A is underestimated then important information on the data structure has been erroneously removed as extracted error; but if A is overestimated experimental error is included in the PC model which could, depending on its level, cloud one's understanding of the structure contained within the data.
14.4 What are the effects of under- and overestimating A in our example of a mixture
The importance of accurately estimating A depends largely on the aims of the statistical analysis and the level of error in the data. If the aim is a quick analysis to gain a feel for the data then the effect of including additional dimensions will depend solely on the number of additional dimensions included and the level of error associated with them. However, in a situation where PCA is applied to, say,
130
mixture data where one may want to use the PC model to identify the constants, and estimate their proportions in the mixtures [7,9,35],then an accurate estimate of A is important.
Table 3. NMR spectra peak heights for the theoretical mixture data set presented in Table 1 with the addition of random noise. Constituent Proportions
Mixture 1 2 3 4 5 6 7
XI
X2
X3
1 0 0 0 1 0 0 0 1 0.5 0.5 0 0.5 0.5 0 0 0.5 0.5 0.33 0.33 0.33
NMR Spectra Peak Heights 2.70 4.30 5.70 2.30 4.60 1.40 2.60 3.70 7.60 9.10 7.40 1.80 4.30 8.10 4.20 5.70 8.40 2.40 2.50 3.50 6.50 5.40 5.60 1.50 4.00 6.20 5.40 3.70 7.40 3.20 3.10 5.30 6.30 8.40 8.90 2.40 3.20 5.03 6.27 5.27 7.80 1.70
Let us return to our example of Table 1 to see the effect that the inclusion of random noise has on the PC model formed from the data. Table 3 presents the mixture data from Table 1 with the addition of an artificial error mamx with a "true root mean square (see Section 3.1.2) error" (i.e. the difference between the raw data and the pure data) equal to 0.4918. In mixture experiments the noise could be a made up of mixture blending errors, variation in sample size, the presence of a contaminant or measurement errors associated with the instrument and other sources of imprecision.
The results from PCA of the mean-centred data in Table 3, with no scaling or transformations, are presented in Table 4.
Q.5 Using the results presented in Table 4 produce a PC scores plot of the first two dimensions extracted from the data to verify your answer to 4.3.
A quick comparison of Tables 2 and 4 shows that the inclusion of experimental error into the data set has resulted in the extraction of four error dimensions. It can be seen from the tables that 4.07% of the data in Table 3 is extractable error by reproducing the data using only the first two PCs. The level of imbedded error in the data cannot be estimated but its effect can be seen by comparing the plots produced in response to Qs 2 and 5.
131
Table 4. Principal components analysis of the mixture data set presented in Table 3.
1 xk
3.200
2 5.161
Principal Component 3 4
5
6
5.996
5.696
7.157
2.057
go
43.09
30.05
2.02
0.89
0.17
0.02
% variance
56.52
39.41
2.65
1.17
0.22
0.03
Cumulative % Variance
56.52
95.93
98.58
99.75
99.97
100.00
0.0319 0.0449 -0.1987 -0.8904 -0.4048 -0.0275
0.3053 0.7075 -0.3844 -0.0806 0.4541 0.2129
0.1089 -0.3227 0.4628 -0.3638 0.5102 0.5266
0.1884 0.0625 -0.0386 0.2529 -0.5670 0.7574
-0.5085 -0.3760 -0.6896 0.0104 0.2149 0.2798
0.7745 -0.4976 -0.3487 0.0648 0.0449 -0.1574
4.0809 -3.5261 0.0076 0.7118 1.8379 -3.1803 0.0682
-1.6762 -2.0530 3.7431 -2.3854 1.7227 0.5971 0.0517
-0.3285 -0.1014 -0.8464 -0.2869 0.9281 0.1710 0.4641
-0.0433 0.2622 0.0164 0.1309 0.4618 -0.0665 -0.7615
0.0135 -0.2361 -0.0629 0.1393 -0.0353 0.2874 -0.1060
-1.0872 -0.0248 0.0181 0.1072 0.0122 -0.0528 0.0272
PI
p2 Loadings p3 pq pj pg t, t2
Scores
r3 r4 ts ts t,
3 Data reduction by dimensionality reduction PCA was presented above as a method in which a model is fitted to a set of data points xik Data reduction can therefore be achieved by reducing the dimensionality of the data space by removing the error (eik) associated with the data (i.e. retain only the fust A PCs); this error is a mixture of experimental error and random deviation from the fitted model. Various techniques have been developed to identify the true dimensionality of the data; these techniques can be classified into two categories as follows:
132
(a) Methods based upon a knowledge of the experimental error of the data. (b) Approximate methods requiring no knowledge of the experimental error of the data. The majority of the second group of methods are empirical functions but in the last decade attempts have been made to develop methods based on formal statistical tests.
3.1 Methods based on knowledge of the experimental error of the data
If information is known about the experimental error associated with a data set then the techniques presented in this section are to be preferred to those requiring no knowledge of the error structure. 3.1 .I Residual standard deviation The residual standard deviation (RSD) of a mamx [7] can be used to identify the true dimensionalityof a data set. RSD is a measure of the lack of fit of a PC model to a data set and is calculated by
RSD = if the PCA is performed via the covariance mamx; where g, is the eigenvalue associated with the ath PC dimension, N is the number of samples, M is the number of variables and A is the PC dimension being scrutinized. If PCA is performed on a data set via its correlation matrix then the RSD is calculated as :
RSD =
NM(M-A)
Principal components analysis is discussed in other chapters. The difference between using the correlation and covariance mamx is as follows. The data are scaled before using the correlation matrix by dividing the columns by the standard deviation (it is assumed that columns are already mean-centred). Using the covariance matrix, the rows are not scaled.
133 The true dimensionality of a data set ( A ) is the number of dimensions required to reduce the RSD to be approximatelyequal to the estimated experimental error of the data.
4.6 Use the residual standard deviation (RSD) method to estimate the true dimensionality of the mixture data presented in Table 3 using the PC results presented in Table 4.
3.1.2 Root mean square error The root mean square (RMS) error of a data matrix [7]is defined by
where A
f,
= 1
+ xtiapaj
(4)
a= 1
where scores are denoted by t and loadings by p . he notation for principal components analysis and the use of scores and loadings described in various chapters. In Chapter 4, Section 5 PCA is discussed in s of singular value decomposition. PCA as a method for model building is ssed in Chapter 7, Section 3. Both sections use the same notation as here. j A u c t i o n to PCA
1
An alternative way of expressing Eq. (3) is as follows:
RMS =
@J
134
Like the RSD the m e dimensionality of a data set (A) is the number of dimensions required to reduce the RMS to be approximately equal to the experimental error of the data. Comparing the Eqs (1) and (9,and simplifying yields:
Although related, the RMS and RSD of a data set measure different sources of error. The RMS measures the difference between the raw data and reproduced data using A PC dimensions. The RSD however measures the difference between the raw and the pure data containing no experimental error. Given the different basis of the two methods the RSD is to be preferred.
4.7 Use the root mean square (RMS)error method to estimate the true dimensionality of the mixture data presented in Table 3 using the PC results presented in Table 4.
3.13 Average error criterion An alternative method for identifying the true dimensionalityof a data matrix is the average error, Z, criterion [7]. The average error is simply the average of the absolute values of the differences between the raw and reproduced data.
The true dimensionality of a data matrix (A) is the number of dimensions required to reduce the average error to be approximately equal to the estimated average error of the data.
4.8 Table 5 gives the average error for the different degrees of PC model used to reproduce the mixture data in Table 3; use this information and the average error criterion to estimate the m e dimensionalityof the mixture data.
135
Table 5. Average error of the principal components extracted from the mixture data in Table 3. PC Dimension (A)
1 2 3
4 5
Average EtYN
0.6844 0.2157 0.1077 0.1395 0.1289
3.1.4 Chi-squaredcriterion For data sets where the standard deviation varies from one data point to another and is not constant Bartlett [lo]proposed a chi-squared (x’) criterion. This method takes into account the variability of the error from one data point to the next, but has the major disadvantage that one must have a reasonably accurate error estimate for each data point. This is defined as N M 2
X =
i=lj=l 2 oij
(8) A
where 00 is the standard deviation associated with the measurablex p andxi is the reproduced data using A PC dimensions. The criterion is applied in an iterative manner (A= 1, 2, ..., M ) and the true dimensionality of the data is the first value of A at which < (N - A ) . ( M - A ) as (expected) = (N - A).(M - A).
2
The 2distribution is discussed elsewhere in this text (e.g Chapter 2, Section 2). Often a target value of is chosen after which the particular test is regarded as
2
3.1 5 Distribution of misfit Another method for determining the dimensionality of a data space involves studying the number of misfits between the observed and reproduced data sets as a
136
function of the number of PCs employed. A reproduced data point is classified as a misfit if its deviation from the observed value is three or more times greater than the standard deviation, 0, estimated from experimental information. The true dimensionality of the data is therefore the number required for data reproduction so that no data points, or only a user specified proportion, are designated misfits. For the mixture data set, with the addition of an artificial error matrix (Table 3), when one component is employed to reproduce the data three points have misfits > 30. With two or more components employed no misfits were greater than 3 0 . Hence we conclude that there are two real components (three constituents) within the data. This is in accord with the known facts of how the mixture data was constructed.
3.2 Methodr based on no knowledge of the experimental error of the data If no knowledge of the experimental error associated with the data is available then one of the following techniques has to be applied to approximate the true dimensionality of the data, although the results obtained from these could be used to approximate the size of the error contained in the data [7]. Most of the techniques presented here are empirical functions, but in the last decade attempts have been made to develop methods based on formal statistical tests. 3.2.1 Cumulative percent variance The cumulative percent variance is a measure of the total variance which is accounted for by data reproduction using A PC dimensions, and is defined as
cumulative percent variance
(9)
A
where XG is the reproduced data point using A PC dimensions and x i is the raw experimental data point. However it can be shown that the cumulative percent variance can be expressed in terms of the eigenvalues of the data matrix
137
For this reason the method is also known as the percent variance in the eigenvalue. The cumulative percent variance criterion is to accept the set of largest eigenvalues (PC dimensions) which account for a specified proportion of the variance. The question arises as to how much variance in the data must be accounted for. Arbitrary specification such as 90%, 95% or 99% variance have been suggested but do not provide reliable estimates for judging the true dimensionality of the data. This method can only be used to identify the true dimensionality of a data set if an accurate estimate of the true variance in the data exists and in practice this cannot be achieved without a knowledge of the errors in the data.
Q.9 Use the cumulative percent variance method to estimate the true dimensionality of lthe mixture data presented in Table 3 using the PC results presented in Table 4. 3.2.2 Average eigenvalue The average eigenvalue criterion [7,11] is based upon retaining only those PC dimensions whose eigenvalues are above the average eigenvalue. If the PCA is performed via its correlation mamx the average eigenvalue will be unity as the variance of each variable is unity. Therefore only those dimensions whose eigenvalue > 1 should be retained. For this reason this method is also known as the eigenvalue-onecriterion. PCA has been discussed in several chapters of this text. Using a correlation matrix involves standardizing each variable prior to PCA, i.e. dividing the columns by their variance. Scaling is discussed in Chapters 2 and 7,and a simple example given in Chapter 2, Section 3.
Q. 10 Use the eigenvalue criterion to estimate the true dimensionality of the mixture data presented in Table 3 using the PC results presented in Table 4.
3.2.3 Scree test The scree test [7, 121 for identifying the true dimensionality of a data set is based on the observation that the residual variance should level off before those dimensions containing random error are included in the data reproduction. The residual variance associated with a reproduced data set, is defined as
I
138
N
M
which is equal to the square of the RMS error. The residual variance can be presented as a percentage as
residual percent variance = 100
In terms of the eigenvalues of the data mamx, this expression can be converted to
When the residual percent variance is plotted against the number of PC dimensions used in the data reproduction, the curve should drop rapidly and level off at some point. The point where the curve begins to level off, or where a discontinuity appears, is taken to be the dimensionality of the data space. This is explained by the fact that successive eigenvalues (PC dimensions) explain less variance in the data and hence this explains the continual drop in the residual percent variance. However the error eigenvalues will be equal, if the experimental error associated with the data is truly random, and hence the residual percent variance will be equal. Discontinuity appears in situations where the errors are not random, in such situations PCA exaggerates the non-uniformity in the data as it aims to explain the variation in the data. Q.11
(a) Use the scree test to estimate the true dimensionality of the mixture data presented in Table 3 using the PC results presented in Table 4. (b) Produce a graph of residual percentage variance against number of PCs used in data reproduction (this is commonly called a scree plot).
139
3.2.4 Exner function Kindsvater et al. [13] suggested the Exner psi function as a method for identifying the true dimensionalityof a data set. This function is defined as
(w)
A
Here irepresents the overall mean of the data and xc is the reproduced data using the first A PCs. The function can vary from zero to infinity, with the best fit approaching zero. A yf equal to 1.0 is the upper limit for significance as this means the data reproduction using A PC dimensions is no better than saying each point is equal to the overall data mean. Exner proposed that 0.5 be considered the largest acceptable value, because this means the fit is twice as good as guessing the overall mean for each data point. Using this reasoning = 0.3 can be considered a fair correlation, = 0.2 can be considered a good correlation and yf = 0.1 an excellent correlation.
w
w
w
w
Q. 12 Table 6 gives the Exner function for different degrees of PC model used t( reproduce the mixture data in Table 3; use this information and the Exner functior criterion to estimate the true dimensionality of the mixture data.
(w)
(w)
Table 6. Exner function (w)of the principal components extracted from the mixture data in Table 3. PC Dimension Exner Function
(4
(w)
1 2 3 4
0.4101 0.1271 0.0760 0.0940
140
3.2 5 Imbedded errorfunction The imbedded error (IE) function [6,7,14] is an empirical function developed to identify those PC dimensions containing error without relying upon an estimate of the error associated with the data matrix. The imbedded error is a function of the error eigenvalues and takes the following form:
IE=
NM(M - A)
Examination of the behaviour of the IE function, as A varies from 1 to M ,can be used to deduce the true dimensionality of the data. The IE function should decrease as the true dimensions are used in the data reproduction. However when the true dimensions are exhausted, and the error dimensions are included in the reproduction, the IE should increase. This should occur because the error dimensions are the sum of the squares of the projections of the error points on the error axis. If the errors are uniformly distributed, then their projections onto the error dimensions should be approximately equal, i.e. g, = g,l = gM Therefore Eq. (15 ) reduces to
for A > true A, where C is given by
These equations only apply when an excessive number of dimensions have been used to reproduce the data. Eq. (16) shows that the IE will actually increase once the true number of dimensions is exceeded. However a steady increase in the IE function, once the true dimensionality is passed, is rarely observed as PCA exaggerates any non-uniformity that exists in the error distribution; this results in non-equal error eigenvalues. As the IE function is based on detecting the error eigenvalues the minimum in the function will not be clearly observed if: (a) the errors are not fairly uniform throughout the data (b) the errors are not truly random and (c) systematic errors exist.
In such cases a number of local minima may be observed.
141
4.13 Use the imbedded error (IE) function to estimate the true dimensionality of the lmixture data presented in Table 3 using the PC results presented in Table 4.
I
3.2.6 Factor indicatorfunction The factor indicator (IND)function [6,7,14] is an empirical function which appears more sensitive than the IE function to identify the true dimensionality of a data mamx. The function is composed of the same components as the E function, and is defined as
where RSD is the residual standard deviation defined in Eq. (I). This function, like the IE function, reaches a minimum when the correct number of PC dimensions have been employed in the data reproduction. However, it has been seen that the minimum is more pronounced and can often occur in situations where the IE function exhibits no minimum. 4.14 Use the factor indicator (IND) function to estimate the true dimensionality of the mixture data presented in Table 3 using the PC results presented in Table 4.
3.2.7 Malinowski F-test As a progression on from the IND and IE functions, and to formulate a test with a statistical basis, Malinowski [I51 developed a test for determining the true dimensionality of a data set based on the Fisher variance ratio test (F-test). The F-test is a quotient of two variances obtained from two independent pools of samples that have normal distributions. As the eigenvalues obtained from a PCA are orthogonal, the condition of independence is satisfied. It is common to assume that the residual errors in the data e i have a normal dismbution; if this is true, then the variance expressed by the error eigenvalues should also follow a normal distribution. This will not be the case if the errors in the data are not uniform or if systematic errors exist. The p l e d variance of the error eigenvectors V, is obtained by dividing the sum of error eigenvalues by the number of pooled vectors, M- A.
142
v,
=
a=A+1 M-A
V, measures the variance in the error vectors A+l to M and not the variance in the residual data points. The true eigenvectors contain structural information as well as some experimental error (imbedded error) and therefore have eigenvalues which are statistically greater than the pooled variance of the error eigenvalues. Hence the following variance ratio can be applied to test whether or not the Ath eigenvector associated with the next smallest eigenvalue belongs to the set of error vectors comprised of the small eigenvalues F( 1,M-A) = ga / V,
In order to determine the true dimensionality of the data (A), Eq. (20) is employed as follows. First set the error eigenvector equal to the dimension with the smallest eigenvalue gM The F-test is then applied to test the next smallest eigenvalue (gM-l) for significance by comparing its variance to the variance of the error set. If the calculated F is less than the tabulated F; at a user chosen level of significance, the eigenvalue under test (gM-1) is added to the error eigenvector set and the process . process of testing is repeated with the next smallest eigenvalue ( g ~ - 2 ) This repeated until the variance ratio of the Ath eigenvalue exceeds the tabulated Fvalues, marking the division between the true and error eigenvectors. The F-test is often used by chemometricians. Other examples of its use are given in Chapter 7, Section 6 and in Chapter 8. The ratio of two variances are compared for significance. Most standard statistical tests provide tables of significance. The larger the ratio the more significant one variance is to another, and these tables provide probabilities that one variance really is significant in relation to another variance. The number of degrees of freedom for variance need to be tak An improvement to this approach has also been given by Malinowski [15,16] by considering the statistical distribution of the error eigenvalues (i.e. those that represent pure noise). We do not have the space to discuss the theory of this approach here, but the formula of Eq. (20) is altered to give
143
The overall advantage of using F-tests over the more empirical approaches in Sections 3.2.1 to 3.2.6 is that these approaches do take into account models for the noise or errors in the data, although these models are usually assumed to model noise by a normal distribution. The early approaches were proposed because, in practice, they worked, but were not able to incorporate detailed knowledge of the sources of error.
Q.15 Use the Malinowski F-test (Eq. (21)) to estimate the true dimensionality of the mixture data presented in Table 3 using the PC results presented in Table 4. A table of F-values is required to answered this question; alternatively such values may easily be obtained from many modem numerical software packages. 3.2.8 Cross-validation Cross-validation as part of the SIMCA strategy is discussed in Chapter 7, 6. There are a large number of possible approaches to cross-validation, many not as yet in common use in chemometrics which are discussed in this section. The approach of Wold (discussed in Chapter 7) is the most popular method used in chemometrics. Cross-validation was first proposed by Stone [17] as a method to identify the best model for statistical prediction. Wold [ 181 adopted the technique for PCA to identify those PC dimensions with best predictive ability for sample prediction. Cross-validation for PCA has since been implemented by Eastment and Krzanowski [19,20] and Osten [21]. However the aim of the PC model ('true' dimensionality of the data set) identified by cross-validation is different to that identified by the techniques described above. The IND, IE,scree test etc. all aim to identify the break point between those dimensions containing sample structure/information, with some imbedded error (first A dimensions), and those dimensions which contain purely experimental /random errors ((A + l)th, ..., Mth dimension). Cross-validation however aims to identify those dimensions (first A ) with best predictive ability, to use in the formulation of a PC model for samples from a known class; the PC model then to be used to classify unknown samples to the class.
144
The basic method of cross-validation is to divide a data mamx X into a number of groups (preferably as small as possible). Each group is deleted in turn from the data and a PCA performed on the reduced data set. The deleted values are then predicted using the PC model parameters obtained. Some suitable criterion of goodness of fit relating the actual (xu) and predicted &j ) values, summed over all groups, is then used to identify the optimal PC model for the data. The different implementationsof cross-validation [18,19,21] vary in the proportion of data in each predictor group and the criterion of goodness of fit used for model selection.
3.2.8.1 The Wold implemefitation of cross-validation. Wolds implementation of cross-validation is based on the NIPALS [3,5]algorithm, which is an iterative method for calculating PCs; this makes it ideal for microcomputer implementation. Wold divides the data matrix into N, groups (3 I N , I10) which consist of diagonal elements of the data matrix. The criterion of goodness of fit is to scrutinize the ratio between the total PRESS (predicted residual error sum of squares) after fitting the Ath PC dimension and the RSS (residual sum of squares) before fitting that PC dimension, and comparing this ratio with unity.
t of cross-validation is also discussed in Chapter 7, Sec context of SIMCA.In this chapter we remain with the notation introduced above. A tional, matrix based, notation is employed in Chapter 7.
It is important to realise that if 25% of the data is deleted so that N, = 4 (see above) then there will be four separate calculations to give the total predicted residual sum of squares of Eq. (22) as follows: (a) Delete 25% of the data. (b) Calculate the principal component in the absence of the 25% of the data. (c) Predict the values of xik for this 25% of the data. (d) Leave out another 25% of the data until step (a) has been repeated 4 times, then calculate PRESS. The residual sum of squares is defined as follows:
145
where ?q is the data obtained using (A - 1) PCs (extracted from the complete data matrix). A PC component (A) is considered significant and containing 'systematic' information if
PRESS(A)/RSS(A-1) < 1
x x x x
X
X
X X
x x
x x x x
x x x x x x
Fig. 2
X
X X X
X X
X
X
X
\ X
lietion pattern for first group in Wold implementation of cross-validion. The second group contains elements one pseudo-diagonal along etc.
As the number of dimensions approaches the true number of significant
components the sample prediction should improve with the inclusion of additional dimensions and hence the PRESS(A) will be < RSS(A - 1) and the ratio less than 1. However as the true significant number of dimensions is passed, noise is included in the PC model which will lead to a decrease in the goodness of fit of the sample predictions and hence the PRESS(A) will be > RSS(A - 1) and thus the ratio greater than 1. In some texts [22] on cross-validation the criterion of fit used is to scrutinize PRESS(A) with PRESS(A - l), instead of with the actual RSS(A - l), and compare this ratio with unity.
To allow for the decreasing number of degrees of freedom with increasing model dimensionality, Wold and Sjastrom [23] recommended that the ratio PRESS(A)/RSS(A - 1) be compared with a empirical function Q rather than unity.
Q=-\l
(M-A)x(N-A- 1) (M-A-l)X(N-A)
146
where N is the number of samples, M is the number of variables and A is the PC dimension being scrutinized. This 'improved' statistic is more conservative, as in situations where the improvement in sample prediction is not worth the inclusion of additional terms in the PC model, then the PRESS(A) will be less than RSS(A - 1) and Q < PRESS(A)/RSS(A - 1) less than 1; in such situations the dimension will be identifred as nonsystematic. Q.16 Table 7 gives the cross-validation results (Wold implementation) for the different degrees of PC model used to reproduce the mean-centred mixture data in Table 3; use this information and the cross-validation criterion (Wold) to estimate the true dimensionality of the mixture data. These data may be obtained using the SIRIUS package as described elsewhere in this text.
Table 7. Cross-validation (Wold implementation) results for the principal components extracted for the mixture data in Table 3. PC Dimension (A)
PR ESS(A) RSS(A-
1
1.11
2
0.83 1.31 1.38 1.89
3 4
5
Q 0.857 1 0.8333 0.8000 0.7500 0.6667
3.2.8.2 The Eastment and Krzanowski implementation of cross-validation. The Eastment and Krzanowski implementation [19] of cross-validation is to use a function of the difference between successive PRESS values which is analogous to testing the effect of including additional terms in stepwise regression analysis
w,
=
PRESS(A - 1) - PRESS(A) DA
PRESS(A) Dk
where D, is the number of degrees of freedom required to fit the Ath component and D'A is the number of degrees of freedom remaining after fitting the A t h component. Consideration of the number of parameters to be estimated, together with all the constraints on the eigenvalues at each stage, yields D, = N + M - 2A. D'Acan be obtained by successive subtraction, given (N - l ) M degrees of freedom in the mean-centred mamx X.A simple numerical example will best illustrate the calculation of these numbers. If the data consist of 10 samples and 20 variables,
147
and we have fitted two principal components then DA = 26 and D'A = 180 - 28 - 26 = 126. WAgreater than unity indicates that the increase in predictive information gained by the inclusion of the Ath PC component is greater than the average information in each of the remaining components, and therefore the nth component is assumed to contain systematic information. To allow for sampling variability, Krzanowski [23] recommended the optimal value of A as the last value of A at which WAis greater than 0.9 rather than unity. The Eastment and Krzanowski implementation also differs from that proposed by Wold in terms of the predictor group size and deletion pattern; utilization of updating algorithms for principal component analysis of a data mamx makes it feasible to make each individual data point ( n ~a) separate predictor group. For further details the reader is referred to reference [24]. There are several other approaches to the calculation of predictor groups, outside the scope of this text.
8C 2 8 Q
'Z
24
$Y 2 0 C
u 16 L
a"
12
I
Q a
8
U
. I
8
4
M o 0
2
4
6
8
10
12
14
16
18
20
Number of PCs used in Data Reproduction Fig.3 Scree test resultsfor the aphid data set.
3.3 Estimation of the true dimensionality of a data setfrom the literature We will now examine the performance of many of the dimensionality-reducing techniques, those where no knowledge is available on the experimental error of the data, by applying them to a data set from the literature. Jeffers [25] described in detail two multivariate case studies, one of which concerned 19 variables measured on 40 winged aphids (Alate adelges) that had been caught in a light trap. This data set was also used by Krzanowski to present properties of a cross-validated data set [20] and to present a variable reduction technique using Procrustes rotation [26]
c
Table 8. Aphid data of Jeffers [25].
1 21.2 20.2 20.2 22.5 20.6 19.1 20.8 15.5 16.7 19.7 10.6 9.2 9.6 8.5 11.0 18.1 17.6 19.2 15.4
2 11.0 10.0 10.0 8.8 11.0 9.2 11.4 8.2 8.8 9.9 5.2 4.5 4.5 4.0 4.7 8.2 8.3 6.6 7.6
3 7.5 7.5 7.0 7.4 8.0 7.0 7.7 6.3 6.4 8.2 3.9 3.7 3.6 3.8 4.2 5.9 6.0 6.2 7.1
4 4.8 5.0 4.6 4.7 4.8 4.5 4.9 4.9 4.5 4.7 2.3 2.2 2.3 2.2 2.3 3.5 3.8 3.4 3.4
5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0 5.0 5.0
Physical measurements of the winged aphids (alateuaizlges) 6 7 8 9 10 11 12 13 14 15 16 17 18 0 2.0 2.0 2.8 2.8 3.3 3 4.4 4.5 3.6 7.0 4.0 8 2.3 2.1 3.0 3.0 3.2 5 4.2 4.5 3.5 7.6 4.2 8 0 0 1.9 2.1 3.0 2.5 3.3 1 4.2 4.4 3.3 7.0 4.0 6 2.4 2.1 3.0 2.7 3.5 5 4.2 4.4 3.6 6.8 4.1 6 0 0 2.4 2.0 2.9 2.7 3.0 4 4.2 4.7 3.5 6.7 4.0 6 1.8 1.9 2.8 3.0 3.2 5 4.1 4.3 3.3 5.7 3.8 8 0 2.5 2.1 3.1 3.1 3.2 4 4.2 4.7 3.6 6.6 4.0 8 0 0 2.0 2.0 2.9 2.4 3.0 3 3.7 3.8 2.9 6.7 3.5 6 2.1 1.9 2.8 2.7 3.1 3 3.7 3.8 2.8 6.1 3.7 8 0 4.1 4.3 3.3 6.0 3.8 8 0 2.2 2.0 3.0 3.0 3.1 0 1.2 1.0 2.0 2.0 2.2 6 2.5 2.5 2.0 4.5 2.7 4 1 2.4 2.3 1.8 4.1 2.4 4 1 1.3 1.2 2.0 1.6 2.1 5 1.3 1.0 1.9 1.7 2.2 4 2.4 2.3 1.7 4.0 2.3 4 1 2.4 2.4 1.9 4.4 2.3 4 1 1.3 1.1 1.9 2.0 2.1 5 1.2 1.0 1.9 2.0 2.2 4 2.5 2.5 2.0 4.5 2.6 4 1 1.9 1.9 1.9 2.7 2.8 4 3.5 3.8 2.9 6.0 4.5 9 1 2.0 1.9 2.0 2.2 2.9 3 3.5 3.6 2.8 5.7 4.3 10 1 2.0 1.8 2.2 2.3 2.8 4 3.5 3.4 2.5 5.3 3.8 10 1 2.0 1.9 2.5 2.5 2.9 4 3.3 3.6 2.7 6.0 4.2 8 1
P 00
19 3.0 3.0 3.0 3.0 3.0 3.5 3.0 3.5 3.0 3.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 3.0
15.1 7.3
6.2 3.8
5.0 2.0
1.8 2.1 2.4
16.1 7.9 19.1 8.8
5.8 6.4
3.7 3.9
2.1 2.2
1.9 2.3 2.0 2.3
15.3 6.4
5.3
3.3
5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
1.7
1.6 2.0 2.2
2.2
2.0 2.2
1.9 1.7
14.8 8.1 6.2 3.7 16.2 7.7 13.4 6.9 12.9 5.8
3.6 3.6 2.7 3.8 4.0 3.0
6.0 4.5 6.5 4.5
0
1
1
3.4
5.4 4.0
0 0 0 0 0 9
2.0 2.5
1
2.0
1 1 1
2.0 2.5 2.0
1
3.0
1 1 1 1
2.0 2.0
3.4 8 3.7 8
1 1
3.7
2.0 2.0 2.0
2.6 2.9 2.4 2.9
3.7
2.8
3.0 3.1
1.9 2.5 2.3 1.6 1.8 2.5
2.8 2.4
5 5
3.3 3.5 2.6 5.4 4.3 8 3.6 3.7 2.8 5.8 4.1 0 3.4 3.6 2.7 6.0 4.0 0 2.7 2.9 2.2 5.3 3.6 8
1.4 2.0 2.7 1.9 1.7 2.5
6 5
2.8 2.7
2.5 2.5
1.8 2.2 2.4 1.8 1.8 2.5
4 4
2.8
2.6 2.0
5.1
8
0
2.7 2.7
2.7 2.5
2.1 2.0
1 1
2.7
2.7
2.0
5.0 3.6 8 5.0 3.2 6 4.2 3.7 6
2.7
2.6 2.0 5.0 3.5
1.9 2.4 5
2.6
2.5
1.9 4.6 3.4
5.0 1.6 1.4 1.7 1.9 2.3 5
2.3
2.5
12.0 6.5 5.3 14.1 7.0 5.5
3.2 3.6
16.7 7.2 14.1 5.4
5.7 5.0
3.5 5.0 3.0 5.0
10.0 11.4 12.5 13.0
4.2 2.5 4.4 2.7 2.3 2.3
2.5
3.7
2.5 2.5
3.7 3.4 2.6
4.7 4.7
1
4
5 4 5 5 4 4 5 5 5
6.9 5.7 4.8
6.0 4.5 5.5 5.3
6.4 4.3 10
2.5
2.5
2.4 3.2
2.0 1.8 2.3 2.4 2.8 2.0 1.8 2.8 2.0 2.6 1.6 1.5 1.9 2.1 2.6 1.9 1.9 2.3 2.2 2.0 2.3
5.0 1.6 1.4 5.0 1.8 1.5 5.0 1.8 1.4 5.0 1.6 1.4
12.4 5.2 4.4 12.0 5.4 4.9 10.7 5.6 4.5
2.6 5.0 3.0 5.0 2.8 5.0
1.6 1.7 1.8
1.4 1.8 2.2 2.2 5 1.5 1.7 1.9 2.4 5 1.4 1.8 2.2 2.4 4
11.1 5.5
4.3
2.6 5.0
1.7
1.5
12.8 5.7
4.8
2.8
1.8
Note : physical description of each measurement given in Table 9.
3.4 2.6
3.5 3.7 2.7 6.0 4.1 3.8 3.7 2.7 5.7 4.2 3.6 3.6 2.6 5.5 3.9 2.8
3.0
2.2
5.1
1.8 4.8 1.9 4.7
3.6
2.5 2.0
2.0
1
2.0 2.0
8
1
2.0
8
1
2.0
1.9 5.0 3.1 8
1
2.0 P bD
150
(see Section 4.3). The data set is presented in Table 8 and full details of the data can be obtained from reference [25]. Of the 19 variables, 14 are length or width measurements, four are counts, and one is a presence / absence variable scored 0 or 1. Table 9 presents the results of a PCA of the data; the analysis was performed via its correlation mamx due to the differing variable types, i.e. each variable was standardized prior to PCA. The results from the application of a number of techniques for identifying the true dimensionality of a data set are presented in Table 10 and Fig. 3.
1
the results presented in Table 10 and Fig. 3 conclude what is the true (dimensionalityof the aphid data set. IQ. Using 17
4 Data reduction by variable reduction While the dimensionality-reducing techniques described in Section 3 provide data reduction, a major deficiency of such approaches is that, while the dimensionality of the space may be reduced from M to A, all M original variables are still needed to define the A new variables (Pcs). However in many applications it is desirable not only to reduce the dimensionality of the data space but also reduce the physical number of variables which have to be considered or measured in future. This area has been insufficiently stressed in chemometrics. Often the chemist measures a large number of variables (e.g. spectroscopic and chromatographic intensities of a variety of compounds). Whereas each variable does, indeed, yield new information, each variable also takes time to record, and often subsequently to analyze. For example performing PCA on 100 variables is substantially more timeconsuming than performing PCA on 10 variables. While many methods for the identification of redundant variables, and selection of subsets of variables, exist in the area of multiple regression and discriminant analysis [27,28], only three studies have been made in the area of principal component analysis [26, 29-3 11. Jolliffe [27,28] discussed a number of methods for selecting a subset of Q variables which preserve most of the variation of the original variables.
Table 9. Principal component analysis of Jeffers aphid data. ga
1 2 13.83866 2.36791 11 12 0.09227 0.07783
PERCENTAGE VARIANCE
1 2 3 4 5 6 7 8 72.8351 12.4627 3.9374 2.6609 1.4644 1.3874 0.9623 0.8381 1 14 15 16 17 18 19 12 0.4856 0.4096 0.3816 0.2297 0.1679 0.1257 0.0981 0.0689
LOADINGS
1
2 3 4 5 6 7 8 9 10 11
12 13 14 15 16 17 18 19
Body length Body width Fore-wing length Hind-wing length No. spiracles Length antennal I Lenath antennal I1 Length antennal III Length antennal IV Length antennal V No. antennal spines Length tarsus Length tibia Length Femur Rostrum length Ovipositor length No. ovipos spines Anal fold No. hind-wing hooks
3 0.74811 13 0.07250
4 0.50558 14 0.04364
5 0.27824 15 0.03190
6 0.26360 16 0.02388
7 0.18283 17 0.01865
8 0.15924 18 0.01310
9 10 0.14479 0.13339 19 0.00388 9 0.7621
10 0.7021
0.0204
1
2
3
4
5
6
7
8
9
10
0.2511 0.2582 0.2602 0.2595 0.1620 0.2398 0.2537 0.23 16 0.2381 0.2488 -0.1306 0.2616 0.2637 0.2612 0.2522 0.2014 0.1097 -0.1876 0.2004
0.0140 0.0678 0.0330 0.0891 -0.4051 -0.1760 -0.1601 0.2375 0.0447 -0.0259 -0.2049 0.0119 0.0301 0.0670 -0.0097 -0.3953 -0.5455 -0.3524 0.2842
-0.0011 0.0108 -0.0515 0.0310 -0.1878 0.0411 0.0057 0.0575 0.1665 0.1083 0.9287 0.0328 0.0836 0.1146 0.0764 -0.0243 -0.1464 0.0389 0.0590
0.1277 0.0965 0.0687 -0.0012 -0.6177 -0.0194 0.0181 0.1017 0.0080 -0.0257 -0.1605 0.1752 0.1905 0.1917 0.0297 0.0497 0.0359 0.4898 -0.4497
0.1689 0.1177 -0.0266 -0.0508 0.0755 0.1640 -0.1165 -0.3218 0.4268 -0.0322 -0.0168 -0.0497 -0.0548 0.0264 -0.0515 0.0348 -0.2348 -0.4231 -0.6140
0.2302 0.0947 0.1242 -0.0882 -0.0400 -0.4694 -0.2426 -0.3328 0.4869 -0.2771 -0.0128 -0.0077 0.0518 0.0808 0.0131 0.0442 0.2525 0.1281 0.3430
0.4701 0.1032 0.0378 0.0315 0.1509 -0.1428 -0.0346 -0.3063 -0.5430 -0.1386 0.0681 -0.0019 0.0221 0.1087 0.3301 0.1830 -0.3865 0.0698 0.0056
-0.1750 -0.0173 -0.1717 -0.0837 -0.0033 -0.3326
-0.3297 -0.0936 0.1162 0.0738 0.0787 0.2799 0.1438 0.0039 0.3218 -0.3842 -0.0980 -0.2282 -0.0369 -0.0474 0.3217 0.1776 -0.4209 0.3431 0.1092
0.2893 0.0190 0.4235 -0.0097 0.1436 0.2367 0.0155 -0.0295 0.0137 0.0739 0.0365 -0.1033 0.0136 -0.0238 -0.6930 -0.0886 -0.2367 0.2880 0.1150
0.2005
-0.2562 0.1556 0.6903 -0.1355 -0.0289 -0.0082 O.ooo6
-0.0454 0.2492 -0.3138 0.1689 0.0702
1/1
c
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Body length Bodywidth FOE-Wing length Hind-wing length No. spiracles Length antennal I LenathanteMalII Length a n t e d III Length a n t e d IV Length a n t e d V No. a n t e d spines Length tarsus Length tibia Length Femur R o s m length Ovipositor length No. ovips spines Anal fold No. hind-wing hooks
11
12
13
0.0566 -0.4204 0.0012 -0.2830 0.0955 -0.2507 -0.0187 0.4241 0.0360
-0.4773 0.4471 0.1306 -0.0765 -0.1733 0.1483 -02288 -0.2504 -02162 -0.1 163 0.0301 0.2237 0.1478 0.0225 -0.2202 0.3892 -0.0259 -0.1693 0.1398
-0.2192 0.2637 -0.1534 0.4807 0.3862 -0.4071 0.1253 0.1083 0.0171 -0.2168 0.0560 0.2067 0.1442 0.0801 -0.2132 -0.1760 4.1178 0.1418 -0.2338
-0.2099 0.0198 0.1843 0.0996 0.1501 -0.2122 0.5197 -0.1628 -0.1706 -0.0704
14
0.0844 -0.4727 -0.2 159 0.5560 -0.2545 0.1664 0.1326 -0.3666
0.0044 -0.1099 -0.0370 0.2546 -0.0519 0.0105 -0.1910 0.1353 0.0293 -0.0988 0.1511
15
16
17
18
19
0.23 19 0.3648 4.5373 -0.0230 -0.1604 0.0662 0.3261 0.1327 0.0193 -0.1829 4.0250 4.4561
-0.0008 0.0647
0.1708 0.1357 -0.2952 0.2427 0.0872 0.1439 -0.6428 0.2770 0.1081 0.1607 -0.0374 0.1227 -0.2675 -0.2297 0.0280 0.2302 -0.0691 0.2213 0.0481
0.1507 0.1792 -0.04 11 -0.2665 -0.0538 -0.0069 0.3722 0.0032 0.0934 -0.1383 0.0376 0.5536 -0.2139 -0.5758 0.0249 -0.0578 -0.1016 0.0250 0.0557
0.0553 -0.1378 -0.0407 0.0471 0.0225 0.0075 -0.1381 -0.03 16 O.Oo60 0.0436 -0.0066 4.1 126 0.7954 -0.5492 0.0501 -0.0065 -0.0247 4.0225 -0.0290
0.0566 0.0057 -0.2275 0.2184 0.045 1
-0.0609 0.1375
0.4664 0.3755 -0.2148 -0.3055 0.1087 0.1326 -0.0480 0.0383 0.0950 -0.3158 -0.2647 -0.3742 0.0293 0.2951
0.0904 -0.1680 -0.1465
Table 10. Results from the application of techniques to identify the true dimensionality of the aDhid data set. PRESS(A Cumulative Residual Wt,t IE* IND** Ill u2 Ftotest %at RSS(A %ga %ga %ga
I)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
13.84 2.37 0.75 0.50 0.28 0.26 0.18 0.16 0.15 0.13 0.09 0.08 0.07 0.05 0.03 0.02 0.02 0.01 0.01
* In units of
10-3
72.84 12.47 3.95 2.63 1.47 1.37 0.95 0.84 0.79 0.69 0.47 0.42 0.37 0.26 0.16 0.11 0.11 0.05 0.05
72.84 27.16 85.31 14.68 89.26 10.74 91.89 8.11 93.36 6.63 94.73 5.26 95.68 4.32 96.52 3.47 97.31 2.69 98.00 2.00 98.47 1.53 98.89 1.11 99.26 0.74 99.52 0.47 99.68 0.32 99.79 0.21 99.90 0.11 99.95 0.05 100.00 0.00
19.42 2.61 2.22 20.78 2.2 1 22.43 23.25 2.25 2.42 24.33 24.64 2.59 25.09 2.83 25.13 3.20 24.58 3.57 4.01 23.57 22.91 4.70 21.76 5.59 19.97 6.7 1 8.49 18.21 17.21 12.10 16.75 20.29 14.96 39.53 15.39 158.11 -
-
** ~nunits of 10-4
?Percent significance level
tt Taken from reference [u)]
1 1 1 1 1 1 1 1 1 1 1 1 1
18 17 16 15 14 13 12 11 10 9 8 7 6
1
5
1 4 1 3 1 2 1 1
- -
20.12 6.05 2.48 2.06 1.33 1.45 1.14 1.16 1.28 1.36 1.10 1.20 1.36 1.27 0.92 0.70 0.94 0.48 -
0.03 2.49 13.51 17.15 26.90 25.02 30.74 30.49 28.31 27.39 32.42 31.04 28.83 31.11 39.11 46.42 43.36 61.48 -
0.31 0.68 6.43 0.99 1.40 1.07 1.34 1.27 1.13 1.05 1.05 0.97 0.83 0.87 0.92 0.83 0.70 0.35
0.96 0.96 0.96 0.96 0.95 0.95 0.95 0.95 0.94 0.94 0.93 0.93 0.92 0.91 0.90 0.88 0.85 0.80
-
-
26.22 5.95 0.19 1.03 -0.39 0.57 0.1 1 0.12 0.23 0.44
0.04 0.04 0.29 0.15 0.06 0.05 0.03 0.0 1
154
Sections 9 and 10. Discriminant analysis (FLD or LDA) is also discussed in
McCabe [31] worked on the concept of a "principal variable" which is a variable containing as much sample information as possible, and showed how the selection of such principal variables can be made on the basis of various optimality criteria. Krzanowski [26] presented the idea of identifying the best subset of Q variables which reproduced as closely as possible the structure of the complete data; this involves a direct comparison between the individual points of the subset configuration and the corresponding points of the complete data configuration. The different ideas and concepts behind these three approaches will now be discussed. 4.1 Jolliffe's methodsfor variable reduction
Jolliffe [ 1,29, 301 presented a number of methods for selecting Q variables which best preserve most of the variation in the original variables of X. These data reduction methods were based around the techniques of multiple correlation, PCA and cluster analysis. Four methods using PCA were examined, but only two methods with satisfactory performance will be discussed here, these methods were as follows: (a) Associate one variable with each of the last (M-Q) PCs, namely the variable not already chosen, with the highest loading, in absolute value, in each successive PC (working backwards from the Mth PC) and delete those (M-C) variables. PCA can either be performed once only, or iteratively. In the latter case delete the variable associated with the Mth PC and then repeat the analysis on the reduced data set and delete the variable associated with the last PC;repeat this process until only Q variables remain. By adopting this criterion those variables which best explain the minor (error) PC dimensions are removed. (b) Associate one variable with each of the first Q PCs, namely the variable, not already chosen with the highest loading, in absolute value, in each successive PC. These Q variables are retained, and the remaining (M-Q) are deleted. The arguments for this approach are twofold. First, it is an obvious complementary approach to (a), i.e. retaining those variables which best explain the major dimensions, and, second, in cases where there are groups of highly-correlated variables it is designed to select just one variable from each group. This will happen as there will be one high-variance PC associated with each group. This approach is plausible, since a single variable from each group should preserve
155
most of the information given by that group, when all variables in that group are highly correlated. Jolliffe [29] applied his different approaches, but only the non-iterative version of (a), to a number of simulated data sets. The results showed that methods (a) and (b), known as B2 and B4 respectively in the reference, retained the 'best' subsets, as opposed to 'good' or 'moderate' subsets, more frequently than the other methods. Method (b) was the most extreme in this respect; it selected 'best' and 'bad' subsets more frequently than any other method, and 'moderate' o r 'good subsets less frequently. Similarly, for various real data sets Jolliffe [30] found that none of the variable selection methods was uniformly best, but several of them, including (a) and (b) found reasonable subsets in most cases. While these methods identify which variables are the 'best' to remove or retain from a data set they do not identify how many physical variables (Q) are required to retain the structure within the data. If A PCs are required to explain the structure within a data set then a minimum of Q = A variables are required so that there is a possibility of recovering this structure. 4.2 McCabe's principal variables
McCabe [31] proposed the concept of a "principal variable", which is a variable containing (in some sense) as much sample information as possible, and showed how the selection of a subset of such principal variables can be made on the basis of various optimality criteria. McCabe proposed four criteria which retained the desirable characteristics of PCs while simultaneously reducing the number of variables to be considered. The most computationally feasible criterion (which we will call method (a)) was found to be to minimise the geometric product of the eigenvalues (go)of the mamx of the M-Q variables that are to be deleted. So, for example, if the data set consists of five variables, then if we want to set Q=2 (i.e. retain only two variables), we calculate the principal components for each subset of three variables. A second method (b) involves minimising the sum of the eigenvalues rather than their product.
156
4.18 How many combinations of variables need to be tested when trying to reduce five variables to two principal variables by McCabe's method?
Like Jolliffe's methods this technique provides no mechanism to identify how many physical variables (Q)are required to retain the structure within the data. Again if A PCs explain the structure within the data then a minimum of Q = A variables are required so that there is a possibility of recovering this structure. 4.3 The Krzanowski methodfor variable reduction using Procrustes rotation The Krzanowski approach [26]to variable reduction aims to identify the best subset of Q variables which preserve the sample structure of the data. This involves a direct comparison between the individual points of the subset configuration and the corresponding points in the complete data configuration. Since the same number of samples are involved in both the complete data and subset data this can be achieved using a criterion based on Pmcrustes rotation [32,33].
Chapter 3, Section 2.1, to represent the dimensions of the matrices where the
Let N,MXdenote the data mamx consisting of M variables measured on each of N samples, and A denote the essential dimensionality of the data, to be used in any comparison. Krzanowski states that A can be decided either formally using crossvalidation or any of the other techniques presented in Section 3 or informally from consideration of convenience (e.g. because two dimensions are ideal for graphical display of the data). However, he also states that in the latter case one must ensure that sufficient data variability has been accounted for in the chosen A. Let N,ATbe the transformed data matrix of PC scores, yielding the best A-dimensional approximation to the original data configuration X. The mamx T is discussed in several other chapters, so mathematical details are not given here. The reader is referred to the following chapters. In Chapter 4, Section 5.2, this mamx is derived in Eq. (22), although the mamx in that Section has M rows (Le. one row for each variable). In this section the mamx T contains the first A (most significant) principal components only. Each column contains the scores of the first A principal components. This mamx is discussed again in Chapter 7, Section 3. The mamx of scores is h h e r e U is scaled so that it is of unit size.
157
I;i;hlsalso
derived in Chapter 1, Section-31
We now aim to identify Q of the original M variables which best retain the configuration, where Q < M (so that selection does take place) and Q 2 N (so that there is a possibility of recovering the m e structure with the Q selected variables) note that the method in this section only works where there are more variables than objects. To further establish notation let use N,QXQto denote the reduced data matrix formed from the Q selected variables and N,ATQdenote the matrix of PC scores of this reduced data matrix: note that two letter names are used occasionally in this text for clarity. N,ATQis therefore the best A-dimensional representation of the Q dimensional configuration defined by the subset of Q variables. If the true dimensionality of the data is indeed A, then T can be viewed as the "true" configuration and TQ is the corresponding approximate configuration based on only Q variables. The notation of this section is summarized,for clarity, in Table 11.
Table 11. Notation of Section 4.3. Objects Number observed
N
Variables Number observed Number in reduced data set
M Q
Principal components Number used
A
The error between the true and approximate matrices can be obtained by a technique called Procrustes rotation. This method is discussed in the references to this section, but is essentially a method for comparing the similarity between two datasets. If the data-sets A and B are represented in Q dimensions (this corresponds to the number of principal components in the case discussed here) then the steps are as follows: (a) Mean-centreboth data sets. (b) Rotate, translate, and reflect data set B so that the residual sum of squares between A and B is minimised.
158
(c) This sum of squares represents the closeness of fit between the two data sets. The smaller it is, the more similar the information between these data sets. We will not discuss the mathematical details of Procrustes analysis here. A simple one variable example is provided by Brereton [34]. Mathematically, the sum of squares is given by S2 = trace ( T T'
+ TQ TQ' )
- trace ( 2 GQ1'2 )
(27)
s defined in Chapter 3, Section 5.5. The mamx GQla is analogous to the mamx G1a but is calculated for the reduced
The steps in this technique are illustrated in Fig. 4. Select a subset of Q variables, less than total number of variables and greater than or equal to number of PCs (Q< M) and (Q2 A)
*
N,MX
raw data - all variables
I
I
N,MT
reduced data Q variables
PCA . Select first A dimensions
PCA. Identify first A dimensions
raw data - all variables
NQXQ
*
N,ATQ 0
reduced data Q variables
Fig.4 Steps in Krzanowsld variable reduction technique.
159
The quantity S2 can be calculated for any subset of the variables in X, and it represents the closeness of the subset configuration to the true configuration. The "best" subset of Q variables for minimal structural loss is therefore the one which yields the smallest value of S2 among all tested subsets. On the importance of the size of S2, Krzanowski [26] poses the question "how large must the Procrustes sum of squares be before the variable is deemed to be important?". Although one may have identified the "best" variable(s) to remove for minimal loss of structure, is that variable(s) critical for defining the sample structure? One possible way to compare the Procrustes sum of squared residuals (S') obtained is to put them on some sort of common basis. Deane and MacFie [8] suggested a comparison between the structure lost (M2) for each subset and the total sample structure contained within the true configuration (traceTT'). This is then a measure of the proportion of systematic structure lost by the removal of each variable(s) from the data.
Y
8
E
8
0.6 JI 0.5
0.4; 0.3
-
3;
X *X
x x
0.2i
7 0.1 a 'B 0.0-
.9 L
-0.1
a
-0.2
U
c -0.3
N
I
-
-
3# K
x
x
"x"m ;
M
X
x*Yr#
x
- - - -
I . I . I I I I . I I I I -1.2 -1.0 -0.8-0.6-0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
-0.4
1st Principal Component
Fig. 5 Scores plot of the first two principal componentsfrom all nineteen variables in the aphid
doto set
4.4 Investigation of a data set from the literature for the presence of redundant variables
We will now examine the performance of the different variable reduction techniques, presented in Section 5, by applying them to the aphid data set (Table 8). These results are taken from reference [26]. Using cross-validation, four principal
160
components were chosen (see Q. 17) to represent the variation between the aphid samples. Fig. 5 presents the scores plot for the first two dimensions (85.3% of total variance - see Table 9) extracted from a PCA of all 19 variables in the aphid data set, via its correlation matrix. The plot shows four distinct taxa of aphids; this reinforces the case for retaining a minimum of four variables to have any possibility of reproducing the sample structure within the data. Table 12 presents the best selection from the application of the different variable reduction techniques to the aphid data along with the Krzanowski residual sum of squares (S') between the true and reduced variable configurations. S2was obtained by comparing the (40 x 4) matrix of PC scores, accounting for 91.9% of the sample variance (Table 9 - add up the cumulative percentage variance), obtained from the PCA of the complete aphid data set and the (40 x 4) mamx of PC scores obtained from the PCA of each four variable subsets.
Table 12. Best subsets of variables to retain, to reproduce the group structure of the aphid data set, as identified by the different variable reduction techniques.
Method Jolliffe (1) Jolliffe (2) McCabe (a) McCabe (b) McCabe (c) McCabe(d) Krzanowski
Variables retained 5, 5, 9, 5, 6, 5, 5,
8, 11, 11, 9, 11, 8, 12,
11, 13, 17, 11, 17,
11, 14,
14 17 19 18 19 18 18
M2 239 258 267 258 266 26 1 22 1
Q.19 Using the results presented in Table 12 and Fig. 5 discuss which subset of variables, and why, best reproduces the samples structure of the example aphid data set. To assist you, Figs 6, 7 and 8 present the optimum two-dimensional representation of the aphid data set using the Jolliffe method (a), McCabe and Krzanowski identified best subsets of four variables respectively.
161
0.4
xx
c)
e
P
3= 0
u
0.1 0.2
1
w
X
X
X
0.1 -I
-s 4
xx x
.zr"JIxx ,""
0.0
. I
. I
k
-
g -0.2
e
N
-0.3
X 1
.
1
.
1
.
1
.
1
-
Fig. 6 Scores plot of thefirst two principal componentsextractedfrom variables 5,8,11 and 14 (Jollifle(a)best selection) in the aphid data set. This plot accountsfor 82.3% of the variation contained within these four variables.
Y
8
0.4
0.3
a
g
0.2
7 a
0.1
i k
0.0
u
. I
X
i
x
Jk
X Y
. I
*e -0.1
m
-0.2
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1st Principal Component Fig. 7 Scores plot of thefirst two principal components extractedfrom variables 5,9,11 and 18 (McCabe (b)best selection) in the aphid data set. Thisplot accountsfor 78.2%ofthe variation contained within thesefour variables.
I62
=
hl
x-
I -0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
1st Principal Component Fig. 8 Scores plot of the first two principal components extractedfrom variables 5 , 1 2 , 1 4 and 18 (Krzanowski best selection) in the aphid data set. This plot accountsfor 90.4% of the variation contained within thesefour variables.
4 5 The effectthat the chosen dimensionality of a data set can have on the variables selected While Jolliffe, McCabe and Krzanowski all developed techniques for identifying optimal subsets of Q variables which best retain the structure contained within a data set (first A components), none of them discussed how the actual variables selected are dependent on the dimensionality ( A ) chosen to define that structure. This is due to the fact that the number of PCs (A) employed to describe the true configuration of the data defines the importance (total variation) of each variable in defining that configuration and therefore effects the best selection of Q variables required to maintain it. Deane and MacFie [8] investigated this situation when cross-validation gave unsatisfactory results in defining the structure contained within a data set, when applying the Krzanowski variable reduction technique to test for the presence of redundant tests (variables) in a subset of the product quality control test criteria for aviation turbine fuel. They applied the Krzanowski variable reduction criterion, as opposed to the Jolliffe or McCabe criterion, as it maintains sample structure (preserves inter-sample distances), hence, outliers (out of specification fuels) in M dimensional space will remain so in Q dimensional space
(Q< M).
163
One of the data sets they considered consisted of 12 physical properties measured on 117 samples of fuel; due to differences in variable type a cross-validated PCA was performed on the data correlation matrix. The cross-validation results (Wold implementation) are presented in Table 13, and show evidence of three systematic dimensions within the data. Inspection of the PC loadings (not presented) showed the fourth and fifth dimensions to be dominated by two tests (smoke point and hydrogen). The data correlation matrix (not presented) showed these two tests to have low correlation to the other tests. This implies that these tests must be critical to the quality control test criteria as they measure fuel properties not covered by the other tests. As these tests provide unique information on sample variation the information they provide will not be predictable from the other tests, and therefore cross-validation will not identify these tests (dimensions) as systematic.
Table 13. Cross-validation (Wold implementation) results for the aviation turbine fuel data of Deane and MacFie [8]. PC Dimension (A) 1
2 3 4 5
6 7 8 9 10 11
Percent Variance 41.62 18.58 11.95 7.74 5.49 4.83 3.38 2.06 1.76 1.37 0.99
PR ESS(A) RSS(A - I )
Q
0.70 0.83 0.92 5.28 1.05 0.89 0.93 1.oo 0.75 0.59 0.30
0.96 0.95 0.95 0.95 0.94 0.93 0.92 0.9 1 0.89 0.86 0.81
Due to their uncertainty as to whether cross-validation had correctly identified the true dimensionality of the data, and to test the effect of varying the number of PC dimensions used to define the true configuration, Deane and MacFie applied the Krzanowski variable reduction technique to estimate the loss of sample information by the removal of each individual variable, using 1, 2, ..., (A4- 1) dimensions to define the true configuration. The results from this analysis, with S2presented as a proportion of the total sample variance, are given in Fig. 9. The plot is not presented to show the actual percent information lost by the removal of each individual test but to present the overall pattern of how the level of infomation loss for the tests is effected by the dimensionality (A) used to define the st~cture.
164
-
Hydrogen Aromatics Naphthalenes Smoke Point Density DislO DisSO Dis90 AGP Viscosity Flashpoint FreezePoint
Number of Principal Components used to Define the True Configuration Fig. 9 Graph of the proportion of sample structure lost from the aviation turbine fuel by removing each test from the test criteria against the number of PC dimensions used to define the true configuration.
Q.20 Use the pattern of information loss as presented in Fig. 9 to discuss the effect that varying the true configuration (A) of the data has on the best single variable (test) to remove.
5 Conclusions Data reduction, using principal components analysis, can be achieved in two ways; dimensionalityreduction and variable reduction. Dimensionality reduction is applied to remove the experimental error contained within the data and the available techniques fall into two categories; those that require a knowledge of the experimental error of the data and those that do not. The latter type tend to be empirical in nature and vary in their effectiveness; it is therefore recommended that more that one technique be applied and the results obtained compared with one's knowledge of the data under analysis before deciding on the true dimensionality of the data. Differences also exist in the optimality criteria of the techniques; therefore consideration of the aims of the analysis have to be taken into account when deciding on which technique(s) to apply.
165
In some cases, as well as achieving dimensionality reduction, it is often desirable to reduce the physical number of variables within the data set. Three approaches to identify the best selection of Q variables to retain have been presented. Each technique has a different optimalitycriterion and therefore the most applicable to use depends on the aims of the statistical analysis. The importance of correctly identifying the true dimensionality (A) of the data, prior to performing variable reduction, has also been shown to have a critical effect on the importance of each variable in defining that configuration and therefore on the best selection of Q variables required to maintain it. 6 References 1. LTJolliffe, Principal Component Analysis, Springer-Verlag. New York, (1986). 2. S.Wold, K.Esbensen, and P.Geladi, Principal Component Analysis, Chemometrics and Intelligent Laboratory Systems 2 (1987), 37-52. 3. S.Wold, C.Albano, W.J.Dunn 111, K.Esbensen, S.Hellberg, EJohansson, and M.Sjl)str(lm,
Pattern recognition: Finding and using regularities in multivariate data; Proceedings of the IUFOST Conference, Food Research and Data Analysis (H. Martens and H. Russwurm, Jr, Eds) Applied Science Publ., London, (1983), pp. 147-188. 4. P.Geladi. Notes on the History and Nature of Partial Least Squares (PLS) Modelling, Journal of Chemometrics, 2 (1988), 23 1-246. 5. H.Wold and E.Lyttkens, Nonlinear Iterative Partial Least Squares (NPALS) Estimation Procedures, Bulletin of the International Statistical Institute Proceedings, 37th session, London, (1969), pp. 1-15. 6. E.R.Malinowski, Theory of Error in Factor Analysis, Analytical Chemistry, 49 (1977). 612617. 7. E.R.Malinowski, Factor Analysis in Chemistry :Second Edition, Wiley, New York, (1991). 8. J.M.Deane and H.J.H.MacFie, Testing for Redundancy in Product Quality Control Test Criteria: An Application to Aviation Turbine Fuel, Journal of Chemometrics,3 (1989). 471491. 9. J.M.Deane, HJ.H.MacFie, and A.G.King, Use of Replication and Signal-to-Noise Ratios in
Identification and Estimation of the Composition of Lubricant Basestock Mixtures using C Nuclear Magnetic Resonance Spectroscopy and Projection into Principal Component / Canonical Variate Space,Journal of Chemometrics,3 (1989). 477-491. 10. M.S.Bartlett, Tests of Significance in Factor Analysis, British Journal of Psychological Statistics Section, 3 (1950). 17-85. 11. H.F.Kaiser, Educ. Psych Meas., 20 (1966), 141. 12. R.D.Catell, The Scree Test for the Number of Factors, Multivariate Behavioral Research, 1 (1966), 245-276. 13. J.H.Kindsvater, P.H.Weiner, and T.J.Klingen, Correlation of Retention Volumes of
Substituted Carboranes with Molecular Properties in High Pressure Liquid Chromatography using Factor Analysis, Analytical Chemistry, 46 (1974), 982-988. 14. E.R.Malinowski, Determination of the Number of Factors and the Experimental Error in a Data Matrix, Analytical Chemistry, 49 (1977), 612-617. 15. E.R.Malinowski, Statistical F-tests for Abstract Factor Analysis and Target Testing, Journal of Chemometrics, 3 (1988), 49-60.
166 16. E.R.Malinowski, Theory of the Distribution of Error Eigenvalues resulting from Principal Components Analysis with Applications to Spectroscopic Data, Journal of Chemometrics, 1 (1987). 49-60. 17. M.J.Stone, Cross-validatory Choice and Assessment of Statistical Prediction, Journal of the Royal Statistical Society, B36 (1974), 11 1-147. 18. S.Wold, Cross-validatory Estimation of the Number of components in Factor and Principal Components Models, Technometrics, 20 (l978), 397-405. 19. H.T.Easunent and W.J.Knanowski. Cross-validatory Choice of the Number of Components from a Principal Component Analysis, Technometrics, 24 (1987), 73-77. 20. W.J.Krzanowski, Cross-validation in Principal Components Analysis, Biometrics, 43 (1987). 575-584. 21. D.W.Osten, Selection of Optimal Regression Models via Cross-Validation, Journal of Chemometrics, 2 (1988), 39-48. 22. M.A.Sharaf, D.L.Illman. and B.R.Kowalski, Chemornetrics, Wiley, New York, (1986). 23. S.Wold and M.Sj(lstr&im, SIMCA: A Method for Analyzing Chemical Data in Terms of Similarity and Analogy, in Chemometrics, Theory and Application (B. Kowalski, Ed), American Chemistry Society Symposium Series, No. 52 (1977). pp. 243-282. 24. J.R.Bunch, C.P.Nielsen, and D.C.Sorensen, Rank-one Modification of the Systematic Eigenproblem, Numerische Mathematik, 31 (1978), 31-48. 25. J.N.R.Jeffers, Two Case Studies in the Application of Principal component Analysis, Applied Statistician, 16 (1967). 225-236. 26. W.J.Krzanowski, Selection of Variables to preserve Multivariate Data Structure using Principal Components Analysis. Applied Statistician, 36 (1987). 22-33. 27. R.J.McKay and N.A.Campbell, Variable Selection Techniques in Discriminant Analysis. I. Description, British Journal of Mathematical Statistics in Psychology, 35 (1982), 1-29. 28. R.J.McKay and N.A.Campbell, Variable Selection Techniques in Discriminant Analysis. 11. Allocation, British Journal of Mathematical Statistics in Psychology, 35 (1982). 30-41. 29. I.T.Jolliffe, Discarding Variables in a Principal Component Space. 1: Artificial Data, Applied Statistician, 21 (1972). 160-173. 30. I.T.Jolliffe, Discarding Variables in a Principal Component Space. 2: Real Data, Applied Statistician, 22 (1973), 21-31. 3 1. G.P.McCabe, Principal Variables, Technometrics, 26 (1984). 137-144. 32. J.C.Gower, Statistical Methods of Comparing Different Multivariate Analyses of the same Data, In Mathematics in the Archaeological and Historical Sciences (F.R. Hodson, D.G. Kendall and P. Tautu, Eds) Edinburgh University Press, (1971). pp. 138-149. 33. R.Sibson, Studies in the Robustness of Multidimensional Scaling: Procrustes Statistics, Journal of the Royal Statistical Society, B40 (1978). 234-238. 34. R.G.Brereton, Chemometrics :Applications of Mathematics and Statistics to Laboratory Systems, Chapter 4, Ellis Horwood, Chichester, (1990). 35. R.A.Hearmon, J.H.Scrivens, K.R.Jennings, and M.J.Farncombe. Is lation of Component Spectra in the Analysis of Mixtures by Mass Spectroscopy and "C Nuclear Magnetic Resonance Spectroscopy. The Utility of Abstract Factor Analysis, Chemometrics and Intelligent Laboratory Systems, 1 (1987), 167-176.
ANSWERS
167
A. 1 As the data is constructed from three constituents then a PCA of the mean-centred
data will form a three-dimensionalstructure (mangle) in a two-dimensional space. ~~
~~~~
4.2
4
2-
X
X
X
0-
-2
-
-4
I
-6
X
X
-4
-
I
-2
-
Y
I - - -
0
I
2
4
Principal component 1 3 e data lie on a triangle, in the principal component space.
A.3
Blends of b constituents form a regular b-dimensional structure in b-dimensional space ((b -1)-dimensional space if the data is mean-centred).PCA of a pure mixture data set will accurately reproduce this regular structure (see A.2). Imbedded error in a data set, due to blending errors, variation in sample size, the presence of a contaminant or measurement errors associated with the instrument or other sources of imprecision, will affect the reproduction of this regular structure via PCA. The b-dimensional structure will still be reproduced in b-dimensional space but will no longer be truly regular (see Q.5).
168
A.4 If A is underestimated then the PC model will be constructed using too few components and one will underestimate the number of constituents in the mixtures and receive a false impression of the proportions of these constituents in the blends. If A is overestimated then the PC model will be constructed using too many components and one may overestimate the number of constituents in the mixtures and will therefore once again receive a false impression of the proportions of the constituents in the blends. If A= 2, 3 or 4 then the number of constituents in a mixture (true dimensionality of the data) can be checked visually as the mixture space will be a regular straight line, triangle or tetrahedron respectively, subject to a degree of error.
4
X
2-
X
X X
0-
X
-2 - x
X
-4
-4
-2
0
2
4
6
Principal component 1 Note the seven points no longer exactly lie on a mangle.
A.6 The table below presents the RSD of the mixture data in Table 3. Notice that the RSD closest to 0.4918, the true RMS error, is 0.3327 corresponding to A = 2. Hence we conclude that there are two real components (three constituents - as data was mean-centred) within the data. This is in accord with the known facts of how the mixture data was constructed.
169
ZSD of the principal components extracted from the mixture data in rable 3. PC Dimension N(M - A ) RSD g, (A)
1 2 3 4 5 6
a=A+l
43.09 30.05 2.02 0.89 0.17 0.02
33.15 3.10 1.08 0.19 0.02
35 28 21 14 7
-
-
0.9732 0.3327 0.2268 0.1 165 0.0535 -
1.7 The table below presents the RMS of the mixture data in Table 3. Notice that the tMS closest to 0.4918, the true RMS error, is 0.2716 corresponding to A = 2. lence we conclude that there are two real components (three constituents) within he data. This is in accord with the known facts of how the mixture data was :onstructed.
ZMS of the principal components extracted from the mixture data in rable 3. PC Dimension
*
(A) 1 2 3 4 5
(N-M)IM 0.8333 0.6666 0.5000 0.3333 0.1666
0.9 129 0.8 165 0.707 I 0.5774 0.4082
RSD*
RMS
0.9732 0.3327 0.2268 0.1165 0.0535
0.8884 0.2716 0.1604 0.0673 0.021 8
See Q.6 for calculation
A.8 From Table 5 it can be seen that the average error closest to 0.4918, the true RMS error, is 0.2157 corresponding to A = 2. Hence we conclude that there are two real components (three constituents) within the data. This is in accord with the known facts of how the mixture data was constructed.
I70
i.9 'he table below presents the cumulative percent variance for the mixture data iresented in Table 3. If a cut-off of 90%or 95% variance explained is required then L= 2 components need to be employed to reproduce the data and if a cut-off of 99% rariance accounted for is required then A = 4 components need to be employed. A omparison of Tables 2 and 4 shows that 4.07% of the data was random / :xperimentalerror. Hence we conclude that there are two real components (three ,onstituents) within the data. This is in accord with the known facts of how the nixture data was constructed. However, these results also demonstrate that a :orrect decision could only be made as a knowledge of the randodexperimental :nor in the data was available.
hmulative percent variance of the principal components extracted rom the mixture data in Table 3. PC Dimension (A)
1 2 3 4 5 6
ga
43.09 30.05 2.02 0.89 0.17 0.02
Percent Variance 56.52 39.41 2.65 1.17 0.22 0.03
Cumulative % Variance 56.52
95.93 98.58 99.75 99.97 100.00
A.10 The average eigenvalue extracted from this mixture data set is 12.71 - calculated from the PC results given in Table 4. The first eigenvalue lower that 12.71 is 2.02 corresponding to A = 2. Hence we conclude that there are two real components (three constituents) within the data. This is in accord with the known facts of how the mixture data was constructed.
171
A.ll (a) The table below presents the residual percent variance for the mixture data; thes results are also plotted to form a scree plot. The scree plot shows that thl residual percent variance drops rapidly with the inclusion of up to twc components and then starts to level off. Hence we conclude that there are twc real components (three constituents) within the data. This is in accord with thi known facts of how the mixture data was constructed.
Residual variance of the principal components extracted from the mixture data in Table 3. PC Dimension (A) 1 2 3 4 5 6
gA
Residual Variance
43.09 30.05 2.02 0.89 0.17 0.02
33.15 3.10 1.08 0.19 0.02 0.00
Residual % Variance 43.48 4.07 1.42 0.25 0.03 0.00
(b) The scree graph is as follows.
0 1 2 3 4 5 6 Row Numbers PC Dimensions used in Data Reproduction
172
A. 12 From Table 6 it can be seen that for an A = 1 dimensional PC model y~ = 0.4101 indicating that the fit is only twice as good as estimating all data points to be equal to the overall mean of the data. For an A = 2 dimensional PC model y~ = 0.127 1 indicating an excellent correlation between the raw and reproduced data. For an A= 3 dimensional PC model y = 0.0760 indicating an even better correlation between the raw and reproduced data. For A > 3 no improvement in w is achieved; w oscillates for these dimensions indicating the error in the data not to be uniformly distributed. One could therefore conclude the mixture data to be two or three dimensional. If one takes the first Occurrence of an excellent correlation between the raw and reproduced data it would be concluded that there are two real components (three constituents) within the data. This is in accord with the known facts of how the mixture data was constructed. Due to the oscillating nature of w after A= 2 dimensions, one could conclude that the improved fit with the inclusion of the third PC dimension to be an effect of the non-uniformity of the error in the data. A.13 The table below presents the imbedded error function for the principal component5 extracted from the mixture data. The function however fails to reach a minimum which implies that the errors in the data are not random or that the technique is noi sensitive enough.
Imbedded error function for the PC dimensions extracted from tht mixture data in Table 3. PC Dimension (A)
8A
1 2 3 4 5 6
43.09 30.05 2.02 0.89 0.17
33.15 3.10 1.08 0.19 0.02
0.02
-
NM(M - A )
IE
210 168 126 84 42
0.3973 0.1921 0.1604 0.095 1 0.0488
a=A+I
-
-
173
A.14
The table below presents the factor indicator (IND) function for the PC dimension extracted from the mixture data. The function reaches a minimum with the inclusioi of the second dimension; hence we conclude that there are two real component (three constituents) within the data. This is in accord with the known facts of hov the mixture data was constructed. However, a local minimum also exists after thc inclusion of the 4th dimension; this indicates that the errors in the data may not bc uniformly distributed. This is probably why the IE function failed to reach i minimum.
Factor indicator (IND) function for the PC dimensions extracted fron the mixture data in Table 3. PC Dimension (A) 1 2 3 4 5
RSD*
(M - *)2
0.9732 0.3327 0.2268 0.1 165 0.0535
25 16 9 4 1
IND 0.0389 0.0208 0.0252 0.0291 0.0535
*See0.6 for calculation
I.
I
. P
A.13
The table below presents the Malinowski F-test (Eq. (21)) for the principal components extracted from the mixture data. The last column in the table presents the percent significance levels, %a,for the fitted F values. Testing for significance. starting with the last dimension, we see that the first dimension for which the null hypothesis (dimension under test is equal in variance to the error dimensions) is rejected is A = 2 because the 2.29% significancelevel is below the critical 5% level. Hence we conclude that there are two real components (three constituents) within the data. This is in accord with the known facts of how the mixture data was Iconstructed.
I
174
Malinowski F-test for the principal components extracted from the mixture data in Table 3. M
A
(N-A+I)(M-N+l) C(N-a+l)(M-u+l)g, a=A+I
1 2 3 4 5
42 30 20 12 6
$g,
u1 u2
FroTesi
%a
a=A+1
70 40 20 8 2
43.09 30.05 2.02 0.89 0.17
33.15 3.10 1.08 0.19 0.02
1 1 1 1 1
5 4 3 2 1
2.83 12.93 1.87 3.13 2.83
20.10 2.29 26.90 21.92 34.13
Notes 1. a is the percentage significance. 2. u refers to degrees of freedom. The first degree of freedom refers to the eigenvalue which is always 1 and the second to the sum of the eigenvalues, which changes according to the principal component tested.
A.16 From Table 7 it can be seen, comparing the cross-validation ratio with either unity or Q,that the first dimension in the data is not significant. This implies that the best PC model which can be fitted to the data is the mean value for each variable. However, the second dimension was indicated to be significant, with all the lower dimensions indicated as non-significant. Let us consider the objectives of the crossvalidation technique and the nature of the data set under analysis. Cross-validation was developed by Wold to identify those PC dimensions with best predictive information, as opposed to identifying the division between the real and error dimsnsions. The identified 'systematic' dimensions then were used to formulate a PC model for a known class of samples; the model then being used to help classify unknown samples. The mixture data under analysis however,was formulated using structured design and therefore contains no class (group) structure. It can be seen from the scores plot obtained in A S that the PC dimensions reflect the structured nature of the data; the first dimensions expressing the proportion of constituents X, and X, in the samples and the second dimension expressing the proportion of constituent X, in the samples. It is not surprising therefore that a non-systematic result was obtained for a one dimensional model, as such a model contains no information to estimate those samples containing a large proportion of constituent X,. However a systematic result was obtained for a two dimensional model; this implies that there are two real dimensions (three constituents) within the data. This is in accord with the known facts of how the mixture data was constructed. This result helps illustrate the differences that exist between the criterion of the different dimensionality-reducingtechniques presented in this chapter; and demonstrates that
175
one has to consider the different criteria of the techniques and nature of the data under analysis when testing to identify the true dimensionalityof the data.
A.17 The results for the cumulative percent variance indicate the data to be 4-, 7- or 13dimensional depending on whether one wishes to retain W%,95% or 99% of the variance of the data, but the results for the average eigenvalue indicate the data to be only two-dimensional. The result for the scree test is not clear cut, but one could conclude that the residual percent variance starts to level off at the 4th or 5th dimension. The IE function fails to reach a minimum but the IND function indicates the data to be three-dimensional.The Malinowski F-test indicates the data to be one dimensional if one tests for significance at the 0.1% and 1% levels and twodimensional if one tests for significance at the 5 % level. Finally, both implementations of cross-validation indicate the data to be two-dimensional, However, Krzanowski [20] argues that the data is four-dimensional, as the very small W, for the third dimension is sandwiched between the "significant" values of 5.95 and 1.03 for components 2 and 4. Deane and MacFie [7] showed that the nonsignificant result for the third dimension could be due to a disparity between the cross-validation technique and the data. Cross-validation aims to identify those dimensions with best sample predictive ability, the third dimension however was dominated by variable 11 (see Table 9), and this was shown to be the variable with the lowest correlation to all other variables; hence it would not be predictable from the other variables, during cross-validation. Looking at the results as a whole it can be seen that there exists wide variation, on the identified dimensionality of the data, between the different techniques. Similar results have been presented in the literature [35]. While some differences can be attributed to the specific nature of the test criteria, it is clear that the decision of what to take as the true dimensionality of the data is not easy. Before applying such methods one has to consider the aims of the analysis being performed; is it to identify those dimensions with best predictive ability, is one only performing exploratory analysis and therefore the requirement is for a rough estimate of the data dimensionality,or does one want to accurately define the data dimensionality as a pre-cursor to variable reduction? The type of analysis to be performed, will therefore reduce the number of techniques available to apply; but i t is recommended, that more than one technique be applied, and the results obtained, compared with one's knowledge of the data set before making a firm decision.
176
A.18 Ten combinations need to be tested, namely variables combinations 123, 124, 125, 134, 135, 145, 234, 235, 245, 345.
A.19 Table 12 shows that no two identified 'best' subsets are identical, though certain variables are common to a number of subsets; this is due to the different optimality criteria of the techniques. The size of S2 for each method, indicates the Knanowski technique as identifying the 'best' subset of four variables to reproduce the sample structure of the aphid data set. A comparison of Figs 6,7 and 8 with Fig. 5 shows varying levels of performance of the different variable reduction techniques in relation to reproducing the group structure of the aphid samples. All three subsets retain the group of aphids in the top left comer of Fig. 5, albeit more loosely in the Jolliffe and McCabe subsets. However, the other three groups evident in Fig. 5 are merged in the Jolliffe and McCabe subsets, but they are retained in the Krzanowski subset. In fact, the compactness of the aphid classes has been improved by the Knanowski selected subset, this is due to the removal of other variables which contribute 'noise' to the group structure of the aphid samples. These results indicate the Krzanowski variable reduction technique to be the best for this application of identifying the subset of four variables which best retain the aphid class structure. It is not surprising that the Krzanowski selected subset best retained the group structure of the example data set; this technique aims to maintain sample structure (distance between points) while the Jolliffe and McCabe techniques aim to recover the data variance-covarianceand correlation matrix respectively. This example demonstrates that the most appropriate variable reduction technique to apply depends on the required properties of the selected subset. A.20 A two-stage pattern is evident in the figure; inclusion of up to seven dimensions to define the true configuration results in wide variation in the importance of each test, but the inclusion of the last five dimensions results in only slight variation between the test results. The first seven dimensions therefore appear to contain important information on fuel composition, as they define the test deletion order, while the last five dimensions contain random noise, as they only vary the proportion of systematic information lost about a well defined level. This would suggest that the data is seven dimensional. One could therefore conclude that the subset of the test lcriteria studied could be reduced from 12 to 7 tests while still retaining the sample
I
177
structure. This example shows how critical it is to accurately define the true dimensionality of the data, prior to performing variable reduction. If a decision had been made using 3 dimensions to define the true configuration, as identified by cross-validation (Table 13), tests could have been removed (smoke point and aromatics), which, with the inclusion of additional dimensions, where shown to be critical for defining sample variation, as they provide unique sample information.
This Page Intentionally Left Blank
179
CHAPTER 6
Cluster Analysis N . Bratchell,’ AFRC Institute of Food Research, Shinfeld, Reading, Berkshire RG2 9AT, U.K.
1 Introduction
Consider the following situation. Multivariate measurements, for instance chromatograms or spectra, are taken for a set of samples, and the objective is to investigate any groupings among the samples. This is a common problem of which there are many examples in the literature: identification of groups of chemicals on the basis of structure and activity measurements [ 11; demonstration of differences between strains of bacteria characterized by biochemical tests [2]; grouping of surface waters on the basis of physicochemical measurements [3]; testing taxonomic grouping of collagen from different organisms using their amino acid compositions [4];identifying samples for the design of near infra-red calibration [ 5 ] . Even a brief search of the literature will reveal many other examples. These problems are all addressed by cluster analysis. The term cluster analysis covers a class of techniques which seek to divide a set of samples or objects into several groups or clusters. The primary criterion for this division is that the clusters are homogeneous and objects within the same group are more similar to each other than are objects in different groups. Five basic types of clustering methods have been identified [6]:hierarchical, optimization-partitioning, clumping, density-seeking and other methods. The choice between them depends primarily on the aim of the investigation, but may be moderated by the type of data and the availability of the method. This chapter will concentrate on the hierarchical and optimization-partitioning methods as they are the most popular. In the hierarchical methods objects are grouped first into small clusters of two or three objects; these are then grouped into larger clusters, and so on. The optimizationpartitioning methods seek to divide the set of objects into a number of groups so that some criterion, for instance total within group distances, is optimized.
present address: Pfizer Central Rcsearch, Sandwich, Kent, U.K.
180
Cluster analysis is widely used in many fields and consequently appears under such different names as unsupervised pattern recognition and numerical and mathematical taxonomy. In addition to the examples cited above, there are many more in the references given throughout this chapter and numerous others which are not listed here. Three useful texts devoted to cluster analysis are by Everitt [6], Sneath and Sokal [7]and Massart and Kaufman [8]. The former [6] provides a thorough review of the subject, while the latter [7, 81 are aimed towards biologists and chemists respectively and contain many interesting examples. More concise descriptions of cluster analysis are by Bratchell [9] and those contained in the chemometrics textbooks by Massart et al. [ 101, Sharaf et al. [ 113 and Brereton
WI. 2 Two problems
This section will introduce two data sets to be used later. They will highlight the fundamental problems underlying cluster analysis. 2.1 Elastin data
The amino acid composition of elastins using eight amino acids was obtained from 32 higher vertebrate species and the aims were to (a) elicit the relationships between them and (b) determine whether these relationships follow the classical taxonomic evolution of the organisms. The plot of the first two principal components of the data is shown in Fig. 1.
x
.;a a
.- -2,
-8a
-
x
X
X
*
X
X3F X X
Fig. 1 Principal componentsplot of the amino acid compositions of elastins of higher vertebrate species (elastin data).
181
Principal components analysis is discussed in various chapters. A qualitative introduction is given in Chapter 2. In Chapter 7, Section 3 principal components analysis is introduced in terms of singular value decomposition. Singular value decomposition, which is the mathematical basis of PCA, is discussed in Chapter 3,
2.2 Near inji-a-reddata Near infra-red spectra comprising 10 wavelengths were obtained for 120 meat samples and their first two principal components are plotted below. The objective was to examine the spectra and choose a few representative samples to form a calibration set [5] for indirect measurement of moisture and fat contents of meat.
-2 !
-10
I
0 Principal Component 1
I
10
Fig. 2 Principal components plot of rhe near infra-red spectra of meat samples (NIRdata).
2.3 Identifying clusters As with many other forms of multivariate analysis it is convenient to consider data
as points in multivariate space. The principal components plots in Sections 2.1 and 2.2 provide an important visualisation of the multidimensional scatter of the samples.
Q. 1 Identify the clusters in the two figures given in Sections 2.1 and 2.2.
182
No simple definition of a cluster exists. In general it depends on the nature of the data and on the aim of the analysis. With some sets of data the groups are clearly distinct: they may be similar-sized, circular, spherical, elongated or curved arrangements of objects. In other cases they may be observable only as an increased density of points against a general background scatter. More frequently however we have an intermediate situation with clusters of different shapes and densities. The clusters identified in Q.l were based on intuitive notions of a cluster and the context in which these notions were applied. This leaves the rather vague definition implied in the introduction that a cluster is simply a collection of objects that are more similar within clusters than between clusters. In the next section the concept of similarity will be defined more precisely and used to develop several methods of cluster analysis. 2.4 The influence of variables
The spatial arrangements of the objects examined in Section 2.3 are a consequence of the observations made on the objects, i.e. of the variables. Another problem, therefore, is one of variable selection [6 - 8, 13, 141. Chatfield and Collins [ 141 note that data may often be clustered in different ways according to the purpose of the investigation, and give an example of a pack of cards which can be sorted by suits or numbers depending on the game. The two variables are suits and numbers, each producing a different classification, and the analysis should be interpreted in view of this choice. Variable selection is discussed in greater detail in Chapter 5, Section 4. Also, in th context of supervised pattern recognition, this topic is covered in Chapter Sections 10 and 11 and Chapter 8, Section 5.1.2.3. Preprocessing data before cluster analysis is essential. Often an investigator measures several variables without knowing their qualities. Some may be discarded because they are invariant over the data set. For example, microbiological tests often give binary observations (these are observations that are either positive or negative often represented as either 1 or 0). Some of these will be constant for the data set and can be ignored because they add no useful information about differences between groups. Continuous variables rarely exhibit strict invariance, but will show some level of noise. Noise, while adding little useful information, will tend to reduce the distinctions between groups and should be removed. Other variables may be inherently unsuitable for a particular problem. For example, when
183
attempting to cluster clinical symptoms the sex of patients will almost certainly show the clearest grouping in the data. This is likely to obscure or confound other groupings of interest and such a variable should be omitted from the analysis or the data can be divided according to this variable and separate analyses carried out. Preprocessing is discussed at length in Chapter 2, where the influence of various approaches to scaling can make major difference to the resultant interpretation of multivariate data display techniques.
3 Visual inspection In Section 2, we indicated that the identification of clusters depends on both the spatial and informational contexts of the objects and that intuitive notions play an important part. These cannot be coded mathematically and we must appeal to the earlier notion of similarity, but as will be seen below, even a loose mathematical definition does provide useful results. However these must be verified in some way. The simplest approach is, of course, visual inspection of the data. Multidimensional scaling techniques such as principal components analysis, principal coordinates analysis and non-linear mapping [ 14 - 171 provide an essential visual impression of the data. This is often a sufficient check of the results, although it must be remembered that these plots are a low-dimensional graphical summary of the data and other groupings and relationships may appear only on the lower dimensions. A more rigorous verification of the solution can be obtained by applying a supervised pattern recognition method such as canonical variates analysis [14 - 161 or SIMCA [ 17, 181 to the groups recovered, as discussed elsewhere in this text. Further refinement of the solution may be permitted by these methods.
]in Chapter 1, Section 10. 4 Measurement of distance and similarity The simplest and most enduring methods of cluster analysis are based on the measurement of distance or similarity between objects. The alternative, that of considering the spatial patterns, is less tractable.
I
184
4.1 Distance and similarity
Similarity, dissimilarity and distance are three closely related terms used to discuss the comparability of objects. Distance is the most precisely defined measure and is governed by three conditions
- 0; D o = 0 if x, = x,; (i) D 'I. . > (ii) D . .= D . : V JL' (iii) D , + Dja 2 Do ; where D . .denotes the distance between two objects i and j . Conditions 1 and 2 'I ensure that the measure is positive and symmemc. The symmetry condition implies that the distance from point i to point j is equal to that from j to i. There are examples, especially in the social sciences, where dissimilarity is measured subjectively and does not obey this condition. Condition (iii) is called the metric inequality and distinguishes between metric distances (expressed as numerical variables) and non-memc dissimilarities (e.g. colour, result of a positive / negative bacterial test). It ensures that the direct distance between two points is less than or equal to that measured via a third point. A distinction between metric distances and non-memc distances can be understood by the following example. The distances between three objects are 2.3, 1.7 and 2.9. Instead of presenting these memc distances, we could consider the non-metric ranks of the distances, namely 2, 1 and 3. The ranks of the distances do not (generally) obey condition (iii). Throughout the following discussion we will be concerned with metric distances and their corresponding similarities.
4.2 Similarity between objects
As noted in Section 2, multivariate data are conveniently regarded as points in a multidimensional space. Consequently our perception of similarity between objects is satisfied by considering the spatial distances between points: objects with similar measurements are located close together in the space. Measurements of distance and similarity in this section derive naturally from this way of thinking. Typically three types of variable - categorical, ordinal and continuous - may be used to characterize a set of objects. The definition or choice of a distance measure depends on both the type of variable and its scale of measurement. The latter can have a profound effect on the representation of distances and consequently on the results of cluster analysis. While autoscaling or standardising to constant variance is
185
available and often used, range-scaling can be more effectively incorporated within the distance measures with the additional benefit of permitting variables of different types to be assembled into an overall similarity measure. In the following sections the distance and similarity along only a single variable are presented and subsequently be combined in an overall similarity value. 4.2.1 Binary variables Binary variables are simply categorical variables which denote a division into two categories, They are conveniently given the values 0 and 1 and frequently arise as the result of binary either / or tests. For example, a variable may record the presence or absence of a particular chromatographic peak. Two types of similarity measure are useful.
The simplest is the simple matching coefficient, which for the kth variable is given by
This implies a distance of 1 between objects with the values 0 and 1 respectively and 0 otherwise.
A useful alternative to the simple matching coefficient is the Jaccard coefficient which is used to distinguish between 1-1 and 0-0 matches Siik
=1
ignored Sijk
=0
if Xik = x . = 1 Ik ifXik=X. =o Ik if Xik
f
(3)
xjk
The joint presence of the peak may represent a positive similarity, but its joint absence may not infer any degree of similarity. The latter is not considered to be a real similarity between the objects and does not enter the calculation. 4.2.2 Categorical variables Categorical variables with more than two categories can be treated in the same manner as binary variables. That is, to consider a similarity of 1 between objects in the same category and 0 otherwise. Both the simple matching and Jaccard coefficients are appropriate.
186
4.2 3 Ordinal variables Ordinal variables are categorical variables in which the categories, conveniently labelled 0, 1,2, 3, etc., have a logical progression or order. The Manhattan or cityblock distance is particularly appropriate for these variables as it defines degrees of similarity: objects at extreme ends of the scale have the lowest similarity, progressing to the highest similarity for objects in the same category. For the kth variable the city-block distance is given by
and implies the similarity measure
The arbitrary choice of category labels can be accounted for by range-scaling,
where rk denotes the range of the kth variable (the range is the difference between the maximum and minimum values of the variable). This gives distances and similarities in the interval [O, 11in parallel with those for binary variables.
4.2.4 Continuous variables Continuous variables are the most frequently encountered type in chemistry; for example, chromatographic peak areas or chemical concentrations are both continuous variables. Euclidean distance is the best-known and most-used distance measure and is most appropriate for continuous variables. Again this can be rangescaled d2 = 1Jk
(
Xik
(7)
to give distances in the interval [0, 11, with the associated similarity given by
187
4 . 2 5 Overall similarity In the sections above the similarity along only a single variable was defined. They can now be combined into an overall similarity between objects
c
l M S..=a 'I
Sijk
k = 1
where M is the number of variables. From Eq. (9) it can be seen that scaling or standardization of variables has important consequences. First, we may have a set of continuous variables with different scales of measurement or variances. Variables with large variances will lead naturally to large distances between objects, and these will tend to swamp or mask distances calculated from other variables with small variance. This is undesirable as it distorts the analysis by giving greater weight or inflence to the variables with large variance. Similarly, variables are often measured in arbitrarily different units, again leading to unbalanced analyses. Either autoscaling to constant variance or range-scaling may be applied to the data to reduce the problem. /Examples of column standardization are given in Chapter 2, Section 3.
I
Two further problems are presented by categorical variables. Ordinal variables may have category labels 0, 1,2, etc., which are arbitrary and bear no relation to other scales of measurement. And there may be several categorical or ordinal variables which have (arbitrarily) different numbers of categories. Furthermore, variance, used in autoscaling, has no simple interpretation and range-scaling is preferred as it measures distance relative to the number of categories with the distance of 1 between the most extreme categories. Range-scaling follows naturally through to continuous variables, as presented, to give comparable similarity measures. Q-2 (a) Show graphically the city-block and Euclidean distances between two points foi the following bivariate data, given by the graph below.
'88
:b) Calculate the (i) unscaled (ii) scaled distances and (iii) scaled similarities between the two objects as defined below.
C1
Object I Object 2
1.3 2.7
Variable B
0 20 10
C2
23.5 72.8
1 0
C1 and C2 are continuous variables with ranges 4.1 and 108.7, and B and 0 are ordinal variables with 2 and 4 categories. Show both the individual and overall
similarities or distances as appropriate.
4.3 The similariry matrix
A convenient way of presenting the similarities or distances between objects is as a similarity or distance matrix. Table 1 presents the data for five objects. Table 2 gives their similarity matrix. Only the lower portion is shown; the matrix is symmetric with the upper portion identical to the lower, a consequence of the symmetry in the similarity metric.
Table 1. Observations on five objects. Variable Object I Object 2 Object 3 Object 4 Object 5
0.31 0.10 0.11 0.58 0.50
v2
v3
17.8 9.3 21.5 22.0 16.0
3 3 1 2 1
Table 2. Similarity matrix (of observations in Table 1). Object 1 Object 2 Object 3 Object 4 Object 5
1.oo 0.79 0.58 0.69 0.6 1
I
1.oo
0.36 0.17 0.34 2
1.oo 0.5 1 0.72
3
1.oo 0.75 4
1.oo 5
189
The similarity mamx is interpreted in the following way. To find the similarity between objects 2 and 4 locate column 2 and row 4; their similarity is given at the intersection, namely 0.17.
4.3 Give the complete similarity mamx for objects 1, 3 and 4 only, including elements above the diagonal of the mamx.
4.4 Similarity between variables
Although variables may also be represented as points in multidimensional space, it is generally more natural to consider their similarity in terms of their correlation coefficient.
C" = 'I
k= 1 I N
N
where CG denotes the correlation between variables i anc.j and x h denotes the kth observation (or experiment or object) on variable i . In the previous section we examined the similarity of measurements on different objects. When clustering variables we are more interested in the similarity of their information. The correlation coefficient measures their linear association. This too can be applied to ordinal variables by correlating their ranks N
where rfi denotes the rank of variable i. The rank of a variable is different to the rank of a mamx. Each variable is sorted in ascending order for all N observations (or experiment). The lowest value is given rank 1, and the highest rank N.Thus if experiment 1 yields the fifth lowest value of variable 3, then r13 = 5 .
190
4 5 Choice of similarity measure It is not only the type of data under analysis which influences the choice of distance or similarity measure, but also the interpretation of the data and the investigation. This was demonstrated for the simple matching and Jaccard coefficients where the absence of a particular chromatographic peak may or may not denote a real similarity. In some cases clustering of variables is useful, and consequently the correlation coefficient is more appropriate than the distances considered above. But Massart and Kaufman [8] contrast the Euclidean distance and the correlation coefficient by application to chromatographic retention indices. The Euclidean distance provides a measure of "absolute" distance between samples while the correlation coefficient provides a "relative" distance. This, however, is based less on the distinction between a distance and a correlation coefficient, but more on regarding them as transformations of the data. Q.4
Consider how the choice of distance metric can influence the choice of multidimensional scaling technique used to inspect the data.
4.6 lnter-cluster distance Cluster analysis requires two concepts of distance. The first concerns the appropriate measurement between two (arbitrary) points, which was the focus of this section. But also it requires some definition of inter- and intra-cluster distances. In the following sections the various methods of cluster analysis will be distinguished by these definitions.
5 Hierarchical methods Hierarchical techniques [6 - 111 are the most popular and widely-used clustering methods because of their simplicity. They are simple and rapid to compute and interpretation is straightforward and intuitive. There are several forms but in all of them clusters are formed hierarchically on the basis of proximity of objects: objects or clusters which are most similar in some sense are merged into larger clusters. The distinctions between the techniques lie in the ways of defining most similar objects.
191
Here we will concentrate on the nearest neighbour or single link, average link and furthest neighbour or complete link methods [6 - 111. The average link and furthest neighbour methods will be illustrated by application to the elastin data (Section 2.1).
5.1 Inter-cluster distances The definitions of most similar objects are satisfied by using a distance-based similarity measure (Section 4). Ignoring the mvial case of a single-object cluster, consider a cluster as a group of objects in a multidimensional space. Such clusters are shown in Fig. 3. Also shown are the (a) nearest neighbour, (b) average link and (c) furthest neighbour distances. The nearest neighbour distance is the shortest distance between clusters and is measured between the pair of objects which are closest together. The furthest neighbour is the largest inter-cluster distance measured between the pair of objects which are furthest apart. The average link distance is measured between cluster centroids; there are several variations on this theme, for instance between the cluster modes.
a
a>
a
a
Fig. 3 Distances used in hierarchical cluster analysis: (a)nearest neighbow. (b)average link and (c) fwihest neighbour.
Geometric distances between two vectors are discuss The points in the above figures could be represented by vectors in two, or in more complex cases, multidimensional space and the formulae for the distances calculated
5.2 Clustering algorithms For all three methods, clustering proceeds in a very simple way. The first step is to compute the similarity matrix. The algorithm then identifies the pair of objects with the highest similarity (or smallest distance) and merges them into a cluster. The mamx is updated to give the appropriate distances between the new cluster and the
192
remaining objects. The process is repeated, continually merging objects or clusters with the greatest similarity and up-dating the similarity mamx until there is a single cluster. To illustrate the method, consider the data and similarity matrix in Tables 1 and 2. We will apply the nearest neighbour method.
Step I is to identify the largest similarity in Table 2. This is a value of 0.79 between objects 1 and 2. These are merged into a cluster, denoted (1,2). Step 2 updates the similarity matrix. The similarities between the new cluster (1,2) and objects 3, 4 and 5 are identified. In this case the method uses the nearest neighbour distances, hence the largest similarities. For example the similarity between cluster (1,2) and object 3 is the larger of the values 0.58 (between objects 1 and 3) and 0.36 (between objects 2 and 3). Hence, the nearest neighbour similarity between cluster (1,2) and object 3 is 0.58. The updated similarity matrix is given in Table 3. Table 3. 2).
Updated nearest neighbour similarity matrix (from Table
Cluster (I ,2) Object 3 Object 4 Object 5
1.oo 0.58 0.69 0.61 f1 2 )
1.oo 0.5 1 0.72
1 .oo 0.75
3
4
1.oo 5
Step 3. Step 1 is repeated on the updated similarity matrix (Table 3). The largest similarity now is 0.75 between objects 4 and 5. These form a cluster (43). Step 4 repeats step 1 and the similarity mamx is updated to give the similarities between the new clusters, Table 4. Table 4.
Updated nearest neighbour similarity matrix (from Table
3). Cluster (1,2) Object 3 Cluster ( 4 3 )
1.oo 0.58 0.61 f1 2 )
1.oo 0.72 3
1.oo (4,5)
Step 5 . Object 3 is merged with cluster ( 4 3since this is the largest similarity,
193
Step 6. The two clusters (1,2) and (3,4,5) are finally merged together at a similarity of 0.69. ~~
~
Q.5 Apply furthest neighbour cluster analysis to the data of Table 1. 5.3 The dendrogram The simplest and most informative method of presenting a hierarchical cluster analysis is a dendrogram. This is a tree-like structure which shows the series of merges between objects and clusters and the similarities at which they occur; it displays the clusters hierarchically. The dendrogram for the nearest neighbour analysis of Section 5.2 is given in Fig. 4. 0.0
0.2
0.4 0.6
0.8 1.o
1
3
2
4
5
Fig. 4 Dendrogram obiainedfrom a nearest neighbour cluster analysis (Section5.2).
The dendrogram shows instantly the hierarchical structure of the method. It also shows the similarities at which clusters merge. For example, objects 4 and 5 merge at a similarity of 0.79, and these are later merged with object 3 at a similarity of 0.72, and finally with the cluster (1,2) at 0.61. ~~
~~
~
Q.6 Draw the dendrogram for the furthest neighbour analysis of Q.5.
194
3
3
2
1
4
2
4
I
Fig. 5 Dendrograms of elastin data obtainedfrom (a) average linkage and (b)furthest neighbow analyses.
195
5.4 Elastin example
In the example, the amino acid composition of elastins was obtained from 32 higher vertebrate species [ 101 and hierarchical cluster analysis was applied to (a) elucidate the relationships between them and (b) determine whether these relationships follow the classical taxonomic evolution of the organisms. The data were analysed by both the average linkage and furthest neighbour analyses and the results are presented in Fig. 5 .
Principal Component 1 Fig. 6 Main groups obtained from hierarchical cluster analyses.
Four main clusters were identified for each method and found to be identical; they are also shown in the form of an x-y plot of the first two principal components (Fig. 6). The differences between the two dendrograms are due to the different ways of determining inter-cluster distances. This it readily reflected in the ways in which the four labelled clusters are merged. In the average linkage analysis groups 1 and 2 are merged to form one cluster while groups 3 and 4 form a second cluster; these two clusters are merged and finally joined by the outlier in the top right-hand comer of the plot. This contrasts with the furthest neighbour analysis in which the outlierbecomes part of group 1 and there is a sequence of merges in which groups 3 and 4 are merged, joined by group 2 and finally by group 1. Closer inspection reveals congruence with the classical taxonomy of the organisms.
196
5.5 Problems with hierarchical analyses The hierarchical relationship has ensured the popularity of these methods in the biological sciences because such an arrangement of groups often has a meaningful taxonomic relationship. In many other applications such an interpretation does not readily exist, and their popularity is ensured by their simplicity.
0
0
.
0
0 .
0
0 .
0
0 0
0 O . 0
O.
Fig. 7 Example of a chained cluster.
However, there are, or can be, several problems in their application. The nearest neighbour method, despite being the best-known and only mathematically acceptable method, is subject to "chaining" in which an elongated cluster can be formed which is composed of a smng of objects passing through the data. An example of a chained cluster is given below. The objects in the large cluster are optimally connected in the sense that each object within the cluster is close to its immediate neighbours, but the cluster is inhomogeneous in that objects at either end of the cluster are clearly very different from each other while being more similar (and closer) to objects placed in surrounding clusters. Moreover, it is clear that
197
there are two distinct clusters connected by a bridge consisting of closely spaced objects. The method, having formed a connection cannot break it and form other groups. If the clusters are clearly distinct, chaining is not generally a problem; but if the data are more scattered chaining can lead to misinterpretation unless backed up by plots of the data. In some cases it is necessary or desirable to divide data into several distinct clusters. The hierarchical analyses do not naturally recover distinct clusters, but this can be accomplished by cutting the dendrogram at an appropriate point. Indeed this was used to identify the four groups in the previous example (Section 5.4). Although, in this case the four groups are not especially distinct, as can be confirmed by the x-y plot. Inspection of the dendrogram from the average linkage analysis, does not provide a clear choice between a two cluster solution and a four cluster solution, and the furthest neighbour analysis suggests a two cluster solution.
6 Optimization-partitioning methods The optimization-partitioning or k-means methods address more directly the problem of dividing a set of data into several mutually-distinct, homogeneous groups, often termed dissection, and do not lead to hierarchical relationships. As noted in Section 5.5 such dissection can be achieved by cutting a dendrogram, but the homogeneity of such clusters is questionable. The techniques described in this section are an attempt to provide a statistically-sound allocation of objects into homogeneous groups; other techniques which fall into this category have a more pragmatic basis but are nevertheless similar.
he concept of a cluster centroid is discussed in Chapter 7, Section 2 and i hapter 8, where it is introduced as the mean measurement of a group. It is a1 metimes called the centre of gravity of a cluster.
6.1 The method of optimization-partitioning These techniques all follow a common pattern or algorithm. Initially the number of groups to be identified is decided in advance and a criterion to measure the success of clustering is chosen. The objective is to optimize this criterion. The data are divided into the appropriate number of groups, each cluster is represented by a centrotype and the criterion is evaluated. The centrotype may be the group centroid or the most typical member, and the criterion might be to reduce the scatter of objects about their respective centrotypes. At each iteration of the algorithm objects are transferred and swapped between groups to improve the criterion value and the
198
centrotype and criterion value are updated. This procedure continues until any further changes reduce the optimality of the solution.
6.2 Clustering criteria There are various clustering criteria and algorithms to choose from, and Everitt [6] and Massart and Kaufman [8] provide extensive reviews, but the distinction between criterion and algorithm should be noted: various algorithms can be used to optimize a single criterion and vice versa, and many apparently different criteria can be in fact identical. For example, the three aims of maximizing the distances between cluster centroids, minimizing the distances of objects to their cenrroids and minimizing the scatter about the centroids are essentially the same. To understand the method of optimization-partitioning,consider the criterion of minimizing the mean within-group distances. The general algorithm can be applied to other criteria, which will be examined subsequently.
6.2.I Minimizing the within-group distances It is convenient, again, to consider a cluster as a group of points in a multidimensional space, as shown below, For each cluster there is a cluster centrotype. This may be a particular object, or it may be the cluster centroid. Initially we will consider the centroid, although the discussion applies to both cases. The mean within-group distance is calculated as the mean of the distances between the individual objects in a cluster and their centrotype. The basic steps of the algorithm used to minimize the within-group distances are outlined below. A more complete implementation of this is Forgy's algorithm [8].
Fig.8 Initial allocation of data to two clusters.
199
To start, the number of clusters to be identified is chosen. Here we will seek two clusters. And the objects are allocated to one of the groups. This may be done at random as in Fig. 8, or by cutting a dendrogram at an appropriate point (Section 5.5). The latter method has the advantage of starting closer to an optimal allocation and so leads more rapidly to a solution.
The algorithm now consists of repeatedly swapping and transferring objects between the clusters. At each step the centrotype and distances are computed to identify the best allocation. Each object in turn is temporarily transferred to another group and the new centroid and within-cluster distances are calculated. The transfer which produces the smallest within-group distances is made permanent. Alternatively, a swap is simply two simultaneous transfers and the same procedure is canied out. This process is repeated until any further swaps or transfers increase the within-group distances. At this point the criterion has been optimized. However, as the algorithm is a numerical procedure different initial allocations may lead to different local optima. In other words another allocation may produce smaller within-group distances. The data illustrated in Fig. 8 can be clustered from two initial allocations. Solution (a) is more optimal in that it has smaller withingroup distances than method (b), although moving a single object increases the within-group distance in both cases.
Q.7 Using only pencil and ruler, verify that solution (a) in Fig. 9 is a more optimal solution than solution (b), and verify that both solutions are locally optimal. This description of the optimization-partitioning algorithm was based on using the cluster centroids as the centrotype. Massart and Kaufman [8] describe the MASLOC algorithm which seeks clusters centred about particular objects, also known as k-median methods. It seeks k objects which are the centrotypes for each group, and minimizes the total distance of cluster members from these objects rather than from the centroids.
200
Fig. 9 Two locally optimal allocations of the same data. Solution (a)is more optimal than
solution (b).
6.2.2 Sums of squares andproducrs Minimum within-group distance is a natural criterion on which to base clusters. However, the problem can be approached from the point of view of minimising the variances of objects about their centroids. Consider two groups of points found by solution (a) in Fig. 9; there are overall and cluster centroids. Along any axis we can define three levels of variation: overall, between-groups and within-groups. The overall variation in the data can be divided into two components - within-group variation and between-group variation - which follow the following relationship
T=B+W
(12)
where T, B and W are the total, between and within group sums of squares and product matrices. T is the variation of objects about the overall mean of the data: the diagonal elements are proportional to the variances of the variables and the off-
20 1
diagonal elements are proportional to their covariances (see Section 4.1, Chapter 4 for further details). B is the variation of cluster centroids about the overall mean and W is the variation of individual objects about their cluster centroids. If, in some cases, we maximise the variation within clusters, we simultaneously minimise the variation between clusters. In terms of the distances between objects, minimising the within-cluster variation is the same as minimising the distances between the objects and their cluster centroids, which maximises the distances between the cluster centroids. There are two ways in which we can minimise the within-cluster variation W, as discussed in the following two sections.
6.2.3 Minimum within-groupsums of squares The diagonal elements of W in Eq. (12) are, apart from a constant, equivalent to the within-group variances of the variables. A natural objective is to form clusters with minimum variances. This is achieved by minimising the sum of the diagonal elements of W termed as tracew). This criterion minimises the distances between objects and cluster centroids along each of the axes.
Chapter 3, Section 5.5.
Consider the two solutions (a) and (b) given in Fig. 9. (a) Which cluster solution is optimal according to the minimum trace criterion? (b) What is the effect of working with the pooled within-group sum of squares? (c) What is the effect of scale of measurement?
6.2.4 Minimum generalized variance The minimum trace criterion takes no account of the correlations between variables. The trace(W) is a measure of the size of clusters. Another such measure is its determinant IWI which effectively measures the volume of the clusters. It is also called the generalized variance [15]. It has two advantages: it is invariant under transformation of the variables, and highly correlated variables are not given excessive weight.
Q.9 Explain the advantages of this approach.
202
6.3 Near infra-redexample Naes [ 5 ] describes an elegant solution to the problem of choosing a set of representative objects on which to base calibration equations. The solution is to analyse all of the samples by the indirect method (near infra-red reflectance spectrometry in that case) and to apply cluster analysis to group the data into smaller, more homogeneous groups from which a few representative samples can be drawn. Although NZS uses hierarchical cluster analysis, this is essentially a dissection problem for which an optimization-partitioning technique is appropriate as it is more likely to produce homogeneous clusters. Q. 10 Explain why the approach to object selection is a good strategy.
In this example near infra-red spectra comprising 10 wavelengths were obtained for 120 samples and the plot of their first two principal components is shown in Section 2.2. Although there are few clear clusters it is necessary to divide the data into several smaller groups. The minimum trace and determinant criteria were applied. In practice the analysis would seek 15 to 20 clusters. For simplicity only the five-group solutions are presented. Tables 5 and 6 give the cluster centroids for the two methods. Q.11
Inspect Tables 5 and 6 and comment on the results. Explain how to choose representative samples and why this would be a good calibration set. What other method could be used?
6.4 Problems with optim'zation-partitioningmethods Despite the basic aim of partitioning a set of data into several more homogeneous groups, the optimization-partitioning methods do not readily allow the correct number of clusters to be decided. Several methods have been proposed, such as plotting the criterion value against the number of groups, but none of them are very useful. The exception is to plot g21WI against g [15], where g is the number of groups and the criterion is the determinant of the within-groups matrix. A value appreciably lower than neighbouring values indicates an optimal division into groups. This was found by Everitt [6] to be a generally reliable guide.
203
Table 5. Cluster centroids obtained by the trace criterion. Group
I 2
3 4 5
Group
I 2
3 4 5
Wavelength I 2 606.4 494.8 568.4 516.5 537.8
488.1 350.5 456.9 400.2 430.7
Wavelength 6 7 162.9 56.4 136.9 94.1 117.2
236.3 129.8 200.9 156.1 175.8
3 456.0 315.4 425.6 368.7 400.2
8 601.8 482.9 565.5 512.6 536.2
4 537.8 392.7 508.2 450.5 482.6
9
5 634.7 526.6 613.7 570.3 594.6
10
482.0 332.0 450.6 387.5 421.8
546.3 416.9 510.0 454.4 481.4
Table 6. Cluster centroids obtained by the determinant criterion. Group 1 2 3 4 5
Group 1 2 3 4 5
Wavelength I 2 583.6 550.5 542.2 520.5 550.2
431.6 425.4 437.1 409.2 465.6
Wuvelengrh 6 7 119.9 112.2 121.6 103.2 142.3
209.4 181.6 175.8 165.3 191.6
3 396.1 391.8 407.5 378.7 436.5
8 571.4 543.4 542.8 517.2 552.6
4
476.0 471.3 493.3 461.2 516.4
9
421.4 411.2 428.0 404.0 456.5
5 596.2 581.3 604.8 584.8 604.9
10
505.6 484.0 486.8 461.2 506.0
A second difficulty lies in knowing the optimality of a particular solution. As described in Section 6.2.1, these methods require iterative numerical algorithms which may or may not converge on a global optimum. In general it is essential to repeat the analysis from different starting allocations. Convergence on the same (local) optimum indicates that a reasonable solution has been obtained. This can be enhanced by starting near the optimum, for example by starting with an allocation
204
obtained from a hierarchical analysis. As ever, visual inspection of the data and solutions is essential (Section 3).
7
Conclusions
This chapter has given a brief description of cluster analysis. Several components were identified: statement of the problem to be solved, visual inspection of the data and consideration of the informationalcontent of the data, choice and formation of a distance or similarity measure, choice and formation of a clustering criterion. The various choices and their reasons have been reviewed and illustrated. 8
References
1. B. Chen, C. Horvath and J.R. Bertino, Multivariate Analysis and Quantitative StructureActivity Relationships. Inhibition of Dihydrofolate Reductase and Thymidylate Synthetase by Quinazolines,Journal of Medicinal Chemistry, 22 (1979), 483-491. 2. G.C. Mead, A.P. Noms and N. Bratchell, Differentiation of Staphyloccocus Aureus from Freshly-Slaughtered Poultry and Strains 'endemic' to Processing Plants by Biochemical and Physiological Tests, Journal of Applied Bacteriology, 66 (1989). 153-159. 3. J.H.M. Bartels, A.H.M. Janse and F.W. Pijpers, Classification of the Quality of Surface Waters by means of Pattern Recognition, Analytica Chimica Acra, 177 (1985). 35-45. 4. H.J.H. MacFie. N.D. Light and A J . Bailey, Natural Taxonomy of Collagen based on Amino Acid Composition, Journal of Theoretical Biology, 131 (1988). 401-418. 5. T. Naes, The Design of Calibration in Near Infra-Red Reflectance Analysis by Clustering, Journal of Chemometrics, 1 (1987). 121-134. 6. B.S. Everiu, Cluster Analysis, Heineman, London, (1981). 7. P.H.A. Sneath and R.R Sokal, Numerical Taxonomy: The Principle and Practice of Numerical Classification, Freeman, San Francisco, (1973). 8. D.L. Massart and L. Kaufman, Interpretation of Analytical Chemical Data by the Use of Cluster Analysis, Wiley. New York, (1983). 9. N. Bratchell, Cluster Analysis, Chemometrics and Intelligent Laboratory Systems, 6 (1989). 105-125. 10. D.L. Massart, B.G.M. Vandeginste, S.N. Deming, Y. Michotte and L. Kaufman, Chemornetrics: a Textbook, Elsevier, Amsterdam, (1988). 11. M.A. Sharaf, D.L. Illman and B.R. Kowalski, Chemometrics, Wiley, New York,(1986). 12. R.G.Brereton, Chemometrics: Applications of Mathematics and Statistics to Laboratory Systems, Ellis Horwood, Chichester, (1990). 13. M.G. Kendall, Multivariate Analysis, Griffin, London, (1975). 14. C. Chatfield and A.J. Collins, Introduction 10 Multivariate Analysis, Chapman and Hall, London, (1986). 15. F.H.C. Marriott, The Interpretation of Multiple Observations, Academic Press, London, (1974). 16. K.V. Mardia, J.T. Kent and J.M. Bibby, Multivariate Analysis, Academic Press, London, (1979). 17. I.T.Jolliffe, Principal Components Analysis, Springer-Verlag,New York. (1986). 18. S. Wold, Pattern Recognition by means of Disjoint Principal Component Models, Pattern Recognition, 8 (1986). 127-139.
205
ANSWERS A. 1 There are no correct solutions to the problem of identifying clusters. In general each person asked to identify the clusters will recognize different groups. For the elastin data there are several clearly-definedclusters but also a number of loosely grouped and outlying samples. For the NIR data there are no clear groups and the objective is simply to partition the data into several homogeneous groups. Each of these groups then provides a representative sample to form the calibration set.
4.2 Euclidean Distance
City block Distance
The distances of the continuous variables may be calculated by using the Euclidear jistance mamx (Eq.(7)) omitting the divisor to obtain the unscaled distances. Foi :he binary variables either the simple matching coefficient (Eq.(2)),or the Jaccarc :oefficient (Eq. (3)) could be used. The city block mamx (Eq. (4)) should be us& 'or the ordinal variable. The overall distance or dissimilarity may be obtained from hese individual values using Eq. (9). I'he individual and overall values are given below (note: scaled similarity and htance add up to 1). Variable c, c2 B O Overall Unscaled distance 1.69 2430 1 10 610.7 Scaled distance 0.12 0.21 1 0.25 0.39 Scaled similarity 0.88 0.79 0 0.75 0.61 A.3 The similarity matrix for objects 1, 3 and 4 is given below. Note the symmetq about the diagonal elements.
Object 1 Object 3 Object 4
1.oo 0.58 0.69
0.58 1.oo
0.5 1
0.69 0.5 1 1.oo
206 ~~
A.4
Principal components analysis and principal coordinates analysis are based on different theories but if principal coordinates analysis is applied to a distance matrix of Euclidean distances they give the same result. However, if a non-Euclidean distance measure, e.g. ranks, is chosen then either principal coordinates analysis or multidimensional scaling can be applied to the distance or similarity mamx to give a more "direct" view of the data to be analysed by cluster analysis.
is 'he steps of the analysis are the same as those for the nearest neighbour analysis .bove. However, this time the inter-cluster similarities are not the largest but the mallest.
:rep 1 is the same as above. That is objects 1 and 2 are merged into a cluster. Bul he updated similarity mamx is different. The similarity between cluster (1,2) and Ibject 3 is the smaller of the values 0.58 (between objects 1 and 3) and 0.3C between objects 2 and 3). Hence, the furthest neighbour similarity between clustei 1,2) and object 3 is 0.36. The similarity mamx is updated. Cluster (1,2) Object 3 Object 4 Object 5
1.oo 0.36
0.17 0.34
(12)
1.oo 0.5 1 0.72 3
1.oo 0.75 4
1.oo 5
4s above, the next step identifies the largest similarity in the mamx and merges the )bjects or clusters. The similarity matrix is updated.
Cluster (1,2) Object 3 Cluster ( 4 3 )
1 .oo 0.36
0.17
(12)
1.oo 0.5 1 3
1.oo f4,5)
'Jotice that in this case the similarities have changed, but as with the neares ieighbour method the next cluster to be formed is (3,4,5), but at a similarity 0: 1.51. Finally the clusters are merged at a similarity of 0.17, the furthest neighbou: iimilarity between the two clusters.
207
4.6 The furthest neighbour dendrogram reflects the different similarities at which :lusters are merged.
0.0 0.2 0.4
0.6 0.8 1.o
r
-
I
1
2
3
4
5
A.7 Find the geometric centre of each cluster and then draw a straight line between this centre and each object in the cluster. The sum of these distances is greater in solution (b) to solution (a). However, changing the allocation of a single object in either solution results in an increase in the sum of overall distances for both solutions, demonstrating that we have found local optima in both cases. A.8 (a) Solution (a) is optimal.
(b) The effect of the criterion is to minimize the variation in all directions. It should be noted that this is equivalent to minimizing the within-group Euclidean distances. (c) Working with the pooled sums of squares will generally produce clusters of similar size. The individual sums of squares will tend towards the mean. But it is clear from Eqs (1 2)-(15) that variables with large variances will have large sums of squares. And from Eq. (17) such a variable will dominate the criterion and the analysis will be distorted.
208
A.9 Invariance to transformation avoids the problem of scale of variables referred to in A.8. Similarly, if two variables having high correlation are analysed the trace criterion by adding their sums of squares effectively doubles one piece of information.Again this would distort the analysis. A. 10 The data have a particular structure within their multivariate space. In calibration it is important to represent this structure. Therefore the samples should be chosen to span this structure or space. The cluster analysis first divides the scatter of objects into smaller groups from which samples can be drawn. A.11 Inspection of the cluster centroids indicates that the two solution are different, and consequently the representative samples found by the two methods are also different. These differences are a result of the different theories underlying the criteria. The calibration set is best formed by choosing objects close to the centroids; indeed a method of partitioning which seeks clustered around particular objects, such as MASLOC [9], could be used here.
209
CHAPTER 7
SIMCA - Classification by Means of Disjoint Cross Validated Principal Components Models Olav M.Kvalheim and Terje V . Karstang Department of Chemistry, University of Bergen, N-5007 Bergen, Norway
1 Introduction
Paralleling the increased amount of data acquired on computerized instruments, classification and data reduction methods have become increasingly important to such diverse areas as chemistry, biochemistry, environmental chemistry, geochemistry, geology, biology, medicine and food science. The question often arises whether samples are similar in some respect; for instance, in structural, toxic or carcinogenic properties. Similarity between samples is assessed by means of variables expected to characterize the samples with respect to the properties investigated. The variables can be used to compare samples pairwise, or, more frequently, as a basis for empirical modelling of of groups (classes) of samples. In the latter instance, goodness-of-fit criteria can be used to assess a sample's similarity with the modelled classes. If the samples can be located to distinct classes an unbiased classification can be obtained. Evaluation of similarity (correlation) and dissimilarity in information content between the different variables represents another crucial task in a classification problem. The periodic system of elements as established by Mendeleev and coworkers in the 1860s, is a classical example of the power of a successful empirical classification. By using similarities in chemical and physical properties, Mendeleev was able to group known elements into classes, each class being similar in chemical properties. In itself, this was a great simplification of the available data. As a further step, new elements could be postulated and compared with already established classes or as belonging to a new classes of elements with other chemical properties than those already discovered. Mendeleev's multivariate classification made a profound impact on the development of Chemistry. The explanation of the chemical classification in terms of a theroretical model had to wait for several decades, not being possible until the development and growth of quantum physics.
210
Whether we look at chemical reaction patterns for similar elements or spectra of chemical constituents we recognize that chemical data generally are redundant, i.e. the chemical variables tend to tell the same story. For instance, the proton decoupled C-13 nuclear magnetic resonance spectrum of a hydrocarbon consists of as many peaks as there are carbons in different electronic surroundings. The intensitiesof the peaks are all proportional to the concentration of the constituent, so if one peak increases the others will automatically increase correspondingly, i.e. the peaks are said to correlate. This is the same as saying that they cany the same information. Since we must expect chemical data to possess a high degree of collinearity, the methods used for assessing similarity must be able to handle data with such characteristics. The classification method SIMCA (Soft Independent Modelling of Class Analogies) developed by Wold and coworkers [ 1-33 is such a method, each class of samples being described by its own principal component model. Thus, in principle, any degree of data collinearity can be accommodated by the models. Starting from a discussion of the important role played by correlation when assessing similarity, the reader will be introduced to the properties of principal component modelling of relevance to a classification problem. All basic concepts and steps in the SIMCA approach to supervised modelling will be thoroughly explored using chemical data obtained in an environmental study [4]. 2 Distance, variance and covariance Definition of distance is central in all classification procedures [ 5 ] . Euclidean distance in variable space is the most commonly used for measuring similarity between samples. This measure is illustrated in two-dimensional space in Fig. 1. The generalization to M-dimensional space is obvious and is expressed through the formula d2, = (xk - x,)'
I
(xk
,
M
- XI)= I xk - XII* = C ( x k j - ~ y ) 2 j= 1
In this chapter we assume that all vectors are column vectors. The prime implies transposition into a row-vector. The use of "I" symbol means length of a vector. Thus, Eq. (1) is the scalar product of the vector expressing the difference between samples k and 1 in M-dimensional space.
21 1
This equation should be compared to Eq. (3) of Chapter 8. The algebraic concept of distance between two vectors is discussed in Chapter 3, Section 3.3. The transpose of a column-vector is a row-vector as described in Chapter 1, Section 6 and Chapter 3, Section 2.2. Multiplying a row-vector by a column-vector gives a scalar or single number. M is the overall number of variables measured. Lengths are csoud-nsi 3.2.
Variable Space
1
Fig. I Two samples plotted in two-dimensional variable sp The difference betw :n the sample vectors define the Euclidean distance between the samples, du.
The fundamental role of distance in multivariate classification follows from the assumption that proximity in multivariate space is indicative of similarity between samples. Thus, samples that are near in variable space are considered as having the same characteristics, whilst large separation are suggestive of different characteristics. An example of a data structure suggestive of two classes of samples is also shown in Fig. 2. The group centroids define the models so that each sample is described by the group centroid xg plus a residual distance ei, i.e.
The idea of centroids of clusters is also introduced in Chapter 6, Section 6 in the context of cluster analysis. It is also examined in detail in Chapter 8. The centre of gravity of a class, g, is
212
(2) becomes M,X{ = M ,i '(g) + M,er (g). This equation should be compared to Eq. (16) of Chapter 8. The difference between the measure of distance from class centroids in the section and Chapter 8, Section 5.2 is that, in the latter case, the
Point i
.Distance
Variable Space
1
Fig. 2 Sample structure suggestive of two classes of samples. The sample vecfors are expressed as the centre of graviry (the model) and a residual vector. The class membership of individual samples can then be decided by comparing their residual dislance (Eq. (2)) to each centre of gravity with the average residual dislance (Eq. (3))for the samples used to model ihe class.
The average residual distance from the centre of gravity for each class can be used as a scale for assessing the similarity of new samples to the two models. This scale is obtained from the squared residual distances as
where %j(g)is the value of thejth variable for the centroid of group g. In Eq. (3), N is the number of samples in the data set (it is interesting to note that the squared average residual distance differs from the variance of the samples
213
around the mean only by the factor N/(N-1). Thus, the variance of a class is proportional to the expectation value of the squared residual distance proving the equivalency in information content between the statistical concept variance and the geometric concept distance). A different type of data structure is illustrated in Fig. 3, namely an example of
samples collected at two different locations with one type of sample indicated by a black triangle and the other by a white mangle. It is not possible to classify the samples into either one of two classes using any of the single variables.
"
A
A A
A
A A A
Variable Space
1
F i g 3 Samplesfrom two direrent locations characterized by IWO variables.
Principal components analysis is discussed elsewhere in this text. In Chapter 1, Section 13, it is introduced as a geometric projection onto a s a r t line. The principal components obtained for the data structure can be calculated and are illustrated in Fig. 4. We note that now the combination of the two variables as given by the second principal component is able to separate the samples into two clusters according to sample locations: the projection on the second principal component unambiguously classifies the samples into one of two groups.
214
2
I
b
1
Variable Space Fig. 4 Principal components of the two variables illustrated above.
Taking into account our original information about sample locations and fitting, in a least-squares sense, a straight line through each group separately, a further simplification of the data, as illustrated in Fig. 5 may be obtained. The average distance from the sample of one group to the corresponding straight line defines a distance measure that can be used for classification. If we extend the concept of residual distance discussed above, Eq. (3) can be generalized to define a similarity scale covering sample structures like the ones illustrated in this section. With this extension, a clustering into two groups can be obtained by working out which line each sample is closest to. (The distance measure defined by Eq. (3) differs from the one actually used in the SIMCA algorithm only through a proportionality constant necessary to account for the loss of degrees freedom due to the fitting of the principal components as discussed in later sections). The discussion so far illustrates the two basic principles of the SIMCA classification procedure. (a) The disjoint principal component modelling of each group of samples: each group (the samples from each of the two different locations) is independently fitted to a principal component (the principal component of class 1 is calculated independently to that of class 2 in our example (Fig. 5)).
215
(b) The use of the residual variance to obtain a scale for assessing the similarity between a sample and a model. This division of data into structure (or model) and noise (or residuals), represents the heart of the SIMCA philosophy as illustrated in Fig. 6.
2
L
b
I
Variable Space
1
Fig. 5 Each class is modelled separately by a one component model whichprovides residual distances that unambiguouslyclassifies the samples.
(VARIANCE)
Fig. 6 Principles of SIMCA philosophy.
The discussion above shows why SIMCA succeeds when the approach of using purely geometric distances to classify or cluster samples often fails. The SIMCA approach can deal with more complicated data structure resulting from correlation between variables (similarity in information content). This (partial) correlation is expressed in terms of principal components, allowing a wider range of data structure to be described. Our example shows a data structure where both classes
216
are influenced by the same (general) factor, thus masking the information separating the classes. This situation is frequently encountered when analyzing real data, for instance when there is a dominating size factor [6], or some other general factor [7]. A further advantage of the SIMCA approach compared to purely distance-based cluster approaches is that the modelling into principal components efficiently separates structure from noise. Amongst others this provides a basis for the rejection of irrelevant variables and outlying samples. The importance of handling correlation properly in multivariate classification has been highlighted by Christie in his very didactic article in the proceedings from Ulvik workshop [8]. We will illustrate the principles of SIMCA using the data set of Table 1. For readers who have access to the tutorial version of SIRIUS this data set is provided on disc and the answers to the questions in this chapter can be obtained in the form of output from this package.
Table 1. Data set used in this chapter. Sample 1
1-1 1-2 1-3 1-4 1-5 1-6 1-7 1-8 1-9 2- 1 2-2 2-3 2-4 2-5 2-6 2-7 2-8
55 56 56 74 67 70 52 69 240 328 163 300 277 25 1 190 377 265
2 558 439 536 580 670 632 618 684 467 548 343 436 450 415 340 296 500
Variables 3 80 115 97 122 98 129 80 111 225 209 154 167 244 164
116 178 255
4
5
60 52 50 60 78 53 85 74 56 40 32 35 59 34 21 32 54
344 632 512 720 514 816 377 576 410 548 304 352 442 335 235 230 440
217
Table 1 shows data collected in an environmental study aimed to explore the possibility of using mussels as pollution indicators [4].The data set (MUSSEL) contains the raw chromatographic areas of five constituents extracted from tissue samples of blue mussels. The mussels were sampled at 17 different sites. The nine first samples in the data table are sample sites presumed to be unpolluted, while the next eight were sampled in harbours and other sites expected to be polluted to varying degrees.
Q.1 (a) Plot the samples of Table 1 along variables 3 and 4. (b) How do the Euclidean distances comply with a division according to the twa types of sampling sites? Assume that the polluted samples can be approximated by spherical model and the unpolluted samples by a one-component principal component model. Visual inspection of the residual distances can then be used to obtain the grouping inta two classes. (c) How does this interpretation comply with a division into two classes? (d) To what class would you assign sample 1-9?
3 The principal component model The mathematical notation used in principal component analysis is discussed in Chapter 3, Section 5. We assume, here, that there are M measurements or variables and N samples or objects. Each principal component can be represented by a column-vector consisting of scores and a row-vector of loadings. Each sample has a score associated with it for each principal component and the mamx of scores for the N samples is referred to by T in this text. The corresponding loadings for the M measurements are referred to by P. It is always useful to consider the eigenvalues for the principal components: these are proportional to the sum of squares of the elements of the principal component score vector - the larger the eigenvalue, the more significant the component. More about the significance of the size of eigenvalues is discussed in Chapter 5. The eigenvalue of the ath component is referred to as g, . The diagonal mamx in which the elements are the eigenvalues of X'X is denoted by G as discussed in Chapter 4,Section 5.2. In some cases, the scores for each component are divided by the square root of the eigenvalue so that a new matrix U is defined, so that UG'" = T.
218
Each sample, after pretreatment, is represented by a 1 x M row object vector containing the M measurements on the sample. The row vectors ( x i ) for N samples together constitute the NxM data mamx X.By use of a maximum-variance criterion the matrix X can be decomposed into principal components. Each principal component is characterized by a vector ua which is proportional to the score for each object and a loading vector pa with N and M elements, respectively. Both the score vectors and the loading vectors are at right angles to each other. In addition, an eigenvalue is calculated, the squared eigenvalue being proportional to the variance accounted for by the principal component. In this text we define the number of principal components in a model by A. This must be less than or equal to min (N,M), i.e. the number of objects or variables, depending on what is smallest. By collecting the score vectors, loading vectors and eigenvalues into mamces the principal component model can be written
X
= 1 i’(g)
+ U G l n P’ + E (g)
(4a)
Using subscripts to denote dimensions for clarity (as defined in Chapter 3, Section 2.2 ), this equation may be expressed as follows: N.MX= N . l M ,
i’(g)+ N.AU A,AG”’ A.MP’ + N.ME(g)
The reader should compare this equation to Eq. (28) of Chapter 4, Section 5.3 and Eq. ( 6 ) of Chapter 1. Note that in this chapter mean-centring and noise are
The vector 1 is of dimension N x l : all the elements are 1. The vector i ( g ) is of dimension M x l and defines the centroid for each variable as discussed in Section 2. Thus, X - 1 i’(g) defines the column-centred data mamx. The matrix E(g) contains the residuals, i.e. the difference between the data and the model for class g.
In SIMCA classification, cross-validation is used to determine the dimension A (number of principal components), which gives the model optimal prediction properties.
219 -~II___I_IIIIyI_uIIIIIlllll~
There are many methods for cross-validation used to determine the principal components that adequate describe a model. In Section 6 of this chapter we discuss two approaches used in SIMCA. In Chapter 5, Section 3.2.8, further methods for cross-validation are described. In Section 2, we showed how the variation in the data mamx can be divided into a model (covariance structure) and a residual mamx (unique variance / noise). As we shall see the residual mamx can be used to define a scale for assessing the similarity of objects to the model. Each sample can be expressed in terms of the principal component model and a residual vector
Thus, every sample has its variation divided into a common sample structure and a unique sample structure. Comparison of the unique sample structure with the average unique sample structure for the class provides a basis for the assessment of the similarity of a sample to the sample structure defined by the class model. In the SIMCA literature [ 1-3,9] the decomposition into principal components is usually expressed as a product of only two matrices, an orthogonal score matrix T and the orthonormal loading matrix P
X = 1 X’(g) + T P’ + E(g)
(5)
Comparison of Eqs (4a) and (5) shows that T = U Gin, i.e. the score vectors u, and to differ only in scale factors equal to the square root of the eigenvalues of the principal components. If we extract all the principal components so that the residual matrix is a zero mamx, it follows directly from Eq. (4) that trace [(X - 1 X’(g))’(X - 1 X’(g>>]= trace (C). Thus, the matrix C is a diagonal matrix with elements showing the variation accounted for by the principal components. Divison by N-1gives the variance accounted for by each principal component. The sum of these variances is the variance accounted for by the model
220
The variance defined by Eq. (6) is a direct measure of the amount of covariation carried by a principal component model. Similarly, the larger the eigenvalue go, the more crucial is the contribution of that principal component to the model. he trace of a matrix, or [ trace(X) ] is simply the sum of the diagonal elements of e mamx, as discussed in Chapter 3, Section 5.5. The interpretation of the matrix GI(N-1) as a diagonal matrix of variances shows that the principal components are uncorrelated new variables describing the variation pattern in a data matrix X. Indeed, the principal components can be defined as orthonormal bases in variable- and object-space [ 101 with the property of spanning the directions of maximum variation in the data. The scores show the relations between samples (variable-space),while the loadings describe relations between variables (object-space). The extraction of the principal components is often performed by an iterative procedure called NIPALS [9].The method decomposes the data matrix directly without the need for calculating a covariance or correlation matrix. This direct approach was introduced into factor analysis by Saunders [ll]. The NIPALS algorithm has been published in several chemometric papers, see, e.g., Wold et al. [9] and will not be repeated here. The method is superior to most methods for decomposition in three ways as follows: (a) The principal components are computed directly from the data mamx, i t . no covariance or correlation mamx is needed. (b) Missing data can be handled. (c) The principal components are extracted successively in order of decreasing importance. Thus, only the first few principal components need to be calculated. The residual variation is contained in the matrix E of dimension NxM.
4 Unsupervised principal component modelling Principal component modelling plays two different roles in the classification of multivariate data.
22 1 (a) It is a tool for data reduction to obtain low-dimensional orthogonal representations of the multivariate variable- and object-space in which object and variable relationships can be explored [ 101. (b) It is used in the SIMCA method to separate the data into a model and a residual matrix from which a scale can be obtained for later classification of samples. Usually, but not necessarily, also SIMCA classification is preceded by an unsupervised principal component modelling of the whole data set. As shown in previous chapters, similar samples tend to group together in the reduced variable space spanned by the major principal components.
Table 2. First and second principal components for the data in Table 1 after normalizing the rows to 100 and standardizing the columns to unit variance. SCORES Component 1 Object 1-1 1-2 1-3 1-4 1-5 1-6 1-7 1-8 1-9 2- 1 2-2 2-3 2-4 2-5 2-6 2-7 2- 8
2.21 12 1.2586 1.6102 1.4149 2.2334 1.3113 2.8346 1.9706 - 1.2364 - 1.3377 - 1.2521 - 1.6629 - 1.5375 - 1SO57 - 1.4606 -3.3256 - 1.5263
Component 2
1.3719 -1.7441 -0.5522 - 1.4036 0.7098 -1.8851 2.0032 0.1973 0.3937 -0.6520 -0.0093 0.1126 0.21 18 0.1237 0.2460 0.6362 0.2403
LOADINGS Variable Comp. 1 Comp. 2
1
-0.5071 0.1915
2 0.4570 0.4194
3 -0.4917 0.1003
4
0.4068 0.5 150
5 0.3560 -0.7157
222
Q.2 (a) Obtain the information in Table 2 as a printout from the SIRIUS package. (b) Show that the two loading vectors are orthonormal to each other, using simple pencil and paper calculations. (c) Show that the two score vectors are orthogonal to each other. (d) Plot the scores using SIRIUS or by hand. (e) Enclose the two classes observed in the score plot. Is there any indication of strange samples (outliers) in the data? (f)Calculate the variance for the two components by hand. (g) What is the percent of original variance accounted for by the model? 5 Supervised principal component modelling using cross-validation
By utilizing the apriori information about the samples in Table 1 new possibilities emerge. (a) Cross-validation [ 121 can be used to obtain models with maximum predictive ability for each class of samples separately. Furthermore, separate modelling enables interpretation of the classes in terms of their covariance patterns independently of the other classes. (b) Samples can be classified as similar or dissimilar to the separate principal component models. Thus, outliers in a class can be detected and new samples with unknown membership can be classified as belonging to one, some or none of the classes at some chosen probability level. (c) The ability of each variable to discriminate between classes can be calculated and used to perform data reduction by the removal of irrelevant variables. (d) With several classes the separation between classes can be assessed, and, measures of overlap between classes can be obtained. In the next sections we shall illustrate these features of supervised principal component modelling on the data of unpolluted (class 1) and polluted (class 2) samples in Table 1.
6 Cross-validated principal component models The goal of the supervised approach to principal component modelling is to obtain the best models with respect to the classification of future samples with unknown class-belongings. Thus, the predictive ability of the principal component models needs to be optimized, a task that can be managed by use of the statistical technique known as cross-validation [ 121 which is discussed in greater detail in Chapter 5 .
223 A much used cross-validation technique, which is well suited for small sample sets
is the leave-out-one-sample-at-the-time procedure given below. Calculate PRESSo = C ( x i - x(g))’(xi - %(g))/(N-1) For a = 1,2, ..., B (where B is less or equal to rank(X) - 1) PRESS, = 0. For i = 1,2, ...,A’ Calculate the principal component model for the mamx X-i i.e. X.; = 1.i i’(g).i+ T.i P-> The notation indicates that the sample i (corresponding to row xi ) is left out of from the data mamx X.Accordingly, the score matrix T, will be of dimension (N-1)xB. ii) Define ei.0’ = xi’ - X’(g).i iii) For a =1,2, ..., B Calculate the score ti,,= ei,a-l’p-i,a Calculate ei,,’= ei,a-l’- ti,, p-;,=‘ Calculate PRESS, = PRESS, + ei,,’ei,, i)
For a = 1,2, ..., B Calculate PRESS, /(N-A-1) Recalculate the principal component model with all samples included and with the number of principal components A which provides the smallest prediction error, PRESSA. Algorithmfor calculating predictive error.
are calculated leaving one or more measurement out, and the error between the experimental and predicted data using A principal components in the model is calculated. Once all the measurements have been left out once, the total predicted sum of square error is calculated.
In this procedure, one row at-a-time is deleted from the data mamx X and the principal components calculated. The N principal component models thus obtained (same as the number of samples) are used to predict the left-out sample with zero, one, two erc. number of components. The deviation between the actual values and the predicted values is used to estimate an overall prediction error. The model
224
providing the minimum prediction error is finally calculated with all samples included. By this procedure, all samples are utilized both for calculating and for validating the model. The algorithm for calculating the predictive error is given above. As observed, the PRESS values have to be corrected for loss in degrees of
freedom. For zero components only the mean is calculated so that the correction factor is N-1.For A components, the correction factor is N-A-1. The leave-out-one-sample-at-a-time procedure is excellent for small data sets. However, when the number of samples increases, the calculation becomes uneconomical in terms of computing time. A modification that can be used is to divide randomly the data matrix into smaller groups. Principal component models are calculated by leaving out one block of samples at-a-time. The kept-out samples can then be fitted to the respective principal component models and the number of components giving the overall minimum predictive error determined as described above [ 131. The cross-validation technique used in SIRIUS [14] and most implementations of the SIMCA method was developed by Wold [ 121 and can be termed as the leaveout-one-group-of-elements-at-a-time. Thus, in Wolds approach a fraction of the elements are left out for each sample, the left-out elements differing from sample to sample.The algorithm for finding the optimal number of principal components in the leave-out-one-group-of-elements-at-a-timeapproach is illustrated below.
Fig. 7Deletion pattern for unpolluted class of samples using the leave-out-one-group-of-elementsat-a-time cross-validationprocedure of Wold [12].
225
Perform column-centring of the data mamx X. Calculate the residual mamx Eofor the column-centred mamx, i.e. Eo = X - 1 X'(g). Partition the elements of the column-centred data maaix (randomly) into N g smaller groups. For a = 1,2,,.,, A+ 1 Calculate RSD,-l = C ei,a-l' ei,,l-l/(M-a+l)(N-a). The vector ei,6 is the distance (as defined in Eq. (2)) of object i to the centroid of the entire data set when the model has been defined using b principal components. For g = 1,2,..., N, i) Calculate the score and the loading vectors t , , and P-~,, for the mamx E-,,i t . the residual mamx leaving out the elements of group g. ii) For the left-out elements calculate the difference between the predicted residuals, i.e. t.g,ia P-~,,,', and the actual residual eq,,-1 to obtain PRESS,,,: PRESS,, (t.g,iaP.g,ai '-eij.0-1)2
=C C k g
ieg
where the symbol E stands for "is a member of'. Calculate PRESS, =
2
PRESS,,/(M-a)(N-a-1).
g=l
Compare PRESS, with RSD,.I. If PRESS, less or equal to RSD,-I then recalculate the principal component with all elements included, and update the residual matrix before the next iteration: E, = E,-l - t, pi. If PRESS, greater than RSD,-1 stop the iteration and use the model with A = a- 1 principal components. Algorithmfor calculating predictive error by leave-out-one-groupof-elements-at-a-time
[W. 4.3 Show that the deletion patterns illustrated in Fig. 7 can be obtained by dividing the eight unpolluted samples of of Table 1 into three groups.
226
The key to Wold’s method for Cross-validation lies in the ability of the NIPALS method for handling samples with missing data. An underlying assumption is that the left-out variables are described through their correlation with the included variables. Thus, the stronger the correlation between variables the better the results. If on the other hand, correlation between the left-out variables and the retained variables for a sample is weak on one or several principal components, the estimates of the scores may be biased to produce bad predictions and an underfitted model. There are several remedies against this pitfall. (a) Increase the number of groups used in the cross-validation, (b) Try different partition patterns for the deletion of elements, and, (c) Block the variables according to their correlation pattern and do a division which avoids that all variables in one block is deleted for any of the samples. In the algorithm above all variables and samples conmbute to all the principal components. This may impose problems when outliers are present in the data. Thus, outliers contaminate the models and thereby induce biased predictions. Fortunately, strong outliers are easily detected through visual inspection of score plots or through the use of statistical tests like Dixon’s Q-test (see Section 12). Weak outliers are detected by redoing the Cross-validation with the suspected samples left-out and estimating their goodness-of-fit to the new model.
2.4 :a) Exclude outliers identified from the scores plot obtained in Q.2 and model the groups of polluted and unpolluted samples in Table 1 as two separate classes. Use the normalized data and standardize the variables to unit variance for each class. :b) How many components are required in order to obtain the best possible predictive models?
7 The SIMCA model
After having decided the dimension of the class model, we can determine the distance si from a sample to the model
221
This equation should be compared to the Mahalanobis distance, introduced in Eq. (15) of Chapter 8 in the context of hard modelling. In the latter case, the Mahalanobis distance is scaled by the pooled variance-covariance mamx, as described. In this chapter the data is already reduced by PCA before calculating the distance from class centroids. Both measures of error take into account correlations between variables. The division by M-A provides a distance measure which is independent of the number of variables and corrected for the loss of freedom due to the fitting of A principal components. In the SIMCA terminology, si is termed the residual standard deviation (RSD)of sample i. The RSD for the samples can be collected in a distance vector s of dimension Nxl. The squared distance vector s defines the mean residual standard deviation of the class
The division by N-A-1 gives a scale which is independent of the number of samples and corrected for the loss in degrees of freedom due to column-centring and fitting of A principal components.
Fig. 8 Illustration of the effeci of different probability levelsfor drawing of the borderline between a class and the rest of variable space. By decreasing the probability level from p = 0.05 to p = 0.01 a larger part of variable space is embraced by the model. This decreases the risk of false negatives but increases the risk of false positives.
228
Comparison of the RSD for the sample i with the mean RSD for the class gives a direct measure of its similarity to the class model. In order to provide a quantitative basis for this test, Wold introduced the F-test statistics for the comparison of :s and sf. The degrees of freedom used to obtain the critical F-value are (M-A) and (MA)(N-A-I), respectively, for sf and sf (see, however, the last paragraph of this section). /The F-test is a statistic test for comparing two variances. It is described in most1 basic texts on statistics. It provides an estimate of the significance of one variance
The F-test can be used to calculate an upper limit for the RSD for samples that belong to the class
Fcritis usually determined at the probability level p = 0.05 o r p = 0.01. In the former case every 20 samples on the avenge actually belonging to the class falls outside the class, while the number of false negatives will be one out of hundred for p = 0.01. Thus, a narrower acceptance level results with increased p. On the other hand, p = 0.01 increases the risk of false positives compared to p = 0.05. This is illustrated for a one-component class model in Fig. 8. By using the F-test to calculate upper limits for acceptable residual distances, we have actually closed the SIMCA model around the principal components. However, the model is still open along each principal component. Wold used the extreme scores tkn,a and tm,,a and their spread st,= along each component to define lower and upper limits for the scores
where st,,2 = t,'tdN = ga IN
With the introduction of upper and lower limits for acceptable scores, the SIMCA model is closed in all directions. This point is illustrated for a one-component model in Fig. 9.
229
Fig. 9 The closed SIMCA model. By using the spread of the sample scores along each principal component,upper and lower limitscan be defined to obtain a closed model in variable space.
As has been pointed out by van der Voet and Doornbos [15-161, Wold and coworkers provide no theoretical foundation for their choice of upper and lower limits for the scores. However, our experience is that this choice gives reasonable results, implying that it represents a useful alternative to the more sophisticated probabilistic approach advocated by van der Voet and Doornbos [15-161. The use of F-test to determine upper limits for the residual distance is not without problems. As pointed out by Droge et al. [17,18], the F-test statistics seem to give too narrow confidence bands when the number of variables is large compared to the number of samples. Actually, it appears that the problem arises even in cases where the number of samples is rather large, as long as there is a strong correlation between the variables [ 131. One solution to this problem is to use the leave-out-oneblock-of-samples procedure in the validation of the model [ 131. This procedure gives broader confidence band around the model. Another solution, advocated by Wold is to calculate Fcritby using a smaller number of degrees of freedom [ 191. Whatever remedy is chosen, there seems to be room for more work in the area of correlation and F-tests. Table 3 shows the samples of Table 1 fitted to the classes of polluted and unpolluted samples.
230
Q.5 (a) Obtain the information of Table 3 using SIRIUS together with overall statistics for each of the two classes. (b) What is your conclusion with respect to sample 9? (c) What is your conclusion with respect to the other samples?
Table 3. Samples fitted to the class models of unpolluted (class 1) and polluted (class 2) mussels. Sample
Name
1 1-1 2 1-2 3 1-3 4 1-4 5 1-5 6 1-6 7 1-7 8 1-8 9 1-9 10 2- 1 11 2-2 12 2-3 13 2-4 14 2-5 15 2-6 16 2-7 17 2- 8 Mean standard deviations of classes (RSD) Rejection criterionfor outliers ( p = 0.05) RSD >
RSD
RSD
(Class 1) 0.8096 0.5974 0.2600 0.6598 0.2574 0.7546 0.9112 0.0764 22.6459 26.1163 21.2077 32.2928 25.4766 28.4785 28.5749 50.62 17 23.3773
(Class 2) 5.1417 5.1416 4.5327 4.8242 4.8832 5.1252 5.7577 4.6633 0.9586 1.0782 0.8215 0.9520 1.2351 0.4647 1.3003 3.5525 0.9078
0.6110
1.om0
1.02
1.59
8 Classification of new samples to a class model The classification of a new sample to a class consists of three steps. Namely, first, fit the sample to the class, second, estimate the residual vector of the sample, and
23 1 third, compare the residual of the sample to the maximum allowed residual of the class. 8.1 Fitting the sample to the class.
In this step we want to determine the scores ti,=forthe sample i using class g. This is obtained by calculating the expansion coefficients of the sample vector x i in the basis of the loading vectors { pa), i.e. A
x f = %’(g)+ C r i o p i + e i ( g ) a= 1
The algorithm for this fitting is as follows:
Thus for each principal component the score of the fitted sample is estimated as the scalar product of the residual vector and the loading vector. The division with d(p,’p,) is necessary in order to account for possible missing data in the fitted sample. Even with this correction the algorithm needs further modification in order to handle properly the nonorthogonalities between loading vectors with missing elements. However, if the number of missing values is small compared to the total number of variables, the algorithm above provides good estimates of the scores just by leaving out the missing variables in the calculation [3]. The value ti,. is analogous to discriminant scores discussed in Chapter 8, Section 5.2.1. Eq. (12) of that chapter has similarities to Eq. (12) of this chapter, although these two concepts are not identical. As a summation the expression M
pipa=
C@j.)* j= 1
where p,. is the loading for the j t h variable on the ath principal component. As
m
m
The simple procedure above can with a small modification be used for other orthogonal decompositions than principal components, for instance, panial-least-
232
squares decomposition [20]. The minor modification needed is to calculate the sample scores by means of the orthogonal coordinate vectors (w,) [21], i.e.
8.2 Calculating the squared residual distance of the sample to the class This is done using the following equation:
The squared residual distance is corrected for the loss of degrees of freedoms accompanying the fitting of the sample to the principal component model. It is assumed that the class is modelled using A principal components. The Mahalanobis distance (introduced in Eq. (15) of Chapter 8) used in hard modelling is not the same as Eq. (1 4)above.
8.3 Comparing the residual vector of the new sample to the maximally allowed residuals,, for the class If sk is less or equal to,s, (as defined by Eq. (9)) then the sample is classified as belonging to the class. If, on the other hand, sk exceeds ,s, then the sample classifies as outside the class c . In cases where samples used to obtain a class model are checked for their class belonging, their squared residual has to be increased by multiplication with the correction factor (N-l)/(N-A-1). This correction is done automatically in most computer implementations of SIMCA 1141. The use of distances to classify samples into classes is also discussed in Chapter 8. Sections 5.2.2.2 and 5.2.2.3, in the context of hard modelling. In Chapter 8, it is normal simply to classify a sample into one class according to which class distance is smallest. However, it is not too difficult to modify the FLD (Fisher Linear Discriminant analysis) (or LDA (Linear Discriminant Analysis)) criteria to allow for outliers (samples with distances far from either class centre), or even samples that
233
4.6 An extra sample obtained in the environmental study [4] gave the followin1 chromatographic areas for the five variables: 36,452,75, 39,403. (a) Classify this sample with respect to the classes of unpolluted and pollutec mussels. What is your conclusion? (b) Unsupervised principal component modelling of the normalized anc standardized data including the extra one mentioned above gave the score plo shown below. Explain this result in view of the result obtained in the SIMCk classification of the added sample. (RSD of 1.62 and 4.62 for the unpolluta and polluted classes, respectively.) Tanle
mussel
SCORE
2
VS.
SCORE
1
Class
code
M
2 80
1-7
;
1-1
1 40
01
0
2 ul
:
0 00
0
II)
2- I
i C
-1
1-4
40
u
1-2
0 C
1-6
a E
U 0
-2
80
-2 60
-1
0 00
40
CompOnent
I
scores
1 I72
40
2 80
1x1
9 Communality and modelling power Even with the most painstaking variable selection when planning an investigation some of the variables used to model a class inevitably turn out to contain little or no relevant information on the properties investigated. Thus, they will have variation patterns distinct from the major part of the variables. A direct measure of the contribution of a variable to a principal component model of a class g is the part of
234
variance it shares with the principal components. We can express the division of the variance of variablej into shared and unique as follows: s2. = 2 J
j.model’
r o m m u n [class of group only.
(15)
S2j,unique
a
l
i
t
y and m
o
d
e
l
l
i
v
1
The communality c, of variablej is defined as the ratio between modelled and original variance
a=l
The scalar product in Eq.( 16) is the mean-centred variable vectorj multiplied by itself. From Eqs (15) and (16) it follows that the communality is a number between 0 and 1. If the communality is 1 or close to 1, variablej shares its variation pattern with the principal component model. If on the other hand, the communality is close to 0, the variable has a variation pattern distinct from the principal components, and, accordingly, its covariance with the other variables is small. Such a variable is a candidate for deletion from the class model if its ability to discriminate samples from other classes is also comparatively low. Wold and coworkers [ 1-31 proposed modelling power as an equivalent measure o f , relevance of a variable for modelling a class. Modelling power is defined as
Similarly to communality MPOW has 1 as the upper limit. The interpretation is also the same in this case since MPOW = 1 means that the residual variable vector contains only zeros, i.e. the variable is completely accounted for by the model. However, because of the correction factor resulting from the loss of degrees of freedom, MPOW may become negative if the variation pattern of variable j is distinct from the model. This is easily confirmed by substituting ej,, with xj (g) -
235
X j (g) in Eq. (17). In some implementations of the SIMCA method negative MPOWs are automatically set equal to zero [ 141.
10 Discriminatory ability of variables When more than one class has been modelled, the variables can be ranked according to their ability to discriminate between different classes. There are many ways to do such an evaluation. The discrimination measure used in most SIMCA applications is due to Wold and coworkers and is termed discrimination power [ 1-
31
The notation in Eq. (18) is adopted from Albano ef af. [3]. The indices r and q refer to two different classes of samples such as, for example, the two classes of mussels discussed above. The definition of s, is given in Eq. (14). y similarities with Eq.(8) of Chapter 8, defining the variance weights of a variable when two classes are included in the model. Eq. (18) expresses a ratio between the squared residuals obtained when fitting the samples of two classes to each other classes and the squared residual obtained for the samples when fitting the samples to their own classes. Thus,
where N , is the number of samples in class r. In Eqs (19a) and (19b), the variable residual vectors are the ones obtained when fitting class r to class q and itself, respectively. Thus, Eq. (18) expresses a ratio between inter- and intra-class variances. From Eqs (18) and (19) it follows that discrimination power is a positive number equal to or greater than 1. A value close to 1 indicates that the variable has no ability to distinguish the two classes. According to Wold et af. [3] a value above 3 for dj(r.4)indicates good separation. This depends, however, on the orientation and shape of the classes, and a robust measure of discrimination is obtained by using zero principal components in Eq. (18) for both classes. The display method suggested by Vogt [22] represents another solution to locating the most discriminating variables. Each principal component loading vector from
236
one class is compared pairwise with all the principal component loading vectors obtained for a second class. Discriminating variables are located as deviations from a straight line. As long as the number of principal components is small, the method works nicely. Visual judgement of the discrimination ability of variables becomes cumbersome when many plots need to be examined or when variables correlate with many components. The method may lead to erroneous conclusion about the discriminatory ability of variables which are little accounted for by the models. The latter situation may arise when some of the variables have comparatively small spread'around their class mean. A third alternative for locating discriminating variables is to model two classes jointly by means of partial least squares (PLS) discriminant analysis [20] or by means of canonical variate analysis (CVA) [23]. Both approaches utilize a priori information about sample locations when decomposing a data structure into latent variables. Furthermore, both methods can be used to find discriminating variables for more than two classes modelled jointly. Canonical variate analysis provides the combination of the variables producing maximal separation into preconceived groups, but in its classical form the method cannot be used when the number of variables exceeds the number of samples. Canonical variates analysis is introduced in Chapter 8, Section 5.2.3 and Chapter
11 Separation between classes An important question in all classification problem is: how good is actually the classification? Is the separation between classes good enough to justify a division into distinct classes? Wold and coworkers [ 1-31 have provided three measures for quantifying the separation between classes. (a) A class distance which is closely related to the discrimination power of the individual variables. (b) A worst ratio between the residual distances of a sample to another and its own class. (c) The Coomans plot to display the residual distances of samples to different classes. Class distance is defined as [3]
231
Eq. (20a) can be rewritten in terms of the residual distances of each sample to its own class and the other class. Thus, class distance measures the mean inter- to intra-class distance for the samples.
Analogous to discrimination power, a class distance above 3 suggests well separated classes.
Fig. 10 Illustration of sample distance and worst raiio (sample k). Thefigure shows that class distance is a measure of the mean inter- to intra- class residual distance.
Class distance is sensitive to differencesin shape (number and relative magnitude of principal components) and relative orientation of classes. If the classes are dissimilar in shape and/or orientation, the ratio between inter- and intra-class residual distance for individual samples can be calculated to check for overlap between classes. If this number is close to or drops below one for individual samples, the classes are overlapping, a situation that may be cured by deletion of
238
variables with low discriminatory ability. In the example pictured in Fig. 10, sample k has the worst inter- to intra-class distance ratio.
Table 4. Evaluation of classes. Modelling and discrimination power of the five variables and class distance. Variable 1 2 3 4
MPOWl 0.03588 0.82219 0.46 100 0.53586 5 0.83741 Distance between classes 17.4
MPOW2 0.0000 0.0000 0.0000 0.0000 0.0000
Discrim. Power 37.66559 3.76219 7.68286 2.57 849 4.19999
The Coomans plot [3] represents a good alternative to the calculation of worst ratio. In this visual display, the residual distances of samples to two classes are plotted together with smar for the two classes. A multiclass plot for visual display of the separation between classes has been devised by Kvalheim and Telnaes [24].
Q.7 Table 4 contains information that can be obtained from the classification module i SIRIUS. (a) Explain why the modelling power is zero for all the variables in class 2. (b) Which variable contributes least to the intra-class structure of class l ? (c) Which variable discriminates most between the two classes? (d) Which variables would you delete if polishing the classes? (e) Is there any overlap between the classes?
12 Detection of outliers The implicit assumptions of similarity between the samples used to model a class is not always fulfilled. For instance, we have observed in the environmental data (Table 1) that one sample assumed to be unpolluted turned out to be polluted when compared to the two classes. Such a sample, which deviates in its variation pattern
239
from the other samples in a class, is called an outlier. Detection of outliers represents an important task in the development of models. The detection of outliers in class modelling can be performed both at the unsupervised and supervised stage. The simplest way to detect outliers goes through the visual inspection of score plots of the separate classes. Strong outliers represent no serious problems as long as they are few since such samples tend to plot as isolated points in score plots of the major principal components. The identification of weak outliers represents a more subtle problem to the data analyst since such outliers are not revealed in score plots until the extraction of the minor principal components. Statistical tests represent a numeric approach to outlier testing. Among the numerous tests devised for outlier detection, Dixon's Q-test [25] is among the simplest. On the outlier dimension, the absolute value of the difference between the score of a suspect and its nearest neighbour is calculated. The ratio between this difference and the total interval spanned by the scores on the outlier dimension defines the Q-value, i.e.
If Q exceeds a critical value, the sample is rejected. The Q-test assumes normally distributed scores along each principal compoent, a reasonable assumption within a single class of samples. The number of degrees of freedom equalizes the number of samples in this test [25]. With the assumption that the samples are normally distributed in a class, the residual distances can be used for detecting weak outliers. This is done by comparing the residuals of the suspected outliers with,,s obtained for the class. Strong outliers usually pass this test since their large influence (leverage) to the model provides them with very small residuals. If there is a single weak outlier, the leave-out-one-sample-at-a-time approach for model validation can be used for detection and confirmation. With multiple weak outliers the situation becomes more complicated, the best approach probably being to model a block of sample with similar variable profile and use F-tests and Crossvalidation iteratively to decide on incorporation or exclusion of the other samples.
240
13 Data reduction by means of measures of relevance The process of detecting and deleting outliers represents one side of the process termed polishing of classes. The other side of this process is the removal of irrelevant variables. Variables are irrelevant for the problem at hand if they have a distinct variation pattern different from the class models and at the same time do not discriminate between classes of samples. Variables with a distinct variation pattern are identified through their low covariation with the other variables. Thus, communality or modelling power approaching zero identify such variables. It is difficult to draw a sharp borderline between high and low communality (or modelling power). This is not only because of subjective factors, but because what is high or low depends on the average variance accounted for by the model. As a thumb-of-rule variables with communality less than half of the average communality are classified as carrying little information about the intra-class structure. Thus if a model explains 80% of the total original variance, the borderline between high and low communality is 0.4. Depending on the number of variables and principal components, this corresponds to a modelling power of approximately 0.5-0.6. Variable reduction is discussed in several other chapters of this text. These include Chapter 8, Sections 5.1.2.3 and 5.2.2.4, and Chapter 5, Section 4. There is a large battery of techniques for this, many as yet not commonly exploited by the
In cases where we have only one class, variables are removed from the model based upon the criteria of low communality/modelling power alone. In multiple class problems, discriminatory ability is used as a supplementary deletion criteria. Thus, only variables with both low communality / modelling power and low discriminatory ability are deleted. Variables with high communality/modelling power stabilize the covariance pattern in a class and make the models robust to missing data. Thus, such variables are retained irrespective of their discriminatory ability. Variables with high discriminatory ability and low communality / modelling power are retained because of their ability to separate the classes. The combination of high discrimination and low communality / modelling power suggests large interclass variance compared to the variance within each class for that variable. The iterative process of SIMCA modelling outlined above is summarized in Fig. 11. A few steps in this flow diagram need to be clarified.
-
24 1
Unsuprvlsed PCA
r------I I
* Select class
Variable weighting
I ross validated PC model
I I
I I
+ Evaluation of c l a w s
* Fit new samples
Fig.11 Flow chart for SlMCA modelling.
(a) Pretreatment of data embraces normalization of samples and various types of data processing such as maximum-entropy methods for reduction of profiles.
242
(b) Outlier checking is done by looking at score plots, residual distances, and crossvalidation ratios for samples and variables and also from unsupervised PCA. If no components seems statistically significant, the major principal components are calculated anyway for the purpose of inspecting score and loading plots for outliers and noisy variables. (c) Evaluation of classes embraces calculation of modelling and discrimination power, as well as class distances. 14 Conclusion
SIMCA is a pattern recognition classification with great applicability to chemistry, geochemistry, geology, medicine and related areas. Fifteen years after the method was launched in an international journal [ 13, it plays the role as a reference method against which new chemometric classification methods have to be evaluated [26]. The strong position of SIMCA is a result of several factors, among which some of the crucial appear to be as follows: (a) The simple geometric basis, all concepts can be interpreted in terms of vectors and distances in multivariate space. (b) The ability for modelling large data sets with more variables than samples. (c) The concept of (almost) distribution free closed models obtained independently of each other as opposed to probabilistic methods. (d) The possibility of adjusting the method to fit one's personal taste.
15 Acknowledgements Several discussions with Prof. Svante Wold during the last ten years are highly appreciated. The authors are grateful for suggestions made by participants at several chemometric courses. In particular, we would like to thank Dr. Marrku Hamalainen and Dr. Goran UrdCn for serious testing and numerous valuable suggestions for improvements in the SIRIUS program. Dr. Richard Brereton is thanked for editorial work. 16
References
1. S. Wold, Pattern Recognition by means of Disjoint Principal Components Models, Pattern Recognition, 8 (1976), 127-139. 2. C. Albano, W. Dunn 111, U. Edlund, E. Johansson, B. NordCn, M. SjBstriim and S. Wold, Four Levels of Pattern Recognition, Analyfica Chimica Acta, 103 (1978). 429-443. 3. C. Albano, G. Blomqvist, D. Coomans, W. J. Dunn 111, U. Edlund, B. Eliasson, S. Hellberg, E. Johansson, B. NordCn, D. Johnels, M. SjBstrtjm, B. Soderstrdm, H. Wold and S. Wold, Pattern Recognition by means of Disjoint Principal Components Models (SIMCA).
243 Philosophy and methods, Proceedings of the Symposium on Applied Statistics, Copenhagen. (1981). 4. O.M. Kvalheim, K. 0ygard and 0. Grahl-Nielsen. SIMCA Multivariate Data Analysis of Blue Mussel Components in Environmental Pollution Studies, Analytica Chimica Acta, 150 (1983). 145-152. 5. D.L. Massart and L. Kaufman, The interpretation of Analytical Chemical Data by the Use of Cluster Analysis, Wiley, New York, (1983). 6. M.O. Eide, O.M. Kvalheim and N. Telnzs, Routine Analysis of Crude Oil Fractions by Principal Component Modelling of Gas Chromatographic Profiles, Analytica Chim’ca Acta, 191 (1986). 433-437. 7. O.M. Kvalheim, Oil-source Correlation by the combined use of Principal Component Modelling, Analysis of Variance and a Coefficient of Congruence, Chemometrics and intelligent Laboratory Systems, 2 (1987), 127-136. 8. O.H.J. Christie, Some Fundamental Criteria for Multivariate Correlation Methodologies, Chemometrics and intelligent Laboratory Systems, 2 (1987), 53-59. 9. S. Wold, K. Esbensen and P. Geladi, Principal component analysis, Chemometrics and intelligent Laboratory Systems, 2 (1987). 37-52. 10. O.M. Kvalheim, Interpretation of Direct Latent-Variable Projection Methods and their Aims and Use in the Analysis of Multicomponent Spectroscopic and Chromatographic Data, Chemometrics and Intelligent Laboratory Systems, 4 (1988), 11-25. 1 1 . D.R. Saunders, Practical Methods in the Direct Factor Analysis of Psychological Score Matrices, Ph.D. Thesis, University of Illinois, Urbana, (1950). 12. S. Wold, Cross-validatoryEstimation of the Number of Components in Factor and Principal Component Models, Technometrics, 20 (1978). 397-405. 13. E. Sletten, O.M. Kvalheim, S. Kruse, M. Farstad and 0. S~reide,Detection of Malignant Tumours by Multivariate Analysis of Proton Magnetic Resonance Spectra of Serum, European Journal of Cancer, 26 (1990), 615-618. 14. O.M. Kvalheim and T.V. Karstang, A general-purpose program for Multivariate Data Analysis, Chemometrics and Intelligent Laboratory Systems, 2 (1987), 235-237. 15. H. van der Voet and D.A. Doornbos. The Improvement of SIMCA Classification by using Kernel Density Estimation. Part 1. A new Probabilistic Approach Classification Technique and how to evaluate such a Technique,Analytica Chimica Acta, 161 (1984). 115-123. 16. H. van der Voet and D.A. Doornbos, The Improvement of SIMCA Classification by using Kernel Density Estimation. Part 2. Practical evaluation of SIMCA, ALLOC and CLASSY on three Dam Sets, Analylica Chim’ca Acla, 161 (1984). 125-134. 17. J.B.M. Drlige and H.A. van’t Kloster, An evaluation of SIMCA. Part 1 - The Reliability of the SIMCA Pattern Recognition Method for a varying Number of Objects and Features, Journal of Chemometrics, 1 (1987). 221-230. 18. J.B.M. Drlige and H.A. van’t Kloster, An evaluation of SIMCA. Part 2 - Classification of Pyrolysis Mass Spectra of Pseudomonas and Serratia Bacteria by Pattern Recognition using the SIMCA classifier, Journal of Chemometrics, 1 (1987), 231-241. 19. S. Wold and M. Sjbstrbm, Comments on a recent evaluation of the SIMCA Method, Journal of Chemometrics, 1 (1987), 243-245. 20. S. Wold, C. Albano, W.J. Dunn 111, U. Edlund, K. Esbensen, P. Geladi, S. Hellberg, E. Johansson, W. Lindberg and M. Sjbstrlim, in Chemometrics - Mathematics and Statistics in Chemistry, (ed.B.R. Kowalski), Reidel, Dordrecht, (1984). 21. O.M. Kvalheim, Latent-structure Decomposition (Projections) of Multivariate Data, Chemometrics and Intelligent Laboratory Systems. 2 (1987), 283-290. 22. N.B. Vogt, Principal Component Variable Discriminant Plot: a novel Approach for Interpretation and Analysis of Multi-class Data, Journal of Chemometrics, 2 (1988), 81-84.
244 23. H.J.B. Birks, Multivariate Analysis in Geology and Geochemistry: an Introduction, Chemomerrics and Intelligent Laboratory Systems, 2 (1987), 15-28. 24. O.M. Kvalheim and N. Telnaes, Visualizing Information in Multivariate Data: Applications to Petroleum Geochemistry. Part 2. Interpretation and correlation of North Sea oils by using three different biomarker fractions, Analyfica Chimicu Acfu, 191 (1986), 97-1 10. 25. J.C. Miller and J.N. Miller, Statistics for Analytical Chemistry, Wiley, New York, (1984), Ch.3. pp. 59-62. 26. I.E. Frank and J.H. Friedman, Classification: Oldtimers and Newcomers, Journal of Chemometrics. 3 (1989), 463475.
245
ANSWERS Ll a) Table:
mussel
VARIABLE
4
vs.
VARIABLE 3
Class code:
RA
*lo1 9 00
1-7 8.00
1-5 1-8
7.00
6.00
1-1
1-4
2-4 1-9
nl
2-8
d
n
m c m > r(
5 00
2- I
4.00
2%3 2-2
2-7
3.00
2-6
2 00
80
120
200
160 Variable
240
3
Graph of variable 4 against variable 3 produced using the SIRIUS package.
b) Visual inspection reveals that the Euclidean distances will not provide a divisor into two classes according to sampling sites. For instance, the Euclidear distance between sample 1-7 and 1-6is larger than the distance between 1-6 anc 2-2. c) By defining the centre of gravity as the class model for the unpolluted samples all samples from the unpolluted sites except 1-9 are closer to this than the mode: defined by a straight line through the polluted samples. Similarly, all sampler from the polluted sites are closer to the model of polluted samples than to the class of unpolluted sample. d) Sample 1-9 appears to belong to the class of polluted samples.
246
4.2
a) The printout can be obtained from SIRIUS. b) In order to show that the loadings are orthonormal calculate the scalar product: pi p2 = (- 0.5071)* 0.1915 + 0.4570* 0.4194+ (- 0.4917). 0.1003
+ 0.4068 0.5150+ 0.3560.(-0.7157) = 0. *
c) The orthogonality between the score vectors is confirmed the same way, i.e. by
showing that the scalar product ti
t2 is
zero.
d) Table:
2
m u s s e l
SCORE
2 VS.
SCORE
1
Class c o d e :
U
80
. -
I - ?
-Y
1.40
1-1
I0 0
N I
1-5
2-7
10
u L
o
i 0.00
-e
2-2
U
ffl 1-3
2-1 N
u c Iu
1-4
1.40
C O
1-2 1-6
n 0
u 2.80
1 -2
00
0.00
-1.40
Component
1.
Scores
1.40 (71.9%)
Graph of scores produced using the SIRIUS package.
2.80
241
(e) From the graph of the scores, sample 1-9 appears to belong to the class of polluted sample rather than the class of unpolluted. Furthermore, sample 2-7 seems not to belong to either class. (f) The variance of each component is obtained by use of Eq. (6). Thus for component 1 the variance is tl'.tl /16 = 3.59. For component 2 the variance is 1.03. (g) Because of the standardization of the variance of each variable to unit variance, the total variance of the data is 5.0. According to Eq.(6)the variance accounted for by the model is the sum of the variance accounted for by each principal component. Thus, vdel = 4.62 and the percent of original variance accounted for is 92.4. A.3 Arrange the eight samples into a vector with 40 elements. In order to obtain the deletion pattern of Fig.8, delete every third element of this vector starting with element 1 for group 1, element 2 for group 2 and element 3 for group 3. Then rearrange the vectors with holes into matrices. A.4 (a) Question 4 requires the use of SIRIUS or other programs for decomposing multivariate data into principal components. Samples 1-9 and 2-7 should be excluded before modelling the classes (see A.2). (b) For the unpolluted class a one-component model is obtained. For the polluted class a zero-component model results, implying a spherical intra-class structure.
IA.5 ' (a) Keep out sample 1-9 from class 1 and sample 2-7 from class 2. (b) Comparison of the residual distance for sample 1-9 to the two models shows that this sample belongs to the class of polluted samples. (c) The other samples classify according to their sampling characteristics (polluted or unpolluted) with the sole exception of 2-7 which falls outside both classes. Thus sample 2-7 is an outlier. A.6 The extra sample has a residual distance of 1.62 and 4.77 to the class of unpolluted and polluted samples, respectively (see Table 3). The maximum allowed residual 'distances for the unpolluted and polluted class are 1.02 and 1.59, respectively, at a probability level of p = 0.05. Thus, the extra sample is outside both classes at this probability level.
I
248
The scores plot is suggestive of an interpretation of the extra sample as belonging to the class of polluted samples. This is contradicted by the fitting of the sample to the class of unpolluted samples. This result shows the pitfalls involved in the interpretation of score plots. The residual variation unaccounted for by the model should always be inspected before drawing conclusions about class belonging of samples or variable correlations when exploring the variation patterns in multivariate data. 4.7 :a) Modelling power is related to variance accounted for by a principal component model. For class 2 (polluted samples) a zero-component model was obtained (see Table 4).Thus, no variance is explained and the modelling power is zero for every variable. :b) Variable 1 has the lowest modelling power in class 1 and thus this variable contributes least to the covariance structure of class 1. :c) Variable 1 is the most discriminating variable. Thus, this variable is an example of a sharp discriminating variable with little covariance with the other variables. :d) Variable 4 is the only candidate for deletion. Low modelling power in both classes together with low discrimination power makes it a candidate fox omission. [e) Inter- to intra-class ratios above 5 for all samples conclusively prove nonoverlapping classes (run SIRIUS).
249
CHAPTER 8 Hard Modelling in Supervised Pattern Recognition D . Coomans’ and D L . Massart2 1 Department of
Mathematics and Statistics, James Cook University of North Queensland, Townmille, Q481I , Australia Farmaceutisch Instituut, Vrije Universiteit Brussel, Laarbeeklaan, 103, B-1090 Brussels, Belgium
1 Introduction The data dealt with in chemistry often aim at a classification into given groups or classes [ 11. In this way, we enter the field of supervised pattern recognition. The supervised pattern recognition techniques are used to develop objective rules for the classification of objects (e.g. samples for which chemical measurements are available) into a priori known classes. In this chapter, the different steps to arrive at such objective rules for classification together with necessary data analytical tools are illustrated using worked examples. To follow the examples, only a scientific pocket calculator is needed. For didactical reasons we use a small data set, which of course is not representative for the complexity of problems that can be tackled by the computer-oriented classification methods. Some of the algorithms treated below have been implemented in PARVUS [2] a package for programs for data exploration and classification and correlation. This package can be used for the larger data sets one encounters in practice.
2 The data set In our example, the problem is to identify milk samples on the basis of their fatty acid spectrum. Data were taken from a study on “the differentation of pure milk
250
from different species and mixtures” [3,4]. The data set we will be using here is only a small selection from the original study and consists of 17 milk samples for which three quantitative measurements (M = 3 variables; j = 1, ..., M ) are considered. The three variables are concentrations (in %) of three fatty acids of the fat fraction of milk obtained by gas-liquid chromatography. As shown in Table 1, the data mamx is partitioned in two known classes of milk ( R = 2 training classes C p ; p = 1,2). There are 14 milk samples of known identification (N = 14 training objects; i = 1, ..., N) divided in two classes : seven samples identified as pure cow milk (N1= 7 for classes CI) and seven samples identified as adulterated samples of 80% cow with 20% goat milk (N,= 7 for class Cz). There are also three test samples which are assumed to belong to one of both classes (No = 3 test objects, class Co).
Table 1. Percent of three acids FA1, FA2, FA3 in the fat fraction of milk samples. FA 1 i= 1
FA2 i=2
FA3 i=3
2 2 2 2 2 2
3.5 2.5 2.7 4.1 3.8 1.5 6.5 4.3 4.8 5.4 4.1 4.5 5.0 4.9
0 0 0
2.9 3.8
12.8 8.3 6.0 6.8 8.9 9.6 11.2 7.8 5.4 5.7 5.0 5.8 6.5 4.5 2.5 7.8 5.5
8.9 11.0 11.5 12.5 9.8 8.8 9.0 11.0 10.4 12.0 11.3 11.7 9.4 12.6 14.5 12.2 13.2
sample
class
i=
P=
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
1 1 1 1 1 1 1 2
1.o
These test samples must be classified on the basis of a classification rule which is derived from the training set measurements.
25 1
~
~
-
~
-
The concept of test and training sets is also discussed in Chapter 7. The SI approach is one of soft modelling. It differs from the approach discussed in this chapter in that the objects do not have to be unambiguously a member of any given class. In Chapter 7, we discussed the classification of object 1-9 in some detail, despite the fact that it was originally part of class 1 when the model was set up. In this chapter all the training sets are assumed to be members of a given class, and we can only comment on class membership of new (test) objects.
Our example is artificial since we consider the possibility of an adulteration by goat milk only and only at one level. However, extensions towards more real situations are straightforward. The data table is a mamx X that can be partitioned in three submatrices, i.e. a mamx X1 for training class C1 , a mamx X2 for training class C2 and a mamx XO for the test group. The data pattern of each milk sample i can also be represented by a row-vector xi, belonging to X. Thus milk sample i=3 is characterized by ~ 3 =,
[2.7, 6.0, 11.51
or its transpose x , ~the , column-vector
2.2.1
Since in a statistical context, the term “sample” means a group of data vectors, we will avoid to use this terminology for a single vector; thus we will call a milk sample i further on an object i. A training class can be considered as a sample of objects.
-
252
3 Geometrical representation The data pattern xi, of an object i can be positioned in a three-dimensional pattern space; each axis of the space corresponds to a variable, namely the percent of one of the fatty acids. If the data patterns of both training classes contain differential information about the classes, then the object points in the three-dimensional pattern space belonging to the same class tend to be closer to each other than the objects belonging to the other class. In order to get a first impression of the separability of the classes, it is advisable to start the analysis of the data with an exploration of the data set X in twodimensional pattern spaces i.e. the construction of scatter diagrams FA1 versus FA2 and FA 1 versus FA3 for the objects of the data matrix X.
%FA1 7
7'
6-
l?
14x +!g la x12 0 Ilx 4 17
54-
@ .
3
2-
1
2
6
d5
1I
0
0
16
3-
0
5,
2
.
I
4
'
I
6
'
I
'
I
'
I
'
8 1 0 1 2 1 4
%FA2 Fig. 1 Scatter plot of the milk data from Table 1 ; abscissa is FA1 and ordinate FA2. The symbols *, x and I indicate objects from respectively matrices X I , X2 and XO.
253
%FA.17 a7
613
5-
4-
8
'Ox
9x
% %l
1
11
5.
9
14 X
4.
16
10
11
12
w
17
13
14
15
%FA3 Fig. 2 Scatter plot of the milk data from Table 1: abscissa is FA1 and ordinate FA3. For symbols see Fig. 1.
%FA1
0
2
4
6
8 1 0 1 2 1 4
%FA2 Fig. 3 Linear classajication boundaries (A and B ) between classes C1 and C2 of the scatter plot in Fig. 1.
The scatter diagram of Fig. 1 reveals immediately a cluster of points belonging to class C2 and a looser aggregate of objects belonging to C1.The visual impression is
254
that both classes are more or less separable. A similar picture for FA 1 versus FA3 is shown in Fig. 2. Geomemcally, the formulation of a classification rule corresponds to an explicit or implicit construction of a boundary between the training classes, so that the classes become separated as well as possible. An explicit construction of a boundary is, for instance, drawing a straight line in the diagram of Fig. 1 (see Fig.3). It is indeed easy to draw a line that discriminates both classes C1 and C2 relatively well. However, line A is not able to perform a complete differentiation of the classes; class C1-object i = 4 lies at the wrong side of the line. A comparably wellseparating line is B, but here too an object is misclassified, namely the class C1object i = 8. Whether we choose either A or B has important implications for the classification of future test objects of unknown class label i.e. the test objects. With line B, the three test objects i = 15, 16 and 17 are all allocated to class C1, but according to line A, the test objects 15 and 17 belong to class C2. The construction of a classification rule is an optimization process and will depend on the optimization criteria that are considered. The more these criteria are able to reflect the geometrical characteristics of the data, the better the ultimate classification rule is.
What is the dimensionality of an explicit boundary between two classes in a fivedimensional pattern space? 4 Classification rule We will focus further on pattern recognition methods that are primarily intended to describe mathematically the classification boundary between mining classes.
No attempt is made to localize fully the class in the pattern space; only its boundary with another class is considered. In the chemometrics literature, this approach is called "hard modelling". In this context, the classification rule for the classification of a test object based on the classification boundary between two classes can be expressed as follows: Classify test object i in class C1
255
otherwise classify test object i into class C2 Here, f(xI, C l ) and f(xI, I C2) are the criterion functions that define the classificationrule. The classification boundary is determined by
I
L
The mathematical notation may appear, at first, rather complex, but is really quite simple. The ith object yields a function f that is defined both for classes 1 and 2. The value of this function relates to how closely the object fits the class model. A similar concept was introduced in Chapter 7, Section 12, in which the residual distance of an object to the centroid of a class is also used as a type of criterion function. These functions define for xi,a kind of likelihood to belong to the particular class or a kind of similarity with the objects of that class. Thus for each class, it is determined how likely it is to find xi, in that class. The test object i is classified in the class with the highest value forf(xl, I C).
Q.2 Suppose we want to classify object i = 17 from our data set in Table 1 and we find (x17, I Cl) = 13 andf(x1.l 1 C2) = 8. We want to do the same for i = 16 and we fobtainf(x,,. ICl) =f(X16. C2). What are the classification results?
1
In hard modelling we consider two types of criterion functions namely, (a) Deterministic ones (b) Probabilistic ones.
5 Deterministic pattern recognition Deterministic means that classification never remains doubtful; each test object will always be allocated to one and only one training class. Although the fact that test object i = 17 is very close to both C1-objects and C2-objects, the test object will be assigned smctly to one of both classes. Two algorithmswill be considered in more detail: (a) the k nearest neighbour classification method (kNN) and
256
(b) Fisher’s linear discriminant method (FLD), also called linear discriminant analysis (LDA).
%FA1
0
2
4
6
8 1 0 1 2 1 4
%FA2 Fig. 4 INN classification of test object15 using the two variables FA1 and FA2.
5,1
Classification using the k” method
5.1.1 Basic principles The simplest case is the 1NN classification.A test object is classified in the class of the nearest training object using one or other distance metric. Suppose we could only use the variables FA1 and FA2 from the milk data matrix, then the 1NN procedure can be displayed on the scatter plot of Fig. 4. The nearest neighbour of test object 15 is training object 3 belonging to class C1. Therefore test object 15 is classified in class C1 i.e. the class of pure cow milk.
Q.3 What is the classification outcome for test objects 16 and 17 using the fatty acid fractions FA 1 and FA2? With a data set of three and more variables, one cannot rely on a diagram, but the classification can be performed mathematicallyas follows.
251
First, calculate the Euclidean distance &(P) between test object i and each training object k
Euclidean distances are discussed in various chapters of this text, including Chapter 3, Section 3.3 (in the context of algebraic definitions), Chapter 6, Section 5 (relating to a measure of dissimilarity in cluster analysis) and Chapter 7, Section 2 (in the context of SIMCA soft modelling).
4.4 Calculate the Euclidean distance between test object 17 and training object 11 (C2). Table 2 collects the distances d calculated for test object 15 with respect to all training objects. Although it is not really necessary, we transform the distance further into proximity (similarities)in order to be able to use the results of Table 2 immediately with the classification rule of Eq. (1). The inverse of the distance is suitable for that purpose:
The proximity values for object 15 are also displayed in Table 2. Proximity is a measure of similarity, whereas distance is a measure of dissimilarity. Several other measures of (dis)similarity could be used. The choice of distance metric is discussed in greater detail in Chapter 6, Section 2. The rank of an observation should not be confused with the rank of a matrix. See Chapter 6, Section 4.4 for another example and an explanation. Second, the general classification rule given before (Eq. (1)) is valid for the 1NN method provided that f ( x i I C,)is defined to be the maximum value of proxyj&) over class p . In other words, it means that for each class C, one training object is selected as reference object for class C,.
4.5 Use the foregoing classification rule to classify test object 15.
258
Table 2. Euclidean distances and proximities of test object 15 with respect to each training object of the data set in Table 3, using the three variables F A 1, F A 2 and FA3. The smallest rank number corresponds to largest proximity value.
CP
k
1 1 1 1 1 1 1 2 2 2 2 2 2 2
1 2 3 4 5 6 7 8 9 10 11 12 13 14
11.99 6.94 4.9 1 5.67 8.42 9.12 11.67 7.16 6.30 5.99 5.1 1 5.57 7.62 4.78
0.083 0.144 0.204 0.176 0.119 0.110 0.086 0.140 0.159 0.167 0.196 0.180 0.131 0.209
14 8 2 5 11 12 13 9 7 6 3 4 10 1
Alternatively, the same rule can be obtained by ranking the proximities in decreasing magnitude and classifying the test object in the same class as the training object with rank 1. The corresponding rank numbers for test object 15 are given in Table 2. Training object 14 of class C2 has rank 1. Consequently, test object 15 belongs to C2. The 1NN method can be extended to more (k) neighbours. However, k should be kept small compared to the number of objects in each class. The test object is usually assigned to a class according to the so-called majority vote procedure i.e. to the class which is most represented in the set of the k nearest training objects [ 5 ] . 4.6 To what class is test object 15 allocated when applying the 3 nearest neighbour rule according to the rank procedure?
259
5 .I .2 Additional aspects of the k" method 5.1.2.1 Scaling of the data. In the above examples, the classification was performed using the raw data displayed in the data matrix of Table 1. It means that the variable with the largest amount of scatter among the data or, in other words, the largest variation, will contribute to the largest extent in the Euclidean distance. It does not mean, however, that a large variation coincides with a good ability of separating classes. If the numerical value for the variation were important then we could improve the classification by changing the scale of the variables for instance by changing percentage into promille values. Of course, this makes no sense. The scale of raw data is quite often arbitrary and artificial but will affect the Euclidean distance. One could say that raw data put artificial weights on each variable with respect to the Euclidean distance. To eliminate this effect, variables are scaled in such a way that they all contribute to the discrimination according to their real performance without the introduction of an artificial weighting. It is advisable to carry out an autoscaling procedure before the distances are calculated. Scaling or preprocessing of data prior to analysis is discussed in several chapters of this text. A detailed discussion of scaling and the consequences is discussed in Chapter 2 in the context of multivariate data display. Standardization (or autoscaling) is introduced in Chapter 2, Section 3, but is referred to in several other contexts. In practice, a variable is autoscaled (or standardized) over the whole mining set as follows:
with i,and s, respectively the mean value and standard deviation of variablej in the training set or N i= 1
I
N
260
4.7 What is the autoscaled value of FA 1 for training object l? The test objects are autoscaled in the same way, but using the statistical parameters and s, of the training set (so in our example only the first 14 samples are used to calculate the mean and standard deviation of each variable, although it is possible subsequently to calculate autoscaled variables for objects 15 to 17). The autoscaled data for the whole data set are given in Table 3.
Xj
Q.8 Does the choice of k and the use of autoscaling of the data affect the classification outcomes for the test objects of the milk data set?
Table 3. Autoscaled matrix of the data of Table 1. Xi
= 4.114
S1
=
5 2 = 7.450 s2 = 2.455
1.281
P
i
1 1 1 1 1 1 1 2 2 2 2 2 2 2 0 0 0
1 2 3 4 5
6 7 8 9 10 11 12 13 14 15 16 17
-
~3 = 10.707 s3 = 1.332
zi, 1
zi.2
Zi.3
-0.479 -1.261 -1.104 -0.01 1 -0.245 -2.042 1.863 0.145 0.535 1.004 -0.01 1 0.301 0.692 0.614 -2.432 0.948 -0.245
2.179 0.346 -0.591 -0.265 0.591 0.876 1.527 0.143 -0.835 -0.713 -0.998 -0.672 -0.387 -1.201 -2.016 0.143 -0.794
- 1.357 0.220 0.595 1.346 -0.68 1 - 1.432 - 1.282 0.220 -0.231 0.97 1 0.445 0.746 -0.982 1.422 2.848 1.121 1.872
26 1
Table 4. 1 N N and 3 N N classification results for the test objects of the milk example using the three variables FA1, FA2 and FA3. Test object :25 kNN
Test object :16 i
prox
unsealed data k = 1 0.209
1
p
4
Rank
Classif.
2
2
0.735
2
1
1
1
1
i
prox
p
Rank
Classif.
0.209 0.204 0.196
14 3 11
2 1 2
1 2 3
2
0.735 0.629 0.543
2 4 8
1 1 2
1 2 3
1
autoscaled data k = 1 0.336
3
1
1
1
1.087
3
1
1
1
3 4 4
1 1 2
1 2 3
1
1.087 1.020 0.952
3 2 4
1 1 1
2
k =3
k =3
0.336 0.299 0.289
1 1
3
Test object :17 k"
prox
unscaled data k= 1 0.663
i
Rank
p
Classif.
4
1
1
1
0.663 0.625 0.595
4 14 12
1 2 2
1 2 3
2
autoscaled data k= 1 1.780
4
1
1
1
4 14 12
1 2 2
1 2 3
2
k=3
k=3
1.780 0.952 0.794
Notes:The value of prox measures proximity, as discussed above; i is the corresponding object; p is the class object belongs to; clussifcn. is the resultant classification.
262
Table 4 compares the 1NN and 3" method for the test objects 15, 16 and 17; unscaled as well as autoscaled data were used. All the calculations are straightforward as demonstrated before.
5.1.2.2Perfomnceofthe classificationrule. It is impossible to determine on test objects which classification procedure performs best. The efficiency of the classification rules can be compared on the basis of the training set of which the class membership of the objects is known. This can be done by applying the classification rules to each of the training objects using the remaining training set objects to develop the rule. Thus, each training object on its turn is considered as a test object and classified. This procedure is known as the leave-one-ourmethod. The degree of congruence between the classification outcomes and the real class belongingness of the training set objects can be used as a measure for the discriminating ability or efficiency of the classification rule. By using each of the training objects in turn as a test object in the leave-one-out procedure, the predictive efficiency of the method is estimated i.e. the efficiency for future test object classifications. A measure for the efficiency is the number of correctly classified training objects (NCC) that can also be expressed as a % correct classification rate (% CCR). For a training class C, we obtain NCC@) and %CCR@). Table 5 summarizes the predictive efficiency for the classification rules considered in Table
4. Table 5. Predictive efficiency of the classification methods in Table 4, based on the leave-one-out procedure. k= 1
Unscaled k=3
NCC@) %CCR@) NCC@)
4 6 c1+C2 10
c1 c2
57 86 71
5 6 11
Autoscaled k= I %CCR@)
71 86 79
k=3
NCC@) %CCR@) NCC@) %CCR@)
4 7 11
57 100 79
5 7 12
71 100 86
Q.9 (a)What conclusions can be drawn from Table 5 concerning the performance of the kNN rules with unscaled and autoscaled data? (b) What can we decide about the allocation of the test objects 15, 16 and 17 regarding the results in Table 5?
Because the data set is small, we have to be cautious about these results. The larger the data set, the more certain one is that the conclusions were reliable.
263
Q.10 What variable could have caused the difference in performance when using unscaled or autoscaled data?
5.1.2.3 Variable selecrion. Not only an appropriate scaling of the variables may improve the separability of the classes, but also the choice of informative variables. It is not necessarily a good idea to use all the variables at hand in a classification procedure. Noisy variables can cause more harm than give benefit [6].Some variables may be highly correlated with others so that there is no need to use all of them. Indeed no information is added when correlations are large. In other words, variable selection can improve the classification performance of the RNN classifier because: (a) uninformative and noisy variables are detected and removed and (b) the ratio of the number of variables to the number of training objects is reduced. In variable selection, the variables can be ranked according to a measure that quantifies the importance of each variable in separating classes. Several measures are available in literature. The one we use here is called the variance weight [7].
Chapter 5, Section 4 . For two classes, C1 and C2,the variance weight is given by
where iij