DATA HANDLING IN SCIENCE AND TECHNOLOGY — VOLUME 20B
Handbook of Chemometrics and Qualimetrics: PartB
DATA HANDLING IN SCIENCE AND TECHNOLOGY Advisory Editors: B.G.M. Vandeginste and S.C. Rutan Other volumes in this series: Volume 1
Microprocessor Programming and Applications for Scientists and Engineers, by R.R. Smardzewski Volume 2 Chemometrics: A Textbook, by D.L. Massart, B.G.M. Vandeginste, S.M. Deming, Y. Micotte and L. Kaufman Volume 3 Experimental Design: A Chemometric Approach, by S.N. Deming and S.L. Morgan Volume 4 Advanced Scientific Coputing in BASIC with Applications in Chemistry, Biology and Pharmacology, by P. Vaiko and S. Vajda Volume 5 PCs for Chemists, edited by J. Zupan Volume 6 Scientific Computing and Automation (Europe) 1990, Proceedings of the Scientific Computing and Automation (Europe) Conference, 12-15 June 1990, Maastrichit, Tine Nettierlands, edited by E.J. Karjalainen Volume 7 Receptor Modeling for Air Quality Management, edited by P.K. Hopke Volume 8 Design and Optimization in Organic Synthesis, by R. Carlson Volume 9 Multivariate Pattern Recognition in Chemometrics, illustrated by case studies, edited by R.G. Brereton Volume 10 Sampling of Heterogeneous and Dynamic Material Systems: Theories of Heterogeneity, Sampling and Homogenizing, by P.M. Gy Volume 11 Experimental Design: A Chemometric Approach (Second, Revised and Expanded Edition), by S.N. Deming and S.L. Morgan Volume 12 Methods for Experimental Design: Principles and Applications for Physicists and Chemists, by J.L. Goupy Volume 13 Intelligent Software for Chemical Analysis, edited by L.M.C. Buydens and P.J. Schoenmakers Volume 14 The Data Analysis Handbook, by I.E. Frank and R. Todeschini Volume 15 Adaption of Simulated Annealing to Chemical Optimization Problems, edited by J. Kalivas Volume 16 Multivariate Analysis of Data in Sensory Science, edited by T. Naes and E. Risvik Volume 17 Data Analysis for Hyphenated Techniques, by E.J. Karjalainen and U.P. Karjalainen Volume 18 Signal Treatment and Signal Analysis in NMR, edited by D.N. Rutledge Volume 19 Robustness of Analytical Chemical Methods and Pharmaceutical Technological Products, edited by M.W.B. Hendriks, J.H. de Boer and A.K. Smilde Volume 20A Handbook of Chemometrics and Qualimetrics: Part A, by D.L. Massart, B.G.M. Vandeginste, LM.C. Buydens, S. De Jong, P.J. Lewi and J. Smeyers-Verbeke Volume 20B Handbook of Chemometrics and Qualimetrics: Part B, by B.G.M. Vandeginste, D.L. Massart, L.M.C. Buydens, S. De Jong, P.J. Lewi and J. Smeyers-Verbeke
DATA HANDLING IN SCIENCE AND TECHNOLOGY — VOLUME 20B Advisory Editors: B.G.M. Vandeginste and S.C. Rutan
Handbook of Chemometrics and Qualimetrics: Part B B.G.M. VANDEGINSTE Unilever Research Laboratorium, Vlaardingen, The Netherlands
D.L. MASSART Farmaceutisch Instituut, Dienst Farmaceutische en Biomedische Analyse, Vrije Universiteit Brussel, Brussels, Belgium
L.M.C. BUYDENS Vakgroep Analytische Chemie, Katholieke Universiteit Nijmegen, Faculteit Natuun/vetenschappen, Nijmegen, The Netherlands
S. DE JONG Unilever Research Laboratorium, Vlaardingen, The Netherlands
P.J. LEWI Janssen Research Foundation, Center for Molecular Design, Vosselaar, Belgium
J. SMEYERS-VERBEKE Farmaceutisch Instituut, Dienst Farmaceutische en Biomedische Analyse, Vrije Universiteit Brussel, Brussels, Belgium
ELSEVIER Amsterdam - Boston - London - New York - Oxford - Paris San Diego - San Francisco - Singapore - Sydney - Tokyo
ELSEVIER SCIENCE B.V. Sara Burgerhartstraat 25 P.O. Box 211, 1000 AE Amsterdam, The Netherlands © 1998 Elsevier Science B.V. All rights reserved. This work is protected under copyright by Elsevier Science, and the following terms and conditions apply to its use: Photocopying Single photocopies of single chapters may be made for personal use as allowed by national copyright laws. Permission of the Publisher and payment of a fee is required for all other photocopying, including multiple or systematic copying, copying for advertising or promotional purposes, resale, and all forms of document delivery. Special rates are available for educational institutions that wish to make photocopies for non-profit educational classroom use. Permissions may be sought directly from Elsevier Science via their homepage (http://www.elsevier.com) by selecting 'Customer support' and then 'Permissions'. Alternatively you can send an e-mail to:
[email protected], or fax to: (+44) 1865 853333. In the USA, users may clear permissions and make payments through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA; phone: (+1) (978) 7508400, fax: (+1) (978) 7504744, and in the UK through the Copyright Licensing Agency Rapid Clearance Service (CLARCS), 90 Tottenham Court Road, London WIP OLP, UK; phone: (+44) 207 631 5555; fax: (+44) 207 631 5500. Other countries may have a local reprographic rights agency for payments. Derivative Works Tables of contents may be reproduced for internal circulation, but permission of Elsevier Science is required for external resale or distribution of such material. Permission of the Publisher is required for all other derivative works, including compilations and translations. Electronic Storage or Usage Permission of the Publisher is required to store or use electronically any material contained in this work, including any chapter or part of a chapter. Except as outlined above, no part of this work may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission of the Publisher. Address permissions requests to: Elsevier Science Global Rights Department, at the fax and e-mail addresses noted above. Notice No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made. First edition 1998 Second impression 2003 Library of Congress Cataloging-in-Publication Data Handbook of chemometrics and qualimetrics / B.G.M. Vandeginste ... [et al.]. p. cm. — (Data handling in science and technology ; v. 20B) Includes index. ISBN 0-444-82853-2 (pt. 20B : acid-free paper) 1. Chemistry, Analytic-Statistical methods. 2. Chemistry, Analytic—Mathematics. 3. Chemistry, Analytic—Data processing. I. Vandeginste, B . G . M . II. Series. QD75.4.S8H36 1998 543\00r5195-dc21
98-42544 CIP
British Library Cataloguing in Publication Data A catalogue record from the British Library has been applied for. ISBN: ISBN:
0-444-82853-2 (Vol. 20B) 0-444-82854-0 (set)
© The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper). Printed in The Netherlands.
Preface
In 1991 two of us, Luc Massart and Bernard Vandeginste, discussed, during one of our many meetings, the possibility and necessity of updating the book Chemometrics: a textbook. Some of the newer techniques, such as partial least squares and expert systems, were not included in that book which was written some 15 years ago. Initially, we thought that we could bring it up to date with relatively minor revision. We could not have been more wrong. Even during the planning of the book we witnessed a rapid development in the application of natural computing methods, multivariate calibration, method validation, etc. When approaching colleagues to join the team of authors, it was clear from the outset that the book would not be an overhaul of the previous one, but an almost completely new book. When forming the team, we were particularly happy to be joined by two industrial chemometricians. Dr. Paul Lewi from Janssen Pharmaceutica and Dr. Sijmen de Jong from Unilever Research Laboratorium Vlaardingen, each having a wealth of practical experience. We are grateful to Janssen Pharmaceutica and Unilever Research Vlaardingen that they allowed Paul, Sijmen and Bernard to spend some of their time on this project. The three other authors belong to the Vrije Universiteit Brussel (Prof. An Smeyers-Verbeke and Prof. D. Luc Massart) and the Katholieke Universiteit Nijmegen (Professor Lutgarde Buydens), thus creating a team in which university and industry are equally well represented. We hope that this has led to an equally good mix of theory and application in the new book. Much of the material presented in this book is based on the direct experience of the authors. This would not have been possible without the hard work and input of our colleagues, students and post-doctoral fellows. We sincerely want to acknowledge each of them for their good research and contributions without which we would not have been able to treat such a broad range of subjects. Some of them read chapters or helped in other ways. We also owe thanks to the chemometrics community and at the same time we have to offer apologies. We have had the opportunity of collaborating with many colleagues and we have profited from the research and publications of many others. Their ideas and work have made this book possible and necessary. The size of the book shows that they have been very productive. Even so, we have cited only a fraction of the literature and we have not included the more sophisticated work. Our wish was to consolidate and therefore to explain those methods that have become more or less accepted, also to
newcomers to chemometrics. Our apologies, therefore, to those we did not cite or not extensively: it is not a reflection on the quality of their work. Each chapter saw many versions which needed to be entered and re-entered in the computer. Without the help of our secretaries, we would not have been able to complete this work successfully. All versions were read and commented on by all authors in a long series of team meetings. We will certainly retain special memories of many of our two-day meetings, for instance the one organized by Paul in the famous abbey of the regular canons of Premontre at Tongerlo, where we could work in peace and quiet as so many before us have done. Much of this work also had to be done at home, which took away precious time from our families. Their love, understanding, patience and support was indispensable for us to carry on with the seemingly endless series of chapters to be drafted, read or revised.
Contents Preface
v
Chapter 28
Introduction to Part B References
1 5
Chapter 29
Vectors, Matrices and Operations on Matrices 29.1 Vector space 29.2 Geometrical properties of vectors 29.3 Matrices 29.4 Matrix product 29.5 Dimension and rank 29.6 Eigenvectors and eigenvalues 29.7 Statistical interpretation of matrices 29.8 Geometrical interpretation of matrix products References
7 8 10 15 19 27 30 42 51 56
Chapter 30
Cluster Analysis 30.1 Clusters 30.2 Measures of (dis)similarity 30.2.1 Similarity and distance 30.2.2 Measures of (dis)similarity for continuous variables 30.2.2.1 Distances 30.2.2.2 Correlation coefficient 30.2.2.3 Scaling 30.2.3 Measures of (dis)similarity for other variables 30.2.3.1 Binary variables 30.2.3.2 Ordinal variables 30.2.3.3 Mixed variables 30.2.4 Similarity matrix 30.3 Clustering algorithms 30.3.1 Hierarchical methods 30.3.2 Non-hierarchical methods 30.3.3 Other methods 30.3.4 Selecting clusters 30.3.4.1 Measures for clustering tendency 30.3.4.2 How many clusters? 30.3.5 Conclusion References
57 57 60 60 60 60 62 64 65 65 66 67 68 69 69 76 79 82 82 83 84 85
Chapter 31
Analysis of Measurement Tables Introduction 31.1 Principal components analysis 31.1.1 Singular vectors and singular values 31.1.2 Eigenvectors and eigenvalues
87 87 88 89 91
Vlll
31.1.3 Latent vectors and latent values 31.1.4 Scores and loadings 31.1.5 Principal components 31.1.6 Transition formulae 31.1.7 Reconstructions 31.2 Geometrical interpretation 31.2.1 Line of closest fit . 31.2.2 Distances 31.2.3 Unipolar axes 31.2.4 Bipolar axes 31.3 Preprocessing 31.3.1 No transformation 31.3.2 Column-centering 31.3.3 Column-standardization 31.3.4 Log column-centering 31.3.5 Log double-centering 31.3.6 Double-closure 31.4 Algorithms 31.4.1 Singular value decomposition 31.4.2 Eigenvalue decomposition 31.5 Validation 31.5.1 Scree-plot 31.5.2Malinowski'sF-test 31.5.3 Cross-validation 31.6 Principal coordinates analysis 31.6.1 Distances defined from data 31.6.2 Distances derived from comparisons of pairs 31.6.3 Eigenvalue decomposition 31.7 Non-linear principal components analysis 31.7.1 Extensions of the data by higher order terms 31.7.2 Non-linear transformations of the data 31.7.3 Non-linear PCAbiplot 31.8 Three-way principal components analysis 31.8.1 Unfolding 31.8.2 The Tucker3 model 31.8.3 The PARAFAC model 31.9 PCA and cluster analysis References Chapter 32
Analysis of Contingency Tables 32.1 Contingency table 32.2 Chi-square statistic 32.3 Closure 32.3.1 Row-closure 32.3.2 Column-closure 32.3.3 Double-closure 32.4 Weighted metric 32.5 Distance of chi-square 32.5.1 Row-closure 32.5.2 Column-closure
95 95 96 100 100 104 104 108 112 113 115 118 119 122 123 125 130 134 134 138 140 142 143 144 146 146 148 148 149 149 149 150 153 153 154 156 156 158 161 161 166 167 168 168 .169 170 175 175 176
32.5.3 Double-closure 32.6 Correspondence factor analysis 32.6.1 Historical background 32.6.2 Generalized singular value decomposition 32.6.3 Biplots. 32.6.4 Application 32.7 Log-linear model 32.7.1 Historical introduction 32.7.2 Algorithm 32.7.3 Application References
177 182 182 183 187 193 201 201 201 204 205
Chapter 33
Supervised Pattern Recognition 33.1 Supervised and unsupervised pattern recognition 33.2 Derivation of classification rules 33.2.1 Types of classification rules 33.2.2 Canonical variates and linear discriminant analysis 33.2.3 Quadratic discriminant analysis and related methods 33.2.4 The k-nearest neighbour method 33.2.5 Density methods. 33.2.6 Classification trees 33.2.7 UNEQ,SIMCA and related methods 33.2.8 Partial least squares 33.2.9 Neural networks 33.3 Feature selection and reduction 33.4 Validation of classification rules References
207 207 208 208 213 220 223 225 227 228 232 233 236 238 239
Chapter 34
Curve and Mixture Resolution by Factor Analysis and Related Techniques . . 243 34.1 Abstract and true factors 243 34.2 Full-rank methods 251 34.2.1 A qualitative approach 251 34.2.2 Factor rotations 252 34.2.3 The Varimax rotation . 254 34.2.4. Factor rotation by target transformation factor analysis (TTFA) . 256 34.2.5 Curve resolution based methods 260 34.2.5.1 Curve Resolution of two-factor systems 260 34.2.5.2 Curve resolution of three-factor systems 267 34.2.6 Factor rotation by iterative target transformation factor analysis (ITTFA) 268 34.3 Evolutionary and local rank methods 274 34.3.1 Evolving factor analysis (EFA) 274 34.3.2 Fixed-size window evolving factor analysis (FSWEFA) 278 34.3.3 Heuristic evolving latent projections (HELP) 280 34.4 Pure column (or row) techniques 286 34.4.1 The variance diagram (VARDIA) technique 286 34.4.2 SimpHsma 292 34.4.3 Orthogonal projection approach (OPA) 295 34.5 Quantitative methods for factor analysis 298 34.5.1 Generalized rank annihilation factor analysis (GRAFA) 298 34.5.2 Residual bilinearization(RBL) 300
34.5.3 Discussion 34.6 Application of factor analysis for peak purity check in HPLC 34.7 Guidance for the selection of a factor analysis method References
301 301 302 303
Chapter 35
Relations between Measurement Tables 35.1 Introduction 35.2 Procrustes analysis 35.2.1 Introduction 35.2.2 Algorithm 3.2.3 Discussion 35.3 Canonical correlation analysis 35.3.1 Introduction 35.3.2 Algorithm 35.3.3 Discussion 35.4 Multivariate least squares regression 35.4.1 Introduction 35.4.2 Algorithm 35.4.3 Discussion 35.5 Reduced rank regression 35.5.1 Introduction 35.5.2 Algorithm 35.5.3 Discussion 35.5.4 Example 35.6 Principal components regression 35.6.1 Introduction 35.6.3 Algorithm 35.6.3 Discussion 35.7 Partial least squares regression 35.7.2 NIPALS-PLS Algorithm 35.7.3 Discussion 35.7.4 Alternative PLS algorithms 35.8 Continuum regression methods 35.9 Concluding remarks References
307 307 310 310 314 314 317 317 320 321 323 323 324 324 324 324 325 326 326 329 329 329 330 331 336 337 340 342 345 346
Chapter 36
Multivariate Calibration 36.1 Introduction 36.2 Calibration methods 36.2.1 Classical least squares 36.2.2 Inverse least squares 36.2.3 Principal Components Regression 36.2.4 Partial least squares regression 36.2.5 Other linear methods 36.3 Validation 36.4 Other aspects 36.4.1 Calibration design 36.4.2 Data pretreatment 36.4.3 Outliers 36.5 New developments
349 349 351 353 357 358 366 367 368 371 371 372 374 375
36.5.1 Feature selection 36.5.2 Transfer of calibration models 36.5.3 Non-linear methods References
375 376 378 379
Chapter 37
Quantitative Structure-Activity Relationships (QSAR) 37.1 Extrathermodynamic methods 37.1.1 Hansch analysis 37.1.2 Free-Wilson analysis 37.2 Principal components models 37.2.1 Principal components analysis 37.2.2 Spectral map analysis 37.2.3 Correspondence factor analysis 37.3 Canonical variate models 37.3.1 Linear discriminant analysis 37.3.2 Canonical correlation analysis 37.4 Partial least squares models 37.4. IPLS regression and CoMFA 37.4.2 Two-block PLS and indirect QSAR 37.5 Other approaches References
383 383 388 393 397 398 402 405 408 408 409 409 409 411 416 417
Chapter 38
Analysis of Sensory Data 38.1 Introduction 38.2 Difference tests 38.2.1 Triangle test 38.2.2 Duo-trio test 38.2.3 Paired comparisons 38.3 Multidimensional scaling 38.4 The analysis of Quantitative Descriptive Analysis profile data 38.5 Comparison of two or more sensory data sets 38.6 Linking sensory data to instrumental data 38.7 Temporal aspects of perception 38.7 Product formulation References
421 421 421 421 422 425 427 431 433 437 440 444 446
Chapter 39
Pharmacokinetic Models 449 Introduction 449 39.1 Compartmental analysis 451 39.1.1 One-compartment open model for intravenous administration . . 455 39.1.2 Two-compartment catenary model for extravascular administration 461 39.1.3 Two-compartment catenary model for extravascular administration with incomplete absorption 469 39.1.4 One-compartment open model for continuous intravenous infusion 470 39.1.5 One-compartment open model for repeated intravenous administration 473 39.1.6 Two-compartment mammillary model for intravenous administration using Laplace transform 476
39.1.7 Multi-compartment models 39.1.7.1 The convolution method 39.1.7.2 The Y-method 39.2 Non-compartmental analysis 39.3 Compartment models versus non-compartmental analysis 39.4 Linearization of non-linear models References
487 487 490 493 500 502 505
Chapter 40
Signal Processing 40.1 Signal domains 40.2 Types of signal processing 40.3 The Fourier transform 40.3.1 Time and frequency domain 40.3.2 The Fourier transform ofa continuous signal 40.3.3 Derivation of the Fourier transform of a sine 40.3.4 The discrete Fourier transformation 40.3.5 Frequency range and resolution 40.3.6SampHng 40.3.7 Zero filling and resolution 40.3.8 Periodicity and symmetry 40.3.9 Shift and phase 40.3.10 Distributivity and scaHng 40.3.11 The fast Fourier transform 40.4 Convolution 40.5 Signal processing 40.5.1 Characterization of noise 40.5.2 Signal enhancement in the time domain 40.5.2.1 Time averaging 40.5.2.2 Smoothing by moving average 40.5.2.3 Polynomial smoothing 40.5.2.4 Exponential smoothing 40.5.3 Signal enhancement in the frequency domain 40.5.4 Smoothing and filtering: a comparison 40.5.5 The derivative of a signal 40.5.6 Data compression by a Fourier transform 40.6 Deconvolution by Fourier transform 40.7 Other deconvolution methods 40.7.1 Maximum Likelihood 40.7.2 Maximum Entropy 40.8 Other transforms 40.8.1 The Hadamard transform 40.8.2 The time-frequency Fourier transform 40.8.3 The wavelet transform References
507 507 509 510 510 513 518 519 520 524 526 527 528 529 530 530 535 535 536 538 538 542 544 547 549 550 550 553 556 557 558 562 562 564 566 573
Chapter 41
Kalman Filtering 41.1 Introduction 41.2. Recursive regression ofa straight line 41.3 Recursive multicomponent analysis 41.4 System equations
575 575 577 585 589
41.4.1 System equation for a kinetics experiment 41.4.2 System equation of a calibration line with drift 41.5 The Kalman filter 41.5.1 Theory 41.5.2 Kalman filter of a kinetics model 41.5.3 Kalman filtering ofa calibration line with drift 41.6 Adaptive Kalman filtering 41.6.1 Evaluation of the innovation 41.6.2The adaptive Kalman filter model 41.7 Applications References
592 593 594 594 596 598 598 599 .599 601 603
Chapter 42
Applications of Operations Research 42.1 An overview 42.2 Linear programming 42.3 Queuing problems 42.3.1 Queuing and Waiting 42.3.2 Application in analytical laboratory management 42.4 Discrete event simulation 42.5 A shortest path problem References
605 605 605 609 610 617 618 621 625
Chapter 43
Artificial Intelligence: Expert and Knowledge Based Systems 43.1 Artificial intelligence and expert systems 43.2 Expert systems 43.3 Structure of expert systems 43.4 Knowledge representation 43.4.1 Rule-based knowledge representation 43.4.2 Frame-based knowledge representation 43.5 The inference engine 43.5.1 Rule-based inferencing 43.5.2 Frame-based inferencing 43.5.2.1 Inheritance 43.5.2.2 Object-oriented programming techniques 43.5.3 Reasoning with uncertainty 43.6 The interaction module 43.7 Tools 43.8 Development of an expert system 43.8.1 Analysis of the application area 43.8.2 Definition ofknowledge domain, sources and tools 43.8.3 Knowledge acquisition 43.8.4 Implementation 43.8.5 Testing, validation and evaluation 43.8.6 Maintenance 43.9 Conclusion References
627 627 628 629 630 631 632 633 633 637 637 638 639 640 641 642 642 643 643 644 644 645 645 646
Chapter 44
Artificial Neural Networks 44.1 Introduction 44.2 Historical overview 44.3 The basic unit — the neuron
649 649 650 650
44.4 The linear learning machine and the perceptron network 44.4.1 Principle 44.4.2 Learning strategy 44.4.3 Limitations 44.5 Multilayer feed forward (MLF) networks 44.5.1 Introduction 44.5.2 Structure 44.5.3 Signal propagation 44.5.4 The transfer function 44.5.4.1 Role of the transfer function 44.5.4.2Transfer function of the output units 44.5.4.3 Transfer function in the hidden units 44.5.5 Learning rule 44.5.6. Learning rate and momentum term 44.5.7 Training and testing an MLF network 44.5.7.1 Network performance 44.5.7.2 Local minima 44.5.8 Determining the number of hidden units 44.5.9 Data Preprocessing 44.5.9.1 Scaling 44.5.9.2 Variable selection and reduction 44.5.10 Validation of MLF networks 44.5.11 Aspects of use 44.5.12 Chemical applications 44.6 Radial basis function networks 44.6.1 Structure 44.6.2 Training ; 44.6.3 An example 44.6.4 Applications 44.7 Kohonen networks 44.7.1 Structure 44.7.2 Training 44.7.3 Interpretation ofthe Kohonen map 44.7.4 Applications 44.8 Adaptive resonance theory networks 44.8.1 Introduction 44.8.2 Structure 44.8.3 Training 44.8.4 Application References Index
653 653 656 659 662 662 662 664 665 665 666 666 670 673 674 674 677 677 679 679 679 679 680 680 681 681 682 683 684 687 687 688 690 691 692 692 693 693 694 695 701
Chapter 28
Introduction to Part B
In the introduction to Part A we discussed the "arch of knowledge" [1] (see Fig. 28.1), which represents the cycle of acquiring new knowledge by experimentation and the processing of the data obtained from the experiments. Part A focused mainly on the first step of the arch: a proper design of the experiment based on the hypothesis to be tested, evaluation and optimization of the experiments, with the accent on univariate techniques. In Part B we concentrate on the second and third steps of the arch, the transformation of data and results into information and the combination of information into knowledge, with the emphasis on multivariate techniques. In order to obtain information from a series of experiments, we need to interpret the data. Very often the first step in understanding the data is to visualise them in a plot or a graph. This is particularly important when the data are complex in nature. These plots help in discovering any structure that might be present in the data and which can be related to a property of the objects studied. Because plots have to be represented on paper or on a flat computer screen, data need to be projected and compressed. Analytical results are often represented in a data table, e.g., a table of the fatty acid compositions of a set of olive oils. Such a table is called a two-way multivariate data table. Because some olive oils may originate from the same region and others from a different one, the complete table has to be studied as a whole instead as a collection of individual samples, i.e., the results of each sample are interpreted in the context of the results obtained for the other samples. For example, one may ask for natural groupings of the samples in clusters with a common property, namely a similar fatty acid composition. This is the objective of cluster analysis (Chapter 30), which is one of the techniques of unsupervised pattern recognition. The results of the clustering do not depend on the way the results have been arranged in the table, i.e., the order of the objects (rows) or the order of the fatty acids (columns). In fact, the order of the variables or objects has no particular meaning. In another experiment we might be interested in the monthly evolution of some constituents present in the olive oil. Therefore, we decide to measure the total amount of free fatty acids and the triacylglycerol composition in a set of olive oil
Knowledge intelligence
creativity Deduction (synthesis)
Induction (analysis) Hypothesis
Information
data
design
Experiment
Fig. 28.1. The arch of knowledge.
samples under fixed storage conditions. Each month a two-way table is obtained — six in total after six months. We could decide to analyse all six tables individually. However, this would not provide information on the effect of the storage time and its relation to the origin of the oil. It is more informative to consider all tables together. They form a so-called three-way table. The analysis of such a table is discussed in Chapter 31. If, in addition, all olive samples are split into portions which are stored under different conditions, e.g., open and closed bottles, darkness and daylight, we obtain several three-way tables or in general a multi-way data table. Some analytical instruments produce a table of raw data which need to be processed into the analytical result. Hyphenated measurement devices, such as HPLC linked to a diode array detector (DAD), form an important class of such instruments. In the particular case of HPLC-DAD, data tables are obtained consisting of spectra measured at several elution times. The rows represent the spectra and the columns are chromatograms detected at a particular wavelength. Consequently, rows and columns of the data table have a physical meaning. Because the data table X can be considered to be a product of a matrix C containing the concentration profiles and a matrix S containing the pure (but often unknown) spectra, we call such a table bilinear. The order of the rows in this data table corresponds to the order of the elution of the compounds from the analytical column. Each row corresponds to a particular elution time. Such bilinear data tables are therefore called ordered data tables. Trilinear data tables are obtained from LC-detectors which produce a matrix of data at any instance during the
elution, e.g., an excitation-emission spectrum as a function of time. Bilinear and trilinear data tables are also measured when a chemical reaction is monitored over time, e.g., the biomass in a fermenter by Near infrared Spectroscopy. An example of a non-bilinear two-way table is a table of MS-MS spectra or a 2D-NMR spectrum. These tables cannot be represented as a product of row and column spectra. So far, we have discussed the structure of individual data tables: two-way to multiway, which may be bilinear to multilinear. In some cases two or more tables are obtained for a set of samples. The simplest situation is a two-way data table associated with a vector of properties, e.g., a table with the fatty acid composition of olive oils and a vector with the coded region of the oils. In this case we do not only want to display the data table, but we want to derive a classification rule, which classifies the collected oils in the right category. With this classification rule, oils of unknown origin can be classified. This is the area of supervised pattern recognition, which classically is based on multivariate statistics (Chapter 33) but more recently neural nets have been introduced in this field (Chapter 44). The technique of neural networks belongs to the area of so-called natural computation methods. Genetic algorithms — another technique belonging to the family of natural computation methods — were discussed in Part A. They are called natural because the principle of the algorithms to some extent mimics the way biological systems function. Originally, neural networks were considered to be a model for the functioning of the brain. In the example above, the property is a discrete class, region of origin, healthy or ill person, which is not necessarily a quantitative value. However, in many cases the property may be a numerical value, e.g., a concentration or a pharmacological activity (Chapter 37). Modelling a vector of properties to a table of measurements is the area of multivariate calibration, e.g., by principal components regression or by partial least squares, which are described in Chapter 36. Here the degree of complexity of the data sets is almost unlimited: several data tables may be predictors for a table of properties, the relationship between the tables may be non-linear and some or all tables may be multiway. As indicated before, the columns and the rows of a bilinear or trilinear dataset have a particular meaning, e.g., a spectrum and a chromatogram or the concentration profiles of reactants and the reaction products in an equilibrium or kinetic study. The resulting data table is made up by the product of the tables of these pure factors, e.g., the table of the elution profiles of the pure compounds and the table of the spectra of these compounds. One of the aims of a study of such a table is the decomposition of the table into its pure spectra and pure elution profiles. This is done by factor analysis (Chapter 34). A special type of table is the contingency table, which has been introduced in Chapter 16 of Part A. In Part B the 2x2 contingency table is extended to the general case (Chapter 32) which can be analyzed in a multivariate way. The above
examples illustrate that the complexity of data and operations discussed in Part B require advanced chemometric techniques. Although Part B can be studied independently from Part A, we will implicitly assume a chemometrics background equivalent to Part A, where a more informal treatment of some of the same topics can be found. For instance, vectors and matrices are introduced for the first time in Chapter 9 of Part A, but are treated in more depth in Chapter 29 of Part B. Principal components analysis, which was introduced in Chapter 17, is discussed in more detail in Chapter 31. These two chapters provide the basis for more advanced techniques such as Procrustes analysis, canonical correlation analysis and partial least squares discussed in Chapter 35. In Part B we also concentrate on a number of important application areas of multivariate statistics: multivariate calibration (Chapter 36) quantitative structure-activity relationships (QSAR, Chapter 37), sensory analysis (Chapter 38) and pharmacokinetics (Chapter 39). The success of data analysis depends on the quality of the data. Noise and other instrumental factors may hide the information in the data. In some instances it is possible to improve the quality of the data by a suitable preprocessing technique such as signal filtering and signal restoration by deconvolution (Chapter 40). Prior to signal enhancement and restoration it may be necessary to transform the data by, e.g., a Fourier transform or wavelet transform. Both are discussed in Chapter 40. A special type of filter is the Kalman filter which is particularly applicable to the real-time modelling of systems of which the model parameters are time dependent. For instance, the slope and intercept of a calibration line may be subject to a drift. At each new measurement of a calibration standard, the Kalman filter updates the calibration factors, taking into account the uncertainty in the calibration factors and in the data. Because the Kalman filter is driven by the difference between a new measurement and the predicted value of that measurement, the filter ignores outlying measurements, caused by stochastic or systematic errors. The last step of the arch of knowledge is the transformation of information into knowledge. Based on this knowledge one is able to make a decision. For instance, values of temperatures and pressures in different parts of a process have to be interpreted by the operator in the control room of the plant, who may take a series of actions. Which action to take is not always obvious and some guidance is usually found in a manual. In analytical method development the same situation is encountered. Guidance is required for instance to select a suitable stationary phase in HPLC or to select the solvents that will make up the mobile phase. This type of knowledge may be available in the form of "If Then" rules. Such rules can be combined in a rule-based knowledge base, which is consulted by an expert system (Chapter 43). Questions such as "What If can also be answered by models developed in Operations Research (Chapter 42). For instance, the average time a sample has to wait in the sample queue can be predicted by queuing theory for various priority strategies.
After completion of one cycle of the arch of knowledge, we are back at the starting point of the arch, where we should accept or reject the hypothesis. At this stage a new cycle can be started based on the knowledge gained from all previous ones. The chemometric techniques described in Part A and B aim to support the scientist in running through these cycles in an efficient way.
References 1.
D. Oldroyd, The Arch of Knowledge. Methuen, New York (1986).
This Page Intentionally Left Blank
Chapter 29
Vectors, Matrices and Operations on Matrices This chapter is an extension and generalization of the material presented in Chapter 9. Here we deal with the calculus of vectors and matrices from the point of view of the analysis of a two-way multivariate data table, as defined in Chapter 28. Such data arise when several measurements are made simultaneously on each object in a set [1]. Usually these raw data are collected in tables in which the rows refer to the objects and the columns to the measurements. For example, one may obtain physicochemical properties such as lipophilicity, electronegativity, molecular volume, etc., on a number of chemical compounds. The resulting table is called a measurement table. Note that the assignment of objects to rows and of measurements to columns is only conventional. It arises from the fact that often there are more objects than measurements, and that printing of such a table is more convenient with the smallest number of columns. In a cross-tabulation each element of the table represents a count, a mean value or some other summary statistic for the various combinations of the categories of the two selected measurements. In the above example, one may cross the categories of lipophilicity with the categories of electronegativity (using appropriate intervals of the measurement scales). When each cell of such a cross-tabulation contains the number of objects that belong to the combined categories, this results in a contingency table or frequency table which is discussed extensively in Chapter 32. In a more general cross-tabulation, each cell of the table may refer, for example, to the average molecular volume that has been observed for the combined categories of lipophilicity and electronegativity. One of the air^of multivariate analysis is to reveal patterns in the data, whether they are in the form of a measurement table or in that of a contingency table. In this chapter we will refer to both of them by the more algebraic term 'matrix'. In what follows we describe the basic properties of matrices and of operations that can be applied to them. In many cases we will not provide proofs of the theorems that underlie these properties, as these proofs can be found in textbooks on matrix algebra (e.g. Gantmacher [2]). The algebraic part of this section is also treated more extensively in textbooks on multivariate analysis (e.g. Dillon and Goldstein [1], Giri [3], Cliff [4], Harris [5], Chatfield and Collins [6], Srivastana and Carter [7], Anderson [8]).
29.1 Vector space In accordance with Section 9.1, we represent a vector z as an ordered vertical arrangement of numbers. The transpose T} then represents an ordered horizontal arrangement of the same numbers. The dimension of a vector is equal to the number of its elements, and a vector with dimension n will be referred to as an Ai-vector. A set of/? vectors (z, ... z^ with the same dimension n is linearly independent if the expression:
I^.
(29.1)
z,=0
holds only when all p coefficients c^ are zero. Otherwise, the p vectors are linearly dependent (see also Section 9.2.8). The following three vectors z,, Zj and Zj with dimension four are linearly independent: " 2"
'-\
2
4 z, =
0
Z2 =
"-5l 1 Z3 =
3
-6
-1
_ 2_
_ 4_
as it is not possible to find a set of coefficients Cj, C2, C3 which are not all equal to zero, and which satisfy the system of four equations: -C]
+ 2C2 -5C3 = 0
4c, + 2c2 + IC3 = 0 Oc, + 3c2 - 6C3 = 0 2c, - IC2 + 4c3 = 0
In the case of linearly dependent vectors, each of them can be expressed as a linear combination of the others. For example, the last of the three vectors below can be expressed in the form Z3 = Zj - IT^^, '-5l
" 2"
'-\
2
4 0 _ 2_
^2
=
3 -1
Z3 =
0 -6 _ 4_
A vector space spanned by a set of/? vectors (Zj... z^ with the same dimension n is the set of all vectors that are linear combinations of the p vectors that span the
space [3]. A vector space satisfies the three requirements of an algebraic set which follow hereafter. (1) Any vector obtained by vector addition and scalar multiplication of the vectors that span the space also belongs to this space. This includes the null vector whose elements are all equal to zero. (2) Addition of the null vector to any vector of the space reproduces the vector. (3) For every vector that belongs to the space, another one can be found, such that vector addition of these two vectors produces the null vector. A set of n vectors of dimension n which are linearly independent is called a basis of an n-dimensional vector space. There can be several bases of the same vector space. The set of unit vectors of dimension n defines an n-dimensional rectangular (or Cartesian) coordinate space S^. Such a coordinate space 5" can be thought of as being constructed from n base vectors of unit length which originate from a common point and which are mutually perpendicular. Hence, a coordinate space is a vector space which is used as a reference frame for representing other vector spaces. It is not uncommon that the dimension of a coordinate space (i.e. the number of mutually perpendicular base vectors of unit length) exceeds the dimension of the vector space that is embedded in it. In that case the latter is said to be a subspace of the former. For example, the basis of 5"^ is: 1
0 1
0
0 0
0
0 1
0
0
0 0 0
0 1
Any vector x in S^ can be uniquely expressed as a linear combination of the n basis vectors u.:
: = X^/"/
(29.2)
where the n vectors (Uj ... u j form a basis of 5". The n coefficients (xj ... x^ are called the coordinates of the vector x in the basis (Uj ... u j . For example, a vector x in 5"* may be expressed in the previously defined usual basis as the vector sum: 1 x=3
0 0 0
0 + (-4)
1 0 0
0 +0
0 1 0
0 +2
0 0 1
10
where the coefficients 3, ^ , 0 and 2 are the coordinates of x in the usual basis. Another basis of 5"* can be defined by means of the set of vectors: 1
1
-1
-1
1
-1
-1
1
-1
1 -1
1
1
1
1
1
and the same vector x which we have defined above can be expressed in this particular basis by means of the vector sum: '\
r
-1 -1 -1 1 -1 + 2.25 x = 0.25 + 0.75 + (-1.25) 1 -1 -1 1 1 1 1 1 1
where the coefficients 0.25, 2.25, -1.25 and 0.75 nbw represent the coordinates of X in this particular basis. From the above it follows that a vector can be expressed by different coordinates according to the particular basis that has been chosen. In multivariate data analysis one often changes the basis in order to highlight particular properties of the vectors that are represented in it. This automatically causes a change of their coordinates. A change of basis and its effect on the coordinates can be defined algebraically, as is shown in Chapters 31 and 32. 29.2 Geometrical properties of vectors Every n vector can be represented as a point in an n-dimensional coordinate space. The n elements of the vector are the coordinates along n basis vectors, such as defined in the previous section. The null vector 0 defines the origin of the coordinate space. Note that the origin together with an endpoint define a directed line segment or axis, which also represents a vector. Hence, there is an equivalence between points and axes, which can both be thought as geometrical representations of vectors in coordinate space. (The concepts discussed here are extensions of those covered previously in Sections 9.2.4 to 9.2.5.) In this and subsequent sections we will make frequent use of the scalar product (also called inner product) between two vectors x and y with the same dimension n, which is defined by:
x^y = X^/ >'/
(29.3)
11
By way of example, if x = [2,1,3, -1 ]^ and y = [4,2, -1,5]^ then we obtain that the scalar product of x with y equals: x'ry = 2 x 4 + l x 2 + 3 x ( - l ) + ( - l ) x 5 = 2 Note that in Section 9.2.2.3 the dot product x • y is used as an equivalent notation for the scalar product x^y. In Euclidean space we define squared distance from the origin of a point x by means of the scalar product of x with itself: n
xTx = ^ x f =llxl|2
(29.4)
where 11 xl I is to be read as the norm or length of vector x. Likewise, the squared distance between two points x and y is given by the expression: n
{x-yY
(x-y) = 21Ui - y , ) ' =llx-ylP
(29.5)
where 11 x - yl I is the norm or length of the vector x - y. Note that the expression of distance from the origin in eq. (29.4) can be derived from that of distance between two points in eq. (29.5) by replacing the vector y by the null vector 0: xTx = ( x - 0 ) ^ ( x - 0 ) = llx-OlP =llxlP Given the vectors x = [2,1,3, -1 ]^ and y = [4,2, -1,5]^, we derive the norms: llxlP - 2 ^ +1^ + 3 ^ +(-1)2 =15 llyiP = 4 ^ + 2 ^ + ( - 1 ) ' + 5^ =46 l l x - y i P = ( 2 - 4 ) 2 + ( 1 - 2 ) 2 +(3_(_i))2 ^ ( _ i _ 5 ) 2 ^ 5 7 It may be noted that some authors define the norm of a vector x as the square of 11 xl I rather than 11 xl I itself, e.g. Gantmacher [2]. Angular distance or angle between two points x and y, as seen from the origin of space, is derived from the definition of the scalar product in terms of the norms of the vectors: n
xTy = ^ x , . y,. =llxllllyllcosi^
(29.6)
where i^ represents the angular distance between the vectors x and y. The geometrical interpretation of the scalar product of the vectors x and y is that of an arithmetic product of the length of y, i.e. II yl I, with the projection of x upon y, i.e. 11x11 cos d (Fig. 29.1). From the expression in eq. (29.6) we derive that cos '&
12
llyll 11x11 cos d Fig. 29.1. Geometrical interpretation of the scalar product of x'^y as the projection of the vector x upon the vector y. The lengths of x and y are denoted by 11 xl I and 11 yl I, respectively, and their angular separation is denoted by i3.
equals the scalar product of the normalized vectors x and y: T
cos d = :(x/llxll)'^ (y/llyll) llxllllyH
(29.7)
Note that normalization of an arbitrary vector x is obtained by dividing each of its elements by the norm 11 xl I of the vector. The geometric properties of vectors can be combined into the triangle relationship, also called the cosine rule, which states that: l l x - y i P =llxlP +llylP -211x11 llyllcosi^
(29.8)
This relationship is of importance in multivariate data analysis as it relates distance between endpoints of two vectors to distances and angular distance from the origin of space. A geometrical interpretation is shown in Fig. 29.2. Using the vectors x and y from our previous illustration, we derive that: = 0.0761
cos-d^ 15V46 or:
^ = 85.64 degrees One can define three special configurations of two vectors, namely parallel in the same direction, parallel in opposite directions, and orthogonal (or perpendicular). The three special configurations depend on the angular distances between the two vectors, being 0, 180 and 90 degrees respectively (Fig. 29.3). More generally, two vectors x and y are orthogonal when their scalar product is zero:
13
Fig. 29.2. Distance 11 x - yl I between two vectors x and y of length 11 xl I and 11 yl I, separated by an angle i^.
f^
^
x y = 11x11 llyll
180° V
0 0
V
>
x y = -11x11 llyll
X
/^y
tki
T
X y=0
Fig. 29.3. Three special configurations of two vectors x and y and their corresponding scalar product x^y. Angular separations of x and y are 0, 180 and 90 degrees, respectively.
x^y = 0
(29.9)
It can be shown that the vectors x = [2, - 1 , 8, 0]''' and y = [10, 4, -2, 3]^ are orthogonal, since: x'ry = [2,-l,8,0][10,4,-2,3]T = 2 x l 0 + (-l)x4 + 8x(-2) + 0 x 3 = 0
14
Hence cos iJ equals 0, or equivalently ^ equals 90 degrees. Two orthogonal vectors are orthonormal when, in addition to orthogonality, the norms of these vectors are equal to one: llxll = llyll = l or equivalently: x^x = y V = l
(29.10)
The n basis vectors which define the basis of a coordinate space 5" are n mutually orthogonal and normalized vectors. Together they form a frame of reference axes for that space. If we represent by x and y the arithmetic means of the elements of the vectors x and y: 1 "
^ ' 1 " y = -Yuyi ^
(29.11)
i
then we can relate the norms of the vectors (x - x) and (y - y) to the standard deviations s^ and s^ of the elements in x and y (Section 2.1.4): 1 "
1
^x =-Y,{x,-xy
=-\\x-x\\^
n ,
n
(29.12)
Note that in data analysis we divide by n in the definition of standard deviation rather than by the factor n - 1 which is customary in statistical inference. Likewise we can relate the product-moment (or Pearson) coefficient of correlation r (Section 8.3.1) to the scalar product of the vectors (x - x) and {y -y): n
-
fn
'
n
X(x,-J)2 SCy.-y)'
V'^
- ( ' ' - • ^ ) ^ y - ^ = cos(p IIX-JII lly-yil
(29.13)
15
where cp is the angular distance between the vectors (x - x) and (y - y).
29.3 Matrices A matrix is defined as an ordered rectangular arrangement of scalars into horizontal rows and vertical columns (Section 9.3). On the one hand, one can consider a matrix X with n rows and/? columns as an ordered array ofp vectors of dimension n, each of the form:
with J = 1, ...,/7
^./ =
On the other hand, one can also regard the same matrix X as an ordered array of n vectors of dimension p, each of the form: with / = 1, ..., n
X • — [X-1 , . . . , X;„
In our notation, x^j represents the element of matrix X at the crossing of row / and column j . The vector Xy defines a vector which contains the n elements of the jth column of X. The vector x, refers to a vector which comprises the/? elements of the /th row of X. In the matrix X of the following example:
X=
3 2 0 -1
2 1 -4 2
-2 0 -3 4
we denote the second column by means of the vector Xy^2*
X,=2=i
2 1 -4 2
16
1 1
P
X
T r
''ij nI
''j p
Fig. 29.4. Schematic representation of a matrix X as a stack of horizontal rows x,- , and as an assembly of vertical columns x,.
and the third row by means of the vector x^^-^'-
In the illustration of Fig. 29.4 we regard the matrix X as either built up from n horizontal rows x^ of dimension p, or as built up from p vertical columns x^ of dimension n. This exemplifies the duality of the interpretation of a matrix [9]. From a geometrical point of view, and according to the concept of duality, we can interpret a matrix with n rows and p columns either as a pattern of n points in a /7-dimensional space, or as a pattern of p points in an n-dimensional space. The former defines a row-pattern P"" in column-space SP, while the latter defines a column-pattern P^ in row-space 5^". The two patterns and spaces are called dual (or conjugate). The term dual space also possesses a specific meaning in another
17
1 1
1
p
X
T -^1
Xij
n
Fig. 29.5. Geometrical interpretation of an nxp matrix X as either a row-pattern of n points P" in /7-dimensional column-space S^ (left panel) or as a column-pattern of p points P^ in n-dimensional row-space S" (right panel). The/? vectors u, form a basis of 5^ and the n vectors v, form a basis of 5".
mathematical context which is distinct from the one which is implied here. The occasion for confusion, however, is slight. In Fig. 29.5, the column-space S^ is represented as a/^-dimensional coordinate space in which each row x^ of X defines a point with coordinates {xn,.., x^p.., x^^) in an orthonormal basis (u^,.., Uy,.., u^ such that:
with:
rr 0
'0'
'
" y
=
0
1
and
u,
0
Each element x^j can thus be reconstructed from the scalar product: Xij=xJ
u . = u j X.
(29.14)
In the same Fig. 29.5, the row-space S"^ is shown as an n-dimensional coordinate space in which each column Xj of X defines a point with coordinates (x^j,.., x-j,.., x^p in an orthonormal basis (Vj,.., v,,.., v„) such that: Xj=X,j
V, -\-,., + X,j V,- + . . . + X„^. V„
with:
"o"
T 0
'
V,. =
0
1
and
0
Each element x^j can also be reconstructed from the scalar product: X^j — \ • Xj
— Xj
\ •
(29.15)
The dimension of a matrix X with n rows and p columns is nxp (pronounced n by p). Here X is referred to as an nxp matrix or as a matrix of dimension nxp. A matrix is called square if the number of rows is equal to the number of columns. Such a matrix can be referred to as a pxp matrix or as a square matrix of dimension p. The transpose of a matrix X is obtained by interchanging its rows and columns and is denoted by X^. If X is an nxp matrix then X^ is apxn matrix. In particular, we have that: T\T (X^)'
_
(29.16)
19
The transpose of the matrix X in the previous example is given below: 0 3 2 2 1 -4
-1 2
0
4
•2
-3
A square matrix A is called symmetric if: (29.17)
AT = A
In a square nxn matrix A, the main diagonal or principal diagonal consists of the elements a^ for all / ranging from 1 to n. The latter are called the diagonal elements; all other elements are off-diagonal. A diagonal matrix D is a square matrix in which all off-diagonal elements are zero, i.e.: J, = 0
if
with iandj = 1, ..., n.
i^j
An identity matrix I is a diagonal matrix in which all diagonal elements are equal to unity and all off-diagonal elements are zero, i.e.: and
'//-!
with j^f = 1, ..., n.
ijj, = 0
The following 3x3 matrices illustrate a square, a diagonal and an identity matrix: A-
3 2
2
-2
0
o"
1 -4
3 0 D= 0
-3
0
0"
'l
0
1= 0
0 -3_
0
1
0 0" 1 0 0
1_
Special matrices are the null matrix 0 in which all elements are zero, and the sum matrix 1 in which all elements are unity. In the case of 3x3 matrices we obtain: "0 0 0"
"1 1 r
0= 0 0 0 0 0 0
1= 1 1 1 1 1 1
29.4 Matrix product Chapter 9 dealt with the basic operations of addition of two matrices with the same dimensions, of scalar multiplication of a matrix with a constant, and of arithmetic multiplication element-by-element of two matrices with the same
20
dimensions. Here, we formalize the properties of the matrix product that have already been introduced in Section 9.3.2.3. If X is of dimension nxp and Y is of dimension pxq, then the product Z = XY is an nxq matrix, the elements of which are defined by: ^ik ~
with / = 1, ..., n and /: = 1,..., q
Aj^ij yjk
(29.18)
Note that the inner dimensions of X and Y must be equal. For this reason the operation is also called inner product, as the inner dimensions of the two terms vanish in the product. Any element of the product, say z,^, can also be thought of as being the sum of the products of the corresponding elements of row / of X with those of column k of Y. Hence the descriptive name of rows-by-columns product. In terms of the scalar product (Section 29.2) we can write: (29.19)
= X
Throughout the book, matrices are often subscripted with their corresponding dimensions in order to provide a check on the conformity of the inner dimensions of matrix products. For example, when a 4x3 matrix X is multiplied with a 3x2 matrix Y rows-by-columns, we obtain a 4x2 matrix Z: 3 X 4x3
0
2 -5
-f 4
2
1 -2
4
3
0
3 8' Y = -2 4 3x2 4 - 6J
Z = X 4x2
4x3
Y = 3x2
1
38
26
-44
-4
32
6
44
In this theoretical chapter, however, we do not follow this convention of subscripted matrices for the sake of conciseness of notation. Instead, we will take care to indicate the dimensions of matrices in the accompanying text whenever this is appropriate. The operation of matrix multiplication can be shown to be associative, meaning that X(YZ) = (XY)Z. But, it is not commutative, as in general we will have that XY ^ YX. Matrix multiplication is distributive with respect to matrix addition, which implies that (X + Y)Z = XZ + YZ. When this expression is read from right to left, the process is called factoring-out [4]. Multiplication of an nxp matrix X with an identity matrix leaves the original matrix unchanged: XT = LX = X
(29.20)
where I^ is the identity matrix with dimension /?, and I„ is the identity matrix with dimension n. For example, by working out the rows-by-column product one can
21
easily verify that:
"l 0' "1 0 0] [3 4' '3 4' 0 1 — 0 1 0 2 -1 — 2 -1 0 0 ij [o 3 0 3_
"3 4 2 -1 0 3
A matrix is orthogonal if the product with its own transpose produces a diagonal matrix. An orthogonal matrix of dimension nxp satisfies one or both of the following relationships: XX^ = D
X^X = D„
or
(29.21)
where D„ is a diagonal matrix of dimension n and D^ is a diagonal matrix of dimension p. In an orthogonal matrix we find that all rows or all columns of the matrix are mutually orthogonal as defined above in Section 9.2. In the former case we state that X is row-orthogonal, while in the latter case X is said to be columnorthogonal. The following 3x3 matrix X can be shown to be column-orthogonal: 2.351 0.060 X = 4.726 1.603 -1.840 4.196
0.686 -0.301 0.105
as can be seen by working out the matrix products: '31.248 X^X = | 0
0
0
20.180
0
0
0 0.572
A matrix is called orthonormal if additionally we obtain that: XX^ = T
or
X^X = I
(29.22)
where I^ and I^ have been defined before. In an orthonormal matrix X, all rowvectors or all column-vectors are mutually orthonormal. In the former case, X is row-orthonormal, while in the latter case we state that X is column-orthonormal. A square matrix U is orthonormal if we can write that: UU'^ = U^U = I where I is the identity matrix with the same dimension as U (or U^). The 3x3 matrix U shown below is both row- and column-orthonormal:
(29.23)
22
0.4205 U=
0.0134
05072
0.8455 0.3569
-0.3972
-0.3291 0.9340
0.1388
as can be seen by working out the matrix products:
'l 0 0" UU^ =U'^U = 0 1 0 0 0 Ij An important property of the matrix product is that the transpose of a product is equal to the product of the transposed terms in reverse order: (29.24)
(XY)T = Y^X^ This property can be readily verified by means of an example: '3
2
0
-5
2 4
-1] 4
nT
r 3
8]
-2
4
1 -2 L 4 -6] 3 oj
1
38"
__ 26 -44 32 -4 44_ 6
T
^
1 38
26-4 -44
32
6 44
and 8" T 3 4 -2 4 -6
3 2 -1 0 -5 4 2 1 -2 4 3 0
'3 -2 = 8 4
4] -6j
2 4' 0 1 3 2 --5 [-1 4 -2 0_
r 3
" 1
26-4 6' 38 -44 3 2 44
The trace of a square matrix A of dimension n is equal to the sum of the n elements on the main diagonal:
tr(A) = X«// For example:
(29.25)
23
tr
4 2 - 1 2 8 - 4 = 4 + 8 + 3=15 1 -4 3 If X is of dimension nxp and if Y is of dimensionpxn, then we can show that:
tr(XY) = tr(YX)
(29.26)
In particular, we can prove that: n
tr(X^X) = tr(XX^) = ^ ^ J ^j =1^J
p
^i = l E ^ ' I
(29.27)
J
where x, represents the ith row and x^ denotes theyth column of the nxp matrix X. The proof follows from working out the products in the manner described above. This relationship is important in the case when Y equals X^. Matrix multiplication can be applied to vectors, if the latter are regarded as one-column matrices. This way, we can distinguish between four types of special matrix products, which are explained below and which are represented schematically in Fig. 29.6. (1) In the matrix-by-vector product one postmultiplies an nxp matrix X with ap vector y which results in an n vector z: (29.28)
Xy = z For example: 3
2
0
-5
2 4
-1] r 4
1 -2 3 oj
'^1
-2 A
I 26 -4 6
(2) The vector-by-matrix product involves an n vector x^ which premultiplies an nxp matrix Y to yield a/? vector z^: x^Y = z" For example:
(29.29)
24
1 ^
P
z = Xy
1 z 1
Y
P
X P
1
n
T
T„
z =x Y 1z
p
^
y
Z = xy'
T
z=x y
Fig. 29.6. Schematic illustration of four types of special matrix products: the matrix-by-vector product, the vector-by-matrix product, the outer product and the scalar product between vectors, respectively from top to bottom.
25
3 [3
2
2
-1] -2 4
1 = [1 5] 3
(3) The outer product results from premultiplying an n vector x with ap vector y^ yielding an nxp matrix Z: (29.30)
xy^ = Z
The outer product of two vectors can be thought of as the matrix product between a single-column matrix with a single-row matrix:
[3, - 2 , 4] =
2x3
2x(-2)
2x4
(-5)x3
(-5)x(-2)
(-5)x4
1x3
lx(-2)
1x4
3x3
3x(-2)
3x4
6 -4 15 10 3 -2 9
-6
8 -20 4 12
For the purpose of completeness, we also mention the vector product which is extensively used in physics and which is defined as: (29.31)
xxy = z
(read as x cross y) where x, y and z have the same dimension n. Geometrically, we can regard x and y as two vectors drawn from the origin ofS"^ and forming an angle (x, y) between them. The resulting vector product z is perpendicular to the plane formed by x and y, and has a length defined by: llzll = llxll llyllsin(x,y)
(29.32)
(4) In the scalar product, which we described in Section 29.2, one multiplies a vector x^ with another vector y of the same dimension, which produces a scalar z: T
(29.33)
x'y = z For example: •-1
[3 0 2 -1]
4 -2 0
:3x(-l) + 0 x 4 + 2 x ( - 2 ) + (-l)x0 = - 7
26
The product of a matrix with a diagonal matrix is used to multiply the rows or the columns of a matrix with given constants. If X is an nxp matrix and if D^ is a diagonal matrix of dimension n we obtain a product Y in which the iih row yequals the iih row of X, i.e. x,, multiplied by the iih element on the main diagonal of Y = D„X
(29.34)
and in particular for the /th row: y, = i/„x,
with «"= !,...,«
The matrix D„ effectively scales the rows of X. For example: •2
0 0
0]
3
2
-1'
-6-4
2"
0
1 0
0
0
-5
4
0 - 5
4
3 0
2
1 -2
6
4
3
0
8
0 0
2J
0 0 0
3-6 6
0
Likewise, if D^ is a diagonal matrix of dimension p we obtain a product Y in which they'th column y^ equals theyth column of X, i.e. x^, multiplied by theyth element on the main diagonal of D^: Y = XD^
(29.35)
with; = l,...,p
and in particular for theyth column:
The matrix D^ effectively scales the columns of X. For example: 3
2
0
-5
-1] [2 4
2
1 -2
4
3
0 0
0 '-
0 -1
0' 0
0 3_
6
-2
-3
0
5
12
4
-1
-6
8
-3
0
Pre- or postmultiplication with a diagonal matrix is useful in data analysis for scaling rows or columns of a matrix, e.g. such that after scaling all rows or all columns possess equal sums of squares.
27
29.5 Dimension and rank It has been shown that the/? columns of an nxp matrix X generate a pattern of p points in S"^ which we call PP. The dimension of this pattern is called rank and is indicated by r(P^). It is equal to the number of linearly independent vectors from which all p columns of X can be constructed as Hnear combinations. Hence, the rank of P^ can be at most equal top. Geometrically, the rank of P^can be seen as the minimum number of dimensions that is required to represent the p points in the pattern together with the origin of space. Linear dependences among the p columns of X will cause coplanarity of some of the p vectors and hence reduce the minimum number of dimensions. The same matrix X also generates a pattern oin points in S^ which we call P*^ and which is generated by the n rows of X. The rank of P"is denoted as riP"^) and equals the number of linearly independent vectors from which all n rows of X can be produced as linear combinations. Hence, the rank of P"^ can be at most equal to n. Using the same geometrical arguments as above, one can regard the rank of P^ as the minimum number of dimensions that is required to represent the n points in the pattern together with the origin of space. Linear dependences between the n rows of X will also reduce this minimum number of dimensions because of coplanarity of some of the corresponding n vectors. It can be shown that the rank of P^ must be equal to that of P'^ and, hence, that the rank of X is at most equal to the smaller of n and p [3]: r{P^ ) = r{PP) = r(X) < min(n, p)
(29.36)
where r(X) is the rank of the matrix X (see also Section 9.3.5). For example, the rank of a 4x3 matrix can be at most equal to 3. In the case when there are linear dependences among the rows or columns of the matrix, the rank can be even smaller. An nxp matrix X with n>p is called singular if linear dependences exist between the columns of X, otherwise the matrix is called non-singular. In this case the rank of X equals p minus the number of linear dependences among the columns of X. If n < p, then X is singular if linear dependences exist between the rows of X, otherwise X is non-singular. In that case, the rank of X equals n minus the number of linear dependences among the rows of X. A matrix is said to be of full rank when X is non-singular or alternatively when r(X) equals the smaller of n or/?. Dimensions and rank of a matrix are distinct concepts. A matrix can have relatively large dimensions say 100x50, but its rank can be small in comparison with its dimensions. This point can be made more clearly in geometrical terms. In a 100-dimensional row-space 5^^, it is possible to represent the 50 columns of the matrix as 50 points, the coordinates of which are defined by the 100 elements in each of them. These 50 points form a pattern which we represent by P^^. It is clear
28
that the true dimension of this pattern of 50 points must be less than the number of coordinate axes of the space 5^^ in which they are represented. In fact, it cannot be larger than 50. The true dimension of the pattern P^^ defines its rank. In an extreme case when all 50 points are located at the origin of 5^^, the rank is zero. In another extreme situation we may obtain that all 50 points are collinear on a line through the origin, in which case the rank is one. All 50 columns may be coplanar in a plane that comprises the origin, which results in a rank of 2. In practical situations, we often find that the points form patterns of low rank, when the data are sufficiently filtered to eliminate random variation and artifacts. Multivariate data analysis capitalizes on this point, and in a subsequent section on eigenvectors we will deal with the algebra which allows us to find the true number of dimensions or rank of a pattern in space. We can now define the rank of the column-pattern P^^ as the number of linearly independent columns or rank of X. If all 50 points are coplanar, then we can reconstruct each of the 50 columns, by means of linear combinations of two independent ones. For example, if x^^j and Xj^2 ^^^ linearly independent then we must have 48 linear dependences among the 50 columns of X: ^7=3 ~ ^ 1 3 ^7=1 "''^23 ^7=2 X,=4 = C j 4 Xy^i + C 2 4
Xj=2
^j=50 ~^1,50 ^7=1 "^^2,50 ^7=2
where 0,3, C23,... are the coefficients of the linear combinations. The rank of X and hence the rank of the column-pattern P^^ is thus equal to 50 - 48 = 2. In this case it appears that 48 of the 50 columns of X are redundant, and that a judicious choice of two of them could lead to a substantial reduction of the dimensionality of the data. The algebraic approach to this problem is explained in Section 29.6 on eigenvectors. A similar argument can be developed for the dual representation of X, i.e. as a row-pattern of 100 points P^^ in a 50-dimensional column-space 5^^. Here again, it is evident that the rank of P^^ can at most be equal to 50 (as it is embedded in a 50-dimensional space). This implies that in our illustration of a 100x50 matrix, we must of necessity have at least 100 - 50 = 50 linear dependences among the rows of X. In other words, we can eliminate 50 of the 100 points without affecting the rank of the row-pattern of points in 5^^. Let us assume that this is obtained by reducing the 100x50 matrix X into a 50x50 matrix X'. The resulting pattern of the rows in X' now comprises only 50 points instead of 100 and is denoted here by the symbol pioo-50 ^hich has the same rank as P^^. Since we previously assumed 48 linear
29
dependences among the columns of X we must necessarily also have 48 additional linear dependences among the rows of X'. Hence the rank of p^^^^^ in S^^ is also equal to 2. Summarizing the results obtained in the dual spaces we can write that: r(P^^) = r(P^^) = r(X)=2 We will attempt to clarify this difficult concept by means of an example. In the 4x3 matrix X we have an obvious linear dependence among the columns: "3
5 2
X=
2 -1
smce Xy^3 = Xy^i + Xy^2 • By simple algebra we can derive that there is also a linear dependence among the rows of X, namely:
11X/=i
^i=4 ~ '
29
+
29
^i=2
Hence we can remove the fourth row of X without affecting its rank, which results in the 3x3 matrix X": 3 5 8 X= 7 2 9 1 2 3 Note that the linear dependence among the columns still persists. We now show that there remains a second linear dependence among the rows of X':
X i=3
11
= •
29
29
i=2
Thus we have illustrated that the number of independent rows, the number of independent columns and the rank of the matrix are all identical. Hence, from geometrical considerations, we conclude that the ranks of the patterns in row- and column-space must also be equal. The above illustration is also rendered geometrically in Fig. 29.7. The rank of a product of two matrices X and Y is equal to the smallest of the rank
30
Fig. 29.7. Illustration of a pattern of points with rank of 2. The pattern is represented by a matrix X with dimensions 5x4 and a linear dependence between the three columns of X is assumed. The rank is shown to be the smallest number of dimensions required to represent the pattern in column-space S^ and in row-space S".
ofXandY: r(XY) = min(r(X), r(Y))
(29.37)
This follows from the fact that the columns of XY are linear combinations of X and that the rows of XY are linear combinations of Y [3]. From the above property, it follows readily that: r(XX^) = r(X^X) = HX)
(29.38)
where the products XX^ and X^X are of special interest in data analysis as will be explained in Section 29.7.
29.6 Eigenvectors and eigenvalues A square matrix A of dimension p is said to be positive definite if: x^Ax > 0
(29.39)
for all non-trivial p vectors x (i.e. vectors that are distinct from 0). The matrix A is said to be positive semi-definite if:
31
x^Ax > 0
for all X ;^ 0
(29.40)
It can be shown that all symmetric matrices of the form X^X and XX^ are positive semi-definite [2]. These cross-product matrices include the widely used dispersion matrices which can take the form of a variance-covariance or correlation matrix, among others (see Section 29.7). An eigenvalue or characteristic root of a symmetric matrix A of dimension p is a root X^ of the characteristic equation: IA-^2II = 0
(29.41)
where IA - ^^11 means the determinant of the matrix A - V'l [2]. The determinant in this equation can be developed into a polynomial of degree p of which all/7 roots yj- are real. Additionally, if A is positive semi-definite then all roots are nonnegative. Furthermore, it can be shown that the sum of the eigenvalues is equal to the trace of the symmetric matrix A:
I^i=tr(A)
(29.42)
and that the product of the eigenvalues is equal to the determinant of the symmetric matrix A:
n^=iAi
(29.43)
By way of example we construct a positive semi-definite matrix A of dimensions 2x2 from which we propose to determine the characteristic roots. The square matrix A is derived as the product of a rectangular matrix X with its transpose in order to ensure symmetry and positive semi-definitiveness: "2 -5
-\ 4
A = XTX =
1 -2 0 _3
39
-24
-24
21
from which follows the characteristic equation:
|39_^2 \X-XH\--
-24
_24 I =0 21-^2 1
The determinant in this characteristic equation can be developed according to the methods described in Section 9.3.4:
32
IA-;i2i|^(39-;i2)(21-X^)-(-24)(-24) = 0 which leads to a quadratic equation in }}\ (X^y -(21+39)>.2 + 3 9 x 2 1 - 2 4 x 2 4 = 0 or (?i2)2 _60;i2 +243 = 0
From the form of this equation we deduce that the characteristic equation has two positive roots: ?i^ 2 = 30 ± (30^ - 243)^/2 ^ 30 + 25.632
or X] =55.632
and
X\ =4.368
It can be easily verified that: 2
XM=tr(A) = 60
n^
2^=IAI = 243
If A is a symmetric positive definite matrix then we obtain that all eigenvalues are positive. As we have seen, this occurs when all columns (or rows) of the matrix A are linearly independent. Conversely, a linear dependence in the columns (or rows) of A will produce a zero eigenvalue. More generally, if A is symmetric and positive semi-definite of rank r < /?, then A possesses r positive eigenvalues and (p - r) zero eigenvalues [6]. In the previous section we have seen that when A has the form of the product of a matrix X with its transpose, then the rank of A is the same as the rank of X. This can be easily demonstrated by means of a simplified illustration: 1 -2" =
-3 2 -1
6 -4
A = X^X =
15 -30
-30 60
2_
Note that there is a linear dependence in X which is transmitted to the matrix of cross-products A:
33
The singularity of A can also be ascertained by inspection of the determinant lAI which in this case equals zero. As a result, the characteristic equation has the form of a degenerated quadratic: (Vy
- ( 1 5 + 60);i2 + 1 5 x 6 0 - 3 0 x 3 0 = 0
The last term of the characteristic equation is always equal to the determinant of A, which in this case equals zero. Hence we obtain:
which leads to: X] = 75
and
X\ = 0
with 2
XM=tr(A) = 75 k
and
n^i=iAi=o k
An eigenvector or characteristic vector is a nontrivial normalized vector v (distinct from 0) which satisfies the eigenvector relation: {k-XH)\^Q
or
Av = ?i2v
(29.44)
from which follows that: v'^Av = X^
(29.45)
because of the orthonormality condition: v^v=l We have seen above that a symmetric non-singular matrix of dimensions/?x;? hasp positive eigenvalues which are roots of the characteristic equation. To each of these p eigenvalues V- one can associate an eigenvector v. The p eigenvectors are normalized and mutually orthogonal. This leads us to the eigenvalue decomposition (EVD) of a symmetric non-singular matrix A: V^AV^A^
(29.46)
34
with the orthonormality condition:
where I^ is thepxp identity matrix and where A^ is apx/? diagonal matrix in which the elements of the main diagonal are the p eigenvalues associated to the eigenvectors (columns) in the pxp matrix V. Because A^ is a diagonal matrix, the decomposition is also called diagonalization of A. Algorithms for eigenvalue decomposition are discussed in Section 31.4. These are routinely used in the multivariate analysis of measurement tables and contingency tables (Chapters 31 and 32). Because of the orthonormality condition we can rearrange the terms of the decomposition in eq. (29.46) into the expression: A = VA^V^
(29.47)
which is known as the spectral decomposition of A. The latter can also be expanded in the form: A = ^^ V, v^ + . . . + ^^, V, v l + . . . + ^
V, v^
(29.48)
where v^ represents the Jdh column of V and where \\ is the eigenvalue associated tov^. By way of example, we propose to extract the eigenvectors from the symmetric matrix A of which the eigenvalues have been derived at the beginning of this section: A=
39 -24
-24 21
and which has been found to be positive definite with eigenvalues: ?.2= 55.632
and
X^ =4.368
The eigenvector relation for the first eigenvector can then be written as:
{K-X\l)y,
=
"39 - 55.632 -24
-24 21-55.632, , v^x
=0
where v^ and Vj2 are the unknown elements of the eigenvector Vj associated to A,^. The determinant of the corresponding system of homogeneous linear equations equals zero: IA-?i2i| = (_i6.632)(-34.632)-(-24)(-24) = 0 within the limits of precision of our calculation. Hence, we can solve for the unknowns v'^ and v'21, which are the non-normalized elements v^ and V2\.
35
-16.632 v ' „ -
24v'2j=0
-24v',i - 34.632 v'21 =0 This leads to the solution: v'n = 1 21
-16.632/24 = -24/34.632 = -0.693
Since the norm of \\ is defined as: IIv^ Il = (v7i + v'l, )i/2 =(1 + 0.6932)1/2 ^1.217 we derive the elements of the normalized eigenvector Vj from: vji =v'ii/llvVI = l/l.217=0.822 V2i= v'21/llv'ill = -0.693/1.217 = - 0 . 5 6 9 or Vi = [0.822-0.569f In order to compute the second eigenvector, we make use of the spectral decomposition of the matrix A: A-?l2 y 1
..T
_ •
M^l
' 2 V.Vn^ ^2^2
where the left-hand member is called the residual matrix (or deflated matrix) of A after extraction of the first eigenvector. This reduces the problem to that of finding the eigenvector V2 associated to 'k\ from the residual matrix:
-Mv,v;^ =
39 -24 "1.417 2.045
0.822 -24 [0.822 - 55.6322 21_ -0.569
-0.569]
2.045 2.S>51
We now have to solve the eigenvector relation: {K-X\y,y1
-'k\l)y^
=
1.417 - 4.368 2.045
2.045 2.951-4.368
'12
=0
'22
where v^2 and V22 are the unknown elements of the eigenvector V2 associated to X, 2. The determinant of the residual matrix is also zero:
36
\A~X]
v,v^ ^ 1 II = (-2.951)(-1.417)-2.045x2.045 = 0
within the limits of precision of our calculation. Now, we can solve the system of homogeneous linear equations for the unknowns v',, 12 'and v'„ 22 which are the non-normalized elements of v,^ 12 ' and v "22-2.951 v',2 -l-2.045v'22 :0 2.045 v' 12
1.417 v'22 = 0
from which we derive that: 1
12 •
v'„ = 2.951/2.045 = 2.045/1.417 = 1.443 After normalization we obtain the elements v,2 and V22 of the eigenvector V2 associated to A, 2: v,2 =v',2 /llv'2 11=1/1.756 = 0.569 V22 =v'22 /llv'211= 1.433/1.756 = 0.822 where llv'2 II = (1+1.4332)"2 ^ i 7 5 g It can be shown that the second eigenvector Vj can also be computed directly from the original matrix A, rather than from the residual matrix A - X\\f\J, by solving the relation: ( A - ^ 1)^2 = 0 This follows from the orthogonality of the eigenvectors v, and V2. We have preferred the residual matrix because this approach is used in iterative algorithms for the calculation of eigenvectors, as is explained in Section 31.4. Finally, we arrange the eigenvectors column-wise into the matrix V, and the eigenvalues into the diagonal matrix A^: V=
0.822 0.569 -0.569 0.822
A^ =
55.632 0 0 4.368
From V and A-^ one can reconstruct the original matrix A by working out the consecutive matrix products: VA^V T
_
0.822 0.569 -0.569 0.822
55.632 0 0 4.368
0.822 0.569
-0.569 0.822
37
39
-24
-24
21
=A
The 'paper-and-pencir method of eigenvector decomposition can only be performed on small matrices, such as illustrated above. For matrices with larger dimensions one needs a computer for which efficient algorithms have been designed (Section 31.4). Thus far we have considered the eigenvalue decomposition of a symmetric matrix which is of full rank, i.e. which is positive definite. In the more general case of a symmetric positive semi-definite pxp matrix A we will obtain r positive eigenvalues where rp,we have: S = XV
with
VV^ = V^V = I^
(29.85)
where V is apxp orthonormal rotation matrix and where S denotes the rotated nxp matrix X. In the case of an nxp non-singular matrix X with « < p, we obtain: L = X^U
with
UU^ = U^U = I,
(29.86)
where U is an nxn orthonormal rotation matrix and where L stands for the rotated pxn matrix X^. Orthogonal rotation produces a new orthogonal frame of reference axes which are defined by the column-vectors of U and V. The structural properties of the pattern of points, such as distances and angles, are conserved by an orthogonal rotation as can be shown by working out the matrices of cross-products: SS^ = XVV^X^ = XX'^
(29.87)
56
or LL^ = X^UU^X = X'^X
(29.88)
where use is made of the orthogonality of rotation matrices. After an orthogonal rotation one can also perform a backrotation toward the original frame of reference axes: SV^ = XVV^ = X
(29.89)
or LU^ = X'^UU^ = X^
(29.90)
where V and U are orthonormal rotation matrices and where use is made of the same property of orthogonality as stated above. References 1. 2. 3. 4. 5. 6. 7. 8. 9.
10. 11. 12.
W.R. Dillon and M. Goldstein, Multivariate Analysis, Methods and Applications. Wiley, New York, 1984. F.R. Gantmacher, The Theory of Matrices. Vols. 1 and 2. Chelsea Publ., New York, 1977. N.C. Giri, Multivariate Statistical Inference. Academic Press, New York, 1972. N. Cliff, Analyzing Multivariate Data. Academic Press, San Diego, CA, 1987. R.J. Harris, A Primer on Multivariate Statistics. Academic Press, New York, 1975. C. Chatfield and A.J. Collins, Introduction to Multivariate Analysis. Chapman and Hall, London, 1980. M.S. Srivastana and E.M. Carter, An Introduction to Applied Multivariate Statistics. North Holland, New York, 1983. T.W. Anderson, An Introduction to Multivariate Statistical Analysis. Wiley, New York, 1984. O.M. Kvalheim, Interpretation of direct latent-variable projection methods and their aims and use in the analysis of multicomponent spectroscopic and chromatographic data. Chemom. Intell. Lab. Syst., 4 (1988) 11-25. P.E. Green and J.D. Carroll, Mathematical Tools for Applied Multivariate Analysis. Academic Press, New York, 1976. A. Gifi, Non-linear Multivariate Analysis. Wiley, Chichester, UK, 1990. K. Beebe and B.R. Kowalski, An introduction to multivariate calibration and analysis. Anal. Chem., 59 (1987) 1007A-1009A.
57
Chapter 30
Cluster analysis 30.1 Clusters Clustering or cluster analysis is used to classify objects, characterized by the values of a set of variables, into groups. It is therefore an alternative to principal component analysis for describing the structure of a data table. Let us consider an example. About 600 iron meteorites have been found on earth. They have been analysed for 13 inorganic elements, such as Ir, Ni, Ga, Ge, etc. One wonders if certain meteorites have similar inorganic composition patterns. In other words, one would like to classify the iron meteorites according to these inorganic composition patterns. One can view the meteorites as the 600 objects of a data table, each object being characterized by the concentration of 13 elements, the variables. This means that one views the 600 objects as points (or vectors) in a 13-dimensional space. To find groups one could obtain, as we learned in Chapter 17, a principal component plot and consider those meteorites that are found close together as similar and try to distinguish in this way clusters or groups of meteorites. Instead of proceeding in this visual way, one can try to use more formal and therefore more objective methods. Let us first attempt such a classification by using only two variables (for instance, Ge and Ni). Fictitious concentrations of these two metals for a number of meteorites (called A, B, ..., J) are shown in Fig. 30.1. A classification of these meteorites permits one to distinguish first two clusters, namely ABDFG and CEHIJ. On closer observation, one notes that the first such group can be divided into two sub-groups, namely ABF and DG, and that in the second group one can also discern two sub-groups, namely CEIJ and H. There are two ways of representing these data by clustering. The first is depicted by the tree, also called a dendrogram, of Fig. 30.2 and consists in the elaboration of a hierarchical classification of meteorites. It is hierarchical because large groups are divided into smaller ones (for instance, the group ABDFG splits into ABF and DG). These are then split up again until eventually each group consists of only one meteorite. This type of classification is very often used in many areas of science. Figure 30.3 shows a very small part of the classification of plants. Individual species are
58 Concentration of Ni
i
Concentration of Ge
Fig. 30.1. Concentration of Ni and Ge for ten meteorites A to J.
ABCDEFGHIJ
1
. ABFDG
I
I
I
DG
CEJI
rti
rh
I r I I
A B F
D G
C E J I
ABF
'
CEHIJ
'
1 H
Fig. 30.2. Hierarchical clustering for the meteorites described in Fig. 30.1.
grouped in genera, genera in families, etc. This classification was obtained historically by determining characteristics such as the number of cotyledons, the flower formula, etc. More recently, botanists and scientists from other areas such as bacteriology where classification is needed, have reviewed the classifications in their respective fields by numerical taxonomy [1,2]. This consists of considering the species as objects, characterized by certain variables (number of cotyledons, etc.). The data table thus obtained is then subjected to clustering. Numerical taxonomy has inspired other experimental scientists, such as chemists to apply clustering techniques in their own field. The other main possibility of representing clustered data is to make a table containing different clusterings. A clustering is a partition into clusters. For the example of Fig. 30.1, this could yield Table 30.1. Such a table does not necessarily yield a complete hierarchy (e.g. in going from 6 to 7, objects J and I, separated for clustering 6, are joined again for 7). Therefore, the presentation is called nonhierarchical. Classical books in the field are the already cited book by Sneath and Sokal [1] and that by Everitt [3]. A more recent book has been written by Kaufman and
59
(angiosperms)
(dicotyledones)
(monocotyledones)
(papilionaceae)
(rosaceae)
r
pi
o
&
I I I Fig. 30.3. Taxonomy of some plants.
TABLE 30.1 A list of clusterings derived from Fig. 30.1 by non-hierarchical clustering No. of clusters
1 2 3 4 6 7 0
Composition of the clusters
A A A A A A A
B B B B
-
C F F F B B B
D D
F
-
E G D D
F F
-
G G D D
-
F
-
-
-
H E C C
G G D
-
G C
I H E E C C G
J I H J I
J I I
-
H
-
J
-
E C
-
E J E
J I -
H
-
H
J
-
Rousseeuw [4]. Massart and Kaufman [5] and Bratchell [6] wrote specifically for chemometricians. Massart and Kaufman's book contains many examples, relevant to chemometrics, including the meteorite example [7]. More recent examples concern classification, for instance according to structural descriptions for toxicity testing [8] or in connection with combinatorial chemistry [9], according to chemical
60
composition for aerosol particles [10], Chinese teas [11] or mint species [12] and according to physicochemical parameters for solvents [13]. The selection of representative samples from a larger group for multivariate calibration (Chapter 36) by clustering was described by Naes [14].
30.2 Measures of (dis)similarity 30.2.1 Similarity and distance To be able to cluster objects, one must measure their similarity. From our introduction, it is clear that "distance" may be such a measure. However, many types of similarity coefficients may be applied. While the terms similarity or dissimilarity have no unique definitions, the definition of distance is much clearer [6]. A dissimilarity between two objects / and i' is a distance if Dii^ > 0 where D,-. = 0 if x,- = x,.
(30.1)
(where x- and x- are the row-vectors of the data table X with the measurements describing objects / and /')
Ar = A .
(30.2)
D,, + D , , > A r
(30.3)
Equation (30.1) shows that distances are zero or positive; eq. (30.2) that they are symmetric. Equation (30.3), where a is another object, is called the metric inequality. It states that the sum of the distances from any object to objects / and i' can never be smaller than the distance between / and i\ 30.2.2 Measures of (dis)similarity for continuous variables 30.2.2.1 Distances The equation for the Euclidean distance between objects / and i is
Ar=JX(^,v-^.';)'
(30.4)
where m is the number of variables. The concept of Euclidean distance was introduced in Section 9.2.3. In vector notation this can be written as: D^,=(x,-x,)^(x,-x,)
61
In some cases, one wants to give larger weights to some variables. This leads to the weighted Euclidean distance:
Ar=JSv./x,-x,.)^ V ./=i
with X w . = l
(30.5)
j
The standardized Euclidean distance is given by:
^/r=JS[(^//-^.y)/^,]'
(30.6)
where Sj is the standard deviation of the values in theyth column of X:
'j=\lt^-,-^j)'
(30.7)
It can be shown that the standardized Euclidean distance is the Euclidean distance of the autoscaled values of X (see further Section 30.2.2.3). One should also note that in this context the standard deviation is obtained by dividing by n, instead of by(n-l). The Mahalanobis distance [15] is given by: Z)^,=(x,-x,)^C-Ux,-x,)
(30.8)
where C is the variance-covariance matrix of a cluster represented by x^ (e.g. x^ is the centroid of the cluster). It is therefore a distance between a group of objects and a single object i. The distance is corrected for correlation. Consider Fig. 30.4a; the distance between the centre C of the cluster and the objects A and B is the same in Euclidean distances but, since B is part of the group of objects outlined by the ellipse, while A is not, one would like a distance measure such that CA is larger than CB. The point B is "closer" to C than A because it is situated in the direction of the major axis of the ellipse while A is not: the objects situated within the ellipse have values ofx^ and X2 that are strongly correlated. For A this will not be the case. It follows that the distance measure should take correlation (or covariance) into account. In the same way, in Fig. 30.4b, clusters Gl and G2 are closer together than G3 and G4 although the Euclidean distances between the centres are the same. All groups have the same shape and volume, but Gl and G2 overlap, while G3 and G4 do not. Gl and G2 are therefore more similar than G3 and G4 are. Generalized distance summarizes eqs. (30.4) to (30.8). It is a weighted distance of the general form:
62
x^i
Xi»
b)
a)
Fig. 30.4. Mahalanobis distance: (a) object Bis closer to centroidC of cluster Gl then object A; (b)the distance between clusters Gl and G2 is smaller than between G3 and G4.
D,^,=(x,-x,yW(x,-x,)
(30.9)
where W represents an mxm weighting matrix. Four particular cases of the generalized distance are mentioned below: W W W
= = =
W
=
I defines ordinary Euclidean distances diag (w) produces weighted Euclidean distances diag (1/d^) where d represents the vector of column-standard deviations of X, yielding standardized Euclidean distances C~\ where C represents the variance-covariance matrix as defined in eq. (30.8), defines Mahalanobis distances.
Euclidean distances (ordinary or standardized) are used very often for clustering purposes. This is not the case for Mahalanobis distance. An application of Mahalanobis distances can be found in Ref. [16]. 30.2.2.2 Correlation coefficient Another way of measuring similarity between / and /' is to measure the correlation coefficients between the two row-vectors x, and x^/. The difference between using Euclidean distance and correlation is explained with the help of Fig. 30.5 and Table 30.2. In Chapter 9 it was shown that r equals the cosine of the angle between vectors. Consider the objects /, i' and T. The Euclidean distance, Z)^-. in Fig. 30.5 is the same as Z),^-. However, the angle between x^ and x,/ is much smaller than between x, and x^.. and therefore the correlation coefficient is larger. How to choose between the two is not evident and requires chemical considerations. This is shown with the example of Table 30.2, which gives the retention indices of five substances on three gas chromatographic stationary phases (SFs). The question is which of these phases should be considered similar. The similarity measure to be chosen depends on the point of view of the analyst. One point of view might be that
63
Fig. 30.5. The point / is equidistant to /' and f according to the Euclidean distances {Dw and Du") but much closer to /' (cos 0,/') than to i" (cos 6r), when a correlation-based similarity measure is applied.
those SFs that have more or less the same over-all retention, i.e., the same polarity towards a variety of substances, are considered to be similar. In that case, SF3 is very dissimilar from both SFj and SF2, while SFj and SF2 are quite similar. The best way to express this is the Euclidean distance. D^^ and D22, are then much higher than D^2' On the other hand, the analyst might not be interested in global retention indices. Indeed, by increasing the temperature for SF3, he would obtain similar retention indices as for the other two. He will then observe that the relative retention time, i.e. the retention times of the substances compared with each other, are the same for SFj and SF3 and different from SF2. Chemically, this means that SF3 has different polarity from SFj, but the same specific interactions. This is best expressed by using the correlation coefficient as the similarity measure. Indeed, ri3 = 1, indicating complete similarity, while r^2 ^^^ ^23 ^ ^ niuch lower. Since both r = 1 and r = -1 are considered to indicate absolute similarity and if, as with Euclidean distance, one would like the numerical value of the similarity measures to increase with increasing dissimilarity, one should use, for instance, 1 - |r|. TABLE 30.2 Retention indices of five substances on three stationary phases in GLC Stationary phases (SF)
1
2
3
4
5
1 2 3
100 120 200
130 110 260
150 170 300
160 150 320
170 145 340
64
30.2.2.3 Scaling In the meteorite example, the concentration of Ni is of the order of 50000 ppm and the Ga content of the order of 50 ppm. Small relative changes in the Ni content then have, of course, a much higher effect on the Euclidean distance than equally high relative changes of the Ga content. One might also consider two metals M and N, one ranging in concentration from 900 to 1100 ppm, the other from 500 to 1500 ppm. Concentration changes from one end of the range to the other in N would then be more important in the Euclidean distance than the same kind of change in M. It is probable that the person carrying out the classification will not agree with these numerical consequences and consider them as artefacts. Both problems can be solved by scaling the variables. The most usual way of doing this is using the z-transform, also called autoscaling (see also Chapter 3.3). One then determines ^^^ij_2^
(30.10)
where x^j is the value for object / of variabley, x^ is the mean for variable7, and Sj is the standard deviation for variable j . One then uses z in eq. (30.4), which is equivalent to applying the standardized Euclidean distance (eq. (30.6)) to the jc-values. Other possibilities are range scaling and logarithmic transformation. In range scaling one does not divide by s^ as in eq. (30.6), but by the range r^ of variable/ x^^_-^ r.
(30.11)
If one wants z,y expressed on a 0-1 scale, this becomes:
where JC^^^J^ is the lowest value of jc^. The logarithmic transform, too, reduces variation between variables. Its effect is not to make absolute variation equal but to make variation comparable in the following sense. Suppose that variable 1 has a mean value of 100 and variable 2 a mean value of 10. Variable 1 varies between 50 and 150. If the variation is proportional to the mean values of the variables, then one expects variable 2 to vary between more or less 5 and 15. In absolute values the variation in variable 1 is therefore much larger. When one transforms variables 1 and 2 by taking their logarithms the variation in the two transformed variables becomes comparable. Log-transformation to correct for heteroscedasticity in a regression context is described in Section 8.2.3.1. It also has the advantage that the scaling does not change when data are added. This is not so for eqs. (30.10) and (30.11), since one must recompute Xj, r- or Sy
65
Scaling is a very important operation in multivariate data analysis and we will treat the issues of scaling and normalisation in much more detail in Chapter 31. It should be noted that scaling has no impact (except when the log transform is used) on the correlation coefficient and that the Mahalanobis distance is also scaleinvariant because the C matrix contains covariance (related to correlation) and variances (related to standard deviation).
30.2.3 Measures of (dis)similarity for other variables 30.2.3.1 Binary variables Binary variables usually have values of 0 (for attribute absent) or 1 (for attribute present). The simplest type of similarity measure is the matching coefficient. For two objects / and /' and attribute j :
V/ = 0 if ^/y^^ry The matching coefficient is the mean of the 5--values for all m attributes 1 m
(30.13) m j=i
This means that one counts the number of attributes for which / and /' have the same value and divides this by the number of attributes. The Jaccard similarity coefficient is slightly more complex. It considers that the simultaneous presence of an attribute in objects / and i indicates similarity, but that the absence of the attribute has no meaning. Therefore: ^..,.= 1
if
x^. = x^^=i
ignored if x^^ = x^^j = 0 The Jaccard similarity coefficient is then computed with eq. (30.13), where m is now the number of attributes for which one of the two objects has a value of 1. This similarity measure is sometimes called the Tanimoto similarity. The Tanimoto similarity has been used in combinatorial chemistry to describe the similarity of compounds, e.g. based on the functional groups they have in common [9]. Unfortunately, the names of similarity coefficients are not standard, so that it can happen that the same name is given to different similarity measures or more than one name is given to a certain similarity measure. This is the case for the Tanimoto coefficient (see further).
66
The Hamming distance is given by: d^i^j = I if x.j^x^^j d^^^j = 0 if Xij = Xi^j
and
It can be shown [5] that the Hamming distance is a binary version of the city block distance (Section 30.2.3.2). Some authors use the Hamming distance as the equivalent of Euclidean distance of binary data. In that case:
The literature also mentions a normalized Hamming distance, which is then equal to either: 1 '^ "^.1=
or 1 '"
\mf:t The first of these two is also called the Tanimoto coefficient by some authors. It can be verified that, since distance = 1 - similarity, this is equal to the simple matching coefficient. Clearly, confusion is possible and authors using a certain distance or similarity measure should always define it unambiguously. 30.2.3.2 Ordinal variables For those variables that are measured on a scale of integer values consisting of more than two levels, one uses the Manhattan or city-block distance. This is also referred to as the Lj-norm. It is given for variabley by:
m
D,,=X4o
(30-14)
67
X2 4
^1 Fig. 30.6. Dii' is the Euclidean distance between / and /', dm and dwj are the city-block distances between / and /' for variables xi and X2 respectively. The city-block distance is dm + dm.
Here, too, scaling can be required when the ranges of the variables are dissimilar. In this case, one divides the distances by the range r^ for variable j IJ
In this way one obtains t/-values from 0 to 1. Then s^^y = 1 - d^^^j. Manhattan distances can be used also for continuous variables, but this is rarely done, because one prefers Euclidean distances in that case. Figure 30.6 compares the Euclidean and Manhattan distances for two variables. While the Euclidean distance between / and i' is measured along a straight line connecting the two points, the Manhattan distance is the sum of the distances parallel to the axes. The equations for both types of distances are very similar in appearance. In fact, they both belong to the Minkowski distances given by:
fm D,,
\
\lr
(30.15)
The Manhattan distance is obtained for r = 1 and the Euclidean distance for r = 2. In this context the Euclidean distance is also referred to as the L2-norm. 30.2.3.3 Mixed variables In some cases, one needs to combine variables of mixed types (binary, ordinal or continuous). The usual way to do this is to eliminate the effect of varying ranges by scaling. All variables are transformed, so that they take values from 0 to 1 using range scaling for the continuous variables or the procedure for scaling described
68
for ordinal variables in Section 30.2.3.2, while binary variables are expressed naturally on a 0-1 scale. The range scaled similarity for variables on an interval scale is obtained as
with Zij and z- as defined in eq. (30.12). Then one can determine the similarities of the objects / and /' by summing the range scaled similarities for all variables j . A distance measure can be obtained by computing
where d^^^j is the range scaled similarity between objects / and i' for variable 7.
30.2.4 Similarity matrix The similarities between all pairs of objects are measured using one of the measures described earlier. This yields the similarity matrix or, if the distance is used as measure of (dis)similarity, the distance matrix. It is a symmetrical nxn matrix containing the similarities between each pair of objects. Let us suppose, for example, that the meteorites A, B, C, D, and E in Table 30.3 have to be classified and that the distance measure selected is Euclidean distance. Using eq. (30.4), one obtains the similarity matrix in Table 30.4. Because the matrix is symmetrical, only half of this matrix needs to be used.
TABLE 30.3 Example of a data matrix System
Concentrations (arbitrary units) Metal a
A B C D E
100 80 80 40 50
Metal b
80 60 70 20 10
Metal c
70 50 40 20 20
Metal d
60 40 50 10 10
69
30.3 Clustering algorithms 30.3.1 Hierarchical methads There is a wide variety of hierarchical algorithms available and it is impossible to discuss all of them here. Therefore, we shall only explain the most typical ones, namely the single linkage, the complete linkage and the average linkage methods. In the similarity matrix, one seeks the two most similar objects, i.e., the objects for which S^^^ is largest. When using distance as the similarity measure, this means that one looks for the smallest D^^ value. Let us suppose that it is D^^, which means that of all the objects to be classified, q and p are the most similar. They are considered to form a new combined object /?*. The similarity matrix is thereby reduced to (n - 1) x (w - 1). In average linkage, the similarities between the new object and the others are obtained by averaging the similarities of q and p with these other objects. For example, D^^* = {D^^ + D^p)/2. In single linkage, D^^* is the distance between the object / and the nearest of the linked objects, i.e., it is set equal to the smallest of the two distances D-^ and D^^: Z)^^* = min (D^^, D^^). Complete linkage follows the opposite approach: D^^* is the distance between / and the furthest object q or p. In other words Z)^^* = max(Z)j^, Z),^. At the same time, one starts constructing the dendrogram by linking together q and p at the level D^^. This process is repeated until all objects are linked in one hierarchical classification system, which is represented by a dendrogram. This procedure can now be illustrated using the data of Tables 30.3 and 30.4. The smallest D is 14.1 (between D and E). D and E are combined first and yield the combined object D*. The successive reduced matrices obtained by average linkages are given in Table 30.5, those obtained by single linkage in Table 30.6 and those obtained by complete linkage in Table 30.7. The dendrograms are shown in Fig. 30.7. Clusters are then obtained by cutting the highest link(s). For instance, by breaking the highest link in Fig. 30.7a, one obtains the clusters (ABC) and (DE). Cutting the second highest links leads to the clustering (A) (BC) (DE). How many links to cut is not always evident (see also Section 30.3.4.2). TABLE 30.4 Similarity matrix (based on Euclidean distance) for the objects from Table 30.3
A B C D E
0 40.0 38.7 110.4 111.4
0 17.3 70.7 72.1
0 78.1 80.6
0 14.1
70
A OT
B
C
D
E
U M
A
B
C
D
E
A
B
C
D
E
U M
IUM
bO\
1001
Fig. 30.7. Dendrograms for the data of Tables 30.3-30.7: (a) average linkage; (b) single linkage; (c) complete linkage. TABLE 30.5 Successive reduced matrices for the data of Table 4 obtained by average linkage (a) B A 0 B 40.0 0 C 38.7 17.3 D* 110.9 71.4 D* is the object resulting from the combination of D and E.
D*
0 79.3
(b) A B* A 0 B* 39.3 0 D* 110.9 75.3 B* is the object resulting from the combination of B and C. (c) A* D* A* 0 D* 93.1 0 A* is the object resulting from the combination of A and B*. (d) The last step consists in the junction of A* and D*. The resulting dendrogram is given in Fig. 30.7(a).
D*
0
71 TABLE 30.6 Successive reduced matrices for the data of Table 30.4 obtained by single linkage (a) A B C D*
A 0 40.0 38.7 110.4
B
C
0 17.3 70.7
0 78.1
B*
D*
D*
(b) A B* D*
A 0 38.7 110.4
0 70.7
0
(c) A* D*
A* 0 70.7
D* 0
(d) The last step consists in the junction of A* and D*. The resulting dendrogram is given in Fig. 30.7(b).
We observe that, in this particular instance, the only noteworthy difference between the algorithms is the distance at which the last link is made (from 111.4 for complete linkage to 70.7 for single linkage). When larger data sets are studied, the differences may become more pronounced. In general, average linkage is preferred. In the average linkage mode, one may introduce a weighting of the objects when clusters of unequal size are linked. Both weighted and unweighted methods exist. Another method which gives good results (i.e., has been shown to give meaningful clusters) is known as Ward's method [17]. It is based on a heterogeneity criterion. This is defined as the sum of the squared distances of each member of a cluster to the centroid of that cluster. Elements or clusters are joined with as criterion that the sum of heterogeneities of all clusters should increase as little as possible. Single linkage methods have a tendency to chain together ill-defined clusters (see Fig. 30.8). This eventually leads to include rather different subjects (A to X of Fig. 30.8) into the same long drawn-out cluster. For that reason one sometimes
72 TABLE 30.7 Successive reduced matrices for the data of Table 30.4 obtained by complete linkage (a) A
A B C D*
B
C
D*
0 40.0 38.7 111.4
0 0
17.3 72.1
80.6
B*
D*
(b) A B* D*
A 0 40.0 111.4
0 80.6
0
(c) A* D*
A* 0 111.4
D* 0
(d) The last step consists in the junction of A* and D*. The resulting dendrogram is given in Fig. 30.7(c).
X,*
Fig. 30.8. Dissimilar objects A and X are chained together in a cluster obtained by single linkage.
says that the single linkage methods is space contracting. The complete linkage method leads to small, tight clusters and is space dilating. Average linkage and Ward's method are space conserving and seem, in general, to give the better results.
73 TABLE 30.8 Values characterizing the objects of Fig. 30.9 X2
A B C D E F G
45 24 14 64 36 56 20
24 43 23 52 121 140 148
TABLE 30.9 Euclidean distance between points in Fig. 30.9 (from Ref [18]) B A B C D E F G
0 28 32 35 100 119 127
0 23 40 80 104 105
C
D
0 60 103 128 126
0 76 90 105
0 29 30
0 35
Single linkage has the advantage of mathematical simplicity, particularly when it is calculated using an operational research technique called the minimal spanning tree [18]. Although the computations seem to be very different from those in Table 30.6, exactly the same results are obtained. To explain the method we need a matrix with some more objects. The data matrix is given in Table 30.8 and the resulting similarity matrix (Euclidean distances) in Table 30.9. We may think of these objects as towns, the distances between which are given in the table, and suppose that the seven towns must be connected to each other by highways (or a production unit serving six clients using a pipeline). This must be done in such a way that the total length of the highway is minimal. Two possible configurations are given in Fig. 30.9. Clearly, (a) is a better solution than (b). Both (a) and (b) are graphs that are part of the complete graph containing all possible links and both are connected graphs (all of the nodes are linked directly or indirectly to each other). These graphs are called trees and the tree for which the sum of the values of the links is minimal is called the minimal spanning tree. This
74
Fig. 30.9. Examples of trees in a graph; (a) is the minimal spanning tree [18].
minimal spanning tree is also the optimal solution for the highway problem. The terminology used in this chapter comes from graph theory. Graph theory is described in Chapter 42. Several algorithms can be used to find the minimal spanning tree. One of these is KruskaVs algorithm [19] which can be stated as follows: add to the tree the edge with the smallest value which does not form a cycle with the edges already part of the tree. According to this algorithm, one selects first the smallest value in Table 30.9 (link BC, value 23). The next smallest value is 28 (link AB). The next smallest values are 29 and 30 (links EF and EG). The next smallest value in the table is 32 (link AC). This would, however, close the cycle ABC and is therefore eliminated. Instead, the next link that satisfies the conditions of Kruskal's algorithm is AD and the last one is DE. The minimal spanning tree obtained in this way is that given in Fig. 30.9(a). After careful inspection of this figure, one notes that two clusters can be obtained in a formal way by breaking the longest edge (DE). When a more detailed classification is needed, one breaks the second longest edge, and so on until the desired number of classes is obtained (see Section 30.3.4.2). In the same way, clusters were obtained from Fig. 30.7 by breaking first the lowest link, i.e. the one with highest distance. An example of an application is shown in Fig. 30.10. This concerns the classification of 42 solvents based on three solvatochromic parameters (parameters that describe the interaction of the solvents with solutes) [13]. Different methods were applied, among which was the average linkage method, the result of which is shown in the figure. According to the method applied, several clusterings can be found. For instance, the first cluster to split off from the majority of solvents consists of solvents 36, 37, 38, 39, 40, 41, 42 (r-butanol, isopropanol, n-butanol,
75 0.371 -r
0.332 I 0.293 t 0.254 0.215 + 0.176 0.138 -f
0.099 I 0.060 + 0.021
I ^ rfiTmm rrrfFT?! n rmTrii I m n JUL
JL
JL
Fig. 30.10. Hierarchical agglomerative classification of solvents according to solvent-solute and solvent-solvent interactions [13].
ethanol, methanol, ethyleneglycol and water). This solvent class consists of amphiprotic solvents (alcohols and water). This is then split further into the monoalcohols on the one hand and ethyleneglycol and water, which have higher association ability. In this way, one can develop a detailed classification of the solvents. Another use of such a classification is to select different, representative objects. Snyder [20] used this to select a few solvents, that would be different and representative of certain types of solvent-solute interactions. These solvents were then used in a successful strategy for the optimization of mobile phases for liquid chromatographic separation. The hierarchical methods so far discussed are called agglomerative. Good results can also be obtained with hierarchical divisive methods, i.e., methods that first divide the set of all objects in two so that two clusters result. Then each cluster is again divided in two, etc., until all objects are separated. These methods also lead to a hierarchy. They present certain computational advantages [21,22]. Hierarchical methods are preferred when a visual representation of the clustering is wanted. When the number of objects is not too large, one may even compute a clustering by hand using the minimum spanning tree. One of the problems in the approaches described above and, in fact, also in those described in the next sections, is that when objects are added to the data set, the
76
05%
03[l
LI
[\
Va-1
A Fig. 30.11. The three-distance clustering method [23]. The new object A has to be classified. In node Ka it must be decided whether it fits better in the group of nodes represented by Vi, the group of nodes represented by V2, or does not fit in any of the nodes already represented by V^.
whole clustering must be carried out again. A hierarchical procedure which avoids this problem has been proposed by Zupan [23,24] and is called the three-distance clustering method (3-DM). Let us suppose that a hierarchical clustering has already been obtained and define V^ as a node in the dendrogram, representing all the objects above it. Thus 1 Z^ ^ii^Xi i=\
i=\
^i2^"'^2j i=\
where v,^ is the mean vector of the n^ vectors, representing the % objects / = 1,..., n^ above it and m is the number of variables. For instance, in Fig. 30.11, Vj is the mean of three objects Oj, O2, O3. Suppose now also that in an earlier stage one has decided that the new object A belongs rather to the group of objects above V^ than to the group represented by V^. We will now take one of three possible decisions: (a) A belongs to Vj (b) A belongs to V2 (c) A does not belong to V^ or V2. This depends on the similarity or distance between A, V^ and Vo- O^ie determines the similarities 5^^ , 5^^^, and 5^ ^ . If 5^^ is highest, then A belongs to V^ and the same process as for V^ is repeated for Vj. If 5^^^ is highest, then A belongs to the group represented by V2 and if 5^, y^ is highest, then A belongs to V^^ but not to V, or Vo. A new branch is started for A between K, and V,a-\' 30.3.2 Non-hierarchical methods Let us now cluster the objects of Table 30.8 with a non-hierarchical algorithm. Instead of clustering by joining objects successively, one wants to determine
77 1I
oD
oF
»t1
^
E
^
X2 Fig. 30.12. Forgy's non-hierarchical classification method. A,..., G are objects to be classified; * 1, * 4 are successive centroids of clusters.
directly a A'-clustering, by which is meant a classification into /f clusters. We will apply this for 2 clusters. Of course, one is able to see that the correct 2-clustering is (A,B,C,D) (E,F,G). In general, one uses m-dimensional data and it is then not possible to visually observe clusters. In this section, we will also suppose that we are not able to do this. To obtain 2 clusters, one selects 2 seed points among the objects and classifies each of the objects with the nearest seed point. In this way, an initial clustering is obtained. For the objects of Table 30.8, A and B are selected as first seed points. In Fig. 30.12 it can be seen that this is not a good choice (A and E would have been better), but it should be remembered that one is supposed to be unable to observe this. D is nearest to A and C, E, F and G are nearest to B (Table 30.9). The initial clustering is therefore (A,D) (B,C,E,F,G). For each of these clusters, one determines the centroid (the point with mean values of variables x^ and X2 for each cluster). For cluster (A,D), the centroid (* 1) is characterized by Xi = (45 + 64) / 2 = 54.5 ^2 = (24 + 52) / 2 = 38 and for cluster (B,C,E,F,G) the centroid (*2) is given by jcj = (24 + 14 + 36 + 56 + 20) / 5 = 30 X2 = (43 + 23 + 121 + 140 + 148) / 5 = 95 The two centroids are shown in Fig. 30.12. In the next step, one reclassifies each object according to whether it is nearest to *1 or *2. This now leads to the clustering (A,B,C,D) (E,F,G). The whole procedure is then repeated: new centroids are computed for the clusters (A,B,C,D) and (E,F,G). These new centroids are
78
situated in *3 and *4. Reclassification of the objects leads again to (A,B,C,D) (E,F,G). Since the new clustering is the same as the preceding one, this clustering is considered definitive. The method used here is called Forgy's method [25]. This is one of the K-center or K'Centroid methods, another well-known variant of which is MacQueen's K-means method [26]. Forgy's method involves the following steps. (1) Select an initial clustering. (2) Determine the centroids of the clusters and the distance of each object to these centroids. (3) Locate each object in the cluster with the nearest centroid. (4) Compute new cluster centroids and go to step (3). One continues to do this until convergence occurs (i.e., until the same clustering is found in two successive assignment steps). Instead of using centroids as the points around which the clusters are constructed, one can select some of the objects themselves. These are then called centrotypes. Such a method might be preferred if one wants to select representative objects: the centrotype object will be considered to be representative for the cluster around it. Returning to the simple example of Fig. 30.12, suppose that one selects objects A and E as centrotypes. Thus, B, C and D would be classified with A since they are nearer to it than to E, and F and G would be clustered with E. This method is based on an operations research model (Chapter 42), the so-called location model. The points A-G might then be cities in which some central facilities must be located. The criterion to select A and E as centrotypes is that the sum of the distances from each town to the nearest facility is minimal when the facilities are located in A and E. This means that A, B, C and D will then be served by the facility located in A and E, F and G by the one located in E. An algorithm that allows to do this was described in the clustering literature under the name MASLOC [27]. Numerical algorithms such as genetic algorithms or simulated annealing can also be applied (e.g. Ref [11]). Both methods described above belong to a class of methods that is also called partitioning or optimization or partitioning-optimization techniques. They partition the set of objects into subsets according to some optimization criterion. Both methods use representative elements, in one case an object of the set to be clustered (the centrotype), in the other an object with real values for the variables that is not necessarily (and usually not) part of the objects to be clustered (the centroid). In general, one maximizes between-cluster Euclidean distance or minimizes within-cluster Euclidean distance or variance. This really amounts to the same. As described by Bratchell [6], one can partition total variation, represented by T, into between-group (B) and within-group components (W). T =B+W
79
^2*
Fig. 30.13. Agglomerative methods will first link A and B, so that meaningless clusters may result. The non-hierarchical K=2 clustering will yield clusters I and II.
where T is the total sums of squares and products matrix, related to the variancecovariance matrix, B is the same matrix for the centroids and W is obtained by pooling the sums of squares and product matrices for the clusters. One can also write tr(T) = tr(B) + tr(W) Since T and therefore also tr(T) is constant, minimizing tr(W) is equivalent to maximizing tr(B). It can be shown that tr(B) is the sum of squared Euclidean distances between the group centroids. An advantage of non-hierarchical methods compared to hierarchical methods is that one is not bound by earlier decisions. A simple example of how disastrous this can be is given in Fig. 30.13 where an agglomerative hierarchical method would start by linking A and B. On the other hand, the agglomerative methods allow better visualization, although some visualization methods (e.g. Ref. [28]) have been proposed for non-hierarchical methods. 30.3.3 Other methods A group of methods quite often used is based on the idea of describing high local densities of points. This can be done in different ways. One such way, mode analysis [29], is described by Bratchell [6]. A graph theoretical method (see also Chapter 42) by Jardine and Sibson [30] starts by considering each object as a node and by linking those objects which are more similar than a certain threshold. If a Euclidean distance is used, this means that only those nodes are linked for which the distance is smaller thap a chosen threshold distance. When this is done, one determines the so-called maximal complete subgraphs. A complete subgraph is a
80
Fig. 30.14. Step 1 of the Jardine and Sibson method [30]. Objects less distant than Dj are linked.
set of nodes for which all the nodes are connected to each other. Maximal complete subgraphs are then the largest (i.e. those containing most objects) of these complete subgraphs. In Fig. 30.14, they are (B, G, F, D, E, C), (A, B, C, F, G), (H, I, J, K), etc. Each of these is considered as the kernel of a cluster. One now joins those kernels that overlap to a large degree with, as criterion, the fact that they should have at least a prespecified number of nodes in common (for instance, 3). Since only the first two kernels satisfy this requirement, one considers A, B, C, D, E, F, G as one cluster and H, I, J, K as another. Another technique, originally derived from the potential methods described for supervised pattern recognition (Chapter 33), was described by Coomans and Massart [31]. A kernel or potential density function is constructed around each object. In Fig. 30.15 A, B, C and D are objects around which a triangular potential field is constructed (solid lines). The potentials in each point are summed (broken line). One selects the point with highest summed potential (B) as cluster center and measures the summed potential in the closest point. All points, such as A and C, that can be reached from B along a potential path, which decreases continuously, belong to the same summed potential hill and such objects are considered to be part of the cluster. When the potential is higher again, in a certain object, or there is a point mid-way between two points which has lower potential, then this means that one has started to climb a new hill and therefore the object is part of another cluster. The method has the advantage that the form of the cluster is not important, while most other methods select spherical or ellipsoid clusters. The disadvantage is that the width of the potential field must be optimized. All these methods and the methods of the preceding section have one characteristic in common: an object may be part of only one cluster. Fuzzy clustering applies other principles. It permits objects to be part of more than one cluster. This leads to results such as those illustrated by Fig. 30.16. Each object / is given a value
81
potential
B' \ \ \ \ \ AgC
D
distance
Fig. 30.15. One-dimensional example of the potential method [31].
Xjt
II
/B
^1 Fig. 30.16. Fuzzy clustering. Two fuzzy clusters (I and II) are obtained. For example/M = 1 ,/AII = 0,/BI = 0 , / B I I = 1 , / C I = 0.47,/CII = 0.53.
f-i^ for a membership function (see Chapter 19) in cluster k. The following relationships are defined for all / and k:
and K
82
where K is the number of clusters. When/j^ = 1, then it means that / unambiguously belongs to cluster k, otherwise the larger/j^ is, the more / belongs to cluster k. The assignment of the membership values is done by an optimization procedure. Of the many criteria that have been described, probably the best known is that of Ruspini [32]:
where 5 is an empirical constant and d^^ a distance. The criterion is minimized and therefore requires membership values of / and i to be similar when their distance is small. Fuzzy clustering has been applied only to a very limited extent in chemometrics. A good example concerning the classification of seeds from images is found in Ref. [33]. As described in the Introduction to this volume (Chapter 28), neural networks can be used to carry out certain tasks of supervised or unsupervised learning. In particular, Kohonen mapping is related to clustering. It will be explained in more detail in Chapter 44. 30.3.4 Selecting clusters 30.3.4.1 Measures for clustering tendency Instead of carrying out the actual cluster analysis to find out whether there is structure in the data, one might wonder if it is useful to do so and try to measure clustering tendency. Hopkin's statistic and modifications of it have been described in the literature [34,35]. The original procedure is based on the fact that if there is a clustering tendency, distances between points and their nearest neighbour will tend to be smaller than distances between randomly selected artificial points in the same experimental domain and their nearest neighbour. The method consists of the following steps (see also Fig. 30.17): - select at random a small number (for example 5%) of the real data points; - compute the distance d^ to the nearest data points for each selected data point i\ - generate at random an equal number of artificial points in the area studied; - compute the distance Uj to the nearest real data point for each artificial point; - determine// = S«y/(Xwy +X Y.d^ and H will be higher than 0.5.
83
Fig. 30.17. Square symbols are the actual objects and circled squares are the marked objects. Open circles are artificial points (adapted from Ref. [34]).
30.3.4.2 How many clusters? In hierarchical clustering one can obtain any number of clusters K,\.2 I J = 0
withik=l,...,r
(31.7)
or
ic,-x^,i,i=o The determinants can be developed into a polynomial equation of degree r of which the r positive roots are the eigenvalues X\, where r^ i ^ O ^ ^ ^ ^ - * ^
/
•1>^
^"'^^'^^'^
5
/
# rHtoONP^^^^*''^^^ \ I^
0
^'^'^N
/
A
^''•>
10
1
*"*,
lfi«A2-V02^k
M 0M0O-Ph«
\^^^/
CH2Cl2j3^
5 20 22
/^^ /
N ^ 24
WoO-Ate<J^
X
34 »
Fig. 31.6. Biplot of chromatographic retention times in Table 31.2, after column-centering of the data. Two unipolar axes and one bipolar axis have been drawn through the representations of the methods DMSO and methylenedichloride (CH2CI2). The projections of three selected compounds are indicated by dashed lines. The values read off from the unipolar axes reproduce the retention times in the corresponding columns. The values on the bipolar axis reproduce the differences between retention times.
hand, and between dioxane, THF and methylenedichloride, on the other hand. The correlation between the three alcohols cannot be established accurately on this plot because of their proximity to the origin. There are two outstanding poles on this biplot. DMSO and dimethylchloride are at a large distance from the origin and from one another. These poles are the most likely candidates for the construction of unipolar axes. As has been explained in the previous section, perpendicular projections of points (representing compounds) upon a unipolar axis (representing a method) leads to a reproduction of the data in Table 31.3. In this case we have to substitute the untransformed value x^^ in eq. (31.35) by z^. of eq. (31.42):
s7i 7
~'^ij
-^ij
^j
(31.43)
Since m^ is a constant for the given unipolar axis through the yth method, we obtain that the projections on this axis are equal to x^j minus a constant. The unipolar axis through the origin and DMSO reproduces rather well the data in the corresponding column of Table 31.2. By perpendicular projection of the
122
center of the circles we can read off the approximate retention times on this unipolar axis. For example, the compound with substituents F-NO2 projects at about a value of 28 which compares well with the tabulated value of 27.62. The DMSO axis separates compounds with NO2 and methoxy substituents from the others. The unipolar axis through methylenedichloride also reproduces the data in the corresponding column of Table 31.2. This additive seems to selectively delay the elution of compounds with dimethylamine and some methoxy substituents. Finally, we have constructed a bipolar axis through DMSO and methylenedichloride. By perpendicular projection of the centers of the circles upon this bipolar axis we obtain the differences in retention times obtained respectively with DMSO and with methylenedichloride. Using a similar reasoning as developed above for unipolar axes we can perform a substitution of eq. (31.42) in eq. (31.38) which leads to: sj{\j
-lj.)
= Zij -Zij'=Xij -x.y-{mj
-my)
(31.44)
This shows that the bipolar axis reproduces the difference x^j - x- between method7 a n d / of the table minus a constant (nij - m^). This bipolar axis defines a clear contrast between the N02-substituted compounds and the others. For example, projection of the compound with substituents dimethylamine-N02 defines a DMSO-methylenedichloride contrast of about 16 (Fig. 31.6), where the real difference is 42.38 - 26.07 or 16.31 (Table 31.2). Graphical estimation of the same contrast for the MeO-MeO compound yields a value o f - 1 , which compares well with the difference between tabulated data of 21.27 - 22.19 or -0.92. 31.3.3 Column-standardization Column-standardization is the most widely used transformation. It is performed by division of each element of a column-centered table by its corresponding column-standard deviation (i.e. the square root of the column-variance): (Xij -JHj)
Zij = -^
—
with / = 1, ..., n and; = 1,..., p
(31.45)
with
d]=^l(^,-mjr In the context of data analysis we divide by n rather than by (n - 1) in the calculation of the variance. This procedure is also called autoscaling. It can be verified in Table 31.5 how these transformed data are derived from those of Table 31.4.
123 TABLE 31.5 Atmospheric data from Table 31.1, after column-standardization. Na 0 90 180 270
1.681 -0.411 -0.949 -0.321
m^ d.
0 1
CI 1.683 -0.399 -0.947 -0.337 0 1
Si
in„
d.
-0.413 -1.228 0.122 1.519
0.984 -0.679 -0.591 0.287
1.394 0.782 0.777 0.917
0 1
0
—
—
1
In the corresponding column-standardized biplot of Fig. 31.7 we find all representations of the eight chromatographic methods more or less at the same distance from the origin of space. The circle is distorted because of the large difference between the contributions of the first and second latent variables (95 and 4%) and the choice of a = (3 = 0.5 which has been made at the outset. The combined effect is an apparent dilation of the vertical axis. The distances between compounds in Fig. 31.7 are not notably affected by the transformation in comparison with the previous Fig. 31.6. This biplot allows more easily to perceive the correlations between measurements. Three clusters are now put in evidence, namely (1) DMSO and DMF, (2) ethanol and propanol, (3) octanol, dioxane, THF and methylenedichloride. The line segments drawn from the origin have been added to emphasize these groupings. Unipolar axes could have been defined here in the same way as in Fig. 31.6. Bipolar axes on the column-standardized biplot, however, cannot be interpreted directly in terms of the original data in X. 31.3.4 Log column-centering The transformation by log column-centering consists of taking logarithms followed by column-centering. The choice of the base of the logarithms has no effect on the interpretation of the result, but decimal logs will be used throughout. y^j = log^.) z.. = y..~m^ with 1 ""
with / = 1,..., n andj = 1,...,/?
(31.46)
124
4%
# a-N02
• B-r
• a-LPr
• ' ^ '^
B-Hm B-Et
Fig. 31.7. Biplot of chromatographic retention times in Table 31.2, after column-standardization of the data. Unipolar semi-axes have been drawn through all points representing methods. The particular arrangement of the methods is indicative for the presence of a strong size component in the data.
In this case it is required that the original data in X are strictly positive. The effect of the transformation appears from Table 31.6. Column-means are zero, while column-standard deviations tend to be more homogeneous than in the case of simple column-centering in Table 31.4 as can be seen by inspecting the corresponding values for Na and CI. In the log column-centered biplot of Fig. 31.8 one observes that the centroid of the compounds coincides with the origin. Also, the chromatographic methods are at a more uniform distance from the origin than it is the case with simple column-centering in Fig. 31.6. The effect of log column-centering is to reduce the heterogeneity of the variances. The result is close to that of column-standardization in Fig. 31.7. While the logarithmic function reduces the effect of large values it enhances that of the smaller ones. In the chromatographic application this is an advantage, as appears from the widening of the cluster of compounds with the substituents CF3, F, H, methyl, ethyl, /-propyl, r-butyl on the left side of Fig. 31.7. With log column-centering we obtain unipolar axes by substituting eq. (31.46) in eq. (31.35): ^j=^ij
'•ytj ~^j
=^^s(Xij)-^j
(31.47)
125 TABLE 31.6
Atmospheric data from Table 31.1, after log column-centering. Na
CI
Si
m^
d.
0 90 180 270
0.4183 -0.0507 -0.3517 -0.0159
0.4326 -0.0445 -0.3690 -0.0191
-0.0297 -0.1181 0.0200 0.1278
0.2738 -0.0711 -0.2336 0.0309
0.3479 0.0785 0.2945 0.0751
m^
0 0.2746
0 0.2853
0 0.0888
^
0 —
0.2343
where rrij is the logarithm of the geometric mean of theyth column in Table 31.2, which is a constant for this unipolar axis. Hence, the unipolar axis through DMSO reproduces the logarithms of the data in the corresponding column of the table. Note that the unipolar axis in Fig. 31.8 has been constructed with logarithmical subdivisions. A similar construction applies to the unipolar axis through methylenedichloride. The bipolar axis through DMSO and methylenedichloride defines logarithms of the ratio of the data in the corresponding columns. By substitution of eq. (31.46) in eq. (31.38) and after rearrangement we obtain that: s^ ( • . / - • ' . / ) = ^ ( / - ^ ( / ' =yij -yty - ( ^ . / ~ ^ / )
(xA = log
(31-48) -(mj-my)
Consequently, the bipolar axis represents the (log) ratio of columns j and / in the original data table up to the term nij - rrif which is a constant for the given bipolar axis. Note that the axis which represents the ratio of retention times with DMSO/methylenedichloride is also divided according to a logarithmic scale. In Fig. 31.8 we determine the DMSO/methylenedichloride ratio of the dimethylamine-N02 substituted compound approximately at 1.6, where the computed ratio from Table 31.2 is 42.38/26.07 or 1.63. 31.3.5 Log double-centering Preprocessing by log double-centering consists of first taking logarithms, and then to center the data both by rows and by columns: y^j = log(x^j) ^ =yij -rrii
with / = 1, ..., n andy = 1,...,/? -rrij^m
(31.49)
126
\3% r 4 \ ^
• B-CF3
^^v^
H-W02
F-JI02
9r-F
DMSO DMFW^LS
j
i%
ETBANOL^ PROPAITOL"
8
\^^ ^
^
4
2
l^-^-"^^
OCTAWOIQ
f CHF #toO-Ph«^v^^3| DJOXAN
•^'•-PhB
M«0-Ha ^^,y^
^^MAO-K02
NO2-Mm
^ ^ • f-ite
R-tBu
^y^
N02
^ • H-H
^
2.s^^
r
2
NMtt^-KO^'^
CB2C12*f^
%H»-MQO
97%*^ Fig. 31.8. Biplot of chromatographic retention times in Table 31.2, after log column-centering of the data. The values on the bipolar axis reproduce the (log) ratios between retention times in the two corresponding columns.
with 1
and
n
p
m=— ^ ^
y..
It is assumed that the original data in X are strictly positive. As is evident from Table 31.7 both the row-means m„ and the column-means m^ of the transformed table Z are equal to zero. The biplot of Fig. 31.9 shows that both the centroids of the compounds and of the methods coincide with the origin (the small cross in the middle of the plot). The first two latent variables account for 83 and 14% of the inertia, respectively. Three percent of the inertia is carried by higher order latent variables. In this biplot we can only make interpretations of the bipolar axes directly in terms of the original data in X. Three prominent poles appear on this biplot: DMSO, methylenedichloride and ethylalcohol. They are called poles because they are at a large distance from the origin and from one another. They are also representative for the three clusters that have been identified already on the column-standardized biplot in Fig. 31.7.
127
TABLE 31.7 Atmospheric data from Table 31.1, after log double-centering Na
CI
Si
m„
dn
0 90 180 270
0.1446 0.0204 -0.1181 -0.0468
0.1589 0.0266 -0.1354 -0.0500
-0.3034 -0.0470 0.2536 0.0969
0 0 0 0
0.2146 0.0333 0.1794 0.0685
m^
0 0.0968
0 0.1082
0 0.2049
—
d.
0 0.1450
A bipolar axis through columns7 a n d / can be interpreted in the same way as in the log column-centered case (eq. (31.48)) since the terms rrij and m^/ cancel out. The first (close to horizontal) axis between DMSO and ethanol represents the (log)ratios of the corresponding retention times. They can be read off by vertical projection of the compounds on this scale. Note that the scale is divided logarithmically. In the same way, one can read off the (log)ratios of methylenedichloride and ethanol from the second (close to vertical) axis on Fig. 31.9. Graphical estimation of these contrasts for the dimethylamine-N02 substituted chalcone produces 9.5 on the DMSO/ethanol axis and 6.2 on the methylenedichloride/ ethanol axis of Fig. 31.9. The exact ratios from Table 31.2 are 10.00 and 6.14, respectively. The first bipolar axis (DMSO/ethanol) accounts for the contrast between compounds with NO2 substitutions and those without. Compounds with a NO2 substituent systematically obtain higher scores on this bipolar axis than others. The second bipolar axis (methylenedichloride/ethanol) seems to produce an order of the substituents according to their electronic properties. To emphasize this point we have reproduced the log double-centered biplot again in Fig. 31.10. The dashed line near the middle separates the class of NO2 substituted chalcones from the other compounds. Further, we have joined substituents by line segments according to the sequence CF3, F, H, methyl, ethyl, /-propyl, ^butyl, methoxy, phenyl and dimethylamine. The electronic properties of these substituents vary progressively from electron acceptors to electron donors [11] in accordance with their scores on the second bipolar axis. The size component which may be strongly present (as in this chromatographic application) is eliminated by the operation of double-centering. Hence, doublecentered latent variables only express contrasts. In column-centered biplots one may find that one latent variable expresses mainly size and the others mainly contrasts. In general, none of the latter is a pure component of size or of contrasts. If we want to see size and some contrasts represented in a biplot, column-centering
128
114% # MmO-Phm
. — • 194m2-N02
€.5
—
Mm-Phmm^ CH2Cl2i 3 tf
DIOXAJ p i ^ j ^ Mm -MmO • • MmO-Mm V ^ •
•r-M.O
\ \ aoczAMOL
a-tBu •
0M.O-WO2 -^ \ «02-B% 4.
^02-T \ ^ ^ ^y^^^io
^yp, and with column-closure when n.A= 0.322 j 1.423
Epilepsy
4 Fig. 32.6. (a) Generalized score plot derived by correspondence factor analysis (CFA) from Table 32.4. The figure shows the distance of Triazolam from the origin, and the distance between Triazolam and Lorazepam. (b) Generalized loading plot derived by CFA from Table 32.4. The figure shows the distance of epilepsy from the origin, and the distance between epilepsy and anxiety.
SL^ = A A « A P B T = A A B T
=Z
(32.50)
provided that a + |3 = 1. The most common choices for the exponents in CFA are a = (3 = 1 and a = (3 = 0.5. The choices a = 1, p = 0 and a = 0, p = 1 will produce biplots that are
192
H X.2= 0.188 Triazolam Sleep Clonazepam O -1
Lorazepam O Anx?etv
\= DiazepaS
0.322 1 D Epilepsy
Fig. 32.7. CFA biplot resulting from the superposition of the score and loading plots of Figs. 32.6a and b. The coordinates of the products and the disorders are contained in Table 32.9.
multidimensional extensions of the triangular (also called barycentric) diagrams that are used for representing ternary mixtures. The reconstruction Z* of the transformed contingency table Z in a reduced space of latent vectors follows from: S*L*'r = Z*
(32.51)
where S* and L* are obtained from the first r* columns of S and L, with r* < r. It is assumed that the columns of S and L are arranged in decreasing order of their corresponding singular values. The r*-dimensional subspace of dominant latent vectors can be interpreted as a (hyper)plane of closest fit to the pattern of points represented in the r-dimensional space of latent vectors. This is analogous to the interpretation of the results of principal components analysis (PCA) which has been discussed in Chapters 17 and 31. The goodness of the fit can be judged from the relative contribution y of the first r* latent vectors to the global interaction, which can be expressed in the form:
y=
lK'lK
(32.52)
CFA can also be defined as an expansion of a contingency table X using the generalized latent vectors in A, B and the singular values in A:
193
with / = 1,..., n and J = 1,..., p
1 + Z ^ i t ^ / ^ ^ . jk
(32.53)
J
as can be derived from eqs. (32.28) and (32.38).
32.6A Application CFA is applied to the contingency Table 32.10 which has been adapted from a retrospective study of US doctorates in chemistry and in other fields awarded to men and women [11]. The rows of this table refer to consecutive time intervals from 1920 up to 1989. The columns indicate non-overlapping categories of doctorates. The marginal sums provide the totals by time interval and by category of doctorate. These will be used for the assignment of weights to the rows and columns of the table. Note that the data from 1920 up to 1959 have been grouped into intervals of 10 or 5 years in order to level out statistical fluctuations due to small counts, especially in the category of women doctorates in chemistry. This pooling of rows does not affect the subsequent analysis by CFA. Indeed, rows (or columns) that are similar can be grouped together, provided that the corresponding weights are also added together. This is called the principle of distributional equivalence of CFA [2]. Casual inspection of the table reveals that the total number of doctorates has reached a peak value of 33727 in 1973 which has only been surpassed in the last year of the time series. Peak values are also observed in the 1973 category of men in other fields and in the 1970 category of men in chemistry. In both categories related to women, no such peak values are observed as the annual counts seem to increase more or less steadily over the whole period. A more comprehensive analysis is obtained by CFA, the results of which are summarized in Tables 32.11 and 32.12. The first column of these tables contains the diagonal elements of the metric matrices W^ and W^. These normalized weight coefficients are proportional to the marginal sums of the original contingency Table 32.10 and sum to unity. The following three columns form the matrix of generalized scores S and the matrix of generalized loadings L. These matrices satisfy the relations (eq. (39.41)): S = AA
and
L = BA
which follows from the generalized SVD (eq. (32.50)): Z = AART where:
with
A^W^A = B^W^B = I,
194 TABLE 32.10 Doctorates awarded in the US between 1920 and 1989 [11] Year
W o m e n in chemistry
1989
142 256 221 222 228 45 66 66 66 93 94 94 120 139 146 180 172 202 179 174 192 189 180 196 220 255 235 271 297 320 362 396 406 429 497
Total
7350
1920-1929 1930-1939 1940-1949 1950-1954 1955-1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988
M e n in chemistry
W o m e n in other fields
M e n in other fields
Total
1782
1674
8291
11889
3707
3507
18116
25586
5105
3871
21358
30555
4950
3370
30102
38644
4840
4388
34714
44170
1060
1045
7850
10000
1162
1185
9692
12105
1163
1185
9692
12106
1163
1186
9691
12106
1427
1702
13865
17087
1702
13865
17088
1427 1427
1702
13865
17088
1644
2295
16236
20295
1643
2761
18291
22834
1801
3225
20562
25734
2043
3764
23449
29436
2032
4403
25165
31772
1809
5080
25910
33001
1670
5903
25975
33727
1618
6241
24967
33000
1570
7001
24150
32913
1434
7487
23813
32923
1390
7665
22437
31672
1349
8117
21188
30850
1347
8701
20932
31200
1283
9132
20312
30982
1376
9637
20071
31319
1407
9786
19584
31048
1462
10188
19243
31190
1445
10340
19148
31253
1474
10337
19028
31201
1507
10848
19019
31770
1568
10964
19340
32278
1589
11361
20077
33456
1474
12013
20335
34319
65148
203766
680333
956597
195 TABLE32.il Weights, scores, distances, contributions and precisions from CFA applied to Table 32.10 /
w.
Sl
1920-1929
0.012
-0.96
0.95
0.11
1930-1939
0.027
-0.98
0.85
0.05
1940-1949
0.032
-1.18
1.10
1950-1954
0.040
-1.35
1955-1959
0.046
1960
0.010
1961 1962
S2
K
In
T^n
1.35
0.023
0.999
1.29
0.045
0.999
-0.11
1.62
0.084
0.995
0.37
-0.03
1.40
0.079
1.000
-1.17
0.15
-0.03
1.18
0.064
1.000
-1.11
0.11
-0.06
1.12
0.013
0.998
0.013
-1.12
-0.05
0.02
1.12
0.016
1.000
0.013
-1.12
-0.05
0.02
1.12
0.016
1.000 1.000 0.998
S3
1963
0.013
-1.12
-0.05
0.02
1.12
0.016
1964
0.018
-1.05
-0.23
0.04
1.07
0.021
1965
0.018
-1.05
-0.23
0.04
1.07
0.021
0.998
1966
0.018
-1.05
-0.23
0.04
1.07
0.021
0.998
1967
0.021
-0.92
-0.21
0.05
0.94
0.019
0.998 0.995
1968
0.024
-0.81
-0.31
0.06
0.87
0.018
1969
0.027
-0.77
-0.33
0.04
0.83
0.019
0.998
1970
0.031
-0.74
-0.32
0.06
0.81
0.020
0.995
1971
0.033
-0.63
-0.36
0.02
0.73
0.018
0.999
1972
0.034
-0.45
-0.43
0.05
0.63
0.014
0.994
1973
0.035
-0.25
-0.44
-0.00
0.51
0.009
1.000
1974
0.034
-0.13
-0.39
-0.03
0.41
0.006
0.996
1975
0.034
0.08
-0.32
-0.03
0.33
0.004
0.990
1976
0.034
0.22
-0.32
-0.05
0.39
0.005
0.983
-0.08
0.44
0.006
0.971
-0.08
0.56
0.010
0.980
1977
0.033
0.34
-0.26
1978
0.032
0.53
-0.18
1979
0.033
0.67
-0.12
-0.07
0.68
0.015
0.990
1980
0.032
0.82
-0.07
-0.04
0.82
0.022
0.998
1981
0.033
0.91
0.01
-0.10
0.92
0.028
0.990
1982
0.032
0.98
0.07
-0.06
0.98
0.031
0.996
1983
0.033
1.07
0.14
-0.05
1.08
0.038
0.998
1.13
0.041
1.000
1984
0.033
1.12
1985
0.033
1986
0.033
1987
0.15
-0.02
1.12
0.18
0.04
1.14
0.042
0.999
1.21
0.23
0.06
1.24
0.051
0.998
0.034
1.19
0.24
0.06
1.22
0.050
0.998
1988
0.035
1.20
0.23
0.07
1.22
0.052
0.996
1989
0.036
1.32
0.22
0.14
1.34
0.065
0.990
196 TABLE 32.12 Weights, loadings, distances, contributions and precisions from CFA applied to Table 32.10
Women in chemistry Men in chemistry Women in other fields Men in other fields
0.008 0.068 0.213 0.711
1 ( X,i X^
^
5
1,
1.
1.
0.99 -1.46 1.69 -0.38
0.72 1.20 0.18 -0.18
0.66 -0.03 -0.02 0.00
^P
1.39 1.89 1.70 0.42
with / = 1, ..., n andy = 1, ...,p
0.015 0.244 0.617 0.124
0.770 1.000 1.000 1.000
(32.54)
\^i-\- ^+j
and where r is the rank of Z. The rank r is at most equal to the smaller ofn-l and p- 1. Hence, in the present application to a 35x4 contingency table we can at most obtain three distinct latent vectors. For practical reasons we have divided the transformed data (eq. (32.7)) by the global distance of chi-square 6 (eq. (32.24)). This type of normalization ensures that the global interaction in the data equals unity:
The three latent vectors account for respectively 86, 13 and 1% of the interaction. The next two columns in Tables 32.11 and 32.12 show the distances 5„ and 6^ of rows and columns from the origin of space and their contributions y^ and y^ to the interaction:
S?=I4
and
y,. =w, 5?
with / = 1, ..., n (32.55)
s]=i:^
and
y^ =Wj 5]
withy = 1, ...,p
The final column in Tables 32.11 and 32.12 lists the precisions n^ and n^ with which the rows and columns are represented in the plane spanned by the first two latent vectors:
197
M&n Chemistry
m
®^40-49 j
\.®
1
20-29
® 30-39 ||IlVoi77e;3
Chemistry
^50-54
1
nf
Men Other
Fields
j^p' 0^^-9^O
Women Other
Fields
^^^^^^m^^^^^^i"®^^^^
Fig. 32.8. CFA biplot computed from the data in Table 32.10. Circles represent years and squares identify the four educational categories. The centre of the plot is represented by a small cross. The coordinates of the years and the categories are contained in Tables 32.11 and 32.12. Factor scaling coefficients were defined as a = P = 1.
with / = 1, ..., n k
(32.56) with7= 1, ...,/7 Figure 32.8 shows the biplot constructed from the first two columns of the scores matrix S and from the loadings matrix L (Table 32.11). This biplot corresponds with the exponents a = 1 and p = 1 in the definition of scores and loadings (eq. (39.41)). It is meant to reconstruct distances between rows and between columns. The rows and columns are represented by circles and squares respectively. Circles are connected in the order of the consecutive time intervals. The horizontal and vertical axes of this biplot are in the direction of the first and second latent vectors which account respectively for 86 and 13% of the interaction between rows and columns. Only 1% of the interaction is in the direction perpendicular to the plane of the plot. The origin of the frame of coordinates is indicated
198
by a small cross (+) near the centre of the biplot. As can be seen from the precisions K^ in Table 32.11, all years are well-represented in the plane of the biplot. The precisions n^ in Table 32.12, however, show that the category of women in chemistry is not so well-represented in the biplot as its precision amounts only to 0.770. It is readily evident that the most dominant (horizontal) latent variable reflects a contrast between women and men. From 1966 onwards there appears a sustained increase of the proportion of women doctorates. The increase is most prominent during the seventies as evidenced by the relatively large distances between adjacent time intervals from 1971 to 1980. The second (vertical) latent variable is defined by a contrast between chemistry and the other fields. Contrast is to be understood here as a difference between profiles. Initially there is a progressive decrease in the share of chemistry doctorates when compared to those in other fields. Around 1973, however, the trend reverses and the proportion of chemical degrees rises slowly, but steadily. The correspondences between rows and columns are evidenced by the positions of the circles and squares with respect to the origin of the plot. Those points that are closest to the origin are most similar to the average profile. Those that are further away show specific differences with respect to the average profile. Circles and squares that moved toward the border of the biplot, and in the same direction, possess a positive correspondence. They seem to attract each other. Those that moved in the opposite direction demonstrate a negative correspondence. They seem to repel each other. For example, the category of chemistry doctorates awarded to men exhibits a positive correspondence with the early years, and at the same time a negative correspondence with the more recent years. A mechanical analogy of forces of attraction and repulsion between a circle and a square is appropriate here. One should refrain, however, from judging distances between circles and squares. As stated already before, their closeness is not a measure of their correspondence, only the distances from the origin and their angular distance matter. In the case when one of the two measurements of the contingency table is divided in ordered categories, one can construct a so-called thermometer plot. On this plot we represent the ordered measurement along the horizontal axis and the scores of the dominant latent vectors along the vertical axis. The solid line in Fig. 32.9 displays the prominent features of the first latent vector which, in the context of our illustration, is called the women/men factor. It clearly indicates a sustained progress of the share of women doctorates from 1966 onwards. The dashed line corresponds with the second latent vector which can be labelled as the chemistry/ other fields factor. This line shows initially a decline of the share of chemistry and a slow but steady recovery from 1973 onwards. The successive decline and rise are responsible for the horseshoe-like appearance of the pattern of points representing
199 1.5 n Score ^ Chemistry / Other Fields 0.5 H
-0.5
1973 Women / Men
1966
-1.5 20
30
40
50
60
70
80
90 Years
Fig. 32.9. Thermometer plot representing the scores of the first and second component of a CFA applied to Table 32.10. The solid line denotes the first component which accounts for the women/men contrast in the data. The broken line corresponds with the second component which reveals a contrast between chemistry and other fields.
the rows in the biplot of Fig. 32.8. This phenomenon is also called the Guttman effect [2]. The effect is due to the fact that a linearly varying series of scores or loadings on factor 1 (e.g. 1, 2, 3) can be uncorrelated or orthogonal to a parabolic series on factor 2 (e.g. -30, 0, 10). For the purpose of comparison, we also discuss briefly the biplot constructed from the CFA using the exponents a = 0.5 and (3 = 0.5 (Fig. 32.10). Such a display is meant to reconstruct the values in the transformed contingency table Z by projections of points representing rows upon axes representing columns (or vice versa): cf. eq. (32.46) where: S = AA 1/2
and
L = BA 1/2
cf. eq. (32.41)
This type of biplot does not reproduce distances or angles accurately, especially when A,j and A-2 are very distinct. It can be shown that the distortion of distances that occurs in the vertical direction is proportional to the square root of \IX2. The advantage of this type of biplot is that it allows us to construct bipolar axes for any
200
Fig. 32.10. CFA biplot computed from the data in Table 32.10. Factor scaling coefficients were defined as a = p = 0.5. This definition allows us to draw bipolar axes through the four educational categories, showing the contrast between women and men (horizontally) and between chemistry and other fields (vertically).
pair of categories. For example, the more horizontal axes represent the contrasts between women and men. The more vertical axes display contrasts between chemistry and the other fields. By perpendicular projection of the circles upon any bipolar axis one obtains a relative ordering of the time intervals according to a particular contrast. In the case of the women/men contrast in other fields (the slightly positively sloping axis) we first note a somewhat stable situation until 1966, when the contrast evolves rapidly in favour of women doctorates. Similarly, the chemistry/other fields contrast in women (the more vertical axis at the right) shows a strong regression of chemistry until about 1975. In later years the situation seems to have stabilized. Both types of symmetric displays exhibited in Figs. 32.9 and 32.10 have their merits. They are called symmetric because they produce equal variances in the scores and in the loadings. In the case when a = p = 1, we obtain that the variances along the horizontal and vertical axes are equal to the eigenvalues'h?associated to the dominant latent vectors. In the other case when a = (3 = 0.5, the variances are found to be equal to the singular values X.
201
32.7 Log-linear model 32J.1 Historical introduction The log-linear model (LLM) is closely related to correspondence factor analysis (CFA). Both methods pursue the same objective, i.e. the analysis of the association (or correspondence) between the rows and columns of a contingency table. In CFA this can be obtained by means of double-closure of the data; in LLM this is achieved by means of double-centring of the logarithmic data. According to Andersen [12] early applications of LLM are attributed to the Danish sociologist Rasch in 1963 and to Andersen himself. Later on, the approach has been described under many different names, such as spectral map analysis [13,14] in studies of drug specificity, as logarithmic analysis in the French statistical literature [15] and as the saturated RC association model [16]. The term log-bilinear model has been used by Escoufier and Junca [17]. In Chapter 31 on the analysis of measurement tables we have described the method under the name of log double-centred principal components analysis. Here we develop the method specifically from the point of view of contingency tables and within the context of weighted metrics. We will show that LLM differs only from CFA in the type of preprocessing that is applied to the contingency table. The results of both approaches are often similar when there are no extreme contrasts in the data. 32.7.2 Algorithm We assume that X is a contingency table in which all elements are positive. We are also given a vector of weights (or masses) w^ for the rows and another set of weights w^ for the columns of the table. For convenience, we assume that all weight coefficients are normalized to unit sum. These weights can be defined proportionally to the marginal sums of the table, although in LLM this is not a strict requirement. One may assign the weights according to the objective of the analysis, i.e. by giving a more prominent weight to the more relevant rows or columns than to the others, or by assigning identical weights to rows, to columns or to both rows and columns. The weight vectors define the weighted metrics of the row- and column-spaces: W^ = diag(wJ
and
W^ = diag(w^)
Preprocessing of the contingency table involves the taking of (natural) logarithms: y^j = logiXij)
with / = 1,..., n
followed by double centring:
and
j = 1,..., p
202
Zij =yij
-yi.-y.j
+x.
where y^, yj andy represent the row-, column- and global means in the weighted metrics W„ and W^: y,. = YW^ 1 , = Yw^
or
yy,-., = S w , y,j yij
(32.57)
j n
y..=YT^W,l,=Y^w,
y. J = S
"^j yij
i
n
= 1^ W YW 1 =w^ Yw
or
y..
p
yij
where the sum vectors 1„ and 1^ have been defined before in eqs. (32.15) and (32.16). From this point on, the analysis is identical to that of CFA. Briefly, this involves a generalized singular vector decomposition (SVD) of Z using the metrics W„ and W^, such that: Z = AAB'^
cf. eq. (32.50)
where A is the rxr diagonal matrix of singular values, A is the nxr matrix of generalized latent vectors for rows, B is the pxr matrix of generalized latent vectors for columns, and where r is the rank of Z. The generalized latent vectors A and B are orthogonal in the weighted metrics W^ and W^, which implies that: A^W^A = B^W^B = I, For the same reason as for double-closure, double-centring always reduces the rank of the data matrix by one, as a result of the introduction of a linear dependence among the rows and columns of the data table. Biplots can be constructed in the same way as in the case of CFA, by defining scores S and loadings L, cf. eq. (32.41): S = AA« L = BAP The choice of a = (3 = 1 will reproduce distances between rows and between columns of Z. The choice a = p = 0.5 allows one to reconstruct the values and contrasts of Z by perpendicular projections of points, representing rows, upon axes representing columns (or vice versa). With these two choices for a and (3, the analysis is symmetrical with respect to rows and columns.
203
The conventions of the LLM biplot are similar to those defined for the CFA biplot. Circles represent the row-categories and squares denote the columncategories of the contingency table. When the areas of circles and squares are made proportional to the marginal sums of the table, then the areas confer a notion of size or importance of the row- and column-categories that are represented. The positions of the circles and squares from the origin (which appears as a small cross near the centre) express the degree of difference or contrast of the corresponding rows and columns with respect to their average profiles. Rows or columns that are represented close to the origin show little contrast with their average profiles. Those that are further away will have greater contrasts in their profiles. The direction in which circles and squares move from the origin, in the same or in opposite directions, is an expression of their positive or negative correspondence. The way interaction is expressed in the LLM biplot is identical to that of the CFA biplot. We can also define bipolar axes through two selected row-categories or through two selected column-categories which form the poles of bipolar axes. Additionally, these axes can be calibrated according to the ratios of the two categories that define the axes. Perpendicular projection of the centres of the circles upon any bipolar axis produces an (approximate) reading of the corresponding ratios, as has been explained in Chapter 31. The approximation depends on the precision with which the points are represented in the plane of the plot and is limited by the visual interpolation on logarithmic scales. LLM can also be defined as an expansion of a contingency table X using the generalized latent vectors in A and B and the singular values in A by analogy with eq. (32.53): ^/.
x^j =
^.j
with/= 1, ...,/2
- e x p ^^k^ikt^.
and
7 = 1,...,/?
(32.58)
J
where x^, x j and x are the geometrical row-, column- and global means of X. If the contrasts in the data are small, then the exponents in the above expression will be small also. In that case we can further expand the exponential function, which produces:
k
and which is similar to the expansion in eq. (32.53) obtained with CFA.
204
32.7.3 Application The LLM approach described above has been applied to the 35x4 contingency Table 32.10 after some modification. In this case, we have replaced the pooled data in the first five rows by the corresponding average annual values. Further, columns 1 and 2 have been combined with columns 3 and 4 in order to produce the categories of women in all fields and of men in all fields. The reason for the change is our objective to produce an analysis of the ratios between chemistry and all fields rather than between chemistry and the other fields. In this analysis, weight coefficients for rows and for columns have been defined as constants. They could have been made proportional to the marginal sums of Table 32.10, but this would weight down the influence of the earlier years, which we wished to avoid in this application. As with CFA, this analysis yields three latent vectors which contribute respectively 89,10 and 1% to the interaction in the data. The numerical results of this analysis are very similar to those in Table 32.11 and, therefore, are not reproduced here. The only notable discrepancies are in the precision of the representation of the early vears up to 1972, which is less than in the previous application, and in the precision of the representation of the category of women chemists which is better than in the previous analysis by CFA (0.960 vs 0.770). The overall interpretation of the LLM biplot of Fig. 32.11 is the same as obtained with the CFA biplot of Fig. 32.10. The first (horizontal) latent variable seems to be associated primarily with the women/men contrast, while the second (vertical) latent variable is mostly associated with the chemistry/all fields contrast. Thermometer plots, which represent the scores of the various time intervals as a function of time, are similar to those in Fig. 32.9. They are not reproduced here as they point to the same remarkable events, i.e. the sustained rise of the proportion of women since 1966 and the recovery of the share of chemistry in 1973. In Fig. 32.11 the women/men ratio in 1980 is estimated visually to be 0.190 in chemistry and 0.430 in all fields. The exact ratios as computed from Table 32.10 are 0.198 and 0.435 respectively. The chemistry/all fields ratio in 1970 is visually estimated at 0.090 for men and at 0.041 for women. The exact values from Table 32.10 are 0.080 and 0.046 respectively. The question may be asked why there should be different methods when they provide more or less the same kind of information. One answer is that not all contingency tables will yield results that are agree as well as in the case we have studied here. When there are very large contrasts, the results of CFA and LLM tend to disagree. This occurs when there are several zeroes in the table, which in LLM have to be replaced by small positive values. In that case it may pay off to analyse the same contingency table with different methods, each of which may reveal different aspects of the multivariate patterns [18]. Furthermore, LLM can be
205
Fig. 32.11. Log-linear model (LLM) biplot computed from the data in Table 32.10. Conventions are the same as in Fig. 32.10. The areas of circles (representing years) and of squares (representing categories) are made proportional to the row- and column-totals in Table 32.10.
applied to tables of heterogeneous data, i.e. data that have been recorded with different units, while CFA can only be applied to tables of homogeneous data, unless the heterogeneous scales of measurement are subdivided into discrete categories. References 1. 2. 3. 4. 5. 6. 7.
B.S. Everitt, The Analysis of Contingency Tables. Chapman and Hall, London, 1977. M.J. Greenacre, Theory and Applications of Correspondence Analysis. Academic Press, London, 1984. A. Agresti, Categorical Data Analysis. Wiley, New York, 1990. M.G. Kendall and A. Stuart, The Advanced Theory of Statistics. Vol. II. Ch. Griffin, London, 1960. M.O. Hill, Reciprocal averaging: an eigenvector method of ordination. J. EcoL, 61 (1973) 237-224. J.-P. Benzecri, L'analyse des Donnees. Vol. II, L'analyse des Correspondances. Dunod, Paris, 1973. J.-P. Benzecri, Histoire et Prehistoire de 1'Analyse des Donnees. Dunod, Paris, 1982.
206 8. 9. 10.
11. 12. 13. 14. 15. 16.
17. 18.
M.O. Hill, Correspondence analysis: a neglected multivariate method. Appl. Statist., 23 (1974) 340-355. K.R. Gabriel, The biplot graphic display of matrices with application to principal components analysis. Biometrika, 58 (1971) 453-446. O.M. Kvalheim, Interpretation of direct latent-variable projection methods and their aims and use in the analysis of multicomponent spectroscopic and chromatographic data. Chemom. Intell. Lab. Syst., 4 (1988) 11-25. K.G. Everett and W.S. Deloach, Chemistry doctorates awarded to women in the United States, A historical perspective. J. Chem. Educ, 68 (1991) 545-547. E.B. Andersen, Discussion of paper by L.A. Goodman. Int. Stat. Review, 54 (1986) 271-272. P.J. Lewi, Spectral mapping, a technique for classifying biological activity profiles of chemical compounds. Arzneim. Forsch. (Drug Res.), 26 (1976) 1295-1300. P.J. Lewi, Spectral Map Analysis, Analysis of contrasts especially from log-ratios. Chemom. Intell. Lab. Syst., 5 (1989) 105-116. J.B. Kasmierczak, Analyse logarithmique, Deux exemples d'applicarion. Rev. Statist. Appliquee, 33 (1985) 13-24. L.A. Goodman, Some useful extensions of the usual correspondence analysis approach and the usual log-linear models approach in the analysis of contingency table. Int. Stat. Review, 54 (1986)243-309. Y. Escoufier and S. Junca, Least squares approximation of frequencies or their logarithms. Int. Stat. Rev., 54 (1986) 279-283. A. Thielemans, P.J. Lewi and D.L. Massart, Similarities and differences among multivariate display techniques by Belgian Cancer Mortality Distribution data. Chemom. Intell. Lab. Syst., 3(1988)277-300.
207
Chapter 33
Supervised Pattern Recognition 33.1 Supervised and unsupervised pattern recognition In Section 17.9 a method, called linear discriminant analysis, was introduced and applied to derive a rule which would discriminate between wines from three origins. To start with, about 100 wine samples of known origin were analyzed: 8 variables (or features) were determined for each. One wanted to know if, in some way, these results could be used to derive a procedure to determine the origin of new samples. In pattern recognition terminology, this question was rephrased as: use the learning or training objects (i.e. those with known origin) to derive a classification rule which allows to classify new objects with unknown origin in one of three known classes, based on the values of the features of the new object. This is called supervised pattern recognition or supervised learning. Mathematically, this means that one needs to assign portions of an 8-dimensional space to the three classes. A new sample is then assigned to the class which occupies the portion of space in which the sample is located. Supervised pattern recognition is distinct from unsupervised pattern recognition. In the latter one applies essentially clustering methods (Chapter 30) to classify objects into classes that are not known beforehand. In supervised pattern recognition, one knows the classes and has to decide in which of those an object should be classified. Supervised pattern recognition techniques essentially consist of the following steps. 1. 2.
3. 4.
Selection of a training or learning set. This consists of objects of known classification for which a certain number of variables are measured. Feature selection, i.e. the selection of variables that are meaningful for the classification and elimination of those that have no discriminating (or, for certain techniques, no modelling power). This step is discussed further in Section 33.3. Derivation of a classification rule, using the training set. This is the subject of Section 33.2. Validation of the classification rule, using an independent test set. This is described in more detail in Section 33.4.
208
Many books have been published about pattern recognition; one of these is directed towards chemometrics [1].
33.2 Derivation of classification rules 33,2.1 Types of classification rules There are many types of pattern recognition which essentially differ in the way they define classification rules. In this section, we will describe some of the approaches, which we will then develop further in the following sections. We will not try to develop a classification of pattern recognition methods but merely indicate some characteristics of the methods, that are found most often in the chemometric literature and some differences between those methods. A first distinction which is often made is that between methods focusing on discrimination and those that are directed towards modelling classes. Most methods explicitly or implicitly try to find a boundary between classes. Some methods such as linear discriminant analysis (LDA, Sections 33.2.2 and 33.2.3) are designed to find explicit boundaries between classes while the k-nearest neighbours (A;-NN, Section 33.2.4) method does this implicitly. Methods such as SIMCA (Section 33.2.7) put the emphasis more on similarity within a class than on discrimination between classes. Such methods are sometimes called disjoint class modelling methods. While the discrimination oriented methods build models based on all the classes concerned in the discrimination, the disjoint class modelling methods model each class separately. LDA and several other supervised methods focus on finding optimal boundaries between classes: their first goal is to discriminate. In Section 17.9 we explained also with an example from clinical chemistry that canonical variates can be determined and plotted to discriminate between classes and that this is a way of performing LDA. The example concerned the thyroid gland. People whose thyroid gland functions normally are called euthyroid (EU) and patients whose thyroid gland is too active or not active enough are called, respectively, hyperthyroid (HYPER) or hypothyroid (KYPO). Clinicians want to make a distinction between the three classes. This can be done using five chemical determinations among which e.g. serum thyroxine or thyroid-stimulating hormone. This is clearly a multivariate (five-dimensional) situation and LDA is applied. Allocation of patients to one of the classes can be achieved in a more formal way by determining the centroid of each group and drawing a boundary half-way between the two centroids [2]. This is shown in Fig. 33.1. Figure 33.1 is typical of many situations in clinical chemistry: it shows a tight normal group (the EU group) and spreading out from it much more disperse
209
a
.2 a
a
Discriminant function 1 Fig. 33.1. Canonical variate plot for three classes with different thyroid status. The boundaries are obtained by linear discriminant analysis [2].
abnormal groups (the HYPO and HYPER groups). In fact, this kind of picture is also found in many non-clinical situations too (see, for instance, the air pollution situation of Fig. 17.10). If one now determines boundaries as described in the preceding paragraph, namely by situating linear boundaries half-way between the centroids of adjacent classes, some patients of the more disperse abnormal classes are classified as members of the more condensed class. This classification problem can then be solved better by developing more suitable boundaries. For instance, using so-called quadratic discriminant analysis (QDA) (Section 33.2.3) or density methods (Section 33.2.5) leads to the boundaries of Fig. 33.2 and Fig. 33.3, respectively [3,4]. Other procedures that develop irregular boundaries are the nearest neighbour methods (Section 33.2.4) and neural nets (Section 33.2.9). One of the problems with discrimination-oriented methods is that we need to classify each object in one of the given classes. It is, however, quite possible that an
210
o o c
•4—•
ac a c
> 3
Q 3 , 3 ^ 3 3
. - - i .
.^r' 1 1 t' I 1 1 1 '* ,'3111 I1 1 1 I \ 31 1 1 I I I I I I n 1 \ :^^ nil 11 11111 i ^'^^ 1 111 111 1 1 1 1 I 1 1 V' 1 2 3 ^ s l l l l l 2 x ' 2 2 2 2 1 "->>^.J-'2 2 2 2
\
3
2 2
2 2
\
3
2
2 2 2
3
2
2
•I
2
\
' 2 2
3
Discriminant function 1 Fig. 33.2. As Fig. 33.1 but with boundaries obtained by quadratic discriminant analysis [3].
object should not be classified in any of these classes. Returning to the wine example, where a classification between wines from three given origins was wanted, we must realize that the sample submitted for classification may, in reality, belong to none of the given origins, but to a fourth one. However, using the discrimination-oriented methods, we will classify it necessarily in one of the three given regions. A different approach to supervised pattern recognition can then be useful. This consists of making a separate model of each class. Objects which fit the model for a class are considered part of it and objects which do not fit are classified as non-members. In discrimination terms, we could say that the class model discriminates between membership and non-membership of a certain class. In statistical terms, we can state that these methods perform outlier tests. The conceptually simplest model, which for reasons explained later is called UNEQ, is based on the multivariate normal distribution. Suppose we have carried
211
c a c
3
X 1
\ 3| / 1 1 I 1/ 1 11 1 ;3111 II \ 31 M 11111 iiim ' il 1111 ^ " 1' 111 111112' 3N ••••• ^/ V 11 1 1 1 1 1 1 11 1 1 1 1 1 / 1 2 ^\ 1 11 1 1 2 ^ ^ 2 2 2 2 1 \ 1 , ''2 2 2 \ 1 / 2 2
/2
3 3 3 3
\>
2
2 2
2 2 2
Discriminant function 1 Fig. 33.3. As Fig. 33.1 but with boundaries obtained by a density method [4].
out two tests, jCj and X2, with which we want to describe the health of a patient. Only healthy patients are investigated. In Fig. 33.4, we could say that the ellipse describing the 95% confidence limit for a bivariate normal distribution can be considered as a model of the class of healthy patients. Those within its limits are considered healthy and those outside would be considered non-members of the healthy class. The bivariate normal distribution is therefore a model of the healthy class. In three dimensions (Fig. 33.5), the model takes the shape of an ellipsoid and in m dimensions, we must imagine an m-dimensional hyperellipsoid. In the figure, two classes are considered and we now observe that four situations can be encountered when classifying an object, namely: (a) the object is part of class K, (b) the object is part of class L, (c) the object is not a member of class K or L: it is an outlier, and
212 X
1 f
Fig. 33.4. Nintety-five percent confidence limit for a bivariate distribution as class envelope.
Fig. 33.5. Class envelopes in three dimensions as derived from the three-variate normal distribution.
(d) if K and L overlap the object can be a member of both classes K and L: it is situated in a region of doubt. UNEQ is applied only when the number of variables is relatively low. For more variables, one does not work with the original variables, but rather with latent variables. A latent variable model is built for each class separately. The best known such method is SIMCA. We also make a distinction between parametric and non-parametric techniques. In the parametric techniques such as linear discriminant analysis, UNEQ and SIMCA, statistical parameters of the distribution of the objects are used in the derivation of the decision function (almost always a multivariate normal distribution
213
is assumed). The most important disadvantage of parametric methods is that to apply the methods correctly statistical requirements must be fulfilled. The nonparametric methods such as nearest neighbours (Section 33.2.4), density methods (Section 33.2.5) and neural networks (Section 33.2.9 and Chapter 44) are not explicitly based on distribution statistics. The most important advantage for the parametric methods is that probabilities of correct classification can be more easily estimated than with most non-parametric methods. 33.2.2 Canonical variates and linear discriminant analysis LDA is the best studied method of pattern recognition. It was originally proposed by Fisher [2] and is applied very often in chemometrics. Applications can be found for instance in the classification of Eucalyptus oils based on gas-chromatographic data [6], the automatic recognition of substance classes from GC/MS [7], the recognition of tablets and capsules with different dosages with the use of NIR spectra [8] and in the already cited clinical chemical example (see Section 33.1). It appears that there are several ways of deriving essentially the same methodology. This may be confusing and, following a short article by Fearn [9], we will try to explain the different approaches. A detailed overview is found in the book by McLachlan [10]. Let us first consider two classes K and L in a bivariate space (xj, Xr^. Figure 33.6a shows the objects in this space. In Fig. 33.6b bivariate probability ellipses are drawn representing the normal (bivariate) probability distributions to which the objects belong. Since there are two classes, there are two such ellipses. Basically, an object will be classified in the class for which it has the highest probability. In Fig. 33.6b, object A is classified in class ^ because it has a (much) higher probability in K than in L. In Fig. 33.6c, an additional ellipse is drawn for each class. These ellipses both represent the same probability level in their respective classes; they touch in point O half-way between the two class centres. Line a is the tangent to the two ellipses in point O. Any point to the left of it, has a higher probability to belong to K and to the right it is more probable that it belongs to L. Line a can be used as a boundary, separating K from L. In practice, we would prefer an algebraic way to define the boundary. For this purpose, we define line d, perpendicular to a. One can project any object or point on that line. In Fig. 33.6c this is done for point A. The location of A on d is given by its score on d. This score is given by: Z) = WQ + WjX] + ^2^2
(33.1)
When working with standardized data WQ = 0. The coefficients Wj and w^ are derived in a way described later, such that D = 0 in point O and D > 0 for objects belonging to L and Z) < 0 for objects of A'. This then is the classification rule.
214
a)
X V X
•
.
• • • X• • • • • • •
'K
b)
X 0
-2
-6
Fig. 33.6. Caption opposite.
-4
0 X1
X X X X „ X
X
X
215
Fig. 33.6. (a) Two classes A'and L to be discriminated, (b) confidence limits around the centroids oiK and L, (c) the iso-probability confidence limits touch in O; a is a line tangential to both ellipses; d is the optimal discriminating direction; A is an object.
It is observed that D as defined by eq. (33.1) is a latent variable, in the same way as a principal component. We can consider LDA, as was the case for principal components analysis, as a feature reduction method. Let us therefore again consider the two-dimensional space of Fig. 33.6. For feature reduction, we need to determine a one-dimensional space (a line) on which the points will be projected from higher, here two-dimensional space. However, while principal components analysis selects a direction which retains maximal structure in a lower dimension among the data, LDA selects a direction which achieves maximum separation among the given classes. The latent variable obtained in this way is a linear combination of the original variables. This function is called the canonical variate. When there are k classes, one can determine k- \ canonical variates. In Fig. 33.1, two canonical variates are plotted against one another for three overlapping classes. A new sample can be allocated by determining its location in the figure. The second way of introducing LDA, discovered by Fisher, is therefore to rotate through O a line
216
• • •
• •
• •
• •
• •
• •
• •
• •
• •
•
X X • • X
X
X X X
X X
X X X X
^^
Fig. 33.7. A univariate classification problem.
until the optimal discriminating direction is found (d in Fig. 33.6c). This rotation is determined by the values of Wj and W2 in eq. (33.1). These weights depend on several characteristics of the data. To understand which ones, let us first consider the univariate case (Fig. 33.7). Two classes, K and L, have to be distinguished using a single variable, jCj. It is clear that the discrimination will be better when the distance between Xj^^ and x^j (i.e. the mean values, or centroids, of jc, for classes K and L) is large and the width of the distributions is small or, in other words, when the ratio of the squared difference between means to the variance of the distributions is large. Analytical chemists would be tempted to say that the resolution should be as large as possible. When we consider the multivariate situation, it is again evident that the discriminating power of the combined variables will be good when the centroids of the two sets of objects are sufficiently distant from each other and when the clusters are tight or dense. In mathematical terms this means that the between-class variance is large compared with the within-class variances. In the method of linear discriminant analysis, one therefore seeks a linear function of the variables, D, which maximizes the ratio between both variances. Geometrically, this means that we look for a line through the cloud of points, such that the projections of the points of the two groups are separated as much as possible. The approach is comparable to principal components, where one seeks a line that explains best the variation in the data (see Chapter 17). The principal component line and the discriminant function often more or less coincide (as is the case in Fig. 33.8a) but this is not necessarily so, as shown in Fig. 33.8b. Generalizing eq. (33.1) to m variables, we can write: D = w'^x + Wo
(33.2)
where it can be shown [10] that the weights w are determined for a two-class discrimination as w^ =(x, - X 2 ) ' ' S-' and
(33.3)
217
Xi4 o
o
a) ^ PC.DF
Xi4
b)
Fig. 33.8. Situation where principal component (PC) and linear discriminant function (DF) are essentially the same (a) and very different (b).
Wo = - - ( ^ 1 - ^ 2 ) ' ^ S U x i + X 2 )
(33.4)
In eq. (33.3) and (33.4) Xj and X2 ^re the sample mean vectors, that describe the location of the centroids in m-dimensional space and S is the pooled sample variance-covariance matrix of the training sets of the two classes. The use of a pooled variance-covariance matrix implies that the variancecovariance matrices for both populations are assumed to be the same. The consequences of this are discussed in Section 33.2.3. Example: A simple two-dimensional example concerns the data from Table 33.1 and Fig. 33.9. The pooled variance-covariance matrix is obtained as [K^K + L^L]/(ni + ^2 - 2), i.e. by first computing for each class the centred sum of squares (for the diagonal elements) and the cross-products between variables (for the other
218 1 —
lO
1
1
\
1
•y—I
o
-
14
-
n ^
o
12
o
^
^
o
^
10
^
o ^
o
8
o
-
)K
^ o
6
o
o
)K
~
)i(
-
X
-
4
)K
-
2
1
{
1
1
10
12
14
x1 Fig. 33.9. LDA applied to the data of Table 33.1; n is a new object to be classified.
elements), then summing the two matrices and dividing each element hyn^+n2-k (here 10 + 10 - 2 = 18). As an example we compute the cross-term s^2 (which is equal to 521). This calculation is performed in Table 33.2. In the same way we can compute the diagonal elements, yielding 2.78 3.78
3.78 ^ 10.56
and 0.70
s-' = -0.25
-0.25 0.18
Since (x,
- X 2 ) :
(6-11) ( 9 - 7)
-5 2
219 TABLE 33.1 Example data set for linear discriminant analysis Class 1
Class 2
Object
^1
X2
Object
^1
X2
1
8
15
11
11
11
2
7
12
12
9
5
3
8
11
13
11
8
4
5
11
14
12
6
5
7
9
15
13
10
6
4
8
16
14
12
7
6
8
17
10
7
8
4
5
18
12
4
9
5
6
19
10
5
10
6
5
20
8
2
mean
6
9
mean
11
7
TABLE 33.2 Computation of the cross-product term in the pooled variance-covariance matrix for the data of Table 33.1 Class 1
Class 2
(8-6) (15-9) = 12
(ll-ll)(ll-7) =0
(7-6)(12-9) = 3
(9-ll)(5-7) =4
(8-6)(ll-9) =4
(ll-ll)(8-7) =0
(5-6)(ll-9)=-2
(12-ll)(6-7) = - l
(7 - 6) (9 ~ 9) = 0
(13-ll)(10-7) = 6
(4-6)(8-9) = 2
(14-11)(12-7)=15
(6 - 6) (8 - 9) = 0
(10-ll)(7-7) = 0
(4-6)(5-9) = 8
(12-ll)(4-7) = - 3
(5-6)(6-9) = 3
(10-ll)(5-7) = 2
(6-6)(5-9) = 0
(8-11) (2-7) = 15
S class 1 = 30
I class 2 = 38
Degrees of freedom: 1 0 + 1 0 - 2 = 1 8 . Cross-product term: (30 + 38)/18 = 3.78
220
Wi=-4.00 W2 = 1.62 Wo= 21.08 and D = 21.08-4.00 jci +1.62 JC2 We can now classify a new object n. Consider an object with x^ = 9 and JC2 = 13. For this object D = 21.08 - 4.00 X 9 + 1.62 X 13 = 6.14 Since D > 0, it is classified as belonging to class 1. For two classes, Fisher arrived at similar results for the equations given above by considering LDA as a regression problem. A response factory, indicating class membership, is introduced: >^ = -1 for all objects belonging to class K and y = -\-\ for all objects belonging to class L. We then obtain the regression equation for y = y(xj,X2). It is shown that, when there is an equal number of objects in K and L, the same w values are obtained. If the number is not the same, then w^ and W2 are still the same but WQ changes. So far, we have described only situations with two classes. The method can also be applied to K classes. It is then sometimes called descriptive linear discriminant analysis. In this case the weight vectors can be shown to be the eigenvectors of the matrix: A = W-^ B
(33.5)
where W is the within-group sum of squares and cross-products matrix and B is the between-groups sum of squares and cross-products matrix. As described in [11] this leads to non-symmetrical eigenvalue problems. 33.2.3 Quadratic discriminant analysis and related methods There is still another approach to explain LDA, namely by considering the Mahalanobis distance (see Chapter 30) to a class. All these approaches lead to the same result. The Mahalanobis distance is the distance to the centre of a class taking correlation into account and is the same for all points on the same probability ellipse. For equally probable classes, i.e. classes with the same number of training objects, a smaller Mahalanobis distance to class K than to class L, means that the probability that the object belongs to class K is larger than that it belongs to L.
221
The Mahalanobis distance representation will help us to have a more general look at discriminant analysis. The multivariate normal distribution for m variables and class K can be described by f (X.) =
~
— C-^^^^'MK
(33.6)
with Dlfj^ the Mahalanobis distance to class K. Dj,j,=(x-ii^y
r^Hx-R^)
(33.7)
where ^ij^ and F^ are the population mean vector and variance-covariance matrix of K respectively. They can be estimated by the sample parameters x^ and C^. From these equations, one can derive the following classification rule: classify object u of unknown class in the class K for which Dj^j^^^ is minimal, given DIK.U=(^U-^KVC-,\X^-X,)
(33.8)
where x^ is the vector of x values describing object u. This equation is applied when the a priori probability of the classes is the same. When this is not so, an additional term has to be added. When all C^ are considered equal, this means that they can be replaced by S, the pooled variance-covariance matrix, which is the case for linear discriminant analysis. The discrimination boundaries then are linear and DJ^J^^^ is given by Dl,,^,=(x,-x,y S-\x,-x,)
(33.9)
Friedman [12] introduced a Bayesian approach; the Bayes equation is given in Chapter 16. In the present context, a Bayesian approach can be described as finding a classification rule that minimizes the risk of misclassification, given the prior probabilities of belonging to a given class. These prior probabilities are estimated from the fraction of each class in the pooled sample:
where % is the prior probability that an object belongs to class K, rij^ is the number of objects in the training set for class K and N is the total number of objects in the training set. One computes OMA:,M ^^ DIK,U = ( X „ - X ^ ) ^
Ci,Hx„ - x ^ ) + lnlC^I-21n7i^
and classifies u in the class for which this value is smallest.
(33.10)
222
X^t
Q)
^2
X, •
b)
X2 Fig. 33.10. Situations with unequal variance-covariance: (a) unequal variance, (b) unequal covariance.
Equation (33.10) is applied in what is called quadratic discriminant analysis (QDA). The equations can be shown to describe a quadratic boundary separating the regions where Dl^j^ ^ is minimal for the classes considered. As stated earlier, LDA requires that the variance-covariance matrices of the classes being considered can be pooled. This is only so when these matrices can be considered to be equal, in the same way that variances can only be pooled, when they are considered equal (see Section 2.1.4.4). Equal variance-covariance means that the 95% confidence ellipsoids have an equal volume (variance) and orientation in space (covariance). Figure 33.10 illustrates situations of unequal variance or covariance. Clearly, Fig. 33.1 displays unequal variance-covariance, so that one must expect that QDA gives better classification, as is indeed the case (Fig. 33.2). When the number of objects is smaller than the number of variables m, the variance-covariance matrix is singular. Clearly, this problem is more severe for QDA (which requires m < /i^) than for LDA, where the variance-covariance matrix is pooled and therefore the number of objects N is the sum of all objects
223
over all classes. It follows that both QDA and LDA have advantages: QDA is less subject to constraints in the distribution of objects in space and LDA requires less objects than QDA. Friedman [12] has also shown that regularised discriminant analysis (RDA), a form of discriminant analysis intermediate between QDA and LDA, has advantages compared to both: it is less subject to constraints without requiring more objects. The method has been used in chemometrics, e.g. for the classification of seagrass [13] or pharmaceutical preparations [14]. 33.2.4 The k-nearest neighbour method A mathematically very simple classification procedure is the nearest neighbour method. In this method one computes the distance between an unknown object u and each of the objects of the training set. Usually one employs the Euclidean distance D (see Section 30.2.2.1) but for strongly correlated variables, one should prefer correlation based measures (Section 30.2.2.2). If the training set consists of n objects, then n distances are calculated and the lowest of these is selected. If this is D^i, where u represents the unknown and / an object from learning class L, then one classifies u in group L. A three-dimensional example is given in Fig. 33.11. Object u is closest to an object of the class L and is therefore considered to be a member of that class. In a more sophisticated version of this technique, called the k-nearest neighbour method (/:-NN method), one selects the k nearest objects to u and applies a majority rule: u is classified in the group to which the majority of the k objects belong. Figure 33.12 gives an example of a 3-NN method. One selects the three nearest neighbours (A, B and C) to the unknown u. Since A and B belong to L, one
Fig. 33.11. 1-NN classification of the unknown u.
224
X, Fig. 33.12. 3-NN classification of the unknown u.
classifies u in category L. The choice of k is determined by optimization: one determines the prediction abihty with different values ofk. Usually it is found that small values of k (3 or 5) are to be preferred. The method has several advantages, the first being its mathematical simplicity, which does not prevent it from yielding classification results as good and often better than the much more complex meth|ods discussed in other sections of this chapter. Moreover, it is free from statistical assumptions, such as normality of the distribution of the variables. This does not mean that the method is not subject to any problem. One such problem is that the method is sensitive to gross inequalities in the number of objects in each class. Figure 33.13 gives an example. The unknown is classified into the class with largest membership, because in the zone of overlap between classes more of its members are present. In fact, the unknown is closer to the centre of the other class, so that its classification is at least doubtful. This can be overcome by not using a simple majority criterion but an altemative one, such as "classify the object in the larger class AT if for /:= 10 at least 9 neighbours (out of 10) belong to K, otherwise classify the test object in the smaller class L". The selection of k and the alternative criterion value should be determined by optimization [15].
225
•
•
U
•• • J •
••
• • • • • • •
•
•
•
•
• •
^ o
-
Fig. 33.13. A situation which necessitates classification of the unknown u by alternative A:-NN criteria.
The nearest neighbour method is often applied, with, in view of its simplicity, surprisingly good results. An example where k-NN performs well in a comparison with neural networks and SIMCA (see further) can be found in [16].
33.2.5 Density methods In density or kernel methods one imagines a potential field around the objects of the learning set. For this reason these methods have also been called potential methods. A variant for clustering was described in Section 30.3.3. One starts with the selection of a potential function. Many functions can be used for this purpose, but for practical reasons it is recommended that a simple one such as a triangular or a Gaussian function is selected. The function is characterized by its width. This is important for its smoothing behaviour (see below). Figure 33.14 shows a Gaussian function for a class ^ in a one-dimensional space. The cumulative potential function is determined by adding the heights of the individual potential functions in each position along the x axis. The figure shows that the cumulative function constitutes a continuous line which is never zero within a class. This is done separately for each class.
226
Fig. 33.14. Density estimate for a test set using normal potential functions (univariate case).
Fig. 33.15. Classification of an unknown object u. f(K) and f(L) indicate the potential functions for classes K and L.
By dividing the cumulative potential function of a class by the number of samples contributing to it, one obtains the (mean) potential function of the class. In this way, the potential function assumes a probabilistic character and, therefore, the density method permits probabilistic classification. The classification of a new object u into one of the given classes is determined by the value of the potential function for that class in u. It is classified into the class which has the largest value. A one-dimensional example is given in Fig. 33.15. Object u is considered to belong to K, because at the location of u the potential value of K is larger than that of L. The boundary between two classes is given by those positions where the potentials caused by these two classes have the same value. The boundaries can assume irregular values as shown in Fig. 33.3. One of the disadvantages of the method is that one must determine the smoothing parameter by optimisation. When the smoothing parameter is too small (Fig. 33.16a) many potential functions of a learning class do not overlap with each other, so that the continuous surface of Fig. 33.15 is not obtained. A new object u may then have a low membership value for a class (here class K) although it clearly belongs to that class. An excessive smoothing parameter leads to a too flat surface (Fig. 33.16b), so that discrimination becomes less clear. The major task of the
227
Fig. 33.16. Influence of the smoothing parameters on the potential surfaces of classes which are (a) too small and (b) too large.
learning procedure is then to select the most suitable value of the smoothing parameter. Advantages of these methods are that no a priori assumptions about distributions are necessary and that probabilistic decisions can be taken more easily than with k-NN. In chemometrics, the method was introduced under the name ALLOC [17, 18]. The methodology was described in detail in a book by Coomans and Broeckaert [19]. The method was developed further by Forina and coworkers [20,21]. 33.2.6 Classification trees In Section 18.4, we explained that inductive expert systems can be applied for classification purposes and we refer to that section for further information and example references. It should be pointed out that the method is essentially univariate. Indeed, one selects a splitting point on one of the variables, such that it achieves the "best" discrimination, the "best" being determined by, e.g., an entropy function. Several references are given in Chapter 18. A comparison with other methods can be found, for instance, in an article by Mulholland et al. [22]. Additionally, Breiman et al. [23] developed a methodology known as classification and regression trees (CART), in which the data set is split repeatedly and a binary tree is grown. The way the tree is built, leads to the selection of boundaries parallel to certain variable axes. With highly correlated data, this is not necessarily the best solution and non-Hnear methods or methods based on latent variables have been proposed to perform the splitting. A combination between PLS (as a feature reduction method — see Sections 33.2.8 and 33.3) and CART was described by
228
Yeh and Spiegelman [24]. Very good results were also obtained by using simple neural networks of the type described in Section 33.2.9 to derive a decision rule at each branching of the tree [25]. Classification trees have been used relatively rarely in chemometrics, but it seems that in general [26] their performance is comparable to that of the best pattern recognition methods. 33.2.7 UNEQ, SIMCA and related methods As explained in Section 33.2.1, one can prefer to consider each class separately and to perform outlier tests to decide whether a new object belongs to a certain class or not. The earliest approaches, introduced in chemometrics, were called SIMCA (soft independent modelling of class analogy) [27] and UNEQ [28]. UNEQ can be applied when only a few variables must be considered. It is based on the Mahalanobis distance from the centroid of the class. When this distance exceeds a critical distance, the object is an outlier and therefore not part of the class. Since for each class one uses its own covariance matrix, it is somewhat related to QDA (Section 33.2.3). The situation described here is very similar to that discussed for multivariate quality control in Chapter 20. In eq. (20.10) the original variables are used. This equation can therefore also be used for UNEQ. For convenience it is repeated here. Dl={x,-x^y
C-\x,-x^)
(33.11)
where Dj^ is the squared Mahalanobis distance between object / and the centroid x^ of the class A'and C is the variance-covariance matrix of the n training objects defining class K (see also eq. (33.7)). When D^ becomes too large for a certain object, this means that it is no longer considered to be part of the class. The Mahalanobis distance follows the Hotelling 7^-distribution. The critical value t^ is defined as: 2 _ m{n - \){n •\-\) ^crit ~ Z ^^ Ma,m,n-m)
(33.12)
n{n-m) UNEQ requires a multivariate normal distribution and can be applied only when the ratio objects/variables is sufficiently high (e.g. 3). When the ratio is less, as also explained in Chapter 20, one can measure distances in the principal component (PC)-space instead of the original space and in the pattern recognition context this is usually either necessary or preferable. In SIMCA one applies latent variables instead of the original variables. It, too, can be viewed as a variant of the quadratic discriminant rule. SIMCA, the original version of which [27] was published in 1976, starts by determining the number of principal components or eigenvectors needed to describe the structure of the training class (Section 31.5). Usually
229
PCI
*-x,
• X ,
b)
a)
PCI
c)
PCI
PCI
d)
Fig. 33.17. SIMCA: (a) step 1 in a 1 PC model; (b) step 1 in a 2 PC model; (c) step 2 in a 1 PC model; (d) step 2 in a 2 PC model.
cross-validation (Section 31.5.3) is preferred. Let us call the number of eigenvectors retained r*. If r* = 1, this means that (see Fig. 33.17) all data are considered to be modelled by a one-dimensional model (a line), for r* = 2, by a twodimensional model (a plane), etc. The residuals of the training class towards such a model are assumed to follow a normal distribution with a residual standard deviation
230
^ = JSE4/[(''-''*K'^-^*-l)J
(33.13)
The residuals from the model can be computed from the scores on the nonretained eigenvectors, i.e. the scores t^j on the eigenvectors r* + 1 to r (r = min {{n 1), m)). Then ^ = J S I,tfj/[(r-r^)(n^r^-l)]
(33.14)
If care is not taken about the way s is obtained, SIMCA has a tendency to exclude more objects from the training class than necessary. The 5-value should be determined by cross-vahdation. Each object in the training set is then predicted, using the r*-dimensionaI PCA model obtained, for the other (n-l) training set objects. The (residual) scores obtained in this way for each object are used in eq. (33.14) [30]. A confidence limit is obtained by defining a critical value of the (Euclidean) distance towards the model. This is given by ^crit=V^^
(33.15)
F^rit is the tabulated one-sided value for (r - r*) and (r - r*) (n - r* - 1) degrees of freedom. The s^^-^^ is used to determine the boundary (the cylinder) around the PCI line in Fig. 33.17c and the planes around the PCI, PC2 plane in Fig. 33.17d. Objects with s < 5^^^ belong to class K, otherwise they do not. To predict whether a new object x^^^ belongs to class K one verifies whether it falls within the cylinder (for a one-dimensional model), between the limiting planes (for a two-dimensional model, etc.). Suppose the following r* dimensional PC model was obtained X^=T^L1,4-E^
(33.16)
with X^ the centred X-matrix for class K, T^^ the (un-normed) score matrix (nxr*), (T^ = U^ A;^, where U^ is the normed score matrix and A^^ is the singular value matrix). L^ = the loading matrix (mxr'^) Ef. = the matrix of residuals (nxm) For a new object x^^^ one first determines the scores using eq. (33.17) tLw=(x„ew-x^)'^L^
(33.17)
231
The Euclidean distance from the model is then obtained, similarly to eq. (33.14) as: (33.18) V./=r*+l
If ^ng^ < s^^-^^, then the new object belongs to class K, otherwise it does not. A discussion concerning the number of degrees of freedom can be found in [31 ]. This article also compares SIMCA with several other methods. A useful tool in the interpretation of SIMCA is the so-called Coomans plot [32]. It is applied to the discrimination of two classes (Fig. 33.18). The distance from the model for class 1 is plotted against that from model 2. On both axes, one indicates the critical distances. In this way, one defines four zones: class 1, class 2, overlap of class 1 and 2 and neither class 1 nor class 2. By plotting objects in this plot, their classification is immediately clear. It is also easy to visualize how certain a classification is. In Fig. 33.18, object a is very clearly within class 1, object b is on the border of that class but is not close to class 2 and object c clearly belongs to neither class. The first versions of SIMCA stop here: it is considered that a new object belongs to the class, if it fits the r*-dimensional PC model. However, one can also consider T distance from class 1 1
1 1
outlier zone
1
class 2
1 1
• c
1
I I 1 1 1 1 1
1 1 1 _4 _ _
.
• b
1 1 1
overlap zone
1 1 1
• a class 1
1 1
distance from class 2 Fig. 33.18. The Coomans plot. Object a belongs to class 1, object b is a borderline class 1 object, object c is an outlier towards the two classes.
232
that objects that fit the PC-model but, in that model, are far from the members of the training class, are also outliers. Therefore, a second step was added to the original version of SIMCA by closing the tube or the box (Fig. 33.17). This was originally done by treating each PC in an univariate way. The limits were situated on each of the r* PCs: max
max(r^) + 0.5 s,
and ^min "~
min(y -- 0.5 s,
(33.19)
where max(rj^) is the largest among the scores of the training objects of class K on the PC considered and s^ is the standard deviation of the scores along that PC. SIMCA has inspired several related methods, such as DASCO [33] and CLASSY [34,35]. The latter has elements of the potential methods and SIMCA, while the former starts with the extraction of principal components, as in SIMCA, but then follows a quadratic discriminant rule. SIMCA has been applied very often and with much success in chemometrics. Examples are food authentication [36], or pharmaceutical identifications such as the recognition of excipients from their near infrared spectra [37] or of blisterpacked tablets using near infrared spectra [38]. Environmental applications have been published by many authors [39,40], for instance by Kvalheim et al. [41,42]. Lavine et al. [43] apply it to fuel spills from high-speed gas chromatograms, and compares SIMCA with DASCO and RDA. Chemometricians often consider SIMCA as the preferred supervised pattern recognition method for all situations. However, this is not evident. When the accent is on discrimination, discrimination oriented methods should be used. The testing procedure underlying SIMCA and other outlier tests has the disadvantage that one has to set a confidence level, a. If the data are normally distributed, a% (e.g. 5%) of objects belonging to the class will be considered as not belonging to it. When applying discriminating methods, such as LDA, to a discrimination oriented classification, this misclassification problem can be avoided. As explained already, SIMCA can be applied as an outlier test, similarly to the multivariate QC tests referred to earlier. Feam et al. [44] have described certain properties of SIMCA in this respect and compared it with some alternatives. 33.2.8 Partial least squares In Section 33.2.2 we showed how LDA classification can be described as a regression problem with class variables. As a regression model, LDA is subject to the problems described in Chapter 10. For instance, the number of variables should not exceed the number of objects. One solution is to apply feature selection or
233
reduction (see Section 33.3) and the other is to apply methods such as partial least squares (PLS - see Chapter 35) [45]. When there are two classes to be discriminated PLSl is applied, which means that there is one independent variable y, which for each object has a value 0 or 1. When there are more classes, PLS2 is applied. The independent variable then becomes a vector of class variables, one for each class, with a value of 1 for the class to which it belongs and zeros for all other variables. Suppose that there are 4 classes and that a certain object belongs to class 2, then for that object y^ = [0 1 0 0]. One might be tempted to use PLSl, with an independent variable that can take the values 1, 2, 3 and 4. However, this would imply an ordered relationship between the four classes, such that the distance between class 3 and 1 is twice that between class 2 and 1. 33.2.9 Neural networks A more recently introduced technique, at least in the field of chemometrics, is the use of neural networks. The methodology will be described in detail in Chapter 44. In this chapter, we will only give a short and very introductory description to be able to contrast the technique with the others described earlier. A typical artificial neuron is shown in Fig. 33.19. The isolated neuron of this figure performs a two-stage process to transform a set of inputs in a response or output. In a pattern recognition context, these inputs would be the values for the variables (in this example, limited to only 2, jc^ and X2) and the response would be a class variable, for instance j = 1 for class K and j = 0 for class L. The inputs, x^ and X2, are linked to the neuron by weights. These weights are determined by training the neuron with a set of training objects, but we will consider in this chapter that this has already been done. In the first stage a weighted sum of the jc-values is made, Z = w^x^ + ^2X2-
OUT IN Fig. 33.19. An artificial neuron. The inputs are weighted and summed according to Z = wixi + W2X2, X is transformed by comparison with T and leads to a 0/1 value for y.
234
Fig. 33.20. Output of the artificial neuron with values wi = 1, W2 = 2, 7 = 1.
In the second stage, Z is transformed with the aid of a transfer function. For instance, it can be compared to a threshold value. If Z > 7, then y=l and if I. 7and therefore lead to an output j j = 1 (i.e. the object is class K), all combinations below it to 3^1 = 0. The procedure described here is equivalent to a method called the linear learning machine, which was one of the first supervised pattern recognition methods to be applied in chemometrics. It is further explained, including the training phase, in Chapter 44. Neurons are not used alone, but in networks in which they constitute layers. In Fig. 33.21 a two-layer network is shown. In the first layer two neurons are linked each to two inputs, jCj and ^2. The upper one is the one we already described, the lower one has w, = 2, W2 = 1 and also T = 1. It is easy to understand that for this neuron, the output y2 is 1 on and above line b in Fig. 33.22a and 0 below it. The outputs of the neurons now serve as inputs to a third neuron, constituting a second layer. Both have weight 0.5 and 7 for this neuron is 0.75. The output y^-^^^i of this neuron is 1 if Z = 0.5 y^ + 0.5 ^2 > 0.75 and 0 otherwise. Since ^i and y2 have as possible values 0 and 1, the condition for r > 0.75 is fulfilled only when both are equal to 1, i.e. in the dashed area of Fig. 33.22b. The boundary obtained is now no longer straight, but consists of two pieces. This network is only a simple demonstration network. Real networks have many more nodes and transfer functions are usually non-linear and it will be intuitively clear that boundaries of a very complex nature can be developed. How to do this, and applications of supervised pattern recognition are described in detail in Chapter 44 but it should be stated here that excellent results can be obtained.
235
Layer 1
Layer 2
Fig. 33.21. A two-layer neural network.
a)
b)
Fig. 33.22. (a) Intermediate (yi mdy2) outputs of the neural network of Fig. 33.21; (b) final output of the neural network.
The similarity in approach to LDA (Section 33.2.2) and PLS (Section 33.2.8) should be pointed out. Neural classification networks are related to neural regression networks in the same way that PLS can be applied both for regression and classification and that LDA can be described as a regression application. This can be generalized: all regression methods can be applied in pattern recognition. One must expect, for instance, that methods such as ACE and MARS (see Chapter 11) will be used for this purpose in chemometrics.
236 33.3 Feature selection and reduction One can — and sometimes must — reduce the number of features. One way is to combine the original variables in a smaller number of latent variables such as principal components or PLS functions. This is calltd feature reduction. The combination of PCA and LDA is often applied, in particular for ill-posed data (data where the number of variables exceeds the number of objects), e.g. Ref. [46]. One first extracts a certain number of principal components, deleting the higher-order ones and thereby reducing to some degree the noise and then carries out the LDA. One should however be careful not to eliminate too many PCs, since in this way information important for the discrimination might be lost. A method in which both are merged in one step and which sometimes yields better results than the two-step procedure is reflected discriminant analysis. The Fourier transform is also sometimes used [14], and this is also the case for the wavelet transform (see Chapter 40) [13,16]. In that case, the information is included in the first few Fourier coefficients or in a restricted number of wavelet coefficients. In feature selection one selects from the m variables a subset of variables that seem to be the most discriminating. Feature selection therefore constitutes a means of choosing sets of optimally discriminating variables and, if these variables are the results of analytical tests, this consists, in fact, of the selection of an optimal combination of analytical tests or procedures. One way of selecting discriminating features is to compare the means and the variances of the different variables. Variables with widely different means for the classes and small intraclass variance should be of value and, for a binary discrimination, one therefore selects those variables for which the expression (^iK-^jL)/p]K + ^i)
(33.20)
is maximal. It should be noted that, in this way, we select the individually best variables. As the correlation between variables is not taken into account, this means one not necessarily selects the best combination of variables. Most of the supervised pattern recognition procedures permit the carrying out of stepwise selection, i.e. the selection first of the most important feature, then, of the second most important, etc. One way to do this is by prediction using e.g. cross-validation (see next section), i.e. we first select the variable that best classifies objects of known classification but that are not part of the training set, then the variable that most improves the classification already obtained with the first selected variable, etc. The results for the linear discriminant analysis of the EU/HYPER classification of Section 33.2.1 is that with all 5 or 4 variables a selectivity of 91.4% is obtained and for 3 or 2 variables 88.6% [2] as a measure of classification success. Selectivity is used here. It is applied in the sense of Chapter
237
16 because of the medical context. Selectivity is the number of true negatives divided by the sum of true negatives (the EU-cases that are classified as such) and false positives (the EU-cases that are classified wrongly as HYPER). Of course, one should also consider the sensitivity (Chapter 16), which was done in the article, but, for simplicity's sake, will not be discussed here. For the HYPER/EU discrimination, the elimination of successively two, or more variables leads to the expected result that, since there is less information, the classification is less successful. On the other hand, for the HYPO/EU discrimination, a less evident result is obtained. Five variables yield a selectivity of 80.0%, 4 of 83.0%, 3 of 86.7% and 2 of 96.7%. A smaller number of tests leads to an improvement in the classification results. One concludes that the variables eliminated either have no relevance to the discrimination considered, and therefore only add noise, or else that the information present in the eliminated variables was redundant or correlated with respect to the retained variables. Another approach requires the use of Wilks' lambda. This is a measure of the quality of the separation, computed as the determinant of the pooled within-class covariance matrix divided by the determinant of the covariance matrix for the whole set of samples. The smaller this is, the better and one selects variables in a stepwise way by including those that achieve the highest decrease of the criterion. In SIMCA, we can determine the modelling power of the variables, i.e. we measure the importance of the variables in modelling the class. Moreover, it is possible to determine the discriminating power, i.e. which variables are important to discriminate two classes. The variables with both low discriminating and modelling power are deleted. This is more a variable elimination procedure than a selection procedure: we do not try to select the minimum number of features that will lead to the best classification (or prediction rate), but rather eliminate those that carry no information at all. It should be stressed here that feature selection is not only a data manipulation operation, but may have economic consequences. For instance, one could decide on the basis of the results described above to reduce the number of different tests for a EU/HYPO discrimination problem to only two. A less straightforward problem with which the decision maker is confronted is to decide how many tests to carry out for a EU/HYPER discrimination. One loses some 3% in selectivity by eliminating one test. The decision maker must then compare the economic benefit of carrying out one test less with the loss contained in a somewhat smaller diagnostic success. In fact, he carries out a cost-benefit analysis. This is only one of the many instances where an analytical (or clinical) chemist may be confronted with such a situation.
238
33.4 Validation of classification rules In the training or learning step, one develops a decision model (a rule) which allows classification of the unknown samples to be carried out. The decision model of Fig. 33.6 consists of line a and the classification rule is that objects to the right of it are assigned to class L and objects to the left to class K. Once a decision rule has been obtained, it is still necessary to demonstrate that it is a good one. This can be done by observing how successful it is at classifying known samples (test set). One distinguishes between recognition and prediction ability. The recognition (or classification) ability is characterized by the percentage of the members of the training set that are correctly classified. The prediction ability is determined by the percentage of the members of the test set correctly classified by using the decision functions or classification rules developed during the training step. When one only determines the recognition ability, there is a risk that one will be deceived into taking an overoptimistic view of the classification result. It is therefore also necessary to verify the prediction ability. Both recognition and prediction ability are usually expressed as (correct) classification rate, although other possibilities exist (see Section 33.3). The situation is very similar to that in regression (see Section 10.3.4), where we validated the regression model by looking how well it modelled the objects that were included in the calibration set (goodness- or lack-of-fit) and new samples (prediction performance). This analogy should not surprise since regression and classification are both modelling methods. Validation in pattern recognition is therefore similar to validation in multivariate calibration and the reader should refer to Chapter 36 for more details. The ideal situation is when there are enough samples available to create separate (independent) training and test sets. When this is not possible, an artifice is necessary. The prediction ability is determined by developing the decision model on a part of the training set only and using the other part as a mock test set. Often this is repeated a few times until all training samples have been used as test samples. If several objects at a time are considered as test samples, this is called a resampling or {internal) cross-validation method, k-fold cross-validation ov jackknife method; when only one sample at a time is removed from the training set, it is called a leave-one-out procedure. If the training set consists of 20 objects, a jackknife method could be carried out as follows. We first delete objects 1-6 from the training set and develop the classification rules with the remaining objects 7-20. Then, we consider 1-6 as the test set, classify them with the rules obtained on objects 7-20 and note how many objects were classified correctly. The whole procedure is then repeated after replacing first objects 1-6 in the training set but deleting 7-12. The latter objects are classified using the classification rule developed on a training set consisting of objects 1-6 and 13-20. Finally, a training set
239
consisting of 1-12 is used and objects 13-20 serve as test set and are classified. The percentage of successes on the three runs together is then called the prediction ability. In general, it is found that prediction ability is somewhat less good than recognition ability. If the prediction and the recognition ability are substantially different, this means that the decision rules depend too much on the actual objects in the training set: the solution obtained is not stable and should therefore not be trusted. Many other subjects are important to achieve successful pattern recognition. To name only two, it should be investigated to what extent outliers are present, because these can have a profound influence on the quality of a model and to what extent clusters occur in a class (e.g. using the index of clustering tendency of Section 30.4.1). When clusters occur, we must wonder whether we should not consider two (or more) classes instead of a single class. These problems also affect multivariate calibration (Chapter 36) and we have discussed them to a somewhat greater extent in that chapter. References 1. 2.
3.
4. 5. 6. 7.
8. 9. 10. 11. 12.
R.G. Brereton, ed., Multivariate Pattern Recognition in Chemometrics. Elsevier, Amsterdam, 1992. D. Coomans, M. Jonckheer, D.L. Massart, I. Broeckaert and P. Blockx, The application of linear discriminant analysis in the diagnosis of thyroid diseases. Anal. Chim. Acta, 103 (1978) 409-415. D. Coomans, I. Broeckaert, M. Jonckheer and D.L. Massart, Comparison of multivariate discrimination techniques for clinical data — Application to the thyroid functional state. Meth. Inform. Med., 22 (1983) 93-101. D. Coomans, I. Broeckaert and D.L. Massart, Potential methods in pattern recognition. Part 4, Anal. Chim. Acta 132 (1981) 69-74. R. Fisher, The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7 (1936) 179-188. P.J. Dunlop, CM. Bignell, J.F. Jackson, D.B. Hibbert, Chemometric analysis of gas chromatographic data of oils from Eucalyptus species. Chemom. Intell. Lab. Systems 30(1995) 59-67. K. Varmuza, F. Stangl, H. Lohninger and W. Werther, Automatic recognition of substance classes from data obtained by gas chromatography, mass spectrometry. Lab. Automation Inf. Manage., 31 (1996)221-224. A. Candolfi, W. Wu, S. Heuerding and D.L. Massart, Comparison of classification approaches applied to NIR-spectra of clinical study lots. J. Pharm. Biomed. Anal., 16 (1998) 1329-1347. T. Fearn, Discriminant analysis. NIR News, 4 (5) (1993) 4-5. G. McLachlan, Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York, 1992. M.G. Kendall and A.S. Stuart, The Advanced Theory of Statistics, Vol. 3. Ch. Griffin, London, 1968. J.H. Friedman, Regularized discriminant analysis. J. Am. Stat. Assoc. 84 (1989) 165-175.
240 13. 14.
15.
16.
17. 18. 19. 20. 21. 22.
23. 24. 25. 26. 27. 28. 29. 30. 31.
32.
33.
Y. Mallet, D. Coomans and O. de Vel, Recent developments in discriminant analysis on high dimensional spectral data. Chemom. Intell. Lab. Systems, 35 (1996) 157-173. W. Wu, Y. Mallet, B. Walczak, W. Penninckx, D.L. Massart, S. Heuerding and F. Erni, Comparison of regularized discriminant analysis, linear discriminant analysis and quadratic discriminant analysis, applied to NIR data. Anal. Chim. Acta, 329 (1996) 257-265. D. Coomans and D.L. Massart, Alternative K-nearest neighbour rules in supervised pattern recognition. Part 2. Probabilistic classification on the basis of the kNN method modified for direct density estimation. Anal. Chim. Acta, 138 (1982) 153-165. E.R. Collantes, R. Duta, W.J. Welsh, W.L. Zielinski and J. Brower, Reprocessing of HPLC trace impurity patterns by wavelet packets for pharmaceutical finger printing using artificial neural networks. Anal. Chem. 69 (1997) 1392-1397. J.D.F. Habbema, Some useful extensions of the standard model for probabilistic supervised pattern recognition. Anal. Chim. Acta, 150 (1983) 1-10. D. Coomans, M.P. Derde, I. Broeckaert and D.L. Massart, Potential methods in pattern recognition. Anal. Chim. Acta, 133 (1981) 241-250. D. Coomans and I. Broeckaert, Potential Pattern Recognition in Chemical and Medical Decision Making. Wiley, Chichester, 1986. M. Forina, C. Armanino, R. Leardi and G. Drava, A class-modelling technique based on potential functions. J. Chemom. 5 (1991) 435^53. M. Forina, S. Lanteri and C. Armanino, Chemometrics in Food Chemistry. Topics Curr. Chem., 141 (1987)93-143. M. Mulholland, D.B. Hibbert, P.R. Haddad and P. Parslov, A comparison of classification in artificial intelligence, induction versus a self-organising neural networks. Chemom. Intell. Lab. Systems, 30 (1995) 117-128. L.J. Breiman, R. Freidman, R. Olsen and C. Stone, Classification and Regression Trees. Wadsworth, Pacific Grove, CA, 1984. C.H. Yeh and C.H. Spiegelman, Partial least squares and classification and regression trees. Chemom. and Intell. Lab. Systems, 22 (1994) 17-23. A. Sankar and R. Mammone, A fast learning algorithm for tree neural networks. In: Proc. 1990 Conf on Information Sciences and Systems, Princeton, NJ, 1990, pp. 638-642. D.H. Coomans and O. Y. de Vel, Pattern analysis and classification, in J. Einax (ed). The Handbook of Environmental Chemistry, Vol. 2, Part G. Springer, 1995, pp. 279-324. S. Wold, Pattern recognition by means of disjoint principal components models. Pattern Recogn.,8(1976) 127-139. M.P. Derde and D.L. Massart, UNEQ: a disjoint modelling technique for pattern recognition based on normal distribution. Anal. Chim. Acta, 184 (1986) 33-51. I. Frank and J.M. Friedman, Classification: oldtimers and newcomers. J. Chemom. 3 (1989) 463-475. R. De Maesschalck, A. Candolfi, D.L. Massart and S. Heuerding, Decision criteria for SIMCA applied to Near Infrared data. Chemom. Intell. Lab. Syst., in prep. H. Van der Voet and P.M. Coenegracht, The evaluation of probabilistic classification methods, Part 2. Comparison of SIMCA, ALLOC, CLASSY and LDA. Anal. Chim. Acta, 209 (1988) 1-27. D. Coomans, I. Broeckaert, M.P. Derde, A. Tassin, D.L. Massart and S. Wold, Use of a microcomputer for the definition of multivariate confidence regions in medical diagnosis based on clinical laboratory profiles. Comp. Biomed. Res., 17 (1984) 1-14. I. Frank, DASCO: a new classification method. Chemom. Intell. Lab. Syst., 4 (1988) 215-222.
241 34.
35.
36. 37.
38.
39.
40.
41. 42.
43.
44. 45. 46.
H. Van Der Voet, P.M.J. Coenegracht and J.B. Hemel, New probabilistic versions of the Simca and Classy classification methods. Part 1. Theoretical description. Anal. Chim. Acta, 192 (1987) 63-75. H. van der Voet and D.A. Doornbos, The improvement of SIMCA classification by using kernel density estimation. Part 1. Anal. Chim. Acta, 161 (1984), 115-123; Part 2. Anal. Chim. Acta, 161 (1984) 125-134. M. Forina, G. Drava and G. Contarini, Feature selection and validation of SIMCA models: a case study with a typical Italian cheese. Analusis 21 (1993) 133-147. A. Candolfi, R. De Maesschalk, D.L. Massart, P.A. Hailey and A.C.E. Harrington, Identification of pharmaceutical excipients using NIR spectroscopy and SIMCA. J. Pharm. Biomed. Anal., in prep. M.A. Dempster, B.F. MacDonald, P.J. GemperHne and N.R. Boyer, A near-infrared reflectance analysis method for the non invasive identification of film-coated and non-film-coated, blister-packed tablets. Anal. Chim. Acta, 310 (1995) 43-51. D. Scott, W. Dunn and S. Emery, Pattern recognition classification and identification of trace organic pollutants in ambient air from mass spectra. J. Res. Natl. Bur. Stand., 93 (1988) 281-283. E. Saaksjarvi, M. Khaligi and P. Minkkinen, Waste water pollution modeling in the southern area of Lake Saimaa, Finland, by the simca pattern recognition method. Chemom. Intell. Lab. Systems, 7 (1989) 171-180. O.M. Kvalheim, K. 0ygard and O. Grahl-Nielsen, SIMCA multivariate data analysis of blue mussel components in environmental pollution studies. Anal. Chim. Acta, 150 (1983) 145-152. B.G.J. Massart, O.M. Kvalheim, F.O. Libnau, K.I. Ugland, K. Tjessem and K. Bryne, Projective ordination by SIMCA: a dynamic strategy for cost-efficient environmental monitoring around offshore installations. Aquatic Sci., 58 (1996) 121-138. B.K. Lavine, H. Mayfield, P.R. Kromann and A. Faruque, Source identification of underground fuel spills by pattern recognition analysis of high-speed gas chromatograms. Anal. Chem., 67 (1995) 3846-3852. B. Mertens, M. Thompson and T. Fearn, Principal component outlier detection and SIMCA: a synthesis. Analyst 119 (1994) 2777-2784. L Stable and S. Wold, Partial least square analysis with cross-vahdation for the two-class problem: a Monte Carlo study. J. Chemometrics, 1 (1987) 185-196. A.F.M. Nierop, A.C. Tas, J. Van der Greef, Reflected discriminant analysis. Chemom. Intell. Lab. Syst., 25 (1994) 249-263.
This Page Intentionally Left Blank
243
Chapter 34
Curve and Mixture Resolution by Factor Analysis and Related Techniques 34.1 Abstract and true factors In Chapter 31 we stated that any data matrix can be decomposed into a product of two other matrices, the score and loading matrix. In some instances another decomposition is possible, e.g. into a product of a concentration matrix and a spectrum matrix. These two matrices have a physical meaning. In this chapter we explain how a loading or a score matrix can be transformed into matrices to which a physical meaning can be attributed. We introduce the subject with an example from environmental chemistry and one from liquid chromatography. Let us suppose that dust particles have been collected in the air above a city and that the amounts of/? constituents, e.g. Si, Al, Ca,..., Pb have been determined in these samples.The elemental compositions obtained for n (e.g. 100) samples, taken over a grid of sampling points, can be arranged in a data matrix X (Fig. 34.1). Each row of the table represents the elemental composition of one of the samples. A column represents the amount of one of the elements found in the sample set. Let us further suppose that there are two main sources of dust in the neighbourhood of the sampled area, and that the particles originating from each source have a specific concentration pattern for the elements Si to Pb. These concentration patterns are described by the vectors Sj and S2. For instance the dust in the air may originate from a power station and from an incinerator, having each a specific concentration pattern, sj = [Si^, Al^, Ca^,... PbJ with A: = 1,2. Obviously, each sample in the sampled area contains particles from each source, but in a varying proportion. Some of the samples mainly contain particles from the power station and less from the incinerator. Other samples may contain an equal amount of particles of each source. In general, one can say that the composition x^ of any sample / of dust is a linear combination of the two source patterns s^ and S2 given by: x^ = c^^ s^ + c^2 S2. In this expression c^ gives the contribution of the first source and c^2 ^he contribution of the second dust source in sample /. For all n samples these contributions can be arranged in a nx2 matrix C giving X = CS^ where S is thepx2 matrix of the source patterns. If the concentration patterns of the
244 variables Si Al Ca .... Pb
12
Si Al Ca .... Pb 1 2 factors
i
FA measurements
concentrations Si Al Ca .... Pb
V PC loadings
PCA
PC-scores Fig. 34.1. The principle of factor analysis.
elements (Si to Pb) of the two dust sources are known, the relative contribution C of each source in the samples is estimated by solving the equation X = CS^ by multiple linear regression (see Chapter 10) provided that the concentration patterns are sufficiently different and that the number of sources nc is less than or equal to the number of measured elements p: C = XS(S^S)"^ Matters become more complex when the concentration patterns of the sources are not known. It becomes even more complicated when the number and the origin of the potential sources of the dust are unknown. In this case the number of sources and the concentration patterns of each source have to be estimated from the measured data table X. This operation is called factor analysis. In the terminology of factor analysis, the two sources of dust in our example are called factors, and the concentration patterns of the compounds in each source are calltdfactor loadings. In Chapters 17 and 31 we explained that a matrix X can be decomposed by SVD in a product of three matrices: the two matrices of singular vectors U and V and a diagonal matrix of singular values A such that: X - U A V^ nxp
nxp pxp
pxp
(34.1)
245
where n is the number of samples (rows), and p is the number of variables (columns). Often the decomposition is given as a product of only two matrices: X = T V^ nxp
nxp
(34.2)
pxp
where T (= UA) is the matrix of scores proportional to their 'size' and V is the loading matrix. The columns of V are colloquially denoted as the principal components of X. Each row of the original data matrix X can be represented as a linear combination of all PCs in the loading matrix. The multiplicative factors (or regressors) of that linear combination for a particular row, x^ of X are given by the corresponding row t^ in the score matrix: x, = t^ V^. X can be reconstructed within the measurement error E by taking the first nc PCs. X - T * V*T+ E nxp
nxnc ncxp
nxp
The symbol * means that the first nc significant columns of V and T are retained. The number of significant principal components, nc, which is the pseudo-rank of X, is usually unknown. Methods for estimating the number of components are discussed in Section 31.5. In the case of city air pollution caused by two sources of dust, one expects to find two significant PCs: PCj and PC2, which are the two first rows of V^. One could conclude that this decomposition yielded the requested information. Each row of X is represented by a linear combination of two source profiles PCj and PC2. The question arises then whether these profiles represent the true concentration profiles of the sources. Unfortunately, the answer is no! In Chapter 29, we explained that PCj and PC2 are obtained under some specific constraints, i.e., PCi is calculated under the constraint that it describes the maximum variance of X. The second loading vector PC2 also describes the maximum variance in X but now in the orthogonal direction to PCj. These constraints are not necessarily valid for the true factors. It is very improbable that the true source profiles are orthogonal. Therefore the PCs are called abstract factors. The aim of factor analysis is to transform these abstract factors into real factors. PC A is a purely mathematical operation, using no other information than that the rows in X can be described by a linear combination of a number of linearly independent vectors. However factor analysis usually requires the formulation of additional constraints to find a solution. These constraints are defined by the characteristics of the system being investigated. In this example, a constraint could be that all concentrations should be non-negative. Before going into more detail, a second example is given. Suppose that during the elution of two compounds from an HPLC, one measures n (=15) UV-visible
246 0.25
(T3
0.15
0.05
220
240 260 280 300 320 340 360 380 400 420 •
wavelength (nm)
Fig. 34.2. UV-visible spectra of mixtures offluorantheneand chrysene (see Fig. 34.3 for the pure spectra).
spectra at p (=20) wavelengths. Because of the Lambert-Beer law, all measured spectra are linear combinations of the two pure spectra. Together they form a 15x20 data matrix. For example the UV-visible spectra of mixtures of two polycyclic aromatic hydrocarbons (PAH) given in Fig. 34.2 are linear combinations of the pure spectra shown in Fig. 34.3. These mixture spectra define a data matrix X, which can be written as the product of a 15x2 concentration matrix C with the 2x20 matrix S^ of the pure spectra: X = C S^ nx/?
(34.3)
ny^nc ncxp
where AI (= 15) is the number of mixture spectra, nc (= 2) is the number of species, and p (= 20) is the number of wavelengths. The rows of X are mixture spectra and the columns are chromatograms at the p = 20 wavelengths. Here, columns as well as rows are linear combinations of pure factors, in this example pure row factors, being the pure spectra, and pure column factors, being the pure elution profiles. Any data matrix can be considered in two spaces: the column or variable space (here, wavelength space) in which a row (here, spectrum) is a vector in the multidimensional space defined by the column variables (here, wavelengths), and the row space (here, retention time space) in which a column (here, chromatogram) is a vector in the multidimensional space defined by the row variables (here, elution times). This duality of the multivariate spaces has been discussed in more detail in Chapter 29. Depending on the chosen space, the PCs of the data matrix
247
0)
o c
CD
o
220
240 260 280 300 320 340 360 380 400 420 wavelength (nm)
Fig. 34.3. UV-visible spectra of two polyaromatic hydrocarbons (PAHs), fluoranthene and chrysene.
have a different meaning. In wavelength space the eigenvectors of X^X represent abstract spectra, and in retention time space the eigenvectors of XX^ are abstract chromatograms. Irrespective of the chosen space, by decomposing matrix X with a PCA as many significant principal components should be found as there are chemical species in the mixtures. The decomposition in the wavelength space, for a system with two compounds is given by: X =T* V*^ + E nx/7
nx2 2x/7
(34.4)
nxp
By decomposing the HPLC data matrix of spectra shown in Fig. 34.2 according to eq. (34.4), a matrix V* is obtained containing the two significant columns of V. Evidently the loading plots shown in Fig. 34.4 do not represent the two pure spectra, though each mixture spectrum can be represented as a linear combination of these two PCs. Therefore, these two PCs are called abstract spectra. Equations (34.3) and (34.4) show a decomposition of the data matrix X in two ways: the first is a decomposition in real factors, a product of a matrix S^ of the spectra with a matrix C of concentration profiles, and the second is a decomposition in abstract factors T* and V*^. By factor analysis one transforms V*^ in eq. (34.4) into S^ in eq. (34.3). The score matrix T* gives the location of the spectra in the space defined by the two principal components. Figure 34.5 shows a scores plot thus obtained with a clear structure (curve). The cause of this structure is explained in Section 34.2.1.
248
• — • — • — • — f
20
240
260
280
300
320
340
360
380 400 420 wavelength (nm)
Fig. 34.4. The two first principal components of the data matrix of the spectra given in Fig. 34.2.
scores on PC2
0.3 0.4 scores on PC1
Fig. 34.5. Score plot (PCi score vs PC2 score) of the mixture spectra given in Fig. 34.2.
0.7
249
The decomposition into elution profiles and spectra may also be represented as: X^ = S C^ pxn
px2 2xn
The corresponding decomposition by a principal components analysis gives: XT ^ p* Q*T + E pxn
pxnc ncxn
pxn
where the rows in Q*^ now represent abstract elution profiles. It should be noted that the score matrix P* has another meaning than T* in eq. (34.4). It represents here the location of the chromatograms in the factor space Q*. It should also be noted that Q is equivalent to U in eq. (34.1) and that P is equivalent to AV^ in eq. (34.1). Here, too, one wants to transform abstract elution profiles Q*^ into real elution profiles C^ by factor analysis. The result of a PC A carried out in this retention time space is given in Fig. 34.6. The two first PCs clearly have the appearance of elution profiles, but are not the true elution profiles. Because elution profiles have a much smoother appearance than spectra, which may have a very irregular form, abstract elution profiles are sometimes easier to interpret than abstract spectra. For instance, one can easily derive the positions of the peak maxima and also distinguish significant PCs from those which represent noise. The scores plot (Fig. 34.7) in which the chromatograms are plotted in the space defined by the wavelengths is less easily interpreted than the corresponding scores plot of the spectra shown in Fig. 34.5. The plot in Fig. 34.7 does not reveal any structure because the consecutive chromatograms in X^ follow the irregular pattern of the absorptivity coefficients as a function of the wavelength. Therefore, if the aim of factor analysis is to transform PCs into real factors (by one of the methods explained in this chapter) one prefers the retention time space, because this yields loadings which are the easiest to interpret. On the other hand, if the aim of the analysis is to detect structure in the scores plot, the wavelength space is preferred. In this space regions of pure spectra (selective parts of the chromatograms), or regions where only binary mixtures are present, are more easily detected. The above considerations form the basis of the HELP procedure explained in Section 34.3.3. In some cases a principal components analysis of a spectroscopic- chromatographic data-set detects only one significant PC. This indicates that only one chemical species is present and that the chromatographic peak is pure. However, by the presence of noise and artifacts, such as a drifting baseline or a nonlinear response, conclusions on peak purity may be wrong. Because the peak purity assessment is the first step in the detection and identification of an impurity by factor analysis, we give some attention to this subject in this chapter.
250
13
17
21
25
29
retention time Fig. 34.6. The two first principal components of a LC-DAD data set in the retention time space.
scores on PC2 •
0.1 •
0.05 0 -0.05
• .
*•
•
t
•
••
•
•
•
•
•
• •
•
-0.1
• -0.15
1
0.2
1
0.4 0.6 scores on PC1
.
_- 1
0.8 Pi
Fig. 34.7. Score plot (PCi score vs PC2 score) of chromatograms in the retention time space.
251
Basically, we make a distinction between methods which are carried out in the space defined by the original variables (Section 34.4) or in the space defined by the principal components. A second distinction we can make is between full-rank methods (Section 34.2), which consider the whole matrix X, and evolutionary methods (Section 34.3) which analyse successive sub-matrices of X, taking into account the fact that the rows of X follow a certain order. A third distinction we make is between general methods of factor analysis which are applicable to any data matrix X, and specific methods which make use of specific properties of the pure factors.
34.2 Full-rank methods 34.2.1 A qualitative approach Before going into detail about various specific methods to estimate pure factors, we qualitatively describe how pure factors can be derived from the principal components. This is illustrated with a data matrix of the two-component HPLC example discussed in the previous section. When discussing the scores plot (Fig. 34.5) we mentioned that the scores showed some structure. From the origin, two straight lines depart which are connected with a curved line. In Chapter 17 we explained that these straight lines coincide with the pure spectra present in the pure elution time zones. The distance from the origin is a measure for the 'size' of the spectrum. The curved part represents the zone where two compounds co-elute. Going through the curve starting from the origin we find pure spectra of compound 1 in increasing concentrations, then mixtures of compounds 1 and 2, followed by the spectra of compound 2 in decreasing concentration, back to the origin. The angle between the two lines is a measure for the correlation between the two pure spectra. If the spectra are uncorrelated (i.e. very dissimilar) the two lines are orthogonal. At high correlations the angle becomes very small. The two lines define the directions of the pure factors in the PC1-PC2 space. In this simplified situation no factor analysis is needed to find the pure factors. From the scores plot we also observe that the pure factors have an angle aj and a2 respectively with PCj and PC2. Conversely, one could also find pure factors by rotating PCj over an angle a^ and PC2 over an angle a2. Finding these angles is the purpose of factor analysis. When pure spectra are present in the data set, finding these angles is quite straightforward. Therefore, several factor analysis methods aim at finding the purest rows (spectra) or purest columns (wavelengths) in the data set, which is discussed in Section 34.4. When no pure row or column is available for one of the factors, we cannot directly derive the rotation angles from the scores plot, because the straight line segments are missing. In this case we need to make some
252
assumptions about the pure factors in order to estimate the rotation angles. Obvious assumptions in chromatography are the non-negativity of the absorbances and the concentration profiles. If no constraints can be formulated to estimate the rotation angles, one must rely on abstract rotation procedures. An example of this type of rotation is the Varimax method of Kaiser [1] which is explained in Section 34.2.3. 34.2.2 Factor rotations A row of a data matrix can be interpreted as a point in the space defined by its column variables.
yi
n
yn
For instance, the first row of the matrix X defines a point with the coordinates (JCJ, 3^i) in the space defined by the two orthogonal axes x^ = (1 0) and y^ = (0 1). Factor rotation means that one rotates the original axes x^ = (1 0) and y^ = (0 1) over a certain angle 9. With orthogonal rotation in two-dimensional space both axes are rotated over the same angle. The distance between the points remains unchanged.
Fig. 34.8. Orthogonal rotation of the (jcj) axes into (/:,/) axes.
253
Rotated axes are characterized by their position in the original space, given by the vectors k^ = [cos9 -sinG] and F = [sinG cosG] (see Fig. 34.8). In PCA or FA, these axes fulfil specific constraints (see Chapter 17). For instance, in PCA the direction of k is the direction of the maximum variance of all points projected on this axis. A possible constraint in FA is maximum simplicity of k, which is explained in Section 34.2.3. The new axes (k,I) define another basis of the same space. The position of the vector [x- y-] is now [k^ /•] relative to these axes. Factor rotation involves the calculation of [k^ /J from [x^ y^], given a rotation angle with respect to the original axes. Suppose that after rotation the matrix X is transformed into a matrix F: 'k, 2
h' h
zn
.K
K.
Then the following relationship exists between [k^ /J and [x^ y^ k^ = x^ cosG - y^ sinG li = x^ sinG + jjCosG or in matrix notation (for all vectors /) 'k, k2 .n
h' k
=
Xi
y
^2
yi
K _ _^n
cosG
sinG
-sinG
cosG
yn\
which gives: F = XR
(34.4)
witr iv R— =
cosG sinG - s i r iG cosG
Columns and rows of R are orthogonal with a norm equal to one. Therefore, R defines a rotation, for which RR^ = R^ R = I.
254
The aim of factor analysis is to calculate a rotation matrix R which rotates the abstract factors (V) (principal components) into interpretable factors. The various algorithms for factor analysis differ in the criterion to calculate the rotation matrix R. Two classes of rotation methods can be distinguished: (i) rotation procedures based on general criteria which are not specific for the domain of the data and (ii) rotation procedures which use specific properties of the factors (e.g. non-negativity). 34.2.3 The Varimax rotation In the previous section we have seen that axes defined by the column variables can be rotated. It is also possible to rotate the principal components. Instead of rotating the axes which define the column space of X, we rotate here the significant PCs in the sub-space defined by V*^: F =V*'r R ncxp
ncxp
(34.5)
pxp
The columns of V* are the abstract factors of X which should be rotated into real factors. The matrix V*^ is rotated by means of an orthogonal rotation matrix R, so that the resulting matrix F = V*^ R fulfils a given criterion. The criterion in Varimax rotation is that the rows of F obtain maximal simplicity, which is usually denoted as the requirement that F has a maximum row simplicity. The idea behind this criterion is that 'real' factors should be easily interpretable which is the case when the loadings of the factor are grouped over only a few variables. For instance the vector f^ = [ 0 0 0 0.5 0.8 0.33] may be easier to interpret than the vector f J = [0.1 0.3 0.1 0.4 0.4 0.75]. It is more likely that the simple vector is a pure factor than the less simple one. Returning to the air pollution example, the simple vector f J may represent the concentration profile of one of the pollution sources which mainly contains the three last constituents. A measure for the simplicity of a vector f, is the variance of the squares of its p elements, which should be maximized [2]: 1 ^ — Simp = var(f,2 ) = _ V (f 2 ^ f 2 )2 P i=i
To illustrate the concept 3 vectors f, and fj: f f 0 0.1 0 0.3 0 0.1 0.5 0.4
of simplicity of a vector we calculate the simplicity of f.^ 0 0 0 0.25
f2 *2
0.01 0.09 0.01 0.16
255
0.8 0.33
0.4 0.75
0.64 0.11 var(fi2) = 0.053
0.16 0.56 var(f|) = 0.035
The simplicity of a matrix is the sum of the simplicities of its rows.Thus the simplicity of a matrix F with the rows fj^ and fj as given before is 0.088. fj and £2 are called varivectors. By a varimax rotation the matrix V*^ is rotated by means of an orthogonal rotation matrix R, so that the simplicity of the resulting matrix F = V*^ R is maximal. Several algorithms have been proposed [2] for the calculation of R. Let us suppose that a PC A of X with four variables yields the following two significant principal components: PCi! [0.1023 -0.767 0.153 0.614] and PC2: [0.438 0.501 0.626 0.407]. The simplicity of the matrix with the rows PC^ and PC2 is equal to 0.0676. A varimax rotation of this matrix yields the varivectors f^ = [-0.167 -0.916 -0.233 0.269] and f2^ =[0.417 -0.030 0.601 0.685] at an angle of 35 degrees with the PCs, and a corresponding simplicity equal to 0.1486. One can see that varivector f^ is mainly directed along variable 2, which is probably a pure factor. The other variables (1,3 and 4) load high on varivector ^2- Therefore, they belong to the second pure factor. Anyway, because the varivectors are simpler than the original PCs one can safely conclude that they resemble better the pure factors. Several applications of varimax rotation in analytical chemistry have been reported. As an example the varimax rotation is applied on the HPLC data table of
1
3
5
7
9
11 13 15 Retention time
Fig. 34.9. Varimax rotated principal components given in Fig. 34.6.
256
PAHs introduced in Section 34.1. A PCA applied on the transpose of this data matrix yields abstract 'chromatograms' which are not the pure elution profiles. These PCs are not simple as they show several minima and/or maxima coinciding with the positions of the pure elution profiles (see Fig. 34.6). By a varimax rotation it is possible to transform these PCs into vectors with a larger simplicity (grouped variables and other variables near to zero). When the chromatographic resolution is fairly good, these simple vectors coincide with the pure factors, here the elution profiles of the species in the mixture (see Fig. 34.9). Several variants of the varimax rotation, which differ in the way the rotated vectors are normalized, have been reviewed by Forina et al. [2]. 34.2.4. Factor rotation by target transformation factor analysis (TTFA) In the previous section, factors were searched by rotating the abstract factors into pure factors, obeying a number of constraints. In some cases, however, one may have a collection of candidate pure factors, e.g., a set of UV-visible or Mass spectra of chemical compounds. Having measured a data matrix of mixture spectra one could investigate whether compounds present in the mixture match with compounds available in the data base of pure spectra. In that situation one could first estimate the pure spectra from the mixture spectra and thereafter compare the obtained spectra with the spectra in the data base. Alternatively, one could identify the pure spectra by solving equation X = C S^ for C where S is the set of candidate spectra to be tested and X contains the mixture spectra. All non-zero rows of C indicate the presence of the spectrum in the corresponding row in S. This method fails when S does not contain all pure spectra present in the mixtures. Moreover, this procedure becomes unpractical when the number of candidate spectra is very large and the whole data base has to be checked. Furthermore, when the spectra to be checked are quite similar, the calculation of (S^S)~^ becomes unstable, leading to large errors in C and the indication of wrong spectra [3]. By target transformation factor analysis (TTFA), each candidate spectrum, called target is tested individually on its presence in the mixtures [4,5]. Here, targets are tested in the space defined by the significant principal components of the data matrix. Therefore, TTFA begins with a PCA of the data matrix X of the measured spectra. The principle of TTFA can be explained in an algebraic as well as geometrical way. We start with the algebraic approach. Because according to eq. (34.4) any row of X can be written as
\xp
Ixnc ncxp
Ixp
Each mixture spectrum is a linear combination of the nc significant eigenvectors. Equally, the pure spectra are linear combinations of the first nc PCs. A target
257
spectrum taken from the library can be tested on this property. If the test passes, the spectrum or target may be one of the pure factors. How this is done is explained below. The first step is to calculate the scores t i„ of the target spectrum, in, to be tested by solving equation:
lx/7
Ixnc ncxp
Ixp
givmg t;„ =inV*(V*T V*)-^
in V
These scores give the linear combination of the PCs that provides the best estimation (in a least squares sense) of the target spectrum. How good that estimation is can be evaluated by calculating the sum of squares of the residuals between the re-estimated target or output target (from its scores) and the input target. The output target out is equal to t •„ V*^. The overall expression for TTFA therefore becomes: out = inV*V*T
(34.6)
If the difference between out and in (||out - in||) can be explained by the variance of the noise, the test passes and the target is possibly one of the pure factors.
9
11 13 15 17 19 21 23 25 27 29 Retention time
Fig. 34.10. Simulated LC-DAD data-set of the separation of three PAHs (spectra 4, 5 and 6 in Fig. 34.11) (individual profiles and the sum).
258
C
0>
^'•
•
10
20
.mrntf
30
,.
..1 . . . ^ ^ 1
40
i^'*'*»>..., 1
50 60 Time
70
i,?TTrrrtii
80
90
1
1
100
Fig. 34.40. Normalized concentration profiles of a minor and main compound for a system with 0.2% of prednisone; chromatographic resolution is 0.8.
In summary, the selection procedure consists of three steps: (1) compare each spectrum in X with all spectra already selected by applying eq. (34.14). Initially, when no spectrum has been selected, the spectra are compared with the average spectrum of matrix X; (2) plot of the dissimilarity values as a function of the retention time (dissimilarity plot) and (3) select the spectrum with the highest dissimilarity value by including it as a reference in matrix Y^. The selection of the spectra is finished when the dissimilarity plot shows a random pattern. It is considered that there are as many compounds as there are spectra. Once the purest spectra are available, the data matrix X can be resolved into its spectra and elution profiles by Alternating Regression explained in Section 34.3.1. By way of illustration, let us consider the separation of 0.2% prednisone in etrocortysone eluting with a chromatographic resolution equal to 0.8 [30] (Fig. 34.40). The dissimilarity of each spectrum with respect to the mean spectrum is plotted in Fig. 34.41a. Two clearly differentiated peaks with maxima around times 46 and 63 indicate the presence of at least two compounds. In this case, the
297
a)
xlO"
Fig. 34.41. Dissimilarity of each spectrum with respect to (a) the mean spectrum, (b) spectrum at time 46 and (c) spectra at times 46 and 63, for the system of Fig. 34.40.
dissimilarity of the spectrum at time 46 is slightly higher than the one of the spectrum at time 63 and it is the first spectrum selected. Each spectrum is then compared with the spectrum at time 46 and the dissimilarity is plotted versus time (Fig. 34.41b). The spectrum at time 63 has the highest dissimilarity value and, therefore, it is the second spectrum selected. The procedure is continued by calculating the dissimilarity of all spectra left with respect to the two already selected spectra, which is plotted in Fig. 34.41c. As one can see the dissimilarity values are about 1000 times smaller than the smallest than obtained so far. Moreover, no peak is observed in the plot. This leads to the conclusion that no third component is present in the data. A comparison of the performance of FWSEFA, SIMPLISMA and OPA on a real data set of LC-FTIR spectra containing three complex clusters of co-eluting compounds is given in Ref. [31]. An alternative method, key-set factor analysis, which looks for a set of purest rows, called key-set, has been developed by Malinowski [32].
298
34.5 Quantitative methods for factor analysis The aim of all the foregoing methods of factor analysis is to decompose a data-set into physically meaningful factors, for instance pure spectra from a HPLC-DAD data-set. After those factors have been obtained, quantitation should be possible by calculating the contribution of each factor in the rows of the data matrix. By ITTFA (see Section 34.2.6) for example, one estimates the elution profiles of each individual compound. However, for quantitation the peak areas have to be correlated to the concentration by a calibration step. This is particularly important when using a diode array detector because the response factors (absorptivity) may considerably vary with the compound considered. Some methods of factor analysis require the presence of a pure variable for each factor. In that case quantitation becomes straightforward and does not need a multivariate approach because full selectivity is available. In this section we focus on methods for the quantitation of a compound in the presence of an unknown interference without the requirement that this interference should be identified first or its spectrum should be estimated. Hyphenated methods are the main application domain. The methods we discuss are generalized rank annihilation method (GRAM) and residual bilinearization (RBL). 34.5.1 Generalized rank annihilation factor analysis (GRAFA) In 1978, Ho et al. [33] published an algorithm for rank annihilation factor analysis. The procedure requires two bilinear data sets, a calibration standard set X^ and a sample set X„. The calibration set is obtained by measuring a standard mixture which contains known amounts of the analytes of interest. The sample set contains the measurements of the sample in which the analytes have to be quantified. Let us assume that we are only interested in one analyte. By a PCA we obtain the rank R^ of the data matrix X^ which is theoretically equal to 1 + n^ where Az- is the number of interfering compounds. Because the calibration set contains only one compound, its rank R^ is equal to one. In the next step, the rank is calculated of the difference matrix X^ = X„ - kX^. For any value of/:, the rank of X^ is equal to 1 + n, except for the case where k is exactly equal to the contribution of the analyte to the signal. In this case the rank of X^ is R^ - 1. Thus the concentration of the analyte in the unknown sample can be found by determining the /:-value for which the rank of X^^ is equal to 7?^ - 1. The amount of the analyte in the sample is then equal to kc^ where c^ is the concentration of the analyte in the standard solution. In order to find this /:-value Ho et al. proposed an iterative procedure which plots the eigenvalues of the least significant PC of X^ as a function oik. This eigenvalue becomes minimal when k exactly compensates the signal of the analyte in the sample. For other /^-values the signal is under- or
299
Fig. 34.42. RAFA plot of the least significant eigenvalue as a function of ^ (see text for an explanation
ofk).
overcompensated which results in a higher value of the EV. An example of such a plot is given in Fig. 34.42. When several analytes have to be determined, this procedure needs to be repeated for each analyte. Because this algorithm requires that a PCA is calculated for each considered value of ^, RAFA is computationally intensive. Sanchez and Kowalski [34] introduced generalized rank annihilation factor analysis (GRAFA).
300
More than one analyte can be quantified simultaneously in the presence of interfering compounds. The required measurements are identical to RAFA: a data matrix X^ of the unknown sample and a calibration matrix with the analytes X^. 34.5.2 Residual bilinearization (RBL) In order to apply residual bilinearization [35] at least two data sets are needed: Xy which is the data set measured for the unknown sample and X^ which is the data matrix of a calibration standard, containing the analyte of interest. In the absence of interferences these two data matrices are related to each other as follows: X^ = bX^-^R
(34.15)
b is 3. coefficient which relates the concentration of the analyte in the unknown sample to the concentration in the calibration standard, where c^ = bc^. R is a residual matrix which contains the measurement error. Its rows represent null spectra. However, in the presence of other (interfering) compounds, the residual matrix R is not random, but contains structure. Therefore the rank of R is greater than zero. A PCA of R, after retaining the significant PCs, gives: R = T*V*'r + E
(34.16)
By combining eqs. (34.15) and (34.16) we obtain: X„ =fcX,+ r V*^ + E
(34.17)
By RBL the regression coefficient b is calculated by minimizing the sum of squares of the elements in E. Because the rank of R in eq. (34.16) is unknown, the estimation of b from eq. (34.17) should be repeated for an increasing number of principal components included in V*^. Schematically the procedure proceeds as follows: (1) Start with an initial estimate b^ofb', (2) Calculate R = X, - fcoX,; (3) Determine the rank of R and decompose R into T*V*^ + E; (4) Obtain a new estimate b^ofbby solving X^ = bX^ + f *Y*T f^j. ^^-^^which T*V*^ is the result ofstep 3. This yields: Z7i = (X J X,)-'Xj ( X , - T V ) ; (5) Repeat steps (2) to (4) after substituting b^ with by The iteration process is stopped after b converged to a constant value. If na analytes are quantified simultaneously, data matrices of standard samples are measured for each analyte separately. These matrices X^j, X^2' •••' ^s na ^^^ collected in a three-way data matrix Xg of the size nxpxna, where n is the number of spectra in X^j,..., X^ „3, p is the number of wavelengths and na is the number of analytes. The basic equation for this multicomponent system is given by:
301
X, = X,b + R
(34.18)
where X,. is the three-way matrix of calibration data and b is a vector of regression coefficients related to the unknown concentrations by c^ = c^ b^. How to perform matrix operations on a three-way table is discussed in Section 31.17. The procedure is then continued in a similar way as for the one-component case. Eq. (34.18) is solved for b iteratively by substituting R = T*V*^ + E as explained before. Because the concentrations c^ are known, the three-way data matrix X^ measured for the standard samples can be directly resolved in its elution profiles and spectra by Parafac [36] explained in Section 31.8.3. References to other methods for the decomposition of three-way multicomponent profiles are included in the list of additional recommended reading. 34.5.3 Discussion In order to apply RBL or GRAFA successfully some attention has to be paid to the quality of the data. Like any other multivariate technique, the results obtained by RBL and GRAFA are affected by non-linearity of the data and heteroscedasticity of the noise. By both phenomena the rank of the data matrix is higher than the number of species present in the sample. This has been demonstrated on the PCA results obtained for an anthracene standard solution eluted and detected by three different brands of diode array detectors [37]. In all three cases significant second eigenvalues were obtained and structure is seen in the second principal component. A particular problem with GRAFA and RBL is the reproducibility of the retention data. The retention time axes should be perfectly synchronized. Small shifts of one time interval (thus the /th spectrum in X^ corresponds with the /+lth spectrum in X^) already introduce major errors (> 5%) when the chromatographic resolution is less than 0.6. The results of an extensive study on the influence of these factors on the accuracy of the results obtained by GRAFA and RBL have been reported in Ref. [37]. Although some practical applications have been reported [38,39], the lack of robustness of RBL and GRAFA due to artifacts mentioned above has limited their widespread application in chromatography. 34.6 Application of factor analysis for peak purity check in HPLC In pharmaceutical analysis the detection of impurities under a chromatographic peak is a major issue. An important step forward in the assessment of peak purity was the introduction of hyphenated techniques. When selecting a method to perform a purity check, one has the choice between a global method which considers a whole peak cluster (from the start to the end of the peak), and evolutionary methods, which consider a window of the peak cluster, which is
302
usually moved over the cluster. All global methods, except PCA, usually apply a stepwise approach, e.g. SIMPLISMA, OPA and HELP. HELP is a very versatile tool for a visual inspection and exploration of the data. Several complications can be present, such as heteroscedastic noise, sloping baseline, large scan time and non-linear absorbance [40]. This may lead to the overestimation of the number of existing compounds. The presence of heteroscedastic noise and non-linearities have an important effect on all PCA based methods, such as EFA and FSWEFA. Non-zero and sloping baselines have a critical effect in SIMPLISMA, HELP and FSWEFA. In any case it is better to correct for the baseline prior to the application of any multivariate technique. Baseline correction can be done by subtracting a linear interpolation of the noise spectra before and after the peak, or by row-centring the data [40]. Most analytical instruments have a restricted linear range and outside that range Beer's law no longer holds. Non-linear absorbance indicates the presence of more compounds in all the approaches discussed in this chapter. In some cases it is possible to detect a characteristic profile indicating the presence of non-linearities. In any case the best remedy is to keep the signal within the linear range. A non-linearity may also be introduced because the DAD needs about 10-50 ms to measure a whole spectrum. During that time the concentration of the eluting compound(s) may change significantly. The most sensitive methods for the detection of small amounts of impurities eluting at low chromatographic resolutions, OPA and HELP, are also the ones most affected by these non-linearities. If the scan time is known, a partial correction is possible. EFA, FSWEFA and ETA, which belong to the family of evolutionary methods are somewhat less performing for purity checking. They may also flag impurities due to the heteroscedasticity of the noise and non-linearity of the signal. For a more detailed discussion we refer to Ref. [40].
34.7 Guidance for the selection of a factor analysis method The first step in analysing a data table is to determine how many pure factors have to be estimated. Basically, there are two approaches which we recommend. One starts with a PCA or else either with OPA or SIMPLISMA. PCA yields the number of factors 2ind the significant principal components, which are abstract factors. OPA yields the number of factors and the purest rows (or columns) (factors) in the data table. If we suspect a certain order in the spectra, we preferentially apply evolutionary techniques such as FSWEFA or HELP to detect pure zones, or zones with two or more components. Depending on the way the analysis was started, either the abstract factors found by a PCA or the purest rows found by OPA, should be transformed into pure factors. If no constraints can be formulated on the pure factors, the purest rows
303
(spectra) found by OPA cannot be improved. On the contrary, a PCA can either be followed by a Varimax rotation or by constructing a variance diagram which yields factors with the greatest simplicity. If constraints can be formulated on the pure factors, a PCA can be followed by a curve resolution under the condition that only two compounds are present. OPA (or SIMPLISMA or FSWEFA) can be followed by alternating regression to iteratively estimate the pure row-factors (spectra) and pure column-factors (elution profiles). In a similar way, the varimax and vardia factors can be improved by alternating regression. The success of ITTFA in finding pure factors depends on its convergence to a pure factor by a stepwise application of constraints on the solution, which has been demonstrated on elution profiles. However, it then requires a PCA in the retention time space. Although the decomposition of a data table yields the elution profiles of the individual compounds, a calibration step is still required to transform peak areas into concentrations. Essentially we can follow two approaches. The first one is to start with a decomposition of the peak cluster by one of the techniques described before, followed by the integration of the peak of the analyte. By comparing the peak area with those obtained for a number of standards we obtain the amount. One should realize that the decomposition step is necessary because the interfering compound is unknown. The second approach is to directly calibrate the method by RAFA, RBL or GRAFA or to decompose the three-way table by Parafac. A serious problem with these methods is that the data sets measured for the sample and for the standard solution should be perfectly synchronized. References 1. 2. 3.
4. 5. 6.
7. 8.
9.
H.F. Kaiser, The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23 (1958) 187-200. M. Forina, C. Armanino, S. Lanteri and R. Leardi, Methods of Varimax rotation in factor analysis with applications in clinical and food chemistry. J. Chemom., 3 (1988) 115-125. J.K. Strasters, H. A.H. Billiet, L. de Galan, B.G.M. Vandeginste and G. Kateman, Evaluation of peak-recognition techniques in liquid chromatography with photodiode array detection. J. Chromatog., 385 (1987) 181-200. E.R. Malinowski and D. Howery, Factor Analysis in Chemistry. Wiley, New York, 1980. P.K. Hopke, Target transformation factor analysis. Chemom. Intell. Lab. Syst., 6 (1989) 7-19. J.K. Strasters, H.A.H. Billiet, L. de Galan, B.G.M. Vandeginste and G. Kateman, Reliability of iterative target transformation factor analysis when using multiwavelength detection for peak tracking in liquid-chromatographic separation. Anal. Chem., 60 (1988) 2745-2751. W.H. Lawton and E A. Sylvestre, Self modeling curve resolution. Technometrics, 13 (1971) 617-633. B.G.M. Vandeginste, R. Essers, T. Bosman, J. Reijnen and G. Kateman, Three-component curve resolution in HPLC with multiwavelength diode array detection. Anal. Chem., 57 (1985) 971-985. O.S. Borgen and B.R. Kowalski, An extension of the multivariate component-resolution method to three components. Anal. Chim. Acta, 174 (1985) 1-26.
304 10. 11.
12.
13.
14. 15.
16.
17. 18.
19. 20.
21.
22.
23.
24.
25.
26.
A. Meister, Estimation of component spectra by the principal components method. Anal. Chim. Acta, 161 (1984) 149-161. B.G.M. Vandeginste, F. Leyten, M. Gerritsen, J.W. Noor, G. Kateman and J. Frank, Evaluation of curve resolution and iterative target transformation factor analysis in quantitative analysis by liquid chromatography. J. Chemom., 1 (1987) 57-71. P.K. Hopke, D.J. Alpert and B.A. Roscoe, FANTASIA — A program for target transformation factor analysis to apportion sources in environmental samples. Comput. Chem., 7 (1983) 149-155. P.J. Gemperline, A priori estimates of the elution profiles of the pure components in overlapped liquid chromatography peaks using target factor analysis. J. Chem. Inf. Comput. Sci., 24 (1984)206-212. P.J. Gemperline, Target transformation factor analysis with linear inequality constraints applied to spectroscopic-chromatographic data. Anal. Chem., 58 (1986) 2656-2663. B.G.M. Vandeginste, W.Derks and G. Kateman, Multicomponent self modelling curve resolution in high performance liquid chromatography by iterative target transformation analysis. Anal. Chim. Acta, 173 (1985) 253-264. A. de Juan, B. van den Bogaert, F. Cuesta Sanchez and D.L. Massart, Application of the needle algorithm for exploratory analysis and resolution of HPLC-DAD data. Chemom. Intell. Lab. Syst.,33(1996) 133-145. M. Maeder, Evolving factor analysis for the resolution of overlapping chromatographic peaks. Anal. Chem., 59 (1987) 527-530. H. Gampp, M. Maeder, C.J. Meyer and A.D. Zuberbuhler, Calculation of equilibrium constants from multiwavelength spectroscopic data. Ill Model-free analysis of spectrophotometric and ESR titrations. Talanta, 32 (1985) 1133-1139. M. Maeder and A.D. Zuberbuhler, The resolution of overlapping chromatographic peaks by evolving factor analysis. Anal. Chim. Acta, 181 (1986) 287-291. R.Tauler and E. Casassas, Application of principal component analysis to the study of multiple equilibria systems — Study of Copper(II) salicylate monoethanolamine, diethanolamine and triethanolamine systems. Anal. Chim. Acta, 223 (1989) 257-268. E.J. Karjalainen, Spectrum reconstruction in GC/MS. The robustness of the solution found with alternating regression, in: E.J. Karjalainen (Ed.), Scientific Computing and Automation (Europe). Elsevier, Amsterdam, 1990, pp. 477-488. H.R. Keller and D.L. Massart, Peak purity control in liquid-chromatography with photodiode array detection by fixed size moving window evolving factor analysis. Anal. Chim. Acta, 246 (1991)379-390. J. Toft and O.M. Kvalheim, Eigenstructure tracking analysis for revealing noise patterns and local rank in instrumental profiles: application to transmittance and absorbance IR spectroscopy. Chemom. Intell. Lab. Syst., 19 (1993) 65-73. O.M. Kvalheim and Y.-Z. Liang, Heuristic evolving latent projections — resolving 2-way multicomponent data. 1. Selectivity, latent projective graph, datascope, local rank and unique resolution. Anal. Chem., 64 (1992) 936-946. M.J.P. Gerritsen, H. Tanis, B.G.M. Vandeginste and G. Kateman, Generalized rank annihilation factor analysis, iterative target transformation factor analysis and residual bilinearization for the quantitative analysis of data from liquid-chromatography with photodiode array detection. Anal. Chem., 64 (1992) 2042-2056. H.R. Keller and D.L. Massart, Artifacts in evolving factor analysis-based methods for peak purity control in liquid-chromatography with diode array detection. Anal. Chim. Acta, 263 (1992) 21-28.
305 27. 28. 29.
30. 31.
32. 33.
34. 35. 36 37.
38.
39. 40.
W. Windig and H.L.C. Meuzelaar, Nonsupervised numerical component extraction from pyrolysis mass spectra of complex mixtures. Anal. Chem., 56 (1984) 2297-2303. W. Windig and J. Guilement, Interactive self-modeling mixture analysis. Anal. Chem., 63 (1991) 1425-1432. W. Windig, C.E. Heckler, FA. Agblevor and R.J. Evans, Self-modeling mixture analysis of categorized pyrolysis mass-spectral data with the Simplisma approach. Chemom. Intell. Lab. Syst., 14 (1992) 195-207. F.C. Sanchez, J. Toft, B. van den Bogaert and D.L. Massart, Orthogonal projection approach applied to peak purity assessment. Anal. Chem., 68 (1996) 79-85. F. C. Sanchez, T. Hancewicz, B.G.M. Vandeginste and D.L. Massart, Resolution of complex liquid chromatography Fourier transform infrared spectroscopy data. Anal. Chem., 69 (1997) 1477-1484. E.R. Malinowski, Obtaining the key set of typical vectors by factor analysis and subsequent isolation of component spectra. Anal. Chim. Acta, 134 (1982) 129-137. C.N. Ho, G.D. Christian and E.R. Davidson, Application of the method of rank annihilation to quantitative analysis of multicomponent fluorescence data from the video fluorometer. Anal. Chem., 52 (1980) 1108-1113. E. Sanchez and B.R. Kowalski, Generalized rank annihilation factor analysis. Anal. Chem., 58 (1986) 496-499. J. Ohman, P. Geladi and S. Wold, Residual bilinearization. Part I Theory and algorithms. J. Chemom., 4 (1990) 79-90. A.K. Smilde, Three-way analysis. Problems and prospects. Chemom. Intell. Lab. Syst., 15 (1992) 143-157. M.J.P. Gerritsen, N.M. Faber, M. van Rijn, B.G.M. Vandeginste and G. Kateman, Realistic simulations of high-performance liquid-chromatographic ultraviolet data for the evaluation of multivariate techniques. Chemom. Intell. Lab. Syst., 2 (1992) 257-268. E. Sanchez, L.S. Ramos and B.R. Kowalski, Generalized rank annihilation method. I. Application to liquid chromatography-diode array ultraviolet detection data. J. Chromatog., 385 (1987) 151-164. L.S. Ramos, E. Sanchez and B.R. Kowalski, Generalized rank annihilation method. II Analysis of bimodal chromatographic data. J. Chromatog., 385 (1987) 165-180. F. C. Sanchez, B. van den Bogaert, S.C. Rutan and D.L. Massart, Multivariate peak purity approaches. Chemom. Intell. Lab. Syst., 34 (1996) 139-1171.
Additional recommended reading Books E.R. Malinowski and D.G. Howery, Factor Analysis in Chemistry, 2nd Edn. Wiley, New York, 1992. R. Coppi and S. Bolasco (Eds.), Multiway Data Analysis. North-Holland, Amsterdam, 1989.
Articles Target transformation factor analysis: P.K. Hopke, Tutorial: Target transformation factor analysis. Chemom. Intell. Lab. Syst., 6 (1989) 7-19.
306 Rank annihilation factor analysis: C.N. Ho, G.D. Christian and E.R. Davidson, Application of the method of rank annihilation to fluorescent multicomponent mixtures of polynuclear aromatic hydrocarbons. Anal. Chem., 52 (1980) 1071-1079. J. Ohman, P. Geladi and S. Wold, Residual bilinearization, part 2: AppHcation to HPLC-diode array data and comparison with rank annihilation factor analysis. J. Chemom., 4 (1990) 135-146. Evolutionary methods: H.R. Keller and D.L. Massart, Evolving factor analysis. Chemom. Intell. Lab. Syst., 12 (1992) 209-224. F. Cuesta Sanchez, M.S. Khots, D.L. Massart and J.O. De Beer, Algorithm for the assessment of peak purity in liquid chromatography with photodiode-array detection. Anal. Chem., 285 (1994) 181-192. J. Toft, Tutorial: Evolutionary rank analysis applied to multidetectional chromatographic structures. Chemom. Intell. Lab. Syst., 29 (1995) 189-212. Three-way methods: B. Grung and O.M. Kvalheim, Detection and quantitation of embedded minor analytes in three-way multicomponent profiles by evolving projections and internal rank annihilation. Chemom. Intell. Lab. Syst., 29 (1995) 213-221. B. Grung and O.M. Kvalheim, Rank mapping of three-way multicomponent profiles. Chemom. Intell. Lab. Syst., 29 (1995) 223-232. R. Tauler, A.K. Smilde and B.R. Kowalski, Selectivity, local rank, three-way data analysis and ambiguity in multivariate curve resolution. J. Chemom., 9 (1995) 31-58. Simplisma: W. Windig and G. Guilment, Interactive self-modeling mixture analysis. Anal. Chem., 63 (1991) 1425-1432. W. Windig and D.A. Stephenson, Self-modeling mixture analysis of second-derivative near-infrared spectral data using the Simplisma approach. Anal. Chem., 64 (1992) 2735-2742. Alternating least squares method: R. Tauler, A.K. Smilde, J.M. Henshaw, L.W. Burgess and B.R. Kowalski, Multicomponent determination of chlorinated hydrocarbons using a reaction-based chemical sensor. 2 Chemical speciation using multivariate curve resolution. Anal. Chem., 66 (1994) 3337-3344. R. Tauler, A. Izquierdo-Ridorsa, R. Gargallo and E. Casassas, Application of a new multivariate curve-resolution procedure to the simultaneous analysis of several spectroscopic titrations of the cupper (II) — polyiosinic acid system. Chemom. Intell. Lab. Syst., 27 (1995) 163-174. S. Lacorte, D. Barcelo and R. Tauler, Determination of traces of herbicide mixtures in water by on-line solid-phase extraction followed by liquid chromatography with diode-array detection and multivariate self-modelling curve resolution. J. Chromatog. A 697 (1995) 345-355.
307
Chapter 35
Relations between measurement tables 35.1 Introduction Studying the relationship between two or more sets of variables is one of the main activities in data analysis. This chapter mainly deals with modelling the linear relationship between two sets of multivariate data. One set holds the dependent variables (or responses), the other set holds the independent variables (or predictors). However, we will also consider cases where such a distinction cannot be made and the two data sets have the same status. Each set is in the usual objects X measurements format. There is a choice of techniques for estimating the model, all closely related to multiple linear regression (see Chapter 10). Roughly, the model found can be used in two ways. One usage is for a better understanding of the system under investigation by an interpretation of the model results. The other usage is for the future prediction of the dependent variable from new measurements on the predictor variables. Examples of problems amenable to such multivariate modelling are legion, e.g. relating chemical composition to spectroscopic or chromatographic measurements in analytical chemistry, studying the effect of structural properties of chemical compounds, e.g. drug molecules, on functional behaviour in pharmacology or molecular biology, linking flavour composition and sensory properties in food research or modelling the relation between process conditions and product properties in manufacturing. The present chapter provides an overview of the wide range of techniques that are available to tackle the problem of relating two sets of multivariate data. Different techniques meet specific objectives: simply identifying strong correlations, matching two multi-dimensional point configurations, analyzing the effects of experimental factors on a set of responses, multivariate calibration, predictive modelling, etc. It is important to distinguish the properties of these techniques in order to make a balanced choice. As an example consider the data presented in Tables 35.1-35.4. These tables are extracted from a much larger data base obtained in an international cooperative study on the sensory aspects of olive oils [1]. Table 35.1 gives the mean scores for 16 samples of olive oil with respect to six appearance attributes given by a Dutch sensory panel. Table 35.2 gives similar scores for the same samples as judged by a
308 TABLE 35.1 Olive oils: mean scores for appearance attributes from Dutch sensory panel Sample
ID
Yellow
Green
Brown
Olossy
Transp
Syrup
1
Gil
21.4
73.4
10.1
79.7
75.2
50.3
2
G12
23.4
66.3
9.8
77.8
68.7
51.7
3
012
32.7
53.5
8.7
82.3
83.2
45.4
4
013
30.2
58.3
12.2
81.1
77.1
47.8
5
022
51.8
32.5
8.0
72.4
65.3
46.5
6
131
40.7
42.9
20.1
67.7
63.5
52.2
7
132
53.8
30.4
11.5
77.8
77.3
45.2
8
132
26.4
66.5
14.2
78.7
74.6
51.8
9
133
65.7
12.1
10.3
81.6
79.6
48.3
10
142
45.0
31.9
28.4
75.7
72.9
52.8
11
S51
70.9
12.2
10.8
87.7
88.1
44.5 42.3
12
S52
73.5
9.7
8.3
89.9
89.7
13
S53
68.1
12.0
10.8
78.4
75.1
46.4
14
S61
67.6
13.9
11.9
84.6
83.8
48.5
15
S62
71.4
10.6
10.8
88.1
88.5
46.7
16
S63
71.4
10.0
11.4
89.5
88.5
47.2
Mean
50.9
33.5
12.3
80.8
78.2
48.0
Std. dev.
19.5
23.5
5.1
6.2
8.3
3.1
British panel. Note that the sensory attributes are to some extent different. Table 35.3 gives some information on the country of origin and the state of ripeness of the olives. Finally, Table 35.4 gives some physico-chemical data on the same samples that are related to the quality indices of olive oils: acid and peroxide level, UV absorbance at 232 nm and 270 nm, and the difference in absorbance at wavelength 270 nm and the average absorbance at 266 nm and 274 nm. Given these tables of multivariate data one might be interested in various relationships. For example, do the two panels have a similar perception of the different olive oils (Tables 35.1 and 35.2)? Are the oils more or less similarly scattered in the two multidimensional spaces formed by the Dutch and by the British attributes? How are the two sets of sensory attributes related? Does the
309 TABLE 35.2 Olive oils: mean scores of appearance attributes from British sensory panel Sample
ID
Bright
Depth
Yellow
Brown
Green
1
Gil
33.2
76.8
24.4
50.9
56.8
2
G12
40.9
76.7
28.3
39.4
61.4
3
G12
44.1
70.0
33.6
35.9
52.4
4
G13
51.4
65.0
37.1
28.3
52.1
5
G22
63.6
47.2
58.1
17.9
36.9
6
131
42.4
67.3
41.6
41.1
34.7
7
132
60.6
51.1
58.0
20.3
33.5
8
132
71.7
42.7
69.9
17.7
21.6
9
133
41.7
74.7
28.1
42.8
51.9
10
142
48.3
68.7
44.7
57.4
16.4
11
S51
78.6
34.3
82.5
9.4
18.7
12
S52
84.8
25.0
85.9
3.1
16.2
13
S53
85.3
26.3
86.7
2.3
17.9
14
S61
81.4
34.5
80.2
8.3
18.2
15
S62
88.4
27.7
87.4
4.7
14.7
16
S63
88.4
29.7
86.8
3.4
16.3
Mean
62.8
51.1
58.3
23.9
32.5
Std.dev.
19.8
19.9
24.4
18.5
17.2
country of origin or the state of ripeness affect the sensory characteristics (Tables 35.1 and 35.3)? Can we possibly predict the sensory properties from the physicochemical measurements (Tables 35.1 and 35.4)? An important aspect of all methods to be discussed concerns the choice of the model complexity, i.e., choosing the right number of factors. This is especially relevant if the relations are developed for predictive purposes. Building validated predictive models for quantitative relations based on multiple predictors is known as multivariate calibration. The latter subject is of such importance in chemometrics that it will be treated separately in the next chapter (Chapter 36). The techniques considered in this chapter comprise Procrustes analysis (Section 35.2), canonical correlation analysis (Section 35.3), multivariate linear regression
310 TABLE 35.3 Country of origin and state of ripeness of 16 olive oils. The last 4 columns contain the same information in the form of a coded design matrix Sample
ID
Country
Ripeness
Spain
Unripe
Overripe
1
Gil
Greece
unripe
0
1
0
2
G12
Greece
normal
0
0
0
Greece
3
G12
Greece
normal
0
0
0
4
G13
Greece
overripe
0
0
1
5
G22
Greece
normal
0
0
0
6
131
Italy
unripe
0
0
1
0
7
132
Italy
normal
0
0
0
0 0
8
132
Italy
normal
0
0
0
9
133
Italy
overripe
0
0
0
1
10
142
Italy
normal
0
0
0
0
11
S51
Spain
unripe
0
1
0
12
S52
Spain
normal
0
0
0
13
S53
Spain
overripe
0
0
1
14
S61
Spain
unripe
0
1
0
15
S62
Spain
normal
0
0
0
16
S63
Spain
overripe
0
0
1
(Section 35.4), reduced rank regression (Section 35.5), principal components regression (Section 35.6), partial least squares regression (Section 35.7) and continuum regression methods (Section 35.8).
35.2 Procrustes analysis 35.2.1 Introduction Procrustes analysis is a method for relating two sets of multivariate observations, say X and Y. For example, one may wish to compare the results in Table 35.1 and Table 35.2 in order to find out to what extent the results from both panels agree, e.g., regarding the similarity of certain olive oils and the dissimilarity of others. Procrustes analysis has a strong geometric interpretation. The
311
TABLE 35.4 Physico-chemical quality parameters of the 16 olive oils Sample
ID
Acidity
Peroxide
K232
K270
DK
1
Gil
0.73
12.70
1.900
0.139
2
G12
0.19
12.30
1.678
0.116
-0.004
3
012
0.26
10.30
1.629
0.116
-0.005
4
013
0.67
13.70
1.701
0.168
-0.002
5
022
0.52
11.20
1.539
0.119
-0.001
6
131
0.26
18.70
2.117
0.142
0.001
7
132
0.24
15.30
1.891
0.116
0.000
8
132
0.30
18.50
1.908
0.125
0.001
9
133
0.35
15.60
1.824
0.104
0.000
10
142
0.19
19.40
2.222
0.158
-0.003
11
S51
0.15
10.50
1.522
0.116
-0.004
12
S52
0.16
8.14
1.527
0.103
-0.002
13
S53
0.27
12.50
1.555
0.096
-0.002 -0.003
0.003
14
S61
0.16
11.00
1.573
0.094
15
S62
0.24
10.80
1.331
0.085
-0.003
16
S63
0.30
11.40
1.415
0.093
-0.004
Mean
0.31
13.25
1.709
0.118
-0.002
Std. dev.
0.18
3.35
0.249
0.024
0.002
observations (objects) are envisioned as points in a high-dimensional variable space. The objective is to find the transformation such that the configuration of points in X-space best matches the corresponding point configuration in F-space. Not all transformations are allowed: the internal configuration of the objects should be preserved. Procrustes analysis treats the two data sets symmetrically: there is no essential difference between either transforming Y to match X or applying the reverse transformation to X so that it best matches Y. One may also apply a transformation to each so that they meet halfway. In the sequel we consider the transformation of X to the target Y. We will assume that X and Y have the same number of variables. If this condition is not met one is at liberty to add the required number of columns, with zeros as entries, to the smaller data set (so-called "zero padding").
312
little bear
Great Bear
OVERALL ROTATION (PCA)
^
^
^
\
d
Fig. 35.1. The stages of Procrustes analysis.
We will explain the mechanics of Procrustes analysis by optimal matching of the two stellar configurations Great Bear and Little Bear. For ease of presentation we work with the 2D-configuration as we see it from the earth (Fig. 35.1a) and we ignore that the actual configuration is 3-dimensional. First X and Y are column mean-centred, so that their centroids m^ = X^IM and my = Y^IM are moved to the origin (Fig. 35.1b, translation step). This column centering is an admissible transformation since it does not alter the distances between objects within each
313
data set. The next step is a reflection (Fig. 35.1c). Again this is a transformation which leaves distances between objects unaltered. The following step is a rotation, which changes the orientation, but not the internal structure of the configuration (Fig. 35.Id). When this best match is found one is at liberty to rotate all configurations equally. This will not affect the match but it may yield an overall orientation that is more appealing (Fig. 35.le). Finally, by taking the mean position for each star, one obtains an average configuration, often called the consensus, that is representative for the two separate configurations (Fig. 35.If). The major problem is to find the rotation/reflection which gives the best match between the two centered configurations. Mathematically, rotations and reflections are both described by orthogonal transformations (see Section 29.8). These are linear transformations with an orthonormal matrix (see Section 29.4), i.e. a square matrix R satisfying R^R = RR^ = I, or R^ = R"^ When its determinant is positive R represents a pure rotation, when the determinant is negative R also involves a reflection. The best match is defined as the one which minimizes the sum of squared distances between the transformed Z-objects and the corresponding objects in the target configuration given by Y. The Procrustes problem then is equivalent to minimizing the sum of squares of the deviations matrix E = Y - XR, assuming both X and Y have been column-mean centered. This looks like a straightforward least-squares regression problem, Y = XR + E, but it is not since R is restricted to be an orthogonal rotation/reflection matrix. Using a shorthand notation for a matrix sum of squares, ||E|p = YJ^'^J = tr(E^E), we may state the Procrustes optimization problem as: min 11 Y - XR IP R
subject to R'^R = RR^ = 1
(35.1)
Using elementary properties of the trace of a matrix (viz. tr(A + B) = tr(A) + tr(B) and tr(AB) = tr(BA), see Section 29.4) we may write: II Y - XR IP = tr((Y - XR)'r(Y - XR)) = tr(YTY - R^X'^Y - Y^XR + R'^X^XR) = = tr(YTY) - 2tr(Y^XR) + tr(RTX'rXR)
(35.2)
The first term on the right-hand side represents the total sum of squares of Y, that obviously does not depend on R. Likewise, the last term represents the total sum of squares of the transformed X-configuration, viz. XR. Since the rotation/reflection given by R does not affect the distance of an object from the origin, the total sum of squares is invariant under the orthogonal transformation R. (This also follows from tr(R'^X^XR) = tr(X^XRRT) = tT(X^XI) = tT{X^X).) The only term then in eq. (35.2) that depends on R is tr(Y'^XR), which we must seek to maximize.
314
Let the SVD of V^X be given by V^X = Q D W ^ , with D being the diagonal matrix of singular values. The properties of singular vector decomposition (SVD, Section 29.6) tell us that, among all possible orthonormal matrices, Q and W are the ones that maximize tr(Q'^Y'^XW). Since tr(Q'^Y'^XW) = trCY^XWQ^), it follows that R = WQ^ is the rotation/reflection which maximizes tr(Y^XR), and hence it minimizes the squared distances, ||Y - XR |p (eq. 35.2) between X and Y. Given this optimal Procrustes rotation applied to X, one may compute an average configuration Z as (Y + XR)/2. Usually, this is followed by a principal component analysis (Section 31.1) of the average Z. The rotation matrix V, obtained as the matrix of eigenvectors of Z^Z, is then applied each to Y, XR and Z. It must be emphasized that Procrustes analysis is not a regression technique. It only involves the allowed operations of translation, rotation and reflection which preserve distances between objects. Regression allows any linear transformation; there is no normality or orthogonality restriction to the columns of the matrix B transforming X. Because such restrictions are released in a regression setting Y = XB will fit Y more closely than the Procrustes match Y = XR (see Section 35.3). 35.2.2 Algorithm Summarizing, the Procrustes matching problem for two configurations X and Y can be solved with the following algorithm: Column-centering: SVD: Rotate X: Average: PCAofZ: Final rotation:
X
T = (n-iy^^ UxW*
< X canonical scores>
(35.7a)
U = (n-iy^^ UyQ*
< r canonical scores>
(35.7b)
W = (n-iy^^ VxSx^W*
<Xcanonical weights>
(35.8a)
Q = (n-iy^^ VYSY^ Q*
< ycanonical weights>
(35.8b)
T = XW
< X canonical scores>
(35.9a)
U = YQ
< Y canonical scores>
(35.9b)
P = X'^T(T'^T)-^
< X canonical structure>
(35.1 Oa)
C = Y^U(U^U)-^
< y canonical structure
(35.10b)
(35.6)
Equations 35.5a and b represent the singular value decomposition of the original data tables, giving the new sets, Ux and Uy, of unit-length orthogonal (orthonormal) variables. From these the matrix R is calculated as U x Uy (eq. 35.6). R is the correlation matrix between the principal components of X and those of Y, because of the equivalence of PCs and (left) singular vectors. Singular value decomposition of R yields the canonical weight vectors W* and Q* applicable to Ux and Uy, respectively. The singular values obtained are equal to the canonical
321
correlations p^. Instead of a single SVD of R one may apply a spectral decomposition (Section 29.6) of RR^ giving eigenvectors W* and a spectral decomposition of R^R giving eigenvectors Q*, the eigenvalues corresponding to the squared canonical correlations. The canonical variables are now obtained as in eqs. (35.7a,b). The factor {n-\f^ is included to ensure that the canonical variables have unit variance. Back-transformation to the centred X- and F-variables yields the sets of canonical weights collected in matrices W and Q, respectively (eqs. 35.8a,b). Applying these weights to the original variables again yields the canonical variables (eqs. 35.9a,b). Regressing the X-variables and F-variables on their corresponding canonical variables gives the loading matrices P and C (eqs. 35.10a,b)) which appear in the canonical decomposition: X = TP'^ = T(T'rT)-iT^X and Y = UC^ = U(U^U)"^U^Y. The loadings, defining the original mean-centred variables in terms of the orthogonal canonical variables, are better suited for interpretation than the weights. Each row of P (or C) corresponds to a variable and tells how much each canonical variable contributes to (or "loads" on) this variable. In case the X-variables and the F-variables are also scaled to unit variance, P and C contain the intra-set correlations between the original variables and the canonical variables (so-called structure correlations', see Table 35.6). It should be appreciated that canonical correlation analysis, as the name implies, is about correlation not about variance. The first step in the algorithm is to move from the original data matrices X and Y, to their singular vectors, Ux and Uy, respectively. The singular values, or the variances of the PCs of X and Y, play no role. 353.3 Discussion Let us take a closer look at the analysis of the data of Table 35.5. In Table 35.6 we summarize the correlations of the canonical variates and also their correlations with the original variables. The high value of the first canonical correlation (pi = 0.95) suggests a strong relationship between the two data sets. However, the canonical variables tj and Uj are only strongly related to each other, not with the original variables (Table 35.6). On the other hand, the second set of canonical variates t2 and U2 are strongly related to their original variables, but not to each other (p2 = 0.55). Thus, the analysis yields a pair of strongly linked, but uninteresting factors and a pair of more interesting factors, which are weakly related, however. A major limitation to the value of CCA thus already has become apparent in the example shown. There is no guarantee that the most important canonical variable t, (or Ui) is highly correlated to any of the individual variables of X (or Y). It is possible then for the first canonical variable tj of X to be strongly correlated with Uj, yet to have very little predictive value for Y. In terms of principal components
322 TABLE 35.6 Canonical structure: correlations between the original variables (jc, y) and their canonical variates (/, u).
X
Y
^1
h
"i
"2
^1
0.0476
0.9989
0.0451
0.5522
X2
0.5477
0.8367
0.5187
0.4625
>'J
-0.0042
-0.5528
-0.0045
-1.0000
yi
0.3532
0.5129
0.3729
0.9279
u
1.0000
0.0000
0.9470
0.0000 0.5528
h
0.0000
1.0000
0.0000
"i
0.9470
0.0000
1.0000
0.0000
"2
0.0000
0.5528
0.0000
1.0000
(Chapter 17): only the minor principal components of X and Y happen to be highly correlated. It is questionable whether such a high correlation is then of much interest. This dilemma of choosing between high correlation and large variance presents a major problem when analyzing the relation between measurements tables. The regression techniques treated further on in this chapter address this dilemma in different ways. A second limitation of CCA is that it cannot deal in a meaningful way with data tables in 'landscape mode', i.e. wide data tables having more variables than objects. This severely limits the importance of CCA as a general tool for multivariate data analysis in chemometrics, e.g. when X represents spectral data. As the name implies CCA analyses correlations. It is therefore insensitive to any rescaling of the original variables. This advantage is not shared with most other techniques discussed in this chapter. As in Procrustes analysis X and Y play entirely equivalent roles in canonical correlation analysis: there is no distinction in terms of dependent variables (or responses) versus independent variables (or predictors, regressors, etc.). This situation is fairly uncommon. Usually, the X and Y data are of a different nature and one is interested in understanding one set of data, say Y, in the light of the information contained in the other data set, X. Rather than exploring correlations in a symmetric XY relation, one is searching for an asymmetric regression relation X->Y explaining the dependent F-variables from the predictor X-variables. Thus, the symmetrical nature of CCA limits its practical importance. In the following sections we will discuss various asymmetric regression methods where the goal is to fit the matrix of dependent variables Y by linear combination(s) of the predictor variables X.
323
35.4 Multivariate least squares regression 35.4.1 Introduction In this section we will distinguish multivariate regression from multiple regression. The former deals with a multivariate response (Y), the latter with the use of multiple predictors (X). When studying the relation between two multivariate data sets via regression analysis we are therefore dealing with multivariate multiple regression. Perhaps the simplest approach to studying the relation between two multivariate data sets X and Y is to perform for each individual univariate variable y^ {k = 1,..., m) a separate multiple (i.e. two or more predictor variables) regression on the Z-variables. The obvious advantage is that the whole analysis can be done with standard multiple regression programs. A drawback of this approach of many isolated regressions is that it does not exploit the multivariate nature of Y, viz. the interdependence of the y-variables. Genuine multivariate analysis of a data table Y in relation to a data table X should be more than just a collection of univariate analyses of the individual columns of Y! One might suspect that fitting all 7-variables simultaneously, i.e. in one overall multivariate regression, might make a difference for the regression model. This is not the case, however. To see this, let us state the multivariate (i.e. two or more dependent variables) regression model as: [yi' y2' — Ym] = X [bj, b2,..., b j + [ei, t^^..., e j
(35.11)
which shows explicitly the various responses, y^ (/ = 1, 2, ..., m), as well as the vector of regression coefficients b^, and the residual vector, e^, corresponding to each response. This model may be written more compactly as: Y = XB + E (35.12) where Y is the nxm data set of responses, X the nxp data set of regressors, B the pxm matrix of regression coefficients and E the nxm error matrix. Each column of Y, B and E corresponds to one of the m responses, each column of X and each row of B to one of the p predictor variables and each row of Y, X, and E to one of the n observations. The total residual sum of squares, taken over all elements of E, achieves its minimum when each column e^ separately has minimum sum of squares. The latter occurs if each (univariate) column of Y is fitted by X in the least-squares way. Consequently, the least-squares minimization of E is obtained if each separate dependent variable is fitted by multiple regression on X. In other words: the multivariate regression analysis is essentially identical to a set of univariate regressions. Thus, from a methodological point of view nothing new is added and we may refer to Chapter 10 for a more thorough discussion of theory and application of multiple regression.
324
35.4.2 Algorithm The solution for the regression parameters can be adapted in a straightforward manner from eq (10.6), viz. b = (X^X)"^ X^y, giving: B = {X^Xr'X^Y
(35.13)
Y = X B = X (X^X)-^ X^Y
(35.14)
In eqs. (35.13) and (35.14) X may include a column of ones, when an intercept has to be fitted for each response, giving (p+l x m) B. Otherwise, X and Y are supposed to be mean centered, and (pxm) B does not contain a column of intercepts. The geometric meaning of Eq. (35.14) is that the best fit is obtained by projecting all responses orthogonally onto the space defined by the columns of X, using the orthogonal projection matrix X(X^X)"^X^ (see Section 29.8). 35.4.3 Discussion A major drawback of the approach is felt when the number of dependent variables, m, is large. In that case there is an equally large number of separate analyses and the combined results may be hard to summarize. When the number of predictor variables, p, is very large, e.g. when X represents spectral intensities at many wavelengths, there is also a problem. In that case X^X cannot be inverted and there is no unique solution for B. Both in the case of large m or large p some kind of dimension reduction is called for. We will therefore not discuss the multivariate regression approach further, since this chapter focuses on truly multivariate methods, taking the joint variation of variables into account. All other methods discussed in this chapter provide such a dimension reduction. They search for the most "interesting" directions in F-space and/or "interesting" directions in X-space that are linearly related. They differ in the optimizing criterion that is used to discover those interesting directions.
35.5 Reduced rank regression 35.5.7 Introduction Reduced rank regression (RRR), also known as redundancy analysis (or PCA on Instrumental Variables), is the combination of multivariate least squares regression and dimension reduction [7]. The idea is that more often than not the dependent F-variables will be correlated. A principal component analysis of Y might indicate that A {A « m) PCs may explain Y adequately. Thus, a full set of m
325
separate multiple regressions as in unconstrained multivariate regression (Section 35.4) contains a fair amount of redundancy. To illustrate this we may look for A particular linear combinations of X-variables that explain most of the total variation contained in Y. For simplicity let us start with A=l. When the K-variables have equal variance, this boils down to finding a single component in X-space, say tj = Xwj, that maximizes the average 7?^. This average /?^ is the mean of the individual /?^-values resulting from all regressions of the individual y-variables with the X-component tj. Since now all 7-variables are estimated by the same regressor tj, all fitted K-variables are proportional to this predictor and, consequently, they are all perfectly correlated. In other words, the rank of the fitted y-matrix is, of necessity, 1. Hence the name reduced rank regression. Of course, this rank-1 restriction may severely affect the quality of the fit when the effective dimensionality A of Y is larger, 1 < A < m. Thus, we may look for a second linear combination of X-variables, X,^ = Xw2, orthogonal to tp such that the multivariate regression of Y on tj and i^ further maximizes the amount of variance explained. This process may be continued until Y can be sufficiently well approximated by regression on a limited set of Z-components, T = [tj, i^^ ..., t J . Since each y-variable is fitted by a linear combination of the A X-components, each Xcomponent itself being a linear combination of the predictor variables, the Yvariables can finally be expressed as a linear combination of the X-variables. It should be noted that when the same number of A-components is used as there are K-variables, i.e. A = m, we can no longer speak of reduced rank regression. The solution then has become entirely equivalent with unconstrained multivariate regression. The question of how many components to include in the final model forms a rather general problem that also occurs with the other techniques discussed in this chapter. We will discuss this important issue in the chapter on multivariate calibration. An alternative and illuminating explanation of reduced rank regression is through a principal component analysis of Y, the set of fitted y-variables resulting from an unrestricted multivariate multiple regression. This interpretation reveals the two least-squares approximations involved: projection (regression) of Y onto X, followed by a further projection (PCA) onto a lower dimensional subspace. 35.5.2 Algorithm The interpretation also suggests the following simple computational implementation of reduced rank regression. Step 1. Multivariate least squares regression of Y on X (compare Section 35.4): Y = X(X^X)-iX^Y
(35.15)
326
Equation (35.15) represents the projection of each 7-variable onto the space spanned by the X-variables, i.e. each F-variable is replaced by its fit from multiple regression on X. Step 2. Next one appUes an SVD (or PCA) to centered Y, denoted as Y*(= Y - I m ^): Y* = U S V ^
(35.16)
Step 3. Dimension (rank) reduction by only retaining A major components to approximate Y*. This gives the RRR fit: Y*,A] = ^A]StAiVtIj
(35.17)
Step 4. The RRR model coefficients are then found by a multivariate linear regression of the RRR fit, Yj^j = (Y*j^j + In"** Y ) ^^ original X, which should have a column of ones: BRRR
= ( X ^ ) - ' X\Y\^^+
l„m^)
(35.18)
35.5.3 Discussion A major difference between reduced rank regression and canonical correlation analysis or Procrustes analysis is that RRR is a regression technique, with different roles for Y and X. It is an appropriate method for the simultaneous prediction of many correlated K-variables from a common set of X-variables through a few X-components. Since reduced rank regression involves a PCA of Y, its solution depends on the choice of scale for the 7-variables. It does not depend on the scaling of the X-variables. The reduction to a few factors may help to prevent overfitting and in this manner it stabilizes the estimation of the regression coefficients. However, the most important factor determining the robustness of any regression solution is the design of the regressor data. When the X-variables are highly correlated we still have no guarantee that unstable minor X-factors are avoided in the regression. In that case, and certainly when X is not of full rank, one may consider to base the regression on all but the smallest principal components of X. The ill-conditioning problem does not occur in the following example. 35.5.4 Example Let us try to relate the (standardized) sensory data in Table 35.1 to the explanatory variables in Table 35.3. Essentially, this is an analysis-of-variance problem. We try to explain the effects of two qualitative factors, viz. Country and Ripeness, on the sensory responses. Each factor has three levels: Country = {Greece, Italy,
327
Spain} and Ripeness = {Unripe, Normal, Overripe}. Since not all combinations out of the complete 3x3 block design are duplicated, there is some unbalance making the design only nearly orthogonal. We treat this multivariate ANOVA problem as a regression problem, coding the regressors as indicated in Table 35.3 and omitting the Italy column and Normal column to avoid ill-conditioning of X. Some of the results are collected in Table 35.7. Table 35.7a shows that some sensory attributes can be fitted rather well by the RRR model, especially 'yellow' and 'green' (/?^« 0.75), whereas for instance 'brown' and 'syrup' do much worse (/?2 ^ 0.40). These fits are based on the first two PCs of the least-squares fit (Y). The PCA on the OLS predictions showed the 2-dimensional approximation to be very good, accounting for 99.2% of the total variation of Y. The table shows the PC weights of the (fitted) sensory variables. Particularly the attributes 'brown', and to a lesser extent 'syrup', stand out as being different and being the main contributors to the second dimension. TABLE 35.7 (a) Basic results of the reduced rank regression analysis. The columns PCI and PC2 give the weights of the PCA model of Y (OLS fitted Y). The columns /?2 (in %) show how well Y and Y are fitted by the first two principal components of Y. /-variable
PCl(Y)
PC2(Y)
RK%
/?2(Y)
Yellow
0.50
-0.35
99.9
77
Transp
0.44
+0.06
99.2
53
Glossy
0.44
+0.24
99.9
58
Green
-0.47
+0.40
99.6
73
Brown
-0.16
-0.73
98.2
41
Syrup
-0.34
-0.35
97.3
41
Overall R^:
80.8
18.4
99.2
57
(b) The columns PCI and PC2 give the X-weights of the PCA model of Y (OLS fitted Y). The columns /?2 (in %) show how well the X variables are fitted by the first two principal components of Y. X-variable
PCI
PC2
R^
Intercept
-0.97
-0.86
-
Greece
-0.17
+1.91
99
Spain
+3.23
+1.03
94
Unripe
-0.88
-0.35
4
Overripe
+0.14
-0.13
7
328
The two principal axes can also be defined as linear combinations of the explanatory variables. This is given in Table 35.7b. The larger coefficients for the Country variables when regressing the PCs on the four predictor variables show that the country of origin is strongly related to the most predictive principal dimensions and that the state of ripeness is not. This also appears from the fact that the Country variables can be fitted very well (high R^) by the first two PCs, in contrast to the low R^ values for the ripeness variables. In other words, the country of origin is the dominant factor affecting the appearance of olive oils whereas the state of ripeness has little effect. The first predictive dimension mainly represents a contrast between the olive oils of Spanish origin versus non-Spanish origin and to a much lesser extent a contrast between unripe versus the (over)ripe olives. The second predictive factor, which is mostly used to fit the 'brown' sensory attribute, represents a contrast between Italy and Greece, with Spain in the middle. Fig. 35.5 summarizes the relationships between samples (objects), predictor variables and dependent variables. The objects are plotted as standardized scores (first two columns of (n-iy^^V), the variables as loading vectors, taken from X'^U and Y^U, respectively, scaled to fit on the graph. For a thorough treatment of biplotting the results of rank reduced multivariate regression models, see Ref [8]. By combining the coefficients in the two parts of the table one can express each sensory attribute in terms of the explanatory factors. Note that the above regression
Fig. 35.5. Biplot of reduced rank regression model showing objects, predictors and responses.
329
model is defined in terms of binary regressor variables, which indicate the presence or absence of a condition. Italian olive oils, for example, are defined as not Greek and not Spanish, and the variables indicating the country of origin, 'Greece' and 'Spain', are both set to 0. For example: Yellow = 0.50*PC1 - 0.35*PC2 = 0.5*(-0.97-0.17* 'Greece' + ...)-0.35*(-0.86 + 1.91 * 'Greece' + ...) = -0.22-0.74*'Greece' + 1.25* 'Spain' - 0.26* 'Unripe' + 0.19* 'Overripe' For an unripe Spanish olive oil this works out as: Yellow = -0.22 - 0.74*0 + 1.25*1 - 0.26*1 + 0.19*0 = 0.77. Since the sensory data were standardized one needs to multiply by the standard deviation (19.5) and to add the average (50.9) to arrive at a prediction in original units, viz. 50.9+0.77*19.5 ^ 66.
35.6 Principal components regression 35.6.1 Introduction In principal components regression (PCR) first a principal component analysis (Chapters 17 and 31) is performed on X, then the 7-variables are regressed on these PCs of X. PCR also combines the two steps of regression and dimension reduction. Compared with reduced rank regression the order of these two basic steps is reversed. The major difference, however, is that the dimension reduction pertains to the predictor set X, and not to the dependent variables. In PCR, therefore, the definition of Z-components is determined prior to the regression analysis, the F-variables not playing a role at this stage. As in the other approaches PCR modelling proceeds factor by factor, the number of factors A to be determined by some model validation procedure (Chapter 36 on Multivariate Calibration). 35.6.3 Algorithm The computational implementation of principal components regression is very straightforward. Step 1. First, carry out an SVD (or PC A) on centered X: X = U S V^ Step 2. Multivariate least squares regression of Y on the major A principal components, using either the unit-norm singular vectors U^^j, or the principal components T^^j = XV^^^ = Uf^jS^^^:
330
The equation represents the projection of each K-variable onto the space spanned by the first A PCs of X. Step 4. The PCR model coefficient matrix, pxm Bp^R, can be obtained in a variety of equivalent ways: BpcR = (X^X)-' X^Y = V,^,Sf^, U,^^, Y = V,^,(T,I,T,^,)-'T,X, Y The vector of intercepts is obtained as: b = (my - nix ^PCR)^35.6.3 Discussion The PCR approach has many attractive features. First of all there is the aspect of a prior dimension reduction of the data set (measurement table) X. Using PCA this is done in such a way as to maintain the maximum amount of information. The neglected minor components are supposed to contain noise that is in no way relevant for the relation with Y. Another advantage is that the principal components are orthogonal (uncorrelated). This greatly simplifies the multiple regression of the y-variables, allowing the effect of the individual principal components to be assessed independently. The chief advantage is that the major principal components have, by definition, large variance. This leads to a stable regression as the variance of an estimated regression coefficient is inversely proportional to the variance of the regressor (si = s^ /Z(A:? - Jc"); see Section 8.2.4.1) The orthogonality of the principal components has the advantage that the effect of the various PCs are estimated independently: multiple regression becomes equivalent with a sequence of separate regressions of the response(s) on the individual PCs. The fact that the X-components are chosen on the basis of representing X rather than Y does not only have advantages. It also gives rise to a major concern. What if a minor component happens to be important for the regression? And what is the use of a major principal component if it is not related to Y? The answer to the latter question is simple: it is of little use but it does not harm either. The problem of discarding minor X-components that possibly are highly correlated to Y is more severe. One way to address this problem is to include the minor components in the regression if they are really needed. That is, one should go on adding principal components in the regression model until Y is fitted well, provided such a model also passes the (cross-)validation procedure (Section 36.10). Another strategy that is gaining popularity is to enter the principal components in a different order than the standard order of descending variance (PCI, PC2,...). Rather than this top-down procedure, one may apply variable selection: one starts with the principal component that is most highly correlated with the
331
responses, then move to the PC with the second highest correlation, etc. The only thing that is needed is to compute the correlation coefficients of the PCs with each response. For a univariate y the PCs may then be ranked according to their descending (squared) correlation coefficient. By applying this forward selection procedure one ensures that highly correlating PCs are not overlooked. For multivariate response data Y one should compute for each PC an average index of its importance for all 7-variables together, e.g. the average squared correlation or the total variance explained (I I Yl P = 11 u J Yl P) by that PC. Comparative studies have shown that the latter method of PCR frequently performs better, i.e. it gives good predictive models with fewer components [9]. Since principal components regression starts with a PCA of X, its solution depends on the particular scaling chosen for the X-variables. It does not depend on the scaling of the 7-variables. When the maximum number of factors are used the regression model becomes equivalent with multivariate regression. There is no special multivariate version of principal component regression: each K-variable is separately regressed on the set of X-components. One might also consider regressing the major PCs of Y or of Y (Eq. 35.14) on the PCs of X.
35.7 Partial least squares regression The purpose of Partial Least Squares (PLS) regression is to find a small number A of relevant factors that (/) are predictive for Y and (//) utilize X efficiently. The method effectively achieves a canonical decomposition of X in a set of orthogonal factors which are used for fitting Y. In this respect PLS is comparable with CCA, RRR and PCR, the difference being that the factors are chosen according to yet another criterion. We have seen that PCR and RRR form two extremes, with CCA somewhere in between. RRR emphasizes the fit of Y (criterion ii). Thus, in RRR the Xcomponents t- preferably should correlate highly with the original 7-variables. Whether X itself can be reconstructed ('back-fitted') from such components t^ is of no concern in RRR. With standard PCR, i.e. top-down PCR, the emphasis is initially more on the X-side (criterion /) than on the F-side. CCA emphasizes the importance of correlation; whether the canonical variates t and u account for much variance in each respective data set is immaterial. Ideally, of course, one would like to have the best of all three worlds, i.e. when the major principal components of X (as in PCR) and the major principal components of Y (as in RRR) happen to be very similar to the major canonical variables (as in CCA). Is there a way to combine these three desiderata — summary of X, summary of Y and a strong link between the two — into a single criterion and to use this as a basis for a compromise method? The PLS method attempts to do just that.
332
PLS has been introduced in the chemometrics literature as an algorithm with the claim that it finds simultaneously important and related components of X and of Y. Hence the alternative explanation of the acronym PLS: Projection to Latent Structure. The PLS factors can loosely be seen as modified principal components. The deviation from the PCA factors is needed to improve the correlation at the cost of some decrease in the variance of the factors. The PLS algorithm effectively mixes two PCA computations, one for X and one for Y, using the NIPALS algorithm. It is assumed that X and Y have been column-centred as usual. The basic NIPALS algorithm can best be demonstrated as an easy way to calculate the singular vectors of a matrix, viz. via the simple iterative sequence (see Section 31.4.1): t = Xw
(35.19)
wocX^t
(35.20)
for X, and u = Yq
(35.21)
q oc Y^u
(35.22)
for Y. The «:-symbol is used here to imply that the resultant vector has to be normalized, i.e. w^w = q^q = 1. In eq. (35.19) t represents the regression coefficients of the rows of X regressed on w. Likewise, w in eq. (35.20) is proportional to the vector of regression coefficients obtained by regressing each column (variables) of X on the score vector t. This iterative process of criss-cross regressions is graphically illustrated in Fig. 35.6. Iterating eq. (35.19) and eq. (35.20) leads w to converge to the first eigenvector of X^X. One may easily verify this by substituting eq. (35.19) into eq. (35.20), which yields w oc X^t«: X^Xw, the defining relation for an eigenvector. Similarly, t is proportional to an eigenvector of XX^. It can be shown that the eigenvectors w and t are the dominant eigenvectors, i.e. the ones corresponding to the largest eigenvalue. Thus, w and (normalized) t form the first pair of singular vectors of X. Likewise, q and (normalized) u are the dominant eigenvectors of Y^Y and YY^, W
X Fig. 35.6. Principle of SVD/NIPALS algorithm.
333
w
u
T Fig. 35.7. Principle of PLS/NIPALS algorithm.
respectively, or the first pair of singular vectors of Y. Once this first pair of singular vectors is determined one extracts this dimension by fitting t to X (or u to Y) and proceeding with the matrix Ej (or Fj) of residuals. Using the residual matrix Ej (or Fj) and the basic NIPALS algorithm one may find the pair of dominant singular vectors which in fact is the second pair of singular vectors of the starting matrix X (or Y). The process is repeated until the starting matrix is fully depleted. Instead of separately calculating the principal components for each data set, the two iterative sequences are interspersed in the PLS-NIPALS algorithm (see Fig. 35.7): Wcx: X^U
(35.23)
t = Xw
(35.19)
qocY^t
(35.24)
u = Yq
(35.21)
One starts the iterative process by picking some column of Y for u and then repeating the above steps cyclically until consistency. Upon convergence we have w oc X^u oc X^Yq oc X^YY^t oc X^YY^Xw. Thus, w is an eigenvector of X'^YY'^X and, similarly, q is an eigenvector of Y^XX^Y [10]. These matrices are the two symmetric matrix products, viz. (X'^Y)(X'^Y)'^ and (X'^Y)'^(X'^Y), based on the same cross-product matrix (X^Y). Apart from a factor (n - 1), the latter matrix is equal to the matrix of inter-set covariances. Another interpretation of the weight vectors w and q in PLS is therefore as the first pair of singular vectors of the CO variance matrix X^Y. As we found in Chapter 29 this first pair of singular vectors forms the unique pair of normalized weight vectors that maximizes the expression w^(X^Y)q = (Xw)'^(Yq) = t^u. Up to a factor (n - 1), the latter inner product equals the covariance of the two score vectors t = Xw and u = Yq. This then leads to the following important interpretation of the PLS factors: t = Xw and u = Yq are chosen to maximize their covariance [10,11].
334
Let us take a closer look at this covariance criterion. A covariance involves three terms (see Section 8.3): cow(i,u) = s,s^r,^
(35.25)
or, taking the square, cov(t,u)2 = var(t) var(u) r,l
(35.26)
Thus, the PLS covariance criterion capitalizes on precisely the three links that connect two sets of data via their latent factors: (i) the X-factor t should have appreciable variance, var(t); (ii) similarly, the K-factor u should have a large variance, var(u), and (iii) the two factors t and u should be strongly related also (high r^). Of the three aspects inherent to the covariance criterion (35.26), CCA just considers the so-called inner relation between t and u as expressed by r^, RRR entirely neglects the var(t) aspect, whereas PCR emphasizes this var(t) component. One might maintain that PLS forms a well-balanced compromise between the methods treated thus far. PLS neither emphasizes one aspect of the XY relation unduly, nor does it completely neglect any. The covariance criterion as such suggests a symmetrical situation, X and Y playing equivalent roles. In fact, up to here, there is little difference with Procrustes analysis which also utilizes the singular vectors of the covariance matrix (Section 35.2). The difference is that in PLS the first X-factor, say tj, is now used as a regressor to fit both the X-block and the K-block: X(=Eo) = t,pT + E ,
(35.27a)
Y(=:Fo) = t , c ^ + F i
(35.27b)
Here, the loading vector pj contains the coefficients of the separate univariate regressions of the individual X-variables on tj. The/^ element of Pj, py, represents the regression coefficient of X: regressed on ti'-Pij = ^J^j f tJ ty The full vector of loadings becomes p, = E^tj / t ^ t j . Similarly, Cj contains the regression coefficients relating tj to the K-variables: Cj =FQ t^/tjt^. The residuals of these regressions are collected in residual matrices Ej and Fj: E,=E,-i,pJ (35.28a) F, =Fo-t,c;^
(35.28b)
A second PLS factor t2 is extracted in a similar way maximizing the covariance of linear combinations of the residual matrices Ej and Fj. Subsequently, Ej and Fj are regressed on t2, yielding new residual matrices Ej and F2 from which a third PLS
335
factor t3 is computed, and so on. If one does not limit the number of factors, the process automatically stops when the Z-matrix has been fully depleted. This occurs when the number of factors A equals the rank of X, i.e. A = mm(n - 1, /?). As for PCR, such a full-rank PLS model is entirely equivalent with multivariate regression on the original X-variables. The PLS algorithm is relatively fast because it only involves simple matrix multiplications. Eigenvalue/eigenvector analysis or matrix inversions are not needed. The determination of how many factors to take is a major decision. Just as for the other methods the 'right' number of components can be determined by assessing the predictive ability of models of increasing dimensionality. This is more fully discussed in Section 36.5 on validation. Let us now consider a new set of values measured for the various X-variables, collected in a supplementary row vector x*. From this we want to derive a row vector y* of expected F-values using the predictive PLS model. To do this, the same sequence of operations is followed transforming x* into a set of factor scores {^1*, ^2, ..., ^^ } pertaining to this new observation. From these r*-scores y * can be estimated using the loadings C. Prediction starts by equating yo to the mean (my) for the training data and removing the mean m^ from x* giving CQ : y 0 = my
6*0 = X* - n i x
Then we compute the score of the new observation x* on the first PLS dimension and from that we calculate an updated prediction (y\) and we remove the first dimension from CQ giving e\: r, =eo' w,
^1 =^o-h
Pi
This sequence is repeated for dimension 2:
62 —Cj — ^2 P 2
and so on.
336
Alternatively, one may obtain predicted values directly as y*o =niY + (x*-mx)'^BpLs using the matrix of regression coefficients BpLs, as estimated by the PLS method. It may be shown that a closed expression for these coefficients can be obtained from the weights and loadings matrices [12]: BpLs = W(P^W)-'CT-. 35.7.2 NIPALS-PLS Algorithm Here we summarize the steps needed to compute the PLS model 1. 2. 3. 4. 5.
E = X-lml/n
F =Y E 16. EaO ^ O 0 o 0 %'^^^^«i:^, o
•
X
„o
35 in
o -
0 '^"°^>;-^ 0
1
1
40
I
() 0
0^ 0
35 30
oo&b 0 0 o
0.5
-0.5
0.5
-0.5
o 0
«
o°o''
o
o tOo o
0 0 o
50
50
D 45
45 h
-
o
40
o o o
35
o o
30
0.4 PC3 scores
-0.2
0 0.2 PC4 scores
0.4
Fig. 36.6. Scatter plot of dependent variable (X-content) versus the first four principal components.
random variation. As a model becomes more complex, e.g. using more PCs in PCR calibration, the noise term starts to dominate the added systematic contribution. Procedures to establish the optimum complexity, i.e. choosing the best-predictive calibration model from a range of increasingly complex models, are given in Section 36.3. Thus far w^e have discussed the traditional method of applying PCR, where the PCs are chosen in order of decreasing variance (top-down approach). Another approach is to calculate the correlation of each PC with y and to enter the PCs in order of their decreasing squared correlation. Since the PC scores are uncorrelated, this is equivalent to applying multiple regression with variable selection on the set of PCs using forward selection (cf. Section 10.3.3). Several studies, e.g. Refs. [7,8], have shown that, with respect to prediction accuracy, this correlation-PCR approach performs no worse and often better than the traditional top-down approach. Correlation-PCR often gives simpler models with a smaller number of PCs than top-down PCR. The added advantage of such parsimonious models is that they are easier to interpret.
365 100
c X
>
Fig. 36.7. Percentage variance of X-content explained by the principal components from spectral data. Individual percentages (bars) are shown as well as cumulative percentages (circles).
I
2r
•
T
—
45
O
,
0 °
1 U
T
CP
o
T3
o[
40
-1
35 -21-
100 wavelength
200
Fig. 36.8. Regression coefficients obtained from PCR model.
0
40 45 35 X-content (measure Fig. 36.9. Final calibration showing fitted versus measured values.
366
36.2.4 Partial least squares regression The procedure for PLS calibration is very akin to PCR calibration. It also proceeds via a small set of orthogonal factors constructed from the predictor variables. The main difference with PCR lies in the way the factors are determined. This has been discussed in Chapter 35. In PLS regression the factor is not solely determined by the spread (variance) of the predictor data; the correlation of the PLS factor with the variable to be predicted also plays a role. A high-variance factor that is not at all correlated to the dependent property is given less weight by PLS from the outset. In PCR, such a factor at first seems important on account of its high variance. However, it is downweighted in the final model through the small regression coefficient q^ for that factor (see, for example, the low coefficient for PCI in the PCR regression model of eq. (36.25)). Since in PLS regression such factors are avoided, PLS models often have less factors than PCR models built on the same data. This aspect of parsimony is sometimes seen as an advantage of PLS over traditional PCR. Correlation-PCR behaves like PLS in this respect. Another difference between PLS and PCR is that the loadings P, even when normalized to unit length, differ from the weights W. Qualitatively, however, e.g. with regard to the sign pattern, the difference often is not large. Usually one plots the loadings since they have a simpler interpretation. In Chapter 35 we showed that the loadings P are the regression coefficients obtained from regressing the predictors (spectral signals at each wavelength) on the PLS factor scores. One may consider the loadings (columns of P) as abstract spectra from which the measured spectra can be reconstructed. A high loading p^j^ implies that the spectral region around thejth wavelength has a strong contribution from the kih factor. PLS calibration of a multicomponent system can be performed in two different ways. One may do a separate regression for each analyte. Such univariate (in >^) regressions are called PLSl regressions. One may also model the various analytes collectively in one and the same multivariate PLS2 regression model. Which of these two approaches should be chosen? The use of PLS2 regression has a few advantages. Firstly, there is one common set of PLS factors T for all analytes. This simplifies interpretation and enables a simultaneous graphical inspection. Secondly, when the analyte concentrations are strongly correlated one may expect on theoretical grounds that the PLS2 model is more robust than separate PLSl models. This is especially true when the Y matrix is closed, as for compositional data. Finally, when the number of analytes is large the development of a single PLS2 model is done much quicker than the development of many separate PLS 1 models. Practical experience, however, indicates that PLSl calibration usually performs equally well or better in terms of predictive accuracy. Thus, when the ultimate requirement of the calibration study is to enable the best possible predictions, a separate PLSl regression for each analyte is advised.
367
36.2.5 Other linear methods There are many more methods that are well suited for calibration of collinear data. Latent root regression [9,10], principal covariates regression [11], total least squares [12], and reduced rank regression [13] are related to PCR and PLS in that a calibration model is based on orthogonal factors derived from the predictors (spectra). However, a comparative evaluation of these techniques has not yet been published. Continuum regression [14,15] provides a continuum of models which can be viewed as interpolations between three 'anchor' models, viz. PCR, PLS (more specifically SIMPLS, Section 35.7.4), and MLR (or reduced rank regression). It still has to be established whether there is a need for a continuum of models filling the gaps between three anchors. Ridge regression is a technique that has been specially devised for dealing with multi-coUinear data (cf. Section 10.6). It replaces the OLS regression estimator (X'^X)-^X'^y by (X^X+kiy^X^y. The addition of the constant k, the ridge parameter, to the diagonal of X^X has the effect of artificially decreasing the dependence among the X-predictors. As a result it leads to regression coefficients that are slightly biased towards the low side (shrinkage property). This small systematic error is, however, more than compensated by the beneficial effect of the greatly reduced variance of the estimates. The value of k can be identified from the graph of the regression parameters as a function of /:, namely as the value where the regression parameters start to stabilize. In a comparative simulation study it has been shown that ridge regression performs as well as PCR or PLS, all of them outperforming MLR with forward variable selection [16]. Other comparative studies also have shown that often the results obtained with various prediction methods are very similar. When each method is applied carefully it turns out that there is no overall superior technique. Thus, it is better to get well acquainted with one or two methods and to apply that method in a professional way, rather than applying different techniques which are known only superficially. Even multiple linear regression (MLR, Chapter 10) has regained some of its lost respectability in the field of multivariate calibration. In combination with modem wavelength selection procedures (Section 36.5.1) and a safeguard for overfitting, well-performing models can be derived that are based on a small number of wavelengths. Generalized standard addition method In Section 8.2.8 we have discussed the standard addition method as a means to quantitate an analyte in the presence of unknown matrix effects (cf. Section 13.9). While the matrix effect is corrected for, the presence of other analytes may still interfere with the analysis. The method can be generalized, however, to the simultaneous analysis of p analytes. Multiple standard additions are applied in order to determine the analytes of interest using many (q > p) analytical sensors. It
368
is assumed that the responses are a linear function of the analyte concentrations. The calibration model reads R = ( V J + AC) K + E
(36.26)
where R (nxq) is the set of responses measured on the calibration samples according to the 'design' matrix AC (nxp), containing the added concentrations. The unknown concentration is given by the vector CQ 07X1) and K (pxq) is the unknown matrix of sensitivities of the q responses with respect to the p different analytes. One may eliminate the unknown concentration vector CQ by subtracting the response vector TQ corresponding to the original sample, without standard additions, from each row of the response matrix R, giving AR, and at the same time subtracting the unknown CQ from the matrix (l^c J + AC). The sensitivities K can be estimated from this reduced system of equations using multiple linear regression of the corrected responses (AR) on the multiple standard additions (AC). Given the estimate K = ((AC)'^(AC))-^ (AC)'^ AR one may estimate the unknown concentration as: Co=K^(KKT)-^ro
(36.27)
More efficient estimation methods exist than the simple method described here [17]. The generalized standard addition method (GS AM) shares the strong points (e.g correction for interferences) and weak points (e.g. error amplification because of the extrapolation involved) of the simple standard addition method [18]. 36.3 Validation Many choices have to be made before a calibration model can be developed. First one has to choose between the methods discussed in Section 36.2. Each of these methods involves the choice of the dimensionality A, a 'meta-parameter' that may greatly affect the predictive performance of the calibration model. In ILS calibration one has the problem of selecting a limited number of predictive wavelengths. If one chooses the method of Brown, Denham and Spiegelman for wavelength selection (c/. Section 36.5.1) one still has to choose a confidence level a which governs the number of wavelength channels. A way to make such choices is to build the models from the calibration or training set and see which of the models gives the best predictions on a new set of data, the test set. The obvious criterion for assessment is the average size of the prediction error, as expressed by the prediction error sum-of-squares, PRESS ([19], Section 10.3.4), or the rootmean-square (rms) value of the prediction error, RMSPE.
369
PRESS = I ( q - c , ) 2
(36.28)
RMSPE = (PRESSMt)i^2
(36.29)
The summation in eq. (36.29) extends over all n^ samples in the test set. Among the various model options one chooses the one having minimum PRESS. It is still common usage to regard the model thus chosen as a validated model and RMSPE as a proper estimate of future prediction errors. However, the procedure described above only helped to choose a final model and the prediction error RMSPE estimated in this manner will be optimistically biased. The so-called test set played the role of a second training set for estimating the meta parameter(s). It is therefore better to refer to this second dataset as the monitoring set, used to fine tune the metaparameter(s). Given the chosen model one should establish its performance by testing the model on a new independent set of data. The RMSPE value obtained for this new set of data, truly the test set, gives a fair impression of the predictive capability of the model provided that the training data, monitoring data and test data are randomly sampled from the same population. Should the test result be disappointing and lead to amendments of the model, this new calibration model should be tested against wholly new data, etc. In practice the strategy of employing separate sets of calibration (training) and validation (test) data often cannot be applied as it requires a large number of samples. It is quite common that the number of samples is limited and resort is taken to resampling methods or internal cross-validation (cf. Sections 10.3.4 and 33.4). Here, one sets aside part of the data and builds a model with the rest of the data. The data not included are then used for assessing the PRESS dependence. This process is repeated for various splits of the original data such that all samples have been left out once. The PRESS values are accumulated over the various splits. At the end one chooses the model having minimum overall PRESS. Perhaps, the most common approach is to perform n calibration steps leaving out one observation at a time (leave-one-out proctdmt, LOO). As an example. Fig. 36.10 shows the cross-validation RMSPE result for (traditional) PCR and PLS models of the X-content. For the PLS model the minimum RMSPE (1.8 %) occurs at a dimensionality A = 5. For PCR the minimum lies at A = 10 (RMSPE = 2.0%). This minimum is very shallow and one might trade in a few factors for a simpler, and probably more robust, model with about the same prediction error (A = 8, RMSPE = 2.2%). In all, the model choice in this example is fairly clear-cut: PLS regression with 5 factors. If one is willing to accept a somewhat larger prediction error (RMSPE = 2.5%) a parsimonious 3-factor PLS model suffices. Since one knows that the minimum RMSPE value is optimistically biased it is good advice to prefer simpler models with slightly higher RMSPE values. The one-factor model for PCR performs not better than a zero-factor model, i.e. using no spectral information at
370
5 # factors Fig. 36.10. Prediction error (RMSPE) as a function of model complexity (number of factors) obtained from leave-one-out cross-validation using PCR (o) and PLS (*) regression.
all. Apparently, the first PC-factor captures an important source of spectral variation that has no predictive value. Leaving out one object at a time represents only a small perturbation to the data when the number (n) of observations is not too low. The popular LOO procedure has a tendency to lead to overfitting, giving models that have too many factors and a RMSPE that is optimistically biased. Another approach is k-fold cross-validation where one applies k calibration steps (5 < k < 15), each time setting a different subset of (approximately) n/k samples aside. For example, with a total of 58 samples one may form 8 subsets (2 subsets of 8 samples and 6 of 7), each subset tested with a model derived from the remaining 49 or 50 samples. In principle, one may repeat this /:-fold cross-validation a number of times using a different splitting [20]. Van der Voet [21] advocates the use of a randomization test (cf. Section 12.3) to choose among different models. Under the hypothesis of equivalent prediction performance of two models, A and B, the errors obtained with these two models come from one and the same distribution. It is then allowed to exchange the observed errors, e^^ and e^^, for the /th sample that are associated with the two models. In the randomization test this is actually done in half of the cases. For each object / the two residuals are swapped or not, each with a probability 0.5. Thus, for all objects in the calibration set about half will retain the original residuals, for the other half they are exchanged. One now computes the error sum of squares for each of the two sets of residuals, and from that the ratio F = SSEJSSE^. Repeating the process some 100-200 times yields a distribution of such F-ratios, which serves as a reference distribution for the actually observed F-ratio. When for instance the observed ratio lies in the extreme higher tail of the simulated distribution one may
371
be confident that model B is significantly better than model A. This randomization test is generally applicable to choosing among different models, hence it can be used to determine e.g. the lowest complexity of PCR or PLS models that is significantly superior to simpler models. Another approach to model validation is the application of the bootstrap. Here, we only give a short and qualitative description. For more details about the bootstrap methodology in general, see [22,23] for a recent application to multivariate calibration. The analogy with cross-validation is that the observed data are resampled many times. The difference is that one does not split the data into two subsets, but applies resampling with replacement. As an example, suppose there are 100 objects in the calibration set. This is regarded as a population. One may draw 100 samples from this population with replacement, i.e. some objects will not be selected at all, many only once, some twice or more. With this artificial data set one builds a model. This process is repeated many times, hence it is computer intensive. For each model one computes a quantity of interest, e.g. a prediction error. From the distribution of these prediction errors one may derive an average value as well as a measure of its uncertainty. By monitoring the average prediction error as a function of the number of PCR or PLS factors one may determine an optimal model complexity. This average prediction error can be used to compute confidence limits around future predictions.
36.4 Other aspects 36.4.1 Calibration design It should be appreciated that classical theory of design of experiments (Chapters 21-26), based on the linear model estimated by least squares regression, cannot be applied directly to the problem of multivariate calibration. One reason is that one may not know precisely which (interfering) factors are at play, let alone that one could control these. Another reason is that one may regard the spectral space as the space to position the chemical samples in. The number of dimensions (wavelengths) is generally much larger than the number of experimental units (chemical samples) and the linear model cannot be estimated by ordinary least squares regression. There are two points of view to take into account when setting up a training set for developing a predictive multivariate calibration model. One viewpoint is that the calibration set should be representative for the population for which future predictions are to be made. This will generally lead to a distribution of objects in experimental space that has a higher density towards the center, tailing out to the boundaries. Another consideration is that it is better to spread the samples more or
372
less evenly over the experimental region. One way to generate such a space filling design is by applying the Kennard-Stone algorithm {cf. Section 24.4.2). Generally, this will put less weight on the center of the experimental region and more on the extremes which should give more precise estimates. Feam [24] gives an interesting discussion on the two choices, the main message being that when the number of calibration objects is limited the latter choice may be preferable. One is not always in a position to choose the samples. In that case it is never wise to discard samples from a given calibration set for the sole sake of making the distribution of calibration objects in the region of interest more uniform. When a linear model does not hold over the full range of calibration samples there are two options. One may apply a nonlinear regression method to the complete set of data {cf. Section 36.5.3) or one may split the experimental region into smaller subregions and estimate a separate linear model for each. Naes and Isaksson [25] use fuzzy clustering for splitting the training sets into smaller subsets with improved linearity in each group. 36.4.2 Data pretreatment Data pretreatment is an important issue. Proper preprocessing of the data can be very instrumental in developing better predictive models. Although it is true that the modelling process in multivariate calibration may accommodate for interferences and irrelevant artefacts, careful data preprocessing often turns out to be more effective [26]. General guidelines on how to preprocess data are hard to give since this depends very much on the specific application at hand (e.g. which type of data? spectroscopic? which techniques?) and on the nature of the samples in question. It goes without saying that any pretreatment of the data has to be applied in an identical manner to both the calibration data, the test data and future new data. A very basic form of pretreatment is (column) mean centering. It corresponds to modelling the variation around the mean, i.e. the deviation from the mean response is directly related to deviations from the mean for the predictors. Mean centering is so common that it is often not even considered as a form of data pretreatment. Without further action this may give the variables with a high variance an undue influence on the model, except for MLR which is not sensitive to such scaling. Autoscaling is a form of pretreatment that is recommendable when the predictor variables are of a different nature and not measured on the same scale. Standardizing all variables to the same variance can be seen as a democratic manoeuvre giving all variables equal chance to influence the model. With spectroscopic data a popular form of data pretreatment is to correct for varying baseline slopes by regressing each spectrum against wavelength number and continuing the calibration with residuals. Quadratic regression has also been used as a means for detrending spectra. Pre-smoothing may be applied to the
373
spectral data to get rid of uncontrolled random noise. Various options are available to implement such smoothing {cf. Chapter 40): moving window (box car) averaging, Fourier filtering, Savitzky-Golay smoothing. A side effect of smoothing is that the spectral resolution may be lost. A very common pre-treatment of spectral data is to convert spectra to first- (or second)- derivative form [27]. This has the effect of removing any offset and constant slope (or curvature). Applying second derivatives has the advantage of sharpening peaks and resolving overlapping bands to some extent, although it also introduces spurious satellite peaks. Derivatization in general amplifies the noise in the data (Section 40.5.5). As a remedy to the latter drawback one may carry out the derivatization in combination with some degree of smoothing, for example, employing Savitzky-Golay filtering (Section 40.5.2.3). An effective preprocessing method is the use of standard normal variates (SNV). This type of standardization boils down to considering each spectrum x^ as a set of q observations and calculating their z-scores: z. = ( x . -x)l
s
(36.30)
It has the effect of removing an overall offset by subtracting the mean spectral reading x and it corrects for differences affecting the overall variation. In various settings it has been found to be an effective preprocessing method. Another popular form of data pre-processing with near-infrared data is the application of the Multiplicative Scatter Correction (MSC, [28]). It is well known that particle size distribution of non-homogeneous powders has an overall effect on the spectrum, raising all intensities as the average particle size increases. Individual spectra x^ are approximated by a general offset plus a multiple of a reference spectrum, z. x. = a^-\-b^z + e,.
(36.31)
The offset a^ and the multiplication constant b^ are estimated by simple linear regression of the /th individual spectrum on the reference spectrum z. For the latter one may take the average of all spectra. The deviation e, from this fit carries the unique information. This deviation, after division by the multiplication constant, is used in the subsequent multivariate calibration. For the above correction it is not mandatory to use the entire spectral region. In fact, it is better to compute the offset and the slope from those parts of the wavelength range that contain no relevant chemical information. However, this requires spectroscopic knowledge that is not always available. A special type of data pre-treatment is the transformation of data into a smaller number of new variables. Principal components analysis is a natural example and we have treated it in Section 36.2.3 as PCR. Another way to summarize a spectrum in a few terms is through Fourier analysis. McClure [29] has shown how a NIR
374
spectrum recorded at 1700 channels can be well approximated by a Fourier series comprising only 100 terms. The error amounted to less than 0.01 %, yet the spectra can be compressed and stored in 6% of the original space. A calibration model based on some 10 selected Fourier terms proved to give results that were superior to using the full raw data. 36.4.3 Outliers The presence of outliers may have a detrimental effect on the quality of a calibration model. Therefore, the identification of outliers is an important part of the modelling process. Outliers can come in different guises. One speaks of high leverage observations when the predictor data for a calibration object deviate strongly from the rest. Such outliers in X-space may fit well to the model Cgood' outliers) or not (*bad' outliers). When the predictor data are not abnormal for an object, but the object fits poorly to the model, then one speaks of a high residual observation (outlier in the >^-direction). Another class is formed by the influential observations. These are observations that have a demonstrably large impact on the model estimates. When such observations are discarded from the calibration set a significantly different model with different predictions is obtained. How do we identify outliers and how should we treat them once found? Basically there are two approaches: either apply diagnostics for detecting outliers or use robust estimation methods. Which diagnostics to use depends on the regression method employed, but some tools are universally applicable. It is always useful to make residual plots. These may reveal strongly deviating observation or some remaining structure that should not appear in a good residual plot. A common way of inspecting the data is to draw a PC A score plot which may show samples deviating from the bulk of the samples. Formally, one can compute the Mahalanobis distance or the leverage (cf. Sections 8.2.6 and 20.7), and use that as a indication of outlying behaviour. Similarly, a loading plot may reveal wavelengths which are showing deviating behaviour. One may also use the PLS scores and loadings during the modelling stage. With MLR one can use such measures as Cook's distance (cf. Section 10.9) or variance inflation factor (cf. Section 10.5) for influential observations or variables. A complicating factor with the diagnosis of outliers is the phenomenon of masking. This refers to the fact that an individual observation may not be recognized as outlying, because it is part of a cluster of outlying objects. Only when the complete cluster of deviating outliers is removed from the calibration set, one recognizes their severe influence on the calibration model [30]. Identification of outliers is not a straightforward process. Even when observations have been diagnosed as outlying, one should not automatically discard these, certainly not when the evidence is not overwhelming. Ideally, one should
375
always try to find additional physical or chemical evidence that something is 'wrong' with such samples before deciding to remove them. An alternative to the detection and removal of outliers is to employ robust regression methods {cf. Sections 12.1.5 and 12.3). With such methods outlying objects are automatically identified and downweighted in the regression modelling. Robust modifications of popular multivariate regression procedures have been reported for PCR [31] and PLS [32]. Walczak [33] has described a method, the evolution program (EP), where a clean subset of the data is generated and the remaining observations are tested relative to this clean subset. As the name implies the method is based on the idea of natural evolution similar to the better-known genetic algorithm. The EP approach allows to build robust models in the presence of multiple multivariate outliers and can be applied usefully in combination with PCR and PLS regression. Other robust methods for identifying multivariate outliers are given in e.g. Refs. [34-35]
36.5 New developments 36.5,1 Feature selection Brown et al. [36] published an interesting, simple method for selecting wavelengths in NIR calibration. The method boils down to ranking the wavelengths in order of diminishing squared correlation (R^) with the analyte concentration. The model is then built using the first m wavelengths, j = 1 ...m, from this ordered list as a weighted average of the m simple regression models corresponding to these wavelengths. The idea is to minimize the confidence interval for future predictions. As m increases, so does the total signal-to-noise ratio (sum of F-ratios or ^-values for the m separate, simple regressions), but also does the model complexity. The optimum number of wavelengths m is established by minimizing the ratio X^(m;a)/(S^J), where X^(m;a) is the critical value of a chi-square distribution with m degrees of freedom at the chosen level of confidence. Having found the selection of wavelengths one may proceed using any of the aforementioned regression methods, e.g. ILS or GLS. Many other new developments are under way in this area of variable selection. Wavelength selection can also be done using forward selection using covariance rather than correlation as a criterion (Intermediate Least Squares [37,38]). One may see this as the PLS analogon to forward selection in MLR. Recently, also genetic algorithms {cf. Chapter 27) have been applied to the problem of finding small sets of predictive wavelengths among the legion of candidate wavelengths [39]. The challenge with the application of such methods is not to fall in the trap of overfitting. Another major problem is that of chance correlations: with the large set
376
of predictors encountered in spectral data it is not at all unlikely to select wavelengths which have no real predictive power but happen to contribute to the overall correlation with the response in the calibration set. An alternative approach to variable selection is the elimination of so-called uninformative variables. These are variables that have no better predictive power than artificial random variables added to the data [40]. 36.5.2 Transfer of calibration models The development of a calibration model is a time consuming process. Not only have the samples to be prepared and measured, but the modelling itself, including data pre-processing, outlier detection, estimation and validation, is not an automated procedure. Once the model is there, changes may occur in the instrumentation or other conditions (temperature, humidity) that require recalibration. Another situation is where a model has been set up for one instrument in a central location and one would like to distribute this model to other instruments within the organization without having to repeat the entire calibration process for all these individual instruments. One wonders whether it is possible to translate the model from one instrument (old or parent or master, A) to the others (new or children or slaves, B). Several approaches have been investigated recently to achieve this multivariate calibration transfer. All of these require that a small set of transfer samples is measured on all instruments involved. Usually, this is a small subset of the larger calibration set that has been measured on the parent instrument A. Let Z indicate the set of spectra for the transfer set, X the full set of spectra measured on the parent instrument and a suffix Aor B the instrument on which the spectra were obtained. The oldest approach to the calibration transfer problem is to apply the calibration model, b^, developed for the parent instrument A using a large calibration set (X^), to the spectra of the transfer set obtained on each instrument, i.e. Z^ and Zg. One then regresses the predictions y^ (=Z^b^) obtained for the parent instrument on those for the child instrument yg (=ZBb^), giving yA=« + ^yB+e
(36.32)
This yields an estimate for the bias (intercept) a and slope b needed to correct predictions y^ from the new (child) instrument that are based on the old (parent) calibration model, b^. The virtue of this approach is its simplicity: one does not need to investigate in any detail how the two sets of spectra compare, only the two sets of predictions obtained from them are related. The assumption is that the same type of correction applies to all future prediction samples. Variations in conditions that may have a different effect on different samples cannot be corrected for in this manner.
377
All other approaches try and relate the child spectra to the parent spectra. In the patented method of Shenk and Westerhaus [41: Sh], in its simplest form, one first applies a wavelength correction and then a correction for the absorbance. Each wavelength channel / of the parent instrument is linked to a nearby wavelength channel j(i) in the child instrument, namely the one to which it is maximally correlated. Then, for each pair of wavelengths, / for the parent Mid j(i) for the child, a simple linear regression is carried out, linking the pair of measured absorbances
ZA,/ =
«/ + ^/ ZB,/(O
(36.33)
In this way the child spectrum is transformed into a spectrum as if measured on the parent instrument. In a more refined implementation one establishes the highest correlating wavelength channel through quadratic interpolation and, subsequently, the corresponding intensity at this non-observed channel through linear interpolation. In this way a complete spectrum measured on the child instrument can be transformed into an estimate of the spectrum as if it were measured on the parent instrument. The calibration model developed for the parent instrument may be applied without further ado to this spectrum. The drawback of this approach is that it is essentially univariate. It cannot deal with complex differences between dissimilar instruments. In the direct standardization introduced by Wang et al. [42] one finds the transformation needed to transfer spectra from the child instrument to the parent instrument using a multivariate calibration model for the transformation matrix: Z^ = ZgF. The transformation matrix F (qxq) translates spectra Zg that are actually measured on the child instrument B into spectra Z^ that appear as if they were measured on instrument A. Predictions are then obtained by applying the old calibration model b^ to these simulated spectra Z^: y^=Z^h^=Z^Fh^,
(36.34)
giving be = Fb^
(36.35)
as the transferred calibration model that applies directly to spectra measured on instrument B. Either PCR or PLS2 regression have been used for establishing the proper transformation, F. Notice that for each channel of the estimated spectrum the full spectrum of instrument B is used. In piecewise direct standardization (PDS) one uses for each frequency (column of F) only the local information of
378
neighbouring wavelengths in the transfer spectra Zg employing a window of wavelengths (columns of Zg) centered around the wavelength (column of Z^) of current interest. In mathematical terms: one imposes a band structure on the transformation matrix F. The span of the neighbourhood region and the number of PCs have to be optimized via cross-validation. Applications with this PDS techniques have proved successful [43]. The choice of the transfer subset is critical to the success of calibration transfer. The transfer samples should span the region of interest and can be chosen on the basis of extreme PC or PLS factor scores. Improved results can be obtained by a better coverage of the calibration range, for example, by using a formal design algorithm such as that of Kennard and Stone (Section 24.4.2) for the selection of the transfer set. Forina [44] applies PLS regression both for estimating the calibration model on the one instrument and for modelling the relation between the two sets of spectra. Alternative methods or suggestions for improving existing methods continue to be reported. Good reviews on the theory and practice of transferring of calibration models are found in Refs. [45,46]. 36.5.3 Non-linear methods In recent years there has been much activity to devise methods for multivariate calibration that take non-linearities into account. Artificial neural networks (Chapter 44) are well suited for modelling non-linear behaviour and they have been applied with success in the field of multivariate calibration [47,48]. A drawback of neural net models is that interpretation and visualization of the model is difficult. Several non-linear variants of PCR and PLS regression have been proposed. Conceptually, the simplest approach towards introducing non-linearity in the regression model is to augment the set of predictor variables (jCj, ^2,...) with their respective squared terms (jc,^, jc|,...) and, optionally, their possible cross-product terms (JCJJC2, ...). Since the number of predictors grows appreciably, PCR or PLS regression is called for. A non-linear variant of PLS employing splines for the inner relation between y and the r-scores has been proposed that has some analogy with neural nets. However, in multivariate calibration this splines-PLS approach [49] has not yet met with success. In fact, using a quadratic regression model using factor scores from PC A can be just as effective [50]. One may employ linear PLS regression as a first step and then proceed with these PLS scores in a quadratically extended regression model (LQ-PLS, [51]). Locally weighted regression (LWR], an approach combining elements of PCA or PLS, weighted regression and local modelling, has been more successful [52]. In this approach one starts with a transformation of the spectra into a few PC scores. The spectrum of any new sample is transformed to the same PC space and a small set of similar spectra from the calibration set is determined using the Mahalanobis distance as a criterion for
379
similarity. Multiple linear regression is then used to relate the response y to the PC scores for this small local set and this is used as an interpolating model to estimate the unknown response for the new sample. The number of PC dimensions, the number of neighbours are determined through cross-validation. More elaborate extensions of this approach not only take spectral similarity but also the estimated chemical similarity into account [53]. An interesting study comparing a variety of modem non-linear methods is given in Ref. [54].
References 1. 2. 3. 4. 5. 6. 7. 8.
9. 10.
11. 12. 13. 14.
15. 16. 17.
K.I. Hildrum, T. Isaksson, T. Nses and A. Tandberg, Near Infra-red Spectroscopy Bridging the Gap between Data Analysis and NIR Applications. Ellis Horwood, New York, 1992. H. van de Waterbeemd, ed., QSAR: Chemometric Methods in Molecular Design. VCH, Weinheim, 1995. T. Naes and E. Risvik (Editors), Multivariate Analysis of Data in Sensory Science, Data Handling in Science and Technology Series. Elsevier, Amsterdam, 1996 P.K. Hopke and X.-H. Song, The chemical mass balance as a multivariate calibration problem. Chemom. Intell. Lab. Assist., 37 (1997) 5-14. P.J. Brown, Measurement, Calibration and Regression. Clarendon Press, Oxford, 1993. T. Nses, Progress in multivariate calibration, pp. 52-60, in Ref. [1]. J.M. Sutter, J.H. Kalivas and P.M. Lang, Which principal components to utilize for principal component regression. J. Chemometr., 6 (1992) 217-225. Y.L. Xie and J.H. Kalivas, Evaluation of principal component selection methods to form a global prediction model by principal component regression. Anal. Chim. Acta, 348 (1997) 19-27. R.F. Gunst and R.L. Mason, Regression Analysis and its Application: A Data-Oriented Approach. Marcel Dekker, New York, 1980. E. Vigneau, D. Bertrand and E.M. Qannari, Application of latent root regression for calibration in near-infrared spectroscopy. Comparison with principal component regression and partial least squares. Chemometr. Intell. Lab. Syst., 35 (1996) 231-238. S. de Jong and H.A.L. Kiers, Principal covariates regression. Chemom. Intell. Lab. Syst., 14 (1992) 155-164. S. Van Huffel and J. Vandewalle, The Total Least Squares Problem: Computational Aspects and Analysis. SIAM, Philadelphia, PA, 1991. P.T. Davies and M.K.S. Tso, Procedures for reduced-rank regression. Appl. Stat., 31 (1982) 244-255. M. Stone and R.J. Brooks, Continuum regression; cross-validated sequentially constructed prediction ernbracing ordinary least sqaures, partial least squares, and principal component regression. J. Roy. Stat. Soc. B52 (1990) 237-269. R.J. Brooks and M. Stone, Joint continuum regression for multiple predictands. J. Am. Stat. Assoc, 89 (1994) 1374-1377. I.E. Frank and J.H. Friedman, A statistical view of some chemometrics regression tools. Technometrics, 35 (1993) 109-135. R. Sundberg, Interplay between chemistry and statistics, with special reference to calibration and the generalized standard addition method. Chemom. Intell. Lab. Syst., 4 (1988) 299-305.
380 18. 19. 20 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 32. 31. 33. 34. 35. 36. 37. 38. 39. 40.
41. 42. 43.
M.A. Sharaf, D.L. Illman and B.R. Kowalski, Chemometrics. Wiley, New York, 1986. D.M. Allen, The relationship between variable selection and data augmentation and a method for prediction. Technometrics, 16 (1974) 125-127. M. Forina, G. Drava, R. Boggia, S. Lanteri and P. Conti, Validation procedures in near-infrared spectrometry. Anal. Chim. Acta, 295 (1994) 109-118. H. van der Voet, Comparing the predictive accuracy of models using a simple randomization test. Chemom. Intell. Lab. Syst., 25 (1994) 313-323. B. Efron and R. Tibshirani, An Introduction to the Bootstrap. Wiley, New York, 1993. R. Wehrens and W. van der Linden, Bootstrapping principal regression models. J. Chemom., 11 (1997)157-172. T. Fearn, Flat or natural? A note on the choice of calibration samples, pp. 61-66 in Ref. [1]. T. Naes and T. Isaksson, SpHtting of calibration data by cluster-analysis. J. Chemometr., 5 (1991)49-65. O.E. de Noord, The influence of data preprocessing on the robustness nd parsimony of multivariate calibration models. Chemom. Intell. Lab. Systems, 23 (1994) 65-70. W.R. Hruschka, Data analysis: wavelength selection methods, pp. 35-55 in: P.C. Williams and K. Norris, eds. Near-infrared Reflectance Spectroscopy. Am. Cereal Assoc, St. Paul MI, 1987. P. Geladi, D. McDougall and H. Martens, Linearization and scatter-correction for nearinfrared reflectance spectra of meat. Appl. Spectrosc, 39 (1985) 491-500. F. McClure, Analysis using Fourier transforms, in: Handbook of Near-Infrared Analysis, D. A. Bums and E.W. Ciurczak, eds. Dekker, New York, pp. 181-224 (1992). A.C. Atkinson, Masking unmasked. Biometrika, 73 (1986) 533-541. I.N. Wakeling and H.J.H. MacFie, A robust PLS procedure. J. Chemom., 6 (1992) 189-198. B. Walczak and D.L. Massart, Robust PCR as outliers detection tool. Chemom. Intell. Lab. Syst., 27 (1995) 41-54. B. Walczak, Outlier detection in bilinear calibration. Chemom. Intell. Lab. Syst., 29 (1995) 63-73. A. Singh, Outliers and robust procedures in some chemometric applications. Chemom. Intell. Lab. Syst., 33 (1996) 75-100. A.S. Hadi, A modification of a method for the detection of outliers in multivariate samples. J. Roy. Stat. Soc. B56 (1994) 393-396. P.J. Brown, C.H. Spiegelman and M.C. Denham, Chemometrics and spectral frequency selection. Phil. Trans. R. Soc. Ser. A, 337 (1991) 311-322. I.E. Frank, Intermediate least squares regression method. Chemom. Intell. Lab. Syst., 1 (1987) 232-242. A. Hoskuldsson, The H-principle in modelling with applications to chemometrics. Chemom. Intell. Lab. Syst., 14 (1992) 139-153. D. Jouan-Rimbaud, D.L. Massart, R. Leardi, et al.. Genetic algorithms as a tool for wavelength selection in multivariate calibration. Anal. Chem., 67 (1995) 4295-4301. V. Centner, D.L. Massart, O.E. de Noord, S. de Jong, B.G.M. Vandeginste, and C. Sterna, Elimination of uninformative variables for multivariate calibration. Anal. Chem., 68 (1996) 3851-3858. J.S. Shenk and M.O. Westerhaus, US Patent No. 4866644, Sept. 12, 1989. Y.D. Wang, D.J. Veltkamp and B.R. Kowalski, Multivariate instrument standardization. Anal. Chem., 63 (1991) 2750-2756. Z.Y. Wang, T. Dean and B.R. Kowalski, Additive background correction in multivariate instrument standardization. Anal. Chem., 67 (1995) 2379-2385.
381 44.
M. Forina, G. Drava, C. Armanino, et al., Transfer of calibration function in near-infrared spectroscopy. Chemom. Intell. Lab. Syst., 27 (1995) 189-203. 45. O.E. de Noord, Multivariate calibration standardization. Chemom. Intell. Lab. Systems, 25 (1995) 85-97. 46. E. Bouveresse and D.L. Massart, Standardisation of near-infrared spectrometric instruments. Vib. Spectrosc, 11 (1996) 3-15. 47. B.J. Wythoff, Backpropagation neural networks — a tutorial. Chemom. Intell. Lab. Syst., 18 (1993)115-155. 48. C. Borggaard and H.H. Thodberg, Optimal minimal neural interpretation of spectra. Anal. Chem., 64 (1992) 545-551. 49. S. Wold, Non-linear partial least squares modelling. II. Spline inner relation. Chemom. Intell. Lab. Syst., 14(1992)71-84. 50. S.D. Oman, T. Naes and A. Zube, Detecting and adjusting for nonlinearities in calibration of near-infrared data using principal components. J. Chemom., 7 (1993) 195-212. 51. S. Wold, N. Kettaneh-Wold and B.Skagerberg, Nonlinear PLS modeling. Chemom. Intell. Lab. Syst., 7 (1989) 53-65. 52. T. Naes, T. Isaksson and B. Kowalski, Locally weighted regression in NIR analysis. Anal. Chem., 2 (1990) 664-673. 53 . Z.Y. Wang, T. Isaksson and B.R. Kowalski, New approach for distance measurement in locally weighted regression. Anal. Chem., 66 (1994) 249-260. 54. S. Sekulics, B.R. Kowalski, Z.Y. Wang, et al.. Nonlinear multivariate calibration methods in analytical chemistry. Anal. Chem., 65 (1993) A835-A845.
Additional recommended reading Books H. Martens and T. Naes, Multivariate Calibration, Wiley, Chichester, 1989. J.H. Kalivas and P.M. Lang, Mathematical Analysis of Spectral Orthogonality. Dekker, New York, 1994.
Articles K.R. Beebe and B.R. Kowalski, An introduction to multivariate calibration and analysis. Anal. Chem., 59 (1987) 1007A-1017A. B.R. Kowalski and M.B. Seasholtz, Recent developments in multivariate calibration. J. Chemometrics, 5 (1990) 129-145. H. Martens and T. Naes, Multivariate Calibration by Data Compression, Chapter 4 in Ref. [1]. P.J. Brown, Multivariate Calibration. J. Roy. Stat. Soc, B44 (1982) 287-321. K. Faber and B.R. Kowalski, Propagation of measurement errors for the validation of predictions obtained by principal component regession and partial least squares. J. Chemom., 11 (1997) 181-238. D.M. Haaland, Multivariate Calibration Methods Applied to the Quantitative Analysis of Infrared Spectra, Chapter 1 in "Computer-Enhanced Analytical Spectroscopy, Volume 3", edited by P.C. Jurs. Plenum Press, New York, 1992.
This Page Intentionally Left Blank
383
Chapter 37
Quantitative Structure-Activity Relationships (QSAR) 37.1 Extrathermodynamic methods Today it is a well-accepted view that biological and therapeutic activity of a drug depend upon its physicochemical and conformational (or steric) properties. The former include the ability of a molecule to cross membranes of cells or to be taken up by fatty material, the distribution of electric charges within the molecule, its capacity to make hydrogen bonds with other molecules, etc. The latter relate to the nature of the atoms that make up the molecule, the distances between atoms and the angles formed by the chemical bonds between them. The search for quantitative relations between chemical structure and biological activity is the subject of Quantitative Structure-Activity Relationships (QSAR), the purpose of which is to explain why a given drug produces its particular effect, and ultimately to predict the effect of newly synthesized chemical compounds [1,2]. Quantitative structure-property relationships have been known for a long time in organic chemistry, such as the regular increase in boiling temperature in homologous series of alkanes (e.g. methane, ethane, propane, etc.) as a function of the number of carbon atoms. It was natural, therefore, to suspect that similar regularities exist between chemical structure and biological activity. The first observation of a quantitative structure-activity relationship was made independently by Meyer and Overton around 1900 [3,4]. They found that the potency of anaesthetics depended upon their lipophilicity, i.e. their tendency to dissolve in oil rather than in water. This discovery is generally regarded as the starting point of QSAR methodology. It pointed toward a physicochemical interaction between a drug and the biological materials of the organism in which the drug is to exert its desired effect. A crucial element of QSAR is the concept of biological receptor, which emerged around the same time that Meyer and Overton made their historical observation. The concept states that a drug molecule must interact in a highly specific way with certain proteins that play a key role in the production of the desired effect. For example, if a pathological condition is due to an abnormally high activity of a certain enzyme or receptor protein, then one may attempt to block this enzyme or receptor by means of a chemical compound that binds specifically to it. This is the case with naloxone which is a specific antagonist
384
of the opioid receptor and which is used as an antidote for poisoning by morphine and related narcotics. The stereospecificity of the drug-receptor interaction was discovered in 1894 by Fischer in his study of the cleavage of glycosides by yeast [5]. He devised the still famous 'lock and key' paradigm. A drug acts as a key which can turn the receptor on or off, and which must satisfy narrow constraints (although the key may be flexible and the lock may appear to be somewhat plastic). The actual term receptor was coined in 1913 by Ehrlich, the discoverer of salvarsan (arsphenamine) which was regarded as the 'magic bullet' for the treatment of syphilis [6]. The part of the drug molecule that interacts with the receptor is called the pharmacophore. Compounds that share the same pharmacophore are considered to be similar with respect to the biological activity that is triggered by their receptor. In practice, however, the therapeutic benefit may vary considerably within a series of similar compounds due to differences in absorption, secretion, metabolization, uptake by fatty tissues, transport through membranes, toxicity, etc. The design of drugs is both an art and a science, but rational approaches, such as QSAR, play an ever increasing role in the generation of new lead compounds and the optimization of existing leads. The foundation of QSAR as a practical tool of drug design took place around 1964 with the introduction of two so-called extrathermodynamic methods. One of these methods is based on so-called linear free energy relationships (LFER) between biological activities of structurally related (congeneric) drugs and the physicochemical properties of the chemical substituents on a common parent molecule. For example, benzoic acid may be regarded as a parent compound which can be substituted at the ortho, meta and para positions by chlorine, bromine, amine, hydroxyl, acetyl, etc. The latter are referred to as the substituent groups. The ortho, meta and para locations with respect to the carboxyl group in the parent benzoic acid molecule are called the substituent positions. This approach is also referred to as Hansch analysis. Another extrathermodynamic method is based on the additivity of the contributions to the biological activity by various substituent groups at multiple substituent positions, and is called Free-Wilson analysis. Both the Hansch and Free-Wilson methods make use of multiple linear regression. These approaches allow us to determine the combination of substituents that provide maximal activity in a series of structurally related molecules. These developments have been extensively reviewed by Martin [7] and Kubinyi [8]. Multivariate chemometric techniques have subsequently broadened the arsenal of tools that can be applied in QSAR. These include, among others, Multivariate ANOVA [9], Simplex optimization (Section 26.2.2), cluster analysis (Chapter 30) and various factor analytic methods such as principal components analysis (Chapter 31), discriminant analysis (Section 33.2.2) and canonical correlation analysis (Section 35.3). An advantage of multivariate methods is that they can be applied in
385
supervised and unsupervised modes. They can handle series of structurally unrelated compounds and some of them can be used in the case when multiple biological activities are obtained. As we will see later on, a drug may act on a wide variety of receptors and hence may provide a typical spectrum of biological activities. Spectral map analysis is a factor analytic method which makes use of this property (Section 31.3.5). Partial Least Squares (PLS) regression (Section 35.7) is one of the more recent advances in QS AR which has led to the now widely accepted method of Comparative Molecular Field Analysis (CoMFA). This method makes use of local physicochemical properties such as charge, potential and steric fields that can be determined on a three-dimensional grid that is laid over the chemical structures. The determination of steric conformation, by means of X-ray crystallography or NMR spectroscopy, and the quantum mechanical calculation of charge and potential fields are now performed routinely on medium-sized molecules [10]. Modem optimization and prediction techniques such as neural networks (Chapter 44) also have found their way into QSAR. Much attention is devoted today to automatic searching of large libraries of chemical compounds in order to find interesting chemical structures that possess some similarity to known lead compounds. Current emphasis is on searching of three-dimensional structures within libraries of hundreds of thousands of compounds [11]. Finally, our knowledge of proteins, specifically of enzymes and receptors, has greatly increased by the use of techniques from biotechnology. It is now possible in many cases to isolate and clone these macromolecules in pure form and to determine their primary amino acid sequence, as well as their exact three-dimensional structure. This has led to so-called rational drug design by which drugs are designed by means of computer modelling rather than by the empirical (serendipitous) method of trial-and-error. In the light of these considerations one may regard QSAR as a multidisciplinary field of chemometrics. Statistics, optimization, pattem recognition and information technology converge in QSAR with chemistry, physicochemistry, biology, biochemistry and biotechnology. The start of QSAR coincides with the development of linear free energy relationships (LFER) between biological activities and the physicochemical properties of molecules. This development is generally referred to as the extrathermodynamic approach in QSAR for reasons that will become apparent later on in this section. A vast amount of literature has appeared on the subject since 1964. The content of this section has been largely inspired by a recent review of the subject by Kubinyi [8]. As we have remarked above, biological activity is the result of the binding of a drug D onto a receptor protein (or enzyme) P which results in the drug-receptor complex DP. The strength of the binding and, hence, the magnitude of the effect can be expressed by means of the change in Gibbs' free energy AG between the
386
free D and P on the one hand and the bound DP on the other hand. According to classical thermodynamics, the free energy release AG is made up from an enthalpic and an entropic term [12]: AG = AH-T AS
(37.1)
where AH is the change in enthalpy, AS is the change in entropy and T represents absolute temperature. The above relation shows that the strength of the binding between drug and receptor increases with the release of enthalpy and with the gain in entropy (i.e. decrease of order) in the drug-receptor complex. The free energy AG can also be related to the equilibrium constant K between the free and the bound state of drug and receptor: AG =-2.303/?r log/i:
(37.2)
where R represents the gas constant. The relation shows that if the reaction between drug and receptor proceeds in the direction of the formation of the drug-receptor complex (K> I), then a decrease of free energy will occur (AG < 0). Interactions of a drug with a receptor (or enzyme) are of a reversible noncovalent nature. The free energy released by drug-receptor interactions is smaller by one to two orders of magnitude than that of a covalent bond. The interactions take place mainly by means of electrostatic, hydrophobic and dispersion forces. Electrostatic interactions between charged groups are enthalpic. Hydrogen bonding is a weak electrostatic interaction which takes place between electron donating and electron accepting groups, such as amine, ketone, hydroxyl groups, etc. Hydrophobic interactions are mostly entropic. They can displace water molecules that are bound to the drug molecule or to the receptor surface. This diminishes the ordering of the water molecules and, hence, increases entropy, which in turn lowers the free energy of the drug-receptor complex. Dispersion forces are attractive when a drug molecule closely approaches the receptor surface. They are repulsive when the van der Waals radii of drug and receptor molecules tend to overlap. The nature of dispersion forces is mainly enthalpic. The question that now arises is how to define biological (or therapeutic) activity of a drug. There are good reasons for defining biological activity as the logarithm of the reciprocal of the effective dose (or concentration) that is needed in order to produce a well-defined biological effect. This is in accordance with WeberFechner's psychophysical law which states that a biological response is an approximately linear function of the logarithm of the physical stimulus [13]. Furthermore, experimental observations also show that the doses or concentrations that are tolerated by living organisms follow a log-normal distribution. The reciprocal of the effective dose is required, since the more active or potent drugs require a lower
387
dose in order to produce the desired effect. Usually, an effective dose is defined as the dose which produces the required effect in half of the samples or subjects that has been tested. In classical pharmacology on living subjects this produces the so-called median effective dose (or ED5Q). In biochemical pharmacology on isolated receptors results are often reported in the form of median inhibitory concentrations (or IC5Q). If we combine the previous observations with eq. (37.2), we obtain an expression which relates the biological activity (log 1/C) of a drug to the equilibrium constant (K) of the drug-receptor interaction: logl/C = alogK
(37.3)
where a denotes a proportionality constant. The next step in the development of the extrathermodynamic approach was to find a suitable expression for the equilibrium constant in terms of physicochemical and conformational (steric) properties of the drug. Use was made of a physicochemical interpretation of the dissociation constants of substituted aromatic acids in terms of the electronic properties of the substituents. This approach had already been introduced by Hammett in 1940 [14]. The Hammett equation relates the dissociation constant ^ of a substituted benzoic acid (e.g. meta-chlorobenzoic acid) to the so-called Hammett electronic parameter a: \ogK = \ogK^ + p(5
(37.4)
where KQ represents the dissociation constant of unsubstituted benzoic acid and where p is a proportionality constant. A distinction is made between a^ and a^, which take different values depending on the meta (m) or para ip) position of the substituent on the parent benzoic acid molecule. On the analogy of the physicochemical relation, one was led to define a biological Hammett equation which related the equilibrium constant of the drugreceptor complex to the electronic a parameters of the substituents (e.g. chlorine, bromine, methyl, ethyl, hydroxyl, carboxyl, acetyl, etc.) of the drug molecule. Since the equilibrium constant of a drug-receptor complex is reflected by the biological activity, this led to the first extrathermodynamic relationship in QSAR: logl/C = Z7o + ^CT
(37.5)
where b^ and b^ are coefficients that can be derived by means of linear regression (Chapter 8). For the historical reasons discussed above, the relation is also referred to as a linear free energy relationship (LFER). A fundamental assumption of the approach is that contributions to a from several substituent groups on the same parent compound are additive. The additivity assumption also holds for the more general Hansch model that will be discussed below.
388
37.1.1 Hansch analysis It soon became apparent that the biological Hammett equation produced an unsatisfactory fit between biological activity of a set of chemical analogs and the experimentally obtained electronic a values of the substituent groups. Hansch [15,16] first proposed to extend the equation with a term which accounted for the hydrophobic interaction. The latter can be characterized by determining the partition of the drug between oil (octanol) and water. The ratio of the concentrations of the drug in the two phases at equilibrium is called the partition coefficient P and the decimal logarithm of P is referred to as the lipophilicity. Lipophilicity measures the tendency of a drug to dissolve into fatty material. It is not exactly the same as hydrophobicity, which measures the ability of a drug to displace or expel water molecules at a binding site. Lipophilicity, however, is often a good approximation to hydrophobicity. This led to the Hansch equation, which reads in its simplest form: log 1/C = Z7o + ^1 a + ^2 log P
(37.6)
or when electrostatic interactions can be neglected: log 1/C = Z7o + ^1 log P
(37.7)
where b^, b^, Z?2 are coefficients that can be determined by means of multiple linear regression (see Chapter 10). The Hansch model can be expressed in a general form: y = Xb + e
(37.8)
where X represents the matrix of independent parameters (extended by a unit column-vector in order to account for the constant term), y is the vector of observed biological activities, e contains the residuals between observed and computed activities, and b is the vector of regression coefficients. The above Hansch equations are also generally referred to as linear free energy relationships (LFER) as they are derived from the free energy concept of the drug-receptor complex. They also assume that biological activity is linearly related to the electronic and lipophilic contributions of the various substituents on the parent molecule. A typical Hansch analysis has been applied to the 50% inhibitory concentrations (ICgo) of oxidative phosphorylation of 11 doubly substituted salicylanilides (Table 37.1) as reported by Williamson and Metcalf [17]. Multiple linear regression leads to the following model: log I/IC50 = -2.190 + 0.708 a + 0.348 log P
389
TABLE 37.1 Lipophilicity (log P), Hammett electronic parameter (a) and inhibitory concentration (1050) for oxidative phosphorylation of 11 doubly substituted salicylanilides [17]. The two substitution positions are labeled A andB. OH
,0^~- ™^^ #
A
B
logF
a
4'-Cl
4.68
0.463
2.512 2.042 1.549
IC50
1
5-Cl-3-(4-Cl-C6H4)
2
5-Cl
2'-Cl-4'-N02
2.12
1.724
3
5-CI-3-C6H5
3',4'-Cl2
4.79
0.836
4
5-CI-3-C6H5
2',5'-Cl2
4.55
0.836
1.445
5
5-CI-3-C6H5
2',4',5-Cl3
5.48
1.063
0.646
6
5-CI-3-C6H5
3'_Cl-4'-N02
4.36
1.879
0.617
7
5-Cl-3-(4-Cl-C6H4)
2'-Cl-5'-N02
4.98
1.173
0.263
8
5-Cl-3-(4-Cl-C6H4)
2'-Cl-4-CN
4.58
1.091
0.219
9
5-Cl-3-(4-Cl-C6H4)
2'-Cl-5'-CF3
5.93
0.878
0.200
10
5-Cl-3-(4-Cl-C6H4)
r,4',5-Cl3
6.41
1.063
0.173
11
5_Cl-3-r-But
2'-Cl-4'-N02
4.00
1.527
0.170
The residual standard deviation of regression (s^) equals 0.346, the coefficient of determination (R^) is 0.548 and the F-statistic amounts to 4.840 with 2 and 8 degrees of freedom (df) which is significant at the 0.05 level of significance (p). All coefficients of the regression are significant at the 0.05 level of probability, which lends evidence to the importance of both electrostatic (a) and hydrophobic interactions (log P). The coefficient of determination (R^) is small and the error term is relatively large. This suggests that predicted IC50 values are expected to deviate by more than a factor two from the observed values. Inspection of the residuals indicates that there may be two outliers (compounds 8 and 11) in this data set that do not fit well to the proposed Hansch model. Frequently, the relationship between biological activity and log P is curved and shows a maximum [18]. In that case, quadratic and non-linear Hansch models have been proposed [19]. The parabolic model is defined as: log lie = bQ^b^ logP + b^ (logPf which reaches a maximum at (log P)^ = -b^l{2b'2).
(37.9)
390
The bilinear model takes the form: log 1/C = Z^o +feilog P + ^2 log (^3 P + 1)
(37.10)
or equivalently: log 1/C =fco+ ^1 log P-^b^ log (Z?3 lO^^s ^ + 1) which obtains a maximum at log P^ = log(-b^/b^(b^ + ^2)). Note that the lipophilicity parameter log P is defined as a decimal logarithm. The parabolic equation is only non-linear in the variable log P, but is linear in the coefficients. Hence, it can be solved by multiple linear regression (see Section 10.8). The bilinear equation, however, is non-linear in both the variable P and the coefficients, and can only be solved by means of non-linear regression techniques (see Chapter 11). It is approximately linear with a positive slope (b^) for small values of log P, while it is also approximately linear with a negative slope (b^ + Z?2) for large values of log P. The term bilinear is used in this context to indicate that the QSAR model can be resolved into two linear relations for small and for large values of P, respectively. This definition differs from the one which has been introduced in the context of principal components analysis in Chapter 17. A non-linear Hansch model has been applied to the bactericidal concentrations (C) of 17 doubly substituted phenols (Table 37.2) which have been reported by Klarmann et al. [20]. By means of multiple linear regression we obtain the parabolic Hansch model of eq. (37.9): log 1/C = -3.474 + 2.298 log P - 0.225 (log Pf 5, = 0.178,
P2 = 0.933,
P = 97.1
with df = 2 and U,p< 0.0001.
The coefficients of the regression are all highly significant (p < 0.0001) and the fit of the model to the observed data is shown in Fig. 37.1. Using non-linear regression we obtain the bilinear Hansch model for the bactericidal activities of the 17 phenol analogs (Table 37.2): log 1/C = -1.552+ 1.011 logP-3.331 log (0.00244P+ 1) 5^ = 0.179 In this case, the bilinear model of eq. (37.10) fits the data as well as the parabolic one. A possible reason for the lack of improvement is that the part of the model which accounts for the higher values of log P is not well covered by the data (Fig. 37.1). The parabolic model yields an optimal value for log P of 5.10 while the optimum of the bilinear model is found at 5.18. Hansch analysis marked the breakthrough of QSAR. The method was soon extended with additional parameters with the aim of improving the fit between biological and physicochemical data and for the prediction of drugs with optimal
391 TABLE 37.2 Lipophilicity (log P) and bactericidal concentrations (O of 17 doubly substituted phenols [20] R
#
R
R'
logP
log 1/C
1
H
CI
2.39
0.81
2
Methyl
CI
2.89
1.34
3
Ethyl
CI
3.39
1.73
4
Propyl
CI
3.89
2.26
5
Butyl
CI
4.39
2.52
6
Amyl
CI
4.89
2.63
7
sec-Amyl
CI
4.69
2.23
8
Cyclohexyl
CI
4.90
2.25
9
Heptyl
CI
5.89
2.51
10
Octyl
CI
6.39
1.83
11
CI
H
2.15
0.50
12
CI
Methyl
2.65
0.91
13
CI
Ethyl
3.15
1.35
14
CI
Propyl
3.65
1.86
15
CI
Butyl
4.15
2.20
16
CI
Amyl
4.65
2.23
17
CI
tert-Amyl
4.33
2.00
activity. Measures of lipophilicity are mostly derived from partition coefficients P and chromatographic retention times [21]. Rekker [22] derived a method for computing lipophilicity 'de novo\ i.e. from first principles as opposed to experimentally, which improved on the experimental method used by Hansch, especially in the case of aliphatic compounds. Electronic parameters include Hammett's a^ and Cp which we have already discussed, the (inductive) field and resonance parameters F and R which have been derived from spectroscopic measurements by Swain and Lupton [23], the dipole moment and various quantum mechanical properties which relate to the distribution of electrons in the molecule. The Swain and Lupton parameters F and R have been shown to be linear combinations of Hammett' s a derived from meta and para substituents of benzoic acid. It is claimed
392 :).U
m
•
2.5 -
T-H
•
2.0 •
o 1.5 •
J
1.0 -
/•
0.5 0.0 -
1
1
I
1
LogP Fig. 37.1. Quadratic Hansch model fitted to the bactericidal activities (log 1/C) of 10 doubly substituted phenols in Table 37.2 as a function of lipophilicity (log F) [20].
that the field and resonance parameters are less correlated than Hammett's electronic parameters. Steric parameters account for the geometric properties of the compounds. Among these we only cite the steric hulk factor E^ of Taft [24], the molecular weight MH^and the five STERIMOL variables which were proposed by Verloop [25] to describe the size and shape of the molecules. Molar refractivity MR appears to be a parameter which correlates with lipophilicity and steric parameters. Connectivity indices have been introduced by Kier and Hall [26] for the description of the topological (graph) structure of the molecules. A connectivity index is a single number which expresses how atoms are arranged in a molecule. There are several different indices, each of which expresses a particular feature of the molecule, such as its degree of branchedness, the presence of double and triple bonds, of non-carbon atoms (apart from hydrogen), of aromatic rings, etc. Finally indicator variables can be included into the Hansch equation in order to account for the presence or absence of specific chemical functions. The use of indicator variables will be discussed more amply in the following section which deals with Free-Wilson analysis. Extensive data bases are now available which list lipophilicity, molar refractivity, electronic and steric values for a wide collection of substituents [27,28]. Starting from a particular parent compound one computes the value of a physicochemical parameter (e.g. lipophilicity) for a given drug by adding the contributions
393
of all the substituents that have been introduced on the parent compound. Hence, for a particular analog derived from an unsubstituted parent compound we obtain: logP = X O o g ^ ) /
(37.11)
CJ = X'SULPIRIDE [PERPHENAZINE
^
CLOTHIAPINE O
CHLOHPROMAZINE*
:0 V 0«)HALOPERiooL D o p a m i n G
1UPHEM4ZIMEC
I I
CHLOPPROTHIXENE%
PENFLURIDOL
- - 0 "^XD] " 1
DROPERIDOL
OPIHOZIDE
I
AGITATION ^ N STEREOTYPy
*PROHAZIHE
I I •THIORIDAZIHE
• 7H10TIXENE
I
jttNORTALITY
Norepinephrine Fig. 37.6. PLS biplot obtained from the pharmacological data in Table 37.9, after log double-centering and analysis by two-block PLS [56]. Circles represent 17 reference neuroleptic compounds, squares denote tests. Areas of circles and squares are proportional to the potencies of the compounds and the sensitivities of the tests, respectively. Reproduced with permission of E.J. Karjalainen.
Serotonin I I PIPANPERONE»
J aSPIPERONE
;
\
•CLOTHIAPINE TfilOTIXENEO
•CLOZAPINE
FLUPHENAZINE -h ^
CHLORPROTHIXENEO 'DROPERIDOLO
•PERPHENAZINE
^
^PENFLURIDOL
piMOiZDE
T^I^\^in.UOPERAZINE
Dopamine
hi-/p--B£NPffljoa.
CHLORPROMAZINEO • APOMORPHINE HALOPERIDOL THIORIDAZINE ^ v I ^ ^ OHALOPERIOOL ^^ DWS 4i0i
"^
OPRONAZINE
I Norepinepifirine Fig. 37.7. PLS biplot derived from the biochemical binding data in Table 37.10, after log doublecentering and analysis by two-block PLS [56]. The reading rules are the same as those indicated with Fig. 37.6. Reproduced with permission of E.J. Karjalainen.
416
correlating structural properties with positions on the maps. It appears that compounds attracted by the dopamine pole are chiefly of the butyrophenone and diphenylbutyl type, while those attracted by the adrenergic pole belong to the phenothiazine class. In a similar way as with PLS regression, one may determine the optimal number of components that must be included for obtaining the most reliable predictions, using cross-validation and minimization of PRESS (Section 36.3). Two-block PLS can be extended to multi-block PLS for the prediction of drug activities from several independent predictor blocks [58]. For example, one may attempt to predict clinical effects (the ultimate goal of drug design) from all available information (pharmacological, biochemical, pharmacokinetic, toxicological, etc.). Multi-block PLS has been used to predict clinical scores for various antischizophrenic effects [59] from in-vivo and in-vitro data [56]. The analysis showed that only a single component of clinical effects could be reliably predicted from animal and receptor studies, while all higher order clinical and biological components showed only weak correlations.
37.5 Other approaches In this chapter we have only addressed a selected number of topics and for lack of space we have left out many others. Cluster analysis has played a larger role in QSAR than appears from our overview. This technique is an established QSAR tool in recognition or classification of known patterns [38,60] as well as for cognition or detection of novel patterns [61]. Neural networks have been introduced in QSAR for non-linear Hansch analyses. The Perceptron, which is generally considered as a forerunner of neural networks has been developed by the Russian school of Rastrigin and coworkers [62] within the context of QSAR. The learning machine is another prototype of neural network which has been introduced in QSAR by Jurs et al. [63] for the discrimination between different types of compounds on the basis of their properties. Decision trees for optimal drug design strategies have been proposed by Topliss [64] and by Purcell et al [65]. The determination of the three-dimensional conformation of molecules is an important aspect of QSAR, which can be obtained from x-ray crystallography [66], NMR spectroscopy or, in the case of small molecular fragments by quantummechanical calculations [67,68]. A topic of actuality is the study of receptor proteins and enzymes for which data bases with crystallographic information are now made available. Computer modelling of the active sites of receptors and enzymes are important tools in rational drug design. Principal components and cluster analysis can be applied to the primary
417
sequences of these proteins in order to derive classifications and phylogenetic relationships [69]. Of great importance is the searching of very large data bases (with several thousands of compounds) for molecular structures that bear similarity to a known lead compound. The standard approach is to start from connectivity tables which describe the two-dimensional (graph) structure of the molecules in an unambiguous way. In the case of a molecule with n atoms, the nxn connectivity table defines the type of each atom (on the main diagonal) and the type of bond between each pair of atoms (on the off-diagonal positions). Similarity indices between compounds can be computed by various techniques, ranging from global connectivity indices [70] to atom-to-atom mapping [11]. Alternatively, molecular data bases may be searched for compounds that are as widely dissimilar as possible, in order to form standard panels of screening compounds. Principal components analysis has been used to reduce a large number of structural indices to a small set of independent descriptors that can be more easily submitted to pattern recognition techniques [71]. As we have stated in the introduction to this chapter and as appears from this overview, a wide variety of chemometric methods converges in QSAR, which plays a key role in the design of novel and improved drugs. References 1. 2. 3. 4. 5. 6. 7. 8. 9.
10. 11. 12.
13.
A. Burger (Ed.), Medicinal Chemistry. Wiley, New York, 1970. E.J. Ariens (Ed.), Drug Design, Vols I-X. Academic Press, New York, 1971-1980. H. Meyer, Zur Theorie der Alkolnarkose I. Welche Eigenschaft der Anesthetica bedingt ihre narkotische Wirkung? Arch. Exp. Pathol. Physiol., 42 (1899) 109-118. E. Overton, Osmotic properties of cells in the bearing on toxicology and pharmacy. Zeitschr. Physik. Chem., 22 (1897) 189-209. E. Fischer, Einfluss der Configuration auf die Wirkung der Enzyme. Ber. Dtsch. Chem. Ges., 27(1894)2985-2993. Dorland's Illustrated Medical Dictionary, 26th edn. Saunders, Philadelphia, PA, 1981. Y.C. Martin, Quantitative Drug Design. A Critical Introduction. Marcel Dekker, New York, 1978. H. Kubinyi, QSAR: Hansch Analysis and Related Approaches. VCH, Weinheim, 1993. P.P. Mager, The Masca model of pharmacochemistry: II. Rational empiricisms in the multivariate analysis of opioids. In: Drug Design, (E.J. Ariens, Ed.), Vol. X. Academic Press, New York, 1980, pp. 343-401. J. W. McFarland, Comparative Molecular Field Analysis (CoMFA) of anticoccidial triazines. J. Med. Chem., 35 (1992) 2543-2550. C. Pepperrell, Three-Dimensional Chemical Similarity Searching. Research Studies Press (J. Wiley), Taunton, UK, 1994. A. Miklavc, D. Kocjan, J. Mavri, J. Koller and D. Hadzi, On the fundamental difference in thermodynamics of agonist and antagonist interactions with P-adrenergic receptors and the mechanism of entropy-driven binding. Biochem. Pharmacol., 40 (1990) 663-669. G.T. Fechner, Elemente der Psychophysik, 1907. Reprinted in: Elements of Psychophysik (D.H. Davis, Ed.). Holt, Rinehart and Winston, New York, 1966.
418 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28.
29. 30. 31. 32.
33. 34. 35. 36.
L.P. Hammett, Physical Organic Chemistry. McGraw-Hill, New York, 1940. C. Hansch and T. Fujita, pen Analysis. A method for the correlation of biological activity and chemical structure. J. Am. Chem. Soc, 86 (1964) 1616-1626. C. Hansch and W.J. Dunn III, Linear relationships between lipophilic character and biological activity of drugs. J. Pharmaceut. Sci., 61 (1972) 1-19. R.L. Williamson and R.L. Metcalf, Salicylanilides: a new group of active uncouplers of oxidative phosphorylation. Science, 158 (1967) 1694-1695. C. Hansch and J.M. Clayton, Lipophilic character and biological activity of drugs II: The parabolic case. J. Pharmaceut. Sci., 62 (1973) 1-21. J.W. McFarland, On the parabolic relationship between drug potency and hydrophobicity. J. Med. Chem., 13 (1970) 1192-1196. E. Klarmann, V.A. Shtemov and L.W. Gates, The alkyl derivatives of halogen phenols and their bactericidal action. I. Chlorophenols. J. Am. Chem. Soc, 55 (1933) 2576-2589. L. Buydens, D.L. Massart and P. Geerlings, Gas chromatographic behaviour and pharmacological activity of neuroleptica. Anal. Chim. Acta, 174 (1985) 237-244. R.F. Rekker and R. Mannhold, Calculation of Lipophilicity. The Hydrophobic Fragmental Constant Approach. VCH, Weinheim, Germany, 1992. C.G. Swain and E.C. Lupton, Field and resonance components of substituent effects. J. Am. Chem. Soc, 90 (1968) 4328-4337. R.W. Taft, Separation of polar, steric and resonance effects in reactivity. In: Steric Effects in Organic Chemistry (M.S. Newman, Ed.). Wiley, New York, 1956, pp. 556-575. A. Verloop, The STERIMOL Approach to Drug Design. Marcel Dekker, New York, 1987. L.B. Kier and L.H. Hall, Molecular Connectivity in Structure-Activity Analysis. Wiley, New York, 1986. C. Hansch and A. Leo, Substituent Constants for Correlation Analysis in Chemistry and Biology. Wiley, New York, 1979. G. Grassy, R. Lahana and Oxford Molecular Ltd., TSAR, Tools for Structure-Activity Relationships, User's Guide, Issue 3. Proprietary Software developed by Oxford Molecular Ltd., Oxford, UK, 1993. _ .^ S.M. Free and J.XVL-Wilson, A mathematical contribution to structure-activity relationships. J. Med. Ciiem.,"7 (1964) 395-399. T. Fujita and T. Ban, Structure-activity study of phenethylamines as substrates of biosynthetic enzymes of sympathetic transmitters. J. Med. Chem., 14 (1971) 148-152. J.L. Spencer, J.J. Hlavka, J. Petisi, H.M. Krazinski and J.H. Boothe, 6-Deoxytetracyclines, V. 7,9-Disubstituted Products. J. Med. Chem., 6 (1963) 405-407. P.N. Craig, Comparison of the Hansch and Free-Wilson approaches to structure-activity correlation. In: Biological Correlations — The Hansch Approach (R.F. Gould, Ed.). Advances in Chemistry Series, No. 114. American Chemical Society, Washington DC, 1972, pp. 115-129. P.N. Craig, Interdependence between physical parameters and selection of substituent groups for correlation studies. J. Med. Chem., 14 (1971) 680-684. F. Darvas, Application of the sequential simplex method in designing drug analogs. J. Med. Chem., 17(1974)99-804. B.R. Kowalski and C.F. Bender, The application of pattern recognition to screening prospective anticancer drugs. J. Am. Chem. Soc, 96 (1974) 916-918. R. Franke, S. Dove and R. Kuehne, Hydrophobicity and hydrophobic interactions, 1. On the physical nature of aromatic hydrophobic substituent constants. Eur. J. Med. Chem., 14 (1979) 363-374.
419 37. 38. 39.
40. 41. 42. 43.
44.
45. 46. 47. 48. 49. 50. 51. 52.
53. 54.
55.
56.
57.
58.
P.J. Lewi, Spectral mapping, a technique for classifying biological activity profiles of chemical compounds. Arzneim.-Forsch. (Drug Res.), 26 (1976) 1205-1300. B.R. Kowalski and C.F. Bender, Pattern recognition. A powerful approach to interpreting chemical data. J. Am. Chem. Soc, 94 (1972) 5632-5639. C. Hansch, Quantitative approaches to pharmacological structure-activity relationships. In: Structure-Activity Relationships (C.J. Cavallito, Ed.), Vol. 1. Pergamon, Oxford, 1973, pp. 75-165. P.J. Lewi, Multivariate data analysis in structure-activity relationships, In: Drug Design (E.J. Ariens, Ed.), Vol. X. Academic Press, New York, 1980. J. Schmutz, Neuroleptic piperazinyl-dibenzo-azepines. Arzneim.-Forsch. (Drug Res.), 25 (1975)712-719. K.R. Gabriel, The biplot graphical display of matrices with appHcation to principal component analysis. Biometrika, 58 (1971) 453-467. P.J. Lewi, Multidimensional data representation in medicinal chemistry. In: Chemometrics. Mathematics and Statistics in Chemistry (B.R. Kowalski, Ed.), Reidel, Dordrecht, 1984, pp. 351-376. P.J. Lewi, Spectral mapping of drug-test specificities. In: Advanced Computer-Assisted Techniques in Drug-Discovery (H. van de Waterbeemd, Ed.). VCH, Weinheim, Germany, 1994, pp. 219-253. P.L. Wood, S.E. Charleson, D. Lane and R.L. Hudgin, Multiple opiate receptors: differential binding of mu, kappa and delta agonists. Neuropharmacology, 20 (1981) 1215-1220. G. Calomme and P.J. Lewi, Multivariate analysis of structure-activity data. Spectral map of opioid narcotics in receptor binding. Actual. Chim. Therap., S.ll (1984) 121-126. J.-P. Benzecri, Analyse des Donnees. Analyse des Correspondences, Dunod, Paris, 1973. C.J.E. Niemegeers and P.A.J. Janssen, A systematic study of the pharmacology of DAantagonists. Life Sci., 24 (1979) 2201-2216. M.J. Greenacre, Correspondence Analysis in Practice. Academic Press, London, 1993. R. Franke, Optimierungs-Methoden in der Wirkstoff-Forschung. Quantitative StrukturWirkungs-Analyse. Akademie-Verlag, Berlin, 1980. T.W. Anderson, Introduction to Multivariate Statistical Analysis. Wiley, New York, 1984. S. Wold, C. Albano, W.J. Dunn III, K. Esbensen, S. Hellberg, E. Johansson and M. Sjostrom, Pattern recognition: finding and using patterns in multivariate data. In: Food Research and Data Analysis (H. Martens and H. Russwurm Jr., Eds.). Applied Science, London, 1983, p. 147. S. de Jong, PLS fits closer than PCR. J. Chemom., 7 (1993) 551-557. R.D. Cramer III, D.E. Patterson and J.D. Bunce, Comparative Molecular Field Analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J. Am. Chem. Soc, 110 (1988)5959-5967. L.M. Kauvar, D.L. Higgins, H.O. Villar, J.R. Sportman, A. Engqvist-Goldstein, R. Bukar, K.E. Bauer, H. Dilley and D.M. Rocke, Predicting ligand binding to proteins by affinity finger printing. Chem. Biol., 2 (1995) 107-118. P.J. Lewi, B. Vekemans and L.M. Gypen, Partial least squares (PLS) for the prediction of real-life performance from laboratory results. In: Scientific Computing and Automation (Europe) (E.J. Karjalainen, Ed.). Elsevier, Amsterdam, 1990, pp. 199-209. J.E. Leysen, Review of neuroleptic receptors: specificity and multiplicity of in-vitro binding related to pharmacological activity. In: Clinical pharmacology in psychiatry (E. Usdin, S. Dahl, L.F. Gram and O. Lingjaerde, Eds.). MacMillan, London, 1982, pp. 35-62. L.E. Wangen and B.R. Kowalski, A multiblock partial least squares algorithm for investigating complex chemical systems, J. Chemom., 3 (1988) 3-10.
420 59.
60. 61. 62.
63.
64. 65. 66. 67. 68. 69.
70. 71.
J. Bobon, D.P. Bobon, A. Pinchard, J. Collard, T.A. Ban, R. De Buck, H. Hippius, P.A. Lambert and O.A. Vinar, A new comparative physiognomy of neuroleptics: a collaborative clinical report. Acta Psych. Belg., 72 (1972) 542-554. C. Hansch, S.H. Unger and A.B. Forsythe, Strategy in drug design. Cluster analysis as an aid in the selection of substituents. J. Med. Chem., 16 (1973) 1212-1222. D.L. Massart and L. Kaufman, The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis. Wiley, New York, 1983. S.A. Hiller, V.E. Golender, A.B. Rosenblit, L.A. Rastrigin and A.B. Glaz, Cybernetic methods of drug design. 1. Statement of the problem — The Perceptron approach. Comp. Biomed. Res., 6(1972)411-421. P.C. Jurs, B.R. Kowalski and T.L. Isenhour, Computerized learning machines applied to chemical problems. Molecular formula determination from low resolution mass spectrometry. Anal. Chem., 41 (1969)21-27. J.G. Topliss, AppHcation of operational schemes for analog synthesis in drug design. J. Med. Chem., 15(1972) 1001-1011. W.P. Purcell, G.E. Bass and J.M. Clayton, Strategy of drug design: a guide to biological activity. Wiley, New York, 1973. J.P. Tollenaere, H. Moereels and L.A. Raymaekers, Atlas of Three-Dimensional Structure of Drugs. Elsevier, Amsterdam, 1979. L.B. Kier, Molecular Orbital Theory in Drug Research. Academic Press, New York, 1971. P.S. Portoghese, In: Molecular and Quantum Pharmacology (E.D. Bergman and B. Pullman, Eds.). Reidel, Dordrecht, 1974, pp. 352-353. P.J. Lewi and H. Moereels, Receptor mapping and phylogenetic clustering. In: Advanced Computer-Assisted techniques in Drug Discovery (H. van de Waterbeemd, Ed.). VCH, Weinheim, Germany, 1994, pp. 131-162. L.B. Kier and L.H. Hall, Molecular Connectivity in Structure-Activity Analysis. Wiley, New York, 1986. S.C. Basak, V.R. Magnuson, G.J. Niems and R.R. Regal, Determining structural similarity of chemicals using graph-theoretic indices. Discr. Appl. Math., 19 (1988) 17-44.
421
Chapter 38
Analysis of Sensory Data 38.1 Introduction The determination and analysis of sensory properties plays an important role in the development of new consumer products. Particularly in the food industry sensory analysis has become an indispensable tool in research, development, marketing and quality control. The discipline of sensory analysis covers a wide spectrum of subjects: physiology of sensory perception, psychology of human behaviour, flavour chemistry, physics of emulsion break-up and flavour release, testing methodology, consumer research, statistical data analysis. Not all of these aspects are of direct interest for the chemometrician. In this chapter we will cover a few topics in the analysis of sensory data. General introductory books are e.g. Refs. [1-3]. There are four main types of data that frequently occur in sensory analysis: pair-wise differences, attribute profiling, time-intensity recordings and preference data. We will discuss in what situations such data arise and how they can be analyzed. Especially the analysis of profiling data and the comparison of such data with chemical information calls for a multivariate approach. Here, we can apply some of the techniques treated before, particularly those of Chapters 35 and 36. 38.2 Difference tests 38.2.1 Triangle test Let us consider a product developer who is trying to improve the taste of an existing product. The first question one could ask (and should ask before continuing) with the new product is: does the new product taste different from the old product? If trained panellists cannot establish a significant difference, it is hardly justifiable to do consumer tests, let alone launch the product on the market. A standard overall difference test is the triangle test (Fig. 38.1). In such a test one presents three samples, in no particular order, which should be tasted. Two out of the three samples are identical (e.g. the existing product, as a control) and the task is to identify the odd sample (the new product). If enough panellists correctly
422 Which is the odd sample?
© ©
®
Fig. 38.1. Triangle test: two similar products and one different product are presented; the assessor has to indicate the product that is different.
recognize the dissimilar sample one knows that a sensory difference exists. Table 38.1 gives the critical values for different panel sizes and different significance levels. For example, with 27 panellists (n = 27) one concludes that there is a difference (at the a = 5% significance level) when at least 14 correct responses are obtained. One should be aware that the conclusion of 'no significant difference detected' may not be interpreted as 'no difference present'. If the panel is small the triangle test has low power, i.e. there is a high probability that real differences may go unnoticed (see Sections 4.7 and 5.3). Therefore, a sizeable panel of at least 25 assessors is recommended. When the assessors have been selected and trained for their task the panel size may be somewhat smaller (at least 20). In general, economizing on the number of assessors for sensory tests may be a false economy: choosing a small inexpensive panel may result in great losses due to missing real opportunities for introducing improved products with increased market share. 38.2.2 DuO'trio test In the duo-trio test one presents to each panellist (or 'subject') an identified reference sample, followed by two coded samples, one of which matches the reference sample (Fig. 38.2). The subjects are asked to indicate which of the two coded samples matches the reference. If enough correct replies are obtained the two coded samples are perceived as different. Table 38.2 gives the critical values Which is equal to X: A or B?
®
® ®
Fig. 38.2. Duo-trio test: two different products are presented; the assessor has to indicate which of these two is similar to a third product.
423 TABLE 38.1 Table of critical values for the triangle test for differences a(%)
n «
10 5
a(%)
1 0.1
a(%)
a(%)
n
10 5
1
26
13
15 17
2
27
13
14
16 18
52 23
23
26 29
77
32 34 36 40
3 3 3
28
14
15
16 18
53
24 27 30
78
32 34
4 4 4
29
14
15
17 19
54 23
25
27 30
79
33
24 25
28 30
80 33
35
38 41
1
14
0.1
n
10 5
51
22 24 26 29
53
1
0.1
«
10 5
1
76
32 33
36 39
0.1
37 40
34 37 41
5 4 4
5
30
14
15
17 19
55
6
5 5
6
31
15
16
18 20
56 24 26 28 31
81
35
38 41
7
5 5
6 7
32
15
16
18 20
57 25
26 28 31
82 34 35
38 42
8 5 6
7 8
33
15
17
18 21
58 25
26 29 32
83
39 42
9
6 6
7 8
34
16
17
19 21
59 25
27 29 32
84 35
36
10 6 7
8 9
35
16
17
19 22
60 26
27 30 33
85
35
37 40 43
11 7 7
8
10
36
17
18 20 22
61
27 30 33
86 35
37 40 44
12 7 8
9
10
37
17
18 20 22
62 26
28
30 33
87 36
37 40 44
13 8 8
9
11
38
17
19 21 23
63
27 28
31
34
88 36
38 41
14 8 9
10 11
39
18
19 21 23
64 27 29
31
34
89 36
38 41 45
15 8 9
10 12
40
18
19 21
65
28 29
32 35
90 37 38 42 45
16 9 9
11 12
41
19 20 22 24
66
28 29
32 35
91
28
30 33 36
92 37 39 42 46
29
30 33 36
93
17 9
24
26
33
34 36
37
39 43
44
39 42 46
10 11 13
42
19 20 22 25
67
18 10 10 12 13
43
19 20 23 25
68
19 10 11 12 14
44 20 21
23 26
69 29
31
33 36
94 38 40 43 47
20 10 11 13 14
45
20 21
24 26
70 29
31
34 37
95
21 11 12 13 15
46
20 22 24 27
30 31
34 37
96
39 41
44 48
39 41
44 48 45 48
71
38 40 43 46
39 40 44 47
22 11 12 14 15
47
21
22 24 27
72
30 32 34 38
97
23 12 12 14 16
48
21
22 25 27
73
31
32 35 38
98 40 41
24 12 13 15 16
49
22 23
25 28
74 31
32 35 39
99 40 42 45 49
25 12 13 15 17
50 22 23
26 28
75
33
100 40 42 46 49
31
36 39
424 TABLE 38.2 Table of critical values for the duo-trio test a(%) n
10 5
a(%) 1
0.1
n
10 5
a(%) 1
a(%)
0.1
1
26 17 18 20 22
51 31 32 35 37
76 45 46 49 52
2
27 18 19 20 22
52 32 33 35 38
77 45 47 50 53
3
28 18 19 21 23
53 32 33 36 39
78 46 47 50 54
29 19 20 22 24
54 33 34 36 39
79 46 48 51 54
4
4
5
5
5
30 20 20 22 24
55 33 35 37 40
80 47 48 51 55
6
6
6
31 20 21 23 25
56 34 35 38 40
81 47 49 52 55
7
6
7
7
32 21 22 24 26
57 34 36 38 41
82 48 49 52 56
8
7
7
8
33 21 22 24 26
58 35 36 39 42
83 48 50 53 56
9
7
8
9
34 22 23 25 27
59 35 37 39 42
84 49 51 54 57
10 8
9
10 10
35 22 23 25 27
60 36 37 40 43
85 49 51 54 58
11 9
9
10 11
36 23 24 26 28
61 37 38 41 43
86 50 52 55 58
12 9
10 11 12
37 23 24 27 29
62 37 38 41 44
87 50 52 55 59
13 10 10 12 13
38 24 25 27 29
63 38 39 42 45
88 51 53 56 59
14 10 11 12 13
39 24 26 28 30
64 38 40 42 45
89 52 53 56 60
15
13 14
40 25 26 28 31
65 39 40 43 46
90 52 54 57 61
16 12 12 14 15
41 26 27 29 31
66 39 41 43 46
91 53 54 58 61
17 12 13 14 16
42 26 27 29 32
67 40 41 44 47
92 53 55 58 62
18 13
15 16
43 27 28 30 32
68 40 42 45 48
93 54 55 59 62
19 13 14 15 17
44 27 28 31 33
69 41 42 45 48
94 54 56 59 63
20
16 18
45 28 29 31 34
70 41 43 46 49
95 55 57 60 63
21 14 15 17 18
46 28 30 32 34
71 42 43 46 49
96 55 57 60 64
22
15 16
17 19
47 29 30 32 35
72 42 44 47 50
97 56 58 61 65
23
16 16
18 20
48 29 31 33 36
73 43 45 47 51
98 56 58 61 65
24
16 17
19 20
49 30 31 34 36
74 44 45 48 51
99 57 59 62 66
25
17 18
19 21
50 31 32 34 37
75 44 46 49 52
100 57 59 63 66
11
12
13
14 15
425
for various panel sizes and significance levels. For example, with a panel of n = 30 panellists one would need at least 20 correct answers to conclude that there is a significant (a = 5%) difference. 38.23 Paired comparisons In paired comparison tests two different samples are presented and one asks which of the two samples has 'most' of the sensory property of interest, e.g. which of two products has the sweetest taste (Fig. 38.3). The pairs are presented in random order to each assessor and preferably tested twice, reversing the presentation order on the second tasting session. Fairly large numbers (>30) of test subjects are required. If there are more than two samples to be tested, one may compare all possible pairs ('round robin'). Since the number of possible pairs grows rapidly with the number of different products this is only practical for sets of three to six products. By combining the information of all paired comparisons for all panellists one may determine a rank order of the products and determine significant differences. For example, in a paired comparison one compares three food products: (A) the usual freeze-dried form, (B) a new freeze-dried product, (C) the new product, not freeze-dried. Each of the three pairs are tested twice by 13 panellists in two different presentation orders, A-B, B-A, A-C, C-A, B-C, C-B. The results are given in Table 38.3. The results of such multiple paired comparison tests are usually analyzed with Friedman's rank sum test [4] or with more sophisticated methods, e.g. the one using the Bradley-Terry model [5]. A good introduction to the theory and applications of paired comparison tests is David [6]. Since Friedman's rank sum test is based on less restrictive, ordering assumptions it is a robust alternative to two-way analysis of variance which rests upon the normality assumption. For each panellist (and presentation) the three products are scored, i.e. a product gets a score 1,2 or 3, when it is preferred twice, once or not at all, respectively. The rank scores are summed for each product /. One then tests the hypothesis that this result could be obtained under the null hypothesis that there is no difference between the three products and that the ranks were assigned randomly. Friedman's test statistic for this reads Which of the two samples is the sweetest ? A or B?
© CD Fig. 38.3. Paired comparison test: two different products are presented and the assessor has to indicated the one that has most of a specified attribute.
426 TABLE 38.3 Results of pairwise comparison of three products in two presentation orders by 13 panellists. Paired comparison
Frequency
Ranking A
B
C
A>B, A>C, B>C
7
1
2
3
A>B, A>C, B C
5
2
1
3
A>B, AC
2
2
2
2
AC, BB, A.
A
CM W X
^1—1^
Q.
O
Q.
'
/D
"E
G^
^c principal axis 1 Fig. 38.9. Consensus plot showing the relative position of 7 cheese products (A-G) as assessed by a panel in 3 sessions. Triangles for each product indicate the three sessions. Differences between the three sessions are much smaller than differences between products.
6-dimensional space. In this case a two-dimensional projection onto the first two principal components of the GPA average configuration is a good enough approximation, accounting for 83% of the total variation. The triangles in the plot around each product indicate the average positions obtained at the three sessions. Clearly, the difference among the products exceeds the within-product betweensession variability. Therefore, an interpretation of the results with regard to the products can be meaningful. Conclusions which can be drawn from the graph are, for example: A takes a somewhat isolated position, E and F are close, so are C and G, B is an 'average' product, the lower right area is empty, and so on. For the product developer who has a background knowledge about the various products such a graphical summary of the sensory properties can be a useful aid in his work. For an interpretation of the principal axes one may draw a correlation plot. This is a plot of the loadings (correlations) of the individual attributes with the principal axis scores. Figure 38.10 shows an example of an international collaborative study involving panels from five different institutes [11]. The aim was to assess the degress of cross-cultural differences in the sensory perception of coffee. The panels characterized eight brands of coffee, each with an independently developed list of attributes. The correlation plot reveals that attributes with the same or a similar name are in general positioned close together. This lends credit to the 'objectivity' of the QDA technique. Attributes which are close to the circle of radius 1 are well represented by the 2-dimensional space of the first two principal axes. Thus, the correlation plot and the reference to the common configuration is helpful in judging the relations between the various attributes within and between different panels. One may also try to label each principal axis with a name that is
436
i/BURNT
CVi
BITTER BURNT WOODY
I
BITTER umt
I
bitter bitter ^ bitter -0.5 DENMARK
Poland GERMANY -1.0
France
-0.5
0.0 0.5 principal axis 1
GB/ICO
Fig. 38.10. Correlations of attributes with consensus principal axes.
suggested by attributes which are highly (positively or negatively) correlated with that axis. This is not always an easy task. Sometimes it is easier to distinguish main factors that are rotated with respect to the principal axes. One may also analyze which of the individual sets (i.e. panellists) are close to the mean and which are more deviant. For this analysis one determines the residuals for all products and attributes between the mean configuration and the individual data sets after the optimal Procrustes transformation. One then strings out each individual residual data set in the form of a long row vector. These row vectors are collected into a matrix where each row is now associated with a panellist. Performing a PCA on this matrix shows in a score plot the relative position of the individual panellist as a deviation from the mean. Figure 38.11 shows this plot for the 8 panellists of the cheese study. It reveals panellist 8 as being furthest removed from the rest. This panellist perhaps needs additional training. It is not strictly required to use the same attributes in each data set. This allows the comparison of independent QDA results obtained by different laboratories or development departments in collaborative studies. Also within a single panel, individual panellists may work with 'personal' lists of attributes. When the sensory attributes are chosen freely by the individual panellist one speaks of Free Choice Profiling. When each panellist uses such a personal list of attributes, it is likely that
437
Fig. 38.11. Deviations of the 8 panellists from the consensus, based on a PC A of the residuals x panellist table, where the residuals comprise all products, sessions and attributes.
the number of variables differs from panellist to panellist. In that case it is convenient to add dummy columns filled with zeros so that all panellists have data sets of the same, maximum, size. This so-called zero-padding does not affect the analysis. So far, the nature of the variables was the same for all data sets, viz. sensory attributes. This is not strictly required. One may also analyze sets of data referring to different types of data (processing conditions, composition, instrumental measurements, sensory variables). However, regression-type methods are better suited for linking such diverse data sets, as explained in the next section.
38.6 Linking sensory data to instrumental data The objective of relating sensory measurements to instrumental measurements is twofold. A first objective can be that it may help a better understanding of the sensory attributes. One should realize that such a goal usually can only be met partly since sensory perception is a highly complex process. The instrumental measurements too may be the result of complex processes. For example, the force recorded with an Instron instrument when compressing a food sample depends in an intricate way on its flow behaviour and breaking properties, which themselves are determined by the sample's internal structure. A second goal of relating the two types of measurements is that instrumental measurements may eventually replace the sensory panel. The driving force behind this second objective is that instrumental measurements are cheaper. Not much success has been scored in this area, due to the complexity of human sensory perception.
438
When relating instrumental measurements to sensory data one should focus on QDA-type data. Hedonic (or 'liking') scores and preference data are generally not well suited for comparisons with instrumental measurements, since there usually will not be a linear relationship. A simple example is saltiness, let us say, of a soup. A QDA panel can be used to 'measure' saltiness as a function of salt concentration. Over a small concentration range the response may be approximately linear. At higher concentrations the response may flatten off and in analogy with an analytical instrument one may consider that the panel is then performing outside its linear range. With preference testing the nature of the non-linearity is quite different. One does not measure the saltiness per se, but the condition that is best liked. Liking scores will show an optimum at some intermediate level of saltiness, so that the salty taste is neither too weak nor too strong. A table of correlations between the variables from the instrumental set and variables from the sensory set may reveal some strong one-to-one relations. However, with a battery of sensory attributes on the one hand and a set of instrumental variables on the other hand it is better to adopt a multivariate approach, i.e. to look at many variables at the same time taking their intercorrelations into account. An intermediate approach is to develop separate multiple regression models for each sensory attribute as a linear function of the physical/chemical predictor variables. Example Beilken et al. [12] have applied a number of instrumental measuring methods to assess the mechanical strength of 12 different meat patties. In all, 20 different physical/chemical properties were measured. The products were tasted twice by 12 panellists divided over 4 sessions in which 6 products were evaluated for 9 textural attributes (rubberiness, chewiness, juiciness, etc.). Beilken et al. [12] subjected the two sets of data, viz. the instrumental data and the sensory data, to separate principal component analyses. The relation between the two data sets, mechanical measurements versus sensory attributes, was studied by their intercorrelations. Although useful information can be derived from such bivariate indicators, a truly multivariate regression analysis may give a simpler overall picture of the relation. In recent years the application of techniques such as PLS regression to link the block of sensory variables to the block of predictor variables has become popular. PLS regression is well suited to data sets with relatively few objects and many highly correlated variables. It provides an analysis in terms of a few latent variables that often allows a meaningful interpretation and an effective graphical summary. When we analyze the data of Beilken et al. with PLS2 regression (see Section 35.7) a two-dimensional model is found to account for 65-90% of the variance of the sensory attributes, with the exception of the attributes juicy and greasy which cannot be modelled well with this set of explanatory variables.
439 4
1
2
eg
a h
f
0
-2] -4
i
-61.ll
,
-6
,
-4
^
-2
1
1
1
1
'
r*
0 t1
Fig. 38.12. Scores of products (meat patties) on the first two PLS dimensions.
Figure 38.12 shows the position of the twelve meat patties in the space of the first two PLS dimensions. Such plots reveal the similarity of certain products (e.g. C and D, or E and G) or the extreme position of some products (e.g. A or I or L). Figure 38.13 shows the loadings of the instrumental variables on these PLS factors and Fig. 38.14 the loadings of the sensory attributes. The plot of the products in the 0.5
WB SLOPE
CM
I
COOKLOSS COHESIV
0.0
w e PEAK CPJYD CPl jyp
WB SHEAR
PH TENSILE -0.5 -0.5
0.0
0.5
p1 Fig. 38.13. Loadings of predictor (instrumental) variables on the first two PLS dimensions.
440 0.61
COARSE CHEWY
0.4i TEXaiRE 0.2 CM O
0.0
GREAS'^dUICY CRUMBLY
T_^RUBBER
0.2
ADHESION 0.4-L -0.4
,
1
-0.2
,
1
0.0
n
0.2
1
1
0.4
.
r'
0.6
c1 Fig. 38.14. Loadings of dependent (sensory) variables on the first two PLS dimensions.
space of the PLS dimensions has a fair resemblance to the PC A scores plot only for the sensory variables. As a consequence, the PLS loading plot of the sensory variables (Fig. 38.14) gives a similar picture as a PCA loading plot would give. The associations between the variables within each set are immediately apparent from Fig. 38.13 or Fig. 38.14. For example, "hardness", "hrdxchv", "R-punch" and "CP-peak" are all highly correlated and indicate the firmness of the product. As another example, "tensile" strength is positively correlated with the amount of protein and negatively correlated with "moisture" and "fat". It would require an exhaustive inspection of the 20x9 correlation table to obtain similar conclusions about the relationship between variables of the two sets as the conclusions derived from simply comparing or overlaying Figs. 38.13 and 38.14.
38.7 Temporal aspects of perception In the foregoing we loosely talked about the intensity of a sensory attribute for a given sample, as if the assessors perceive a single (scalar) response. In reality, perception is a dynamic process, and a very complex one. For example, when a food product is taken in the mouth, the product disintegrates, emulsions are broken, flavours are released and transported from the mouth to the olfactory (smell) receptors in the nose. The measurement of these processes, analyzing and interpreting the results and, eventually, their control is of importance to the food
441
CO CD
time I s Fig. 38.15. Example of time-intensity (TI) curves.
manufacturer. There are many ways to study the temporal aspects of sensory perception. Experimentally many methods have been developed to measure socalled time-intensity curves or Tl-curves. Currently popular methodology is to use a slide-wire potentiometer or a computer mouse and to feed the data directly into the computer. The sort of curves that is obtained is shown in Fig. 38.15. Typically, one may characterize such a curve by a number of parameters, such as time-to-maximum-intensity, maximum intensity, time of decay, total area. The way to average such curves over panellists in order to derive a panel-average Tl-curve is not trivial. Geometrical averaging in both the intensity and time direction may help to best preserve shape. Separate analysis of variance of the characteristic parameters of the average curves can be used to assess the differences between the products. One may also try to fit a parametric function to each individual curve, for example, a combination of two exponential functions (see Chapters 11 and 39). The curves are then characterized by their best-fitting parameters, and these are compared in a subsequent analysis. Another method is to leave the curves as they are and to analyze the whole set of curves by PCA. It would seem natural to consider each curve as a multivariate observation and the intensities at equidistant points in time as the variables. Since there is a natural zero point (^ = 0, / = 0) in these measurements it makes sense in this case not to center each curve around its mean intensity, but to analyze the raw intensity data. Also, since the different maximum intensities may be related to concentration of the bitter component it would be imprudent to scale the curves to a common standard (e.g. maximum, mean intensity, area). However, one might consider a log transformation allowing for the nature (ratio scale) of the intensity scale. Figures 38.16 to 38.18 show the result of such an uncentered PCA applied to a set of Tl-curves obtained by 9 panellists [13]. The perceived intensity of bitterness
442 25Ch
Caffeine 1
Fig. 38.16. Loading curves (PCI, non-centered PCA) for 4 bitter solutions.
of four solutions, caffeine and tetrahop at two concentration levels, was recorded as a function of time. The analysis is applied separately to each of four bitter solutions. The loading plots for PCI and PC2 are shown in Figs. 38.16 and 38.17. Notice that PCI (Fig. 38.16) has little structure: it represents an equal weighting over most of the time axis. In fact the PCI loading plot very much resembles the average curve for each product. This is a common outcome with noncentered PCA. The loading plot for PC2 (Fig. 38.17) has a more distinct structure. Since it has a negative part it does not represent a particular type of intensity curve. PC2 affects the shape of the curve. One notices in Fig. 38.17 a distinct shape between the two tetrahop solutions and the two caffeine solutions. This interpretation of PCI as a size component and PC2 as a contrast component is a familiar phenomenon in principal component analysis of data (see Chapter 31). A different interpretation is obtained if one considers a rotation of the PCs. Rotation of PCI and PC2 does give more interpretable curves: PCI + PC2 gives a curve that rises steeply and decays deeply, representing a fast perception, whereas PCI - PC2 gives a curve that starts to rise much more slowly, reaching its maximum much later with a longer lasting perception. The score plot of the panellists in the space of PCI and PC2 is shown in Fig. 38.18, for one of the four product. A high score along PCI, e.g. panellist 6, implies that the panellist gave overall high intensity scores. A high score on PC2
443 _
60-1
O
c
2(H
o c 0)
I "55 -2(H c C C
-4(H
o o 0)
-SO 10
20
n— 30
"T 40
r50 Time
"1
60
70
80
"I
90
Fig. 38.17. Loading curves (PC2, non-centered PCA) for 4 bitter solutions.
Caffeine 1
4.6 4
73 5
8
-1
-4dimension 1
4-6 c o
E Fig. 38.18. Score plot (PC2 v. PCI) based on non-centered PCA of Tl-curves from 9 panellists for a bitter (caffeine) solution.
444
(e.g. panellists 9), implies a Tl-curve with a relatively fast rise and early peak, in contrast to a low (negative) score on PC2 (e.g. panellist 1), implying a TI curve with a relatively slow rise and late peak. There have also been attempts to describe the temporal aspects of perception from first principles, the model including the effects of adaptation and integration of perceived stimuli. The parameters in the specific analytical model derived were estimated using non-linear regression [14]. Another recent development is to describe each individual TI-curve,y;(r), / = 1, 2,..., n, as derived from a prototype curve, S(t). Each individual Tl-curve can be obtained from the prototype curve by shrinking or stretching the (horizontal) time axis and the (vertical) intensity axis, i.e. f.(t) = a^ S(bi i). The least squares fit is found in an iterative procedure, alternately adapting the parameter sets {^,, Z?.} for / = 1,2,..., n and the shape of the prototype curve [15].
38.7 Product formulation An important task of the food technologist is to optimize the ingredients (composition) or the processing conditions of a food product in order to achieve maximum acceptability. In practice this has often to be done under constraints of cost restriction and limited ranges of composition or processing conditions. Techniques such as response surface methodology (Chapter 24) and mixture designs (Chapter 25) are effective in formulation optimization. It is very often the case that the sensory perception of a product is not a simple linear function of the ingredients involved. A logarithmic function (Weber-Fechner law) or a power-law (Stevens function) often describe the relation between perceived intensity (/) and concentration (c): / = a log(c) Weber-Fechner law (38.4) I^ac^ Steven's law (38.5) The acceptance is generally a non-linear function of perceived intensity. A simple example is the salt level in a soup which clearly has a level of maximum acceptability between too weak and too salty a taste. The experimental designs discussed in Chapters 24-26 for optimization can be used also for finding the product composition or processing condition that is optimal in terms of sensory properties. In particular, central composite designs and mixture designs are much used. The analysis of the sensory response is usually in the form of a fully quadratic function of the experimental factors. The sensory response itself may be the mean score of a panel of trained panellists. One may consider such a trained panel as a sensitive instrument to measure the perceived intensity useful in describing the sensory characteristics of a food product.
445
Example Figure 38.19 shows the contour plots of the foaming behaviour, uniformity of air cells and the sweetness of a whipped topping based on peanut milk with varying com syrup and fat concentrations [16]. Clearly, fat is the most important variable determining foam (Fig. 38.19A), whereas com symp concentration determines sweetness (Fig. 38.19C). It is rather the mle than the exception that more than one sensory attribute are needed to describe the sensory characteristics of a product. An effective way to make a final choice is to overlay the contour plots associated with the response surfaces for the various plots. If one indicates in each contour plot which regions are preferred, then in the overlay a window region of products with acceptable properties is left (see Fig. 38.19D and Sections 24.5 and 26.4). In the
20
/
100
110' 1
B
y
,^
-"'—--.
W""
1 9D_ — L 8Q 15
"
• - ,
^
"^^^^ ~"~~-- ^
[ 70 10 15
20
^
^-^
25
20
% Corn Syrup
% Corn Syrup
20 ^ ^ ^ C ^ _ _ _ ^ _ J : , , - ^ - — ^
^u \
\
C
CO
CO
15-
15
J40
/45
U''':
/50
20 % Corn Syrup
••' , - - ^ - ^ 7 - ^ ^ ^ ^ ^ " " " ^
55,-'-
in
15
m-^-^:
25
10 15
'
•
I
«"•
1
I
-I
1
--•
20 % Corn Syrup
~T
.
= 1
25
Fig. 38.19. Contour plots of foam (A), uniformity of air cells (B) and sweetness (C) as a (fullyquadratic) function of the levels of fat and corn syrup. An overlay plot (D) shows the region of overall acceptability.
446
case of this example products with >135 g fat and
^\
^
^
x/s)
XpW
^ap
Plasma Compartment
Elimination Pool
>'':' ^ •
s
^\f ^ •
s
>-•
Xe(s)
Fig. 39.14. (a) Catenary compartmental model representing a reservoir (r), absorption (a) and plasma (p) compartments and the elimination (e) pool. The contents Xr, Xa, Xp and X^ are functions of time t. (b) The same catenary model is represented in the form of a flow diagram using the Laplace transforms Xr, Xa and Xp in the ^-domain. The nodes of the flow diagram represent the compartments, the boxes contain the transfer functions between compartments [ 1 ]. (c) Flow diagram of the lumped system consisting of the reservoir (r), and the absorption (a) and plasma (p) compartments. The lumped transfer function is the product of all the transfer functions in the individual links.
488
transfer function from the reservoir to the absorption compartment is l/(s + k^^), from the adsorption to the plasma compartment is k^^(s + ^p^), and from the plasma to the elimination pool is k^Js. Note that the transfer constant from an emitting input node and to a receiving output node appears systematically in the numerator and denominator of the connecting transfer function, respectively. This makes it rather easy to model linear systems, of the catenary, mammillary and mixed types. Parts of the model in the Laplace domain can be lumped by multiplying the transfer functions that appear between the input and output nodes. In Fig. 39.14c we have lumped the absorption and plasma compartments. The resulting Laplace transform of the output jCp(5) is then related to the input jc/5) by means of the transfer function g(s): x^(s) = g(s)x,is)
(39.74)
where 8(s) = -
, ^;;
,
,
(39.75)
This model can now be solved for various inputs to the absorption compartment. In the case of rapid administration of a dose D to the absorption compartment (such as the gut, skin, muscle, etc.), the Laplace transform of the reservoir function is given by: x^s) = D
(39.76)
In this case we obtain the simplest possible expression for the plasma function in the 5-domain: x^(s) = g(s)D = -
/^
;'
(39.77)
The inverse transform X^(t) in the time domain can be obtained by means of the method of indeterminate coefficients, which was presented above in Section 39.1.6. In this case the solution is the same as the one which was derived by conventional methods in Section 39.1.2 (eq. (39.16)). The solution of the twocompartment model in the Laplace domain (eq. (39.77)) can now be used in the analysis of more complex systems, as will be shown below. When the administration is continuous, for example by oral infusion at a constant rate offc^p,the input function is given by: k x^(s) = -^ s
(39.78)
489
and the resulting plasma function becomes: x^(s) = g(s)'^ = -
'"/')''
(39.79)
The inverse Laplace transform can be obtained again by means of the method of indeterminate coefficients. In this case the coefficients A, B and C must be solved by equating the corresponding terms in the numerators of the left- and right-hand parts of the expression: "^^ ^ ^ + _ A _ + _ ^
(39.80)
The inverse transform of the plasma function is then given by: Xp(0 = A + Be"^^^' +Ce"^^'
(39.81)
After substitution of the values of A, B and C we finally obtain the plasma concentration function X^{t) for the two-compartment open system with continuous oral administration: ^,it)
k.
= k^ ^ap
l-e"'-^ ^pe
^pe
1 e -k.j
A
(39.82)
^ap
From the above solution we can now easily determine the steady-state plasma content X^^ after a sufficiently long time t: X^^=-^
(39.83)
^pe
If X^(t) and X^(t) are the input and output functions in the time domain (for example, the contents in the reservoir and in the plasma compartment), then X^{t) is the convolution of X-{t) with G(t), the inverse Laplace transform of the transfer function between input and output: X^(r) =
JG(x)X,(t-x)d%=\G{t-x)X;{x)dx 0
0
= G(trX,it) where the symbol * means convolution, and where T denotes the integration variable.
490
In the 5-domain, convolution is simply the product of the Laplace transform: x^(s) = g{s) x.,(s)
(39.85)
By means of numerical convolution one can obtain X^(t) directly from sampled values of G(t) and X^(t) at regular intervals of time t. Similarly, numerical deconvolution yields X-^(t) from sampled values of G(t) and X^(t). The numerical method of convolution and deconvolution has been worked out in detail by Rescigno and Segre [1]. These procedures are discussed more generally in Chapter 40 on signal processing in the context of the Fourier transform. 39.1.7.2 The y-method Thus far we have only considered relatively simple linear pharmacokinetic models. A general solution for the case of n compartments can be derived from the matrix K of coefficients of the linear differential equations: A:,J ""^12
^21 ^22
^nl
~^n2
K=
(39.86) ~^\n
~^2n
^nn
in which an off-diagonal element k^j represents the transfer constant from compartment / to compartment^, and in which a diagonal element k^^ represents the sum of the transfer constants from compartment / to all n-\ others, including the elimination pool. If compartment / is not connected to say compartmenty, then the corresponding element k^^ is zero. The index 1 is reserved here to denote the plasma compartment. In the previous section we found that the hybrid transfer constants of a twocompartment model are eigenvalues of the transfer constant matrix K. This can be generalized to the multi-compartment model. Hence the characteristic equation can be written by means of the determinant A: A = |K-Yl| = 0
(39.87)
where I is the AI x n identity matrix. In the general case there will be n roots 7-, which are the eigenvalues of the transfer matrix K. Each of the eigenvalues defines a particular phase of the time course of the contents in the n compartments of the model. The eigenvalues are the hybrid transfer constants which appear in the exponents of the exponential function. For example, for the zth compartment we obtain the general solution:
491
^/(^) = X^//^"'^'
(39.88)
where G^j is the coefficient of theyth phase in the ith compartment. The coefficient G,y can be determined from the minors of the determinant A, as shown by Rescigno and Segre [1]:
G,j=xm
^(-iy^u^ A'
(39.89) Jy=yj
where Aj, is the minor of A, which is obtained by crossing out row 1 and column /, and where A' denotes the derivative (dA/dy) of A with respect to y. This general approach for solving linear pharmacokinetic problems is referred to as the y-method. It is a generalization of the approach by means of the Laplace transform, which has been applied in the previous Section 39.1.6 to the case of a two-compartment model. The theoretical solution from the y-method allows us to study the behaviour of the model, provided that the transfer constants in K are known. In the reverse problem, one must estimate the transfer constants in K from an observed plasma concentration curve C^(k). In this case, we may determine the hybrid transfer constants jj and the associated intercepts of the plasma concentration curve Gj • by means of graphical curve peeling or by non-linear regression techniques. Using these experimental results we must then seek to compute the transfer constants from systems of equations, which relate the hybrid rate constants y^ and the associated intercepts Gy to the transfer constants in K. We may also insert the general solution into the differential equation, their derivatives (at time 0) and their integrals (between 0 and infinity) in order to obtain useful relationships between the hybrid transfer constants y^, the intercepts Gy and the model parameters in K. The calculations can be done by computer programs such as PROC MODEL in SAS [8]. Not all linear pharmacokinetic models are computable, however, and criteria for computability have been described [9]. Some models may be indeterminate, yielding an infinity of solutions, while others may have no solution. By way of illustration, we apply the y-method to the two-compartment mammillary model for intravenous administration which we have already seen in Section 39.1.6. The matrix K of transfer constants for this case is defined by means of: ^ pb "^ ^ pe
^ bp
K= ~^pb
^bp
(39.90)
492
and the corresponding characteristic equation can be written in the form: p pe + "^ pb " y
~ ^ bp
A=IK-Yll =
=0
(39.91)
^bp-Y
-pb
The eigenvalues y of the model are the roots a and P of the quadratic equation: Y ' - Y ( / : p b + ^bp + ^ p e ) - ^ b p ^ p c = 0
(39.92)
the sum and product of which are easily derived: U
i2
P
Y,Y2 = a p
bp
pb
PC
^3^^^^
=^bp^pe
The general solution for the plasma compartment is now expressed as follows: Cp(f) = G„e-T''' +G,2e-^=' = Ape-™ +B^e-^
(39.94)
with D
f
(-l)A, A'
'"-''-t^
(39.95)
^Y=Yi=a
and
D f (-l)A, f(kAt) with k = -N to -\-N. The value k = -N corresponds to -T^ and the value k = N to T^. This digitization procedure is schematically shown in Figs. 40.7a and 40.7b. The new variable k varies from ~N to+N,
520
f(t)|
f(kAt)
(a)
(b)
V^ ^ I I I I
0
1
2
2
0
2
^
Fig. 40.7. (a) A continuous signalJ{t) measured from to =0 to ^o +2Tn, = 2 seconds, (b) Same signal after digitization,/(/:A0 as a function of/:.
The expressions for the forward and backward Fourier transforms of a data array of2N-\- 1 data points with the origin in the centre point are [3]: Forward: F(n) =
k=+N
1 2A^ + 1
-jnlnk
X / ( M r ) e 2^-1
(40.7)
k=-N
for n = -^max ^^ '^max' ^max corrcspouds to the maximum frequency which is in the ta. The value vah of n^^ is derived in the next section. data. Backward: .
f(kAt) = -
.....
^ jnlnk
X ^(^)e 2^^^
(40.8)
for/: = -M...,+A^ The frequency associated with F(n) is v^. This frequency should be equal to n times the basis frequency, which is equal to 1/(27^) (this is the period of a sine or cosine which exactly fits in the measurement time). Thus v„ = n/(2T^) = n/(2NAt). It should be noted that in literature one may find other conventions for the normalization factor used in front of the integral and summation signs. 40.3.5 Frequency range and resolution As mentioned before, the smallest observable frequency (v^jj,) in a continuous signal is the reciprocal of the measurement time (1/27^). Because only those frequencies are considered which exactly fit in the measurement time, all frequencies should be a multiple of Vj^j^, namely n/2T^ with n = -©o to +oo. As a result the Fourier transform of a continuous signal is discrete in the frequency domain,
521
with an interval (Ai+l)/(2rj - nl{2TJ = l/(2TJ. This frequency interval is called the resolution. Thus the measurement time defines the lowest observable frequency and the resolution. Both are equal to l/2T^. In the case of a continuous signal there is no upper limit for the observable frequencies. For an infinite measurement time the Fourier transform also becomes a continuous signal because 1/27^ -^ 0. On the contrary, when transforming a discrete signal, the observable frequencies are not unlimited. The reason lies in the fact that a sine or cosine function should be sampled at least twice per period in order to be uniquely defined. This sampling frequency is called the Nyquistfrequency. This is illustrated in Fig. 40.8 showing a signal measured in one second and digitized in 4 data points. Clearly, these 4 data points can be fitted in two ways: with a sine with a period equal to 1 s or with a sine with a period equal to 1/3 s. However, the Fourier transform always yields the lowest frequencies with which the data can be fitted, in this case, the sine with a period equal to 1 s. The Nyquist frequency is the upper limit (Vj^ax = 1/(2A0) of the frequencies that can be observed. An increase of the maximally observable frequency in the FT can only be achieved by increasing the sampling rate (smaller A^ of the digitization process. From the values of v^^j^^ and v^^^, one can derive that for a signal digitized in (2A'^ +1) data points, N frequencies are observed, namely: + 1 = 2TJ2At - 1 + 1 = M [(v„ - v^iJ/resolution] + 1 = (l/2At - l/2TJ(l/2TJ The same A^ frequencies are observed in the negative frequency domain as well. In summary, the Fourier transform of a continuous signal digitized in 2A^ + 1 data points returns A^ real Fourier coefficients, A^ imaginary Fourier coefficients and the average signal, also called the DC term, i.e. in total 2A + 1 points. The relationship between the scales in both domains is shown in Fig. 40.9.
(D
Q.
E (0
Fig. 40.8. (a) Sine function sampled at the Nyquist frequency (2 points per period), (b) An under-sampled sine function.
522 33 = (2N+1) data points 1 32 16
1 0 16
At 0
-N
N
^0
A(n) • n
1
N real coefficients
1 v = 1/(2T„)
(+ value at n=0)
ii
\
v = 1/(2 At) Av=1/(2Tm)
B{n)
N Imaginary coefficients
— •
n
1
1
1 v=1/(2T^)
\ \ Av=1/(2T„)
16 v = 1/(2 At)
Fig. 40.9. Relationship between measurement time {2T„,), digitization interval and the maximum and minimal observable frequencies in the Fourier domain.
TABLE 40.1 Signal measured at five time points with Ar = 0.5 s and 2T^ = 2 s Akhi)
Original time scale (seconds)
Time scale with origin shifted to the centre
k
f,=0
t,=-\
-2
A-\) = 2
(2 = 0.5
t2 = -0.5
-1
y(-0.5) = 3
(3=1
^3 = 0
0
./(0) = 4
(4=1.5
r4 = 0.5
1
m5) = 4
u=2
u=\
2
By way of illustration we calculate the FT of the discrete signal listed in Table 40.1. The origin of the time domain has been placed in the centre of the data. The forward transform is calculated as follows (eq. (40.7))
523
Mean value is: 3; A^ = 2; 2A^ + 1 = 5 v„i„= 1/27^ = 0.5 Hz Av = l/2T^ = 0.5 Hz v' max _ = l / 2 A f = 1/1 = 1 Hz l . n = 0:Vo = OHz
-^ k=-2
-^ k=-2
= 1/5(2 + 3 + 4 + 4 +2) = 3 2. n = l:vi = 1/27^ = 0.5 Hz ni)=l/5lf()tA0e--'^''*'5 = = l/5(2e'""'5 +3e'^"'5 + 4e° + AQ-J^"^^ + 2e-''*"'^) = l/5(2cos47i/5 + 2/sin47i/5 + 3cos27t/5 + 3;sin27i/5 + 4 + 4cos27i/5 - 4/sin27i/5 + 2cos47i/5 - 2jsm4n/5) = 1/5 (-1.618 +1.1756/- + 0.9271 + 2.8532/- + 4 + 1.2361 - 3.8042; - 1.618 -1.1756/) = = 1/5(2.93 - 0.95;) = 0.58 - 0.19; 3. n = 2: V2 = V, + Av = IHz = 1/2A? = v^^^ Fi2) = 1/5 I f{kAt)e-J^'^^ = = l/5(2e'^"'5 +3e'""'5 + 4e0 + 4e-''"^5 + 2e-J»^^) = l/5(2cos87i/5 + 2jsm8n/5 + 3cos47C/5 + 3;sin47i/5 + 4 + 4cos4jt/5 - 4;sin47r/5 + 2cos87i/5 - 2/sin87i/5) = 1/5(0.618- 1.9021y--2.4271 + 1.7634/+ 4-3.2361 -2.3511; +0.618 + 1.902;) = l/5(-0.427 + -0.588;-) = - 0.085 - 0.12;
At« = 2 the maximally observable frequency is reached, and the calculation can be stopped. The results are summarized in Fig. 40.10.
524
f(kAt)
0.5 1
1.0 2
1.5 Hz 3 n
0.5 1
1.0 2
1.5 Hz
3
n
Fig. 40.10. Fourier transform of the data points listed in Table 40.1.
40.3.6 Sampling In the previous section we have seen that the highest observable frequency present in a discrete signal depends on the sampling interval (At) and is equal to l/2Ar, the Nyquist frequency. This has two important consequences. The minimally required digitizing rate (sampling points per second) in order to retain all information in a continuous signal is defined by its maximum frequency. Secondly, the Fourier transform is disturbed if the continuous signal contains frequencies higher than the Nyquist frequency. This disturbance is called aliasing or folding. The principle of folding is illustrated in Fig. 40.11. In this figure two sine functions A and B are shown which have been sampled at the same sampling rate of 16 Hz. Signal A (8 Hz) is sampled with a rate which exactly fits the Nyquist frequency (v^J, namely 2 data points per period. Signal B (11 Hz) is under-sampled as it requires a sampling rate of minimally 22 Hz. The frequency of signal B is 3 Hz (= 5v) higher than the maximally observable frequency. This does not mean that for signal B no frequencies are observed in the frequency domain. Indeed, from Fig. 40.11 one can see that the data points of signal B can also be fitted with a 5 Hz sine, which is 3 Hz (= 6v) lower than v^^^ (= 8 Hz). As a consequence, if a signal contains a frequency which is 5v higher than the Nyquist frequency, false frequen-
525
t (sec) Fig. 40.11. Aliasing or folding, (a) Sine of 8 Hz sampled at 16 Hz (Nyquist frequency), (b) Sine of 11 Hz sampled at 16 Hz (under-sampled), (c) A sine of 5 Hz fitted through the data points of signal (b).
cies are observed at a frequency which is 5v lower than the Nyquist frequency. One should always be aware of the possible presence of 'false' frequencies by aliasing. This can be easily checked by changing the sampling rate. True frequencies remain unaffected whereas 'aliased' frequencies shift to other values. The Nyquist frequency defines the minimally required sampling rate of analytical signals. As an example we take a chromatographic peak with a Gaussian shape given by y(x) = atxp(-x^/2s^) (Gaussian function with a standard deviation equal to s located in the origin, x = 0). The FT of this function is [5]: 3;(v) = a^{2ns)txp(-v^
/(2/s^))
From this equation it follows that the amplitudes of the frequencies present in a Gaussian peak are normally distributed about a frequency equal to zero with a standard deviation inversely proportional to the standard deviation s of the Gaussian peak. As a consequence, the maximal frequency present in a Gaussian peak is approximately equal to (0 + 3{l/s)). In order to be able to observe that frequency, a sampling rate of at least 6/s is required or 6 data points per standard deviation of the original signal. As the width of the base of a Gaussian peak is approximately 6 times the standard deviation, the sampling rate should be at least 36 points over the whole peak. In practice higher sampling rates are applied in order to avoid aliasing of high noise frequencies and to allow signal processing
526
(a)
0.1
(b)
0.1
-0.1 0 05
-0.2
-0.3 «
64
64
127
127 0.2
0.1
-0.1
-0.2 I
127
Fig. 40.12. (a) FT (real coefficients) of a Gaussian peak located in the origin of the measurements (256 data points). Solid line: wVi = 20; dashed line: wVi = 5 and corresponding maximal frequencies, (b) FT (real and imaginary coefficients) of the same peak shifted by 50 data points.
(see Section 40.5). Figure 40.12a gives the real Fourier coefficients of two Gaussian peaks centered in the origin of the data and half-height widths respectively equal to 20 and 5 data points. As one can read from the figure, the maximum frequency in the narrow peak (dashed line in Fig. 40.12) is about 4 times higher than the maximum frequency in the wider peak (solid line in Fig. 40.12). 40,3.7 Zero filling and resolution In Section 40.3.5 we concluded that the resolution (Av) in the frequency spectrum is equal to the reciprocal of the measurement time. The longer the measurement time in the time domain, the better the resolution is in the frequency domain. The opposite is also true: the longer the measurement time in the frequency domain (e.g. in FTIR or FT NMR), the better is the separation of the peaks in the spectrum after the back-transform to the wavelength or chemical shift domain.
527
Fig. 40.13. Exponentially decaying pulse NMR signal.
In FTIR and FT NMR the amplitude of the measured signal is an exponentially decaying function (Fig. 40.13). One could conclude that in this case continuing the measurements makes little sense. However, it shortens the measurement time and would, therefore, limit the resolution of the spectrum after the back-transform. It is, therefore, common practice to artificially extend the measurement time by adding zeros behind the measured signal, and to consider these zeros as measurements. This is called zero filling. The effect of zero filling is illustrated on the inverse Fourier transform of an exponentially decaying FT NMR signal before (Fig. 40.14a) and after zero filling (Fig.40.14b).One should take into account that if the last measured data point is not close to zero, zero filling introduces false frequencies because of the introduction of a stepwise change of the signal. This is avoided by extrapolating the signal to zero following an exponential function, called apodization. 40.3.8 Periodicity and symmetry In Section 40.3.4 we have shown that the FT of a discrete signal consisting of 2A^ + 1 data points, comprises A^ real, A^ imaginary Fourier coefficients (positive frequencies) and the average value (zero frequency). We also indicated that A^ real and N imaginary Fourier coefficients can be defined in the negative frequency domain. In Section 40.3.1 we explained that the FT of signals, which are symmetrical about the r = 0 in the time domain contain only real Fourier coefficients.
528
6.4 s
1 I
10 Hz
10 Hz
Fig. 40.14. Effect of zero filling on the back transform of the pulse NMR signal given in Fig. 40.12. (a) before zero filling, (b) after zero filling.
whereas signals which are antisymmetric about the r = 0 point in the time domain contain only imaginary Fourier coefficients (sines). This property of symmetry may be applied to obtain a transform which contains only real Fourier coefficients. For example, when a spectrum encoded in 2A^ + 1 = 512 + 1 data points is artificially mirrored into the negative wavelength domain, the spectrum is symmetrical about the origin. The FT of this spectrum consists of 512 real and 512 imaginary Fourier coefficients. However, because the signal is symmetric all imaginary coefficients are zero, reducing the FT to 512 real coefficients (plus one for the mean) of which 256 coefficients at negative frequencies and 256 coefficients at positive frequencies. 40.3.9 Shift and phase A shift or translation of f(0 by /Q results in a modulation of the Fourier coefficients by exp(-j(OtQ). Without shift f(0 is transformed into F(co). After a shift by /Q, f(r - ^o) is transformed into txpi-jcoto) F((o), which results in a modulation (by a cosine or sine wave) of the Fourier coefficients. The frequency CO^Q of the
529
modulation depends on ^Q, the magnitude of the shift. The back transform has the same property. F{n - n^ is back-transformed to f(t)txp(j2nnQt/2N). A shift by A^ data points, therefore, results in f{kAt)Qxp(j2nNk/2N) = f{kAt)cxp(jnk) = f{kAt) (-1)^. This property is often applied to shift the origin of the Fourier domain to the centre of its 2A^ + 1 data array when the software by default places its origin at the first data point. In Section 40.3.6 we mentioned that the Fourier transform of a Gaussian peak positioned in the origin is also a Gaussian function. In practice, however, peaks are located at some distance from the origin. The real output, therefore, is a damped cosine wave and the imaginary output is a damped sine wave (see Fig. 40.12b). The frequency of the damping is proportional to the distance of the peak from the origin. The functional form of the damped wave is defined by the peak shape and is a Gaussian for a Gaussian peak. Inversely, the frequency of the oscillations of the Fourier coefficients contains information on the peak position. The phase spectrum 0{n) is defined as 0(n) = arctan(A(n)/S(n)). One can prove that for a symmetrical peak the ratio of the real and imaginary coefficients is constant, which means that all cosine and sine functions are in phase. It is important to note that the Fourier coefficients A(n) and B(n) can be regenerated from the power spectrum P(n) using the phase information. Phase information can be applied to distinguish frequencies corresponding to the signal and noise, because the phases of the noise frequencies randomly oscillate. 403.10 Distributivity and scaling The Fourier transform is distributive over summation, which means that the Fourier transform of the two individual signals is equal to the sum of the Fourier transforms of the two individual signals F\f^{t) +/2(0] = F\f\if)\ + ^l/iCOlThe enhancement of the signal-to-noise ratio (or filtering) in the Fourier domain is based on that property. If one assumes that the noise n{t) is additive to the signal s{t), the measured signal m{i) is equal to s{t) + n{t). Therefore, F[m{t)\ = F[s{t)\ + F[n{i)], or M(v) = 5(v)+7V(v) Assuming that the Fourier transformed spectra 5(v) and A^(v) contribute at specific frequencies, the true signal, s(t), can be recovered from M(v) after elimination of A^(v). This is csllcd filtering (see further Section 40.5.3) The Fourier transform is not distributive over multiplication:
Fm)f2(f))^F{f,{twm))' It is also easy to show that for the scalar a F{a(j{t))) = aF(t).
530
40.3.11 The fast Fourier transform As explained before, the FT can be calculated by fitting the signal with all allowed sine and cosine functions. This is a laborious operation as this requires the calculation of two parameters (the amplitude of the sine and cosine function) for each considered frequency. For a discrete signal of 1024 data points, this requires the calculation of 1024 parameters by linear regression and the calculation of the inverse of a 1024 by 1024 matrix. The FT could also be calculated by directly solving eq. (40.7). For each frequency (2N + 1 values) we have to add 2N + 1 (the number of data points) values which are each the result of a multiplication of a complex number with a real value. The number of complex multiplications and additions is, therefore, proportional to {2N + 1)^. Even for fast computers this is a considerable task. Therefore, so-called fast Fourier transform (FFT) algorithms have been developed, originally by Cooley and Tukey [6], which are available in many software packages. The number of operations in FFT is proportional to (2N + l)log2(2A^ +1) permitting considerable savings of calculation time. The calculation of a signal digitized over 1024 points now requires 10"^ operations instead of 10^, which is about 100 times faster. A condition for applying the FFT algorithm is that the number of points is a power of 2. The principle of the FFT algorithm can be found in many textbooks (see additional recommended reading). Because the FFT algorithm requires the number of data points to be a power of 2, it follows that the signal in the time domain has to be extrapolated (e.g. by zero filling) or cut off to meet that requirement. This has consequences for the resolution in the frequency domain as this virtually expands or shortens the measurement time.
40.4 Convolution As a rule, a measurement is an imperfect representation of reality. Noise and other blur sources may degrade the signal. In the particular case of spectrometry a major source of degradation is peak broadening caused by the limited bandwidth of a monochromator. When a spectrophotometer is tuned at a wavelength XQ other neighbouring wavelengths also attain the detector, each with a certain intensity. The profile of these intensities as a function of the wavelength is called the slit function, h(X). An example of a slit function is given in Fig. 40.15. This slit function is also called a convolution function. Under certain conditions, the shape of the slit function is a triangle symmetrical about XQ. The width at half-height is called the spectral band-width. When measuring a "true" absorbance peak with a half-height width not very much larger than the spectral band-width, the observed
531
300
302
306
308
Fig. 40.15. (a) Slit function (point-spread function) h{X) for a spectrometer tuned at 304 nm. (b) f(X) is the true absorbance spectrum of the sample.
peak shape is disturbed. The mechanism behind this disturbance is called convolution. Convolution also occurs when filtering a signal with an electronic filter or with a digital filter, as explained in Section 40.5.2. By way of illustration the spectrometry example is worked out. Two functions are involved in the process, the signal/(^) and the convolution function h{X). Both functions should be measured in the same domain and should be digitized with the same interval and at the same ^values (in spectrometry: X-values). Let us furthermore assume that the spectrum/(^) and convolution function h{X) have a simple triangular shape but with a different half-height width. Let us suppose that the true spectrum y(^) (absorbance xlOO) is known (Fig. 40.15b): ?ii = 301nm=>f(301) = 0 ?i2 = 302nm=»y(302) = 0 ?i3 = 303nm=>y(303) = 0 ?i4 = 304nm=>/(304) = 5 ^5 = 305nm=>y(305)=10 ^6 = 306 nm =^J{306) = 5 ;i7 = 307nm=^y(307) = 0 For all other wavelengths the absorbance is equal to zero.
532
Let us also consider a convolution function h(X), also called point-spread function. This function represents the profile of the intensity of the light reaching the detector, when tuned at wavelength \ . If we assume that the slit function has the triangular form given in Fig. 40.15a and that the spectrometer has been tuned at a wavelength equal to 304 nm, then for a particular slit width the radiation reaching the detector is composed by the following contributions: 25% comes from X = 304 or the relative intensity is: 0.25 18.8% comes from \ = 303 and 305 0.188 12.5% comes from X = 302 and 306 0.125 6.2% comes from X = 301 and 307 0.062 The relative intensities sum up to 1. None of the other wavelengths reaches the detector. When tuning the spectrometer at another wavelength, the centre of the convolution function is moved to that wavelength. If we encode the convolution function relative to the set point /i(0), then we obtain the following discrete values (normalized to a sum =1): h(0) = 0.25 /z(_l) = /z(+l) = 0.188 /z(-2) = /z(+2) = 0.125 /z(-3) = /z(+3) = 0.062 h(-A) = /z(+4) = 0 All values from h(-^) to h{-oo), and from h{+A) to /i(+oo) are zero. Let us now calculate the signal g(304) which is measured when the spectrometer is tuned at X. = 304 nm: ^(304) = 0.25/(304) + 0.188/(305) + 0.188/303) + 0.125^(306) + 0.125^(302) + 0.062/(307) + 0.062/(301) = /z(0)/(304) + /i(l)/(305) + /i(-l)/(303) + /z(2)/(306) + /i(-2)y(302) + /z(3)y(307) + /z(-3)y(301) Because /z(0) = /z(304-304), /z(l) = /z(305-304), /i(2) = /z(306-304) and h{-\) = /i(303-304) and so on, the general expression for ^(jc) can be written in the following compact notation: gix)=^fiy)h(y-x) for all X and y for which/(j) and h(y - x) are defined.
(40.9)
533
302
304
306
308
Fig. 40.16. Measured absorbance spectrum for the system shown in Fig. 40.15.
A shorthand notation for eq. (40.9) is g{x)=f{x)^h{x)
(40.10)
where * is the symbol for convolution. Extension of the convolution to the wavelengths, 301 to 307 nm, yields the measured spectrum g(x) shown in Fig. 40.16. The broadening of the signal is clearly visible. One should note that signals measured in the frequency domain may also be a convolution of two signals. For instance the periodic exponentially decaying signal shown in Fig. 40.13 is a convolution of a sine function with an exponential function. An important aspect of convolution is its translation into the frequency domain and vice versa. This translation is known as the convolution theorem [7], which states that: - Convolution in the time domain is equivalent to a multiplication in the frequency domain g{x) =J{x) * h(x) ^ ^ G(v) = F(v) H(v) and - Convolution in the frequency domain is equivalent to a multiplication in the time domain G(v) = F(v) * H{v) ^=^ g{x) =fix) hix) From the convolution theorem it follows that the convolution of the two triangles in our example can also be calculated in the Fourier domain, according to the following scheme: (1) Calculate F(v) of the signal/(O (2) Calculate H(v) of the point-spread function h{t) (3) Calculate G(v) = F(y)H(v): The real (Re) and imaginary (Im) transform coefficients are multiplied according to the multiplication rule of two complex numbers:
534 Imaginary FT coefficients
Real FT coefficients (a) FT
FT
(b)
(d)
(C)
RFT
13
17
:^1
25
Fig. 40.17. Convolution in the time domain ofJ{t) with h(t) carried out as a multiplication in the Fourier domain, (a) A triangular signal (wi/j = 3 data points) and its FT. (b) A triangular slit function h{t) (w./, = 5 data points) and its FT. (c) Multiplication of the FT of (a) with that of (b). (d) The inverse FT of (c).
Re(G(v)) = Re(F(v))Re(//(v)) - Im(F(v))Im(//(v)) Im(G(v)) = -(Re(F(v))Im(//(v)) + Im(F(v))Re(//(v))) (4) Back-transform G(v) to g(t) These four steps are illustrated in Fig. 40.17 where two triangles (array of 32 data points) are convoluted via the Fourier domain. Because one should multiply Fourier coefficients at corresponding frequencies, the signal and the point-spread function should be digitized with the same time interval. Special precautions are needed to avoid numerical errors, of which the discussion is beyond the scope of this text. However, one should know that when J[t) and h(t) are digitized into sampled arrays of the size A and B respectively, both J{t) and h(t) should be extended with zeros to a size of at least A + J5. If (A + B) is not a power of two, more zeros should be appended in order to use the fast Fourier transform.
535
40.5 Signal processing 40.5.1 Characterization of noise As said before, there are two main applications of Fourier transforms: the enhancement of signals and the restoration of the deterministic part of a signal. Signal enhancement is an operation for the reduction of the noise leading to an improved signal-to-noise ratio. By signal restoration deformations of the signal introduced by imperfections in the measurement device are corrected. These two operations can be executed in both domains, the time and frequency domain. Ideally, any procedure for signal enhancement should be preceded by a characterization of the noise and the deterministic part of the signal. Spectrum (a) in Fig. 40.18 is the power spectrum of "white noise" which contains all frequencies with approximately the same power. Examples of white noise are shot noise in photomultiplier tubes and thermal noise occurring in resistors. In spectrum (b), the power (and thus the magnitude of the Fourier coefficients) is inversely proportional to the frequency (amplitude -- 1/v). This type of noise is often called 1//
IF(/'
L^'VAMMM/ iFd'ii
(b)
IFI/11
Fig. 40.18. Noise characterisation in the frequency domain. The power spectrum IF(v)l of three types of noise, (a) White noise, (b) Flicker or 1//noise, (c) Interference noise.
536
o u
0.00
8.00
16.00
24.00 32.00 Time(sec)(» io 1)
4 0.00
48.00
Fig. 40.19. Baseline noise of a UV-Vis detector in HPLC.
noise (/is the frequency) and is caused by slow fluctuations of ambient conditions (temperature and humidity), power supply, vibrations, quality of chemicals etc. This type of noise is very common in analytical equipment. An example is the power spectrum (Fig. 40.20) of the noise of a UV-Vis detector in HPLC (Fig. 40.19). In some cases the power spectrum may have peaks at some specific frequencies (see Fig. 40.18c). A very common source of this type of noise is the 50 Hz periodic interference of the power line. In reality most noise is a combination of the noise types described. Spectral analysis is a useful tool in assessing the frequency characteristics of these types of signal. We first discuss signal enhancement in the time domain, which does not require a transform to the frequency domain. It is noted that all discrete signals should be sampled at uniform intervals. 40.5.2 Signal enhancement in the time domain In many instances the quality of the signal has to be improved before the chemical information can be derived from it. One of the possible improvements is the reduction of the noise. In principle there are two options, the enhancement of the analog signal by electronic devices (hardware), e.g. an electronic filter, and the
537
(b)
10 4
3
5
8 10-3
3 5 8 10-2 3 F r e q u e n c y (CPS)
5
8 10-1
3
5
8 io 0
Fig. 40.20. The power spectrum of the baseline noise given in Fig. 40.19.
manipulation of the signal after digitization by computer, a so called digital filter. Analytical equipment usually contains hardware to obtain a satisfactory signalto-noise ratio. For example, in AAS the radiation of the light source is modulated by a light chopper to remove the noise introduced by the flame. The frequency of this chopper is locked into a lock-in amplifier, which passes only signals with a frequency equal to the frequency of the chopper. As a result noise with other frequencies than the chopper frequency is eliminated, including the 1//noise. An apparent advantage of digital filters over the analog filters is their greater flexibility. When the original data points are stored in computer memory, the enhancement operation can be repeated under different conditions without the need to remeasure the signal. Finally, for the processing of a given data point also data points measured later are available, which is intrinsically impossible when processing an analog signal. This advantage becomes clear in Section 40.5.2.3. Many instrument manufacturers apply digital devices in their equipment, or supply software to be operated from the PC for post-run data processing of the signal by the user, e.g. a digital smoothing step. To ensure a correct appHcation, the principles and limitations of smoothing and filtering techniques are explained in the following sections.
538
5 6 time (s)
6 6 time (s)
Fig. 40.21. Averaging of 100 scans of a Gaussian peak.
40.5.2.1 Time averaging Some instruments scan the measurement range very rapidly and with a great stability. NMR is such a technique. In this case the signal-to-noise ratio can be improved by repeating the scans (e.g. A^ times) and adding the corresponding data points. As a result the magnitude of the deterministic part of the signal is multiplied by A^, and the standard deviation of the noise by a factor of ^|N (set Chapter 3). Consequently, the signal-to-noise ratio improves by a factor vA^. There are clear limitations to this technique. First of all, the repeated scans must be sampled at exactly the same time values. Furthermore, one should be aware that in order to obtain an improvement of the signal-to-noise ratio by a factor doubling of the measurement time is required. The limiting factor then becomes the stability of the signal. The effect of a 100 times averaging of a Gaussian peak (standard deviation of the noise is 10% of the peak maximum) is demonstrated in Fig. 40.21. The signal-to-noise ratio is improved without deformation of the signal.
40.5.2.2 Smoothing by moving average In the accumulation process explained in the previous section, data points collected during several scans and measured at corresponding time values, are added. One could also consider to accumulate the values of a number of data points in a small segment or window in the same scan. This is the principle of smoothing, which is explained in more detail below.
539
The simplest form of smoothing consists of using a moving average, which has been introduced in Chapter 7 for quahty control. A window with an odd number of data points is defined. The values in that window are averaged, and the central point is replaced by that value. Thereafter, the window is shifted one data point by dropping the last one in the window and including the next one (un-smoothed measurement). The averaging process is repeated until all data points have been averaged. For the smoothing of a given data point, data points are used which were measured after this data point, which is not possible for analog filters. The expression for the moving average of point / is: j=m
j=-m
where g^ is the smoothed data point i^f^^j is the original data point /+/, (2m + 1) is the size of the smoothing window. The process of moving averaging — and we will see later, the process of smoothing in general — can be represented as a convolution. Consider, therefore, the data points/(I),/(2), .... , fin). The moving average of point/(5) using a smoothing window of 5 data points is calculated as follows: Ai) multiply by
0
add and divide by 5
[0 +
m
^3)
y(4)
AS)
./(6)
0
1
1
1
0+
./(3) +
m+
./(5) +
1
.K^) 1
.m 0...
./(6) +
./(7) +
0 + ..]/5
If we consider the zeros and ones to form a function h(t) and if we position the origin of that function /i(0) in the centre of the defined window, then the above process becomes:
mKt):
m=
./(I) K-A)
./(2)
0
0
-4
^(-3) -3
./(4)
./(5)
h(-2)
/i(-l)
h{0)
1
1
1
m -2
-1
0
./(7) ^(+2)
m.. /i(+3)..
1
1
0...
1
2
3...
.m
K+l)
givmg ^(5) = 1/5 JJ{m)h{m - 5) or in general g{t) = Yj'{m)h{m - 0/NORM
(40.11)
for all m for which f(m) and h(m-t) are defined. NORM is a factor to keep the integral of the signal constant.
540
Equation (40.11) corresponds exactly to the expression of a convolution (eq. (40.9)) introduced earlier, demonstrating that the mechanism of smoothing indeed is equivalent to that of convolution. Thus eq. (40.11) can be rewritten as:
g(t) =m * h(t)
(40.12)
As a consequence, a moving average in the time domain is a multiplication in the Fourier domain, namely: G(v) = F(v) H(v)
(40.13)
where H(v) is the Fourier transform of the smoothing function. This operation is called filtering and H(v) is the filter function. Filtering is further discussed in Section 40.5.3. For the moment it is important to realize that smoothing in the time domain has its complementary operation in the frequency domain and vice versa. Besides the improvement of the signal-to-noise ratio, smoothing also introduces two less desired effects, which are illustrated on a Gaussian peak. In Fig. 40.22a we show the effect of the smoothing of this peak with increasingly larger smoothing
1.2
(0)
(5)
(9)
(17)
(25)
'''^
1 0 6 Q.5 0.4
(window)/\A^^2 1.2
(0)
(5)
(9)
(17)
(25)
h/hp
1
1 f
(b) •
•-
0.8 0.6 0.4 0,2
f¥^ -V4^
4h—\f^4fn^,M,J
v-y\r^-^—^
-U(window)/w
1/2
Fig. 40.22. Distortion (h/ho) of a Gaussian peak for various window sizes (indicated within parentheses), (a) Moving average, (b) Polynomial smoothing.
541
windows. As one can see, the peak becomes lower and broader, but remains symmetrical. The peak remains symmetric because of the symmetry of the filter about its central point. Analog filters are by definition asymmetric because they can only include data points at the left side of the data point being smoothed (see also Section 40.5.2.4 on exponential smoothing). The applicability of moving averaging and smoothing in general depends on the degree of deformation associated with a certain improvement of the signal-to-noise ratio. Figure 40.22a shows the effect of a moving averaging applied on a Gaussian peak to which white random noise is added, as a function of the window size. Clearly, for window sizes larger than half the half-height peak width, the peak is broadened and the intensity drops. As a result adjacent peaks may be less resolved. Peak areas, however, remain unaffected. From Fig. 40.22a one can derive the maximally applicable window size (number of data points) to avoid peak deformation, given the scan speed and the digitisation rate. Suppose, for instance, that the half-height width of the narrowest peak in an IR spectrum is about 10 cm~\ the digitization rate is 10 Hz and the scan rate is 2 cm"^ per sec. Hence, the half-height width is digitized in 50 data points. The largest smoothing window, therefore, introducing minor disturbances is 25 data points. If the noise is white the signal-to-noise ratio is improved by a factor 5. Another unwanted effect of smoothing is the alteration of the frequency characteristics of the noise. This calls for caution. Because low frequencies present in the noise are not removed, the improvement of the signal-to-noise ratio may be limited. This is illustrated in Fig. 40.23 where one can see that after smoothing with a 25 point window low-frequency noise is left. In Section 40.5.3 filtering
v ^
Fig. 40.23. Polynomial smoothing (noise = N(0,3%)): 5-point; 17-point; 25-point smoothing window and the noise left after smoothing.
542
methods in the frequency domain are discussed, which are capable of removing specific frequencies from the noise. 40.5,2.3 Polynomial smoothing The convolution or smoothing function, h{t), used in moving averaging is a simple block function. However, one could try and derive somewhat more complex convolution functions giving a better signal-to-noise ratio with less deformation of the underlying deterministic signal. Let us consider a smoothing window with 5 data points. Polynomial smoothing would then consist of fitting a polynomial model through the 5 data points by linear regression. In this case a polynomial with a degree equal to zero (horizontal line) up to 4 (no degrees of freedom left) can be chosen. In the latter case the model exactly fits the data points (no residuals) and has no effect on the noise and signal. The fit of a model with a degree equal to zero is equivalent to the moving average. Clearly, one should try and find an optimal value for the degree of the polynomial, given the size of the window, the shape of the deterministic signal and the characteristics of the noise. Unfortunately no hard rules are available for that purpose. Therefore, several polynomial models and window sizes should be tried out. The smoothing procedure consists of replacing the central data point in the window by the value obtained from the model, and repeating the fit procedure by shifting the window one data point until the whole signal has been scanned. For a signal digitized over 1000 data points, 996 regressions over 5 points have to be calculated. This would be very impractical and computing intensive. Savitzky and Golay [8] derived convolutes, h{t) for each combination of degree of the polynomial and size of the window. The effect of a convolution of a signal with these convolutes is the same as fitting the signal with the corresponding polynomial in a moving window. For instance for a quadratic model and a 5-point window the convolutes are (see Table 40.2): h{-2) = - 3 ; h{-\) = 12; /i(0) = 17; /z(+l) = 12; /i(+2) = - 3 ; else h{t) = 0 In order to keep the average signal amplitude unaffected a scaling factor NORM is introduced, which is the sum off all convolutes, here 35. The smoothing procedure is now g(t) =
If(m)h(m-t)/lh{m)
for all m for which/(m) is defined and h(m -t)^0. The effect of a 5-point, 17-point and 25-point quadratic smoothing of a Gaussian peak with 0.3% noise is shown in Fig. 40.22b. Peaks are distorted as
543 TABLE 40.2 Convolutes for quadratic and cubic smoothing (adapted from Refs. [8,10]) 23
21
Points
25
-12
-253
-11
-138
-42
-10
-33
-21
-171
-09
62
-2
-76
-136
19
17
15
-08
147
15
9
-51
-21
-07
222
30
84
24
-6
-78
-06
13
11
9
7
5
287
43
149
89
7
-13
-11
-05
343
54
204
144
18
42
0
-36
-04
387
63
249
189
27
87
9
9
-03
422
70
284
224
34
122
16
44
14
-2
-02
447
75
309
249
39
147
21
69
39
3
-3
-01
462
78
324
264
42
162
24
84
54
6
12
00
467
79
329
269
43
167
25
89
59
7
17
01
462
78
324
244
42
162
24
84
54
6
12
02
447
75
309
249
39
147
21
69
39
3
-3
-2
-21
03
422
70
284
224
34
122
16
44
14
04
387
63
249
189
27
87
9
9
-21
05
343
54
204
144
18
42
0
-36
-11
06
287
43
149
89
7
-13
07
222
30
84
24
-6
-78
08
147
15
9
09
62
-2
-76
10
-33
-21
-171
-42
11
-138
12
-253
NORM
5175
805
3059
-51 -21 -136
2261
323
1105
143
429
231
21
35
well, but to a less extent than for moving averaging. For a Gaussian peak the window size of a quadratic polynomial smoothing should now be less than 1.5 times the half-height width compared to the value of 0.5 found for moving averaging. The effect on the noise reduction and frequencies left in the noise is comparable to the moving average filter. We refer to the work of Enke [9] for a detailed discussion of peak deformation versus signal-to-noise improvement under
544
- 3 - 2 - 1 0 1 2 3 Fig. 40.24. Polynomial smoothing: window of 7 data points fitted with polynomials of degrees 0,1,2, 3 and 4.
different circumstances. Generally, polynomial smoothing is preferred over moving averaging, because larger windows are allowed before the signal is deformed. The convolutes h{t), adapted according to Steinier [10], are tabulated in Table 40.2 for several degrees of the smoothing polynomial and window sizes. Polynomial models of an even 2n and odd 2n-\-\ degree have the same value for the central point (Fig. 40.24). Therefore, the same convolutes and the same smoothing results are found for a quadratic and cubic polynomial fit. In the same way as for moving averaging the polynomial smoothing can be represented in the frequency domain as a multiplication (eq. (40.13)). This aspect is further discussed in Section 40.5.4. 40.5.2.4 Exponential smoothing The principle of exponential averaging has been introduced in Chapters 7 and 20, and is given by the following equation: jc,. =(l-X)Xi
+ ^/-i 0<X.)X2 +
X\l-X)Xy
4
X4
^4 =(1->.)X4+AJC3 =
(l->-)X4 + >-(l-A,)X3 + A-^(1->.)X2 +
5
X5
X5 = (1-X)X5+X^4 =
(1-X)X5 + X,(1->.)X4 + A,2(1-X)X3 + P(1->.)X2 + X\l-X)X^
X\l-k)Xi
546 x{t) 10 8 6
J
5
L_
10
6 t
Fig. 40.25. Effect of exponential smoothing on the data points listed in Table 40.3 (solid line: original data; dotted line: smoothed data). Signal 1 0.8 0.6 0.4 0.2
0
5
10
15
20
26 30 Time
35
40
45
50
Fig. 40.26. Effect of exponential smoothing {X = 0.6) on a Gaussian peak (wy^ = 6 data points) (solid line: original data; bold line: smoothed data).
Table 40.3. As one can see, the filter introduces a slower response to stepwise changes of the signal, as if it were measured with an instrument with a large response time. Because fluctuations are smoothed, the standard deviation of the signal is decreased, in this example from 2.58 to 1.95. A Gaussian peak is broadened and becomes asymmetric by exponential smoothing (Fig. 40.26).
547
40.5.3 Signal enhancement in the frequency domain Instead of smoothing the data directly in the domain in which they were acquired, the signal-to-noise ratio can also be improved by transforming the signal to the frequency domain and eliminating noise frequencies present in the measurements, after which one returns to the original domain. For instance, the power spectrum of the noise of a flame ionization detector (Fig. 40.27) reveals the presence of two dominant frequencies, namely at 2 and at 10 Hz. By substituting all Fourier coefficients at frequencies higher than 5 Hz by zero, all high frequency noise is eliminated after back transforming to the time domain. This operation is called filtering, and because in this particular case low frequencies are retained, this filter is called a low-pass filter. Equally, one could define a high-pass filter hy setting all low frequency values equal to zero. Mathematically this operation can be described by the same equation (eq. (40.13)) as derived for polynomial smoothing, namely: G(v) = F{v)H(v) where H(v) is the filter function, which is now defined in the frequency domain. Often used filter functions are: Low-pass filter: H(v) = 1 for all v < VQ else H(v) = 0 High-pass filter: H(v) = 1 for all v > VQ else H(v) = 0
o X
0.64
a u 0.3 2
16
32 F r e q . in
48
CPS
Fig. 40.27. Power spectrum of the noise of a flame ionization detector.
548
VQ is called the cut-off frequency. H(v) is referred in this context as di filter transfer function. Many other filter functions can be designed, e.g. an exponential or a trapezoidal function, or a band pass filter. As a rule exponential and trapezoidal filters perform better than cut-off filters, because an abrupt truncation of the Fourier coefficients may introduce artifacts, such as the annoying appearance of periodicities on the signal. The problem of choosing filter shapes is discussed in more detail by Lam and Isenhour [11] with references to a more thorough mathematical treatment of the subject. The expression for a band-pass filter is: H(v) = 1 for v^i„ < v < v^^^^ else //(v) = 0. This filter is particularly useful for removing periodic disturbances of the signal. The effect of a low-pass filter applied on a Gaussian peak is shown in Fig. 40.28 for two cut-off frequencies. The lower the cut-off frequency of the filter, the more noise is removed. However, this increasingly effects the high frequencies present in the signal itself, causing a deformation. On the other hand, the higher the cut-off frequency the more high-frequency noise is left in the signal. Thus the choice of the cut-off frequency is often a compromise between the noise one wants to eliminate and the deformation of the signal one can accept. The more the frequencies of noise and signal are similar, the more difficult it becomes to improve
Fig. 40.28. Effect of alow-pass filter, (a) original Gaussian signal, (b) FT of (a), (c) Signal (a) filtered with V() = 10. (d) Signal (a) filtered with Vo = 20.
549
the signal-to-noise ratio. For this reason it is very difficult to eliminate 1//noise, because its power increases with lower frequencies. 40.5.4 Smoothing and filtering: a comparison Filtering and smoothing are related and are in fact complementary. Filtering is more complicated because it involves a forward and a backward Fourier transform. However, in the frequency domain the noise and signal frequencies are distinguished, allowing the design of a filter that is tailor-made for these frequency characteristics. Polynomial smoothing is more or less a trial and error operation. It gives an improvement of the signal-to-noise ratio but the best smoothing function has to be empirically found and there are no hard rules to do so. However, because of its computational simplicity, polynomial smoothing is the preferred method of many instrument manufacturers. By calculating the Fourier transform of the smoothing convolutes derived by Savitzky and Golay one can see that polynomial smoothing is equivalent to low-pass filtering. Figure 40.29 shows the Fourier transforms of the 5-point, 9-point, 17-point and 25-point second-order convolutes given in Table 40.2 (the frequency scale is arbitrarily based on a 1024 data points sampled with
Fig. 40.29. Fourier spectrum of second-order Savitzky-Golay convolutes. (a) 5-point. (b) 9-point. (c) 17-point. (d) 25-point (arrows indicate cut-off frequencies).
550
1 Hz). Another feature of polynomial smoothing is that smoothing and differentiation (first and second derivative) can be combined in single step, which is explained in Section 40.5.5. 40.5.5 The derivative of a signal Signals are differentiated for several purposes. Many software packages for chromatography and spectrometry offer routines for determining the peak position and for finding the up-slope and down-slope integration limits of a peak. These algorithms are based on the calculation of the first- or second-derivative. In NIRA small differences between spectra are magnified by taking the first or second derivative of the spectra. Baseline drifts are eliminated as well. The simplest procedure to calculate a derivative is by taking the difference between two successive data points. However, by this procedure the noise is magnified by several orders of magnitude leading to unacceptable results. Therefore, the calculation of a derivative is usually linked to a smoothing procedure. In principle one could smooth the data first. This requires a double sweep through the data, the first one to smooth the data and the second one to calculate the derivative. However, the smoothing and differentiation can be combined into a single step. To explain this, we recall the way Savitzky and Golay derived the smoothing convolutes by moving a window over the data and fitting a polynomial through the data in the window. The central point in the window is replaced by the value of the polynomial. Instead, one may replace it by the value of the first or second derivative of that polynomial in that point. Savitzky and Golay [8] published convolutes (corrected later on by Steinier [10]) for that operation (see Table 40.5). This procedure is the recommended method for the calculation of derivatives. Fig. 40.30 gives the second-derivative of two noisy overlapped Gaussian peaks, obtained with a quadratic 7-points smoothed derivative. The two negative regions (shaded areas in the figure) reveal the presence of two peaks. 40.5.6 Data compression by a Fourier transform Sets of spectroscopic data (IR, MS, NMR, UV-Vis) or other data are often subjected to one of the multivariate methods discussed in this book. One of the issues in this type of calculations is the reduction of the number variables by selecting a set of variables to be included in the data analysis. The opinion is gaining support that a selection of variables prior to the data analysis improves the results. For instance, variables which are little or not correlated to the property to be modeled are disregarded. Another approach is to compress all variables in a few features, e.g. by a principal components analysis (see Section 31.1). This is called
551 TABLE 40.5 Convolutes for the calculation of the smoothed second derivative (adapted from Ref. [8]) Points
25
23
21
19
17
15
13
11
9
7
5
-12
92
-11
69
77
-10
48
56
190
-09
29
37
133
51
-08
12
20
82
34
40
-07
-3
5
37
19
25
91
-06
-16
-8
-2
6
12
52
22
-05
-27
-19
-35
-5
1
19
11
15
-04
-36
-28
-62
-14
-8
-8
2
6
28
-03
-43
-35
-83
-21
-15
-29
-5
-1
7
5
-02
-48
-40
-98
-26
-20
-48
-10
-6
-8
0
2
-01
-51
-43
-107
-29
-23
-53
-13
-9
-17
-3
-1
00
-52
-44
-110
-30
-24
-56
-14
-10
-20
-4
-2
01
-51
-43
-107
-29
-23
-53
-13
-9
-17
-3
-1
02
-48
-40
-98
-26
-20
^8
-10
-6
8
0
2
03
-43
-35
-83
-21
-15
-29
-5
-1
7
5
04
-36
-28
-62
-14
-8
-8
2
6
28
05
-27
-19
-35
-5
1
19
11
15
06
-16
-8
-2
6
12
52
22
07
-3
5
37
19
25
91
40
08
12
20
82
34
09
29
37
133
51
10
190
48
56
11
69
77
12
92
NORM
26910 17710 33649 6783
3876
6188
1001
429
462
42
feature reduction. Data may also be compressed by a Fourier transform or by one of the transforms discussed later in this chapter. This compression consists of taking the FT of the data and retaining the first n relevant Fourier coefficients. If the data are symmetrically mirrored about the first data point, the FT only consists
552
5
40
45
50
0.05 h
•0.D5 -0.1 -0.15
Fig. 40.30. Smoothed second-derivative (window: 7 data points, second-order) according to Savitzky-Golay.
of real coefficients which facilitates the calculations (see Section 40.3.8). Figure 40.31 shows a spectrum of 512 data points, which is reconstructed from respectively the first 2, 4, 8...,256 Fourier coefficients. The effect is more or less comparable to wavelength selection by deleting data points at regular intervals. As a consequence one looses high frequency information. When the rows of a data table are replaced by the first n relevant Fourier coefficients, the properties of a data table are retained. For instance, the Fourier coefficients of the rows of a two-way data table of mixture spectra remain additive (distributivity property). Similarities and dissimilarities between rows (the objects) are retained as well, allowing the application of pattern recognition [12] and other multivariate operations [13,14].
553 FFT
(b) 4 6
16
32
64
128
256
'pi
512
(a)
Fig. 40.31. Data compression by a Fourier transform, (a) A spectrum measured at 512 wavelengths; (b) spectrum after reconstruction with 2, 4,..., 256 Fourier coefficients.
40.6 Deconvolution by Fourier transform In Section 40.4 we mentioned that the distortion introduced by instruments can be modeled by a convolution. Moreover, we demonstrated that noise filtering by either an analog or digital filter is a convolution process. In some cases the distortion introduced by the measuring device may damage the signal so much that the analytical information wanted cannot be derived from the signal. For instance in chromatography the peak broadening introduced during the elution process may cause peak overlap and hamper an accurate determination of the peak area. Therefore, one may want to mathematically remove the damage. This process of signal restoration is known as deconvolution or inverse filtering. Deconvolution is the inverse operation of convolution. While convolution is mathematically straightforward, deconvolution is more complicated. It requires an operation in the Fourier
554
domain and a careful design of the inverse filter. The basic deconvolution algorithm follows directly from eq. (40.13)
F(v) =
G(v) H(v)
where G(v) is the Fourier transform of the damaged signal, F(v) is the FT of the recovered signal and H(v) is the FT of the point-spread function. The back transform of F(v) gives/(A:). Thus a deconvolution requires the following three steps: (1) Calculate the FT of the measured signal and of the point-spread function to obtain respectively G(v) and //(v). (2) Divide G(v) by ^(v) at corresponding frequency values (according to the rules for the division of two complex numbers), which gives F(v) (3) Back-transform F(v) by which the undamaged signaly(jc) is estimated. The effect of deconvolution applied on a noise-free Gaussian peak is shown in Fig. 40.32a. Unfortunately as can be seen in Fig. 40.32c a deconvolution carried out in the presence of noise (s=l%of the signal maximum) leads to no results at all. This is caused by the fact that two different kinds of damage are present sjgrlal
1.2
(a)
1
1 0.8
*
0.6
^
tl 11 1 1
0.4 0.2
u 0
16
32
0.8 0.6 0.4
/l
48
64 80 time
0.2 0 96
112 128
Fig. 40.32. Deconvolution (result in solid line) of a Gaussian peak (dashed line) for peak broadening ((>v./Jpsf/(H'i/jG = 1). (a) Without noise, (b) With coloured noise (A^(0,1 %), 7JC = 1.5): inverse filter in combination with a low-pass filter, (c) With coloured noise (M0,1 %), Tx=\.5): inverse filter without low-pass filter.
555
simultaneously, namely signal broadening and noise. The model for the damaged signal, therefore, needs to be expanded to the following expression: g(x)=f(x)^h{x)
+ n{x)
The Fourier transform G(v) of g(x) is given by: G(v) = F(v) H(v) + A^(v). Thus the result of applying the inverse filter is: H(v)
H(v)
The unacceptable results of Fig. 40.32c are caused by the term N(v)/H{v) which is large for high v values (= high frequency). Indeed H(v) approaches zero for high frequencies whereas the value of A^(v) does not. The influence of the noise can be limited by combining the inverse filter with a low-pass noise filter, which removes all frequencies larger than a threshold value VQ. In this way one can avoid that the term N(v)/H{v) inflates to large values (Fig. 40.32b). We observe that the overall procedure consists of two contradictory operations: one which sharpens the signal by removing the broadening effect of the measuring device, and one which increases broadening because noise has to be removed. Consequently, the broadening effect of the measuring device can only be partially removed. In Section 40.7 we discuss other approaches such as the Maximum Entropy method and Maximum Likelihood method, which are less sensitive to noise. An essential condition for performing signal restoration by deconvolution is the knowledge of the point-spread function (psf) h(x). In some instances h{x) can be postulated or be determined experimentally by measuring a narrow signal having a bandwidth which is at least 10 times narrower than the width of the point-spread function. The effect of deconvolution is very well demonstrated by the recovery of two overlapping peaks from a composite profile (see Fig. 40.33). The half-height width of the psf was 1.25 times the peak width for the peak systems (a) and (b). For the peak system (c) the half-height width of the psf and signal were equal. It is still possible to enhance the resolution also when the point-spread function is unknown. For instance, the resolution is improved by subtracting the secondderivative g'\x) from the measured signal g{x). Thus the signal is restored by ag(x) - (1 - a)g'Xx) with 0 < a < 1. This algorithm is called pseudo-deconvolution. Because the second-derivative of any bell-shaped peak is negative between the two inflection points (second-derivative is zero) and positive elsewhere, the subtraction makes the top higher and narrows the wings, which results in a better resolution (see Fig. 40.30). Pseudo-deconvolution methods can correct for sym-
556
1 0 U 9 O 6 0 7 O 6 0 !> 0 4 O J 0 2 O 1 0 0
O
11
20 40
6 0 60 100 120 140 160 180 200
20 40
60 80 100 120 140 160 180 200
(c)
1 U 0 9
"V -r-v 1
o tt
o /
//
c o
i
\ / \1 V
M
O t J 4
tl
G J
ij
0 J
1 I
ij
O J 00
_-.yi/. 20 40
M
L V
v:-.^
60 8 0 100 120 140 160 1 8 0 2 0 0
Fig. 40.33. Restoration of two overlapping peaks by deconvolution. Dashed line: measured data. Solid line: after restoration. Dotted line: difference between true and restored signals.
metric point-spread functions. However, for asymmetric point-spread functions, e.g. the broadening introduced by a slow detector response, pseudo-deconvolution is not applicable.
40.7 Other deconvolution methods In previous sections a signal was enhanced or restored by applying a processing technique to the signal in order to remove damage from the data. Examples of damage are noise and peak broadening. We have also seen that removing noise and restoring peak broadening are two opposite operations. Filtering introduces broadening whereas peak sharpening introduces noise. In Section 40.6 we mentioned
557
that the calculation of fix) from the measured spectrum g{x) by solving g(x) = y(x)*/z(x) by deconvolution is a compromise between restoration and filtering. Because of a lack of hard rules the selection of the noise filter introduces some arbitrariness in the procedure. Another class of methods such as Maximum Entropy, Maximum Likelihood and Least Squares Estimation, do not attempt to undo damage which is already in the data. The data themselves remain untouched. Instead, information in the data is reconstructed by repeatedly taking revised trial data f(x) (e.g. a spectrum or chromatogram), which are damaged as they would have been measured by the original instrument. This requires that the damaging process which causes the broadening of the measured peaks is known. Thus an estimate g(x) is calculated from a trial spectrum J{x) which is convoluted with a supposedly known pointspread function h{x). The residuals e{x) = g{x) - g(x) are inspected and compared with the noise n{x). Criteria to evaluate these residuals are Maximum Entropy (see Section 40.7.2) and Maximum Likelihood (Section 40.7.1). 40.7.1 Maximum Likelihood The principle of Maximum Likelihood is that the spectrum,/(x), is calculated with the highest probability to yield the observed spectrum g{x) after convolution with h{x). Therefore, assumptions about the noise n{x) are made. For instance, the noise n^ in each data point / is random and additive with a normal or any other distribution (e.g. Poisson, skewed, exponential,...) and a standard deviation s^. In case of a normal distribution the residual e^ = gi-gi = gt - ifh)^ in each data point should be normally distributed with a standard deviation s^. The probability that ifh)- represents the measurement g- is then given by the conditional probability density function Pig^if): P{gi\f)^—^
exp
2s}
Under the assumption that the noise in point / is uncorrelated with the noise in pointy, the likelihood that (/*/i). for all measurements, /, represents the measured set gj, g2, ..., g^ is the product of all probabilities: (g,-(/*/^),)' 1=1
i=i
5,V27i
2s}
(40.15)
This likelihood function has to be maximized for the parameters in f. The maximization is to be done under a set of constraints. An important constraint is the knowledge of the peak-shapes. We assume that f is composed of many individual
558
Fig. 40.34. Signal restoration by a Maximum Likelihood approach.
peaks of known shape. However, we make no assumption about the number and position of the peaks. Because f is non-linear and contains many parameters to be estimated, the solution of eq. (40.15) is not straightforward and should be calculated in an iterative way by a sequential optimisation strategy. Figure 40.34 shows the kind of resolution improvement one obtains. Under the normality assumption the Maximum Likelihood and the least squares criteria are equivalent. Thus one can also minimize Zef by a sequential optimization strategy [15]. 40.7.2 Maximum Entropy Before going into detail in the meaning of entropy and maximum entropy, the effect of applying this principle for signal enhancement is shown in Fig. 40.35. As can be seen, the effect is a drastic improvement of the signal-to-noise ratio and an enhancement of the resolution. This effect is thus comparable to what is achieved by the Maximum Likelihood procedure and by inverse filtering. However, the Maximum Entropy technique apparently does improve the resolution of the signal without increasing noise. In physical chemistry, entropy has been introduced as a measure of disorder or lack of structure. For instance the entropy of a solid is lower than for a fluid, because the molecules are more ordered in a solid than in a fluid. In terms of probability it means also that in solids the probability distribution of finding a molecule at a given position is narrower than for fluids. This illustrates that entropy has to do with probability distributions and thus with uncertainty. One of the earliest definitions of entropy is the Shannon entropy which is equivalent to the definition of Shannon's uncertainty (see Chapter 18). By way of illustration we
559
(a)
Ctfcefnlwl Shift/ppm
(b)
c
t«
140 ChgrnTcalSKVl/ppm
Fig. 40.35. Signal restoration by the Maximum Entropy approach.
consider two histograms of 20 analytical results, one obtained with a precise method and one obtained with a less precise method (see Table 40.6). On average the method yields a result equal to 100, in the range of 85 to 115. According to Shannon, the uncertainty of the two methods can be expressed by means of: fi=-^Pi^0S2
Pi
i=\
where p^ is the probability to find a value in class /.
560 TABLE 40.6 Shannon's uncertainty for two probability distributions (a broad and narrow distribution) Intervals
85 90
90-95
95-100
100-105
105-110
110-115
Distribution 1
2
3
5
5
3
2
Probability
0.10
0.15
0.25
0.25
0.15
0.10
-logaP/
3.32
2.737
2.00
2.00
2.737
3.32
-Pi^og^Pi
0.332
0.411
0.50
0.50
0.411
0.332
Distribution 2
0
2
8
8
2
0
Probability
0
0
-•og2P, -Pi^og2Pi
0
0.10
0.40
0.40
0.10
3.323
1.322
1.322
3.323
0.332
0.529
0.529
0.332
0
H = 2.486
H= 1.722
Application of this equation to the probability distributions given in Table 40.6 shows that H for the less precise method is larger than for the more precise method. Uniform distributions represent the highest form of uncertainty and disorder. Therefore, they have the largest entropy. We now apply the same principle to calculate the entropy of a spectrum (or any other signal). The entropy, 5 of a spectrum given by the vector y is defined as S = -I.Pi\og^i
with Pi = \yi\/Byi\
The entropy of a noise spectrum with equal probability of measuring a certain amplitude at each wavelength, is maximal. When structure is added to the spectrum the entropy decreases. Noise is associated with disorder, whereas structure means more order. In order to get a feeling for the meaning of entropy, we calculated the entropy of some typical spectra: a noise spectrum, the same noisy spectrum to which we added a spike, a noise free spectrum with one spike and one with two spikes (Table 40.7). As one can see, noise has the highest entropy, whereas a single spike has no entropy at all. It represents the highest degree of order. As indicated before, the maximum entropy approach does not process the measurements themselves. Instead, it reconstructs the data by repeatedly taking revised trial data (e.g. a spectrum or chromatogram), which are artificially corrupted with measurement noise and blur. This corrupted trial spectrum is thereafter compared with the measured spectrum by a x^-test. From all accepted spectra the maximum entropy approach selects that spectrum, f with minimal structure (which is equivalent to maximum entropy). The maximum entropy approach applied for noise elimination consists of the following steps:
561
TABLE 40.7 Entropy of noise, noise plus a spike at / = 5, a spike at / = 5 and two spikes at / = 5 and 6 i
Noise y
Single spike
Noise ••\- spike Pi
-Pi^^ZiPi
y
Pi
-Pi^og^i
y
Pi
Two spikes y
Pi
-Pi^og^i
1
0.43
0.141
0.398
0.43
0.095
0.324
0
0
0
0
0
0
2
-0.41
0.135
0.418
-0.41
0.091
0.317
0
0
0
0
0
0
3
0.34
0.112
0.361
0.34
0.075
0.280
0
0
0
0
0
0
4
0.49
0.161
0.424
0.49
0.108
0.347
0
0
0
0
0
0
5
0.01
0.003
0.025
1.50
0.331
0.531
1.5 1
0
1 0.5
1
6
0.25
0.082
0.296
0.25
0.055
0.230
0
0
0
1 0.5
1
7
-0.42
0.138
0.394
-0.42
0.093
0.320
0
0
0
0
0
0
8
-0.04
0.013
0.081
-0.04
0.009
0.006
0
0
0
0
0
0
9
-0.28
0.092
0.317
-0.28
0.062
0.250
0
0
0
0
0
0
0
0
0
0
0
10 -0.37
0.122
0.370
Entropy//= 3.084
-0.37
0.082
0.297
/ / = 2.956
0
H =0
H = 2.0
(1) Start the procedure with a trial spectrum fj. If no prior knowledge is available on the spectrum one starts the iteration process with a structure-less noise spectrum. Indeed, in that case there is no evidence to assume a particular structure beforehand. However, prior knowledge may justify to introduce some extra structure. (2) Calculate the variance of the differences d between the measurements and the trial spectrum f^. (3) Test whether the variance of these differences (sl)is significantly different from the variance of the measurement noise (s^) by a x^-test: X^ = (n - l)sj/s^ (n data points). For large n, Xcrit ~ ^• (4) If the trial spectrum is significantly different from the measured spectrum, the trial spectrum is adapted into f2 = fj + Af, whereafter the cycle is repeated from step (2) (see e.g. [16] for the derivation of A) until the spectrum meets the x^ criterion. (5) By repeating steps (1) to (4) with several 'noise' spectra, a set of spectra is obtained which meet the x^ criterion. All these spectra are marked as 'feasible' spectra. (6) Finally from the set of 'feasible' spectra the spectrum is selected with the maximum entropy.
562
The maximum entropy method thus consists of maximizing the entropy under the y^ constraint. An algorithm to maximize entropy is the so-called Cambridge algorithm [16]. When the maximum entropy approach is used for signal restoration a step has to be included between steps (1) and (2) in which the trial spectrum is first convoluted (see Section 40.4) with the point-spread function before calculating and testing the differences with the measured spectrum. The entropy of the trial spectrum before convolution is evaluated as usual.
40.8 Other transforms 40.8,1 The Hadamard transform In Section 40.3.2 we mentioned that the Fourier coefficients A^ and B^ can be calculated by fitting eq. (40.1) to the signaly(0 by a least squares regression. This fit is represented in a matrix notation as given in Fig. 40.36. The vector X represents the measurements, whereas A and B are vectors with respectively the real and imaginary Fourier coefficients. The columns of the two matrices are the sine and cosine functions with increasing frequency. These sines and cosines constitute a base of orthogonal functions. This representation also shows the resemblance of a FT with PC A. The measurement vector which initially contains A^ features (e.g. wavelengths) is reduced to a vector with n < N features by a projection on a smaller orthogonal sub-space defined by the n columns in the transform matrix. In PCA these n columns are the n principal components and in FT these columns are sines and cosines. Depending on the properties of these columns, the scores have a specific meaning, which in the FT are the Fourier coefficients. In theory, any base of orthogonal functions can be selected to transform the data. A base which is related to the cosine and sine functions is a series of orthogonal block signals with increasing frequency (Fig. 40.37). Any signal can be decomposed in a series of block functions, which is called the 1 2 B,
B„ Nl Fig. 40.36. Matrix representation of a Fourier transform.
563
6 H
~i
I
12
~T
16
1
1
20
r
— I —
24
Fig. 40.37. A base of block signals. FHT
J
L
16
32
"^~>^V
64
128
i\
256
Fig. 40.38. Spectrum given in Fig. 40.31 reconstructed with 2, 3,..., 256 Hadamard coefficients.
564
Hadamard transform [17]. For example the IR spectrum (512 data points) shown in Fig. 40.31a is reconstructed by the first 2, 4, 8, ... 256 Hadamard coefficients (Fig. 40.38). In analogy to spectrometers which directly measure in the Fourier domain, there are also spectrometers which directly measure in the Hadamard domain. Fourier and Hadamard spectrometers are called non-dispersive. The advantage of these spectrometers is that all radiation reaches the detector whereas in dispersive instruments (using a monochromator) radiation of a certain wavelength (and thus with a lower intensity) sequentially reaches the detector. 40.8,2 The time-frequency Fourier transform A common feature of the Fourier and Hadamard transform is that they describe an overall property of the signal in the measurement range, ^ = 0 to ^ = 7^. However, one may be interested in local features of the signal. For instance, it may well be that at the beginning of the signal the frequencies are much higher than at the end of the signal as shows Fig. 40.39. This is certainly true when the signal contains noise and peaks with different peak widths. In Fig. 40.39 there are regions with a high, low and intermediate frequency. One way to detect these local features is by calculating the FT in a moving window of size T^, and to observe the
Fig. 40.39. Signal with local frequency features.
565
i+1,w i+1,w Fig. 40.40. The moving window FT principle.
evolution of the Fourier coefficients as a function of the position of the moving window. In this example when the centre of the window coincides with the position of one of the peaks, the low frequency components are dominant, whereas in an area of noise the high frequencies become dominant. This means that peaks are detected by monitoring the Fourier coefficients as a function of the position of the moving window. The procedure of the moving FT is schematically shown in Fig. 40.40. At each position / of the window (size = T^) a filter function h(i) is defined by which the signal/(0 is multiplied before the FT is calculated. In general, this is expressed as follows: F{v,a) =
Fmh{a)]
where F is the symbol for the FT and a refers to the filter transfer function h(a). For each a, n Fourier coefficients are obtained, which can be arranged in matrix of A^ and B, coefficients: ^1,0 • • • ^ l , n ^ l , 0
"•^\,n
^ 2 , 0 '"^2,n^2,0
"^2,n
•^a,n^a,0
"^a,n
^a,0
The columns of this matrix contain the time information (amplitudes at a specific frequency as a function of time) and the rows the frequency information.
566
40.8.3 The wavelet transform Another transform which provides time and frequency information is the wavelet transform (WT). By analogy with the Fourier transform, the WT decomposes a signal into a set of basis functions, called a wavelet basis. In FT the basis functions are the cosine and sine function. The wavelet basis is also a function, called the analyzing wavelet. Frequently applied analyzing wavelets are the Morlet and Daubechies wavelets [18], of which the Haar wavelet is a specific member (Fig. 40.41). A series of wavelets is generated by stretching and shifting the wavelet over the data. The shift b is called a translation and the stretching or widening of the basis wavelet with a factor a is called a dilation. Suppose for instance that the analyzing wavelet is a function h(t). A series of wavelets h^u(t) is then generated by introducing a translation b and a dilation a, according to ^Ja
K a
A series of Morlet wavelets for various dilation values is shown in Fig. 40.42. In a similar way as for the FT, where only frequencies are considered which fit an exact number of times in the measurement time, here, only dilation values are considered which are stretched by a factor of two. A wavelet transform consists of fitting the measurements with a basis of wavelets as shown in Fig. 40.42, which are generated by stretching and shifting a mother wavelet h(t). The narrowest wavelet (level a^) is shifted in small steps, whereas a broad wavelet is shifted in bigger steps. The shift b is usually a multiple of the dilation value. The fitting of these basis of wavelets on the data yields the wavelet transform coefficients. Coefficients associated with narrow wavelets describe the local features in a signal, whereas the broad wavelets describe the smooth features in the signal. a)
c)
A
Fig. 40.41. The Haar wavelet (a) and three Daubechies wavelets (b-d).
567
Fig. 40.42. A family of Morlet wavelets with various dilation values.
To transform measurements available in a discrete form a discrete wavelet transform (DWT) is applied. Condition is that the number of data is equal to 2". In the discrete wavelet transform the analyzing wavelet is represented by a number of coefficients, called wavelet filter coefficients. For instance, the first member (smallest dilation a and shift b = 0)of the Haar family of wavelets is characterized by two coefficients Cj = 1 and C2 = 1. The next one with a dilation 2a is characterized by four coefficients: Cj = 1, C2 = 1, C3 = 1 and C4 = 1. Generally, the wavelet member n is characterized by l"" coefficients. The widest wavelet considered is the one for which 2" is equal to N, the number of measurements. The value of n defines the level of the wavelet. For instance, forn = 2 the level 2 wavelet is obtained. For each level, a transform matrix is defined in which the wavelet filter coefficients are arranged in a specific way. For a signal containing eight data points (arranged in a 8x1 column vector) and level 1, the transform matrix has the following form: 0
0
0
C2
0 0
0
0
0
0
0
Cl
C2
0
0
0
0
0
0
c,
C-,
C)
C2
0
0
0 0
0
Cl
0
0
0
Multiplication of this 4x8 transformation matrix with the 8x1 column vector of the signal results in 4 wavelet transform coefficients or N/2 coefficients for a data vector of length A^. For Cj = C2 = C3 = C4 = 1, these wavelet transform coefficients are equivalent to the moving average of the signal over 4 data points. Consequently,
568
these wavelet filter coefficients define a low-pass filter (see Section 40.5.3), and the resulting wavelet transform coefficients contain the 'smooth' information of the signal. For this reason this set of wavelet filter coefficients is called the approximation coefficients and the resulting transform coefficients are the a-components. The transform matrix containing the approximation coefficients is denoted as the G-matrix. In the above example with 8 data points, the highest possible transform level is level 3 (8 non-zero coefficients). The result of this transform is the average of the signal. The level zero (1 non-zero coefficient) returns the signal itself Besides this first set of coefficients, a second set of filter coefficients is defined which is the equivalent of a high-pass filter (see Section 40.5.3) and describes the detail in the signal. The high-pass filter uses the same set of wavelet coefficients, but with alternating signs and in reversed order. These coefficients are arranged in the H-matrix. The H-matrix for the level two transform of a signal with length 8 is: Cl
-C\
0
0
0
0
C2
0
0
0
0
-C\
0 0
0 0
0 0
0 0
0
0
Cl
-C|
0
0
0
0
0
0
C-,
-c
The coefficients in the H-matrix are the detail coefficients. The output of the H-matrix are the ^-components. With Nil detail components and Nil approximation components, we are able to reconstruct a signal of length N. The discrete wavelet transform can be represented in a vector-matrix notation as: a = W'^f
(40.16)
where a contains N wavelet transform coefficients, W is an A^xA^ orthogonal matrix consisting of the approximation and detail coefficients associated to a particular wavelet and f is a vector with the data. The action of this matrix is to perform two related convolutions, one with a low-pass filter G and one with a high-pass filter H. The output of G is referred to as the smooth information and the output of H may be regarded as the detail information. By way of illustration we consider a sequence of a discrete sample of 16 points, taken from Walczak [19], F = [0 0.2079 0.4067 0.5878 0.7431 0.8660 0.9511 0.9945 0.9945 0.9511 0.8660 0.7431 0.5878 0.4067 0.2079 0.0] which is fitted with a Haar wavelet at level a\ We first define the 16x16 matrix W of the wavelet filter coefficients equal to:
569 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 - 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 .
Rows 1-8 are the approximation filter coefficients and rows 9-16 represent the detail filter coefficients. At each next row the two coefficients are moved two positions (shift b equal to 2). This procedure is schematically shown in Fig. 40.43 for a signal consisting of 8 data points. Once W has been defined, the a^ wavelet transform coefficients are found by solving eq. (40.16), which gives: 1 0 0 0 0 0 0 0 /1/2 1 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 1-10 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 - 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -
0 ro.oooo" 0 0 0 0.5878 0 0 0 0.7431 0 0.8660 0 0.9511 1 0.9945 0 0.9945 0 0.9511 0 0.8660 0 0 0.7431 0 0 0.5878 0 0.4067 0 0.2079 1 [o.oooo_ 0.2079 0.4067
" 0.1470" 0.7032 1.1378 1.3757 1.3757 1.1378 0.7032 0.1470 -0.1470 -0.1281 -0.0869 -0.0307 0.0307 ' 0.0869 0.1281 . 0.1470 J
The factor •\ll/ 2 is introduced to keep the intensity of the signal unchanged. The 8 first wavelet transform coefficients are the a or smooth components. The last eight coefficients are the d or detail components. In the next step, the level 2 components are calculated by applying the transformation matrix, corresponding to the a^ level on the original signal. This a^ transformation matrix contains 4 wavelet filter
570
original data detail
approximation rn
m
Ti
rrm
iTTTi
m
1 ,
[,
\,
J
\
1
m
1 iTi
IJ 11FT]
• iJ Fig. 40.43. Waveforms for the discrete wavelet transform using the Haar wavelet for an 8-points long signal with the scheme of Mallat' s pyramid algorithm for calculating the wavelet transform coefficients.
coefficients, which is a doubUng of the width of the wavelet. This wavelet is shifted four positions instead of two at the previous level. In our example this leads to a transform matrix with four approximation rows and four detail rows with 16 elements. Multiplication of this matrix with the 16-points data vector results in a vector with four a and four d components. The level 2 coefficients a^ are equal to: ro.oooo" 02079 O4067
/1/4
1 1 0 0 0 0 0 0 i -1
1 1 0 0 0 0 0 0 1 -1
0 0 0
0 0 0
0 0 0
0 0 0
0 1 0 0
0 1 0 0
0 0 1 -1 0 0 0 0
0 1 0 0 0
0 1 0 0
0 1 -1 0 0 0 0
0 0 1 0
0 0 1 0
0 0 0 0 1 -1 0 0
0 0 1 0
0 0 1 0
0 0 0 0 1 -1 0 0
0 0 0 1
o]
05878 07431 08660 0.9511
0 0 0 1
0 0 0 1
0
0
0
0
09945
0 0 0 0 I -1
0 0 1
0 0
0.9511 O8660 07431
0 0 1 09945
-ij
05878 0.4067 02079
LoooooJ
O6012 1.7773 1.7773 O6012 -0.3922 -0.1682 01682 03933
571
The same result is obtained by multiplying the vector of a coefficients obtained in the previous step with an 8x8 a' level transform matrix:
ViTI
' 0.6012"
0
0
0 0
1 1 0 0 0 0 1 1
0
0 [0.1470' 0 0.7032
1.7773
0
0
0
0
0 1
1.1378
0
0 1
1.3757
0.6012
1 - 1 0 0 0 1 0 0 0 0 0 0
0
0
0
0
0
1.3757
-0.3933
0 0 1 0 0 1
0
1.1378 0.7032
-0.1682
1
1 0
0 0 0
0
0
- 1 0 0 0
1 0
0
- 1 L0.1470_
1.7773
0.1682 _ 0.3933
This is the principle of the pyramidal algorithm developed by Mallat [20], which is computationally more efficient. Continuing the calculations according to this algorithm, the four a components are input to a 4x4 a^ level transformation matrix, giving the level-3 components:
^n/2
1 1 0 0 ] [0.6012" 0 0 1 1 1.7773 1-10
0
1.7773
0 0 1 -ij [0.6012
1.6819" 1.6818 -0.8316 0.8316J
and finally the level-4 coefficients (a^) are calculated according to:
^fU2
1.6819
2.3785
1 - 1 1.6818
0.0000
1
1
Having a closer look at the pyramid algorithm in Fig. 40.43, we observe that it sequentially analyses the approximation coefficients. When we do analyze the detail coefficients in the same way as the approximations, a second branch of decompositions is opened. This generalization of the discrete wavelet transform is called the wavelet packet transform (WPT). Further explanation of the wavelet packet transform and its comparison with the DWT can be found in [19] and [21]. The final results of the DWT applied on the 16 data points are presented in Fig. 40.44. The difference with the FT is very well demonstrated in Fig. 40.45 where we see that wavelet a^ describes the locally fast fluctuations in the signal and wavelet a^ the slow fluctuations. An obvious application of WT is to denoise spectra. By replacing specific WT coefficients by zero, we can selectively remove
572
approximations
details
J ^ ^
2^^.
Fig. 40.44. Wavelet decomposition of a 16-point signal (see text for the explanation).
50
100
150
X'O
250
300
350
400
460
500
QI—•^'^—^^A^-*^ 50
100 150 200 250
50
100 150 200 250
0.2
V
-0.2
123
Hri 128
Fig. 40.45. Wavelet decomposition of a signal with local features.
noise from distinct areas in the signal without disturbing other areas [22]. Mittermayr et al. [23] compared the wavelet filters to Fourier filters and to polynomial smoothers such as the Savitzky-Golay filters. Wavelets have been applied to analyze signals arising from several areas, as acoustics [24], image processing [22], seismics [25] and analytical signals [23,26, 27]. Another obvious application is to use wavelets to detect peaks in a noisy signal. Each sudden change of the signal by the appearance of a peak results in a
573
wavelet coefficient at that position [27]. Recently, it has been shown that signals can be compressed to a fairly small number of coefficients without much loss of information. Bos et al. [26] applied this property to compress IR spectra by a factor of 20 prior to a classification by a neural net. Feature reduction by wavelet transform for multivariate calibration has been studied by Jouan-Rimbaud et al. [28]. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19.
F.C. Strong III, How the Fourier transform infrared spectrophotometer works. J. Chem. Educ, 56(1979)681-684. R. Brereton, Tutorial: Fourier transforms. Use, theory and applications to spectroscopic and related data. Chem. Intell. Lab. Syst., 1 (1986) 17-31. R.N. Bracewell, The Fourier Transform and its Applications. 2nd rev. ed., McGraw-Hill, New York, 1986. G. Doetsch, Anleitung zum praktischen gebrauch der Laplace-transformation und der Ztransformation. R. Oldenbourg, Munchen 1989. K. Schmidt-Rohr and H.W. Spiess, Multidimensional Solid-state NMR and Polymers. Academic Press, London 1994, pp. 141. J.W. Cooley and J.W. Tukey, An algorithm for the machine calculation of complex Fourier series. Math. Comput., 19 (1965) 297-301. E.G. Brigham, The Fast Fourier Transform. Prentice-Hall, Englewood Cliffs NJ, 1974. A. Savitzky and M.J.E. Golay, Smoothing and differentiating of data by simplified leastsquares procedures. Anal. Chem., 36 (1964) 1627-1639. C.G. Enke and T.A. Nieman, Signal-to-noise ratio enhancement by least-squares polynomial smoothing. Anal. Chem., 48 (1976) 705A-712A. J. Steinier, Y. Termonia and J. Deltour, Comments on smoothing and differentiation of data by simplified least squares procedure. Anal. Chem., 44 (1972) 1906-1909. B. Lam and T.L. Isenhour, Equivalent width criterion for determining frequency domain cutoffs in Fourier transform smoothing. Anal. Chem., 53 (1981) 1179-1182. W. Wu, B. Walczak, W. Pennincks and D.L, Massart, Feature reduction by Fourier transform in pattern recognition of NIR data. Anal. Chim. Acta, 331 (1996) 75-83. L. Pasti, D. Jouan-Rimbaud, D.L. Massart and O.E. de Noord, Application of Fourier transform to multivariate caUbration of Near Infrared Data. Anal. Chim. Acta, 364 (1998) 253-263. W.F. McClure, A. Hamid, E.G. Giesbrecht and W.W. Weeks, Fourier analysis enhances NIR diffuse reflectance spectroscopy. Appl. Spectrosc, 38 (1988) 322-329. L.K. DeNoyer and J.G. Dodd, Maximum Likelihood deconvolution for spectroscopy and chromatography. Am. Lab., 23 (1991) D24-H24. E.D. Laue, M.R. Mayger, J. Skilling and J. Staunton, Reconstruction of phase-sensitive twodimensional NMR-spectra by maximum entropy. J. Magn. Reson., 68 (1986) 14-29. S.A. Dyer, Tutorial: Hadamard transform spectrometry. Chemom. Intell. Lab. Syst., 12 (1991) 101-115. I. Daubechies, Orthonormal bases of compactly supported wavelets. Comm. Pure Appl. Math., 41(1988) 909-996. B. Walczak and D.L.Massart, Tutorial: Noise suppression and signal compression using the wavelet packet transform. Chemom. Intell. Lab. Syst., 36 (1997) 81-94.
574 20. 21. 22. 23. 24. 25. 26. 27. 28.
I. Daubechies, S. Mallat and A.S. Willsky, Special issue on wavelet transforms and multiresolution signal analysis. IEEE Trans. Info Theory, 38 (1992) 529-531. B. Walczak, B. van den Bogaert and D.L. Massart, Application of wavelet packet transform in pattern recognition of near-IR data. Anal. Chem., 68 (1996) 1742-1747. S.G. Nikolov, H. Hutter and M. Grasserbauer, De-noising of SIMS images via wavelet shrinkage. Chemom. Intell. Lab. Syst., 34 (1996) 263-273. C.R. Mittermayr, S.G. Nikolov, H. Hutter and M. Grasserbauer, Wavelet denoising of Gaussian peaks: a comparative study. Chemom. Intell. Lab. Syst., 34 (1996) 187-202. R. Kronland-Martinet, J. Morlet and A. Grossmann, Analysis of sound patterns through wavelet transforms. Int. J. Pattern Recogn. Artif. Intell., 1 (1987) 273-302. P. Goupillaud, A. Grossmann and J. Morlet, Cycle-octave and related transforms in seismic signal analysis. Geoexploration, 23 (1984) 85-102. M. Bos and J. A.M. Vrielink, The wavelet transform for pre-processing IR spectra in the identification of mono- and di-substituted benzenes. Chemom. Intell. Lab. Syst., 23 (1994) 115-122. M. Bos and E. Hoogendam, Wavelet transform for the evaluation of peak intensities in flowinjection analysis. Anal. Chim. Acta, 267 (1992) 73-80. D. Juan-Rimbaud, B. Walczak, R.J. Poppi, O.E. de Noord and D.L. Massart, Application of wavelet transform to extract the relevant component from spectral data for multivariate calibration. Anal. Chem., 69 (1997) 4317-4323.
Additional recommended reading C.K. Chui, Introduction to Wavelets. Academic Press, Boston, 1991. D.N. Rutledge (Ed.), Signal Treatment and Signal Analysis in NMR. Elsevier, Amsterdam, 1996. B.K. Alsberg, A.M. Woodward and D.B. Kell, An introduction to wavelet transforms for chemometricians: a time-frequency approach. Chemom. Intell. Lab. Syst., 37 (1997) 215-239. F. Dondi, A. Betti, L. Pasti, M.C. Pietrogrande and A. FeHnger, Fourier analysis of multicomponent chromatograms — application to experimental chromatograms. Anal. Chem., 65 (1993) 2209-2222.
575
Chapter 41
Kalman Filtering 41.1 Introduction Linear regression and non-linear regression methods to fit a model through a number of data points have been discussed in Chapters 8, 10 and 11. In these regression methods all data points are collected first followed by the estimation of the parameters of a postulated model. The validity of the model is checked by a statistical evaluation of the residuals. Generally, the same weight is attributed to all data points, unless a weighing procedure is applied on the residuals, e.g., according to the inverse of the variance of the experimental error (see Section 8.2.3.2). In this chapter we discuss an alternative method of estimating the parameters of a model. There are two main differences from the regression methods discussed so far. First, the parameters are estimated during the collection of the data. Each time a new data point is measured, the parameters of the model are updated. This procedure is called recursive regression. Because at the beginning of the measurement-estimation process, the model is based on a few observations only, the estimated model parameters are imprecise. However, they are improved as the process of data collection and updating proceeds. During the progress of the measurement-estimation sequence, more data points are measured leading to more precise estimates. The second difference is that the values of the model parameters are allowed to vary during the measurement-estimation process. An example is the change of the concentrations during a kinetics experiment, which is monitored by UV-Vis spectrometry. Multicomponent analysis is traditionally carried out by measuring the absorbance of a sample at a number of wavelengths (at least equal to the number of components which are analyzed) and calculating the unknown concentrations (see Chapter 10). These concentrations are the parameters of the model. The basic assumption is that during the measurement of the absorbances of the sample at the selected wavelengths, the concentrations of the compounds in the mixture do not vary. However, if the measurements are carried out during a kinetics experiment, the concentrations may vary in time, and as a result the composition of the sample varies during the measurements. In this case, we cannot simply estimate the unknown concentrations by multiple linear regression as explained in Chapter 10. In order to estimate the concentrations of the
576
compounds as a function of time during the data acquisition, two models are needed, the Lambert-Beer model and the kinetics model. The Lambert-Beer model relates the measurements (absorbances) to the concentrations. This model is called the measurement model {A =J{a,c)). The kinetics model describes the way the concentrations vary as a function of time and is called the system model (c = J{k,t)). In this particular instance, the system model is an exponential function in which the reaction rate k is the regression parameter. The terms * systems' and 'states' are associated with the Kalman fdter. The system here is the chemical reaction and its state at a given time is the set of concentrations of the compounds at that time. The output of the system model is the state of the system. The measurement and system models fully describe the behaviour of the system and are referred to as the state-space model. Thus the system and measurement models are connected. The parameters of the measurement model (in our example, the concentrations of the reactants and the reaction product) are the dependent variables of the system model, in which the reaction rate is the regression parameter and time is the independent variable. Later on we explain that system models may also contain a stochastic part. It should be noted that this dual definition implies that two sets of parameters are estimated simultaneously, the parameters of the measurement model and the parameters of the systems model. In this chapter we introduce the Kalman filter, with which it is possible to estimate the actual values of the parameters of a state-space model, e.g., the rate of a chemical reaction from the evolution of the concentrations of a reaction product and the reactants as a function of time which in turn are estimated from the measured absorbances. Let us consider an experiment where a flow-through cell is connected to a reaction vessel, in which the reaction takes place. During the reaction, which may be fast with respect to the scan speed of the spectrometer, a spectrum is measured. At r = 0 the measurement of the spectrum is started, e.g., at 320 nm in steps of 2 nm per second. Every second (2 nm) estimates of the concentrations of the compounds are updated by the Kalman filter. When arrived at the end of the spectral range (e.g. 600 nm) we may continue the measurement process as long as the reaction takes place, by reversing the scan of the spectrometer. As mentioned before, the confidence in the estimates of the model parameters improves during the measurement-estimation process. Therefore, we want to prevent a new but deviating measurement to influence the estimated parameters too much. In the Kalman filter this is implemented by attributing a lower weight to new measurements. An important property of a Kalman filter is that during the measurement and estimation process, regions of the measurement range can be identified where the model is invalid. This allows us to take steps to avoid these measurements affecting the accuracy of the estimated parameters. Such a filter is called the adaptive Kalman fdter. An increasing number of applications of the Kalman filter
577
has been published, taking advantage of the formulation of a systems model which describes the dynamic behaviour of the parameters in the measurement model and exploiting the adaptive properties of the filter. In this chapter we discuss the principles of the Kalman filter with reference to a few examples from analytical chemistry. The discussion is divided into three parts. First, recursive regression is applied to estimate the parameters of a measurement equation without considering a systems equation. In the second part a systems equation is introduced making it necessary to extend the recursive regression to a Kalman filter, and finally the adaptive Kalman filter is discussed. In the concluding section, the features of the Kalman filter are demonstrated on a few applications.
41.2. Recursive regression of a straight line Before we introduce the Kalman filter, we reformulate the least-squares algorithm discussed in Chapter 8 in a recursive way. By way of illustration, we consider a simple straight line model which is estimated by recursive regression. Firstly, the measurement model has to be specified, which describes the relationship between the independent variable x, e.g., the concentrations of a series of standard solutions, and the dependent variable, 3;, the measured response. If we assume a straight line model, any response y^ is described by: yi = ^0 + ^1 ^i + ^i
bg and b^ are the estimates of the true regression parameters % and Pj, calculated by linear regression and e^ is the contribution of measurement noise to the response. In matrix notation the model becomes: y. = xj h + e^ where x- is a [2x1] column vector 1
and b is [2x1] column vector
h A. A recursive algorithm which estimates PQ and pj has the following general form: New estimate = Previous estimate + Correction
578
After each new observation, the estimates of the model parameters are updated (= new estimate of the parameters). In all equations below we treat the general case of a measurement model with p parameters. For the straight line model /? = 2. An estimate of the parameters b based ony - 1 measurements is indicated by b(/ - 1). Let us assume that the parameters are recursively estimated and that an estimate b(/ - 1) of the model parameters is available fromj - 1 measurements. The next measurement y(j) is then performed at x(/), followed by the updating of the model parameters to b(/). The first step of the algorithm is to calculate the innovation (I), which is the difference between measured y(j) and predicted response y(j) at x(/). Therefore, the last estimate b(/ - 1) of the parameters is substituted in the measurement model in order to forecast the response y(j), which is measured at x(j):
y(j) = bo(j-l) + b,(j-l)x(j)or >;(/•) = xT(/*)b(/'-l)
(41.1)
The innovation /(/) (not to be confused with the identity matrix I) is the difference between measured and predicted response at x(j). Thus /(/) = y(j) - yij)The value of the innovation is used to update the estimates of the parameters of the model as follows: b(7) = b ( 7 - l ) + k ( ; ) / ( j ) px\
px\
pxl
(41.2)
1x1
Equation (41.2) is the first equation of the recursive algorithm. k(/) is a/7xl vector, called the gain vector. Looking in more detail at this gain vector, we see that it weights the innovation. For k equal to the null vector, b(/) = b(/ - 1), leaving the model parameters unadapted. The larger the k, the more weight is attributed to the innovation and as a consequence to the last observation. Therefore, one may intuitively understand that the gain vector depends on the confidence we have in the estimated parameters. As usual this confidence is expressed in the variancecovariance matrix of the parameters. Because this confidence varies during the estimation process, it is indicated by P(/) which is given by:
nj)-One expects that during the measurement-prediction cycle the confidence in the parameters improves. Thus, the variance-covariance matrix needs also to be updated in each measurement-prediction cycle. This is done as follows [1]: P(7)=P(;-l)-k(;)xT(7)P(7-l) pxp
pxp
pxl
Ixp
pxp
(41.3)
579
Equation (41.3) is the second equation of the recursive filter, which expresses the fact that the propagation of the measurement error depends on the design (X) of the observations. Once the fitting process is complete the square root of the diagonal elements of P give the standard deviations of the parameter estimates. Key factor in the recursive algorithm is the gain vector k, which controls the updating of the parameters as well as the updating of the variance-covariance matrix P. The gain vector k is the most affected by the variance of the experimental error r(j) of the new observation y(j) and the uncertainty PO - 1) in the parameters b(/ - 1). When the elements of PO* - 1) are much larger than r(/), the gain factor is large, otherwise it is small. After each new measurement j the gain vector is updated according to eq. (41.4) k(7) = P ( y - l ) x ( y ) ( x T ( y ) P ( y - l ) x ( ; ) + r ( ; ) ) - i pxl
pxp
pxl
\xp
pxp
px\
(41.4)
1x1
The expression x^(/)P(/ - l)x(/) in eq. (41.4) represents the variance of the predictions, y(j), at the value x(j) of the independent variable, given the uncertainty in the regression parameters P(/). This expression is equivalent to eq. (10.9) for ordinary least squares regression. The term r(j) is the variance of the experimental error in the response y(j). How to select the value of r(j) and its influence on the final result are discussed later. The expression between parentheses is a scalar. Therefore, the recursive least squares method does not require the inversion of a matrix. When inspecting eqs. (41.3) and (41.4), we can see that the variancecovariance matrix only depends on the design of the experiments given by x and on the variance of the experimental error given by r, which is in accordance with the ordinary least-squares procedure. Typically, a recursive algorithm needs initial estimates for b(0) and P(0) to start the iteration process and an estimate of r(j) for all j during the measurementestimation process. When no prior information on the regression parameters is available b(0) is usually set equal to zero. In many textbooks [1], it is recommended to choose for P(0) a diagonal matrix with very large diagonal elements, expressing the large uncertainty in the chosen starting values b(0). As explained before, r(j) represents the variance of the experimental error in observation y(j). Although the choice of P(0) and r(j) introduces some arbitrariness in the calculation, we show in an example later on that the gain vector (k) which fully controls the updating of the estimates of the parameters is fairly insensitive for the chosen values r(/) and P(0). However, the obtained final value of P depends on a correct estimation of r(j). Only when the value of r(j) is equal to the variance of the experimental error, does P converge to the variance-covariance matrix of the estimated parameters. Otherwise, no meaning can be attributed to P. In summary, after each new measurement a cycle of the algorithm starts with the calculation of the new gain vector (eq. (41.4)). With this gain vector the variance-
580
Iteration step
Fig. 41.1. Evolution of the innovation during the recursive estimation process (iteration steps) (see Table 41.1).
covariance matrix (eq. (41.3)) is updated and the estimates of the parameters are updated (eq. (41.2)). By monitoring the innovation sequence, the stability of the iteration process can be followed. Initially, the innovation shows large fluctuations, which gradually fade out to the level expected from the measurement noise (Fig. 41.1). An innovation which fails to converge to the level of the experimental error indicates that the estimation process is not completed and more observations are required. However, P(0) also influences the convergence rate of the estimation process as we show with the calibration example discussed below. By way of illustration, the regression parameters of a straight line with slope = 1 and intercept = 0 are recursively estimated. The results are presented in Table 41.1. For each step of the estimation cycle, we included the values of the innovation, variance-covariance matrix, gain vector and estimated parameters. The variance of the experimental error of all observations y is 25 10"^ absorbance units, which corresponds to r = 25 10"^ au for a l l / The recursive estimation is started with a high value (10^) on the diagonal elements of P and a low value (1) on its off-diagonal elements. The sequence of the innovation, gain vector, variance-covariance matrix and estimated parameters of the calibration lines is shown in Figs. 41.1^1.4. We can clearly see that after four measurements the innovation is stabilized at the measurement error, which is 0.005 absorbance units. The gain vector decreases monotonously and the estimates of the two parameters stabilize after four measurements. It should be remarked that the design of the measurements fully defines the variance-covariance matrix and the gain vector in eqs. (41.3) and (41.4), as is the case in ordinary regression. Thus, once the design of the experiments is chosen
581 TABLE 41.1 Recursive estimation of the parameters BQ and Z?^ of a straight line (see text for symbols; y are the measurements). InitiaUsation: bT(0) = [0 0]; r = 25 10-^; P^^(0) = P2,2(^)= 1000; Fi 2CO) = PiA^) = 1 J
x(/')
y(j)
y(j)
I(j)
h(j)
1
m
1 0.1
0.101
0
0.101
0.09999 0.01010
0.99000 0.09998
2
1 0.2
0.196
0.102
0.094
0.0060 0.9500
3
1 0.3
0.302
0.291
0.011
4
1 0.4
0.403
0.401
5
1 0.5
0.499
6
1 0.6
0.608
7
1 0.7
0.703
0.706
8
1 0.8
0.801
9
1 0.9
10 1 1.0
P(/") 9.8990 -98.99
-98.99 989.9
-1.0000 9.99995
1.37 10"" -8.09 lO""
-8.16 10-" 5.249 10"'
-0.0024 1.0072
-0.7307 5.2029
-6.0 10"' -2.61 10-"
-2.62 10-" 1.303 10"'
-0.002
-0.0032 1.0138
-0.5254 3.0754
3.73 10-' -1.26 10-"
-1.26 10-" 5.06 10^
0.504
-0.005
-0.0012 1.0042
-0.4085 2.0236
2.68 10-' -7.4 10"'
-7.41 10"' 2.49 10""
0.601
0.007
-0.0035 1.0138
-0.334 1.433
2.10 10-' -4.88 10'
-4.89 10"'
-0.003
-0.0025 1.0104
-0.2838 1.0694
1.7 10"' -3.5 10'
-3.5 10"' 8.78 10"'
0.806
-0.005
-0.0014 1.0064
-0.2468 0.8290
1.5 10' -2.6 10'
-2.6 10"' 5.84 10'
0.897
0.904
-0.007
-0.0002 1.0015
-0.2185 0.6618
1.27 10' -2.02 10'
-2.02 10"' 4.08 10"'
1.010
1.002
0.0082 -0.0014 1.0060
-0.1962 0.5408
1.13 10' -1.62 10'
-1.62 10' 2.97 10"'
1.41 10^
(x(j),j = 1, ..., n), one can predict how the variance-covariance matrix behaves during the iterative process and decide on the number of experiments or decide on the design itself. This is further explained in Section 41.3. The relative insensitivity of the estimation process for the initial value of P is illustrated by repeating the calculation with the diagonal elements P(l,l) and P(2,2) of P set equal to 100 instead of equal to 1000 (see Table 41.2). As can be seen from Table 41.2 the gain vector, the variance-covariance matrix and the estimated regression parameters rapidly converge to the same values. Also using unrealistically high values for the experimental error (e.g. r(j) = 1) does not affect the convergence of the gain factors too much as long as the diagonal elements of P remain high. However, we also
582
Iteration number
Fig. 41.2. Gain factor during the recursive estimation process (see Table 41.1). lt+4 1E+2
b 1E+0 \ \
P(2,2)
1E-2
\pair'"---~.^___^ 1E-4
1
8
9
10
Iteration number
Fig. 41.3. Evolution of the diagonal elements of the variance-covariance matrix (P) during the estimation process (see Table 41.1).
observe that P no longer represents the variance-covariance matrix of the regression parameters. If we start with a high r(j) value with respect to the diagonal elements of P(0) (e.g. 1:100), assuming a large experimental error compared to the confidence in the model parameters, the convergence is slow. This is indicated by comparing the innovation sequence for the ratio r(j) to P(0) equal to 1:1000 and 1:100 in Table 41.2. In recursive regression, new observations steadily receive a lower weight, even when the variance of the experimental error is constant (homoscedastic). Consequently, the estimated regression parameters are generally not exactly equal to the values obtained by ordinary least squares (OLS).
583 Q. o'-i) where z(j) is the predicted absorbance at the 7th measurement, using the latest estimated concentrations x(j - 1) obtained after the (/* - l)th measurement. h^O) contains the absorptivities of CI2 and Br2 at the wavelength chosen for the jth measurement. Step 1. Initialisation (/ = 0) P(0) =
"1000
1
1
1000
r(j) = 1 for ally
x(0) =
587
Step 2. Update of the Kalman gain vector k(l) and variance-covariance matrix P(l) k(l) = P(0)h(l)(hT(l)P(0)h(l) + i r ' P(l) = P(0)-k(l)h'^(l)P(0) This gives for design A:
k(l) =
fiooo L 1
0.0045 1 [0.0045" 1000 1 +1 |[([0.0045 0.168] 1 1000 0.168 1000 [ 0.168
r
[0.1596" ' L 5-744 _
P(l) =
"1000 1
1 ~
1000 1 "0.1596" [0.0045-0.168] 1 1000 _ 5.74^t_
K)00^
999.2 -25.8 -25.8 34.88
Step 3. Predict the value of the first measurement z(l) z(l) = [0.0045 0.168]
To' 0
=0
Step 4. Obtain the first measurement: z(l) = 0.0341. Step 5. Update the predicted concentrations: x(l) = x(0) + k(l) (z(l) - 2(1)) x(l) =
"0" 0
+
"0.16'
[0.0341-0] =
_5.7_
0.0054 0.196
Step 6. Return to step 2. These steps are summarized in Tables 41.5 and 41.6. The concentration estimates should be compared with the true values 0.1 and 0.2 respectively. For design B the results listed in Table 41.7 are obtained. From both designs a number of interesting conclusions follow. (1) The set of selected wavelengths (i.e. the experimental design) affects the variance-covariance matrix, and thus the precision of the results. For example, the set 22, 24 and 26 (Table 41.5) gives a less precise result than the set 22, 32 and 24 (Table 41.7). The best set of wavelengths can be derived in the same way as for multiple linear regression, i.e. the determinant of the dispersion matrix (h^h) which contains the absorptivities, should be maximized.
588 TABLE 41.5 Calculated gain vector k and variance covariance matrix P\ f 11(0) = P2.2(^) = 1000; P^j^O) = ^2.1(0) = 1 > ''= 1 Step
Wavelength
k(«)
P(n)
A:l 1
22
2 3
kl
^2,2
^1.2-^2,1
34.88
-25.88
0.16
5.74
999.2
24
1.16
2.82
995.8
14.73
-34.12
26
9.37
1.061
859.7
12.98
-49.53
4
28
13.17
245.0
11.4
-18.11
5
30
7.11
-0.52
71.4
10.5
-5.6
6
32
3.71
-0.25
52.6
10.4
-4.3
-0.67
TABLE 41.6 Estimated concentrations (see Table 41.5 for the starting conditions) Step
Wavelength
x2 CL
Br.
1
22
0.0054
0.196
2
24
0.0072
0.2004
3
26
0.023
0.2022
4
28
0.0802
0.1993
5
30
0.0947
0.1982
6
32
0.0959
0.198
TABLE 41.7 Calculated gain vector, variance covariance matrix and estimated concentrations (see Table 41.5 for the starting conditions) Step
Wavelength
k{n) k\
x(«)
P(A2)
kl
Pu
^2.2
^1,2-^2,1
JCl
x2
999.2
34.88
-25.88
0.0054
0.196
166.2
34.43
-6.42
0.0825
0.194
166.2
13.81
-6.54
0.0825
0.198
62.6
13.68
-2.86
0.0938
0.197
1
22
0.16
2
32
11.76
3
24
0.016
4
30
6.24
5
26
0.59
1.560
62.1
10.4
-4.1
0.0941
0.198
6
28
2.82
0.069
52.2
10.4
-4.3
0.0955
0.198
5.744 -0.27 2.859 -0.22
589 TABLE 41.8 Concentration estimation with the optimal set of wavelengths (see Table 41.5 for the starting conditions) Step
Wavelength
k(n) k\
x(n)
P(^^) k2
Pli
P22
^1,2-^2,1
x\
x2
1
22
0.16
999.2
34.88
-25.88
0.0054
0.196
2
32
11.76
-0.27
166.2
34.43
-6.42
0.0825
0.1942
3
30
6.244
-0.18
62.6
34.34
-3.42
0.0940
0.1939
5.744
(2) From the evolution of P in design B (Table 41.7), one can conclude that the measurement at wavenumber 24 10^ cm~^ does not really improve the estimates already available after the sequence 22, 32. Equally the measurement at 26 10^ cm~^ does not improve the estimate already available after the sequence 22,32,24 and 30 10^ cm~^. This means that these wavelengths do not contain new information. Therefore, a possibly optimal set of wavenumbers is 22,32 and 30. Inclusion of a fourth wavelength namely at 28 10^ cm~^ probably does not improve the estimates of the concentrations already available, since the value of P converged to a stable value. To confirm this conclusion, the recursive regression was repeated for the set of wavelengths 22, 32 and 30 10^ cm"^ (see Table 41.8). Thyssen et al. [3] developed an algorithm for the selection of the best set of m wavelengths out of n. Instead of having to calculate 10^^ determinants to find the best set of six wavelengths out of 300, the recursive approach only needs to evaluate a rather straightforward equation 420 times. The influence of the measurement sequence on the speed of convergence is well demonstrated for the four-component analysis (Fig. 41.5) of a mixture of aniline, azobenzene, nitrobenzene and azoxybenzene [3]. In the forward scan mode a quick convergence is attained, whereas in the backward scan mode, convergence is slower. Using an optimized sequence, convergence is complete after less than seven measurements (Fig. 41.6). Other methods for wavelength selection are discussed in Chapters 10 and 27.
41.4 System equations When discussing the calibration and multicomponent analysis examples in previous sections, we mentioned that the parameters to be estimated are not necessarily constant but may vary in time. This variation is taken into account by
590
a
1 "
K
M
1 .6
•
"
K
X K M
M
K
X
«""
1.2 •".
X
K *
X
X Z
X
r>
0.8
*-
rzi
xa^ie"" -4
_
AA 1
1
14
1
1
28
1
1
42
1
1
56
1-
>
k
78
—> k
0.8 +
0.4 \
Fig. 41.5. Multicomponent analysis (aniline (jci), azobenzene fe), nitrobenzene fe) and azoxybenzene (JC4)) by recursive estimation (a) forward run of the monochromator (b) backward run (k indicates the sequence number of the estimates; solid lines are the concentration estimates; dotted lines are the measurements z).
591
l.B f
1.2
e.4 I ps..
X2K10 XltUB
M
28
42
56
70
—> k
Fig. 41.6. Multicomponent analysis (see Fig. 41.5) with an optimized wavelength sequence.
the system equation. As explained before, the system equation describes how the system changes in time. In the kinetics example the system equation describes the change of the concentrations as a function of time. This is a deterministic system equation. The random fluctuation of the slope and intercept of a straight line can be described by a stochastic model, e.g., an autoregressive model (see Chapter 20). Any unmodelled system fluctuations left are included in the system noise, w(j). The system equation is usually expressed in the following way: x(/*) = F ( / j - l ) x ( / - l ) + w(/')
(41.9)
where F(/j - 1) is the system transition matrix, which describes how the system state changes from time ti_^ to time /^. The vector w(/) consists of the noise contributions to each of the system states. These are system fluctuations which are not modelled by the system transition matrix. The parameters of the measurement equation, the h-vector and system transition matrix for the kinetics and calibration model are defined in Table 41.9. In the next two sections we derive the system equations for a kinetics and a calibration experiment. System state equations are not easy to derive. Their form depends on the particular system under consideration and no general guidance can be given for their derivation.
592 TABLE 41.9 Definition of the state parameter (x), h-vector and transition matrix (F) for two systems State parameters (x)
h-vector
1. Calibration
Slope and intercept of [ 1 c] where c is the concentthe calibration line at ration of the calibration time t standard measured at time t
2. 1 st order kinetics monitored by Uv-Vis A -> B
Concentrations of A and B at time t
Absorbance coefficients of A and B at the wavelength of the reading at time t
Transition matrix (F) time constant of the the variations of slope and intercept 1st order reaction rate
41.4.1 System equation for a kinetics experiment Let us assume that a kinetics experiment is carried out and we want to follow the concentrations of component B which is formed from A by the reaction: A -> B. For a first-order reaction, the concentrations of A (= jCj) and B (= x^^) as a function of time are described by two differential equations: dxj Idt = -k^x^ djC2/dr = fcjjCj
which can be rewritten in the following recursive form: jc,(r + 1) = (1 - A:i)jci(0 + w,(0
(41.10)
x^{t + 1) = k^x^{t) + JC2(0 + ^2(0 w indicates the error due to the discretization of the differential equations. These two equations describe the concentrations of A and B as a function of time, or in other words, they describe the state of the system, which in vector notation becomes: x ( r + l ) = Fxft) + w(0 with x\t^\)^[x,{t^\)x^{t+\)] w^(r+l) = [vi;i(r+1)^2(^+1)] and the transition matrix equal to: "l-A:,
0"
F
Jr —
k^
1
(41.11)
593
41.4.2 System equation of a calibration line with drift In this section we derive a system equation which describes a drifting calibration line. Let us suppose that the intercept x,(/ + 1) at a time; + 1 is equal to x,(/) at a time j augmented by a value a(/'), which is the drift. By adding a non-zero system noise w„ to the drift, we express the fact that the drift itself is also time dependent. This leads to the following equations [5,6]: x,(7+l) = x,0') + aO') a(/ + l) = a(/) + w„(/ + l) which is transformed into the following matrix notation:
.a(7 + l)
0 1 1 •^i(y) + 0 1 a(7) WaCi+l)
(41.12)
A similar equation can be derived for the slope, where P is the drift parameter: X2(;+l)
.P0"+1).
0
1 ip2(;) .0 iJIpCy).
(41.13)
H ' p ( ; + l)
Equations (41.12) and (41.13) can now be combined in a single system model which describes the drift in both parameters: \x,{i+\)
x^ii+i) a(/+l)
.PO'+l).
10
1 0 " \x\{j)^
0
0 1 U2(;) + 0 0 1 0 «(;•) H'aCy + i) >vp(; + l) 0 0 0 1 ^PO).
0
10
or x(/ + 1) = Fx(/) + w(7' + 1) with
F=
10 10" 0 10 1 0 0 10 0 0 0 1
Xi
andx =
a
.P.
F describes how the system state changes from time tj to tj^^.
(41.14)
594
For the time invariant calibration model discussed in Section 41.2, eq. (41.14) reduces to: "1
OTA:, •^iO')1
0
lJ[jC2lU)}
where x^ = intercept and X2 = slope.
41.5 The Kalman filter 41.5.1 Theory In Sections 41.2 and 41.3 we applied a recursive procedure to estimate the model parameters of time-invariant systems. After each new measurement, the model parameters were updated. The updating procedure for time-variant systems consists of two steps. In the first step the system state x(/ - 1) at time ti_^ is extrapolated to the state x(j) at time tj by applying the system equation (eq. (41.15)) in Table 41.10). At time tj a new measurement is carried out and the result is used to TABLE41.10 Kalman filter algorithm equations Initialisation: x(0), P(0) State extrapolation: system equation x(/!/- 1) = F(/)xO- 11/'- 1) + wO)
(41.15)
Covariance extrapolation P(/V- 1) = F(/-)P(/- II7- 1)F''(/) + Q 0 - 1)
(41.16)
New measurement: z(j) Measurement equation z(J) = h'(j)x(j\j - 1) + v(/') forj = 0„ 1 „ 2,... Covariance update P(/'iy) = P(/"iy - 1) - mh'ijWiiy
-1)
(41.17)
Kalman gain update k(/) = POV- l)h(/)(h''(/-)POV- l)hO) + rO))-' State update x(J\J) = ^(J\J-^) + mKz(j)-h'(j)x(j\j-
D)
(41.18)
(41.19)
595
update the state x(j) (eq. (41.19) in Table 41.10). In order to make a distinction between state extrapolations by the system equation and state updates when making a new observation, a double index (j\j - 1) is used. The index (j\j - 1) indicates the best estimates at time tj, based on measurements obtained up to and including those obtained at point tj^^. Equations (41.15) and (41.19) for the extrapolation and update of system states form the so-called state-space model. The solution of the state-space model has been derived by Kalman and is known as the Kalman filter. Assumptions are that the measurement noise v(j) and the system noise w(;) are random and independent, normally distributed, white and uncorrected. This leads to the general formulation of a Kalman filter given in Table 41.10. Equations (41.15) and (41.19) account for the time dependence of the system. Eq. (41.15) is the system equation which tells us how the system behaves in time (here inj units). Equation (41.16) expresses how the uncertainty in the system state grows as a function of time (here inj units) if no observations would be made. Q(/ - 1) is the variance-covariance matrix of the system noise which contains the variance of w. The algorithm is initialized in the same way as for a time-invariant system. The sequence of the estimations is as follows: Cycle 1 1. Obtain initial values for x(OIO), k(0) and P(OIO) 2. Extrapolate the system state (eq. (41.15)) to x(llO) 3. Calculate the associated uncertainty (eq. (41.16)) P( 110) 4. Perform measurement no. 1 5. Update the gain vector k(l) (eq. (41.18)) using P(IIO) 6. Update the estimate x(llO) to x(lll) (eq. (41.19)) and the associated uncertainty P(lll) (eq. (41.17)) Cycle 2 1. Extrapolate the estimate of the system state to x(2ll) and the associated uncertainty P(2I1) 2. Peform measurement no. 2 3. Update the gain vector k(2) using P(2I1) 4. Update (filter) the estimate of the system state to x(2l2) and the associated uncertainty P(2I2) and so on. In the next section, this cycle is demonstrated on the kinetics example introduced in Sections 41.1 and 41.4. Time-invariant systems can also be solved by the equations given in Table 41.10. In that case, F in eq. (41.15) is substituted by the identity matrix. The system state, x(/), of time-invariant systems converges to a constant value after a few cycles of the filter, as was observed in the calibration example. The system state,
596
x(/), of time-variant systems is obtained as a function of y, for example the concentrations of the reactants and reaction products in a kinetic experiment monitored by a spectrometric multicomponent analysis. 41.5.2 Kalman filter of a kinetics model Equation (41.11) represents the (deterministic) system equation which describes how the concentrations vary in time. In order to estimate the concentrations of the two compounds as a function of time during the reaction, the absorbance of the mixture is measured as a function of wavelength and time. Let us suppose that the pure spectra (absorptivities) of the compounds A and B are known and that at a time t the spectrometer is set at a wavelength giving the absorptivities h^(0- The system and measurement equations can now be solved by the Kalman filter given in Table 41.10. By way of illustration we work out a simplified example of a reaction with a true reaction rate constant equal to k^ =0.1 min"^ and an initial concentration jCi(O) = 1. The concentrations are spectrophotometrically measured every 5 minutes and at the start of the reaction after 1 minute. Each time a new measurement is performed, the last estimate of the concentration A is updated. By substituting that concentration in the system equation x^(t) = x^(0)txp(-k^t) we obtain an update of the reaction rate k. With this new value the concentration of A is extrapolated to the point in time that a new measurement is made. The results for three cycles of the Kalman filter are given in Table 41.11 and in Fig. 41.7. The "c ik (0 I i •G
1?
(Q
g>
1
O C
o (Q
08
C
8
0.6
oC o 0.4 0.2 o
0
_l
I
I
l_
_J
I
\
o
o
o
L_
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 • time
Fig. 41.7. Concentrations of the reactant A (reaction A ^ B) as a function of time (dotted line) (CA = 1, CB = 0); • state updates (after a new measurement), O state extrapolations to the next measurement (see Table 41.11 for Kalman filter settings).
597 TABLE41.il The prediction of concentrations and reaction rate constant by a Kalman filter; (JCJCO) =l,k^=0.l Time
Concentrations
Continuous A
B
Wavelengthy
Discrete A
Absorptivities
Absorbance
A
z
B
z
Estimate of the concentration of A State update State extra(41.19)
B
min-^
1 ^-
polation (41.15)
1)
x(OIO) 1.00
0
1
0
1
0
1
0.90
0.10
0.90
0.10
2
0.82
0.18
0.81
0.19
0.82
->1
10
1
9.1
10
jc(lll)0.91
;c(llO)
0.90
3
0.74
0.28
0.73
0.27
0.74
4
0.67
0.33
0.66
0.34
0.67
5
0.61
0.39
0.59
0.41
6
0.55
0.45
0.53
0.47
0.57
7
0.50
0.50
0.48
0.52
0.52
8
0.45
0.55
0.43
0.57
0.48 0.43
->2
5
2
3.8
3.87
jc(2l2) 0.63
4211)
0.62
9
0.41
0.59
0.39
0.61
10
0.37
0.63
0.35
0.65
11
0.33
0.67
0.31
0.69
0.34
12
0.29
0.71
0.28
0.72
0.31
13
0.27
0.73
0.25
0.75
0.28
14
0.25
0.75
0.23
0.77
15
0.22
0.78
0.21
0.79
16
0.20
0.80
0.18
0.82
0.15
17
0.18
0.82
0.17
0.83
0.13
18
0.16
0.84
0.15
0.85
0.12
19
0.15
0.85
0.14
0.86
0.11
^ 3
2
5
3.97
3.81
A:(3I3) 0.38
jc(3l2)
0.40
0.26 ->4
7
2
3.10
3.15
;c(4l4)0.16
x(4l3)
0.23
0.10
')State extrapolations in italic are obtained by substituting the last estimate of k^ equal to -\n{x(j\j))/t in the system equation (41.15).
dashed line in Fig. 41.7 shows the true evolution of the concentration of the reactant A. For a matter of simplicity we have not included the covariance extrapolation (eq. (41.16)) in the calculations. Although more cycles are needed for a convincing demonstration that a Kalman filter can follow time varying states, the example clearly shows its principle.
598
41.5.3 Kalman filtering of a calibration line with drift The measurement model of the time-invariant calibration system (eq. (41.5)) should now be expanded in the following way: z(j) = h'iDxij - 1) + v(j) for; = 0, 1, 2,...
(41.20)
where h^(/) = [lc(/)0 0] c(j) is the concentration of the analyte in theyth calibration sample
x'{j-l)
= [b, b, a p ]
The model contains four parameters, the slope and intercept of the calibration line and two drift parameters a and p. All four parameters are estimated by applying the algorithm given in Table 41.10. Details of this procedure are given in Ref. [5].
41.6 Adaptive Kalman filtering In previous sections we demonstrated that a major advantage of Kalman filtering over ordinary least squares (OLS) procedures is that it can handle timevarying systems, e.g., a drifting calibration parameter and drifting background. In this section another feature of Kalman filters is demonstrated, namely its capability of handling incomplete measurement models or model errors. An example of an incomplete measurement model is multicomponent analysis in the presence of an unknown interfering compound. If the identity of this interference is unknown, one cannot select wavenumbers where only the analytes of interest absorb. Therefore, solution by OLS may lead to large errors in the estimated concentrations. The occurrence of such errors may be detected by inspecting the difference between the measured absorbance for the sample and the absorbance estimated from the predicted concentrations (see Chapter 10). However, inspection of PRESS does not provide information on which wavelengths are not selective. One finds that the result is wrong without an indication on how to correct the error. Another type of model error is a curving calibration line which is modelled by a straight line model. The size and pattern of the residuals usually indicate that there is a model error (see Chapter 8), after which the calculations may be repeated with another model, e.g., a quadratic curve. The recursive property of the Kalman filter allows the detection of such model deviations, and offers the possibility of disregarding the measurements in the region where the model is invalid. This filter is the so-called adaptive Kalman filter.
599
41.6.1 Evaluation of the innovation Before we can apply an adaptive filter, we should define a criterion to judge the validity of the model to describe the measurements. Such a criterion can be based on the innovation defined in Section 41.2. The concept of innovation, /, has been introduced as a measure of how well the filter predicts new observations:
/(/•)=zij) - h^ox/ -1)=zij) - i(j) where zij) is the jth measurement, x(/ - 1) is the estimate of the model parameters after 7 - 1 observations, and h^(/) is the design vector. Thus /(/) is a measure of the predictive ability of the model. For the calibration example discussed in Section 41.2, x(/ - 1) contains the slope and intercept of the straight line, and h^(/) is equal to [1 c(j)] with c(j) the concentration of the calibration standard for the jth calibration measurement. For the multicomponent analysis (MCA), x(/ - 1 ) contains the estimated concentrations of the analytes after j - I observations, and h^(/) contains the absorptivities of the analytes at wavelength;. It can be shown [4] that the innovations of a correct filter model applied on data with Gaussian noise follows a Gaussian distribution with a mean value equal to zero and a standard deviation equal to the experimental error. A model error means that the design vector h in the measurement equation is not adequate. If, for instance, in the calibration example the model was quadratic, h^ should be [1 c(j) c(/)^] instead of [1 c(j)]. In the MCA example h^(/) is wrong if the absorptivities of some absorbing species are not included. Any error in the design vector h^ appears by a non-zero mean for the innovation [4]. One also expects the sequence of the innovation to be random and uncorrelated. This can be checked by an investigation of the autocorrelation function (see Section 20.3) of the innovation. 41.6.2 The adaptive Kalman filter model The principle of adaptive filtering is based on evaluating the observed innovation calculated at each new observation and comparing this value to the theoretically expected value. If the (absolute) value of the innovation is larger than a given criterion, the observation is disregarded and not used to update the estimates of the parameters. Otherwise, one could eliminate the influence of that observation by artificially increasing its measurement variance r(/), which effectively attributes a low weight to the observation. For a time-invariant system, the expected standard deviation of the innovation consists of two parts: the measurement variance (r(/)), and the variance due to the uncertainty in the parameters (P(/)), given by [4]:
600
OJU) = [r(j) + h^(/*)P(/' - l)mr'
(41.21)
As explained before, the second term in the above equation is the variance of the response, £(/), predicted by the model at the value h(/) of the independent variable, given the uncertainty in the regression parameters P(/ - 1) obtained so far. This equation reflects the fact that the fluctuations of the innovation are larger at the beginning of the filtering procedure, when the states are not well known, and converge to standard deviation of the experimental error when the confidence in the estimated parameters becomes high (small P). Therefore, it is more difficult to detect model errors at the beginning of the estimation sequence than later on. Rejection and acceptance of a new measurement is then based on the following criterion: if |/0)|>3a,(/): reject otherwise: accept Adaptation of the Kalman filter may then simply consist of ignoring the rejected observations without affecting the parameter estimates and covariances. When using eq. (41.21) a complication arises at the beginning of the recursive estimation because the value of P depends on the initially chosen values P(0) and thus is not a good measure for the uncertainty in the parameters. When large values are chosen for the diagonal terms of P in order to speed up the convergence (high P means large gain factor k and thus large influence of the last observation), eq. (41.21) overestimates the variance of the innovation, until P becomes independent of P(0). For short runs one can evaluate the sequence of the innovation and look for regions with significantly larger values, or compare the innovation with r(j). By way of illustration we apply the latter procedure to solve the multicomponent system discussed in Section 41.3 after adding an interfering compound which augments the absorbance at 26 10^ cm"^ with 0.01 au and at 28 10^ cm"^ with 0.015 au. First we apply the non-adaptive Kalman filter to all measurements. The estimation then proceeds as shown in Table 41.12. The above example illustrates the self adaptive capacity of the Kalman filter. The large interferences introduced at the wavelengths 26 and 28 10^ cm~^ have not really influenced the end result. At wavelengths 26 and 28 10^ cm"^ the innovation is large due to the interferent. At 30 10^ cm"^ the innovation is high because the concentration estimates obtained in the foregoing step are poor. However, the observation at 30 10^ cm~^ is unaffected by which the concentration estimates are restored within the true value. In contrast, the OLS estimates obtained for the above example are inaccurate (jCj = 0.148 and X2 = 0.217) demonstrating the sensitivity of OLS for model errors.
601 TABLE 41.12 Non-adaptive Kalman filter with interference at 26 and 28 10-3 cm-^ (see Table 41.5 for the starting conditions) Step 7
Wavelength
0
Innovation
^1
^2
Absorbance
CI2
Br2
Measured^)
Estimated^)
0
0
1
22
0.054
0.196
0.0341
0
0.0341
2
24
0.072
0.2004
0.0429
0.0414
0.0015
3
26
0.169
0.211
0.0435
0.0331
0.0104
4
28
0.313
0.204
0.0267
0.0158
0.0110
5
30
0.099
0.215
0.0110
0.0409
-0.030
6
32
0.098
0.215
0.0080
0.0082
-0.0002
'^Values taken from Table 41.4. 2)Calculated with the absorbtivity coefficients from Table 41.4.
The estimation sequence when the Kalman filter is adapted is given in Table 41.13. This illustrates well the adaptive procedure, which has been followed. At 26 10-^ cm~^ the new measurement is 0.0104 absorbance units higher than expected from the last available concentration estimates (x^ = 0.072 and X2 = 0.2004). This deviation is clearly larger than the value 0.005 expected from the measurement noise. Therefore, the observation is disregarded and the next measurement at 28 10^ cm~^ is predicted with the last accepted concentration estimates. Again, the difference between predicted and measured absorbance (= innovation) cannot be explained from the noise and the observation is disregarded as well. At 30 10^ cm"^ the predicted absorbance using the concentration estimates from the second step is back within the expectations, P and k can be updated leading to new concentration estimates x^ = 0.098 and X2 = 0.1995. Thereafter, the estimation process is continued in the normal way. The effect of this whole procedure is that the two measurements corrupted by the presence of an interferent have been eliminated after which the measurement-filtering process is continued.
41.7 Applications One of the earliest applications of the Kalman filter in analytical chemistry was multicomponent analysis by UV-Vis spectrometry of time and wavelength independent concentrations, which was discussed by several authors [7-10]. Initially, the spectral range was scanned in the upward and downward mode, but later on
602 TABLE 41.13 Adaptive Kalman filter with interference at 26 and 28 cm-' (see Table 41.5 for the starting conditions) Step 7
Wavelength
0 1
22
Innovation
^1
X2
Absorbance
CI2
Br2
Measured')
Estimated^)
0
0
0.054
0.196
0.0341
0
0.0341 0.0015 0.0104
2
24
0.072
0.2004
0.0429
0.0414
3
26
0.0331
28
-
0.0435
4
-
0.0267
0.0100
0.0166
0.00814
-0.002
0.00801
-0.00001
5
30
0.0980
0.1995
0.0110
6
32
0.0979
0.1995
0.0080
')Values taken from Table 41.4. 2)Calculated with absorbtivity coefficients from Table 41.4.
optimal sequences were derived for faster convergence to the result [3]. The measurement model can be adapted to include contributions from linear drift or background [6,11]. This requires an accurate model for the background or drift. If the background response is not precisely described by the model the Kalman filter fails to estimate accurate concentrations. Rutan [12] applied an adaptive Kalman filter in these instances with success. In HPLC with diode array detection, threedimensional data are available. The processing of such data by multivariate statistics has been the subject of many chemometric studies, which are discussed in Chapter 34. Under the restriction that the spectra of the analytes should be available, accurate concentrations can be obtained by Kalman filtering in the presence of unknown interferences [13]. One of the earliest reports of a Kalman filter which includes a system equation is due to Seelig and Blaunt [14] in the area of anodic stripping voltametry. Five system states — potential, concentration, potential sweep rate and the slope and intercept of the non-Faradaic current — were predicted from a measurement model based on the measurement of potential current. Later on the same approach was applied in polarography. Similar to spectroscopic and chromatographic applications, overlapping voltamograms can be resolved by a Kalman filter [15]. A vast amount of applications of Kalman filters in kinetic analysis has been reported [16,17] and the performance has been compared with conventional non-linear regression. In most cases the accuracy and precision of the results obtained from the two methods were comparable. The Kalman filter is specifically superior for detecting and correcting model errors.
603
The Kalman filter is particularly well-suited to monitor the dynamic behaviour of processes. The measurement procedure in itself can be considered to be a system which is observed through the measurement of check samples. One can set up system equations, e.g., a system equation which describes the fluctuations of the calibration factors. Only a few applications exploiting this capability of a Kalman filter have been reported. One of the difficulties is a lack of system models, which describe the dynamic behaviour of an analytical system. Thyssen and coworkers [17] demonstrated the potential of this approach by designing a Kalman filter for the monitoring of the calibration factors. They assembled a so-called self-calibrating Flow Injection Analyzer for the determination of chloride in water. The software of the instrument included a system model by which the uncertainty of the calibration factors was evaluated during the measurement of the unknown samples. When this uncertainty exceeded a certain threshold the instrument decided to update the calibration factors by remeasuring one of the calibration standards. Thyssen [18] also designed an automatic titrator which controlled the addition of the titrant by a Kalman filter. After each addition the equivalence point (the state of the system) was estimated during the titration.
References 1. 2. 3.
4. 5. 6.
7. 8. 9. 10.
11.
D. Graupe, Identification of Systems. Krieger, New York, NY, 1976. Landolt-Bornstein, Zalen Werte und Funktionen. Teil 3, Atom und Molekular Physik. Springer, Berlin 1951. P.C. Thijssen, L.J.P. Vogels, H.C. Smit and G. Kateman, Optimal selection of wavelengths in spectrophotometric multicomponent analysis using recursive least squares. Z. Anal. Chem., 320(1985)531-540. A. Gelb (Ed.), Applied Optimal Estimation. MIT Press, Cambridge, MA, 1974. G. Kateman and L. Buydens, Quality Control in Analytical Chemistry, 2nd Edn. Wiley, New York, 1993. P.C. Thijssen, S.M. Wolfrum, G. Kateman and H.C. Smit, A Kalman filter for calibration,evaluation of unknown samples and quality control in drifting systems: Part 1. Theory and simulations. Anal. Chim. Acta, 156 (1984) 87-101. H.N.J. Poulisse, Multicomponent analysis computations based on Kalman Filtering. Anal. Chim. Acta, 112 (1979) 361-374. C.B.M. Didden and H.N.J. Poulisse. On the determination of the number of components from simulated spectra using Kalman filtering. Anal. Lett., 13 (1980) 921-935. T.F. Brown and S.D. Brown, Resolution of overlapped electrochemical peaks with the use of the Kalman filter. Anal. Chem., 53 (1981) 1410-1417. S.C. Rutan and S.D. Brown, Pulsed photoacoustic spectroscopy and spectral deconvolution with the Kalman filter for determination of metal complexation parameters. Anal. Chem., 55 (1983) 1707-1710. P.C. Thijssen, A Kalman filter for calibration, evaluation of unknown samples and quality control in drifting systems: Part 2. Optimal designs. Anal. Chim. Acta, 162 (1984) 253-262.
604 12.
13.
14. 15.
16. 17.
18.
19.
S.C. Rutan, E. Bouveresse, K.N. Andrew, P.J. Worsfold and D.L. Massart, Correction for drift in multivariate systems using the Kalman filter. Chemom. Intell. Lab. Syst., 35 (1996) 199-211. J. Chen and S.C. Rutan, Identification and quantification of overlapped peaks in liquid chromatography with UV diode array detection using an adaptive Kalman filter. Anal. Chim. Acta, 335(1996) 1-10. P. Seelig and H. Blount, Kalman Filter applied to Anodic Stripping Voltametry: Theory. Anal. Chem., 48 (1976) 252-258. C.A. Scolari and S.D. Brown, Multicomponent determination in flow-injection systems with square-wave voltammetric detection using the Kalman filter. Anal. Chim. Acta, 178 (1985) 239-246. B.M. Quencer, Multicomponent kinetic determinations with the extended Kalman filter. Diss. Abstr. Int. B 54 (1994) 5121-5122. M. Gui and S.C. Rutan, Determination of initial concentration of analyte by kinetic detection of the intermediate product in consecutive first-order reactions using an extended Kalman filter. Anal. Chim. Acta, 66 (1994) 1513-1519. P.C. Thijssen, L.T.M. Prop, G. Kateman and H.C. Smit, A Kalman filter for caHbration, evaluation of unknown samples and quality control in drifting systems. Part 4. Flow Injection Analysis. Anal. Chim. Acta, 174 (1985) 27-40. P.C. Thijssen, N.J.M.L. Janssen, G. Kateman and H.C. Smit, Kalman filter applied to setpoint control in continuous titrations. Anal. Chim. Acta, 177 (1985) 57-69.
Recommended additional reading S.C. Rutan, Recursive parameter estimation. J. Chemom., 4 (1990) 103-121. S.C. Rutan, Adaptive Kalman filtering. Anal. Chem., 63 (1991) 1103A-1109A. S.C. Rutan, Fast on-line digital filtering. Chemom. Intell. Lab. Syst., 6 (1989) 191-201. D. Wienke, T. Vijn and L. Buydens, Quality self-monitoring of intelligent analyzers and sensor based on an extended Kalman filter: an application to graphite furnace atomic absorption spectroscopy. Anal. Chem., 66 (1994) 841-849. S.D. Brown, Rapid parameter estimation with incomplete chemical calibration models. Chemom. Intell. Lab. Syst., 10 (1991) 87-105.
605
Chapter 42
Applications of Operations Research 42.1 An overview Ackoff and Sasieni [1] defined operations research (OR) as "the application of scientific method by interdisciplinary teams to problems involving the control of organized (man-machine) systems so as to provide solutions which best serve the purposes of the organization as a whole". Operations research consists of a collection of mathematical techniques. Some of these are linear programming, integer programming, queuing theory, dynamic programming, graph theory, game theory, multicriteria decision making, and simulation. They often are optimization techniques and are characterized by their combinatorial character: their aim is to find an optimal combination. Typical problems that can be solved are: (1) allocation, (2) inventory, (3) replacement, (4) queuing, (5) sequencing and combination, (6) routing, (7) competition, and (8) search. Several, but not all, of these mathematical methods (e.g. multicriteria decision making. Chapter 26) or problems (the non-hierarchical clustering methods of Chapter 30, which can be treated as allocation models) have been treated earlier. In this chapter, we will briefly discuss the methods that are relevant to chemometricians and have not been treated in earlier chapters yet.
42.2 Linear programming Suppose that a manufacturer prepares a food product by adding two oils (A and B) of different sources to other ingredients. His purpose is to optimize the quality
606
of the product and at the same time minimize cost. The quality parameters are the amount of vitamin A (y^) and the amount of polyunsaturated fatty acids (y2). The cost of a unit amount of oil A is 40, that of B is 25. In this introductory example, we will suppose that the cost of the other ingredients is negligible and does not have to be taken into account. Moreover, we suppose that the volume remains constant by adaptation of the other ingredients. The cost or objective function to be minimized is therefore Z = 4 0 J C , + 25JC2
(42.1)
where jc j is the amount in grams of oil A per litre of product and X2 the amount of oil B. An optimal product contains at least 65 vitamin A units and 40 polyunsaturated units per litre (all numbers in this section have been chosen for mathematical convenience and are not related to real values). More vitamin A or polyunsaturated fatty acids are not considered to have added benefit. Suppose now that oil A contains 30 units of vitamin A per litre and 10 units of polyunsaturated fatty acids and oil B contains 15 units of vitamin A and 25 units of polyunsaturated fatty acid. One can then write the following set of constraints. y^ = 30x^
+ 15A:2>65
3;2=10JCI + 2 5 J C 2 > 4 0
(42.2)
JC, > 0, ;c2 > 0
The line y^=65 = SOx, + 15JC2 is shown in Fig. 42.1. All points on that line or above satisfy the constraint 30JCJ + 15JC2 ^ 65. Similarly, all points lying above line J2 = IOJC, + 25JC2 = 40 satisfy the second constraint of eq. (42.2), while the last constraints limit the acceptable solutions to positive or zero values for x^ and X2. The acceptable region is the shaded area of Fig. 42.1. We can now determine which pairs of (JC,, X2) values yield a particular z. In Fig. 42.1 line z, shows all values for which z = 50. These (JC,, X2) do not belong to the acceptable area. However, we can now draw parallel lines until we meet the acceptable area. This happens in point B with line 12- The coordinates of this point are obtained by solving the set of simultaneous equations 30JC, + 15JC2 = 6 5 l IOJCJ + 2 5 J C 2
=40J
This yields x^ = 41/24, X2 = Will and z = 91.2.
607
Fig. 42.1. Linear programming: oil example.
Let us look at another example (Fig. 42.2) [2]. A laboratory must carry out routine determinations of a certain substance and uses two methods, A and B, to do this. With method A, one technician can carry out 10 determinations per day, with method B 20 determinations per day. There are only 3 instruments available for method B and there are 5 technicians in the laboratory. The first method needs no sophisticated instruments and is cheaper. It costs 100 units per determination while method B costs 300 units per determination. The available daily budget is 14000 units. How should the technicians be divided over the two available methods, so that as many determinations as possible are carried out? Let the number of technicians working with method A be x^ and with method B ^2, and the total number of determinations z; then, the objective function to be maximized is given by: Z=\Ox,+
20A:,
(42.3)
608
Fig. 42.2. Linear programming: laboratory technicians example.
The constraints are: }^i = ^2 < 3
^2 = X, + ^2 < 5
y^ = (10 X 100) jci + (20 X 300) x^ < 14000
(42.4)
X, > 0, jc, > 0 The optimal result obtained in this way is x^ = 3.2, JC2 = 1.8, z = 68. We observe that in both cases - the set of acceptable solutions is convex, i.e. whatever two points one chooses from the set, the line connecting them lies completely within the domain defined by the set; - the optimal solution is one of the comer points of the convex set. It can be shown that this can be generalized to the case of more than two variables. The standard solution of a linear programming problem is then to define the corner points of the convex set and to select the one that yields the best value for the objective function. This is called the Simplex method. The second example illustrates a difficulty that can occur, namely the optimal solution concerns 1.8 technicians working with method B, while one needs an integer number. This can be solved by letting one technician working full time and another four days out of five with this instrument. When this is not practical, the
609
solution is not feasible and one should then apply a related method, called integer programming in which only integer values are allowed for the variable x^. When only binary values are allowed, this is called binary programming. Problems resembling the first example, but much more complex, are often studied in industry. For instance in the agro-food industry linear programming is a current tool to optimize the blending of raw materials (e.g. oils) in order to obtain the wanted composition (amount of saturated, monounsaturated and polyunsaturated fatty acids) or property of the final product at the best possible price. Here linear programming is repeatedly applied each time when the price of raw materials is adapted by changing markets. Integer programming has been applied by De Vries [3] (a short Englishlanguage description can be found in [2]) for the determination of the optimal configuration of equipment in a clinical laboratory and by De Clercq et al. [4] for the selection of optimal probes for GLC. From a data set with retention indices for 68 substances on 25 columns, sets of p probes (substances) (/? = 1, 2,..., 20) were selected, such that the probes allow to obtain the best characterization of the columns. This type of application would nowadays probably be carried out with genetic algorithms (see Chapter 27). The fact that only linear objective functions are possible, limits the applicability of the methodology in chemometrics. Quadratic or non-linear programming are possible however. The former has been applied in the agro-food industry for the determination of the composition of an unknown fat blend from its fatty acid profile and the fatty acid profiles of all possible pure oils [5]. The solution of this problem is searched under constraints of the number of oils allowed in the solution, a minimal or maximal content, or a content range. This problem can be solved by quadratic programming. The objective function is to minimize the squared differences between the calculated and actual fatty acid composition of the oil blend. An attractive feature of the programming approach to solve this type of problems is that it provides several solutions in a decreasing order of value of the objective function. All these methods together are also known as mathematical programming.
42.3 Queuing problems In several chapters we discussed how the quality of the analytical result defines the amount of information which is obtained on a sampled system. Obvious quality criteria are accuracy and precision. An equally important criterion is the analysis time. This is particularly true when dynamic systems are analyzed. For instance a relationship exists between the measurability and the sampling rate, analysis time and precision (see Chapter 20). The monitoring of environmental and chemical processes are typical examples where the management of the analysis time is
610
important. In this chapter we will focus on the analysis time. The time between the arrival of the sample and the reporting of the analytical result is usually substantially longer than the net analysis time. Delays may be caused by a congestion of the laboratory or analysis station or by managerial policies, e.g. priorities between samples and waiting until a batch of a certain size is available for analysis. A branch of Operations Research is the study of queues and the influence of scheduling policies on the formation of queues. Queues in waiting rooms, for instance, or the occupation of beds in hospitals, telephone and computer networks have been extensively studied by queuing theory. On the contrary only a few studies have been conducted on queues and delays in analytical laboratories [6-10]. Despite the fact that several operational parameters can be registered with modem Laboratory Information Management Systems (LIMS), laboratory activities are apparently too complex to be described by models from queuing theory. For the same reason the alternative approach by simulation (see Section 42.4) is not a real management tool for decision support in analytical laboratory management. On the other hand simulation techniques proved to be a useful tool for the scheduling of robots. In this section waiting and queues are discussed in order to provide some basic understanding of general queuing behaviour, in particular in analytical laboratories. This should allow a qualitative forecast of the effect of managerial decisions.
42.3.1 Queuing and Waiting No queues would be formed if no new samples are submitted during the time that the analyst is busy with the analysis of the previous sample. If all analysis times were equally long and if each new sample arrived exactly after the previous analysis is finished, the analytical facility could be utilized up to 100%. On the other hand, if samples always arrive before the analysis of the previous sample is completed, more samples arrive than can be analyzed, causing the queue to grow indefinitely long. Mathematically this means that: AZq = 0, w = 0
when ?i/|Li < 1
and n^ = oo^w = oo when )J\i > 1 with AZq the number of samples waiting in queue, w the waiting time in queue, X the number of samples submitted per unit of time (e.g. a day), and \i the number of samples which can be analyzed per unit of time.
611
Because X = l/lAT, wherelAT is the mean interarrival time, and L | L = 1 / AT, where AT is the mean analysis time
:^/|i = Af/iAf = p p is called the utilization factor of the facility or service station. In reality, the queue size {n^ and waiting time (w) do not behave as a zeroinfinity step function at p = 1. Also at lower utilization factors (p < 1) queues are formed. This queuing is caused by the fact that when analysis times and arrival times are distributed around a mean value, incidently a new sample may arrive before the previous analysis is finished. Moreover, the queue length behaves as a time series which fluctuates about a mean value with a certain standard deviation. For instance, the average lengths of the queues formed in a particular laboratory for spectroscopic analysis by IR, ^H NMR, MS and ^^C NMR are respectively 12, 39, 14 and 17 samples and the sample queues are Gaussian distributed (see Fig. 42.3). This is caused by the fluctuations in both the arrivals of the samples and the analysis times. According to the queuing theory the average waiting time (w) exponentially grows with increasing utilization factor (p) and asymptotically approaches infinity when p goes to 100% (see Fig. 42.4). Figure 42.4 shows the waiting time for the simplest queuing system, consisting of one server, independent analysis and arrival processes, Poisson distributed (see Section 15.3) arrivals (number of samples per day) and exponentially distributed analysis times. In the jargon of queuing theory such a system is denoted by M/M/1 where the two Ms indicate the arrival and analysis processes respectively (M = Markov process) and ' 1' is the number of servers. The number of arrivals follows a Poisson distribution when samples are submitted independently from each other, which is generally valid when the samples are submitted by several customers. The probability of n arrivals in a time interval t is given by: P M ^ ' - ^ ^
(42.5)
where Xt is the average number of arrivals during the interval t. For instance the number of samples (samples/day) submitted to the spectroscopic department mentioned earher [9] can be modelled by Poisson distributions with the means 2.8,7.7, 2.06 and 2.5 samples per day (Fig. 42.5).
612
20
20
60
60
100
140
180
220
- ^ - 1 (days)
Fig. 42.3. Time series of the observed queue lengths (n) in a department for structural analysis, with their corresponding histograms fitted with a Gaussian distribution.
613
utilization factor (p)
Fig. 42.4. The ratio between the average waiting time (w) and the average analysis time (AT) as a function of the utilization factor (p) for a system with exponentially distributed interarrival times and analysis times (M/M/1 system). 1
IR t
c
^^
a.
HNMR
20 15 10
0
1 2
3
4
n^ 5
7
lii 0
8
2
I I I 4
10 12 14 16 18
6
»• n
13
MS
CO
CNMR
25 r
m
>
t>
CO
03
20 V
c: -
•
.
-
^
tl.
0 1
2
3
L
4
5
6
7
8
15 [
0 1
2
3
4
L^
5
6
7
8
Fig. 42.5. The distribution of the probability that n samples arrive per day, observed in a department for structural analysis. (I) observed (•) Poisson distribution with mean [i.
614
The following relationships fully describe an M/M/1 system: - the average queue length (n^) which is the number of samples in queue, excluding the one which is being analysed: ^^=p2/(l-p) - the average waiting time (vv) in queue (excluding the analysis time of the sample): w = ATp/(l-p): ^(1-p) Queuing models also describe the distribution of the waiting times though only for relatively simple queuing systems. Waiting times in an M/M/1 system are exponentially distributed. The probability of a waiting time shorter than a given Wj^gx is given by (see Fig. 42.6): P ( w < w ^ ^ ) = l - p e -^iVVm
,/AT
= l - p e -P^mj^
It means that a large part of the samples (65% for p = 0.7) waits for less than the mean waiting time (w^^ = vv). On the other hand, there is a significant probability (35% for p = 0.7) that a sample has to wait longer than this average. It also means that when the laboratory management wants to guarantee a certain maximal turn around time (e.g. 95% of the samples within w^^J, the mean waiting time should be 27% of w^3^ (p = 0.7). Figure 42.7 shows the waiting time distributions of the samples in the IR, *H NMR, ^^C NMR and MS departments mentioned before. The probability of finding k samples in a queue is:
t/AT Fig. 42.6. Probability that the waiting time is smaller than t (t given in units relative to the average analysis time).
615 cumulative %
100
IP
%
100
1 • n^Ht
b. scheduled interruption c. interruption ifn< w^rit
fact that queues are never empty does not necessarily indicate an oversaturated system. 42.3.2 Application in analytical laboratory management The overview given in Table 42.1 demonstrates that queues and waiting can only be studied by queuing theory in a limited number of cases. Specifically the queuing systems that are of interest to the analytical chemist are too complex. However the behaviour of simple queuing systems provides a good qualitative insight in the queuing processes occurring in the laboratory. An alternative approach which has been applied extensively in other fields, is to simulate queuing systems and to support the decision making by the simulation of the effect of the decision. This is the subject of Section 42.4. Before this, we will summarize a number of rules of thumb relevant to the laboratory manager, who may control the queuing process by controlling the input, the analytical process and the resources. (1) Input control: Maximum delays may be controlled by monitoring the work load (Wq AT). When n^ AT > w^^^, or when n^ exceeds a critical value (n^rit)' customers are requested to refrain from submitting samples. Too high a frequency of such warnings indicates insufficient resources to achieve the desired Wj^^x-
618
(2) Priorities do not influence the overall average delay, because vv = avvj + (1 - a)w2, where a is the fraction of samples with a high priority. The values of vvj and W2 depend on a, the kind of priority (see Table 40.2) and p [10]. (3) The effect of collecting batches depends on the shortening of the analysis time by batch analysis and the time needed to collect the batches. (4) By automation one can remove the variation of the analysis time or shorten the analysis time. Although the variation of the analysis time causes half of the delay, a reduction of the analysis time is more important. This is also true if, by reducing the analysis time, the utilization factor would remain the same (and thus n^) because more samples are submitted. Since p = AT / lAT, any measure to shorten the analysis time will have a quadratic effect on the absolute delay (because vv = AT / (lAT - AT)). As a consequence the benefit of duplicate analyses (detection of gross errors) and frequent recalibration should be balanced against the negative effect on the delay. (5) By preference, overhead activities should be scheduled in regular blocks, e.g. at the end of the day [10]. (6) For fixed resources (costs), the sampling rate in process control can be increased to maximal utilization (p = 1) of the available resources without penalty only if samples are taken at regular time intervals (no variation) and there is no variation in the analysis time (automated measuring device). In other situations an optimal sampling rate will be found where the measurability is maximal. Recalling the fertilizer example discussed in Chapter 20 we can derive the optimal sampling rate for the A^ determination by an ion-selective electrode, by substituting the analysis time (10 min) by the delay in the measurability equation. However considering that in process control it is always preferable to analyze the last submitted sample (by eventually skipping the analysis of the waiting samples, because they do not contain additional information), it is obvious that a last-infirst-out (LIFO) strategy should be chosen. 42.4 Discrete event simulation In Section 42.2 we have discussed that queuing theory may provide a good qualitative picture of the behaviour of queues in an analytical laboratory. However the analytical process is too complex to obtain good quantitative predictions. As this was also true for queuing problems in other fields, another branch of Operations Research, called Discrete Event Simulation emerged. The basic principle of discrete event simulation is to generate sample arrivals. Each sample is characterized by a number of descriptors, e.g. one of those descriptors is the analysis time. In the jargon of simulation software, a sample is an object, with a number of attributes (e.g. analysis time) and associated values (e.g. 30 min). Other objects are e.g. instruments and analysts. A possible attribute is a list of the analytical
619
procedures which can be carried out by the analyst or instrument. The more one wants to describe the reaUty in detail, the more attribute-value pairs are added to the objects. For example, if one wants to include down times of the instrument or absence due to illness, such attributes have to be added to the object. An event takes place when the state of the laboratory changes. Examples of events are: - a sample arrival: this introduces a state change because the sample joins the queue; - an analysis is ready: this introduces a state change because the instrument and analyst become idle. With each event a number of actions is associated. For example when a sample arrives, the following actions are taken: - if instrument and analyst are idle and if all other conditions are met (batch size), start the analysis. This implies that the next event 'analysis is ready' is generated, the status of the instrument and analyst is switched to 'busy'. Generate the time of next arrival. - otherwise: register the arrival time in the queue; augment the queue size by one. Generate the time of the next arrival. As events generate other events, the simulation keeps going from event to event until some terminating conditions are met (e.g. the end of the simulation time, or the maximum number of samples has been generated). As one can see a specific programming environment, called object oriented programming [17], is required to develop a simulation model, consisting of object-attribute-value (O-A-V) triplets and rules (see also Section 43.4.2). The little research that has been conducted on the simulation of laboratory systems [9,10,13-15] was primarily focused on the demonstration that it is possible to develop a validated simulation model that exhibits the same behaviour in terms of queues, delays as in reality. Next, such a validated model is interrogated with the question "What if?". For instance, what if: - priorities are changed? - resources are modified? - minimal batch size is increased? In Fig. 42.9 we show the simulation results obtained by Janse [8] for a municipal laboratory for the quality assurance of drinking water. Simulated delays are in good agreement with the real delays in the laboratory. Unfortunately, the development of this simulation model took several man years which is prohibitive for a widespread application. Therefore one needs a simulator (or empty shell) with predefined objects and rules by which a laboratory manager would be capable to develop a specific model of his laboratory. Ideally such a simulator should be linked to or be integrated with the laboratory information management system in order to extract directly the attribute values.
620
CO
T3 CD CO -Q
O
«
8
:^ 2
i*^ ^ ;: E
CM S,
J
o_c5
CM
•^
"w"
c\ ^-^ o |
o c3
0 Class B if: NET(x,.) < 0 The classification of objects is based on the threshold value, 0, of NET(x), also called the bias. The procedure can be described by means of a transfer function, F (Fig. 44.4a). The weighted sum of the input values of x is transmitted through a
655
+1 (CLASS A) F(vj[5j+\^3^ - 9) I
^-1 (CLASS B)
F +1
+1
0
K^ + ^^1 -1
[Wj^+\^Xj-
6]
J_
Fig. 44.4. (a) Schematic representation of the LLM. F represents the threshold transfer function: F = sign(Net(x)). (b) On the left: the threshold function for NET(x) = wiJCi + W2X2; on the right for NET(x) = wiXi + W2X2 - 6; see text for explanation.
transfer function, a threshold step function, also called a hard delimiter function. Figure 44.4b shows the threshold function. The classification is based on the output value of the threshold function (0 or +1 for class A; - 1 for class B). When the input to the threshold function (i.e. NET(Xj)) exceeds the threshold, then the output value, y, of the function is 1, otherwise it is - 1 . Instead of such a hard delimiter function it is possible to use other transfer functions such as the threshold logic, also called the semi-linear function (Fig. 44.5a). In this function there is a region where the output value of the transfer function is linearly related to the input value with a slope, a. The first point, A, of this region is: x^w^ + X2W2 = NET(JCI,JC2) = 6. The endpoint, B, of the linear region is reached when: a{x^w^ + X2W2 - 9) = 1 or x^Wi + JC2W2 = NET(xi,X2) = 6 + lla. The width of the interval between A and B is thus lla. It is, moreover, possible to use non-linear functions. The sigmoidal transfer function (Fig. 44.5b) is the most widely used transfer function in the more advanced MLF networks. It will be discussed more in detail in Section 44.5.
656
a
A +1
e + l/a
NET
F +1
NET
Fig. 44.5. (a) The semi-linear function, F = max(0,min(l,a(NET(x) -9))) with NET(x) = Wixi + W2X2 and (b) the sigmoid function: F =
ft.Mox.
^ a^^
44.4.2 Learning strategy The weights, as described in the previous section, determine the position of the boundary that the LLM draws between the classes. The strategy to find these weights is at the heart of the LLM. This procedure is called the learning rule [7,8]. It is a supervised strategy and is based on the learning rule that Hebb suggested for biological neurons [2]. Initially the weights are set randomly. A training set with a number of objects with known classification is presented to the classifier. At each presentation the weights are updated with an amount Aw, determined by the learning rule. The most important variant, used in networks, is the delta rule
657
developed by Widrow and Hoff (eq.(44.4)) [4]. In this learning strategy the actual output of the neuron is adapted with a term, based on 6, the error, i.e. the difference between the actual output and the desired or target output of the neuron for a specific object (eq. (44.4)). Aw. ~ X. 6, 5-4-a,
(44.4)
where d^ is the desired output and a^ is the actual output. In the linear learning machine this rule is applied as follows: 1. present an object / with input vector x^ and apply eq. (44.2); 2. if the classification is correct, the weights are left unchanged; 3. if the classification is wrong, the weights are updated according to eq. (44.4); 6 is taken such that the updated weights yield the current output but with an opposite sign and thus yielding the correct classification for object i (see eq. (44.5)); 4. goto step 1. In the LLM the (scalar) desired output value (xj w^^^) is defined as the negative of the actual (wrong) output value: ^J w„ew =-^J
Wold
(44.5)
The trivial solution (w„g^ = -w^y) is not interesting since it defines the same boundary line. A non-trivial solution is found by the following procedure: Aw ~ 6, X. or
xJ Aw 5,=T1-V—
(0 D: the response is maximal (approximates 1) The region from A to D is called the dynamic range. The regions 2 and 4 constitute the most important difference with the hard delimiter transfer function in perceptron networks. These regions rather than the near-linear region 3 are most important since they assure the non-linear response properties of the network. It may
668
8
10
e = o^
(b) e=-o.5 Fig. 44.12 continued (b).
seem surprising at first sight that such a simple non-linear unit suffices to model successfully complex non-linear relationships. One must not forget however that it is the proper combination (weight setting) of several of those sigmoid functions that assures the overall modelling power. In Fig. 44.12a an example of a combination of two sigmoidal functions of two hidden units is given so that a Gaussian-like relationship between input and output is modelled. Changing the weights yields Fig. 44.12b. It is easy to imagine that in such a way complex relationships can be modelled with more sigmoids. In this view an MLF network can be seen as a kind of automatic spline fitter (Chapter 11). It is automatic because the combination of basic functions and knot setting is done automatically during the weight and bias setting in the training phase by means of the learning rule (Section 44.5.5).
669
-4
-3
-2
9=0 xl
x2
e=o Fig. 44.12 continued (c).
When the MLF is used for classification its non-linear properties are also important. In Fig. 44.12c the contour map of the output of a neural network with two hidden units is shown. It shows clearly that non-linear boundaries are obtained. Totally different boundaries are obtained by varying the weights, as shown in Fig. 44.12d. For modelling as well as for classification tasks, the appropriate number of transfer functions (i.e. the number of hidden units) thus depends essentially on the complexity of the relationship to be modelled and must be determined empirically for each problem. Other functions, such as the tangens hyperbolicus function (Fig. 44.13a) are also sometimes used. In Ref. [19] the authors came to the conclusion that in most cases a sigmoidal function describes non-linearities sufficiently well. Only in the presence of periodicities in the data
670 1
1
1
1
1 1
'
1
'
'/
'
1
... 1
'
4 3 ?
cvj
^
^^
1 0
-1 -2 -3 -4 1
1
-4
-3
1
-2
/
1
-1
1
1
1
1..
1
0 x1 9=0
xl
x2
9=0
Fig. 44.12 continued (d).
this type of function is not adequate. In that case a sinusoidal transfer function (Fig. 44.13b) is necessary. In a special kind of MLF networks the so-called radial basis function is used as transfer function. We describe them in Section 44.6. 44.5.5 Learning rule In MLF networks the proper setting of the weights is optimized by supervised training. The weights are optimized by means of a number of example input patterns together with their associated desired output pattern. During the training session the weights are adapted according to the learning rule.
671
A
-1
b
A
Fig, 44.13. The tangens hyperbolicus and the sine function.
The most common learning algorithm is the back-propagation learning rule [7,8]. The weight updates are, as in the delta rule, based on the difference between the actual and the desired output of the network. The weight updating can be done after each training example that is offered to the network or it can be done after all training examples have been seen once. The two procedures are strictly equivalent. Since the former one is applied most, this procedure is explained here. It can be summarized as follows: 1. Initialize the weights with small random values in a range around 0 (e.g. -0.3 to +0.3). 2. For each output unit, y, calculate the output value with the current weight setting and the error, Ej, based on the difference between this value and the target or desired output value.
672
Ej=^(dj-Ojf
(44.9)
Ej is determined by the weights (through Oj which is a function of NET^, see eq. (44.8)). Note that this error is in fact the same as the error term used in a usual least squares procedure. 3. Carry out the weight adaptation of the output neurons: Aw,. = y]bjO,
(44.10)
w^j is the weight between the hth hidden unit and the 7th output unit; r| is the learning rate, a positive constant between 0 and l;Of^is the output of the hth hidden unit and bj is a term based on the error. In the back-propagation strategy the weight adaption is made in the direction that minimizes the error. The back-propagation algorithm is thus essentially a gradient based optimization method. Therefore, the gradient of the error as a function of the weights must be calculated. In Fig. 44.14 this is shown for one of the weights. For the output units this gradient minimization yields eq. (44.11). This explains why the transfer function must have a derivative over the whole response domain. d(sf(NETj)) '
'
'
d(NETj)
(44.11)
= (dj- Oj) sf (NETJ) [1 - sf (NETJ)] sfiNETj) represents the value of the sigmoidal function for NETy. I
Error
Wij Fig. 44.14. An example of an error surface of an MLF as function of one weight value. The gradient determines the change of the weight in the next iteration.
673
4. Calculate the adaptation of the weights to the hidden layer. The desired output and thus the error is not known directly. In the back-propagation strategy it is calculated from the accumulated errors of the output neurons. It can be shown (e.g. Refs. [14] and [15]) that this yields eq. (44.12). In this equation, k represents the units in the output layer. All weight adaptations can be calculated using eq. (44.10) and eq. (44.12). The error is said to be back-propagated through the network.
•'•
diNETj)
h
k
^jk
(44.12)
= sf (NETJ) [1 - sfiNETj)] ^ 5 , Wj, k=\
5. Repeat this process for all input patterns. One iteration or epoch is defined as one weight correction for all examples of the training set.
44.5.6. Learning rate and momentum term The learning rate, r|, in eq. (44.10) is important for the performance of the training procedure. If it is small, the convergence of the weights to an optimum may be slow and there is a danger of getting stuck at a local optimum. If the learning rate is high, the system may oscillate. An appropriate value of the learning rate depends on the transfer function of the output units. For a sigmoidal transfer function, r| values between 0.5 and 1 are used. For a linear transfer function much smaller values (0.001 < r| < 0.1) are applied. To limit the danger of oscillation, eq. (44.10) can be adapted as follows: Aw,-(n) = Ti5^ o, + a Aw,.(n - 1)
(44.13)
In eq. (44.13) the term ^Wy^fn) is the current weight change, l^w^fn - 1) is the weight change from the previous learning cycle, a is called the momentum. It represents the proportion of the weight change of the previous learning cycle that is taken into account to determine the step size of the weight in the present learning cycle. Its value is usually set between 0.6 and 0.8. The influence of the momentum term is less important than that of the learning rate. The optimal values of a and y\ depend on the problem under study and can be found by means of an optimization procedure, although this is usually not done and only some values of r| and a are tried.
674
44.5J Training and testing an MLF network During a supervised training session the weights of the network are optimized, using a training set. Since the weight adaptations are done by means of the training set, it should contain a sufficient number of relevant examples for the relationship to be modelled. The more hidden units, the more examples are necessary to train the network. The number of weights to be optimized increases indeed drastically with the number of units in the network: number of weights = ph + qh = h(p + q) where p is the number of input units, h the number of hidden units and q the number of output units. As described in Section 44.5.5, the weights are adapted along the gradient that minimizes the error in the training set, using the back-propagation strategy. One iteration is not sufficient to reach the minimum in the error surface. Care must be taken that the sequence of input patterns is randomized at each iteration, otherwise bias can be introduced. Several (50 to 5000) iterations are typically required to reach the minimum. 44.5.7.1 Network performance During the training session the performance of the network must be monitored. Different performance criteria are possible, but usually the normalized standard error, NSE, is used.
NSE = —XX(^/;-^.7)'
(44.14)
q is the number of output units, n is the number of examples in the training set, d is the desired output value and a is the actual output value. The NSE is calculated after each iteration of the training session. The NSE is not always the best performance criterion. One badly predicted pattern (e.g. an outlier) may influence strongly this performance criterion. It also gives no indication on which or how many patterns are well or badly predicted. Small errors, close to 0 for almost all training examples and a few badly predicted examples will yield the same NSE as for average errors for all examples. For classification purposes the NSE is not the ideal criterion since it is more important that the output values are above or below a certain threshold (e.g. 0.9 or 0.1) than that they are exactly 0 or 1. The NSE does give, however, a good general idea of the performance of the network and is therefore commonly used during the training session. In Fig. 44.15a a performance curve is shown. The NSE decreases monotonically with the number
675
a
Error A
ITERATIONS
Error
Fig. 44.15. (a) Performance curve of a MLF network for a training set. (b) Performance curve of a MLF network with too high a learning rate.
of iterations. Training with too high a learning rate, Ti, yields the performance behaviour of Fig 44.15b. When the network must be used to predict future unknown examples it is not good practice however to judge the performance of the network based on the training set only. Together with the NSE of the training set the NSE of an independent set, the monitoring set, must always be calculated. The NSE value for the monitoring set is usually somewhat larger than the NSE for the training set. In Fig. 44.16a an ideal performance behaviour of a network is shown. A typical performance behaviour is shown in Fig. 44.16b. The increase of the NSE for the monitoring set is a phenomenon that is called overtraining. This phenomenon can be compared to fitting a curve with a polynomial of a too high order or with a PCR or PLS model with too many latent variables. It is caused by the fact that after a certain number of iterations, the noise present in the training set is modelled by the network. The network acts then as a memory, able to recall
676
Error
Error
ITERATIONS
ITERATIONS
Fig. 44.16. (a) Ideal performance curve of a MLF network. The solid line represents the training set; the dashed line represents the monitoring set. (b) overtraining of the network begins at point A. (c) Paralysed network, (d) Performance curve of an MLF network for which the training and control samples represent different models.
exactly the training examples but loosing its ability to predict. This is harmful for the predictive performance of the network. When the NSE of the monitoring set is acceptable it suffices to stop the training at point A. Overtraining occurs sooner when the number of hidden units is too large for the problem complexity or because the number of training examples is too small to train adequately all the weights. The remedy then is to use a network with a smaller number of units. In Fig. 44.16c the performance behaviour of SL paralysed network is shown. The normal decrease of the NSE stops too soon and the NSE remains too high to be acceptable. The paralysation of the network occurs when the weighted input value,
677
NETy, is situated in the response regions 1 or 5 of the transfer function for too many units (Fig. 44.11). As long as Net^ is smaller than A or larger than D, the response does not change when changing Net^ (i.e changing the weights). A remedy is to increase the learning rate, r|, or to explicitly decrease the slope, (3, of the transfer function, so that the input values fall again inside the dynamic region. To retrain the network with a different weight initialization is another option. This performance behaviour is also observed when too few hidden units are used. The network is then not able to properly model the relationship. Increasing the number of hidden units is the remedy in this case. The performance behaviour shown in Fig. 44.16d is caused by the fact that the monitoring set and the training set represent different relationships or when some outliers are present in the monitoring set that are not present in the training set. When not enough examples are available to make an independent monitoring set, the cross-validation procedure can be applied (see Chapter 10). The data set is split into C different parts and each part is used once as monitoring set. The network is trained and tested C times. The results of the C test sessions give an indication on the performance of the network. It is strongly advised to validate the network that has been trained by the above procedure with a second independent test set (see Section 44.5.10). 44.5.7.2 Local minima The back-propagation strategy is a steepest gradient method, a local optimization technique. Therefore, it also suffers from the major drawback of these methods, namely that it can become locked in a local optimum. Many variants have been developed to overcome this drawback [20-24]. None of these does however really solve the problem. A way to obtain an idea on the robustness of the obtained solution is to retrain the network with a different weight initialization. The results of the different training sessions can be used to define a range around the performance curve as shown in Fig. 44.17. This procedure can also be used to compare different networks [20]. 44.5.8 Determining the number of hidden units From the previous sections it is clear that it is important to use a network with a suitable number of hidden units. When too few hidden units are used, the relationship cannot be modelled properly and the network shows poor performance. Too large a number of hidden units causes severe overtraining. The suitable number of hidden units depends on the problem complexity and on the number of training examples that are available. It must be determined empirically. There are basically three approaches for this:
678
Error
10
12
14
16
18
Hidden units Fig. 44.17. Determining the number of hidden units for an MLF network. The whiskers represent the range of the error with different random weight initializations or by cross-validation.
Train and test a network with a certain number of hidden units, based on an educated guess. When the NSE of the network is acceptable and no severe overtraining occurs, the network is suitable. A second approach is to train different networks with a different number of hidden units. Preferably, each network is trained several times with a different weight initialization. From a plot as in Fig. 44.17 it is then straightforward to select a suitable number of hidden units. This approach is certainly the best but it involves the training and testing of many networks and is thus a time consuming procedure. Another approach that has been used is the so-called pruning procedure [26,27]. One starts with a network with a large number of hidden units. During the training phase the weight changes of all units are monitored. Those units or connections whose weights remain low are removed and the training is continued. This procedure has not been very successful for MLF networks and is not often applied.
679
44.5.9 Data Preprocessing 44.5.9.1 Scaling So far, we have not considered the nature of the data in the X or Y matrix. However, as with all other data handling techniques, preprocessing of the data may be desirable in some cases. The reasons for scaling are essentially the same as described in earlier chapters. Methods that are often used for continuous values are autoscaling and range scaling. The data can be scaled by columns or by rows. Scaling is often applied if the variables have different ranges of values. Variables with large values can dominate the model. Another reason for scaling in neural networks is the fact that large input values for a certain unit, especially together with large weights, may cause the NET input of that unit to fall in the tail of the sigmoid transfer function. In this region the derivative is small and, as explained in Section 44.5.5, weight corrections according to the back-propagation rule are proportional to the derivative of the transfer function (see eqs. (44.10 and 44.11)). This may then lead to paralysation of the network. Scaling of the input brings the net input within the appropriate range of the sigmoid transfer function. 44.5.9.2 Variable selection and reduction The number of variables in the X and Y matrix determines the number of input and output units respectively. The total number of units determines the total number of weights that have to be optimized during the training session. When there is a large amount of input variables, variable selection and or reduction may be necessary. This is for example the case when the input consists of a spectrum of e.g. 1000 wavelengths. Preprocessing the data with PCA seems attractive; it is however not always the appropriate approach. Neural networks, unlike some other techniques, do not suffer from correlations in the data. In this way the network can extract relevant information from these correlations which is not possible with other techniques. Other methods have been proposed to perform the variable selection (e.g. genetic algorithms and simulated annealing, see Chapter 27). 44.5.10 Validation ofMLF networks Validation of neural networks is usually based on an independent test set. Note that this test set should be different from the monitoring set as described in Section 44.5.7. If sufficient samples are not available cross-validation is applied [28]. The presence of local minima causes an additional problem. Retraining the network with different weight initializations is an important diagnostic tool for this purpose. It must be noted however that retraining the network with different initial weights yields different final weight settings, thus different networks. The next problem is to select the best network among those. One may decide to use the
680
network that yields the lowest NSE for the test set. As mentioned before, however, the NSE does not cover all performance aspects. It is good practice to consider also other performance criteria at this stage of the development. The distribution of the errors is certainly important. Another important performance criterion is the robustness to noise on the input values. Derks et al. [29] proposed an empirical procedure to test this robustness. It is based on the gradually increasing addition of noise to the input data. Networks whose performance is less sensitive to this noise injection are to be preferred over sensitive networks. Gemperline studied the robustness of MLF networks in comparison with PLS [30]. The more performance criteria used to validate the model the higher the probability that a good impression is obtained of the network's predictive value for samples that lie within the range of the training set. Since no theoretical knowledge is available about the model, it is dangerous to extrapolate with MLF networks. Before using the network for prediction it should be checked whether the input falls within the training set range. 44.5.11 Aspects of use MLF networks are very powerful but should be applied with caution. They are powerful since practical experience has shown that they are able to model in an easy way difficult non-linear relationships, much better and faster than alternative techniques. There are however many pitfalls in the use of MLF networks. - Since they are used for complex relationships, about which only little prior knowledge is available, it is very hard to be sure that the training, monitoring and test examples are representative for the data. - Usually many local minima are present in the error surface. It is impossible to guarantee that the overall optimum is reached. - The validation of the obtained result is another problem. It is yet not possible to obtain a confidence interval around the obtained output. Research on this topic is going on [21,28]. - It is not possible to extract in a direct way theoretical information on the model. It is to be considered as an empirical model builder such as a spline or polynomial fitting procedure. 44.5.12 Chemical applications Especially the last few years, the number of applications of neural networks has grown exponentially. One reason for this is undoubtedly the fact that neural networks outperform in many applications the traditional (linear) techniques. The large number of samples that are needed to train neural networks remains certainly a serious bottleneck. The validation of the results is a further issue for concern.
681 TABLE 44.3 Examples of chemical applications - Pattern recognition [31-38] - Spectrum interpretation [39-45] - Quality control/ process control [46-49] - Structure elucidation [50-51] - Non-linear calibration and modelling [52-58] - Quantitative structure activity relationships [11,59] - Signal processing [60,61]
Even more than in traditional techniques, care must be taken not to extrapolate. When the training data are not evenly distributed, even intrapolation can cause problems [20]. The applications concern classification as well as quantification problems. In Table 44.3 some examples are given in both problem domains. 44.6 Radial basis function netvrorks 44.6.1 Structure Radial basis function networks (RBF) are a variant of three-layer feed forward networks (see Fig 44.18). They contain a pass-through input layer, a hidden layer and an output layer. A different approach for modelling the data is used. The transfer function in the hidden layer of RBF networks is called the kernel or basis function. For a detailed description the reader is referred to references [62,63]. Each node in the hidden unit contains thus such a kernel function. The main difference between the transfer function in MLF and the kernel function in RBF is that the latter (usually a Gaussian function) defines an ellipsoid in the input space. Whereas basically the MLF network divides the input space into regions via hyperplanes (see e.g. Figs. 44.12c and d), RBF networks divide the input space into hyperspheres by means of the kernel function with specified widths and centres. This can be compared with the density or potential methods in pattern recognition (see Section 33.2.5). The output of a hidden unit, in case of a Gaussian kernel function is defined as: output^. = Ofx) = e'^"^-'^'^^^'
(44.13)
Ix - c-l is the euclidean distance (other distance measures are also possible) between the input vector, x, and c^, the centroid of the Gaussian kemel function. The
682
yi
y2
Fig. 44.18. An example of an RBF network.
parameter b represents the width of the Gaussian function. The centroids, Cy and the widths, bj of all the hidden units together define the so-called activation space, where the Gaussian function has a value larger than a given threshold value. Nodes containing a kernel with a centroid close (in comparison with the width, b) to the input pattern will be predominantly activated. This is in contrast to the distant objects which cause low output values of the hidden units. This activation space can thus be regularized by the width factor and the centroid position, providing a means to adapt the local behaviour of the network. The output of these hidden nodes, o^, is then forwarded to all output nodes through weighted connections. The output yj of these nodes consists of a linear combination of the kernel functions: yj = I wji Oi(x)
where Wj^ represents the weights of the connections between the hidden layer / and output layer, j .
44.6.2 Training The centroid positions, c^, of the kernel functions and their widths, bj, determine the output of the hidden units. The centroid positions are fixed at the initialization of the network. The widths and the weights are obtained by supervised training. Initializing and training these positions and widths is a critical issue in constructing RBF networks. The supervised network training is generally performed by error back-propagation by means of the generalized delta learning rule (Section 44.5.5). The Gaussian kernels are continuously differentiable functions and are therefore suited for this algorithm.
683
Approaches for the initialization and training of the position of the centres, the widths and weights are summarized below: 1. For the distribution of the position of the centres the following methods or combinations of methods are used [29]: - random distribution within the range of interest; - random selection of input patterns; - maximal coverage of the range of interest, e.g. with the Kennard and Stone algorithm (see Chapter 24); - selection of representative input patterns, defined by means of the output of a Kohonen network, applied to the input pattems (see Section 44.7); ~ selection of representative input pattems, defined by means of clustering methods (e.g. ^-means [65] (see Section 30.3.2), genetic algorithms or simulated annealing [64] (see Chapter 27). 2. The widths, b, of the kernel functions are obtained by: - the supervised training procedure (error back-propagation); - fixing them at the initialization phase based on expertise and prior knowledge on the data structure. 3. The weights, w^p are trained with the usual error back-propagation strategy. It is generally advised to use an abundant number of kernels on random positions which are then successively pruned during training. 44,6.3 An example In this section the local modeling capability of RBF is demonstrated on the logical operators AND and OR. The examples used are demonstrative for the structure and physical meaning of the weights of the RBF network. Two datasets of four training objects each, have been created according to the logical operators AND and OR. In Fig. 44.19 the datasets and the network topology are graphically depicted. The AND-operator yields a positive output on inputs of identical sign, whereas the OR-operator responds positively on any positive input. The centroids represent the positions of Gauss kernels and are in this example positioned in the same place as the objects in the input space. The width factors do not change during training. The weights of each kernel function are obtained by training. In Fig. 44.20a a grid of data within the range [-1...1] has been propagated through the OR network model to depict the activation space. It can be seen that the OR-function has been trained properly, i.e. the output is high or low at the correct positions. Where no input data were available the network yields zero as output value. In Fig. 44.20b the same grid of data has been propagated through an OR network model obtained with a broader width parameter. It can be seen that the network still yields high and low values at the positions of the position of the input
684
.•.
0., •• 4
# .•
(-i,.i)(i-i,i)Ci4i)(i.i)
i
-1
- 1 1
>i
I
%. •
#
, 1
i
(-i;-i)(i-i,i).(i.^i)<W)
i
"* X
AND
. 1 •, 1
• -1
-I
I
-1
1
1
OR
Fig. 44.19. The logical operators AND and OR and their RBF implementation.
patterns but it interpolates in a smoother way. It follows from the introduction of this section that the concept of RBF is based on local approximation of data by means of kernel functions. MLF networks try to model data by means of constructing non-linear hyperplanes in the input space. In Fig. 44.21 the predictions of an AND-model obtained by MLF and RBF are shown. It can clearly be seen that the output of both networks is the same in the neighbourhood of the input objects. However, the output differs for the positions in between the input objects. The RBF network yields zero output for the interpolated grid positions, while the output of the MLF network is different from zero. The results can be influenced by the width parameters of the kernels. Whatever is better should be evaluated by means of independent test sets. 44.6.4 Applications In chemical practice, problems are far more complex than the example problems described in the previous section. To be able to apply RBF networks properly, it is useful to have a considerable amount of prior knowledge about the
685
Fig. 44.20. (a) The trained logical OR. The contour lines in the xl ,x2-plane around the centres denote the 20, 40, 60 and 80 percent confidence limits of the Gaussian kernels, (b) The trained logical OR with increased width factors.
686
Fig. 44.21. Grid (40x40) prediction of the logical AND for (a) the MLF model and (b) for the RBF model.
687
data especially for the distribution of the centroid positions and widths [29]. Although this might be considered as a disadvantage, at least the weights have physical or chemical meaning, allowing a better understanding of the model. They have been applied in process control [66,67]. A combination of RBF-PLS as a flexible non-linear regression technique has also been proposed [68].
44.7 Kohonen networks 44.7.1 Structure Kohonen networks belong to the class of self-organizing maps. In contrast to MLF and RBF networks, they are designed for unsupervised pattern recognition tasks. The Kohonen network consists of one layer of neurons, ordered in a low-dimensional map such as a one-dimensional array or a two-dimensional matrix (see Fig. 44.22). Each neuron or unit contains a weight vector of the same dimension as the input patterns. To train the network, the unsupervised Kohonen learning rule is applied (see Section 44.7.2). After training the individual weight vectors are oriented in such a way that the structure of the input space (the topology) is represented as well as possible in the resulting map (see further). Each object or input pattern is assigned to (or mapped on) the neuron with the most similar weight vector. The goal of the Kohonen network is to map similar objects
a o—o—o—o—o—o
c
o—o—o—o /
\
/
\
/
\
/
\
o—o—o—o—o /
b o—o—o—o—o—o
\
/
\
/
\
/
\
/
\
o—o—o—o—o—o
i4444_|, / \ / \ / \ / \ / \ / \ I
I
I
I
I V
o—o—o—o—o—o o—o—o—o—o—o o—o—o—o—o—o
I I I I I I o—o—o—o—o—o
O
O
O
O
O
O
O
o—o—o—o—o—o \
/
\
/
\
/
\
/
\
/
o—o—o—o—o
\ / \ / \ / \. o—o—o—o
Fig. 44.22. Three commonly used Kohonen network structures, (a) One-dimensional array; (b) two-dimensional rectangular network (each unit, apart from the borderline units has 8 neighbours) and (c) two-dimensional hexagonal network (each unit, apart from the borderline units, has 6 neighbours). (Reprinted with permission from Ref. [70]).
688
on the same or neighbouring neurons. The interpretation of the Kohonen map is described in Section 44.7.3. The Kohonen mapping is primarily used for classification purposes. 44J.2 Training The training process of a Kohonen network consists of a competitive learning procedure and can be summarized as follows: - Initialize the weight vectors of all units with random values. - Calculate a specified similarity measure, D, between each weight vector and a randomly chosen input pattern, x. - The unit in the Kohonen map that is most similar to the input vector is declared as the winning unit and is activated (i.e. its output is set to 1). The output of a Kohonen unit is typically 0 (not activated) or 1 (activated). - The weights of the winning unit are adapted according to: w(/+l) = w(0 + r| (x - yfj{t)) y\ is the learning rate and w(0 is the weight vector. - the weights of the units in close vicinity of the winning unit are also adapted according to: w(r+l) = w(r) + r| N(t,r)(x - Wj(t)) N(t,r) is a predefined neighbourhood function in which r represents the distance in the map between the considered unit and the winning unit. Different neighbourhood functions can be used. The principle is always that units closer to the winning unit are adapted most. Some common functions are shown in Fig. 44.23. Due to this aspect of the learning procedure, the connectivity (i.e. the number of neighbours for each unit) of the network has an influence on its performance. Networks with a different number of neighbours for the units are shown in Fig. 44.22. The above steps are repeated for all input patterns to complete one iteration or cycle. A training procedure consists of several such cycles. During the training process the neighbourhood definition is usually narrowed down, allowing convergence of the network (see Fig. 44.24). As a result of the training procedure the weight vector of the winning unit and its neighbours move gradually towards the applied input vector, as shown in Fig. 44.25. In this way, similar input vectors will map in the same region of the Kohonen map. Due to the unsupervised nature of the Kohonen algorithm, it cannot be checked which units are associated with specific clusters in the input patterns. However, inspection of all weight vectors provides information concerning topological aspects of the input pattern space, i.e. the position of the input patterns relative to each other.
689
n(r)
a
Fig. 44.23. Some common neighbourhood functions, used in Kohonen networks, (a) a block function, (b) a triangular function, (c) a Gaussian-bell function and (d) a Mexican-hat shaped function. In each of the diagrams is the winning unit situated at the centre of the abscissa. The horizontal axis represents the distance, r, to the winning unit. The vertical axis represents the value of the neighbourhood function. (Reprinted with permission from [70]).
matching input vector
Fig. 44.24. Example of gradually narrowing of the neighbourhood function during the training process. Light grey: definition of neighbourhood at stage 1; middle grey: at stage 2 and dark grey: at stage 3. O is a unit and • is the winning unit. (Adapted from Ref. [70]).
690
X^
r\(X-W(t))
Fig. 44.25. Example of the weight update of the winning unit in a Kohonen network. (Reprinted with permission from Ref. [70]).
44.7.3 Interpretation of the Kohonen map After the training procedure, the weight vectors of the units are fixed and the map is ready to be interpreted. There are different possibilities to interpret the weight combination, depending on the purpose of the network use. In this section some possibilities are described. The output-activity map. A trained Kohonen network yields for a given input object, Xj, one winning unit, whose weight vector, w^, is closest (as defined by the criterion used in the learning procedure) to x-. However, x^ may be close to the weight vectors, w^, of other units as well. The output )^ of the units of the map can also be defined as: yj = D(Wj,
X.)
D is the similarity measure as used in the training procedure. This results in a map as in Fig.44.26a. This map allows the inspection of regions (neighbouring neurons) that have a similar weight vector as a given input x^. Note that each input x^ yields a different output activity map. The counting map (Fig. 44.26b). This map is obtained by counting for each unit of the Kohonen network, the number of training objects for which the unit is the winning one. This map provides insight in the number of clusters that are present in the dataset. When e.g. all input patterns are assigned to e.g two distinct regions (sets of neighbouring neurons) in the map, it can be concluded that there are two clusters in the dataset.
691
a
+ 4- 4- + 4- + + 4- 4- 4- 44- + 4X -
- 4- 4- 4 4+ « 4 ++ 4- 4- 4- 44- 4-
k
X X X -
X X -1 -
A A A A A A A A A A B A A A B C X B B X C C BX X C C C B X C C C C C C C B
B B B B B B
B B B B B
Fig. 44.26. Some possibilities for analysing a two-dimensional Kohonen map (Reprinted with permission from Ref. [70]). (a) Grey-encoded output activity map for a given training example. Dark areas in the map indicate a high similarity between the weight vector of the unit and the input object, (b) A counting map: dark areas indicate a large number of training examples that map on the unit. Units on which no training examples map are indicated white, (c) Feature map indicating units on which training examples map with (+) or without (-) a certain feature. (X) indicates that different labels are assigned to the unit, (d) Feature map with class identification A and B as labels. (X) indicates multiple class labels.
The feature map. This map can be obtained when labels can be assigned to the training objects. Labels may consist of a known classification, or the presence versus absence of certain features. For each training object, x-, the winning unit is determined and to this unit the label of x^ is assigned. When this is completed for all training objects, each unit in the map is labelled in the map with zero, one or more labels (see Figs. 44.26c and 44.26d). When one unit is labelled with more labels it means that an overlap is present. In this way it can be compared with fuzzy clustering techniques (see Chapter 30). A detailed description can be found in references [69,70]. 44.7.4 Applications Due to the Kohonen learning algorithm, the individual weight vectors in the Kohonen map are arranged and oriented in such a way that the structure of the input space, i.e. the topology is preserved as well as possible in the resulting
692
low-dimensional map. Therefore, the Kohonen map is said to be a topology preserving mapping technique. It is primarily used for the examination of data for which no or little prior knowledge is available. The choice of the size and dimensionality of the network depends on the problem type. When too few units are used for classification purposes, different classes will fall together on the same unit(s). Too many units will not result in clustering of the data. All training objects will tend to map on different units. A rule of thumb is to take twice the number of expected classes. It has been applied for e.g. mapping IR spectra [71], to map surface molecular properties of organic molecules [72] and for classification [73] (e.g. SIMS images [74] or plant seeds [75]). Another interesting application is in the design of RBF networks (Section 44.6). It can best be compared with techniques such as non-linear mapping or multi-dimensional scaling (see Chapter 38). Reference [76] is a review of the use of Kohonen neural networks in analytical chemistry.
44.8 Adaptive resonance theory networks 44.8.1 Introduction The Adaptive Resonance Theory (ART) networks are also designed for unsupervised pattern recognition. There exists a variant (ARTMAP) that is meant for supervised pattern recognition, but we will consider here the more common unsupervised variant. For a detailed description see Ref. [14]. Grossberg designed the adaptive resonance theory to overcome a general drawback of classifiers (e.g. MLF networks), i.e. the fact that once the training procedure (supervised or unsupervised) is completed the weights are fixed. The network or classifier is designed for a certain well-defined classification task. The network is trained by means of a set of representative training examples. This is an acceptable procedure only when the classification problem is well defined and only as long as it remains stable. Unfortunately this is not often the case in real situations. It may happen that a new class of objects appears that was not represented in the training set. The only remedy in that case is to retrain the network including training objects of the new class. It is also possible that existing classes tend to change in time, e.g. drifting away. Here too, the proper action is to retrain the network with new representative training objects. This is what Grossberg called the stability-plasticity dilemma: how can a system be adaptive to relevant new input and at the same time be stable to irrelevant input changes? In response to this, he and others developed the adaptive resonance theory. The essence of ART is pattern matching. When a familiar pattern is presented to the network (i.e. satisfies the limiting conditions of the previous examples) the network recognizes the input and it also incorporates
693
the new information of the new input by adapting the weights. When, on the other hand, a novel input is presented, i.e. not satisfying the limiting conditions of the previous examples, the structure is adapted and the novel input is identified as the first representative of a new class. In this way ART is stable enough to preserve past learning but remains adaptable to incorporate new information when it appears. 44.8.2 Structure ART networks consist of units that contain a weight vector of the same dimension as the input patterns. Each unit is meant to represent one class or cluster in the input patterns. The structure of the ART network is such that the number of units is larger than the expected number of classes. The units in excess are dummy units that can be taken into use when a new input pattern shows up that does not belong to any of the already learned classes. There exist many different types of ART. The variant ARTl is the original Grossberg algorithm. It allows only binary input vectors. ART2 allows also continuous input. It is the basic variant of this type that we will describe. The structure of ART networks is hard to visualize. It is in fact a theory that can better be explained by means of a sequence of steps that follow the strategy. 44.8.3 Training Training of an ART network can be summarized as follows: - Initialize the weights of all units (in a (pxc) matrix W) with a fixed value. The parameter p is the length of the weight vector and c is the total number of units. Usually the fixed value 1/^ P is used for the initial weights, such that the length of the weight vector is scaled to unity. - The input patterns are scaled to unit length. - The first input vector is copied into the weight vector of the first unit, which becomes now an active unit. - For the next input vector the similarity, p,^, with the weight vector, w^, of each active unit, k is calculated:
Because x^ as well as w^ are normalized, p,^ represents the cosine or correlation coefficient between the two vectors. In a variant of ART, Fuzzy ART, a fuzzy similarity measure is used instead of the cosine similarity measure [14]. - The active unit with the highest similarity is declared as the winning unit.
694
- The similarity between x, and the winning unit is compared with a threshold value, p*, in the range from zero to one. When p^^ < p* the input pattern, x^, is not considered to fall into the existing class. It is decided that a so-called novelty is detected and the input vector is copied into one of the unused dummy units. Otherwise the input pattern, x^, is considered to fall into the existing class (to resonate with it). A large p* will result in many novelties, thus many small clusters. A small p* results in few novelties and thus in a few large clusters. - When the resonance step succeeds the weight vector of the winning unit is changed. It adapts itself a little towards the new input pattern x^, belonging to the same class, according to: w(r+l) = r| x^ + (1 - r|) w(0 w(r+l) = w(r+l)/lw(r+l)l Usually the learning rate, rj, is chosen between 0 and 1. In this step the network incorporates the new information present in the input object by moving the centroid of the class a little towards the new input pattern x^. This step is intended to make the network flexible enough when clusters are changing in time. 44.8.4 Application From the previous discussion, it is clear that ART networks are more applicable for pattern recognition than for quantitative applications. In Fig. 44.27 it is shown how ART networks functions with different values for p*. It has been applied to
Fig. 44.27. The influence of different values of p* on the classification performance of the ART network, (a) large p* value; (b) small p* values.
695
classify process control data [77], for classification of UV/VIS/NIR spectra and of airborne particles [78]. It has also been applied for QSAR [79]. There is not, however, much expertise available yet on the performance of these networks in real situations. The general experience is that the networks are very sensitive to noise. It is difficult to select a proper threshold value for the similarity measure. Moreover, the experience is that within one network in fact different threshold values are necessary for different classes. Strategies that use also an adaptive threshold (e.g. based on the different class properties) may be more successful.
References 1. 2. 3. 4. 5. 6. 7.
8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.
W.S. McCulloch and W. Pitts, A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophy., 5 (1943) 115-133. D.O. Hebb, The Organization of Behavior. Wiley, New York, 1949. F. Rosenblatt, The perceptron: a probabilistic model for information storage and organization in the brain. Psycholog. Rev., 65 (1958) 386-408. B. Widrow and M.E. Hoff, Adaptive switching circuits. In: IRE WESCON Convention Record, New York, 1960, p. 96-104. N. Nilsson, Learning Machines, McGraw-Hill, New York, 1965. M.L. Minsky and S.A. Papert, Perceptrons. MIT Press, Cambridge, MA, 1969. D.E. Rumelhart, G.E. Hinton and R.J. Williams, Learning internal representations by error propagation. In: Parallel Distributing Processing, Explorations in the Microstructure of Cognition, Vol. 1: Foundations, D.E. Rumelhart and J.L. McClelland (eds.), MIT Press, Cambridge, MA, 1986, pp. 318-362. P. Werbos, Beyond regression: new tools for prediction and analysis the behavioral sciences. PhD Thesis, Harvard, Cambridge, MA, 1974. B. Wythoff, Backpropagation neural networks. Chemom. Intell. Lab. Syst., 18 (1993) 115-155. J. Zupan and J. Gasteiger, Neural Networks for Chemists. An Introduction. VCH Verlaggesellschaft, Weinheim, 1993. J. Devillers (ed.), Neural Networks in QSAR and Drug Design. Academic Press, London, 1996. R. Beale and T. Jackson, Neural Computing: An Introduction. Institute of Physics Publishing, Bristol, 1992. J.L. McClelland and D.E. Rumelhart, Parallel Distributed Processing, Vol. 1. MIT Press, London, 1988. J.A. Freeman and D.M. Skapura, Neural Networks, Algorithms, Applications and Programming Techniques. Addison-Wesley, Reading, MA, 1991. R. Hecht-Nielsen, Neurocomputing. Addison-Wesley, Reading, MA, 1991. P.D. Wasserman, Neural Computing: Theory and Practice. Van Nostrand Reinhold, New York, 1989. J. Smits, W.J. Meissen, L.M.C. Buydens and G. Kateman, Using artificial neural networks for solving chemical problems. Chemom. Intell. Lab. Syst., 22 (1994) 165-189. D. Svozil, Introduction to multi-layer feed-forward neural networks. Chemom. Intell. Lab. Syst., 39 (1997) 43-62.
696 19.
20. 21. 22. 23. 24. 25. 26. 27.
28.
29. 30. 31.
32.
33.
34.
35.
36.
37.
W.J. Meissen and L.M.C. Buydens, Aspects of multi-layer feed-forward neural networks influencing the quality of the fit of univariate non-linear relationships. Anal. Proc, 32 (1995) 53-56. E.P.P. A. Derks, Aspects of Artificial networks and experimental noise. PhD thesis, University of Nijmegen, the Netherlands, Chapter 2, 1997. R. Tibshirani, A comparison of some error estimates for neural network models. Neural Computation, 8 (1995) 152-163. B. Walczak, Neural networks with robust backpropagation learning algorithm. Anal. Chim. Acta, 322 (1996) 21-30. P. Deveka and L. Achenie, On the use of quasi-Newton based training of a feedforward neural network for time series forecasting. J. Intell. Fuzzy Syst., 3 (1995) 287-294. M. Norgaard, Neural network based system identification toolbox. Technical report. Institute of Automation, Technical University, Denmark, 1995. E.P.P. A. Derks, Aspects of artificial networks and experimental noise. PhD thesis, Nijmegen, the Netherlands, Chapter 3, 1997. G. Castellano, A.M. Fanelli and M. Pelillo, An iterative pruning algorithm for feedforward neural networks. IEEE Trans. Neural Networks, 8 (1997) 519-531. J. Zhang, J.-H. Jiang, P. Liu, Y.-Z. Liang and R.-Q. Yu, Multivariate nonlinear modelling of fluorescence data by neural network with hidden node pruning algorithm. Anal. Chim. Acta, 344(1997)29-40. E.P.P.A. Derks, M.L.M. Beckers, W.J. Meissen and L.M.C. Buydens, A parallel cross-validation procedure for artificial neural networks. Computers Chem., 20 (1995) 439-448. E.P.P.A. Derks, M.S. Sanchez Pastor and L.M.C. Buydens, A robustness analysis for MLF and RBF neural networks models. Chemom. Intell. Lab. Syst., 34 (1996) 299-301. P.J. Gemperline, Rugged Spectroscopic calibration. Chemom. Intell. Lab. Syst., 39 (1997) 29-42. G.J. Salter, M. Lazzari, L. Giansante, R. Goodacre, A. Jones, G. Surrichio, D.B. Kell and B. Bianchi, Determination of the geographical origin of Italian extra-virgin olive oil using pyrolysis-mass spectrometry and neural networks. J. Anal. Appl. Pyrol., 40 (1997) 159-170. J.R.M. Smits, L.W. Breedveld, M.W.J. Derksen, G. Kateman, H.W. Balfoort, J. Snoek and J.W. Hofstraat, Pattern classification with artificial neural networks: classification of Algae, based upon flow cytometry data. Anal. Chim. Acta, 258 (1992) 11-25. H. Chan, A. Butler, D.M. Falk and M.S. Freund, Artificial neural network processing of stripping analysis responses for identifying and quantifying heavy metals in the presence of intermetallic compound formation. Anal. Chem., 69 (1997) 2373-2378. C.W. McCarrick, D.T. Ohmer, L.A. Gilliland, P.A. Edwards and H.T. Mayfield, Fuel identification by neural network analysis of the responses of vapour-sensitive sensor arrays. Anal. Chem., 68 (1996) 4264^269. M. Click and G.M. Hieftje, Classification of alloys with an artificial neural network and multivariate calibration of Glow-Discharge emission spectra. Appl. Spectrosc, 45 (1991) 1706-1716. R. Goodacre, D.B. Kell and G. Bianchi, Rapid identification of species using pyrolysis mass spectrometry and artificial neural networks of propionibacterium acnes isolated from dogs. J. Appl. Bacteriol., 76 (1994) 124-134. W. Werther, H. Lohninger, F. Stand and K. Vermuza, Classification of mass spectra. A comparison of yes/no classification methods for the recognition of simple structural properdes. Chemom. Intell. Lab. Syst., 22 (1994) 63-67.
697 38. 39. 40.
41.
42. 43.
44. 45. 46. 47. 48. 49. 50. 51.
52. 53.
54.
55. 56.
57.
J.M. Andrews and S.H. Lieberman, Neural network approach to qualitative identifications of fuels from laser induced fluorescence spectra. Anal. Chim. Acta, 285 (1994) 237-246. U. Hare, J. Brian and J.H. Prestegard, Application of neural networks to automated assignment of NMR spectra of proteins. J. Biomolec. NMR, 4 (1994) 35-46. T. Visser, H.J. Luinge and J.H. van der Maas, Recognition of visual characteristics of infrared spectra by artificial neural networks and partial least squares regression. Anal. Chim. Acta, 296 (1994). H.J. Luinge, E.D. Leussink, and T. Visser, Trace-level identity confirmation from infrared spectra by library searching and artificial neural networks. Anal. Chim. Acta, 345 (1997) 173-184. C. Affolter and J.T. Clerc, Prediction of infrared spectra from chemical structures of organic compounds using neural networks. Chemom. Intell. Lab. Syst., 21 (1993) 151-157. J.R.M. Smits, P. Schoenmakers, A. Stehmann, F. Sijstermans and G. Kateman, Interpretation of Infrared spectra with modular neural-network systems. Chemom. Intell. Lab. Syst., 18 (1993) 27-39. M.E. Munk, M.S. Madison and E.W. Robb, Neural network models for infrared spectrum interpretation. Microchim. Acta, 2 (1991) 505-524. W. Wu and D.L. Massart, Artificial neural networks in classification of Nir spectral data: selection of the input. Chemom. Intell. Lab. Syst., 35 (1996) 127-135. C. Hoskins and D.M. Himmelblau, Process Control via artificial Neural networks and reinforced learning. Computers Chem. Eng., 16 (1992) 241-251. C. Hoskins and D.M. Himmelblau, Fault diagnosis in complex chemical plants using artificial neural networks. AIChE J, 37 (1991) 137-141. M. Bhat and T.J. McAvory, Use of neural nets for dynamic modelling and control of chemical process systems. Computers Chem. Eng., 15 (1990) 573-578. C. Puebla, Industrial process control of chemical reactions using spectroscopic data and neural networks. Chemom. Intell. Lab. Syst., 26 (1994) 27-35. N. Qian and T.J. Sejnowski, Predicting the secondary structure of globular proteins using neural network models. J. Molec. Biol., 202 (1988) 568-584. M. Vieth, A. Kolinsky, J. Skolnicek, and A. Sikorski, Prediction of protein secondary structure by neural networks, encoding short and long range patterns of amino acid packing. Acta Biochim. Pol., 39 (1992) 369-392. R. Wehrens and W.E. van der Linden, Cahbration of an array of voltammetric microelectrodes. Anal. Chim. Acta, 334 (1996) 93-100. J.R.M. Smits, W.J. Meissen, G.J. Daalmans and G. Kateman, Using molecular representations in combination with neural networks. A case study: prediction of the HPLC retention index. Computers Chem., 18 (1994) 157-172. A. Bos, M. Bos and W.E. Van der Linden, Artificial neural networks as a multivariate calibration tool: modeUng the ion-chromium nickel system in x-ray fluorescence spectra. Anal. Chim. Acta, 277 (1993) 289-295. M.N. Tib and R. Narayanaswamy, Multichannel calibration technique for optical-fibre chemical sensor using artificial neural network. Sensors Actuators, B39 (1997) 365-370. H.M. Wei, L.S. Wang, B.G. Zhang, C.J. Liu and J.X. Feng, An application of artificial neural networks. Simultaneous determination of the concentration of sulfur dioxide and relative humidity with a single coated piezoelectric crystal. Anal. Chem., 69 (1997) 699-702. H.J. Miao, M. H. Yu and S.X. Hu, Artificial neural networks aided deconvolving overlapped peaks in chromatograms. J. Chromatogr., A, 749 (1996) 5-11.
698 58.
59.
60.
61. 62. 63. 64. 65. 66.
67.
68. 69. 70.
71.
72.
73. 74. 75.
76.
R.M. Lopes Marques, P.J. Schoenmakers, C.B. Lucasius and L.M.C. Buydens, Modelling chromatographic behavior as a function of pH and solvent composition in RPLC. Chromatographia, 36 (1993) 83-95. A.P. de Weyer, L.M.C. Buydens, G. Kateman and H.M. Heuvel, Neural networks used as a soft modelling technique for quantitative description of the inner relation between physical properties and mechanical properties of poly ethylene terephthalate yams. Chemom. Intell. Lab. Syst., 16(1992)77-82. E.P.P.A. Derks, B.A. Pauly, J. Jonkers, E.A.H. Timmermans and L.M.C. Buydens, Adaptive noise cancellation on inductively coupled plasma spectroscopy. Chemom. Intell. Lab. Syst., (1998) in press. A. Cichocki and R. Unbehauen, Neural Networks for Optimization and Signal Processing. Wiley, New York, 1993. J. Park and I.W. Sandberg, Universal approximation using radial basis function networks. Neural Computation, 3 (1991) 246-257. J. Moody and C.J. Darken, Fast learning in networks of locally-tuned processing units. Neural Computation, 1 (1989) 281-294. B. Carse and T.C. Fogarty, Fast evolutionary learning of minimal radial basis function neural networks using a genetic algorithm. Lecture Notes in Computer Science, 1143, (1996) 1-22. L. Kieman, J.D. Mason and K. Warwick, Robust initialisation of Gaussian radial basis function networks using partitioned k-means clustering. Electron. Lett., 32 (1996) 671-672. W. Luo, M.N. Karim, A.J. Morris and E.B. Martin, Control relevant identification of a pH waste water neutralisation process using adaptive radial basis function networks. Computers Chem.Eng.,20(1996)S1017 B. Walczack and D.L. Massart, Application of radial basis functions-partial least squares to non-linear pattern recognition problems: diagnosis of process faults. Anal. Chim. Acta, 331 (1996) 187-193. B. Walczack and D.L. Massart, The radial basis functions — partial least squares approach as a flexible non-linear regression technique. Anal. Chim. Acta, 331 (1996) 177-185. T. Kohonen, Self Organization and Associated Memory. Springer-Verlag, Heidelberg, 1989. W.J. Meissen, J.R.M. Smits, L.M.C. Buydens and G. Kateman, Using artificial neural networks for solving chemical problems. II. Kohonen self-organizing feature maps and Hopfield networks. Chemom. Intell. Lab. Syst., 23 (1994) 267-291. W.J. Meissen, J.R.M. Smits, G.H. Rolf and G. Kateman,Two-dimensional mapping of IR spectra using a parallel implemented self-organising feature map. Chemom. Intell. Lab. Syst., 18(1993)195-204. S. Anzali, G. Bamickel, M. Krug, J. Sadowski, M. Wagener, J. Gasteiger and J. Polanski, The comparison of geometric and electronic properties of molecular surfaces by neural networks: Application to the analysis of corticosteroid-binding globulin activity of steroids. J. Computer-Aided Molec. Design, 10 (1996) 521-534. X.H. Song and P.K. Hopke, Kohonen neural network as a pattern-recognition method, based on weight interpretation. Anal. Chim. Acta, 334 (1996) 57-66. M. Walkenstein, H. Hutter, C. Mittermayr, W. Schiesser and M. Grasserbauer, Classification of SIMS images using a Kohonen network. Anal. Chem., 69 (1997) 777-782. R. Goodacre, J. Pygall and D.B. Kell, Plant seed classification using pyrolysis mass spectrometry with unsupervised learning: the application of auto-associative and Kohonen artificial neural networks. Chemom. Intell. Lab. Syst., 33 (1996) 69-83. J. Zupan, M. Novic and I. Ruisanchez, Kohonen and counterpropagation networks in analytical chemistry. Chemom. Intell. Lab. Syst., 38 (1997) 1-23.
699 77.
78.
79.
80.
D. Wienke and L.M.C. Buydens, Adaptive resonance theory neural networks — the ART of real-time pattern recognition in chemical process monitoring? Trends Anal. Chem., 99 (1995) 1-8. Y. Xie, P.K. Hopke and D. Wienke, Airborne particle classification with a combination of chemical composition and shape index utilizing an adaptive resonance artificial neural network. Environ. Sci. Technol., 28 (1994) 1399-1407. D. Domine, D. Wienke, J. Devillers and L.M.C. Buydens, A new nonlinear neural mapping technique for visual exploration of QSAR data. In: Neural Networks in QSAR and Drug Design, J. Devillers (ed.). Academic Press, London, 1996, p. 223-253. S.H. Barondes, Geestesziekten en Moleculen. De Wetenschappelijke Bibliotheek van Natuur & Techniek, 1993.
This Page Intentionally Left Blank
701
Subject Index a priori probability, 221 absorption, 449 abstract chromatogram, 247, 266 abstract factor, 243, 245 abstract spectrum, 247 ACE, 235 activation space, 682 ADALINE, 650 adaptive Kalman filter, 576, 598, 599 adaptive resonance theory network, 692 additivity, 393, 397 administration, way of, 449 aerosol particles, 60 agglomerative hierarchical classification, 75, 79 aliasing, 524 algebraic set, 9 ALLOC, 227 alpha-phase, 469, 481 alternating least squares, 278, 296 alternating regression, 278, 296 analog filter, 537 analytical laboratory management, 617 analyzing wavelet, 566 angular distance, 11, 47 ANOVA, 129,384,395 antibiotics, 452 anti-symmetric function, 511 apodization, 527 approximation coefficient, 568 area under the curve, 457,493 area under the moment curve, 495 area under the second moment, 497 ART network, 649 artefact, 156 artificial intelligence, 627 artificial neural network, 649 artificial neuron, 233 autoscaling, 64, 122
average linkage, 69, 70, 72 axis of inertia, 106 axis of symmetry, 104
back-propagation learning rule, 662, 671 back-propagation network, 662 backrotation, 56 backtracking, 636 backward chaining, 634 backward Fourier transform, 516 bacteriostatic activity, 394 basis function, 681 basis vector, 9, 14, 91 Bayes equation, 221 Bayesian approach, 221 Bayesian probability theory, 640 beta-phase, 463,481,494 between-class variance, 216 bias, 654 bidiagonal matrix, 136 bilinear model, 390 binary programming, 609 binary variables, 65 binding affinity, 402 binding assay, 411 binding capacity, 453 bioavailability, 457, 466,484 bio-equivalence, 467 biological activity, 149, 383 biotransformation, 449 biplot, 112,153,183,187, 190, 203,405,408, 413 bipolar axis, 113, 188 bivariate distribution, 212 bivariate normal distribution, 211 bootstrap, 371 busy time, 616
702 calibration, 115 canonical analysis, 320 canonical axis, 409 canonical correlation, 319 canonical correlation analysis, 409 canonical variable, 319 canonical variate, 213, 319 canonical variate model, 408 canonical weight, 319 Cartesian coordinate space, 9 Cartesian diagram, 112 catenary model, 452, 487 centered matrix, 45 centering, 43 centre of mass, 42 centroid, 77, 116, 120 centrotypes, 78 certainty factor, 640 chalcone, 116 characteristic equation, 31, 490 characteristic root, 31 characteristic vector, 33 child frame, 637 Chinese teas, 60 chi-square distance, 147 chi-square statistic, 166, 182 chromatography, 118, 622 circulating protein, 456 city-block distance, 66, 149 city-block metric, 152 class envelope, 212 class membership, 408 classical least squares, 352, 353 classical metric scaling, 149 classification, 57, 669 — agglomerative hierarchical, 75, 79 — divisive hierarchical, 75 — fuzzy clustering, 80-82 — hierarchical, 57, 69 — /:-nearest neighbour method, 208, 223 — non-hierarchical, 58 — non-hierarchical, Forgy's method, 77 — and regression trees, 227 classification rule, 207, 208
classification tree, 227 CLASSY, 232 clearance, 452,459,481, 484 clinical chemistry, 208, 213 clinical laboratory, 609 closure, 87, 167 cluster analysis, 57, 156, 384, 397,416 clustering, 57, 207 — algorithm, 69 — average linkage, 69,70, 72 — complete linkage, 69, 70, 72 — density method, 209, 211, 213, 225 — non-hierarchical, 59, 83 — seed points, 77 — single linkage, 398 — tendency, 82 — Ward's method, 71,72 column space, 246 column-centered biplot, 120 column-centered matrix, 43 column-closure, 168 column-eigenvector, 92 column-mean, 42 column-orthogonal, 21 column-orthonormal, 21 column-pattern, 16, 104 column-principal component, 97 column-profile, 168 column-singular vector, 91 column-standard deviation, 46 column-standardization, 122 column-standardized biplot, 123 column-standardized matrix, 47 column-sum, 165 column-variable, 87 combinatorial chemistry, 59, 65 commutative, 20 comparative molecular field analysis, 385, 410 compartment, 451 compartmental analysis, 451 competitive inhibition, 503 complete linkage, 69, 70, 72 compositional data, 130
703 compositional table, 87 concentration-time curve, 457 confirmatory data analysis, 46 conflict resolution, 634, 636 conformation, 387 conjugated biplot, 409 connected graphs, 73 connection pattern, 652 connectivity indices, 392 consensus, 313 contingency table, 3, 7, 161 continuous signal, 513 continuum power-PLS, 345 continuum regression, 367 contrast, 113, 115, 127, 180, 181, 203, 204, 405 controlled calibration, 351 convolution, 489, 530, 533 convolution function, 530 convolution theorem, 533 Cook's distance, 374 Coomans plot, 231 coordinate space, 9 core matrix, 154 correlation, 63, 401 correlation coefficient, 14, 62, 65, 111 correlation matrix, 49 correlation-PCR, 364 correspondence, 182 correspondence factor analysis, 130, 174, 182, 405 cosine rule, 12, 111 cost-benefit analysis, 237 counting map, 690 Craig plot, 397 credibility function, 640 cross-product matrice, 48 cross-tabulation, 7, 88, 147 cross-validation, 144, 229, 236, 369, 416 — internal, 238 — k-Md, 238 curve peeling, 465, 491, 501 curve resolution, 243, 260, 267 curvilinear trajectories, 152 cut-off frequency, 548
dance step diagram, 135 DASCO, 232 data analysis, 14, 30, 40 data compression, 550 data-scope, 281 Daubechies wavelet, 566 de novo approach, 391, 393 decision-making process, 628 decision tree, 416, 629 deconvolution, 490, 510, 553, 556 deflated cross-product matrix, 138 deflated matrix, 35 deflation, 138 degeneracy, 40 delta rule, 656 dendrogram, 57, 69,70-72, 83 density method, 209, 211, 213, 225 derivative of a signal, 550 descriptive linear discriminant analysis, 220 detail coefficient, 568 determinant, 31, 34, 39, 490 determinant, minor of, 491 deterministic chaos, 451 deviations — of column-profiles, 169 — of double-closed data, 169 — of row-closed profile, 168 diagonal matrix, 19 diagonalization, 34, 93, 139, 140 differential equation, 462 differential equations, system of, 477 digital filter, 537 dilation, 566 dimension, 8 dimension reduction, 107 direct standardization, 377 directed graphs, 623 discrete event simulation, 618 discrete Fourier transform, 519 discrete wavelet transform, 567 distributivity, 529 discriminant function, 216, 217 discriminating power, 237 discrimination, 208
704 disjoint class modelling, 208 dispersion matrix, 31 dissimilarity plot, 295 dissociation constant, 387 distance, 43, 45, 47, 60, 108, 176-178 — between two points, 11 — of chi-square, 133, 175 — city block, 66, 149 — Cook, 374 — Euclidean,60, 62,63, 67, 108, 146,230, 231 — from the origin, 11 — function, 152 — generalized, 61 — Hamming, 66, 147 — Mahalanobis, 61, 62, 65, 147, 220, 221, 228, 274 — Manhattan, 67, 147 — Minkowski, 67 — matrix, 68 — metrics, 150 — Pythagorean, 146 — standardized Euclidean, 61, 62, 64 — taxi, 147 distribution, 449 — volume of, 456 distributional equivalence, 193 D-optimality, 40 double-closure, 130, 169 drift, 593, 598 drug-receptor complex, 385 drug-receptor interaction, 402 drug-test specificity, 397 dual representation, 28 dual space, 174 duality, 16 duo-trio test, 422 dynamic programming, 605 earning machine, 416 edges, 621 eigenfunction, 486 eigenstructure tracking, 280 eigenvalue, 31, 186, 486, 490, 492
eigenvalue decomposition, 33, 92, 148, 183 eigenvector, 33, 228, 230 eigenvector projection, 55 electron acceptor, 127 electron donor, 127 electrostatic interaction, 386 elimination pool, 455 ellipsoid, 39 enthalpy, 386 entropy, 386 environmental applications, 232 enzyme, 383, 411 epoch, 673 equilibrium constant, 386 equiprobability envelope, 104, 107 error eigenvector, 143 Euclidean distance, 60, 62, 63, 67, 108, 146, 230,231 euroleptic effect, 400 evolution program, 375 evolutionary method, 251 evolutionary rank analysis, 274 evolving factor analysis, 274 evolving principal components analysis, 274 exclusive or (XOR) problem, 659 excretion, 449, 452 exhaustion, 136 expected value, 147 experimental design, 40 expert system, 4, 627 — if...then... rules, 631 — inductive, 227 — inference engine, 629, 633 — inheritance, 637 — shell, 630, 641 explanation facility, 640 exploratory analysis, 46, 167 exponential function, 203, 462 exponential smoothing, 544 extended matrix, 155 extrathermodynamic method, 384 extravascular administration, 461,469 factor, 244, 319
705 factor analysis, 243, 244, 302, 397 — evolving, 274 — fixed-size window evolving, 278 — full-rank method, 251 — iterative target transformation, 268 — key-set, 297 — local rank methods, 274 — residual bilinearization, 300 — target transformation, 256 factor loading, 244 factor rotation, 252 factor scaling coefficient, 95, 150, 188 factor space, 95, 99 factorization, 95 fast Fourier transform, 530 feature map, 691 feature reduction, 215, 236 feature selection, 207, 236, 375 field and resonance parameters, 392 filter function, 540, 547 filter transfer function, 548 filtering, 529, 540, 547, 549 — cut-off frequency, 548 — low-pass, 547, 549 — high-pass, 547 — inverse, 553, 555 finite progression, 474 fixed-size window evolving factor analysis, 278 folding, 524 food authentication, 232 Forgy's method, 78 Forgy's non-hierarchical classification method, 77 forward chaining, 634 forward chaining inference engine, 636 forward Fourier transform, 516 Fourier coefficient, 236, 515 — imaginery, 516 — real, 516 Fourier filtering, 373 Fourier transform, 236, 478, 507, 510, 513, 550 — discrete, 519 — inverse, 516 — pair, 517
— time-frequency, 564 fractal geometry, 451 frame, 14, 633 free choice profiling, 436 Free-Wilson analysis, 384, 393 frequency domain, 509, 510, 547 fuel spill, 232 full-rank method, 251 fuzzy clustering, 80, 81, 82 fuzzy-set theory, 640 gain vector, 578, 579 game theory, 605 gamma-method, 490,491 generalized column-eigenvectors, 186 generalized distance, 61 generalized eigenvalue decomposition, 185 generalized inverse, 38 generalized least squares, 356 generalized loading, 188 generalized Procrustes analysis, 317, 434 generalized rank annihilation factor analysis, 298 generalized row-eigenvectors, 185 generalized score, 188 generalized singular value decomposition, 183 generalized standard addition method, 367, 368 genetic algorithm, 78, 625 geometrical structure, 46 Gibbs' free energy, 385 global interaction, 133 global sum, 165 global sum of squares, 94 global weighted mean, 178 Golub-Reinsch, 134 goodness of fit, 182 Cower's similarity index, 148 GRAFA, 298 graph, 73, 621 graph theory, 79, 605 graphical determination, 480 graphical solution, 463 GSAM, 367, 368
706 Guttman effect, 199 Haar wavelet, 566 Hadamard transform, 562, 564 half-life time, 456 Hammett electronic parameter, 387 Hammett equation, 387 Hamming distance, 66, 147 hard delimiter, 655 Hansch analysis, 384 Hansch equation, 388 Hansch model, 410 HELP, 280 heterogeneity, 124 heterogeneous data, 133, 205 heterogeneous table, 182 heteroscedasticity, 64, 503 heuristic evolving latent projection, 280 heuristics, 628 hierarchical classification, 57 hierarchical methods, 69 high-pass filter, 547 H-matrix, 568 HOMALS program, 150 homogeneity, 150 homogeneous data, 133, 205 homogeneous linear equation, 34 homogeneous table, 87, 182 Hopkin's statistic, 82 Hotelling T^-distribution, 228 Householder-QR algorithm, 134 Householder transformation, 136 hybrid constants, 486 hybrid transport constant, 479 hydrogen bonding, 386, 395 hydrophobic interaction, 386, 388 hydrophobicity, 388 ICso value, 412 identity matrix, 19 ill-conditioned problem, 469 image, 52 image analysis, 153 imaginary Fourier coefficient, 516
incomplete absorption, 469 indicator table, 161 indicator variable, 392 indirect QSAR, 412 inertia, 107, 113 inference engine, 629, 633 inheritance, 637 inhibition constant, 403 integer programming, 605, 609 integration by parts, 498 interaction, 128, 175, 179, 182, 196 interaction module, 629, 640 interferogram, 509 internal cross-validation method, 238 intravenous administration, 455,476,491 — repeated, 473 inverse calibration, 352 inverse filtering, 553, 555 inverse Fourier transform, 516 inverse Laplace transform, 478, 489 iterative target transformation factor analysis, 268 ITTFA, 268 Jaccard similarity coefficient, 65 jackknife method, 238 Jacobi algorithm, 134 Kalman filter, 575, 576, 594, 596 — innovation, 578, 599 K-center, 78 K-centroid methods, 78 K-clustering, 77, 83, 84 Kennard-Stone algorithm, 372, 378 kernel, 80 kernel function, 681 kernel method, 225 key-set factor analysis, 297 kinetics, 592, 596 kinetics model, 576 k-nearest neighbour method, 208, 223 knowledge acquisition, 643 knowledge base, 629 knowledge engineer, 644
707 knowledge environment, 642 knowledge representation, 630 Kohonen learning rule, 687 Kohonen mapping, 82 Kohonen network, 649, 687 Kronecker's delta, 152 Kruskal's algorithm, 74 LI-norm, 66, 152 L2-norm, 67 laboratory management, 610 laboratory simulator, 621 Lagrange multipliers, 93 Laplace domain, 488 Laplace transform, 477,491 — inverse, 478, 489 latent value, 95 latent variable, 50, 212, 215, 228, 319 latent vector, 95, 182, 183 lead compounds, 384 learning object, 207 learning rate, 673 learning rule, 652, 656, 670 least-squares criterion, 53 least-squares linear regression, 503 leave-one-out procedure, 238 leverage, 374 limiting solution, 463 line of closest fit, 104, 106 linear algebra, 134 linear combination, 91, 96 linear discriminant analysis, 84, 208, 209, 212, 213,236,408 linear free energy relationship, 384 linear independence, 27, 32 linear learning machine, 234, 653 linear programming, 605 linear regression, 387, 388, 390, 460, 468 linear transfer function, 666 linearity, 500 linearization, 502 linearly independent, 8 Lineweaver-Burk form, 502 lipophilicity, 388
loading, 55 loading matrix, 155 loading plot, 99 locally weighted regression, 378 location, 174 location model, 78 lock and key paradigm, 384 log column-centered biplot, 124 log column-centering, 123 log double-centering, 125 log ratio, 404 logarithmic analysis, 129 logarithmic function, 124 logarithmic transformation, 64 logarithms, 133 log-bilinear model, 129, 201 log-linear model, 201 lumping, 451 MacQueen's K-means method, 78 Mahalanobis distance, 61, 62, 65, 147, 220, 221,228,374 majority rule, 223 Malinowski's F-ratio, 144 mammillary model, 452 Manhattan distance, 67,147 manifest variable, 50 marginal sum, 131, 165, 166, 405 MARS, 235 MASLOC, 78, 83 mass, 174 mass balance differential equation, 451 mass balance equation, 470, 476 mass weight, 106 matching coefficient, 65, 66 mathematical programming, 609 matrix, 7, 15 — inner product, 10, 20 — outer product, 25, 43 matrix addition, 43 matrix multiplication, 20 matrix-by-vector product, 23 maximal complete subgraph, 79 maximum entropy method, 555, 558
708 maximum likelihood method, 555, 557 mean absorption time, 496 measure of belief, 640 measure of disbelief, 640 measure of (dis)similarity, 60, 65 measurement equation, 577 measurement model, 576 measurement table, 7, 87 mechanical analogy, 128 membrane, 456 metabolism, 452 metabolite, 450 meta-knowledge, 629, 631 meteorites, 57 metric matrix, 171 metric MDS, 428 Michaelis constant, 453 Michaelis-Menten equation, 502 Michaelis-Menten kinetics, 453 minimal path, 622 minimal spanning tree, 73, 74 Minkowski distance, 67 mint species, 60 mixed variables, 67 mixture design, 444 mode analysis, 79 modelling power, 237 molar refractivity, 392 molecular field, 410 momentum, 673 monitoring set, 675 Moore-Penrose inverse, 38 Morlet wavelet, 566 moving average, 538 multicompartment model, 487 multicomponent analysis, 575 multicriteria decision making, 426, 605 multidimensional scaling, 149, 427 multidimensional space, 16 multinormal, 104 multiple linear regression, 53 multiple inheritance, 637 multiplicative scatter correction, 373 multivariate bioassay, 397
multivariate cahbration, 60, 239, 349 multivariate normal distribution, 40, 212, 221, 228 multivariate quality control, 232 multivariate regression, 323 multivariate statistics, 4 multi-way contingency table, 165 multi-way data table, 2 MYCIN, 640 natural calibration, 352 near infrared spectra, 232 nearest neighbour method, 209, 223 nearest neighbours, 213 needle search, 271 network, 621 neural network, 82, 149, 209, 213, 225, 233, 378,385,416,649 — application of, 680 — ART, 649 — back-propagation learning rule, 662, 671 — backtracking, 636 — backward chaining, 634 — hidden layer, 660, 662 — hidden unit, 660, 677 — input layer, 662 — multi-layer-feed-forward network, 649 — output layer, 662 — performance behaviour of, 674, 677 — performance criterion, 680 — radial basis function, 649, 670, 681 — validation of, 679 neuron, 650 NIPALS, 134, 139,332 NIPALS algorithm, 40, 107 NIR spectra, 213 node, 621 noise, 143, 535 nominal scale, 161 non-compartmental analysis, 493, 500 non-hierarchical classification, 58 non-hierarchical clustering, 59, 83 non-hierarchical methods, 76 non-linear Hansch model, 389
709 non-linear mapping, 430 non-linear model, 502 non-linear PCA, 150 non-linear PLS, 378 non-linear programming, 609 non-linear regression, 390, 486, 491 non-linear transformation, 149 non-metric MDS, 429 non-parametric regression, 505 non-singular, 27 norm, 11, 111 normal probability distribution, 213 normalization, 12, 138, 167 normalized Hamming distance, 66 normalized standard error, 674 null matrix, 19 null vector, 9 numerical integration, 494 numerical taxonomy, 58 Nyquist frequency, 521, 524 object-attribute-value triplets, 619, 632 object-oriented programming technique, 638 one-compartment open model, 455,473, 474, 495 operations research, 4, 605 optimization, 622 ordinal variable, 66 orthogonal decomposition, 138 orthogonal projection, 55, 295 orthogonal projection operator, 54 orthogonal rotation, 55, 108, 252 orthogonal rotation matrix, 255 orthogonal vector, 12 orthonormal vector, 14 outliertest, 210, 232 outlier, 239, 374 output-activity map, 690 overtraining, 675 paired comparison test, 425 parabolic model, 389 PARAFAC model, 156, 301 paralysed network, 676
parent frame, 637 partial least squares, 232, 331 partial least squares analysis, 408 partial least squares model, 409 partial least squares regression, 366 partition coefficient, 388 partitioning-optimization techniques, 78 pattern recognition, 397 — supervised, 207, 652 — unsupervised, 207, 687 peak concentration, 467 peak purity, 301 perceptron, 650 periodicity, 527 pharmaceutical identification, 232 pharmaceutical preparations, 223 pharmacodynamics, 450 pharmacokinetic effect, 411 pharmacokinetics, 449 pharmacological activity, 412 pharmacophore, 384, 396 phase, 528 phase spectrum, 529 phylogenetic application, 157 physical dimension, 169 physicochemical parameter, 393 physicochemical property, 387 piecewise direct standardization, 377 plane of closest fit, 106 PLS — continuum power, 345 — two-block, 411 PLS 1,233 PLS2, 233 point-spread function, 531, 532, 554 poles of specificity, 404 polynomial smoothing, 542 pooled variance-covariance matrix, 217, 219, 221 positive definite, 30 positive semi-definite, 30 potential density function, 80 potential function, 225, 226 potential method, 80, 81, 225
710 powering algorithm, 134, 138 power spectrum, 517, 536 precisions, 196 prediction ability, 238, 239 preprocessing, 48, 112, 115, 201 PRESS, 368 principal axis, 104 principal component, 215, 228 principal components analysis, 57, 88, 192, 201, 384, 398 principal components regression, 329, 358 principal coordinates analysis, 146,428 principal covariates regression, 367 principal diagonal, 19 prior probabilities, 221 probability, a priori, 221 probability density function, 495 Procrustes analysis, 310, 411 projection, 51, 52, 151 projection to latent structure, 332 pruning, 678 pseudo-deconvolution, 555 pseudodiagonals, 145 pure variable, 286 purest column, 251 purest row, 251 purity spectrum, 292 pyramidal algorithm, 571 Pythagorean distance, 146 QDA, 228 Q-mode, 88 QR transformation, 136, 140 quadratic discriminant analysis, 209, 210, 220, 228 quadratic programming, 609 quantitation limit, 460 quantitative descriptive analysis, 431 quantitative structure-activity relationships, 383 queue length, 614 queueing — idle time, 616 — M/M/1 system, 611 — utilization factor, 611
queuing theory, 605, 609, 610 radial basis function network, 649, 670, 681 RAFA, 298 random calibration, 352 random noise, 485 random variation, 28 range scaling, 64, 67 rank, 27, 170, 196,202 rank annihilation factor analysis, 298 rate-limited exchange, 453 rational drug design, 385 RBL, 300 real Fourier coefficient, 516 receptor, 383, 396, 402, 411 reciprocal averaging, 182 recognition ability, 238, 239 reconstructed data matrix, 102 reconstruction, 100 recursive multicomponent analysis, 585 recursive regression, 575, 577 reduced rank regression, 324, 367 redundancy analysis, 324 reflected discriminant analysis, 236 regression, 64, 220, 238 regression line, 106 regularised discriminant analysis, 223 relative contribution, 103, 192 repeated intravenous administration, 473 representative objects, 78 representative samples, 60 repulsion, 128 resampling, 238 reservoir, 472 residual data matrix, 102, 136 residual error, 155 residual error sum of squares, 145 residual matrix, 35 residual variance, 142 resolution, 520, 521,526 response surface, 397 response surface methodology, 444 retention time, 116 ridge regression, 367
711 right-singular vector, 91 R-mode, 88 RMSPE, 368 robust clusters, 83 robust regression, 504 rotation, 55 rotation angle, 251, 253 rotation matrix, 254, 286 rotation, Varimax, 254 row simplicity, 254 row space, 246 row-centered matrix, 43 row-closure, 168 row-eigenvector, 92 row-mean, 43 row-orthogonal, 21 row-orthonormal, 21 row-pattern, 16, 104 row-principal component, 96 row-profile, 168 rows-by-columns product, 20 row-singular vector, 91 row-standardized matrix, 47 row-sum, 165 row-variable, 87 rule set, 632 rule-based scheme, 631 sampling, 500, 524 saturated RC association model, 129, 201 scalar multiplication, 9 scalar product, 10, 11,51 scaling, 64, 67, 529 score, 55, 213 score plot, 97, 157 Scree-plot, 142 selecting clusters, 82 selectivity, 237 self-organizing map, 687 semi-axis, 40 semilogarithmic diagram, 481 semilogarithmic plot, 457,468, 501 sensitivity, 237, 413 sensory analysis, 421
Shannon entropy, 558 shift, 528 shortest path problem, 621 sigmoidal transfer function, 655, 662 signal domains, 507 signal, derivative of, 550 signal enhancement, 510, 535, 536, 547 signal processing, 507, 509, 535 signal restoration, 510, 535 signal-to-noise ratio, 509, 535 SIMCA, 212, 225, 228, 237 similarity, 60, 148 similarity coefficient, 60 similarity index, Gower's, 148 similarity matrix, 68 simple inheritance, 637 simple neural network, 228 Simplex method, 608 Simplex optimization, 384, 397 Simplisma, 292 SIMPLS, 340, 367 SIMULA, 621 simulated annealing, 78 simulation, 605, 610 simulation model, 619 single linkage, 69, 70, 71, 72 singular value, 186 singular value decomposition, 40, 89, 183 singular vector decomposition, 202 sinusoidal transfer function, 670 size component, 119 slit function, 530 smoothing, 538, 549 — polynomial, 542 — Savitzky-Golay, 373 smoothing parameter, 226, 227 soft independent modelling of class analogy, 228 solubility, 449 solvent, 60, 74, 75 specificity, 413 spectral decomposition, 34, 92 spectral map, 129,404 spectral map analysis, 129, 201, 402
712 speed of release, 449 spread, 174 stability-plasticity dilemma, 692 stacked histogram, 168 standard deviation of the retention time, 498 standard deviation spectrum, 292 standard normal variates, 373 standardized Euclidean distance, 61, 62, 64 state vector, 585 state-space model, 576, 595 statistical moment, 498 statistical moment theory, 493, 501 steady-state plasma concentration, 471 steady-state solution, 475 steady-state volume of distribution, 496 stepwise variable selection, 236 steric bulk, 392 steric conformation, 385 STERIMOL variable, 392 stochastic distribution, 498 STRESS, 429 stretched coordinate axis, 171 structural eigenvector, 143 structural features, 48 structural information, 140 structure correlation, 321 subspace, 9, 192, 136 substituent parameter, 398 substitution constant, 393 sum matrix, 19 sum vector, 42 supervised learning, 207, 652 supervised pattern recognition, 207 Swain and Lupton parameter, 391 symmetric function, 511 symmetry, 527 system equation, 589, 592, 593 system model, 576 system noise, 591 system transition matrix, 591 systematic information, 140 systems equation, 577 Tanimoto coefficient, 65, 66
Tanimoto similarity, 65 target, 256 target transformation factor analysis, 256 taxi-distance, 147 taxonomy, 157 test set, 238, 239 therapeutic activity, 383 thermometer plot, 198 three-dimensional conformation, 416 three-distance clustering method, 76 three-factor system, 267 three-way analysis, 153 three-way data, 153 three-way data table, 2 three-way matrix, 301 threshold logic, 655 threshold step function, 655 time averaging, 538 time of appearance of the maximum, 467 time domain, 507, 509, 510, 536 time-frequency Fourier transform, 564 time-intensity curves, 441 time-invariant system, 585, 595 tolerance, 135, 139 top-down PCR, 364 topology preserving mapping technique, 692 total least squares, 367 toxicity testing, 59 trace, 22, 49 trace elements, 89 training object, 233 training objects, 207 trajectory, 153 transfer constant, 451 transfer function, 234, 651, 655 transfer function, role of, 665 transfer of calibration models, 376 transform kernel, 517 transformation, 43 transition formula, 100, 185 translation, 43, 45, 566 transport, 450, 451 triangle relationship, 12 triangle test, 421
713 tridiagonal matrix, 140 trilinear model, 156 true factors, 243 TTFA, 256 Tucker3 model, 154 two-block PLS, 411 two-compartment catenary model, 461, 469 two-compartment mammillary model, 476,491 two-component analysis, 586 two-factor system, 260 two-layer network, 234 two-way table, 1,7,88, 153 uncertainty, 639 UNEQ, 210, 212, 228 unfolding, 153 unipolar axis, 112, 150, 188 unit vector, 9 unsupervised learning, 397, 652 unsupervised pattern recognition, 207, 687 utilization factor, 611 validation, 141, 157,207,238 — of neural networks, 679 van der Waals radius, 386 VARDIA, 286 variable space, 246 variance diagram, 286, 289 variance inflation factor, 374 variance of the residence time, 497 variance-covariance matrix, 49, 578 Varimax rotation, 254 varivector, 255 vector, 8 vector addition, 9 vector product, 25 vector space, 8 vector-by-matrix product, 23 volume of distribution, 456 Ward's method, 71,72
wavelength domain, 507 wavelength selection, 589 wavelet — Daubechies, 566 — Haar, 566 — Morlet, 566 wavelet basis, 566 wavelet coefficient, 236 wavelet filter coefficient, 567 wavelet packet transform, 571 wavelet transform, 566 — coefficient, 566 — discrete, 567 Weber-Fechner's psychophysical law, 386 weight, 652 weight coefficient, 131, 173, 201 weighted cross-products matrix, 186 weighted distance, 116, 171 weighted Euclidean distance, 61, 62 weighted Euclidean metric, 170 weighted linear regression, 503 weighted mean, 173 weighted norm, 171 weighted scalar product, 170 weighted sum, 131 weighted sum of squares, 131, 173 weighted variance, 173 weighting function, 495 white noise, 535 Wilks' lambda, 237 wind direction, 89 within-class variance, 216 within-variance, 150 xenobiotics, 450 Young-Householder factorization, 38 z-transform, 64 zero filling, 526, 527
This Page Intentionally Left Blank