The data analysis handbook

DATA HANDLING IN SCIENCE AND TECHNOLOGY - VOLUME 14 The data analysis handbook DATA HANDLING IN SCIENCE AND TECHNOLO...

Author: I.E. Frank | Roberto Todeschini

147 downloads 2804 Views 4MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

DATA HANDLING IN SCIENCE AND TECHNOLOGY - VOLUME 14

The data analysis handbook

DATA HANDLING IN SCIENCE AND TECHNOLOGY Advisory Editors: B.G.M. Vandeginste and S.C. Rutan Other volumes in this series: Volume 1 Volume 2 Volume 3 Volume 4 Volume 5 Volume 6 Volume 7 Volume 8 Volume 9 Volume 10 Volume 11 Volume 12 Volume 13 Volume 14

Microprocessor Programming and Applications for Scientists and Engineers by R.R. Smardzewski Chemometrics: A Textbook by D.L. Massart, B.G.M. Vandeginste, S.N. Deming, Y. Michotte and L. Kaufman Experimental Design: A Chemometric Approach by S.N. Deming and S.L. Morgan Advanced Scientific Computing in BASIC with Applications in Chemistry, Biology and Pharmacology by P. Valko and S. Vajda PCs for Chemists, edited by J. Zupan Scientific Computing and Automation (Europe) 1990, Proceedings of the Scientific Computing and Automation (Europe) Conference, 12-15 June, 1990, Maastricht, The Netherlands, edited by E.J. Karjalainen Receptor Modeling for Air Quality Management, edited by P.K. Hopke Design and Optimization in Organic Synthesis by R. Carlson Multivariate Pattern Recognition in Chemometrics, illustrated by case studies, edited bv R.G. Brereton Sampling of Heterogeneous and Dynamic Material Systems: theories of heterogeneity, sampling and homogenizing by P.M. Gy Experimental Design: A Chemometric Approach (Second, Revised and Expanded Edition) by S.N. Deming and S.L. Morgan Methods for Experimental Design: principles and applications for physicists and chemists by J.L. Goupy Intelligent Software for Chemical Analysis, edited by L.M.C. Buydens and P.J. Schoenmakers The Data Analysis Handbook, by I.E. Frank and R. Todeschini

DATA HANDLING IN SCIENCE AND TECHNOLOGY -VOLUME

14

Advisory Editors: B.G.M. Vandeginste and S.C.Rutan

The data analysis

handbook

I L D I K ~E. FRANK Jerll, Inc., 790 Esplanada, Stanford, CA 94305, U.S.A.

and

ROBERTO TODESCHINI Department of Environmental Sciences, University of Milan, 20133 Milan, Italy

ELSEVIER Amsterdam

- Lausanne - New York - Oxford -Shannon -Tokvo

1994

ELSEVIER SCIENCE B.V. Sara Burgerhartstraat 25 P.O. Box 211,1000 AE Amsterdam, The Netherlands

ISBN 0-444-81659-3 Q 1994 Elsevier Science B.V. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior written permission of the publisher, Elsevier Science B.V., Copyright & Permissions Department, P.O. Box 521,1000 AM Amsterdam, The Netherlands. Special regulations for readers in the USA - This publication has been registered with the Copyright Clearance Center Inc. (CCC), Salem, Massachusetts. Information can be obtained from the CCC about conditions under which photocopies of parts of this publication may be made in the USA. All other copyright questions, including photocopying outside of the USA, should be referred to the publisher. No responsibility is assumed by the publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. This book is printed on acid-free paper. Printed in The Netherlands

V

Introduction Organizing knowledge is another way to contribute to its development. The value of such an attempt is in its capability for training, education and providing deepening insights. Separating the organization from the production of knowledge is arbitrary. Both are essential to the advancement of a field. How many times have you looked for a short and accurate description of an unknown or vaguely familiar term encountered in a paper, book, lecture or discussion? How often have you tried to figure out whether two different terminologies in fact refer to the same method, or whether they are related to different techniques? How can you get a comprehensive, yet concise introduction to a topic in data analysis? It is our hope that this book will be helpful in these as well as in many other contexts. This handbook can be used in several ways at different levels. It can serve as a quick reference guide to a rapidly growing field where even an expert might encounter new concepts or methods. It can be a handy companion to text books for undergraduate or graduate students and researchers who are involved in statistical data analysis. It provides a brief and highly organized introduction to many of the most useful techniques. This handbook is written in a dictionary format; it contains almost 1700 entries in alphabetical order. However, unlike an ordinary dictionary, in which each entry is a separate, self contained unit, rarely connected to others by cross-reference, this book is highly structured. Our goal is to give not only definitions and short descriptions of technical terms, but also to describe the similarities, differences, and hierarchies that exist among them through extensive cross-referencing. We are grateful to many of our colleagues, who have contributed their expertise to this book. At the risk of doing injustice to others who are not named individually, we would like to thank Kim Esbensen, Michele Forina, Jerome Friedman, Bruce Kowalski, Marina Lasagni, Luc Massart, Barbara Ryan, Luis Sarabia, Bernard Vandeginste, Svante Wold for their valuable help.

This Page Intentionally Left Blank

VII

Contents ............................................. User’s Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

XI11

................................................

XVII

................................

1

..............................................

353

Introduction

Notation

The Data Analysis Handbook

References

V

IX


IX

User’s Guide This handbook consists of definitions of technical terms in alphabetical order. Each entry belongs to one of the following twenty topics as indicated by their abbreviations: [ALGEI [ANOVA] [CLASI [CLUS] [DESC] [ESTIM] [EXDEI [FACT] [GEOM] [GRAPH] [MISC] [MODEL] [MULT] [OPTIM] [PREP] [PROB] [QUALI [REGR] [TEST] [TIME]

linear algebra analysis of variance classification cluster analysis descriptive statistics estimation experimental design factor analysis geometrical concepts graphical data analysis miscellaneous modeling multivariate analysis optimization preprocessing probability theory quality control regression analysis hypothesis testing time series

Each topic is organized in a hierarchical fashion. By following the cross-references (underlined words) one can easily find all the entries pertaining to a topic even if they are not located together. Starting from the topic name itself, one is referred to more and more specific entries in a top-down manner. We have intentionally tried to collect related terms together under a single entry. This organization helps to reveal similarities and differences among them. Such “mega” entries are for example: classification, control chart, data, design, dispersion, distance, distribution, estimator, factor rotation, goodness-of-fit, hierarchical

.

.

. .

.

.

. ..

X

User’s Guide

abbreviation to which it belongs, and finally by its (: example: b

artificial intelligence (AI) [MISC]

b

class [CLAS] (: category)

Synonyms), if any. For

There are three different kinds of entries: regular, synonym, and referenced. A regular entry has its definition immediately after the entry line. For example:

association matrix [DESC] Square matrix of order n defined as the product of a data matrix W(n, p ) with its transpose: b

A=XXT A synonym entry is defined only by its synonym word indicated by the symbol : and typeset in italics. To find the definition of a synonym entry, one goes to the

text under the entry line of the synonym word typeset in italics. For example: b

.-

least squares regression (LS) [REGR] ordinary least squares regression

A referenced entry has its definition in the text of an other entry indicated by the symbol + and typeset in italics. To find the definition of a referenced entry, one references the text under the entry line of the word in italics. For example:

agglomerative clustering [CLUS] + hierarchical clustering b

The text of a regular entry may include the definition of one or more referenced entries highlighted in bold face letters. When there are many referenced entries collected under one regular entry, called a “mega” entry, they are organized in a hierarchical fashion. There may be two levels of subentries in a mega entry, The first indicated by the symbol and the second by the symbol 0. For example, the subentries in the mega entry “hierarchical clustering” are:

hierarchical clustering [CLUS] agglomerative clustering o average linkage o centroid linkage b

...

o weighted average linkage o weighted centroid linkage

User’s Guide

o

divisive clustering association analysis

o

Williams-Lambed clustering

XI

...

In the text of a regular entry one is referred to other relevant terms by underlined words. We highly recommend reading also the definitions of these underlined words in conjunction with the original entry. We have made a special effort to keep mathematical notation simple and uniform. A collection of the most often appearing symbols are found on page XVII. There are several figures and tables throughout the book to enhance the comprehension of the definitions. A list of acronyms helps decipher and locate the full terminologies given in the book. Finally, we have included a list of references. Although far from complete, this bibliography reflects our personal preferences and suggestions for further reading. Books, flagged by the symbol m, and important papers, flagged by the symbol a, are organized according to the same topical scheme as the entries of the handbook.


XI11

Acronyms ACE AFA AZ AZC AID ANCOVA ANN ANOVA AOQ AOQL AQL AR ARZMA ARL A M ASN BBM BZBD BLUE BR CARS0 CART CCA CCD cdf CFA CFD CLASSY CTM cv DA DASCO

Alternating Conditional Expectations Abstract Factor Analysis Artificial Intelligence Akaike’s Information Criterion Automatic Interaction Detection ANalysis of COVAriance Artificial Neural Network ANalysis Of VAriance Average Outgoing Quality Average Outgoing Quality Limit Acceptable Quality Level AutoRegressive model AutoRegressive Integrated Moving Average model Average Run Length AutoRegressive Moving Average model Average Sample Number Branch and Bound Method Balanced Incomplete Block Design Best Linear Unbiased Estimator Bayes’ Rule Computer Aided Response Surface Optimization Classification And Regression n e e Canonical Correlation Analysis Central Composite Design cumulative distribution function Correspondence Factor Analysis Complete Factorial Design CLassification by Alloc and Simca Synergism Classification Tree Method cross-validation Discriminant Analysis Discriminant Analysis with Shrunken COvariances

Acronyms

XIV

EDA EFA EMS

ER ESE EVOP FD FFD GA gcv GLM GLS GOF GOLPE GOP GSA GSAM IC IE IKSFA ILS IR WLS K" KSFA LCL LDA LDCT LDF LDHC LLM LMS LOO LOWESS LRR LS LTPD LTS MA MAD MADM MANOVA MARS MCDM

Exploratory Data Analysis Evolving Factor Analysis Expected Mean Squares in ANOVA Error Rate Expected Squared Error Evolutionary Operation Factorial Design Fractional Factorial Design Genetic Algorithm generalized cross-validation Generalized Linear Model Generalized Least Squares regression Goodness Of Fit Generating Optimal Linear Pls Estimation Goodness Of Prediction Generalized Simulated Annealing Generalized Standard Addition Method Influence Curve Imbedded Error Iterative Key Set Factor Analysis Intermediate Least Squares regression Iteratively Reweighted Least Squares regression K Nearest Neighbors method Key Set Factor Analysis Lower Control Limit Linear Discriminant Analysis Linear Discriminant Classification Tree Linear Discriminant Function Linear Discriminant Hierarchical Clustering Linear Learning Machine Least Median Square regression Leave-One-Out cross-validation Locally WEighted Scatter plot Smoother Latent Root Regression Least Squares regression Lot Tolerance Percent Defective Least Trimmed Squares regression Moving Average model Mean Absolute Deviation Median Absolute Deviation around the Median Multivariate ANalysis Of VAriance Multivariate Adaptive Regression Splines MultiCriteria Decision Making

xv

Acronyms

MDS Mhl MIF MIL-STD ML MLS MR

MS MSE MSS MST NER NIPALS NLM NLPLS NMC NMDS NN OCC OLS OR OVAT PARC PC PCA PCP PCR Pdf PFA PLS PP PPR PRESS PRIM PSE QC QDA QPLS RDA RE RMS RMSD RMSE

MultiDimensional Scaling Multivariate Image Analysis Malinowski’s Indicator Function MILitary STanDard table Maximum Likelihood Multivariate Least Squares regression Misclassification Risk Mean Squares in ANOVA Mean Square Error Model Sum of Squares Minimal Spanning Tree Non-Error Rate Nonlinear Iterative PArtial Least Squares NonLinear Mapping NonLinear Partial Least Squares regression Nearest Means Classification Non-metric MultiDimensional Scaling Neural Network Operating Characteristic Curve Ordinary Least Squares regression Operations Research One-Variable-At-a-Time PAttern Recognition Principal Component Principal Component Analysis Principal Component Projection Principal Component Regression probability density function Principal Factor Analysis Partial Least Squares regression Projection Pursuit Projection Pursuit Regression Predictive Residual Sum of Squares Pattern Recognition by Independent Multicategory Analysis Predictive Squared Error Quality Control Quadratic Discriminant Analysis Quadratic Partial Least Squares regression Regularized Discriminant Analysis Real Error Residual Mean Square Root Mean Square Deviation Root Mean Square Error

Acronyms

XVI

RR RSD RSE RSM RSS SA SAM

sc

SDEC SDEP SEA4 SIMCA SLC SMA SMART SPC SPLS

ss

SVD SWLDA SWR TDIDT TSA TSS TTFA UCL UNEQ VIF VSGSA WLS WNMC XE

Ridge Regression Residual Standard Deviation Residual Standard Error Response Surface Methodology Residual Sum of Squares Simulated Annealing Standard Addition Method Sensitivity Curve Standard Deviation of Error of Calculation Standard Deviation of Error of Prediction Standard Error of the Mean Soft Independent Modeling of Class Analogy Standardized Linear Combination Spectral Map Analysis Smooth Multiple Additive Regression Technique Statistical Process Control Spline Partial Least Squares regression Sum of Squares in ANOVA Singular Value Decomposition StepWise Linear Discriminant Analysis StepWise Regression Top-Down Induction Decision Tree Time Series Analysis Total Sum of Squares Target Transformation Factor Analysis Upper Control Limit UNEQual covariance matrix classification Variance Inflation Factor Variable Step size Generalized Simulated Annealing Weighted Least Squares regression Weighted Nearest Means Classification extracted Error

XVII

Notation -

n. objects - n. variables - n. responses - n. components, factors - n. groups, classes, clusters

-

-

-

-

-

-

-

-

-

-

TOTAL, n P r

INDEX ilst j

M G

m

MATRIX

variable response error coefficient component factor loading eigenvector eigenvalue communality variance/covariance correlation distance/dissimilarity similarity association hat matrix total scatter matrix within scatter matrix between scatter matrix identity matrix

For matrix X: - determinant - inverse - trace - transpose

x Y IE B T

F IL

v

A

H $3

k g

ELEMENT Xij

Yik eik bkj tim f im lj m uj m

4 4 $k

R

rjk

D S

SSt

A

as t

w T

W

B I[

1x1

w-'

4t hii

XVIII For random variable x: expected value estimate variance bias probability density function cumulative distribution function probability kernel - quantile

-

For variable Xj :

- mean - standard deviation - lower value -

uppervalue

For group g: - number of objects - prior probability - density function - centroid - covariance matrix

Notation

acceptance number /QUAL]

A a error [TEST] + hypothesis testing b

b

A-optimal design [EXDE] design (0 optimal design)

-3

b

Abelson-’lbkey’s test [TEST]

.--, hypothesis test

absolute error [ESTIM] + error b

b

absolute moment [PROB]

.--, moment b

abstract factor [FACT]

: common factor b

abstract factor analysis (MA) [FACT]

: factor analysis b acceptable quality level (AQL) [QUAL] + producer’s risk

acceptable reliability level [QUAL] ----+ producer’s risk b

b

acceptance control chart [QUAL] control chart (o modified control chart)

--+

b

acceptance error [TEST] hypothesis testing

-N

b

acceptance line [QUAL] lot

b

acceptance number [QUAL] lot

-N

1

2

acceptance sampling [QUAL]

acceptance sampling [QUAL] Procedure for providing information for judging a lot on the basis of inspecting only a (usually) small subset of the lot. Its purpose is to sentence (accept or reject) lots, not to estimate the lot quality. Acceptance sampling is used instead of inspecting the whole lot when testing is destructive, or when 100% inspection is very expensive or not feasible. Although sampling reduces cost, damage, and inspection error, it increases type I and type I1 errors and requires planning and documentation. The samples should be selected randomly and the items in the sample should be representative of all of the items in the lot. Stratification of the lot is a commonly applied technique. The specification of the sample size and of the acceptance criterion is called sampling plan. The two most popular table of standards are the Dodae-Romig tables and the militaw standard tables. Acceptance sampling plans can be classified according to the quality characteristics or the number of samples taken. b

attribute sampling Sampling in which the lot sentencing is based on the number of defective items found in the sample. The criterion for accepting a lot is defined by the acceptance number. chain sampling Alternative to single sampling when the testing is very expensive and destructive, the acceptance number is zero, and the OC curve is convex, i.e. the lot acceptance drops rapidly as the defective lot fraction becomes greater than zero. This sampling makes use of cumulative results of several proceeding lots. The lot is accepted if the sample has zero defective items, and rejected if the sample has more than one defective item. If the sample has one defective item the lot is accepted only if there were no defective items in a predefined number of previous lots. Chain sampling makes the shape of the OC curve more desirable near its origin. continuous sampling Sampling for continuous production, when no lots are formed. It alternates 100% inspection with acceptance sampling inspection. The process starts with 100% inspection and switches to sampling inspection once a prespecified number of conforming items are found. Sampling inspection continues until a certain number of defective items are reached. double sampling Sampling in which the lot sentencing is based on two samples. First an initial sample is taken and, on the basis of the information from that sample, the lot is either accepted or rejected, or a second sample is taken. If a second sample is taken the final decision is based on the combined information from the two samples. A double sampling plan for attributes is defined by four parameters: nl, size of the first

acceptance sampling (QUAL]

3

sample; n2, size of the second sample; a l , acceptance number of the first sample; a2, acceptance number of the second sample. First a random sample of nl items is inspected. If the number of defective items dl is less than or equal to a1 the lot is accepted on the first sample. If dl is greater than a2, the lot is rejected on the first sample. If dl is between a1 and a2, a second random sample of size n2 is inspected resulting in d2 defective items. If dl d2 is less than or equal to a2, then the lot is accepted, otherwise rejected.

+

lot-plot method Variable sampling plan that uses the frequency distribution estimated from the sample to sentence the lot. It can also be used for nonnormally distributed quality characteristics. Ten random samples, each of five items are usually used to construct the frequency distribution and to establish upper and lower lot limits. The lot-plot diagram, which is the basis of lot sentencing, is very similar to an average chart. multiple sampling Extension of the double sampling in which the lot’sentencing is based on the combined information from several samples. A multiple sampling plan for attributes is defined by the number of samples, their size, and an acceptance number and a rejection number for each sample. If the number of defective items dj in any of the samplesj is less than or equal to the acceptance number a, of that sample, the lot is accepted. If d, equals or exceeds the rejection number rj of that sample, the lot is rejected; otherwise the next sample is taken. sequential sampling Multiple sampling in which the number of samples is determined by the results of the sampling itself. Samples are taken one after another and inspected. On the basis of the result of the inspection a decision is made on whether the lot is accepted, rejected or another sample must be drawn. The sample size is often one. single sampling Lot sentencing on the basis of one single sample. A single sample plan for attributes is defined by the sample size n and the acceptance number a. From a lot n items are selected at random and inspected. If the number of defective items d is less than or equal to a the lot is accepted, otherwise rejected. The distribution of d is binomial with parameters n and P d (defective fraction in the lot). skip-lot sampling Sampling when only some~fractionof the lots is inspected. It is used only when the quality is known to be good. It can be viewed as a continuous sampling applied to lots instead of individual items. When a certain number of lots are accepted, the inspection switches to skipping, i.e. only a fraction of the lots are inspected. When a lot is rejected the inspection returns to normal lot-by-lot inspection.

4 action limits [QUAL] variable sampling Sampling in which the lot sentencing is based on a measurable quality characteristic, usually on their sample average and their sample standard deviation. Variable sampling generally requires a smaller sample size than attribute sampling for the same protection level. Numerical measurements of the quality characteristics provide more information than attribute data. Most standard plans require a normal distribution of the quality characteristics. A separate sampling plan must be employed for each quality characteristic inspected. There are two types of variable sampling: sampling that controls the defective lot fraction and sampling that controls a lot parameter (usually the mean). b

action limits [QUAL]

+ controlchart b adaptive kernel density estimator [ESTIM] + estimator ( 0 density estimator)

b adaptive smoother [REGR] + smoother

b

adaptive spline [REGR] spline

b addition of matrices [ALGE] + matrix operation

b

additive design [EXDE] design

-+

b additive distance [GEOM] + distance b

additive inequality [GEOM] distance

-+

b

-+ b

additive model [MODEL] model adequate subset [MODEL] model

-+

b

adjusted R2 [MODEL] goodness offit

-+

b

adjusted residual [REGR] residual

b

admissibility properties [CLUS] assessment of clustering

all possible subsets regression [REGR]

5

admissible estimator [ESTIM] + estimator b

b

affinity [PROB] population

b

agglomerative clustering [CLUS] hierarchical clustering

b

agreement coefficient [DESC] correlation

b qjne’s test [TEST] + hypothesis test b

h i k e ’ s information criterion (AIC) [MODEL] goodness of prediction

b

algebra [ALGE]

: linear algebra b algorithm [OPTIM] A set of rules, instructions, or formulas for performing a (usually numerical)

calculation or for solving a problem. A description of calculation steps in a form suitable for computer implementation. It does not provide theoretical background, motivation or justification.

b

alias [EXDE] confounding

b

alias structure [EXDE] confounding

alienation coefficient [MODEL] + goodness of fit b

all possible subsets regression [REGR] + variable subset selection b

6 ALLOC[CLAS] b

ALLOC [CLAS] potential function classijier

4

b alternating conditional expectations (ACE) [REGR] Nonparametric nonlinear regression model of the form

The functions t and g , called transformation functions, are smooth but otherwise unrestricted functions of the predictors and response variables. They replace the regression coefficients in a linear regression model. The A C E functions are obtained by smoothers which are estimated by least squares using an iterative algorithm. The variance of function tj indicates the importance of the variablej in the model: the higher the variance the more important is the variable. In contrast to parametric models, the ACE model is defined as a set of point pairs [ X i j , tj ( X i j ) , j = l,p] and [yi, g ( y i ) ] not in closed (or analytical) form. The nonlinear transformations are analyzed and interpreted by plotting each variable against its transformation function. If the goal is to calculate a predictive model, then the response function should be restricted to linearity. The prediction in ACE is a two step procedure. First the transformation function values are looked up in the function table, usually by calculating a linear interpolation between two values. In the second step the function values are summed to obtain a predicted response value. alternative hypothesis [TEST] + hypothesis testing b

b

analysis of covariance (ANCOVA) [ANOVA]

An extension to the analysis of variance, where the responses are influenced not

only by the levels of the various effects (introduced in the design) but also by other measurable quantities, called covariates, or concomitant variables. The covariates cannot usually be controlled, but can be observed along with the response. The analysis of covariance involves adjusting the observed response for the linear effect of the covariates. This procedure is a combination of analysis of variance and regression. The total variance is decomposed into variances due to effects, to interactions, to covariates and a random component. The simplest ANCOVA model of n observations contains the grand mean J , one effect A with I levels, one covariate x with mean jY and an error term e: Yik

=J+Ai

+b(Xik

-x) +eik

i = 1, 1

k = 1, K

n =IK

where b is a linear regression coefficient indicating the dependence of the response y on the covariate x; i is the index of levels of A, k is the index of observations per level. It is assumed that the error terms per level are normally distributed with common variance 02, and that b is identical for all levels. The null hypothesis Ho:b = 0 can

analysis of variance (ANOVA) [ANOVA]

7

be tested by the statistic

where MSE is the error mean square calculated as:

MSE =

I (K

SSE - 1) - 1

i

k

i

k

The numerator has one degree of freedom and the denominator has I (K degrees of freedom.

- 1) - 1

analysis of experiment [EXDE] + experimental design b

b analysis of variance (ANOVA) [ANOVA] (: variance analysis) Statistical technique for analyzing observations that depend on the simultaneous operation of a number of effects. The total variance, expressed as the sum of squares of deviations from the grand mean, is partitioned into components corresponding to the various sources of variation present in the data. The goal is to estimate the magnitude of the effects and their interactions, and to decide which ones are significant. The data matrix contains one continuous response variable and one or more categorical predictor variables representing the various effects and their interactions. For hypothesis testing it is assumed that the model error is normally and independently distributed with mean zero and variance o’,i.e. it is constant throughout the level combinations. The analysis of variance model is a linear equation in which the response is modeled as a function of the grand mean, the effects and the interactions, called terms inANOE-4. The model also contains an error term. It is customary to write the model with indices indicating the levels of effects and indices of observations made under the same level combination. The results of the analysis are collected in the analvsis of variance table. The simplest ANOE-4 model, called one-way analvsis of variance model, contains only one effect A of I levels, each level having K observations:

Yik =

7 4-Ai 4-eik

i=l,I

k=l,K

n=IK

8

analysis of variance table [ANOVA]

The simplest randomized block design has one effect A of I levels and one blocking variable B of J levels: Yij = Y

+ Ai + Bj + eij

i=l,I

j=l,J

n=IJ

The crossed two-way analysis of variance model contains two effects, A of I levels and B of J levels, and their interaction AB of I J levels. Each level combination contains K observations: Yijk=Y+Ai+Bj+ABij+eijk

i=l,I

j=l,J

k=l,K

n=IJK

The two-stage nested analysis of variance model contains effect A of I levels, effect B of J levels nested under each level of A. There is no interaction between A and B: yijk

= y + A i +Bj(i) +eijk

i=l,I

j=l,J

k=l,K

n=IJK

The model in which every level of an effect occurs in association with every level of another effect is called a crossed model. The model in which all main effect terms and their interaction terms that compose the interaction term of the highest order in the model is called a hierarchical model. The model in which no level of a nested effect occurs with more than one level of the effect in which it is nested is called a nested model. Nested models are further characterized by their number of effects, e.g. a two-stage nested ANOVA model. The mixed effect model, in which the random crossed effects terms have to sum to zero over subscripts corresponding to fixed effects terms, is called a restricted model. An ANOVA model that does not contain all possible crossed effect terms that can be constructed from the main effect terms is called a reduced model. Analvsis of covariance and multivariate analvsis of variance are extensions of ANOVA.

analysis of variance table [ANOVA] Summary table of analvsis of variance in which the columns contain the following ANOVA results: degrees of freedom, sum of squares, mean square, F ratio or expected mean square. The rows,of the table correspond to the terms in ANOVA, the last two rows usually correspond to the error term and the total. For example, the table of a random two-way ANOVA model with n observations is: b

Term

df

ss

Mean Square

F Ratio

Anscombe residual [REGR]

9

Another example is the ANOW table of a linear regression model with four predictor variables and 30 observations: Term

df

ss

Mean Square

F Ratio

where SSMis the model sum of squares, SSE is the error (residual) sum of squares and SST is the total sum of squares.

analytical rotation [FACT] + factor rotation b

Anderson’s classification function [CLAS] + discriminant analysis b

b Andrews’ Fourier-type plot [GRAPH] (: harmonic curves plot) Graphical representation of multivariate data. Each p-dimensional object i (i.e. each row of the data matrix) is represented by a curve of the form:

f i (t) = - + xi2 sin(t) + xi3 cos(t) + xi4 sin(2t) + . . . Xi1

2/2

This plot is a powerful tool for visual outlier detection.

Andrews-Pregibon statistic [REGR] + influence analysis b

b angular-linear transformation [PREP] + transformation b angular variable [PREP] + variable b animation [GRAPH] + interactive computer graphics

b Ansari-Bradley’s test [TEST] + hypothesis test b Anscombe residual [REGR] + residual

-7c 1

The weights depend on (Xi - Z) ui = c=6 cs where s is a robust scale estimate, for example the quartile deviation. Because Z depends on the weights and the weights depend on ji; the procedure is iterative starting with the mean as initial estimate.

biweight kernel [ESTIM] -+ kernel b

b block [EXDE] + blocking

block clustering [CLUS] + cluster analysis b

b

-+

block design [EXDE] design

b block generator [EXDE] + blocking

22 b

block size [EXDE]

block size [EXDE]

+ blocking b blocking [EXDE] Assignment of runs into blocks. A block is a portion of the experimental material which is expected to be more homogeneous than the entire set. The block size is the number of runs per block. If the block size is equal to the number of treatments, the design has complete blocks; if the block size is less than the number of treatments, the design is composed of incomplete blocks. In the latter case some interaction effects are indistinguishable from the blocking effect. Blocking of runs is obtained with the use of block generators. These are interaction columns in the design matrix, such that combinations of their levels determine the blocking of runs. These interaction columns are confounded with the blocking variable. For example, one interaction column can generate two blocks: runs having the -sign in the column belong to the first block and runs having the +sign to the second block. Two interaction columns are needed to generate four blocks: level combination - - indicates the first block, - the the fourth block. In general 2N second block, - the third block, and blocks can be generated according to the level combination of N interaction columns. Blocking is a technique for dealing with inhomogeneity of runs and for increasing the precision of an experiment. By confining treatment comparisons within blocks, greater precision in estimating effects can often be obtained. Blocking is done to control and eliminate the effect of a nuisance factor described by the blocking variable. The goal is to separate the effect of blocks from the effects of factors. The assumption is that blocks and factors are additive, i.e. there are no interactions among them. When blocking is used the experimental error is estimated from within-block comparisons rather than from comparisons among all the experimental units. Blocking is the best design strategy only if the within-block variability is smaller than the between-block variability.

+

b

blocking variable [PREP] variable

-F

b Bonferroni index [DESC] -+ skewness

b

bootstrap [MODEL]

+ model validation b

boundary design [EXDE] design

--j

++

+

Box-Draper design [EXDE]

23

b bounded influence regression [REGR] + robust regression b box plot [GRAPH] (.- box-and-whiskersplot) Graphical summary of a univariate distribution. The bulk of the data is represented as a rectangle with the lower and the upper quartiles being the bottom and the top of the rectangle, respectively, and the median is portrayed by a horizontal line within the rectangle. The width of the box has no meaning. Dashed lines, called whiskers, extend from the ends of the box to the adjacent values. The upper adjacent value is equal to the upper quartile plus 1.5 times the inter-quartile range. The lower adjacent value is defined as the lower quartile minus 1.5 times the inter-quartile range. Outliers, i.e. values outside the adjacent values, are plotted as individual points above and below the adjacent value line segments.

c,

I

OUTLIERS

+

-

0

L

t

UPPER QUARTILE

MEDIAN d

t r

LOWER QUARTILE

0

ADJACENT VALUE

0

+ 1

I

I

I

x1

x2

x3

b

The box plot easily reveals asymmetry, outliers, and heavy tails of a distribution. Displaying several box-plots side by side gives a graphical comparison of the corresponding distributions. To emphasize the relative locations, box plots can be drawn with notches in their sides, called notched box plot. The formula for calculating notch lengths is based on a formal hypothesis test of equal location. b

box-and-whiskers plot [GRAPH]

: box plot

Box-Behnken design [EXDE] + design b

b

-+

Box-Draper design [EXDE] design

24

Box-Jenkins model [TIME]

b Box-Jenkins model [TIME] + time series model

b

BOX’Stest [TEST]

+ hypothesis test b branch and bound method (BBM) [OPTIM] Algorithm proposed for solving a general linear problem in which some of the variables are continuous and some may take only on integer values. BBM is a partial enumeration technique in which the set of solutions to a problem is examined by dividing this set into smaller subsets. By using some user-defined decision criterion it can be shown that some of these subsets do not contain the optimal solution and can thus be eliminated. A solution subset can be defined by calculating its bound. The bound of a subset is a value that is less than or equal to the value of all the solutions in the subset. It is sometimes possible to eliminate a solution subset from further consideration by comparing its bound with a solution which has already been found. The solution subset contains no better solution if the bound is larger than the value of the known best solution. In searching a solution subset, if one of its solutions is better than the best one already known, this newly found solution replaces the best one previously found. If a solution subset cannot be eliminated, it must be split into smaller subsets using a branching criterion. This criterion attempts to divide the solution subset into two subsets in such a way that one of the subsets will almost certainly not contain the optimal solution. b

Bray-Curtis coefficient [GEOM] (0 quantitative data)

+ distance

b breakdown point [ESTIM] Characteristic of an estimator that measures its robustness. It is calculated as the percentage of outliers or contamination that can cause the estimator to take arbitrary large values, i.e. to break down. In other words, the breakdown point is the distance from the assumed distribution beyond which the statistic becomes totally unreliable and uninformative. Ideally the breakdown point is 50%, as in the case of the median, i.e. the majority of the observations can overrule the minority. In contrast, the breakdown point of the mean is O%, which indicates total nonrobustness. In regression the breakdown point of the least squares estimator is 0%, indicating extreme sensitivity to outliers, while the least median squares regression or the least trimmed squares regression have an optimal breakdown point of 50%. b

Brier score [CLAS]

+ classification

calibration [REGR] b

25

Brownian motion process [TIME] ( 0 Wiener-Levy process)

+ stochastic process b

Brunk’s test [TEST]

+ hypothesis test b brushing [GRAPH] + interactive computer graphics

(0

connecting plots)

C b c-chart [QUAL] + control chart (o attribute control chart)

b

calibration [REGR]

Special regression problem in which the goal is to predict a value of a fixed variable x from an observed value of a random variable y on the basis of the regression model

y=f(x)+f By contrast, in the regular regression problem the above regression model is used to predict y. For example, one of the most important problems in analytical chemistry is to predict chemical concentration x from measured analytical signal y on the basis of the calibration function. If there is only one x variable and one y variable the calibration is called a univariate calibration. The case when there is a vector of x to be predicted from a vector of y is called a multivariate calibration: y = f (x) + i

The most common calibration model is linear and additive:

where each row contains a calibration sample, a row of U contains the r responses from multiple sensors, while a row of X contains the p component concentrations; r 2 p. S, the regression coefficient matrix, is called the calibration matrix and contains the partial sensitivities or pure spectral values. IE is the error term. The solution for S is:

s = (XT X)-1

XT Y

26

Calkoun distance [GEOM]

and the estimated concentrations i are obtained by measuring the corresponding response vector and solving: ..I

y = STx

and

i = (S ST)-' S y

This case, in which all concentrations x are known, is called direct calibration (or total calibration). It is a two-step procedure: first S is calculated from the calibration data set in which all component concentrations are known, then the concentrations are predicted in the unknown mixture. If not all component concentrations are known the inverse calibration (also called indirect calibration) is used, where the concentration of one component is taken as a function of the responses. x = f (y) +2

The linear model is:

x=Ys+e The regression coefficients s can be estimated from calibration data set in which only the concentration of the component of interest is known. The solution for the inverse calibration is:

i=(YTYppX

&yTi

As in a regular regression problem, the parameters can be estimated by ordinary least squares or by a biased estimator, e.g. PCR,or PLS.

Calkoun distance [GEOM] + distance (0 ranked data) b

b

Canberra distance [GEOM]

+ distance (o quantitative data) b canonical analysis [ALGE] + quadratic form b canonical correlation [MULT] + canonical correlation analysis

b canonical correlation analysis (CCA) [MULT] (: correlation analysis) Analysis of interrelationship between two sets of variables x and y. CCA examines the inter-correlation between two sets of variables, as opposed to factor analysis which is concerned with the intra-correlation in one single set. In contrast to regression analysis, CCA does not assume causal relationship between the two sets and treats them symmetrically. Geometrically, CCA investigates the extent to which

category [CLAS] 27

objects have the same relative position in the two measurement spaces x and y; i.e. to what extent do the two sets of variables describe the objects the same way? The procedure starts with calculating an overall correlation matrix Iw that is composed of four submatrices: R,, is the correlation matrix of the x variables, R,, is the correlation matrix of the y variables, R,, = Iw;, ,containsthe cross-correlations between elements of x and y.

Iw=* The next step is to reduce the variables x and y into two sets of canonical variates. These are linear combinations of x and of y calculated to obtain a maximum correlation between them. The canonical variates in each set are uncorrelated among themselves and each canonical variate in one set is independent of all but one canonical variate in the other set. The correlation between a linear combination of x and the corresponding linear combination of y, i.e. between two corresponding canonical variates, is called canonical correlation. The goal of CG1 is to maximize the canonical correlations. Squared canonical correlations are called canonical roots. They are eigenvalues of [It; Iw,, :;wI R,,].

canonical form [ALGEI + quadratic form b

b canonical root [MULT] + canonical correlation analysis b canonical variate [MULT] + canonical correlation analysis b

Capon’s test [TEST]

+ hypothesis test

categorical classification [CLAS] + probabilistic classification b

b categorical data [PREP] + data

categorical variable [PREP] + variable b

b

category [CLAS]

: class

28

Cauchy distribution [PROB]

b Cauchy distribution [PROB] + distribution

causal model [MODEL] + model b

cause variable [PREP] + variable b

cause-effect diagram [GRAPH] (: fishbone diagram, Ishikawa diagram) Versatile tool used in manufacturing and service industries to reveal the various sources of nonconformity of a product and analyze their interrelationship. b

causes

cause 1

methods

parameters

sub-cause 1.2

+

s amp1es

effect

solvents

causes The following steps are taken to construct such a diagram:

- choose the effect (product) to be studied and write it at the end of a horizontal main line; - list all the main causes that influence the effect under study and join them to the main line; - arrange and rank other specific causes and sub-causes joining them to the lines of the main causes; - check the final diagram to make sure that all known causes of variation are included.

centroid classification [CLAS]

29

cell [EXDE] + factor b

b

censored data [PREP] data

b

center [DESC] location

center point [EXDE] + design ( 0 composite design) b

b centering [PREP] + standardization

b

central composite design (CCD) [EXDE] design (n composite design)

b

central limit theorem [PROB]

Theorem in statistics that points out the importance of the normal distribution. According to the theorem, the distribution of the sum of j = 1,p variables with mean p, and finite variances a,,will tend to the normal distribution with mean p, and variance uj as p goes to infinity. For example, the distribution of the sample means approaches the normal distribution as the sample size increases, regardless of the population distribution.

cj b

cj

central moment [PROB]

+ moment b

central tendency [DESC]

: location b centroid [DESC] + location b centroid classification [CLAS] ( : nearest means classification) Parametric classification method in which the classification rule is based simply on the Euclidean distances from the class centroids calculated from the training set. If the classes are represented only by one training object, this method is called prototype classification. This method assumes equal spherical covariance matrices in each class. When the class weights are inversely proportional to the average class variance of the variables, the method is known as weighted nearest means

classification (WNMC).

30

centroid distance [GEOM]

Pattern recognition by independent multicategory analysis (PRIMA) is a variation of the centroid classification. The coordinates in the Euclidean distance are weighted in inverse proportion to the variance of the corresponding variable in the corresponding class. b

centroid distance [GEOM] distance

--+

centroid linkage [CLUS] + hierarchical clustering (0 agglomerative clustering) b

b centrotype [CLUS] + cluster

b chain sampling [QUALI + acceptance sampling b chance correlation [DESC] + correlation b

characteristic [PREP]

: variable b characteristic polynomial [ALGE] + eigenanalysis b characteristic root [ALGE] + eigenanalysis b

characteristic vector [ALGE] eigenanalysis

--+

b

Chebyshev distance [GEOM]

+ distance (0 quantitative data) b Chernoff face [GRAPH] + graphical symbol b chi squared distance [GEOM] + distance (0 quantitative data)

b

chi squared distribution [PROB] distribution

--+

class densityfinction b

[CLAS] 31

chi squared test [TEST] hypothesis test

-+

b

Cholesky factorization [ALGE] matrix decomposition

-+

chromosome [MISC] -+ genetic algorithm b

circular data [PREP] -+ data b

b circular histogram [GRAPH] + histogram b

circular variable [PREP]

+ variable b

-+

city-block distance [GEOM] distance (0 quantitative data)

b class [CLAS] (: category) Distinct subspace of the whole measurement space, defined by a set of objects in the training set. The objects of a class have one or more characteristics in common indicated by the same value of a categorical variable or by the same range of a continuous variable. The term class is mainly used in classification; however, it also appears incorrectly in cluster analysis as a synonym for cluster or group. Classes are often assumed to be mutually exclusive (not to overlap) and exhaustive (the sum of the class subspaces covers the whole measurement space).

b class boundary [CLAS] (: decision boundary, decision surface) Multidimensional surface (hypersurface) that separates the classes in the measurement space. Some classification methods assume linear boundaries (e.g. LDA), others assume quadratic boundaries (e.g. @A), or allow even more complex boundaries (e.g. CART, K",ALLOC). In a classification rule class boundaries can be defined explicitly by equations (e.g. LDA) or implicitly by the training set (e.g. K"). b

class box [CLAS] class modeling method

-+

b class density function [CLAS] Density function describing the distribution of a single class population in a classification problem. In discriminant analysis the class density functions are

32

class modeling method [CLAS]

assumed to be normal. In histogram classification the class density functions are estimated by univariate histograms, assuming uncorrelated variables. In potential function classifiers the class density functions are calculated as the average of the object kernel densities. The classification rule assigns an object to the class with the largest density function at that point. b class modeling method [CLAS] Classification method that defines closed class boundaries for each class in the training set. The subspace separated by class boundaries is called the class-box. The density function of a class outside of its class-box is assumed to be zero, so objects that are outside all of the class-boxes are classified as unknown. The class density function can be uniform inside its class-box (SIMCA), normal (QDA), sum of individual kernels (ALLOC, CLASSY), etc. The shape of a class-box can be a hypersphere centered around the class centroid (PRIMA), a hyperellipsoid (UNEQ), a hyperbox (SIMCA), etc.

b

class prediction [CLAS] classification

b class recognition [CLAS] + classification

classification [CLAS] (: discrimination) Assignment of objects to one of several classes based on a classification rule. Classification is also called supervised pattern recognition, as opposed to unsupervised pattern recognition, which refers to cluster analysis. The classes are defined a priori by groups of objects in the training set. Separation of only two classes is called binary classification. The goal is to calculate a classification rule and class boundaries based on the training set objects of known classes and to apply this rule to objects of unknown classes. An object is usually assigned to the class with the largest class density function at that point. Based on how they estimate the class density functions and the classification rule, the methods are called parametric classification (DASCO, LDA, NMC, P R I M , RDA, QDA, SIMCA, UNEQ),or nonparametric classification. The latter group contains classification tree methods (AID, CART, LDCT), mtential function classifiers (ALLOC, CLASSY), histogram classification and K". Methods that define closed class boundaries are called class modeling methods. Classification that assumes uncorrelated variables, e.g. histogram classification, is called independence classification. There are several measures for evaluating the performance of a classification method. They are based on assessing the misclassification, i.e. the assignment of objects to other than their true class. Loss matrix and prior class mobability are often incorporated in such measures. Some of the measures are calculated for probabilistic classification, others can also be applied for categorical classification. b

classification [CLAS]

33

One should distinguish between measures calculated from the training set with resubstitution, measures calculated from the training set with cross-validation and measures calculated from the test set. The first group assesses the class recognition of a method, i.e. how well the objects of known classes in the training set are classified. The other two groups estimate the class prediction of a method, i.e. how well objects of unknown classes will be classified. In the following equations the notations are: object index i = 1,n g, g’ = 1,G class index number of objects in class g number of correctly classified objects in class g number of incorrectly classified objects in class g probability of object i belonging to proper class g probability of object i belonging to another class

ng Cgg CSS’

Pig Pig!

average probability for proper class Measure for probabilistic classification:

Brier score (: quadratic score) Measure for probabilistic classification: r 1

The modified Brier score is:

confusion matrix (: misclussification malriu) Matrix containing categorical classification results. The rows of the matrix correspond to true classes and the columns to predicted classes. 0

correctly classified 1st true class 2nd true class

...

Gth true class

1st

[zi:

L

2nd . . . Gth predicted class

z;:)

... ... ... . . . .. . . .. cG1 C O ~ ... COG 012

czz

The number of correctly classified objects in each class appears on the main diagonal, while the numbers of misclassified objects are the off-diagonal elements.

34

classification [CLAS]

The off-diagonal element c,,’ is the number of objects that belong to class g but which are classified as class g’. In case of perfect classification the off-diagonal elements cggJare all zero and the diagonal elements are c,, = n,.

error rate (ER) Measure for categorical classification expressed as the percentage of incorrectly classified objects:

xx

Cgg’

ER% =

g’

g

0 5 ER% 5 1 n The no-model error rate is calculated for the assignment of all objects to the largest class with nM objects ( n M 2 n,) without using any classification model. It serves as a common denominator in comparing classification methods. n-nM NOMER% = 0 5 NOMER% 5 1 n The complementary quantity, called non-error rate (NER), is the percentage of the correctly classified objects: ~

c, 5%

NER% = -= I - E R % 0 5 NER% 5 1 n Depending on how the error rate is calculated, one should distinguish between true error rate, conditional error rate and unconditional error rate. The true error rate, which comes from classification rules with known (not estimated) parameters, serves for the theoretical comparison of classification methods. The conditional error rate is calculated from and relevant to only one particular training set, while the unconditional error rate is the expected value of the conditional error rate over all possible training sets. The conditional error rate is of interest in practical problems.

misclassification matrix : confhion matrix misclassification risk (MR) Measure for categorical classification:

MR% =

C g

[ ~ C g g . L g , . / ~ , ]p , G

where P , is the prior class probabilitv and Lgg!is an element of the loss matrix.

quadratic score : Brier score

classification and regression trees (CART) [CLAS] 35

reliability score Measure for probabilistic classification: Q3

= QI - 4 2

- 15

Q3

5 (G - 1)/4G

where Q1 and Q 2 are the average probability for proper class and the sharpness of classification, respectively.

sensitivity of classification Measure for categorical classification, the non-error rate for class g:

SN%,

=%

0 5 SN% 5 1 % Sensitivity equals 1 if all objects of true class g are classified to class g.

sharpness of classification Measure for probabilistic classification:

y xp$ 4 2

=

i

g'

n

1/G 5

4 2

I1

specificity of classification Measure for categorical classification indicating the purity of class g:

Specificity equals 1 if only objects of true class g are assigned to class g.

classification and regression trees (CART) [CLAS] Classification tree method that constructs a binary decision tree as a classification rule. Each node f of the tree is characterized by a single predictor variable j ( f ) and by a threshold value for that variable Sj(t) and represents a question of the form: 5 s, (f)? If the answer is yes then the next question is asked on the left son node, otherwise on the right son node. Starting at the root node with an object vector, the tree is traversed sequentially in this manner until a terminal node is reached. Associated with each terminal node is a class, to which the object is assigned. Given the training set, CART constructs the binary decision tree in a forwardbackward stepwise manner. The size of the tree is determined on the basis of cross-validation, which ensures a good class prediction. The CART classification rule has a simple form that is easy to interpret, yet it takes into account that different relationships may hold among variables in different parts of the data. CART is scale invariant, extremely robust with respect to outliers, and performs automatic stepwise variable selection. b

36

classifcation by ALLOC and SIMCA synergism (CLASSY) [CLAS]

b classification by ALLOC and SIMCA synergism (CLASSY) [CLAS] Classification method that combines potential function classifier (ALLOC) and SZMCA. First, as in SZMCA, principal component models for each class are calculated and the optimal number of components is determined. A normal distribution is assumed outside of the class box, just as in SZMCA. Inside the class box, however, kernel density estimators of the principal components are used to calculate the class density functions. The optimal parameter, which determines the smoothness of the class density function, is estimated with a leave-one-out modification of the maximum likelihood procedure.

classification power [CLAS] + classijication rule b

b classification rule [CLAS] (.- discriminant rule) Rule calculated from the training set to determine the class to which an object is assigned. The classification rule may be expressed as a mathematical equation with few parameters, as in discriminant analysis, or may consist of only a set of threshold values of the variables, as in CART, or it may include all objects of the training set, as in K".Many classification rules are based on the Baves' rule. The importance of a variable in a classification rule is called classification power or discrimination power. For example, in linear discriminant analysis the discriminant weight measures the classification power of the corresponding variable. b classification tree method (CTM) [CLAS] (.- top-down induction decision tree method) Nonparametric classification method that generates the classification rule in the form of a binary decision tree.

cluster [CLUS]

37

In such a tree each nonterminal node is a binarv classifier. Traversing the tree from the root to the leaves and classifying objects is called binary tree searching. Examples are: AID, CART, LDCT. Many expert systems are also based on binary decision trees, e.g. EX-TRAN. b

closed data [PREP] data

b

CLUPOT clustering [CLUS] non-hierarchical clustering (0 density clustering)

--3

b cluster [CLUS] Distinct group of objects that are more similar to each other than to objects outside of the group. Once a cluster is found, one can usually find a unifying property characterizing all the objects in the cluster. A cluster is represented by its seed point. It can be a representative object of the cluster, called a centrotype, or a calculated center point, called a centroid. ?kro desirable properties of a cluster are internal cohesion and external isolation. L-clusters score high on both lists. The internal cohesion of a cluster g can be measured by the cluster diameter, defined as the maximum distance between any of its two objects: {max[d,,], s,t E g ) or by the cluster radius, defined as the maximum distance between an object and its centroid or centrotype: {max[d(xi, c,)], i E g ] . Cluster connectedness is another measure of internal cohesion. The external isolation of a cluster is quantified in hierarchical clustering by its moat.

I

cluster A well isolated cluster B nested in cluster C

Geometrically, a cluster is a continuous region of a high-dimensional space containing a relatively high density of points (objects), separated from other clusters

38

cluster analysis [CLUS]

by regions containing a relatively low density of points. For example, clusters can be well isolated spherical clusters (cluster A) or may be nested within each other (clusters B and C). Clusters are easy to detect visually in two or three dimensions. In higher dimensions, however, cluster analysis must be applied. b cluster analysis [CLUS] (.- numerical taxonomy) Set of multivariate exploratory methods which try to find clusters in high-dimensional space, based on some similarity criterion among the objects (or variables). Sometimes, incorrectly, cluster analysis is referred to as classification. Classification, also called supervised pattern recognition, means the assignment of objects to predefined groups, called classes. In contrast, the goal of cluster analysis, also called unsupervised pattern recognition, is to define the groups, called clusters, such that the within-group similarity is higher than the between-group similarity. Cluster analysis often precedes classification; the former explores the groups, the latter confirms them. Frequently a partition of objects obtained from cluster analysis satisfies the following rules: each object belongs to one and only one cluster, each cluster contains at least one object. Fuzwclustering results in partitions which are not restricted by the above rules. Cluster analysis problems are usually not well defined, and often have no unique solutions. The resulting partition greatly depends on the method used, on the standardization of the variables and on the measure of similarity chosen. It is important to evaluate the partition obtained, i.e. to perform assessment of clustering, and to verify clustering results with supervised techniques. The clustering tendency may be calculated by the Hopkins’ statistic. Clustering objects is called Q-analysis, while clustering variables is called Ranalysis. The former often starts with a distance matrix calculated between objects, the latter with the correlation matrix of variables. Block clustering (also called two-way clustering) groups both objects and variables simultaneously, resulting in rectangular blocks with similar elements in the data matrix. The partition is obtained by minimizing the within-block variance.

CLUSTER ANALYSIS k!

Hierarchical Clustering k!

agglomerative clustering

I divisive clustering

I Non-hierarchical Clustering

J density clustering

.1 graph theoretical clustering

I optimization clustering

Clustering methods are divided into two major groups: hierarchical clustering and non-hierarchical clustering methods. Each group is further divided into subgroups. Clustering based on several variables is called polythetic clustering, as opposed to monothetic clustering, which considers only one variable at a time.

Cohran’stest [TEST]

39

b cluster connectedness [CLUS] Measure of internal cohesion of a cluster defined for hierarchical agglomerative clustering. It is a range scaled number of edges e present in the graph theoretical representation of the obtained cluster g containing n, objects:

e - (n, - 1)

c -- 0.5 ng(n, - 1) - (n, - 1) By definition the connectedness of a cluster obtained by complete linkage is 1, while the connectedness of a cluster resulting from single linkage can be as low as 0. The connectedness of a cluster of ng objects lies between [ng - 11 and [OS n,(n, - l)]. b cluster diameter [CLUS] + cluster b

cluster omission admissibility [CLUS] assessment of clustering ( admissibilityproperties)

-+

b cluster radius [CLUS] + cluster

b

cluster sampling [PROB]

+ sampling b coded data [PREP] + data

coefficient of determination [MODEL] + goodness of fit b

b

coefficient of nondetermination [MODEL] goodness of fit

-+

b coefficient of skewness [DESC] + skewness

coefficient of variation [DESC] + dispersion b

Cohran’s Q test [TEST] + hypothesis test b

b Cohran’s test [TEST] + hypothesis test

40

collinearily [MULT]

b collinearity [MULT] ( : multicollinear@) Approximate linear dependence among variables. In case of perfect collinearity a set of coefficients b can be found such that

Collinearity causes high variance in least squares estimates (e.g. covariance matrix, regression coefficients) resulting in instability of the estimated values, even wrong signs. Collinearity badly affects the modeling power in regression and classification but does not necessarily reduce the goodness of prediction of the model. Biased estimators mitigate the negative effects of collinearity. Collinearity is indicated by: - values close to one in the off-diagonal elements of the correlation matrix; - zero or close to zero eigenvalues calculated from the correlation matrix; - large condition number, i.e. large ratio between the largest and the smallest eigenvalues calculated from the correlation matrix, - multiple correlation coefficient close to one, calculated by regressing one variable on all the others.

column vector [ALGE] + vector b

b common factor [FACT] (: abstract fuctor,fuctor) Underlying, non-observed, non-measured, hypothetical variable that contributes to the variance of at least two measured variables. In factor analysis the measured variables are linear combinations of the common factors plus a unique factor. In principal component analysis the unique factor is assumed to be zero, and the common factors, called principal components (PC) or components, are also linear combinations of the variables. The common factors are uncorrelated with the unique factors and usually it is assumed that the common factors are uncorrelated with each other. Estimated common factors, called factor scores or scores, can be calculated as linear combinations of the variables using the factor score coefficients. These are linear coefficients calculated as the product of the factor loadings and the inverse of the covariance matrix or correlation matrix. b communality [FACT] Measure of the contribution of the common factors to the variance of the corresponding variable. The squared communality is the sum of squared factor loadings over M common factors:

completely randomized design [EXDE]

41

The complement quantity, called uniqueness, is the sum of squared unique factors. It measures the amount of variance which remains unaccounted for by the common factors. In principal component analysis, where the unique factors are assumed to be zero, the squared communalities are equal to the variance of the corresponding variable if M = p. In factor extraction techniques, in which an initial estimate of the communalities is needed, it is usually taken to be the squared multiple correlation coefficient: ..

where rJJ is the jth diagonal element of the inverse correlation matrix. A factor analysis solution that leads to the communality of some (scaled) variable being greater than one is called a Heywood case. This may arise because of sampling errors or because the factor model is inappropriate. b

comparative experiment [EXDE] experimental design

-+=

complementary design [EXDE] + design b

complete block [EXDE] + blocking b

b

complete block design [EXDE] ( 0 randomized block design)

+ design b

complete factorial design (CFD) [EXDE]

+ design b

complete graph [MISC] graph theoly

-+=

complete linkage [CLUS] + hierarchical clustering ( 0 agglomerative clustering) b

b

complete mixture [EXDE] ( 0 mixture design)

+ design

b complete pivoting [ALGE] + Gaussian elimination

completely randomized design [EXDE] + design b

42

component [FACT]

b component [FACT] + common factor b

component analysis [FACT]

: principal component analysis

component of variance [ANOVA] -+ term in ANOVA b

b

composite design [EXDE] design

b computer aided response surface optimization (CARSO) [REGR] + partial least squares regression b computer intensive method [MODEL] Statistical method that calculates models without making assumptions about the distribution of the underlying population. Such a method replaces theoretical analysis with massive amount of calculations, feasible only on computers. Instead of focusing on parameters of the data that have concise analytical form (e.g. mean, correlation, etc.) the properties are explored numerically, thus offering a wider array of statistical tools. A computer intensive method offers freedom from the constraints of traditional parametric theory, with its overreliance on a small set of standard models for which theoretical solutions are available. An example is the bootstrap technique, which can be used to estimate the bias and variance of an estimator. b concomitant variable [PREP] -+ variable

b

condition number [ALGE] matrix condition

b

condition of a matrix [ALGE]

: matrix condition b conditional distribution [PROB] + random variable

b

conditional error rate [CLAS] classification ( 0 error rate)

b

conditional frequency [DESC] ji-equency

-+

confounding [EXDE] b

43

conditional probability [PROB] probability

-+

conditionally present variable [PREP] -+ variable b

b

confidence coefficient [ESTIM] confidence interval

-+

b

confidence interval [ESTIM]

Interval between values tl and t;! (called upper and lower confidence limits), calculated as two statistics of a parameter 6 given a sample, such that: P(tl 5 6 5 t 2 ) = a

Parameter a,called the confidence coefficient or confidence level, is the probability that the interval [ t 2 - t l ] contains parameter fl. Assuming normal distribution, confidence limits can be calculated from Student's t distribution. b

confidence level [ESTIM] confidence interval

-+

b confidence limit [ESTIM] + confidence interval

b

configuration [GEOMI geometrical concept

-+

confirmatory data analysis [DESC] -+ data analysis b

b

confirmatory factor analysis [FACT]

+ factor analysis b

confounding [EXDE]

Generating a design in which an interaction column or a block generator column is identical to another column. Such columns are called aliases and their effect cannot be estimated separately. It is a technique for the creation of blocking and fractional factorial designs. The rule is that factors, blocking variable and important interactions should be confounded with unimportant interactions. A design generator is a relation that determines the levels of a factor in a fractional factorial design alias with an interaction. The levels of such a factor are calculated as a product of levels of the factors in the interaction. For example, a z3 complete factorial design with 8 runs can be extended to a z4-' fractional factorial

44 confounding pattern [EXDE] design by generating the fourth factor as D = ABC, i.e. calculating the levels of D in each run as a product of levels of A and B and C. Partial confounding means that different interactions are confounded in each replication. In contrast to total confounding, when the confounded effects cannot be estimated separately, here partial information is obtained. A confounding pattern, also called an alias structure, is a set of relations that indicate which interactions and factors are confounded, i.e. which are aliases in a design matrix. For example the 24-1 design with the design generator D = ABC has the following confounding pattern:

A=BCD B=ACD AB=CD AC=BD

C=ABD D=ABC AD=BC

The confounding pattern of a design is determined by the defining relation. The defining relation contains the design generators and all the interactions in which all levels are These interactions can be generated as products of the design generators. The resolution of a two-level fractional factorial design is the length of the shortest word in the defining relation. Resolution minus one is the lowest order interaction that is confounded with a factor in the design. The resolution is conventionally denoted by a Roman numeral appended as a subscript, e.g. the design 2;-' has resolution five. For example, the defining relation of the 2:i4 design with design generators

+.

D=AB

E=AC

F=BC

G=ABC

is

I = ABD =ACE = BCF = ABCG = BCDE = ACDF = CDG = ABEF = = BEG = AFG = DEF = ADEG = BDFG = CEFG =ABCDEFG and it is a resolution I11 design, i.e. second-order interactions are confounded with factors.

confounding pattern [EXDE] + confounding b

b confusion matrix [CLAS] + classijication b conjugate gradient optimization [OPTIMI Gradient optimization that is closely related to variable metric optimization. It minimizes a quadratic function of the parameter vector p of the form:

f (p) = u

+ pTb+ 0.5 pTX p

continuous distribution [PROB]

45

where X is a symmetric, positive definite matrix. Linearly independent descent directions di are calculated, that are mutually conjugate with respect to X:

dTXdk=O

for j # k

The step taken in the ith iteration is: Pi+l = Pi

+ sidi

If at each step the linear search that calculates the step size si is exact then the minimum of f (p) is found in at most p steps. Mutually conjugate directions can be found according to the Fletcher-Reeves formula:

where gi is the gradient direction. b

conjugate vectors [ALGE]

+ vector b

connected graph [MISC]

+ graph theory

connecting plots [GRAPH] + interactive computer graphics b

b

-+

consistent estimator [ESTIM] estimator

b

constrained optimization [OPTIM] optimization

b consumer’s risk [QUAL] + producer’s risk

b

contaminated distribution [PROB] random variable

b contingency [DESC] + frequency b

contingency table [DESC]

--+frequency b continuous distribution [PROB] + random variable

46

continuous sampling [QUAL]

b continuous sampling [QUAL] + acceptance sampling

continuous variable [PREP] + variable b

b

contour plot [GRAPH] scatter plot

4

b contrast [ANOVA] Linear combination of treatment totals in which the coefficients add up to zero. For example, in the one-wayANOVA model:

c=cciyi 1

i = 1, I

with

C C i =0 i

Bvo contrasts with coefficients ci and di are orthogonal if 1

Contrasts are used in hypothesis testing to compare treatment means. A contrast can be tested by comparing its sum of squares to the error mean square. The resulting F ratio has one degree of freedom for the numerator and n - I degrees of freedom for the denominator, where I is the number of levels of the effect tested. The statistical method used to determine which treatment means differ from each other is called multiple comparison. The special case of making pairwise comparisons of means can be performed, for example, by the multiple t-test, Fisher’s least significant difference test, the Student-Newman-Keul multiple range test, Duncan’s modified multiple range test, the Waller-Duncan Bayesian test, or Dunnett’s test. Scheffe’s test or Tukey’s test can be used in order to judge simultaneously all the contrasts. The linear dendrogram is a graphical tool for representing multiple comparisons. b control chart [QUAL] (.- quality control chart) Graphical display of a quality characteristic measured or computed from a sample against the sample number or time. The control chart provides information about recently produced items, about the process, and helps to establish specifications or the inspection procedure. The use of control charts for monitoring the characteristics of a process in order to detect deviations from target values is called statistical process control (SPC). A control chart usually contains three horizontal lines: the center line represents the average value of the quality characteristic. The other two lines are called control limits. The uppermost line, called the upper control limit (UCL), and the lowermost line, called the lower control limit (LCL), indicate the region within

control chart [QUAL]

47

which nearly all the sample points fall when the process is in control. When a sample point falls outside this region, a corrective action is required. Sometimes two sets of control limits are used: the inner pair is called the warning limits and the outer pair the action limits. Warning limits increase the sensitivity of the control chart, however, they may be confusing to the operator. It is customary to connect the sample points for easier visualization of the sample sequence over time. The average number of data points plotted on a control chart before a change in the process is detected is called the average run length (ARL).

upper control limit

.h

I

warning limit

* * *

.P

23 so

*

* *

*

process in control

center

*

* *

f ....

..........................................................

warning limit

lower control limit

process out of control

time

A control chart tests the hypothesis that the process is in control. Choosing the control limits is equivalent to setting up critical regions. If the control limits are defined from a chosen type I error probability, then they are called probability limits. A point lying within the control limits is equivalent to accepting, while a point situated outside the control limits is equivalent to rejecting the hypothesis of statistical control. A general model for a control chart can be given as: UCL

= pw

- center line = p, LCL

= A/ ,

+ kaw - kaw

where w is a sample statistic for some quality characteristic, its mean is p, and its standard deviation is a,, k is a constant often chosen to be 3. The above type of control chart is called a Shewhart chart. The list of the most important control charts follows.

attribute control chart Chart for controlling characteristics described by binary variables.

48

. control chart [QUAL]

o c-chart (: chart for nonconfomities) Displays the number of defects (nonconformities) per item (i.e. the sample size is l), which is assumed to occur according to the Poisson distribution. An item is conforming or nonconforming depending on the number of nonconformities in it. The parameter distribution is c with

The model of this chart is:

- UCL

=c+3fi center line = c -LCL =c-3& Parameter c is estimated by C, the observed average number of nonconformities in a preliminary sample.

np-chart (: chart for number nonconforming) Displays the number of defective (nonconforming) items D = np, that has binomial distribution, with parameter p, where n is the sample size and p the proportion of defective items in the sample. The three horizontal lines are calculated as: o

UCL =np+34GT-5 center line = n p LCL =np-3,/= Similar to the p-chart, p is estimated by jY. o p-chart ( : chart for fraction nonconforming)

Displays the fraction of defective (nonconforming) items p (p = D/n, where D is the number of nonconforming items and n the sample size). The number of nonconforming items has a binomial distribution with p=p

and

a=,/-


- UCL

-center line LCL

= P + 3 h G - m =p

=p-3,/-

Parameter p is estimated by j7, the average of a few (usually 20-25) sample fractions of nonconforming items. This chart is often used in the case of variable sample size.


49

u-chart (: chart for nonconfomzitiesper unit) Displays the average number of defects (nonconformities) in a sample unit of size larger than one item. The average number of nonconformities is u = d/n, where d is the total number of nonconformities in the sample. It has a Poisson distribution with o

p=u

and

a=m


- UCL

= u + 3 G center line = u LCL = U - 3 G

Similar to the c-chart, u is estimated by ii.This chart is often used in the case of variable sample size.

cusum chart (: cumulative sum chart) Alternative to the Shewhart chart, used to control the process on the basis not only of one but an entire sequence of data points. It plots the cumulative sums of the deviations of the sample values from a target value. For example the quantity:

can be plotted against the sample number m.TT is the average of the ith sample and p is the target mean. When the process is in control, the data points are plotted on an approximately horizontal line. However, when the target value p shifts upward or downward, a positive or negative drift develops, respectively. A V mask is used to assess whether the process is out of control or not. This is a V-shaped tool that is placed on the chart with its axis horizontal and its apex a distance d to the right of the last plotted point. A point outside of the two arms of the V indicates that the process is out of control. The performance of the chart is determined by the parameters of the V mask: distance d and the angle between the two arms.

modified control chart Control chart for a process with variability much smaller than the spread between the specification limits. In this case the specification limits are far from the control limits. One is interested only in detecting whether the true process mean, p is located such that the fraction nonconforming is in excess of a specified value. When the sample size, the maximum fraction nonconforming and the type I error are specified the chart is called an acceptance control chart.

moving average chart Chart to correct for the Shewhart chart’s insensitivity to small shifts of the mean.

50


The model is:

+

UCL = 51 3 ~ / f i center line = 51 LCL = 51 - 3a/@

where w is the time span over which the sample means are averaged. The sample means can be averaged according to various weighting schemes.

variable control chart Chart for controlling quality characteristics described by continuous or discrete variables. Depending on the chart, it controls either the mean or the variance of the characteristic. It is assumed that the distribution of the quality characteristic is normal. Estimates for p and CT are usually calculated from 20-25 samples, each of n observations. o averagechart : Kchart

o R chart (: range chart) Chart for controlling the process variance:

The range R is estimated by R, the average sample range. OR is estimated by sR/m, where m is the mean and s is the standard deviation of the relative range R’ = R/a. Both parameters depend on the sample size. o rangechart : Rchart o S chart (: standard deviation chart)

Chart for controlling the process variance:

- UCL

=s+3as center line = S -LCL =s-3us The standard deviation S is estimated by 3, the average sample standard deviation. 0s is estimated by

SJiZ/c where c is a constant depending on the sample size.

Cook’s influencestatistic [REGR]

51

o standard deviation chart : S chart o 51 chart (: average chart)

Chart for controlling the central tendency of the process: UCL =p+3a/fi center line = p LCL =p -3a/fi

is estimated by TI, the mean of the sample averages; u is estimated either by R / m from the sample range with m being the mean of R’ = R/u or by s / c from the sample standard deviation with c being a constant that depends on the sample size. b control limits [QUAL] + control chart b control variable [PREP] + variable b controlled experiment [EXDE] + factor b convergence [OPTIM] The end of an iterative algorithm when the change in the estimated parameter values or in the value of the function being optimized from one iteration step to another is smaller than a prespecified limit value. In minimizing a function f (p) with respect to a vector of parameters p, the following convergence criteria are most common: - the change in function value is small:

If

(Pi+l) - f

(pi)I < E

- the change in parameters is small: - Pi II < E the gradient vector is small: IIPi+l

-

Ilgi+lII < E

or

max j bth component of gi+l] < E .

where i is the index of iterations and

11 . / I indicates the vector norm.

b convex admissibility [CLUS] -+ assessment of clustering (0 admissibilityproperties) b Cook’s influence statistic [REGR] + influence analysis

52

Coomans’plot [GRAPH]

b Coomans’ plot [GRAPH] + scatter plot b Coomans-Massart clustering [CLUS] + non-hierarchical clustering (0 density clustering) b correlation [DESC] Interdependence between two variables x and y . The correlation coefficient r,, is a measure of this interdependence. It ranges between -1 and + l . A zero value indicates absence of correlation, while -1 or 1indicate perfect negative or positive correlation. The correlation matrix R is a symmetric matrix of the pairwise correlation coefficients of several variables. Correlation calculated on ranks is called rank correlation. Underestimation of the correlation between two variables x and y due to the error in the observations is called attenuation. A corrected correlation estimate is:

+

where rxy is the geometric mean of the correlations between independent determinations of x and y; rxxand ryy are the means of the correlation between independent determinations of x and y . If correlation is found, despite the fact that the observations come from uncorrelated variables, such a quantity is called chance correlation or spurious correlation. Partial correlation is the correlation between two variables after the effect of some other variable, on which both depend, has been removed. It is most often encountered in variable subset selection. For example, the correlation coefficient between the residuals from models

is denoted ryz.x,measuring the strength of relation between y and z after their dependence on x was removed from both of them. Multiple correlation is a correlation between two variables where one of the variables is a linear combination of several other variables. For example, the goodness-of-fit of a linear regression model, measured by the squared multiple correlation coefficient, is the correlation between the response and the linear combination of the predictors. The most common correlation coefficients are:

agreement coefficient Generalization of Kendall’s T coefficient, defined as:

correlation analysis [MULTJ 53 U =

8s -1 k(k - l ) ( n - l ) n

where k is the number of observers providing paired comparisons of n objects and S is the sum of the number of agreements between pairs of observers. This coefficient is equal to Kendall’s t coefficient when k = 2.

Kendall’s T coefficient Rank correlation defined as:

C Si rt =

i

0.5 n(n - 1 ) where Si indicates the number of inversion in rank (y) compared to rank ( x ) . It is calculated as follows: - write the ranks of the observations on y in increasing order of the ranks on x; - calculate Si as: if rank (yi) < rank (yi+l) 6i = +1 Si = -1 otherwise.

Pearson’s correlation coefficient ( :product moment Correlation coeficient) Measure of linear association between x and y:

with i and 7= n n sx, sy and s,: are the standard deviations of x and y and the covariance between x and y, respectively. -

x = -1

product moment correlation coefficient : Pearson ’s correlation coefficient Spearman’s p coefficient Rank correlation defined as: 6 [rank (Xi) - rank (yi)I2 rp=lb

n(n2 - 1 )

correlation analysis [MULT]

: canonical correlation analysis

54

correlation coefficient [DESC]

b correlation coefficient [DESC] + correlation b

-+

correlation matrix [DESC] correlation

b correlogram [TIME] + autocovariance function b

correspondence analysis [FACT]

: correspondencefactor analysis

correspondence factor analysis (CFA) [FACT] ( : correspondence analysis) Multivariate method, very popular in the French school of data analysis, that finds the best simultaneous representation of rows and columns of a data matrix. It is especially appropriate for the qualitative data of contingency tables, although its applicability can be extended to any data matrix of positive numbers. Similar to principal component analysis, correspondence analysis is a data reduction technique that displays the data structure projected onto a two- (or three-) dimensional factor space. This plot is an exploratory tool for studying the relationship between the rows and columns of a data matrix, i.e. the object-variable interaction. This method differs from principal component analysis in treating the rows and columns of the data matrix in a symmetric fashion via special scaling, i.e. calculating column and row profiles. The similarities among the row profiles and among the column profiles are measured by the chi squared distance. This latter differs from the Euclidean distance only by a weighting factor, i.e. each term is weighted by the inverse of the total of the profiles corresponding to the term. The squared distance between rows s and t is: b

wherehj andftj are the profiles andfj is the sum of the profiles over the rows in column j. In symmetric fashion, the squared distance between columns p and q is: 1

cfi.

d2 =pq

(fip

i

Correspondence analysis calculates the eigenvalues A and eigenvectors v of the matrix P, with elements:

covariance matrix [DESC]

55

The goal is to project both rows and columns of the data matrix on the same axis v, calculated as Pv = hv

approximating the following ideal situation: - each column is the barycenter of rows, weighted as

- each row is the barycenter of columns, weighted as

As on a biplot, it is legitimate to interpret the distances between rows and between columns. It is also legitimate to interpret the relative positions of a column with respect to all the rows. However, the proximity of a row and a column cannot usually be directly interpreted. The center of gravity located at the origin of the axes corresponds to the average profiles of rows and columns. Additional (test) rows and columns can be added to the plot, projecting their profiles onto the principal component axes. b cosine coefficient [GEOM] + distance (0 quantitative data)

count data [PREP] -+ data b

counting process [TIME] + stochastic process b

b covariance [DESC] The first product moment of two variables x and y about their means. For a population with means px and py it is defined as:

C (xi cT y ;

=

~

x (Yi )

-~

y )

'

n In a sample the population means are estimated by TI and 7 and the covariance is estimated as:

s:x is the variance of the variable x. To obtain an unbiased estimate, n is substituted by n - 1 in the denominator. The covariance is scale dependent, taking on values between -MI and +MI.In case of multivariate data, the pairwise covariance values are arranged in a covariance matrix.

56

covariance matrix [DESC] (: variance-covariance matrix)

b covariance matrix [DESC] (: variance-covariance matrix) Symmetric matrix S(p,p) of pairwise covariances. It is a measure of the dispersion of multivariate data. The diagonal elements are the variances of the corresponding variables. It is the scatter matrix divided by the degrees of freedom. If the variables are autoscaled the covariance matrix equals the correlation matrix. Depending on the mean estimates and on the observations included, the following covariance matrices are of practical importance:

between-centroid covariance matrix Measure of dispersion of group centroids cg, g = 1,G around the overall barycenter c. Its elements are:

where B is the between-group scatter matrix.

between-group covariance matrix ( : inter-group covariance matrix) Measure of dispersion of the groups around the overall barycenter c. Its elements are:

When all G groups have the same number of objects, i.e. ng = n/G, this matrix equals the between-centroid covariance matrix.

group covariance matrix : within-group covariance matrix covariance matrix about the origin Measure of dispersion of the variables around the origin; its elements are:

inter-group covariance matrix : between-group covariance matrix intra-group covariance matrix : within-group covariance ma& mean group covariance matrix The average of the within-group covariance matrices:

critical value [TEST]

57

pooled covariance matrix The mean dispersion of the groups:

s=

g

n-G When all G groups have the same number of objects, i.e. ng = n/G, the pooled covariance matrix equals the mean group covariance matrix. within-group covariance matrix (: group covariance matrix, intra-group covanance matrix) The dispersion of ng objects in group g around its centroid cg, denoted as 8,. Its elements are: (xijg - c j g ) (xikg

2 sjkg =

- ckg)

i

ng - 1

-- W

ng - 1

where W is the within-group scatter matrix. b covariance matrix about the origin [DESC] + covariance matrix

covariate [PREP] + variable b

b

covarimin rotation [FACT] factor rotation

-+= b

covering clustering [CLUS]

+ non-hierarchical clustering

(0

optimization clustering)

b COVRATIO [REGR] + influence analysis

Cox-Stuart’s test [TEST] + hypothesis test b

b

Cramer-von Mises’s test [TEST]

+ hypothesis test b critical region [TEST] + hypothesis testing

critical value [TEST] -+= hypothesis testing b

58

cross-correlationfunction [TIME]

b cross-correlation function [TIME] + cross-covariancefunction b cross-correlogram [TIME] + cross-covariancefunction b cross-covariance function [TIME] Autocovariance style function defined for bivariate time series [x(fi), y(ti), i = 1, n]:

Y x y R 4 = cov [x, Y ( S ) I = E [(x - Px (0)( Y W - Py(S))]

The cross-correlation function is:

If the process is stationary, the above functions depend only on t = t - s, therefore:

The plot of p x y ( t )against t is called a cross-correlogram.

crossed effect term [ANOVA] -+ term in ANOVA b

b

crossed model [ANOVA]

-+ anabsis of variance b cross-over design [EXDE] + design

cross-validated Rz [MODEL] + goodness of prediction b

cross-validated residual [REGR] + residual b

b cross-validation (cv) [MODEL] + model validation b

cube point [EXDE]

+ design (o composite design) b cumulative distribution function (cdf) [PROB] + random variable

data [PREP] b

59

cumulative frequency [DESC]

+ frequency

-+

curve fitting [MODEL] model fitting

b

cusum chart [QUALI

b

+ controlchart b

cycle [MISC] graph theory

-+

cyclical design [EXDE] + design b

b cyclical variable [PREP] + variable

D-optimal design [EXDE] -+ design (a optimal design) b

b D’Agostino’s test [TEST] + hypothesis test

Daniel’s test [TEST] + hypothesis test b

b data [PREP] Statistical information recorded as one or more measurements or observations on an object. Multivariate data constitute a row of the data matrix.

bivariate data Data composed of two measurements or observations recorded on an object. categorical data (: non-metric data, qualitative data) Data that consist of observations of qualitative characteristics, defined by categories.

60

data[PREP]

censored data Data where certain values are not measurable or not observable, although they exist. For such data only bounds of the true value are available. When only a lower bound is given, the data is called right-censored, while data specified by an upper bound is called left-censored. circular data (: spherical data) Data ordered according to a variable which cyclically repeats itself in time or space. Data of this type has no natural starting point and can be represented on a circumference, hence its other name: spherical data. The mean can be defined in several ways: for instance, by finding the interval for which the sum of the deviations is zero. The mean trigonometric deviation is a measure of dispersion. closed data Multivariate data that add up to the same constant in each object, i.e. the p measurements are fully determined by only p - 1 measurements. If this is not true, the data is called open data. Closed data where the measurements add up to one is called percent data. For example, the amounts of components in a mixture is percent data. Closed data, defined as the ratio between a specific value and a reference total value, is called proportional data. Several multivariate methods give different results, depending on whether they are applied on closed data or on open data. To avoid the closure problem one of the variables should be deleted or the variables should be transformed to logarithmic ratios. codeddata Data obtained by multiplying (dividing) by and/or adding (subtracting) a constant, in order to convert the original measurements into more convenient values. count data Data of integer numbers counting the occurrence of an event. This type of data often assumes a binomial distribution; therefore, in case of the normality assumption, a special transformation is required. grouped data Data reported in the form of frequencies, i.e. counts of occurrences within cells defined by a set of cell boundaries. Grouped data are often collected in contingency tables. metric data : numerical data mixed data Multivariate data including both numerical and categorical data.

data reduction [MULT] 61

multivariate data Data composed of more than one measurement or observation recorded on an object. non-metric data : categorical data numerical data (: quantitative data, metric data) Data that consist of numerical measurements or counts. qualitative data : categorical data quantitative data : numerical data spherical data : circular data univariate data Single datum, i.e. a single measurement or observation recorded on an object. b data analysis [DESC] Obtaining information from measured or observed data. Exploratory data analysis (EDA) is a collection of techniques that reveal (or search for) structure in a data set before calculating any probabilistic model. Its purpose is to obtain information about the data distribution (univariate or multivariate), about the presence of outliers and clusters, to disclose relationships and correlations between objects and/or variables. Examples are principal component analysis, cluster analysis, projection pursuit. EDA is usually enhanced with graphical analysis. Confirmatory data analysis tests hypotheses, calculates probabilistic models, makes statistical inference providing statements of significance and confidence. '

b

data matrix [PREP]

-+ data set b

data point [PREP]

: object b data reduction [MULT] Procedure that results in a smaller number of variables describing each object. There are two types of data reduction. If the new variables are simply a subset of the original variables, the procedure is called feature reduction or variable reduction. Such data reduction often forms part of classification or regression analysis, e.g. variable subset selection regression methods, stepwise linear discriminant analysis. The other type of data reduction, called dimensionality reduction, involves calculating a new, smaller number of variables as functions of the original ones. If the

62

data set [PREP]

procedure is linear, the new variables are linear combinations of the original variables. Principal component analysis, factor analysis, correspondence factor analysis and spectral map analysis belong to this group. In contrast, multidimensional scaling, nonlinear mapping and proiection pursuit give a low-dimensional representation of the objects via nonlinear dimensionality reduction. Dimensionality reduction techniques are called display methods when they are accompanied by graphical analysis. In that case the goal is to represent multidimensional objects in lower dimensions (usually two or three), retaining as much as possible of the high-dimensional structure, such that the data configuration can be visually examined (e.g. on a scatter plot). There are two basic groups: the projection method calculates the new coordinates as a linear combinations of the original variables, as opposed to mapping where the two or three new coordinates are nonlinearly related to the original ones. b data set [PREP] Set of objects described by one or more variables. A data set is often considered as a sample from a population and the sample parameters calculated from the data set are taken as estimates of the population parameters. A data set can be split into two parts. The training set, also called the learning set, is a set of objects used for modeling. The set of objects used to check the goodness of prediction of a model calculated from the training set is called the evaluation set. The test set is a new set of data, a new sample from the population. Predictions for the test objects (e.g. class assignment, response value) are obtained using the model calculated on the training set. A data set is usually presented in the form of a data matrix. Each object corresponds to a row of the matrix and each variable constitutes a column.

David-Barton’s test [TEST] + hypothesis test b

Davidon-Fletcher-Powell optimization (DFP) [OPTIM] + variable metric optimization b

David’s test [TEST] + hypothesis test b

b

decile [DESC]

+ quantile b *.

decision boundary [CLAS] class boundary

decision function [MISC] + decision theory b

defect [QUAL]

63

b decision rule [MISC] -+ decision theoly b

decision surface [CLAS]

: class boundary b decision theory [MISC] The theory of finding a decision which is an optimum with respect to the consequence of the decision. There are three spaces considered in decision theory. The elements of the parameter space 6 describe the state of the system. The elements of the sample space x provide information about the true value of 6 . Finally, the elements of the decision space d are the possible actions to take. Decisions are determined by the decision function 6, also called the decision rule, based on the sample:

d = S(X) The consequence of a decision is formalized by the loss function L(6,d) or L ( 6 , S(x)); it is a real, nonnegative function valued on the parameter and the decision spaces. In case of discrete parameter and decision spaces (e.g. a classification problem) the loss function is a matrix, called the loss matrix, otherwise the loss function is a surface. The expected value of the loss function is called the risk function: %(6) = Ex tL(% 6(4)1

Given the conditional distribution F(x 1 0 ) the risk function is:

Rs(f+) =

/

L (0, 6(x>) d o I 6)

The optimal decision function 6* is found by minimizing &(tY). If 6* minimizes h(6) for every value of 6, it is called the uniformly best decision function. When such a function does not exist, the Bayes or minimax rules can be employed. The former defines the 6* that minimizes the average risk:

I-

6* = min RS = E~ [ E , [ L ( ~~,( x ) ) ] ] ]

s while the latter defines the 6* that minimizes the maximum risk

max [Ex[L(6,6 ( x ) ) ] ] b decision tree [GRAPH] Graphical display of decision rules. Decision rules usually generate a tree of successive dichotomous cuts. The most common one is the binary decision tree. b defect [QUAL] + lot

64

defective item [QUAL]

b

defective item [QUAL]

+ lot defective unit [QUAL] + lot b

b

defining relation [EXDE]

+ confounding b degrees of freedom (df) [ESTIM] The number of independent pieces of information necessary to calculate a single statistic or a model with several parameters. For example, the mean has one degree of freedom; an OLS model with p parameters hasp degrees of freedom. For further details see degrees of freedom in ANOVA and model degrees of freedom.

b degrees of freedom in ANOVA (df) [ANOVA] Column in the analysis of variance table containing the number of independent pieces of information needed to estimate the corresponding term. The number of degrees of freedom associated with a main effect term is one less than the number of levels the term can assume. The number of degrees of freedom associated with a crossed effect term is the product of the numbers of degrees of freedom associated with the main effects in the term. The number of degrees of freedom associated with a nested effect term is one less than the number of levels of the nested effect multiplied by the number of levels of the other (outer) effect. The total sum of squares around the grand mean has a number of degrees of freedom which is one less than the total number of observations. For example, in a crossed two-way ANOVA model with n observations the numbers of degrees of freedom are partitioned among the terms as:

Term

df

B

A

~~

~

1-1

Error

AB ~

J -1

~

~

~

Total ~

(I - 1)(J - 1)

n-IJ

n-1

In a two-stage nestedANOVA model with n observations the numbers of degrees of freedom are: Term

A

df

1-1

I ( J - 1)

b deletion residual [REGR] + residual

Error

?btal

n-IJ

n-1

dendrogram [GRAPH]

65

b demerit system [QUALI Weighting system of defects that takes their severity into consideration. Because defects are not of equal importance, it is not only the number of defects but also their severity that influences whether an item is judged as conforming or nonconforming. For example: defects are classified into four categories: A, B, C, D; weights are assigned as: 100 for A (the most severe), 50 for B (less severe), 10 for C and 1 for D (least severe); the value of demerits is defined as:

where dA, dB, dc, and dD are the numbers of defects of various degrees of severity. b dendrogram [GRAPH] Graphical representation of a hierarchy of clusters resulting from a hierarchical cluster analysis, in which the edges of the tree are associated with numerical values (e.g. distance or similarity measures).

On the above figure the two most similar objects are (B, C). At the 0.5 similarity level (indicated by the dotted line), for example, five clusters are formed: CA (A), CB (B, C, D, E, F), CG (G, H), CI (I, J, K), CL (L, M, N). CA, called a singleton, is the most distinct object. Among these five clusters, CG and CI are the most similar, i.e. CG and CI will be connected next. At the 0.2 similarity level all objects are combined into one single cluster. In a dendrogram, resulting from cluster analysis, each node represents a cluster and each terminal node, called a leaf, represents an object (or variable). The graphtheoretical distances between the root of the dendrogram (single cluster) and the

66

density clustering [CLUS]

leaves (objects) define the quality of the tree, i.e. the quality of the clustering result. The maximum distance is called the height of the tree. A dendrogram with a height close to the largest distance is a chained tree, while a dendrogram with a minimal height is a balanced tree. A linear dendrogram is a graphical representation of multiple comparisons results: 1.23

1.45

2.18

2.44

2.50

3.78

b density clustering [CLUS] + non-hierarchical clustering

b

density estimator [ESTIM]

-+ estimator b density function [PROB] + random variable

b

density map [GRAPH]

+ scatter plot b dependent variable [PREP] + variable

b

descriptive statistics [DESC]

+ statistics b

descriptor [PREP]

: variable b design [EXDE] Location of experimental points in the predictor space. A design consists of two basic structures: the treatment structure and the design structure. The treatment structure is defined by the factors and their level combinations, called treatments. The design structure is the arrangement of experimental units into blocks. A design is specified by the random assignment of treatments from the treatment structure to the experimental units from the design structure. Hence design means the choice of the two structures and the method of random assignment. In regression terms it corresponds to choosing the predictor variables and fixing their values in the matrix. The most important designs observations. A design is collected in a design are listed below:

design [EXDE]

61

additive design Design in which there are no interactions of any type, i.e. the factors influencing the response have an additive effect. In such a design a change of one level of a factor to another causes the same change in the outcome of the experiment, independent of the level of other factors and blocking variables. asymmetric design Design in which the number of levels is not the same for all the factors. axial design Design in which, in contrast to the boundary design, most design points are positioned on the factor axes. The most common axial design is for complete mixtures in which most of the design points are inside the simplex positioned on the component axis. (7

xl = 1

X3-1

xl = 0

The simplest axial design has points equidistant from the centroid towards each of the vertices. In a design with q components the axis of component j is the imaginary line extending from the base point

x, = l / ( q - 1)

Xk = 0

for all k # j

1

for all k # j

to the vertex Xj

=O

Xk =

For example, a three-component axial design has design points which are equidistant from the centroid on the three axes.

balanced factorial design Factorial design in which each treatment is run an equal number of times. In a balanced design the effects of factors and interactions are orthogonal.

68

design [EXDE]

block design : randomized block design boundary design Design in which the design points are on the boundaries (vertices, edges, faces) of the cube or the simplex. Box-Behnken design Design for fitting a second-order model generated by combining two-level factorial design with balanced incomplete block design (BIBD). For example, a three factor, three block BIBD, where each factor appears twice in the design, always paired with a different factor, combined with a 2' complete factorial design, gives the following design matrix:

1 2 3 4

5 6 I

8

+

9 10

11 12 -

+

Box-Draper design Saturated design of K factors and N = (K + 1)(K 2)/2 runs for fitting a second-order model. For example, if K = 2, the six runs are:

+

(-1

-1)

(+1 -1)

(-1 -1)

(-d

-d)

(+1 3d) (3d +1)

where d = 0.1315. The optimal position of the design points is found by minimizing Ix=XI.

complementary design Fractional factorial design in which the design generators contain minus signs. For example, the Z4-' design has two half fractions, the first generated by D = ABC, the second by D = -ABC. The second design is a complementary design to the first one. complete factorial design (CFD) (: full factorial design) Factorial design in which there is at least one run for each combination of factor levels. For example, a 23 design with 8 runs studies the effect of three factors, each

design [EXDE]

69

with two levels. This design allows one to study the effect of all three factors and the effect of all interactions. The model is:

where i , j , k = 1, 2.

completely randomized design (.- randomized design) Design in which the runs are assigned randomly to treatments. The runs are viewed as a random sample from a normal distribution. There are no restrictions on randomization due to blocking. composite design lko-level factorial design augmented with further runs. This permits the fitting of a second-order model and the estimation of the quadratic effects. If the optimum is assumed to be near the center of the cube, the added runs are center points and star points and the design is called a central composite design (CCD). cube

.

.

. . I

. star

. ..

.

'..

. . . . ... .. .. . .. . . . . . center star

cube

If the optimum is assumed to be close to a vertex of the cube, the added runs should be near that vertex and the design is called a noncentral composite design. The design matrix of a central composite design is composed of cube points, center points and axial points (or star points). Cube points are the runs of the original factorial design with factor levels + 1 and -1. The factor levels of a center point are denoted by 0. Given K factors, there are 2 K axial points added to the design with level f a in one factor and 0 4in others. For example, the 23 design can be augmented with two center points (0, 0,O)and (0, 0,O)and six axial points (&a,0, 0), (0, fa,0), and (0, 0, &a).Adding only center points to the cube points allows a test to be made for curvature in the response surface.

70

design [EXDE]

cross-over design Design in which time periods constitute factors. If there are N treatments to be tested in K time periods, then N x K experimental units are needed. cyclical design Balanced factorial design in which blocks are generated by cyclical permutation of the treatments. distance-based optimal design Optimal design in which a new set of points is selected to maximize the minimum distance between the previously selected points. This design results in a uniform distribution of the points finally selected, and it does not require any assumption on the regression model used. Doehlert design (.- uniform shell design) Design for fitting a second-order model that consists of points uniformly spaced on concentric spheres. Such points are generated from a regular simplex by taking differences among its vertices. The number of factors is K = 2,10, the number of runs is N = K2 K plus a center point.

+

equiradial design Design that consists of two or more sets of design points such that the points in each set are equidistant from the origin. These point sets are called equiradial sets. factorial design (FD) Design in which each level of each factor is combined with each level of every other factor under consideration. If, for example, three factors are investigated, and there are L1 levels of the first factor, LZ levels of the second factor and L3 levels of the third factor, then an L1 x LZ x L3 treatments must be run. Factorial designs are specified by the number of factors, e.g. two-factor factorial design, threefactor factorial design, etc., and by the number of levels of the factors, e.g. two-level factorial design, three-level factorial design, etc. (assuming equal number of levels in all factors). The most important one is the two-level factorial design. There are two kinds of factorial designs: complete factorial design and fractional factorial design. The advantage of factorial design is that the effects of several factors can be investigated simultaneously. Its disadvantage is that the number of treatments increases rapidly with increasing number of factors or levels. first-order design Design for fitting a model in which the response is a linear function of the factors, i.e. the response surface can be approximated by a first-order polynomial. The regression coefficients of the first-order polynomial dictate the direction to be taken to optimize the response. For example, in a two-factor ( A and B), first-order design the model is

design [EXDE]

y = bo

71

+ biA + b2B

The response surface is a tilted plane with intercept bo and slopes bl and 62. The contour lines of such plane are equally spaced parallel straight lines.

folded design (: fold-over design, reflected design) Two-level factorial design of N runs in which the second N/2 runs are mirror images of the first N/2 runs, i.e. with opposite levels in each factor. Furthermore, the N/2 runs can be folded only on specified factors, i.e. only the specified factors have opposite levels; the rest of the factors are repeated with the same levels. fold-over design :folded design fractional factorial design (FFD)(.- partial factorial design) Factorial design that consists of only a fraction of the total number of runs. When the fraction of runs is selected appropriately the effects of factors and even of some interactions can be estimated with fewer runs. The assumption is that the effect of the majority of interactions is negligible. Conventionally a fractional factorial design is referred as a LK-’ design, where L denotes the number of levels in the factors, K denotes the total number of factors and J denotes the number of factors generated by confounding. The number of runs in this design is LK-J. These designs are generated by writing down first a complete factorial design in Yates order with the maximum number of factors possible (K - J ) , then calculating the levels of the remaining factors according to the design generators. full factorial design : complete factorial design Graeco-Latin square design Design for studying the effect of one factor and three blocking variables. The blocking variables and the factor must have equal numbers of levels. The assumption is that the three blocking variables and the factor are additive, i.e. there is no interaction among them. This design can eliminate three extraneous sources of variation. The design matrix can be generated from a Latin square design by adding a third blocking variable such that each run with the same factor level receives a different level of the third blocking variable. An example of a four-level GraecoLatin square design, where A, B, C are the three blocking variables, and D is the

72

design [EXDE]

Hartley design Composite design in which the cube portion is of resolution 111, i.e. with the restriction that two-factor interactions must not be aliased with other two-factor interactions. This design permits much smaller cubes compared to other composite designs with higher resolution cubes. hierarchical design : nested design Hoke design Design containing three or more factors with three levels for fitting a second-order model. It is generated from partially balanced saturated fractions of the 3K design. This design compares favorably with the Box-Behnken or the Hartley designs. hybrid design : Roquemore design hyper-Graeco-Latinsquare design Design for studying the effect of one factor and more than three blocking variables. The design matrix can be generated from a Graeco-Latin square design by adding blocking variables to it. Knut-Vik square design Special Latin square design in which the permutation of the factor levels is generated by moving three levels forward instead of one. The pattern of the factor levels resembles the knight's move in chess. An example of a five-level Knut-Vik square design, in which A and B are two blocking variables and C is the factor, is:

Latin square design Design for studying the effect of one factor and two blocking variables. The blocking variables and the factor must have equal numbers of levels. The assumption is that the two blocking variables and the factor are additive, i.e. there is no interaction among them. The levels of the factor are assigned randomly to the combinations of blocking variable levels such that each level of a blocking variable receives a different level of the factor. This design eliminates two extraneous sources of variation. An example of a four-level Latin square design, in which A and B are the two blocking variables, and C is the factor, is:

design [EXDE]

73

mixture design Design for the mixture problem, i.e. when the response depends only on the proportion of the mixture components and not on their amount. In such a design the sum of the amounts of the components is constant over the design points. Due to the above constraint, a simplex coordinate system is used for the geometric description of the factor space. Usually the amount of components is standardized, i.e. their sum is 1. A q-component mixture design has points on or inside the boundaries of a (q - 1)-dimensional simplex with edges of unit length. The points inside the simplex represent complete mixtures, i.e. mixtures in which all components are present, while points on the boundaries of the simplex represent mixtures in which only a subset of components is present. The proportion of the components is often restricted by a lower or upper bound, or by both. In that case the design points can be positioned only on a subregion of the simplex, usually on the extreme vertices. The most commonly used mixture designs are the simulex centroid design and the simulex lattice design. nested design (.- hierarchical design) Design in which the levels of factors are not completely crossed but are combined in a hierarchical manner. optimal design Design aimed at minimizing the variance of estimates of the effects. This variance depends on the size and sphericity of the confidence region. There are various criteria for defining design optimality which measure some combination of the above two characteristics. o A-optimal design minimizes the mean square error of the coefficients:

r

min [A] = min [tr [(XTX)-']] = rnin

1

IF"]

o D-optimal design minimizes the generalized variance, i.e. the volume of the

confidence ellipsoid of the coefficients: min [D] = min [ IXT XI] = rnin

n:]

-

[ j

where X denotes the design matrix and A, thejth eigenvalue of (XTX)-'.

74

design [EXDE]

o E-optimal design minimizes the largest eigenvalue:

min [El = min [All The desirability of a design increases as D increases, but as A and E decrease.

orthogonal design Design in which the factors are pairwise orthogonal. Two factors are orthogonal if the sum over the runs of the product of their levels is zero. Orthogonal designs are desirable because they offer greater precision in the estimates of the parameters since these estimates are uncorrelated. All two-level factorial designs are orthogonal designs. partial factorial design :fractional fuctoriul design Plackett-Burman design Saturated, orthogonal, resolution 111, two-level fractional factorial design for up to K = N - 1 factors, in which N, the number of runs, must be a multiple of four. The size of the design varies between K = 3,99 and N = 4,100. This design can be generated by shifting the first row cyclically one place K - 1 times. The last row has a minus sign in all factors. randomized block design (: block design) Design in which the runs are grouped into blocks such that the runs are assumed to be homogeneous within blocks. Runs are assigned randomly to treatments within blocks. The effect of blocking is assumed to be orthogonal to the effect of factors. It is important to form the blocks on the basis of a variable which is logically related to the outcome of the experiment because the purpose of blocking is to control the variation of this variable. Effective blocking reduces the residual error, i.e. the denominator of the F test. This design is preferred over the completely randomized design if the within-block variability is smaller than the between-block variability. In this design the blocking is done in a single direction, which removes one source of variation. Other designs can handle blocking in several directions, e.g. the Latin square design has two-way blocking, and the Graeco-Latin square design has three-way blocking. A randomized block design in which each treatment is run in each block, i.e. the block size equals the number of treatments, is called a complete block design. In contrast, a randomized block design in which not all treatments are run in each block is called an incomplete block design. A special case is the balanced incomplete block design (BIBD) in which each block has the same number of runs, each treatment is replicated the same number of times, and each pair of treatments occurs together in a block the same number of times as any other pair of treatments.

design [EXDE]

75

randomized design : completely randomized design reflected design :folded design Roquemore design (.- hybrid design) Design for fitting a second-order model with K factors, generated from a central composite design with K - 1 factors. It has the same degree of orthogonality as the parent central composite design but is also near saturated and near rotatable. rotatable design Design in which the variance of the predicted response at any point of the design depends only on the distance of the point from the design center, but not on the direction. The variance contours are concentric hyperspheres. saturated design Fractional factorial design in which the number of runs is equal to the number of terms to be fitted. Examples are: the Plackett-Burman design and the Box-Draper design. second-order design Design for fitting a model in which the response is a linear function of the factors, of the factor interactions and of the squared factors, i.e. the resuonse surface is approximated by a second-order polynomial. For example, in a two-factor, secondorder factorial design the model is: JJ = bo

+ b l A + b2B + b3A2 + b4B2 + bsAB

The response surface is curved, it can have a single maximum or minimum with circular contour lines, a stationary ridge with nonequidistant parallel contour lines, a rising ridge with parabola contour lines, or a saddle point with double hyperbola contour lines. sequential design Design in which, based on the results of statistical analysis, the original first-order design is augmented to a second-order design suitable for fitting a full second-order model. For example, after discovering strong interaction effects in a Z3 factorial design, axial points and center points can be added to examine second-order effects using the newly generated composite design. simplex centroid design Mixture design with q components consisting of 24 - 1 design points. Each point is a subset of mixture components of equal proportions. Such mixture points are located at the centroid of the lower dimensional simplexes. For example, there are seven design points in a three-component (q = 3) simplex centroid design:

76

design [EXDE]

simplex lattice design Mixture design with q components for fitting a polynomial equation describing the response surface over the simplex region. A lattice is a set of design points ordered and uniformly distributed on a simplex. A lattice design is characterized by the number of mixture components q and by the degree of the polynomial model m. The [q, m]simplex lattice design contains all possible combinations of points xi = 0, l/m, 2/m,

. . ., 1

For example, there are six points in the q = 3 and m = 2 simplex lattice design:

(1,0,0> ( 0 , L 0) (I/& 1/2,0> ( W , 0,1/2)

(0,0,1) (0,1/2,1/2)

split-plot design Generalization of the randomized block design for the case when complete randomization within blocks is not possible. The simplest design consists of two factors and one blocking variable. Instead of completely assigning the treatments to experimental units in a randomized way within a block, only the levels of one factor can be chosen randomly, the levels of the other factor are varied along the runs in a systematic fashion. The levels of this latter factor are called main treatments and the collection of the experimental units running with the same level of this factor is called the whole plot. Each whole plot is then divided randomly into split-plots, i.e. the experimental units are assigned randomly to the levels of the other factor, called a subplot treatment. The terminology obviously originated from the first agricultural applications of this design. symmetric design Design in which all factors have equal numbers of levels. two-level factorial design Factorial design in which each factor has only two levels. This design is of special importance because with relatively few runs it can indicate major trends and can be a building block in developing more complex designs. The two-level complete factorial design is denoted 2 K , the two-level fractional factorial design is denoted 2K-J. In the design matrix the two levels are conventionally denoted as + and -. In geometric terms the runs of this design form a hypercube. For example, the runs of the 23 factorial design can be represented geometrically as vertices of a cube, where the center of the cube is at point (0, 0, 0), and the coordinates of the vertices are either +1 or -1.

design structure [EXDE]

77

unbalanced factorial design Factorial design in which the number of runs per treatment is unequal. In contrast to balanced design, the effects of factors and interactions are not orthogonal. Consequently, the analysis of an unbalanced design is more complicated than that of a balanced one. uniform shell design : Doehlert design Westlake design Composite design, similar to a Hartley design, with a cube portion of relatively small runs. The most common design sizes are:

(K = 25, N

= 22),

(K = 7, N = 40), (K = 9, N

= 62).

Youden square design Incomplete Latin square design in which the number of levels in the two blocking variables is not equal. design generator [EXDE] + confounding b

b design matrix [EXDE] Matrix containing the combination of factor levels (treatments) at which the experiments must be run. It is the predictor matrix in a regression context. The columns of the matrix correspond to factors, interactions and blocking variables. A matrix with orthogonal columns is usually preferred. The rows of the design matrix represent runs, usually in Yates order. A run, also called experimental run or design point, is the set up and operation of a system under a specific set of experimental conditions, i.e. with factors adjusted to some specified set of levels. For example, in a chemical experiment, to do a run means to bring together in a reactor specific amount of reactants, adjusting temperature, pressure to the desired levels and allowing the reaction to proceed for a particular time. In the regression context runs correspond to observations; in the multivariate context runs are points in the factor space. The smallest partition of the experimental material such that any two units may receive different treatments in the actual experiment, i.e. the unit on which a run is performed, is called an experimental unit or a plot. Experimental units can be material samples, animals, persons, etc. b

-+

design point [EXDE] design matrix

design structure [EXDE] + design

desirabilityfunction [QUAL]

78

b desirability function [QUAL] -+ multicriteria decision making

b determinant of a matrix [ALGE] + matrix operation

b

deterministic model [MODEL] model

+

b DFBETA [REGR] + influence analysis

b

DFFIT [REGR]

+ influence analysis b

diagnostics [REGR]

: regression diagnostics b diagonal element [ALGE] -+ matrix

diagonal matrix [ALGE] + mat^ b

b diagonalization [ALGE] + matrix decomposition

b

dice coefficient [GEOM] distance ( 0 binary data)

-+

b

dichotomous classification [CLAS]

: binary classification

dichotomous variable [PREP] -+ variable b

b diffusion process [TIME] + stochastic process (u Markov process)

b digidot plot [GRAPH] Complex graphical tool used in quality control. A stem-and-leaf diag-am (digits) is

dkcrete variable [PREP]

79

constructed simultaneouslywith a time series ulot of observations (dots) as data are taken sequentially. 17.1 18.2 16.3 15.2

ri

15.2 15.4 16.4 18.6 22.3

17.1 15.2 19.7 19.5 23.0

digraph [MISC] + graph theory b

b

dimensionality [GEOM] geometrical concept

dimensionality reduction [MULT] + data reduction b

b

direct calibration [REGR] calibration

b direct search optimization [OPTIM] Outimization which, in calculating the set of parameters that minimizes a function, relies only on the vaIues of the function calculated in the iterative process. This optimization does not explicitly evaluate any partial derivative of the function. Techniques belonging to this group of optimization are: linear search oDtimization, simulex outimization and simulated annealing.

directed graph [MISC] + graph theory b

discrete distribution [PROB] + random variable b

b discrete variable [PREP] + variable

80 discretization error [OPTIM] b discretization error [OPTIM] + numerical error

b discriminant analysis (DA) [CLAS] Parametric classification method that derives its classification rule by assuming normal class density functions. Each class g is modeled by the class centroid c, and the class covariance matrix S,, both estimated from the training set. An object x is classified into class g’ with the largest posterior class probability P(g‘ 1 x) calculated according to the Baves’ rule. Inserting the equation of the normal density function, the following expression has to be maximized:

P(g’ I x) = max

[m]’1S,1”2

exp -0.5 (x - cg)TSp’(x- c,)

[

where P, denotes the prior class probability. Equivalently the following distance function has to be minimized:

where the first term is the Mahalanobis distance between object x and a class centroid c,. This rule, which defines quadratic class boundaries, is the basis of the quadratic discriminant analysis (QDA). In linear discriminant analysis (LDA) it is assumed that the class covariance matrices are equal, i.e. 8, = S for all g. In that case the class boundaries are linear and linear discriminant functions can be calculated. Fisher’s discriminant analysis is a binary classification subcase of LDA. Discriminant score, i.e. a one dimensional projection, is obtained by multiplying the measurement vector x by the discriminant weights w:

s = wT x w is often called Anderson’s classification function. The weights are obtained by maximizing the between- to within-class variance ratio:

This single linear combination yields a new axis which best separates the two classes. The goodness of classification in both LDA and QDA depends (among other things) on the quality of the estimates of the class centroids and class covariance matrices. When the sample size is small compared to the dimensionality,LDA often gives better results, even in the case of somewhat different covariance matrices, due to its advantage of estimating fewer parameters. DASCO, SZMCA, RDA, UNEQ are extensions and modifications of the QDA rule, all of which try to improve class prediction using biased covariance estimates.

dispersion [DESC]

81

In stepwise linear discriminant analysis (SWLDA) the variables used in computing the linear discriminant functions are chosen in a stepwise manner based on F statistics. Both forward and backward selections are possible, similar to variable subset selection. At each step the variable that adds the most to the separation is entered or the variable that contributes the least is removed. The optimal subset of variables is best estimated on the basis of class prediction. b discriminant analysis with shrunken covariances (DASCO) [CLAS] Parametric classification method, similar to the quadratic discriminant analysis, that models each class by its centroid and covariance matrix and classifies the objects on the basis of their generalized Mahalanobis distance from the centroids. Instead of the usual unbiased estimates for the class covariance matrices, however, DASCO approximates them on basis of a biased eigenvalue decomposition. The large eigenvalues, which can usually be reliably estimated, are unchanged, but the small eigenvalues, which carry most of the variance of the covariance matrix estimate, are replaced by their average. An important step in DASCO is to determine in each class the number of unaltered eigenvalues. This number is estimated on the basis of the cross- validated misclassification risk. b

discriminant function [CLAS]

: linear discriminant function b discriminant plot [GRAPH] + scatter plot

b

discriminant rule [CLAS]

: classification rule

discriminant score [CLAS] + linear discriminant function b

b discriminant weight [CLAS] + linear discriminant function

discrimination [CLAS] : classif cation

b

b

discrimination power [CLAS]

+ classification rule b dispersion [DESC] (: scale, scatter, variation) A single value that describes the spread of a set of observations or the spread of a distribution around its location. The most commonly used dispersion measures are:

dispersion [DESC]

82

average absolute deviation : mean absolute deviation coefficient of variation (: relative standard deviation) Measure which is independent of the magnitude of the observations: cvj = Sj /jrj

where Sj is the standard deviation and Srj is the arithmetic mean.

H-spread The difference between the two hinges, which are the two data points half-way between the two extremes and the median on ranked data. half-interquartile range : quartile deviation interdecile range Robust measure, similar to the interquartile range: IDR = Q[O.9] - Q[O.l] where Q10.91 and Q[O.13 are the ninth and first deciles, respectively.

interquartile range Robust measure: IQR = Q3 - Ql = Q[0.75] - Q[O.25]

where Q3 and Ql are the upper and lower quartiles, respectively.

mean absolute deviation (MAD) (: mean deviation, average absolute deviation) Robust measure: MADj=

n

mean deviation : mean absolute deviation mean trigonometric deviation Measure for cyclical data: MTDj = 1 -

IA sin(Tj) + B cos(jrj)I n

with

x = arctan(A/B)

n=

C 1

Wi

dispersion [DESC]

83

median absolute deviation around the median (MADM) Robust measure: MADMj = median

[IXij

- Mj I]

with Mj = median [Xij] For symmetric distribution MADM is asymptotically equivalent to the quartile deviation.

quartile deviation ( : half-interquartilerange, semi-interquartile range) Robust measure:

range

R. J -- U.J - L.J where U, and Lj are the maximum (upper) and minimum (lower) values, respectively.

relative standard deviation : coefficient of variation root mean square (RMS)

root mean square deviation (RMSD) : standard deviation semi-interquartile range : quartile deviation standard deviation (: root mean square deviation) Square root of the variance:

sj =

trimmed variance Robust measure in which the sum of squared differences from the trimmed mean is calculated excluding a fraction of objects. It involves ordering the squared differences from smallest to largest and excluding the largest ones. For example, the 10%-trimmed variance is calculated using n - 2 m objects, where m = 0.1 n.

84 dispersion matrix [DESC] variance

s2 =

i

n-1 This is the unbiased estimate of the population variance. Sometimes the biased formula is used in which the sum is divided by n instead of n - 1. For computational purposes the following formulas are often used: J

s? = i J

n-1

-

i

‘ i

n(n - 1)

weighted variance

where wi are observation weights and Xj is a weighted mean. b

dispersion matrix [DESC]

: scatter matrix

b

display method [MULT] data reduction

b

dissimilarity [GEOM]

--+distance

b

dissimilarity index [GEOM] similarity index

b

dissimilarity matrix [GEOM] similarity index

distance [GEOM] Nonnegative number 8 , associated with a pair of geometric objects, reflecting their relative position. The pairwise distances can be arranged into a matrix, called the distance matrix, in which rows and columns correspond to objects. The distance S may satisfy some or all of the following properties: b

1.

S,t

2.

a,

20 =0

3.

S,t

= St,

distance [GEOM]

85

4. SSt = 0 iff s = t 5 . Sst 5 Ss, Szt 5a. SSt = S,, + SZt

+

6. SSt 6a. Sst 7. 8.

+ SU, Imax [(Ssu + S t A + Wl + S, = max [(L + JtA(asz + WI (JSz

5

[ J s m Jt,]

qt= (x,

- Xt)T(Xs

Sst

- Xt)

Property 5 is called the triangular inequality, property 6 is called the additive inequality or four-point condition, and property 7 is called the ultrametric inequality. If property 7 is satisfied, then properties 4,5,6 and 8 are also satisfied. Similarly, if property 6 is satisfied, so is property 5 and if property 8 is satisfied, so is property 4. According to which property is satisfied, the functions are: Name

Property

Proxlmity measure Pseudo distance Dissimilarity Metric distance Additive distance Ultrametric distance Centroid distance Euclidean distance

1

1 1 1 1 1 1 1

2 2 2 2 2 2 2

3 3 3 3 3 3 3

4 4 4 4 4 4

5

6

7 5a

6a 8

Ultrametric distances are particularly important in partitions and hierarchies. For example, in agglomerative hierarchical clustering methods, if the pairwise distances calculated from the data matrix satisfy the ultrametric property, then complete, single and group average linkages result in the same hierarchy. Distance and related measures are widely used in multivariate data analysis, allowing the reduction of multivariate comparison between two objects to a single number. The choice of distance measure greatly influences the result of the analysis. A list of the most commonly used distances follows.

binary data Distance between two binary variables or between two objects described by a set of binary variables is usually calculated from an association coefficient a, by applying the transformation: dst = constant - as[

Association measures are based on the following 2 x 2 table containing the number of occurrences (a, b, c, d) of the four possibilities

86

distance [GEOM]

There is an important difference between measures including the 0,Omatch (6) and measures without it. Coefficients that take d occurrences into account must be selected only if joint absence implies similarity. Dice coefficient (: Sorenson Coefficient) 2a ast = 2a+b+c Edmonston coefficient a ast = a 2(b c)

+ +

range [0, 11

range [0, 11

fi coefficient ad-bc

ast = J(u

+ b)(a + c)(b+ d)(c + d)

Hamann coefficient (a d) - ( b c ) - W + d ) -1 a,t = a+b+c+d a+b+c+d

+

+

range [-1, 11

range [-1, 11

Hamming coefficient Binary version of the Manhattan distance:

range[O,a+b+c+d]

4t=b+c

Binary version of the Euclidean distance: dst =

range [0, a + b + c + d ]

Jb+c

o Jaccard coefficient ast

=

a a+b+c

range [0, 11

o Kulczinsky coefficient U

ast = b+c o

range [0, 001

Kulczinsky probabilistic coefficient a

a+b

a a+c

range [0, 11

distance [GEOM] 87

o lambda coefficient a,t = o

&d-& &d+&

range [-1, 11

normalized Hamming coefficient : Tanimoto coefficient

o Rogers-'Panimoto coefficient

a,t =

a

a+d d 2 (b c)

range [0, 11

+

+ +

o Russell-Rao coefficient

a,t =

U

range [0, 11

a+b+c+d

o simple matching coefficient (: Sokal-Michener coefficient)

a,t =

a+d a+b+c+d

range [0, 11

o Sokal-Michener coefficient : simple matching coefficient o Sokal-Sneath coefficient

ast =

2(a

2(a+d) d) b

range [0, 11

+ + +c

o Sorenson coefficient : Dice coefficient

o Tanimoto coefficient (: normalized Hamming coefficient)

b+c " -a+ b + c + d

d -

range [0, 11

range [0, 11 o Yates chi squared coefficient

+ b + c + dl [ ( a d- b c ) - (a + b + c + d)/2]* (a + b)(c + m a + c)(b+ d) (ad - bc) (a + b + c + d)/2 else = 0.

ast = if

(a

7

%t

o Yule coefficient

a,t =

ad-bc ad+bc

range [-1, 11

88

distance [GEOM]

nominal data When the objects contain nominal variables, the distance between them is calculated after transforming the nominal variables into binary variables. Each nominal variable that may assume k different values is substituted by k binary variables; all are set to zero except the one corresponding to the actual value of the nominal variable. rankeddata The most common measure of distance between ranked variables is the rank correlation coefficient, e.g. Spearman’s p coefficient or Kendall’s t coefficient. To measure distance between objects of ranked data the following quantities are the most popular, calculated from the ranks rsj and rtj of objects s and t on variable j : o asymptotic Mahalanobis distance : Bhattachalyya distance o Bhattacharyya distance ( : asymptotic Mahalanobis distance)

o Calkoun distance

Distance based on ranking the n objects for each variable and counting the number of objects between objects s and t:

dst = 6 n l + 3 n 2 + 2 n 3 where nl is the number of objects that fall between objects s and t on at least one variable; n2 is the number of objects that are not in nl but have tie values on at least one variable with either object s or t; n3 is the number of objects that are neither in nl nor in n2 but have tie values on at least one variable with both objects s and t. The normalized Calkoun distance is defined as: 6n1+3n2+2n3 dst = 6 (n- 2)

Mahalanobis-like distance

Rajski’s distance Measure of distance between variables; can also be calculated for objects:

where H(s; t) is the mutual entropy and H ( s , t) is the joint entropy for the objects (or the variables) s and t. If the entropy is calculated over the relative frequencies

distance [GEOM]

89

f of the objects in the data set, Rajski’s distance can also be used for quantitative data. Rqjski’s coherence coefficient is a similarity index calculated from the Rajski’s distance as

o rank distance

where sf is the variance of variable j. The variance of a ranked variable j with n observations and k = 1,q ties at tk is defined as:

(n3 - n) -

c0

0

and the beta function is: B(z, W ) =

j

tz-'(l

- t)w-l

dt

z >0

w >0

0

The two functions are related as:

Bernoulli distribution Discrete distribution with parameter p, the probability of success in one trial. F(0) = 1 - p f(0) = 1- P

F(1) = 1 f(1) = P

x E {O,l) Range: Mean: p Variance: p(1 - p)

0cp c1

beta distribution Continuous distribution with shape parameters a and b. f ( x ) =Xa-'(l

-x)b-'/B(a, b)

where B is the beta function. Range: 05x51 a>O b>O Mean: a / ( a b) Variance: a b / [ ( a+ b12(a+ b + 111

+

94

distribution [PROB]

binomial distribution Discrete distribution with parameters n, the number of trials, and p , the probability of success. F(x) =

c (;)

p'(1 - p)"-'

I

Range: 0j x jn Mean: np Variance: n p (1- p)

05p 51

Cauchy distribution Continuous distribution with location parameters a , and scale parameter b. F(x) = 0.5 + l/n tan-'

[ [ + ('+TI]

(*ba) -1

f ( x ) = rrb 1

Range: -00 Mean = Median = Mode: a Interquartile range: 2b

<X

< 00

b >0

chi squared distribution Continuous distribution with shape parameter n, the degrees of freedom.

where

r is the gamma function.

O j x < 00 Range: Mean: n Variance: 2 n

0 t n < 00

double exponential distribution : Laplace distribution error distribution Continuous distribution with location parameters a, scale parameter b, and shape parameter c.

distribution [PROB]

where


--oocxcoo Range: Mean = Median = Mode: a

-cocacoo

b>O

c>O

Variance:

exponential distribution : negative exponential distribution Continuous distribution with scale parameter a . F(x) = 1 - exp[-x/a]

m = l/a exp [-+I S(x) = exp [-x/a]

h(x) = l / a 0 s x c 00 Range: Mean: a Variance: a 2

a >0

F distribution (: variance ratio distribution, Fisher's distribution) Continuous distribution with shape parameters n and m, the degrees of freedom.

F[OS (n+ m)] (n/rn)"I2x("-2)/2

=

where

r ( n / 2 )r ( m / 2 )[I + (n/m)x](n+m)/2


n>O m > O m/(m - 2) for m > 2 2 m2(n+ m - 2) Variance: for m > 4 n(m - 2)2(m - 4)

Range: Mean:

Opxcca

Fisher's distribution : F distribution gamma distribution Continuous distribution with scale parameter a , and shape parameter b.

where

95


Opxcoo Range: Mean: ab Variance: a 2 b

a>O

b>O

96

distribution [PROB]

Gaussian distribution : normal distribution geometric distribution Discrete distribution with parameters p, the probability of success, and n, the number of trials. F(x) = 1 - (1 - p)"+' f(x) = P(1 - PI" Range: n 20 0
O

a >O

Laplace distribution ( : double exponential distribution) Continuous distribution with location parameter a, and scale parameter b.

F(x) = 1 - 0.5exp

[

--

ifxza

Range: --oo<xO

logistic distribution Continuous distribution with location parameter a, and scale parameter b. F(x) = 1 - {l+exp[(x-a)/b]}-'=

{l+exp[-(x-a)/b]}-'

distribution [PROB] 97

f ( x ) = exp [ - ( x - a)/b]b-'(l+ exp [ - ( x - a)/b]}-2 = exp[(x-a)/b]b-'(l +exp[(x- a)/b])-2 S(x) = 11 +exp[(x-a)/b]}-'

+

h(x) = (b(1 exp [-(x

- a)/b]}]-'

Range: --oo<x0 Mean: b c/(c 1) Variance: b2 c/[(c 2)(c l)']

+ +

+

Rayleigh distribution Continuous distribution with scale parameter b.

F(x) = 1 - exp [-2/(2 b2)] f(X) = (x/b2) exp [-2/(2b2)1

h(x) = x/b2 Range: 0 < x < 00 Mean: b m Variance: (2 - n/2)b2

b >0

rectangular distribution : uniform distribution Student's t distribution : t distribution t distribution (: Student's t distribution) Continuous distribution with shape parameter u , the degrees of freedom.

where


-0O<xo Range: Mean = Mode: 0 for u > 1 Variance: u / ( u -2) for u > 2

99

100 distribution [PROB]

triangular distribution Continuous distribution with location parameters a, b and shape parameter c. F(x) =

(x - al2 (b - a)(c - a )

F(x) = 1 -

(b - x ) ~ (b - a)(b - c)

2( x- a ) f ( x ) = (b - a)(c - a ) 2(b-x)

f(x> = (b - a)(b - c)

for a ( x 5 c for c i x i b for a 5 x 5 ~ for c s x s b

Range: a 5x5b Mean: (a b c)/3 Variance: (a2 b2 c2 - a b - a c - bc)/18

+ + + +

uniform distribution (.- rectangular distribution) Continuous distribution with location parameters a and b. F(x) = (x - a)/(b - a )

h(x) = l/(b - X )

Range: a5 x 5b Mean: (a + b)/2 Variance: (b - a)2/12 Discrete distribution with parameter n.

S(x) = (n - x ) / ( n

+ 1)

h(x) = l/(n - x ) Range: 0i x5n Mean: n/2 Variance: n(n 2)/12

+

Doehlert design [EXDE]

101

variance ratio distribution : F distribution Weibull distribution Continuous distribution with scale parameter b, shape parameter c. F(x) = 1 - exp[-(x/b)'] f i x ) = (cxC-'/bC)exp [-(x/b)']

Range: 05xO c>O Mean: b r [ (c l)/c] Variance: b2(r[(c 2)/c] - {r[(c 1 ) / ~ ] ] ~ )

+ +

+

b distribution function (df) [PROB] + random variable

distribution-free estimator [ESTIM] + estimator b

distribution-free test [TEST] + hypothesis testing b

b

divergence coefficient [GEOM]

+ distance (o quantitative data) b divisive clustering [CLUS] + hierarchical clustering

b Dodge-Romig tables [QUAL] Tables of standards for acceptance sampling plans for attributes. They are either designed for lot tolerance percent defective protection or provide a specified average outgoing quality limit. In both cases there are tables for single sampling and double sampling. These tables apply only in the case of rectifying inspection, i.e. if the rejected lots are submitted to 100% inspection. b Doehlert design [EXDE] + design

102

dotplot [GRAPH]

b dot plot [GRAPH] Graphical display of the frequency distribution of a continuous variable on a onedimensional scatter plot. The data points are plotted along a straight line parallel to and above the horizontal axis. Stacking or jitter can be applied to better display overlapping points.

JITTER

STACKING

. . . . .. . .. . .. .

. .

' .

.'

.

*

b

2 4 6 8 10 12 14 16

Stacking means that data points of equal value are plotted above each other on a vertical line orthogonal to the main line of the plot. Jitter is a technique used to separate overlapping points on a scatter plot. Instead of positioning the overlapping points exactly on the same location, their coordinate values are slightly perturbed so that each point appears separately on the plot. On a one-dimensional scatter plot, rather than plotting the points on a horizontal line, they are displayed on a strip above the axis. The width of the strip is kept small compared to the range of the horizontal axis. The vertical position of a point within a strip is random. Although a scatter plot with jitter loses in accuracy, its information content is enhanced.

b

double cross-validation (dcv) [FACT] rankanalysis

b

double exponential distribution [PROB] distribution

P -

b

double sampling [QUAL] acceptance sampling

b double tail test [TEST] + hypothesis testing b draftsman's plot [GRAPH] (.- scatterplot matrix, matrix plot) Graphical representation of multivariate data. Several two-dimensional scatter plots arranged in such a way that adjacent plots share a common axis. A p-dimensional data set is represented by p @ - 1)/2 scatter plots.

Durbin- Watson’s test [TEST]

103

x2

XI

1 x4

x4 x4

‘0

x2

’

‘ *

8

I .

4

x3

In this array of pairwise scatter plots the plots of the same row have a common vertical axis, while the plots of the same column have a common horizontal axis. Corresponding data points can be connected by highlighting or coloring.

dummy variable [PREP] + variable b

b

Duncan’s test [TEST]

+ hypothesis test

b

Dunnett’s test [TEST] hypothesis test

Dunn’s partition coefficient [CLUS] --f fizzy clustering b

Dunn’s test [TEST] + hypothesis test

b

Durbin-Watson’s test [TEST] + hypothesis test b

104 Dwass-Steel’s test [TEST] b Mass-Steel’s test [TEST] + hypothesis test b

dynamic linear model [TIME]

: state-space model

b E-M algorithm [ESTIM] Iterative procedure, first introduced in the context of dealing with missing data, but also widely applicable in other maximum likelihood estimation problems. In missing data estimation it is based on the realization that if the missing data had been observed, simple sufficient statistics for the parameters would be used for straightforward maximum likelihood estimation; on the other hand, if the parameters of the model were known, then the missing data could be estimated given the observed data. The algorithm alternates between two procedures. In the E step (stands for Expectation) the current parameter values are used to estimate the missing values. In the M step (stands for Maximization) new maximum likelihood parameter estimates are obtained from the observed data and the current estimate of the missing data. This sequence of alternating steps converges to a local maximum of the likelihood function.

E-optimal design [EXDE] ----f design (: optimal design) b

b

edge [MISC]

-+ graph theory

b

Edmonston coefficient [GEOM] distance (: binarydatu)

b

Edwards and Cavalli-Sfona clustering [CLUS]

+ hierarchical clustering ( : divisive clustering) b effect [ANOVA] + term in ANOVA

b

effect variable [PREP] variable

eigenanalysis [ALGE]

105

b efficiency [ESTIM] Measure of the relative goodness of an estimator, defined as its variance. This measure provides a basis for comparing several potential estimators for the same parameter. The ratio of the variance of two estimators of the same quantity, in which the smaller variance is in the numerator, is called the relative efficiency. This is the relative efficiency of the estimator with the larger variance. It also gives the ratio of the sample sizes required for two statistics to do the same job. For example, the sample median has variance of approximately (n/2)a2/nand the sample mean has variance u 2 / n ,therefore the median has relative efficiency 2/n = 0.64. In other words the median estimates the location of a sample of 100 objects as well as the mean estimates it for a sample of 64. The efficiency of an estimator as the sample size tends to infinity is called asymptotic efficiency. b efficient estimator [ESTIM] + estimator

b eigenanalysis [ALGE] Analysis of a square matrix X[p, p] in terms of its eigenvalues and eigenvectors. The result is the solution for the system of equations:

XVj = Aj vj

j = 1,p

The vector v is called the eigenvector (or characteristic vector, or latent vector); the scalar A is called the eigenvalue (or characteristic root, or latent root). Each eigenvector defines a one-dimensional subspace that is invariant to premultiplication by X. The eigenvalues are the roots of the characteristic polynomial defined as:

P(A) = IW - A III The p roots A are also called the spectrum of the matrix X and are often arranged in a diagonal matrix A. The decomposition j

is called the eigendecomposition or spectral decomposition. Usually, one faces a symmetric eigenproblem, i.e. finding the eigenvalues and eigenvectors of a symmetric square matrix; most of the time the matrix is also positive semi-definite. This case has substantial computational advantages. The asymmetric eigenproblem is more general and more involved computationally. The following algorithms are the most common for solving an eigenproblem.

Jacobi method An old, less commonly used method for diagonalizing a symmetric matrix X. Let

S(X) denote the sum of squares of the off-diagonal elements of X, so that X is

106 eigenanalysis [ALGE]

diagonal if S(X) = 0. For any orthogonal matrix Q, QTX Q has the same eigenvalues as X. In each step the Jacobi method in each step finds a matrix Q for which

~ ( Q ~ x p and underdetermined if n < p . The solution has the following properties: - if n = p and X is nonsingular, a unique solution is a = X-' y; - the equation is consistent (i.e. admits at least one solution) if rank@) = rank(X,y); - for y = 0, there exists a nontrivial solution (i.e. a # 0) if rank@) < p ; - the equation XT X = XTy is always consistent. Linear equations are most commonly solved by orthogonal matrix transformation or Gaussian elimination. b linear estimator [ESTIM] -+estimator b linear learning machine (LLM) [CLAS] Binarv classificationmethod, similar to Fisher's discriminant analysis, that separates two classes in the p-dimensional measurement space by a (p - 1)-dimensional hyperplane. The iterative procedure starts with an arbitrary hyperplane, defined by a weight vector w orthogonal to the plane, through a specified origin WO.A training object x is classified by calculating a linear discriminant function

s = WrJ

+ w' x

which gives positive values for objects on one side and negative values for objects on the other side of the hyperplane. The position of the plane is changed by reflection on a misclassified observation as: wo (new) = ;O

(old)

+c

This method, which is rarely used nowadays, has many disadvantages: nonunique solution, slow or no convergence, too simple class boundary, unbounded misclassification risk. b

linear least squares regression [REGR]

: ordinary least squares regression b

linear programing [OPTIM]

+ optimization

linear structural relationship (LISREL) [MULT] 189 b

linear regression model [REGR]

+ regression model b linear search optimization [OPTIM] Direct search optimization for minimizing a function of p parameters f (p). The step taken in the ith iteration is

pi+l = Pi

+ Si di

where S i is the step size and di is the step direction. There are two basic types of linear search optimization. The first type uses a scheme for systematically reducing the length of the known interval that contains the optimal step size, based on a comparison of function values. Fibonacci search, golden section, and bisection belong to this group. The second type approximates the function f (p) around the minimum with a more simple function (e.g. second- or third-order polynomial) for which a minimum is easily obtained. These methods are known as quadratic, cubic, etc. interpolations. b linear structural relationship (LISREL) [MULT] Latent variable model solved by maximum likelihood estimation, implemented in a software package. The model consists of two parts: the measurement model specifies how the latent variables are related to the manifest variables, while the structural model specifies the relationship among latent variables. It is assumed that both manifest and latent variables are continuous with zero expected values, and the manifest variables have a joint normal distribution. The latent variables are of two types: endogenous ( r ] ) and exogenous (c), and are related by the linear structural model:

q=Br]+r6+C B and r are regression coefficients representing direct causal effects among q-r] and r]-6. The error term is uncorrelated with 6. There are two sets of manifest variables y and x corresponding to the two sets of latent variables. The linear measurement models are:

j+l

idempotent matrix Square matrix W in which

x2= x For example, the hat matrix is an idempotent matrix.

identity matrix Diagonal matrix, often denoted as 1, in which all diagonal elements are 1:

xi, = 1

if

i=j

and

xi, = O

if

i#j

nilpotent matrix Square matrix W in which

X' = 0

for some r

nonsingular matrix Square matrix in which, in contrast to a singular matrix, the determinant is not zero. Only nonsingular matrices can be inverted. null matrix (.- zero matrix) Square matrix in which all elements are zero: x.. 'J = 0

for all i and j

orthogonal matrix Square matrix in which WTW = XWT = 1

Examples are: Givens transformation matrix, Housholder transformation matrix, and matrix of eigenvectors.

198

matrix [ALGE]

positive definite matrix Square matrix in which

zTxz>o

for all Z # O

positive matrix Square matrix in which all elements are positive: xij

>0

for all i and j

positive semi-definite matrix Square matrix in which

zTxz>o

for all

Z # O

singular matrix Square matrix in which the determinant is zero. In a singular matrix, at least two rows or columns are linearly dependent. square matrix Matrix in which the number of rows equals the number of columns: n = p symmetric matrix Square matrix in which row and column indices are interchangeable:

x =p

xij = xJ.' .

triangular matrix Square matrix in which all elements either below or above the diagonal are zero. In an upper triangular matrix x 'J. . = O

if

i>j

In a lower triangular matrix

x 'J. . = O

if

i<j

tridiagonal matrix Square matrix in which only diagonal elements and elements adjacent to the diagonals (superdiagonals and subdiagonals) are nonzero:

This matrix is both an upper and a lower Hessenberg matrix.

matrix decomposition [ALGE]

199

unit matrix Square matrix in which all elements are one: x 'J, ' = 1

for all i and j

zero matrix : null matrix matrix algebra [ALGE] : linear algebra

b

b matrix condition [ALGE] (: condition of a matrix) Characteristic of a matrix reflecting the sensitivity of the quantities computed from the matrix to small changes in the elements of the matrix. A matrix is called an ill-conditioned matrix, with respect to a problem, if the computed quantities are very sensitive to small changes like numerical precision error. If this is not the case, the matrix is called a well-conditioned matrix. There are various measures of the matrix condition. The most common one is the condition number:

When 11. 1 1 indicates the two-norm, the condition number is the ratio of the largest to the smallest nonzero eigenvalue of the matrix. A large condition number indicates an ill-conditioned matrix. b matrix decomposition [ALGE] The expression of a matrix as a product of two or three other matrices that have more structure than the original matrix. Rank-one-update is often used to make the calculation more efficient. A list of the most important methods follow.

bidiagonalization

B = UXVT where B is an upper bidiagonal matrix, U and V are orthogonal Housholder matrices. This decomposition is accomplished by a sequence of Housholder transformations. First the subdiagonal elements of the first column of X are set to zero by U1,next the p - 2 elements above the superdiagonal of the first row are set to zero by V1 resulting in U1 XV1. Similar Housholder operations are applied to the rest of the columnsj via Uj and to the rest of the rows i via Vi:

U = (Up U,-1 . . . Ul)

and

V = (VpVp-1 . . . VI)

Bidiagonalization is used, for example, as a first step in singular value decomposition.

200

matrix decomposition [ALGE]

Cholesky factorization

X=LL~ where X is a symmetric, positive definite matrix, and L is a lower triangular matrix. The Cholesky decomposition is widely used in solving linear equation systems in the form X b = y. diagonalization : singular value decomposition eigendecomposition : spectral decomposition G R decomposition : L-U decomposition

L-U decomposition ( : L-R decomposition,triangularfactorizution) X=LU where L is a lower triangular matrix and U is an upper triangular matrix. The most common method to calculate the L-U decomposition is Gaussian elimination.

Q-R decomposition X=QR where Q is an orthogonal matrix and R is an upper triangular matrix. This decomposition is equivalent to an orthogonal transformation of X into an upper triangular form, used in solving linear equations and in eigenanalysis. singular value decomposition (SVD) ( : diagonalizution)

X=UAVT where U and V are orthogonal matrices, and A is a diagonal matrix. The values of the diagonal elements in A are called singular values of X,the columns of U and V are called left and right singular vectors, respectively. A is calculated by first transforming W into a bidiagonal matrix and then eliminating the superdiagonal elements. When n = p and X is a symmetric positive definite matrix the singular values coincide with the eigenvalues of X. Furthermore U = V, that contains the eigenvectors of X. spectral decomposition ( : eigendecomposition) X=VAVT

where W is a symmetric matrix, A is a diagonal matrix of the eigenvalues of X, and V is an orthogonal matrix containing the eigenvectors.

matrix operation [ALGE]

201

triangular factorization : L-U decomposition tridiagonalization

T = UXUT where X is a symmetric matrix, T is a tridiagonal matrix, and U is an orthogonal matrix. This decomposition, similar to bidiagonalization, is achieved by a sequence of Housholder transformations: U = (Ul . . . Up-l). It is used mainly in eigenanalysis. b matrixnorm [ALGE] + norm

matrix operation [ALGE] The following are the most common operations on one single matrix X, or on two matrices X and Y with matching orders. b

addition of matrices Z(n, p ) = X(n, P )

+ Y ( n ,P )

zij

= xij

+ Yij

Addition can be performed only if X and Y have the same dimensions.

determinant of a matrix Scalar defined for a square matrix X(p, p) as:

where the summation is taken over all permutations a of (1, . . . , p ) , and la1 equals +1 or -1, depending on whether a can be written as the product of an even or odd number of transpositions. For example, for p = 2

1x1= XllX22 - x12x21 The determinant of a diagonal matrix X is equal to the product of the diagonal elements:

If are:

1x1 # 0, then W is a nonsingular matrix. Important properties of the determinant

4

l X Y = 1x1IYI IXTI = 1x1 lcXl = CPlXl

b) c)

202

mat& operation [ALGE]

inverse of a matrix The inverse of a square matrix X is the unique matrix X-' satisfying:

The inverse exists only if X is a nonsingular matrix, i.e. if 1x1 # 0. The most efficient way to calculate the inverse is via Cholesky decomposition. The matrix X- is called generalized inverse matrix of a nonsquare matrix X if it satisfies:

The generalized inverse always exists, although usually it is not unique. If the following three equations are also satisfied, it is called a Moore-Penrose inverse (or pseudo inverse) and denoted as X+

(XX+)T

= X X+

(X+X)T

= X+ X

For example, the Moore-Penrose inverse matrix is calculated in ordinary least squares regression as:

X+ = (XTX)-' XT The generalized inverse is best obtained by first performing a singular value decomposition of X:

X=UAVT then

Multiplication can be performed only if the number of columns in X equals the number of rows in Y.

matrix rank [ALGE]

203

partitioning of a matrix

scalar multiplication of a matrix z(n, p ) = c X(n, p ) zij = cxij

subtraction of matrices z(n, p ) = X(n, p ) - U h p ) ZIJ'

= X"i j - Y i j

Subtraction can be performed only if X and U have the same dimensions.

trace of a matrix The sum of the diagonal elements of X, denoted as tr (X):

transpose of a matrix Interchanging row and column indices of a square matrix: Z(p, 4 = X T h p ) Zji b

-+

= Xij

matrix plot [GRAPH] draftsman's plot

b matrix rank [ALGE] (: rank ofa matrix) The maximum number of linearlv indeuendent vectors (rows or columns) in a matrix X(n, p), denoted as r (X). Consequently, linearly dependent rows or columns reduce the rank of a matrix. A matrix X is called a rank deficient matrix when r(X) < p. The rank has the following properties:

- 0 Ir(X) Imin(n,p) - r(X) = r(XT) - r(X+Z)sr(X)+r(Z) - r(XZ) Imin [r(X), r(Z)] - r(XT X) = r (XxT)= r (x)

204

matrix transformation [ALGE]

b matrix transformation [ALGE] Numerical method that transforms a real matrix X(n, p) or a real symmetric matrix X(p,p) into some desirable form by pre- or post-multiplying it by a chosen set of nonsingular matrices. The method is called orthoaonal matrix transformation when the multiplier matrices are orthogonal and an orthogonal basis for the column space of X is calculated; otherwise the transformation is nonorthogonal, and is usually performed by elimination methods. Matrix transformations are the basis of several numerical algorithms for solving linear equations and for eigenanalvsis. b maximum likelihood (ML) [PROB] + likelihood b maximum likelihood clustering [CLUS] Clustering method that estimates the partition on the basis of the maximum likelihood. Given a data matrix X of n objects described by p variables, a parameter vector 6 = (n1 .. . n ~ p1.. ; . p ~ X;I . . . CG) and another parameter vector y = (nl . . .n G ) the likelihood is:

r

1

where s, is the set of objects Xi belonging to the gth group and n, is the number of objects in s,. Parameters n,, p,, and C, are the prior cluster probabilities, the cluster centroids, and the within-cluster covariance matrices, respectively. The ML estimates of these quantities are:

n,: P , = n,/n pg:

cg = C x i / n g sg

C(xi C,:

9, =

- CglT(Xi - cg)

Sg

g=l,G

ng b maximum likelihood estimator [ESTIM] + estimator b maximum likelihood factor analysis [FACT] Factor extraction method based on maximizing the likelihood function for normal distribution

L(X I P , E) The population covariance matrix E can be expressed in terms of common and unique factors as E=ILILT+U

mean absolute deviation (MAD) [DESC]

205

and fitted to the sample covariance matrix S. The assumption is that both common factors and unique factors are independent and normally distributed with zero means and unit variances, consequently the variables are drawn from a multivariate normal distribution. The common factors are assumed to be orthogonal, and their number M is prespecified. The factor loadings IL and the unique factors U are calculated in an iterative procedure based on maximizing the likelihood function, that is equal to minimizing: min = tr M

[x-'S] -In Ix-'s~

-p

The minimum is found by nonlinear optimization. Convergence is not always obtained, the optimization algorithm often ends in a local minimum, and sometimes one must face the Heywood case problem. The solution is not unique; two solutions can differ by a rotation. maximum likelihood latent structure analysis [MULT] + latent class model b

b

maximum linkage [CLUS]

-+ hierarchical clustering b

(0 agglomerative clustering)

maximum scaling [PREP] standardization

--+

b

maxplane rotation [FACT]

+ factor rotation McCulloh-Meeter plot [GRAPH] + scatter plot (0 residual plot) b

McNemar's test [TEST] + hypothesis test b

b

McQuitty's similarity analysis [CLUS] hierarchical clustering (0 agglomerative clustering)

----f

mean [DESC] -+ location b

mean absolute deviation (MAD) [DESC] + dispersion b

206

mean character difference [GEOM]

b mean character difference [GEOM] + distance (0 quantitative data) b mean deviation [DESC] + dispersion b mean function [TIME] + autocovariance function b mean group covariance matrix [DESC] + covariance matrix b mean square error (MSE) [ESTIM] Mean squared difference between a true value and estimated value of a parameter. It has two components: variance and squared bias.

+

MSE(6) = E[6 - S]2 = v[6]+ B2[6] = E[6 - E[ii]]2 (E[ii] - S)2

Estimators that calculate estimates with zero bias are called unbiased estimators. Biased estimators increase the bias and decrease the variance component of MSE, trying to find its minimum at optimal complexity.

MSE BIAS VARIANC

4

I\

NCE

BIAS I

OPTIMAL COMPLEXITY

b mean square error (MSE) [MODEL] + goodness of fit

COMPLEXITY

metric scale [PREP]

201

b mean squares in ANOVA (MS) [ANOVA] Column in the analvsis of variance table containing the ratio of the sum of squares to the number of degrees of freedom of the corresponding term. It estimates the variance component of the term and is used to test the significance of the term. The mean square of the error term is an unbiased estimate of the error variance. b

mean trigonometric deviation [DESC] dispersion

-+

b

measure of distortion [CLUS] assessment of clustering

-+

b median [DESC] + location

median absolute deviation around the median (MADM) [DESC] + dispersion b

b

median linkage [CLUS] hierarchical clustering (0 agglomerative clustering)

-+

b

median test [TEST] hypothesis test

-+

b membership function [MISC] + fuzzy set theory

metameric transformation [PREP] + transformation b

b

metameter [PREP] transformation (0 metameric transformation)

-+

b

metric data [PREP] data

----+

b

metric distance [GEOMI

+ distance b

-+

metric multidimensional scaling [MULT] multidimensional scaling

b

metric scale [PREP] scale

208 b

midrange [DESC]

midrange [DESC]

+ location b military standard table (MIL-STD) [QUAL] Military standard for acceptance sampling that is widely used in industry. It is a collection of sampling plans. The most popular one is MIL-STD 105D, which contains standards for single, double and multiple sampling for attributes. The primary focus is the acceptable quality level. There are three inspection levels: normal, tightened and reduced. The sample size is determined by the lot size and the inspection level. MIL-STD 414 contains sampling plans for variable sampling. They also control the acceptable quality level. There are five inspection levels. It is assumed that the quality characteristic is normally distributed.

minimal spanning tree (MST) [MISC] + graph theory b

b minimal spanning tree clustering [CLUS] --+ non-hierarchical clustering (0 graph theoretical clustering) b

minimax strategy [MISC]

+ gametheory b minimum linkage [CLUS] + hierarchical clustering (0 agglomerative clustering)

b Minkowski distance [GEOM] + distance ( 0 quantitative data) b

minres [FACT]

+ principal factor analysis b misclassification [CLAS] + classification

misclassification matrix [CLAS] + classification b

b

misclassification risk (MR) [CLAS]

+ classification b missing value [PREP] Absent element in the data matrix. It is important to distinguish between a real missing value (a value potentially available but not measured), a don’t care value

model [MODEL]

209

(the measured value is not relevant), and a meaningless value (the measurement is not possible or not allowed). There are several techniques for dealing with missing values. The simplest solution, applicable only in the case of few missing values, is to delete the object (row) or the variable (column) containing the missing value. A more reasonable procedure is to fill in the missing value by some estimated value obtained from the same variable. This can be: the variable mean calculated from all objects with nonmissing values; the variable mean calculated from a subset of objects, e.g. belonging to the same class; or a random value of the normal distribution within the extremes of the variable. While the variable mean produces a flattening of the information content of the variable, the random value introduces spurious noise. Missing values can also be estimated using multivariate models. Principal component, K nearest neighbors and regression models are the most popular ones. b mixed data [PREP] -+ data b

mixed effect model [ANOVA]

+ term in ANOVA mixture design [EXDE] + design b

b moat [CLUS] Measure of external isolation of a cluster defined for hierarchical agglomerative clustering. It is the difference between the similarity level at which the cluster was formed and the similarity level at which the cluster was agglomerated into a larger cluster. Clusters in complete linkage usually have larger moats than clusters in single linkage. b mode [DESC] + location b mode analysis [CLUS] + non-hierarchicalclustering

(0

density clustering)

b model [MODEL] Mathematical equation describing causal relationship between several variables. The responses (or dependent variables) in a model may be either quantitative or qualitative. Examples of the first group are the regression model, the factor analysis model, and the ANOVA model, while classification models, for example, belong to the second group. The value of the highest power of a predictor variable is called

210

model [MODEL]

the order of a model. A subset of predictor variables that contains almost the same amount of information about the response as the complete set is called an adequate subset. The number of independent pieces of information needed to estimate a model is called the model degrees of freedom. A model consists of two parts: a systematic part described by the equation, and the model error. This division is also reflected in the division of total sum of sauares. The calculation of optimal parameters of a model from data is called model fitting. Besides fitting the data, a model is also used for prediction. Once a model is obtained it must be evaluated on the basis of goodness of fit and goodness of Drediction criteria. This step is called model validation. Often the problem is to find the optimal model among several potential models. Such a procedure is called model selection.

additive model Model in which the predictors have an additive effect on the response. biased model Statistical model in which the parameters are calculated by biased estimators. The goal is to minimize the model error via bias-variance trade-off. In these models the bias is not zero, but in exchange, the variance component of the squared model error is smaller than in a corresponding unbiased model. PCR, PLS, and ridge regression are examples of biased regression models; RDA, DASCO,SIMCA are biased classification models. causal model Model concerned with the estimation of the parameters in a system of simultaneous equations relating dependent and independent variables. The independent variables are viewed as causes and dependent variables are effects. The dependent variables may affect each other and be affected by the independent variables; these latter are not, however, affected by the dependent variables. The cause-effect relationship cannot be proved by statistical methods, it is postulated outside of statistics. Causal models can be solved by LISREL or by path analvsis and are frequently described by path diagrams. deterministic model Model that, in contrast to a stochastic model, does not contain random element. hierarchical model : nested model nested model (: hierarchical model) Set of models in which each one is a special case of a more general model obtained by dropping some terms (usually by setting a parameter to zero). For example, the model y = bo

+ b1x

model degrees offreedom (dfl [MODEL]

211

is nested within y = bo

+ b1x + b 2 1

There are two extreme approaches to finding the best nested model: - the top-down approach starts with the largest model of the family (also called the saturated model) and then drops terms by backward elimination; - the bottom-up approach starts with the simplest model and then includes terms one at a time from a list of the possible terms, by forward elimination. It is important not only to take account of the best fit but also the trade-off between fit and complexity of the model. Hierarchical models can be compared on the basis of criteria such as adjusted R2,Mallows’ C,,PRESS,likelihood ratio and information criteria (e.g. the Akaike’ information criterion). In contrast to nested models, nonnested models are a group of heterogeneous models. For example, y = bo + 61 ln(x) + bz(l/x) is nonnested within the above models. Maximum likelihood estimators and information criteria are particularly useful when comparing nonnested models.

parsimonious model Model with the fewest parameters among several satisfactory models. soft model A term used to denote the soft part of the modeling approach, characterized by the use of latent variables, principal components or factors calculated from the data set and therefore directly describing the structure of the data. This is opposite to a hard model, in which a priori ideas of functional connections (physical, chemical, biological, etc. mathematical equations) are used and models are constructed from these.

stochastic model Model that contains random elements. b

model I [ANOVA]

-+ term in ANOVA b model I1 [ANOVA] + term in ANOVA

b model degrees of freedom (df) [MODEL] The number of independent pieces of information necessary to estimate the parameters of a statistical model. For example, in a linear regression model it is calculated as the trace of the hat matrix: tr [W]. In OLS with p parameters: tr [W] = p. The total degrees of freedom (number of objects n) minus the model degrees of freedom is called the error degrees of freedom or residual degrees of freedom.

212

model error [MODEL]

For example, in a linear regression model it is calculated as n - tr [W]. The error degrees of freedom of an OLS model with p parameters is n - p.

model error [MODEL] (: error term) Part of the variance that cannot be described by a model, denoted e or ei. It is calculated as the difference between observed and calculated response: b

e=y-i,

or

ei = yi - fi

The standard deviation of ei, denoted 0,is also often called model error. The estimated model error s is an important characteristic of a model and the basis of several goodness of fit statistics.

model fitting [MODEL] Procedure of calculating the optimal parameter values of a model from observed data. When the functional form of the model is specified in analytical form (e.g. linear or polynomial), the procedure is called curve fitting. In contrast, the form of the model is sometimes defined only by weak constraints (e.g. smooth) and the model obtained is stored in digital form. Overfitting means increasing the complexity of a model, i.e. fitting model terms that make little or no contribution. Increasing the number of parameters and the model degrees of freedom beyond the level supported by the available data causes high variance in the estimated quantities. In case of a small data set, in particular, one should be careful about fitting excessively complex models, (e.g. nonlinear models). Underfitting is the opposite phenomenon. Underspecified models (e.g. linear instead of nonlinear, or with terms excluded) result in bias in the estimates. b

model selection [MODEL] Selection of the optimal model from a set of candidate models. The criterion for optimality has to be prespecified. One usually is interested in finding the best predictive model, so the model selection is often based on a goodness of prediction criteria. Important examples are biased regression and classification. The range of candidate models can be defined by the number of predictor variables included in the model (e.g. variable subset selection, stepwise linear discriminant analysis), by the number of components (e.g. PCR, PLS, SZMCA), or by the value of a shrinkage parameter (e.g. ridge regression, RDA). b

b

model sum of squares (MSS) [MODEL]

+ goodness of fit b model validation [MODEL] (.- validation) Statistical procedure for validating a statistical model with respect to a prespecified criterion, most often to assess the goodness of fit and goodness of prediction of a model. A list of various model validation techniques follows.

model validation [MODEL]

213

bootstrap Computer intensive model validation technique that gives a nonparametric estimate of the statistical error of a model in terms of its bias and variance. This procedure mimics the process of drawing many samples of equal size from a population in order to calculate a confidence interval for the estimates. The data set of n observations is not considered as a sample from a population, but as the entire population itself, from which samples of size n, called bootstrap samples, are drawn with resubstitution. It is achieved by assigning a number to each observation of the data set and then generating the random samples by matching a string of random numbers to the numbers that correspond to the observations. The estimate calculated from the entire data set is denoted t, while the estimate calculated from the bth bootstrap sample is denoted tb. The distribution of tb can be treated as if it were a distribution constructed from real samples. The mean value of tb is

where B is the number of bootstrap samples. The bootstrap estimate of the bias B(t) and the variance V(t) of statistic t is B(t) = 1 - t

V(t) =

b

B-1

cross-validation (cv) Model validation technique used to estimate the predictive power of a statistical model. This resampling procedure predicts an observation from a model calculated without that observation, so that the predicted response is independent of the observed response. The predictive residual sum of squares, PRESS,is one of the best goodness of prediction measures. In linear estimators, e.g. OLS, PRESS can be calculated from ordinary residuals, while in nonlinear estimators, e.g. PLS,the predicted observations must literally be left out of the model calculation. Cross-validation, which repeats the model calculation n times, each time leaving out a different observation and predicting it from a model fitted to the other n - 1 observations, is called leave-one-out cross-validation (LOO). Cross-validation in which n’ = n/G observations are left out, is called G-fold cross-validation. In this case the model calculation is repeated G times, each time leaving out n’ different observations. Because the perturbation of the model is greater than in the leave-one-out procedure, the G-fold goodness of prediction estimate is usually less optimistic than the LOO estimate.

214

model validation [MODEL]

jackknife Model validation technique that gives a nonparametric estimate of statistical error of an estimator in terms of its bias and variance. The procedure is based on sequentially deleting observations from the sample and recomputing the statistic t calculated without the ith observation, denoted t(i). The mean of these statistics calculated from the truncated data sets is i

4.) = n The jackknife estimate of the bias B(t) and the variance V(t) of statistic t is:

For example, jackknife shows that the mean is an unbiased estimate

B(K) = (n - l)(K(.) -TI) = 0

V(K)=

n(n - 1)

Similarly, the bias and variance of the variance is estimated as

The bias and variance of more complex estimators, like regression parameters, eigenvectors, discriminant scores, etc. can be calculated in a similar fashion obtaining a numerical estimate for the statistical error. Jackknife is similar to cross-validation in that in both procedures observations are omitted one at a time and estimators are calculated repeatedly on truncated data sets. However, the goal in the two techniques is different.

moment [PROB] 215

resampling (: sample reuse) Model validation procedure that repeatedly calculates an estimator from a reweighted version of the original data set in order to estimate the statistical error associated with the estimator. With this technique the goodness of prediction of a model can be calculated from a single data set. Instead of splitting the available data into a training set, to calculate the model, and an evaluation set, to evaluate it, the evaluation set is created in a repeated subsampling. Cross-validation, bootstrap and jackknife belong to this group. resubstitution Model validation based on the same observations used for calculating the model. Such validation methods can be used to calculate goodness of fit measures, but not goodness of prediction measures. For example, in regression models RSS or R2,in classification the error rate or the misclassification risk are calculated via resubstitution. sample reuse : resampling training-evaluation set split In case of many observations, the calculated model can be evaluated without reusing the observations. The whole data set is split into two parts. One part, called the training set, is used to calculate the model, and the other part, called the evaluation set, is used to evaluate the model. A key point in this model validation method is how to partition the data set. The sizes of the two sets are often selected to be equal. A random split is usually a good choice, except if there is some trend or regularity in the observations (e.g. time series). b

modified control chart [QUAL]

+ control chart b moment [PROB] The expected value of a power of a variate x with probability distribution function f ( x ):

1 00

pr(x) =

y f ( x )&

--M

The central moment is the moment calculated about the mean: 60

--M

216

momentum term [ M S C ]

The absolute moment is defined as 00

LLpw = J

Ixl'f(x)b

--6o

The most important moment is the first moment, which is the mean

/

00

LL =

xf(x)b

-00

and the second central moment, which is the variance of the variate u 2=

Jm(x

- p)2f(x) dr

-m

The skewness of a distribution is py/a3 and the kurtosis of a distribution is 4iu4. b

momentum term [MISC]

+ neural network b Monte Carlo sampling [PROB] + sampling b

Monte Carlo simulation [PROB]

: simulation b Mood-Brown' test [TEST] + hypothesis test

Mood's test [TEST] + hypothesis test b

b Moore-Penrose inverse [ALGE] + matrix operation ( 0 inverse of a matrix) b monothetic clustering [CLUS] -+ cluster analysis

b monotone admissibility [CLUS] + assessment of clustering (0 admissibilityproperties) b Moses' test [TEST] + hypothesis test

multicollineariy [MULT]

217

b most powerful test [TEST] + hypothesis testing b

moving average (MA) [TIME]

The series of values

m q = ~ w i x q + i / ~ w i q=O,n-k

i=l,k

i

i

where X I , . . . , xn are a set of values from a time series, k is a span parameter (k c n) and w l , . . . , Wk are a set of weights.

. .

... .

. ..

....

..

If all the weights are equal to l l k , the moving average is called simple and can be constructed by dividing the moving total (sum of k consecutive elements) by k: k

C xilk

Exilk

Exilk

i=l

i =2

i =3

etc.

The set of moving averages constitutes the movinp average model. b moving average chart [QUALI + control chart b moving average model (MA) [TIME] + time series model b

-+

moving total [TIME] moving average

multiblock PLS [REGR] + partial least squares regression b

multicollinearity [MULT] : collinearity

b

218

multicriteria decision making (MCDM)[QUAL]

b multicriteria decision making (MCDM) [QUAL] Techniques for finding the best settings of process variables that optimize more than one criterion. These techniques can be divided into two groups: the generating techniques use no prior information to determine the relative importance of the various criteria, while in the preference techniques the relative importance of the criteria are given a priori. The most commonly used MCDM techniques follow.

desirability function Several responses are merged into one final criterion to be optimized. The first step is to define the desired values for each response. Each measured response is transformed into a measure of desirability dr, where 0 5 dr 5 1. The overall desirability D is calculated as the geometric mean:

D=

y&&Tx

The following scale was suggested for desirability: Value

Quality

1 .oo-0.80 0.80-0.63 0.63-0.40 0.40-0.30 0.30-0.00

excellent good poor borderline unacceptable

When the response is categorical, or should be in a specified interval or above a specified level to be acceptable, only dr = 1 and dr = 0 are assigned. When the response Yr is continuous, it is transformed into dr by a one-sided transformation:

dr = exp [- exp [(-~r)ll or by a two-sided transformation:

outranking method Ranks all possible parameter value combinations. Methods belonging to this group are: ELECTRE, ORESTE, and PROMETHEE. PROMETHEE ranks the parameter combinations using a preference function that gives a binary output comparing two parameter combinations. overlay plot Projects several bivariate contour plots of response surfaces onto one single plot. Each contour plot represents a different criteria. Minimum and maximum boundaries for acceptable criterion in each contour plot can be compared visually on the aggregate plot and the best process variables selected.

multinomial variable [PREP]

219

Pareto optimality criterion Selects a set of noninferior, so-called Pareto-optimal points, that are superior to other points in the process variable space. The nonselected points are inferior to the Pareto-optimal points in at least one criterion. The position of the Pareto-optimal points are often investigated on principal components plots and biplots. utility function An overall optimality criterion is calculated as a linear combination of f k, k = 1, K different criteria

The optimum of such a utility function is always the set of Pareto-optimal points, given certain weights. b multidimensional scaling (MDS) [MULT] Mapping method to construct a configuration of n points in the p’-dimensional space from a matrix containing interpoint distances or similarities measured with error in the p-dimensional space. Starting from a distance matrix D,the objective is to find the location of the data points X I , . . . ,xn in the p’-dimensional space such that their interpoint distance d E D is similar in some sense to d’ in the projected space. Usually p’ is restricted to 1, 2, or 3. Multidimensional scaling is a dimension reduction technique that provides a one-, two-, or three-dimensional picture that conserves most of the structure of the original configuration. There are several solutions to this problem, all are indeterminate with respect to translation, rotation and reflection. The solution is a non-metric multidimensional scaling (NMDS)if it is based only on the rank order of the pairwise distances, otherwise it is called metric multidimensional scaling. The most popular solution minimizes the stress function

The minimum indicating the optimal configuration is usually found by the steepest descent method. b multigraph [MISC] + graph theory

multinomial distribution [PROB] -+ distribution b

multinomial variable [PREP] + variable b

220

multiple compatison [ANOVA]

b multiple comparison [ANOVA] + contrast b multiple correlation [DESC] + correlation

multiple correlation coefficient [MODEL] + goodness of fit b

b

multiple least squares regression [REGR]

: ordinaly least squares regression

multiple regression model [REGR] + regression model b

b multiple sampling [QUAL] + acceptance sampling b

multiplication of matrices [ALGE]

+ matrix operation

multistate variable [PREP] + variable b

b multivariate [MULT] This term refers to objects described by several variables (multivariate data), to statistical parameters and distributions of such data, to models and methods applied to such data. In contrast, the term univariate refers to objects described by only one variable (univariate data), to statistical parameters and distributions of such variable, to models and methods applied to one single variable. One-variable-at-a-time (OVAT) is a term which refers to a statistical method that tries to find the optimal solution for a multivariate problem considering the variation in only one variable at a time, while keeping all other variables at a fixed level. Such a limited approach looses information about variable interaction, synergism and correlation. Despite this drawback, such methods are used due to their simplicity and ease of control and interpretability.

multivariate [PROB] + random variable b

b multivariate adaptive regression splines (MARS) [REGR] Nonparametric, nonlinear regression model based on splines. The MARS model has the form:

multivariate analysis [MULT]

)’ = bo +

c

fj

(xj)

+

c

fjk(xj ,xk)

+

c

f j k l ( x j ,xk,xI)

221

+

The first term contains all functions that involve a single variable, the second term contains all functions that involve two variables, the third term has functions with three variables, etc. A univariate spline has the form yi = bo

+ Cm b m &(xi)

i = 1,n

where B m are called spline basis functions and have the form: &(xi)

9

= [Sm(xi - tm>]+

m = 1,M

Values tm are called knot locations that define the predictor subregions, Sm equals either +1 or -1. The + sign indicates that the function is evaluated only for positive values. Index q is the power of the fitted spline; q = 1 indicates linear spline, while q = 3 indicates cubic spline. MARS is a multivariate extension of the above univariate model: m

j

Each multivariate basis function is a product o f j = 1, J univariate basis functions. The parameter J can be different for each multivariate basis function. Once the basis functions are calculated the regression coefficients bm are estimated by the least squares procedure. The above multivariate splines are adaptive splines in MARS. This means that the knot locations are not fixed, but optimized on the basis of the training set. The advantage of the MARS model is its capability to depict different relationships in the various predictor subregions via local variable subset selection, to include both additive and interactive terms. As in many nonlinear and biased regression methods, a crucial point is to determine the optimal complexity of the model. In MARS the complexity is determined by q, the degree of the polynomials fitted, by M,the number of terms, and by J , the order of the multivariate basis functions. These parameters are optimized by cross-validation to obtain the best predictive model. b multivariate analysis [MULT] Statistical analysis performed on multivariate data, i.e. on objects described by more than one variable. Multivariate data can be collected in one or more data matrices. Multivariate methods can be distinguished according to the number of data matrices they deal with. Disulav methods, factor analvsis. cluster analvsis and principal coordinate analvsis explore the structure of one single matrix. Canonical correlation analvsis and Procrustes analvsis examine the relationship between two matrices. Latent variable models connect two or more matrices. Multivariate analysis that examines the relationship among n objects described by p variables is called Q-analysis. The opposite is R-analysis when the relationship

222

multivariate analysis of variance (MANOVA) [ANOVA]

among p variables determined by n observations is of interest. For example, clustering objects or extracting latent variables is Q-analysis, while clustering variables or calculating correlation among variables is R-analysis. b multivariate analysis of variance (MANOVA) [ANOVA] Multivariate generalization of the analvsis of variance, studying group differences in location in a multidimensional measurement space. The response is a vector that is assumed to arise from multivariate normal distributions with different means but with the same covariance matrix. The goal is to verify the differences among the population locations and to estimate the treatment effects. The one-way u 4 N O v A model with I levels and K observations per level partitions the total scatter matrix T into between-scatter matrix B and within-scatter matrix W

T=B+W where

i

k

i

k

The test for location differences invwes generalized variances. The nu esis

is rejected if the ratio (Wilks’ A)

is too small.

multivariate calibration [REGR] + calibration b

multivariate data [PREP] + data b

path-

nearest centroid sorting [CLUS]

223

b multivariate dispersion [DESC] The dispersion of multivariate data about its location is measured, for example, by the covariance matrix. However, sometimes it is convenient to assess the multivariate dispersion by a single number. ?Irvocommon measures are: the generalized variance, which is the determinant of the covariance matrix, IBI, and the total variation, which is the trace of the covariance matrix, tr[S]. In both cases, a large value indicates wide dispersion and a low value represents a tight concentration of the data about the location. For example, these quantities are optimized in non-hierarchical cluster analysis. Generalized variance plays an important role in maximum likelihood estimation, total variation is calculated in principal component analysis. b

multivariate distribution [PROB]

+ random variable b multivariate image analysis (MIA) [MULT] + image analysis b

multivariate least squares regression ( M U ) [REGR]

: ordinary least squares regression b multivariate normal distribution [PROB] + distribution

multivariate regression model [REGR] + regression model b

b mutual information [MISC] + information theory

np-chart [QUAL] + controlchart (0 attribute control chart) b

b narrow-band process [TIME] + stochastic process

nearest centroid sorting [CLUS] + non-hierarchical clustering ( 0 optimization clustering) b

224

nearest centrotype sorting [CLUS]

b nearest centrotype sorting [CLUS] + non-hierarchical clustering (n optimization Clustering)

b

nearest means classification (NMC) [CLAS]

: centroid classijication b nearest neighbor density estimator [ESTIM] + estimator (0 density estimator)

nearest neighbor linkage [CLUS] + hierarchical clustering (0 agglomerative clustering) b

b negative binomial distribution [PROB] + distribution

b negative exponential distribution [PROB] + distribution

nested design [EXDE] + design b

nested effect term [ANOVA] + term in ANOVA b nested model [ANOVA] + analysis of variance

nested model [MODEL] + model b

nested sampling [PROB] + sampling b

b network [MISC] + graph theory b neural network (NN) [MISC] (.- artijicial neural network) New, rapidly developing field of pattern recognition on the basis of mimicking the function of the biological neural system. It is a parallel machine that uses a large number of simple processors, called neurons, with a high degree of connectivity. A NN is a nonlinear model, and an adaptive system that learns by adjusting the strength of connections between neurons.

neural network (NN)[MISCJ

225

output layer

YI

Y2

A simple NN model consists of three layers, each composed of neurons: the input layer distributes the input to the processing layer, the output layer produces the output, the processing or hidden layer (sometimes more than one) does the calculation. The neurons in the input layer correspond to the predictor variables, while the neurons in the output layer represent the response variables. Each neuron in a layer is fully connected to the neurons in the next layer. Connections between neurons within the same layer are prohibited. The hidden layer has no connection to the outside world; it calculates only intermediate values. The nodenode connection having a weight associated with it is called a synapse. These are automatically adapted in the training process to obtain an optimal setting. Each neuron has a transfer function that translates input to output. There are different types of networks. In the feedforward network the signals are propagated from the input layer to the output layer. The input xi of a neuron is the weighted output from the connected neurons in the previous layer. The output yi of a neuron i depends on its input Xi and on its transfer functionf(xi). A popular transfer function is the sigmoid:

where ti is a shift parameter. Good performance of a neural network depends on how well it was trained. The training is done by supervised learning, i.e. iteratively, using a training set with known output values. The most popular technique is called a backpropagation network. In the nth iteration the weights are changed as:

226

neuron [MISC]

where ;rl is the learning rate or step size parameter, a! is the momentum term. If the learning rate is too low, the convergence of the weights to an optimal value is too slow. On the other hand if the learning rate is too high, the system may oscillate. The momentum term is used to damp the oscillation. Function Sj is the error for node j. If node j is in the output layer and has the desired output dj , then the error is:

If the nodej is in the hidden layer, then

where k is indexing the neurons in the next layer.

neuron [MISC] + neural network b

b Newman-Keuls' test [TEST] -+ hypothesis test

Newton-Raphson optimization [OPTIM] Gradient optimization using both the gradient vector (first derivative) and the Hessian matrix (second derivative). The scalar valued function f (p), where p is the vector of parameters, is approximated at POby the Taylor expansion: b

f

(P) = Z(P>

=f

(Po) + (P - Po)Td P 0 )

+ 0.5 (P - PolT H(Po)(P - Po)

where g is the gradient vector and H is the Hessian matrix evaluated at PO. In each step i of the iterative procedure the minimum of f (p) is approximated by the minimum of z (p) Pi+l

= Pi - si H-' (Pi) g(Pi)

where the direction is the ratio of the first and second derivatives H-' g and the step size si is determined by a linear search outimization. This procedure converges rapidly when pi is close to the minimum. Its disadvantage is that the Hessian matrix and its inverse must be evaluated in each step. Also, if the parameter vector is not close to the minimum, then H may become negative definite and convergence cannot be reached. When the function f (p) is of a special quadratic form, as in the nonlinear least squares regression problem, the Gauss-Newton outimization offers a simplified computation.

nilpotent matrix [ALGE] + matrix b

b

node [MISC] graph theory

b

noise variable [PREP] variable

b

nominal scale [PREP] scale

non-hierarchical clustering [CLUS] 227

-+

b no-model error rate [CLAS] + classification (0 error rate)

noncentral composite design [EXDE] + design (0 composite design) b

b nonconforming item [QUAL] + lot b

nonconformity [QUAL] lot

-+

b

non-error rate (NER) [CLAS] ( 0 error rate)

+ classification

b non-hierarchical clustering [CLUS] ( .-partitioning clustering) Clustering that produces a division of objects into a certain number of clusters. This clustering results in one single partition, as opposed to hierarchical clustering, which produces a hierarchy of partitions. The number of clusters is either given a priori or determined by the clustering method. The clusters obtained are often represented by their centrotypes or centroids. Non-hierarchical clustering methods can be grouped as: density clustering, graph theoretical clustering, and optimization clustering.

density clustering Non-hierarchical clustering that seeks regions of high density (modes) in the data to define clusters. In contrast to most optimization clustering methods, density clustering produces clusters of a wide range of shapes, not only spherical ones. One of the most popular methods is Wishart's mode analysis, which is closely related to single linkage clustering. For each object mode analysis calculates its distances from all other objects, then selects and averages the distances from its K nearest neighbors. Such an average distance for an object from a dense area is small, while outliers that have few close neighbors have a large average distance.

228

non-hierarchical clustering [CLUS]

Another method based on the nearest neighbor density estimation is the JarvisPatrick clustering. This is based on an estimate of the density around an object by counting the number of neighbors K that are within a preset radius R of an object. Two objects are assigned to the same cluster if they are on each other's nearest neighbor list (length fixed a priori) and if they have at least a certain number (fixed a priori) of common nearest neighbors. The Coomans-Massart clustering method, also called CLUPOT clustering, uses a multivariate Gaussian kernel in estimating the potential function. The object with the highest cumulative potential value is selected as the center of the first cluster. Next, all members of this first cluster are selected by exploring the neighborhood of the cluster center. After the first cluster and its members are defined and set aside, a new object with the highest cumulative potential is selected to be the center of the next cluster and members of this cluster are selected. The procedure continues until all objects are clustered.

graph theoretical clustering Non-hierarchical clustering that views objects (variables) as nodes of a graph and applies graph theory to obtain clusters of those nodes. The best known such method is the minimal spanning tree clustering. The tree is built stepwise such that, in each step, the link with the smallest distance is added that does not form a cycle in the path. This process is the same as the single linkage algorithm. The final partition with a given number of clusters is obtained by minimizing the distances associated with the links, i.e. cutting the longest links. optimization clustering Non-hierarchical clustering that seeks a partition of objects into G clusters optimizing a predefined criterion. These methods, also called hill climbing clustering, are based on a relocation algorithm. They differ from each other in the optimization criterion. For a given partition the total scatter matrix T can be partitioned into within-group scatter matrix W and between-group scatter matrix B: T = W B where W is the sum of all within-group scatter matrices: W = C, W,. The eigenvalues of BW-' are A j , j = 1,M where M = min[p, GI. For univariate data the optimal partition minimizes W and maximizes B. For multivariate data similar optimization is achieved by minimizing or maximizing one of the following criteria :

+

o error sum of squares The most popular criterion to minimize is the error sum of squares, i.e. the sum of squared distances between each object and its own cluster centroid or centrotype. This criterion is equal to minimize the trace of W r 1

non-hierarchical clustering [CLUS]

229

Methods belonging to this group differ from each other in their cluster representation,in how they find the initial partition and in how they reach the optimum. The final partition is obtained in an iterative procedure that relocates an object, i.e. puts it into another cluster if that cluster’s centrotype or centroid is closer. Methods in which clusters are represented by centrotypes are called nearest centrotype sorting methods, e.g. K-median clustering and MASLOC. G objects are selected from a data set to be the centrotypes of G clusters such that the sum of distances between an object and its centrotype is a minimum. Leader clustering selects the centrotypes iteratively as objects which lie at a distance greater than some preset threshold value from the existing centrotypes. Its severe drawback is the strong dependence on the order of the objects. Methods representing clusters by their centroids are called nearest centroid sorting methods. K-means clustering (also called MacQueen clustering) selects K centroids (randomly or defined by the first K objects) and recalculates them each time after an object is relocated. Forgy clustering is similar to the K-means method, except that it updates the cluster centroids only after all objects have been checked for potential relocation. Jancey clustering is similar to Forgy clustering, except that the centroids are updated by reflecting the old centroids through the new centroids of the clusters. K-Weber clustering is another variation of the K-means method. Ball and Hall clustering takes the overall centroid as the first cluster centroid. In each step the objects that lie at a distance greater than some preset threshold value from the existing centroids are selected as additional initial centroids. Once G cluster centroids have been obtained, the objects are assigned to the cluster of the closest centroid. ISODATA clustering is the most elaborate of the nearest centroid clustering methods. It obtains the final partition not only by relocation of objects, but also by lumping and splitting clusters according to several user-defined threshold parameters. o cluster radius K-center clustering is a nearest centrotype sorting method that minimizes the cluster radiuses, i.e. minimizes the maximum distance between an object and its centrotype. In the covering clustering method the maximum cluster radius is fixed and the goal is to minimize the number of clusters under the radius constraint. o cluster diameter Hansen-Delattre clustering minimizes the maximum cluster diameter

1

230

nonlinear estimator [ESTIM]

o intra-cluster distances

Schrader clustering minimizes the sum of intra-cluster distances

o determinant of W

Friedman-Rubin clustering minimizes the determinant of the within-group scatter matrix min [IWll It is equivalent to maximizing

or to minimizing the Wilks’ lambda statistic min

[71

o trace of B W-’ Another criterion suggested by Friedman and Rubin is to maximize Hotelling’s trace

r max [tr [B W-’I] = max

1

lj 1 A,

largest eigenvalue of W Maximize Roy’s greatest root criterion

o

max[A11 b nonlinear estimator [ESTIM] + estimator b nonlinear iterative partial least squares (NIPALS) [ALGE] + eigenanalysis (o power method) b nonlinear mapping (NLM) [MULT] Mapping method, similar to multidimensional scaling, that calculates a two- or three-dimensional configuration of high-dimensional objects. It tries to preserve relative distances between points in the low-dimensional display space so that they are as similar as possible to the distances in the original high-dimensional space. After calculating the distance matrix in the original space, an initial (usually random)

nonparametric estimator [ESTIM]

231

configuration is chosen in the display space. A mapping error is calculated from the two distance matrices (calculated in the original and in the display spaces). Coordinates of objects in the display space are iteratively modified so that the mapping error is minimized. Any monotone distance measure can be used for which the derivative of the mapping error exists, e.g. Euclidean, Manhattan, or Minkowski distances. In order to avoid local optima it has been suggested that the optimization be started from several random configurations. Principal components projection can also be used as a starting configuration. There are several mapping error formulas in the literature, the most popular one is:

E=-

1

Cdij

i<j

(dij -d;jI2 dij

i<j

where 4, are distances in the display space, while di, are distances in the original space. The various algorithms also differ in the minimization procedure; the most popular is the steepest descent method. There are some drawbacks to NLM. new test points cannot be placed on a previously calculated map; computation is time consuming; the solution is often a local minimum.

nonlinear partial least squares regression (NLPLS) [REGR] + partial least squares regression b

nonlinear programing [OPTIM] + optimization b

b nonlinear regression model [REGR] + regression model b non-metric data [PREP] + data

non-metric multidimensional scaling (NMDS) [MULT] + multidimensional scaling b

non-metric scale [PREP] + scale b

b nonparametric classification [CLAS] + parametric classijication

nonparametric estimator [ESTIM] + estimator b

232

nonprobabilisticclassification [CLAS]

b nonprobabilistic classification [CLAS] -+ probabilistic classif cation b nonsingular matrix [ALGE] + matrix

b norm [ALGE] Measure of the size of a vector or a matrix, denoted as assigns a scalar to a vector or a matrix.

11.11.

It is a function that

matrix norm Measure of the size of a matrix W. The most frequently used matrix norms are: o Frobenius norm

o p-norm IlXllp = SUP l l ~ ~ l l p / l l ~ l l p U#O

where v is a nonzero vector. When p = 2, this norm is called the two-norm, and is equal to the largest eigenvalue of W. vector norm Measure of the size of a vector x. The most frequently used vector norm is the p-norm:

When p = 2 this norm is called the Euclidean norm; when p = 00 the norm reduces to

b normal distribution [PROB] -+ distribution b normal equation [ALGE] A set of simultaneous equations obtained in calculating a least squares estimator. For example the normal equations for the coefficient and intercept of a single

number of factors [FACT]

regression y = a nzxiyi

+ bx are: - 1 X i I

a=

233

i

Cyi i

I

n

b normal kernel [ESTIM] + kernel b

normal probability plot [GRAPH]

+ quantile plot b normal process [TIME] + stochasticprocess b normal residual plot [GRAPH] + quantile plot b normality assumption [PROB] The assumption underlying many statistical models and tests, that requires a variable, e.g. the model error, to follow a normal distribution. For example, in ordinary least squares regression and discriminant analysis the error is assumed to be normally distributed with zero mean and equal variance, and to be uncorrelated. The most often used techniques for verifying the normality assumption are D’Agostino’s test and the normal probability plot. b normalized Hamming coefficient [GEOM] + distance (o binalydata) b notched box plot [GRAPH] + box plot b null hypothesis [TEST] + hypothesis testing b

null matrix [ALGE]

+ matrix

number of factors [FACT] + rank analysis b

234 b

-+

numerical data [PREP]

numerical data [PREP] data

F numerical error [OPTIM] Error due to numerical approximations in calculating a number from mathematical functions or in representation of the number itself. When iterative procedures are used, error propagation can also occur, i.e. at a given stage of the calculation part of the error derives from the error at a previous stage. The propagation of experimental and numerical errors can be examined by repeating the calculation with slightly perturbed data. Several types of numerical errors can limit the accuracy of the results:

discretization error Error due to substituting integrals with finite sums. roundoff error Error due to the finite representation of the digits of a number x. Digital computers store the digits of a number as a finite sequence of discrete physical quantities. For example, in floating-point notation, the number is represented by a finite number t of nonzero digits before the decimal point (mantissa) and a finite number e which indicates the position of the decimal point with respect to the mantissa (exponent): x=tloe

The numbers t and e (together with the basis of the exponent, 10 or 2) determine the subset M of real numbers R, called machine numbers, M C R, which can be represented exactly in a given machine. Thus, any number x # M is approximated by a machine number rd(x) E M satisfying Ix - rd(x)l IIx - rl

for all

rEM

The quantity Ix - rd(x)I is the round-off error. This type of error can obviously occur not only in representing the final results, but also during intermediate calculations.

truncation error Error due to substituting infinite series with a finite number of terms of the series. b

numerical optimization [OPTIM]

: optimization b

numerical taxonomy [CLUS]

: cluster analysis

offset[REGR]

235

0 b object [PREP] (; data point, item, observation, unit) The basic unit in data analysis, e.g. an individual, an animal, a molecule, a food sample. Each object is described by one or more measurements, called data. A collection of objects constitutes a data set. Each object corresponds to a row of the data matrix. The ratio between the number of objects n and the number of variables p is called the object-variable ratio n/p. If this ratio is smaller than one the data set is an underdetermined system, while a data set with a ratio higher than one is a well-determined system. In the case of an underdetermined system one must apply biased methods to mitigate the high variance of estimates calculated from such a data set.

b

objective function [OPTIM] optimization

b object-variable interaction [FACT] + correspondencefactor analysis b

object-variable ratio [PREP] object

b oblimax rotation [FACT] + factor rotation b

oblimin rotation [FACT] fuctor rotation

oblique rotation [FACT] + factor rotation b

b

observation [PREP]

: object b

off-diagonal element [ALGE]

+ matrix

b

off-line quality control [QUAL] Tuguchi method

offset [REGR] + regression coeficient b

236

one-sided test [TEST]

b one-sided test [TEST] + hypothesis testing

one-variable-at-a-time (OVAT) [MULT] + multivariate b

b one-way analysis of variance [ANOVA] The simplest analysis of variance model, containing one single effect. The total variance has two components: variance due to the effect, and a random component. The assumption is that there are I normal populations (where I is the number of levels the effect can assume) with different means but common variances, from which samples of size K are drawn. In this model the total variance of the data is partitioned into two parts: sum of squares of differences between treatment means and the grand mean (often called between sum of squares); and sum of squares of differences between observations and their own treatment means (often called within sum of squares or error sum of squares):

i=l,I

k=l,K

n=IK

The first term measures the effect of treatments, the second term is due to random error. The total sum of squares has n - 1 degrees of freedom, the between sum of squares has I - 1 degrees of freedom and the within sum of squares has n - I degrees of freedom. The one-way ANOW model is used, for example, in regression diagnostics to partition the total variance into variance due to the regression model and variance due to random error. In this case the number of levels is substituted by the number of parameters of the regression model.

on-line quality control [QUAL] + Taguchi method b

b open data [PREP] + data (o closed data)

operating characteristic [TEST] + hypothesis testing b

b operating characteristic curve (OCC) [QUAL] Plot of the type I1 error against the process quality. It indicates the probability of accepting a lot if it contains a certain percentage of defective items estimated from a sample. The probability of accepting the lot with no defective items is one. As the number of defective items increases the probability of acceptance decreases

Optimization [OPTIM]

237

asymptotically to zero. The shape of the OC curve depends on the sampling plan and on the distribution assumed. b operations research (OR) [MISC] Application of mathematical tools based on objective and quantitative criteria, where necessary with given constraints, to system organization, decision making and system control in order to provide the best solutions. Techniques most commonly used in OR are: graph theory, game theory, linear programming, optimization, simulation.

optimal design [EXDE] + design b

b optimization [OPTIM] (: numerical optimization) Finding the optimal value (minimum or maximum) of a numerical function f, called the objective function, with respect to a set of parameters p, f ( P I , . . . , pp). If the values that the parameters can take on are constrained, the procedure is called a constrained optimization.

global maximum local maximum

local minimum

I

global minimum

P

At stationary points of a function, i.e. at minimum (global or local), maximum (global or local) or saddle points the first (partial) derivatives are zero: af/apl = . . . = af/ap, = o At a minimum the second derivative matrix (the Hessian matrix) is positive definite:

238

optimization clustering [CLUS]

Optimization procedures are iterative, i.e. in each step i the solution, the estimated values of the parameters, represents an improved approximation of the true parameter. Whether a procedure converges to a global or local optimum depends on the initial guess about the parameter values. Various convergence criteria can be used to decide when a procedure reaches the optimum. In each iteration step i a new set of parameters is calculated as

+

pi+l = Pi sidi where Si is the step size and di (a p-dimensional vector) specifies the step direction to be taken in moving from pi to pi+l. The optimization methods differ from each other in the way they determine di and S i . In direct search optimization the choice of the step direction and step size depends only on the values of the function, while in gradient optimization they are calculated from derivatives. The problem of optimization is often referred to as mathematical programming. If both objective function and the constraints are linear functions of the unknown parameters, the optimization is called linear programming, otherwise it is called nonlinear programming. An important problem in optimization, one which must be dealt with, is numerical error. optimization clustering [CLUS] + non-hierarchical clustering b

b

order of a matrix [ALGEI

+ matrix b order of a model [MODEL] + model b order statistic [DESCI + rank b

ordinal scale [PREP]

+ scale b ordinary least squares regression (OLS) [REGR] (: least squares regression, linear least squares regression, multiple least squares regression, multivariate least squares regression) The most popular regression method in which the model is linear in both the parameters and the predictors

y=%b+e

and the regression coefficients b are calculated with the least squares estimator solving for the normal equations:

6 = (XT x)-l xTy

orthogonal matrix [ALGE]

239

These coefficients are estimated by minimizing the residual sum of squares or, equivalently, by maximizing the correlation between the linear combination of the predictors and the response: mba [corr2 (y, b xT)] OLS is BLUE, i.e. it is an unbiased linear estimator that has the smallest variance among all unbiased linear estimators if the normality assumption on the residuals hold and if the true relationship is linear. If the normality assumption is violated the least squares estimator is not necessarily the optimal unbiased one. Generalized least squares regression can be applied to deal with heteroscedasticity and collinearity. The bias and variance of the estimated regression coefficient are:

where s is the residual standard deviation and Am is the rnth eigenvalue of (XTX). This method calculates an unbiased model; however, it is suitable only for welldetermined systems and is very sensitive to outliers. The estimated response and any statistic related to it is scale invariant. It is a linear estimator, therefore the cross-validated residuals can be calculated as:

where hii are diagonals of the hat matrix.

ordinary residual [REGR] + residual b

b

ordinary residual plot [GRAPH]

+ scatter plot (O residual plot) b orthoblique rotation [FACT] + factor rotation

orthogonal design [EXDE] + design b

b

orthogonal matrix [ALGE] matrix

--+

240

orthogonal matrix transformation [ALGE]

F orthogonal matrix transformation [ALGE] ( : orthogonalizution) Transformation of a square matrix X into an upper triangular matrix R via multiplication with an orthogonal matrix Q:

R=Q~X

The above transformation can also be viewed as a Q-R decomposition of X:

X=QR Each method expresses the columns of X in terms of a new orthonormal basis, contained in the columns of Q. Orthogonal matrix transformations, for example, afford an easier solution to the least squares regression problem and provide a numerically more stable solution. The orthogonal matrix Q can be calculated in several ways. Givens transformation Orthogonal matrix transformation that brings X to an upper triangular form through a series of orthogonal transformations, i.e. the transformation matrix Q is a product of orthogonal matrices Gij . In contrast to the Householder transformation, each step sets to zero only one sub-diagonal element, i.e. Gij operates on columnj of %, annihilating element i in that column. The matrix Gij differs from the identity matrix in only four elements:

with

Although the Housholder transformation is preferred for general use, because fewer computations are required and slightly better results are obtained, the Givens transformation is recommended for large, very sparse (many elements are zero) X matrices. Gram-Schmidt transformation Orthogonal matrix transformation that brings X to an upper triangular form. In contrast to the Housholder transformation, which does not calculate Q explicitly, but only the matrices Hj, here in each step one calculates a new column of Q, orthogonal to the previous ones. Housholder transformation The most important orthogonal matrix transformation that brings X to an upper triangular form through a series of orthogonal transformations, i.e. the transformation matrix Q is a product of p orthogonal matrices Hj . The matrix Hj is symmetric,

orthonormal basis [ALGE]

241

orthogonal and operates on column j of X, annihilating all sub-diagonal elements:

H.-I[-- 2vvT

’-

VTV

This transformation is used in many algorithms, e.g. in bidiagonalization, or tr idiagonalization of a matrix before performing an eigenanalysis. b orthogonal rotation [FACT] + factor rotation b orthogonal vector transformation [ALGE] Let G be an orthogonal matrix of rank p and nj an orthonormal basis of p vectors. Then

gj = Gnj

forms a new orthonormal basis. Any vector x can be represented in terms of both the old and the new basis as

where Xj indicates the old coordinates and xi the new transformed coordinates. b orthogonal vectors [ALGE] + vector b

orthogonalization [ALGE]

: orthogonal matrix transformation

orthomax rotation [FACT] + factor rotation b

orthonormal basis [ALGE] A set of p vectors that are orthogonal vectors and have their norm equal to one: b

nj nk = O

if

j# k

and

/(njII= 1 j = l , p

Without imposing the orthogonality and normality restrictions, any set of linearly independent vectors nl, . . . , np,such that each vector in the space can be written as linear combination of nl , . . . , np, is simply called the basis. The minimum number of vectors needed to form a basis is equal to the dimension of the vector space. A p-dimensional vector x can be represented as x = C x j nj j

where xj is calledjth coordinate. The orthonormal basis is used, for example, in orthogonal vector transformation.

242

outlier [DESC]

b outlier [DESC] Observation that is separated from the bulk of the data. Outliers in the sample may distort statistics calculated from such a sample. Outliers are often observations from a different than the sampled population. Their effect on an estimator is measured by the breakdown point and the influence curve. Outliers must be detected and tested to determine whether they should be discarded before modeling. Influence analvsis detects outliers in the predictor space, called leverage points. Robust estimators are often used to mitigate the distorting effect of potential outliers.

b

output layer [MISC] neural network

b

-+

outranking method [QUAL] multicriteria decision making

b

overfitting [MODEL]

---, model fitting

overlay plot [QUAL] + multicriteria decision making b

b

p-chart [QUAL]

+ control chart

(0 attribute control

chart)

p-norm [ALGE] -+ norm b

parameter design [QUAL] Design for investigating the response of a process which is independent of or at least insensitive to different sources of variation. This design is different from the experimental design, in which different sources of variation are investigated. The goal of the parameter design is to model a response by making it independent from different sources of variation rather than controlling those sources. b

b

parameter estimation [ESTIM] estimation

partial least squares regression (PLS) [REGR] 243 b parametric classification [CLAS] Classification where the classification rule is expressed as a function of a relatively small number of parameters of the class density functions, which is usually assumed to be normal (e.g. DASCO, LDA, QDA, RDA, SIMCA, UNEQ). Its opposite is the nonparametric classification where no particular form of class density function is assumed and it is estimated directly from the training set (e.g. ALLOC, CART,

K"). b parametric estimator [ESTIM] + estimator

Pareto analysis [QUAL] + qualitycost b

b

Pareto chart [QUALI quality cost

-+ b

Pareto distribution [PROB]

+ distribution b

Pareto optimality criterion [QUAL] multicriteria decision making

-+

Parks distance [GEOMI -+ distance ( 0 mixed type data) b

b

parsimonious model [MODEL] model

-+

b

partial confounding [EXDE]

+ confounding b

partial correlation [DESC] correlation

-+

partial factorial design [EXDE] -+ design b

b partial least squares regression (PLS)[REGR] Biased regression method that relates a set of predictor variables X to a set of response variables Y. A least squares regression is performed on a set of uncorrelated, so called latent variables T that are a standardized linear combinations

244

partial least squares regression (PLS) [REGR]

of the original predictor variables. PLS models the response(s) as Y=IL'BQ+lE

where I5 is a diagonal matrix containing the least squares coefficients of the latent variables, Q contains the weights of the responses and lE is the error matrix.

The latent variables are calculated one at a time, maximizing their covariance: max [corr2 (urn, tm), var (u,)var (tm)] m

where

tm

=Xw;f,

and

Um

rn = I, M T =Yqm

The connections among the various matrices of the PLS algorithm are shown in the following scheme. A PLS component rn is calculated as:

1. Initialize: 2. X weight: 3. X latent variable: 4. Y weight: 5. U latent variable: 6. Check convergence: 7. Inner relation: 8. X loadings: 9. X residuals: 10. Y residuals:

=Y1 : w = XT u, tm = Xw; q', = YT tm

scale to length 1 scale to length 1

uk = Yq', if ( h u m 5 E ) then 7. else

Um = uk

and 2.

The bias-variance trahd-off is controlled by the num,er of latent varia,.;s M:the more latent variables that are included the larger is the variance and the smaller is the bias of the estimated regression coefficients. As the number of latent variables increases (max [MI = p ) , the PLS model converges to the OLS model. The optimal number of components must be determined by a model selection criterion, e.g. cross-validation.

path analysis [MULT]

245

A modification of the PLS regression that permits the calculation of the latent variables as a linear combination of a subset of the predictor variables is called intermediate least squares regression (ILS). A special case of ZLS is the forward selection method, i.e. when a latent variable consists of only one predictor variable. An extension of the two-block PLS is the multiblock PLS model in which several predictor and response blocks are connected by regressing their latent variables on each other according to a prespecified path. There are several nonlinear extensions of the linear PLS model nonlinearizing the inner relationship, i.e. assuming nonlinear function among the latent variables: Um

= f (tm)

+e

The quadratic partial least squares regression (QPLS) fits a second-order polynomial: f (tm) = a0 a1 tm a2 t i . The nonlinear partial least squares regression (NLPLS) applies the smoother to estimate the above nonlinear functions. The spline partial least squares regression (SPLS) uses a nonadaptive bivariate regression sdine to approximate f . PLS is used extensively in the context of experimental design. The generating optimal linear PLS estimation (GOLPE) method performs a variable selection on the basis of D -optimal design in the loading space to obtain the best predictive model. The computer aided response surface optimization (CARSO) method uses PLS to fit a linear or quadratic response surface.

+

+

b partial pivoting [ALGE] + Gaussian elimination

b

partitioning clustering [CLUS]

: non-hierarchical clustering b partitioning of a matrix [ALGE] + matrix operation b Pascal distribution [PROB] + distribution

b path [MISC] + graph theory

b path analysis [MULT] Method for solving a causal model, i.e. studying patterns of causation among a set of variables, in which certain variables are viewed as a cause, others as an effect.

246

path diagram [MULT]

This is a special case of LISREL. The analysis is based on the following assumptions: - relationships are linear and additive; - all error terms are uncorrelated with each other; - models are recursive, only one-way causal flow is allowed; - endogenous variables are measured on an interval or ratio scale; - manifest variables are measured without error; - the model is properly specified, with all causal relationships included. Each equation is treated separately, i.e. errors in estimating the parameters of one equation do not affect the estimation of parameters of other equations. The parameters of the model are usually estimated by least squares. Path coefficients, calculated from the regression coefficients, indicate the effect of a causal variable on an effect variable. A path diagram is often used to enhance the analysis graphically. b path diagram [MULT] Graphical representation of the relationship among variables in a causal model. The diagram shown below corresponds to the following equations:

+ Yl 61 + c1 v 2 = B 2 v1 + b x1 = A.1 61 + 81 x 2 = A.2 61 + 8 2 rl1 = B1 rl2

+ €1 Y2 = A4 v 2 + €2 Y3 = A.5 r12 + €3 Y1 = A.3 rll

61

6 2

-

-El

Pearson’sfirst index [DESC]

247

The conventional notation is the following: - x: manifest exogenous variable - y: manifest endogenous variable latent exogenous variable - 3: latent endogenous variable - B: effect of endogenous variable on endogenous variable - y : effect of exogenous variable on endogenous variable - 0, and the semi-positive definite quadratic form if Q(x) 2 0. For example, the Mahalanobis distance is a quadratic form, where A is often the covariance matrix and x is the vector of the coordinate differences. Canonical analysis converts a quadratic form into a weighted sum of squares form without cross-product terms, called the canonical form. The procedure consists of centering and of rotating the axes in order to decorrelate them. b quadratic partial least squares regression (QPLS) [REGR] + partial least squares regression

quadratic regression model [REGR] + regression model b

quality cost [QUAL]

261

b quadratic score [CLAS] + classification

b

-+

qualitative data [PREP] data

b qualitative variable [PREP] + variable

b

quality characteristic [QUAL]

+ lot b quality control (QC) [QUAL] (: statistical quality control) Statistical analysis of data collected to measure the quality characteristics of the product, compare them with established standards and take action to ensure the required product quality. The primary goal is the systematic reduction of variability in the quality characteristics. A process can be characterized by the operating characteristic curve and the process capability ratio, monitored by a control chart, investigated by =meter design. The quality of a can be estimated by acceptance samulinq, the type I error is measured by the producer’s risk. Defects of different severity are dealt with in a demerit system. Multicriteria decision making optimizes more than one quality criterion. The cost of producing conforming products is quantified by the quality cost. The Taguchi method is a novel approach to quality control.

quality control chart [QUAL] : control chart

b

b qualitycost [QUAL] Cost associated with producing, identifying, avoiding, or repairing products that do not conform with the standards. There are four categories of quality cost. The prevention cost is associated with efforts in design and manufacturing to prevent nonconformance. The appraisal cost is associated with measuring, evaluating products, components, and materials to ensure conformance. The internal failure cost is associated with nonconforming products, components, and materials discovered prior to their delivery to the customers. The external failure cost is associated with nonconforming products delivered to customers. Pareto analysis is a quality-cost analysis aiming at cost reduction by identifying quality costs by category, by product, or by type of defect. A Pareto distribution of percentage defective items is calculated and plotted as a bar plot, called a Pareto chart, e.g. by department, by machine, by operator, etc.

262

quantal variable [PREP]

Defects

Conventionally, the bars are arranged in decreasing order. Usually, a few items on the left represent a substantial amount of the total.

quantal variable [PREP] + variable b

b quantile [DESC] (.- fractile) A value within the range of a variable which divides the data set into two groups, such that the fraction of the observation specified by the quantile falls below and the complement fraction falls above. For example, the quantile Q(0.8) indicates a value of a variable for which 80% of the observations lie below and 20% lie above. The most often used quantiles are:

quartiles divide the range into quarters Q(0.25) Q(0.5) Q(0.75) The jth quartile is the j(n + 1)/4th ordered value. deciles divide the range into tenths Q(o.1) Q(O.2) Q(0.3) Q(0.4) Q(0.5) Q(O.6) Q(0.7) Q(O.8) Q(o.9) The jth decile is the j(n l)/lOth ordered value.

+

percentiles divide the range into hundredths Q(O.01) Q(0.02) Q(0.03) . . . Q(0.97) Q(0.98) Q(0.99) The jth percentile is the j ( n + 1)/100th ordered value.

quantileplot [GRAPH]

Q,

Q,

Q,

lower quartile

median

upper quartile

263

-

. . ..I . ...i . .. I. . .. interquartile range

The three quartiles are important in measuring scale and location: - the first quartile Ql = Q(0.25) is called the lower quartile - the second quartile Q2 = Q(0.5) is the median, a robust measure of location - the third quartile Q3 = Q(0.75) is called the upper quartile The distance between the upper and the lower quartiles is called the interauartile range (Q3 - Ql), which indicates the spread of the bulk of the data. The half-interquartile range 0.5 . (Q3 - Ql), also called quartile deviation, is a robust measure of dispersion. Quantiles are used, for example, in quantile plots and in robust estimators. b quantile function [PROB] -+ random variable b quantile plot [GRAPH] Plot of the auantiles of a distribution against the corresponding fractions. The scale of the horizontal axis of the fraction ranges between 0 and 1. The scale of the vertical axis of the quantile is the scale of the variable. Except for the labeling of the horizontal axis, this plot is the same as X i vs. i, when the X i values are ordered.

I

0.0

I

0.2

I

I

I

0.4

0.6

0.8

Fractions of data

1.0

264

quantile-quantile plot [GRAPH]

The plot of the quantiles of two distributions against each other is called the quantile-quantile plot. If both quantiles are from an empirical distribution, it is called an empirical quantile-quantile plot. If the number of observations is equal for the two empirical distributions, the empirical quantile-quantile plot is simply a plot of one sorted variable against the other. Data points lying along a straight line indicate distributions of similar shape. The intercept of the line indicates the difference in location, the slope shows the difference in scale. When the quantiles on the horizontal axis come from a theoretical distribution the plot is called a theoretical quantile-quantile plot, or more commonly, a probability plot. When the normal distribution is used, this plot is called a normal probability plot. This plot is very effective for testing the assumption of normality on a variable. Departure from a straight line indicates nonnormality; an opposite curvature at the ends indicates long or short tails; a convex or concave curvature is related to asymmetry. The normal residual plot is a theoretical quantile-quantile plot of the quantiles of residuals (obtained, for example, from a regression model and usually standardized or Studentized) against the corresponding quantiles from a normal distribution. Deviation from the normal straight line can result either from nonnormal residuals or from model misspecification, e.g. a nonlinear relationship described by a linear model. b

quantile-quantile plot [GRAPH]

+ quantile plot b

quantitative data [PREP] data

--+

b

quantitative variable [PREP]

+ variable b

quartile [DESC]

+ quantile b quartile coefficient of skewness [DESC] + skewness b quartile deviation [DESC] + dispersion

quartimax rotation [FACT] + factor rotation b

random factor [EXDE] b

quartimin rotation [FACT] factor rotation

-3

b

quasi-Newton optimization [OPTIM]

: variable metric optimization

Quenouille’s test [TEST] + hypothesis test b

R R2 [MODEL] + goodness of fit b

b R-analysis [MULT] + multivariate analysis

b

R-chart [QUAL]

+ control chart b

(0

variable control chart)

Restimator [ESTIM]

+ estimator

Rajski’s coherence coefficient [GEOM] -3 distance ( 0 ranked data) b

b

-+ b

Rajski’s distance [GEOM] distance ( 0 ranked data)

random effect [ANOVA] term in ANOVA

--3

b

random effect model [ANOVA]

+ term in ANOVA random factor [EXDE] + factor

b

265

266

random number [PROB]

b random number [PROB] + random variable b

random process [TIME]

: stochastic process

random sample [PROB] + population b

b random sampling [PROB] + sampling b random series [PROB] + simulation b

random variable [PREP]

+ variable b random variable [PROB] Function that maps events into a set of values. The variate, denoted by X, is a generalization of the random variable. It also has probabilistic properties, but is defined without reference to a particular type of probabilistic experiment. A variate is the set of all random variables that follow a particular probabilistic law. A multivariate is a vector of variates. A random number x associated with a given variate is a number indicating a realization of a random variable belonging to that variate; e.g. X = x. The probability of a variate taking on a value less than or equal to x is denoted P [X 5 x ] . The set of all random numbers that a variate can take is called the range of the variate. The set of all values that P[.. .] can take is called the probability domain of the variate. There are several functions associated with a variate. The distribution function or cumulative distribution function (cdf) of a variate, denoted F(x), maps the range of the variate into the probability domain of the variate:

F(x) = P [X 5 x ] = a!

0 5 a! 5 1

F(x) is the probability that the variate takes a value less than or equal to x. The function F(x) is nondecreasing in x and takes the value one at the maximum of x. The inverse distribution function or quantile function of a variate, denoted G(a!),maps the probability domain into the range of the variate: G ( a ) = x = G(F(x)) G(a!)is the random value x, such that the probability that the variate takes a value less than or equal to x is a!.

random variable [PROB]

267

The survival function or reliability function of a variate, denoted S(x), is the complement function to F(x), i.e. it is the probability that the variate takes a value greater than x : S(x) = P [ X > x] = 1-a! = 1 - F ( x )

The inverse survival function of a variate, denoted Z(a), is the random number that is exceeded by the variate with probability a!: Z(a) = x = Z(S(x)) = G ( l - a)

The probability density function (pdf) or density function or frequencyfunction of a variate, denoted f ( x ) , is the first derivative of the distribution function F(x) with respect to x:

Its integral over the range takes a value in that range:

XU

- XL is equal to the probability that the variate

If the variate is discrete, the functionf(x) is called the probability function (pf). It is the probability that the variate takes on the value x : f ( x ) = P [X= x]

The hazard function or failure function of a variate, denoted h(x) is the ratio of the probability density function to the survival function at x :

Moments are also important functions of a variate. The following terminology may characterize a distribution:

asymmetric distribution Distribution without central value p such thatf(x - p ) = f ( p - x ) .

268


asymptotic distribution The form of a distribution as the sample size tends to infinity. The form of the distribution is called asymptotic normal distribution when it approaches the normal distribution as the sample size tend to infinity. bell-shaped distribution Distribution in which the density function has a shape that is similar to the contour of a bell. Bell-shaped distributions are symmetric. For example, the normal distribution is a bell-shaped distribution. 0.6

I I I I

I

bimodal distribution Distribution with a density function having two modes, i.e. two maximums. It is often the result of mixing two unimodal distributions.


269

conditional distribution Distribution of a set of variates at fixed values of another set of variates. contaminated distribution Mixture of normal distributions with identical locations but different dispersions. Contamination, i.e. observations from a normal distribution with a larger dispersion than the base distribution, causes a heavy (long) tail in the base distribution. continuous distribution Distribution that is a function of one or more variates measured on a proportional or ratio scale. discrete distribution Distribution that is a function of one or more variates measured on a nominal or ordinal scale. empirical distribution Distribution that is a function of a variate describing a sample. J-shaped distribution Distribution in which the density function has a shape that is similar to a letter J or a reversed J. J-shaped distributions are skewed.

z&

-

n oo

I 0 20

I

I 0 40

I

I 0 60

I

I 0 80

I 1 0

joint distribution Distribution of two or more variates (especially used for two variates). marginal distribution Unconditional distribution of a single variate in a multivariate distribution. It does not depend on the values taken by the other variates.

270

random walk [TIME]

multivariate distribution Joint distribution of several variates. symmetric distribution Distribution with central value p, such thatf(x - p ) =f(p - x ) . theoretical distribution Distribution that is a function of a variate describing a population. U-shaped distribution Distribution in which the density function has a shape that is similar to a letter U.

0

9

0

I

I

I

I

I

I

I

I

I 0

univariate distribution Distribution of a single variate. b random walk [TIME] Stochastic process defined as:

x (t) = x ( t - 1)

+ a(t)

where a(t) is an i.i.d. random variable with zero mean and variance 0,”. The mean function of x ( t ) is p(t) = 0. Because its autocovariance function y ( t ) = ta;

increases linearly with time, the random walk is nonstationary. Its autocorrelation function is

randomized design [EXDE]

271

b randomization [EXDE] Random assignment of runs to treatments and random ordering of the runs. Randomization is either complete, i.e. runs are randomized along the whole design matrix, or runs are randomized within a block. Randomization is performed to eliminate unforeseen bias and to cancel correlation between adjacent runs. Randomization helps to ensure that the experimental error is an independently distributed random variable.

randomized block design [ANOVAI Analysis of variance model in which the total sum of squares is composed of the sum of squares due to effects, the sum of squares due to blocking and the sum of squares due to random error. The simplest randomized block design model has one effect A with I levels and one blocking variable B with J levels: b

Yij=y+Ai+Bj+eij

i=l,I

j=l,J

n=I J

There is one observation per treatment in each block. Because the only randomization of treatments is within blocks, the blocks represent a restriction on randomization. This model is additive, i.e. there is no interaction between the effects and blocks. The treatment and block effects are defined as deviations from the grand mean, so that

When the effects and blocks are fixed, the expected mean squares are: EMSA= U 2

+ J CAH/(I - 1) i

The model with one effect A and two blocking variables B and C is called a Latin square: Yijk

-

=Y

+ Ai + Bj + c k +

eijk

This model is also completely additive, i.e. there is no interaction term between the effect and the two blocking variables. b randomized block design [EXDE] + design b

randomized design [EXDE]

+ design

272

range [DESC]

range [DESC] + dispersion b

b

range chart [QUALI

-+ control chart (n variable control chart) b

range scaling [PREP] standardization

-+

b rank [DESC] Ordinal number indicating the position of an object with respect to other objects when they are ordered according to some criterion, such as values assumed by a variable. When ranking objects it may happen that some of them have indistinguishable positions, i.e. they have a tied rank. In this case, the usual solution is to assign equal rank to all objects in the tied group, calculated as the average of ranks assigned ignoring the tie. For example:

3.5 5.0 1.3 3.5 4.0 1.2 3.5 6.7 3.5 rank(x): 4.5 8 2 4.5 7 1 4.5 9 4.5

X:

A statistic calculated on ordered data, i.e. on data with observations arranged in ascending order, is called a rank order statistic or order statistic. Examples are: median, interquartile range, rank correlation. b rank analysis [FACT] Collection of techniques for determining the number of factors, i.e. the complexity of a data matrix.

average eigenvalue criterion A factor is significant if its eigenvalue is greater than the average eigenvalue. When variables are autoscaled, i.e. the correlation matrix is factored, this criterion is the eigenvalue-one criterion. double cross-validation (dcv) Determines M based on an optimization of the predictive ability. It calculates the ratio PRESS(M) RSS(M - 1) with i

j

i

j

rank anarysis [FACT]

273

is reproduced from an (M - 1)-component model calculated from the complete data matrix, while xIij\ij is reproduced from an M-component model calculated from a data matrix in which elements were deleted diagonally including the ijth element. The above ratio is compared with either Q = 1 or with a more conservative empirical function &j

Q=/

(p - M)(n - M - 1) (p-M-l)(n-M)

a ratio less than Q indicates optimal M.

eigenvalue-one criterion A factor is significant if its eigenvalue is greater than one. It is the average eigenvalue criterion on autoscaled data.

eigenvalue threshold criterion A factor is significant if its eigenvalue is greater than a specified threshold value. It is a generalization of the average eigenvalue criterion.

fixed percentage of explained variance The number of factors M is determined such that the factor model explains a prespecified fixed percentage of the total variance.

5

Am

Horn’s method A modification of the average eigenvalue criterion suggesting that the threshold value should not be uniformly the average eigenvalue, but should decrease with increasing rank of factors. The individual thresholds are calculated as eigenvalues from a data matrix of the same size as the matrix analyzed, but filled with random values from a normal distribution with the same variance as the original variables.

Hotelling-Lawley trace test Determines M based on testing the distribution of

imbedded error function Under the assumption that the error is random and identically distributed in the data matrix, the eigenvalues associated with the residual error should be approximately

274

mnk analysis [FACT]

equal, i.e. A, = A,+~ = . . . =A,

m = M + l,p

In this case the imbedded error can be written as

If the above assumption holds, this function decreases for m = 1,M and increases for m = M + 1,p . The optimal number of factors M is indicated by the minimum of IE. In practice, when the error is not truly random or identically distributed, IE keeps decreasing even at m > M,but a much slower rate, so the curve IE - m flattens at M.

indicator function : Malinowski’s indicator function likelihood ratio criterion Determines M based on testing the distribution of

Malinowski’s indicator function (ME) ( : indicator function) An empirical function of the real error RE or residual standard error RSD:

MIF =

-

RE @-M)2

RSD

- (m$+lnb

- (p-M)2 -

Am

@-W2

It shows a minimum when the true number of factors M is reached. This indicator function appears to be more sensitive than the imbedded error function.

Pillai’s trace test Determines M based on testing the distribution of 1

=

(ix) m = l , M

Roy’s greatest root test Determines M based on testing the distribution of AR = Al.

scree test Determines M based on the phenomenon that the residual variance levels off when the proper number of common factors is obtained. This leveling off is assessed visually on the scree plot, which is the residual variance, or simply the eigenvalues plotted against the number of common factors.

Rayleigh distribution [PROB]

275

Wilks’ A test Determines M based on testing the distribution of

Aw

=

(ik)

m=l,M

b rank correlation [DESC] + correlation b

rank deficient matrix [ALGE]

+ matrixrank b rank distance [GEOM] + distance (0 rankeddata)

b

rank of a matrix [ALGE]

: rnutrixrank b rank order statistic [DESC] + rank b rank test [TEST] + hypothesis testing b

ranking variable [PREP]

+ variable b rankit transformation [PREP] + transformation

rank-one update [ALGE] Matrix decomposition of X’calculated from the matrix decomposition of X where matrix W’ can be obtained from X by: - adding a rank one matrix to X; - appending a row or column to X; - deleting a row or column from X. In such situations it is more efficient to update the decomposition of X instead of starting the calculation from the beginning. The most straightforward updating is that of the Q-R decomposition. b

ratio scale [PREP] -+ scale b

Rayleigh distribution [PROB] + distribution b

276

Rayleigh’s test [TEST]

b Rayleigh’s test [TEST] -+ hypothesis test

b

-+

real error (RE) [MODEL] goodness offit

real error (RE) [FACT] -+ error terms in factor analysis b

b

reciprocal regression model [REGR] regression model

-+

b

reciprocal transformation [PREP] transformation

-+

b

-+

rectangular distribution [PROB] distribution

rectangular kernel [ESTIM] + kernel b

b

rectifying inspection [QUAL] lot

-+

b

recursive residual [REGR] residual

-+

b

reduced correlation matrix [FACT] principal factor analysis

-+

b

reduced model [ANOVA] analysis of variance

-+

reduced variable [PREP] + variable b

b

reflected design [EXDE] design

-+

b regression analysis [REGR] Collection of statistical methods using a mathematical equation to model the relationship among measured or observed quantities. The goal of this analysis is

regression curve [REGR]

277

twofold: modeling and predicting. The relationship is described in algebraic form as y=f(x)+e

where x denotes the predictor variable(s), y the response variable(s), f ( x ) is the systematic part, and e is the random element of the response, also called the model error or residual. To calculate a regression model one must select the structural form f (x) (the most common is the linear regression model), the probabilistic model for the error term (the most common is to assume normality), and the estimator for the regression coefficients (the most common is least squares). Regression analysis is not merely the estimation of model parameters. It also includes the calculation of goodness of fit and goodness of prediction statistics, regression diagnostics, i.e. influence analysis and residual analysis. Besides the well-known ordinary least squares regression model, biased regression, nonlinear regression, and robust regression models are also important. Calibration and the standard addition method are two special applications of regression.

regression coefficient [REGR] (: regression parameter) Coefficient of a predictor or a function of predictors in a regression model. In a linear regression model it is the partial derivative of the response variable with respect to a predictor variable b

It indicates the importance of the corresponding predictor variable in the model. Although the least squares estimator is the most popular method for calculating the regression coefficients, generalized least squares, biased estimators, and robust estimators are also often applied. The variance inflation factor measures the effect of an ill-conditioned predictor matrix on the coefficient estimates. The standardized regression coefficient is the regression coefficient divided by the ratio of the standard deviation of the response to the standard deviation of the corresponding predictor variable, i.e. it is the regression coefficient for autoscaled variables. The constant term in a regression model is called the intercept or offset. It can be considered as the regression coefficient of a dummy predictor variable with all elements being set to one. b regression curve [REGR] Graphical representation of a regression model. For a univariate regression it can be drawn on a plane of the predictor and response variables. If the relationship is linear the regression curve is called a regression line. For multiple regression the model is represented as a regression surface, also called the response surface.

278

regression diagnostics [REGR]

F regression diagnostics [REGR] (: diagnostics) Collection of techniques used to detect and assess the quality and reliability of a regression model. The goal of diagnostics is twofold: recognition of important phenomena due to outliers rather than the bulk of the data and suggestion of appropriate remedies to find a better regression model. Regression diagnostics is performed to narrow the gap between theoretical assumptions and observed data. In contrast to robust regression, which solves this problem by dampening the effect of outliers, regression diagnostics identifies the outliers and deals with them directly. It looks for model misspecification, departure from the normalitv assumption and from homoscedasticity of the residuals, collinearity in the predictor variables and influential observations. Residual analvsis, comprising numerical and graphical analysis of the ordinary and various derived residuals, is one of the most important parts of regression diagnostics. A collection of statistics, known as influence analysis, helps to reveal the effect of individual observations on the regression model. Assessment of collinearity and its harmful effect on the regression estimates is another task of regression diagnostics. b

regression equation [REGR]

: regression model

regression function [REGR] : regression model

b

b

regression line [REGR]

+ regression curve b regression model [REGR] (.- regression equation, regression function) Mathematical (usually algebraic) equation, also called structural model, to describe the relationship among predictor and response variables. The graphical representation of a regression model is called regression curve. A list of regression models of particular interest follows.

exponential regression model Intrinsically linear regression model of the form y = exp [bTx]e Taking natural logarithm of both sides gives a linear model: lnb) = bTx + ln[e]

first-order regression model Regression model in which the exponents of all variables are 1, e.g. y = bo b2 x2. In such a model the following relationship holds:

+ bl + XI

regression model [REGR] 279

Geometrically this model is a p-dimensional hyperplane, e.g. a straight line if p = 1, a plane if p = 2, etc.

generalized linear model (GLM) Regression model describing the relationship between a response variable and a set of predictor variables. The probability distribution of the response, specified in terms of predictors, must belong to an exponential family, e.g. normal, Poisson, etc. This restriction rules out discretization and various mathematical transformations of the response. The predictor variables are connected to the response only through a linear combination: ?-)

= bo

+ blxl + b2XZ + . . . + bpXp

The expected value of each response y can be expressed as some known function of this linear combination:

where g is called the link function. Examples are: ANOVA (response of normal distribution and identity link function), OLS (response of normal distribution and identity link function), log-linear models (response of Poisson distribution and the inverse of the link function is the natural logarithm), logit and probit analysis (response of binomial distribution and logit or probit function).

Gompertz growth model Growth model that assumes a linear relationship between the relative growth rate and time. It has a double exponential form:

This model has an S-shaped curve which is not symmetrical about its inflection point. The parameter a is the limiting growth, when x = 0, y = a e-b.

growth model Nonlinear regression model which describes growth as a function of an increasing independent variable, usually time. In general, growth models are mechanistic, i.e. the model is defined by solving differential equations that represent an assumption about the type of growth. Examples are: logistic, Gompertz, Richards and Weibull growth models. intrinsically linear regression model Nonlinear regression model that can be transformed into linear form; e.g. logistic model, reciprocal model.

280

regression model [REGR]

intrinsically nonlinear regression model Nonlinear regression model that cannot be transformed into linear form. linear regression model Regression model in which the response variable is a linear function of the regression coefficients, i.e. ay/abj is not a function of bj. Examples of linear regression are ordinary least squares regression, ridge regression, stepwise regression, principal components regression, partial least squares regression. logistic growth model Growth model assuming that the growth rate is proportional to the product of the present size and the future amount of growth U

+ bexp [-la]

=1

+

When x = 0 the starting growth value is y = a/(l b). The parameter a is called limiting growth. It is the value that y approaches as x tends to infinity. The values b and k are always positive. The plot of y vs. x has an S shape, the slope of the curve is always positive. logistic regression model Nonlinear regression model which describes the probability P of a binary response y of the form: 1 = 1 exp [bo blX]

+

+

This function has an S shape that approaches one asymptotically. It can be linearized by applying the logit function: In

(&)

= bo

+ b1x

multiple regression model : multivariate regression model multivariate regression model (: multiple regression model) Regression model in which the response is a function of more than one predictor variable. The simplest one is of linear form y = bo + x b T

+e

nonlinear regression model Regression model in which the response variable is a nonlinear function of the regression coefficients. In contrast, a regression in which the response variable is a linear function of the parameters but contains powers or cross-products of the predictor variables is called a polynomial regression model, or second-, third-

regression model [REGR]

281

(etc.) order regression. Nonlinear regression models can be divided into two groups: parametric and nonparametric. The first group consists of models that have a specific parameterized form arising from the scientific field of the data; these models are suggested by the theory of the subject, e.g. growth models. The second group contains more flexible models in which the form of nonlinearity is not prespecified but estimated from the data. Examples are: ACE, SMART, MARS, nonlinear PLS. In nonparametric methods smoothers and splines are used extensively for the approximation of functions. The least squares estimator is the most popular one for calculating the parameters and functions. Because the parameters enter nonlinearly into the criterion to be minimized, nonlinear optimization techniques must be employed, e.g. the Gauss-Newton method.

quadratic regression model : second-order regression model reciprocal regression model Intrinsically linear regression model of the form: 1 bTx+e Taking reciprocals on both sides gives a linear model: y=-

1 -=bTx+e Y

Richards growth model Variation of the logistic growth model including an additional parameter: a Y = (1 +bexp[-kr])lld second-order regression model (: quadratic regression model) Model in which at least one regression coefficient is a linear function of the corresponding predictor, i.e. abj axj

- = krj This model contains predictors on second power 0

# and/or product terms Xjxk.

single regression model : univariate regression model

univariate regression model ( ; single regression model) Regression model in which the response is a function of only one predictor variable. The simplest is the linear form 0

y = bo

+ blx+ e

282

regression parameter [REGR]

Weibull growth model Growth model of the form: Y = a - bexp (-cxd)

The starting growth value at x = 0 is y = a - b. The limiting growth parameter is Q.

zero-order regression model Regression model containing only a constant, i.e. it is independent of the predictors: y=bo+e b

regression parameter [REGR]

: regression coefficient

regression surface [REGR] + regression curve b

b

regressor [PREP]

+ variable b

regular simplex [OFTIM]

+ simplex b regularized discriminant analysis (RDA) [CLAS] Parametric classification method, like SIMCA and DASCO, that is an extension of quadratic discriminant analvsis. It seeks a biased estimate of the class covariance matrices in order to reduce their variance. RDA has two biasing schemes: the class covariance matrices are pulled towards a common covariance matrix:

S&)

= (1 - h)Sg +AS

and shrunk towards a multiple (the average of the eigenvalues) of the identity matrix 1:

The first biasing is controlled by the parameter h and the second is regulated by the shrinkage parameter y. Both range between 0 and 1, their values are chosen based on cross-validated misclassification risk. If A. = 0 and y = 0, RDA is equal to quadratic discriminant analysis, whereas if h = 1 and y = 0, RDA is equal to linear discriminant analysis. The case of h = 1 and y = 1 corresponds to nearest

relativefrequency [DESC]

283

means classification and the case of A = 0 and y = 1 gives weighted nearest means classification.

WNMC

NMC

Euclidean distance

i QDA

LDA

ridge

Mahalanobis distance

1

intermediate QDA LDA

-

Holding y = 0 and varying A produces models lying between QDA and LDA; holding A = 0 and increasing y , RDA attempts to unbias the sample-based eigenvalue estimates; holding A = 1 and increasing y gives rise to a ridge regression analog for LDA. rejectable quality level [QUALI + producer’s risk b

b rejection line [QUALI + lot b rejection number [QUAL] + lot

rejection region [TEST] + hypothesis testing b

relative efficiency [ESTIM] + efficiency b

relative frequency [DESC] + frequency b

284

b

relative standard deviation [DESC]

relative standard deviation [DESC] dispersion

b

reliability function [PROB] random variable

b

reliability score [CLAS] classification

b relocation algorithm [CLUS] Basic algorithm of non-hierarchical optimization clustering, consisting of the following steps: 1. Set an initial partition of n objects into G clusters. 2. Select a criterion to optimize. 3. Calculate the centroid (centrotype) of each cluster. For i = 1, n 4. Assign object i to another cluster if that improves the optimization criterion. 5. If object i was relocated, recalculate the centroid (centrotype) of its old and new cluster. End 6 . If no relocation occured stop, otherwise goto step 3. A variation of the above algorithm is when step 5 is omitted, i.e. the cluster centroids are recalculated only after a full cycle.

b

renewal process [TIME] stochastic process ( counting process)

replication [EXDE] Repetition of runs with the same treatment. Replicates are identical rows in the design matrix. They are assumed to be independent observations. Replication makes it possible to estimate the experimental error (precision) and to obtain a more precise estimate of the effect of a factor. b

b

reproduced correlation matrix [FACT] factor loading

i .

b

resampling [MODEL] model validation

b residual [REGR] Quantity remaining after some other quantity has been subtracted; for example the part of a variable unexplained by a statistical model. In regression, the part of the

residual [REGR]

285

response variable not described by the regression model. This part is due to random variation or model misspecification, as opposed to the systematic part described by the model. The residual is calculated as the difference between the observed and the predicted value of the response variable: e = y - 3. In ordinary least squares the residuals are assumed to be uncorrelated and to follow a normal distribution with zero mean and equal variance. Residuals, investigated in residual analvsis, play an important part in regression diagnostics. A list of various residuals follows.

adjusted residual Residual adjusted for the effect of the predictor values, assuring equal variance:

where hii is the ith diagonal in the hat matrix.

Anscombe residual Transformed residual informative in case of response of nonnormal distribution: ef = f (yi) - f

(fi)

i = 1,n

where f is a function that transforms y into a normally distributed variable.

cross-validated residual ( : deletion residual,predictive residual) Residual of a model fitted to data with the ith observation excluded: ei\i = yi - f i \ i

i = 1, n

where fi\i denotes the predicted response value of the ith observation from a model calculated without the zth observation. Measures of goodness of mediction are calculated on the basis of this residual. In the case of linear estimators (e.g. OLS or ridge regression) the cross-validated residuals can be calculated from the ordinary residuals as:

where hii is the ith diagonal element of the hat matrix. In the case of nonlinear estimators (e.g. PLS) the cross-validated residuals must be calculated by the leaveone-out technique.

deletion residual : cross-validated residual externally Studentized residual : Studentized residual internally Studentized residual : standardized residual jackknifed residual : Studentized residual

286

residual [REGR]

ordinary residual Residual of a model fitted to all the observations:

ej=yj-fj

i=l,n

Measures of goodness of fit are calculated on the basis of this residual. predictive residual : cross-validated residual

o recursive residual Residual of a time series model in which bi-1 is calculated using only the first i - 1 observation:

o standardized residual ( : internally Studentized residual) Residual of a model fitted to all the observations scaled by its standard error estimated from all the observations:

with

This residual is scale invariant and has a t-like distribution. This scaling assures homoscedasticity and unit variance. Studentized residual ( : externally Studentized residual,jackknifed residual) Residual standardized so as to have the same precision along the observations. This standardization eliminates the effect of the location of the objects: ei t. i=l,n '-s\iJm;

where S\i is the standard error estimated without the ith observation. It can be calculated as S\j

=

d

(n -P)s* - e?/(l - hji) n-p-1

This residual has zero mean and unit variance, it is scale invariant, and is more appropriate for detecting violations of assumptions in a regression model.

response surface [EXDE]

287

b residual analysis [REGR] Part of regression diagnostics,based on examining residuals from a regression model via numerical and/or graphical techniques. The goal is to infer any incorrect assumptions concerning the model error (e.g. homoscedasticity or assumption of normality). The most popular plots are residual scatter plots and residual quantile dots. b

residual correlation matrix [FACT] factor loading

-+

residual degrees of freedom [MODEL] -+ model degrees of freedom b

b residual mean square (RMS) [MODEL] + goodness of fit

residual plot [GRAPH] + scatter plot b

b residual standard deviation (RSD) [MODEL] + goodness of fit b residual standard error (RSE) [MODEL] + goodness offit

b

residual sum of squares (RSS) [MODEL] goodness of fit

-+

b

residual variance [MODEL] goodness of fit

-+

b resistant estimator [ESTIM] -+ estimator b

resolution [EXDE] confounding

-+

b

response curve [GRAPH] scatter plot

-+

b response surface [EXDE] Mathematical function that describes the response as a function of the factors. It can be visualized and plotted only if there are no more than two factors. Response surfaces are often described by polynomial approximations.

288

response surface methodology (RSM) [EXDE]

If only terms up to degree one are included (i.e. only main effects), the function is called a first-order response surface, which is a hyperplane. The corresponding design is a first-order design. A second-order response surface also includes interactions and squared factors. This surface is curved possibly with minima, maxima, ridge, and saddle points. The corresponding design is called a second-order design.

Response surface methodology (RSM) is a collection of statistical techniques that, by design and analysis of an experiment, seeks to relate the response (output) of a system to factors (input) that have an effect on the response. RSM is used for predicting responses at given factor levels, choosing factor levels to obtain specified response; finding factor levels to obtain optimal response. b

response surface methodology (RSM) [EXDE]

+ response sugace b

response variable [PREP]

+ variable b

restricted model [ANOVA]

+ analysis of variance b resubstitution [MODEL] + model validation

ridge regression (RR)[REGRJ

289

b Richards growth model [REGR] + regression model

b ridge regression (RR)[REGR] Biased regression method based on the assumption that large regression coefficients are likely to be spurious, therefore it shrinks them toward zero by adding a small constant y to each diagonal element of the correlation matrix. The constant y is called the shrinkage parameter. Increasing y increases the bias in the regression coefficient estimates but it also decreases their variance. The selection of optimal y on the basis of plotting the regression coefficients as function of y is called the ridge trace. The value of y is increased until stability is indicated in all regression coefficients. Stability does not mean convergence, but rather a lack of rapid change of the coefficients as a function of y . A better strategy for estimating the optimal y is based on a goodness of prediction criteria. The regression coefficient estimates are

b=(WTW+yI[)-lXTy

They are calculated by minimizing r

1

or equivalently by maximizing max [corr2 (y, X bT)var (X bT)/ var (W bT + y ) ] b

bT b = 1

The bias and variance of the estimated regression coefficient is:

x + y 11-2 /9 ~ ( 6=)s2tr [(wT x + y I>-'] =

B2(6) = y2 /9T[WT

S'

C

Aj /(Aj

+ y12

j

where /9 is the true regression coefficient, s is the error standard deviation and Aj are eigenvalues of the correlation matrix. As the ridge estimator is a linear estimator the cross-validated residuals can easily be calculated as

where hii(y) is a diagonal element of the ridge hat matrix W ( y ) = X (WT W

+ y I [ p WT

with W having centered predictor variables. The degrees of freedom of a ridge regression model is given by the trace of W ( y ) .

290 b

ridge trace [REGR]

ridge trace [REGR] ridge regression

--f

risk function [MISC] + decision theory b

b robust estimator [ESTIM] + estimator b

robust locally weighted regression [REGR]

-+ robusf regression b robust regression [REGR] Regression method that is insensitive to small deviations from the distributional assumptions. It weights down the influence of observations with large residuals, hence safeguarding against outliers in the response. The least squares estimator can be made robust by iteratively reweighting the residuals in inverse proportion to their magnitude. For example, the biweight can be applied as an iterative procedure in which the weights are calculated as a function of the residuals. The indicator function offers another solution to mitigate the effect of outliers. Robust regression estimators can also be defined by minimizing other functions than the sum of squares of the residuals. The following robust methods are the most popular.

bounded influence regression (: GM estimator) Robust regression that limits the influence of outliers in a regression model by applying some weighting function. The derivative of the minimization criterion is

C

w(xi)

Ilr(ei/S) x i = 0

i

where s is an estimate of the error standard deviation, w denotes some weight function of the predictor vectors and @ is an indicator function. Unfortunately the breakdown point of the bounded influence regression increases with an increasing number of predictor variables.

GM estimator : bounded influence regression iteratively reweighted least squares regression (IRWLS) Robust regression that estimates the regression coefficients by minimizing the criterion: mp

[7

wi

r;]

where W i weights the squared standardized residuals r; according to their magnitude. The weights Wi are calculated simultaneously with the estimates of error standard deviation in an iterative fashion.

robust regression [REGR]

291

L1 regression : least absolute residual regression least absolute residual regression (: Ll regression) Robust regression that estimates the regression coefficients by minimizing the sum of absolute residuals, not the sum of squared residuals: r

1

Although this method is thought to be more robust than the least squares method, its breakdown point is still no better than 0%.

least median squares regression (LMS) Robust regression that estimates the regression coefficients by minimizing the median instead of the sum of the squared residuals: c

-

J' 1

min mede?

This method is very robust with respect to outliers in the response; its breakdown point is 50%.

least trimmed squares regression (LTS) Robust regression that estimates the regression coefficients by minimizing the criterion:

1

m p C(e2)l:. [ i

where (e2)1:nare the ordered squared residuals. In the summation the m largest squared residuals are omitted. When m is around 4 2 the breakdown point of this estimator is 50%.

locally weighted scatter plot smoother (LOWESS) regression

.- robust local& weighted

robust locally weighted regression ( .- locally weighted scatter plot smoother) Robust regression in which, at each observation Xi, a weight vector wk(xi) is calculated and a weighted least squares criterion is minimized. This weight vector is centered at Xi and scaled such that it becomes zero at points further away than a specified nearest neighbor. The size of the neighborhood, i.e. the fraction of the observations with nonzero weights, is a parameter to be optimized. After the initial model and residuals have been calculated the weight vectors are modified on the basis of the size of the residuals: observations with large residuals receive small weights and observations with small residuals receive large weights. This correction is performed by using the biweight.

292 b

robustness [ESTIM]

robustness [ESTIM] estimator ( 0 robust estimator)

-+ b

Rogers-Tanimoto coefficient [GEOM] distance (0 binarydata)

-+ b

root mean square (RMS) [DESC] dispersion

-+

b

root mean square deviation (RMSD) [DESC] dispersion

-+

b

root mean square error (RMSE) [MODEL]

+ goodness of fit b

rootogram [GRAPH] histogram

-+

b

Roquemore design ‘[EXDE]

+ design b

Rosenbaum’s test [TEST]

+ hypothesis test b rotatable design [EXDE] + design

rotation [FACT] : factor rotation

b

b

rotation matrix [ALGE] (.- transformation matrix)

An orthogonal matrix M that brings a matrix W into another matrix X’, preserving

the length of its vectors:

W’=WM

Due to the orthogonality of M, the following properties hold: M = M-1 MTM=ll IMI = f l

round-off error [OPTIM] -+ numerical error b

sampling [PROB]

293

b rowvector [ALGE] + vector

Roy’s greatest root test [FACT] + rankanalysis b

b run [EXDE] + design matrix

b Russel-Rao coefficient [GEOM] + distance (0 binarydata)

S b S-chart [QUAL] + control chart ( 0 variable control chart) b S, statistic [MODEL] + goodness ofjit

b sample [PROB] + population

b sample reuse [MODEL] + model validation b

sample size [PROB]

+ population b sampling [PROB] The process of drawing a subset, called a sample, from a population in order to estimate the properties of the population. Mathematically, simulation performs the sample drawing process. There are several sampling strategies to choose from:

cluster sampling (: nested sampling) Sampling from selected, restricted parts of the population. Examples are subsampling and two-stage sampling where essentially more than one sample is collected from selected parts of the population.

294

samplingplan [QUAL]

Monte Carlo sampling Sampling on the basis of mathematical experiments, involving random numbers, in which mathematical constraints (e.g. distribution parameters) are imposed. nested sampling : cluster sampling random sampling Sampling by dividing the population into equal parts and selecting from them using a random procedure. Such a process ensures that an unbiased sample is obtained. stratified sampling Sampling from a population that is heterogeneous but composed of k clearly distinguishable homogeneous sub-populations (strata) with known relative frequencies. In such a case, k individuals are selected, one from each stratum. systematic sampling Sampling at regular intervals. It is appropriate only for homogeneous population. On the other hand, systematic sampling results in biased estimates if the investigated property changes regularly with the sampling interval.

b

sampling plan [QUAL] acceptance sampling

b

saturated design [EXDE] design

b saturated model [MODEL] + model (0 nested model) b scalar [ALGE] Quantity that, in contrast to a vector, has only magnitude, but no direction. A scalar has the same value in each coordinate system, i.e. a scalar is scale invariant.

b

scalar multiplication of a matrix [ALGE] matrix operation

b

scale [DESC]

: dispersion b scale [PREP] Variables can be characterized according to the scale on which they are defined. A qualitative variable is measured on a nominal or ordinal scale, also called a non-metric scale, while a quantitative variable is defined on a proportional or ratio

scale [PREP] 295

scale, also called a metric scale. The scale of measurement greatly affects the choice of model and estimator.

arithmetic scale : ratio scale interval scale :proportional scale nominal scale The mathematically weakest scale for qualitative variables where the coded information is only a name (category or label) without any order relation. The number of possible categories of a nominal variable is usually finite, although it is possible to have a countably infinite number. On a nominal scale the only admitted operations among the categories are = and #. For example, color, taste, or chemical categories are measured on a nominal scale. When a variable has only two categories (presence/absence of a property, male/female, no/yes) it is called a binary variable and is usually coded as 0/1, -/+ or 1/2. Nominal variables with several categories are often coded in several binary variables, each corresponding to a category. ordinal scale Stronger than a nominal scale for qualitative variables where the categories can be arranged in order. The number of possible categories of an ordinal variable is usually finite, although it is possible to have a countably infinite number. On an ordinal scale total ordering () or partial ordering (5,2 ) operations are defined among the categories. Thus, an ordinal variable indicates an ordering or ranking of measurements, with relative rather than quantitative differences. Variables that are originally on a proportional or ratio scale can be reduced to ranks measured on an ordinal scale. proportional scale (: interval scale) Scale for quantitative variables; discrete or continuous. The starting point of the scale is not well-defined, so only the difference between two values is meaningful, not the ratio. Besides the operations =, #, , 5,2 of the weaker ordinal scale, - and + are also defined. For example, temperature is measured on a proportional scale. ratio scale ( : arithmetic scale) Stronger than a proportional scale for quantitative variables where the starting point of the scale is unambiguously defined. All operations are exactly defined; both the difference and the ratio between two values are meaningful. For example, length, weight, or volume are measured on a ratio scale. A ratio scale usually has a unit associated with it. Unitless ratio scales are the frequency count scale and the percentage scale. The former measures counts, the latter percentages of the total.

296

scale invariance [ESTIM]

b scale invariance [ESTIM] Characteristic of an estimator or a model stating that it does not change with a change in the scale of the data. Examples are: ordinary least squares regression, discriminant analysis. Many estimators, however, result in different estimates depending on the scale of the data, i.e. they are scale variant. Examples are: PCR, PLS, SIMCA, RDA, K",most of the cluster analysis models.

b

scaling [PREP] standardization

--3

scatter [DESC] : dispersion

b

b

scatter diagram [GRAPH]

: scatter plot b scatter matrix [DESC] (; dispersion matrix) Square matrix T that describes the dispersion of multivariate data around the mean. Its elements are:

In the case of centered variables the scatter matrix equals the information matrix, defined as XT X. The scatter matrix is related to the covariance matrix B as:

B = T/n

or

B = T/(n- 1)

the latter being for unbiased estimates. The scatter matrix T can be decomposed into the within-group scatter matrix W and the between-group scatter matrix B:

T=W+B Several multivariate methods are based on optimizing or testing such decomposition, e.g. MANOVA, discriminant analysis, non-hierarchical cluster analysis. b scatter plot [GRAPH] ( ; scatter diagram, plot) Cartesian plot in which the coordinates are either original variables or quantities (statistics) derived from the data. In contrast to the quantile-quantile plot, the

scatterplot [GRAPH]

297

variables are paired, i.e. the corresponding values are measured on the same object. The scatter plot is an efficient way to represent the relationship between two or three variables. no-dimensional scatter plots are the most common. Additional variables can be represented on a static two-dimensional plot by adding color, size or shape to the data points. A more efficient way to create a three-dimensional scatter plot is by using interactive computer graphics.

biplot Graphical display of a data matrix by means of markers for its rows and for its columns such that inner products of these markers represent the corresponding matrix elements. The most popular biplot is the principal component biplot, in which the row markers are principal component scores and the column markers are the eigenvectors scaled with the corresponding eigenvalues.

0

0 0

PC1

It is a joint display of rows and columns, in contrast to most other scatter plots, which display only rows or only columns of a data matrix. The distances of the column points from the origin are roughly proportional to the standard deviations of the variables. Cosines of angles between two vectors drawn from the origin to two column points represent the correlation between the two variables. Distances between row points represent their dissimilarities. While distance between a row and a column point (i.e. between an object and a variable) is not directly interpretable, the relative positions of objects with respect to variables, and vice versa, are the essence of the biplot.

contour plot (: density map) Two-dimensional graphical representation of a response surface, i.e. the projection of a response variable onto a plane of the predictor variables or factors, by means of isoresponse curves.

298

scatterplot [GRAPH]

x2

x1

The levels of the two factors or of two linear combinations of factors are indicated on the horizontal and vertical axes, whereas the values of the response are indicated by contour lines drawn in the plane of the two factors. These contour lines are projections of the outlines of cross-sections of the response surface parallel to the factor plane. Coomans’ plot Graphical display of the goodness of class separation, often used in SIMCA. Residuals of objects fitted to one class are plotted against residuals of the same objects fitted to another class. On the two-dimensional plot the separability of the classes can be assessed only pairwise.

Residuals from model of class B

Objects classified as A

Objects classified neither A nor B

Objects can be classified as both A and B

Objects classified as B

t

density map : contourplot discriminant plot Two- (or three-) dimensional scatter plot of discriminant scores, usually of the first two (or three) linear discriminant functions. This plot gives information about class separability.

scatterplot [GRAPH] 299

latent variable plot Two-dimensional scatter plot, in which the predictor latent variable is plotted against the response latent variable, calculated in the same component. This plot is similar to the principal component plot, except that the axes are not simply eigenvectors of a matrix but eigenvectors of complex aggregates of the predictor and the response matrices. This plot serves as a diagnostic tool in PLS modeling: it reveals outlier and high leverage observations, and any nonlinear relationship between predictors and responses. loading plot Display of variables on a two- (or three-) diniensional scatter plot as their projections on new axes calculated by a linear combination of the original variables. The most common loading plot projects the variables on the principal component axes. PC2

. high loading in PC2 low loading in PCl

+lx1 x7 x2

I( zero

+

0-

point high loading in PCl

X6 152

-

- - - - - -. XS I

7

non important in either components

x4

J

x3

I

I

-1

I

-1

1

151

I

0

I

b

+l PC1

loading of variable 5 in component 1

map Two- (or three-) dimensional scatter plot of multidimensional objects in which the coordinates are nonlinear combinations of the original variables. The two- (or three-) dimensional configuration is calculated by some mapping technique, like multidimensional scaling, nonlinear mapping, or projection pursuit. principal component plot ( : score plot) Scatter plot of the scores (projections) of the observations, usually on the first and second principal component axes. This plot is a linear projection of the observations onto a two-dimensional subspace that conserves most of the variance.

300

scatterplot [GRAPH] 0 = obj. group 1

I

t pc2

X = obj. group 2

+ = obj. group 3

X

x

X

X

I

+ +

+

0 PC1

If the first two components explain a high percentage of the variance, then the distribution of the observations in the two dimensions is a fair approximation of the distribution in the original measurement space. This plot is one of the most popular graphical tools in exploratory data analysis. Clusters, outliers, trends can be discovered by visualization of the multivariate distribution. residual plot lbo-dimensional scatter plot of residuals from a regression model. This plot plays an important role in residual analysis. It is used to verify the normality and homoscedasticity assumption on the residual, to identify outliers and to check the linearity of the relationship. The ideal plot shows a horizontal band of points with constant vertical scatter from left to right.

0 0

0

0

ID

Depending on which residual is assigned to the vertical axis (e-ordinary, ewcross-validated, r-standardized, t-Studentized) and what is plotted on the horizontal

scatterplot [GRAPH]

301

axis (x-predictor, y-response, j-estimated response, h-hat diagonal, ID-observation id), one can obtain several different residual plots: o McCulloh-Meeter plot

ln[h/n(l

- h)]

vs. ln[e2]

o ordinary residual plot

ID vs. x vs. y vs. j , vs.

e e e e

o predictive residual plot

ID vs. ecv Y vs. ecv e vs. ew o Pregibon plot

h

vs. e2

o standardized residual plot

ID vs. r y vs. r o Studentized residual plot

ID vs. t y vs. t o Williams plot

h

vs. ew

response curve System response plotted as a function of one factor. It is a one-dimensional response surface. CI

score plot :principal component plot

time series plot Graphical representation of a time series: x(t) plotted against the corresponding time intervals t. The successive points are often connected for better visualization.

302 b

scatterplot matrix [GRAPH]

scatter plot matrix [GRAPH]

: draftsman’s plot b Scheffe’s test [TEST] + hypothesis test

Schrader clustering [CLUS] + non-hierarchical clustering (o optimization clustering) b

b

score [FACT]

+ common factor

b

score plot [GRAPH] scatter plot

b

scree plot [GRAPH] rank anabsis (n scree test)

scree test [FACT] + rank analysis b

b second kind model [ANOVA] + tern in ANOVA

second-order design [EXDE] + design b

second-order regression model [REGR] -+ regression model b

seed point [CLUS] + cluster b

b

semi-interquartilerange [DESC]

+ dispersion b sensitivity curve (SC) [ESTIM] + influence curve

sensitivity of classification [CLAS] + classification b

b sequential design [EXDE] + design

similarity index [GEOM]

303

b sequential sampling [QUALI + acceptance sampling b

sequential variable selection [REGR]

+ variable subset selection

Shannon entropy [MISC] + infomation theory b

b

Shannon-Wiener index [MISC]

+ information theory b shape difference coefficient [GEOM] + distance (0 quantitative data) b Shapiro-Wilk’s test [TEST] -+ hypothesis test

sharpness of classification [CLAS] + classification b

b Shewhart chart [QUALI + control chart

b

shot-noise process [TIME]

+ stochastic process b

shrinkage parameter [REGR]

+ ridge regression b

Siegel-Thkey’s test [TEST]

+ hypothesis test

sign test [TEST] + hypothesis testing b

significance level [TEST] + hypothesis testing b

b similarity index [GEOM] Measure associated with a pair of objects, that depicts the extent to which the two objects are similar. The more similar the two objects are, the larger is the similarity index. A similarity index sst calculated for objects s and t has the following properties:

304

similarity matrix [GEOM]

O 5 s s t 51 sss = 1 sst = Sts

Pairwise similarity indices calculated from a data matrix X(n,p) can be arranged in a symmetric matrix S(n, n), called the similarity matrix. Rows and columns represent objects; diagonal elements sss equal to one. A similarity index sst can be calculated from the dissimilarity index & t as: sst

= 1/(1+ dst)

Sst

= 1- d s t / P a x

or

Pairwise dissimilarity indices calculated from a data matrix X ( n , p ) can be arranged in a symmetric matrix D(n, n), called the dissimilarity matrix. Rows and columns represent objects; diagonal elements dss are equal to zero. The most often used dissimilarity indices are distances.

similarity matrix [GEOM] + similarity index b

b similarity transformation [ALGE] Multiplication by a nonsingular matrix Z such that for two matrices X and Y the following equations hold:

XZ = ZY Y=Z-'XZ

The matrices W and Y are called similar and their eigenvalues are the same. b

simple matching coefficient [GEOM] (a binaly data)

+ distance

simplex [OPTIM] Geometric figure formed by a set of p + 1points in the p-dimensional space. In two dimensions the simplex is a triangle, in three dimensions it is a tetrahedron, etc. When the points are equidistant the simplex is called a regular simplex. A simplex is used, for example, in simplex optimization to find the minimum or maximum of a function. b

simplex centroid design [EXDE] + design b

simplex Optimization [OPTIM] 305 b

-+

simplex lattice design [EXDE] design

b simplex optimization [OPTIM] Direct search optimization based on comparing the values of a function at the p + 1 vertices of a simplex and moving the simplex towards the optimum during an iterative procedure. The technique as originally proposed uses a regular simplex; however, allowing the simplex to be nonregular (points not equidistant) increases the power and efficiency of the optimization. The simplex is moved toward the optimum via three basic operations: reflection, contraction, and expansion. The process starts with a p 1 set of initial parameter values and evaluates the function at each vertex

+

In minimization the simplex is moved away from the vertex with the largest function value (pc) by reflecting this vertex in the opposite face of the simplex. The new reflected vertex PR is on the line joining the centroid of all points pp , and the vertex to be eliminated pc:

where a is the reflection coefficient. Expansion (PR+)and contraction (PR-) help to move the simplex in a valley, where the function value could be the same at the old and the new vertices.

A

I

B

The convergence criterion generally used in the simplex optimization requires the standard deviation of the function values at the p + 1vertices to be less than a prespecified small value.

306

simulated annealing (SA) [OPTIM]

simulated annealing (SA) [OPTIM] Direct search optimization searching for the most probable configuration of parameters on the basis of simulating the evolution of a substance to thermal equilibrium. The distribution of configurations s is expressed by the Boltzman distribution: b

W

where C(s) is the function to be minimized, s and w are configurations and c is a control parameter. Starting with an initial configuration s of the parameters, another configuration r in the neighborhood of s is produced by modifying one randomly selected parameter. The new configuration is accepted with probability 1 if AC(r, s) 5 0, otherwise with probability P = exp [-AC(r, s)/c]

This probability is compared to a random number from the uniform distribution [0,1] and the new configuration is accepted if P is larger than the random number. The iteration continues until convergence is reached. Then the control parameter c is lowered and the optimization continues with the new parameter. Generalized simulated annealing (GSA) and variable step size generalized simulated annealing (VSGSA) are improvements on &I. b simulation [PROB] (: Monte Carlo simulation) Technique for imitating the random process of drawing samples from a predefined population in order to obtain estimates of the population parameters. Given a mathematical formula that cannot easily be evaluated by analytical reduction or by standard procedures of numerical analysis, it is often possible to find a stochastic process for generating statistical variables with frequency distributions that can be related to the mathematical formula. The simulation actually generates a sample, determines its empirical distribution and uses it in a numerical evaluation of the formula. A random series is a series of numbers drawn randomly from a distribution. It has an important role in simulation studies. Simulation is used to evaluate the behavior of a statistical method, to compare several similar statistical methods, or to solve mathematical problems arising from stochastic processes. The advantage of using simulation instead of real data sets is that in the former case the distribution of the underlying population is known. b

single linkage [CLUSI

+ hierarchical clustering b

(0

agglomerative clustering)

single regression model [REGR]

+ regression model

skewness [DESC] b

301

single sampling [QUALI

+ acceptance sampling b single tail test [TEST] + hypothesis testing b

singular matrix [ALGE]

+ matrix b singular value [ALGE] + matrix decomposition

((7

singular value decomposition)

b singular value decomposition (SVD) [ALGE] + matrix decomposition b

singular vector [ALGE]

+ matrix decomposition

(0

singular value decomposition)

b size difference coefficient [GEOM] + distance (0 quantitative data) b

-+

size of a test [TEST] hypothesis testing

b skewness [DESC] Measure of asymmetry of a distribution around its location. For a symmetric distribution the mean, median and mode are equal, i.e. there is no skewness. If there is a longer tail on the right, i.e. the mean is larger than the median and the median is larger than the mode, there is positive skewness. If there is a longer tail on the left, i.e. the mean is smaller than the median and the median is smaller than the mode, there is negative skewness.

I

2.0

POSITIVE

I

NEGATIVE

308

skewness [DESC]

A list of the most common skewness measures follows.

Bonferroni index Avery sensitive measure, defined in terms of the deviations from the mean, weighted by the frequency distribution:

i

with -51j)

s i =fi(xij

i = 1, n

B = 0 if the distribution is perfectly symmetric, and approaches 1 as the distribution becomes increasingly asymmetric. This index is based on the following relationship, which holds for symmetric ranked data: Xi

+ Xn-i+l

= constant = 251

coefficient of skewness Scaled difference between the arithmetic mean and the mode:

blj =

xj

- mode [Xj] sj

where si is the standard deviation. In a p-dimensional space, the multivariate measure of skewness is defined as

7 h , p

=

S

x t

n2

s, t = 1, n

where dstis the square root of the Mahalanobis distance between objects s and c

dst = (x,

- E)T 8-' ( ~ t E)

where 8-' is the inverse covariance matrix. For p = 1, bl,p = bl.

Pearson's first index Scaled sum of the cubed differences from the arithmetic mean:

smoother [REGR] 309 y1 = 0 for a symmetric distribution; y < 0 for a right-tailed distribution and y1 > 0 for a left-tailed distribution.

Pearson’s second index Scaled difference between the arithmetic mean and the median: X j - median [Xj ] yzj = 3 quartile coefficient of skewness Measure based on the three quartiles:

where

Q3

is the upper quartile,

Ql

is the lower quartile and Q2 is the median.

skip-lot sampling [QUALI -+ acceptance sampling b

b

-+

slicing [GRAPH] interactive computer graphics (o animation)

b smooth multiple additive regression technique (SMART) [REGR] (0 projection pursuit regression) Nonparametric multiple response nonlinear regression model that describes each response (usually) as a different linear combination of the predictor functions f m . Each predictor function is taken as a smooth but otherwise unrestricted function (usually) of a different linear combination of the predictor variables. The model is:

where qmk and wmj are linear coefficients for the predictor functions and for the predictor variables, respectively. The least squares solution is obtained by simultaneously estimating, in each component m = 1,M, the linear coefficients q and w and the nonlinear function f . The coefficients qmk are estimated by univariate least squares regression, the coefficients w m j by a Gauss-Newton step, and the function f, by a smoother. The optimal number of components M is estimated by cross-validation. b smoother [REGR] Function estimator which calculates the conditional expectation

f

(4 = ELY I XI

3 10

smoother [REGR]

There are two basic kinds of smoothers: the kernel smoother and the window smoother. The kernel smoother estimates the above conditional expectation at xi by assigning weights to the points, fitting a weighted polynomial to all the points and taking the fitted response value at xi. The largest weight is put at xi and the rest of the weights decrease symmetrically as the points become further from X i . A popular robust kernel smoother is the locally weighted scatter plot smoother. The window smoother can be considered as a special case of the kernel smoother in which all points within a certain interval (window) Ni around Xi have weight 1 and all points outside the interval have weight 0. According to the degree of the polynomial the smoother can be local averaging (degree zero), local linear fit (degree one), local quadratic fit (degree two) etc. The local averaging window smoother calculates the f i value as the average of the y values for those points with x values in an interval Ni around X i :

The local averaging, although a commonly used technique, has some serious shortcomings. It does not reproduce a straight line if the x values are not equispaced and it has bad behavior at the boundaries. The local linear fit alleviates both problems. It calculates the smooth f i value by fitting a straight line (usually by least squares) to the points in the interval Ni and taking the fitted response value at X i . Higher degree polynomials can be fitted in a similar fashion. 1.5 y

I

nod

span = 0.1

0

I u

-1.5

'

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6 X

smoother [REGR]

3 11

1 c

-

- 1 .5

1

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

X

In window smoothers the key point is how to select the size of the interval, also called the span parameter. This parameter controls the trade-off between bias and variance of the estimated smooth. Increasing the span value, i.e. the size of the interval, increases the bias and decreases the variance. A larger span value makes

312

soft independent modeling of class analogy (SIMCA) [CLAS]

the smooth less wiggly. Ideally, the optimal span value should be estimated via cross-validation. In contrast to the fixed span smoother, the adaptive smoother uses a span parameter that varies over the range of x. This smoother is preferable if the error variance or the second derivative of the underlying function changes over the range of x. b soft independent modeling of class analogy (SIMCA) [CLAS] Parametric classification method designed to deal with low object-variable ratio. Each class is represented by a principal component model, usually of fewer components than the original dimensionality, and the classification rule is based on the distances of objects from these class models. These object-class distances are calculated as squared residuals:

C L

e:g = xi

- cg -

tmg 1mg

I’

i=l,n

g=l,G

where i is the object, g is the class and m is the component index, cg is the class centroid, tmg denotes principal component scores and 1, the corresponding loadings, in the mth component and gth class. The optimal number of components, M, is determined for each class separately by double cross-validation. This procedure results in principal component models which are optimal for describing the withinclass similarities but not necessarily optimal for discriminating among classes.

x3

These class models define unbounded M-dimensional subspaces in the pdimensional pattern space. In order to delimit the models, i.e. to create class boxes, normal ranges are defined using the class residual standard deviations sg:

spectral decomposition [ALGE]

313

SIMCA calculates both modeling and classification power for each variable based on the residuals. Similar to RDA and DASCO, S I M o l can be viewed as a modification of quadratic discriminant analysis where the class covariance matrices are estimated by truncated principal component representation. b soft model [MODEL] + model

Sokal-Michener coefficient [GEOM] -+ distance ( 0 binarydata) F

b Sokal-Sneath coefficient [GEOM] + distance ( 0 binarydata) b Sorenson coefficient [GEOM] + distance (0 binarydata) b spanning tree [MISC] + graph theory

Spearman’s p coefficient [DESC] -+ correlation b

b

specific factor [FACT]

: unique factor b

specific variance [FACT] factor analysis

-+

b

specification limits [QUAL] lot

-+

specificity of classification [CLAS] + classification b

b

spectral decomposition [ALGE]

+ matrix decomposition

314

spectral density function [TIME]

b spectral density function [TIME] (: spectrum) Function of a stationary time series x ( h ) , i = 1,n defined as:

f (w) = y ( 0 )

+2

c

y ( t ) COS(t, w )

k

or in normalized form: f (w)/y(O) = 1

+2

c

p ( t ) COS(t, w )

k

where y (t)is the autocovariance function and p ( t ) is the autocorrelation function, and 0 5 w 5 n. Its integrated form is called the spectral function:

F(w) =

1

f (@)do

spectral function [TIME] + spectral density function b

b spectral map analysis (SMA) [MULT] Dimension reduction and display technique related to biplot and correspondence factor analysis. It was developed for the graphical analysis of drug contrasts. Contrast is defined here as the logarithm of an activity ratio (specificity) in proportion to its mean. The word spectra here refers to the activity spectra of drugs, i.e. the logarithm of activities in various tests. The map of compounds described by their activity spectra is obtained after special scaling.

spectrum [ALGE] + eigenanalysis b

b

spectrum [TIME]

: spectral density function b spherical data [PREP] + data

b spline [REGR] Function estimate obtained by fitting piecewise polynomials. The x range is split into intervals separated by so-called knot locations. A polynomial is fitted, in each interval, with the constraint that the function be continuous at the knot locations. The integral and derivative of a spline is also a spline of one degree higher or lower; often also with a continuity constraint. The degree of a spline can range from

spline [REGR]

315

zero to very high; however, the first-, second-, and third-degree splines are of more practical use.

1

0.5 C

0

-0.5

-1

-1.5

0

0.5

1.5

1

2.5

2

3

3.5

4

4.5

5

5.5

6 X

knot,

knot,

knot,

A spline is defined by its degree, by the number of knot locations, by the position of knots and by the coefficients of the polynomial fitted at each interval. A spline of degree m with N knot locations (tk, k = 1, N ) can be written in a general form as:

where x! and (Xi positive part, i.e.

- tk):

(xi - tk): = (xi - tk)J (xi - tk): = 0

are called basis functions. The notation (..)+ means the if if

xi Xi

> tk

5 tk

This representation casts the spline as an ordinary regression equation. The coefficients boj and bkj are estimated by minimizing the least squares criterion. Depending on the continuity requirements at various knot locations, not all of the above basis functions are present in a spline, i.e. some of the bkj coefficients are zero. A frequently used spline of degree m with N knots and with continuity constraint on the function and on its derivatives up to degree m - 1 has the form:

3 16

splinepartial least squares regression (SPLS) [REGR]

+ +

The number of coefficients in such a spline is m N 1. There are several equivalent basis function representations of the same spline. Another form of the above spline is: N I.

bk[Sk(Xi - tk)]? -k ei

yi = b0 -k k= 1

where s k is either +1 or -1. In fitting a spline one must select the degree rn, the number and location of the knots. The degree of the spline is sometimes fixed a priori. The number and location of knots is either fixed, or variable. Splines with variable knot locations are called adaptive splines. Adaptive splines offer a more flexible function approximation than splines with fixed knot locations. The bias-variance trade-off is controlled by the degree of the fitted polynomial and the number of knots. Increasing the degree and the number of knots increases the variance and decreases the bias of the spline. Ideally, one should estimate the optimal degree m, the optimal number of knots N and the optimal knot locations by cross-validation to obtain the best predictive spline. b

spline partial least squares regression (SPLS) [REGR] partial least squares regression

-+ b

split-plot design [EXDE] design

-+

spurious correlation [DESC] -+ correlation b

square matrix [ALGE] -+ matrix b

square root transformation [PREP] -+ transformation b

square transformation [PREP] -+ transformation b

b stability of clustering [CLUS] -+ assessment of clustering

b

stacking [GRAPH]

+ dotplot b

stagewise regression [REGR] variable subset selection

-+

standard error of estimate [ESTIM]

317

b standard addition method (SAM) [REGR] Calibration procedure used in chemistry to correct for matrix effects. The chemical sample is divided into several equal-volume aliquots and increasing amounts of standards are added to all but one aliquot. Each aliquot is diluted to the same volume, a response yi is measured and plotted as the function of X i , the amount of standard added. The regression model is:

yi = bl(e +Xi)

+ ei

i = 1,n

where 6 denotes the unknown amount of the analyte. The intercept is the response for the aliquot without standard addition

bo = bi 6 therefore the unknown amount of analyte is given by bo/bl. The key assumption is that the linearity of the model holds over the range of the calibration including zero response. SAM cannot be used when spectral interferences are present. The generalized standard addition method (GSAM) is the multivariate extension of SAM used to correct for spectral interferences and matrix effects simultaneously. The key equations are

ro=KTc0

AR=ACK

where ro is the response vector of p sensors, co is the concentration vector of n analytes, K is the n x p calibration matrix, and AR and AC are changes in response and concentration, respectively, for m standard additions.

standard deviation [DESCI + dispersion b

b

standard deviation chart [QUALI (0 variable control chart)

+ control chart b

-+

standard deviation of error of calculation (SDEC) [MODEL] goodness offit

b standard deviation of error of prediction (SDEP) [MODEL] + goodness ofprediction b standard error [MODEL] + goodness of fit

b standard error of estimate [ESTIM] Standard deviation of an estimated value. For example, the standard error of the mean (SEM) calculated from n observation is s / f i where s is the standard deviation of the n observations. The standard error of the estimated regression coefficients 6 and the estimated response yi in OLS is

3 18

standard error of the mean (SEM) [ESTIM]

x>-'xi

~ ~ ( f = i )s , / x ; ( ~ ~

where s is the residual standard deviation.

b

standard error of the mean (SEM) [ESTIM] standard error of estimate

standard order of runs [EXDE] : Kites order

b

b

standard score [PREP] (a autoscaling)

: standardization

b standardization [PREP] Simple transformation of the elements of a data matrix. It can be performed columnwise (called variable standardization), rowwise (called object standardization), both ways (called double standardization), or elementwise (called global standardization). Variable standardization results in variables which are independent of the unit of measurement. Scale variant estimators are greatly influenced by the previously performed standardization. Object standardization often results in closed data. The most common standardization procedures follow.

a autoscaling One of the most common column standardizations composed of a column centering and a column scaling:

x!.U = (Xij

-Zj)/Sj

The mean of an autoscaled variable is 0 and the variance is 1. An autoscaled variable is often simply called a standardized variable, its value is called the z-score or standard score.

a centering Scale shift (translation) by subtracting a constant (the mean), resulting in the zero mean of the standardized elements. Centering can be:

- row centering:

x!.IJ = x.. 1J - x i

xi = Cxij

- column centering: xij = Xij - Xj -

- double centering: xij = Xij - Xj

-

global centering:

-

x!. IJ = x" - x

j

-

xj = C X i j i

-Xi

TI= xxxij i

j

standardization [PREP]

319

logarithmic scaling Scale shift based on logarithmic transformation followed by column centering to mitigate extreme differences between variances

XG

C log(xij = log(xij) -

n

maximum scaling Column standardization where each value is divided by the maximum value of its column:

All the values in a maximum scaled variable have an upper limit of one.

profile Standardization that results in unit sum or unit sum of squares of the standardized elements. The profiles can be row profile:

Xij

= x IJ- - / cXij

c* /c j

x!.IJ = x.. 11 /

normalized row profile:

Xij

j

column profile:

Xij

= xij

xij

i

normalized column profile: xij = Xij /

C

~2IJ

i

global profile: normalized global profile:

xij

= Xij

//F c X:

X:

j

range scaling Column standardization where each value in the column is centered by the minimum value of the column Lj and divided by the range of the column Uj - Lj x!. = (xij - Lj)/(U, - Lj) 1J

In a range-scaled variable all values lie between 0 and 1. Range scaling where the values of the variable are expanded or compressed between prespecified limits A, (lower) and B, (upper) is called generalized range scaling: x!.1J = Aj

+ (Bj - Aj)(Xij

- Lj)/(U, - Lj)

320

standardized linear combination (SLC) [ALGE]

scaling Scale expansion (contraction) by dividing by a constant (the standard deviation) that results in unit variance of the standardized elements. Scaling can be

- row scaling: - column scaling:

x!.'J

-

xij = Xij /S

global scaling:

= xi, /sj

sj = J ~ ( x i j - F j > 2 / n

s =

x(Xij

Ji

-Q2/np

j

standardized linear combination (SLC)[ALGE] Linear equation in which the sum of the squared coefficients is equal to one, i.e. the coefficient vector has unit length. For example, principal components calculated from a correlation matrix are standardized linear combinations. b

b

standardized regression coefficient [REGR] regression coeficient

--f

b standardized residual [REGR] + residual

standardized residual plot [GRAPH] + scatter plot (0 residual plot) b

standardized variable [PREP] + variable F

b

star point [EXDE] (0 composite design)

+ design

F

star symbol [GRAPH] graphical symbol

b state-space model [TIME] (: Bayesian forecast, dynamic linear model, Kalman filter) Linear model in which the parameters are not constant, but change in time. The linear equation, called the observational equation is:

Y ( 0 = b(t>x(t>4-e(t>

The response y(t) is a quantity observed in time, x(t) is the known predictor vector, e(t), the error term, is a white noise process. The parameter vector b(t), called the state vector, is a time series described by the state equation:

statistics [DESC]

b(t) = G b(t - 1)

321

+c~ ( t )

where a(?)is a white noise process independent of e(t), and G and c are coefficients. b stationarity [TIME] The phenomenon that the probabilistic structure of a time series x ( t ) does not change with time. In practice it implies that the mean and the variance of a series is independent of time, and the covariance depends only on the separation time. Stationarity allows replication within a time series, thus making formal inference possible. b stationary process [TIME] --+ stochastic process

b stationary time series [TIME] + stochastic process (0 stationary process)

statistic [DESC] Numerical summary of a set of observations; a particular value of an estimator. If the observations are regarded as a sample from a population, then the calculated statistic is taken to be an estimate of the population parameter. For example, the arithmetic mean of a set of observations can be used as an estimate of the population mean. b

b

statistical distribution [PROB]

: distribution

statistical process control (SPC) [QUAL] + controlchart b

b

statistical quality control [QUAL]

: quality control b statistics [DESC] A branch of mathematics concerned with collecting, organizing, analyzing and interpreting data. There are two major problems in statistics: estimation, and hwothesis testing. In the case of inference statistics the data set under consideration is a sample from a population, the calculated statistic is taken as an estimate of the population parameter and the conclusions about the properties of the data set are translated to the underlying population. In contrast, the goal of descriptive statistics is simply to analyze, model or summarize the available data without further inference. A data set can be described by freauency. location, disuersion, skewness, kurtosis,

322

steepest ascent optimization [OPTIM]

a.

quantiles, Relationship between two variables can be described by association, correlation, covariance. Scatter matrix, correlation matrix, covariance matrix, tivariate dispersion are characteristics of multivariate data. Statistics may also be divided into theoretical statistics and data analvsis.

a-

b steepest ascent optimization [OPTIM] + steepest descent optimization

steepest descent optimization [OPTIM] Gradient oDtimization that minimizes a function f by estimating the optimal parameter values following the negative gradient direction. The iterative procedure starts with an initial guess for the p parameter values PO. In each iteration i one calculates the gradient, i.e. the partial first derivatives of the function with respect to the parameters: b

gi = (af/apli, a f / a p ~ i*, - - af/appi) 9

and a new set of parameter values is obtained as:

where the step size Si is determined by a linear search optimization. Steepest descent is a gradient optimization method where Ai is the identity matrix 1. Moving along the negative gradient direction ensures that the function value decreases at the fastest rate. However, this is only a local property, so frequent changes of direction are often necessary, making the convergence very slow, hence the optimization is quite inefficient. The method is sensitive to small perturbations in direction and step size, so these must be computed to high precision. The main problem with steepest descent is that the second derivatives describing the curvature of the function near the minimum are not taken into account. Because of its drawbacks this optimization is seldom used nowadays. The opposite procedure, which maximizes a function by searching for the optimal parameter values along the positive gradient, is called the steepest ascent optimization. b stem-and-leaf diagram [GRAPH] Part graphical, part numerical display of a univariate distribution. As in the histogram, the range of the data is partitioned into intervals. These intervals are established by first writing all possible leading digits in the range of the data to the left of a vertical line. Each object in the interval then is represented by its trailing digit, written to the right of the vertical line. The leading digits on the left form the stem and the trailing digits on the right are called the leaves.

stochastic process [TIME] leaf

stem

L

DATA 100 102 111 115 117 120 129 131 133

133 141 143 144 144 144 145 152 158

323

158 159 160 163 164 172 181 186 195

J 10 11 12 13 14 15 16 17 18 19

02 157 09 133 134445 2889 034 2 16 5

This is a compact way to record the data, while also giving visual information about the shape of a distribution. The length of each row represents the density of objects in the corresponding interval. It is often necessary to change measurement units or to ignore some digits to the right. b step direction [OPTIM] -+ optimization b

step size [OPTIM]

+ optimization b stepwise linear discriminant analysis (SWLDA) [CLAS] + discriminant analysis

b

stepwise regression (SWR) [REGR] variable subset selection

stochastic model [MODEL] -+ model b

b stochastic process [TIME] (: random process) Random phenomena that can be described by at least one random variable n(t) where t is a parameter belonging to an index set T. Usually t is interpreted as time, but it can also refer to a distribution in space. The process can be either continuous or discontinuous. Random walk and white noise are special stochastic processes. A list of the most important stochastic processes follows.

counting process Integer-valued, continuous stochastic process N(t) of a series of events, in which N(t) represents the total number of occurrences of the event in the time interval (O,?). If the time intervals between successive occurrences (interarrival times) are i.i.d. random variables, the process is called a renewal process. If these time

324

stochastic process [TIME]

intervals follow an exponential distribution, the process is called a Poisson process. If the series of occurrences are repeated trials with two outcomes (e.g. success or failure), the process is called a Bernoulli process.

ergodic process Stochastic process in which the time average of a single record x(t) is approximately equal to the ensemble average x(t). The ergodic property of a stochastic process is commonly assumed to be true in engineering and physical sciences, therefore parameters may be estimated from the analysis of a single record. independent increment process Stochastic process in which the quantities x(t+ 1)- x ( t ) are statistically independent. Markov process Stochastic process in which the conditional probability distribution at any point x(t) depends only on the immediate past value x(t - l), but is independent of the history of the process prior to t - 1. A Markov process having discrete states is called a Markov chain, while a Markov process with continuous states is called a diffusion process. narrow-band process Stationary stochastic process continuous in time and state ~ ( t= ) A(t) cos[ct

+ 4 (t)]

where c is a constant, A(t) is the amplitude and @(t)is the phase of the process. A stochastic process that does not satisfy this condition is called a wide-band process.

normal process Stochastic process in which at any given time t the random variable x ( t ) is normally distributed. shot noise process Stochastic process induced by a sequence of impulses applied to a system at random time points tn: N(t)

~ ( t=)

C An ~ ( tn)t , n=l

where w(t, tn) is the response of the system at time t resulting from an impulse An at time tn, and N(t) is a counting process with interarrival time tn.

stationary process Stochastic process with stationarity. A process that does not satisfy stationarity is called an evolutionary process. A time series that is a stationary process is called a stationary time series.

sum of squares in ANOVA (SS) [ANOVA]

325

Wiener-Levy process Stationary independent increment process in which every independent increment is normally distributed, the average value of x(t) is zero, and x(0) = 0. The most common Wiener-Levy process is the Brownian motion process. It is also widely used in other fields such as quantum mechanics and electric circuits. b

stochastic variable [PREP] variable

-+

b

strategy [MISCI

+ game theory b

stratified sampling [PROB] sampling

-+

b Studentized residual [REGR] + residual

b Studentized residual plot [GRAPH] + scatter plot ( 0 residual plot)

b Student’s t distribution [PROB] + distribution

subdiagonal element [ALGE] + matrix b

submatrix [ALGE] + matrix operation ( 0 partitioning of a matrix) b

b subsampling [PROB] + sampling (0 cluster sampling) b

subtraction of matrices [ALGE] matrix operation

-+

b sufficient estimator [ESTIMI + estimator b sum of squares in ANOVA (SS) [ANOVA] Column in the analysis of variance table containing the squared deviations of the observations from the grand mean or from an effect mean summed over the observations. It is customary to indicate the summation over an index by replacing that index with a dot. For example, in a one-way ANOVA model the effect of

326

sum of squares linkage [CLUS]

treatment A, is calculated as: Ai = yi. = Yi./K =

Yik/K k

where K is the number of observations at level i, and the sum of squares associated with the effect A is:

b

sum of squares linkage [CLUS]

+ hierarchical clustering (a agglomerative clustering) b

superdiagonal element [ALGE] matrix

-+

b

supervised learning [MULT] pattern recognition

--3

b

supervised pattern recognition [MULT] pattern recognition

-+ b

-+

survival function [PROB] random variable

sweeping [ALGE] + Gaussian elimination b

b symmetric design [EXDE] + design

b

symmetric distribution [PROB]

--+ random variable b symmetric matrix [ALGE] + matrix

b

symmetric test [TEST]

+ hypothesis testing b

synapse [MISC]

+ neural network

target transfomationfactor analysis (TTFA)[FACT] b

327

systematic distortion [ESTIM]

: bias b

systematic sampling [PROB]

-+

sampling

T b t distribution [PROB] + distribution

b

t test [TEST]

-+

hypothesis test

b Taguchi method [QUAL] Quality control approach suggesting that statistical testing of a product should be carried out at the design level, called off-line quality control, in order to make the product robust against variations in manufacturing. This proposal is different from the traditional on-line quality control, such as acceptance sampling and statistical process control. The Taguchi method is based on minimizing the variability of the product or process, either by keeping the quality on a target value or by optimizing the output. The quality is measured by statistical variability, for example mean squared error or standard deviation, rather than percentage of defects or other criteria based on control limits. Taguchi makes a distinction between variables, that can be controlled, and noise variables. He suggests systematically including the noise variables systematically into the parameter design. The variables in the parameter design can be classified into two groups: the ones that affect the mean response and the ones that affect the variability of the response.

b

Tanimoto coefficient [GEOM]

-+ b

distance

binarydata)

target rotation [FACT]

-+

b

(0

factor rotation

target transformation factor analysis ("FA) [FACT]

+ factor rotation

328

taxi distance [GEOM]

b taxi distance [GEOM] + distance (o quantitative data) b tensor [ALGE] Mathematical object, a generalization of the vector relative to a local Euclidean space, that possesses a specified system of components for every coordinate system that changes under a transformation of coordinates. The simplest tensors are building blocks of linear algebra: zero-order tensors are scalars, first-order tensors are vectors, and second-order tensors are matrices. b term in ANOVA [ANOVA] Categorical predictor variable in the analvsis of variance model. There are two kinds of terms: a main effect term and an interaction term. A main effect term, also called an effect, is a measurable or an observable quantity that affects the outcome of the observations. It is measured on a nominal scale, i.e. it assumes categorical values, called levels. The effects are commonly denoted by consecutive upper case letters (A, B, C,etc.), the levels are indicated by a lower case index, and the number of levels by the corresponding upper case letters as:

Ai

i=l,I

Bj

j = 1, J

An interaction term consists of more than one effect, and takes on values corresponding to the possible combinations of the levels of the effects, also called a treatment. There are two ways to combine effects. In a crossed effect term all combinations of levels are possible. For example if effect Ai has two levels and effect Bj has four levels, then the term ABij has eight levels: (1J) (1,2) (1,3) (L4) (2,2) (2,3) (2,4). In a nested effect term the level combinations are restricted. For example, the same B effect nested in A results in term B(A)jci) with four levels: (1,l) (1,2) (2,3) (2,4). An effect that has a fixed number of levels I is called a fixed effect. The corresponding main effect term in the ANOVA model is not considered to be a random variable. An interaction term that contains only fixed effects is also fixed, i.e. it is not random. A model containing only fixed effects is called a fixed effect model, or a first kind model or model I. In this model the treatment effects Ai, Bj , ABij , etc. are defined as deviations from the grand mean, therefore:

time series [TIME]

329

In this model conclusions from testing Ho: Ai = 0 apply only to the I levels included in the model. An effect that has a large number of possible levels from which I levels have been randomly selected is called a random effect. The corresponding main effect term in theANOl.44 model is a random variable. All interaction terms that contain random effects are also considered to be random. A model containing only random effects is called a random effect model, or a second kind model or model 11. The variance of a random effect term (and of the error term) is called a variance component or component of variance, denoted as a2,02, a&, etc. In a random effect model conclusions from testing Ho: 02 = 0 apply beyond the effect levels in the model, i.e. an inference is drawn about the entire population of effect levels. A model containing both fixed effects and random effects is called a mixed effect model. b terminal node [MISC] + graph theory (0 digraph) b

test [TEST]

: hypothesis test b test set [PREP] + data set b test statistic [TEST] + hypothesis testing b

theoretical distribution [PROB]

+ random variable b theoretical variable [PREP] + variable

three-dimensional motion graphics [GRAPH] + interactive computer graphics b

b

-+

tied rank [DESC] rank

F time series [TIME] Set of observations x(t), ordered in time, where t indicates the time when x(t) was taken. The observations are often equally spaced in time. In case of multivariate observations the scalar x(t) is replaced by a vector x(t). It is assumed that the observations are realizations of a stochastic process, x ( t ) is a random variable, and the observations made at different time points are statisticallydependent. The multivariate joint distribution is described by the mean function, autocovariance function,

330

time series analysis (TSA) [TIME]

cross-covariance function, and spectral density function. A time series can be written as a sum of four components: -, fluctuation about the trend, seasonal component, and random component. Transforming one time series x(t) into another time series y(t) is called filtering. The simplest is the linear filter:

Y ( 0 = a x(t) time series analysis (TSA)[TIME] Analysis of time series, i.e. of series of data collected as a function of time. A time series model is a mathematical description of such data. The goal of the analysis is twofold: modeling the stochastic mechanism that gives rise to an observed series and forecasting (predicting) future values of the series on the basis of its history. Another common objective is to monitor the series and to detect changes in a trend. b

b time series model [TIME] Mathematical description of time series. It is composed of two parts: one contains past values of the time series

x(t-.Il j=o,p and the other one contains terms of a white noise process

a ( ? - i)

i =O,q

The parameters p and q define the complexity of the model, they are indicated in parentheses after the model name. The most commonly used models are the following.

autoregressive integrated moving average model (ARIMA) Model that can be used when a time series is nonstationary, i.e. p(t) is constant in time. TheARIMA@, 1,q) model written in difference equation form is: x ( t ) - ~ (- t1) =

C bj [x(t - j ) - x(t - j

+

- l)] a(t) -

j

C

Ci

a(t - i)

I

The simplest ARIMA models are the IM(1,l)model: x(t)

- x ( t - 1) = a(t) - ca(t - 1)

and theARI(1,l) model: x(t)

- ~ (- t1) = b[x(t - 1) - x( t - 2)] + a(t)

The AZUM model can also be written for differences between other than neighboring points, denoted ARIMA@,d, q). For example, the ARIM(p, 2, q) is a model for x ( t ) - x ( t - 2). The dth difference of ARIMA(p, d, q) is a statimary A M @ ,q) model.

time series model [TIME]

331

autoregressive model (AR) Model in which each point is represented as a linear combination of the p most recent past values of itself plus a white noise term which is independent of all x(t-j] values

The simplest AR@) isAR(l), a first-order model:

with autocovariance and autocorrelation functions ~ ( 0= ) a;/(l - b2) r(t>= b y ( t - 1) p ( t ) = br For a higher-order model, assuming stationarity and zero means, the autocorrelation and autocovariance functions are defined by theYule-Walker equations:

autoregressive moving average model (ARMA) The A M @ ,q ) model is a mixture of AR(p) and MA(q) models:

Each point is represented as a linear combination of past values of itself, and of past and present terms of a white noise process. The simplest A M model is the A M (1,l):

with autocovariance and autocorrelation functions y(0) = (1 - 2bc + c2)a;/(1 - b2) ~ ( 1= ) by(0) - ca; v(t>= b y ( t - 1) p ( t ) = (1 - bc)(b - c)b'-'/(l - 2bc + c2)

332

time seriesplot [GRAPH]

Box- Jenkins model : autoregressive moving average model moving average model (MA) Model, based on moving averages, in which each point is represented as a weighted linear combination of present and past terms of a white noise process: x(t) = a(t) -

C ci a(t - i) 1

The simplest MA(q) is MA(1), a first-order model: x(t) = a(t) - ca(t - 1) with autocovariance and autocorrelation functions y(0) = a,2(1+c2) y(1) = -ca;

+

p(1) = -c/(l c2) y(t) =p(t) = O for

t

2 2

b time series plot [GRAPH] + scatter plot b

tolerance limits [QUALI lot

-+

top-down induction decision tree method (TDIDT) [CLAS] : classification tree method

b

b

total calibration [REGR] calibration

-+

b total sum of squares (TSS) [MODEL] Sum of squared differences between observed values and their mean:

TSS can be partitioned into two components: the model sum of squares, MSS, and the residual sum of squares, RSS: TSS = MSS + R S S =

C(ii- f12 + C ( y i 1

fi)2

i

TSS is an estimate of the variability of yi without any model. b

total variation [DESC] multivariate dispersion

-+

transformation [PREP] 333

-+

trace of a matrix [ALGE] matrix operation

b

training - evaluation set split [MODEL]

b

+ model validation b

training set [PREP]

+ data set b transfer function [MISC] + neural network b transformation [PREP] Mathematical function for transforming the original values of a variable x to new values Xr

x' = f ( x ) Transformations are used to:

- stabilize the variance; - linearize the relationship among variables; - normalize the distribution;

-

represent results on a more convenient scale;

- mitigate the influence of outliers. Transformation can be considered as a correction for undesired characteristics of a variable, such as heteroscedasticity, nonnormality, nonadditivity, or nonlinear relationship with other variables. The following are the most common transformations.

angular-linear transformation Transformation that converts an ordinary (linear) variable x into an angular variable t 360"x

r=-

k

where k is the number of units in a full cycle. For example, the time 06:OO can be transformed to 90" (one-fourth of a cycle) with k = 24.

linear transformation Transformation that can be interpreted as a roto-translation of the original variable: x'=a+bx where a is the translation term (or intercept) and b is the rotation term (or slope). There are two subcases: pure rotation ( a = 0) and simple translation of the origin (b = 0).

334

transformation [PREP]

logarithmic transformation Transformation that changes multiplicative behavior into additive behavior. For example, in regression a nonlinear relationship can be changed into a linear one:

x' = log, (k + X ) where the logarithmic base a is usually 10 or e, and k is an additive constant (often 1) to remove zero values. The standard deviation of the new values is proportional to their mean (i.e. the coefficient of variation is constant).

logit transformation Transformation of percentages, proportions or count ratios (variables with values between zero and 1) in order to obtain a logit scale, where values range between -3 (for P N 0.05) and +3 (for P N 0.95): pl = In

(L) 1-P

metameric transformation Transformation of the values of a dose or response variable into a dimensionless scale -1, 0 and +l. The transformed values are called metameters. It is often used to simplify the analysis of the dose-response relationship. probit transformation Tlansformation (abbreviation of probability unit) to make negative values very rare in a standard normally distributed variable by adding 5 to each original value.

x' = x + 5.0 rankit transformation Transformation for rank ordering of a quantitative variable. x' = rank (x)

reciprocal transformation 4

square root transformation Transformation especially applied to variables from the Poisson distribution, such as count data. In this case the variance is proportional to the mean and can be stabilized by using:

i=J;r+0.5

or

i=J+3/8

tree vmbol [GRAPH]

335

square transformation Transformation particularly useful for correcting a skewed distribution.

trigonometric transformation Transformations using trigonometric functions: x' = sin(x) x' = arcsin(x)

x' = cos(x) x' = arccos(x)

x' = tan@)

x' = arctan(x)

The arcsin transformation is often used to stabilize the variance, i.e. to make it close to constant for different populations from which x has been drawn. Particularly used on percentages or proportions p to change their (usually) binomial distribution into a nearly normal distribution: p' = arcsin (@ , ) The behavior of this transformation is not optimal at the extreme values. This can be improved by using the transformation

b

transformation matrix [ALGE]

: rotation matrix b

transpose of a matrix [ALGE] matrix operation

--+

treatment [EXDE] + factor

b

b treatment structure [EXDE] + design b tree [MISC] + graph theory

tree symbol [GRAPH] + graphical symbol

b

336

trend [TIME]

b trend [TIME] Long-term movement in a time series, a smooth, relatively slowly changing component. It is a nonrandom function of a time series: p(t) = E[x(t)].The trend can be various functions of time, for example:

- linear: - quadratic: - cyclical: - cosine:

+ +

p(t) = bo bit p(t) = ba bit+ bk? PO) = PO 12) p(t) = bo bl cos(2n f t)

+ +

+ b sin(2n f t)

triangle inequality [GEOM] -+ distance b

triangular diagram [GRAPH] Ternary coordinate system, in the form of an equilateral triangle, in which the values of the three variables sum to one. The quantities represented on the three coordinates are proportions rather than absolute values. This diagram is particularly useful for studying mixtures of three components. b

0

x3

b triangular distribution [PROB] + distribution

F triangular factorization [ALGE] + matrix decomposition

triangular kernel [ESTIM] + kernel b

1

two-stage nested anarysis of variance [ANOVA] b

triangular matrix [ALGE]

+ matrix

tridiagonal matrix [ALGE] + matrix b

b

tridiagonalization [ALGE] matrix decomposition

-+

b trigonometric transformation [PREP] + lransformation b

trimmed estimator [ESTIM]

+ estimator b

trimmed mean [DESC]

+ location

trimmed variance [DESC] -+ dhpersion b

b

true error rate [CLAS] classification (o error rate)

-+

b

truncation error [OPTIM] numerical error

-+

b

'hkey's quick test [TEST] hypothesis test

-+

b

'hkey's test [TEST] hypothesis test

-+

two-level factorial design [EXDE] -+ design b

b

two-norm [ALGE] norm (0 matnknom)

-+

b

two-sided test [TEST] hypothesis testing

-+

b

two-stage nested analysis of variance [ANOVA]

-+ analysis of variance

337

338 b

two-stage sampling [PROB]

two-stage sampling [PROB] sampling ( cluster sampling)

-+

b two-way analysis of variance [ANOVA] + analysis of variance

b

two-way clustering [CLUS] cluster analysis

-+

b

type I error [TEST]

+ hypothesis testing b type II error [TEST] + hypothesis testing

U b u-chart [QUAL] + control chart (0 attribute control chart)

U-shaped distribution [PROB] + random variable b

b

ultrametric distance [GEOM]

+ distance b ultrametric inequality [GEOM] + distance b unbalanced factorial design [EXDEI + design b unbiased estimator [ESTIM] -+ estimator

unconditional error rate [CLAS] + classification ( 0 error rate) b

uncorrelated vectors [ALGE] + vector b

unit [PREP]

339

b underdetermined system [PREP] -+ object b

underfitting [MODEL]

+ model fitting b

undirected graph [MISC] ( 0 digraph)

+ graph theory

b unequal covariance matrix classification (UNEQ) [CLAS] Parametric classification method that is a variation of the quadratic discriminant analvsis. Each class g is represented by its centroid cg and its covariance matrix Sg. Object xi is classified according to its Mahalanobis distance from the class centroids. The metric is the inverse of the corresponding class covariance matrix Sg:

d2'g = (xi - CgIT s;' (xi - cg>

As the above distance metric follows the chi square distribution, the probability of an object belonging to a class can be calculated from this distribution. UNEQ, similar to SIMCA, simplifies the QDA classification rule by omitting the logarithm of the determinant of the class covariance matrices, which means that the class density functions are not properly scaled. In the presence of significant scale differences this usually causes inferior performance. uniform distribution [PROB] + distribution F

b uniform shell design [EXDE] + design b unique factor [FACT] (.- specific factor) Term in the factor analvsis model that accounts for the variance not described by the common factors. Its role is similar to the role of the error term in a regression model. There is one unique factor for each variable. The unique factors are uncorrelated, both among themselves and with the common factors. Principal component analysis assumes zero unique factors. F unique variance [FACT] + factor analysis b uniqueness [FACT] + communalify

unit [PREP] : object

b

340

unit matrix [ALGE]

unit matrix [ALGE] -+ matrix b

b

univariate [MULT]

+ multivariate b univariate calibration [REGR] + calibration b

univariate data [PREP]

+ data b univariate distribution [PROB] -+ random variable b univariate regression model [REGR] -+ regression model

unsupervised learning [MULT] + pattern recognition

b

b unsupervised pattern recognition [MULT] + pattern recognition b

unweighted average linkage [CLUS] (0 agglomerative clustering)

+ hierarchical clustering

b upper control limit (UCL) [QUAL] + control chart

upper quartile [DESC] + quantile b

utility function [QUALI + multicriteria decision making

b

V Vmask [QUAL] -+ control chart (0 cusum chart) b

variable [PREP] b

341

validation [MODEL]

: model validation

variable [PREP] ( : attribute, characteristic, descriptor,feature) Characteristic of an object that may take on any value from a specified set. Data for n objects described by p variables are collected in a data matrix X(n, p), where the element xi, is the jth measurement taken on the ith object. There are several types of variables, listed below, depending on their of measurement, on the set of values they can take, and on the relationships with other variables. b

angular variable (: circular variable) Variable that takes on values expressed in terms of angles. autoscaled variable : standardized variable binary variable Dichotomous variable that takes values of 0 or 1. For example, quanta1 and dummy variables are binary variables. blocking variable Categorical variable in the design matrix that groups runs into blocks. The levels of the blocking variable are defined by the blocking generator. categorical variable

.- qualitative variable

cause variable Variable that is the cause of change in other (effect) variables. circular variable : angular variable concomitant variable

.- covariate

conditionally-presentvariable Variable that exists or is meaningful only for some of the objects. The values taken by such variable may exist (or be meaningful) for an object depending on the value of a dichotomous variable of the present/absent type. continuous variable Variable that can take on any numerical value. For any two values there is another one that such a variable can assume. control variable Variable measured in monitoring a process. Its values are collected at fixed time intervals, often recorded on control charts or registered automatically.

342

variable [PREP]

covariate (: concomitant variable) Predictor variable in ANCOVA model, measured on a ratio scale. Its effect on the response cannot be controlled, only observed. cyclical variable Variable in time series that takes on values that depend on the cycle (period) during which it is measured. dependent variable (: response variable) Variable exhibiting statistical dependence on one or more other variables, called independent variables. dichotomous variable Discrete variable that can take on only two values. A binary variable is a special dichotomous variable that takes on values of 0 or 1. discrete variable In contrast to a continuous variable, a variable that takes on only a finite number of values. These values are usually integer numbers, but some discrete variables also take on ratios of integer numbers. For example, variables measured on a frequency count scale are discrete variables. dummy variable (: indicator variable) Binary variable created by converting a qualitative variable into binary ones. Each level of the qualitative variable is represented by one dummy variable set to 1 or 0, indicating whether the qualitative variable assumed the corresponding level or not. effect variable Variable in which change is caused by other (cause) variables. endogenous variable Variable, mainly used in econometrics, that is measured within a system and affected both by variables in the system and by variables outside the system (exogenous variables). It is possible for a variable to be endogenous in one system and exogenous in another. exogenous variable Variable, mainly used in econometrics, that is measured outside the system. They can affect the behavior of the system described by the endogenous variables, but are not affected by the fluctuations in the system. experimental variable Variable measured or set during an experiment, in contrast to theoretical variable, calculated from a mathematical model. In experimental design the experimental variables are more commonly called factors.

variable [PREP]

343

explanatory variable :predictor variable inadmissible variable Variable that must not be included in a model because it is constant or perfectly correlated with other variables. A variable containing a large number of missing values, measured with too much noise, or highly correlated with other variables is often also considered inadmissible. independent variable : predictor variable indicator variable : dummy variable latent variable Non-observable and non-measurable hypothetical variable, crucial element of a latent variable model. A certain part of its effect is manifested in measurable manifest variables. Mainly used in sociology, economics and psychology. An example is the common factor in factor analysis. lurking variable Variable that affects the response, but may not be measured or may not even be known to exist. The effect of such variables is gathered in the error term. They may cause significant correlation between two measured variables, without providing evidence that these two variables are necessarily causally related. manifest variable Observable or measurable variable, as opposed to a latent variable, which is not observable or measurable. multinomial variable : qualitative variable multistate variable : qualitative variable noise variable Variable that cannot be controlled during the experiment or the process. predictor variable ( : explanatory variable, independent variable, regressor) Variable in a regression model as a function of which the response variable is modeled. process variable Variable controlled during the experiment or process.

344

variable control chart [QUAL]

qualitative variable (: categorical variable, multinomial Variable, multistate variable) Variable in which differences between values cannot be interpreted in a quantitative sense and for which only non-arithmetic operations are valid. It can be measured on nominal or ordinal scales. Examples are: label, color, type. quanta1 variable Binary response variable measuring the presence or absence of response to a stimulus. quantitative variable Variable, measured on a proportional or ratio scale, for which arithmetic operations are valid. random variable (: variate, stochastic variable) Variable that may take on values of a specified set with a defined frequency or probability. Variable with values associated with an element of chance or probability. ranking variable Variable defined by the ranks of the values of another variable. Rank order statistics are calculated from ranking variables that replace the ranked variable. reduced variable : standardized variable regressor : predictor variable response variable : dependent variable standardized variable ( : autoscaled variable, reduced variable) Variable standardized by autoscaling, i.e. by subtracting its mean and by dividing by its standard deviation. Such a variable has zero mean and unit variance. stochastic variable : random variable theoretical variable Variable taking values according to a mathematical model, in contrast to an experimental variable, which is measured or set in an experiment. variate : random variable F

variable control chart [QUAL]

+ control chart

variable subset selection [REGR]

345

b variable metric optimization [OPTIM] (.- quasi-Newton optimization) Gradient optimization that tries to overcome the problems in the Newton-Rauhson optimization, namely that the Hessian matrix H may become negative definite. It finds a search direction of the form H-'(pi) g(pi) where g is the gradient vector and H is a positive definite symmetric matrix, updated at each iteration, that converges to the Hessian matrix. The most well known variable metric optimization is the Davidon-Fletcher-Powell optimization (DFP). It begins as steepest descent and changes over to the Newton-Raphson optimization during the iterations by continuously updating an approximation to the inverse of the matrix of second derivatives. b variable reduction [MULT] + data reduction

variable sampling [QUALI + acceptance sampling b

b

variable step size generalized simulated annealing (VSGSA) [OPTIM]

+ simulated annealing b variable subset selection [REGR] (: best subset regression) Collection of regression methods that model the response variable as a function of a selected subset of the predictor variables only. These techniques are biased regression methods in which biasing is based on the assumption that not all the predictor variables are relevant in the regression problem. There are various strategies for finding the optimal subset of predictors. The forward selection procedure starts with an initial model containing only a constant term and inserts one predictor at a time until a prespecified goodness of prediction criterion is satisfied. The order of insertion is determined by the partial correlation coefficients between the response and the predictors not yet in the model. In other words, at each step the predictor is included in the model that produces the largest increase in R2. k partial correlations are calculated in each step as a correlation between residuals from models y = f ( X I , x2, . .. , Xk-1) and Xj = f (XI, x2, . . . , Xk-I), with j # 1, . . . , k - 1. The contribution of predictor k to the variance described by the regression model is assessed by the partial F value:

The forward selection procedure stops when the partial F value does not exceed a preselected Fin threshold, i.e. when the contribution of the selected predictor to the variance described by the model is no longer significant. Although this procedure improves the regression model in each step, it does not consider the effect of the newly inserted variable on the role of the predictors already in the model.

346

variance [DESC]

The backward elimination is an opposite strategy; it starts with the full model containing all the predictors. In each step a predictor is eliminated from the model which has the smallest partial correlation with the response (or equivalently the smallest partial F value), i.e. which results in the smallest decrease in R2. The elimination procedure stops when the smallest partial F value is greater than a preselected F,,, threshold, i.e. when all predictors in the model do contribute significantly to the variance of the model. This procedure, however, cannot be used when the full predictor matrix is underdetermined. The stepwise regression (SWR) method is a combination of the above two strategies. Both variable selection and variable elimination are attempted in each step. The procedure starts with only the constant term in the model. In each step the predictor with the largest partial F value is inserted if F > Fin and the predictor with the smallest partial F value is eliminated if F < F,,,. The procedure stops when all predictors in the model have F > Foutand all candidate predictors not in the model have F < Fin. This is the most frequently recommended method, although its success depends on the preselected Fin and Fout. The above procedures, called sequential variable selection, were developed with computational efficiency in mind, so that only a relatively small number of subsets are actually calculated and compared. As computation is getting cheaper, it is no longer prohibitive to calculate the all possible subsets regression. This examines all possible 2P combinations of the predictors and chooses the best model on the basis of a goodness of prediction criterion. This method proves to be superior, especially in case of collinearity. The above methods calculate a least squares estimate for the variables included in the model. The candidate predictors are decorrelated from the predictors already in the model. Stagewise regression considers predictors on the basis of their correlation with the response, as opposed to their partial correlation, i.e. the candidate predictors are not decorrelated from the model. In each step the correlations between the residual from the existing model and the potential predictors are calculated and the variable with the largest correlation is inserted into the model. This method does not give the least squares regression coefficients for the variables in the final model and yields a larger mean square error than that of the least squares estimate. Its advantage is, however, that highly correlated variables are allowed to enter into the model if they are also highly correlated with the response.

variance [DESC] + dispersion b

b

variance analysis [ANOVA]

: analysis of variance

variance component [ANOVA] + term in ANOVA b

vector [ALGE] b

347

variance-covariance matrix [DESC]

: covariance matrix b variance inflation factor (VIF) [REGR] Measure of the effect of ill-conditioned predictor matrix X on the estimated regression coefficients:

where Rj is the multiple correlation coefficient obtained by regressing predictor xj on all the other predictors Xk k # j . VIF is a simple measure for detecting collinearity. In the ideal case, when xj is totally uncorrelated with the other predictors, W.,= 1. As R, tends to 1, i.e. collinearity increases, VIFj tends to infinity. b

-+

variance ratio distribution [PROB] distribution

variance ratio test [TEST] + hypothesis test b

variate [PREP] + variable b

b

variate [PROB] random variable

-+

b

variation [DESC]

: dispersion

varimax rotation [FACT] -+ factor rotation b

b

variogram [TIME]

+ autocovariancefunction b vector [ALGE] Row or column of numbers. In contrast to a scalar, a vector has both size and direction. The size of a vector is measured by its norm. Conventionally a vector x is assumed to be a column vector; a row vector is indicated as xT (transpose of x). For example, multivariate measurements are usually represented by a row vector, measurement of the same quantity on various individuals or material samples are often collected in a column vector.

348

vector norm [ALGE]

Vectors x and y are orthogonal vectors if

A set of p such vectors form an orthonormal basis of the p space. Vectors x and y are uncorrelated vectors if:

Vectors x and y are linearly independent vectors if b,x+ b,y=O

with

b, # O

and

by # O

Vectors x and y are coqjugate vectors if xTZy = 0

where Z is a symmetric positive definite matrix. b

vector norm [ALGE]

+ norm

Wald-Wolfowitz’s test [TEST] + hypothesis test b

b

walk [MISC] graph theory

-+ b

Walsh’ test [TEST] hypothesis test

4

b

Ward linkage [CLUS] hierarchical clustering ( agglomerative clustering)

b

warning limit [QUAL] controlchart

-+

Watson’s test [TEST] -+ hypothesis test b

well-strctured admksibility [CLUS] 349 b Weibull distribution [PROB] + distribution

F

Weibull growth model [REGR] regression model

4

b weight [PREP] Numerical coefficient associated with objects, variables or classes indicating their relative importance in a model. Most statistical models can incorporate observation weights; e.g. weighted mean, weighted variance, weighted least squares. Observation weights play an important role in regression, for example in generalized least squares regression, robust regression, biweight, smoother.

b weighted average linkage [CLUS] + hierarchical clustering (0 agglomerative clustering) b weighted centroid linkage [CLUS] + hierarchical clustering (0 agglomerative clustering)

weighted graph [MISC] + graph theory b

weighted least squares regression (WLS) [REGR] + generalized least squares regression b

weighted mean [DESC] + location b

b

weighted nearest means classification (WNMC) [CLAS] centroid classijication

4

b

weighted variance [DESC]

+ dispersion b well-conditioned matrix [ALGE] + matrix condition b

-+

well-determined system [PREP] object

well-structured admissibility [CLUS] + assessment of clustering ( 0 admissibility properties) b

350

Westenberg’stest [TEST]

b Westenberg’s test [TEST] + hypothesis test b

-+

Westlake design [EXDE] design

b white noise [TIME] Stationarv stochastic process defined as a sequence of i.i.d. random variables a(t). This process is stationary with mean function

P(t) = E[a(t)I

and with autocovariance and autocorrelation functions y ( t , s) = var [a(t)]

p(t,s) = 1

if t = s, otherwise both are zero. b wide-band process [TIME] + stochastic process (0 narrow-bandprocess)

Wiener-Levy process [TIME] + stochastic process b

b Wilcoxon-Mann-Whitney’s test [TEST] + hypothesis test b Wilcoxon’s test [TEST] + hypothesis test b

Wilks’ A test [FACT]

+ rank analysis b Wilk’s test [TEST] + hypothesis test

b

Williams-Lambert clustering [CLUS]

+ hierarchical clustering (o divisive clustering) b Williams plot [GRAPH] + scatter plot (0 residual plot)

window smoother [REGR] + smoother b

f i f e sorder [EXDE]

351

b Winsorized estimator [ESTIM] + estimator

within-group covariance matrix [DESC] + covariance matrix b

X b

%chart [QUAL]

+ control chart

(0

variable control chart)

Y b Yates algorithm [EXDE] Algorithm for calculating estimates for the effects of factors and of their interactions in two-level factorial design. The outcomes of the experimental runs of a K-factor factorial design are first written in a column in Yates order. Another column is calculated by first adding together the consecutive pairs of numbers from the first column, then subtracting the top number from the bottom number of each pair. With the same technique a total of K columns are generated; the entries of a new column are sums of differences of pairs of numbers from the previous columns. The last column contains the sum of squares such that the first value corresponds to the mean, the next K values to the K factors and the rest to the interactions. To obtain the estimate of the mean, the corresponding sum of squares is divided by 2K, the factor and interaction effects are estimated by dividing the corresponding sum of squares by 2K-1. b

Yates chi squared coefficient [GEOM] (0 binarydatu)

+ distance

b Yates order [EXDE] (: standard order of runs) The most often used order of runs in a two-level factorial design. The first column of the design matrix consists of successive minus and plus signs, the second column of successive pairs of minus and plus signs, the third column of four minus signs

352

Youden square design [EXDE]

followed by four plus signs, and so forth. In general, the kth column consists of 2K-1minus signs followed by 2K-' plus signs. For example, the design matrix for a 23 design is:

b Youden square design [EXDE] + design b Yule coefficient [GEOM] + distance ( 0 binarydata) b

Yule-Walker equations [TIME] ( 0 autoregressive model)

+ time series model

b z-chart [GRAPH] Plot of time series data that contains three lines forming a z shape. The lower line is the plot of the original time series, the center line is a cumulative total and the upper line is a moving total. b z-score [PREP] + standardization

(0 autoscaling)

zero matrix [ALGE] + matrix b

zero-order regression model [REGR] + regression model b

zooming [GRAPH] + interactive computer graphics b

353

References

[ALGE] LINEAR ALGEBRA

G.H. Golub and C.E van Loan, Matrix Computations. John Hopkins University Press, Baltimore, MD (USA), 1983 w AS. Householder, The Theory of Matrices in Numerical Analysis. Dover Publications, New York, NY (USA), 1974 w A. Jennings, Matrix Computation for Engineers and Scientists. Wiiey, New York, NY (USA), 1977 w B.N. Parlett, The Symmetric Eigenvalue Problem. Prentice-Hall, Englewood Cliffs, NJ (USA), 1980

[ANOVA] ANACYSIS OF VARIANCE

O.J. Dunn and V.A. Clark, Applied Statistics: Analysis ofvariance and Regression. Wiley, New York, NY (USA), 1974 w L. Fisher, Fixed Effects Analysis of Variance. Academic Press, New York, NY (USA), 1978 D.J. Hand, Multivariate Analysis of Variance and Repeated Measures. Chapman and Hall, London (UK), 1987 B A. Huitson, The Analyis of Variance. Griffin, London (UK), 1966 S.R.Searle, Variance Components. Wiley, New York, NY (USA), 1992 w G.O. Wesolowsky, Multiple Regression and Analysis of Variance. Wiley, New York, NY (USA), 1976

[CLAS] CLASSIFICATION

H.H. Bock, Automatische Klassifikation. Vandenhoeck & Rupprecht, Gottingen (GER), 1974 w I. Bratko and N. Lavrac (Eds.), Progress in Machine Learning. Sigma Press, Wilmslow (UK), 1987 L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression 'kees. Wadsworth, Belmont, CA (USA), 1984 w H.T. Clifford and W. Stephenson, An Introduction to Numerical Classification. Academic Press, New York, NY (USA), 1975 D. Coomans, D.L. Massart, I. Broeckaert, and A. 'hssin, Potential methods in pattern recognition. Anal. Chim. Acta, 133,215 (1981) D. Coomans, D.L. Massart, and I. Broeckaert, Potential methods in pattern recognition. Part 4 A combination of ALLOC and statistical linear discriminant analysis. Anal. Chim. Acta, 133. 215 (1981) TM. Cover and P.E. Hart, Nearest neighbor pattern classification. IEEE 'Itans., 21 (1967) o M.P. Derde and D.L. Massart, UNEQ:a disjoint modelling technique for pattern recognition based on normal distribution. Anal. Chim. Acta, 184,33 (1986)

m,

354

rn

0

rn

w rn rn

rn

rn H

rn

References

M.P. Derde and D.L. Massart, Comparison of the performance of the class modelling techniques UNEQ, SIMCA and PRIMA. Chemolab, 4, 65 (1988) R.A. Eisenbeis, Discriminant Analysis and Classification Procedures: Theory and Applications. Lexington, 1972 M. Forina, C. Armanino, R. Leardi, and G. Drava, A class-modelling technique based on potential functions. J. Chemometrics, 5, 435 (1991) I.E. Frank, DASCO -A new classification method. Chemolab, 4,215 (1988) I.E. Frank and J.H.Friedman, Classification: oldtimers and newcomers. J. Chemometrics, 3, 463 (1989) I.E. Frank and S. Lanteri, Classification models: discriminant analysis, SIMCA, CART. Chemolab, 5 247 (1989) J.H. Friedman, Regularized Discriminant Analysis. J. Am. Statist. Assoc., 165 (1989) M. Goldstein, Discrete Discriminant Analysis. Wiley, New York, NY (USA), 1978 L. Gordon and R.A. Olsen, Asymptotically efficient solutions to the classification problem. Ann. Statist., 6, 515 (1978) D.J. Hand, Kernel discriminant analysis. Research Studies Press, Letchworth (UK), 1982 D.J. Hand, Discrimination and Classification. Wiley, Chichester (UK), 1981 M. James, Classification Algorithms. Collins, London (UK), 1985 I. Juricskay and G.E. Veress, PRIMA: a new pattern recognition method. Anal. Chim. Acta, 171.61 (1985) W.R. Klecka, Discriminant Analysis. Sage Publications, Beverly Hills, CA (USA), 1980 B.R. Kowalski and C.F. Bender, The k-nearest neighbour classification rule (pattern recognition) applied to nuclear magnetic resonance spectral interpretation. Anal. Chem., 4 1405 (1972) PA. Lachenbruch, Discriminant Analysis. Hafner Press, New York, NY (USA), 1975 G.J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York, NY (USA), 1992 N.J. Nilson, Learning Machines. McGraw-Hill, New York, NY (USA), 1965 R. Todeschini and E. Marengo, Linear Discriminant Classification Wee (LDCT): a user-driven multicriteria classification method. Chemolab, 16,25 (1992) G.T. Toussaint, Bibliography on estimation of misclassification. Information Theory, 472 (1974) H. van der Voet and D.A. Doornbos, The improvement of SIMCA classification by using kernel density estimation. Part 1. A new probabilistic approach classification technique and how to evaluate such a technique. Anal. Chim. Acta, 161,115 (1984) H. van der Voet and D.A. Doornbos, The improvement of SIMCA classification by using kernel density estimation. Part 2. Practical evaluation of SIMCA, ALLOC and CLASSY on three data sets. Anal. Chim. Acta, 161,125 (1984) H.van der Voet, P.M.J. Coenegracht, and J.B.Hemel, The evaluation of probabilistic classification methods. Part 1.A Monte Carlo study with ALLOC. Anal. Chim. Acta, 191,47 (1986) H. van der Voet, J.B. Hemel, and P.M.J. Coenegracht, New probabilstic version of the SIMCA and CLASSY classification methods. Part 2. Practical evaluation. Anal. Chim. Acta, 191,63 (1986) S . Wold, Pattern recognition by means of disjoint principal components models. Pattern Recognition, S, 127 (1976) S. Wold, The analysis of multivariate chemical data using SIMCA and MACUP. Kern. Kemi, 3 401 (1982) S . Wold and M. Sjstrom, Comments on a recent evaluation of the SIMCA method. J. Chemometrics, L 243 (1987)

m,

0

0 0

References

355

[CLUS] CLUSTER ANALYSIS rn L.A. Abbott, EA. Bisby, and D.J. Rogers, lsxonomic Analysis in Biology. Columbia Univ. Press, New York, NY (USA), 1985 m M.S. Aldenderfer and R.K. Blashfield, Cluster Analysis. Sage Publications, Beverly Hills, CA (USA), 1984 rn M.R. Anderberg, Cluster Analysis for Applications. Academic Press, New York, NY (USA), 1973 0 G.H. Ball and D.J. Hall, A clustering technique for summarizing multivariate data. Behav. Sci., 2, 153 (1967) rn E.J. Bijnen, Cluster Analysis. Tilburg University Press, 1973 rn A.J. Cole, Numerical lsxonomy. Academic Press, New York, NY (USA), 1969 0 D. Coomans and D.L. Massart, Potential methods In pattern recognition. Part 2. CLUPOT, an unsupervised pattern recognition technique. Anal. Chim. Acta, 133,225 (1981) rn B.S. Duran, Cluster Analysis. Springer-Verlag, Berlin (GER), 1974 A.W.F. Edwards and L.L. Cavalli-Sforza, A method for cluster analysis. Biometrics, 21,362 (1965) B.S. Everitt, Cluster Analysis. Heineman Educational Books, London (UK), 1980 0 E.B. Fowlkes, R. Gnanadesikan, and J.R. Kettenring, Variable selection In clustering. J. Classif., 5 205 (1988) 0 H.P. Friedman and J. Rubin, On some invariant criteria for grouping data. J. Am. Statist. Assoc., @, 1159 (1967) A.D. Gordon, Classification Methods for the Exploratory Analysis of Multivariate Data. Chapman & Hall, London (UK), 1981 0 J.C. Gower, Maximal predictive classification. Biometrics, 643 (1974) 0 P. Hansen and M. Delattre, Complete-link analysis by graph coloring. J. Am. Statist. Assoc., -3 7 397 (1978) rn J. Hartigan, ClusteringAlgorithms. Wiley, New York, NY (USA), 1975 rn A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs, NJ (USA), 1988 rn M. Jambu, Cluster Analysis and Data Analysis. North-Holland, Amsterdam (The Netherlands), 1983 rn N. Jardine and R. Sibson, Mathematical lsxonomy. Wiley, London (UK), 1971 0 R.A. Jarvis and E.A. Patrick, Clustering using a similarity measure based on shared nearest neighbours. IEEE 'Itans. Comput., 1025 (1973) rn L. Kaufman and P.J. Rousseeuw, Finding Groups in Data An Introduction to Cluster Analysis. Wiley, New York, NY (USA), 1990 R.G. Lawson and P.C. Jurs, New index for clustering tendency and its application to chemical problems. J. Chem. Inf. Comput. Sci., -03 36 (1990) 0 E. Marengo and R. Todeschini, Linear discriminant hierarchical clustering: a modeling and crossvalidable clustering method. Chemolab, 19,43 (1993) 0 F.H.C. Marriott, Optimization methods of cluster analysis. Biometrika, 417 (1982) rn D.L. Massart and L. Kaufman, The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis. Wiley, New York, NY (USA), 1983 D.L. Massart, F. Plastria, and L. Kaufman, Non-hierarchical clustering with MASLOC. Pattern Recognition, &, 507 (1983) rn P.M. Mather, Cluster Analysis. Computer Applications, Nottingham (UK), 1969 0 G.W. Milligan and P.D. Isaac, The validation of four ultrametric clustering algorithms. Pattern Recognition, -02 41 (1980) 0 W.M. Rand, Objective criteria for the evaluation of clustering methods. J. Am. Statist. Assoc., 66, 846 (1971)

a

=,

-

356

References

w H.C. Romesburg, Cluster Analysis for Researchers. Lifetime Learning Publications, Belmont, CA (USA), 1984 A.J. Scott and H.J. Symons, Clustering methods based on likelihood ratio criteria. Biometrics, 27, 387 (1971) w P.H.A. Sneath and R.R. Sokal, Numerical Taxonomy. Freeman, San Francisco, CA (USA), 1973 w H. Spath, Cluster Analysis Algorithms. Wiley, New York, NY (USA), 1980 M.J. Symons, Clustering criteria and multivariate normal mixtures. Biometrics, 37, 35 (1981) w R.C. w o n , Cluster Analysis. McGraw-Hill, New York, NY (USA), 1970 J.W. Van Ness, Admissible clustering procedures. Biometrika, @, 422 (1973) w J. Van Ryzin (Ed.), Classification and Clustering. Academic Press, New York, NY (USA), 1977 W. Vogt, D. Nagel, and H. Sator, Cluster Analysis in Clinical Chemistry: A Model. Wiley, Chichester (UK), 1987 J.H. Ward, Hierarchical grouping to optimize an objective function. J. Am. Statist. Assoc., 3, 236 (1963) P. Willett, Clustering tendency in chemical classification. J. Chem. Inf. Comput. Sci., 25,78 (1985) P. Willett, Similarity and Clustering in Chemical Information Systems. Research Studies Press, Letchworth (UK), 1987 W.T Williams and J.M. Lambert, Multivariate methods in plant eehology. 2. The use of an electronic 8 4 689 (1960) computer for association analysis. J. Echology, 0 D. Wishart, Mode Analysis: a generalization of nearest neighbour which reduces chaining effect. in Numerical Taxonomy, Ed. A.J. Cole, Academic Press, New York, NY, 1969, p. 282. J. Zupan, A new approach to binary tree-based heuristics. Anal. Chim. Acta, 122. 337 (1980) 0 J. Zupan, Hierarchical clustering of infrared spectra. Anal. Chim. Acta, 139.143 (1982) w J. Zupan, Clustering of Large Data Sets. Research Studies Press, Chichester (UK), 1982

[ESTIM] ESTIMATION

w Y. Bard, Nonlinear Parameter Estimation. Academic Press, New York, NY (USA), 1974 P.J. Huber, Robust Statistics. Wiley, New York, NY (USA), 1981 w J.S. Maritz, Distribution-free Statistical Methods. Chapman & Hall, London (UK), 1981 w 0. Richter, Parameter Estimation in Ecology. VCH Publishers, Weinheim (GER), 1990 P.J. Rousseeuw, lhtorial to robust statistics. J. Chemometrics, 5,1 (1991) B.W. Silverman, Density Estimation for Statistics and Data Analysis. Research Studies Press, Letchworth (UK), 1986

[EXDE] EXPERIMENTALDESIGN 0

KM. Abdelbasit and R.L. Plackett, Experimental Design for Binary Data. J. Am. Statist. Assoc.,

28, 90 (1983) D.F. Andrews and A.M. Herzberg, The Robustness and Optimality of response Surface Designs. J. Statist. Plan. Infer., 2,249 (1979) w TB. Barker, Quality by Experimental Design. Dekker, New York, NY (USA), 1985 G.E.P. Box and D.W. Behnken, Some new three level designs for the study of quantitative variables. Technometrics, 2,445 (1960) w G.E.P. Box and N.R. Draper, Evolutionary Operation. Wiley, New York, NY (USA), 1969 G.E.P. Box and N.R. Draper, Empirical Model-Building and Response Surfaces. Wiley, New York, NY (USA), 1987

References

357

rn G.E.P. Box, W.G. Hunter, and J.S. Hunter, Statistics for Experimenters. An Introduction to Design, Data Analysis, and Model Building. Wiley, New York, NY (USA), 1978 N. Bratchell, Multivariate response surface modelling by principal components analysis. J. Chemometrics, 3, 579 (1989) rn R. Carlson, Design and optimization in organic synthesis. Elsevier, Amsterdam (NL), 1992 S. Clementi, G. Cruciani, C. Curti and B. Skakerberg, PLS response surface optimization: the CARS0 procedure. J. Chemometrics, 3, 499 (1989) rn J.A. Cornell, Experiments with Mixtures. Wiley, New York, NY (USA), 1990 (2nd ed.) J.A. Cornell, Experiments with Mixtures: A Review. Technometrics, & 437 (1973) o J.A. Cornell, Experiments with Mixtures: An Update and Bibliography. Rchnometrics, -l2 95 (1979) S.N. Deming and S.L.Morgan, Experimental Design: A Chemometric Approach. Elsevier, Amsterdam (NL), 1987 C.A.A. Duinveld, A.K. Smilde and D.A. Doornbos, Comparison of experimental designs combining process and mixture variables. Part 1. Design construction and theoretical evaluation. Chemolab, 19, 295 (1993) o C.A.A. Duinveld, A.K. Smilde and D.A. Doornbos, Comparison of experimental deslgns combining process and mixture variables. Part 11. Design evaluation on measured data. Chemolab, 19,309 (1993) rn V.V. Fedorov, Theory of Optimal Experiments. Academic Press, New York, NY (USA), 1972 rn P.D. Haaland, Experimental Design in Biotechnology. Marcel Dekker, New York, NY (USA), 1989 J.S. Hunter, Statistical Design Applied to Product Design. J. Qual. Control, l7,210 (1985) W.G. Hunter and J.R. Kittrell, Evolutionary Operation: A Review. Technometrics, 389 (1966) A.I. Khuri and J.A. Cornell, Response Surfaces Designs and Analysis. Marcel Dekker, New York, NY (USA), 1987 R.E. Kirk, Experimental Design. Wadsworth, Belmont, CA (USA), 1982 o E. Marengo and R. Todeschini, A new algorithm for optimal, distance based, experimental design. Chemolab, 2,117 (1991) rn R.L. Mason, R.F. Gunst, and J.L. Hess, Statistical Design and Analysis of Experiments with Applications to Engineering and Science. Wiley, New York, NY (USA), 1989 rn R. Mead, The Design of Experiments. Cambridge Univ. Press, Cambridge (UK), 1988 R. Mead and D.J. Pike, A review of response surface methodology from a hiometric viewpoint. Biometrics, 3l, 803 (1975) rn D.C. Montgomery, Design and Analysis oPExperiments. Wiley, New York, NY (USA), 1984 rn E. Morgan, Chemometrics: Experimental Design. Wiley, Chichester (UK), 1991 o R.L. Plackett and J.P. Burman, The Design of Optimum Multifactorial Experiments. Biometrika, 33, 305 (1946) o D.M. Steinberg and W.G. Hunter, Experimental Design: Review and Comment. Rchnometrics, 4 441 (1984) rn G. Tagughi and Y. Wu, Introduction to Off-Line Quality Control. Central Japan Quality Control Association (JPN), 1979

[FACT] FACTOR ANALYSIS

rn J.F! Benzecri, L'Analyse des Correspondences. Dunod, Paris (FR), 2 vols., 1980 (3rd ed.) 0 M. Feinberg, The utility of correspondence factor analysis for making decisions from chemical data. Anal. Chim. Acta, 191,75 (1986) rn B. Flury, Common Principal Components and Related Multivariate Models. Wiley, New York, NY (USA), 1988

358

References

H. Gampp, M. Maeder, C.J. Meyer, and A.D. Zuberbuehler, Evolving Factor Analysis. Comments Inorg. Chem., G, 41 (1987) R.L. Gorsuch, Factor Analysis. Saunders, Philadelphia PA, 1974 M.J. Greenacre, Theory and Applications of Correspondence Analysis. Academic Press, London (UK), 1984 H.H. Harman, Modern Factor Analysis. Chicago Univerisity Press, Chicago, IL (USA), 1976 P.K. Hopke, Target transformation factor analysis. Chemolab, 6 7 (1989) J.E. Jackson, A User's Guide to Principal Components. Wiley, New York, NY (USA), 1991 1.T Joliffe, Principal Component Analysis. Springer-Verlag, New York, NY, 1986 H.R. Keller and D.L. Massart, Evolving factor analysis. Chemolab, 2, 209 (1992) J. Kim and C.W. Mueller, Factor Analysis. Sage Publications, Beverly Hills, CA (USA), 1978 D.N. Lawley and A.E. Maxwell, Factor Analysis as Statistical Method. Macmilian, New York, NY (USA) / Butterworths, London (UK), 1971 (2nd ed.) M. Maeder, Evolving factor analysis for the resolution of overlapping chromatographic peaks. Anal. Chem., 527 (1987) M. Maeder and A. Zilian, Evolving Factor Analysis, a New Whnique in Chromatography. Chemolab, 2, 205 (1988) E.R. Malinowski and D.G. Howery, Factor Analysis in Chemistry. Wiley, New York, NY (USA), 1980-1991 (2nd ed.) w S.A. Mulaik, The Foundations of Factor Analysis. McGraw-Hill, New York, NY (USA), 1972 R.J. Rummel, Applied Factor Analysis. Northwestern University Press, Evanston, IL (USA), 1970 C.J.E Ter Braak, Canonical correspondence analysis: a new elgenvector technique for multivariate direct gradient analysis. Ecology, 7 6 1167 (1986) L.L. Thurstone, Multiple Factor-Analysis. A development and expansion of the vectors of mind. Chicago University Press, Chicago, IL (USA), 1947

[GEOM] GEOMETRICAL CONCEPTS

C.M. Cuadras, Distancias Estadisticas. Estadistica Espafi ola,-03 295 (1988) J.C. Gower, A general coefficient of similarity and some of its properties. Biometrics, 22,857 (1971)

[GRAPH] GRAPHICAL DATA ANALYSIS

D.E Andrews, Plots of high-dimensional data. Biometrics, 28, 125 (1972) H.P. Andrews, R.D. Snee, and M.H. Sarner, Graphical display of means. The Am. Statist., 195 (1980) F.J. Anscombe, Graphs in statistical analysis. The Am. Statist., 22, 17 (1973) w J.M. Chambers, W.S. Cleveland, B. Kleiner, and P.A. lbkey, Graphical Methods for Data Analysis. Wadsworth & Brooks, Pacific Grove, CA (USA), 1983 H. Chernoff, The use of faces to represent points in k-dimensional space graphically. J. Am. Statist. Assoc., 68- 361 (1973) B. Everitt, Graphical Techniques for Multivariate Data. Heinemann Educational Books, London (UK), 1978 S.E. Fienberg, Graphical methods in statistics. The Am. Statist., 2,165 (1979) 0 J.H. Friedman and L.C. Rafsky, Graphics for the multivariate two-sample problem. J. Am. Statist. 7 277 (1981) Assoc., -6 0 K.R. Gabriel, The biplot graphic display of matrices with applications to principal components anal453 (1971) ysis. Biometrika,

a

s,

References

359

B. Kleiner and J.A. Hartigan, Representing points in many dimensions by trees and castles. J. Am. Statist. Assoc., 76, 260 (1981) R. Leardi, E. Marengo, and R. Todeschini, A new procedure for the visual inspection of multivariate data ofdifferent geographic origins. Chemolab, 12, 181 (1991) R. McGill, J.W. Tbkey, and W.A. Larsen, Variation of box plots. The Am. Statist., 12 (1978) D.W. Scott, On optimal and data-based histograms. Biometrika, -6 605 (1979) H. Wainer and D. Thissen, Graphical Data Analysis. Ann.Rev.Psychol., 22,191 (1981) K. Wakimoto and M. 'Eiguri, Constellation graphical method for representing multidimensional data. Ann. Inst. Statist. Mathem., 97 (1978) w P.C.C. Wang (Ed.), Graphical Representation of Multivariate Data. Academic Press, New York, NY (USA), 1978

a

a

[MISC] MISCELLANEOUS

w D. Bonchev, Information Theoretic Indices for Characterization of Chemical Structure. Research Studies Press, Letchworth (UK), 1983 w L. Davis, Handbook of Genetic Algorithms. Van Nostrand Reinhold, New York, NY (USA), 1991 w K. Eckschlager and V. Stepanek, Analytical Measurement and Information: Advances in the Information Theoretic Approach to Chemical Analysis. Research Studies Press, Letchworth (UK), 1985 m D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading, MA (USA), 1989 w R.W. Hamming, Coding and Information Theory. Prentice-Hall, Englewood Cliffs NJ (USA), 19801986 (2nd ed.) w J. Hertz, A. Krogh and R.G. Palmer, Introduction to Theory of Neural Computation. Addison-Wesley, New York, NY (USA), 1991 D.B. Hibbert, Genetic algorithms in chemistry. Chemolab, fi 277 (1993) w Z. Hippe, Artificial Intelligence in Chemistry. Elsevier, Warszawa (POL), 1991 m J.H. Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, MI (USA), 1975 G.J. Klir and TA. Folger, Fuzzy Sets, Uncertainty, and Information. Prentice-Hall, Englewood Cliffs, NJ (USA), 1988 R. Leardi, R. Boggia and M. Terrile, Genetic algorithms as a strategy for feature selection. J. Chemometrics, 5, 267 (1992) o C.B. Lucasius and G. Kateman, Understanding and using genetic algorithms. Part 1. Concepts, properties and context. Chemolab, 19, 1 (1993) w N.J. Nilsson, Principles of Artificial Intelligence. Springer-Verlag, Berlin (GER), 1982 o A.P. de Weijer, C.B. Lucasius, L. Buydens, G. Kateman, and H.M. Heuvel, Using genetic Algorithms 45 (1993) for an artificial neural network model inversion. Chemolab, B.J. Wythoff, Backpropagation neural networks. A tutorial. Chemoiab, 18,115 (1993) J. Zupan and J. Gasteiger, Neural networks: A new method for solving Chemical problems or just a passing phase?. Anal. Chim. Acta, 248. 1 (1991)

a

[MODEL] MODELING

360 H

References

B. Efron and R. Tibshirani, An Introduction to the Bootstrap. Chapman & Hall, New York, NY (USA), 1993 B. Efron and R. Tibshirani, Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy. Statistical Science, 1,54 (1986) B. Efron, Estimating the error rate of a prediction rule: improvements on cross-validation. J. Am. 316 (1983) Statist. Assoc., G.H. Golub, M. Heath, and G. Wahba, Generalized Cross-validation as a Method for Choosing a Good Ridge Parameter. Technometrics, 2 l , 215 (1979) S . Lanteri, Full validation for feature selection in classification and regression problems. Chemolab, 159 (1992) A.M. Law and W.D. Kelton, Simulation Modeling and Analysis. McGraw-Hill, New York, NY (USA), 1991 D.W. Osten, Selection of optimal regression models via cross-validation. J. Chemometrics, 2, 39 (1988) M. Stone, Cross-validatory Choice and Assessment of Statistical Predictions. Journal of Royal Statistical Society, Ser. B 36, 111 (1974)

z,

H

[MULT]MULTIVARIATE ANALYSIS H

H H H

H

H H

H

H

H H

H H H H H

A.A. Afifi and S.P. h e n , Statistical Analysis: A Computer Oriented Approach. Academic Press, New York, NY (USA), 1979 A.A. Afifi and V. Clark, Computer-Aided Multivariate Analysis. Wadsworth, Belmont, CA (USA), 1984 I.H. Bernstein, Applied Multivariate Analysis. Springer-Verlag, New York, NY (USA), 1988 H. Bozdogan and A.K. Gupta, Multivariate Statistical Modeling and Data Analysis. Reidel Publishers, Dordrecht (NL), 1987 R.G. Brereton, Multivariate pattern recognition in chemometrics, illustred by case studies. Elsevier, Amsterdam (NL), 1992 H. Bryant and W.R. Atchley, Multivariate Statistical Methods. Dowden, Hutchinson & Ross, Stroudsberg, PA (USA), 1975 N.B. Chapman and J. Shorter, Correlation Analysis in Chemistry. Plenum Press, New York, NY (USA), 1978 C. Chatfield and A.J. Collins, Introduction to Multivariate Analysis. Chapman & Hall, London (UK), 1986 W.W. Cooley and P.R. Lohnes, Multivariate Data Analysis. Wiley, New York, NY (USA), 1971 D. Coomans and I. Broeckaert, Potential Pattern Recognition in Chemical and Medical Decision Making. Research Studies Press, Letchworth (UK), 1986 M.L. Davison, Multidimensonal scaling. Wiley, New York, NY (USA), 1983 W.R. Dillon and M. Goldstein, Multivariate Analysis. Methods and Applications. Wiley, New York, NY (USA), 1984 R.O. Duda and PE.Hart, Pattern Classification and Scene Analysis. Wiley, New York, NY (USA), 1973 M.L. Eaton, Multivariate Statistics. Wiley, New York, NY (USA), 1983 K. Esbensen and P. Geladi, Strategy of multivariate image analysis (MIA). Chemolab, 67 (1989) B.S. Everitt, An Introduction to Latent Variable Models. Chapman & Hall, London (UK), 1984 R. Giffins, Canonical Analysis: A Review with Applications in Ecology. Biomathematics 12, SpringerVerlag, Berlin (GER), 1985 R. Gnanadesikan, Methods for Statistical Data Analysis of Multivariate Observations. Wiley, New

z

References

361

York, NY (USA), 1977 rn F! Green, Mathematical Tools for Applied Multivariate Analysis. Academic Press, San Diego, CA (USA), 1978 rn 1.1. Joffe, Application of Pattern Recognition to Catalytic Research. Research Studies Prees, Letchworth (UK), 1988 rn R.A. Johnson and D.W. Wichern, Applied Multivariate Statistical Analysis. Prentice-Hall, London (UK), 1982-1992 (3rd ed.) rn P.C. Jurs and TL. Isenhour, Chemical Applications of Pattern Recognition. Wiley-Interscience, New York, NY (USA), 1975 rn M. Kendall, Multivariate Analysis. Griffin, London (UK), 1980 rn P.R. Krishnaiah (Ed.), Multivariate Analysis. Academic Press, New York, NY (USA), 1966 W.J. Krzanowski, Principles of Multivariate Analysis. Oxford Science Publishers, Clarendon (UK), 1988 rn A.N. Kshirsagar, Multivariate Analysis. Dekker, New York, NY (USA), 1978 rn M.S. Levine, Canonical Analysis and Factor Comparisons. Sage Publications, Beverly Hills, CA (USA), 1977 rn P.J. Lewi, Multivariate Data Analysis in Industrial Practice. Research Studies Press, Letchworth (UK), 1982 B.F.J. Manly, Multivariate Statistical Methods. A Primer. Chapman & Hall, Bristol (UK), 1986 rn W.S. Meisel, Computer-oriented approach to pattern recognition. Academic Press, New York, NY (USA), 1972 rn K.V. Mardia, J.T. Kent, and J.M. Bibby, Multivariate Analysis. Academic Press, London (UK), 1979-1988 (6th ed.) rn D.E Morrison, Multivariate Statistical Methods. McGraw-Hill, New York, NY (USA), 1976 rn S.S.Schiffman, M.L. Reynolds and F.W.Young, Introduction to Multidimensional Scaling. Academic Press, Orlando, FL (USA), 1981 G.A.F. Seber, Multivariate Observations. Wiley, New York, NY (USA), 1984 rn M.S. Srivastana and E.M. Carter, An Introduction to Applied Multivariate Statistics. North-Holland, Amsterdam (NL), 1983 rn 0. Strouf, Chemical Pattern Recognition. Research Studies Press, Letchworth (UK), 1986 R.M. Thorndike, Correlational Procedures for Research. Gardner, New York, NY (USA), 1978 J.T.Tou and R.C. Gonzales, Pattern Recognition Principles. Addison-Wesley, Reading, MA (USA), 1974 rn J.P. Van der Geer, Introduction to Linear Multivariate Data Analysis. DSWO Press, Leiden (GER), 1986 rn E. Van der Burg, NonlineapCanonical Correlation and Some Related 'lkchniques. DSWO Press, Leiden (GER), 1988 K. Varmuza, Pattern Recognition in Chemistry. Springer-Verlag, Berlin (GER), 1980 rn D.D. Wolf and M.L. Pearson, Pattern Recognition Approach to Data Interpretation. Plenum Press, New York, NY (USA), 1983

[OPTIM] OPTIMIZATION

K.W.C. Burton and G. Nickless, Optimization via simplex Part 1. Background, definitions and simple applications. Chemolab, 1,135 (1987) rn B.S. Everitt, Introduction to Optimization Methods and their Application in statistics. Chapman & Hall, London (UK), 1987 rn R. Fletcher, Practical Methods of Optimization. Wiley, New York, NY (USA), 1987 ~1

362

References

J.H. Kalivas, Optimization using variations of simulated annealing. Chemolab, 5 l 1 (1992) rn L. Mackley (Ed.), Introduction to Optimization. Wiley, New York, NY (USA), 1988 0 J.A. Nelder and R. Mead, A simplex method for function minimization. Computer J., 2, 308 (1965) A.C. Norris, Computational Chemistry: An Introduction to Numerical Methods. Wiley, Chichester (UK), 1986 I C.H. Papadimitriou and K. Steiglitz, Combinatorial Optimization. Algorithms and Complexity. Prentice-Hall, Englewood Cliffs, NJ (USA), 1982 H P.J.M. van Laarhoven and E.H.L. Aarts, Simulated Annealing: Theory and Applications. Reidel, Dordrecht (GER), 1987

[PROB] PROBABILITY

M. Evans, N. Hasting, B. Peacock, statistical Distributions. Wiley, New York, NY (USA), 1993 rn M.A. Goldberg, An Introduction to Probability Theory with Statistical Applications. Plenum Press, New York, NY (USA), 1984 rn H.J. Larson, Introduction to Probability Theory and Statistical Inference. Wiley, New York, NY (USA), 1969 P.L. Meyer, Introductory Probability and Statistical Applications. Addison-Wesley, Reading, MA (USA), 1970 rn E Mosteller, Probability with Statistical Applications. Addison-Wesley, Reading, MA (USA), 1970

[QUAL] QUALITY CONTROL

rn 0

rn rn

G.A. Barnard, G.E.P. Box, D. Cox, A.H. Seheult, and B.W. Silverman (Eds.), Industrial Quality and Productivity with Statistical Methods. The Royal Society, London (UK), 1989 D.H. Besterfield, Quality Control. Prentice-Hall, London (UK), 1979 M.M.W.B. Hendriks, J.H. de Boer, A.K. Smilde and D.A. Doornbos, Multlcrlterla decision macking. Chemolab, & 175 (1992) G. Kateman and EW. Pijpers, Quality Control in Analytical Chemistry. Wiley, New York, NY (USA), 1981 D.C. Montgomery, Introduction to Statistical Quality Control. Wiley, New York, NY (USA), 1985 D.J. Wheeler and D.S. Chambers, Understanding Statistical Process Control. Addison-Wesley, Avon (UK), 1990

[REGR] REGRESSION ANALYSIS

rn rn 0

rn

141 EJ. Anscombe and J.W. 'hkey, The examination and analysis of residuals. Technometrics, (1963) A.C. Atkinson, Plots, 'hnsformations, and Regression. Oxford Univ. Press, Oxford (UK), 1985 D.M. Bates and D.G. Watts, Nonlinear Regression Analysis. Wiley, New York, NY (USA), 1988 D.A. Belsey, E. Kuh, and R.E. Welsch, Regression Diagnostics: IdentiFying Influential Data and Sources of Collinearity. Wiley, New York, NY (USA), 1980 L. Breiman and J.H. Friedman, Estimating Optimal 'kansformations for Multiple Regression and Correlation. J. Am. Statist. Assoc., 80- 580 (1985) S.Chatterjee and B. Price, Regression Analysis by Example. Wiley, New York, NY (USA), 1977 J. Cohen and P. Cohen, Applied Multiple Regression-Correlation Analysis for the Behavioral Sciences. Halsted, New York, NY (USA), 1975

References

363

rn R.D. Cook and S. Weisberg, Residuals and Influence in Regression. Chapman & Hall, New York, NY (USA), 1982 rn C. Daniel and F.S. Wood, Fitting Equations to Data. Wiley, New York, NY (USA), 1980 (2nd ed.) rn N. Draper and H. Smith, Applied Regression Analysis, Wiley, New York, NY (USA), 1966-1981 (2nd ed.) I.E.Frank, Intermediate least squares regression method. Chemolab, 1,233 (1987) I.E. Frank, A nonlinear PLS model. Chemolab, 8, 109 (1990) 0 I.E. Frank and J.H. Friedman, A statistical view of some chemometrics regression tools. Rchnometrics, 35, 109 (1993) J.H. Friedman, Multivariate adaptive regression splines. The Annals of Statistics, 19,1 (1991) 0 J.H. Friedman and W. Stuetzle, Projection pursuit regression. J. Am. Statist. Assoc., 76,817 (1981) T Gasser and M. Rosenblatt (Eds.), Smoothing Techniques for Curve Estimation. Springer-Verlag, Berlin (GER), 1979 rn M.H.J. Gruber, Regression Estimators: a comparative study. Academic Press, San Diego, CA (USA), 1990 rn R.E Gunst and R.L. Mason, Regression Analysis and Its Application: A Data-Oriented Approach. Marcel Dekker, New York, NY (USA), 1980 rn F.R. Hampel, E.M. Ronchetti, P.J. Rousseeuw, and W.A. Stahel, Robust Statlstics. The Approach based on Influence Functions. Wiley, New York, NY (USA), 1986 D.M. Hawkins, On the Investigation of Alternative Regression by Principal Components Analysis. Applied Statistics, 22,275 (1973) 0 R.R. Hocking, Developments in linear regression methodology: 1959-1982. Technometrics, -52 219 (1983) R.R. Hocking, The analysis and selection of variables in linear regresslon. Biometrics, -23 1 (1976) A.E. Hoerl and R.W. Kennard, Ridge Regression: Biased estlmation for non-orthogonal problems. Rchnometrics, l2, 55 (1970) 0 A. Hoskuldsson, PLS Regression Methods. J. Chemometrics, 2,211 (1988) m D.G. Kleinbaum and L.L. Kupper, Applied Regression Analysis and Other Multlvariable Methods. Duxbury Press, North Scituate, MA (USA), 1978 0 KG. Kowalski, On the predictive performance of biased regression methods and multiple linear regression. Chemolab, $ 177 (1990) 0 0. Kvalheim, The Latent Variable. Chemolab, 4l 1 (1992) 0. Kvalheim and TV. Karstang, Interpretation of latent-variable regression models. Chemolab, 2, 39 (1989) A. Lorber, L.E. Wangen, and B.R. Kowalski, A Theoretical Foundation for the PLS Algorithm. J. Chemometrics, 1,19 (1987) D.W. Marquardt, Generalized Inverses, Ridge Regression, Biased Linear Estimation, and Nonlinear Estimations. Rchnometrics, l2, 591 (1970) rn H. Martens and T Ns, Multivariate calibration. Wiley, New York, NY (USA), 1989 rn A.J. Miller, Subset Selection in Regression. Chapman & Hall, London (UK), 1990 rn F. Mosteller and J.W. Tbkey, Data Analysis and Regression. Addison-Wesley, Reading, MA (USA), 1977 rn R.H. Myers, Classical and Modem Regression with Applications. Duxbury Press, Boston, MA (USA), 1986 T Ns, C. Irgens, and H. Martens, Comparison of Linear Statistical Methods for Calibration of NIR Instruments. Applied Statistics, 35, 195 (1986) rn J. Neter and W. Wasserman, Applied Linear Statistical Models. Irwin, Homewood, IL (USA), 1974 rn C.R. Rao, Linear Statistical Inference and Its Applications. Wiley, New York, NY (USA), 1973 (2nd ed.)

364

References

P.J. Rousseeuw and A.M. Leroy, Robust Regression and Outliers Detection. Wiley, New York, NY (USA), 1987 H G.A.F. Seber, Linear Regression Analysis. Wiley, New York, NY (USA), 1977 S. Sekulic and B.R. Kowalski, MARS: a tutorial. J. Chemometrics, 6, 199 (1992) H J. Shorter, Correlation Analysis of Organic Reactivity: With Particular Reference to Multiple Regression. Research Studies Press, Chichester (UK), 1982 H G. Wahba, Spline Models for Observational Data. Society for Industrial and Applied Mathematics, Philadelphia, PA (USA), 1990 J.T. Webster, R.F. Gunst, and R.L. Mason, Latent Root Regression Analysis. Rchnometrics, & 513 (1974) H S. Weisberg, Applied Linear Regression. Wiley, New York, NY (USA), 1980 H G.B. Wetherill, Regression Analysis with Applications. Chapman & Hall, London (UK), 1986 S. Wold, N. Kettaneh-Wold and B. Skagerberg, Nonlinear PLS modeling. Chemolab, 2, 53 (1989) 0 S. Wold, P. Geladi, K. Esbensen, and J. Oehman, Multi-way Principal Components and PLS Analysis. J. Chemometrics, 1,41 (1987) o C. Yale and A.B. Forsythe, Winsorized regression. Technometrics, 18,291 (1976) H M.S. Younger, Handbook for Linear Regression. Duxbury Press, North Scituate, MA (USA), 1979 H

[TEST] HYPOTHESIS TESTING H H H

H H H

J.V. Bradley, Distribution-free Statistical Tests. Prentice Hall, Engelwood Cliffs, NJ (USA), 1968 L.N.H. Bunt, Probability and Hypothesis Testing. Harrap, London (UK), 1968 E. Caulcott, SignificanceTests. Routledge and Kegan Paul, London (UK), 1973 E.S. Edgington, Randomization Tests. Marcel Dekker, New York, NY (USA), 1980 K.R. Koch, Parameter Estimation and Hypothesis Testing in Linear Models. Springer-Verlag, Berlin (GER), 1988 E.L. Lehman, Testing Statistical Hypotheses. Wiley, New York, NY (USA), 1986

[TIME] TIME SERIES H

H H H H H H

H H H

B.L. Bowerman, Forecasting and Time Series. Duxbury Press, Belmont, CA (USA), 1993 G.E.P. Box and G.M. Jenkins, Time Series Analysis. Holden-Day, San Francisco, CA (USA), 1976 C. Chatfield, The Analysis of Time Series: An Introduction. Chapman & Hall, London (UK), 1984 J.D. Cryer, Time Series Analysis. Duxbury Press, Boston, MA (USA), 1986 P.J. Diggle, Time Series. A Biostatistical Introduction. Clarendon Press, Oxford (UK), 1990 E.J. Hannan, Time Series Analysis. Methuen, London (UK), 1960 A.C. Harvey, Time Series Models. Wiley, New York, NY (USA), 1981 M. Kendall and J.K. Ord, Time Series. Edward Arnold, London (UK), 1990 D.C. Montgomery, L.A. Johnson, and J.S. Gardiner, Forecasting and Time Series Analysis. McGrawHill, New York, NY (USA), 1990 R.H. Shumway, Applied Statistical Time Series Analysis. Prentice Hall, Englewood Cliffs, NJ (USA), 1988

GENERAL STATISTICSAND CHEMOMETRICS H H

J. Aitchison, The Statistical Analysis of Compositional Data. Chapman & Hall, London (UK), 1986 J. Aitchison, Statistics for Geoscientists. Pergamon Press, Oxford (UK), 1987

References

365

S.E Arnold, Mathematical Statistics. Prentice-Hall, Englewood Cliffs, NJ (USA), 1990 V Barnett and T Lewis, Outliers in Statistical Data. Wiley, New York, NY (USA), 1978 H J.J. Breen and RE. Robinson, Environmental Applications of Chemometrics. ACS Symposium Series, vol. 292,Am. Chem. SOC.,Washington, D.C. (USA), 1985 H R.G. Brereton, Chemometrics. Applications of mathematics and statistics to laboratory systems. Ellis Honvood, Chichester (UK), 1990 H D.T Chapman and A.H. El-Shaarawi, Statistical Methods for the Assessment of Point Source Pollution. Khwer Academic Publishers, Dordrecht (NL), 1989 H D.J. Finney, Statistical Methods in Biological Assay. Griffin, Oxford (UK), 1978 H M. Forina, Introduzione alla Chimica Analitica con elementi di Chemiometria. ECIG, Genova (IT), 1993 H D.M. Hawkins, Identification of Outliers. Chapman & Hall, London (UK), 1980 m D.C. Hoaglin, F. Mosteller, and J.W. 'hkey, Understanding Robust and Exploratory Data Analysis. Wiley, New York, NY (USA), 1983 H W.J. Kennedy and J.E. Gentle, Statistical Computing. Marcel Dekker, New York, NY (USA), 1980 H B.R. Kowalski (Ed.), Chemometrics: Theory and Application. ACS Symposium Series, vol. 52, Am. Chem. SOC.,Washington, D.C. (USA), 1977 H B.R.Kowalski (Ed.), Chemometrics, Mathematics, and Statistics in Chemistry. Proceedings of the NATO AS1 - Cosenza 1983, Reidel Publishers, Dordrecht (NL), 1984 H D.L. Massart, R.G. Brereton, R.E. Dessy, P.K. Hopke, C.H. Spiegelman, and W. Wegscheider (Eds.), Chemometrics 'htorial. Collected from Chemolab., Vol. 1-5, Elsevier, Amsterdam (NL), 1990 H D.L. Massart, A. Dijkstra, and L. Kaufman, Evaluation and Optimization of Laboratoty Methods and Analytical Procedures. Elsevier, Amsterdam (NL), 1978-1984 (3rd ed.) H D.L. Massart, B.G.M. Vandeginste, S.N. Deming, Y. Michotte, and L. Kaufman, Chemometrics: A 'kxtbook. Elsevier, Amsterdam (NL), 1988 H M. Meloun, J. Militky and M. Forina, Chemometrics for Analytical chemistry. Volume 1: PC-Aided Statistical Data Analysis. Ellis Honvood, New York, NY (USA), 1992 H M.A. Sharaf, D.A. Illman, and B.R. Kowalski, Chemometrics. Wiley, New York, NY (USA), 1986 H J.W. 'hkey, Exploratory Data Analysis. Addison-Wesley, Reading, MA (USA), 1977 H J.H. Zar, Biostatistical Analysis. Prentice-Hall, Englewood Cliffs, NJ (USA), 1984 (2nd ed.) H H


The data analysis handbook

Handbook on data envelopment analysis

Handbook on Data Envelopment Analysis

Handbook on Data Envelopment Analysis

The Data Journalism Handbook

The Data Journalism Handbook

The Wireless Data Handbook

The Handbook of Data Mining

The Vibration Analysis Handbook

The handbook of data mining

The Handbook of Data Mining

Paleontological Data Analysis

Statistical Data Analysis

High-dimensional Data Analysis

Analysis of panel data

Spatiotemporal Data Analysis

Advances in Data Analysis

Analysis of Survey Data

Intelligent Data Analysis

Analysis of Economic Data

Longitudinal Data Analysis

Python for Data Analysis

Nonparametric Functional Data Analysis

Analysis of Panel Data

Access Data Analysis Cookbook

Analysis of Financial Data

Bayesian data analysis

Advanced Quantitative Data Analysis

Applied Functional Data Analysis

The data analysis handbook

Handbook on data envelopment analysis

Handbook on Data Envelopment Analysis

Handbook on Data Envelopment Analysis

The Data Journalism Handbook

The Data Journalism Handbook

The Wireless Data Handbook

The Handbook of Data Mining

The Vibration Analysis Handbook

The handbook of data mining

The Handbook of Data Mining

Paleontological Data Analysis

Statistical Data Analysis

High-dimensional Data Analysis

Analysis of panel data

Spatiotemporal Data Analysis

Advances in Data Analysis

Analysis of Survey Data

Intelligent Data Analysis

Analysis of Economic Data

Longitudinal Data Analysis

Python for Data Analysis

Nonparametric Functional Data Analysis

Analysis of Panel Data

Access Data Analysis Cookbook

Analysis of Financial Data

Bayesian data analysis

Advanced Quantitative Data Analysis

Applied Functional Data Analysis

Recommend Documents