This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
O
a >O
Laplace distribution ( : double exponential distribution) Continuous distribution with location parameter a, and scale parameter b.
F(x) = 1 - 0.5exp
[
--
ifxza
Range: --oo<xO
logistic distribution Continuous distribution with location parameter a, and scale parameter b. F(x) = 1 - {l+exp[(x-a)/b]}-'=
{l+exp[-(x-a)/b]}-'
distribution [PROB] 97
f ( x ) = exp [ - ( x - a)/b]b-'(l+ exp [ - ( x - a)/b]}-2 = exp[(x-a)/b]b-'(l +exp[(x- a)/b])-2 S(x) = 11 +exp[(x-a)/b]}-'
+
h(x) = (b(1 exp [-(x
- a)/b]}]-'
Range: --oo<x0 Mean: b c/(c 1) Variance: b2 c/[(c 2)(c l)']
+ +
+
Rayleigh distribution Continuous distribution with scale parameter b.
F(x) = 1 - exp [-2/(2 b2)] f(X) = (x/b2) exp [-2/(2b2)1
h(x) = x/b2 Range: 0 < x < 00 Mean: b m Variance: (2 - n/2)b2
b >0
rectangular distribution : uniform distribution Student's t distribution : t distribution t distribution (: Student's t distribution) Continuous distribution with shape parameter u , the degrees of freedom.
where
r is the gamma function.
-0O<xo Range: Mean = Mode: 0 for u > 1 Variance: u / ( u -2) for u > 2
99
100 distribution [PROB]
triangular distribution Continuous distribution with location parameters a, b and shape parameter c. F(x) =
(x - al2 (b - a)(c - a )
F(x) = 1 -
(b - x ) ~ (b - a)(b - c)
2( x- a ) f ( x ) = (b - a)(c - a ) 2(b-x)
f(x> = (b - a)(b - c)
for a ( x 5 c for c i x i b for a 5 x 5 ~ for c s x s b
Range: a 5x5b Mean: (a b c)/3 Variance: (a2 b2 c2 - a b - a c - bc)/18
+ + + +
uniform distribution (.- rectangular distribution) Continuous distribution with location parameters a and b. F(x) = (x - a)/(b - a )
h(x) = l/(b - X )
Range: a5 x 5b Mean: (a + b)/2 Variance: (b - a)2/12 Discrete distribution with parameter n.
S(x) = (n - x ) / ( n
+ 1)
h(x) = l/(n - x ) Range: 0i x5n Mean: n/2 Variance: n(n 2)/12
+
Doehlert design [EXDE]
101
variance ratio distribution : F distribution Weibull distribution Continuous distribution with scale parameter b, shape parameter c. F(x) = 1 - exp[-(x/b)'] f i x ) = (cxC-'/bC)exp [-(x/b)']
Range: 05xO c>O Mean: b r [ (c l)/c] Variance: b2(r[(c 2)/c] - {r[(c 1 ) / ~ ] ] ~ )
+ +
+
b distribution function (df) [PROB] + random variable
distribution-free estimator [ESTIM] + estimator b
distribution-free test [TEST] + hypothesis testing b
b
divergence coefficient [GEOM]
+ distance (o quantitative data) b divisive clustering [CLUS] + hierarchical clustering
b Dodge-Romig tables [QUAL] Tables of standards for acceptance sampling plans for attributes. They are either designed for lot tolerance percent defective protection or provide a specified average outgoing quality limit. In both cases there are tables for single sampling and double sampling. These tables apply only in the case of rectifying inspection, i.e. if the rejected lots are submitted to 100% inspection. b Doehlert design [EXDE] + design
102
dotplot [GRAPH]
b dot plot [GRAPH] Graphical display of the frequency distribution of a continuous variable on a onedimensional scatter plot. The data points are plotted along a straight line parallel to and above the horizontal axis. Stacking or jitter can be applied to better display overlapping points.
JITTER
STACKING
. . . . .. . .. . .. .
. .
' .
.'
.
*
b
2 4 6 8 10 12 14 16
Stacking means that data points of equal value are plotted above each other on a vertical line orthogonal to the main line of the plot. Jitter is a technique used to separate overlapping points on a scatter plot. Instead of positioning the overlapping points exactly on the same location, their coordinate values are slightly perturbed so that each point appears separately on the plot. On a one-dimensional scatter plot, rather than plotting the points on a horizontal line, they are displayed on a strip above the axis. The width of the strip is kept small compared to the range of the horizontal axis. The vertical position of a point within a strip is random. Although a scatter plot with jitter loses in accuracy, its information content is enhanced.
b
double cross-validation (dcv) [FACT] rankanalysis
b
double exponential distribution [PROB] distribution
P -
b
double sampling [QUAL] acceptance sampling
b double tail test [TEST] + hypothesis testing b draftsman's plot [GRAPH] (.- scatterplot matrix, matrix plot) Graphical representation of multivariate data. Several two-dimensional scatter plots arranged in such a way that adjacent plots share a common axis. A p-dimensional data set is represented by p @ - 1)/2 scatter plots.
Durbin- Watson’s test [TEST]
103
x2
XI
1 x4
x4 x4
‘0
x2
’
‘ *
8
I .
4
x3
In this array of pairwise scatter plots the plots of the same row have a common vertical axis, while the plots of the same column have a common horizontal axis. Corresponding data points can be connected by highlighting or coloring.
dummy variable [PREP] + variable b
b
Duncan’s test [TEST]
+ hypothesis test
b
Dunnett’s test [TEST] hypothesis test
Dunn’s partition coefficient [CLUS] --f fizzy clustering b
Dunn’s test [TEST] + hypothesis test
b
Durbin-Watson’s test [TEST] + hypothesis test b
104 Dwass-Steel’s test [TEST] b Mass-Steel’s test [TEST] + hypothesis test b
dynamic linear model [TIME]
: state-space model
b E-M algorithm [ESTIM] Iterative procedure, first introduced in the context of dealing with missing data, but also widely applicable in other maximum likelihood estimation problems. In missing data estimation it is based on the realization that if the missing data had been observed, simple sufficient statistics for the parameters would be used for straightforward maximum likelihood estimation; on the other hand, if the parameters of the model were known, then the missing data could be estimated given the observed data. The algorithm alternates between two procedures. In the E step (stands for Expectation) the current parameter values are used to estimate the missing values. In the M step (stands for Maximization) new maximum likelihood parameter estimates are obtained from the observed data and the current estimate of the missing data. This sequence of alternating steps converges to a local maximum of the likelihood function.
E-optimal design [EXDE] ----f design (: optimal design) b
b
edge [MISC]
-+ graph theory
b
Edmonston coefficient [GEOM] distance (: binarydatu)
b
Edwards and Cavalli-Sfona clustering [CLUS]
+ hierarchical clustering ( : divisive clustering) b effect [ANOVA] + term in ANOVA
b
effect variable [PREP] variable
eigenanalysis [ALGE]
105
b efficiency [ESTIM] Measure of the relative goodness of an estimator, defined as its variance. This measure provides a basis for comparing several potential estimators for the same parameter. The ratio of the variance of two estimators of the same quantity, in which the smaller variance is in the numerator, is called the relative efficiency. This is the relative efficiency of the estimator with the larger variance. It also gives the ratio of the sample sizes required for two statistics to do the same job. For example, the sample median has variance of approximately (n/2)a2/nand the sample mean has variance u 2 / n ,therefore the median has relative efficiency 2/n = 0.64. In other words the median estimates the location of a sample of 100 objects as well as the mean estimates it for a sample of 64. The efficiency of an estimator as the sample size tends to infinity is called asymptotic efficiency. b efficient estimator [ESTIM] + estimator
b eigenanalysis [ALGE] Analysis of a square matrix X[p, p] in terms of its eigenvalues and eigenvectors. The result is the solution for the system of equations:
XVj = Aj vj
j = 1,p
The vector v is called the eigenvector (or characteristic vector, or latent vector); the scalar A is called the eigenvalue (or characteristic root, or latent root). Each eigenvector defines a one-dimensional subspace that is invariant to premultiplication by X. The eigenvalues are the roots of the characteristic polynomial defined as:
P(A) = IW - A III The p roots A are also called the spectrum of the matrix X and are often arranged in a diagonal matrix A. The decomposition j
is called the eigendecomposition or spectral decomposition. Usually, one faces a symmetric eigenproblem, i.e. finding the eigenvalues and eigenvectors of a symmetric square matrix; most of the time the matrix is also positive semi-definite. This case has substantial computational advantages. The asymmetric eigenproblem is more general and more involved computationally. The following algorithms are the most common for solving an eigenproblem.
Jacobi method An old, less commonly used method for diagonalizing a symmetric matrix X. Let
S(X) denote the sum of squares of the off-diagonal elements of X, so that X is
106 eigenanalysis [ALGE]
diagonal if S(X) = 0. For any orthogonal matrix Q, QTX Q has the same eigenvalues as X. In each step the Jacobi method in each step finds a matrix Q for which
~ ( Q ~ x p and underdetermined if n < p . The solution has the following properties: - if n = p and X is nonsingular, a unique solution is a = X-' y; - the equation is consistent (i.e. admits at least one solution) if rank@) = rank(X,y); - for y = 0, there exists a nontrivial solution (i.e. a # 0) if rank@) < p ; - the equation XT X = XTy is always consistent. Linear equations are most commonly solved by orthogonal matrix transformation or Gaussian elimination. b linear estimator [ESTIM] -+estimator b linear learning machine (LLM) [CLAS] Binarv classificationmethod, similar to Fisher's discriminant analysis, that separates two classes in the p-dimensional measurement space by a (p - 1)-dimensional hyperplane. The iterative procedure starts with an arbitrary hyperplane, defined by a weight vector w orthogonal to the plane, through a specified origin WO.A training object x is classified by calculating a linear discriminant function
s = WrJ
+ w' x
which gives positive values for objects on one side and negative values for objects on the other side of the hyperplane. The position of the plane is changed by reflection on a misclassified observation as: wo (new) = ;O
(old)
+c
This method, which is rarely used nowadays, has many disadvantages: nonunique solution, slow or no convergence, too simple class boundary, unbounded misclassification risk. b
linear least squares regression [REGR]
: ordinary least squares regression b
linear programing [OPTIM]
+ optimization
linear structural relationship (LISREL) [MULT] 189 b
linear regression model [REGR]
+ regression model b linear search optimization [OPTIM] Direct search optimization for minimizing a function of p parameters f (p). The step taken in the ith iteration is
pi+l = Pi
+ Si di
where S i is the step size and di is the step direction. There are two basic types of linear search optimization. The first type uses a scheme for systematically reducing the length of the known interval that contains the optimal step size, based on a comparison of function values. Fibonacci search, golden section, and bisection belong to this group. The second type approximates the function f (p) around the minimum with a more simple function (e.g. second- or third-order polynomial) for which a minimum is easily obtained. These methods are known as quadratic, cubic, etc. interpolations. b linear structural relationship (LISREL) [MULT] Latent variable model solved by maximum likelihood estimation, implemented in a software package. The model consists of two parts: the measurement model specifies how the latent variables are related to the manifest variables, while the structural model specifies the relationship among latent variables. It is assumed that both manifest and latent variables are continuous with zero expected values, and the manifest variables have a joint normal distribution. The latent variables are of two types: endogenous ( r ] ) and exogenous (c), and are related by the linear structural model:
q=Br]+r6+C B and r are regression coefficients representing direct causal effects among q-r] and r]-6. The error term is uncorrelated with 6. There are two sets of manifest variables y and x corresponding to the two sets of latent variables. The linear measurement models are:
j+l
idempotent matrix Square matrix W in which
x2= x For example, the hat matrix is an idempotent matrix.
identity matrix Diagonal matrix, often denoted as 1, in which all diagonal elements are 1:
xi, = 1
if
i=j
and
xi, = O
if
i#j
nilpotent matrix Square matrix W in which
X' = 0
for some r
nonsingular matrix Square matrix in which, in contrast to a singular matrix, the determinant is not zero. Only nonsingular matrices can be inverted. null matrix (.- zero matrix) Square matrix in which all elements are zero: x.. 'J = 0
for all i and j
orthogonal matrix Square matrix in which WTW = XWT = 1
Examples are: Givens transformation matrix, Housholder transformation matrix, and matrix of eigenvectors.
198
matrix [ALGE]
positive definite matrix Square matrix in which
zTxz>o
for all Z # O
positive matrix Square matrix in which all elements are positive: xij
>0
for all i and j
positive semi-definite matrix Square matrix in which
zTxz>o
for all
Z # O
singular matrix Square matrix in which the determinant is zero. In a singular matrix, at least two rows or columns are linearly dependent. square matrix Matrix in which the number of rows equals the number of columns: n = p symmetric matrix Square matrix in which row and column indices are interchangeable:
x =p
xij = xJ.' .
triangular matrix Square matrix in which all elements either below or above the diagonal are zero. In an upper triangular matrix x 'J. . = O
if
i>j
In a lower triangular matrix
x 'J. . = O
if
i<j
tridiagonal matrix Square matrix in which only diagonal elements and elements adjacent to the diagonals (superdiagonals and subdiagonals) are nonzero:
This matrix is both an upper and a lower Hessenberg matrix.
matrix decomposition [ALGE]
199
unit matrix Square matrix in which all elements are one: x 'J, ' = 1
for all i and j
zero matrix : null matrix matrix algebra [ALGE] : linear algebra
b
b matrix condition [ALGE] (: condition of a matrix) Characteristic of a matrix reflecting the sensitivity of the quantities computed from the matrix to small changes in the elements of the matrix. A matrix is called an ill-conditioned matrix, with respect to a problem, if the computed quantities are very sensitive to small changes like numerical precision error. If this is not the case, the matrix is called a well-conditioned matrix. There are various measures of the matrix condition. The most common one is the condition number:
When 11. 1 1 indicates the two-norm, the condition number is the ratio of the largest to the smallest nonzero eigenvalue of the matrix. A large condition number indicates an ill-conditioned matrix. b matrix decomposition [ALGE] The expression of a matrix as a product of two or three other matrices that have more structure than the original matrix. Rank-one-update is often used to make the calculation more efficient. A list of the most important methods follow.
bidiagonalization
B = UXVT where B is an upper bidiagonal matrix, U and V are orthogonal Housholder matrices. This decomposition is accomplished by a sequence of Housholder transformations. First the subdiagonal elements of the first column of X are set to zero by U1,next the p - 2 elements above the superdiagonal of the first row are set to zero by V1 resulting in U1 XV1. Similar Housholder operations are applied to the rest of the columnsj via Uj and to the rest of the rows i via Vi:
U = (Up U,-1 . . . Ul)
and
V = (VpVp-1 . . . VI)
Bidiagonalization is used, for example, as a first step in singular value decomposition.
200
matrix decomposition [ALGE]
Cholesky factorization
X=LL~ where X is a symmetric, positive definite matrix, and L is a lower triangular matrix. The Cholesky decomposition is widely used in solving linear equation systems in the form X b = y. diagonalization : singular value decomposition eigendecomposition : spectral decomposition G R decomposition : L-U decomposition
L-U decomposition ( : L-R decomposition,triangularfactorizution) X=LU where L is a lower triangular matrix and U is an upper triangular matrix. The most common method to calculate the L-U decomposition is Gaussian elimination.
Q-R decomposition X=QR where Q is an orthogonal matrix and R is an upper triangular matrix. This decomposition is equivalent to an orthogonal transformation of X into an upper triangular form, used in solving linear equations and in eigenanalysis. singular value decomposition (SVD) ( : diagonalizution)
X=UAVT where U and V are orthogonal matrices, and A is a diagonal matrix. The values of the diagonal elements in A are called singular values of X,the columns of U and V are called left and right singular vectors, respectively. A is calculated by first transforming W into a bidiagonal matrix and then eliminating the superdiagonal elements. When n = p and X is a symmetric positive definite matrix the singular values coincide with the eigenvalues of X. Furthermore U = V, that contains the eigenvectors of X. spectral decomposition ( : eigendecomposition) X=VAVT
where W is a symmetric matrix, A is a diagonal matrix of the eigenvalues of X, and V is an orthogonal matrix containing the eigenvectors.
matrix operation [ALGE]
201
triangular factorization : L-U decomposition tridiagonalization
T = UXUT where X is a symmetric matrix, T is a tridiagonal matrix, and U is an orthogonal matrix. This decomposition, similar to bidiagonalization, is achieved by a sequence of Housholder transformations: U = (Ul . . . Up-l). It is used mainly in eigenanalysis. b matrixnorm [ALGE] + norm
matrix operation [ALGE] The following are the most common operations on one single matrix X, or on two matrices X and Y with matching orders. b
addition of matrices Z(n, p ) = X(n, P )
+ Y ( n ,P )
zij
= xij
+ Yij
Addition can be performed only if X and Y have the same dimensions.
determinant of a matrix Scalar defined for a square matrix X(p, p) as:
where the summation is taken over all permutations a of (1, . . . , p ) , and la1 equals +1 or -1, depending on whether a can be written as the product of an even or odd number of transpositions. For example, for p = 2
1x1= XllX22 - x12x21 The determinant of a diagonal matrix X is equal to the product of the diagonal elements:
If are:
1x1 # 0, then W is a nonsingular matrix. Important properties of the determinant
4
l X Y = 1x1IYI IXTI = 1x1 lcXl = CPlXl
b) c)
202
mat& operation [ALGE]
inverse of a matrix The inverse of a square matrix X is the unique matrix X-' satisfying:
The inverse exists only if X is a nonsingular matrix, i.e. if 1x1 # 0. The most efficient way to calculate the inverse is via Cholesky decomposition. The matrix X- is called generalized inverse matrix of a nonsquare matrix X if it satisfies:
The generalized inverse always exists, although usually it is not unique. If the following three equations are also satisfied, it is called a Moore-Penrose inverse (or pseudo inverse) and denoted as X+
(XX+)T
= X X+
(X+X)T
= X+ X
For example, the Moore-Penrose inverse matrix is calculated in ordinary least squares regression as:
X+ = (XTX)-' XT The generalized inverse is best obtained by first performing a singular value decomposition of X:
X=UAVT then
Multiplication can be performed only if the number of columns in X equals the number of rows in Y.
matrix rank [ALGE]
203
partitioning of a matrix
scalar multiplication of a matrix z(n, p ) = c X(n, p ) zij = cxij
subtraction of matrices z(n, p ) = X(n, p ) - U h p ) ZIJ'
= X"i j - Y i j
Subtraction can be performed only if X and U have the same dimensions.
trace of a matrix The sum of the diagonal elements of X, denoted as tr (X):
transpose of a matrix Interchanging row and column indices of a square matrix: Z(p, 4 = X T h p ) Zji b
-+
= Xij
matrix plot [GRAPH] draftsman's plot
b matrix rank [ALGE] (: rank ofa matrix) The maximum number of linearlv indeuendent vectors (rows or columns) in a matrix X(n, p), denoted as r (X). Consequently, linearly dependent rows or columns reduce the rank of a matrix. A matrix X is called a rank deficient matrix when r(X) < p. The rank has the following properties:
- 0 Ir(X) Imin(n,p) - r(X) = r(XT) - r(X+Z)sr(X)+r(Z) - r(XZ) Imin [r(X), r(Z)] - r(XT X) = r (XxT)= r (x)
204
matrix transformation [ALGE]
b matrix transformation [ALGE] Numerical method that transforms a real matrix X(n, p) or a real symmetric matrix X(p,p) into some desirable form by pre- or post-multiplying it by a chosen set of nonsingular matrices. The method is called orthoaonal matrix transformation when the multiplier matrices are orthogonal and an orthogonal basis for the column space of X is calculated; otherwise the transformation is nonorthogonal, and is usually performed by elimination methods. Matrix transformations are the basis of several numerical algorithms for solving linear equations and for eigenanalvsis. b maximum likelihood (ML) [PROB] + likelihood b maximum likelihood clustering [CLUS] Clustering method that estimates the partition on the basis of the maximum likelihood. Given a data matrix X of n objects described by p variables, a parameter vector 6 = (n1 .. . n ~ p1.. ; . p ~ X;I . . . CG) and another parameter vector y = (nl . . .n G ) the likelihood is:
r
1
where s, is the set of objects Xi belonging to the gth group and n, is the number of objects in s,. Parameters n,, p,, and C, are the prior cluster probabilities, the cluster centroids, and the within-cluster covariance matrices, respectively. The ML estimates of these quantities are:
n,: P , = n,/n pg:
cg = C x i / n g sg
C(xi C,:
9, =
- CglT(Xi - cg)
Sg
g=l,G
ng b maximum likelihood estimator [ESTIM] + estimator b maximum likelihood factor analysis [FACT] Factor extraction method based on maximizing the likelihood function for normal distribution
L(X I P , E) The population covariance matrix E can be expressed in terms of common and unique factors as E=ILILT+U
mean absolute deviation (MAD) [DESC]
205
and fitted to the sample covariance matrix S. The assumption is that both common factors and unique factors are independent and normally distributed with zero means and unit variances, consequently the variables are drawn from a multivariate normal distribution. The common factors are assumed to be orthogonal, and their number M is prespecified. The factor loadings IL and the unique factors U are calculated in an iterative procedure based on maximizing the likelihood function, that is equal to minimizing: min = tr M
[x-'S] -In Ix-'s~
-p
The minimum is found by nonlinear optimization. Convergence is not always obtained, the optimization algorithm often ends in a local minimum, and sometimes one must face the Heywood case problem. The solution is not unique; two solutions can differ by a rotation. maximum likelihood latent structure analysis [MULT] + latent class model b
b
maximum linkage [CLUS]
-+ hierarchical clustering b
(0 agglomerative clustering)
maximum scaling [PREP] standardization
--+
b
maxplane rotation [FACT]
+ factor rotation McCulloh-Meeter plot [GRAPH] + scatter plot (0 residual plot) b
McNemar's test [TEST] + hypothesis test b
b
McQuitty's similarity analysis [CLUS] hierarchical clustering (0 agglomerative clustering)
----f
mean [DESC] -+ location b
mean absolute deviation (MAD) [DESC] + dispersion b
206
mean character difference [GEOM]
b mean character difference [GEOM] + distance (0 quantitative data) b mean deviation [DESC] + dispersion b mean function [TIME] + autocovariance function b mean group covariance matrix [DESC] + covariance matrix b mean square error (MSE) [ESTIM] Mean squared difference between a true value and estimated value of a parameter. It has two components: variance and squared bias.
+
MSE(6) = E[6 - S]2 = v[6]+ B2[6] = E[6 - E[ii]]2 (E[ii] - S)2
Estimators that calculate estimates with zero bias are called unbiased estimators. Biased estimators increase the bias and decrease the variance component of MSE, trying to find its minimum at optimal complexity.
MSE BIAS VARIANC
4
I\
NCE
BIAS I
OPTIMAL COMPLEXITY
b mean square error (MSE) [MODEL] + goodness of fit
COMPLEXITY
metric scale [PREP]
201
b mean squares in ANOVA (MS) [ANOVA] Column in the analvsis of variance table containing the ratio of the sum of squares to the number of degrees of freedom of the corresponding term. It estimates the variance component of the term and is used to test the significance of the term. The mean square of the error term is an unbiased estimate of the error variance. b
mean trigonometric deviation [DESC] dispersion
-+
b
measure of distortion [CLUS] assessment of clustering
-+
b median [DESC] + location
median absolute deviation around the median (MADM) [DESC] + dispersion b
b
median linkage [CLUS] hierarchical clustering (0 agglomerative clustering)
-+
b
median test [TEST] hypothesis test
-+
b membership function [MISC] + fuzzy set theory
metameric transformation [PREP] + transformation b
b
metameter [PREP] transformation (0 metameric transformation)
-+
b
metric data [PREP] data
----+
b
metric distance [GEOMI
+ distance b
-+
metric multidimensional scaling [MULT] multidimensional scaling
b
metric scale [PREP] scale
208 b
midrange [DESC]
midrange [DESC]
+ location b military standard table (MIL-STD) [QUAL] Military standard for acceptance sampling that is widely used in industry. It is a collection of sampling plans. The most popular one is MIL-STD 105D, which contains standards for single, double and multiple sampling for attributes. The primary focus is the acceptable quality level. There are three inspection levels: normal, tightened and reduced. The sample size is determined by the lot size and the inspection level. MIL-STD 414 contains sampling plans for variable sampling. They also control the acceptable quality level. There are five inspection levels. It is assumed that the quality characteristic is normally distributed.
minimal spanning tree (MST) [MISC] + graph theory b
b minimal spanning tree clustering [CLUS] --+ non-hierarchical clustering (0 graph theoretical clustering) b
minimax strategy [MISC]
+ gametheory b minimum linkage [CLUS] + hierarchical clustering (0 agglomerative clustering)
b Minkowski distance [GEOM] + distance ( 0 quantitative data) b
minres [FACT]
+ principal factor analysis b misclassification [CLAS] + classification
misclassification matrix [CLAS] + classification b
b
misclassification risk (MR) [CLAS]
+ classification b missing value [PREP] Absent element in the data matrix. It is important to distinguish between a real missing value (a value potentially available but not measured), a don’t care value
model [MODEL]
209
(the measured value is not relevant), and a meaningless value (the measurement is not possible or not allowed). There are several techniques for dealing with missing values. The simplest solution, applicable only in the case of few missing values, is to delete the object (row) or the variable (column) containing the missing value. A more reasonable procedure is to fill in the missing value by some estimated value obtained from the same variable. This can be: the variable mean calculated from all objects with nonmissing values; the variable mean calculated from a subset of objects, e.g. belonging to the same class; or a random value of the normal distribution within the extremes of the variable. While the variable mean produces a flattening of the information content of the variable, the random value introduces spurious noise. Missing values can also be estimated using multivariate models. Principal component, K nearest neighbors and regression models are the most popular ones. b mixed data [PREP] -+ data b
mixed effect model [ANOVA]
+ term in ANOVA mixture design [EXDE] + design b
b moat [CLUS] Measure of external isolation of a cluster defined for hierarchical agglomerative clustering. It is the difference between the similarity level at which the cluster was formed and the similarity level at which the cluster was agglomerated into a larger cluster. Clusters in complete linkage usually have larger moats than clusters in single linkage. b mode [DESC] + location b mode analysis [CLUS] + non-hierarchicalclustering
(0
density clustering)
b model [MODEL] Mathematical equation describing causal relationship between several variables. The responses (or dependent variables) in a model may be either quantitative or qualitative. Examples of the first group are the regression model, the factor analysis model, and the ANOVA model, while classification models, for example, belong to the second group. The value of the highest power of a predictor variable is called
210
model [MODEL]
the order of a model. A subset of predictor variables that contains almost the same amount of information about the response as the complete set is called an adequate subset. The number of independent pieces of information needed to estimate a model is called the model degrees of freedom. A model consists of two parts: a systematic part described by the equation, and the model error. This division is also reflected in the division of total sum of sauares. The calculation of optimal parameters of a model from data is called model fitting. Besides fitting the data, a model is also used for prediction. Once a model is obtained it must be evaluated on the basis of goodness of fit and goodness of Drediction criteria. This step is called model validation. Often the problem is to find the optimal model among several potential models. Such a procedure is called model selection.
additive model Model in which the predictors have an additive effect on the response. biased model Statistical model in which the parameters are calculated by biased estimators. The goal is to minimize the model error via bias-variance trade-off. In these models the bias is not zero, but in exchange, the variance component of the squared model error is smaller than in a corresponding unbiased model. PCR, PLS, and ridge regression are examples of biased regression models; RDA, DASCO,SIMCA are biased classification models. causal model Model concerned with the estimation of the parameters in a system of simultaneous equations relating dependent and independent variables. The independent variables are viewed as causes and dependent variables are effects. The dependent variables may affect each other and be affected by the independent variables; these latter are not, however, affected by the dependent variables. The cause-effect relationship cannot be proved by statistical methods, it is postulated outside of statistics. Causal models can be solved by LISREL or by path analvsis and are frequently described by path diagrams. deterministic model Model that, in contrast to a stochastic model, does not contain random element. hierarchical model : nested model nested model (: hierarchical model) Set of models in which each one is a special case of a more general model obtained by dropping some terms (usually by setting a parameter to zero). For example, the model y = bo
+ b1x
model degrees offreedom (dfl [MODEL]
211
is nested within y = bo
+ b1x + b 2 1
There are two extreme approaches to finding the best nested model: - the top-down approach starts with the largest model of the family (also called the saturated model) and then drops terms by backward elimination; - the bottom-up approach starts with the simplest model and then includes terms one at a time from a list of the possible terms, by forward elimination. It is important not only to take account of the best fit but also the trade-off between fit and complexity of the model. Hierarchical models can be compared on the basis of criteria such as adjusted R2,Mallows’ C,,PRESS,likelihood ratio and information criteria (e.g. the Akaike’ information criterion). In contrast to nested models, nonnested models are a group of heterogeneous models. For example, y = bo + 61 ln(x) + bz(l/x) is nonnested within the above models. Maximum likelihood estimators and information criteria are particularly useful when comparing nonnested models.
parsimonious model Model with the fewest parameters among several satisfactory models. soft model A term used to denote the soft part of the modeling approach, characterized by the use of latent variables, principal components or factors calculated from the data set and therefore directly describing the structure of the data. This is opposite to a hard model, in which a priori ideas of functional connections (physical, chemical, biological, etc. mathematical equations) are used and models are constructed from these.
stochastic model Model that contains random elements. b
model I [ANOVA]
-+ term in ANOVA b model I1 [ANOVA] + term in ANOVA
b model degrees of freedom (df) [MODEL] The number of independent pieces of information necessary to estimate the parameters of a statistical model. For example, in a linear regression model it is calculated as the trace of the hat matrix: tr [W]. In OLS with p parameters: tr [W] = p. The total degrees of freedom (number of objects n) minus the model degrees of freedom is called the error degrees of freedom or residual degrees of freedom.
212
model error [MODEL]
For example, in a linear regression model it is calculated as n - tr [W]. The error degrees of freedom of an OLS model with p parameters is n - p.
model error [MODEL] (: error term) Part of the variance that cannot be described by a model, denoted e or ei. It is calculated as the difference between observed and calculated response: b
e=y-i,
or
ei = yi - fi
The standard deviation of ei, denoted 0,is also often called model error. The estimated model error s is an important characteristic of a model and the basis of several goodness of fit statistics.
model fitting [MODEL] Procedure of calculating the optimal parameter values of a model from observed data. When the functional form of the model is specified in analytical form (e.g. linear or polynomial), the procedure is called curve fitting. In contrast, the form of the model is sometimes defined only by weak constraints (e.g. smooth) and the model obtained is stored in digital form. Overfitting means increasing the complexity of a model, i.e. fitting model terms that make little or no contribution. Increasing the number of parameters and the model degrees of freedom beyond the level supported by the available data causes high variance in the estimated quantities. In case of a small data set, in particular, one should be careful about fitting excessively complex models, (e.g. nonlinear models). Underfitting is the opposite phenomenon. Underspecified models (e.g. linear instead of nonlinear, or with terms excluded) result in bias in the estimates. b
model selection [MODEL] Selection of the optimal model from a set of candidate models. The criterion for optimality has to be prespecified. One usually is interested in finding the best predictive model, so the model selection is often based on a goodness of prediction criteria. Important examples are biased regression and classification. The range of candidate models can be defined by the number of predictor variables included in the model (e.g. variable subset selection, stepwise linear discriminant analysis), by the number of components (e.g. PCR, PLS, SZMCA), or by the value of a shrinkage parameter (e.g. ridge regression, RDA). b
b
model sum of squares (MSS) [MODEL]
+ goodness of fit b model validation [MODEL] (.- validation) Statistical procedure for validating a statistical model with respect to a prespecified criterion, most often to assess the goodness of fit and goodness of prediction of a model. A list of various model validation techniques follows.
model validation [MODEL]
213
bootstrap Computer intensive model validation technique that gives a nonparametric estimate of the statistical error of a model in terms of its bias and variance. This procedure mimics the process of drawing many samples of equal size from a population in order to calculate a confidence interval for the estimates. The data set of n observations is not considered as a sample from a population, but as the entire population itself, from which samples of size n, called bootstrap samples, are drawn with resubstitution. It is achieved by assigning a number to each observation of the data set and then generating the random samples by matching a string of random numbers to the numbers that correspond to the observations. The estimate calculated from the entire data set is denoted t, while the estimate calculated from the bth bootstrap sample is denoted tb. The distribution of tb can be treated as if it were a distribution constructed from real samples. The mean value of tb is
where B is the number of bootstrap samples. The bootstrap estimate of the bias B(t) and the variance V(t) of statistic t is B(t) = 1 - t
V(t) =
b
B-1
cross-validation (cv) Model validation technique used to estimate the predictive power of a statistical model. This resampling procedure predicts an observation from a model calculated without that observation, so that the predicted response is independent of the observed response. The predictive residual sum of squares, PRESS,is one of the best goodness of prediction measures. In linear estimators, e.g. OLS, PRESS can be calculated from ordinary residuals, while in nonlinear estimators, e.g. PLS,the predicted observations must literally be left out of the model calculation. Cross-validation, which repeats the model calculation n times, each time leaving out a different observation and predicting it from a model fitted to the other n - 1 observations, is called leave-one-out cross-validation (LOO). Cross-validation in which n’ = n/G observations are left out, is called G-fold cross-validation. In this case the model calculation is repeated G times, each time leaving out n’ different observations. Because the perturbation of the model is greater than in the leave-one-out procedure, the G-fold goodness of prediction estimate is usually less optimistic than the LOO estimate.
214
model validation [MODEL]
jackknife Model validation technique that gives a nonparametric estimate of statistical error of an estimator in terms of its bias and variance. The procedure is based on sequentially deleting observations from the sample and recomputing the statistic t calculated without the ith observation, denoted t(i). The mean of these statistics calculated from the truncated data sets is i
4.) = n The jackknife estimate of the bias B(t) and the variance V(t) of statistic t is:
For example, jackknife shows that the mean is an unbiased estimate
B(K) = (n - l)(K(.) -TI) = 0
V(K)=
n(n - 1)
Similarly, the bias and variance of the variance is estimated as
The bias and variance of more complex estimators, like regression parameters, eigenvectors, discriminant scores, etc. can be calculated in a similar fashion obtaining a numerical estimate for the statistical error. Jackknife is similar to cross-validation in that in both procedures observations are omitted one at a time and estimators are calculated repeatedly on truncated data sets. However, the goal in the two techniques is different.
moment [PROB] 215
resampling (: sample reuse) Model validation procedure that repeatedly calculates an estimator from a reweighted version of the original data set in order to estimate the statistical error associated with the estimator. With this technique the goodness of prediction of a model can be calculated from a single data set. Instead of splitting the available data into a training set, to calculate the model, and an evaluation set, to evaluate it, the evaluation set is created in a repeated subsampling. Cross-validation, bootstrap and jackknife belong to this group. resubstitution Model validation based on the same observations used for calculating the model. Such validation methods can be used to calculate goodness of fit measures, but not goodness of prediction measures. For example, in regression models RSS or R2,in classification the error rate or the misclassification risk are calculated via resubstitution. sample reuse : resampling training-evaluation set split In case of many observations, the calculated model can be evaluated without reusing the observations. The whole data set is split into two parts. One part, called the training set, is used to calculate the model, and the other part, called the evaluation set, is used to evaluate the model. A key point in this model validation method is how to partition the data set. The sizes of the two sets are often selected to be equal. A random split is usually a good choice, except if there is some trend or regularity in the observations (e.g. time series). b
modified control chart [QUAL]
+ control chart b moment [PROB] The expected value of a power of a variate x with probability distribution function f ( x ):
1 00
pr(x) =
y f ( x )&
--M
The central moment is the moment calculated about the mean: 60
--M
216
momentum term [ M S C ]
The absolute moment is defined as 00
LLpw = J
Ixl'f(x)b
--6o
The most important moment is the first moment, which is the mean
/
00
LL =
xf(x)b
-00
and the second central moment, which is the variance of the variate u 2=
Jm(x
- p)2f(x) dr
-m
The skewness of a distribution is py/a3 and the kurtosis of a distribution is 4iu4. b
momentum term [MISC]
+ neural network b Monte Carlo sampling [PROB] + sampling b
Monte Carlo simulation [PROB]
: simulation b Mood-Brown' test [TEST] + hypothesis test
Mood's test [TEST] + hypothesis test b
b Moore-Penrose inverse [ALGE] + matrix operation ( 0 inverse of a matrix) b monothetic clustering [CLUS] -+ cluster analysis
b monotone admissibility [CLUS] + assessment of clustering (0 admissibilityproperties) b Moses' test [TEST] + hypothesis test
multicollineariy [MULT]
217
b most powerful test [TEST] + hypothesis testing b
moving average (MA) [TIME]
The series of values
m q = ~ w i x q + i / ~ w i q=O,n-k
i=l,k
i
i
where X I , . . . , xn are a set of values from a time series, k is a span parameter (k c n) and w l , . . . , Wk are a set of weights.
. .
... .
. ..
....
..
If all the weights are equal to l l k , the moving average is called simple and can be constructed by dividing the moving total (sum of k consecutive elements) by k: k
C xilk
Exilk
Exilk
i=l
i =2
i =3
etc.
The set of moving averages constitutes the movinp average model. b moving average chart [QUALI + control chart b moving average model (MA) [TIME] + time series model b
-+
moving total [TIME] moving average
multiblock PLS [REGR] + partial least squares regression b
multicollinearity [MULT] : collinearity
b
218
multicriteria decision making (MCDM)[QUAL]
b multicriteria decision making (MCDM) [QUAL] Techniques for finding the best settings of process variables that optimize more than one criterion. These techniques can be divided into two groups: the generating techniques use no prior information to determine the relative importance of the various criteria, while in the preference techniques the relative importance of the criteria are given a priori. The most commonly used MCDM techniques follow.
desirability function Several responses are merged into one final criterion to be optimized. The first step is to define the desired values for each response. Each measured response is transformed into a measure of desirability dr, where 0 5 dr 5 1. The overall desirability D is calculated as the geometric mean:
D=
y&&Tx
The following scale was suggested for desirability: Value
Quality
1 .oo-0.80 0.80-0.63 0.63-0.40 0.40-0.30 0.30-0.00
excellent good poor borderline unacceptable
When the response is categorical, or should be in a specified interval or above a specified level to be acceptable, only dr = 1 and dr = 0 are assigned. When the response Yr is continuous, it is transformed into dr by a one-sided transformation:
dr = exp [- exp [(-~r)ll or by a two-sided transformation:
outranking method Ranks all possible parameter value combinations. Methods belonging to this group are: ELECTRE, ORESTE, and PROMETHEE. PROMETHEE ranks the parameter combinations using a preference function that gives a binary output comparing two parameter combinations. overlay plot Projects several bivariate contour plots of response surfaces onto one single plot. Each contour plot represents a different criteria. Minimum and maximum boundaries for acceptable criterion in each contour plot can be compared visually on the aggregate plot and the best process variables selected.
multinomial variable [PREP]
219
Pareto optimality criterion Selects a set of noninferior, so-called Pareto-optimal points, that are superior to other points in the process variable space. The nonselected points are inferior to the Pareto-optimal points in at least one criterion. The position of the Pareto-optimal points are often investigated on principal components plots and biplots. utility function An overall optimality criterion is calculated as a linear combination of f k, k = 1, K different criteria
The optimum of such a utility function is always the set of Pareto-optimal points, given certain weights. b multidimensional scaling (MDS) [MULT] Mapping method to construct a configuration of n points in the p’-dimensional space from a matrix containing interpoint distances or similarities measured with error in the p-dimensional space. Starting from a distance matrix D,the objective is to find the location of the data points X I , . . . ,xn in the p’-dimensional space such that their interpoint distance d E D is similar in some sense to d’ in the projected space. Usually p’ is restricted to 1, 2, or 3. Multidimensional scaling is a dimension reduction technique that provides a one-, two-, or three-dimensional picture that conserves most of the structure of the original configuration. There are several solutions to this problem, all are indeterminate with respect to translation, rotation and reflection. The solution is a non-metric multidimensional scaling (NMDS)if it is based only on the rank order of the pairwise distances, otherwise it is called metric multidimensional scaling. The most popular solution minimizes the stress function
The minimum indicating the optimal configuration is usually found by the steepest descent method. b multigraph [MISC] + graph theory
multinomial distribution [PROB] -+ distribution b
multinomial variable [PREP] + variable b
220
multiple compatison [ANOVA]
b multiple comparison [ANOVA] + contrast b multiple correlation [DESC] + correlation
multiple correlation coefficient [MODEL] + goodness of fit b
b
multiple least squares regression [REGR]
: ordinaly least squares regression
multiple regression model [REGR] + regression model b
b multiple sampling [QUAL] + acceptance sampling b
multiplication of matrices [ALGE]
+ matrix operation
multistate variable [PREP] + variable b
b multivariate [MULT] This term refers to objects described by several variables (multivariate data), to statistical parameters and distributions of such data, to models and methods applied to such data. In contrast, the term univariate refers to objects described by only one variable (univariate data), to statistical parameters and distributions of such variable, to models and methods applied to one single variable. One-variable-at-a-time (OVAT) is a term which refers to a statistical method that tries to find the optimal solution for a multivariate problem considering the variation in only one variable at a time, while keeping all other variables at a fixed level. Such a limited approach looses information about variable interaction, synergism and correlation. Despite this drawback, such methods are used due to their simplicity and ease of control and interpretability.
multivariate [PROB] + random variable b
b multivariate adaptive regression splines (MARS) [REGR] Nonparametric, nonlinear regression model based on splines. The MARS model has the form:
multivariate analysis [MULT]
)’ = bo +
c
fj
(xj)
+
c
fjk(xj ,xk)
+
c
f j k l ( x j ,xk,xI)
221
+
The first term contains all functions that involve a single variable, the second term contains all functions that involve two variables, the third term has functions with three variables, etc. A univariate spline has the form yi = bo
+ Cm b m &(xi)
i = 1,n
where B m are called spline basis functions and have the form: &(xi)
9
= [Sm(xi - tm>]+
m = 1,M
Values tm are called knot locations that define the predictor subregions, Sm equals either +1 or -1. The + sign indicates that the function is evaluated only for positive values. Index q is the power of the fitted spline; q = 1 indicates linear spline, while q = 3 indicates cubic spline. MARS is a multivariate extension of the above univariate model: m
j
Each multivariate basis function is a product o f j = 1, J univariate basis functions. The parameter J can be different for each multivariate basis function. Once the basis functions are calculated the regression coefficients bm are estimated by the least squares procedure. The above multivariate splines are adaptive splines in MARS. This means that the knot locations are not fixed, but optimized on the basis of the training set. The advantage of the MARS model is its capability to depict different relationships in the various predictor subregions via local variable subset selection, to include both additive and interactive terms. As in many nonlinear and biased regression methods, a crucial point is to determine the optimal complexity of the model. In MARS the complexity is determined by q, the degree of the polynomials fitted, by M,the number of terms, and by J , the order of the multivariate basis functions. These parameters are optimized by cross-validation to obtain the best predictive model. b multivariate analysis [MULT] Statistical analysis performed on multivariate data, i.e. on objects described by more than one variable. Multivariate data can be collected in one or more data matrices. Multivariate methods can be distinguished according to the number of data matrices they deal with. Disulav methods, factor analvsis. cluster analvsis and principal coordinate analvsis explore the structure of one single matrix. Canonical correlation analvsis and Procrustes analvsis examine the relationship between two matrices. Latent variable models connect two or more matrices. Multivariate analysis that examines the relationship among n objects described by p variables is called Q-analysis. The opposite is R-analysis when the relationship
222
multivariate analysis of variance (MANOVA) [ANOVA]
among p variables determined by n observations is of interest. For example, clustering objects or extracting latent variables is Q-analysis, while clustering variables or calculating correlation among variables is R-analysis. b multivariate analysis of variance (MANOVA) [ANOVA] Multivariate generalization of the analvsis of variance, studying group differences in location in a multidimensional measurement space. The response is a vector that is assumed to arise from multivariate normal distributions with different means but with the same covariance matrix. The goal is to verify the differences among the population locations and to estimate the treatment effects. The one-way u 4 N O v A model with I levels and K observations per level partitions the total scatter matrix T into between-scatter matrix B and within-scatter matrix W
T=B+W where
i
k
i
k
The test for location differences invwes generalized variances. The nu esis
is rejected if the ratio (Wilks’ A)
is too small.
multivariate calibration [REGR] + calibration b
multivariate data [PREP] + data b
path-
nearest centroid sorting [CLUS]
223
b multivariate dispersion [DESC] The dispersion of multivariate data about its location is measured, for example, by the covariance matrix. However, sometimes it is convenient to assess the multivariate dispersion by a single number. ?Irvocommon measures are: the generalized variance, which is the determinant of the covariance matrix, IBI, and the total variation, which is the trace of the covariance matrix, tr[S]. In both cases, a large value indicates wide dispersion and a low value represents a tight concentration of the data about the location. For example, these quantities are optimized in non-hierarchical cluster analysis. Generalized variance plays an important role in maximum likelihood estimation, total variation is calculated in principal component analysis. b
multivariate distribution [PROB]
+ random variable b multivariate image analysis (MIA) [MULT] + image analysis b
multivariate least squares regression ( M U ) [REGR]
: ordinary least squares regression b multivariate normal distribution [PROB] + distribution
multivariate regression model [REGR] + regression model b
b mutual information [MISC] + information theory
np-chart [QUAL] + controlchart (0 attribute control chart) b
b narrow-band process [TIME] + stochastic process
nearest centroid sorting [CLUS] + non-hierarchical clustering ( 0 optimization clustering) b
224
nearest centrotype sorting [CLUS]
b nearest centrotype sorting [CLUS] + non-hierarchical clustering (n optimization Clustering)
b
nearest means classification (NMC) [CLAS]
: centroid classijication b nearest neighbor density estimator [ESTIM] + estimator (0 density estimator)
nearest neighbor linkage [CLUS] + hierarchical clustering (0 agglomerative clustering) b
b negative binomial distribution [PROB] + distribution
b negative exponential distribution [PROB] + distribution
nested design [EXDE] + design b
nested effect term [ANOVA] + term in ANOVA b nested model [ANOVA] + analysis of variance
nested model [MODEL] + model b
nested sampling [PROB] + sampling b
b network [MISC] + graph theory b neural network (NN) [MISC] (.- artijicial neural network) New, rapidly developing field of pattern recognition on the basis of mimicking the function of the biological neural system. It is a parallel machine that uses a large number of simple processors, called neurons, with a high degree of connectivity. A NN is a nonlinear model, and an adaptive system that learns by adjusting the strength of connections between neurons.
neural network (NN)[MISCJ
225
output layer
YI
Y2
A simple NN model consists of three layers, each composed of neurons: the input layer distributes the input to the processing layer, the output layer produces the output, the processing or hidden layer (sometimes more than one) does the calculation. The neurons in the input layer correspond to the predictor variables, while the neurons in the output layer represent the response variables. Each neuron in a layer is fully connected to the neurons in the next layer. Connections between neurons within the same layer are prohibited. The hidden layer has no connection to the outside world; it calculates only intermediate values. The nodenode connection having a weight associated with it is called a synapse. These are automatically adapted in the training process to obtain an optimal setting. Each neuron has a transfer function that translates input to output. There are different types of networks. In the feedforward network the signals are propagated from the input layer to the output layer. The input xi of a neuron is the weighted output from the connected neurons in the previous layer. The output yi of a neuron i depends on its input Xi and on its transfer functionf(xi). A popular transfer function is the sigmoid:
where ti is a shift parameter. Good performance of a neural network depends on how well it was trained. The training is done by supervised learning, i.e. iteratively, using a training set with known output values. The most popular technique is called a backpropagation network. In the nth iteration the weights are changed as:
226
neuron [MISC]
where ;rl is the learning rate or step size parameter, a! is the momentum term. If the learning rate is too low, the convergence of the weights to an optimal value is too slow. On the other hand if the learning rate is too high, the system may oscillate. The momentum term is used to damp the oscillation. Function Sj is the error for node j. If node j is in the output layer and has the desired output dj , then the error is:
If the nodej is in the hidden layer, then
where k is indexing the neurons in the next layer.
neuron [MISC] + neural network b
b Newman-Keuls' test [TEST] -+ hypothesis test
Newton-Raphson optimization [OPTIM] Gradient optimization using both the gradient vector (first derivative) and the Hessian matrix (second derivative). The scalar valued function f (p), where p is the vector of parameters, is approximated at POby the Taylor expansion: b
f
(P) = Z(P>
=f
(Po) + (P - Po)Td P 0 )
+ 0.5 (P - PolT H(Po)(P - Po)
where g is the gradient vector and H is the Hessian matrix evaluated at PO. In each step i of the iterative procedure the minimum of f (p) is approximated by the minimum of z (p) Pi+l
= Pi - si H-' (Pi) g(Pi)
where the direction is the ratio of the first and second derivatives H-' g and the step size si is determined by a linear search outimization. This procedure converges rapidly when pi is close to the minimum. Its disadvantage is that the Hessian matrix and its inverse must be evaluated in each step. Also, if the parameter vector is not close to the minimum, then H may become negative definite and convergence cannot be reached. When the function f (p) is of a special quadratic form, as in the nonlinear least squares regression problem, the Gauss-Newton outimization offers a simplified computation.
nilpotent matrix [ALGE] + matrix b
b
node [MISC] graph theory
b
noise variable [PREP] variable
b
nominal scale [PREP] scale
non-hierarchical clustering [CLUS] 227
-+
b no-model error rate [CLAS] + classification (0 error rate)
noncentral composite design [EXDE] + design (0 composite design) b
b nonconforming item [QUAL] + lot b
nonconformity [QUAL] lot
-+
b
non-error rate (NER) [CLAS] ( 0 error rate)
+ classification
b non-hierarchical clustering [CLUS] ( .-partitioning clustering) Clustering that produces a division of objects into a certain number of clusters. This clustering results in one single partition, as opposed to hierarchical clustering, which produces a hierarchy of partitions. The number of clusters is either given a priori or determined by the clustering method. The clusters obtained are often represented by their centrotypes or centroids. Non-hierarchical clustering methods can be grouped as: density clustering, graph theoretical clustering, and optimization clustering.
density clustering Non-hierarchical clustering that seeks regions of high density (modes) in the data to define clusters. In contrast to most optimization clustering methods, density clustering produces clusters of a wide range of shapes, not only spherical ones. One of the most popular methods is Wishart's mode analysis, which is closely related to single linkage clustering. For each object mode analysis calculates its distances from all other objects, then selects and averages the distances from its K nearest neighbors. Such an average distance for an object from a dense area is small, while outliers that have few close neighbors have a large average distance.
228
non-hierarchical clustering [CLUS]
Another method based on the nearest neighbor density estimation is the JarvisPatrick clustering. This is based on an estimate of the density around an object by counting the number of neighbors K that are within a preset radius R of an object. Two objects are assigned to the same cluster if they are on each other's nearest neighbor list (length fixed a priori) and if they have at least a certain number (fixed a priori) of common nearest neighbors. The Coomans-Massart clustering method, also called CLUPOT clustering, uses a multivariate Gaussian kernel in estimating the potential function. The object with the highest cumulative potential value is selected as the center of the first cluster. Next, all members of this first cluster are selected by exploring the neighborhood of the cluster center. After the first cluster and its members are defined and set aside, a new object with the highest cumulative potential is selected to be the center of the next cluster and members of this cluster are selected. The procedure continues until all objects are clustered.
graph theoretical clustering Non-hierarchical clustering that views objects (variables) as nodes of a graph and applies graph theory to obtain clusters of those nodes. The best known such method is the minimal spanning tree clustering. The tree is built stepwise such that, in each step, the link with the smallest distance is added that does not form a cycle in the path. This process is the same as the single linkage algorithm. The final partition with a given number of clusters is obtained by minimizing the distances associated with the links, i.e. cutting the longest links. optimization clustering Non-hierarchical clustering that seeks a partition of objects into G clusters optimizing a predefined criterion. These methods, also called hill climbing clustering, are based on a relocation algorithm. They differ from each other in the optimization criterion. For a given partition the total scatter matrix T can be partitioned into within-group scatter matrix W and between-group scatter matrix B: T = W B where W is the sum of all within-group scatter matrices: W = C, W,. The eigenvalues of BW-' are A j , j = 1,M where M = min[p, GI. For univariate data the optimal partition minimizes W and maximizes B. For multivariate data similar optimization is achieved by minimizing or maximizing one of the following criteria :
+
o error sum of squares The most popular criterion to minimize is the error sum of squares, i.e. the sum of squared distances between each object and its own cluster centroid or centrotype. This criterion is equal to minimize the trace of W r 1
non-hierarchical clustering [CLUS]
229
Methods belonging to this group differ from each other in their cluster representation,in how they find the initial partition and in how they reach the optimum. The final partition is obtained in an iterative procedure that relocates an object, i.e. puts it into another cluster if that cluster’s centrotype or centroid is closer. Methods in which clusters are represented by centrotypes are called nearest centrotype sorting methods, e.g. K-median clustering and MASLOC. G objects are selected from a data set to be the centrotypes of G clusters such that the sum of distances between an object and its centrotype is a minimum. Leader clustering selects the centrotypes iteratively as objects which lie at a distance greater than some preset threshold value from the existing centrotypes. Its severe drawback is the strong dependence on the order of the objects. Methods representing clusters by their centroids are called nearest centroid sorting methods. K-means clustering (also called MacQueen clustering) selects K centroids (randomly or defined by the first K objects) and recalculates them each time after an object is relocated. Forgy clustering is similar to the K-means method, except that it updates the cluster centroids only after all objects have been checked for potential relocation. Jancey clustering is similar to Forgy clustering, except that the centroids are updated by reflecting the old centroids through the new centroids of the clusters. K-Weber clustering is another variation of the K-means method. Ball and Hall clustering takes the overall centroid as the first cluster centroid. In each step the objects that lie at a distance greater than some preset threshold value from the existing centroids are selected as additional initial centroids. Once G cluster centroids have been obtained, the objects are assigned to the cluster of the closest centroid. ISODATA clustering is the most elaborate of the nearest centroid clustering methods. It obtains the final partition not only by relocation of objects, but also by lumping and splitting clusters according to several user-defined threshold parameters. o cluster radius K-center clustering is a nearest centrotype sorting method that minimizes the cluster radiuses, i.e. minimizes the maximum distance between an object and its centrotype. In the covering clustering method the maximum cluster radius is fixed and the goal is to minimize the number of clusters under the radius constraint. o cluster diameter Hansen-Delattre clustering minimizes the maximum cluster diameter
1
230
nonlinear estimator [ESTIM]
o intra-cluster distances
Schrader clustering minimizes the sum of intra-cluster distances
o determinant of W
Friedman-Rubin clustering minimizes the determinant of the within-group scatter matrix min [IWll It is equivalent to maximizing
or to minimizing the Wilks’ lambda statistic min
[71
o trace of B W-’ Another criterion suggested by Friedman and Rubin is to maximize Hotelling’s trace
r max [tr [B W-’I] = max
1
lj 1 A,
largest eigenvalue of W Maximize Roy’s greatest root criterion
o
max[A11 b nonlinear estimator [ESTIM] + estimator b nonlinear iterative partial least squares (NIPALS) [ALGE] + eigenanalysis (o power method) b nonlinear mapping (NLM) [MULT] Mapping method, similar to multidimensional scaling, that calculates a two- or three-dimensional configuration of high-dimensional objects. It tries to preserve relative distances between points in the low-dimensional display space so that they are as similar as possible to the distances in the original high-dimensional space. After calculating the distance matrix in the original space, an initial (usually random)
nonparametric estimator [ESTIM]
231
configuration is chosen in the display space. A mapping error is calculated from the two distance matrices (calculated in the original and in the display spaces). Coordinates of objects in the display space are iteratively modified so that the mapping error is minimized. Any monotone distance measure can be used for which the derivative of the mapping error exists, e.g. Euclidean, Manhattan, or Minkowski distances. In order to avoid local optima it has been suggested that the optimization be started from several random configurations. Principal components projection can also be used as a starting configuration. There are several mapping error formulas in the literature, the most popular one is:
E=-
1
Cdij
i<j
(dij -d;jI2 dij
i<j
where 4, are distances in the display space, while di, are distances in the original space. The various algorithms also differ in the minimization procedure; the most popular is the steepest descent method. There are some drawbacks to NLM. new test points cannot be placed on a previously calculated map; computation is time consuming; the solution is often a local minimum.
nonlinear partial least squares regression (NLPLS) [REGR] + partial least squares regression b
nonlinear programing [OPTIM] + optimization b
b nonlinear regression model [REGR] + regression model b non-metric data [PREP] + data
non-metric multidimensional scaling (NMDS) [MULT] + multidimensional scaling b
non-metric scale [PREP] + scale b
b nonparametric classification [CLAS] + parametric classijication
nonparametric estimator [ESTIM] + estimator b
232
nonprobabilisticclassification [CLAS]
b nonprobabilistic classification [CLAS] -+ probabilistic classif cation b nonsingular matrix [ALGE] + matrix
b norm [ALGE] Measure of the size of a vector or a matrix, denoted as assigns a scalar to a vector or a matrix.
11.11.
It is a function that
matrix norm Measure of the size of a matrix W. The most frequently used matrix norms are: o Frobenius norm
o p-norm IlXllp = SUP l l ~ ~ l l p / l l ~ l l p U#O
where v is a nonzero vector. When p = 2, this norm is called the two-norm, and is equal to the largest eigenvalue of W. vector norm Measure of the size of a vector x. The most frequently used vector norm is the p-norm:
When p = 2 this norm is called the Euclidean norm; when p = 00 the norm reduces to
b normal distribution [PROB] -+ distribution b normal equation [ALGE] A set of simultaneous equations obtained in calculating a least squares estimator. For example the normal equations for the coefficient and intercept of a single
number of factors [FACT]
regression y = a nzxiyi
+ bx are: - 1 X i I
a=
233
i
Cyi i
I
n
b normal kernel [ESTIM] + kernel b
normal probability plot [GRAPH]
+ quantile plot b normal process [TIME] + stochasticprocess b normal residual plot [GRAPH] + quantile plot b normality assumption [PROB] The assumption underlying many statistical models and tests, that requires a variable, e.g. the model error, to follow a normal distribution. For example, in ordinary least squares regression and discriminant analysis the error is assumed to be normally distributed with zero mean and equal variance, and to be uncorrelated. The most often used techniques for verifying the normality assumption are D’Agostino’s test and the normal probability plot. b normalized Hamming coefficient [GEOM] + distance (o binalydata) b notched box plot [GRAPH] + box plot b null hypothesis [TEST] + hypothesis testing b
null matrix [ALGE]
+ matrix
number of factors [FACT] + rank analysis b
234 b
-+
numerical data [PREP]
numerical data [PREP] data
F numerical error [OPTIM] Error due to numerical approximations in calculating a number from mathematical functions or in representation of the number itself. When iterative procedures are used, error propagation can also occur, i.e. at a given stage of the calculation part of the error derives from the error at a previous stage. The propagation of experimental and numerical errors can be examined by repeating the calculation with slightly perturbed data. Several types of numerical errors can limit the accuracy of the results:
discretization error Error due to substituting integrals with finite sums. roundoff error Error due to the finite representation of the digits of a number x. Digital computers store the digits of a number as a finite sequence of discrete physical quantities. For example, in floating-point notation, the number is represented by a finite number t of nonzero digits before the decimal point (mantissa) and a finite number e which indicates the position of the decimal point with respect to the mantissa (exponent): x=tloe
The numbers t and e (together with the basis of the exponent, 10 or 2) determine the subset M of real numbers R, called machine numbers, M C R, which can be represented exactly in a given machine. Thus, any number x # M is approximated by a machine number rd(x) E M satisfying Ix - rd(x)l IIx - rl
for all
rEM
The quantity Ix - rd(x)I is the round-off error. This type of error can obviously occur not only in representing the final results, but also during intermediate calculations.
truncation error Error due to substituting infinite series with a finite number of terms of the series. b
numerical optimization [OPTIM]
: optimization b
numerical taxonomy [CLUS]
: cluster analysis
offset[REGR]
235
0 b object [PREP] (; data point, item, observation, unit) The basic unit in data analysis, e.g. an individual, an animal, a molecule, a food sample. Each object is described by one or more measurements, called data. A collection of objects constitutes a data set. Each object corresponds to a row of the data matrix. The ratio between the number of objects n and the number of variables p is called the object-variable ratio n/p. If this ratio is smaller than one the data set is an underdetermined system, while a data set with a ratio higher than one is a well-determined system. In the case of an underdetermined system one must apply biased methods to mitigate the high variance of estimates calculated from such a data set.
b
objective function [OPTIM] optimization
b object-variable interaction [FACT] + correspondencefactor analysis b
object-variable ratio [PREP] object
b oblimax rotation [FACT] + factor rotation b
oblimin rotation [FACT] fuctor rotation
oblique rotation [FACT] + factor rotation b
b
observation [PREP]
: object b
off-diagonal element [ALGE]
+ matrix
b
off-line quality control [QUAL] Tuguchi method
offset [REGR] + regression coeficient b
236
one-sided test [TEST]
b one-sided test [TEST] + hypothesis testing
one-variable-at-a-time (OVAT) [MULT] + multivariate b
b one-way analysis of variance [ANOVA] The simplest analysis of variance model, containing one single effect. The total variance has two components: variance due to the effect, and a random component. The assumption is that there are I normal populations (where I is the number of levels the effect can assume) with different means but common variances, from which samples of size K are drawn. In this model the total variance of the data is partitioned into two parts: sum of squares of differences between treatment means and the grand mean (often called between sum of squares); and sum of squares of differences between observations and their own treatment means (often called within sum of squares or error sum of squares):
i=l,I
k=l,K
n=IK
The first term measures the effect of treatments, the second term is due to random error. The total sum of squares has n - 1 degrees of freedom, the between sum of squares has I - 1 degrees of freedom and the within sum of squares has n - I degrees of freedom. The one-way ANOW model is used, for example, in regression diagnostics to partition the total variance into variance due to the regression model and variance due to random error. In this case the number of levels is substituted by the number of parameters of the regression model.
on-line quality control [QUAL] + Taguchi method b
b open data [PREP] + data (o closed data)
operating characteristic [TEST] + hypothesis testing b
b operating characteristic curve (OCC) [QUAL] Plot of the type I1 error against the process quality. It indicates the probability of accepting a lot if it contains a certain percentage of defective items estimated from a sample. The probability of accepting the lot with no defective items is one. As the number of defective items increases the probability of acceptance decreases
Optimization [OPTIM]
237
asymptotically to zero. The shape of the OC curve depends on the sampling plan and on the distribution assumed. b operations research (OR) [MISC] Application of mathematical tools based on objective and quantitative criteria, where necessary with given constraints, to system organization, decision making and system control in order to provide the best solutions. Techniques most commonly used in OR are: graph theory, game theory, linear programming, optimization, simulation.
optimal design [EXDE] + design b
b optimization [OPTIM] (: numerical optimization) Finding the optimal value (minimum or maximum) of a numerical function f, called the objective function, with respect to a set of parameters p, f ( P I , . . . , pp). If the values that the parameters can take on are constrained, the procedure is called a constrained optimization.
global maximum local maximum
local minimum
I
global minimum
P
At stationary points of a function, i.e. at minimum (global or local), maximum (global or local) or saddle points the first (partial) derivatives are zero: af/apl = . . . = af/ap, = o At a minimum the second derivative matrix (the Hessian matrix) is positive definite:
238
optimization clustering [CLUS]
Optimization procedures are iterative, i.e. in each step i the solution, the estimated values of the parameters, represents an improved approximation of the true parameter. Whether a procedure converges to a global or local optimum depends on the initial guess about the parameter values. Various convergence criteria can be used to decide when a procedure reaches the optimum. In each iteration step i a new set of parameters is calculated as
+
pi+l = Pi sidi where Si is the step size and di (a p-dimensional vector) specifies the step direction to be taken in moving from pi to pi+l. The optimization methods differ from each other in the way they determine di and S i . In direct search optimization the choice of the step direction and step size depends only on the values of the function, while in gradient optimization they are calculated from derivatives. The problem of optimization is often referred to as mathematical programming. If both objective function and the constraints are linear functions of the unknown parameters, the optimization is called linear programming, otherwise it is called nonlinear programming. An important problem in optimization, one which must be dealt with, is numerical error. optimization clustering [CLUS] + non-hierarchical clustering b
b
order of a matrix [ALGEI
+ matrix b order of a model [MODEL] + model b order statistic [DESCI + rank b
ordinal scale [PREP]
+ scale b ordinary least squares regression (OLS) [REGR] (: least squares regression, linear least squares regression, multiple least squares regression, multivariate least squares regression) The most popular regression method in which the model is linear in both the parameters and the predictors
y=%b+e
and the regression coefficients b are calculated with the least squares estimator solving for the normal equations:
6 = (XT x)-l xTy
orthogonal matrix [ALGE]
239
These coefficients are estimated by minimizing the residual sum of squares or, equivalently, by maximizing the correlation between the linear combination of the predictors and the response: mba [corr2 (y, b xT)] OLS is BLUE, i.e. it is an unbiased linear estimator that has the smallest variance among all unbiased linear estimators if the normality assumption on the residuals hold and if the true relationship is linear. If the normality assumption is violated the least squares estimator is not necessarily the optimal unbiased one. Generalized least squares regression can be applied to deal with heteroscedasticity and collinearity. The bias and variance of the estimated regression coefficient are:
where s is the residual standard deviation and Am is the rnth eigenvalue of (XTX). This method calculates an unbiased model; however, it is suitable only for welldetermined systems and is very sensitive to outliers. The estimated response and any statistic related to it is scale invariant. It is a linear estimator, therefore the cross-validated residuals can be calculated as:
where hii are diagonals of the hat matrix.
ordinary residual [REGR] + residual b
b
ordinary residual plot [GRAPH]
+ scatter plot (O residual plot) b orthoblique rotation [FACT] + factor rotation
orthogonal design [EXDE] + design b
b
orthogonal matrix [ALGE] matrix
--+
240
orthogonal matrix transformation [ALGE]
F orthogonal matrix transformation [ALGE] ( : orthogonalizution) Transformation of a square matrix X into an upper triangular matrix R via multiplication with an orthogonal matrix Q:
R=Q~X
The above transformation can also be viewed as a Q-R decomposition of X:
X=QR Each method expresses the columns of X in terms of a new orthonormal basis, contained in the columns of Q. Orthogonal matrix transformations, for example, afford an easier solution to the least squares regression problem and provide a numerically more stable solution. The orthogonal matrix Q can be calculated in several ways. Givens transformation Orthogonal matrix transformation that brings X to an upper triangular form through a series of orthogonal transformations, i.e. the transformation matrix Q is a product of orthogonal matrices Gij . In contrast to the Householder transformation, each step sets to zero only one sub-diagonal element, i.e. Gij operates on columnj of %, annihilating element i in that column. The matrix Gij differs from the identity matrix in only four elements:
with
Although the Housholder transformation is preferred for general use, because fewer computations are required and slightly better results are obtained, the Givens transformation is recommended for large, very sparse (many elements are zero) X matrices. Gram-Schmidt transformation Orthogonal matrix transformation that brings X to an upper triangular form. In contrast to the Housholder transformation, which does not calculate Q explicitly, but only the matrices Hj, here in each step one calculates a new column of Q, orthogonal to the previous ones. Housholder transformation The most important orthogonal matrix transformation that brings X to an upper triangular form through a series of orthogonal transformations, i.e. the transformation matrix Q is a product of p orthogonal matrices Hj . The matrix Hj is symmetric,
orthonormal basis [ALGE]
241
orthogonal and operates on column j of X, annihilating all sub-diagonal elements:
H.-I[-- 2vvT
’-
VTV
This transformation is used in many algorithms, e.g. in bidiagonalization, or tr idiagonalization of a matrix before performing an eigenanalysis. b orthogonal rotation [FACT] + factor rotation b orthogonal vector transformation [ALGE] Let G be an orthogonal matrix of rank p and nj an orthonormal basis of p vectors. Then
gj = Gnj
forms a new orthonormal basis. Any vector x can be represented in terms of both the old and the new basis as
where Xj indicates the old coordinates and xi the new transformed coordinates. b orthogonal vectors [ALGE] + vector b
orthogonalization [ALGE]
: orthogonal matrix transformation
orthomax rotation [FACT] + factor rotation b
orthonormal basis [ALGE] A set of p vectors that are orthogonal vectors and have their norm equal to one: b
nj nk = O
if
j# k
and
/(njII= 1 j = l , p
Without imposing the orthogonality and normality restrictions, any set of linearly independent vectors nl, . . . , np,such that each vector in the space can be written as linear combination of nl , . . . , np, is simply called the basis. The minimum number of vectors needed to form a basis is equal to the dimension of the vector space. A p-dimensional vector x can be represented as x = C x j nj j
where xj is calledjth coordinate. The orthonormal basis is used, for example, in orthogonal vector transformation.
242
outlier [DESC]
b outlier [DESC] Observation that is separated from the bulk of the data. Outliers in the sample may distort statistics calculated from such a sample. Outliers are often observations from a different than the sampled population. Their effect on an estimator is measured by the breakdown point and the influence curve. Outliers must be detected and tested to determine whether they should be discarded before modeling. Influence analvsis detects outliers in the predictor space, called leverage points. Robust estimators are often used to mitigate the distorting effect of potential outliers.
b
output layer [MISC] neural network
b
-+
outranking method [QUAL] multicriteria decision making
b
overfitting [MODEL]
---, model fitting
overlay plot [QUAL] + multicriteria decision making b
b
p-chart [QUAL]
+ control chart
(0 attribute control
chart)
p-norm [ALGE] -+ norm b
parameter design [QUAL] Design for investigating the response of a process which is independent of or at least insensitive to different sources of variation. This design is different from the experimental design, in which different sources of variation are investigated. The goal of the parameter design is to model a response by making it independent from different sources of variation rather than controlling those sources. b
b
parameter estimation [ESTIM] estimation
partial least squares regression (PLS) [REGR] 243 b parametric classification [CLAS] Classification where the classification rule is expressed as a function of a relatively small number of parameters of the class density functions, which is usually assumed to be normal (e.g. DASCO, LDA, QDA, RDA, SIMCA, UNEQ). Its opposite is the nonparametric classification where no particular form of class density function is assumed and it is estimated directly from the training set (e.g. ALLOC, CART,
K"). b parametric estimator [ESTIM] + estimator
Pareto analysis [QUAL] + qualitycost b
b
Pareto chart [QUALI quality cost
-+ b
Pareto distribution [PROB]
+ distribution b
Pareto optimality criterion [QUAL] multicriteria decision making
-+
Parks distance [GEOMI -+ distance ( 0 mixed type data) b
b
parsimonious model [MODEL] model
-+
b
partial confounding [EXDE]
+ confounding b
partial correlation [DESC] correlation
-+
partial factorial design [EXDE] -+ design b
b partial least squares regression (PLS)[REGR] Biased regression method that relates a set of predictor variables X to a set of response variables Y. A least squares regression is performed on a set of uncorrelated, so called latent variables T that are a standardized linear combinations
244
partial least squares regression (PLS) [REGR]
of the original predictor variables. PLS models the response(s) as Y=IL'BQ+lE
where I5 is a diagonal matrix containing the least squares coefficients of the latent variables, Q contains the weights of the responses and lE is the error matrix.
The latent variables are calculated one at a time, maximizing their covariance: max [corr2 (urn, tm), var (u,)var (tm)] m
where
tm
=Xw;f,
and
Um
rn = I, M T =Yqm
The connections among the various matrices of the PLS algorithm are shown in the following scheme. A PLS component rn is calculated as:
1. Initialize: 2. X weight: 3. X latent variable: 4. Y weight: 5. U latent variable: 6. Check convergence: 7. Inner relation: 8. X loadings: 9. X residuals: 10. Y residuals:
=Y1 : w = XT u, tm = Xw; q', = YT tm
scale to length 1 scale to length 1
uk = Yq', if ( h u m 5 E ) then 7. else
Um = uk
and 2.
The bias-variance trahd-off is controlled by the num,er of latent varia,.;s M:the more latent variables that are included the larger is the variance and the smaller is the bias of the estimated regression coefficients. As the number of latent variables increases (max [MI = p ) , the PLS model converges to the OLS model. The optimal number of components must be determined by a model selection criterion, e.g. cross-validation.
path analysis [MULT]
245
A modification of the PLS regression that permits the calculation of the latent variables as a linear combination of a subset of the predictor variables is called intermediate least squares regression (ILS). A special case of ZLS is the forward selection method, i.e. when a latent variable consists of only one predictor variable. An extension of the two-block PLS is the multiblock PLS model in which several predictor and response blocks are connected by regressing their latent variables on each other according to a prespecified path. There are several nonlinear extensions of the linear PLS model nonlinearizing the inner relationship, i.e. assuming nonlinear function among the latent variables: Um
= f (tm)
+e
The quadratic partial least squares regression (QPLS) fits a second-order polynomial: f (tm) = a0 a1 tm a2 t i . The nonlinear partial least squares regression (NLPLS) applies the smoother to estimate the above nonlinear functions. The spline partial least squares regression (SPLS) uses a nonadaptive bivariate regression sdine to approximate f . PLS is used extensively in the context of experimental design. The generating optimal linear PLS estimation (GOLPE) method performs a variable selection on the basis of D -optimal design in the loading space to obtain the best predictive model. The computer aided response surface optimization (CARSO) method uses PLS to fit a linear or quadratic response surface.
+
+
b partial pivoting [ALGE] + Gaussian elimination
b
partitioning clustering [CLUS]
: non-hierarchical clustering b partitioning of a matrix [ALGE] + matrix operation b Pascal distribution [PROB] + distribution
b path [MISC] + graph theory
b path analysis [MULT] Method for solving a causal model, i.e. studying patterns of causation among a set of variables, in which certain variables are viewed as a cause, others as an effect.
246
path diagram [MULT]
This is a special case of LISREL. The analysis is based on the following assumptions: - relationships are linear and additive; - all error terms are uncorrelated with each other; - models are recursive, only one-way causal flow is allowed; - endogenous variables are measured on an interval or ratio scale; - manifest variables are measured without error; - the model is properly specified, with all causal relationships included. Each equation is treated separately, i.e. errors in estimating the parameters of one equation do not affect the estimation of parameters of other equations. The parameters of the model are usually estimated by least squares. Path coefficients, calculated from the regression coefficients, indicate the effect of a causal variable on an effect variable. A path diagram is often used to enhance the analysis graphically. b path diagram [MULT] Graphical representation of the relationship among variables in a causal model. The diagram shown below corresponds to the following equations:
+ Yl 61 + c1 v 2 = B 2 v1 + b x1 = A.1 61 + 81 x 2 = A.2 61 + 8 2 rl1 = B1 rl2
+ €1 Y2 = A4 v 2 + €2 Y3 = A.5 r12 + €3 Y1 = A.3 rll
61
6 2
-
-El
Pearson’sfirst index [DESC]
247
The conventional notation is the following: - x: manifest exogenous variable - y: manifest endogenous variable latent exogenous variable - 3: latent endogenous variable - B: effect of endogenous variable on endogenous variable - y : effect of exogenous variable on endogenous variable - 0, and the semi-positive definite quadratic form if Q(x) 2 0. For example, the Mahalanobis distance is a quadratic form, where A is often the covariance matrix and x is the vector of the coordinate differences. Canonical analysis converts a quadratic form into a weighted sum of squares form without cross-product terms, called the canonical form. The procedure consists of centering and of rotating the axes in order to decorrelate them. b quadratic partial least squares regression (QPLS) [REGR] + partial least squares regression
quadratic regression model [REGR] + regression model b
quality cost [QUAL]
261
b quadratic score [CLAS] + classification
b
-+
qualitative data [PREP] data
b qualitative variable [PREP] + variable
b
quality characteristic [QUAL]
+ lot b quality control (QC) [QUAL] (: statistical quality control) Statistical analysis of data collected to measure the quality characteristics of the product, compare them with established standards and take action to ensure the required product quality. The primary goal is the systematic reduction of variability in the quality characteristics. A process can be characterized by the operating characteristic curve and the process capability ratio, monitored by a control chart, investigated by =meter design. The quality of a can be estimated by acceptance samulinq, the type I error is measured by the producer’s risk. Defects of different severity are dealt with in a demerit system. Multicriteria decision making optimizes more than one quality criterion. The cost of producing conforming products is quantified by the quality cost. The Taguchi method is a novel approach to quality control.
quality control chart [QUAL] : control chart
b
b qualitycost [QUAL] Cost associated with producing, identifying, avoiding, or repairing products that do not conform with the standards. There are four categories of quality cost. The prevention cost is associated with efforts in design and manufacturing to prevent nonconformance. The appraisal cost is associated with measuring, evaluating products, components, and materials to ensure conformance. The internal failure cost is associated with nonconforming products, components, and materials discovered prior to their delivery to the customers. The external failure cost is associated with nonconforming products delivered to customers. Pareto analysis is a quality-cost analysis aiming at cost reduction by identifying quality costs by category, by product, or by type of defect. A Pareto distribution of percentage defective items is calculated and plotted as a bar plot, called a Pareto chart, e.g. by department, by machine, by operator, etc.
262
quantal variable [PREP]
Defects
Conventionally, the bars are arranged in decreasing order. Usually, a few items on the left represent a substantial amount of the total.
quantal variable [PREP] + variable b
b quantile [DESC] (.- fractile) A value within the range of a variable which divides the data set into two groups, such that the fraction of the observation specified by the quantile falls below and the complement fraction falls above. For example, the quantile Q(0.8) indicates a value of a variable for which 80% of the observations lie below and 20% lie above. The most often used quantiles are:
quartiles divide the range into quarters Q(0.25) Q(0.5) Q(0.75) The jth quartile is the j(n + 1)/4th ordered value. deciles divide the range into tenths Q(o.1) Q(O.2) Q(0.3) Q(0.4) Q(0.5) Q(O.6) Q(0.7) Q(O.8) Q(o.9) The jth decile is the j(n l)/lOth ordered value.
+
percentiles divide the range into hundredths Q(O.01) Q(0.02) Q(0.03) . . . Q(0.97) Q(0.98) Q(0.99) The jth percentile is the j ( n + 1)/100th ordered value.
quantileplot [GRAPH]
Q,
Q,
Q,
lower quartile
median
upper quartile
263
-
. . ..I . ...i . .. I. . .. interquartile range
The three quartiles are important in measuring scale and location: - the first quartile Ql = Q(0.25) is called the lower quartile - the second quartile Q2 = Q(0.5) is the median, a robust measure of location - the third quartile Q3 = Q(0.75) is called the upper quartile The distance between the upper and the lower quartiles is called the interauartile range (Q3 - Ql), which indicates the spread of the bulk of the data. The half-interquartile range 0.5 . (Q3 - Ql), also called quartile deviation, is a robust measure of dispersion. Quantiles are used, for example, in quantile plots and in robust estimators. b quantile function [PROB] -+ random variable b quantile plot [GRAPH] Plot of the auantiles of a distribution against the corresponding fractions. The scale of the horizontal axis of the fraction ranges between 0 and 1. The scale of the vertical axis of the quantile is the scale of the variable. Except for the labeling of the horizontal axis, this plot is the same as X i vs. i, when the X i values are ordered.
I
0.0
I
0.2
I
I
I
0.4
0.6
0.8
Fractions of data
1.0
264
quantile-quantile plot [GRAPH]
The plot of the quantiles of two distributions against each other is called the quantile-quantile plot. If both quantiles are from an empirical distribution, it is called an empirical quantile-quantile plot. If the number of observations is equal for the two empirical distributions, the empirical quantile-quantile plot is simply a plot of one sorted variable against the other. Data points lying along a straight line indicate distributions of similar shape. The intercept of the line indicates the difference in location, the slope shows the difference in scale. When the quantiles on the horizontal axis come from a theoretical distribution the plot is called a theoretical quantile-quantile plot, or more commonly, a probability plot. When the normal distribution is used, this plot is called a normal probability plot. This plot is very effective for testing the assumption of normality on a variable. Departure from a straight line indicates nonnormality; an opposite curvature at the ends indicates long or short tails; a convex or concave curvature is related to asymmetry. The normal residual plot is a theoretical quantile-quantile plot of the quantiles of residuals (obtained, for example, from a regression model and usually standardized or Studentized) against the corresponding quantiles from a normal distribution. Deviation from the normal straight line can result either from nonnormal residuals or from model misspecification, e.g. a nonlinear relationship described by a linear model. b
quantile-quantile plot [GRAPH]
+ quantile plot b
quantitative data [PREP] data
--+
b
quantitative variable [PREP]
+ variable b
quartile [DESC]
+ quantile b quartile coefficient of skewness [DESC] + skewness b quartile deviation [DESC] + dispersion
quartimax rotation [FACT] + factor rotation b
random factor [EXDE] b
quartimin rotation [FACT] factor rotation
-3
b
quasi-Newton optimization [OPTIM]
: variable metric optimization
Quenouille’s test [TEST] + hypothesis test b
R R2 [MODEL] + goodness of fit b
b R-analysis [MULT] + multivariate analysis
b
R-chart [QUAL]
+ control chart b
(0
variable control chart)
Restimator [ESTIM]
+ estimator
Rajski’s coherence coefficient [GEOM] -3 distance ( 0 ranked data) b
b
-+ b
Rajski’s distance [GEOM] distance ( 0 ranked data)
random effect [ANOVA] term in ANOVA
--3
b
random effect model [ANOVA]
+ term in ANOVA random factor [EXDE] + factor
b
265
266
random number [PROB]
b random number [PROB] + random variable b
random process [TIME]
: stochastic process
random sample [PROB] + population b
b random sampling [PROB] + sampling b random series [PROB] + simulation b
random variable [PREP]
+ variable b random variable [PROB] Function that maps events into a set of values. The variate, denoted by X, is a generalization of the random variable. It also has probabilistic properties, but is defined without reference to a particular type of probabilistic experiment. A variate is the set of all random variables that follow a particular probabilistic law. A multivariate is a vector of variates. A random number x associated with a given variate is a number indicating a realization of a random variable belonging to that variate; e.g. X = x. The probability of a variate taking on a value less than or equal to x is denoted P [X 5 x ] . The set of all random numbers that a variate can take is called the range of the variate. The set of all values that P[.. .] can take is called the probability domain of the variate. There are several functions associated with a variate. The distribution function or cumulative distribution function (cdf) of a variate, denoted F(x), maps the range of the variate into the probability domain of the variate:
F(x) = P [X 5 x ] = a!
0 5 a! 5 1
F(x) is the probability that the variate takes a value less than or equal to x. The function F(x) is nondecreasing in x and takes the value one at the maximum of x. The inverse distribution function or quantile function of a variate, denoted G(a!),maps the probability domain into the range of the variate: G ( a ) = x = G(F(x)) G(a!)is the random value x, such that the probability that the variate takes a value less than or equal to x is a!.
random variable [PROB]
267
The survival function or reliability function of a variate, denoted S(x), is the complement function to F(x), i.e. it is the probability that the variate takes a value greater than x : S(x) = P [ X > x] = 1-a! = 1 - F ( x )
The inverse survival function of a variate, denoted Z(a), is the random number that is exceeded by the variate with probability a!: Z(a) = x = Z(S(x)) = G ( l - a)
The probability density function (pdf) or density function or frequencyfunction of a variate, denoted f ( x ) , is the first derivative of the distribution function F(x) with respect to x:
Its integral over the range takes a value in that range:
XU
- XL is equal to the probability that the variate
If the variate is discrete, the functionf(x) is called the probability function (pf). It is the probability that the variate takes on the value x : f ( x ) = P [X= x]
The hazard function or failure function of a variate, denoted h(x) is the ratio of the probability density function to the survival function at x :
Moments are also important functions of a variate. The following terminology may characterize a distribution:
asymmetric distribution Distribution without central value p such thatf(x - p ) = f ( p - x ) .
268
random variable [PROB]
asymptotic distribution The form of a distribution as the sample size tends to infinity. The form of the distribution is called asymptotic normal distribution when it approaches the normal distribution as the sample size tend to infinity. bell-shaped distribution Distribution in which the density function has a shape that is similar to the contour of a bell. Bell-shaped distributions are symmetric. For example, the normal distribution is a bell-shaped distribution. 0.6
I I I I
I
bimodal distribution Distribution with a density function having two modes, i.e. two maximums. It is often the result of mixing two unimodal distributions.
random variable [PROB]
269
conditional distribution Distribution of a set of variates at fixed values of another set of variates. contaminated distribution Mixture of normal distributions with identical locations but different dispersions. Contamination, i.e. observations from a normal distribution with a larger dispersion than the base distribution, causes a heavy (long) tail in the base distribution. continuous distribution Distribution that is a function of one or more variates measured on a proportional or ratio scale. discrete distribution Distribution that is a function of one or more variates measured on a nominal or ordinal scale. empirical distribution Distribution that is a function of a variate describing a sample. J-shaped distribution Distribution in which the density function has a shape that is similar to a letter J or a reversed J. J-shaped distributions are skewed.
z&
-
n oo
I 0 20
I
I 0 40
I
I 0 60
I
I 0 80
I 1 0
joint distribution Distribution of two or more variates (especially used for two variates). marginal distribution Unconditional distribution of a single variate in a multivariate distribution. It does not depend on the values taken by the other variates.
270
random walk [TIME]
multivariate distribution Joint distribution of several variates. symmetric distribution Distribution with central value p, such thatf(x - p ) =f(p - x ) . theoretical distribution Distribution that is a function of a variate describing a population. U-shaped distribution Distribution in which the density function has a shape that is similar to a letter U.
0
9
0
I
I
I
I
I
I
I
I
I 0
univariate distribution Distribution of a single variate. b random walk [TIME] Stochastic process defined as:
x (t) = x ( t - 1)
+ a(t)
where a(t) is an i.i.d. random variable with zero mean and variance 0,”. The mean function of x ( t ) is p(t) = 0. Because its autocovariance function y ( t ) = ta;
increases linearly with time, the random walk is nonstationary. Its autocorrelation function is
randomized design [EXDE]
271
b randomization [EXDE] Random assignment of runs to treatments and random ordering of the runs. Randomization is either complete, i.e. runs are randomized along the whole design matrix, or runs are randomized within a block. Randomization is performed to eliminate unforeseen bias and to cancel correlation between adjacent runs. Randomization helps to ensure that the experimental error is an independently distributed random variable.
randomized block design [ANOVAI Analysis of variance model in which the total sum of squares is composed of the sum of squares due to effects, the sum of squares due to blocking and the sum of squares due to random error. The simplest randomized block design model has one effect A with I levels and one blocking variable B with J levels: b
Yij=y+Ai+Bj+eij
i=l,I
j=l,J
n=I J
There is one observation per treatment in each block. Because the only randomization of treatments is within blocks, the blocks represent a restriction on randomization. This model is additive, i.e. there is no interaction between the effects and blocks. The treatment and block effects are defined as deviations from the grand mean, so that
When the effects and blocks are fixed, the expected mean squares are: EMSA= U 2
+ J CAH/(I - 1) i
The model with one effect A and two blocking variables B and C is called a Latin square: Yijk
-
=Y
+ Ai + Bj + c k +
eijk
This model is also completely additive, i.e. there is no interaction term between the effect and the two blocking variables. b randomized block design [EXDE] + design b
randomized design [EXDE]
+ design
272
range [DESC]
range [DESC] + dispersion b
b
range chart [QUALI
-+ control chart (n variable control chart) b
range scaling [PREP] standardization
-+
b rank [DESC] Ordinal number indicating the position of an object with respect to other objects when they are ordered according to some criterion, such as values assumed by a variable. When ranking objects it may happen that some of them have indistinguishable positions, i.e. they have a tied rank. In this case, the usual solution is to assign equal rank to all objects in the tied group, calculated as the average of ranks assigned ignoring the tie. For example:
3.5 5.0 1.3 3.5 4.0 1.2 3.5 6.7 3.5 rank(x): 4.5 8 2 4.5 7 1 4.5 9 4.5
X:
A statistic calculated on ordered data, i.e. on data with observations arranged in ascending order, is called a rank order statistic or order statistic. Examples are: median, interquartile range, rank correlation. b rank analysis [FACT] Collection of techniques for determining the number of factors, i.e. the complexity of a data matrix.
average eigenvalue criterion A factor is significant if its eigenvalue is greater than the average eigenvalue. When variables are autoscaled, i.e. the correlation matrix is factored, this criterion is the eigenvalue-one criterion. double cross-validation (dcv) Determines M based on an optimization of the predictive ability. It calculates the ratio PRESS(M) RSS(M - 1) with i
j
i
j
rank anarysis [FACT]
273
is reproduced from an (M - 1)-component model calculated from the complete data matrix, while xIij\ij is reproduced from an M-component model calculated from a data matrix in which elements were deleted diagonally including the ijth element. The above ratio is compared with either Q = 1 or with a more conservative empirical function &j
Q=/
(p - M)(n - M - 1) (p-M-l)(n-M)
a ratio less than Q indicates optimal M.
eigenvalue-one criterion A factor is significant if its eigenvalue is greater than one. It is the average eigenvalue criterion on autoscaled data.
eigenvalue threshold criterion A factor is significant if its eigenvalue is greater than a specified threshold value. It is a generalization of the average eigenvalue criterion.
fixed percentage of explained variance The number of factors M is determined such that the factor model explains a prespecified fixed percentage of the total variance.
5
Am
Horn’s method A modification of the average eigenvalue criterion suggesting that the threshold value should not be uniformly the average eigenvalue, but should decrease with increasing rank of factors. The individual thresholds are calculated as eigenvalues from a data matrix of the same size as the matrix analyzed, but filled with random values from a normal distribution with the same variance as the original variables.
Hotelling-Lawley trace test Determines M based on testing the distribution of
imbedded error function Under the assumption that the error is random and identically distributed in the data matrix, the eigenvalues associated with the residual error should be approximately
274
mnk analysis [FACT]
equal, i.e. A, = A,+~ = . . . =A,
m = M + l,p
In this case the imbedded error can be written as
If the above assumption holds, this function decreases for m = 1,M and increases for m = M + 1,p . The optimal number of factors M is indicated by the minimum of IE. In practice, when the error is not truly random or identically distributed, IE keeps decreasing even at m > M,but a much slower rate, so the curve IE - m flattens at M.
indicator function : Malinowski’s indicator function likelihood ratio criterion Determines M based on testing the distribution of
Malinowski’s indicator function (ME) ( : indicator function) An empirical function of the real error RE or residual standard error RSD:
MIF =
-
RE @-M)2
RSD
- (m$+lnb
- (p-M)2 -
Am
@-W2
It shows a minimum when the true number of factors M is reached. This indicator function appears to be more sensitive than the imbedded error function.
Pillai’s trace test Determines M based on testing the distribution of 1
=
(ix) m = l , M
Roy’s greatest root test Determines M based on testing the distribution of AR = Al.
scree test Determines M based on the phenomenon that the residual variance levels off when the proper number of common factors is obtained. This leveling off is assessed visually on the scree plot, which is the residual variance, or simply the eigenvalues plotted against the number of common factors.
Rayleigh distribution [PROB]
275
Wilks’ A test Determines M based on testing the distribution of
Aw
=
(ik)
m=l,M
b rank correlation [DESC] + correlation b
rank deficient matrix [ALGE]
+ matrixrank b rank distance [GEOM] + distance (0 rankeddata)
b
rank of a matrix [ALGE]
: rnutrixrank b rank order statistic [DESC] + rank b rank test [TEST] + hypothesis testing b
ranking variable [PREP]
+ variable b rankit transformation [PREP] + transformation
rank-one update [ALGE] Matrix decomposition of X’calculated from the matrix decomposition of X where matrix W’ can be obtained from X by: - adding a rank one matrix to X; - appending a row or column to X; - deleting a row or column from X. In such situations it is more efficient to update the decomposition of X instead of starting the calculation from the beginning. The most straightforward updating is that of the Q-R decomposition. b
ratio scale [PREP] -+ scale b
Rayleigh distribution [PROB] + distribution b
276
Rayleigh’s test [TEST]
b Rayleigh’s test [TEST] -+ hypothesis test
b
-+
real error (RE) [MODEL] goodness offit
real error (RE) [FACT] -+ error terms in factor analysis b
b
reciprocal regression model [REGR] regression model
-+
b
reciprocal transformation [PREP] transformation
-+
b
-+
rectangular distribution [PROB] distribution
rectangular kernel [ESTIM] + kernel b
b
rectifying inspection [QUAL] lot
-+
b
recursive residual [REGR] residual
-+
b
reduced correlation matrix [FACT] principal factor analysis
-+
b
reduced model [ANOVA] analysis of variance
-+
reduced variable [PREP] + variable b
b
reflected design [EXDE] design
-+
b regression analysis [REGR] Collection of statistical methods using a mathematical equation to model the relationship among measured or observed quantities. The goal of this analysis is
regression curve [REGR]
277
twofold: modeling and predicting. The relationship is described in algebraic form as y=f(x)+e
where x denotes the predictor variable(s), y the response variable(s), f ( x ) is the systematic part, and e is the random element of the response, also called the model error or residual. To calculate a regression model one must select the structural form f (x) (the most common is the linear regression model), the probabilistic model for the error term (the most common is to assume normality), and the estimator for the regression coefficients (the most common is least squares). Regression analysis is not merely the estimation of model parameters. It also includes the calculation of goodness of fit and goodness of prediction statistics, regression diagnostics, i.e. influence analysis and residual analysis. Besides the well-known ordinary least squares regression model, biased regression, nonlinear regression, and robust regression models are also important. Calibration and the standard addition method are two special applications of regression.
regression coefficient [REGR] (: regression parameter) Coefficient of a predictor or a function of predictors in a regression model. In a linear regression model it is the partial derivative of the response variable with respect to a predictor variable b
It indicates the importance of the corresponding predictor variable in the model. Although the least squares estimator is the most popular method for calculating the regression coefficients, generalized least squares, biased estimators, and robust estimators are also often applied. The variance inflation factor measures the effect of an ill-conditioned predictor matrix on the coefficient estimates. The standardized regression coefficient is the regression coefficient divided by the ratio of the standard deviation of the response to the standard deviation of the corresponding predictor variable, i.e. it is the regression coefficient for autoscaled variables. The constant term in a regression model is called the intercept or offset. It can be considered as the regression coefficient of a dummy predictor variable with all elements being set to one. b regression curve [REGR] Graphical representation of a regression model. For a univariate regression it can be drawn on a plane of the predictor and response variables. If the relationship is linear the regression curve is called a regression line. For multiple regression the model is represented as a regression surface, also called the response surface.
278
regression diagnostics [REGR]
F regression diagnostics [REGR] (: diagnostics) Collection of techniques used to detect and assess the quality and reliability of a regression model. The goal of diagnostics is twofold: recognition of important phenomena due to outliers rather than the bulk of the data and suggestion of appropriate remedies to find a better regression model. Regression diagnostics is performed to narrow the gap between theoretical assumptions and observed data. In contrast to robust regression, which solves this problem by dampening the effect of outliers, regression diagnostics identifies the outliers and deals with them directly. It looks for model misspecification, departure from the normalitv assumption and from homoscedasticity of the residuals, collinearity in the predictor variables and influential observations. Residual analvsis, comprising numerical and graphical analysis of the ordinary and various derived residuals, is one of the most important parts of regression diagnostics. A collection of statistics, known as influence analysis, helps to reveal the effect of individual observations on the regression model. Assessment of collinearity and its harmful effect on the regression estimates is another task of regression diagnostics. b
regression equation [REGR]
: regression model
regression function [REGR] : regression model
b
b
regression line [REGR]
+ regression curve b regression model [REGR] (.- regression equation, regression function) Mathematical (usually algebraic) equation, also called structural model, to describe the relationship among predictor and response variables. The graphical representation of a regression model is called regression curve. A list of regression models of particular interest follows.
exponential regression model Intrinsically linear regression model of the form y = exp [bTx]e Taking natural logarithm of both sides gives a linear model: lnb) = bTx + ln[e]
first-order regression model Regression model in which the exponents of all variables are 1, e.g. y = bo b2 x2. In such a model the following relationship holds:
+ bl + XI
regression model [REGR] 279
Geometrically this model is a p-dimensional hyperplane, e.g. a straight line if p = 1, a plane if p = 2, etc.
generalized linear model (GLM) Regression model describing the relationship between a response variable and a set of predictor variables. The probability distribution of the response, specified in terms of predictors, must belong to an exponential family, e.g. normal, Poisson, etc. This restriction rules out discretization and various mathematical transformations of the response. The predictor variables are connected to the response only through a linear combination: ?-)
= bo
+ blxl + b2XZ + . . . + bpXp
The expected value of each response y can be expressed as some known function of this linear combination:
where g is called the link function. Examples are: ANOVA (response of normal distribution and identity link function), OLS (response of normal distribution and identity link function), log-linear models (response of Poisson distribution and the inverse of the link function is the natural logarithm), logit and probit analysis (response of binomial distribution and logit or probit function).
Gompertz growth model Growth model that assumes a linear relationship between the relative growth rate and time. It has a double exponential form:
This model has an S-shaped curve which is not symmetrical about its inflection point. The parameter a is the limiting growth, when x = 0, y = a e-b.
growth model Nonlinear regression model which describes growth as a function of an increasing independent variable, usually time. In general, growth models are mechanistic, i.e. the model is defined by solving differential equations that represent an assumption about the type of growth. Examples are: logistic, Gompertz, Richards and Weibull growth models. intrinsically linear regression model Nonlinear regression model that can be transformed into linear form; e.g. logistic model, reciprocal model.
280
regression model [REGR]
intrinsically nonlinear regression model Nonlinear regression model that cannot be transformed into linear form. linear regression model Regression model in which the response variable is a linear function of the regression coefficients, i.e. ay/abj is not a function of bj. Examples of linear regression are ordinary least squares regression, ridge regression, stepwise regression, principal components regression, partial least squares regression. logistic growth model Growth model assuming that the growth rate is proportional to the product of the present size and the future amount of growth U
+ bexp [-la]
=1
+
When x = 0 the starting growth value is y = a/(l b). The parameter a is called limiting growth. It is the value that y approaches as x tends to infinity. The values b and k are always positive. The plot of y vs. x has an S shape, the slope of the curve is always positive. logistic regression model Nonlinear regression model which describes the probability P of a binary response y of the form: 1 = 1 exp [bo blX]
+
+
This function has an S shape that approaches one asymptotically. It can be linearized by applying the logit function: In
(&)
= bo
+ b1x
multiple regression model : multivariate regression model multivariate regression model (: multiple regression model) Regression model in which the response is a function of more than one predictor variable. The simplest one is of linear form y = bo + x b T
+e
nonlinear regression model Regression model in which the response variable is a nonlinear function of the regression coefficients. In contrast, a regression in which the response variable is a linear function of the parameters but contains powers or cross-products of the predictor variables is called a polynomial regression model, or second-, third-
regression model [REGR]
281
(etc.) order regression. Nonlinear regression models can be divided into two groups: parametric and nonparametric. The first group consists of models that have a specific parameterized form arising from the scientific field of the data; these models are suggested by the theory of the subject, e.g. growth models. The second group contains more flexible models in which the form of nonlinearity is not prespecified but estimated from the data. Examples are: ACE, SMART, MARS, nonlinear PLS. In nonparametric methods smoothers and splines are used extensively for the approximation of functions. The least squares estimator is the most popular one for calculating the parameters and functions. Because the parameters enter nonlinearly into the criterion to be minimized, nonlinear optimization techniques must be employed, e.g. the Gauss-Newton method.
quadratic regression model : second-order regression model reciprocal regression model Intrinsically linear regression model of the form: 1 bTx+e Taking reciprocals on both sides gives a linear model: y=-
1 -=bTx+e Y
Richards growth model Variation of the logistic growth model including an additional parameter: a Y = (1 +bexp[-kr])lld second-order regression model (: quadratic regression model) Model in which at least one regression coefficient is a linear function of the corresponding predictor, i.e. abj axj
- = krj This model contains predictors on second power 0
# and/or product terms Xjxk.
single regression model : univariate regression model
univariate regression model ( ; single regression model) Regression model in which the response is a function of only one predictor variable. The simplest is the linear form 0
y = bo
+ blx+ e
282
regression parameter [REGR]
Weibull growth model Growth model of the form: Y = a - bexp (-cxd)
The starting growth value at x = 0 is y = a - b. The limiting growth parameter is Q.
zero-order regression model Regression model containing only a constant, i.e. it is independent of the predictors: y=bo+e b
regression parameter [REGR]
: regression coefficient
regression surface [REGR] + regression curve b
b
regressor [PREP]
+ variable b
regular simplex [OFTIM]
+ simplex b regularized discriminant analysis (RDA) [CLAS] Parametric classification method, like SIMCA and DASCO, that is an extension of quadratic discriminant analvsis. It seeks a biased estimate of the class covariance matrices in order to reduce their variance. RDA has two biasing schemes: the class covariance matrices are pulled towards a common covariance matrix:
S&)
= (1 - h)Sg +AS
and shrunk towards a multiple (the average of the eigenvalues) of the identity matrix 1:
The first biasing is controlled by the parameter h and the second is regulated by the shrinkage parameter y. Both range between 0 and 1, their values are chosen based on cross-validated misclassification risk. If A. = 0 and y = 0, RDA is equal to quadratic discriminant analysis, whereas if h = 1 and y = 0, RDA is equal to linear discriminant analysis. The case of h = 1 and y = 1 corresponds to nearest
relativefrequency [DESC]
283
means classification and the case of A = 0 and y = 1 gives weighted nearest means classification.
WNMC
NMC
Euclidean distance
i QDA
LDA
ridge
Mahalanobis distance
1
intermediate QDA LDA
-
Holding y = 0 and varying A produces models lying between QDA and LDA; holding A = 0 and increasing y , RDA attempts to unbias the sample-based eigenvalue estimates; holding A = 1 and increasing y gives rise to a ridge regression analog for LDA. rejectable quality level [QUALI + producer’s risk b
b rejection line [QUALI + lot b rejection number [QUAL] + lot
rejection region [TEST] + hypothesis testing b
relative efficiency [ESTIM] + efficiency b
relative frequency [DESC] + frequency b
284
b
relative standard deviation [DESC]
relative standard deviation [DESC] dispersion
b
reliability function [PROB] random variable
b
reliability score [CLAS] classification
b relocation algorithm [CLUS] Basic algorithm of non-hierarchical optimization clustering, consisting of the following steps: 1. Set an initial partition of n objects into G clusters. 2. Select a criterion to optimize. 3. Calculate the centroid (centrotype) of each cluster. For i = 1, n 4. Assign object i to another cluster if that improves the optimization criterion. 5. If object i was relocated, recalculate the centroid (centrotype) of its old and new cluster. End 6 . If no relocation occured stop, otherwise goto step 3. A variation of the above algorithm is when step 5 is omitted, i.e. the cluster centroids are recalculated only after a full cycle.
b
renewal process [TIME] stochastic process ( counting process)
replication [EXDE] Repetition of runs with the same treatment. Replicates are identical rows in the design matrix. They are assumed to be independent observations. Replication makes it possible to estimate the experimental error (precision) and to obtain a more precise estimate of the effect of a factor. b
b
reproduced correlation matrix [FACT] factor loading
i .
b
resampling [MODEL] model validation
b residual [REGR] Quantity remaining after some other quantity has been subtracted; for example the part of a variable unexplained by a statistical model. In regression, the part of the
residual [REGR]
285
response variable not described by the regression model. This part is due to random variation or model misspecification, as opposed to the systematic part described by the model. The residual is calculated as the difference between the observed and the predicted value of the response variable: e = y - 3. In ordinary least squares the residuals are assumed to be uncorrelated and to follow a normal distribution with zero mean and equal variance. Residuals, investigated in residual analvsis, play an important part in regression diagnostics. A list of various residuals follows.
adjusted residual Residual adjusted for the effect of the predictor values, assuring equal variance:
where hii is the ith diagonal in the hat matrix.
Anscombe residual Transformed residual informative in case of response of nonnormal distribution: ef = f (yi) - f
(fi)
i = 1,n
where f is a function that transforms y into a normally distributed variable.
cross-validated residual ( : deletion residual,predictive residual) Residual of a model fitted to data with the ith observation excluded: ei\i = yi - f i \ i
i = 1, n
where fi\i denotes the predicted response value of the ith observation from a model calculated without the zth observation. Measures of goodness of mediction are calculated on the basis of this residual. In the case of linear estimators (e.g. OLS or ridge regression) the cross-validated residuals can be calculated from the ordinary residuals as:
where hii is the ith diagonal element of the hat matrix. In the case of nonlinear estimators (e.g. PLS) the cross-validated residuals must be calculated by the leaveone-out technique.
deletion residual : cross-validated residual externally Studentized residual : Studentized residual internally Studentized residual : standardized residual jackknifed residual : Studentized residual
286
residual [REGR]
ordinary residual Residual of a model fitted to all the observations:
ej=yj-fj
i=l,n
Measures of goodness of fit are calculated on the basis of this residual. predictive residual : cross-validated residual
o recursive residual Residual of a time series model in which bi-1 is calculated using only the first i - 1 observation:
o standardized residual ( : internally Studentized residual) Residual of a model fitted to all the observations scaled by its standard error estimated from all the observations:
with
This residual is scale invariant and has a t-like distribution. This scaling assures homoscedasticity and unit variance. Studentized residual ( : externally Studentized residual,jackknifed residual) Residual standardized so as to have the same precision along the observations. This standardization eliminates the effect of the location of the objects: ei t. i=l,n '-s\iJm;
where S\i is the standard error estimated without the ith observation. It can be calculated as S\j
=
d
(n -P)s* - e?/(l - hji) n-p-1
This residual has zero mean and unit variance, it is scale invariant, and is more appropriate for detecting violations of assumptions in a regression model.
response surface [EXDE]
287
b residual analysis [REGR] Part of regression diagnostics,based on examining residuals from a regression model via numerical and/or graphical techniques. The goal is to infer any incorrect assumptions concerning the model error (e.g. homoscedasticity or assumption of normality). The most popular plots are residual scatter plots and residual quantile dots. b
residual correlation matrix [FACT] factor loading
-+
residual degrees of freedom [MODEL] -+ model degrees of freedom b
b residual mean square (RMS) [MODEL] + goodness of fit
residual plot [GRAPH] + scatter plot b
b residual standard deviation (RSD) [MODEL] + goodness of fit b residual standard error (RSE) [MODEL] + goodness offit
b
residual sum of squares (RSS) [MODEL] goodness of fit
-+
b
residual variance [MODEL] goodness of fit
-+
b resistant estimator [ESTIM] -+ estimator b
resolution [EXDE] confounding
-+
b
response curve [GRAPH] scatter plot
-+
b response surface [EXDE] Mathematical function that describes the response as a function of the factors. It can be visualized and plotted only if there are no more than two factors. Response surfaces are often described by polynomial approximations.
288
response surface methodology (RSM) [EXDE]
If only terms up to degree one are included (i.e. only main effects), the function is called a first-order response surface, which is a hyperplane. The corresponding design is a first-order design. A second-order response surface also includes interactions and squared factors. This surface is curved possibly with minima, maxima, ridge, and saddle points. The corresponding design is called a second-order design.
Response surface methodology (RSM) is a collection of statistical techniques that, by design and analysis of an experiment, seeks to relate the response (output) of a system to factors (input) that have an effect on the response. RSM is used for predicting responses at given factor levels, choosing factor levels to obtain specified response; finding factor levels to obtain optimal response. b
response surface methodology (RSM) [EXDE]
+ response sugace b
response variable [PREP]
+ variable b
restricted model [ANOVA]
+ analysis of variance b resubstitution [MODEL] + model validation
ridge regression (RR)[REGRJ
289
b Richards growth model [REGR] + regression model
b ridge regression (RR)[REGR] Biased regression method based on the assumption that large regression coefficients are likely to be spurious, therefore it shrinks them toward zero by adding a small constant y to each diagonal element of the correlation matrix. The constant y is called the shrinkage parameter. Increasing y increases the bias in the regression coefficient estimates but it also decreases their variance. The selection of optimal y on the basis of plotting the regression coefficients as function of y is called the ridge trace. The value of y is increased until stability is indicated in all regression coefficients. Stability does not mean convergence, but rather a lack of rapid change of the coefficients as a function of y . A better strategy for estimating the optimal y is based on a goodness of prediction criteria. The regression coefficient estimates are
b=(WTW+yI[)-lXTy
They are calculated by minimizing r
1
or equivalently by maximizing max [corr2 (y, X bT)var (X bT)/ var (W bT + y ) ] b
bT b = 1
The bias and variance of the estimated regression coefficient is:
x + y 11-2 /9 ~ ( 6=)s2tr [(wT x + y I>-'] =
B2(6) = y2 /9T[WT
S'
C
Aj /(Aj
+ y12
j
where /9 is the true regression coefficient, s is the error standard deviation and Aj are eigenvalues of the correlation matrix. As the ridge estimator is a linear estimator the cross-validated residuals can easily be calculated as
where hii(y) is a diagonal element of the ridge hat matrix W ( y ) = X (WT W
+ y I [ p WT
with W having centered predictor variables. The degrees of freedom of a ridge regression model is given by the trace of W ( y ) .
290 b
ridge trace [REGR]
ridge trace [REGR] ridge regression
--f
risk function [MISC] + decision theory b
b robust estimator [ESTIM] + estimator b
robust locally weighted regression [REGR]
-+ robusf regression b robust regression [REGR] Regression method that is insensitive to small deviations from the distributional assumptions. It weights down the influence of observations with large residuals, hence safeguarding against outliers in the response. The least squares estimator can be made robust by iteratively reweighting the residuals in inverse proportion to their magnitude. For example, the biweight can be applied as an iterative procedure in which the weights are calculated as a function of the residuals. The indicator function offers another solution to mitigate the effect of outliers. Robust regression estimators can also be defined by minimizing other functions than the sum of squares of the residuals. The following robust methods are the most popular.
bounded influence regression (: GM estimator) Robust regression that limits the influence of outliers in a regression model by applying some weighting function. The derivative of the minimization criterion is
C
w(xi)
Ilr(ei/S) x i = 0
i
where s is an estimate of the error standard deviation, w denotes some weight function of the predictor vectors and @ is an indicator function. Unfortunately the breakdown point of the bounded influence regression increases with an increasing number of predictor variables.
GM estimator : bounded influence regression iteratively reweighted least squares regression (IRWLS) Robust regression that estimates the regression coefficients by minimizing the criterion: mp
[7
wi
r;]
where W i weights the squared standardized residuals r; according to their magnitude. The weights Wi are calculated simultaneously with the estimates of error standard deviation in an iterative fashion.
robust regression [REGR]
291
L1 regression : least absolute residual regression least absolute residual regression (: Ll regression) Robust regression that estimates the regression coefficients by minimizing the sum of absolute residuals, not the sum of squared residuals: r
1
Although this method is thought to be more robust than the least squares method, its breakdown point is still no better than 0%.
least median squares regression (LMS) Robust regression that estimates the regression coefficients by minimizing the median instead of the sum of the squared residuals: c
-
J' 1
min mede?
This method is very robust with respect to outliers in the response; its breakdown point is 50%.
least trimmed squares regression (LTS) Robust regression that estimates the regression coefficients by minimizing the criterion:
1
m p C(e2)l:. [ i
where (e2)1:nare the ordered squared residuals. In the summation the m largest squared residuals are omitted. When m is around 4 2 the breakdown point of this estimator is 50%.
locally weighted scatter plot smoother (LOWESS) regression
.- robust local& weighted
robust locally weighted regression ( .- locally weighted scatter plot smoother) Robust regression in which, at each observation Xi, a weight vector wk(xi) is calculated and a weighted least squares criterion is minimized. This weight vector is centered at Xi and scaled such that it becomes zero at points further away than a specified nearest neighbor. The size of the neighborhood, i.e. the fraction of the observations with nonzero weights, is a parameter to be optimized. After the initial model and residuals have been calculated the weight vectors are modified on the basis of the size of the residuals: observations with large residuals receive small weights and observations with small residuals receive large weights. This correction is performed by using the biweight.
292 b
robustness [ESTIM]
robustness [ESTIM] estimator ( 0 robust estimator)
-+ b
Rogers-Tanimoto coefficient [GEOM] distance (0 binarydata)
-+ b
root mean square (RMS) [DESC] dispersion
-+
b
root mean square deviation (RMSD) [DESC] dispersion
-+
b
root mean square error (RMSE) [MODEL]
+ goodness of fit b
rootogram [GRAPH] histogram
-+
b
Roquemore design ‘[EXDE]
+ design b
Rosenbaum’s test [TEST]
+ hypothesis test b rotatable design [EXDE] + design
rotation [FACT] : factor rotation
b
b
rotation matrix [ALGE] (.- transformation matrix)
An orthogonal matrix M that brings a matrix W into another matrix X’, preserving
the length of its vectors:
W’=WM
Due to the orthogonality of M, the following properties hold: M = M-1 MTM=ll IMI = f l
round-off error [OPTIM] -+ numerical error b
sampling [PROB]
293
b rowvector [ALGE] + vector
Roy’s greatest root test [FACT] + rankanalysis b
b run [EXDE] + design matrix
b Russel-Rao coefficient [GEOM] + distance (0 binarydata)
S b S-chart [QUAL] + control chart ( 0 variable control chart) b S, statistic [MODEL] + goodness ofjit
b sample [PROB] + population
b sample reuse [MODEL] + model validation b
sample size [PROB]
+ population b sampling [PROB] The process of drawing a subset, called a sample, from a population in order to estimate the properties of the population. Mathematically, simulation performs the sample drawing process. There are several sampling strategies to choose from:
cluster sampling (: nested sampling) Sampling from selected, restricted parts of the population. Examples are subsampling and two-stage sampling where essentially more than one sample is collected from selected parts of the population.
294
samplingplan [QUAL]
Monte Carlo sampling Sampling on the basis of mathematical experiments, involving random numbers, in which mathematical constraints (e.g. distribution parameters) are imposed. nested sampling : cluster sampling random sampling Sampling by dividing the population into equal parts and selecting from them using a random procedure. Such a process ensures that an unbiased sample is obtained. stratified sampling Sampling from a population that is heterogeneous but composed of k clearly distinguishable homogeneous sub-populations (strata) with known relative frequencies. In such a case, k individuals are selected, one from each stratum. systematic sampling Sampling at regular intervals. It is appropriate only for homogeneous population. On the other hand, systematic sampling results in biased estimates if the investigated property changes regularly with the sampling interval.
b
sampling plan [QUAL] acceptance sampling
b
saturated design [EXDE] design
b saturated model [MODEL] + model (0 nested model) b scalar [ALGE] Quantity that, in contrast to a vector, has only magnitude, but no direction. A scalar has the same value in each coordinate system, i.e. a scalar is scale invariant.
b
scalar multiplication of a matrix [ALGE] matrix operation
b
scale [DESC]
: dispersion b scale [PREP] Variables can be characterized according to the scale on which they are defined. A qualitative variable is measured on a nominal or ordinal scale, also called a non-metric scale, while a quantitative variable is defined on a proportional or ratio
scale [PREP] 295
scale, also called a metric scale. The scale of measurement greatly affects the choice of model and estimator.
arithmetic scale : ratio scale interval scale :proportional scale nominal scale The mathematically weakest scale for qualitative variables where the coded information is only a name (category or label) without any order relation. The number of possible categories of a nominal variable is usually finite, although it is possible to have a countably infinite number. On a nominal scale the only admitted operations among the categories are = and #. For example, color, taste, or chemical categories are measured on a nominal scale. When a variable has only two categories (presence/absence of a property, male/female, no/yes) it is called a binary variable and is usually coded as 0/1, -/+ or 1/2. Nominal variables with several categories are often coded in several binary variables, each corresponding to a category. ordinal scale Stronger than a nominal scale for qualitative variables where the categories can be arranged in order. The number of possible categories of an ordinal variable is usually finite, although it is possible to have a countably infinite number. On an ordinal scale total ordering () or partial ordering (5,2 ) operations are defined among the categories. Thus, an ordinal variable indicates an ordering or ranking of measurements, with relative rather than quantitative differences. Variables that are originally on a proportional or ratio scale can be reduced to ranks measured on an ordinal scale. proportional scale (: interval scale) Scale for quantitative variables; discrete or continuous. The starting point of the scale is not well-defined, so only the difference between two values is meaningful, not the ratio. Besides the operations =, #, , 5,2 of the weaker ordinal scale, - and + are also defined. For example, temperature is measured on a proportional scale. ratio scale ( : arithmetic scale) Stronger than a proportional scale for quantitative variables where the starting point of the scale is unambiguously defined. All operations are exactly defined; both the difference and the ratio between two values are meaningful. For example, length, weight, or volume are measured on a ratio scale. A ratio scale usually has a unit associated with it. Unitless ratio scales are the frequency count scale and the percentage scale. The former measures counts, the latter percentages of the total.
296
scale invariance [ESTIM]
b scale invariance [ESTIM] Characteristic of an estimator or a model stating that it does not change with a change in the scale of the data. Examples are: ordinary least squares regression, discriminant analysis. Many estimators, however, result in different estimates depending on the scale of the data, i.e. they are scale variant. Examples are: PCR, PLS, SIMCA, RDA, K",most of the cluster analysis models.
b
scaling [PREP] standardization
--3
scatter [DESC] : dispersion
b
b
scatter diagram [GRAPH]
: scatter plot b scatter matrix [DESC] (; dispersion matrix) Square matrix T that describes the dispersion of multivariate data around the mean. Its elements are:
In the case of centered variables the scatter matrix equals the information matrix, defined as XT X. The scatter matrix is related to the covariance matrix B as:
B = T/n
or
B = T/(n- 1)
the latter being for unbiased estimates. The scatter matrix T can be decomposed into the within-group scatter matrix W and the between-group scatter matrix B:
T=W+B Several multivariate methods are based on optimizing or testing such decomposition, e.g. MANOVA, discriminant analysis, non-hierarchical cluster analysis. b scatter plot [GRAPH] ( ; scatter diagram, plot) Cartesian plot in which the coordinates are either original variables or quantities (statistics) derived from the data. In contrast to the quantile-quantile plot, the
scatterplot [GRAPH]
297
variables are paired, i.e. the corresponding values are measured on the same object. The scatter plot is an efficient way to represent the relationship between two or three variables. no-dimensional scatter plots are the most common. Additional variables can be represented on a static two-dimensional plot by adding color, size or shape to the data points. A more efficient way to create a three-dimensional scatter plot is by using interactive computer graphics.
biplot Graphical display of a data matrix by means of markers for its rows and for its columns such that inner products of these markers represent the corresponding matrix elements. The most popular biplot is the principal component biplot, in which the row markers are principal component scores and the column markers are the eigenvectors scaled with the corresponding eigenvalues.
0
0 0
PC1
It is a joint display of rows and columns, in contrast to most other scatter plots, which display only rows or only columns of a data matrix. The distances of the column points from the origin are roughly proportional to the standard deviations of the variables. Cosines of angles between two vectors drawn from the origin to two column points represent the correlation between the two variables. Distances between row points represent their dissimilarities. While distance between a row and a column point (i.e. between an object and a variable) is not directly interpretable, the relative positions of objects with respect to variables, and vice versa, are the essence of the biplot.
contour plot (: density map) Two-dimensional graphical representation of a response surface, i.e. the projection of a response variable onto a plane of the predictor variables or factors, by means of isoresponse curves.
298
scatterplot [GRAPH]
x2
x1
The levels of the two factors or of two linear combinations of factors are indicated on the horizontal and vertical axes, whereas the values of the response are indicated by contour lines drawn in the plane of the two factors. These contour lines are projections of the outlines of cross-sections of the response surface parallel to the factor plane. Coomans’ plot Graphical display of the goodness of class separation, often used in SIMCA. Residuals of objects fitted to one class are plotted against residuals of the same objects fitted to another class. On the two-dimensional plot the separability of the classes can be assessed only pairwise.
Residuals from model of class B
Objects classified as A
Objects classified neither A nor B
Objects can be classified as both A and B
Objects classified as B
t
density map : contourplot discriminant plot Two- (or three-) dimensional scatter plot of discriminant scores, usually of the first two (or three) linear discriminant functions. This plot gives information about class separability.
scatterplot [GRAPH] 299
latent variable plot Two-dimensional scatter plot, in which the predictor latent variable is plotted against the response latent variable, calculated in the same component. This plot is similar to the principal component plot, except that the axes are not simply eigenvectors of a matrix but eigenvectors of complex aggregates of the predictor and the response matrices. This plot serves as a diagnostic tool in PLS modeling: it reveals outlier and high leverage observations, and any nonlinear relationship between predictors and responses. loading plot Display of variables on a two- (or three-) diniensional scatter plot as their projections on new axes calculated by a linear combination of the original variables. The most common loading plot projects the variables on the principal component axes. PC2
. high loading in PC2 low loading in PCl
+lx1 x7 x2
I( zero
+
0-
point high loading in PCl
X6 152
-
- - - - - -. XS I
7
non important in either components
x4
J
x3
I
I
-1
I
-1
1
151
I
0
I
b
+l PC1
loading of variable 5 in component 1
map Two- (or three-) dimensional scatter plot of multidimensional objects in which the coordinates are nonlinear combinations of the original variables. The two- (or three-) dimensional configuration is calculated by some mapping technique, like multidimensional scaling, nonlinear mapping, or projection pursuit. principal component plot ( : score plot) Scatter plot of the scores (projections) of the observations, usually on the first and second principal component axes. This plot is a linear projection of the observations onto a two-dimensional subspace that conserves most of the variance.
300
scatterplot [GRAPH] 0 = obj. group 1
I
t pc2
X = obj. group 2
+ = obj. group 3
X
x
X
X
I
+ +
+
0 PC1
If the first two components explain a high percentage of the variance, then the distribution of the observations in the two dimensions is a fair approximation of the distribution in the original measurement space. This plot is one of the most popular graphical tools in exploratory data analysis. Clusters, outliers, trends can be discovered by visualization of the multivariate distribution. residual plot lbo-dimensional scatter plot of residuals from a regression model. This plot plays an important role in residual analysis. It is used to verify the normality and homoscedasticity assumption on the residual, to identify outliers and to check the linearity of the relationship. The ideal plot shows a horizontal band of points with constant vertical scatter from left to right.
0 0
0
0
ID
Depending on which residual is assigned to the vertical axis (e-ordinary, ewcross-validated, r-standardized, t-Studentized) and what is plotted on the horizontal
scatterplot [GRAPH]
301
axis (x-predictor, y-response, j-estimated response, h-hat diagonal, ID-observation id), one can obtain several different residual plots: o McCulloh-Meeter plot
ln[h/n(l
- h)]
vs. ln[e2]
o ordinary residual plot
ID vs. x vs. y vs. j , vs.
e e e e
o predictive residual plot
ID vs. ecv Y vs. ecv e vs. ew o Pregibon plot
h
vs. e2
o standardized residual plot
ID vs. r y vs. r o Studentized residual plot
ID vs. t y vs. t o Williams plot
h
vs. ew
response curve System response plotted as a function of one factor. It is a one-dimensional response surface. CI
score plot :principal component plot
time series plot Graphical representation of a time series: x(t) plotted against the corresponding time intervals t. The successive points are often connected for better visualization.
302 b
scatterplot matrix [GRAPH]
scatter plot matrix [GRAPH]
: draftsman’s plot b Scheffe’s test [TEST] + hypothesis test
Schrader clustering [CLUS] + non-hierarchical clustering (o optimization clustering) b
b
score [FACT]
+ common factor
b
score plot [GRAPH] scatter plot
b
scree plot [GRAPH] rank anabsis (n scree test)
scree test [FACT] + rank analysis b
b second kind model [ANOVA] + tern in ANOVA
second-order design [EXDE] + design b
second-order regression model [REGR] -+ regression model b
seed point [CLUS] + cluster b
b
semi-interquartilerange [DESC]
+ dispersion b sensitivity curve (SC) [ESTIM] + influence curve
sensitivity of classification [CLAS] + classification b
b sequential design [EXDE] + design
similarity index [GEOM]
303
b sequential sampling [QUALI + acceptance sampling b
sequential variable selection [REGR]
+ variable subset selection
Shannon entropy [MISC] + infomation theory b
b
Shannon-Wiener index [MISC]
+ information theory b shape difference coefficient [GEOM] + distance (0 quantitative data) b Shapiro-Wilk’s test [TEST] -+ hypothesis test
sharpness of classification [CLAS] + classification b
b Shewhart chart [QUALI + control chart
b
shot-noise process [TIME]
+ stochastic process b
shrinkage parameter [REGR]
+ ridge regression b
Siegel-Thkey’s test [TEST]
+ hypothesis test
sign test [TEST] + hypothesis testing b
significance level [TEST] + hypothesis testing b
b similarity index [GEOM] Measure associated with a pair of objects, that depicts the extent to which the two objects are similar. The more similar the two objects are, the larger is the similarity index. A similarity index sst calculated for objects s and t has the following properties:
304
similarity matrix [GEOM]
O 5 s s t 51 sss = 1 sst = Sts
Pairwise similarity indices calculated from a data matrix X(n,p) can be arranged in a symmetric matrix S(n, n), called the similarity matrix. Rows and columns represent objects; diagonal elements sss equal to one. A similarity index sst can be calculated from the dissimilarity index & t as: sst
= 1/(1+ dst)
Sst
= 1- d s t / P a x
or
Pairwise dissimilarity indices calculated from a data matrix X ( n , p ) can be arranged in a symmetric matrix D(n, n), called the dissimilarity matrix. Rows and columns represent objects; diagonal elements dss are equal to zero. The most often used dissimilarity indices are distances.
similarity matrix [GEOM] + similarity index b
b similarity transformation [ALGE] Multiplication by a nonsingular matrix Z such that for two matrices X and Y the following equations hold:
XZ = ZY Y=Z-'XZ
The matrices W and Y are called similar and their eigenvalues are the same. b
simple matching coefficient [GEOM] (a binaly data)
+ distance
simplex [OPTIM] Geometric figure formed by a set of p + 1points in the p-dimensional space. In two dimensions the simplex is a triangle, in three dimensions it is a tetrahedron, etc. When the points are equidistant the simplex is called a regular simplex. A simplex is used, for example, in simplex optimization to find the minimum or maximum of a function. b
simplex centroid design [EXDE] + design b
simplex Optimization [OPTIM] 305 b
-+
simplex lattice design [EXDE] design
b simplex optimization [OPTIM] Direct search optimization based on comparing the values of a function at the p + 1 vertices of a simplex and moving the simplex towards the optimum during an iterative procedure. The technique as originally proposed uses a regular simplex; however, allowing the simplex to be nonregular (points not equidistant) increases the power and efficiency of the optimization. The simplex is moved toward the optimum via three basic operations: reflection, contraction, and expansion. The process starts with a p 1 set of initial parameter values and evaluates the function at each vertex
+
In minimization the simplex is moved away from the vertex with the largest function value (pc) by reflecting this vertex in the opposite face of the simplex. The new reflected vertex PR is on the line joining the centroid of all points pp , and the vertex to be eliminated pc:
where a is the reflection coefficient. Expansion (PR+)and contraction (PR-) help to move the simplex in a valley, where the function value could be the same at the old and the new vertices.
A
I
B
The convergence criterion generally used in the simplex optimization requires the standard deviation of the function values at the p + 1vertices to be less than a prespecified small value.
306
simulated annealing (SA) [OPTIM]
simulated annealing (SA) [OPTIM] Direct search optimization searching for the most probable configuration of parameters on the basis of simulating the evolution of a substance to thermal equilibrium. The distribution of configurations s is expressed by the Boltzman distribution: b
W
where C(s) is the function to be minimized, s and w are configurations and c is a control parameter. Starting with an initial configuration s of the parameters, another configuration r in the neighborhood of s is produced by modifying one randomly selected parameter. The new configuration is accepted with probability 1 if AC(r, s) 5 0, otherwise with probability P = exp [-AC(r, s)/c]
This probability is compared to a random number from the uniform distribution [0,1] and the new configuration is accepted if P is larger than the random number. The iteration continues until convergence is reached. Then the control parameter c is lowered and the optimization continues with the new parameter. Generalized simulated annealing (GSA) and variable step size generalized simulated annealing (VSGSA) are improvements on &I. b simulation [PROB] (: Monte Carlo simulation) Technique for imitating the random process of drawing samples from a predefined population in order to obtain estimates of the population parameters. Given a mathematical formula that cannot easily be evaluated by analytical reduction or by standard procedures of numerical analysis, it is often possible to find a stochastic process for generating statistical variables with frequency distributions that can be related to the mathematical formula. The simulation actually generates a sample, determines its empirical distribution and uses it in a numerical evaluation of the formula. A random series is a series of numbers drawn randomly from a distribution. It has an important role in simulation studies. Simulation is used to evaluate the behavior of a statistical method, to compare several similar statistical methods, or to solve mathematical problems arising from stochastic processes. The advantage of using simulation instead of real data sets is that in the former case the distribution of the underlying population is known. b
single linkage [CLUSI
+ hierarchical clustering b
(0
agglomerative clustering)
single regression model [REGR]
+ regression model
skewness [DESC] b
301
single sampling [QUALI
+ acceptance sampling b single tail test [TEST] + hypothesis testing b
singular matrix [ALGE]
+ matrix b singular value [ALGE] + matrix decomposition
((7
singular value decomposition)
b singular value decomposition (SVD) [ALGE] + matrix decomposition b
singular vector [ALGE]
+ matrix decomposition
(0
singular value decomposition)
b size difference coefficient [GEOM] + distance (0 quantitative data) b
-+
size of a test [TEST] hypothesis testing
b skewness [DESC] Measure of asymmetry of a distribution around its location. For a symmetric distribution the mean, median and mode are equal, i.e. there is no skewness. If there is a longer tail on the right, i.e. the mean is larger than the median and the median is larger than the mode, there is positive skewness. If there is a longer tail on the left, i.e. the mean is smaller than the median and the median is smaller than the mode, there is negative skewness.
I
2.0
POSITIVE
I
NEGATIVE
308
skewness [DESC]
A list of the most common skewness measures follows.
Bonferroni index Avery sensitive measure, defined in terms of the deviations from the mean, weighted by the frequency distribution:
i
with -51j)
s i =fi(xij
i = 1, n
B = 0 if the distribution is perfectly symmetric, and approaches 1 as the distribution becomes increasingly asymmetric. This index is based on the following relationship, which holds for symmetric ranked data: Xi
+ Xn-i+l
= constant = 251
coefficient of skewness Scaled difference between the arithmetic mean and the mode:
blj =
xj
- mode [Xj] sj
where si is the standard deviation. In a p-dimensional space, the multivariate measure of skewness is defined as
7 h , p
=
S
x t
n2
s, t = 1, n
where dstis the square root of the Mahalanobis distance between objects s and c
dst = (x,
- E)T 8-' ( ~ t E)
where 8-' is the inverse covariance matrix. For p = 1, bl,p = bl.
Pearson's first index Scaled sum of the cubed differences from the arithmetic mean:
smoother [REGR] 309 y1 = 0 for a symmetric distribution; y < 0 for a right-tailed distribution and y1 > 0 for a left-tailed distribution.
Pearson’s second index Scaled difference between the arithmetic mean and the median: X j - median [Xj ] yzj = 3 quartile coefficient of skewness Measure based on the three quartiles:
where
Q3
is the upper quartile,
Ql
is the lower quartile and Q2 is the median.
skip-lot sampling [QUALI -+ acceptance sampling b
b
-+
slicing [GRAPH] interactive computer graphics (o animation)
b smooth multiple additive regression technique (SMART) [REGR] (0 projection pursuit regression) Nonparametric multiple response nonlinear regression model that describes each response (usually) as a different linear combination of the predictor functions f m . Each predictor function is taken as a smooth but otherwise unrestricted function (usually) of a different linear combination of the predictor variables. The model is:
where qmk and wmj are linear coefficients for the predictor functions and for the predictor variables, respectively. The least squares solution is obtained by simultaneously estimating, in each component m = 1,M, the linear coefficients q and w and the nonlinear function f . The coefficients qmk are estimated by univariate least squares regression, the coefficients w m j by a Gauss-Newton step, and the function f, by a smoother. The optimal number of components M is estimated by cross-validation. b smoother [REGR] Function estimator which calculates the conditional expectation
f
(4 = ELY I XI
3 10
smoother [REGR]
There are two basic kinds of smoothers: the kernel smoother and the window smoother. The kernel smoother estimates the above conditional expectation at xi by assigning weights to the points, fitting a weighted polynomial to all the points and taking the fitted response value at xi. The largest weight is put at xi and the rest of the weights decrease symmetrically as the points become further from X i . A popular robust kernel smoother is the locally weighted scatter plot smoother. The window smoother can be considered as a special case of the kernel smoother in which all points within a certain interval (window) Ni around Xi have weight 1 and all points outside the interval have weight 0. According to the degree of the polynomial the smoother can be local averaging (degree zero), local linear fit (degree one), local quadratic fit (degree two) etc. The local averaging window smoother calculates the f i value as the average of the y values for those points with x values in an interval Ni around X i :
The local averaging, although a commonly used technique, has some serious shortcomings. It does not reproduce a straight line if the x values are not equispaced and it has bad behavior at the boundaries. The local linear fit alleviates both problems. It calculates the smooth f i value by fitting a straight line (usually by least squares) to the points in the interval Ni and taking the fitted response value at X i . Higher degree polynomials can be fitted in a similar fashion. 1.5 y
I
nod
span = 0.1
0
I u
-1.5
'
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6 X
smoother [REGR]
3 11
1 c
-
- 1 .5
1
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
X
In window smoothers the key point is how to select the size of the interval, also called the span parameter. This parameter controls the trade-off between bias and variance of the estimated smooth. Increasing the span value, i.e. the size of the interval, increases the bias and decreases the variance. A larger span value makes
312
soft independent modeling of class analogy (SIMCA) [CLAS]
the smooth less wiggly. Ideally, the optimal span value should be estimated via cross-validation. In contrast to the fixed span smoother, the adaptive smoother uses a span parameter that varies over the range of x. This smoother is preferable if the error variance or the second derivative of the underlying function changes over the range of x. b soft independent modeling of class analogy (SIMCA) [CLAS] Parametric classification method designed to deal with low object-variable ratio. Each class is represented by a principal component model, usually of fewer components than the original dimensionality, and the classification rule is based on the distances of objects from these class models. These object-class distances are calculated as squared residuals:
C L
e:g = xi
- cg -
tmg 1mg
I’
i=l,n
g=l,G
where i is the object, g is the class and m is the component index, cg is the class centroid, tmg denotes principal component scores and 1, the corresponding loadings, in the mth component and gth class. The optimal number of components, M, is determined for each class separately by double cross-validation. This procedure results in principal component models which are optimal for describing the withinclass similarities but not necessarily optimal for discriminating among classes.
x3
These class models define unbounded M-dimensional subspaces in the pdimensional pattern space. In order to delimit the models, i.e. to create class boxes, normal ranges are defined using the class residual standard deviations sg:
spectral decomposition [ALGE]
313
SIMCA calculates both modeling and classification power for each variable based on the residuals. Similar to RDA and DASCO, S I M o l can be viewed as a modification of quadratic discriminant analysis where the class covariance matrices are estimated by truncated principal component representation. b soft model [MODEL] + model
Sokal-Michener coefficient [GEOM] -+ distance ( 0 binarydata) F
b Sokal-Sneath coefficient [GEOM] + distance ( 0 binarydata) b Sorenson coefficient [GEOM] + distance (0 binarydata) b spanning tree [MISC] + graph theory
Spearman’s p coefficient [DESC] -+ correlation b
b
specific factor [FACT]
: unique factor b
specific variance [FACT] factor analysis
-+
b
specification limits [QUAL] lot
-+
specificity of classification [CLAS] + classification b
b
spectral decomposition [ALGE]
+ matrix decomposition
314
spectral density function [TIME]
b spectral density function [TIME] (: spectrum) Function of a stationary time series x ( h ) , i = 1,n defined as:
f (w) = y ( 0 )
+2
c
y ( t ) COS(t, w )
k
or in normalized form: f (w)/y(O) = 1
+2
c
p ( t ) COS(t, w )
k
where y (t)is the autocovariance function and p ( t ) is the autocorrelation function, and 0 5 w 5 n. Its integrated form is called the spectral function:
F(w) =
1
f (@)do
spectral function [TIME] + spectral density function b
b spectral map analysis (SMA) [MULT] Dimension reduction and display technique related to biplot and correspondence factor analysis. It was developed for the graphical analysis of drug contrasts. Contrast is defined here as the logarithm of an activity ratio (specificity) in proportion to its mean. The word spectra here refers to the activity spectra of drugs, i.e. the logarithm of activities in various tests. The map of compounds described by their activity spectra is obtained after special scaling.
spectrum [ALGE] + eigenanalysis b
b
spectrum [TIME]
: spectral density function b spherical data [PREP] + data
b spline [REGR] Function estimate obtained by fitting piecewise polynomials. The x range is split into intervals separated by so-called knot locations. A polynomial is fitted, in each interval, with the constraint that the function be continuous at the knot locations. The integral and derivative of a spline is also a spline of one degree higher or lower; often also with a continuity constraint. The degree of a spline can range from
spline [REGR]
315
zero to very high; however, the first-, second-, and third-degree splines are of more practical use.
1
0.5 C
0
-0.5
-1
-1.5
0
0.5
1.5
1
2.5
2
3
3.5
4
4.5
5
5.5
6 X
knot,
knot,
knot,
A spline is defined by its degree, by the number of knot locations, by the position of knots and by the coefficients of the polynomial fitted at each interval. A spline of degree m with N knot locations (tk, k = 1, N ) can be written in a general form as:
where x! and (Xi positive part, i.e.
- tk):
(xi - tk): = (xi - tk)J (xi - tk): = 0
are called basis functions. The notation (..)+ means the if if
xi Xi
> tk
5 tk
This representation casts the spline as an ordinary regression equation. The coefficients boj and bkj are estimated by minimizing the least squares criterion. Depending on the continuity requirements at various knot locations, not all of the above basis functions are present in a spline, i.e. some of the bkj coefficients are zero. A frequently used spline of degree m with N knots and with continuity constraint on the function and on its derivatives up to degree m - 1 has the form:
3 16
splinepartial least squares regression (SPLS) [REGR]
+ +
The number of coefficients in such a spline is m N 1. There are several equivalent basis function representations of the same spline. Another form of the above spline is: N I.
bk[Sk(Xi - tk)]? -k ei
yi = b0 -k k= 1
where s k is either +1 or -1. In fitting a spline one must select the degree rn, the number and location of the knots. The degree of the spline is sometimes fixed a priori. The number and location of knots is either fixed, or variable. Splines with variable knot locations are called adaptive splines. Adaptive splines offer a more flexible function approximation than splines with fixed knot locations. The bias-variance trade-off is controlled by the degree of the fitted polynomial and the number of knots. Increasing the degree and the number of knots increases the variance and decreases the bias of the spline. Ideally, one should estimate the optimal degree m, the optimal number of knots N and the optimal knot locations by cross-validation to obtain the best predictive spline. b
spline partial least squares regression (SPLS) [REGR] partial least squares regression
-+ b
split-plot design [EXDE] design
-+
spurious correlation [DESC] -+ correlation b
square matrix [ALGE] -+ matrix b
square root transformation [PREP] -+ transformation b
square transformation [PREP] -+ transformation b
b stability of clustering [CLUS] -+ assessment of clustering
b
stacking [GRAPH]
+ dotplot b
stagewise regression [REGR] variable subset selection
-+
standard error of estimate [ESTIM]
317
b standard addition method (SAM) [REGR] Calibration procedure used in chemistry to correct for matrix effects. The chemical sample is divided into several equal-volume aliquots and increasing amounts of standards are added to all but one aliquot. Each aliquot is diluted to the same volume, a response yi is measured and plotted as the function of X i , the amount of standard added. The regression model is:
yi = bl(e +Xi)
+ ei
i = 1,n
where 6 denotes the unknown amount of the analyte. The intercept is the response for the aliquot without standard addition
bo = bi 6 therefore the unknown amount of analyte is given by bo/bl. The key assumption is that the linearity of the model holds over the range of the calibration including zero response. SAM cannot be used when spectral interferences are present. The generalized standard addition method (GSAM) is the multivariate extension of SAM used to correct for spectral interferences and matrix effects simultaneously. The key equations are
ro=KTc0
AR=ACK
where ro is the response vector of p sensors, co is the concentration vector of n analytes, K is the n x p calibration matrix, and AR and AC are changes in response and concentration, respectively, for m standard additions.
standard deviation [DESCI + dispersion b
b
standard deviation chart [QUALI (0 variable control chart)
+ control chart b
-+
standard deviation of error of calculation (SDEC) [MODEL] goodness offit
b standard deviation of error of prediction (SDEP) [MODEL] + goodness ofprediction b standard error [MODEL] + goodness of fit
b standard error of estimate [ESTIM] Standard deviation of an estimated value. For example, the standard error of the mean (SEM) calculated from n observation is s / f i where s is the standard deviation of the n observations. The standard error of the estimated regression coefficients 6 and the estimated response yi in OLS is
3 18
standard error of the mean (SEM) [ESTIM]
x>-'xi
~ ~ ( f = i )s , / x ; ( ~ ~
where s is the residual standard deviation.
b
standard error of the mean (SEM) [ESTIM] standard error of estimate
standard order of runs [EXDE] : Kites order
b
b
standard score [PREP] (a autoscaling)
: standardization
b standardization [PREP] Simple transformation of the elements of a data matrix. It can be performed columnwise (called variable standardization), rowwise (called object standardization), both ways (called double standardization), or elementwise (called global standardization). Variable standardization results in variables which are independent of the unit of measurement. Scale variant estimators are greatly influenced by the previously performed standardization. Object standardization often results in closed data. The most common standardization procedures follow.
a autoscaling One of the most common column standardizations composed of a column centering and a column scaling:
x!.U = (Xij
-Zj)/Sj
The mean of an autoscaled variable is 0 and the variance is 1. An autoscaled variable is often simply called a standardized variable, its value is called the z-score or standard score.
a centering Scale shift (translation) by subtracting a constant (the mean), resulting in the zero mean of the standardized elements. Centering can be:
- row centering:
x!.IJ = x.. 1J - x i
xi = Cxij
- column centering: xij = Xij - Xj -
- double centering: xij = Xij - Xj
-
global centering:
-
x!. IJ = x" - x
j
-
xj = C X i j i
-Xi
TI= xxxij i
j
standardization [PREP]
319
logarithmic scaling Scale shift based on logarithmic transformation followed by column centering to mitigate extreme differences between variances
XG
C log(xij = log(xij) -
n
maximum scaling Column standardization where each value is divided by the maximum value of its column:
All the values in a maximum scaled variable have an upper limit of one.
profile Standardization that results in unit sum or unit sum of squares of the standardized elements. The profiles can be row profile:
Xij
= x IJ- - / cXij
c* /c j
x!.IJ = x.. 11 /
normalized row profile:
Xij
j
column profile:
Xij
= xij
xij
i
normalized column profile: xij = Xij /
C
~2IJ
i
global profile: normalized global profile:
xij
= Xij
//F c X:
X:
j
range scaling Column standardization where each value in the column is centered by the minimum value of the column Lj and divided by the range of the column Uj - Lj x!. = (xij - Lj)/(U, - Lj) 1J
In a range-scaled variable all values lie between 0 and 1. Range scaling where the values of the variable are expanded or compressed between prespecified limits A, (lower) and B, (upper) is called generalized range scaling: x!.1J = Aj
+ (Bj - Aj)(Xij
- Lj)/(U, - Lj)
320
standardized linear combination (SLC) [ALGE]
scaling Scale expansion (contraction) by dividing by a constant (the standard deviation) that results in unit variance of the standardized elements. Scaling can be
- row scaling: - column scaling:
x!.'J
-
xij = Xij /S
global scaling:
= xi, /sj
sj = J ~ ( x i j - F j > 2 / n
s =
x(Xij
Ji
-Q2/np
j
standardized linear combination (SLC)[ALGE] Linear equation in which the sum of the squared coefficients is equal to one, i.e. the coefficient vector has unit length. For example, principal components calculated from a correlation matrix are standardized linear combinations. b
b
standardized regression coefficient [REGR] regression coeficient
--f
b standardized residual [REGR] + residual
standardized residual plot [GRAPH] + scatter plot (0 residual plot) b
standardized variable [PREP] + variable F
b
star point [EXDE] (0 composite design)
+ design
F
star symbol [GRAPH] graphical symbol
b state-space model [TIME] (: Bayesian forecast, dynamic linear model, Kalman filter) Linear model in which the parameters are not constant, but change in time. The linear equation, called the observational equation is:
Y ( 0 = b(t>x(t>4-e(t>
The response y(t) is a quantity observed in time, x(t) is the known predictor vector, e(t), the error term, is a white noise process. The parameter vector b(t), called the state vector, is a time series described by the state equation:
statistics [DESC]
b(t) = G b(t - 1)
321
+c~ ( t )
where a(?)is a white noise process independent of e(t), and G and c are coefficients. b stationarity [TIME] The phenomenon that the probabilistic structure of a time series x ( t ) does not change with time. In practice it implies that the mean and the variance of a series is independent of time, and the covariance depends only on the separation time. Stationarity allows replication within a time series, thus making formal inference possible. b stationary process [TIME] --+ stochastic process
b stationary time series [TIME] + stochastic process (0 stationary process)
statistic [DESC] Numerical summary of a set of observations; a particular value of an estimator. If the observations are regarded as a sample from a population, then the calculated statistic is taken to be an estimate of the population parameter. For example, the arithmetic mean of a set of observations can be used as an estimate of the population mean. b
b
statistical distribution [PROB]
: distribution
statistical process control (SPC) [QUAL] + controlchart b
b
statistical quality control [QUAL]
: quality control b statistics [DESC] A branch of mathematics concerned with collecting, organizing, analyzing and interpreting data. There are two major problems in statistics: estimation, and hwothesis testing. In the case of inference statistics the data set under consideration is a sample from a population, the calculated statistic is taken as an estimate of the population parameter and the conclusions about the properties of the data set are translated to the underlying population. In contrast, the goal of descriptive statistics is simply to analyze, model or summarize the available data without further inference. A data set can be described by freauency. location, disuersion, skewness, kurtosis,
322
steepest ascent optimization [OPTIM]
a.
quantiles, Relationship between two variables can be described by association, correlation, covariance. Scatter matrix, correlation matrix, covariance matrix, tivariate dispersion are characteristics of multivariate data. Statistics may also be divided into theoretical statistics and data analvsis.
a-
b steepest ascent optimization [OPTIM] + steepest descent optimization
steepest descent optimization [OPTIM] Gradient oDtimization that minimizes a function f by estimating the optimal parameter values following the negative gradient direction. The iterative procedure starts with an initial guess for the p parameter values PO. In each iteration i one calculates the gradient, i.e. the partial first derivatives of the function with respect to the parameters: b
gi = (af/apli, a f / a p ~ i*, - - af/appi) 9
and a new set of parameter values is obtained as:
where the step size Si is determined by a linear search optimization. Steepest descent is a gradient optimization method where Ai is the identity matrix 1. Moving along the negative gradient direction ensures that the function value decreases at the fastest rate. However, this is only a local property, so frequent changes of direction are often necessary, making the convergence very slow, hence the optimization is quite inefficient. The method is sensitive to small perturbations in direction and step size, so these must be computed to high precision. The main problem with steepest descent is that the second derivatives describing the curvature of the function near the minimum are not taken into account. Because of its drawbacks this optimization is seldom used nowadays. The opposite procedure, which maximizes a function by searching for the optimal parameter values along the positive gradient, is called the steepest ascent optimization. b stem-and-leaf diagram [GRAPH] Part graphical, part numerical display of a univariate distribution. As in the histogram, the range of the data is partitioned into intervals. These intervals are established by first writing all possible leading digits in the range of the data to the left of a vertical line. Each object in the interval then is represented by its trailing digit, written to the right of the vertical line. The leading digits on the left form the stem and the trailing digits on the right are called the leaves.
stochastic process [TIME] leaf
stem
L
DATA 100 102 111 115 117 120 129 131 133
133 141 143 144 144 144 145 152 158
323
158 159 160 163 164 172 181 186 195
J 10 11 12 13 14 15 16 17 18 19
02 157 09 133 134445 2889 034 2 16 5
This is a compact way to record the data, while also giving visual information about the shape of a distribution. The length of each row represents the density of objects in the corresponding interval. It is often necessary to change measurement units or to ignore some digits to the right. b step direction [OPTIM] -+ optimization b
step size [OPTIM]
+ optimization b stepwise linear discriminant analysis (SWLDA) [CLAS] + discriminant analysis
b
stepwise regression (SWR) [REGR] variable subset selection
stochastic model [MODEL] -+ model b
b stochastic process [TIME] (: random process) Random phenomena that can be described by at least one random variable n(t) where t is a parameter belonging to an index set T. Usually t is interpreted as time, but it can also refer to a distribution in space. The process can be either continuous or discontinuous. Random walk and white noise are special stochastic processes. A list of the most important stochastic processes follows.
counting process Integer-valued, continuous stochastic process N(t) of a series of events, in which N(t) represents the total number of occurrences of the event in the time interval (O,?). If the time intervals between successive occurrences (interarrival times) are i.i.d. random variables, the process is called a renewal process. If these time
324
stochastic process [TIME]
intervals follow an exponential distribution, the process is called a Poisson process. If the series of occurrences are repeated trials with two outcomes (e.g. success or failure), the process is called a Bernoulli process.
ergodic process Stochastic process in which the time average of a single record x(t) is approximately equal to the ensemble average x(t). The ergodic property of a stochastic process is commonly assumed to be true in engineering and physical sciences, therefore parameters may be estimated from the analysis of a single record. independent increment process Stochastic process in which the quantities x(t+ 1)- x ( t ) are statistically independent. Markov process Stochastic process in which the conditional probability distribution at any point x(t) depends only on the immediate past value x(t - l), but is independent of the history of the process prior to t - 1. A Markov process having discrete states is called a Markov chain, while a Markov process with continuous states is called a diffusion process. narrow-band process Stationary stochastic process continuous in time and state ~ ( t= ) A(t) cos[ct
+ 4 (t)]
where c is a constant, A(t) is the amplitude and @(t)is the phase of the process. A stochastic process that does not satisfy this condition is called a wide-band process.
normal process Stochastic process in which at any given time t the random variable x ( t ) is normally distributed. shot noise process Stochastic process induced by a sequence of impulses applied to a system at random time points tn: N(t)
~ ( t=)
C An ~ ( tn)t , n=l
where w(t, tn) is the response of the system at time t resulting from an impulse An at time tn, and N(t) is a counting process with interarrival time tn.
stationary process Stochastic process with stationarity. A process that does not satisfy stationarity is called an evolutionary process. A time series that is a stationary process is called a stationary time series.
sum of squares in ANOVA (SS) [ANOVA]
325
Wiener-Levy process Stationary independent increment process in which every independent increment is normally distributed, the average value of x(t) is zero, and x(0) = 0. The most common Wiener-Levy process is the Brownian motion process. It is also widely used in other fields such as quantum mechanics and electric circuits. b
stochastic variable [PREP] variable
-+
b
strategy [MISCI
+ game theory b
stratified sampling [PROB] sampling
-+
b Studentized residual [REGR] + residual
b Studentized residual plot [GRAPH] + scatter plot ( 0 residual plot)
b Student’s t distribution [PROB] + distribution
subdiagonal element [ALGE] + matrix b
submatrix [ALGE] + matrix operation ( 0 partitioning of a matrix) b
b subsampling [PROB] + sampling (0 cluster sampling) b
subtraction of matrices [ALGE] matrix operation
-+
b sufficient estimator [ESTIMI + estimator b sum of squares in ANOVA (SS) [ANOVA] Column in the analysis of variance table containing the squared deviations of the observations from the grand mean or from an effect mean summed over the observations. It is customary to indicate the summation over an index by replacing that index with a dot. For example, in a one-way ANOVA model the effect of
326
sum of squares linkage [CLUS]
treatment A, is calculated as: Ai = yi. = Yi./K =
Yik/K k
where K is the number of observations at level i, and the sum of squares associated with the effect A is:
b
sum of squares linkage [CLUS]
+ hierarchical clustering (a agglomerative clustering) b
superdiagonal element [ALGE] matrix
-+
b
supervised learning [MULT] pattern recognition
--3
b
supervised pattern recognition [MULT] pattern recognition
-+ b
-+
survival function [PROB] random variable
sweeping [ALGE] + Gaussian elimination b
b symmetric design [EXDE] + design
b
symmetric distribution [PROB]
--+ random variable b symmetric matrix [ALGE] + matrix
b
symmetric test [TEST]
+ hypothesis testing b
synapse [MISC]
+ neural network
target transfomationfactor analysis (TTFA)[FACT] b
327
systematic distortion [ESTIM]
: bias b
systematic sampling [PROB]
-+
sampling
T b t distribution [PROB] + distribution
b
t test [TEST]
-+
hypothesis test
b Taguchi method [QUAL] Quality control approach suggesting that statistical testing of a product should be carried out at the design level, called off-line quality control, in order to make the product robust against variations in manufacturing. This proposal is different from the traditional on-line quality control, such as acceptance sampling and statistical process control. The Taguchi method is based on minimizing the variability of the product or process, either by keeping the quality on a target value or by optimizing the output. The quality is measured by statistical variability, for example mean squared error or standard deviation, rather than percentage of defects or other criteria based on control limits. Taguchi makes a distinction between variables, that can be controlled, and noise variables. He suggests systematically including the noise variables systematically into the parameter design. The variables in the parameter design can be classified into two groups: the ones that affect the mean response and the ones that affect the variability of the response.
b
Tanimoto coefficient [GEOM]
-+ b
distance
binarydata)
target rotation [FACT]
-+
b
(0
factor rotation
target transformation factor analysis ("FA) [FACT]
+ factor rotation
328
taxi distance [GEOM]
b taxi distance [GEOM] + distance (o quantitative data) b tensor [ALGE] Mathematical object, a generalization of the vector relative to a local Euclidean space, that possesses a specified system of components for every coordinate system that changes under a transformation of coordinates. The simplest tensors are building blocks of linear algebra: zero-order tensors are scalars, first-order tensors are vectors, and second-order tensors are matrices. b term in ANOVA [ANOVA] Categorical predictor variable in the analvsis of variance model. There are two kinds of terms: a main effect term and an interaction term. A main effect term, also called an effect, is a measurable or an observable quantity that affects the outcome of the observations. It is measured on a nominal scale, i.e. it assumes categorical values, called levels. The effects are commonly denoted by consecutive upper case letters (A, B, C,etc.), the levels are indicated by a lower case index, and the number of levels by the corresponding upper case letters as:
Ai
i=l,I
Bj
j = 1, J
An interaction term consists of more than one effect, and takes on values corresponding to the possible combinations of the levels of the effects, also called a treatment. There are two ways to combine effects. In a crossed effect term all combinations of levels are possible. For example if effect Ai has two levels and effect Bj has four levels, then the term ABij has eight levels: (1J) (1,2) (1,3) (L4) (2,2) (2,3) (2,4). In a nested effect term the level combinations are restricted. For example, the same B effect nested in A results in term B(A)jci) with four levels: (1,l) (1,2) (2,3) (2,4). An effect that has a fixed number of levels I is called a fixed effect. The corresponding main effect term in the ANOVA model is not considered to be a random variable. An interaction term that contains only fixed effects is also fixed, i.e. it is not random. A model containing only fixed effects is called a fixed effect model, or a first kind model or model I. In this model the treatment effects Ai, Bj , ABij , etc. are defined as deviations from the grand mean, therefore:
time series [TIME]
329
In this model conclusions from testing Ho: Ai = 0 apply only to the I levels included in the model. An effect that has a large number of possible levels from which I levels have been randomly selected is called a random effect. The corresponding main effect term in theANOl.44 model is a random variable. All interaction terms that contain random effects are also considered to be random. A model containing only random effects is called a random effect model, or a second kind model or model 11. The variance of a random effect term (and of the error term) is called a variance component or component of variance, denoted as a2,02, a&, etc. In a random effect model conclusions from testing Ho: 02 = 0 apply beyond the effect levels in the model, i.e. an inference is drawn about the entire population of effect levels. A model containing both fixed effects and random effects is called a mixed effect model. b terminal node [MISC] + graph theory (0 digraph) b
test [TEST]
: hypothesis test b test set [PREP] + data set b test statistic [TEST] + hypothesis testing b
theoretical distribution [PROB]
+ random variable b theoretical variable [PREP] + variable
three-dimensional motion graphics [GRAPH] + interactive computer graphics b
b
-+
tied rank [DESC] rank
F time series [TIME] Set of observations x(t), ordered in time, where t indicates the time when x(t) was taken. The observations are often equally spaced in time. In case of multivariate observations the scalar x(t) is replaced by a vector x(t). It is assumed that the observations are realizations of a stochastic process, x ( t ) is a random variable, and the observations made at different time points are statisticallydependent. The multivariate joint distribution is described by the mean function, autocovariance function,
330
time series analysis (TSA) [TIME]
cross-covariance function, and spectral density function. A time series can be written as a sum of four components: -, fluctuation about the trend, seasonal component, and random component. Transforming one time series x(t) into another time series y(t) is called filtering. The simplest is the linear filter:
Y ( 0 = a x(t) time series analysis (TSA)[TIME] Analysis of time series, i.e. of series of data collected as a function of time. A time series model is a mathematical description of such data. The goal of the analysis is twofold: modeling the stochastic mechanism that gives rise to an observed series and forecasting (predicting) future values of the series on the basis of its history. Another common objective is to monitor the series and to detect changes in a trend. b
b time series model [TIME] Mathematical description of time series. It is composed of two parts: one contains past values of the time series
x(t-.Il j=o,p and the other one contains terms of a white noise process
a ( ? - i)
i =O,q
The parameters p and q define the complexity of the model, they are indicated in parentheses after the model name. The most commonly used models are the following.
autoregressive integrated moving average model (ARIMA) Model that can be used when a time series is nonstationary, i.e. p(t) is constant in time. TheARIMA@, 1,q) model written in difference equation form is: x ( t ) - ~ (- t1) =
C bj [x(t - j ) - x(t - j
+
- l)] a(t) -
j
C
Ci
a(t - i)
I
The simplest ARIMA models are the IM(1,l)model: x(t)
- x ( t - 1) = a(t) - ca(t - 1)
and theARI(1,l) model: x(t)
- ~ (- t1) = b[x(t - 1) - x( t - 2)] + a(t)
The AZUM model can also be written for differences between other than neighboring points, denoted ARIMA@,d, q). For example, the ARIM(p, 2, q) is a model for x ( t ) - x ( t - 2). The dth difference of ARIMA(p, d, q) is a statimary A M @ ,q) model.
time series model [TIME]
331
autoregressive model (AR) Model in which each point is represented as a linear combination of the p most recent past values of itself plus a white noise term which is independent of all x(t-j] values
The simplest AR@) isAR(l), a first-order model:
with autocovariance and autocorrelation functions ~ ( 0= ) a;/(l - b2) r(t>= b y ( t - 1) p ( t ) = br For a higher-order model, assuming stationarity and zero means, the autocorrelation and autocovariance functions are defined by theYule-Walker equations:
autoregressive moving average model (ARMA) The A M @ ,q ) model is a mixture of AR(p) and MA(q) models:
Each point is represented as a linear combination of past values of itself, and of past and present terms of a white noise process. The simplest A M model is the A M (1,l):
with autocovariance and autocorrelation functions y(0) = (1 - 2bc + c2)a;/(1 - b2) ~ ( 1= ) by(0) - ca; v(t>= b y ( t - 1) p ( t ) = (1 - bc)(b - c)b'-'/(l - 2bc + c2)
332
time seriesplot [GRAPH]
Box- Jenkins model : autoregressive moving average model moving average model (MA) Model, based on moving averages, in which each point is represented as a weighted linear combination of present and past terms of a white noise process: x(t) = a(t) -
C ci a(t - i) 1
The simplest MA(q) is MA(1), a first-order model: x(t) = a(t) - ca(t - 1) with autocovariance and autocorrelation functions y(0) = a,2(1+c2) y(1) = -ca;
+
p(1) = -c/(l c2) y(t) =p(t) = O for
t
2 2
b time series plot [GRAPH] + scatter plot b
tolerance limits [QUALI lot
-+
top-down induction decision tree method (TDIDT) [CLAS] : classification tree method
b
b
total calibration [REGR] calibration
-+
b total sum of squares (TSS) [MODEL] Sum of squared differences between observed values and their mean:
TSS can be partitioned into two components: the model sum of squares, MSS, and the residual sum of squares, RSS: TSS = MSS + R S S =
C(ii- f12 + C ( y i 1
fi)2
i
TSS is an estimate of the variability of yi without any model. b
total variation [DESC] multivariate dispersion
-+
transformation [PREP] 333
-+
trace of a matrix [ALGE] matrix operation
b
training - evaluation set split [MODEL]
b
+ model validation b
training set [PREP]
+ data set b transfer function [MISC] + neural network b transformation [PREP] Mathematical function for transforming the original values of a variable x to new values Xr
x' = f ( x ) Transformations are used to:
- stabilize the variance; - linearize the relationship among variables; - normalize the distribution;
-
represent results on a more convenient scale;
- mitigate the influence of outliers. Transformation can be considered as a correction for undesired characteristics of a variable, such as heteroscedasticity, nonnormality, nonadditivity, or nonlinear relationship with other variables. The following are the most common transformations.
angular-linear transformation Transformation that converts an ordinary (linear) variable x into an angular variable t 360"x
r=-
k
where k is the number of units in a full cycle. For example, the time 06:OO can be transformed to 90" (one-fourth of a cycle) with k = 24.
linear transformation Transformation that can be interpreted as a roto-translation of the original variable: x'=a+bx where a is the translation term (or intercept) and b is the rotation term (or slope). There are two subcases: pure rotation ( a = 0) and simple translation of the origin (b = 0).
334
transformation [PREP]
logarithmic transformation Transformation that changes multiplicative behavior into additive behavior. For example, in regression a nonlinear relationship can be changed into a linear one:
x' = log, (k + X ) where the logarithmic base a is usually 10 or e, and k is an additive constant (often 1) to remove zero values. The standard deviation of the new values is proportional to their mean (i.e. the coefficient of variation is constant).
logit transformation Transformation of percentages, proportions or count ratios (variables with values between zero and 1) in order to obtain a logit scale, where values range between -3 (for P N 0.05) and +3 (for P N 0.95): pl = In
(L) 1-P
metameric transformation Transformation of the values of a dose or response variable into a dimensionless scale -1, 0 and +l. The transformed values are called metameters. It is often used to simplify the analysis of the dose-response relationship. probit transformation Tlansformation (abbreviation of probability unit) to make negative values very rare in a standard normally distributed variable by adding 5 to each original value.
x' = x + 5.0 rankit transformation Transformation for rank ordering of a quantitative variable. x' = rank (x)
reciprocal transformation 4
square root transformation Transformation especially applied to variables from the Poisson distribution, such as count data. In this case the variance is proportional to the mean and can be stabilized by using:
i=J;r+0.5
or
i=J+3/8
tree vmbol [GRAPH]
335
square transformation Transformation particularly useful for correcting a skewed distribution.
trigonometric transformation Transformations using trigonometric functions: x' = sin(x) x' = arcsin(x)
x' = cos(x) x' = arccos(x)
x' = tan@)
x' = arctan(x)
The arcsin transformation is often used to stabilize the variance, i.e. to make it close to constant for different populations from which x has been drawn. Particularly used on percentages or proportions p to change their (usually) binomial distribution into a nearly normal distribution: p' = arcsin (@ , ) The behavior of this transformation is not optimal at the extreme values. This can be improved by using the transformation
b
transformation matrix [ALGE]
: rotation matrix b
transpose of a matrix [ALGE] matrix operation
--+
treatment [EXDE] + factor
b
b treatment structure [EXDE] + design b tree [MISC] + graph theory
tree symbol [GRAPH] + graphical symbol
b
336
trend [TIME]
b trend [TIME] Long-term movement in a time series, a smooth, relatively slowly changing component. It is a nonrandom function of a time series: p(t) = E[x(t)].The trend can be various functions of time, for example:
- linear: - quadratic: - cyclical: - cosine:
+ +
p(t) = bo bit p(t) = ba bit+ bk? PO) = PO 12) p(t) = bo bl cos(2n f t)
+ +
+ b sin(2n f t)
triangle inequality [GEOM] -+ distance b
triangular diagram [GRAPH] Ternary coordinate system, in the form of an equilateral triangle, in which the values of the three variables sum to one. The quantities represented on the three coordinates are proportions rather than absolute values. This diagram is particularly useful for studying mixtures of three components. b
0
x3
b triangular distribution [PROB] + distribution
F triangular factorization [ALGE] + matrix decomposition
triangular kernel [ESTIM] + kernel b
1
two-stage nested anarysis of variance [ANOVA] b
triangular matrix [ALGE]
+ matrix
tridiagonal matrix [ALGE] + matrix b
b
tridiagonalization [ALGE] matrix decomposition
-+
b trigonometric transformation [PREP] + lransformation b
trimmed estimator [ESTIM]
+ estimator b
trimmed mean [DESC]
+ location
trimmed variance [DESC] -+ dhpersion b
b
true error rate [CLAS] classification (o error rate)
-+
b
truncation error [OPTIM] numerical error
-+
b
'hkey's quick test [TEST] hypothesis test
-+
b
'hkey's test [TEST] hypothesis test
-+
two-level factorial design [EXDE] -+ design b
b
two-norm [ALGE] norm (0 matnknom)
-+
b
two-sided test [TEST] hypothesis testing
-+
b
two-stage nested analysis of variance [ANOVA]
-+ analysis of variance
337
338 b
two-stage sampling [PROB]
two-stage sampling [PROB] sampling ( cluster sampling)
-+
b two-way analysis of variance [ANOVA] + analysis of variance
b
two-way clustering [CLUS] cluster analysis
-+
b
type I error [TEST]
+ hypothesis testing b type II error [TEST] + hypothesis testing
U b u-chart [QUAL] + control chart (0 attribute control chart)
U-shaped distribution [PROB] + random variable b
b
ultrametric distance [GEOM]
+ distance b ultrametric inequality [GEOM] + distance b unbalanced factorial design [EXDEI + design b unbiased estimator [ESTIM] -+ estimator
unconditional error rate [CLAS] + classification ( 0 error rate) b
uncorrelated vectors [ALGE] + vector b
unit [PREP]
339
b underdetermined system [PREP] -+ object b
underfitting [MODEL]
+ model fitting b
undirected graph [MISC] ( 0 digraph)
+ graph theory
b unequal covariance matrix classification (UNEQ) [CLAS] Parametric classification method that is a variation of the quadratic discriminant analvsis. Each class g is represented by its centroid cg and its covariance matrix Sg. Object xi is classified according to its Mahalanobis distance from the class centroids. The metric is the inverse of the corresponding class covariance matrix Sg:
d2'g = (xi - CgIT s;' (xi - cg>
As the above distance metric follows the chi square distribution, the probability of an object belonging to a class can be calculated from this distribution. UNEQ, similar to SIMCA, simplifies the QDA classification rule by omitting the logarithm of the determinant of the class covariance matrices, which means that the class density functions are not properly scaled. In the presence of significant scale differences this usually causes inferior performance. uniform distribution [PROB] + distribution F
b uniform shell design [EXDE] + design b unique factor [FACT] (.- specific factor) Term in the factor analvsis model that accounts for the variance not described by the common factors. Its role is similar to the role of the error term in a regression model. There is one unique factor for each variable. The unique factors are uncorrelated, both among themselves and with the common factors. Principal component analysis assumes zero unique factors. F unique variance [FACT] + factor analysis b uniqueness [FACT] + communalify
unit [PREP] : object
b
340
unit matrix [ALGE]
unit matrix [ALGE] -+ matrix b
b
univariate [MULT]
+ multivariate b univariate calibration [REGR] + calibration b
univariate data [PREP]
+ data b univariate distribution [PROB] -+ random variable b univariate regression model [REGR] -+ regression model
unsupervised learning [MULT] + pattern recognition
b
b unsupervised pattern recognition [MULT] + pattern recognition b
unweighted average linkage [CLUS] (0 agglomerative clustering)
+ hierarchical clustering
b upper control limit (UCL) [QUAL] + control chart
upper quartile [DESC] + quantile b
utility function [QUALI + multicriteria decision making
b
V Vmask [QUAL] -+ control chart (0 cusum chart) b
variable [PREP] b
341
validation [MODEL]
: model validation
variable [PREP] ( : attribute, characteristic, descriptor,feature) Characteristic of an object that may take on any value from a specified set. Data for n objects described by p variables are collected in a data matrix X(n, p), where the element xi, is the jth measurement taken on the ith object. There are several types of variables, listed below, depending on their of measurement, on the set of values they can take, and on the relationships with other variables. b
angular variable (: circular variable) Variable that takes on values expressed in terms of angles. autoscaled variable : standardized variable binary variable Dichotomous variable that takes values of 0 or 1. For example, quanta1 and dummy variables are binary variables. blocking variable Categorical variable in the design matrix that groups runs into blocks. The levels of the blocking variable are defined by the blocking generator. categorical variable
.- qualitative variable
cause variable Variable that is the cause of change in other (effect) variables. circular variable : angular variable concomitant variable
.- covariate
conditionally-presentvariable Variable that exists or is meaningful only for some of the objects. The values taken by such variable may exist (or be meaningful) for an object depending on the value of a dichotomous variable of the present/absent type. continuous variable Variable that can take on any numerical value. For any two values there is another one that such a variable can assume. control variable Variable measured in monitoring a process. Its values are collected at fixed time intervals, often recorded on control charts or registered automatically.
342
variable [PREP]
covariate (: concomitant variable) Predictor variable in ANCOVA model, measured on a ratio scale. Its effect on the response cannot be controlled, only observed. cyclical variable Variable in time series that takes on values that depend on the cycle (period) during which it is measured. dependent variable (: response variable) Variable exhibiting statistical dependence on one or more other variables, called independent variables. dichotomous variable Discrete variable that can take on only two values. A binary variable is a special dichotomous variable that takes on values of 0 or 1. discrete variable In contrast to a continuous variable, a variable that takes on only a finite number of values. These values are usually integer numbers, but some discrete variables also take on ratios of integer numbers. For example, variables measured on a frequency count scale are discrete variables. dummy variable (: indicator variable) Binary variable created by converting a qualitative variable into binary ones. Each level of the qualitative variable is represented by one dummy variable set to 1 or 0, indicating whether the qualitative variable assumed the corresponding level or not. effect variable Variable in which change is caused by other (cause) variables. endogenous variable Variable, mainly used in econometrics, that is measured within a system and affected both by variables in the system and by variables outside the system (exogenous variables). It is possible for a variable to be endogenous in one system and exogenous in another. exogenous variable Variable, mainly used in econometrics, that is measured outside the system. They can affect the behavior of the system described by the endogenous variables, but are not affected by the fluctuations in the system. experimental variable Variable measured or set during an experiment, in contrast to theoretical variable, calculated from a mathematical model. In experimental design the experimental variables are more commonly called factors.
variable [PREP]
343
explanatory variable :predictor variable inadmissible variable Variable that must not be included in a model because it is constant or perfectly correlated with other variables. A variable containing a large number of missing values, measured with too much noise, or highly correlated with other variables is often also considered inadmissible. independent variable : predictor variable indicator variable : dummy variable latent variable Non-observable and non-measurable hypothetical variable, crucial element of a latent variable model. A certain part of its effect is manifested in measurable manifest variables. Mainly used in sociology, economics and psychology. An example is the common factor in factor analysis. lurking variable Variable that affects the response, but may not be measured or may not even be known to exist. The effect of such variables is gathered in the error term. They may cause significant correlation between two measured variables, without providing evidence that these two variables are necessarily causally related. manifest variable Observable or measurable variable, as opposed to a latent variable, which is not observable or measurable. multinomial variable : qualitative variable multistate variable : qualitative variable noise variable Variable that cannot be controlled during the experiment or the process. predictor variable ( : explanatory variable, independent variable, regressor) Variable in a regression model as a function of which the response variable is modeled. process variable Variable controlled during the experiment or process.
344
variable control chart [QUAL]
qualitative variable (: categorical variable, multinomial Variable, multistate variable) Variable in which differences between values cannot be interpreted in a quantitative sense and for which only non-arithmetic operations are valid. It can be measured on nominal or ordinal scales. Examples are: label, color, type. quanta1 variable Binary response variable measuring the presence or absence of response to a stimulus. quantitative variable Variable, measured on a proportional or ratio scale, for which arithmetic operations are valid. random variable (: variate, stochastic variable) Variable that may take on values of a specified set with a defined frequency or probability. Variable with values associated with an element of chance or probability. ranking variable Variable defined by the ranks of the values of another variable. Rank order statistics are calculated from ranking variables that replace the ranked variable. reduced variable : standardized variable regressor : predictor variable response variable : dependent variable standardized variable ( : autoscaled variable, reduced variable) Variable standardized by autoscaling, i.e. by subtracting its mean and by dividing by its standard deviation. Such a variable has zero mean and unit variance. stochastic variable : random variable theoretical variable Variable taking values according to a mathematical model, in contrast to an experimental variable, which is measured or set in an experiment. variate : random variable F
variable control chart [QUAL]
+ control chart
variable subset selection [REGR]
345
b variable metric optimization [OPTIM] (.- quasi-Newton optimization) Gradient optimization that tries to overcome the problems in the Newton-Rauhson optimization, namely that the Hessian matrix H may become negative definite. It finds a search direction of the form H-'(pi) g(pi) where g is the gradient vector and H is a positive definite symmetric matrix, updated at each iteration, that converges to the Hessian matrix. The most well known variable metric optimization is the Davidon-Fletcher-Powell optimization (DFP). It begins as steepest descent and changes over to the Newton-Raphson optimization during the iterations by continuously updating an approximation to the inverse of the matrix of second derivatives. b variable reduction [MULT] + data reduction
variable sampling [QUALI + acceptance sampling b
b
variable step size generalized simulated annealing (VSGSA) [OPTIM]
+ simulated annealing b variable subset selection [REGR] (: best subset regression) Collection of regression methods that model the response variable as a function of a selected subset of the predictor variables only. These techniques are biased regression methods in which biasing is based on the assumption that not all the predictor variables are relevant in the regression problem. There are various strategies for finding the optimal subset of predictors. The forward selection procedure starts with an initial model containing only a constant term and inserts one predictor at a time until a prespecified goodness of prediction criterion is satisfied. The order of insertion is determined by the partial correlation coefficients between the response and the predictors not yet in the model. In other words, at each step the predictor is included in the model that produces the largest increase in R2. k partial correlations are calculated in each step as a correlation between residuals from models y = f ( X I , x2, . .. , Xk-1) and Xj = f (XI, x2, . . . , Xk-I), with j # 1, . . . , k - 1. The contribution of predictor k to the variance described by the regression model is assessed by the partial F value:
The forward selection procedure stops when the partial F value does not exceed a preselected Fin threshold, i.e. when the contribution of the selected predictor to the variance described by the model is no longer significant. Although this procedure improves the regression model in each step, it does not consider the effect of the newly inserted variable on the role of the predictors already in the model.
346
variance [DESC]
The backward elimination is an opposite strategy; it starts with the full model containing all the predictors. In each step a predictor is eliminated from the model which has the smallest partial correlation with the response (or equivalently the smallest partial F value), i.e. which results in the smallest decrease in R2. The elimination procedure stops when the smallest partial F value is greater than a preselected F,,, threshold, i.e. when all predictors in the model do contribute significantly to the variance of the model. This procedure, however, cannot be used when the full predictor matrix is underdetermined. The stepwise regression (SWR) method is a combination of the above two strategies. Both variable selection and variable elimination are attempted in each step. The procedure starts with only the constant term in the model. In each step the predictor with the largest partial F value is inserted if F > Fin and the predictor with the smallest partial F value is eliminated if F < F,,,. The procedure stops when all predictors in the model have F > Foutand all candidate predictors not in the model have F < Fin. This is the most frequently recommended method, although its success depends on the preselected Fin and Fout. The above procedures, called sequential variable selection, were developed with computational efficiency in mind, so that only a relatively small number of subsets are actually calculated and compared. As computation is getting cheaper, it is no longer prohibitive to calculate the all possible subsets regression. This examines all possible 2P combinations of the predictors and chooses the best model on the basis of a goodness of prediction criterion. This method proves to be superior, especially in case of collinearity. The above methods calculate a least squares estimate for the variables included in the model. The candidate predictors are decorrelated from the predictors already in the model. Stagewise regression considers predictors on the basis of their correlation with the response, as opposed to their partial correlation, i.e. the candidate predictors are not decorrelated from the model. In each step the correlations between the residual from the existing model and the potential predictors are calculated and the variable with the largest correlation is inserted into the model. This method does not give the least squares regression coefficients for the variables in the final model and yields a larger mean square error than that of the least squares estimate. Its advantage is, however, that highly correlated variables are allowed to enter into the model if they are also highly correlated with the response.
variance [DESC] + dispersion b
b
variance analysis [ANOVA]
: analysis of variance
variance component [ANOVA] + term in ANOVA b
vector [ALGE] b
347
variance-covariance matrix [DESC]
: covariance matrix b variance inflation factor (VIF) [REGR] Measure of the effect of ill-conditioned predictor matrix X on the estimated regression coefficients:
where Rj is the multiple correlation coefficient obtained by regressing predictor xj on all the other predictors Xk k # j . VIF is a simple measure for detecting collinearity. In the ideal case, when xj is totally uncorrelated with the other predictors, W.,= 1. As R, tends to 1, i.e. collinearity increases, VIFj tends to infinity. b
-+
variance ratio distribution [PROB] distribution
variance ratio test [TEST] + hypothesis test b
variate [PREP] + variable b
b
variate [PROB] random variable
-+
b
variation [DESC]
: dispersion
varimax rotation [FACT] -+ factor rotation b
b
variogram [TIME]
+ autocovariancefunction b vector [ALGE] Row or column of numbers. In contrast to a scalar, a vector has both size and direction. The size of a vector is measured by its norm. Conventionally a vector x is assumed to be a column vector; a row vector is indicated as xT (transpose of x). For example, multivariate measurements are usually represented by a row vector, measurement of the same quantity on various individuals or material samples are often collected in a column vector.
348
vector norm [ALGE]
Vectors x and y are orthogonal vectors if
A set of p such vectors form an orthonormal basis of the p space. Vectors x and y are uncorrelated vectors if:
Vectors x and y are linearly independent vectors if b,x+ b,y=O
with
b, # O
and
by # O
Vectors x and y are coqjugate vectors if xTZy = 0
where Z is a symmetric positive definite matrix. b
vector norm [ALGE]
+ norm
Wald-Wolfowitz’s test [TEST] + hypothesis test b
b
walk [MISC] graph theory
-+ b
Walsh’ test [TEST] hypothesis test
4
b
Ward linkage [CLUS] hierarchical clustering ( agglomerative clustering)
b
warning limit [QUAL] controlchart
-+
Watson’s test [TEST] -+ hypothesis test b
well-strctured admksibility [CLUS] 349 b Weibull distribution [PROB] + distribution
F
Weibull growth model [REGR] regression model
4
b weight [PREP] Numerical coefficient associated with objects, variables or classes indicating their relative importance in a model. Most statistical models can incorporate observation weights; e.g. weighted mean, weighted variance, weighted least squares. Observation weights play an important role in regression, for example in generalized least squares regression, robust regression, biweight, smoother.
b weighted average linkage [CLUS] + hierarchical clustering (0 agglomerative clustering) b weighted centroid linkage [CLUS] + hierarchical clustering (0 agglomerative clustering)
weighted graph [MISC] + graph theory b
weighted least squares regression (WLS) [REGR] + generalized least squares regression b
weighted mean [DESC] + location b
b
weighted nearest means classification (WNMC) [CLAS] centroid classijication
4
b
weighted variance [DESC]
+ dispersion b well-conditioned matrix [ALGE] + matrix condition b
-+
well-determined system [PREP] object
well-structured admissibility [CLUS] + assessment of clustering ( 0 admissibility properties) b
350
Westenberg’stest [TEST]
b Westenberg’s test [TEST] + hypothesis test b
-+
Westlake design [EXDE] design
b white noise [TIME] Stationarv stochastic process defined as a sequence of i.i.d. random variables a(t). This process is stationary with mean function
P(t) = E[a(t)I
and with autocovariance and autocorrelation functions y ( t , s) = var [a(t)]
p(t,s) = 1
if t = s, otherwise both are zero. b wide-band process [TIME] + stochastic process (0 narrow-bandprocess)
Wiener-Levy process [TIME] + stochastic process b
b Wilcoxon-Mann-Whitney’s test [TEST] + hypothesis test b Wilcoxon’s test [TEST] + hypothesis test b
Wilks’ A test [FACT]
+ rank analysis b Wilk’s test [TEST] + hypothesis test
b
Williams-Lambert clustering [CLUS]
+ hierarchical clustering (o divisive clustering) b Williams plot [GRAPH] + scatter plot (0 residual plot)
window smoother [REGR] + smoother b
f i f e sorder [EXDE]
351
b Winsorized estimator [ESTIM] + estimator
within-group covariance matrix [DESC] + covariance matrix b
X b
%chart [QUAL]
+ control chart
(0
variable control chart)
Y b Yates algorithm [EXDE] Algorithm for calculating estimates for the effects of factors and of their interactions in two-level factorial design. The outcomes of the experimental runs of a K-factor factorial design are first written in a column in Yates order. Another column is calculated by first adding together the consecutive pairs of numbers from the first column, then subtracting the top number from the bottom number of each pair. With the same technique a total of K columns are generated; the entries of a new column are sums of differences of pairs of numbers from the previous columns. The last column contains the sum of squares such that the first value corresponds to the mean, the next K values to the K factors and the rest to the interactions. To obtain the estimate of the mean, the corresponding sum of squares is divided by 2K, the factor and interaction effects are estimated by dividing the corresponding sum of squares by 2K-1. b
Yates chi squared coefficient [GEOM] (0 binarydatu)
+ distance
b Yates order [EXDE] (: standard order of runs) The most often used order of runs in a two-level factorial design. The first column of the design matrix consists of successive minus and plus signs, the second column of successive pairs of minus and plus signs, the third column of four minus signs
352
Youden square design [EXDE]
followed by four plus signs, and so forth. In general, the kth column consists of 2K-1minus signs followed by 2K-' plus signs. For example, the design matrix for a 23 design is:
b Youden square design [EXDE] + design b Yule coefficient [GEOM] + distance ( 0 binarydata) b
Yule-Walker equations [TIME] ( 0 autoregressive model)
+ time series model
b z-chart [GRAPH] Plot of time series data that contains three lines forming a z shape. The lower line is the plot of the original time series, the center line is a cumulative total and the upper line is a moving total. b z-score [PREP] + standardization
(0 autoscaling)
zero matrix [ALGE] + matrix b
zero-order regression model [REGR] + regression model b
zooming [GRAPH] + interactive computer graphics b
353
References
[ALGE] LINEAR ALGEBRA
G.H. Golub and C.E van Loan, Matrix Computations. John Hopkins University Press, Baltimore, MD (USA), 1983 w AS. Householder, The Theory of Matrices in Numerical Analysis. Dover Publications, New York, NY (USA), 1974 w A. Jennings, Matrix Computation for Engineers and Scientists. Wiiey, New York, NY (USA), 1977 w B.N. Parlett, The Symmetric Eigenvalue Problem. Prentice-Hall, Englewood Cliffs, NJ (USA), 1980
[ANOVA] ANACYSIS OF VARIANCE
O.J. Dunn and V.A. Clark, Applied Statistics: Analysis ofvariance and Regression. Wiley, New York, NY (USA), 1974 w L. Fisher, Fixed Effects Analysis of Variance. Academic Press, New York, NY (USA), 1978 D.J. Hand, Multivariate Analysis of Variance and Repeated Measures. Chapman and Hall, London (UK), 1987 B A. Huitson, The Analyis of Variance. Griffin, London (UK), 1966 S.R.Searle, Variance Components. Wiley, New York, NY (USA), 1992 w G.O. Wesolowsky, Multiple Regression and Analysis of Variance. Wiley, New York, NY (USA), 1976
[CLAS] CLASSIFICATION
H.H. Bock, Automatische Klassifikation. Vandenhoeck & Rupprecht, Gottingen (GER), 1974 w I. Bratko and N. Lavrac (Eds.), Progress in Machine Learning. Sigma Press, Wilmslow (UK), 1987 L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression 'kees. Wadsworth, Belmont, CA (USA), 1984 w H.T. Clifford and W. Stephenson, An Introduction to Numerical Classification. Academic Press, New York, NY (USA), 1975 D. Coomans, D.L. Massart, I. Broeckaert, and A. 'hssin, Potential methods in pattern recognition. Anal. Chim. Acta, 133,215 (1981) D. Coomans, D.L. Massart, and I. Broeckaert, Potential methods in pattern recognition. Part 4 A combination of ALLOC and statistical linear discriminant analysis. Anal. Chim. Acta, 133. 215 (1981) TM. Cover and P.E. Hart, Nearest neighbor pattern classification. IEEE 'Itans., 21 (1967) o M.P. Derde and D.L. Massart, UNEQ:a disjoint modelling technique for pattern recognition based on normal distribution. Anal. Chim. Acta, 184,33 (1986)
m,
354
rn
0
rn
w rn rn
rn
rn H
rn
References
M.P. Derde and D.L. Massart, Comparison of the performance of the class modelling techniques UNEQ, SIMCA and PRIMA. Chemolab, 4, 65 (1988) R.A. Eisenbeis, Discriminant Analysis and Classification Procedures: Theory and Applications. Lexington, 1972 M. Forina, C. Armanino, R. Leardi, and G. Drava, A class-modelling technique based on potential functions. J. Chemometrics, 5, 435 (1991) I.E. Frank, DASCO -A new classification method. Chemolab, 4,215 (1988) I.E. Frank and J.H.Friedman, Classification: oldtimers and newcomers. J. Chemometrics, 3, 463 (1989) I.E. Frank and S. Lanteri, Classification models: discriminant analysis, SIMCA, CART. Chemolab, 5 247 (1989) J.H. Friedman, Regularized Discriminant Analysis. J. Am. Statist. Assoc., 165 (1989) M. Goldstein, Discrete Discriminant Analysis. Wiley, New York, NY (USA), 1978 L. Gordon and R.A. Olsen, Asymptotically efficient solutions to the classification problem. Ann. Statist., 6, 515 (1978) D.J. Hand, Kernel discriminant analysis. Research Studies Press, Letchworth (UK), 1982 D.J. Hand, Discrimination and Classification. Wiley, Chichester (UK), 1981 M. James, Classification Algorithms. Collins, London (UK), 1985 I. Juricskay and G.E. Veress, PRIMA: a new pattern recognition method. Anal. Chim. Acta, 171.61 (1985) W.R. Klecka, Discriminant Analysis. Sage Publications, Beverly Hills, CA (USA), 1980 B.R. Kowalski and C.F. Bender, The k-nearest neighbour classification rule (pattern recognition) applied to nuclear magnetic resonance spectral interpretation. Anal. Chem., 4 1405 (1972) PA. Lachenbruch, Discriminant Analysis. Hafner Press, New York, NY (USA), 1975 G.J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York, NY (USA), 1992 N.J. Nilson, Learning Machines. McGraw-Hill, New York, NY (USA), 1965 R. Todeschini and E. Marengo, Linear Discriminant Classification Wee (LDCT): a user-driven multicriteria classification method. Chemolab, 16,25 (1992) G.T. Toussaint, Bibliography on estimation of misclassification. Information Theory, 472 (1974) H. van der Voet and D.A. Doornbos, The improvement of SIMCA classification by using kernel density estimation. Part 1. A new probabilistic approach classification technique and how to evaluate such a technique. Anal. Chim. Acta, 161,115 (1984) H. van der Voet and D.A. Doornbos, The improvement of SIMCA classification by using kernel density estimation. Part 2. Practical evaluation of SIMCA, ALLOC and CLASSY on three data sets. Anal. Chim. Acta, 161,125 (1984) H.van der Voet, P.M.J. Coenegracht, and J.B.Hemel, The evaluation of probabilistic classification methods. Part 1.A Monte Carlo study with ALLOC. Anal. Chim. Acta, 191,47 (1986) H. van der Voet, J.B. Hemel, and P.M.J. Coenegracht, New probabilstic version of the SIMCA and CLASSY classification methods. Part 2. Practical evaluation. Anal. Chim. Acta, 191,63 (1986) S . Wold, Pattern recognition by means of disjoint principal components models. Pattern Recognition, S, 127 (1976) S. Wold, The analysis of multivariate chemical data using SIMCA and MACUP. Kern. Kemi, 3 401 (1982) S . Wold and M. Sjstrom, Comments on a recent evaluation of the SIMCA method. J. Chemometrics, L 243 (1987)
m,
0
0 0
References
355
[CLUS] CLUSTER ANALYSIS rn L.A. Abbott, EA. Bisby, and D.J. Rogers, lsxonomic Analysis in Biology. Columbia Univ. Press, New York, NY (USA), 1985 m M.S. Aldenderfer and R.K. Blashfield, Cluster Analysis. Sage Publications, Beverly Hills, CA (USA), 1984 rn M.R. Anderberg, Cluster Analysis for Applications. Academic Press, New York, NY (USA), 1973 0 G.H. Ball and D.J. Hall, A clustering technique for summarizing multivariate data. Behav. Sci., 2, 153 (1967) rn E.J. Bijnen, Cluster Analysis. Tilburg University Press, 1973 rn A.J. Cole, Numerical lsxonomy. Academic Press, New York, NY (USA), 1969 0 D. Coomans and D.L. Massart, Potential methods In pattern recognition. Part 2. CLUPOT, an unsupervised pattern recognition technique. Anal. Chim. Acta, 133,225 (1981) rn B.S. Duran, Cluster Analysis. Springer-Verlag, Berlin (GER), 1974 A.W.F. Edwards and L.L. Cavalli-Sforza, A method for cluster analysis. Biometrics, 21,362 (1965) B.S. Everitt, Cluster Analysis. Heineman Educational Books, London (UK), 1980 0 E.B. Fowlkes, R. Gnanadesikan, and J.R. Kettenring, Variable selection In clustering. J. Classif., 5 205 (1988) 0 H.P. Friedman and J. Rubin, On some invariant criteria for grouping data. J. Am. Statist. Assoc., @, 1159 (1967) A.D. Gordon, Classification Methods for the Exploratory Analysis of Multivariate Data. Chapman & Hall, London (UK), 1981 0 J.C. Gower, Maximal predictive classification. Biometrics, 643 (1974) 0 P. Hansen and M. Delattre, Complete-link analysis by graph coloring. J. Am. Statist. Assoc., -3 7 397 (1978) rn J. Hartigan, ClusteringAlgorithms. Wiley, New York, NY (USA), 1975 rn A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs, NJ (USA), 1988 rn M. Jambu, Cluster Analysis and Data Analysis. North-Holland, Amsterdam (The Netherlands), 1983 rn N. Jardine and R. Sibson, Mathematical lsxonomy. Wiley, London (UK), 1971 0 R.A. Jarvis and E.A. Patrick, Clustering using a similarity measure based on shared nearest neighbours. IEEE 'Itans. Comput., 1025 (1973) rn L. Kaufman and P.J. Rousseeuw, Finding Groups in Data An Introduction to Cluster Analysis. Wiley, New York, NY (USA), 1990 R.G. Lawson and P.C. Jurs, New index for clustering tendency and its application to chemical problems. J. Chem. Inf. Comput. Sci., -03 36 (1990) 0 E. Marengo and R. Todeschini, Linear discriminant hierarchical clustering: a modeling and crossvalidable clustering method. Chemolab, 19,43 (1993) 0 F.H.C. Marriott, Optimization methods of cluster analysis. Biometrika, 417 (1982) rn D.L. Massart and L. Kaufman, The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis. Wiley, New York, NY (USA), 1983 D.L. Massart, F. Plastria, and L. Kaufman, Non-hierarchical clustering with MASLOC. Pattern Recognition, &, 507 (1983) rn P.M. Mather, Cluster Analysis. Computer Applications, Nottingham (UK), 1969 0 G.W. Milligan and P.D. Isaac, The validation of four ultrametric clustering algorithms. Pattern Recognition, -02 41 (1980) 0 W.M. Rand, Objective criteria for the evaluation of clustering methods. J. Am. Statist. Assoc., 66, 846 (1971)
a
=,
-
356
References
w H.C. Romesburg, Cluster Analysis for Researchers. Lifetime Learning Publications, Belmont, CA (USA), 1984 A.J. Scott and H.J. Symons, Clustering methods based on likelihood ratio criteria. Biometrics, 27, 387 (1971) w P.H.A. Sneath and R.R. Sokal, Numerical Taxonomy. Freeman, San Francisco, CA (USA), 1973 w H. Spath, Cluster Analysis Algorithms. Wiley, New York, NY (USA), 1980 M.J. Symons, Clustering criteria and multivariate normal mixtures. Biometrics, 37, 35 (1981) w R.C. w o n , Cluster Analysis. McGraw-Hill, New York, NY (USA), 1970 J.W. Van Ness, Admissible clustering procedures. Biometrika, @, 422 (1973) w J. Van Ryzin (Ed.), Classification and Clustering. Academic Press, New York, NY (USA), 1977 W. Vogt, D. Nagel, and H. Sator, Cluster Analysis in Clinical Chemistry: A Model. Wiley, Chichester (UK), 1987 J.H. Ward, Hierarchical grouping to optimize an objective function. J. Am. Statist. Assoc., 3, 236 (1963) P. Willett, Clustering tendency in chemical classification. J. Chem. Inf. Comput. Sci., 25,78 (1985) P. Willett, Similarity and Clustering in Chemical Information Systems. Research Studies Press, Letchworth (UK), 1987 W.T Williams and J.M. Lambert, Multivariate methods in plant eehology. 2. The use of an electronic 8 4 689 (1960) computer for association analysis. J. Echology, 0 D. Wishart, Mode Analysis: a generalization of nearest neighbour which reduces chaining effect. in Numerical Taxonomy, Ed. A.J. Cole, Academic Press, New York, NY, 1969, p. 282. J. Zupan, A new approach to binary tree-based heuristics. Anal. Chim. Acta, 122. 337 (1980) 0 J. Zupan, Hierarchical clustering of infrared spectra. Anal. Chim. Acta, 139.143 (1982) w J. Zupan, Clustering of Large Data Sets. Research Studies Press, Chichester (UK), 1982
[ESTIM] ESTIMATION
w Y. Bard, Nonlinear Parameter Estimation. Academic Press, New York, NY (USA), 1974 P.J. Huber, Robust Statistics. Wiley, New York, NY (USA), 1981 w J.S. Maritz, Distribution-free Statistical Methods. Chapman & Hall, London (UK), 1981 w 0. Richter, Parameter Estimation in Ecology. VCH Publishers, Weinheim (GER), 1990 P.J. Rousseeuw, lhtorial to robust statistics. J. Chemometrics, 5,1 (1991) B.W. Silverman, Density Estimation for Statistics and Data Analysis. Research Studies Press, Letchworth (UK), 1986
[EXDE] EXPERIMENTALDESIGN 0
KM. Abdelbasit and R.L. Plackett, Experimental Design for Binary Data. J. Am. Statist. Assoc.,
28, 90 (1983) D.F. Andrews and A.M. Herzberg, The Robustness and Optimality of response Surface Designs. J. Statist. Plan. Infer., 2,249 (1979) w TB. Barker, Quality by Experimental Design. Dekker, New York, NY (USA), 1985 G.E.P. Box and D.W. Behnken, Some new three level designs for the study of quantitative variables. Technometrics, 2,445 (1960) w G.E.P. Box and N.R. Draper, Evolutionary Operation. Wiley, New York, NY (USA), 1969 G.E.P. Box and N.R. Draper, Empirical Model-Building and Response Surfaces. Wiley, New York, NY (USA), 1987
References
357
rn G.E.P. Box, W.G. Hunter, and J.S. Hunter, Statistics for Experimenters. An Introduction to Design, Data Analysis, and Model Building. Wiley, New York, NY (USA), 1978 N. Bratchell, Multivariate response surface modelling by principal components analysis. J. Chemometrics, 3, 579 (1989) rn R. Carlson, Design and optimization in organic synthesis. Elsevier, Amsterdam (NL), 1992 S. Clementi, G. Cruciani, C. Curti and B. Skakerberg, PLS response surface optimization: the CARS0 procedure. J. Chemometrics, 3, 499 (1989) rn J.A. Cornell, Experiments with Mixtures. Wiley, New York, NY (USA), 1990 (2nd ed.) J.A. Cornell, Experiments with Mixtures: A Review. Technometrics, & 437 (1973) o J.A. Cornell, Experiments with Mixtures: An Update and Bibliography. Rchnometrics, -l2 95 (1979) S.N. Deming and S.L.Morgan, Experimental Design: A Chemometric Approach. Elsevier, Amsterdam (NL), 1987 C.A.A. Duinveld, A.K. Smilde and D.A. Doornbos, Comparison of experimental designs combining process and mixture variables. Part 1. Design construction and theoretical evaluation. Chemolab, 19, 295 (1993) o C.A.A. Duinveld, A.K. Smilde and D.A. Doornbos, Comparison of experimental deslgns combining process and mixture variables. Part 11. Design evaluation on measured data. Chemolab, 19,309 (1993) rn V.V. Fedorov, Theory of Optimal Experiments. Academic Press, New York, NY (USA), 1972 rn P.D. Haaland, Experimental Design in Biotechnology. Marcel Dekker, New York, NY (USA), 1989 J.S. Hunter, Statistical Design Applied to Product Design. J. Qual. Control, l7,210 (1985) W.G. Hunter and J.R. Kittrell, Evolutionary Operation: A Review. Technometrics, 389 (1966) A.I. Khuri and J.A. Cornell, Response Surfaces Designs and Analysis. Marcel Dekker, New York, NY (USA), 1987 R.E. Kirk, Experimental Design. Wadsworth, Belmont, CA (USA), 1982 o E. Marengo and R. Todeschini, A new algorithm for optimal, distance based, experimental design. Chemolab, 2,117 (1991) rn R.L. Mason, R.F. Gunst, and J.L. Hess, Statistical Design and Analysis of Experiments with Applications to Engineering and Science. Wiley, New York, NY (USA), 1989 rn R. Mead, The Design of Experiments. Cambridge Univ. Press, Cambridge (UK), 1988 R. Mead and D.J. Pike, A review of response surface methodology from a hiometric viewpoint. Biometrics, 3l, 803 (1975) rn D.C. Montgomery, Design and Analysis oPExperiments. Wiley, New York, NY (USA), 1984 rn E. Morgan, Chemometrics: Experimental Design. Wiley, Chichester (UK), 1991 o R.L. Plackett and J.P. Burman, The Design of Optimum Multifactorial Experiments. Biometrika, 33, 305 (1946) o D.M. Steinberg and W.G. Hunter, Experimental Design: Review and Comment. Rchnometrics, 4 441 (1984) rn G. Tagughi and Y. Wu, Introduction to Off-Line Quality Control. Central Japan Quality Control Association (JPN), 1979
[FACT] FACTOR ANALYSIS
rn J.F! Benzecri, L'Analyse des Correspondences. Dunod, Paris (FR), 2 vols., 1980 (3rd ed.) 0 M. Feinberg, The utility of correspondence factor analysis for making decisions from chemical data. Anal. Chim. Acta, 191,75 (1986) rn B. Flury, Common Principal Components and Related Multivariate Models. Wiley, New York, NY (USA), 1988
358
References
H. Gampp, M. Maeder, C.J. Meyer, and A.D. Zuberbuehler, Evolving Factor Analysis. Comments Inorg. Chem., G, 41 (1987) R.L. Gorsuch, Factor Analysis. Saunders, Philadelphia PA, 1974 M.J. Greenacre, Theory and Applications of Correspondence Analysis. Academic Press, London (UK), 1984 H.H. Harman, Modern Factor Analysis. Chicago Univerisity Press, Chicago, IL (USA), 1976 P.K. Hopke, Target transformation factor analysis. Chemolab, 6 7 (1989) J.E. Jackson, A User's Guide to Principal Components. Wiley, New York, NY (USA), 1991 1.T Joliffe, Principal Component Analysis. Springer-Verlag, New York, NY, 1986 H.R. Keller and D.L. Massart, Evolving factor analysis. Chemolab, 2, 209 (1992) J. Kim and C.W. Mueller, Factor Analysis. Sage Publications, Beverly Hills, CA (USA), 1978 D.N. Lawley and A.E. Maxwell, Factor Analysis as Statistical Method. Macmilian, New York, NY (USA) / Butterworths, London (UK), 1971 (2nd ed.) M. Maeder, Evolving factor analysis for the resolution of overlapping chromatographic peaks. Anal. Chem., 527 (1987) M. Maeder and A. Zilian, Evolving Factor Analysis, a New Whnique in Chromatography. Chemolab, 2, 205 (1988) E.R. Malinowski and D.G. Howery, Factor Analysis in Chemistry. Wiley, New York, NY (USA), 1980-1991 (2nd ed.) w S.A. Mulaik, The Foundations of Factor Analysis. McGraw-Hill, New York, NY (USA), 1972 R.J. Rummel, Applied Factor Analysis. Northwestern University Press, Evanston, IL (USA), 1970 C.J.E Ter Braak, Canonical correspondence analysis: a new elgenvector technique for multivariate direct gradient analysis. Ecology, 7 6 1167 (1986) L.L. Thurstone, Multiple Factor-Analysis. A development and expansion of the vectors of mind. Chicago University Press, Chicago, IL (USA), 1947
[GEOM] GEOMETRICAL CONCEPTS
C.M. Cuadras, Distancias Estadisticas. Estadistica Espafi ola,-03 295 (1988) J.C. Gower, A general coefficient of similarity and some of its properties. Biometrics, 22,857 (1971)
[GRAPH] GRAPHICAL DATA ANALYSIS
D.E Andrews, Plots of high-dimensional data. Biometrics, 28, 125 (1972) H.P. Andrews, R.D. Snee, and M.H. Sarner, Graphical display of means. The Am. Statist., 195 (1980) F.J. Anscombe, Graphs in statistical analysis. The Am. Statist., 22, 17 (1973) w J.M. Chambers, W.S. Cleveland, B. Kleiner, and P.A. lbkey, Graphical Methods for Data Analysis. Wadsworth & Brooks, Pacific Grove, CA (USA), 1983 H. Chernoff, The use of faces to represent points in k-dimensional space graphically. J. Am. Statist. Assoc., 68- 361 (1973) B. Everitt, Graphical Techniques for Multivariate Data. Heinemann Educational Books, London (UK), 1978 S.E. Fienberg, Graphical methods in statistics. The Am. Statist., 2,165 (1979) 0 J.H. Friedman and L.C. Rafsky, Graphics for the multivariate two-sample problem. J. Am. Statist. 7 277 (1981) Assoc., -6 0 K.R. Gabriel, The biplot graphic display of matrices with applications to principal components anal453 (1971) ysis. Biometrika,
a
s,
References
359
B. Kleiner and J.A. Hartigan, Representing points in many dimensions by trees and castles. J. Am. Statist. Assoc., 76, 260 (1981) R. Leardi, E. Marengo, and R. Todeschini, A new procedure for the visual inspection of multivariate data ofdifferent geographic origins. Chemolab, 12, 181 (1991) R. McGill, J.W. Tbkey, and W.A. Larsen, Variation of box plots. The Am. Statist., 12 (1978) D.W. Scott, On optimal and data-based histograms. Biometrika, -6 605 (1979) H. Wainer and D. Thissen, Graphical Data Analysis. Ann.Rev.Psychol., 22,191 (1981) K. Wakimoto and M. 'Eiguri, Constellation graphical method for representing multidimensional data. Ann. Inst. Statist. Mathem., 97 (1978) w P.C.C. Wang (Ed.), Graphical Representation of Multivariate Data. Academic Press, New York, NY (USA), 1978
a
a
[MISC] MISCELLANEOUS
w D. Bonchev, Information Theoretic Indices for Characterization of Chemical Structure. Research Studies Press, Letchworth (UK), 1983 w L. Davis, Handbook of Genetic Algorithms. Van Nostrand Reinhold, New York, NY (USA), 1991 w K. Eckschlager and V. Stepanek, Analytical Measurement and Information: Advances in the Information Theoretic Approach to Chemical Analysis. Research Studies Press, Letchworth (UK), 1985 m D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading, MA (USA), 1989 w R.W. Hamming, Coding and Information Theory. Prentice-Hall, Englewood Cliffs NJ (USA), 19801986 (2nd ed.) w J. Hertz, A. Krogh and R.G. Palmer, Introduction to Theory of Neural Computation. Addison-Wesley, New York, NY (USA), 1991 D.B. Hibbert, Genetic algorithms in chemistry. Chemolab, fi 277 (1993) w Z. Hippe, Artificial Intelligence in Chemistry. Elsevier, Warszawa (POL), 1991 m J.H. Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, MI (USA), 1975 G.J. Klir and TA. Folger, Fuzzy Sets, Uncertainty, and Information. Prentice-Hall, Englewood Cliffs, NJ (USA), 1988 R. Leardi, R. Boggia and M. Terrile, Genetic algorithms as a strategy for feature selection. J. Chemometrics, 5, 267 (1992) o C.B. Lucasius and G. Kateman, Understanding and using genetic algorithms. Part 1. Concepts, properties and context. Chemolab, 19, 1 (1993) w N.J. Nilsson, Principles of Artificial Intelligence. Springer-Verlag, Berlin (GER), 1982 o A.P. de Weijer, C.B. Lucasius, L. Buydens, G. Kateman, and H.M. Heuvel, Using genetic Algorithms 45 (1993) for an artificial neural network model inversion. Chemolab, B.J. Wythoff, Backpropagation neural networks. A tutorial. Chemoiab, 18,115 (1993) J. Zupan and J. Gasteiger, Neural networks: A new method for solving Chemical problems or just a passing phase?. Anal. Chim. Acta, 248. 1 (1991)
a
[MODEL] MODELING
360 H
References
B. Efron and R. Tibshirani, An Introduction to the Bootstrap. Chapman & Hall, New York, NY (USA), 1993 B. Efron and R. Tibshirani, Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy. Statistical Science, 1,54 (1986) B. Efron, Estimating the error rate of a prediction rule: improvements on cross-validation. J. Am. 316 (1983) Statist. Assoc., G.H. Golub, M. Heath, and G. Wahba, Generalized Cross-validation as a Method for Choosing a Good Ridge Parameter. Technometrics, 2 l , 215 (1979) S . Lanteri, Full validation for feature selection in classification and regression problems. Chemolab, 159 (1992) A.M. Law and W.D. Kelton, Simulation Modeling and Analysis. McGraw-Hill, New York, NY (USA), 1991 D.W. Osten, Selection of optimal regression models via cross-validation. J. Chemometrics, 2, 39 (1988) M. Stone, Cross-validatory Choice and Assessment of Statistical Predictions. Journal of Royal Statistical Society, Ser. B 36, 111 (1974)
z,
H
[MULT]MULTIVARIATE ANALYSIS H
H H H
H
H H
H
H
H H
H H H H H
A.A. Afifi and S.P. h e n , Statistical Analysis: A Computer Oriented Approach. Academic Press, New York, NY (USA), 1979 A.A. Afifi and V. Clark, Computer-Aided Multivariate Analysis. Wadsworth, Belmont, CA (USA), 1984 I.H. Bernstein, Applied Multivariate Analysis. Springer-Verlag, New York, NY (USA), 1988 H. Bozdogan and A.K. Gupta, Multivariate Statistical Modeling and Data Analysis. Reidel Publishers, Dordrecht (NL), 1987 R.G. Brereton, Multivariate pattern recognition in chemometrics, illustred by case studies. Elsevier, Amsterdam (NL), 1992 H. Bryant and W.R. Atchley, Multivariate Statistical Methods. Dowden, Hutchinson & Ross, Stroudsberg, PA (USA), 1975 N.B. Chapman and J. Shorter, Correlation Analysis in Chemistry. Plenum Press, New York, NY (USA), 1978 C. Chatfield and A.J. Collins, Introduction to Multivariate Analysis. Chapman & Hall, London (UK), 1986 W.W. Cooley and P.R. Lohnes, Multivariate Data Analysis. Wiley, New York, NY (USA), 1971 D. Coomans and I. Broeckaert, Potential Pattern Recognition in Chemical and Medical Decision Making. Research Studies Press, Letchworth (UK), 1986 M.L. Davison, Multidimensonal scaling. Wiley, New York, NY (USA), 1983 W.R. Dillon and M. Goldstein, Multivariate Analysis. Methods and Applications. Wiley, New York, NY (USA), 1984 R.O. Duda and PE.Hart, Pattern Classification and Scene Analysis. Wiley, New York, NY (USA), 1973 M.L. Eaton, Multivariate Statistics. Wiley, New York, NY (USA), 1983 K. Esbensen and P. Geladi, Strategy of multivariate image analysis (MIA). Chemolab, 67 (1989) B.S. Everitt, An Introduction to Latent Variable Models. Chapman & Hall, London (UK), 1984 R. Giffins, Canonical Analysis: A Review with Applications in Ecology. Biomathematics 12, SpringerVerlag, Berlin (GER), 1985 R. Gnanadesikan, Methods for Statistical Data Analysis of Multivariate Observations. Wiley, New
z
References
361
York, NY (USA), 1977 rn F! Green, Mathematical Tools for Applied Multivariate Analysis. Academic Press, San Diego, CA (USA), 1978 rn 1.1. Joffe, Application of Pattern Recognition to Catalytic Research. Research Studies Prees, Letchworth (UK), 1988 rn R.A. Johnson and D.W. Wichern, Applied Multivariate Statistical Analysis. Prentice-Hall, London (UK), 1982-1992 (3rd ed.) rn P.C. Jurs and TL. Isenhour, Chemical Applications of Pattern Recognition. Wiley-Interscience, New York, NY (USA), 1975 rn M. Kendall, Multivariate Analysis. Griffin, London (UK), 1980 rn P.R. Krishnaiah (Ed.), Multivariate Analysis. Academic Press, New York, NY (USA), 1966 W.J. Krzanowski, Principles of Multivariate Analysis. Oxford Science Publishers, Clarendon (UK), 1988 rn A.N. Kshirsagar, Multivariate Analysis. Dekker, New York, NY (USA), 1978 rn M.S. Levine, Canonical Analysis and Factor Comparisons. Sage Publications, Beverly Hills, CA (USA), 1977 rn P.J. Lewi, Multivariate Data Analysis in Industrial Practice. Research Studies Press, Letchworth (UK), 1982 B.F.J. Manly, Multivariate Statistical Methods. A Primer. Chapman & Hall, Bristol (UK), 1986 rn W.S. Meisel, Computer-oriented approach to pattern recognition. Academic Press, New York, NY (USA), 1972 rn K.V. Mardia, J.T. Kent, and J.M. Bibby, Multivariate Analysis. Academic Press, London (UK), 1979-1988 (6th ed.) rn D.E Morrison, Multivariate Statistical Methods. McGraw-Hill, New York, NY (USA), 1976 rn S.S.Schiffman, M.L. Reynolds and F.W.Young, Introduction to Multidimensional Scaling. Academic Press, Orlando, FL (USA), 1981 G.A.F. Seber, Multivariate Observations. Wiley, New York, NY (USA), 1984 rn M.S. Srivastana and E.M. Carter, An Introduction to Applied Multivariate Statistics. North-Holland, Amsterdam (NL), 1983 rn 0. Strouf, Chemical Pattern Recognition. Research Studies Press, Letchworth (UK), 1986 R.M. Thorndike, Correlational Procedures for Research. Gardner, New York, NY (USA), 1978 J.T.Tou and R.C. Gonzales, Pattern Recognition Principles. Addison-Wesley, Reading, MA (USA), 1974 rn J.P. Van der Geer, Introduction to Linear Multivariate Data Analysis. DSWO Press, Leiden (GER), 1986 rn E. Van der Burg, NonlineapCanonical Correlation and Some Related 'lkchniques. DSWO Press, Leiden (GER), 1988 K. Varmuza, Pattern Recognition in Chemistry. Springer-Verlag, Berlin (GER), 1980 rn D.D. Wolf and M.L. Pearson, Pattern Recognition Approach to Data Interpretation. Plenum Press, New York, NY (USA), 1983
[OPTIM] OPTIMIZATION
K.W.C. Burton and G. Nickless, Optimization via simplex Part 1. Background, definitions and simple applications. Chemolab, 1,135 (1987) rn B.S. Everitt, Introduction to Optimization Methods and their Application in statistics. Chapman & Hall, London (UK), 1987 rn R. Fletcher, Practical Methods of Optimization. Wiley, New York, NY (USA), 1987 ~1
362
References
J.H. Kalivas, Optimization using variations of simulated annealing. Chemolab, 5 l 1 (1992) rn L. Mackley (Ed.), Introduction to Optimization. Wiley, New York, NY (USA), 1988 0 J.A. Nelder and R. Mead, A simplex method for function minimization. Computer J., 2, 308 (1965) A.C. Norris, Computational Chemistry: An Introduction to Numerical Methods. Wiley, Chichester (UK), 1986 I C.H. Papadimitriou and K. Steiglitz, Combinatorial Optimization. Algorithms and Complexity. Prentice-Hall, Englewood Cliffs, NJ (USA), 1982 H P.J.M. van Laarhoven and E.H.L. Aarts, Simulated Annealing: Theory and Applications. Reidel, Dordrecht (GER), 1987
[PROB] PROBABILITY
M. Evans, N. Hasting, B. Peacock, statistical Distributions. Wiley, New York, NY (USA), 1993 rn M.A. Goldberg, An Introduction to Probability Theory with Statistical Applications. Plenum Press, New York, NY (USA), 1984 rn H.J. Larson, Introduction to Probability Theory and Statistical Inference. Wiley, New York, NY (USA), 1969 P.L. Meyer, Introductory Probability and Statistical Applications. Addison-Wesley, Reading, MA (USA), 1970 rn E Mosteller, Probability with Statistical Applications. Addison-Wesley, Reading, MA (USA), 1970
[QUAL] QUALITY CONTROL
rn 0
rn rn
G.A. Barnard, G.E.P. Box, D. Cox, A.H. Seheult, and B.W. Silverman (Eds.), Industrial Quality and Productivity with Statistical Methods. The Royal Society, London (UK), 1989 D.H. Besterfield, Quality Control. Prentice-Hall, London (UK), 1979 M.M.W.B. Hendriks, J.H. de Boer, A.K. Smilde and D.A. Doornbos, Multlcrlterla decision macking. Chemolab, & 175 (1992) G. Kateman and EW. Pijpers, Quality Control in Analytical Chemistry. Wiley, New York, NY (USA), 1981 D.C. Montgomery, Introduction to Statistical Quality Control. Wiley, New York, NY (USA), 1985 D.J. Wheeler and D.S. Chambers, Understanding Statistical Process Control. Addison-Wesley, Avon (UK), 1990
[REGR] REGRESSION ANALYSIS
rn rn 0
rn
141 EJ. Anscombe and J.W. 'hkey, The examination and analysis of residuals. Technometrics, (1963) A.C. Atkinson, Plots, 'hnsformations, and Regression. Oxford Univ. Press, Oxford (UK), 1985 D.M. Bates and D.G. Watts, Nonlinear Regression Analysis. Wiley, New York, NY (USA), 1988 D.A. Belsey, E. Kuh, and R.E. Welsch, Regression Diagnostics: IdentiFying Influential Data and Sources of Collinearity. Wiley, New York, NY (USA), 1980 L. Breiman and J.H. Friedman, Estimating Optimal 'kansformations for Multiple Regression and Correlation. J. Am. Statist. Assoc., 80- 580 (1985) S.Chatterjee and B. Price, Regression Analysis by Example. Wiley, New York, NY (USA), 1977 J. Cohen and P. Cohen, Applied Multiple Regression-Correlation Analysis for the Behavioral Sciences. Halsted, New York, NY (USA), 1975
References
363
rn R.D. Cook and S. Weisberg, Residuals and Influence in Regression. Chapman & Hall, New York, NY (USA), 1982 rn C. Daniel and F.S. Wood, Fitting Equations to Data. Wiley, New York, NY (USA), 1980 (2nd ed.) rn N. Draper and H. Smith, Applied Regression Analysis, Wiley, New York, NY (USA), 1966-1981 (2nd ed.) I.E.Frank, Intermediate least squares regression method. Chemolab, 1,233 (1987) I.E. Frank, A nonlinear PLS model. Chemolab, 8, 109 (1990) 0 I.E. Frank and J.H. Friedman, A statistical view of some chemometrics regression tools. Rchnometrics, 35, 109 (1993) J.H. Friedman, Multivariate adaptive regression splines. The Annals of Statistics, 19,1 (1991) 0 J.H. Friedman and W. Stuetzle, Projection pursuit regression. J. Am. Statist. Assoc., 76,817 (1981) T Gasser and M. Rosenblatt (Eds.), Smoothing Techniques for Curve Estimation. Springer-Verlag, Berlin (GER), 1979 rn M.H.J. Gruber, Regression Estimators: a comparative study. Academic Press, San Diego, CA (USA), 1990 rn R.E Gunst and R.L. Mason, Regression Analysis and Its Application: A Data-Oriented Approach. Marcel Dekker, New York, NY (USA), 1980 rn F.R. Hampel, E.M. Ronchetti, P.J. Rousseeuw, and W.A. Stahel, Robust Statlstics. The Approach based on Influence Functions. Wiley, New York, NY (USA), 1986 D.M. Hawkins, On the Investigation of Alternative Regression by Principal Components Analysis. Applied Statistics, 22,275 (1973) 0 R.R. Hocking, Developments in linear regression methodology: 1959-1982. Technometrics, -52 219 (1983) R.R. Hocking, The analysis and selection of variables in linear regresslon. Biometrics, -23 1 (1976) A.E. Hoerl and R.W. Kennard, Ridge Regression: Biased estlmation for non-orthogonal problems. Rchnometrics, l2, 55 (1970) 0 A. Hoskuldsson, PLS Regression Methods. J. Chemometrics, 2,211 (1988) m D.G. Kleinbaum and L.L. Kupper, Applied Regression Analysis and Other Multlvariable Methods. Duxbury Press, North Scituate, MA (USA), 1978 0 KG. Kowalski, On the predictive performance of biased regression methods and multiple linear regression. Chemolab, $ 177 (1990) 0 0. Kvalheim, The Latent Variable. Chemolab, 4l 1 (1992) 0. Kvalheim and TV. Karstang, Interpretation of latent-variable regression models. Chemolab, 2, 39 (1989) A. Lorber, L.E. Wangen, and B.R. Kowalski, A Theoretical Foundation for the PLS Algorithm. J. Chemometrics, 1,19 (1987) D.W. Marquardt, Generalized Inverses, Ridge Regression, Biased Linear Estimation, and Nonlinear Estimations. Rchnometrics, l2, 591 (1970) rn H. Martens and T Ns, Multivariate calibration. Wiley, New York, NY (USA), 1989 rn A.J. Miller, Subset Selection in Regression. Chapman & Hall, London (UK), 1990 rn F. Mosteller and J.W. Tbkey, Data Analysis and Regression. Addison-Wesley, Reading, MA (USA), 1977 rn R.H. Myers, Classical and Modem Regression with Applications. Duxbury Press, Boston, MA (USA), 1986 T Ns, C. Irgens, and H. Martens, Comparison of Linear Statistical Methods for Calibration of NIR Instruments. Applied Statistics, 35, 195 (1986) rn J. Neter and W. Wasserman, Applied Linear Statistical Models. Irwin, Homewood, IL (USA), 1974 rn C.R. Rao, Linear Statistical Inference and Its Applications. Wiley, New York, NY (USA), 1973 (2nd ed.)
364
References
P.J. Rousseeuw and A.M. Leroy, Robust Regression and Outliers Detection. Wiley, New York, NY (USA), 1987 H G.A.F. Seber, Linear Regression Analysis. Wiley, New York, NY (USA), 1977 S. Sekulic and B.R. Kowalski, MARS: a tutorial. J. Chemometrics, 6, 199 (1992) H J. Shorter, Correlation Analysis of Organic Reactivity: With Particular Reference to Multiple Regression. Research Studies Press, Chichester (UK), 1982 H G. Wahba, Spline Models for Observational Data. Society for Industrial and Applied Mathematics, Philadelphia, PA (USA), 1990 J.T. Webster, R.F. Gunst, and R.L. Mason, Latent Root Regression Analysis. Rchnometrics, & 513 (1974) H S. Weisberg, Applied Linear Regression. Wiley, New York, NY (USA), 1980 H G.B. Wetherill, Regression Analysis with Applications. Chapman & Hall, London (UK), 1986 S. Wold, N. Kettaneh-Wold and B. Skagerberg, Nonlinear PLS modeling. Chemolab, 2, 53 (1989) 0 S. Wold, P. Geladi, K. Esbensen, and J. Oehman, Multi-way Principal Components and PLS Analysis. J. Chemometrics, 1,41 (1987) o C. Yale and A.B. Forsythe, Winsorized regression. Technometrics, 18,291 (1976) H M.S. Younger, Handbook for Linear Regression. Duxbury Press, North Scituate, MA (USA), 1979 H
[TEST] HYPOTHESIS TESTING H H H
H H H
J.V. Bradley, Distribution-free Statistical Tests. Prentice Hall, Engelwood Cliffs, NJ (USA), 1968 L.N.H. Bunt, Probability and Hypothesis Testing. Harrap, London (UK), 1968 E. Caulcott, SignificanceTests. Routledge and Kegan Paul, London (UK), 1973 E.S. Edgington, Randomization Tests. Marcel Dekker, New York, NY (USA), 1980 K.R. Koch, Parameter Estimation and Hypothesis Testing in Linear Models. Springer-Verlag, Berlin (GER), 1988 E.L. Lehman, Testing Statistical Hypotheses. Wiley, New York, NY (USA), 1986
[TIME] TIME SERIES H
H H H H H H
H H H
B.L. Bowerman, Forecasting and Time Series. Duxbury Press, Belmont, CA (USA), 1993 G.E.P. Box and G.M. Jenkins, Time Series Analysis. Holden-Day, San Francisco, CA (USA), 1976 C. Chatfield, The Analysis of Time Series: An Introduction. Chapman & Hall, London (UK), 1984 J.D. Cryer, Time Series Analysis. Duxbury Press, Boston, MA (USA), 1986 P.J. Diggle, Time Series. A Biostatistical Introduction. Clarendon Press, Oxford (UK), 1990 E.J. Hannan, Time Series Analysis. Methuen, London (UK), 1960 A.C. Harvey, Time Series Models. Wiley, New York, NY (USA), 1981 M. Kendall and J.K. Ord, Time Series. Edward Arnold, London (UK), 1990 D.C. Montgomery, L.A. Johnson, and J.S. Gardiner, Forecasting and Time Series Analysis. McGrawHill, New York, NY (USA), 1990 R.H. Shumway, Applied Statistical Time Series Analysis. Prentice Hall, Englewood Cliffs, NJ (USA), 1988
GENERAL STATISTICSAND CHEMOMETRICS H H
J. Aitchison, The Statistical Analysis of Compositional Data. Chapman & Hall, London (UK), 1986 J. Aitchison, Statistics for Geoscientists. Pergamon Press, Oxford (UK), 1987
References
365
S.E Arnold, Mathematical Statistics. Prentice-Hall, Englewood Cliffs, NJ (USA), 1990 V Barnett and T Lewis, Outliers in Statistical Data. Wiley, New York, NY (USA), 1978 H J.J. Breen and RE. Robinson, Environmental Applications of Chemometrics. ACS Symposium Series, vol. 292,Am. Chem. SOC.,Washington, D.C. (USA), 1985 H R.G. Brereton, Chemometrics. Applications of mathematics and statistics to laboratory systems. Ellis Honvood, Chichester (UK), 1990 H D.T Chapman and A.H. El-Shaarawi, Statistical Methods for the Assessment of Point Source Pollution. Khwer Academic Publishers, Dordrecht (NL), 1989 H D.J. Finney, Statistical Methods in Biological Assay. Griffin, Oxford (UK), 1978 H M. Forina, Introduzione alla Chimica Analitica con elementi di Chemiometria. ECIG, Genova (IT), 1993 H D.M. Hawkins, Identification of Outliers. Chapman & Hall, London (UK), 1980 m D.C. Hoaglin, F. Mosteller, and J.W. 'hkey, Understanding Robust and Exploratory Data Analysis. Wiley, New York, NY (USA), 1983 H W.J. Kennedy and J.E. Gentle, Statistical Computing. Marcel Dekker, New York, NY (USA), 1980 H B.R. Kowalski (Ed.), Chemometrics: Theory and Application. ACS Symposium Series, vol. 52, Am. Chem. SOC.,Washington, D.C. (USA), 1977 H B.R.Kowalski (Ed.), Chemometrics, Mathematics, and Statistics in Chemistry. Proceedings of the NATO AS1 - Cosenza 1983, Reidel Publishers, Dordrecht (NL), 1984 H D.L. Massart, R.G. Brereton, R.E. Dessy, P.K. Hopke, C.H. Spiegelman, and W. Wegscheider (Eds.), Chemometrics 'htorial. Collected from Chemolab., Vol. 1-5, Elsevier, Amsterdam (NL), 1990 H D.L. Massart, A. Dijkstra, and L. Kaufman, Evaluation and Optimization of Laboratoty Methods and Analytical Procedures. Elsevier, Amsterdam (NL), 1978-1984 (3rd ed.) H D.L. Massart, B.G.M. Vandeginste, S.N. Deming, Y. Michotte, and L. Kaufman, Chemometrics: A 'kxtbook. Elsevier, Amsterdam (NL), 1988 H M. Meloun, J. Militky and M. Forina, Chemometrics for Analytical chemistry. Volume 1: PC-Aided Statistical Data Analysis. Ellis Honvood, New York, NY (USA), 1992 H M.A. Sharaf, D.A. Illman, and B.R. Kowalski, Chemometrics. Wiley, New York, NY (USA), 1986 H J.W. 'hkey, Exploratory Data Analysis. Addison-Wesley, Reading, MA (USA), 1977 H J.H. Zar, Biostatistical Analysis. Prentice-Hall, Englewood Cliffs, NJ (USA), 1984 (2nd ed.) H H
This Page Intentionally Left Blank