Linear and Nonlinear Models: Fixed Effects, Random Effects, and Mixed Models
Erik W. Grafarend
Walter de Gruyter
Grafarend · Linear and Nonlinear Models
Erik W. Grafarend
Linear and Nonlinear Models Fixed Effects, Random Effects, and Mixed Models
≥
Walter de Gruyter · Berlin · New York
Author Erik W. Grafarend, em. Prof. Dr.-Ing. habil. Dr. tech. h.c. mult Dr.-Ing. E.h. mult Geodätisches Institut Universität Stuttgart Geschwister-Scholl-Str. 24/D 70174 Stuttgart, Germany E-Mail:
[email protected] 앝 Printed on acid-free paper which falls within the guidelines of the ANSI to ensure permanence 앪 and durability.
Library of Congress Cataloging-in-Publication Data Grafarend, Erik W. Linear and nonlinear models : fixed effects, random effects, and mixed models / by Erik W. Grafarend. p. cm. Includes bibliographical references and index. ISBN-13: 978-3-11-016216-5 (hardcover : acid-free paper) ISBN-10: 3-11-016216-4 (hardcover : acid-free paper) 1. Regression analysis. 2. Mathematical models. I. Title. QA278.2.G726 2006 519.5136⫺dc22 2005037386
Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at ⬍http://dnb.ddb.de⬎.
ISBN-13: 978-3-11-016216-5 ISBN-10: 3-11-016216-4 쑔 Copyright 2006 by Walter de Gruyter GmbH & Co. KG, 10785 Berlin All rights reserved, including those of translation into foreign languages. No part of this book may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher. Printed in Germany Cover design: Rudolf Hübler, Berlin Typeset using the author’s word files: M. Pfizenmaier, Berlin Printing and binding: Hubert & Co. GmbH & Co. Kg, Göttingen
Preface “All exact science is dominated by the idea of approximation.” B. Russell “You must always invert.” C.G.J. Jacobi “Well, Mr. Jacobi; here it is: all the generalized inversion of two generations of inventors who knowingly or unknowingly subscribed and extended your dictum. Please, forgive us if we have over-inverted, or if we have not always inverted in the natural and sensible way. Some of us have inverted with labor and pain by using hints from a dean or a tenure and promotion committee that “you better invert more, or else you would be inverted.”” M.Z. Nashed, L.B. Rall There is a certain intention in reviewing linear and nonlinear models from the point-of-view of fixed effects, random effects and mixed models. First, we want to portray the different models from the algebraic point of view – for instance a minimum norm, least squares solution (MINOLESS) – versus the stochastic point-of-view – for instance a minimum bias, minimum variance “best” solution (BLIMBE). We are especially interested in the question under which assumption the algebraic solution coincides with the stochastic solution, for instance when MINOLESS is identical to BLIMBE. The stochastic approach is richer with respect to modeling. Beside the first order moments we need, the expectation of a random variable, also a design for the central second order moments, the variance-covariance matrix of the random variable as long as we deal with second order statistics. Second, we therefore setup a unified approach to estimate (predict) the first order moments, or for instance by BLUUE (BLUUP) and the central second order moments, for instance by BIQUUE, if they exist. In short, BLUUE (BLUUP) stands for “Best Linear Uniformly Unbiased Estimation” (Prediction) and BIQUUE alternatively for “Best Invariant Quadratic Uniformly Unbiased Estimation”. A third criterion is the decision whether the observation vector is inconsistent or random, whether the unknown parameter vector is random or not, whether the “first design matrix” within a linear model is random or not and finally whether the “mixed model” E{y} = Aȟ + &E{]} + E{;}Ȗ has to be applied if we restrict ourselves to linear models. How to handle a nonlinear model where we have a priori information about approximative values will be outlined in detail. As a special case we also deal with “condition equations with unknowns” BE{y} c = Aȟ where the matrices/vector {A, B, c} are given and the observation vector y is again a random variable.
vi
Preface
A fourth problem is related to the question of what is happening when we take observations not over \ n (real line, n-dimensional linear space) but over S n (circle S1 , sphere S 2 ,…, hypersphere S n ), over E n (ellipse E1 , ellipsoid E 2 , …, hyperellipsoid E n ), in short over a curved manifold. We show in particular that the circular variables are elements of a von Mises distribution or that the spherical variables are elements of a von Mises-Fisher distribution. A more detailed discussion is in front of you. The first problem of algebraic regression in Chapter one is constituted by a consistent system of linear observational equations of type underdetermined system of linear equations. So we may say “more unknowns than equations”. We solve the corresponding system of linear equations by an optimization problem which we call the minimum norm solution (MINOS). We discuss the semi-norm solution of Special Relativity and General Relativity and alternative norms of type l p . For “MINOS” we identify the typical generalized inverse and the eigenvalue decomposition for G x -MINOS. For our Front Page Example we compute canonical MINOS. Special examples are Fourier series and Fourier-Legendre series, namely circular harmonic and spherical harmonic regression. Special nonlinear models included Taylor polynomials and generalized Newton iteration, for the case of planar triangular network as an example whose nodal points are a priori coordinated. The representation of the proper objective function of type “MINOS” is finally given for a defective network (P-diagram, E-diagram). The transformation group for observed coordinate differences (translation groups T(2), T(3),..., T(n) ), for observed distances (group of motions T(2) S O(2), T(3) S O(3) ,…, T(n) SO (n) ), for observed angles or distance ratios (conformal groups C 4 (2), C7 (3),..., C( n +1) ( n + 2) / 2 ( n) ) and for observed cross-ratios of area elements in the projective plane (projective group) are reviewed with their datum parameters. Alternatively, the first problem of probabilistic regression – the special GaussMarkov model with datum defect – namely the setup of the linear uniformly minimum bias estimator of type LUMBE for fixed effects is introduced in Chapter two. We define the first moment equations Aȟ = E{y} and the second central moment equations Ȉ y = D{y} and estimate the fixed effects by the homogeneous linear setup ȟˆ = / y of type S-LUMBE by the additional postulate of minimum bias || B ||S2 =|| (I m LA ) ||S2 where B := I m LA is the bias matrix. When are G x -MINOS and S-LUMBE equivalent? The necessary and sufficient condition is G x = S 1 or G x1 = S , a key result. We give at the end an extensive example. The second problem of algebraic regression in Chapter three treats an inconsistent system of linear observational equations of type overdetermined system of linear equations. Or we may say “more observations than unknowns”. We solve the corresponding system of linear equations by an optimization problem which we call the least squares solution (LESS). We discuss the signature of the observation space when dealing with Special Relativity and alternative norms of type l p , namely l2 ,… , l p ,… , lf . For extensive applications we discuss various objective functions like (i) optimal choice of the weight matrix G y : second order design SOD, (ii) optimal choice of weight matrix G y by means of condition equations, and (iii) robustifying objective functions. In all detail we introduce the second order design SOD by an optimal choice of a criterion matrix of weights.
Preface
vii
What is the proper choice of an ideal weight matrix G x ? Here we propose the Taylor-Karman matrix borrowed from the Theory of Turbulence which generates a homogeneous and isotropic weight matrix G x (ideal). Based upon the fundamental work of G. Kampmann, R. Jurisch and B. Krause we robustify G y -LESS and identify outliers. In particular we identify Grassmann-Plücker coordinates which span the normal space R ( A) A . We pay a short tribute to Fuzzy Sets. In some detail we identify G y -LESS and its generalized inverse. Canonical LESS is based on the eigenvalue decomposition of G y -LESS illustrated by an extensive example. As a case study we pay attention to partial redundancies, latent conditions, high leverage points versus break points, direct and inverse Grassmann coordinates, Plücker coordinates, “hat” matrix, right eigenspace analysis, multilinear algebra, “join” and “meet”, the Hodge star operator, dual Grassmann coordinates, dual Plücker coordinates, Gauss-Jacobi Combinatorial Algorithm concluding with a historical note on C.F. Gauss, A.M. Legendre and the invention of Least Squares and its generalization. Alternatively, the second problem of probabilistic regression in Chapter four the special Gauss-Markov model without datum effect – namely the setup of the best linear uniformly unbiased estimator for the first order moments of type BLUUE and of the best invariant quadratic uniformly unbiased estimator for the central second order moments of type BIQUUE for random observations is introduced. First, we define ȟˆ and Ȉ y -BLUUE, two lemmas and a theorem. Alternatively, second we setup by four definitions and by six corollaries, five lemmas and two theorems IQE (“invariant quadratic estimation ”) and best IQUUE (“best invariant quadratic uniformly unbiased estimator ”). Alternative estimators of type MALE (“maximum likelihood”) are reviewed. Special attention is paid to the “IQUUE” of variance-covariance components of Helmert type called “HIQUUE” and “MIQE”. For the case of one variance component, we are able to find necessary and sufficient conditions when LESS agrees to BLUUE, namely G y = Ȉ y1 or G y1 = Ȉ y , a key result. The third problem of algebraic regression in Chapter five – the inconsistent system of linear observational equations with datum defect: overdeterminedunderdetermined system – presents us with three topics. First, by one definition and five lemmas we document the minimum norm, least squares solution (“MINOLESS”). Second, we review of the general eigenspace analysis versus the general eigenspace synthesis. Third, special estimators of type “ D -hybrid approximation solution” (“ D -HAPS ”) and “Tykhonov-Phillips regularization” round up the alternative estimators. Alternatively, third problem of probabilistic regression in Chapter six – the special Gauss-Markov model with datum problem – namely the setup of estimators of type “BLIMBE” and “BLE” for the moments of first order and of type “BIQUUE ” and “BIQE ” for the central moments of second order, is reviewed. First, we define ȟˆ as homogeneous Ȉ y , S-BLUMBE (“ Ȉ y , S – best linear uniformly minimum bias estimator”) and compute via two lemmas and three theorems “hom Ȉ y , S-BLUMBE”, E{y}, D{Aȟˆ}, D{H y } as well as “ Vˆ 2 - BIQUUE” and “ Vˆ 2 BIQE ” of V 2 . Second, by three definitions and one lemma and three theorems we work on “hom BLE”, “hom S-BLE”, “hom D -BLE”. Extensive examples are given. For the case of one variance component we are able to find
viii
Preface
necessary and sufficient conditions when MINOLESS agrees to BLIMBE, namely G x = S 1 , G y = Ȉ y1 or G x1 = S, G y1 = Ȉ y , a key result. As a spherical problem of algebraic representation we treat an incomplete system of directional observational equations, namely an overdetermined system of nonlinear equations on curved manifolds (circle, sphere, hypersphere S p ). We define what we mean by minimum geodesic distance on S1 and S 2 and present two lemmas on S1 and two lemmas on S 2 of type minimum geodesic distances. In particular, we take reference to the von Mises distribution on a circle, to the Fisher spherical distribution on the sphere and, in general, to the Langevin sphere S p \ p +1 . The minimal geodesic distance (“MINGEODISC”) is computed for / g and (/ g , ) g ) . We solve the corresponding nonlinear normal equations. In conclusion, we present a historical note of the von Mises distribution and generalize to the two-dimensional generalized Fisher distribution by an oblique map projection. At the end, we summarize the notion of angular metric and give an extensive case study. The fourth problem of probabilistic regression in Chapter eight as a special Gauss-Markov model with random effects is described as “BLIP” and “VIP” for the central moments of first order. Definitions are given by hom BLIP (“homogeneous best linear Mean Square Predictor”), S-hom BLIP (“homogeneous linear minimum S-modified Mean Square Predictor”) and hom D -VIP (“homogeneous linear minimum variance-minimum bias in the sense of a weighted hybrid normsolution”). One lemma and three theorems collect the results for (i) hom BLIP, (ii) hom S-BLIP and (iii) hom D -VIP. In all detail, we compute the predicted solution for the random effects, its bias vector, the Mean Square Prediction Error MSPE. Three cases for nonlinear error propagation with random effects are discussed. In Chapter nine we specialize towards the fifth problem of algebraic regression, namely the system of conditional equations of homogeneous and inhomogeneous type. We follow two definitions, one theorem and three lemmas of type G y LESS before we present an example from angular observations. As Chapter ten we treat the fifth problem of probabilistic regression, the GaussMarkov model with mixed effects in setting up BLUUE estimators for the moments of first order, special case of Kolmogorov-Wiener prediction. After defining Ȉ y -BLUUE of [ and E{z} where z is random variable, we present two n lemmas and one theorem how to construct estimators of ȟˆ , E {z} on the basis of Ȉ y -BLUUE of ȟ and E{z} . By a separate theorem we fix a homogeneous quadratic setup of the variance component Vˆ 2 within the first model of fixed effects and random effects superimposed. As an example we present “collocation” enriched by a set of comments about A.N. Kolmogorov – N. Wiener prediction, the so-called “yellow devil”. Chapter eleven leads us to the “sixth problem of probabilistic regression”, the celebrated random effect model “errors-in-variables”. We outline the model and sum up the theory of normal equations. Our example is the linear equation E{y} = E{X}Ȗ where the first order design matrix is random. An alternative name is “straight line fit by total least squares”. Finally we give a detailed example and a literature list.
ix
Preface
C.F. Gauss and F.R. Helmert introduced the sixth problem of generalized algebraic regression, the system of conditional equations with unknowns which we proudly present in Chapter twelve. First, we define W-LESS of the model Ax + Bi = By where i is an inconsistency parameter. In two lemmas we solve its normal equations and discuss the condition on the matrix A B . Two alternative solutions are based on R, W-MINOLESS (two lemmas, one definition) and R, W-HAPS (one lemma, one definition) are given separately. An example is reviewed as a height network. For shifted models of type Ax + Bi = By c with similar results are summarized. For the special nonlinear problem of the 3d datum transformation of Chapter thirteen we review the famous Procrustes Algorithm. With the algorithm we consider the coupled unknowns of type dilatation, also called scale factor, translation and rotation for random variables of 3d coordinates in an “old system” and in a “new system”. After the definition of the conformal group C7(3) in a three-dimensional network with 7 unknown parameters we present four corollaries and one theorem: First, we reduce the translation parameters, second the scale parameters and last, not least, third the rotation parameters bound together in a theorem. A special result is the computation of the variance-covariance matrix of the observation array E := Y1 Y2 X c3 x1 1x c2 as a function of Ȉ vecYc , Ȉ vecYc , Ȉ vecYc , (I n x1 X 2 ) vec Y2c and (I n
x1X3 ) . A detailed example of type ILESS is given including a discussion about || El || and ||| El ||| precisely defined. Here we conclude with a reference list. 1
2
1
Chapter fourteen as our sixth problem of type generalized algebraic regression “revisited” deals with “The Grand Linear Model”, namely the split level of conditional equations with unknowns (general Gauss-Helmert model). The linear model consists of 3 components: (i) B1i = By c , (ii) A 2 x + B 2 i = B 2 y , c 2 R (B 2 ) and (iii) A 3 x c = 0 or A 3 x + c = 0, R3 R ( A 3 ) . The first equation contains only conditions on the observation vector. In contrast, the second equation balances both condition equations between the unknown parameters in the form of A 2 x and the conditions B 2 y = c3 . Finally, the third equation is a condition exclusively on the unknown parameter vector. For our model Lemma 14.1 presents the W-LESS solution, Lemma 14.2 the R, W-MINOLESS solution and Lemma 14.3 the R, W-HAPS solution. As an example we treat a planar triangle whose coordinates consist of three distances being measured under a datum condition. Chapter fifteen is concerned with three topics. First, we generalize the univariate Gauss-Markov model to the multivariate Gauss-Markov model with and without constraints. We present two definitions, one lemma about multivariate LESS, one lemma about its counterpart of type multivariate Gauss-Markov modeling and one theorem of type multivariate Gauss-Markov modeling with constraints. Second, by means of a MINOLESS solution we present the celebrated “n-way classification model” to answer the question of how to compute a basis of unbiased estimable quantities. Third, we take into account the fact that in addition to observational models we also have dynamical system equations. In some detail, we review the Kalman Filter (Kalman-Bucy Filter), models of type ARMA and ARIMA. We illustrate the notions of “steerability” and “observability” by two examples. The state differential equation as well as the observational equation are simultaneously solved by Laplace transformation. At this end we focus on
x
Preface
the modern theory of dynamic nonlinear models and comment on the theory of chaotic behavior as its up-to-date counterparts. In the appendices we specialize on specific topics. Appendix A is a review on matrix algebra, namely special matrices, scalar measures and inverse matrices, eigenvalues and eigenvectors and generalized inverses. The counterpart is matrix analysis which we outline in Appendix B. We begin with derivations of scalarvalued and vector-valued vector functions, followed by a chapter on derivations of trace forms and determinantal forms. A specialty is the derivation of a vector/matrix function of a vector/matrix. We learn how to derive the KroneckerZehfuß product and matrix-valued symmetric or antisymmetric matrix function. Finally we show how to compute higher order derivatives. Appendix C is an elegant review of Lagrange multipliers. The lengthy Appendix D introduces sampling distributions and their use: confidence intervals and confidence regions. As peculiar vehicles we show how to transform random variables. A first confidence interval of Gauss-Laplace normally distributed observations is computed for the case P , V 2 known, example the Three Sigma Rule. A second confidence interval for the sampling form the Gauss-Laplace normal distributions for the mean built on the assumption that the variance is known. The alternative sampling from the Gauss-Laplace normal distribution leads the third confidence interval for the mean, variance unknown based on the Student sampling distribution. The fourth confidence interval for the variance is based on the analogue sampling for the variance based on the F 2 - Helmert distribution. For both the intervals of confidence, namely based on the Student sampling distribution for the mean, variance unknown, and based on the F 2 - Helmert distribution for the sample variance, we compute the corresponding Uncertainty Principle. The case of a multidimensional Gauss-Laplace normal distribution is outlined for the computation of confidence regions for fixed parameters in the linear Gauss-Markov model. Key statistical notions like moments of a probability distribution, the Gauss-Laplace normal distribution (quasi-normal distribution), error propagation as well as important notions of identifiability and unbiasedness are reviewed. We close with bibliographical indices. Here we are not solving rank-deficient or ill-problems in using UTV or QR factorization techniques. Instead we refer to A. Björk (1996), P. Businger and G. H. Golub (1965), T. F. Chan and P. C. Hansen (1991, 1992), S. Chandrasekaran and I. C. Ipsen (1995), R. D. Fierro (1998), R. D. Fierro and J. R. Bunch (1995), R. D. Fierro and P. C. Hansen (1995, 1997), L. V. Foster (2003), G. Golub and C. F. van Loan (1996), P. C. Hansen (1990 a, b, 1992, 1994, 1995, 1998), Y. Hosada (1999), C. L. Lawson and R. J. Hanson (1974), R. Mathias and G.W. Stewart (1993), A. Neumaier (1998), H. Ren (1996), G. W. Stewart (1992, 1992, 1998), L. N. Trefethen and D. Bau (1997). My special thanks for numerous discussions go to J. Awange (Kyoto/Japan), A. Bjerhammar (Stockholm/Sweden), F. Brunner (Graz/Austria), J. Cai (Stuttgart/Germany), A. Dermanis (Thessaloniki/Greece), W. Freeden (Kaiserslautern /Germany), R. Jurisch (Dessau/Germany), J. Kakkuri (Helsinki/Finland), G. Kampmann (Dessau/Germany), K. R. Koch (Bonn/Germany), F. Krumm (Stuttgart/Germany), O. Lelgemann (Berlin/Germany), H. Moritz (Graz/Austria), F. Sanso (Milano/Italy), B. Schaffrin (Columbus/Ohio/USA), L. Sjoeberg (Stock-
Preface
xi
holm/Sweden), N. Sneeuw (Calgary/Canada), L. Svensson (Gävle/Sweden), P. Vanicek (Fredericton/New Brunswick/Canada). For the book production I want to thank in particular J. Cai, F. Krumm, A. Vollmer, M. Paweletz, T. Fuchs, A. Britchi, and D. Wilhelm (all from Stuttgart/ Germany). At the end my sincere thanks go to the Walter de Gruyter Publishing Company for including my book into their Geoscience Series, in particular to Dr. Manfred Karbe and Dr. Robert Plato for all support.
Stuttgart, December 2005
Erik W. Grafarend
Contents 1
2
3
The first problem of algebraic regression – consistent system of linear observational equations – underdetermined system of linear equations: {Ax = y | A \ n×m , y R ( A ) rk A = n, n = dim Y} 1-1 Introduction 1-11 The front page example 1-12 The front page example in matrix algebra 1-13 Minimum norm solution of the front page example by means of horizontal rank partitioning 1-14 The range R( f ) and the kernel N(A) 1-15 Interpretation of “MINOS” by three partitionings 1-2 The minimum norm solution: “MINOS” 1-21 A discussion of the metric of the parameter space X 1-22 Alternative choice of the metric of the parameter space X 1-23 G x -MINOS and its generalized inverse 1-24 Eigenvalue decomposition of G x -MINOS: canonical MINOS 1-3 Case study: Orthogonal functions, Fourier series versus Fourier-Legendre series, circular harmonic versus spherical harmonic regression 1-31 Fourier series 1-32 Fourier-Legendre series 1-4 Special nonlinear models 1-41 Taylor polynomials, generalized Newton iteration 1-42 Linearized models with datum defect 1-5 Notes
1 3 4 5 7 9 12 17 23 24 25 26
40 41 52 68 68 74 82
The first problem of probabilistic regression – special GaussMarkov model with datum defect – Setup of the linear uniformly minimum bias estimator of type LUMBE for fixed effects 2-1 Setup of the linear uniformly minimum bias estimator of type LUMBE 2-2 The Equivalence Theorem of G x -MINOS and S -LUMBE 2-3 Examples
86 90 91
The second problem of algebraic regression – inconsistent system of linear observational equations – overdetermined system of linear equations: {Ax + i = y | A \ n×m , y R ( A ) rk A = m, m = dim X} 3-1 Introduction 3-11 The front page example
95 97 97
85
xiv
Contents
3-2
3-3
3-4 3-5 4
3-12 The front page example in matrix algebra 98 3-13 Least squares solution of the front page example by means of vertical rank partitioning 100 3-14 The range R( f ) and the kernel N( f ), interpretation of the least squares solution by three partitionings 103 The least squares solution: “LESS” 111 3-21 A discussion of the metric of the parameter space X 118 3-22 Alternative choices of the metric of the observation space Y 119 3-221 Optimal choice of weight matrix: SOD 120 3-222 The Taylor Karman criterion matrix 124 3-223 Optimal choice of the weight matrix: 125 A The space R ( A ) and R ( A ) 3-224 Fuzzy sets 129 129 3-23 G x -LESS and its generalized inverse 3-24 Eigenvalue decomposition of G y -LESS: canonical LESS 131 Case study Partial redundancies, latent conditions, high leverage points versus break points, direct and inverse Grassmann coordinates, Plücker coordinates 143 3-31 Canonical analysis of the hat matrix, partial redundancies, high leverage points 143 3-32 Multilinear algebra, ”join” and “meet”, the Hodge star operator 152 3-33 From A to B: latent restrictions, Grassmann coordinates, Plücker coordinates 158 3-34 From B to A: latent parametric equations, dual Grassmann coordinates, dual Plücker coordinates 172 3-35 Break points 176 Special linear and nonlinear models A family of means for direct observations 184 A historical note on C. F. Gauss, A.-M. Legendre and the invention of Least Squares and its generalization 185
The second problem of probabilistic regression – special Gauss-Markov model without datum defect – Setup of BLUUE for the moments of first order and of BIQUUE for the central moment of second order 4-1 Introduction 4-11 The front page example 4-12 Estimators of type BLUUE and BIQUUE of the front page example 4-13 BLUUE and BIQUUE of the front page example, sample median, median absolute deviation
187 190 191 192 201
Contents
4-14 Alternative estimation Maximum Likelihood (MALE) 4-2 Setup of the best linear uniformly unbiased estimators of type BLUUE for the moments of first order 4-21 The best linear uniformly unbiased estimation ȟˆ of ȟ : Ȉ y -BLUUE 4-22 The Equivalence Theorem of G y -LESS and Ȉ y -BLUUE 4-3 Setup of the best invariant quadratic uniform by unbiased estimator of type BIQUUE for the central moments of second order 4-31 Block partitioning of the dispersion matrix and linear space generated by variance-covariance components 4-32 Invariant quadratic estimation of variance-covariance components of type IQE 4-33 Invariant quadratic uniformly unbiased estimations of variance-covariance components of type IQUUE 4-34 Invariant quadratic uniformly unbiased estimations of one variance component (IQUUE) from Ȉ y -BLUUE: HIQUUE 4-35 Invariant quadratic uniformly unbiased estimators of variance covariance components of Helmert type: HIQUUE versus HIQE 4-36 Best quadratic uniformly unbiased estimations of one variance component: BIQUUE 5
xv 205 208 208 216
217 218 223 226
230
232 236
The third problem of algebraic regression – inconsistent system of linear observational equations with datum defect overdetermined- underdermined system of linear 243 equations: {Ax + i = y | A \ n×m , y R ( A ) rk A < min{m, n}} 5-1 Introduction 245 5-11 The front page example 246 5-12 The front page example in matrix algebra 246 5-13 Minimum norm - least squares solution of the front page example by means of additive rank partitioning 248 5-14 Minimum norm - least squares solution of the front page example by means of multiplicative rank partitioning: 252 5-15 The range R( f ) and the kernel N( f ) interpretation of “MINOLESS” by three partitionings 256 5-2 MINOLESS and related solutions like weighted minimum normweighted least squares solutions 263 5-21 The minimum norm-least squares solution: "MINOLESS" 263 5-22 (G x , G y ) -MINOS and its generalized inverse 273 5-23 Eigenvalue decomposition of (G x , G y ) -MINOLESS 277 5-24 Notes 282
xvi
6
7
8
9
Contents
5-3 The hybrid approximation solution: D-HAPS and TykhonovPhillips regularization The third problem of probabilistic regression – special Gauss- Markov model with datum problem – Setup of BLUMBE and BLE for the moments of first order and of BIQUUE and BIQE for the central moment of second order 6-1 Setup of the best linear minimum bias estimator of type BLUMBE 6-11 Definitions, lemmas and theorems 6-12 The first example: BLUMBE versus BLE, BIQUUE versus BIQE, triangular leveling network 6-121 The first example: I3, I3-BLUMBE 6-122 The first example: V, S-BLUMBE 6-123 The first example: I3 , I3-BLE 6-124 The first example: V, S-BLE 6-2 Setup of the best linear estimators of type hom BLE, hom S-BLE and hom Į-BLE for fixed effects A spherical problem of algebraic representation – Inconsistent system of directional observational equationsoverdetermined system of nonlinear equations on curved manifolds 7-1 Introduction 7-2 Minimal geodesic distance: MINGEODISC 7-3 Special models: from the circular normal distribution to the oblique normal distribution 7-31 A historical note of the von Mises distribution 7-32 Oblique map projection 7-33 A note on the angular metric 7-4 Case study The fourth problem of probabilistic regression – special Gauss-Markov model with random effects– Setup of BLIP and VIP for the moments of first order 8-1 The random effect model 8-2 Examples
282
285 287 289 296 297 301 306 308 312
327 328 331 335 335 337 340 341
347 348 362
The fifth problem of algebraic regression - the system of conditional equations: homogeneous and inhomogeneous equations {By = Bi versus c + By = Bi} 373 9-1 G y -LESS of system of inconsistent homogeneous conditional equations 374 9-2 Solving a system of inconsistent inhomogeneous conditional equations 376
Contents
9-3 Examples 10
11
12
The fifth problem of probabilistic regression – general Gauss-Markov model with mixed effects– Setup of BLUUE for the moments of first order (Kolmogorov-Wiener prediction) 10-1 Inhomogeneous general linear Gauss-Markov model (fixed effects and random effects) 10-2 Explicit representations of errors in the general Gauss-Markov model with mixed effects 10-3 An example for collocation 10-4 Comments The sixth problem of probabilistic regression – the random effect model – “errors-in-variables” 11-1 Solving the nonlinear system of the model “errors-in-variables” 11-2 Example: The straight line fit 11-3 References The sixth problem of generalized algebraic regression – the system of conditional equations with unknowns – (Gauss-Helmert model) 12-1 Solving the system of homogeneous condition equations with unknowns 12-11 W-LESS 12-12 R, W-MINOLESS 12-13 R, W-HAPS 12-14 R, W-MINOLESS against R, W - HAPS 12-2 Examples for the generalized algebraic regression problem: homogeneous conditional equations with unknowns 12-21 The first case: I-LESS 12-22 The second case: I, I-MINOLESS 12-23 The third case: I, I-HAPS 12-24 The fourth case: R, W-MINOLESS, R positive semidefinite, W positive semidefinite 12-3 Solving the system of inhomogeneous condition equations with unknowns 12-31 W-LESS 12-32 R, W-MINOLESS 12-33 R, W-HAPS 12-34 R, W-MINOLESS against R, W-HAPS 12-4 Conditional equations with unknowns: from the algebraic approach to the stochastic one
xvii 377
379 380 385 386 397 401 404 406 410
411 414 414 416 419 421 421 422 422 423 423 424 424 426 427 428 429
xviii
Contents
12-41 12-42 12-43 12-44 13
14
15
Shift to the center The condition of unbiased estimators n {ȟ} The first step: unbiased estimation of ȟˆ and E The second step: unbiased estimation N 1 and N 2
The nonlinear problem of the 3d datum transformation and the Procrustes Algorithm 13-1 The 3d datum transformation and the Procrustes Algorithm 13-2 The variance - covariance matrix of the error matrix E 13-3 Case studies: The 3d datum transformation and the Procrustes Algorithm 13-4 References The seventh problem of generalized algebraic regression revisited: The Grand Linear Model: The split level model of conditional equations with unknowns (general Gauss-Helmert model) 14-1 Solutions of type W-LESS 14-2 Solutions of type R, W-MINOLESS 14-3 Solutions of type R, W-HAPS 14-4 Review of the various models: the sixth problem
429 429 430 430 431 433 441 441 444
445 446 449 450 453
Special problems of algebraic regression and stochastic estimation: multivariate Gauss-Markov model, the n-way classification model, dynamical systems 455 15-1 The multivariate Gauss-Markov model – a special problem of probabilistic regression – 15-2 n-way classification models 15-21 A first example: 1-way classification 15-22 A second example: 2-way classification without interaction 15-23 A third example: 2-way classification with interaction 15-24 Higher classifications with interaction 15-3 Dynamical Systems
Appendix A: Matrix Algebra A1 A2 A3 A4 A5 A6
Matrix-Algebra Special Matrices Scalar Measures and Inverse Matrices Vectorvalued Matrix Forms Eigenvalues and Eigenvectors Generalized Inverses
455 460 460 464 469 474 476 485 485 488 495 506 509 513
Contents
xix
Appendix B: Matrix Analysis
522
B1 Derivations of Scalar-valued and Vector-valued Vector Functions B2 Derivations of Trace Forms B3 Derivations of Determinantal Forms B4 Derivations of a Vector/Matrix Function of a Vector/Matrix B5 Derivations of the Kronecker-Zehfuß product B6 Matrix-valued Derivatives of Symmetric or Antisymmetric Matrix Functions B7 Higher order derivatives Appendix C: Lagrange Multipliers C1 A first way to solve the problem Appendix D: Sampling distributions and their use: Confidence Intervals and Confidence Regions
522 523 526 527 528 528 530 533 533 543
D1 A first vehicle: Transformation of random variables 543 D2 A second vehicle: Transformation of random variables 547 D3 A first confidence interval of Gauss-Laplace normally distributed observations: P , V 2 known, the Three Sigma Rule 553 D31 The forward computation of a first confidence interval of Gauss-Laplace normally distributed observations: P , V 2 known D32 The backward computation of a first confidence interval of Gauss-Laplace normally distributed observations: P , V 2 known
564
D4 Sampling from the Gauss-Laplace normal distribution: a second confidence interval for the mean, variance known
567
D41 Sampling distributions of the sample mean Pˆ , V 2 known, and of the sample variance Vˆ 2 D42 The confidence interval for the sample mean, variance known
582 592
D5 Sampling from the Gauss-Laplace normal distribution: a third confidence interval for the mean, variance unknown
596
557
D51 Student’s sampling distribution of the random variable ( Pˆ P ) / Vˆ 596 D52 The confidence interval for the sample mean, variance unknown 605 D53 The Uncertainty Principle 611 D6 Sampling from the Gauss-Laplace normal distribution: a fourth confidence interval for the variance
613
D61 The confidence interval for the variance D62 The Uncertainty Principle
613 619
xx
Contents
D7 Sampling from the multidimensional Gauss-Laplace normal distribution: the confidence region for the fixed parameters in the linear Gauss-Markov model Appendix E: Statistical Notions E1 Moments of a probability distribution, the Gauss-Laplace normal distribution and the quasi-normal distribution E2 Error propagation E3 Useful identities E4 The notions of identifiability and unbiasedness
621 163 644 648 651 652
Appendix F: Bibliographic Indexes
655
References
659
Index
745
1
The first problem of algebraic regression – consistent system of linear observational equations – underdetermined system of linear equations: {Ax = y | A \ n×m , y R ( A ) rk A = n, n = dim Y} : Fast track reading: Read only Lemma 1.3.
Lemma 1.2 xm G x -MINOS of x
Lemma 1.3 xm G x -MINOS of x
Definition 1.1 xm G x -MINOS of x
Lemma 1.4 characterization of G x -MINOS
Lemma 1.6 # adjoint operator A
Definition 1.5 # Adjoint operator A
Lemma 1.7 eigenspace analysis versus eigenspace synthesis
Corollary 1.8 Symmetric pair of eigensystems
Lemma 1.9 Canonical MINOS
“The guideline of chapter one: definitions, lemmas and corollary”
2
1 The first problem of algebraic regression
The minimum norm solution of a system of consistent linear equations Ax = y subject to A R n× m , rk A = n, n < m, is presented by Definition 1.1, Lemma 1.2 and Lemma 1.3. Lemma 1.4 characterizes the solution of the quadratic optimization problem in terms of the (1,2,4)-generalized inverse, in particular the right inverse. The system of consistent nonlinear equations Y = F( X) is solved by means of two examples. Both examples are based on distance measurements in a planar network, namely a planar triangle. In the first example Y = F( X) is linearized at the point x, which is given by prior information and solved by means of Newton iteration. The minimum norm solution is applied to the consistent system of linear equations 'y = A'x and interpreted by means of first and second moments of the nodal points. In contrast, the second example aims at solving the consistent system of nonlinear equations Y = F( X) in a closed form. Since distance measurements as Euclidean distance functions are left equivariant under the action of translation group as well as the rotation group – they are invariant under translation and rotation of the Cartesian coordinate system – at first a TRbasis (translation-rotation basis) is established. Namely the origin and the axes of the coordinate system are fixed. With respect to the TR-basis (a set of free parameters has been fixed) the bounded parameters are analytically fixed. Since no prior information is built in, we prove that two solutions of the consistent system of nonlinear equations Y = F( X) exist. In the chosen TR-basis the solution vector X is not of minimum norm. Accordingly, we apply a datum transformation X 6 x of type group of motion (decomposed into the translation group and the rotation group). The parameters of the group of motion (2 for translation, 1 for rotation) are determined under the condition of minimum norm of the unknown vector x, namely by means of a special Procrustes algorithm. As soon as the optimal datum parameters are determined we are able to compute the unknown vector x which is minimum norm. Finally, the Notes are an attempt to explain the origin of the injectivity rank deficiency, namely the dimension of the null space N ( A), m rk A of the consistent system of linear equations Ax = y subject to A R n× m and rk A = n, n < m , as well as of the consistent system of nonlinear equations F( X) = Y subject to a Jacobi matrix J R n× m and rk J = n, n < m = dim X. The fundamental relation to the datum transformation, also called transformation groups (conformal group, dilatation group /scale/, translation group, rotation group and projective group) as well as to the “soft” Implicit Function Theorem is outlined. By means of a certain algebraic objective function which geometrically is called minimum distance function we solve the first inverse problem of linear and nonlinear equations, in particular of algebraic type, which relate observations to parameters. The system of linear or nonlinear equations we are solving here is classified as underdetermined. The observations, also called measurements, are elements of a certain observation space Y of integer dimension, dim Y = n,
1-1 Introduction
3
which may be metrical, especially Euclidean, pseudo–Euclidean, in general a differentiable manifold. In contrast, the parameter space X of integer dimension, dim X = m, is metrical as well, especially Euclidean, pseudo–Euclidean, in general a differentiable manifold, but its metric is unknown. A typical feature of algebraic regression is the fact that the unknown metric of the parameter space X is induced by the functional relation between observations and parameters. We shall outline three aspects of any discrete inverse problem: (i) set-theoretic (fibering), (ii) algebraic (rank partitioning, “IPM”, the Implicit Function Theorem) and (iii) geometrical (slicing) Here we treat the first problem of algebraic regression: A consistent system of linear observational equations: Ax = y , A R n× m , rk A = n, n < m , also called “underdetermined system of linear equations”, in short “more unknowns than equations” is solved by means of an optimization problem. The Introduction presents us a front page example of two inhomogeneous linear equations with unknowns. In terms of five boxes and five figures we review the minimum norm solution of such a consistent system of linear equations which is based upon the trinity
1-1 Introduction With the introductory paragraph we explain the fundamental concepts and basic notions of this section. For you, the analyst, who has the difficult task to deal with measurements, observational data, modeling and modeling equations we present numerical examples and graphical illustrations of all abstract notions. The elementary introduction is written not for a mathematician, but for you, the analyst, with limited remote control of the notions given hereafter. May we gain your interest. Assume an n-dimensional observation space Y, here a linear space parameterized by n observations (finite, discrete) as coordinates y = [ y1 ," , yn ]c R n in which an r-dimensional model manifold is embedded (immersed). The model
4
1 The first problem of algebraic regression
manifold is described as the range of a linear operator f from an m-dimensional parameters space X into the observation space Y. The mapping f is established by the mathematical equations which relate all observables to the unknown parameters. Here the parameters space X , the domain of the linear operator f, will be restricted also to a linear space which is parameterized by coordinates x = [ x1 ," , xm ]c R m . In this way the linear operator f can be understood as a coordinate mapping A : x 6 y = Ax. The linear mapping f : X o Y is geometrically characterized by its range R(f), namely R(A), defined by R(f):= {y Y | y = f (x) for all x X} which in general is a linear subspace of Y and its null space defined by N ( f ) := {x X | f (x) = 0}. Here we restrict the range R(f), namely R(A), to coincide with the n = r-dimensional observation space Y such that y R (f ) , namely y R (A) . Example 1.1 will therefore demonstrate the range space R(f), namely R(A), which here coincides with the observation space Y, (f is surjective or “onto”) as well as the null space N(f), namely N(A), which is not empty. (f is not injective or one-to-one) Box 1.1 will introduce the special linear model of interest. By means of Box 1.2 it will be interpreted as a polynomial of degree two based upon two observations and three unknowns, namely as an underdetermined system of consistent linear equations. Box 1.3 reviews the formal procedure in solving such a system of linear equations by means of “horizontal” rank partitioning and the postulate of the minimum norm solution of the unknown vector. In order to identify the range space R(A), the null space N(A) and its orthonormal complement, N ( A) A , Box 1.4 by means of algebraic partitioning (“horizontal” rank partitioning) outlines the general solution of a system of homogeneous linear equations approaching zero. With a background Box 1.5 presents the diagnostic algorithm for solving an underdetermined system of linear equations. In contrast, Box 1.6, is a geometric interpretation of a special solution of a consistent system of inhomogeneous linear equations of type “minimum norm” (MINOS). The g-inverse A m of the type “MINOS” is finally characterized by three conditions collected in Box 1.7. Figure 1.1 demonstrates the range space R(A), while Figure 1.2 and 1.3 demonstrate the null space N ( A ) as well as its orthogonal complement N ( A) A . Figure 1.4 illustrates the orthogonal projection of an element of the null space N ( A ) onto the range space R ( A ) , where A is a generalized inverse. In terms of fibering the set of points of the parameter space as well as of the observation space Figure 1.5 introduced the related Venn diagrams. 1-11
The front page example
Example 1.1
(polynomial of degree two, consistent system of linear equations Ax = y, x X = R m , dim X = m, y Y = R n , dim Y = n, r = rk A = dim Y ):
5
1-1 Introduction
First, the introductory example solves the front page consistent system of linear equations, x1 + x2 + x3 = 2 x1 + 2 x2 + 4 x3 = 3, obviously in general dealing with the linear space X = R m x, dim X = m, here m=3, called the parameter space, and the linear space Y = R n y , dim Y = n, here n=2 , called the observation space. 1-12
The front page example in matrix algebra
Second, by means of Box 1.1 and according to A. Cayley’s doctrine let us specify the consistent system of linear equations in terms of matrix algebra. Box 1.1: Special linear model: polynomial of degree two, two observations, three unknowns ª y º ªa y = « 1 » = « 11 ¬ y2 ¼ ¬ a21
a12 a22
ª x1 º a13 º « » x2 a23 »¼ « » ¬« x3 ¼»
ª x1 º ª 2 º ª1 1 1 º « » y = Ax : « » = « » « x2 » ¬ 3 ¼ ¬1 2 4 ¼ « » ¬ x3 ¼ xc = [ x1 , x2 , x3 ], y c = [ y1 , y2 ] = [2, 3], x R 3×1 , y Z +2×1 R 2×1 ª1 1 1 º
A := « » Z+ R 1 2 4 ¬ ¼ r = rk A = dim Y = n = 2. The matrix A R n× m is an element of R n× m , the n×m array of real numbers. dim X = m defines the number of unknowns (here: m=3), dim Y = n, the number of observations (here: n=2). A mapping f is called linear if f ( x1 + x2 ) = f ( x1 ) + f ( x2 ) and f (O x) = O f ( x) holds. Beside the range R(f), the range space R(A), the linear mapping is characterized by the kernel N ( f ) := {x R m | f (x) = 0}, the null space N ( A) := {x R m | Ax = 0} to be specified lateron. ? Why is the front page system of linear equations called “underdetermined”?
6
1 The first problem of algebraic regression
Just observe that we are left with only two linear equations for three unknowns ( x1 , x2 , x3 ) . Indeed the system of inhomogeneous linear equations is “underdetermined”. Without any additional postulate we shall be unable to inverse those equations for ( x1 , x2 , x3 ) . In particular we shall outline how to find such an additional postulate. Beforehand we have to introduce some special notions from the theory of operators. Within matrix algebra the index of the linear operator A is the rank r = rk A, here r = 2, which coincides with the dimension of the observation space, here n = dim Y = 2. A system of linear equations is called consistent if rk A = dim Y. Alternatively we say that the mapping f : x 6 y = f (x) R (f ) or A : x 6 Ax = y R (A) takes an element x X into the range R(f) or the range space R(A), also called the column space of the matrix A. f : x 6 y = f (x), y R ( f ) A : x 6 Ax = y, y R(A ) . Here the column space is spanned by the first column c1 and the second column c 2 of the matrix A, the 2×3 array, namely ° ª1º R (A) = span ® « » , ¯° ¬1¼
ª 1 º ½° «2» ¾ . ¬ ¼ ¿°
Let us continue with operator theory. The right complementary index of the linear operator A R n× m which accounts for the injectivity defect given by d = m - rk A (here d = m - rk A = 1). “Injectivity” relates to the kernel N(f), or “the null space” we shall constructively introduce lateron. How can such a linear model of interest, namely a system of consistent linear equations, be generated? Let us assume that we have observed a dynamical system y(t) which is represented by a polynomial of degree two with respect to time t R, namely y (t ) = x1 + x2 t + x3t 2 . Due to y (t ) = 2 x3 it is a dynamical system with constant acceleration or constant second derivative with respect to time t. The unknown polynomial coefficients are collected in the column array x = [ x1 ," , xm ]c, x X = R 3 , dim X = 3, and constitute the coordinates of the three-dimensional parameter space X. If the dynamical system y(t) is observed at two instants, say y(t1) = y1 = 2, y(t2) = y2 = 3, say at t1 = 1 and t2 = 2, respectively, and if we collect the observations in the column array y = [ y1 , y2 ]c = [2, 3]c, y Y = R 2 , dim Y = 2, they constitute the coordinates of the two-dimensional observation space Y. Thus we are left with a special linear model interpreted in Box 1.2. We use “ ” as the symbol for “equivalence”.
7
1-1 Introduction
Box 1.2: Special linear model: polynomial of degree two, two observations, three unknowns ª y º ª1 t1 y = « 1» = « ¬ y2 ¼ ¬1 t2
ª x1 º t12 º « » » x2 t22 ¼ « » «¬ x3 »¼
ª x1 º ª t1 = 1, y1 = 2 ª 2 º ª1 1 1 º « » « :« » = « » « x2 » ~ ¬t2 = 2, y2 = 3 ¬ 3 ¼ ¬1 2 4 ¼ « x » ¬ 3¼ ~ y = Ax, r = rk A = dim Y = n = 2 . Third, let us begin with a more detailed analysis of the linear mapping f : Ax = y , namely of the linear operator A R n× m , r = rk A = dim Y = n. We shall pay special attention to the three fundamental partitionings, namely (i) algebraic partitioning called rank partitioning of the matrix A, (ii) geometric partitioning called slicing of the linear space X, (iii) set-theoretical partitioning called fibering of the domain D(f). 1-13
Minimum norm solution of the front page example by means of horizontal rank partitioning
Let us go back to the front page consistent system of linear equations, namely the problem to determine three unknown polynomial coefficients from two sampling points which we classified as an underdetermined one. Nevertheless we are able to compute a unique solution of the underdetermined system of inhomogeneous linear equations Ax = y , y R ( A) or rk A = dim Y , here A R 2×3 , x R 3×1 , y R 2×1 if we determine the coordinates of the unknown vector x of minimum norm (minimal Euclidean length, A2-norm), here & x &2I = xcx = x12 + x22 + x32 = min. Box 1.3 outlines the solution of the related optimization problem. Box 1.3: Minimum norm solution of the consistent system of inhomogeneous linear equations, horizontal rank partitioning The solution of the optimization problem {|| x ||2I = min | Ax = y , rk A = dim Y} x
is based upon the horizontal rank partitioning of the linear mapping
8
1 The first problem of algebraic regression
f : x 6 y = Ax, rk A = dim Y , which we already introduced. As soon as we decompose x1 = A11 A 2 x 2 + A11 y and implement it in the norm & x &2I , we are prepared to compute the first derivatives of the unconstrained Lagrangean
L (x1 , x 2 ) := & x &2I = x12 + x22 + x32 = = (y A 2 x 2 )c( A1A1c ) 1 (y A 2 x 2 ) + xc2 x 2 = = y c( A1A1c ) 1 y 2xc2 A 2 ( A1A1c ) 1 y + xc2 A c2 ( A1A1c ) 1 A 2 x 2 + xc2 x 2 = = min x2
wL (x 2m ) = 0 wx 2 A c2 ( A1A1c ) 1 y + [ A c2 ( A1A1c ) 1 A 2 + I]x 2m = 0 x 2 m = [ A c2 ( A1A1c ) 1 A 2 + I]1 Ac2 ( A1 A1c ) 1 y, which constitute the necessary conditions. (The theory of vector derivatives is presented in Appendix B.) Following Appendix A devoted to matrix algebra, namely (I + AB) 1 A = A(I + BA) 1 , (BA) 1 = A 1B 1 , for appropriate dimensions of the involved matrices, such that the identities hold x 2 m = [ A c2 ( A1 A1c ) 1 A 2 + I]1 A c2 ( A1 A1c ) 1 y = = Ac2 ( A1 A1c ) 1 [ A 2 Ac2 ( A1 A1c ) 1 + I]1 y = = Ac2 [ A 2 Ac2 ( A1 A1c ) 1 + I]1 ( A1 A1c ) 1 y , we finally derive x 2 m = A c2 ( A1 A1c + A 2 A c2 ) 1 y. The second derivatives w2L (x 2 m ) = 2[ A c2 ( A1 A1c ) 1 A 2 + I ] > 0 wx 2 wxc2 due to positive-definiteness of the matrix A c2 ( A1 A1c ) 1 A 2 + I generate the sufficiency condition for obtaining the minimum of the unconstrained Lagrangean. Finally let us backward transform x 2 m 6 x1m = A11 A 2 x 2 + A11 y. x1m = A11 A 2 A c2 ( A1 A1c + A 2 A c2 ) 1 y + A11 y. Let us right multiply the identity A1A1c = A 2 A c2 + ( A1A1c + A 2 A c2 ) by ( A1 A1c + A 2 A c2 ) 1 such that
9
1-1 Introduction
A1 A1c ( A1 A1c + A 2 A c2 ) 1 = A 2 A c2 ( A1 A1c + A 2 A c2 ) 1 + I holds, and left multiply by A11 , namely A1c ( A1 A1c + A 2 A c2 ) 1 = A11 A 2 A c2 ( A1 A1c + A 2 A c2 ) 1 + A11 . Obviously we have generated the linear form 1 ° x1m = A1c ( A1A1c + A 2 A c2 ) y ® 1 °¯ x 2m = A c2 ( A1A1c + A 2 A c2 ) y or
ª x1m º ª A1c º 1 « x » = « A c » ( A1 A1c + A 2 A c2 ) y ¬ 2m ¼ ¬ 2 ¼ or x m = A c( AA c) 1 y. A numerical computation with respect to the introductory example is ª3 7 º 1 ª 21 7 º A1 A1c + A 2 A c2 = « , ( A1 A1c + A 2 A c2 ) 1 = 14 » « 7 3 » ¬ 7 21¼ ¬ ¼ 1 ª14 4 º A1c ( A1 A1c + A 2 A c2 ) 1 = 14 « 7 1» ¬ ¼ 1 [ 7, 6] A c2 ( A1 A1c + A 2 A c2 ) 1 = 14
ª 73 º 3 1 x1m = « 11 » , x 2 m = 14 , & x m & I = 14 42 ¬ 14 ¼ 11 y (t ) = 87 + 14 t + 141 t 2
w2L (x 2 m ) = 2[ A c2 ( A1 A1c ) 1 A 2 + I ] = 28 > 0 . wx 2 wxc2 1-14
The range R(f) and the kernel N(f)
Fourthly, let us go into the detailed analysis of R(f), N(f), N ( f ) A with respect to the front page example. How can we actually identify the range space R(A), the null space N(A) or its orthogonal complement N ( A) A ? The range space R (A) := {y R n | Ax = y, x R m } is conveniently described by first column c1 = [1, 1]c and the second column c 2 = [1, 2]c of the matrix A, namely 2-leg
10
1 The first problem of algebraic regression
{e1 + e 2 , e1 + 2e 2 | O} or {ec1 , ec 2 | O}, with respect to the orthogonal base vector e1 and e 2 , respectively, attached to the origin O. Symbolically we write R ( A) = span{e1 + e 2 , e1 + 2e 2 | O}. As a linear space R (A) Y is illustrated by Figure 1.1. y ec2
1 ec1 e2 e1
1
Figure 1.1: Range R(f), range space R(A), y R (A) By means of Box 1.4 we identify N(f) or “the null space N(A)” and give its illustration by Figure 1.2. Such a result has paved the way to the diagnostic algorithm for solving an underdetermined system of linear equations by means of rank partitioning presented in Box 1.5. Box 1.4: The general solution of the system of homogeneous linear equations Ax = 0, “horizontal” rank partitioning The matrix A is called “horizontally rank partitioned ”, if r = rk A = rk A1 = n ½ ° n× m n× r n× d ° ® A R A = [ A1 , A 2 ] A1 R , A 2 R ¾ ° d = d ( A) = m rk A °¿ ¯ holds. (In the introductory example A R 2×3 , A1 R 2× 2 , A 2 R 2×1 , rk A = 2, d ( A) = 1 applies.) A consistent system of linear equations Ax = y, rk A = dim Y , is “horizontally rank partitioned” if Ax = y , rk A = dim Y A1x1 + A 2 x 2 = y
11
1-1 Introduction
for a partitioned unknown vector ªx º {x R m x = « 1 » | x1 R r ×1 , x 2 R d ×1 } ¬x2 ¼ applies. The “horizontal” rank partitioning of the matrix A as well as the “horizontally rank partitioned” consistent system of linear equations Ax = y, rk A = dim Y , of the introductory example is ª1 1 1 º ª1 1 º ª1º A=« , A1 = « , A2 = « » , » » ¬1 2 4 ¼ ¬1 2¼ ¬ 4¼ Ax = y , rk A = dim Y A1x1 + A 2 x 2 = y x1 = [ x1 , x2 ]c R 2×1 , x 2 = [ x3 ] R ª1 1 º ª x1 º ª 1 º «1 2 » « x » + « 4 » x3 = y. ¬ ¼¬ 2¼ ¬ ¼ By means of the horizontal rank partitioning of the system of homogenous linear equations an identification of the null space N(A), namely N ( A) = {x R m | Ax = A1 x1 + A 2 x 2 = 0}, is A1 x1 + A 2 x 2 = 0 x1 = A11 A 2 x 2 , particularly in the introductory example ª x1 º ª 2 1º ª 1 º « x » = « 1 1 » « 4 » x3 , ¬ ¼¬ ¼ ¬ 2¼ x1 = 2 x3 = 2W , x2 = 3x3 = 3W , x3 = W . Here the two equations Ax = 0 for any x X = R 2 constitute the linear space N(A), dim N ( A) = 1 , a one-dimensional subspace of X = R 2 . For instance, if we introduce the parameter x3 = W , the other coordinates of the parameter space X = R 2 amount to x1 = 2W , x2 = 3W . In geometric language the linear 1 space N(A) is a parameterized straight line L0 through the origin illustrated by Figure 1.2. The parameter space X = R r (here r = 2) is sliced by the subspace, the linear space N(A), also called linear manifold, dim N ( A) = d( A) = d , here a 1 straight line L0 (here), through the origin O.
12 1-15
1 The first problem of algebraic regression
Interpretation of “MINOS” by three partitionings: (i) algebraic (rank partitioning) (ii) geometric (slicing) (iii) set-theoretical (fibering)
Figure 1.2:
The parameter space X = R 3 ( x3 is not displayed) sliced by the null space, 1 the linear manifold N ( A) = L0 R 2
The diagnostic algorithm for solving an underdetermined system of linear equations y = Ax, rk A = dim Y = n, n < m = dim X, y R ( A) by means of rank partitioning is presented to you by Box 1.5. Box 1.5: algorithm The diagnostic algorithm for solving an underdetermined system of linear equations y = Ax, rk A = dim Y , y R ( A) by means of rank partitioning Determine the rank of the matrix A rk A = dimY = n
Compute the “horizontal rank partitioning ” A = [ A1 , A 2 ], A1 R r × r = R n× n , A 2 R n× ( m r ) = R n× ( m n ) “ m r = m n = d is called right complementary index.” “A as a linear operator is not injective, but surjective”
13
1-1 Introduction
Compute the null space N(A) N ( A) := {x R m | Ax = 0} = {x R m | x1 + A11A 2 x 2 = 0}
Compute the unknown parameter vector of type MINOS (Minimum Norm Solution x m ) x m = A c( AA c) 1 y .
h
While we have characterized the general solution of the system of homogenous linear equations Ax = 0, we are left with the problem of solving the consistent system of inhomogeneous linear equations. Again we take advantage of the rank partitioning of the matrix A summarized in Box 1.4. Box 1.6: A special solution of a consistent system of inhomogeneous linear equations Ax = y, “horizontal” rank partitioning ª rk A = dim Y, A1 x1 + A 2 x 2 = y . Ax = y , « ¬ y R ( A) Since the matrix A1 is of full rank it can be regularly inverted (Cayley inverse). In particular, we solve for x1 = A11 A 2 x 2 + A11 y , or ª x1 º ª 2 1º ª1 º ª 2 1º ª y1 º « x » = « 1 1 » « 4 » x3 + « 1 1 » « y » ¬ ¼¬ ¼ ¬ ¼¬ 2¼ ¬ 2¼ x1 = 2 x3 + 2 y1 y2 , x2 = 3x3 y1 + y2 . For instance, if we introduce the parameter x3 = W , the other coordinates of the parameter space X = R 2 amount to x1 = 2W + 2 y1 y2 , x2 = 3W y1 + y2 . In geometric language the admissible parameter space is a family of a one-dimensional linear space, a family of one-dimensional parallel straight lines dependent on y = [ y1 , y2 ]c, here [2, 3]c, in particular
14
1 The first problem of algebraic regression
L1( y1 , y2 ) := { x R 3 | x1 = 2 x3 + 2 y1 y2 , x2 = 3x3 y1 + y2 }, including the null space L1(0, 0) = N ( A). Figure 1.3 illustrates (i)
the admissible parameter space L1( y1 , y2 ) ,
(ii)
the line L1A which is orthogonal to the null space called N ( A) A ,
(iii) the intersection L1( y1 , y2 ) N ( A ) A , generating the solution point xm as will be proven now.
x2 1 A
~ N (A)A
x1
N (A) ~ L1(0,0)
L1( 2 , 3 )
Figure 1.3: The range space R ( A ) (the admissible parameter space) parallel straight lines L1( y , y ) , namely L1(2, 3) : 1
2
L ( y1 , y2 ) := { x R | x1 = 2 x3 + 2 y1 y2 , x2 = 3x3 y1 + y2 } . 1
3
The geometric interpretation of the minimum norm solution & x & I = min is the following: With reference to Figure 1.4 we decompose the vector x = x N (A) + x N (A)
A
where x N ( A ) is an element of the null space N ( A) (here: the straight line L1(0, 0) ) and x N ( A ) is an element of the orthogonal complement N ( A) A of the null space N ( A) (here: the straight line L1(0, 0) , while the inconcistency parameter i N ( A ) = i m is an element of the range space R ( A ) (here: the straight line L1( y , y ) , namely L1(2, 3) ) of the generalized inverse matrix A of type MINOS (“minimum norm solution”). & x &2I =& x N ( A ) + x N ( A ) &2 2 2 =& x N ( A ) & +2 < x | i > + & i & is minimal if and only if the inner product ¢ x N ( A ) | i ² = 0 , x N ( A ) and i m = i N ( A ) are orthogonal. The solution point x m is the orthogonal projection of the null point onto R ( A ) : A
1
2
A
A
15
1-1 Introduction
PR ( A ) = A Ax = A y for all x D ( A). Alternatively, if the vector x m of minimal length is orthogonal to the null space N ( A ) , being an element of N ( A) A (here: the line L1(0, 0) ) we may say that N ( A) A intersects R ( A ) in the solution point x m . Or the normal space NL10 with respect to the tangent space TL10 – which is in linear models identical to L10 , the null space N ( A ) – intersects the tangent space TL1y , the range space R ( A ) in the solution point x m . In summary, x m N ( A ) A R ( A ).
Figure 1.4: Orthogonal projection of an element of N ( A) onto the range space R ( A ) Let the algebraic partitioning and the geometric partitioning be merged to interpret the minimum norm solution of the consistent system of linear equations of type “underdetermined” MINOS. As a summary of such a merger we take reference to Box 1.7. The first condition AA A = A Let us depart from MINOS of y = Ax, x X = R m , y Y = R n , r = rk A = n, namely x m = A m y = A c( AA c) 1 y. Ax m = AA m y = AA m Ax m Ax m = AA m Ax m AA A. The second condition A AA = A
16
1 The first problem of algebraic regression
x m = A c( AA c) 1 y = A m y = A m Ax m x m = A m y = A m AA m y
A m y = A m AA m y A AA = A . rk A m = rk A is interpreted as follows: the g-inverse of type MINOS is the generalized inverse of maximal rank since in general rk A d rk A holds The third condition AA = PR ( A )
x m = A m y = A m Ax m
A A = PR ( A ) .
Obviously A m A is an orthogonal projection onto R ( A ) , but i m = I A m A onto its orthogonal complemert R ( A ) A . If the linear mapping f : x 6 y = f (x), y R (f ) is given we are aiming at a generalized inverse (linear) mapping y 6 x = g(y ) such that y = f (x) = = f ( g (y ) = f ( g ( f ( x))) or f = f D g D f as a first condition is fulfilled. Alternatively we are going to construct a generalized inverse A : y 6 A y = x such that the first condition y = Ax = AA Ay or AA A = A holds. Though the linear mapping f : x 6 y = f (x) R (f ), or the system of linear equations Ax = y , rk A = dim Y , is consistent, it suffers from the (injectivity) deficiency of the linear mapping f(x) or of the matrix A. Indeed it recovers from the (injectivity) deficiency if we introduce the projection x 6 g ( f (x)) = q R (g ) or x 6 A Ax = q R (A ) as the second condition. Note that the projection matrix A A is idempotent which follows from P 2 = P or ( A A)( A A) = A AA A = A A. Box 1.7: The general solution of a consistent system of linear equations; f : x 6 y = Ax, x X = R m (parameter space), y Y = R n (observation space) r = rk A = dim Y , A generalized inverse of MINOS type Condition #1 f (x) = f ( g (y )) f = f DgD f . Condition #2 (reflexive g-inverse mapping)
Condition #1 Ax = AA Ax AA A = A. Condition #2 (reflexive g-inverse)
17
1-2 The minimum norm solution: “MINOS”
x = g (y ) =
x = A y = A AA y
= g ( f (x)).
A AA = A .
Condition #3
Condition #3
g ( f (x)) = x R ( A g D f = projR ( A
)
A A = x R (A )
)
A A = projR (A ) .
The set-theoretical partitioning, the fibering of the set system of points which constitute the parameters space X, the domain D(f), will be finally outlined. Since the set system X (the parameters space) is R r , the fibering is called “trivial”. Non-trivial fibering is reserved for nonlinear models in which case we are dealing with a parameters space X which is a differentiable manifold. Here the fibering D( f ) = N ( f ) N ( f )A produces the trivial fibers N ( f ) and N ( f ) A where the trivial fibers N ( f ) A is the quotient set R n /N ( f ) . By means of a Venn diagram (John Venn 18341928) also called Euler circles (Leonhard Euler 1707–1783) Figure 1.5 illustrates the trivial fibers of the set system X = R m generated by N ( f ) and N ( f ) A . The set system of points which constitute the observation space Y is not subject to fibering since all points of the set system D(f) are mapped into the range R(f).
Figure 1.5: Venn diagram, trivial fibering of the domain D(f), trivial fibers N(f) and N ( f ) A , f : R m = X o Y = R n , Y = R (f ) , X set system of the parameter space, Y set system of the observation space.
1-2 The minimum norm solution: “MINOS” The system of consistent linear equations Ax = y subject to A R n× m , rk A = n < m , allows certain solutions which we introduce by means of Definition 1.1
18
1 The first problem of algebraic regression
as a solution of a certain optimization problem. Lemma 1.2 contains the normal equations of the optimization problem. The solution of such a system of normal equations is presented in Lemma 1.3 as the minimum norm solution with respect to the G x -seminorm. Finally we discuss the metric of the parameter space and alternative choices of its metric before we identify by Lemma 1.4 the solution of the quadratic optimisation problem in terms of the (1,2,4)-generalized inverse. Definition 1.1 (minimum norm solution with respect to the G x -seminorm): A vector xm is called G x -MINOS (Minimum Norm Solution with respect to the G x -seminorm) of the consistent system of linear equations rk A = rk( A, y ) = n < m ° n Ax = y, y Y { R , ® A R n×m ° y R ( A), ¯
(1.1)
if both Ax m = y
(1.2)
and in comparison to all other vectors of solution x X { R m , the inequality
|| x m ||G2 x := xcmG x x m d xcG x x =:|| x ||G2 x
(1.3)
holds. The system of inverse linear equations A y + i = x, rk A z m or x R ( A ) is inconsistent. By Definition 1.1 we characterized G x -MINOS of the consistent system of linear equations Ax = y subject to A R n× m , rk A = n < m (algebraic condition) or y R ( A) (geometric condition). Loosely speaking we are confronted with a system of linear equations with more unknowns m than equations n, namely n < m . G x -MINOS will enable us to find a solution of this underdetermined problem. By means of Lemma 1.2 we shall write down the “normal equations” of G x -MINOS. Lemma 1.2 (minimum norm solution with respect to the G x -(semi)norm) : A vector x m X { R m is G x -MINOS of (1.1) if and only if the system of normal equations ªG x «A ¬
A cº ª x m º ª 0 º = 0 »¼ «¬ Ȝ m »¼ «¬ y »¼
(1.4)
19
1-2 The minimum norm solution: “MINOS”
with the vector Ȝ m R n×1 of “Lagrange multipliers” is fulfilled. x m exists always and is in particular unique, if rk[G x , A c] = m
(1.5)
holds or equivalently if the matrix G x + AcA is regular. : Proof : G x -MINOS is based on the constraint Lagrangean L(x, Ȝ ) := xcG x x + 2Ȝ c( Ax y ) = min x, O
such that the first derivatives 1 2 1 2
wL ½ (x m , Ȝ m ) = G x x m + A cȜ m = 0 ° ° wx ¾ wL ° (x m , Ȝ m ) = Ax m y = 0 wO ¿° ªG A c º ª x m º ª 0 º « x »« » = « » ¬ A 0 ¼ ¬Ȝ m ¼ ¬y ¼
constitute the necessary conditions. The second derivatives 1 w 2L (x m , Ȝ m ) = G x t 0 2 wxwxc due to the positive semidefiniteness of the matrix G x generate the sufficiency condition for obtaining the minimum of the constrained Lagrangean. Due to the assumption rk A = rk [ A, y ] = n or y R ( A) the existence of G x -MINOS x m is guaranteed. In order to prove uniqueness of G x -MINOS x m we have to consider case (i) G x positive definite and case (ii) G x positive semidefinite. Case (i) : G x positive definite Due to rk G x = m , G x z 0 , the partitioned system of normal equations ªG A c º ª x m º ª 0 º G x z 0, « x »« » = « » ¬ A 0 ¼ ¬Ȝ m ¼ ¬y ¼ is uniquely solved. The theory of inverse partitioned matrices (IPM) is presented in Appendix A. Case (ii) : G x positive semidefinite Follow these algorithmic steps: Multiply the second normal equation by Ac in order to produce A cAx Acy = 0 or A cAx = Acy and add the result to the first normal equation which generates
20
1 The first problem of algebraic regression
G x x m + A cAx m + A cȜ m = A cy or (G x + A cA)x m + A cȜ m = A cy . The augmented first normal equation and the original second normal equation build up the equivalent system of normal equations ªG + A cA A cº ª x m º ª A cº G x + A cA z 0, « x » « » = « » y, 0 ¼ ¬ Ȝ m ¼ ¬I n ¼ ¬ A which is uniquely solved due to rk (G x + A cA ) = m , G x + A cA z 0 .
ƅ
The solution of the system of normal equations leads to the linear form x m = Ly which is G x -MINOS of (1.1) and can be represented as following. Lemma 1.3
(minimum norm solution with respect to the G x -(semi-) norm):
x m = Ly is G x -MINOS of the consistent system of linear equations (1.1) Ax = y , rk A = rk( A, y ) = n (or y R ( A) ), if and only if L R m × n is represented by Case (i): G x = I m L = A R = Ac( AAc) 1
(right inverse)
(1.6)
x m = A y = Ac( AAc) y x = xm + i m , R
1
(1.7) (1.8)
is an orthogonal decomposition of the unknown vector x X { R m into the I-MINOS vector x m Ln and the I-MINOS vector of inconsistency i m Ld subject to (1.9) x m = A c( AA c) 1 Ax , i m = x x m =[I A c( AA c) 1 A]x . m
(1.10)
(Due to x m = A c( AA c) 1 Ax , I-MINOS has the reproducing property. As projection matrices A c( AAc) 1 A , rk A c( AA c) 1 A = rk A = n and [I m A c( AA c) 1 A] , rk[I m A c( AA c) 1 A] = n rk A = d , are independent). Their corresponding norms are positive semidefinite, namely & x m ||I2 = y c( AA c) 1 y = xcA c( AA c) 1 Ax = xcG m x
(1.11)
|| i m ||I2 = xc[I m A c( AA c) 1 A]x.
(1.12)
21
1-2 The minimum norm solution: “MINOS”
Case (ii): G x positive definite L = G x 1 A c( AG x 1 A c) 1 (weighted right inverse)
(1.13)
x m = G A c( AG A c) y x = xm + i m
(1.14)
1 x
1 x
1
(1.15)
is an orthogonal decomposition of the unknown vector x X { R m into the G x -MINOS vector x m Ln and the G x -MINOS vector of inconsistency i m Ld subject to x m = G x 1 A c( AG x 1 A c) 1 Ax ,
(1.16)
i m = x x m = [I m G x 1 A c( AG x 1 A c) 1 A]x .
(1.17)
(Due to x m = G x 1 A c( AG x 1 A c) 1Ax G x -MINOS has the reproducing property. As projection matrices G x 1 A c( AG x 1 A c) 1A , rk G x 1 A c (A G x 1A c) 1A =n , and [I m G x 1 A c( AG x 1 A c) 1 A] , rk[I G x 1A c( A G x 1A c) 1 A ] = n rk A = d , are independent.) The corresponding norms are positive semidefinite, namely || x m ||G2 = y c( AG x A c) 1 y = xcA c( AG x A c) 1 Ax = xcG m x
(1.18)
|| i m ||G2 = xc[G x A c( AG x1A c) 1 A]x .
(1.19)
x
x
Case (iii): G x positive semidefinite L = (G x + A cA) 1 A c [ A(G x + A cA) 1 A c]1
(1.20)
x m = (G x + A cA) 1 A c [ A(G x + A cA) 1 A c]1 y
(1.21)
x = xm + i m
(1.22)
is an orthogonal decomposition of the unknown vector x X = Ln into the ( G x + A cA )-MINOS vector x m Ln and the G x + AA c MINOS vector of inconsistency i m Ld subject to x m = (G x + A cA) 1 A c [ A(G x + A cA) 1 A c]1 Ax i m = {I m ( G x + A cA ) A c[ A ( G x + A cA ) 1
1
A c]1 A}x .
Due to x m = (G x + A cA) 1 A c [ A(G x + A cA) 1 A c]1 Ax (G x + AcA) -MINOS has the reproducing property. As projection matrices (G x + A cA) 1 A c[ A(G x + A cA ) 1 A c]1 A, rk (G x + A cA ) 1 A c[ A(G x + A cA ) 1 A c]1 A = rk A = n,
(1.23) (1.24)
22
1 The first problem of algebraic regression
and {I m = (G x + A cA ) 1 A c[ A (G x + A cA ) 1 A c]1 A}, rk{I m (G x + A cA) 1 A c[ A(G x + A cA) 1 A c]1 A} = n rk A = d , are independent. The corresponding norms are positive semidefinite, namely || x m ||G2
x
+ AcA
= y c[ A(G x + A cA) 1 A c]1 y = xcA c[ A(G x + A cA) 1 A c]1 Ax = xcG m x,
|| i m ||G2
x
+ AcA
= xc{I m (G x + A cA ) 1 A c[ A (G x + A cA ) 1 A c]1 A}x .
(1.25) (1.26)
: Proof : A basis of the proof could be C. R. Rao´s Pandora Box, the theory of inverse partitioned matrices (Appendix A: Fact: Inverse Partitioned Matrix /IPM/ of a symmetric matrix). Due to the rank identity rk A = rk ( AG x 1 A c) = n < m the normal equations of the case (i) as well as case (ii) can be faster directly solved by Gauss elimination. ªG x A c º ª x m º ª 0 º « » « »=« » ¬ A 0 ¼ ¬Om ¼ ¬ y ¼ G x x m + A cOm = 0 Ax m = y. Multiply the first normal equation by AG x 1 and subtract the second normal equation from the modified first one. Ax m + AG x 1A cOm = 0 Ax m = y
Om = ( AG A c) 1 y. 1 x
The forward reduction step is followed by the backward reduction step. Implement Om into the first normal equation and solve for x m . G x x m A c( AG x 1A c) 1 y = 0 x m = G x1A c( AG x1A c) 1 y Thus G x -MINOS x m and Om are represented by x m = G x 1 A c( AG x 1 A c) 1 y , Ȝ m = ( AG x 1 A c) 1 y.
1-2 The minimum norm solution: “MINOS”
23
For the Case (iii), to the first normal equation we add the term AA cx m = Acy and write the modified normal equation ªG x + A cA A cº ª x m º ª A cº « » « » = « » y. 0 ¼ ¬Om ¼ ¬ I n ¼ ¬ A Due to the rank identity rk A = rk[ A c(G x + AA c) 1 A] = n < m the modified normal equations of the case (i) as well as case (ii) are directly solved by Gauss elimination. (G x + A cA)x m + A cOm = A cy Ax m = y. Multiply the first modified normal equation by AG x 1 and subtract the second normal equation from the modified first one. Ax m + A(G x + A cA) 1 A cOm = A (G x + A cA ) 1 A cy Ax m = y A(G x + A cA) 1 A cOm = [ A(G x + A cA) 1 A c I n ]y
Om = [ A(G x + A cA) 1 A c]1[ A(G x + A cA) 1 A c I n ]y Om = [I n ( A(G x + A cA) 1 A c) 1 ]y. The forward reduction step is followed by the backward reduction step. Implement Om into the first modified normal equation and solve for x m . (G x + A cA )x m Ac[ A(G x + A cA ) 1 A c]1 y + A cy = A cy (G x + A cA )x m Ac[ A(G x + A cA ) 1 A c]1 y = 0 x m = (G x + A cA) 1 Ac[ A(G x + A cA ) 1 A c]1 y. Thus G x -MINOS of (1.1) in terms of particular generalized inverse is obtained as x m = (G x + A cA) 1 Ac[ A(G x + A cA ) 1 A c]1 y ,
Om = [I n ( A(G x + A cA) 1 A c) 1 ]y .
ƅ 1-21
A discussion of the metric of the parameter space X
With the completion of the proof we have to discuss the basic results of Lemma 1.3 in more detail. At first we have to observe that the matrix G x of the metric of the parameter space X has to be given a priori. We classified MINOS according to (i) G x = I m , (ii) G x positive definite and (iii) G x positive semidefinite. But how do we know the metric of the parameter space? Obviously we need prior information about the geometry of the parameter space X , namely from
24
1 The first problem of algebraic regression
the empirical sciences like physics, chemistry, biology, geosciences, social sciences. If the parameter space X R m is equipped with an inner product ¢ x1 | x 2 ² = x1cG x x 2 , x1 X, x 2 X where the matrix G x of the metric & x &2 = xcG x x is positive definite, we refer to the metric space X R m as Euclidean E m . In contrast, if the parameter space X R m is restricted to a metric space with a matrix G x of the metric which is positive semidefinite, we call the parameter space semi Euclidean E m , m . m1 is the number of positive eigenvalues, m2 the number of zero eigenvalues of the positive semidefinite matrix G x of the metric (m = m1 + m2 ). In various applications, namely in the adjustment of observations which refer to Special Relativity or General Relativity we have to generalize the metric structure of the parameter space X : If the matrix G x of the pseudometric & x &2 = xcG x x is built on m1 positive eigenvalues (signature +), m2 zero eigenvalues and m3 negative eigenvalues (signature –), we call the pseudometric parameter space pseudo Euclidean E m , m , m , m = m1 + m2 + m3 . For such a parameter space MINOS has to be generalized to & x &2G = extr , for instance "maximum norm solution" . 1
2
1
2
3
x
1-22
Alternative choice of the metric of the parameter space X
Another problem associated with the parameter space X is the norm choice problem. Here we have used the A 2 -norm, for instance A 2 -norm:
& x & 2 := xcx = x12 + x22 + ... + xm2 1 + xm2 ,
A p -norm:
& x & p :=
p
x1 + x2
p
p
p
p
+ ... + xm 1 + xm ,
as A p -norms (1 d p < f ) are alternative norms of choice. Beside the choice of the matrix G x of the metric within the A 2 -norm and of the norm A p itself we like to discuss the result of the MINOS matrix G m of the metric. Indeed we have constructed MINOS from an a priori choice of the metric G called G x and were led to the a posteriori choice of the metric G m of type (1.27), (1.28) and (1.29). The matrices (i) G m = A c( AA c) 1 A
(1.27)
(ii) G m = A c( AG Ac) A 1 x
1
(1.28)
(iii) G m = A c[ A(G x + A cA) A c] A 1
1
(1.29)
are (i) idempotent, (ii) G x idempotent and (iii) [ A(G x + AcA) 1 Ac]1 idempotent, 2 namely projection matrices. Similarly, the norms i m of the type (1.30), (1.31) and (1.32) measure the distance of the solution point x m X from the null space N ( A ) . The matrices (i) I m A c( AA c) 1 A 1 x
(1.30)
1
(ii) G x A c( AG A c) A
(1.31)
(iii) I m (G x + A cA ) A c[ A(G x + A cA ) A c] A 1
1
1
(1.32)
25
1-2 The minimum norm solution: “MINOS”
are (i) idempotent, (ii) G x 1 idempotent and (iii) (G x + A cA) 1 A idempotent, namely projection matrices. 1-23
G x -MINOS and its generalized inverse
A more formal version of the generalized inverse which is characteristic for G x -MINOS is presented by Lemma 1.4 (characterization of G x -MINOS): x m = Ly is I – MINOS of the consistent system of linear equations (1.1) Ax = y , rk A = rk ( A, y ) (or y R ( A) ) if and only if the matrix L R m × n fulfils ALA = A ½ ¾. LA = (LA)c¿
(1.33)
The reflexive matrix L is the A1,2,4 generalized inverse. x m = Ly is G x -MINOS of the consistent system of linear equations (1.1) Ax = y , rk A = rk( A, y ) (or y R ( A) ) if and only if the matrix L R m × n fulfils the two conditions ALA = A ½ ¾. c G x LA = (G x LA ) ¿
(1.34)
The reflexive matrix L is the G x -weighted A1,2,4 generalized inverse. : Proof : According to the theory of the general solution of a system of linear equations which is presented in Appendix A, the conditions ALA = L or L = A guarantee the solution x = Ly of (1.1) , rk A = rk( A, y ) . The general solution x = x m + (I LA)z with an arbitrary vector z R m×1 leads to the appropriate representation of the G x -seminorm by means of || x m ||G2 = || Ly || G2 d || x ||G2 = || x m + (I LA )z ||G2 x
x
x
x
=|| x m ||G2 +2xcmG x (I LA )z + || (I LA)z ||G2 x
=|| Ly ||
2 Gx
x
+2y cLcG x (I LA )z + || (I LA)z ||G2
x
= y cLcG x Ly + 2y cLcG x (I LA)z + z c(I A cLc)G x (I LA)z where the arbitrary vectors y Y { R n holds if and only if y cLcG x (I LA)z = 0 for all z R m×1 or A cLcG x (I LA) = 0 or A cLcG xc = A cLcG x LA . The right
26
1 The first problem of algebraic regression
hand side is a symmetric matrix. Accordingly the left hand side must have this property, too, namely G x LA = (G x LA)c , which had to be shown. Reflexivity of the matrix L originates from the consistency condition, namely (I AL)y = 0 for all y R m×1 or AL = I . The reflexive condition of the G x weighted, minimum norm generalized inverse, (1.17) G x LAL = G x L , is a direct consequence. Consistency of the normal equations (1.4) or equivalently the uniqueness of G x x m follows from G x L1y = A cL1cG x L1y = G x L1 AL1 y = G x L1 AL 2 y = A cL1c A × ×Lc2 G x L 2 y = A cL 2 G x L 2 y = G x L 2 y for arbitrary matrices L1 R m × n and L 2 R m × n which satisfy (1.16).
ƅ
1-24
Eigenvalue decomposition of G x -MINOS: canonical MINOS
In the empirical sciences we meet quite often the inverse problem to determine the infinite set of coefficients of a series expansion of a function of a functional (Taylor polynomials) from a finite set of observations. First example: Determine the Fourier coefficients (discrete Fourier transform, trigonometric polynomials) of a harmonic function with circular support from observations in a one-dimensional lattice. Second example: Determine the spherical harmonic coefficients (discrete Fourier-Legendre transform) of a harmonic function with spherical support from observations n a twodimensional lattice. Both the examples will be dealt with lateron in a case study. Typically such expansions generate an infinite dimensional linear model based upon orthogonal (orthonormal) functions. Naturally such a linear model is underdetermined since a finite set of observations is confronted with an infinite set of unknown parameters. In order to make such an infinite dimensional linear model accessible to the computer, the expansion into orthogonal (orthonormal) functions is truncated or band-limited. Observables y Y , dim Y = n , are related to parameters x X , dim X = = m n = dim Y , namely the unknown coefficients, by a linear operator A \ n× m which is given in the form of an eigenvalue decomposition. We are confronted with the problem to construct “canonical MINOS”, also called the eigenvalue decomposition of G x -MINOS. First, we intend to canonically represent the parameter space X as well as the observation space Y . Here, we shall assume that both spaces are Euclidean
27
1-2 The minimum norm solution: “MINOS”
equipped with a symmetric, positive definite matrix of the metric G x and G y , respectively. Enjoy the diagonalization procedure of both matrices reviewed in Box 1.19. The inner products aac and bbc , respectively, constitute the matrix of the metric G x and G y , respectively. The base vectors {a1 ,..., am | O} span the parameter space X , dim X = m , the base vectors {b1 ,..., bm | O} the observation space, dim Y = n . Note the rank identities rk G x = m , rk G y = n , respectively. The left norm || x ||G2 = xcG x x is taken with respect to the left matrix of the metric G x . In contrast, the right norm || y ||G2 = y cG y y refers to the right matrix of the metric G y . In order to diagonalize the left quadratic form as well as the right quadratic form we transform G x 6 G *x = Diag(Ȝ 1x ,..., Ȝ mx ) = 9 cG x 9 - (1.35), (1.37), (1.39) - as well as G y 6 G *y = Diag(Ȝ 1y ,..., Ȝ ny ) = 8 cG y 8 - (1.36), (1.38), (1.40) - into the canonical form by means of the left orthonormal matrix V and by means of the right orthonormal matrix U . Such a procedure is called “eigenspace analysis of the matrix G x ” as well as “eigenspace analysis of the matrix G y ”. ȁ x constitutes the diagonal matrix of the left positive eigenvalues (Ȝ 1x ,..., Ȝ mx ) , the right positive eigenvalues (Ȝ 1y ,..., Ȝ ny ) the n-dimensional right spectrum. The inverse transformation G *x = ȁ x 6 G x - (1.39) - as well as G *y = ȁ y 6 G y - (1.40) - is denoted by “left eigenspace synthesis” as well as “right eigenspace synthesis”. x
y
Box 1.8: Canonical representation of the matrix of the metric parameter space versus observation space “parameter space X ”
“observation space”
span{a1 ,..., am } = X
Y = span{b1 ,..., bn }
aj |aj 1
2
= gj ,j 1
2
1
2
= g i ,i 1 2
aac = G x
bbc = G y
j1 , j2 {1,..., m}
i1 , i2 {1,..., n}
rk G x = m
rk G y = n
“left norms”
“right norms”
|| x ||G2 = xcG x x = (x* )cx*
(y * )cy * = y cG y y =|| y ||G2
“eigenspace analysis of the matrix G x ”
“eigenspace analysis of the matrix G y ”
x
(1.35)
ai | ai
G *x = V cG x V =
G *y = U cG y U =
= Diag(Ȝ 1x ,..., Ȝ mx ) =: ȁ x
= Diag(Ȝ 1y ,..., Ȝ ny ) =: ȁ y
y
(1.36)
28
1 The first problem of algebraic regression
subject to
subject to
(1.37)
VV c = V cV = I m
UU c = U cU = I n
(1.38)
(1.39)
(G x Ȝ xj I m ) v j = 0
(G y Ȝ iy I n )u i = 0
(1.40)
“eigenspace synthesis of the matrix G x ”
“eigenspace synthesis of the matrix G y ”
(1.41) G x = VG *x V c = Vȁ x V c
Uȁ y U c = UG *y U c = G y . (1.42)
Second, we study the impact of the left diagonalization of the metric of the metric G x as well as right diagonalization of the matrix of the metric G y on the coordinates x X and y Y , the parameter systems of the left Euclidean space X , dim X = m , and of the right Euclidian space Y . Enjoy the way how we have established the canonical coordinates x* := [ x1* ,..., xm* ]c of X as well as the canonical coordinates y * := [ y1* ,..., yn* ] called the left and right star coordinates of X and Y , respectively, in Box 1.9. In terms of those star coordinates (1.45) as well as (1.46) the left norm || x* ||2 of the type (1.41) as well as the right norm || y * ||2 of type (1.42) take the canonical left and right quadratic form. The transformations x 6 x* as well as y 6 y * of type (1.45) and (1.46) are special versions of the left and right polar decomposition: A rotation constituted by the matrices {U, V} is followed by a stretch constituted by the matrices {ȁ x , ȁ y } as diagonal matrices. The forward transformations (1.45), (1.46), x 6 x* and y 6 y * are computed by the backward transformations x* 6 x and y * 6 y . ȁ x and ȁ y , respectively, denote those diagonal matrices which are generated by the positive roots of the left and right eigenvalues, respectively. (1.49) – (1.52) are corresponding direct and inverse matrix identities. We conclude with the proof that the ansatz (1.45), (1.46) indeed leads to the canonical representation (1.43), (1.44) of the left and right norms. 1 2
1 2
1 2
1 2
Box 1.9: Canonical coordinates x* X and y * Y , parameter space versus observation space “canonical coordinates “canonical coordinates of the parameter space” of the observation space” (1.43)
|| x* ||2 = (x* )cx* =
|| y * ||2 = (y * )cy * =
= xcG x x =|| x ||G2
= y cG y y =|| y ||G2
x
(1.44)
y
ansatz (1.45)
1 2
x* = V cȁ x x
1 2
y * = U cȁ y y
(1.46)
29
1-2 The minimum norm solution: “MINOS”
versus
versus - 12 x
x = ȁ Vx
(1.47) 1 2
(1.49) ȁ x := Diag
(
- 12
y = ȁ y Uy *
*
O1x ,..., Omx
)
Diag
§ 1 1 (1.51) ȁ x := Diag ¨ ,..., x ¨ O Omx © 1 1 2
· ¸ ¸ ¹
(
)
§ 1 1 Diag ¨ ,..., y ¨ O Ony © 1
· ¸ =: ȁ -y (1.52) ¸ ¹
“the ansatz proof” G y = Uȁ y Uc
|| x ||G2 = xcG x x =
|| y ||G2 = y cG y y =
1 2
1 2
= xcVȁ x ȁ x V cx = 1 2
1 2
y
x
- 12
1 2
O1y ,..., Ony := ȁ y (1.50)
“the ansatz proof” G x = Vȁ x V c
1 2
(1.48)
1 2
= y cUȁ y ȁ y U cy = - 12
= (x* )cȁ x V cVȁ x ȁ x V cVȁ x x* =
= (y * )cȁ y U cUȁ y ȁ y U cUȁ y y * =
= (x* )cx* =|| x* ||2
= (y * )cy * =|| y * ||2 .
- 12
1 2
- 12
Third, let us discuss the dual operations of coordinate transformations x 6 x* , y 6 y * , namely the behavior of canonical bases, also called orthonormal bases e x , e y , or Cartan frames of reference e1x ,..., emx | 0 spanning the parameter space X as well as e1y ,..., eny | 0 spanning the observation space Y , here a 6 e x , b 6 e y . In terms of orthonormal bases e x and e y as outlined in Box 1.10, the matrix of the metric e x e xc = I m and e yce y = I n takes the canonical form (“modular”). Compare (1.53) with (1.55) and (1.54) with (1.56) are achieved by the changes of bases (“CBS”) of type left e x 6 a , a 6 ex - (1.57), (1.59) - and of type right e y 6 b , b 6 e y - (1.58), (1.60). Indeed these transformations x 6 x* , a 6 e x - (1.45), (1.57) - and y 6 y * , b 6 e y - (1.46), (1.58) - are dual or inverse.
{
}
{
}
Box 1.10: General bases versus orthonormal bases spanning the parameter space X as well as the observation space Y “left” “parameter space X ” “general left base”
“right” “observation space” “general right base”
span {a1 ,..., am } = X
Y = span {b1 ,..., bn }
30
1 The first problem of algebraic regression
: matrix of the metric : (1.54) bbc = G y
: matrix of the metric : (1.53) aac = G x “orthonormal left base”
{
x 1
span e ,..., e
x m
“orthonormal right base”
}=X
{
Y = span e1y ,..., eny
: matrix of the metric : (1.56) e y ecy = I n
: matrix of the metric : (1.55) e x ecx = I m
(1.57)
(1.59)
}
“base transformation”
“base transformation”
1 2
a = ȁ x Ve x
b = ȁ y Ue y
versus
versus
1 2
- 12
- 12
e y = Ucȁ y b
e x = V cȁ x a
{
(1.58)
}
{
span e1x ,..., emx = X
(1.60)
}
Y = span e1y ,..., eny .
Fourth, let us begin the eigenspace analysis versus eigenspace synthesis of the rectangular matrix A \ n× m , r := rk A = n , n < m . Indeed the eigenspace of the rectangular matrix looks differently when compared to the eigenspace of the quadratic, symmetric, positive definite matrix G x \ m × m , rk G x = m and G y \ n×n , rk G y = n of the left and right metric. At first we have to generalize the transpose of a rectangular matrix by introducing the adjoint operator A # which takes into account the matrices {G x , G y } of the left, right metric. Definition 1.5 of the adjoint operator A # is followed by its representation, namely Lemma 1.6. Definition 1.5 (adjoint operator A # ): The adjoint operator A # \ m× n of the matrix A \ n× m is defined by the inner product identity y | Ax G = x | A # y , (1.61) y
Gx
where the left inner product operates on the symmetric, full rank matrix G y of the observation space Y , while the right inner product is taken with respect to the symmetric full rank matrix G x of the parameter space X . Lemma 1.6 (adjoint operator A # ): A representation of the adjoint operator A # \ m × n of the matrix A \ n× m is A # = G -1x A cG y . (1.62)
31
1-2 The minimum norm solution: “MINOS”
For the proof we take advantage of the symmetry of the left inner product, namely y | Ax
Gy
= y cG y Ax
x | A#y
versus
Gx
= xcG x A # y
y cG y Ax = xcA cG y y = xcG x A # y A cG y = G x A # G x1A cG y = A # .
ƅ Five, we solve the underdetermined system of linear equations
{y = Ax | A \
n× m
}
, rk A = n, n < m
by introducing
• •
the eigenspace of the rectangular matrix A \ n× m of rank r := rk A = n , n < m : A 6 A* the left and right canonical coordinates: x o x* , y o y *
as supported by Box 1.11. The transformations (1.63) x 6 x* , (1.64) y 6 y * from the original coordinates ( x1 ,..., xm ) , the parameters of the parameter space X , to the canonical coordinates x1* ,..., xm* , the left star coordinates, as well as from the original coordinates ( y1 ,..., yn ) , the parameters of the observation space Y , to the canonical coordinates y1* ,..., yn* , the right star coordinates are polar decompositions: a rotation {U, V} is followed by a general stretch G y , G x . The matrices G y as well as G x are product decompositions of type G y = S y S yc and G x = S xcS x . If we substitute S y = G y or S x = G x symbolically, we are led to the methods of general stretches G y and G x respectively. Let us substitute the inverse transformations (1.65) x* 6 x = G x Vx* and (1.66) * * y 6 y = G y Uy into our system of linear equations (1.67) y = Ax or its dual (1.68) y * = A* x* . Such an operation leads us to (1.69) y * = f x* as well as (1.70) y = f ( x ) . Subject to the orthonormality conditions (1.71) U cU = I n and (1.72) V cV = I m we have generated the matrix A* of left–right eigenspace analysis (1.73)
(
)
(
{
1 2
1 2
}
1 2
)
1 2
1 2
1 2
1 2
1 2
1 2
1 2
( )
A* = [ ȁ, 0] subject to the horizontal rank partitioning of the matrix V = [ V1 , V2 ] . Alternatively, the left-right eigenspace synthesis (1.74) ªV c º A = G y U [ ȁ, 0 ] « 1 » G x «V c » ¬ 2¼ 1 2
1 2
- 12
is based upon the left matrix (1.75) L := G y U and the right matrix (1.76) R := G x V . Indeed the left matrix L by means of (1.77) LLc = G -1y reconstructs the inverse matrix of the metric of the observation space Y . Similarly, the right 1 2
32
1 The first problem of algebraic regression
matrix R by means of (1.78) RR c = G -1x generates the inverse matrix of the metric of the parameter space X . In terms of “L, R” we have summarized the eigenvalue decompositions (1.79)-(1.84). Such an eigenvalue decomposition helps us to canonically invert y * = A* x* by means of (1.85), (1.86), namely the rank partitioning of the canonical unknown vector x* into x*1 \ r and x*2 \ m r to determine x*1 = ȁ -1 y * , but leaving x*2 underdetermined. Next we shall proof that x*2 = 0 if x* is MINOS. A
X x
y Y
1 2
1 2
U cG y
V cG x
X x*
y* Y
A*
Figure 1.6: Commutative diagram of coordinate transformations Consult the commutative diagram for a short hand summary of the introduced transformations of coordinates, both of the parameter space X as well as the observation space Y . Box 1.11: Canonical representation, underdetermined system of linear equations “parameter space X ” versus “observation space Y ” (1.63) y * = U cG y y (1.64) x* = V cG x x and and 1 2
1 2
- 12
- 12
y = G y Uy *
x = G x Vx*
(1.65)
(1.66)
“underdetermined system of linear equations” y = Ax | A \ n× m , rk A = n, n < m
{
}
y = Ax
(1.67) - 12
- 12
G y Uy * = AG x Vx*
(
1 2
y * = A * x*
versus
- 12
)
(1.69) y * = U cG y AG x V x*
1 2
(1.68) 1 2
U cG y y = A* V cG x x
(
- 12
1 2
)
y = G y UA* V cG x x (1.70)
33
1-2 The minimum norm solution: “MINOS”
subject to U cU = UUc = I n
(1.71)
V cV = VV c = I m
versus
(1.72)
“left and right eigenspace” “left-right eigenspace “left-right eigenspace analysis” synthesis” A* = U cG y AG x [ V1 , V2 ] 1 2
(1.73)
1 2
ªV c º A = G y U [ ȁ, 0] « 1 » G x (1.74) «V c » ¬ 2¼ 1 2
= [ ȁ, 0]
1 2
“dimension identities” ȁ\
r×r
, 0 \ r × ( m r ) , r := rk A = n, n < m
V1 \ m × r , V2 \ m × ( m r ) , U \ r × r “left eigenspace” - 12 y
“right eigenspace” - 12
1 2
1 2
R := G x V R -1 = V cG x (1.76)
(1.75) L := G U L = U cG y -1
- 12
- 12
R 1 := G x V1 , R 2 := G x V2 1 2
1 2
R 1- := V1cG x , R -2 := V2cG x (1.77) LLc = G -1y (L-1 )cL-1 = G y (1.79)
A = LA* R -1 1
RR c = G -1x (R -1 )cR -1 = G x (1.78) versus
A* = L-1 AR A = [ ȁ, 0] =
(1.80)
*
ªR º (1.81) A = L [ ȁ, 0] « - » ¬« R 2 ¼»
versus
AA # L = Lȁ 2
versus
(1.83)
= L-1 A [ R 1 , R 2 ] ª A # AR 1 = R 1 ȁ 2 « # «¬ A AR 2 = 0
(1.82)
(1.84)
“underdetermined system of linear equations solved in canonical coordinates” (1.85)
ª x* º x* \ r ×1 y * = A* x* = [ ȁ, 0] « 1* » = ȁx*1 , * 1 ( m r )×1 x2 \ «¬ x 2 »¼ ª x*1 º ª ȁ -1 y * º « *» = « * » ¬« x 2 ¼» ¬ x 2 ¼
( )
“if x* is MINOS, then x*2 = 0 : x1*
(1.86)
m
= ȁ -1 y * .”
34
1 The first problem of algebraic regression
Six, we prepare ourselves for MINOS of the underdetermined system of linear equations
{y = Ax | A \
n× m
}
, rk A = n, n < m, || x ||G2 = min x
by introducing Lemma 1.7, namely the eigenvalue - eigencolumn equations of the matrices A # A and AA # , respectively, as well as Lemma 1.9, our basic result on “canonical MINOS”, subsequently completed by proofs. (eigenspace analysis versus eigenspace synthesis of the matrix A \ n× m , r := rkA = n < m ) The pair of matrices {L, R} for the eigenspace analysis and the eigenspace synthesis of the rectangular matrix A \ n× m of rank r := rkA = n < m , namely versus A* = L-1 AR A = LA* R -1 Lemma 1.7
{
}
or
or
A = [ ȁ, 0 ] = L A [ R 1 , R 2 ] *
-1
versus
ª R -1 º A = L [ ȁ, 0] « 1-1 » , ¬« R 2 ¼»
are determined by the eigenvalue – eigencolumn equations (eigenspace equations) of the matrices A # A and AA # , respectively, namely versus A # AR 1 = R 1 ȁ 2 AA # L = Lȁ 2 subject to ªO12 … 0 º « » ȁ 2 = « # % # » , ȁ = Diag + O12 ,..., + Or2 . « 0 " Or2 » ¬ ¼
)
(
Let us prove first AA # L = Lȁ 2 , second A # AR 1 = R 1 ȁ 2 . (i) AA # L = Lȁ 2 AA # L = AG -1x A cG y L = ªV c º ªȁº = L [ ȁ, 0] « 1 » G x G -1x (G x )c [ V1 , V2 ] « » U c(G y )cG y G y U, c 0 «V c » ¬ ¼ ¬ 2¼ 1 2
1 2
ª V cV AA # L = L [ ȁ, 0] « 1 1 «V cV ¬ 2 1 ªI AA # L = L [ ȁ, 0] « r ¬0
1 2
1 2
V1c V2 º ª ȁ º » « », V2c V2 »¼ ¬ 0c ¼ 0 º ªȁº . I m -r »¼ «¬ 0c »¼
ƅ
35
1-2 The minimum norm solution: “MINOS”
(ii) A # AR 1 = R 1 ȁ 2 A # AR = G -1x AcG y AR = ªȁº = G -1xG x V « » U c(G y )cG y G y U [ ȁ, 0] V cG x G x V, ¬ 0c ¼ ª ȁ 2 0º ªȁº A # AR = G x V « » [ ȁ, 0] = G x [ V1 , V2 ] « », ¬ 0c ¼ ¬ 0 0¼ 1 2
1 2
1 2
1 2
1 2
1 2
1 2
A # A [ R 1 , R 2 ] = G x ª¬ V1 ȁ 2 , 0 º¼ 1 2
A # AR 1 = R 1 ȁ 2 .
ƅ
{
}
The pair of eigensystems AA # L = Lȁ 2 , A # AR 1 = R 1 ȁ 2 is unfortunately based upon non-symmetric matrices AA # = AG -1x A cG y and A # A = G -1x A cG y A which make the left and right eigenspace analysis numerically more complex. It appears that we are forced to use the Arnoldi method rather than the more efficient Lanczos method used for symmetric matrices. In this situation we look out for an alternative. Indeed when we substitute
{L, R}
{
- 12
}
- 12
by G y U, G x V
- 12
into the pair of eigensystems and consequently left multiply AA # L by G x , we achieve a pair of eigensystems identified in Corollary 1.8 relying on symmetric matrices. In addition, such a symmetric pair of eigensystems produces the canonical base, namely orthonormal eigencolumns. Corollary 1.8 (symmetric pair of eigensystems): The pair of eigensystems (1.87)
1 2
1 2
G y AG -1x A c(G y )cU = ȁ 2 U versus 1 2
- 12
- 12
- 12
(G x )cA cG y AG x V1 = V1 ȁ 2 (1.88) - 12
- 12
(1.89) G y AG -1x Ac(G y )c Ȝ i2 I r = 0 versus (G x )cA cG y AG x Ȝ 2j I m = 0 (1.90) is based upon symmetric matrices. The left and right eigencolumns are orthogonal. Such a procedure requires two factorizations, 1 2
1 2
- 12
- 12
G x = (G x )cG x , G -1x = G x (G x )c
and
1 2
- 12
- 12
G y = (G y )cG y , G -1y = G y (G y )c
via Cholesky factorization or eigenvalue decomposition of the matrices G x and Gy .
36
1 The first problem of algebraic regression
Lemma 1.9 (canonical MINOS): Let y * = A* x* be a canonical representation of the underdetermined system of linear equations
{y = Ax | A \
n× m
}
, r := rkA = n, n < m .
Then the rank partitioning of x*m ª x* º ª ȁ -1 y * º x1* = ȁ -1 y * * or , x1 \ r ×1 , x*2 \ ( m r )×1 x*m = « *1 » = « » * x2 = 0 ¬x2 ¼ ¬ 0 ¼
(1.91)
is G x -MINOS. In terms of the original coordinates [ x1 ,..., xm ]c of the parameter space X a canonical representation of G x -MINOS is ª ȁ -1 º xm = G x [ V1 , V2 ] « » U cG y y , ¬ 0c ¼ 1 2
1 2
- 12
1 2
xm = G x V1 ȁ -1 U cG y = 5 1 ȁ -1 /-1 y. The G x -MINOS solution xm = A m- y - 12
1 2
A m- = G x V1 ȁ -1 U cG y is built on the canonical ( G x , G y ) weighted reflexive inverse of A . For the proof we depart from G x -MINOS (1.14) and replace the matrix A \ n× m by its canonical representation, namely eigenspace synthesis.
(
xm = G -1x Ac AG -1x Ac
)
-1
y
ªV c º A = G y U [ ȁ, 0 ] « 1 » G x «V c » ¬ 2¼ 1 2
1 2
ªVc º ªȁº AG -1x Ac = G y U [ ȁ, 0] « 1 » G x G -1x (G x )c [ V1 , V2 ] « » Uc(G y )c «V c » ¬0¼ ¬ 2¼ 1 2
1 2
- 12
- 12
1 2
(
AG -1x Ac = G y Uȁ 2 Uc(G y )c, AG -1x Ac
1 2
)
-1
( )
c = G y Uȁ -2 UcG y 1 2
( )c [V , V ] «¬ªȁ0 »¼º Uc (G )c (G )c Uȁ 1 2
xm = G -1x G x
1
2
- 12 y
1 2
y
1 2
-2
1 2
U cG y y
37
1-2 The minimum norm solution: “MINOS”
ª ȁ -1 º xm = G x [ V1 , V2 ] « » U cG y y ¬ 0 ¼ 1 2
1 2
- 12
1 2
xm = G x V1 ȁ -1 U cG y y = A m- y - 12
1 2
A m- = G x V1 ȁ -1 U cG y A1,2,4 G x
( G x weighted reflexive inverse of A ) ª x* º ª ȁ -1 º ª ȁ -1 º ª ȁ -1 y * º ƅ x*m = « *1 » = V cG x xm = « » U cG y y = « » y * = « ». ¬ 0 ¼ ¬ 0 ¼ ¬ 0 ¼ ¬x2 ¼ The important result of x*m based on the canonical G x -MINOS of {y * = A* x* | A* \ n× m , rkA* = rkA = n, n < m} needs a short comment. The rank partitioning of the canonical unknown vector x* , namely x*1 \ r , x*2 \ m r again paved the way for an interpretation. First, we acknowledge the “direct inversion” 1 2
1 2
(
)
x*1 = ȁ -1 y * , ȁ = Diag + O12 ,..., + Or2 , for instance [ x1* ,..., xr* ]c = [O11 y1 ,..., Or1 yr ]c . Second, x*2 = 0 , for instance [ xr*+1 ,..., xm* ]c = [0,..., 0]c introduces a fixed datum for the canonical coordinates ( xr +1 ,..., xm ) . Finally, enjoy the commutative diagram of Figure 1.7 illustrating our previously introduced transformations of type MINOS and canonical MINOS, by means of A m and ( A* )m . A m xm X Y y
1 2
1 2
UcG y
Y y*
V cG x
(A ) *
x*m X
m
Figure 1.7: Commutative diagram of inverse coordinate transformations Finally, let us compute canonical MINOS for the Front Page Example, specialized by G x = I 3 , G y = I 2 .
38
1 The first problem of algebraic regression
ª x1 º ª 2 º ª1 1 1 º « » y = Ax : « » = « » « x2 » , r := rk A = 2 ¬ 3 ¼ ¬1 2 4 ¼ « » ¬ x3 ¼ left eigenspace AA U = AAcU = Uȁ #
right eigenspace A # AV1 = A cAV1 = V1 ȁ 2
2
A # AV2 = A cAV2 = 0 ª2 3 5 º « 3 5 9 » = A cA « » «¬ 5 9 17 »¼
ª3 7 º AA c = « » ¬7 21¼ eigenvalues AA c Oi2 I 2 = 0
A cA O j2 I 3 = 0
O12 = 12 + 130, O22 = 12 130, O32 = 0 left eigencolumns 2 1
ª3 O (1st) « ¬ 7
7 º ª u11 º »« » = 0 21 O12 ¼ ¬u21 ¼
right eigencolumns ª 2 O12 « (1st) « 3 « 5 ¬
3 5 º ª v11 º » 2 5 O1 9 » «« v 21 »» = 0 9 17 O12 »¼ «¬ v31 »¼
subject to
subject to
2 u112 + u21 =1
2 v112 + v 221 + v31 =1
(3 O12 )u11 + 7u21 = 0
versus
ª(2 O12 )v11 + 3v 21 + 5v31 = 0 « 2 ¬3v11 + (5 O1 )v 21 + 9v31 = 0
49 49 ª 2 « u11 = 49 + (3 O 2 ) 2 = 260 + 18 130 1 « 2 2 « 2 (3 O1 ) 211 + 18 130 = «u21 = 2 2 O 49 + (3 ) 260 + 18 130 ¬« 1 2 ª v11 º 1 « 2» « v 21 » = (2 + 5O 2 ) 2 + (3 9O 2 ) 2 + (1 + 7O 2 O 4 ) 2 1 1 1 1 2 » « v31 ¬ ¼
ª (2 + 5O12 ) 2 º « » 2 2 « (3 9O1 ) » « (1 7O12 + O14 ) 2 » ¬ ¼
39
1-2 The minimum norm solution: “MINOS”
(
)
ª 62 + 5 130 2 º « » ªv º « 2» 1 « » « 105 9 130 » «v » = » « v » 102700 + 9004 130 « ¬ ¼ « 191 + 17 130 2 » ¬« ¼» 2 11 2 21 2 31
ª3 O22 (2nd) « ¬ 7
( (
ª 2 O22 7 º ª u12 º « = 0 (2nd) « 3 » 2»« 21 O2 ¼ ¬u22 ¼ « 5 ¬
) )
3 5 º ª v12 º » 2 5 O2 9 » «« v 22 »» = 0 9 17 O22 »¼ «¬ v32 »¼
subject to
subject to
u +u =1
2 v + v 222 + v32 =1
2 12
2 22
2 12
(3 O22 )u12 + 7u22 = 0
versus
ª (2 O22 )v12 + 3v 22 + 5v32 = 0 « 2 ¬ 3v12 + (5 O2 )v 22 + 9v32 = 0
49 49 ª 2 « u12 = 49 + (3 O 2 ) 2 = 260 18 130 2 « 2 2 « 2 (3 O2 ) 211 18 130 = «u22 = 2 2 + 49 (3 O ) «¬ 260 18 130 2 2 ª v12 º 1 « 2 » « v 22 » = (2 + 5O 2 ) 2 + (3 9O 2 ) 2 + (1 + 7O 2 O 4 ) 2 2 2 2 2 2 » « v32 ¬ ¼
(
ª (2 + 5O22 ) 2 º « » 2 2 « (3 9O2 ) » « (1 7O22 + O24 ) 2 » ¬ ¼
)
ª 62 5 130 2 º 2 « » ª v12 º « 2» 1 « 2 » « 105 + 9 130 » « v 22 » = 102700 9004 130 « » 2 » « v32 ¬ ¼ « 191 17 130 2 » «¬ »¼
( (
ª 2 3 5 º ª v13 º (3rd) «« 3 5 9 »» «« v 23 »» = 0 «¬ 5 9 17 »¼ «¬ v33 »¼
subject to
) )
2 v132 + v 223 + v33 =1
2v13 + 3v 23 + 5v33 = 0 3v13 + 5v 23 + 9v33 = 0 ª v13 º ª 2 3º ª v13 º ª 5º ª 5 3º ª 5º « 3 5» « v » = « 9» v33 « v » = « 3 2 » «9» v33 ¬ ¼ ¬ 23 ¼ ¬ ¼ ¬ ¼¬ ¼ ¬ 23 ¼ v13 = 2v33 , v 23 = 3v33
40
1 The first problem of algebraic regression
v132 =
2 9 1 2 2 , v 23 = , v33 = . 7 14 14
There are four combinatorial solutions to generate square roots. 2 ª u11 u12 º ª ± u11 « = «u » 2 ¬ 21 u22 ¼ «¬ ± u21
ª v11 «v « 21 «¬ v31
v12 v 22 v32
2 ª v13 º « ± v11 v 23 »» = « ± v 221 « v33 »¼ « ± v 2 31 ¬
± u122 º » 2 » ± u22 ¼ 2 ± v12
± v 222 2 ± v32
2 º ± v13 » ± v 223 » . » 2 » ± v33 ¼
Here we have chosen the one with the positive sign exclusively. In summary, the eigenspace analysis gave the result as follows. ȁ = Diag
( 12 +
130 , 12 130
7 ª « « 260 + 18 130 U=« « 211 + 18 130 « ¬ 260 + 18 130 ª 62 + 5 130 « « 102700 + 9004 130 « 105 + 9 130 « V=« « 102700 + 9004 130 « 191 + 17 130 « «« 102700 + 9004 130 ¬
)
7
º » 260 18 130 » » 211 + 18 130 » » 260 18 130 ¼
62 5 130 102700 9004 130 105 9 130 102700 9004 130 191 + 17 130 102700 9004 130
º 2 » » 14 » 3 » » = [ V1 , V2 ] . 14 » 1 » » 14 » »¼
1-3 Case study: Orthogonal functions, Fourier series versus Fourier-Legendre series, circular harmonic versus spherical harmonic regression In empirical sciences, we continuously meet the problems of underdetermined linear equations. Typically we develop a characteristic field variable into orthogonal series, for instance into circular harmonic functions (discrete Fourier transform) or into spherical harmonics (discrete Fourier-Legendre transform) with respect to a reference sphere. We are left with the problem of algebraic regression to determine the values of the function at sample points, an infinite set of coefficients of the series expansion from a finite set of observations. An infi-
41
1-3 Case study
nite set of coefficients, the coordinates in an infinite-dimensional Hilbert space, cannot be determined by finite computer manipulations. Instead, band-limited functions are introduced. Only a finite set of coefficients of a circular harmonic expansion or of a spherical harmonic expansion can be determined. It is the art of the analyst to fix the degree / order of the expansion properly. In a peculiar way the choice of the highest degree / order of the expansion is related to the Uncertainty Principle, namely to the width of lattice of the sampling points. Another aspect of any series expansion is the choice of the function space. For instance, if we develop scalar-valued, vector-valued or tensor-valued functions into scalar-valued, vector-valued or tensor-valued circular or spherical harmonics, we generate orthogonal functions with respect to a special inner product, also called “scalar product” on the circle or spherical harmonics are eigenfunctions of the circular or spherical Laplace-Beltrami operator. Under the postulate of the Sturm-Liouville boundary conditions the spectrum (“eigenvalues”) of the Laplace-Beltrami operator is positive and integer. The eigenvalues of the circular Laplace-Beltrami operator are l 2 for integer values l {0,1,..., f} , of the spherical Laplace-Beltrami operator k (k + 1), l 2 for integer values k {0,1,..., f} , l {k , k + 1,..., 1, 0,1,..., k 1, k} . Thanks to such a structure of the infinite-dimensional eigenspace of the Laplace-Beltrami operator we discuss the solutions of the underdetermined regression problem (linear algebraic regression) in the context of “canonical MINOS”. We solve the system of linear equations
{
}
{Ax = y | A \ n× m , rk A = n, n m} by singular value decomposition as shortly outlined in Appendix A. 1-31
Fourier series
? What are Fourier series ? Fourier series (1.92) represent the periodic behavior of a function x(O ) on a circle S1 . They are also called trigonometric series since trigonometric functions {1,sin O , cos O ,sin 2O , cos 2O ,...,sin AO , cos AO} represent such a periodic signal. Here we have chosen the parameter “longitude O ” to locate a point on S1 . Instead we could exchange the parameter O by time t , if clock readings would substitute longitude, a conventional technique in classical navigation. In such a setting, 2S O = Zt = t = 2SQ t , T t AO = AZt = 2S A = 2S AQ t T
42
1 The first problem of algebraic regression
longitude O would be exchanged by 2S , the product of ground period T and time t or by 2S , the product of ground frequency Q . In contrast, AO for all A {0,1,..., L} would be substituted by 2S the product of overtones A / T or AQ and time t . According to classical navigation, Z would represent the rotational speed of the Earth. Notice that A is integer, A Z . Box 1.12: Fourier series x(O ) = x1 + (sin O ) x2 + (cos O ) x3 +
(1.92)
+(sin 2O ) x4 + (cos 2O ) x5 + O3 (sin AO , cos AO ) x(O ) = lim
L of
+L
¦ e (O ) x
A = L
A
(1.93)
A
ª cos AO A > 0 « eA (O ) := « 1 A = 0 «¬sin A O A < 0. Example
(1.94)
(approximation of order three):
x (O ) = e0 x1 + e 1 x2 + e +1 x3 + e 2 x4 + e +2 x5 + O3 .
(1.95)
Fourier series (1.92), (1.93) can be understood as an infinite-dimensional vector space (linear space, Hilbert space) since the base functions (1.94) eA (O ) generate a complete orthogonal (orthonormal) system based on trigonometric functions. The countable base, namely the base functions eA (O ) or {1,sin O , cos O , sin 2O , cos 2O , ..., sin AO , cos AO} span the Fourier space L2 [0, 2S [ . According to the ordering by means of positive and negative indices { L, L + 1,..., 1, 0, +1, ..., L 1, L} (1.95) x (O ) is an approximation of the function x(O ) up to order three, also denoted by x L . Let us refer to Box 1.12 as a summary of the Fourier representation of a function x(O ), O S1 . Box 1.13: The Fourier space “The base functions eA (O ), A { L, L + 1,..., 1, 0, +1,..., L 1, L} , span the Fourier space L2 [ 0, 2S ] : they generate a complete orthogonal (orthonormal) system of trigonometric functions.” “inner product” : x FOURIER and y FOURIER : f
x y :=
1 1 ds * x( s*) y ( s*) = ³ s0 2S
2S
³ d O x(O ) 0
y (O )
(1.96)
43
1-3 Case study
“normalization” < eA (O ) | eA (O ) >:= 1
2
1 2S
2S
³ dO e
A1
(O ) eA (O ) = OA G A A 2
1
1 2
(1.97)
0
ª OA = 1 A1 = 0 subject to « 1 «¬ OA = 2 A1 z 0 1
1
“norms, convergence” || x ||
2
=
1 2S
2S 2 ³ d O x (O ) = lim
Lof
0
lim || x x L ||2 = 0
+L
¦Ox A
2 A
= OA 2S
(1.101)
³ dO e (O ) x (O ) A
0
< x | eA >
(1.102)
A
“canonical basis of the Hilbert space FOURIER” ª 2 sin AO A > 0 « e := « 1 A = 0 « ¬ 2 cos AO A < 0
A
(1.103)
(orthonormal basis) (1.104) (1.106)
1
eA
versus
eA = OA e*A
xA* = OA xA
versus
xA =
e*A =
OA
x = lim
L of
+L
¦e
A = L
* A
< x | e*A >
1
OA
xA*
(1.105) (1.107)
(1.108)
“orthonormality” < e*A (x) | e*A (x) >= G A A 1
2
1 2
(1.109)
44
1 The first problem of algebraic regression
Fourier space Lof FOURIER = span{e L , e L +1 ,..., e 1 , e0 , e1 ,..., e L 1 , e L } dim FOURIER = lim(2 L + 1) = f L of
“ FOURIER = HARM L ( S ) ”. 2
1
? What is an infinite dimensional vector space ? ? What is a Hilbert space ? ? What makes up the Fourier space ? An infinite dimensional vector space (linear space) is similar to a finite dimensional vector space: As in an Euclidean space an inner product and a norm is defined. While the inner product and the norm in a finite dimensional vector space required summation of their components, the inner product (1.96), (1.97) and the norm (1.98) in an infinite-dimensional vector space force us to integration. Indeed the inner products (scalar products) (1.96), (1.97) are integrals over the line element of S1r applied to the vectors x(O ) , y (O ) or eA , eA , respectively. Those integrals are divided by the length s of a total arc of S1r . Alternative representations of < x | y > and < eA | eA > (Dirac’s notation of brackets, decomposed into “bra” and “ket”) based upon ds = rd O , s = 2S r , lead us directly to the integration over S1 , the unit circle. 1
1
2
2
A comment has to be made to the normalization (1.97). Thanks to < eA (O ) | eA (O ) >= 0 for all A1 z A 2 , 1
2
for instance < e1 (O ) | e1 (O ) > = 0 , the base functions eA (O ) are called orthogonal. But according to < eA (O ) | eA (O ) > = 12 , for instance < e1 (O ) | e1 (O ) > = || e1 (O ) ||2 = 12 , < e 2 (O ) | e 2 (O ) > = || e 2 (O ) ||2 = 12 , they are not normalized to 1. A canonical basis of the Hilbert space FOURIER has been introduced by (1.103) e*A . Indeed the base functions e*A (O ) fulfil the condition (1.109) of orthonormality. The crucial point of an infinite dimensional vector space is convergency. When we write (1.93) x(O ) as an identity of infinite series we must be sure that the series converge. In infinite dimensional vector space no pointwise convergency is required. In contrast, (1.99) “convergence in the mean” is postulated. The norm (1.98) || x ||2 equals the limes of the infinite sum of the OA weighted, squared coordinate xA , the coefficient in the trigonometric function (1.92),
45
1-3 Case study
|| x ||2 = lim
L of
+L
¦Ox
A = L
2 A A
< f,
which must be finite. As soon as “convergence in the mean” is guaranteed, we move from a pre-Fourier space of trigonometric functions to a Fourier space we shall define more precisely lateron. Fourier analysis as well as Fourier synthesis, represented by (1.100) versus (1.101), is meanwhile well prepared. First, given the Fourier coefficients x A we are able to systematically represent the vector x FOURIER in the orthogonal base eA (O ) . Second, the projection of the vector x FOURIER onto the base vectors eA (O ) agrees analytically to the Fourier coefficients as soon as we take into account the proper matrix of the metric of the Fourier space. Note the reproducing representation (1.37) “from x to x ”. The transformation from the orthogonal base eA (O ) to the orthonormal base e*A , also called canonical or modular as well as its inverse is summarized by (1.104) as well as (1.105). The dual transformations from Fourier coefficients x A to canonical Fourier coefficients x*A as well as its inverse is highlighted by (1.106) as well as (1.107). Note the canonical reproducing representation (1.108) “from x to x ”. The space ª FOURIER = span {e L , e L +1 ,..., e L 1 , e L }º L of « » « » L + 1) = f » «¬ dim FOURIER = Llim(2 of ¼ has the dimension of hyperreal number f . As already mentioned in the introduction FOURIER = HARM L ( S ) 2
1
is identical with the Hilbert space L2 (S1 ) of harmonic functions on the circle S1 . ? What is a harmonic function which has the unit circle S1 as a support ? A harmonic function “on the unit circle S1 ” is a function x(O ) , O S1 , which fulfils (i) the one-dimensional Laplace equation (the differential equation of a harmonic oscillator) and (ii) a special Sturm-Liouville boundary condition. (1st) '1 x(O ) = 0 (
d2 + Z 2 ) x (O ) = 0 dO2
46
1 The first problem of algebraic regression
x(0) = x(2S ) ª « (2nd) «[ d x(O )](0) = [ d x(O )](2S ). «¬ d O dO The special Sturm-Liouville equations force the frequency to be integer, shortly proven now. ansatz: x(O ) = cZ cos ZO + sZ sin ZO x(0) = x(2S ) cZ = cZ cos 2SZ + sZ sin 2SZ [
d d x(O )](0) = [ x(2S )](2S ) dO dO
sZZ = cZZ sin 2SZ + sZZ cos 2SZ
cos 2SZ = 0 º Z = A A {0,1,..., L 1, L} . sin 2SZ = 0 »¼
Indeed, Z = A , A {0,1,..., L 1, L} concludes the proof. Box 1.14: Fourier analysis as an underdetermined linear model “The observation space Y ” ª y1 º ª x(O1 ) º « y » « x (O ) » 2 » « 2 » « « # » := « # » = [ x(Oi ) ] i {1,.., I }, O [ 0, 2S ] « » « » « yn 1 » « x(On 1 ) » «¬ yn »¼ «¬ x(On ) »¼
(1.110)
dim Y = n I “equidistant lattice on S1 ”
Oi = (i 1)
2S i {1,..., I } I
(1.111)
Example ( I = 2) : O1 = 0, O2 = S 180° Example ( I = 3) : O1 = 0, O2 = Example ( I = 4) : O1 = 0, O2 =
2S 4
6S 5
120°, O3 =
4S 3
240°
90°, O3 = S 180°, O4 =
Example ( I = 5) : O1 = 0, O2 =
O4 =
2S 3
2S 5
216°, O5 =
72°, O3 = 8S 5
288°
4S 5
3S 2
144°,
270°
47
1-3 Case study
“The parameter space X ” x1 = x0 , x2 = x1 , x3 = x+1 , x4 = x2 , x5 = x+2 ,..., xm 1 = x L , xm = xL (1.112) dim X = m 2 L + 1 “The underdetermined linear model” n < m : I < 2L + 1 cos O1 ª y1 º ª1 sin O1 « y » «1 sin O cos O2 2 « 2 » « y := « ... » = « « » « « yn 1 » «1 sin On 1 cos On 1 «¬ yn »¼ «¬1 sin On cos On
... ...
sin LO1 sin LO2
... sin LOn 1 ... sin LOn
cos LO1 º ª x1 º cos LO2 »» «« x2 »» » « ... » . (1.113) »« » cos LOn 1 » « xm 1 » cos LOn »¼ «¬ xm »¼
? How can we setup a linear model for Fourier analysis ? The linear model of Fourier analysis which relates the elements x X of the parameter space X to the elements y Y of the observation space Y is setup in Box 1.14. Here we shall assume that the observed data have been made available on an equidistant angular grid, in short “equidistant lattice” of the unit circle S1 parameterized by ( O1 ,..., On ) . For the optimal design of the Fourier linear model it has been proven that the equidistant lattice 2S i {1,..., I } Oi = (i 1) I is “D-optimal”. Box 1.14 contains three examples for such a lattice. In summary, the finite dimensional observation space Y , dim Y = n , n = I , has integer dimension I . I =2 0° 180° 360° level L = 0
I =3 0°
level L = 1
120°
240°
360°
level L = 2 level L = 3
I =4 0°
90° 180° 270° 360°
I =5 0° 72° 144° 216° 288° 360° Figure 1.8: Fourier series, Pascal triangular graph, weights of the graph: unknown coefficients of Fourier series
Figure 1.9: Equidistant lattice on S1 I = 2 or 3 or 4 or 5
48
1 The first problem of algebraic regression
In contrast, the parameter space X , dim X = f , is infinite dimensional. The unknown Fourier coefficients, conventionally collected in a Pascal triangular graph of Figure 1.8, are vectorized by (1.112) in a peculiar order. X = span{x0 , x1 , x+1 ,..., x L , x+ L } L of
dim X = m = f . Indeed, the linear model (1.113) contains m = 2 L + 1 , L o f , m o f , unknowns, a hyperreal number. The linear operator A : X o Y is generated by the base functions of lattice points. yi = y (Oi ) = lim
L of
L
¦ e (O ) x
A = L
A
i
A
i {1,..., n}
is a representation of the linear observational equations (1.113) in Ricci calculus which is characteristic for Fourier analysis. number of observed data at lattice points
versus
number of unknown Fourier coefficients
n=I
m = 2L + 1 o f
(finite)
(infinite)
Such a portray of Fourier analysis summarizes its peculiarities effectively. A finite number of observations is confronted with an infinite number of observations. Such a linear model of type “underdetermined of power 2” cannot be solved in finite computer time. Instead one has to truncate the Fourier series, a technique or approximation to make up Fourier series “finite” or “bandlimited”. We have to consider three cases. n>m
n=m
n<m
overdetermined case
regular case
underdetermined case
First, we can truncate the infinite Fourier series such that n > m holds. In this case of an overdetermined problem , we have more observations than equations. Second, we alternatively balance the number of unknown Fourier coefficients such that n = m holds. Such a model choice assures a regular linear system. Both linear Fourier models which are tuned to the number of observations suffer from a typical uncertainty. What is the effect of the forgotten unknown Fourier coefficients m > n ? Indeed a significance test has to decide upon any truncation to be admissible. We are in need of an objective criterion to decide upon the degree m of bandlimit. Third, in order to be as objective as possible we follow the third case of “less observations than unknowns” such that
49
1-3 Case study
n < m holds. Such a Fourier linear model which generates an underdetermined system of linear equations will consequently be considered. The first example (Box 1.15: n m = 1 ) and the second example (Box 1.16: n m = 2 ) demonstrate “MINOS” of the Fourier linear model. Box 1.15: The first example: Fourier analysis as an underdetermined linear model: n rk A = n m = 1, L = 1 “ dim Y = n = 2, dim X = m = 3 ” ªx º cos O1 º « 1 » x y = Ax cos O2 »¼ « 2 » «¬ x3 »¼
ª y1 º ª1 sin O1 « y » = «1 sin O ¬ 2¼ ¬ 2
Example ( I = 2) : O1 = 0°, O2 = 180° sin O1 = 0, cos O1 = 1,sin O2 = 0, cos O2 = 1 ª1 0 1 º ª1 sin O1 A := « »=« ¬1 0 1¼ ¬1 sin O2
cos O1 º \ 2× 3 cos O2 »¼
AA c = 2I 2 ( AA c) 1 = 12 I 2 2 1 + sin O1 sin O2 + cos O1 cos O2 º ª AA c = « » 2 ¬1 + sin O2 sin O1 + cos O2 cos O1 ¼ if Oi = (i O )
2S , then I
1 + 2sin O1 sin O2 + 2 cos O1 cos O2 = 0 or L = 1:
+L
¦ e (O A
A = L
L = 1:
i1
+L
¦ e (O
A = L
A
i1
)eA (Oi ) = 0 i1 z i2 2
)eA (Oi ) = L + 1 i1 = i2 2
ª x1 º ª1 1 º ª y1 + y2 º 1« 1« « » » 1 x A = « x2 » = A c( AA c) y = « 0 0 » y = « 0 »» 2 2 «¬ x3 »¼ A «¬1 1»¼ «¬ y1 y2 »¼ || x A ||2 = 12 y cy .
50
1 The first problem of algebraic regression
Box 1.16: The second example: Fourier analysis as an underdetermined linear model: n rk A = n m = 2, L = 2 “ dim Y = n = 3, dim X = m = 5 ”
ª y1 º ª1 sin O1 « y » = «1 sin O 2 « 2» « «¬ y3 »¼ «¬1 sin O3
cos O1 cos O2
sin 2O1 sin 2O2
cos O3
sin 2O3
ª x1 º cos 2O1 º «« x2 »» cos 2O2 »» « x3 » « » cos 2O3 »¼ « x4 » «¬ x5 »¼
Example ( I = 3) : O1 = 0°, O2 = 120° , O3 = 240° sin O1 = 0,sin O2 =
1 2
3,sin O3 = 12 3
cos O1 = 1, cos O2 = 12 , cos O3 = 12 sin 2O1 = 0,sin 2O2 = 12 3,sin 2O3 =
1 2
3
cos 2O1 = 1, cos 2O2 = 12 , cos 2O3 = 12 0 1 ª1 « 1 A := «1 2 3 12 « 1 1 ¬1 2 3 2
0 1 2
1 2
1º » 3 12 » » 3 12 ¼
AA c = 3I 3 ( AAc) 1 = 13 I 3 AA c = ª « « «1 + sin O « + sin 2O «1 + sin O « + sin 2O ¬
1 + sin O1 sin O2 + cos O1 cos O2 + + sin 2O1 sin 2O2 + cos 2O1 cos 2O 2
1 + sin O1 sin O3 + cos O1 cos O3 + + sin 2O1 sin 2O3 + cos 2O1 cos 2O3
sin O1 + cos O2 cos O1 + sin 2O1 + cos 2O2 cos 2O1
3
1 + sin O2 sin O3 + cos O2 cos O3 + + sin 2O2 sin 2O3 + cos 2O 2 cos 2O3
sin O1 + cos O3 cos O1 + sin 2O1 + cos 2O3 cos 2O1
1 + sin O3 sin O2 + cos O3 cos O 2 + + sin 2O3 sin 2O2 + cos 2O3 cos 2O2
3
3
2
2
3
3
if Oi = (i 1)
2S , then I
1 + sin O1 sin O2 + cos O1 cos O2 + sin 2O1 sin 2O2 + cos 2O1 cos 2O2 = = 1 12 12 = 0
º » » » » » » ¼
51
1-3 Case study
1 + sin O1 sin O3 + cos O1 cos O3 + sin 2O1 sin 2O3 + cos 2O1 cos 2O3 = = 1 12 12 = 0 1 + sin O2 sin O3 + cos O2 cos O3 + sin 2O2 sin 2O3 + cos 2O2 cos 2O3 = = 1 34 14 14 + 14 = 0 L = 2:
+L
¦ e (O
A = L
L = 2:
A
i1
+L
¦ e (O
A = L
A
i1
)eA (Oi ) = 0 i1 z i2 2
)eA (Oi ) = L + 1 i1 = i2 2
1 1 º ª1 « » 1 1 «0 2 3 2 3 » ª y1 º 1 12 12 » «« y2 »» x A = Ac( AAc) 1 y = «1 « » 3 «0 12 3 12 3 » «¬ y3 »¼ « » 12 12 »¼ «¬1 ª y1 + y2 + y3 º ª x1 º « 1 » «x » 1 « 2 3 y2 2 3 y3 » « 2» 1 x A = « x3 » = « y1 12 y2 12 y3 » , » 3« « » « 12 3 y2 + 12 3 y3 » « x4 » « » «¬ x5 »¼ «¬ y1 12 y2 12 y3 »¼ A
1 || x ||2 = y cy . 3
Lemma 1.10 (Fourier analysis): If finite Fourier series ª x1 º « x2 » « x3 » yi = y (Oi ) = [1,sin Oi , cos Oi ,..., cos( L 1)Oi ,sin LOi , cos LOi ] « # » (1.114) « xm 2 » «x » « xm 1 » ¬ m ¼ or y = Ax, A \ n× m , rk A = n, I = n < m = 2 L + 1 A O ( n ) := {A \ n × m | AA c = ( L + 1)I n }
(1.115)
are sampled at observations points Oi S1 on an equidistance lattice (equiangular lattice)
52
1 The first problem of algebraic regression
Oi = (i 1)
2S I
i, i1 , i2 {1,..., I } ,
(1.116)
then discrete orthogonality AA c = ( L + 1)I n
+L
¦ e (O A
A=L
i1
ª 0 i1 z i2 )eA (Oi ) = « ¬ L + 1 i1 = i2 2
(1.117)
applies. A is an element of the orthogonal group O(n) . MINOS of the underdetermined system of linear equations (1.95) is xm = 1-32
1 A cy, L +1
xm
2
=
1 y cy. L +1
(1.118)
Fourier-Legendre series ? What are Fourier-Legendre series ?
Fourier-Legendre series (1.119) represent the periodic behavior of a function x(O , I ) on a sphere S 2 . They are called spherical harmonic functions since {1, P11 (sin I ) sin O , P10 (sin I ), P11 (sin I ) cos I ,..., Pkk (sin I ) cos k O} represent such a periodic signal. Indeed they are a pelicular combination of Fourier’s trigonometric polynomials {sin AO , cos AO} and Ferrer’s associated Legendre polynomials Pk A (sin I ) . Here we have chosen the parameters “longitude O and latitude I ” to locate a point on S 2 . Instead we could exchange the parameter O by time t , if clock readings would submit longitude, a conventional technique in classical navigation. In such a setting,
O = Zt =
2S t = 2SQ t , T
t = 2S AQ t , T longitude O would be exchanged by 2S the product of ground period T and time t or by 2S the product of ground frequency Q . In contrast, AO for all A { k , k + 1,..., 1, 0,1,..., k 1, k} would be substituted by 2S the product of overtones A / T or AQ and time t . According to classical navigation, Z would represent the rotational speed of the Earth. Notice that both k , A are integer, k, A Z . AO = AZt = 2S A
Box 1.17: Fourier-Legendre series x (O , I ) =
(1.119)
P00 (sin I ) x1 + P11 (sin I ) sin O x2 + P10 (sin I ) x3 + P1+1 (sin I ) cos O x4 + + P2 2 (sin I ) sin 2O x5 + P21 (sin I ) sin O x6 + P20 (sin I ) x7 + P21 (sin I ) cos O x8 +
53
1-3 Case study
+ P22 (sin I ) cos 2O x9 + O3 ( Pk A (sin I )(cos AO ,sin AO )) K
+k
x(O , I ) = lim ¦ ¦ e k A (O , I ) xk A
(1.120)
Pk A (sin I ) cos AO A > 0 ° ek A (O , I ) := ® Pk 0 (sin I ) A = 0 ° P (sin I ) sin | A | O A < 0 ¯ kA
(1.121)
K of
K
k = 0 A = k
k
x(O , I ) = lim ¦ ¦ Pk A (sin I )(ck A cos AO + sk A sin AO ) K of
(1.122)
k = 0 A = k
“Legendre polynomials of the first kind” recurrence relation k Pk (t ) = (2k 1) t Pk 1 (t ) (k 1) Pk 2 (t ) º » initial data : P0 (t ) = 1, P1 (t ) = t ¼ Example: 2 P2 (t ) = 3tP1 (t ) P0 (t ) = 3t 2 1 P2 (t ) = 32 t 2 12 “if t = sin I , then P2 (sin I ) = 32 sin 2 I 12 ” “Ferrer’s associates Legendre polynomials of the first kind” Pk A (t ) := (1 t 2 )
l
2
d A Pk (t ) dt A
Example: P11 (t ) = 1 t 2
d P1 (t ) dt
P11 (t ) = 1 t 2 “if t = sin I , then P11 (sin I ) = cos I ” Example: P21 (t ) = 1 t 2 P21 (t ) = 1 t 2
d P2 (t ) dt
d 3 2 1 ( t 2) dt 2
P21 (t ) = 3t 1 t 2
(1.123)
54
1 The first problem of algebraic regression
“if t = sin I , then P21 (sin I ) = 3sin I cos I ” Example: P22 (t ) = (1 t 2 )
d2 P2 (t ) dt 2
P22 (t ) = 3(1 t 2 ) “if t = sin I , then P22 (sin I ) = 3cos 2 I ” Example (approximation of order three): x (O , I ) = e00 x1 + e11 x2 + e10 x3 + e11 x4 +
(1.124)
+e 2 2 x5 + e 21 x6 + e 20 x7 + e 21 x8 + e 22 x9 + O3 recurrence relations vertical recurrence relation Pk A (sin I ) = sin I Pk 1, A (sin I ) cos I ¬ª Pk 1, A +1 (sin I ) Pk 1, A 1 (sin I ) ¼º initial data: P0 0 (sin I ) = 1, Pk A = Pk A A < 0 k 1, A 1
k 1, A
k 1, A + 1
k,A Fourier-Legendre series (1.119) can be understood as an infinite-dimensional vector space (linear space, Hilbert space) since the base functions (1.120) e k A (O , I ) generate a complete orthogonal (orthonormal) system based on surface spherical functions. The countable base, namely the base functions e k A (O , I ) or {1, cos I sin O ,sin I , cos I cos O ,..., Pk A (sin I ) sin AO} span the Fourier-Legendre space L2 {[0, 2S [ × ] S / 2, +S / 2[} . According to our order xˆ(O , I ) is an approximation of the function x(O , I ) up to order Pk A (sin I ) {cos AO ,sin A O } for all A > 0, A = 0 and A < 0, respectively. Let us refer to Box 1.17 as a summary of the Fourier-Legendre representation of a function x(O , I ), O [0, 2S [, I ] S/2, +S/2[.
55
1-3 Case study
Box 1.18: The Fourier-Legendre space “The base functions e k A (O , I ) , k {1,..., K } , A { K , K + 1,..., 1, 0,1,..., K 1, K } span the Fourier-Legendre space L2 {[0, 2S ] ×] S 2, + S 2[} : they generate a complete orthogonal (orthonormal) system of surface spherical functions.” “inner product” : x FOURIER LEGENDRE and y FOURIER LEGENDRE : 1 dS x(O , I )T y (O , I ) = S³
< x | y >= =
1 4S
(1.125)
+S 2
2S
³ dO 0
dI cosI x(O , I )y (O , I )
³
S 2
“normalization” (O , I ) > = 1 1
2 2
1 4S
³ dO 0
= Ok A G k k G A A 1 1
1 2
1 1
³
dI cos I e k A (O , I )e k A (O , I ) (1.126) 1 1
2 2
S 2
1 2
Ok A =
+S 2
2S
k1 , k2 {0,..., K } A 1 , A 2 { k ,..., + k}
1 (k1 A1 )! 2k1 + 1 (k1 + A1 )!
(1.127)
“norms, convergence” +S 2 2S 1 O || x ||2 = d dI cos I x 2 (O , I ) = ³ 4S ³0 S 2 K
(1.128)
+k
= lim ¦ ¦ Ok A x k2A < f K of
k = 0 A = k
lim || x x K ||2 = 0 (convergence in the mean)
K of
(1.129)
“synthesis versus analysis” K
+k
(1.130) x = lim ¦ ¦ e k A xk A K of
xk A =
versus
k = 0 A = k
:=
1 4SOk A
2S
³ dO 0
1 < e k A | x >:= O
(1.131)
+S 2
³
S 2
dI cos I e k A (O , I )x(O , I )
56
1 The first problem of algebraic regression +k
K
1 ekA < x | ekA > K of k = 0 A = k Ok A
x = lim ¦ ¦
(1.132)
“canonical basis of the Hilbert space FOURIER-LEGENDRE” ª 2 cos AO A > 0 « 1 e (O , I ) := 2k + 1 Pk A (sin I ) « A = 0 (k A )! « ¬ 2 sin A O A < 0 (k + A )!
* kA
(1.133)
(orthonormal basis) 1
(1.134)
e*k A =
ek A
versus
e k A = Ok A e*k A
(1.136)
xk*A = Ok A xk A
versus
xk A =
Ok A
K
1
Ok A
xk*A
(1.137)
+k
x = lim ¦ ¦ e*k A < x | e*k A > K of
(1.135)
(1.138)
k = 0 A = k
“orthonormality” < e*k A (O , I ) | e*k A (O , I ) > = G k k G A A 1 2
(1.139)
1 2
Fourier-Legendre space K of FOURIER LEGENDRE = span {e K , L , e K , L +1 ,..., e K , 1 , e K ,0 , e K ,1 ,..., e K , L 1 , e K , L } dim FOURIER LEGENDRE = lim ( K + 1) 2 = f K of
“ FOURIER LEGENDRE = HARM L ( S ) ”. 2
2
An infinite-dimensional vector space (linear space) is similar to a finitedimensional vector space: As in an Euclidean space an inner product and a norm is defined. While the inner product and the norm in a finite-dimensional vector space required summation of their components, the inner product (1.125), (1.126) and the norm (1.128) in an infinite-dimensional vector space force us to integration. Indeed the inner products (scalar products) (1.125) (1.126) are integrals over the surface element of S 2r applied to the vectors x(O , I ), y (O , I ) or e k A , e k A respectively. 1 1
2 2
Those integrals are divided by the size of the surface element 4S of S 2r . Alternative representations of < x, y > and <e k A , e k A > (Dirac’s notation of a bracket 1 1
2 2
57
1-3 Case study
decomposed into “bra” and “txt” ) based upon dS = rd O dI cos I , S = 4S r 2 , lead us directly to the integration over S 2r , the unit sphere. Next we adopt the definitions of Fourier-Legendre analysis as well as FourierLegendre synthesis following (1.125) - (1.139). Here we concentrate on the key problem: ?What is a harmonic function which has the sphere S 2 as a support? A harmonic function “on the unit sphere S 2 ” is a function x(O, I), (O, I) [0, 2S[ × ] S / 2, +S / 2[ which fulfils (i)
the two-dimensional Laplace equation (the differential equation of a two-dimensional harmonic osculator) and
(ii)
a special Sturm-Liouville boundary condition (1st) ' k A x(O , I ) = 0 (
d2 + Z ) x(O ) = 0 dO2
plus the harmonicity condition for ' k ª x(0) = x(2S ) d (2nd) « d «¬[ d O x(O )](0) = [ d O x(O )](2S ). The special Strum-Liouville equation force the frequency to be integer! Box 1.19: Fourier-Legendre analysis as an underdetermined linear model - the observation space Y “equidistant lattice on S 2 ” (equiangular) S S O [0, 2S [, I ] , + [ 2 2 2S ª O = (i 1) i {1,..., I } I = 2J : « i I « j {1,..., I } Ij ¬ S S J ª k {1,..., } « Ik = J + (k 1) J 2 J even: « «Ik = S (k 1) S k { J + 2 ,..., J } «¬ J J 2 S J +1 ª « Ik = (k 1) J + 1 k {1,..., 2 } J odd: « J +3 «Ik = (k 1) S k { ,..., J } J +1 2 ¬«
58
1 The first problem of algebraic regression
longitudinal interval: 'O := Oi +1 Oi =
2S I
S ª « J even : 'I := I j +1 I j = J lateral interval: « « J odd : 'I := I j +1 I j = S «¬ J +1 “initiation: choose J , derive I = 2 J ” 'I J ª k {1,..., } « Ik = 2 + (k 1)'I 2 J even: « ' I J + 2 «Ik = (k 1)'I k { ,..., J } 2 2 ¬ J +1 ª k {1,..., } « Ik = (k 1)'I 2 J odd: « «Ik = (k 1)'I k { J + 3 ,..., J } ¬ 2 Oi = (i 1)'O i {1,..., I } and I = 2 J “multivariate setup of the observation space X ” yij = x(Oi , I j ) “vectorizations of the matrix of observations” Example ( J = 1, I = 2) : Sample points Observation vector y (O1 , I1 ) ª x(O1 , I1 ) º 2×1 « x (O , I ) » = y \ (O2 , I1 ) ¬ 2 1 ¼ Example ( J = 2, I = 4) : Sample points Observation vector y (O1 , I1 ) ª x(O1 , I1 ) º « x (O , I ) » (O2 , I1 ) « 2 1 » (O3 , I1 ) « x(O3 , I1 ) » « » (O4 , I1 ) « x(O4 , I1 ) » = y \8×1 (O1 , I2 ) « x(O1 , I2 ) » « » (O2 , I2 ) « x(O2 , I2 ) » « x (O , I ) » (O3 , I2 ) « 3 2 » (O4 , I2 ) ¬« x(O4 , I2 ) ¼» Number of observations: n = IJ = 2 J 2 Example: J = 1 n = 2, J = 3 n = 18 J = 2 n = 8, J = 4 n = 32.
59
1-3 Case study
?How can we setup a linear model for Fourier-Legendre analysis? The linear model of Fourier-Legendre analysis which relates the elements of the parameter space X to the elements y Y of the observations space Y is again setup in Box 1.19. Here we shall assume that the observed data have been made available on a special grid which extents to O [0, 2S[, I ] S / 2, +S / 2[ 2S ª O = (i 1) , i {1,..., I } I = 2J : « i I « Ij, j {1,..., I }! ¬ longitudinal interval: 2S 'O =: O i +1 O i = I lateral interval: S J even: 'I =: I j + i I j = J S J odd: 'I =: I j +1 I j = . J +1 In addition, we shall review the data sets fix J even as well as for J odd. Examples are given for (i) J = 1, I = 2 and (ii) J = 2, I = 4 . The number of observations which correspond to these data sets have been (i) n = 18 and (ii) n = 32 . For the optimal design of the Fourier-Legendre linear model it has been shown that the equidistant lattice ª J even: 2S O i = (i 1) , I j = « I ¬ J odd: ª S S J½ «Ik = J + (k 1) J , k ®1,..., 2 ¾ ¯ ¿ J even: « « S S J +2 ½ ,..., J ¾ «Ik = (k 1) , k ® J J ¯ 2 ¿ ¬ ª S J + 1½ «Ik = (k 1) J + 1 , k ®1,..., 2 ¾ ¯ ¿ J odd: « « S J +3 ½ , k ® ,..., J ¾ «Ik = (k 1) J + 1 2 ¯ ¿ ¬ is “D-optimal”. Table 1.1 as well Table 1.2 are samples of an equidistant lattice on S 2 especially in a lateral and a longitudinal lattice.
60
1 The first problem of algebraic regression
Table 1.1: Equidistant lattice on S 2 - the lateral lattice J 1 2 3 4 5 6 7 8 9 10
'I
1
2
3
lateral grid 5 6
4
0° +45° -45° 90° 0° +45° -45° 45° 45° +22,5° +67.5° -22.5° -67.5° 0° +30° +60° -30° 30° 15° +45° +75° -15° 30° 0° +22.5° +45° +67.5° 22.5° 22.5° +11.25° +33.75° +56.25° +78.75° 0° +18° +36° +54° 18° +90° +27° +45° +63° 18°
7
8
9
10
-60° -45°
-75°
-22.5°
-45°
-67.5°
-11.25° -33.75° -56.25° -78.75° +72°
-18°
-36°
-54°
-72°
+81°
-9°
-27°
-45°
-63°
-81°
Longitudinal grid 5 6 7
8
9
10
288°
324°
Table 1.2: Equidistant lattice on S 2 - the longitudinal lattice J I = 2 J 'O 1 2 3 4 5
2 4 6 8 10
180° 90° 60° 45° 36°
1
2
3
4
0°
180°
0°
90°
180°
270°
0°
60°
120°
180°
240°
300°
0°
45°
90°
135°
180°
225°
270° 315°
0°
36°
72°
108°
144°
180°
216° 252°
In summary, the finite-dimensional observation space Y, dimY = IJ , I = 2J has integer dimension. As samples, we have computed via Figure 1.10 various horizontal and vertical sections of spherical lattices for instants (i) J = 1, I = 2, (ii) J = 2, I = 4, (iii) J = 3, I = 6, (iv) J = 4, I = 8 and (v) J = 5, I = 10 . By means of Figure 1.11 we have added the corresponding Platt-Carré Maps. Figure 1.10: Spherical lattice left: vertical section, trace of parallel circles right: horizontal section, trace of meridians vertical section meridian section
horizontal section perspective of parallel circles
61
1-3 Case study
+
I
J = 1, I = 2
J = 2, I = 4
J = 2, I = 4
J = 3, I = 6
J = 3, I = 6
J = 4, I = 8
J = 4, I = 8
J = 5, I = 10
J = 5, I = 10
S 2
0°
J = 1, I = 2
S 2
+
S
2S
I
O
S 2
S 2
Figure 1.11 a: Platt-Carré Map of S 2 longitude-latitude lattice, case: J = 1, I = 2, n = 2
62
1 The first problem of algebraic regression +
I
S 2
0°
S 2
+
S 2
I
0°
S 2
+
S 2
I
0°
S 2
+
S 2
I
0°
S 2
+
S
2S
O
S
2S
O
S
2S
I
S 2
+
S 2
I
S 2
+
S 2
I
S 2
+
S 2
O
S
2S
I
O
S 2
S 2
Figure 1.11 b: Platt-Carré Map of S 2 longitude-latitude lattice, case: J = 2, I = 4, n = 8
Figure 1.11 c: Platt-Carré Map of S 2 longitude-latitude lattice, case: J = 3, I = 6, n = 18
Figure 1.11 d: Platt-Carré Map of S 2 longitude-latitude lattice, case: J = 4, I = 8, n = 32
Figure 1.11 e: Platt-Carré Map of S 2 longitude-latitude lattice, case: J = 5, I = 10, n = 50.
In contrast, the parameter space X, dimX = f is infinite-dimensional. The unknown Fourier-Legendre coefficients, collected in a Pascal triangular graph of Figure 1.10 are vectorized by X = span{x00 , x11 , x10 , x11 ,..., xk A } k of k =0ok | A |= 0 o k dim X = m = f .
63
1-3 Case study
Indeed the linear model (1.138) contains m = IJ = 2 J 2 , m o f, unknows, a hyperreal number. The linear operator A : X o Y is generated by the base functions of lattice points K
+k
jij = y ( xij ) = lim ¦ ¦ e k A (Oi , I j )x k A K of
k = 0 A = k
i, j {1,..., n} is a representation of the linear observational equations (1.138) in Ricci calculus which is characteristic for Fourier-Legendre analysis. number of observed data at lattice points
number of unknown Fourier-Legendre coefficients K
n = IJ = 2 J 2
versus
+k
m = lim ¦ ¦ e k A K of
(finite)
k = 0 A = k
(infinite)
Such a portray of Fourier-Legendre analysis effectivly summarizes its peculiarrities. A finite number of observations is confronted with an infinite number of observations. Such a linear model of type “underdetermined of power 2” cannot be solved in finite computer time. Instead one has to truncate the FourierLegendre series, leaving the serier “bandlimited”. We consider three cases. n>m
n=m
n<m
overdetermined case
regular cases
underdetermined case
First, we have to truncate the infinite Fourier-Legendre series that n > m hold. In this case of an overdetermined problem, we have more observations than equations. Second, we alternativly balance the number of unknown FourierLegendre coefficients such that n = m holds. Such a model choice assures a regular linear system. Both linear Fourier-Legendre models which are tuned to the number of observations suffer from a typical uncertainty. What is the effect of the forgotten unknown Fourier-Legendre coefficients m > n ? Indeed a significance test has to decide upon any truncation to be admissible. We need an objective criterion to decide upon the degree m of bandlimit. Third, in order to be as obiective as possible we again follow the third case of “less observations than unknows” such that n < m holds. Such a Fourier-Legendre linear model generating an underdetermined system of linear equations will consequently be considered.
64
1 The first problem of algebraic regression
A first example presented in Box 1.20 demonstrates “MINOS” of the FourierLegendre linear model for n = IJ = 2 J 2 = 2 and k = 1, m = (k + 1) 2 = 4 as unknowns and observations. Solving the system of linear equations Z and four unknows [x1 , x2 , x3 , x4 ](MINOS) = ª y1 + y2 º ª1 1 º « » «0 0 » 1 » = 1 « 0 ». = « 2 «0 0 » 2 « 0 » « » « » ¬1 1¼ ¬ y1 y2 ¼ The second example presented in Box 1.21 refers to “MINOS” of the FourierrLegendre linear model for n = 8 and m = 9 . We have computed the design matrix A .
Box 1.20 The first example: Fourier-Legendre analysis as an underdetermined linear model: m rk A = m n = 2 dim Y = n = 2 versus dim X = m = 4 J = 1, I = 2 J = 2 n = IJ = 2 J 2 = 2 versus K = 1 m = ( k + 1) 2 = 4 ª x1 º « » ª º ª y1 º 1 P11 ( sin I1 ) sin O1 P10 ( sin I1 ) P11 ( sin I1 ) cos O1 « x2 » = « » « y » 1 P sin I sin O P sin I P sin I cos O « x » ¬ 2 ¼ «¬ 11 ( 2) 2 10 ( 2 ) 11 ( 2) 2» ¼« 3» ¬ x4 ¼ subject to
( O1 , I1 ) = (0D , 0D ) and ( O2 , I2 ) = (180D , 0D ) {y = Ax A \ 2×4 , rkA = n = 2, m = 4, n = m = 2} ª1 0 0 1 º ª1 P11 ( sin I1 ) sin O1 P10 ( sin I1 ) P11 ( sin I1 ) cos O1 º A := « » »=« ¬1 0 0 1¼ «¬1 P11 ( sin I2 ) sin O2 P10 ( sin I2 ) P11 ( sin I2 ) cos O2 »¼ P11 ( sin I ) = cos I , P10 ( sin I ) = sin I P11 ( sin I1 ) = P11 (sin I2 ) = 1 , P10 ( sin I1 ) = P10 (sin I2 ) = 0 sin O1 = sin O2 = 0, cos O1 = 1, cos O2 = 1
65
1-3 Case study
AAc = 1 + P11 ( sin I1 ) P11 (sin I2 ) sin O1 sin O2 + º » + P10 (sin I1 ) P10 (sin I2 ) + » » + P11 ( sin I1 ) P11 (sin I2 ) cos O1 cos O2 » » » » 1 + P112 ( sin I2 ) + P102 (sin I2 ) » ¼»
ª « 2 2 « 1 + P11 ( sin I1 ) + P10 (sin I1 ) « « «1 + P ( sin I ) P (sin I ) sin O sin O + 11 2 11 1 2 1 « « + P10 ( sin I2 ) P10 (sin I2 ) + « ¬« + P11 ( sin I2 ) P11 (sin I1 ) cos O2 cos O1
1 AAc = 2I 2 (AAc)-1 = I 2 2 K =1
¦
+ k1 , + k 2
e k A (Oi , Ii ) e k A (Oi , Ii ) = 0, i1 z i2
¦
k1 , k 2 = 0 A1 =-k1 , A 2 = k 2 K =1
¦
+ k1 , + k 2
¦
1 1
1
1
2 2
2
2
e k A (Oi , Ii ) e k A (Oi , Ii ) = 2, i1 = i2
k1 , k 2 = 0 A1 =-k1 , A 2 = k 2
1 1
1
1
2 2
2
2
ª x1 º ª c00 º «x » «s » 1 2» « xA = ( MINOS ) = « 11 » ( MINOS ) = A cy = « x3 » « c10 » 2 « » « » ¬ x4 ¼ ¬ c11 ¼ ª y1 + y2 º ª1 1 º « » «0 0 » y 1 » ª 1º = 1 « 0 ». = « « » 2 « 0 0 » ¬ y2 ¼ 2 « 0 » « » « » ¬1 1¼ ¬ y1 y2 ¼
Box 1.21 The second example: Fourier-Legendre analysis as an underdetermined linear model: m rk A = m n = 1 dim Y = n = 8 versus dim X 1 = m = 9 J = 2, I = 2 J = 4 n = IJ = 2 J 2 = 8 versus k = 2 m = (k + 1) 2 = 9
66
1 The first problem of algebraic regression
ª y1 º ª1 P11 ( sin I1 ) sin O1 P10 ( sin I1 ) P11 ( sin I1 ) cos O1 « ... » = «" … … … « » « ¬ y2 ¼ «¬ 1 P11 ( sin I8 ) sin O8 P10 ( sin I8 ) P11 ( sin I8 ) cos O8 P22 ( sin I1 ) sin 2O1 P21 ( sin I1 ) sin O1 P20 ( sin I1 ) … … … P22 ( sin I8 ) sin 2O8 P21 ( sin I8 ) sin O8 P20 ( sin I8 ) P21 ( sin I1 ) cos O1 P22 ( sin I1 ) cos 2O1 º ª x1 º »« » … … » «... » P21 ( sin I8 ) cos O8 P22 ( sin I8 ) cos 2O8 »¼ «¬ x9 »¼ “equidistant lattice, longitudinal width 'O , lateral width 'I ” 'O = 90D , 'I = 90D (O1 , I1 ) = (0D , +45D ), (O2 , I2 ) = (90D , +45D ), (O3 , I3 ) = (180D , +45D ), (O4 , I4 ) = (270D , +45D ), (O5 , I5 ) = (0D , 45D ), (O6 , I6 ) = (90D , 45D ), (O7 , I7 ) = (180D , 45D ), (O8 , I8 ) = (270D , 45D ) P11 (sin I ) = cos I , P10 (sin I ) = sin I P11 (sin I1 ) = P11 (sin I2 ) = P11 (sin I3 ) = P11 (sin I4 ) = cos 45D = 0,5 2 P11 (sin I5 ) = P11 (sin I6 ) = P11 (sin I7 ) = P11 (sin I8 ) = cos( 45D ) = 0,5 2 P10 (sin I1 ) = P10 (sin I2 ) = P10 (sin I3 ) = P10 (sin I4 ) = sin 45D = 0,5 2 P10 (sin I5 ) = P10 (sin I6 ) = P10 (sin I7 ) = P10 (sin I8 ) = sin( 45D ) = 0,5 2 P22 (sin I ) = 3cos 2 I , P21 (sin I ) = 3sin I cos I , P20 (sin I ) = (3 / 2) sin 2 I (1/ 2) P22 (sin I1 ) = P22 (sin I2 ) = P22 (sin I3 ) = P22 (sin I4 ) = 3cos 2 45D = 3 / 2 P22 (sin I5 ) = P22 (sin I6 ) = P22 (sin I7 ) = P22 (sin I8 ) = 3cos 2 ( 45D ) = 3 / 2 P21 (sin I1 ) = P21 (sin I2 ) = P21 (sin I3 ) = P21 (sin I4 ) = 3sin 45D cos 45D = 3 / 2 P21 (sin I5 ) = P21 (sin I6 ) = P21 (sin I7 ) = P21 (sin I8 ) = 3sin( 45D ) cos( 45D ) = 3 / 2 P20 (sin I1 ) = P20 (sin I2 ) = P20 (sin I3 ) = P20 (sin I4 ) = (3 / 2) sin 2 45D (1/ 2) = 1/ 4 P20 (sin I5 ) = P20 (sin I6 ) = P20 (sin I7 ) = P20 (sin I8 ) = (3 / 2) sin 2 ( 45D ) (1/ 2) = 1/ 4 sin O1 = sin O3 = sin O5 = sin O7 = 0 sin O2 = sin O6 = +1, sin O4 = sin O8 = 1 cos O1 = cos O5 = +1, cos O2 = cos O4 = cos O6 = cos O8 = 0
67
1-3 Case study
cos O3 = cos O7 = 1 sin 2O1 = sin 2O2 = sin 2O3 = sin 2O4 = sin 2O5 = sin 2O6 = sin 2O7 = sin 2O8 = 0 cos 2O1 = cos 2O3 = cos 2O5 = cos 2O7 = +1 cos 2O2 = cos 2O4 = cos 2O6 = cos 2O8 = 1 A \8×9 ª1 0 «1 2/2 « 0 «1 1 2/2 « A=« 1 0 « 2/2 «1 1 0 « «¬1 2 / 2
0 0 1/ 4 3 / 2 3 / 2 º 0 3 / 2 1/ 4 0 3 / 2 » » 0 0 1/ 4 3 / 2 3 / 2 » 0 3 / 2 1/ 4 0 3 / 2 » . 0 0 1/ 4 3 / 2 3 / 2 » » 0 3 / 2 1/ 4 0 3 / 2 » 0 0 1/ 4 3 / 2 3 / 2 » 0 3 / 2 1/ 4 0 3 / 2 »¼ rkA < min{n, m} < 8.
2/2 2/2 2/2 0 2/2 2/2 2/2 0 2/2 2/2 2/2 0 2/2 2/2 2/2 0
Here “the little example” ends, since the matrix A is a rank smaller than 8! In practice, one is taking advantage of • Gauss elimination or • weighting functions in order to directly compute the Fourier-Legendre series. In order to understand the technology of “weighting function” better, we begin with rewriting the spherical harmonic basic equations. Let us denote the letters f k A :=
1 4S
+S / 2
³
S / 2
2S
dI cos I ³ d O Z (I )e k A (O , I ) f (O , I ) , 0
the spherical harmonic expansion f k A of a spherical function f (O , I ) weighted by Z (I ) , a function of latitude. A band limited representation could be specified by +S / 2 2S K 1 f k A := d I I d O Z I e k A (O , I )e k A (O , I ) f kA cos ( ) ¦ ³0 4S S³/ 2 k ,A 1 1
1
fkA =
K
¦
f k,A 1
1 4S
1
k1 , A1
=
S /2
³
S / 2
2S
dI cos I ³ d O w(I )e k A (O , I )e k A ( O , I ) = 1 1
0
K
¦ f ¦g e k1 , A1
k1 , A1
ij
kA
( Oi , I j )e k A ( Oi , I j ) = 1 1
i, j
K
= ¦ gij ¦ gij ekA (Oi ,I j )ek A (Oi ,I j ) = i, j
1 1
1
k1 ,A1
1 1
68
1 The first problem of algebraic regression
= ¦ gij f ( Oi , I j )e k A ( Oi , I j ) . i, j
As a summary, we design the weighted representation of Fourier-Legendre synthesis J
f = ¦ g j Pk*A (sin I j ) f A (I j ) kA
j =1
J
1st: Fourier f (I j ) = ¦ g j eA (O ) f (Oi , I j ) A
i =1
J
2nd: Legendre f k,A ( I , J ) = ¦ g j Pk*A (sin I j ) f A (I j ) j =1
lattice: (Oi , I j ) .
1-4 Special nonlinear models As an example of a consistent system of linearized observational equations Ax = y , rk A = rk( A, y ) where the matrix A R n× m is the Jacobi matrix (Jacobi map) of the nonlinear model, we present a planar triangle whose nodal points have to be coordinated from three distance measurements and the minimum norm solution of type I -MINOS. 1-41
Taylor polynomials, generalized Newton iteration
In addition we review the invariance properties of the observational equations with respect to a particular transformation group which makes the a priori indeterminism of the consistent system of linearized observational equations plausible. The observation vector Y Y { R n is an element of the column space Y R ( A) . The geometry of the planar triangle is illustrated in Figure 1.12. The point of departure for the linearization process of nonlinear observational equations is the nonlinear mapping X 6 F ( X) = Y . The B. Taylor expansion Y = F( X) = F(x) + J (x)( X x) + H( x)( X x)
( X x) + + O [( X x)
( X x)
( X x)], which is truncated to the order O [( X x)
( X x)
( X x)], J ( x), H( x) , respectively, represents the Jacobi matrix of the first partial derivatives, while H , the Hesse matrix of second derivatives, respectively, of the vectorvalued function F ( X) with respect to the coordinates of the vector X , both taken at the evaluation point x . A linearized nonlinear model is generated by truncating the vector-valued function F(x) to the order O [( X x)
( X x)] , namely 'y := F( X) F(x) = J (x)( X x) + O [( X x)
( X x)]. A generalized Newton iteration process for solving the nonlinear observational equations by solving a sequence of linear equations of (injectivity) defect by means of the right inverse of type G x -MINOS is the following algorithm.
1-4 Special nonlinear models
69
Newton iteration Level 0: x 0 = x 0 , 'y 0 = F( X) F(x 0 ) 'x1 = [ J (x 0 ) ]R 'y 0
Level 1:
x1 = x 0 + 'x1 , 'y1 = F (x) F (x1 ) 'x 2 = [ J (x1 ) ]R 'y1
Level i:
xi = xi 1 + 'xi , 'y i = F(x) F(xi ) 'xi +1 = [ J (xi ) ]R 'y i
Level n: 'x n +1 = 'x n (reproducing point in the computer arithmetric) I -MINOS, rk A = rk( A, y ) The planar triangle PD PE PJ is approximately an equilateral triangle pD pE pJ whose nodal points are a priori coordinated by Table 1.3. Table 1.3: Barycentric rectangular coordinates of the equilateral triangle pD pE pJ of Figure 1.12 1 1 ª ª ª xJ = 0 « xD = 2 « xE = 2 pD = « , pE = « , pJ = « «y = 1 3 1 1 «y = «y = 3 3 «¬ J 3 D E «¬ «¬ 6 6 Obviously the approximate coordinates of the three nodal points are barycentric, namely characterized by Box 1.22: Their sum as well as their product sum vanish.
Figure 1.12: Barycentric rectangular coordinates of the nodal points, namely of the equilateral triangle
70
1 The first problem of algebraic regression
Box 1.22: First and second moments of nodal points, approximate coordinates x B + x C + x H = 0, yB + y C + y H = 0 J xy = xD yD + xE yE + xJ yJ = 0 J xx = ( yD2 + yE2 + yJ2 ) = 12 , J yy = ( xD2 + xE2 + xJ2 ) = 12 : J xx = J yy ª xD + xE + xJ º ª0 º »=« » ¬« yD + yE + yJ ¼» ¬0 ¼
ªI º
[ Ii ] = « I x » = « ¬ y¼
for all i {1, 2} 2 2 2 xD yD + yE xE + xJ yJ º ª I xx I xy º ª ( yD + yE + yJ ) ª¬ I ij º¼ = « »= »=« 2 2 2 «¬ I xy I yy »¼ «¬ xD yD + yE xE + xJ yJ ( xD + xE + xJ ) »¼ ª 1 0 º =« 2 = 12 I 2 1» 0 ¬ 2¼
for all i, j {1, 2} .
Box 1.23: First and second moments of nodal points, inertia tensors 2
I1 = ¦ ei I i = e1 I1 + e 2 I 2 i =1
for all i, j {1, 2} : I i =
+f
³
f
I2 =
2
¦e
i , j =1
i
+f
dx ³ dy U ( x, y ) xi f
e j I ij = e1
e1 I11 + e1
e 2 I12 + e 2
e1 I 21 + e 2
e 2 I 22
for all i, j {1, 2} : I ij =
+f
³
f
+f
dx ³ dy U ( x, y )( xi x j r 2G ij ) f
subject to r 2 = x 2 + y 2
U ( x, y ) = G ( x, y, xD yD ) + G ( x, y, xE yE ) + G ( x, y , xJ yJ ) . The product sum of the approximate coordinates of the nodal points constitute the rectangular coordinates of the inertia tensor I=
2
¦e
i , j =1
I ij =
+f
i
e j I ij
+f
³ dx ³ dy U ( x, y)( x x i
f
f
j
r 2G ij )
1-4 Special nonlinear models
71
for all i, j {1, 2} , r 2 = x 2 + y 2 ,
U ( x, y ) = G ( x, y, xD yD ) + G ( x, y, xE yE ) + G ( x, y , xJ yJ ) . The mass density distribution U ( x, y ) directly generates the coordinates I xy , I xx , I yy of the inertia tensor in Box 1.22. ( G (., .) denotes the Dirac generalized function.). The nonlinear observational equations of distance measurements are generated by the Pythagoras representation presented in Box 1.24:
Nonlinear observational equations of distance measurements in the plane, (i) geometric notation versus (ii) algebraic notation 2 Y1 = F1 ( X) = SDE = ( X E X D ) 2 + (YE YD ) 2
Y2 = F2 ( X) = S EJ2 = ( X J X E ) 2 + (YJ YE ) 2 Y3 = F3 ( X) = SJD2 = ( X D X J ) 2 + (YD YJ ) 2 .
sB. Taylor expansion of the nonlinear distance observational equationss 2 Y c := ª¬ SDE , S EJ2 , SJD2 º¼ , Xc := ª¬ X D , YD , X E , YE , X J , YJ º¼
xc = ª¬ xD , yD , xE , yE , xJ , yJ º¼ = ª¬ 12 , 16 3, 12 , 16 3, 0, 13 3 º¼ sJacobi maps ª w F1 « « wX D «wF J (x) := « 2 « wX D « « w F3 « wX D ¬
w F1 wYD
w F1 wX E
w F1 wYE
w F1 wX J
w F2 wYD
w F2 wX E
w F2 wYE
w F2 wX J
w F3 wYD
w F3 wX E
w F3 wYE
w F3 wX J
ª 2( xE xD ) 2( y E yD ) 2( xE xD ) 2( y E yD ) « 0 0 2( xJ xE ) 2( yJ y E ) « « 2( xD xJ ) 2( yD yJ ) 0 0 ¬ 0 2 0 0 ª 2 « =«0 0 1 3 1 « 0 1 ¬ 1 3 0 Let us analyze sobserved minus computed s
w F1 º » wYJ » w F2 » » ( x) = wYJ » » w F3 » wYJ »¼ º 0 0 » 2( xJ xE ) 2( yJ y E ) » = 2( xD xJ ) 2( yD yJ ) »¼ 0º » 3» » 3¼
72
1 The first problem of algebraic regression
'y := F( X) F(x) = J (x)( X x) + O [ ( X x)
( X x) ] = = J'x + O [ ( X x)
( X x) ] ,
here specialized to Box 1.25: Linearized observational equations of distance measurements in the plane, I -MINOS, rkA = dimY sObserved minus computeds 'y := F( X) F(x) = J (x)( X x) + O [ ( X x)
( X x) ] = = J'x + O [ ( X x)
( X x) ] ,
2 2 2 ª 'sDE º ª SDE º ª1.1 1 º ª 1 º sDE 10 « 2 » « 2 » « « 1» » 2 « 'sEJ » = « S EJ sEJ » = «0.9 1» = « 10 » « 2 » « 2 2 » « » « 1» «¬ 'sJD »¼ «¬ SJD sJD »¼ ¬1.2 1 ¼ ¬ 5 ¼
2 ª 'sDE º ª aDE bDE aDE bDE « 2 » « 0 aEJ bEJ « 'sEJ » = « 0 « 2 » « a bJD 0 0 ¬« 'sJD ¼» ¬ JD
ª 'xD º « » 'yD » 0 0 º« » « 'xE » » aEJ bEJ » « 'yE » « aJD bJD »¼ « » « 'xJ » « 'y » ¬ J¼
slinearized observational equationss y = Ax, y R 3 , x R 6 , rkA = 3 0 2 0 0 ª 2 « A=«0 0 1 3 1 « 0 1 ¬ 1 3 0 ª 9 « « 3 1 «« 9 A c( AA c) 1 = 36 « 3 « « 0 « 2 3 ¬
0º » 3» » 3¼
3 º » 3 5 3 » 3 3 » » 5 3 3 » » 6 6 » 4 3 4 3 »¼ 3
sminimum norm solutions
1-4 Special nonlinear models
73
ª ª 'xD º « « » « « 'yD » « 'xE » 1 « « »= xm = « « 'yE » 36 « « « » « « 'xJ » « « 'y » ¬ J¼ ¬ 2
9 y1 + 3 y2 3 y3
º » 3 y1 + 3 y2 5 3 y3 » » 9 y1 + 3 y2 3 y3 » 3 y1 5 3 y2 + 3 y3 » » 6 y2 + 6 y3 » » 3 y1 + 4 3 y2 + 4 3 y3 ¼
1 ª º xcm = 180 ¬ 9, 5 3, 0, 4 3, +9, 3 ¼
(x + 'x)c = ª¬ xD + 'xD , yD + 'yD , xE + 'xE , yE + 'yE , xJ + 'xJ , yJ + 'yJ º¼ = 1 ª º = 180 ¬ 99, 35 3, +90, 26 3, +9, +61 3 ¼ .
The sum of the final coordinates is zero, but due to the non-symmetric displacement field ['xD , 'yD , 'xE , 'yE , 'xJ , 'yJ ]c the coordinate J xy of the inertia tensor does not vanish. These results are collected in Box 1.26. Box 1.26: First and second moments of nodal points, final coordinates yD + 'yD + yE + 'yE + yJ + 'yJ = yD + yE + yJ + 'yD + 'yE + 'yJ = 0 J xy = I xy + 'I xy = = ( xD + 'xD )( yD + 'yD ) + ( xE + 'xE )( yE + 'yE ) + ( xJ + 'xJ )( yJ + 'yJ ) = = xD yD + xE yE + xJ yJ + xD 'yD + yD 'xD + xE 'yE + yE 'xE + xJ 'yJ + yJ 'xJ + + O ('xD 'yD , 'xE 'yE , 'xJ 'yJ ) = 3 /15 J xx = I xx + 'I xx = = ( yD + 'yD ) 2 ( yE + 'yE ) 2 ('yJ yJ ) 2 = = ( yD2 + yE2 + yJ2 ) 2 yD 'yD 2 yE 'yE 2 yJ 'yJ O ('yD2 , 'yE2 , 'yJ2 ) = = 7 /12 J yy = I yy + 'I yy = = ( xD + 'xD ) 2 ( xE + 'xE ) 2 ('xJ xJ ) 2 = = ( xD2 + xE2 + xJ2 ) 2 xD 'xD 2 xE 'xE 2 xJ 'xJ O ('xD2 , 'xE2 , 'xJ2 ) = 11/ 20 .
ƅ
74 1-42
1 The first problem of algebraic regression
Linearized models with datum defect
More insight into the structure of a consistent system of observational equations with datum defect is gained in the case of a nonlinear model. Such a nonlinear model may be written Y = F ( X) subject to Y R n , X R m , or {Yi = Fi ( X j ) | i {1, ..., n}, j {1, ..., m}}. A classification of such a nonlinear function can be based upon the "soft" Implicit Function Theorem which is a substitute for the theory of algebraic partioning, namely rank partitioning. (The “soft” Implicit Function Theorem is reviewed in Appendix C.) Let us compute the matrix of first derivatives [
wFi ] R n× m , wX j
a rectangular matrix of dimension n × m. The set of n independent columns builds up the Jacobi matrix ª wF1 « wX « 1 « wF2 « A := « wX 1 «" « « wFn « wX ¬ 1
wF1 wX 2 wF2 wX 2 wFn wX 2
wF1 º wX n » » wF2 » " » wX n » , r = rk A = n, » » wFn » " wX n »¼ "
the rectangular matrix of first derivatives A := [ A1 , A 2 ] = [J, K ] subject to A R n× m , A1 = J R n× n = R r × r , A 2 = K R n× ( m n ) = R n× ( m r ) . m-rk A is called the datum defect of the consistent system of nonlinear equations Y = F ( X) which is a priori known. By means of such a rank partitioning we have decomposed the vector of unknowns Xc = [ X1c , Xc2 ] into “bounded parameters” X1 and “free parameters” X 2 subject to X1 R n = R r , and X 2 R m n = R m r . Let us apply the “soft” Implicit Function Theorem to the nonlinear observational equations of distance measurements in the plane which we already have intro-
1-4 Special nonlinear models
75
duced in the previous example. Box 1.27 outlines the nonlinear observational 2 equations for Y1 = SDE , Y2 = S EJ2 , Y3 = SJD2 . The columns c1 , c 2 , c3 of the matrix [wFi / wX j ] are linearly independent and accordingly build up the Jacobi matrix J of full rank. Let us partition the unknown vector Xc = [ X1c , Xc2 ] , namely into the "free parameters" [ X D , YD , YE ]c and the "bounded parameters" [ X E , X J , YJ ]c. Here we have made the following choice for the "free parameters": We have fixed the origin of the coordinate system by ( X D = 0, YD = 0). Obviously the point PD is this origin. The orientation of the X-axis is given by YE = 0. In consequence the "bounded parameters" are now derived by solving a quadratic equation, indeed a very simple one: Due to the datum choice we find 2 (1st) X E = ± SDE = ± Y1 2 (2 nd) X J = ± ( SDE S EJ2 + SJD2 ) /(2SDE ) = ±(Y1 Y2 + Y3 ) /(2 Y1 ) 2 2 (3rd) YJ = ± SJD2 ( SDE S EJ2 + SJD2 ) 2 /(4 SDE ) = ± Y3 (Y1 Y2 + Y3 ) 2 /(4Y1 ) .
Indeed we meet the characteristic problem of nonlinear observational equations. There are two solutions which we indicated by "± " . Only prior information can tell us what the realized one in our experiment is. Such prior information has been built into by “approximate coordinates” in the previous example, a prior information we lack now. For special reasons here we have chosen the "+" solution which is in agreement with Table 1.3. An intermediate summary of our first solution of a set of nonlinear observational equations is as following: By the choice of the datum parameters (here: choice of origin and orientation of the coordinate system) as "free parameters" we were able to compute the "bounded parameters" by solving a quadratic equation. The solution space which could be constructed in a closed form was non-unique. Uniqueness was only achieved by prior information. The closed form solution X = [ X 1 , X 2 , X 3 , X 4 , X 5 , X 6 ]c = [ X D , YD , X E , YE , X J , YJ ]c has another deficiency. X is not MINOS: It is for this reason that we apply the datum transformation ( X , Y ) 6 ( x, y ) outlined in Box 1.28 subject to & x &2 = min, namely I-MINOS. Since we have assumed distance observations, the datum transformation is described as rotation (rotation group SO(2) and a translation (translation group T(2) ) in toto with three parameters (1 rotation parameter called I and two translational parameters called t x , t y ). A pointwise transformation ( X D , YD ) 6 ( xD , yD ), ( X E , YE ) 6 ( xE , yE ) and ( X J , YJ ) 6 ( xJ , yJ ) is presented in Box 1.26. The datum parameters ( I , t x , t y ) will be determined by IMINOS, in particular by a special Procrustes algorithm contained in Box 1.28. There are various representations of the Lagrangean of type MINOS outlined in Box 1.27. For instance, we could use the representation
76
1 The first problem of algebraic regression
2 of & x &2 in terms of observations ( Y1 = SDE , Y2 = S EJ2 , Y3 = SJD2 ) which transforms 2 2 (i) & x & into (ii) & x & (Y1 , Y2 , Y3 ) . Finally (iii) & x &2 is equivalent to minimizing the product sums of Cartesian coordinates.
Box 1.27: nonlinear observational equations of distance measurements in the plane (i) geometric notation versus (ii) algebraic notation "geometric notation" 2 SDE = ( X E X D ) 2 + (YE YD ) 2
S EJ2 = ( X J X E ) 2 + (YJ YE ) 2 SJD2 = ( X D X J ) 2 + (YD YJ ) 2 "algebraic notation" 2 Y1 = F1 ( X) = SDE = ( X E X D ) 2 + (YE YD ) 2
Y2 = F2 ( X) = S EJ2 = ( X J X E ) 2 + (YJ YE ) 2 Y3 = F3 ( X) = SJD2 = ( X D X J ) 2 + (YD YJ ) 2 2 Y c := [Y1 , Y2 , Y3 ] = [ SDE , S EJ2 , SJD2 ]
Xc := [ X 1 , X 2 , X 3 , X 4 , X 5 , X 6 ] = [ X D , YD , X E , YE , X J , YJ ] "Jacobi matrix" [
wFi ]= wX j
0 0 ª ( X 3 X 1 ) ( X 4 X 2 ) ( X 3 X 1 ) ( X 4 X 2 ) º « =2 0 0 ( X 5 X 3 ) ( X 6 X 4 ) ( X 5 X 3 ) ( X 6 X 4 ) » « » «¬ ( X 1 X 5 ) ( X 2 X 6 ) 0 0 ( X 1 X 5 ) ( X 2 X 6 ) ¼» wF wF rk[ i ] = 3, dim[ i ] = 3 × 6 wX j wX j ª ( X 3 X 1 ) ( X 4 X 2 ) ( X 3 X 1 ) º J = «« 0 0 ( X 5 X 3 ) »» , rk J = 3 «¬ ( X 1 X 5 ) »¼ (X2 X6) 0 0 0 ª (X4 X2) º K = «« ( X 6 X 4 ) ( X 5 X 3 ) ( X 6 X 4 ) »» . «¬ 0 ( X 1 X 5 ) ( X 2 X 6 ) »¼
1-4 Special nonlinear models
77
"free parameters"
"bounded parameters"
X1 = X D = 0
X 3 = X E = + SDE
X 2 = YD = 0
2 X 5 = X J = + SDE S EJ2 + SJD2 = + Y32 Y22 + Y12
X 4 = YE = 0
2 X 6 = YJ = + S EJ2 SDE = + Y22 Y12
( )
( )
()
( )
( )
Box 1.28: Datum transformation of Cartesian coordinates ª xº ª X º ªtx º « y » = R « Y » + «t » ¬ ¼ ¬ ¼ ¬ y¼ R SO(2):={R R 2× 2 | R cR = I 2 and det R = +1} Reference: Facts :(representation of a 2×2 orthonormal matrix) of Appendix A: ª cos I R=« ¬ sin I
sin I º cos I »¼
xD = X D cos I + YD sin I t x yD = X D sin I + YD cos I t y xE = X E cos I + YE sin I t x yE = X E sin I + YE cos I t y xJ = X J cos I + YJ sin I t x yJ = X J sin I + YJ cos I t y . Box 1.29: Various forms of MINOS (i ) & x &2 = xD2 + yD2 + xE2 + yE2 + xJ2 + yJ2 = min I ,tx ,t y
2 (ii ) & x &2 = 12 ( SDE + S EJ2 + SJD2 ) + xD xE + xE xJ + xJ xD + yD yE + yE yJ + yJ yD = min
I ,tx ,t y
(iii ) & x &2 = min
ª xD xE + xE xJ + xJ xD = min « y y + y y + y y = min . E J J D ¬ D E
The representation of the objective function of type MINOS in term of the obser2 vations Y1 = SDE , Y2 = S EJ2 , Y3 = SJD2 can be proven as follows:
78
1 The first problem of algebraic regression
Proof: 2 SDE = ( xE xD ) 2 + ( yE yD ) 2
= xD2 + yD2 + xE2 + yE2 2( xD xE + yD yE )
1 2
2 SDE + xD xE + yD yE = 12 ( xD2 + yD2 + xE2 + yE2 )
& x &2 = xD2 + yD2 + xE2 + yE2 + xJ2 + yJ2 = 2 = 12 ( SDE + S EJ2 + SJD2 ) + xD xE + xE xJ + xJ xD + yD yE + yE yJ + yJ yD
& x &2 = 12 (Y1 + Y2 + Y3 ) + xD xE + xE xJ + xJ xD + yD yE + yE yJ + yJ yD .
Figure1.13: Commutative diagram (P-diagram) P0 : centre of polyhedron (triangle PD PE PJ ) action of the translation group
ƅ
Figure1.14:
Commutative diagram (E-diagram) P0 : centre of polyhedron (triangle PD PE PJ orthonormal 2-legs {E1 , E1 | P0 } and {e1 , e1 | P0 } ) at P0 action of the translation group
As soon as we substitute the datum transformation of Box 1.28 which we illustrated by Figure 1.9 and Figure 1.10 into the Lagrangean L (t x , t y , I ) of type MINOS ( & x &2 = min ) we arrive at the quadratic objective function of Box 1.30. In the first forward step of the special Procrustes algorithm we obtain the minimal solution for the translation parameters (tˆx , tˆy ) . The second forward step of the special Procrustes algorithm is built on (i) the substitution of (tˆx , tˆy ) in the original Lagrangean which leads to the reduced Lagrangean of Box 1.29 and (ii) the minimization of the reduced Lagrangean L (I ) with respect to the rotation parameter I . In an intermediate phase we introduce "centralized coordinates" ('X , 'Y ) , namely coordinate differences with respect to the centre Po = ( X o , Yo ) of the polyhedron, namely the triangle PD , PE , PJ . In this way we are able to generate the simple (standard form) tan 2I of the solution I the argument of L1 = L1 (I ) = min or L2 = L2 (I ) .
1-4 Special nonlinear models
79
Box 1.30: Minimum norm solution, special Procrustes algorithm, 1st forward step & x &2 := := xD2 + yD2 + xE2 + yE2 + xJ2 + yJ2 = min
tx ,t y ,I
"Lagrangean "
L (t x , t y , I ) := := ( X D cos I + YD sin I t x ) 2 + ( X D sin I + YD cos I t y ) 2 + ( X E cos I + YE sin I t x ) 2 + ( X E sin I + YE cos I t y ) 2 + ( X J cos I + YJ sin I t x ) 2 + ( X J sin I + YJ cos I t y ) 2 1st forward step 1 wL (t x ) = ( X D + X E + X J ) cos I + (YD + YE + YJ ) sin I 3t x = 0 2 wt x 1 wL (t y ) = ( X D + X E + X J ) sin I + (YD + YE + YJ ) cos I 3t y = 0 2 wt y t x = + 13 {( X D + X E + X J ) cos I + (YD + YE + YJ ) sin I} t y = + 13 {( X D + X E + X J ) sin I + (YD + YE + YJ ) cos I} (t x , t y ) = arg{L (t x , t y , I ) = min} . Box 1.31: Minimum norm solution, special Procrustes algorithm, 2nd forward step "solution t x , t y in Lagrangean: reduced Lagrangean"
L (I ) := := { X D cos I + YD sin I [( X D + X E + X J ) cos I + (YD + YE + YJ ) sin I ]}2 + 1 3
+ { X E cos I + YE sin I 13 [( X D + X E + X J ) cos I + (YD + YE + YJ ) sin I ]}2 + + { X J cos I + YJ sin I 13 [( X D + X E + X J ) cos I + (YD + YE + YJ ) sin I ]}2 + + { X D sin I + YD cos I 13 [( X D + X E + X J ) sin I + (YD + YE + YJ ) cos I ]}2 + + { X E sin I + YE cos I 13 [( X D + X E + X J ) sin I + (YD + YE + YJ ) cos I ]}2 + + { X J sin I + YJ cos I 13 [( X D + X E + X J ) sin I + (YD + YE + YJ ) cos I ]}2 = min I
80
1 The first problem of algebraic regression
L (I ) = = {[ X D ( X D + X E + X J )]cos I + [YD 13 (YD + YE + YJ )]sin I }2 + 1 3
+ {[ X E 13 ( X D + X E + X J )]cos I + [YE 13 (YD + YE + YJ )]sin I }2 + + {[ X J 13 ( X D + X E + X J )]cos I + [YJ 13 (YD + YE + YJ )]sin I }2 + + {[ X D 13 ( X D + X E + X J )]sin I + [YD 13 (YD + YE + YJ )]cos I }2 + + {[ X E 13 ( X D + X E + X J )]sin I + [YE 13 (YD + YE + YJ )]cos I }2 + + {[ X J 13 ( X D + X E + X J )]sin I + [YJ 13 (YD + YE + YJ )]cos I }2 "centralized coordinate" 'X := X D 13 ( X D + X E + X J ) = 13 (2 X D X E X J ) 'Y := YD 13 (YD + YE + YJ ) = 13 (2YD YE YJ ) "reduced Lagrangean"
L1 (I ) = ('X D cos I + 'YD sin I ) 2 + + ('X E cos I + 'YE sin I ) 2 + + ('X J cos I + 'YJ sin I ) 2
L2 (I ) = ('X D sin I + 'YD cos I ) 2 + + ('X E sin I + 'YE cos I ) 2 + + ('X J sin I + 'YJ cos I ) 2 1 wL (I ) = 0 2 wI ('X D cos I + 'YD sin I )('X D sin I + 'YD cos I ) + + ('X E cos I + 'YE sin I ) 2 ('X E sin I + 'YE cos I ) + + ('X J cos I + 'YJ sin I ) 2 ('X J sin I + 'YJ cos I ) = 0 ('X D2 + 'X E2 + 'X J2 ) sin I cos I + + ('X D 'YD + 'X E 'YE + 'X J 'YJ ) cos 2 I ('X D 'YD + 'X E 'YE + 'X J 'YJ ) sin 2 I + ('YD2 + 'YE2 + 'YJ2 ) sin I cos I = 0 [('X D2 + 'X E2 + 'X J2 ) ('YD2 + 'YE2 + 'YJ2 )]sin 2I = = 2['X D 'YD + 'X E 'YE + 'X J 'Y ]cos 2I
1-4 Special nonlinear models tan 2I = 2
81 'X D 'YD + 'X E 'YE + 'X J 'Y
('X + 'X E2 + 'X J2 ) ('YD2 + 'YE2 + 'YJ2 ) 2 D
"Orientation parameter in terms of Gauss brackets" tan 2I =
2['X'Y] ['X 2 ] ['Y 2 ]
I = arg{L1 (I ) = min} = arg{L2 (I ) = min}. The special Procrustes algorithm is completed by the backforward steps outlined in Box 1.32: At first we convert tan 2I to (cos I ,sin I ) . Secondly we substitute (cos I ,sin I ) into the translation formula (t x , t y ) . Thirdly we substitute (t x , t y , cos I ,sin I ) into the Lagrangean L (t x , t x , I ) , thus generating the optimal objective function & x &2 at (t x , t y , I ) . Finally as step four we succeed to compute the centric coordinates ª xD xE xJ º «y » ¬ D yE yJ ¼ with respect to the orthonormal 2-leg {e1 , e 2 | Po } at Po from the given coordinates ª XD X E XJ º «Y » ¬ D YE YJ ¼ with respect to the orthonormal 2-leg {E1 , E2 | o} at o , and the optimal datum parameters t x , t y , cos I ,sin I . Box 1.32: Special Procrustes algorithm backward steps step one tan 2I =
2['X'Y] ['X 2 ] ['Y 2 ]
ª cos I « ¬ sin I step two
t x = 13 ([ X]cos I + [Y]sin I ) t y = 13 ([ X]sin I + [Y]cos I ) step three
82
1 The first problem of algebraic regression
& x &2 = L (t x , t y , I ) step four ª xD «y ¬ D
xE yE
xJ º ª cos I = yJ »¼ «¬ sin I
sin I º ª X D »« cos I ¼ ¬ YD
XE YE
X J º ªt x º « » 1c . YJ »¼ «¬t y »¼
We leave the proof for [x] = xD + xE + xJ = 0, [y ] = yD + yE + yJ = 0, [xy ] = xD yD + xE yE + xJ yJ z 0 to the reader as an exercise. A numerical example is SDE = 1.1, S EJ = 0.9, SJD = 1.2, Y1 = 1.21, Y2 = 0.81, Y3 = 1.44, X D = 0, X E = 1.10, X J = 0.84, YD = 0 , YE = 0
, YJ = 0.86,
'X D = 0.647, 'X E = +0.453, 'X J = +0.193, 'YD = 0.287 , 'YE = 0.287, 'YJ = +0.573, test: ['X] = 0, ['Y] = 0, ['X'Y] = 0.166, ['X 2 ] = 0.661, ['Y 2 ] = 0.493, tan 2I = 1.979, I = 31D.598,828, 457,
I = 31D 35c 55cc.782, cos I = 0.851, 738, sin I = 0.523,968, t x = 0.701, t y = 0.095, ª xD = 0.701, xE = +0.236, xJ = +0.465, « ¬ yD = +0.095, yE = 0.481, yJ = +0.387, test: [x] = xD + xE + xJ = 0, [y ] = yD + yE + yJ = 0, [xy ] = +0.019 z 0 . ƅ
1-5 Notes What is the origin of the rank deficiency three of the linearized observational equations, namely the three distance functions observed in a planar triangular network we presented in paragraph three?
1-5 Notes
83
In geometric terms the a priori indeterminancy of relating observed distances to absolute coordinates placing points in the plane can be interpreted easily: The observational equation of distances in the plane P 2 is invariant with respect to a translation and a rotation of the coordinate system. The structure group of the twodimensional Euclidean space E 2 is the group of motion decomposed into the translation group (two parameters) and the rotation group (one parameter). Under the action of the group of motion (three parameters) Euclidean distance functions are left equivariant. The three parameters of the group of motion cannot be determined from distance measurements: They produce a rank deficiency of three in the linearized observational equations. A detailed analysis of the relation between the transformation groups and the observational equations has been presented by E. Grafarend and B. Schaffrin (1974, 1976). More generally the structure group of a threedimensional Weitzenboeck space W 3 is the conformal group C7 (3) which is decomposed into the translation group T3 (3 parameters), the special orthogonal group SO(3) (3 parameters) and the dilatation group ("scale", 1 parameter). Under the action of the conformal group C7 (3) – in total 7 parameters – distance ratios and angles are left equivariant. The conformal group C7 (3) generates a transformation of Cartesian coordinates covering R 3 which is called similarity transformation or datum transformation. Any choice of an origin of the coordinate system, of the axes orientation and of the scale constitutes an S-base following W. Baarda (1962,1967,1973,1979,1995), J. Bossler (1973), M. Berger (1994), A. Dermanis (1998), A. Dermanis and E. Grafarend (1993), A. Fotiou and D. Rossikopoulis (1993), E. Grafarend (1973,1979,1983), E. Grafarend, E. H. Knickmeyer and B. Schaffrin (1982), E. Grafarend and G. Kampmann (1996), G. Heindl (1986), M. Molenaar (1981), H. Quee (1983), P. J. G. Teunissen (1960, 1985) and H. Wolf (1990). In projective networks (image processing, photogrammetry, robot vision) the projective group is active. The projective group generates a perspective transformation which is outlined in E. Grafarend and J. Shan (1997). Under the action of the projective group cross-ratios of areal elements in the projective plane are left equivariant. For more details let us refer to M. Berger (1994), M. H. Brill and E. B. Barrett (1983), R. O. Duda and P.E. Heart (1973), E. Grafarend and J. Shan (1997), F. Gronwald and F. W. Hehl (1996), M. R. Haralick (1980), R. J. Holt and A. N. Netrawalli (1995), R. L. Mohr, L. Morin and E. Grosso (1992), J. L. Mundy and A. Zisserman (1992a, b), R. F. Riesenfeldt (1981), J. A. Schonten (1954). In electromagnetism (Maxwell equations) the conformal group C16 (3,1) is active. The conformal group C16 (3,1) generates a transformation of "space-time" by means of 16 parameters (6 rotational parameters – three for rotation, three for "hyperbolic rotation", 4 translational parameters, 5 "involutary" parameters, 1 dilatation – scale – parameter) which leaves the Maxwell equations in vacuum as
84
1 The first problem of algebraic regression
well as pseudo – distance ratios and angles equivariant. Sample references are A. O. Barut (1972), H. Bateman (1910), F. Bayen (1976), J. Beckers, J. Harnard, M, Perroud and M. Winternitz (1976), D. G. Boulware, L. S. Brown, R. D. Peccei (1970), P. Carruthers (1971), E. Cunningham (1910), T. tom Dieck (1967), N. Euler and W. H. Steeb (1992), P. G. O. Freund (1974), T. Fulton, F. Rohrlich and L. Witten (1962), J. Haantjes (1937), H. A. Kastrup (1962,1966), R. Kotecky and J. Niederle (1975), K. H. Marivalla (1971), D. H. Mayer (1975), J. A. Schouten and J. Haantjes (1936), D. E. Soper (1976) and J. Wess (1990). Box 1.33: Observables and transformation groups observed quantities
transformation group
datum parameters
coordinate differences in R 2 coordinate differences in R 3 coordinate differences in R n Distances in R 2 Distances in R 3 Distances in R n angles, distance ratios in R 2 angles, distance ratios in R 3 angles, distance ratios in R n
translation group T(2) translation group T(3) translation group T( n ) group of motion T(2) , SO(2) group of motion T(3) , SO(3) group of motion T(n) , SO(n) conformal group C 4 (2) conformal group C7 (3) conformal group C( n +1)( n + 2) / 2 (n)
2
cross-ratios of area elements in the projective plane
projective group
3 n 3 3+3=6 n+(n+1)/2 4 7 (n+1)(n+2)/2 8
Box 1.33 contains a list of observables in R n , equipped with a metric, and their corresponding transformation groups. The number of the datum parameters coincides with the injectivity rank deficiency in a consistent system of linear (linearized) observational equations Ax = y subject to A R n× m , rk A = n < m, d ( A) = m rk A .
2
The first problem of probabilistic regression – special Gauss-Markov model with datum defect – Setup of the linear uniformly minimum bias estimator of type LUMBE for fixed effects.
In the first chapter we have solved a special algebraic regression problem, namely the inversion of a system of consistent linear equations classified as “underdetermined”. By means of the postulate of a minimum norm solution || x ||2 = min we were able to determine m unknowns ( m > n , say m = 106 ) from n observations (more unknowns m than equations n, say n = 10 ). Indeed such a mathematical solution may surprise the analyst: In the example “MINOS” produces one million unknowns from ten observations. Though “MINOS” generates a rigorous solution, we are left with some doubts. How can we interpret such an “unbelievable solution”? The key for an evaluation of “MINOS” is handed over to us by treating the special algebraic regression problem by means of a special probabilistic regression problem, namely as a special Gauss-Markov model with datum defect. The bias generated by any solution of an underdetermined or ill-posed problem will be introduced as a decisive criterion for evaluating “MINOS”, now in the context of a probabilistic regression problem. In particular, a special form of “LUMBE”, the linear uniformly minimum bias estimator || LA - I ||2 = min , leads us to a solution which is equivalent to “MINOS”. Alternatively we may say that in the various classes of solving an underdetermined problem “LUMBE” generates a solution of minimal bias. ? What is a probabilistic regression problem? By means of a certain statistical objective function, here of type “minimum bias”, we solve the inverse problem of linear and nonlinear equations with “fixed effects” which relates stochastic observations to parameters. According to the Measurement Axiom observations are elements of a probability space. In terms of second order statistics the observation space Y of integer dimension, dim Y = n , is characterized by the first moment E {y} , the expectation of y Y , and the central second moment D {y} , the dispersion matrix or variancecovariance matrix Ȉ y . In the case of “fixed effects” we consider the parameter space X of integer dimension, dim X = m , to be metrical. Its metric is induced from the probabilistic measure of the metric, the variance-covariance matrix Ȉ y of the observations y Y . In particular, its variance-covariance-matrix is pulled-back from the variance-covariance-matrix Ȉ y . In the special probabilistic regression model “fixed effects” ȟ Ȅ (elements of the parameter space) are estimated. Fast track reading: Consult Box 2.2 and read only Theorem 2.3
86
2 The first problem of probabilistic regression
Please pay attention to the guideline of Chapter two, say definitions, lemma and theorems.
Lemma 2.2 ˆȟ hom S -LUMBE of ȟ Definition 2.1 ˆȟ hom S -LUMBE of ȟ
Theorem 2.3 ˆȟ hom S -LUMBE of ȟ Theorem 2.4 equivalence of G x -MINOS and S -LUMBE
“The guideline of chapter two: definition, lemma and theorems”
2-1 Setup of the linear uniformly minimum bias estimator of type LUMBE Let us introduce the special consistent linear Gauss-Markov model specified in Box 2.1, which is given for the first order moments again in the form of a consistent system of linear equations relating the first non-stochastic (“fixed”), realvalued vector ȟ of unknowns to the expectation E{y} of the stochastic, realvalued vector y of observations, Aȟ = E{y}. Here, the rank of the matrix A , rkA equals the number n of observations, y \ n . In addition, the second order central moments, the regular variance-covariance matrix Ȉ y of the observations, also called dispersion matrix D{y} , constitute the second matrix Ȉ y \ n×n as unknowns to be specified as a linear model further on, but postponed to the fourth chapter. Box 2.1: Special consistent linear Gauss-Markov model {y = Aȟ | A \ n× m , rk A = n, n < m} 1st moments Aȟ = E{y}
(2.1)
2nd moments Ȉ y = D{y} \
n× n
, Ȉ y positive definite, rk Ȉ y = n
ȟ unknown Ȉ y unknown or known from prior information.
(2.2)
2-1 Setup of the linear uniformly minimum bias estimator
87
Since we deal with a linear model, it is “a natural choice” to setup a homogeneous linear form to estimate the parameters ȟ of fixed effects, at first, namely ȟˆ = Ly ,
(2.3)
where L \ m × n is a matrix-valued fixed unknown. In order to determine the real-valued m × n matrix L , the homogeneous linear estimation ȟˆ of the vector ȟ of foxed effects has to fulfil a certain optimality condition we shall outline. Second, we are trying to analyze the bias in solving an underdetermined system of linear equations. Take reference to Box 2.2 where we systematically introduce (i) the bias vector ȕ , (ii) the bias matrix, (iii) the S -modified bias matrix norm as a weighted Frobenius norm. In detail, let us discuss the bias terminology: For a homogeneous linear estimation ȟˆ = Ly the vector-valued bias ȕ := E{ȟˆ ȟ} = E{ȟˆ} ȟ takes over the special form ȕ := E{ȟˆ} ȟ = [I LA] ȟ ,
(2.4)
which led us to the definition of the bias matrix ( I - LA )c . The norm of the bias vector ȕ , namely || ȕ ||2 := ȕcȕ , coincides with the ȟȟ c weighted Frobenius norm 2 of the bias matrix B , namely || B ||ȟȟ c . Here, we meet the central problem that the c c weight matrix ȟȟ , rk ȟȟ = 1, has rank one. In addition, ȟȟ c is not accessible since ȟ is unknown. In this problematic case we replace the matrix ȟȟ c by a fixed positive-definite m × m matrix S , rk S = m , C. R. Rao’s substitute matrix and define the S -weighted Frobenius matrix norm || B ||S2 := trBcSB = tr(I m LA)S(I m LA )c .
(2.5)
Indeed, the substitute matrix S constitutes the matrix of the metric of the bias space. Box 2.2: Bias vector, bias matrix Vector and matrix bias norms Special consistent linear Gauss-Markov model of fixed effects A \ n×m , rk A = n, n < m E{y} = Aȟ, D{y} = Ȉ y “ansatz” ȟˆ = Ly
(2.6)
“bias vector” ȕ := E{ȟˆ ȟ} = E{ȟˆ} ȟ z 0 ȟ \ m
(2.7)
ȕ = LE{y} ȟ = [I m LA]ȟ = 0 ȟ \ m
(2.8)
88
2 The first problem of probabilistic regression
“bias matrix” Bc = I m LA
(2.9)
“bias norms” || ȕ ||2 = ȕcȕ = ȟ c[I m LA ]c [I m LA ]ȟ
(2.10)
2 || ȕ ||2 = tr ȕȕc = tr[I m LA]ȟȟ c[I m LA]c = || B ||[[ c
(2.11)
|| ȕ ||S2 := tr[I m LA]S[I m LA]c =:|| B ||S2 .
(2.12)
Being prepared for optimality criteria we give a precise definition of ȟˆ of type hom S-LUMBE. Definition 2.1 ( ȟˆ hom S -LUMBE of ȟ ): An m × 1 vector ȟˆ is called hom S-LUMBE (homogeneous Linear Uniformly Minimum Bias Estimation) of ȟ in the special consistent linear Gauss-Markov model of fixed effects of Box 2.1, if (1st)
ȟˆ is a homogeneous linear form ȟˆ = Ly ,
(2nd)
(2.13)
in comparison to all other linear estimation ȟˆ has the minimum bias in the sense of || B ||S2 := || (I m LA)c ||S2 .
(2.14)
The estimations ȟˆ of type hom S-LUMBE can be characterized by Lemma 2.2 ( ȟˆ hom S -LUMBE of ȟ ): An m × 1 vector ȟˆ is hom S-LUMBE of ȟ in the special consistent linear Gauss-Markov model with fixed effects of Box 2.1, if and only if the matrix Lˆ fulfils the normal equations ASA cLˆ c = AS .
(2.15)
: Proof : The S -weighted Frobenius norm || ( I m LA )c ||S2 establishes the Lagrangean
L (L) := tr ( I m LA ) S ( I m LA )c = min L
(2.16)
2-1 Setup of the linear uniformly minimum bias estimator
89
for S -LUMBE. The necessary conditions for the minimum of the quadratic Lagrangean L (L) are wL ˆ c L := 2 ª¬ ASA cLˆ c AS º¼ = 0. wL
( )
(2.17)
The theory of matrix derivatives is reviewed in Appendix B “Facts: derivative of a scalar-valued function of a matrix: trace”. The second derivatives w2L Lˆ > 0 w (vec L)w (vec L)c
( )
(2.18)
at the “point” Lˆ constitute the sufficiency conditions. In order to compute such a mn × mn matrix of second derivatives we have to vectorize the matrix normal equation by wL ˆ ˆ c SAcº L = 2 ª¬LASA ¼ wL
(2.19)
wL ˆ c SAcº Lˆ = vec 2 ª¬LASA ¼ w (vec L)
(2.20)
wL Lˆ = 2 [ ASAc
I m ] vec Lˆ 2 vec ( SAc ) . w (vec L)
(2.21)
( )
( )
( )
The Kronecker-Zehfuß poduct A
B of two arbitrary matrices as well as ( A + B )
C = A
C + B
C of three arbitrary matrices subject to the dimension condition dim A = dim B is introduced in Appendix A, “Definition of Matrix Algebra: multiplication of matrices of the same dimension (internal relation) and Laws”. The vec operation (vectorization of an array) is reviewed in Appendix A as well, namely “Definition, Facts: vec AB = (Bc
I Ac ) vec A for suitable matrices A and B ”. No we are prepared to compute w2L ( Lc ) = 2( ASAc)
Im > 0 w (vec L)w (vec L)c
(2.22)
as a positive-definite matrix. The useful theory of matrix derivatives which applies here is reviewed in Appendix B, “Facts: derivative of a matrix-valued function of a matrix namely w (vec X) / w (vec X)c ”. The normal equations of hom S-LUMBE wL / wL(Lˆ ) = 0 agree to (2.15).
ƅ For an explicit representation of ȟˆ as hom LUMBE in the special consistent linear Gauss-Markov model of fixed effects of Box 2.1, we solve the normal equations (2.15) for Lˆ = arg{L (L) = min} . L
90
2 The first problem of probabilistic regression
Beside the explicit representation of ȟˆ of type hom LUMBE we compute the related dispersion matrix D{ȟˆ} in Theorem 2.3 ( ȟˆ hom LUMBE of ȟ ): Let ȟˆ = Ly be hom LUMBE in the special consistent linear Gauss-Markov model of fixed effects of Box 2.1. Then the solution of the normal equation is ȟˆ = SA c( ASA c) 1 y
(2.23)
completed by the dispersion matrix D{ȟˆ} = SAc( ASAc) 1 Ȉ y ( ASAc) AS
(2.24)
and by the bias vector ȕ := E{ȟˆ} ȟ = = ª¬I m SAc( ASAc) 1 A º¼ ȟ
(2.25)
for all ȟ \ m . The proof of Theorem 2.3 is straight forward. At this point we have to comment what Theorem 2.3 is actually telling us. hom LUMBE has generated the estimation ȟˆ of type (2.23), the dispersion matrix D{ȟˆ} of type (2.24) and the bias vector of type (2.25) which all depend on C. R. Rao’s substitute matrix S , rk S = m . Indeed we can associate any element of the solution vector, the dispersion matrix as well as the bias vector with a particular weight which can be “designed” by the analyst.
2-2 The Equivalence Theorem of Gx -MINOS and S -LUMBE We have included the second chapter on hom S -LUMBE in order to interpret G x -MINOS of the first chapter. The key question is open: ? When are hom S -LUMBE and G x -MINOS equivalent ? The answer will be given by Theorem 2.4 (equivalence of G x -MINOS and S -LUMBE) With respect to the special consistent linear Gauss-Markov model (2.1), (2.2) ȟˆ = Ly is hom S -LUMBE for a positive-definite matrix S , if ȟ m = Ly is G x -MINOS of the underdetermined system of linear equations (1.1) for G x = S -1 G -1x = S .
(2.26)
91
2-2 The Equivalence Theorem of G x -MINOS and S-LUMBE
The proof is straight forward if we directly compare the solution (1.14) of G x MINOS and (2.23) of hom S -LUMBE. Obviously the inverse matrix of the metric of the parameter space X is equivalent to the matrix of the metric of the bias space B . Or conversely, the inverse matrix of the metric of the bias space B determines the matrix of the metric of the parameter space X . In particular, the bias vector ȕ of type (2.25) depends on the vector ȟ which is inaccessible. The situation is similar to the one in hypothesis testing. We can produce only an estimation ȕˆ of the bias vector ȕ if we identify ȟ by the hypothesis ȟ 0 = ȟˆ . A similar argument applies to the second central moment D{y} Ȉ y of the “random effect” y , the observation vector. Such a dispersion matrix D{y} Ȉ y has to be known a priori in order to be able to compute the dispersion matrix D{ȟˆ} Ȉȟˆ . Again we have to apply the argument that we are only able to conˆ and to setup a hypothesis about Ȉ . struct an estimate Ȉ y y
2-3 Examples Due to the Equivalence Theorem G x -MINOS ~ S -LUMBE the only new items of the first problem of probabilistic regression are the dispersion matrix D{ȟˆ | hom LUMBE} and the bias matrix B{ȟˆ | hom LUMBE} . Accordingly the first example outlines the simple model of the variance-covariance matrix D{ȟˆ} =: Ȉȟˆ and its associated Frobenius matrix bias norm || B ||2 . New territory is taken if we compute the variance-covariance matrix D{ȟˆ * } =: Ȉȟˆ and its related Frobenius matrix bias norm || B* ||2 for the canonical unknown vector ȟ* of star coordinates [ȟˆ1* ,..., ȟˆ *m ]c , lateron rank partitioned. *
Example 2.1 (simple variance-covariance matrix D{ȟˆ | hom LUMBE} , Frobenius norm of the bias matrix || B(hom LUMBE) || ): The dispersion matrix Ȉ := D{ȟˆ} of ȟˆ (hom LUMBE) is called ȟˆ
Simple, if S = I m and Ȉ y := D{y} = I n ı y2 . Such a model is abbreviated “i.i.d.”
and
“u.s.”
or
or
independent identically distributed observations (one variance component)
unity substituted (unity substitute matrix).
Such a simple dispersion matrix is represented by Ȉȟˆ = A c( AA c) 2 Aı 2y .
(2.27)
The Frobenius norm of the bias matrix for such a simple invironment is derived by
92
2 The first problem of probabilistic regression
|| B ||2 = tr[I m A c( AA c) 1 A]
(2.28)
|| B ||2 = d = m n = m rk A,
(2.29)
since I m A c(AA c)1 A and A c( AAc) 1 A are idempotent. According to Appendix A, notice the fact “ tr A = rk A if A is idempotent”. Indeed the Frobenius norm of the u.s. bias matrix B ( hom LUMBE ) equalizes the square root m n = d of the right complementary index of the matrix A . Table 2.1 summarizes those data of the front page examples of the first chapter relating to D{ȟˆ | hom LUMBE}
and
|| B(hom BLUMBE) || .
Table 2.1: Simple variance-covariance matrix (i.i.d. and u.s.) Frobenius norm of the simple bias matrix Front page example 1.1 A \ 2×3 , n = 2, m = 3 1 ª 21 7 º ª1 1 1 º ª3 7 º A := « , AAc = « , ( AAc) 1 = « » » 14 ¬ 7 3 »¼ ¬1 2 4 ¼ ¬7 21¼ rk A = 2 A c( AA c) 1 =
( AA c) 2 =
ª14 4 º ª10 6 2º 1 « 1 7 1» , A c( AA c) 1 A = « 6 5 3 » 14 « 7 5 » 14 « 2 3 13 » ¬ ¼ ¬ ¼
ª106 51 59 º 1 ª 245 84 º 1 « 2 c c , A ( AA ) A = 51 25 27 » 98 ¬« 84 29 ¼» 98 « 59 27 37 » ¬ ¼
ª106 51 59 º 1 « Ȉȟˆ = A c( AA c) AV = 51 25 27 » V y2 98 « 59 27 37 » ¬ ¼ 2
2 y
|| B ||2 = tr ¬ªI m A c( AA c) 1 A º¼ = tr I 3 tr A c( AA c) 1 A || B ||2 = 3 141 (10 + 5 + 13) = 3 2 = 1 = d || B || = 1 = d .
93
2-3 Examples
Example 2.2 (canonically simple variance-covariance matrix D{ȟˆ * | hom LUMBE} , Frobenius norm of the canonical bias matrix || B* (hom LUMBE) || ): The dispersion matrix Ȉȟˆ := D{ȟˆ * } of the rank partitioned vector of canonical coordinates ȟˆ * = 9 cS ȟˆ of type hom LUMBE is called *
1 2
canonically simple, if S = I m and Ȉ y := D{y * } = I nV y2 . In short, we denote such a model by *
*
“i.i.d.”
and
“u.s.”
or
Or
independent identically distributed observations (one variance component)
unity substituted (unity substitute matrix).
Such a canonically simple dispersion matrix is represented by ° ª ȟ* º ½° ª ȁ-2Vy2 D{ȟˆ* } = D ® « *1 » ¾ = « °¯ ¬«ȟ2 ¼» °¿ ¬« 0
*
0º » 0 ¼»
(2.30)
or ª1 1 º var ȟˆ 1* = Diag « 2 ,..., 2 » V y2 , ȟ1* \ r ×1 , Or ¼ ¬ O1 *
(
)
cov ȟˆ 1* , ȟˆ *2 = 0, var ȟˆ *2 = 0, ȟ*2 \ ( m r )×1 . If the right complementary index d := m rk A = m n is interpreted as a datum defect, we may say that the variances of the “free parameters” ȟˆ *2 \ d are zero. Let us specialize the canonical bias vector ȕ* as well as the canonical bias matrix B* which relates to ȟˆ * = L* y * of type “canonical hom LUMBE” as follows. Box 2.3: Canonical bias vector, canonical bias matrix “ansatz” ȟˆ * = L* y * E{y * } = A*ȟ* , D{y * } = Ȉ y “bias vector” ȕ := E{ȟ* } ȟ * ȟ * \ m *
*
94
2 The first problem of probabilistic regression
ȕ* = L* E{y * } ȟ* ȟ* \ m ȕ* = (I m L* A* )ȟ* ȟ* \ m ª ȕ* º ªI ȕ* (hom LUMBE) = « 1* » = ( « r ¬0 ¬ȕ 2 ¼
0 º ª ȁ 1 º ª ȟ1* º ȁ , 0 ) ] « *» « »[ I d »¼ ¬ 0 ¼ ¬ȟ 2 ¼
(2.31)
for all ȟ*1 \ r , ȟ*2 \ d ª ȕ* º ª0 0 º ª ȟ*1 º ȕ* (hom LUMBE) = « *1 » = « »« *» ¬ 0 I d ¼ ¬ȟ 2 ¼ ¬ȕ 2 ¼
(2.32)
ª ȕ* º ª0º ȕ* (hom LUMBE) = « *1 » = « * » ȟ *2 \ d ¬ȟ 2 ¼ ¬ȕ 2 ¼
(2.33)
“bias matrix” (B* )c = I m L* A* ªI ª¬B* (hom LUMBE) º¼c = « r ¬0
0 º ª ȁ 1 º « » [ ȁ, 0 ] I d »¼ ¬ 0 ¼
ª0 0 º ª¬B* (hom LUMBE) º¼c = « » ¬0 I d ¼
(2.34)
(2.35)
“Frobenius norm of the canonical bias matrix” ª0 0 º || B* (hom LUMBE) ||2 = tr « » ¬0 I d ¼
(2.36)
|| B* (hom LUMBE) || = d = m n .
(2.37)
d = = m n = m rk A of Box 2.3 agrees to the value of the Frobenius norm of the ordinary bias matrix.
It is no surprise that the Frobenius norm of the canonical bias matrix
3
The second problem of algebraic regression – inconsistent system of linear observational equations – overdetermined system of linear equations: {Ax + i = y | A \ n×m , y R ( A ) rk A = m, m = dim X} :Fast track reading: Read only Lemma 3.7.
Lemma 3.2 x A G y -LESS of x
Lemma 3.3 x A G y -LESS of x
Lemma 3.4 x A G y -LESS of x constrained Lagrangean Lemma 3.5 x A G y -LESS of x constrained Lagrangean
Theorem 3.6 bilinear form
Lemma 3.7 Characterization of G y -LESS
“The guideline of chapter three: theorem and lemmas”
96
3 The second problem of algebraic regression
By means of a certain algebraic objective function which geometrically is called a minimum distance function, we solve the second inverse problem of linear and nonlinear equations, in particular of algebraic type, which relate observations to parameters. The system of linear or nonlinear equations we are solving here is classified as overdetermined. The observations, also called measurements, are elements of a certain observation space Y of integer dimension, dim Y = n, which may be metrical, especially Euclidean, pseudo–Euclidean, in general a differentiable manifold. In contrast, the parameter space X of integer dimension, dim X = m, is metrical as well, especially Euclidean, pseudo–Euclidean, in general a differentiable manifold, but its metric is unknown. A typical feature of algebraic regression is the fact that the unknown metric of the parameter space X is induced by the functional relation between observations and parameters. We shall outline three aspects of any discrete inverse problem: (i) set-theoretic (fibering), (ii) algebraic (rank partitioning, “IPM”, the Implicit Function Theorem) and (iii) geometrical (slicing). Here we treat the second problem of algebraic regression: A inconsistent system of linear observational equations: Ax + i = y , A R n× m , rk A = m, n > m, also called “overdetermined system of linear equations”, in short “more observations than unknowns” is solved by means of an optimization problem. The introduction presents us a front page example of three inhomogeneous equations with two unknowns. In terms of 31 boxes and 12 figures we review the least-squares solution of such a inconsistent system of linear equations which is based upon the trinity.
3-1 Introduction
97
3-1 Introduction With the introductory paragraph we explain the fundamental concepts and basic notions of section. For you, the analyst, who has the difficult task to deal with measurements, observational data, modeling and modeling equations we present numerical examples and graphical illustrations of all abstract notions. The elementary introduction is written not for a mathematician, but for you, the analyst, with limited remote control of the notions given hereafter. May we gain your interest. Assume an n-dimensional observation space Y, here a linear space parameterized by n observations (finite, discrete) as coordinates y = [ y1 ," , yn ]c R n in which an m-dimensional model manifold is embedded (immersed). The model manifold is described as the range of a linear operator f from an m-dimensional parameter space X into the observation space Y. The mapping f is established by the mathematical equations which relate all observables to the unknown parameters. Here the parameter space X , the domain of the linear operator f, will also be restricted to a linear space which is parameterized by coordinates x = [ x1 ," , xm ]c R m . In this way the linear operator f can be understood as a coordinate mapping A : x 6 y = Ax. The linear mapping f : X o Y is geometrically characterized by its range R(f), namely R(A), defined by R(f):= {y Y | y = f (x) for all x X} which in general is a linear subspace of Y and its kernel N(f), namely N(f), defined by N ( f ) := {x X | f (x) = 0}. Here the range R(f), namely R(A), does not coincide with the n-dimensional observation space Y such that y R (f ) , namely y R (A) . In contrast, we shall assume that the null space element N(f) = 0 “is empty”: it contains only the element x = 0 . Example 3.1 will therefore demonstrate the range space R(f), namely the range space R(A), which dose not coincide with the observation space Y, (f is not surjective or “onto”) as well as the null space N(f), namely N(A), which is empty. f is not surjective, but injective. Box 3.20 will introduce the special linear model of interest. By means of Box 3.21 it will be interpreted. 3-11 The front page example
Example 3.1 (polynomial of degree two, inconsistent system of linear equations Ax + i = y, x X = R m , dim X = m, y Y = R n , r = rk A = dim X = m, y N ( A ) ):
98
3 The second problem of algebraic regression
First, the introductory example solves the front page inconsistent system of linear equations, x1 + x2 1 x1 + 2 x2 3
x1 + x2 + i1 = 1 x1 + 2 x2 + i2 = 3
or
x1 + 3 x2 4
x1 + 3x2 + i3 = 4,
obviously in general dealing with the linear space X = R m x, dim X = m, here m=2, called the parameter space, and the linear space Y = R n y , dim Y = n, here n =3 , called the observation space. 3-12 The front page example in matrix algebra Second, by means of Box 3.1 and according to A. Cayley’s doctrine let us specify the inconsistent system of linear equations in terms of matrix algebra. Box 3.1: Special linear model: polynomial of degree one, three observations, two unknowns ª y1 º ª a11 y = «« y2 »» = «« a21 «¬ y3 »¼ «¬ a31
a12 º ªx º a22 »» « 1 » x a32 »¼ ¬ 2 ¼
ª1 º ª1 1 º ª i1 º ª1 1 º ª x1 º « » « » « » y = Ax + i : « 2 » = «1 2 » « » + «i2 » A = ««1 2 »» x «¬ 4 »¼ «¬1 3 »¼ ¬ 2 ¼ «¬ i3 »¼ «¬1 3 »¼ xc = [ x1 , x2 ], y c = [ y1 , y2 , y3 ] = [1, 2, 3], i c = [i1 , i2 , i3 ] , A Z +3× 2 R 3× 2 , x R 2×1 , y Z +3×1 R 3×1 r = rk A = dim X = m = 2 . As a linear mapping f : x 6 y Ax can be classified as following: f is injective, but not surjective. (A mapping f is called linear if f ( x1 + x2 ) = f ( x1 ) + f ( x2 ) holds.) Denote the set of all x X by the domain D(f) or the domain space D($). Under the mapping f we generate a particular set called the range R(f) or the range space R(A). Since the set of all y Y is not in the range R(f) or the range space R(A), namely y R (f ) or y R (A) , the mapping f is not surjective. Beside the range R(f), the range space R(A), the linear mapping is characterized by the kernel N ( f ) := {x R m | f (x) = 0} or the null space N ( A) := {x R m | Ax = 0} . Since the inverse mapping
99
3-1 Introduction
g : R ( f ) y / 6 x D( f ) is one-to-one, the mapping f is injective. Alternatively we may identify the kernel N(f), or the null space N ( A ) with {0} . ? Why is the front page system of linear equations called inconsistent ? For instance, let us solve the first two equations, namely x1 = 0, x2 = 1. As soon as we substitute this solution in the third one, the inconsistency 3 z 4 is met. Obviously such a system of linear equations needs general inconsistency parameters (i1 , i2 , i3 ) in order to avoid contradiction. Since the right-hand side of the equations, namely the inhomogeneity of the system of linear equations, has been measured as well as the model (the model equations) have been fixed, we have no alternative but inconsistency. Within matrix algebra the index of the linear operator A is the rank r = rk A, here r = 2, which coincides with the dimension of the parameter space X, dim X = m, namely r = rk A = dim X = m, here r=m=2. In the terminology of the linear mapping f, f is not “onto” (surjective), but “one-to-one” (injective). The left complementary index of the linear operator A R n× m , which account for the surjectivity defect is given by d s = n rkA, also called “degree of freedom” (here d s = n rkA = 1 ). While “surjectivity” related to the range R(f) or “the range space R(A)” and “injectivity” to the kernel N(f) or “the null space N(A)” we shall constructively introduce the notion of range R (f ) range space R (A)
versus
kernel N ( f ) null space N ( f )
by consequently solving the inconsistent system of linear equations. But beforehand let us ask: How can such a linear model of interest, namely a system of inconsistent linear equations, be generated ? With reference to Box 3.2 let us assume that we have observed a dynamical system y(t) which is represented by a polynomial of degree one with respect to time t R + , namely y(t ) = x1 + x2t. (Due to y• (t ) = x2 it is a dynamical system with constant velocity or constant first derivative with respect to time t.) The unknown polynomial coefficients are collected in the column array x = [ x1 , x2 ]c, x X = R 2 , dim X = 2, and constitute the coordinates of the two-dimensional parameter space X. If the dynamical system y(t) is observed at three instants, say y(t1) = y1 = 1, y(t2) = y2 = 2, y(t3) = y3 = 4, and if we collect the observations in the column array y = [ y1 , y2 , y3 ]c = [1, 2, 4]c, y Y = R 3 , dim Y = 3, they constitute the coordinates of the three-dimensional observation space Y. Thus we are left with the
100
3 The second problem of algebraic regression
problem to compute two unknown polynomial coefficients from three measurements. Box 3.2: Special linear model: polynomial of degree one, three observations, two unknowns ª y1 º ª1 t1 º ª i1 º ª x1 º « » « » « » y = « y2 » = «1 t2 » « » + «i2 » x «¬ y3 »¼ «¬1 t3 »¼ ¬ 2 ¼ «¬ i3 »¼ ª t1 = 1, y1 = 1 ª 1 º ª1 1 º ª i1 º ª x1 º « » « « » « » «t2 = 2, y2 = 2 : « 2 » = «1 2 » « » + «i2 » ~ x «¬ t3 = 3, y3 = 4 «¬ 4 »¼ «¬1 3 »¼ ¬ 2 ¼ «¬ i3 »¼ ~ y = Ax + i, r = rk A = dim X = m = 2 . Thirdly, let us begin with a more detailed analysis of the linear mapping f : Ax y or Ax + i = y , namely of the linear operator A R n× m , r = rk A = dim X = m. We shall pay special attention to the three fundamental partitionings, namely (i)
algebraic partitioning called rank partitioning of the matrix A,
(ii) geometric partitioning called slicing of the linear space Y (observation space), (iii) set-theoretical partitioning called fibering of the set Y of observations. 3-13 Least squares solution of the front page example by means of vertical rank partitioning Let us go back to the front page inconsistent system of linear equations, namely the problem to determine two unknown polynomial coefficients from three sampling points which we classified as an overdetermined one. Nevertheless we are able to compute a unique solution of the overdetermined system of inhomogeneous linear equations Ax + i = y , y R ( A) or rk A = dim X , here A R 3× 2 x R 2×1 , y R 3×1 if we determine the coordinates of the unknown vector x as well as the vector i of the inconsistency by least squares (minimal Euclidean length, A2-norm), here & i &2I = i12 + i22 + i32 = min . Box 3.3 outlines the solution of the related optimization problem.
101
3-1 Introduction
Box 3.3: Least squares solution of the inconsistent system of inhomogeneous linear equations, vertical rank partitioning The solution of the optimization problem {& i &2I = min | Ax + i = y , rk A = dim X} x
is based upon the vertical rank partitioning of the linear mapping f : x 6 y = Ax + i, rk A = dim X , which we already introduced. As soon as ª y1 º ª A1 º ª i1 º r ×r « y » = « A » x + « i » subject to A1 R ¬ 2¼ ¬ 2¼ ¬ 2¼ 1 1 x = A1 i1 + A1 y1 y 2 = A 2 A11i1 + i 2 + A 2 A11y1
i 2 = A 2 A11i1 A 2 A11 y1 + y 2
is implemented in the norm & i &2I we are prepared to compute the first derivatives of the unconstrained Lagrangean
L (i1 , i 2 ) := & i &2I = i1ci1 + i c2 i 2 = = i1ci1 + i1c A1c1A c2 A 2 A11i1 2i1c A1c1A c2 ( A 2 A11y1 y 2 ) + +( A 2 A11y1 y 2 )c( A 2 A11y1 y 2 ) = = min i1
wL (i1l ) = 0 wi1 A1c1A c2 ( A 2 A11y1 y 2 ) + [ A1c1Ac2 A 2 A11 + I ] i1l = 0 i1l = [I + A1c1Ac2 A 2 A11 ]1 A1c1 A c2 ( A 2 A11y1 y 2 ) which constitute the necessary conditions. The theory of vector derivatives is presented in Appendix B. Following Appendix A, “Facts: Cayley inverse: sum of two matrices , namely (s9), (s10) for appropriate dimensions of the involved matrices”, we are led to the following identities:
102
3 The second problem of algebraic regression
1st term (I + A1c1A c2 A 2 A11 ) 1 A1c1A c2 A 2 A11y1 = ( A1c + A c2 A 2 A11 ) 1 A c2 A 2 A11y1 = A1 ( A1c A1 + A c2 A 2 ) 1 A c2 A 2 A11y1 = A1 ( A1c A1 + A c2 A 2 ) 1 A1cy1 + + A1 ( A1c A1 + A c2 A 2 ) 1 ( A c2 A 2 A11 + A1c )y1 = A1 ( A1c A1 + A c2 A 2 ) 1 A1cy1 + +( A1c A1 + A c2 A 2 ) 1 ( A c2 A 2 + A1c A1 )y1 = A1 ( A1c A1 + A c2 A 2 ) 1 A1cy1 + y1 2nd term (I + A1c1A c2 A 2 A11 ) 1 A1c1A 2 y 2 = ( A1c + A c2 A 2 A11 ) 1 A 2 y 2 = = A1 ( A1c A1 + A c2 A 2 ) 1 A c2 y 2 i1l = A1 ( A1c A1 + A c2 A 2 ) 1 ( A1cy1 + A c2 y 2 ) + y1 . The second derivatives w2L (i1l ) = 2[( A 2 A11 )c( A 2 A11 ) + I] > 0 wi1wi1c due to positive-definiteness of the matrix ( A 2 A11 )c( A 2 A11 ) + I generate the sufficiency condition for obtaining the minimum of the unconstrained Lagrangean. Finally let us backward transform i1l 6 i 2 l = A 2 A11i1l A 2 A11 y1 + y 2 . i 2 l = A 2 ( A1c A1 + Ac2 A 2 ) 1 ( A1cy1 + Ac2 y 2 ) + y 2 . Obviously we have generated the linear form i1l = A1 ( A1c A1 + A c2 A 2 ) 1 ( A1cy1 + Ac2 y 2 ) + y1 i 2l = A 2 ( A1c A1 + Ac2 A 2 ) 1 ( A1cy1 + Ac2 y 2 ) + y 2 or ª i1l º ª A1 º ª y1 º ª y1 º 1 «i » = « A » ( A1c A1 + A c2 A 2 ) [ A1c , A c] « y » + « y » ¬ 2¼ ¬ 2¼ ¬ 2¼ ¬ 2l ¼ or i l = A( A cA) 1 y + y. Finally we are left with the backward step to compute the unknown vector of parameters x X : xl = A11i1l + A11 y1 xl = ( A1c A1 + A c2 A 2 ) 1 ( A1cy1 + A c2 y 2 ) or xl = ( A cA) 1 Acy.
103
3-1 Introduction
A numerical computation with respect to the introductory example is ª3 6 º ª14 6º A1c A1 + A c2 A 2 = « , ( A1c A1 + A c2 A 2 ) 1 = 16 « » », ¬6 14 ¼ ¬ 6 3 ¼ ª 8 3 º A1 ( A1c A1 + A c2 A 2 ) 1 = 16 « », ¬2 0 ¼ A 2 ( A1c A1 + A c2 A 2 ) 1 = 16 [4, 3], ª8º ª1 º A1cy1 + Ac2 y 2 = « » , y1 = « » , y 2 = 4 ¬19 ¼ ¬ 3¼ ª 1º i1l = 16 « » , i 2 l = 16 , & i l & I = 16 6, ¬2¼ ª 2 º xl = 16 « » , & xl &= 16 85, ¬9¼ y(t ) = 13 + 32 t ª 2 2 º 1 w 2L (x 2m ) = [( A 2 A11 )c( A 2 A11 ) + I] = « » > 0, 2 wx 2 wxc2 ¬ 2 5 ¼ § ª 2 2 º · § ª 2 2 º · " first eigenvalue O1 ¨ « ¸ = 6", " second eigenvalue O2 ¨ « » » ¸ = 1". © ¬ 2 5 ¼ ¹ © ¬ 2 5 ¼ ¹ The diagnostic algorithm for solving an overdetermined system of linear equations y = Ax, rk A = dim X = m, m < n = dim Y, y Y by means of rank partitioning is presented to you by Box 3.4. 3-14
The range R(f) and the kernel N(f), interpretation of “LESS” by three partitionings: (i) algebraic (rank partitioning) (ii) geometric (slicing) (iii) set-theoretical (fibering)
Fourthly, let us go into the detailed analysis of R(f), R ( f ) A , N(f), with respect to the front page example. Beforehand we begin with a comment. We want to emphasize the two step procedure of the least squares solution (LESS) once more: The first step of LESS maps the observation vector y onto the range space R(f) while in the second step the LESS point y R ( A) is uniquely mapped to the point xl X , an element of the parameter space. Of
104
3 The second problem of algebraic regression
course, we directly produce xl = ( A cA) 1 Acy just by substituting the inconsistency vector i = y – Ax into the l2 norm & i &2I = (y Ax)c(y Ax) = min . Such a direct procedure which is common practice in LESS does not give any insight into the geometric structure of LESS. But how to identify the range R(f), namely the range space R(A), or the kernel N(f), namely the null space N(A) in the front page example? By means of Box 3.4 we identify R(f) or “the null space R(A)” and give its illustration by Figure 3.1. Such a result has paved the way to the diagnostic algorithm for solving an overdetermined system of linear equations by means of rank partitioning presented in Box 3.5. The kernel N(f) or “the null space” is immediately identified as {0} = N ( A ) = {x R m | Ax) = 0} = {x R m | A1 x = 0} by means of rank partitioning ( A1x = 0 x = 0} . Box 3.4: The range space of the system of inconsistent linear equations Ax + i = y, “vertical” rank partitioning The matrix A is called “vertically rank partitioned”, if r = rk A = rk A1 = m, ªA º {A R n×m A = « 1 » A1 R r ×r , A 2 R d ×r } ¬ A 2 ¼ d = d ( A) = m rk A holds. (In the introductory example A R 3× 2 , A1 R 2× 2 , A 2 R1× 2 , rk A = 2, d ( A) = 1 applies.) An inconsistent system of linear equations Ax = y, rk A = dim X = m, is “vertically rank partitioned” if ªA º ªi º Ax = y , rk A = dim X y = « 1 » x + « 1 » ¬A2 ¼ ¬i 2 ¼ ª y = A1x + i1 « 1 ¬y 2 = A 2x + i 2 for a partitioned observation vector ªy º {y R n y = « 1 » | y1 R r ×1 , y 2 R d ×1 } ¬y 2 ¼ and a partitioned inconsistency vector ªi º {i R n i = « 1 » | i1 R r ×1 , i 2 R d ×1 }, ¬i 2 ¼ respectively, applies. (The “vertical” rank partitioning of the
105
3-1 Introduction
matrix A as well as the “vertically rank partitioned” inconsistent system of linear equations Ax + i = y , rk A = dim X = m , of the introductory example is ª1 1 º ª1 1 º «1 2 » = A = ª A1 º = «1 2» , « » « « » » ¬ A 2 ¼ «1 3» «¬1 3 »¼ ¬ ¼
ª1 º ª y1 º « » 2 ×1 « » = «3 » , y1 R , y 2 R . y ¬ 2 ¼ «4» ¬ ¼ By means of the vertical rank partitioning of the inconsistent system of inhomogeneous linear equations an identification of the range space R(A), namely R ( A) = {y R n | y 2 A 2 A11 y1 = 0} is based upon y1 = A1x + i1 x1 = A11 (y1 i1 ) y 2 = A 2 x + i 2 x 2 = A 2 A11 (y1 i1 ) + i 2 y 2 A 2 A11 y1 = i 2 A 2 A11i1 which leads to the range space R(A) for inconsistency zero, particularly in the introductory example 1
ª1 1 º ª y1 º y3 [1, 3] « » « » = 0. ¬1 2 ¼ ¬ y2 ¼ For instance, if we introduce the coordinates y1 = u , y2 = v, the other coordinate y3 of the range space R(A) Y = R 3 amounts to ª 2 1º ªu º ªu º y3 = [1, 3] « » « v » = [ 1, 2] « v » 1 1 ¬ ¼¬ ¼ ¬ ¼ y3 = u + 2v. In geometric language the linear space R(A) is a parameterized plane 2 P 0 through the origin illustrated by Figure 3.1. The observation space Y = R n (here n = 3) is sliced by the subspace, the linear space (linear manifold) R(A), dim R ( A) = rk( A) = r , namely a straight line, a plane (here), a higher dimensional plane through the origin O.
106
3 The second problem of algebraic regression
y 2 0
e3
ec1 e2 e1
Figure 3.1: Range R(f), range space R(A), y R(A), observation space Y = R 3 , slice by R ( A) = P02 R 3 , y = e1u + e 2 v+ e3 (u + 2 v) R ( A) Box 3.5: Algorithm Diagnostic algorithm for solving an overdetermined system of linear equations y = Ax + i, rk A = dim X , y R ( A) by means of rank partitioning Determine the rank of the matrix A rk A = dimX = m
107
3-1 Introduction
Compute the “vertical rank partitioning” ªA º A = « 1 » , A1 R r × r = R m × n , A 2 R ( n r )× r = R ( n m )× m ¬A2 ¼ “n – r = n – m = ds is called left complementary index” “A as a linear operator is not surjective, but injective”
Compute the range space R(A) R ( A) := {y R n | y 2 A 2 A11 y1 = 0}
Compute the inconsistency vector of type LESS i l = A( AcA) 1 y + y test : A ci l = 0
Compute the unknown parameter vector of type LESS xl = ( A cA) 1 Acy .
h What is the geometric interpretation of the least-squares solution & i &2I = min ? With reference to Figure 3.2 we additively decompose the observation vector accordingly to y = y R(A) + y R(A) , A
where y R ( A ) R ( A) is an element of the range space R ( A) , but the inconsistency vector i l = i R ( A ) R ( A) A an element of its orthogonal complement, the normal space R ( A) A . Here R ( A) is the central plane P02 , y R ( A ) P02 , but A
108
3 The second problem of algebraic regression
R ( A) A the straight line L1 , i l R ( A) A . & i &2I = & y y R ( A ) &2 = min can be understood as the minimum distance mapping of the observation point y Y onto the range space R ( A) . Such a mapping is minimal, if and only if the inner product ¢ y R ( A ) | i R ( A ) ² = 0 approaches zero, we say A
A
" y R ( A ) and i R ( A ) are orthogonal". A
The solution point y R ( A ) is the orthogonal projection of the observation point y Y onto the range space R ( A), an m-dimensional linear manifold, also called a Grassmann manifold G n , m .
Figure 3.2: Orthogonal projection of the observation vector an y Y onto the range space R ( A), R ( A) := {y R n | y 2 A 2 A11 y1 = 0} , i l R ( A) A , here: y R ( A ) P02 (central plane), y L1 (straight line ), representation of y R ( A ) (LESS) : y = e1u + e 2 v + e3 (u + 2v) R 3 , R ( A) = span{eu , e v } ª eu := Du y R ( A ) / & Du y R ( A ) &= (e1 e3 ) / 2 « Dv y R ( A ) < Dv y R ( A ) | eu > eu « Gram - Schmidt : «ev := = (e1 + e 2 + e3 ) / 3 & Dv y R ( A ) < Dv y R ( A ) | eu > eu & « « < eu | ev > = 0, Dv y R ( A ) = e 2 + 2e3 ¬ As an “intermezzo” let us consider for a moment the nonlinear model by means of the nonlinear mapping " X x 6 f (x) = y R ( A ) , y Y ".
109
3-1 Introduction
In general, the observation space Y as well as the parameter space X may be considered as differentiable manifolds, for instance “curved surfaces”. The range R(f) may be interpreted as the differentiable manifolds. X embedded or more generally immersed, in the observation space Y = R n , for instance: X Y. The parameters [ x1 ,… , xm ] constitute a chart of the differentiable manifolds X = M m M n = Y. Let us assume that a point p R ( f ) is given and we are going to attach the tangent space Tp M m locally. Such a tangent space Tp M m at p R ( f ) may be constructed by means of the Jacobi map, parameterized by the Jacobi matrix J, rk J = m, a standard procedure in Differential Geometry. An observation point y Y = R n is orthogonally projected onto the tangent space Tp M m at p R ( f ) , namely by LESS as a minimum distance mapping. In a second step – in common use is the equidistant mapping – we bring the point q Tp M m which is located in the tangent space Tp M m at p R ( f ) back to the differentiable manifold, namely y R ( f ). The inverse map " R ( f ) y 6 g ( y ) = xl X " maps the point y R ( f ) to the point xl of the chosen chart of the parameter space X as a differentiable manifold. Examples follow lateron. Let us continue with the geometric interpretation of the linear model of this paragraph. The range space R(A), dim R ( A) = rk( A) = m is a linear space of dimension m, here m = rk A, which slices R n . In contrast, the subspace R ( A) A corresponds to a n rk A = d s dimensional linear space Ln r , here n - rk A = n – m, r = rk A= m. Let the algebraic partitioning and the geometric partitioning be merged to interpret the least squares solution of the inconsistent system of linear equations as a generalized inverse (g-inverse) of type LESS. As a summary of such a merger we take reference to Box 3.6. The first condition: AA A = A Let us depart from LESS of y = Ax + i, namely xl = A l y = ( AcA) 1 Acy, i l = (I AA l ) y = [I A( AcA) 1 Ac]y. º Axl = AA l y = AA l ( Axl + i l ) » 1 1 A ci l = Ac[I A( AcA) Ac]y = 0 A l i l = ( A cA) Aci l = 0 ¼ Axl = AA l Axl AA A = A . The second condition A AA = A
110
3 The second problem of algebraic regression
xl = ( A cA) 1 A cy = A l y = A l ( Axl + i l ) º » A l i l = 0 ¼ xl = A l y = A l AA l y
A l y = A l AA l y A AA = A . rk A l = rk A is interpreted as following: the g-inverse of type LESS is the generalized inverse of maximal rank since in general rk A d rk A holds. The third condition AA = PR ( A )
y = Axl + i l = AA l + (I AA l )y y = Axl + i l = A( A cA) 1 A cy + [I A( A cA) 1 A c]y º » y = y R(A) + i R(A) »¼ A
A A = PR ( A ) , (I AA ) = PR ( A
A
)
.
Obviously AA l is an orthogonal projection onto R ( A) , but I AA l onto its orthogonal complement R ( A) A . Box 3.6: The three condition of the generalized inverse mapping (generalized inverse matrix) LESS type Condition #1 f (x) = f ( g (y )) f = f DgD f Condition #2 (reflexive g-inverse mapping)
Condition #1 Ax = AA Ax AA A = A Condition #2 (reflexive g-inverse)
x = g (y ) =
x = A y = A AA y
= g ( f (x))
A AA = A
Condition #3
Condition #3
f ( g (y )) = y R ( A )
A Ay = y R (A)
f D g = projR (f)
A A = PR (A) .
3-2 The least squares solution: “LESS”
111
The set-theoretical partitioning, the fibering of the set system of points which constitute the observation space Y, the range R(f), will be finally outlined. Since the set system Y (the observation space) is R n , the fibering is called “trivial”. Non-trivial fibering is reserved for nonlinear models in which case we are dealing with a observation space as well as an range space which is a differentiable manifold. Here the fibering Y = R( f ) R( f )A produces the trivial fibers R ( f ) and R ( f ) A where the trivial fibers R ( f ) A is the quotient set R n /R ( f ) . By means of a Venn diagram (John Venn 1834-1928) also called Euler circles (Leonhard Euler 1707-1783) Figure 3.3 illustrates the trivial fibers of the set system Y = R n generated by R ( f ) and R ( f ) A . The set system of points which constitute the parameter space X is not subject to fibering since all points of the set system R(f) are mapped into the domain D(f).
Figure 3.3: Venn diagram, trivial fibering of the observation space Y, trivial fibers R ( f ) and R ( f ) A , f : R m = X o Y = R ( f ) R ( f ) A , X set system of the parameter space, Y set system of the observation space.
3-2 The least squares solution: “LESS” The system of inconsistent linear equations Ax + i = y subject to A R n×m , rk A = m < n , allows certain solutions which we introduce by means of Definition 3.1 as a solution of a certain optimization problem. Lemma 3.2 contains the normal equations of the optimization problem. The solution of such a system of normal equations is presented in Lemma 3.3 as the least squares solution with respect to the G y - norm . Alternatively Lemma 3.4 shows the least squares solution generated by a constrained Lagrangean. Its normal equations are solved for (i) the Lagrange multiplier, (ii) the unknown vector of inconsistencies by Lemma 3.5. The unconstrained Lagrangean where the system of linear equations has been implemented as well as the constrained Lagrangean lead to the identical solution for (i) the vector of inconsistencies and (ii) the vector of unknown parameters. Finally we discuss the metric of the observation space and alternative choices of its metric before we identify the solution of the quadratic optimization problem by Lemma 3.7 in terms of the (1, 2, 3)-generalized inverse.
112
3 The second problem of algebraic regression
Definition 3.1 ( least squares solution w.r.t. the G y -seminorm): A vector xl X = R m is called G y - LESS (LEast Squares Solution with respect to the G y -seminorm) of the inconsistent system of linear equations ª rk A = dim X = m Ax + i = y , y Y { R , «« or «¬ y R ( A) n
(3.1)
(the system of inverse linear equations A y = x, rk A = dim X = m or x R ( A ) , is consistent) if in comparison to all other vectors x X { R m the inequality & y Axl &G2 = (y Axl )cG y (y Axl ) d y
d (y Ax)cG y (y Ax) = & y Ax &G2
(3.2)
y
holds, in particular if the vector of inconsistency i l := y Axl has the least G y -seminorm. The solution of type G y -LESS can be computed as following Lemma 3.2 (least squares solution with respect to the G y -seminorm) : A vector xl X { R m is G y -LESS of (3.1) if and only if the system of normal equations A cG y Axl = AcG y y
(3.3)
is fulfilled. xl always exists and is in particular unique, if A cG y A is regular. : Proof : G y -LESS is constructed by means of the Lagrangean b ± b 2 4ac 2a = xcA cG y Ax 2y cG y Ax + y cG y y = min
L(x) := & i &2G = & y Ax &2G = y
y
x
such that the first derivatives w i cG y i wL (xl ) = (xl ) = 2 A cG y ( Axl y ) = 0 wx wx constitute the necessary conditions. The theory of vector derivative is presented in Appendix B. The second derivatives
3-2 The least squares solution: “LESS”
113
w 2 i cG y i w2L (xl ) = (xl ) = 2 A cG y A t 0 wx wxc wx wxc due to the positive semidefiniteness of the matrix A cG y A generate the sufficiency condition for obtaining the minimum of the unconstrained Lagrangean. Because of the R ( A cG y A) = R ( A cG y ) there always exists a solution xl whose uniqueness is guaranteed by means of the regularity of the matrix A cG y A .
ƅ
It is obvious that the matrix A cG y A is in particular regular, if rk A = dim X = m , but on the other side the matrix G y is positive definite, namely & i &2G is a G y norm. The linear form xl = Ly which for arbitrary observation vectors y Y { R n leads to G y -LESS of (3.1) can be represented as following. y
Lemma 3.3 (least squares solution with respect to the G y - norm, rk A = dim X = m or ( x R ( A ) ): xl = Ly is G y -LESS of the inconsistent system of linear equations (3.1) Ax + i = y , restricted to rk ( A cG y A) = rk A = dim X (or R ( A cG y ) = R ( A c) and x R ( A ) ) if and only if L R m × n is represented by Case (i) : G y = I Lˆ = A L = ( AcA) 1 Ac
(left inverse)
(3.4)
xl = A L y = ( A cA) 1 Acy. y = yl + il
(3.5) (3.6)
is an orthogonal decomposition of the observation vector y Y { R n into the I -LESS vector y l Y = R n and the I LESS vector of inconsistency i l Y = R n subject to (3.7) y l = Axl = A( A cA) 1 A cy i l = y y l =[I n A( A cA) 1 Ac] y.
(3.8)
Due to y l = A( AcA) 1 Acy , I-LESS has the reproducing property. As projection matrices A( A cA) 1 A c and [I n A( AcA) 1 Ac] are independent. The “goodness of fit” of I-LESS is & y Axl &2I =& i l &2I = y c[I n A( A cA) 1 A c]y .
(3.9)
Case (ii) : G y positive definite, rk ( A cG y A) = rk A Lˆ = ( A cG y A ) 1 A cG y (weighted left inverse)
(3.10)
xl = ( A cG y A) AcG y y.
(3.11)
1
114
3 The second problem of algebraic regression
(3.12) y = y l + il is an orthogonal decomposition of the observation vector y Y { R n into the G y -LESS vector y l Y = R n and the G y LESS vector of inconsistency i l Y = R n subject to y l = Axl = A( A cG y A) 1 AcG y y , (3.13) i l = y Axl =[I n A( AcG y A) 1 AcG y ] y .
(3.14)
Due to y l = A( A cG y A) 1 A cG y y G y -LESS has the reproducing property. As projection matrices A( A cG y A) 1 A cG y and [I n A ( A cG y A ) 1 A cG y ] are independent. The “goodness of fit” of G y -LESS is & y Axl &2G =& i l &2G = y c[I n A ( A cG y A ) 1 A cG y ]y . y
y
(3.15)
The third case G y positive semidefinite will be treated independently. The proof of Lemma 3.1 is straightforward. The result that LESS generates the left inverse, G y -LESS the weighted left inverse will be proved later. An alternative way of producing the least squares solution with respect to the G y - seminorm of the linear model is based upon the constrained Lagrangean (3.16), namely L(i, x, Ȝ ) . Indeed L(i, x, Ȝ ) integrates the linear model (3.1) by a vector valued Lagrange multiplyer to the objective function of type “least squares”, namely the distance function in a finite dimensional Hilbert space. Such an approach will be useful when we apply “total least squares” to the mixed linear model (error-in-variable model). Lemma 3.4 (least squares solution with respect to the G y - norm, rk A = dim X , constrained Lagrangean): G y -LESS is assumed to be defined with respect to the constrained Lagrangean L(i, x, Ȝ ) := i cG y i + 2Ȝ c( Ax + i y ) = min . i , x, Ȝ
(3.16)
A vector [i cl , xcl , Ȝ cl ]c R ( n + m + n )×1 is G y -LESS of (3.1) in the sense of the constrained Lagrangean L(i, x, Ȝ ) = min if and only if the system of normal equations ªG y 0 I n º ª i l º ª 0 º « 0 0 A c» « x » = « 0 » (3.17) « »« l» « » «¬ I n A 0 »¼ «¬ Ȝ l »¼ «¬ y »¼ with the vector Ȝ l R n×1 of “Lagrange multiplyer” is fulfilled. (i l , xl , Ȝ l ) exists and is in particular unique, if G y is positive semidefinite. There holds (i l , xl , Ȝ l ) = arg{L(i, x, Ȝ ) = min} .
(3.18)
3-2 The least squares solution: “LESS”
115
: Proof : G y -LESS is based on the constrained Lagrangean L(i, x, Ȝ ) := i cG y i + 2Ȝ c( $x + i y ) = min i , x, Ȝ
such that the first derivatives wL (i l , xl , Ȝ l ) = 2(G y i l + Ȝ l ) = 0 wi wL (i l , xl , Ȝ l ) = 2$ cȜ l = 0 wx wL (i l , xl , Ȝ l ) = 2( $xl + i l y ) = 0 wȜ or ªG y « 0 « «¬ I n
0 0
I n º ª il º ª 0 º A c»» «« xl »» = «« 0 »» A 0 »¼ «¬ Ȝ l »¼ «¬ y »¼
constitute the necessary conditions. (The theory of vector derivative is presented in Appendix B.) The second derivatives 1 w2L ( xl ) = G y t 0 2 w i w ic due to the positive semidefiniteness of the matrix G y generate the sufficiency condition for obtaining the minimum of the constrained Lagrangean.
ƅ Lemma 3.5 (least squares solution with respect to the G y - norm, rk A = dim X , constrained Lagrangean): If G y -LESS of the linear equations (3.1) is generated by the constrained Lagrangean (3.16) with respect to a positive definite weight matrix G y , rk G y = n, then the normal equations (3.17) are uniquely solved by xl = ( AcG y A) 1 AcG y y,
(3.19)
i l =[I n A( A cG y A) 1 A cG y ] y,
(3.20)
Ȝ l =[G y A( A cG y A) 1 A c I n ] G y y.
(3.21)
:Proof : A basis of the proof could be C. R. Rao´s Pandora Box, the theory of inverse partitioned matrices (Appendix A: Fact: Inverse Partitioned Matrix /IPM/ of a
116
3 The second problem of algebraic regression
symmetric matrix). Due to the rank identities rk G y = n, rk A = rk ( A cG y A) = m < n, the normal equations can be solved faster directly by Gauss elimination. G y il + Ȝ l = 0 A cȜ l = 0 Axl + i l y = 0. Multiply the third normal equation by A cG y , multiply the first normal equation by Ac and substitute A c Ȝ l from the second normal equation in the modified first one. A cG y Axl + AcG y i l A cG y y = 0 º » A cG y i l + A cȜ l = 0 » »¼ A cȜ l = 0
A cG y Axl + AcG y i l A cG y y = 0 º » A cG y i l = 0 ¼ A cG y Axl A cG y y = 0,
xl = ( A cG y A) 1 AcG y y. Let us subtract the third normal equation and solve for i l . i l = y Axl , i l =[I n A( AcG y A) 1 AcG y ] y. Finally we determine the Lagrange multiplier: substitute i l in the first normal equation in order to find Ȝ l = G y i l Ȝ l =[G y A( AcG y A) 1 A cG y G y ] y.
ƅ Of course the G y -LESS of type (3.2) and the G y -LESS solution of type constrained Lagrangean (3.16) are equivalent, namely (3.11) ~ (3.19) and (3.14) ~ (3.20). In order to analyze the finite dimensional linear space Y called “the observation space”, namely the case of a singular matrix of its metric, in more detail, let us take reference to the following.
3-2 The least squares solution: “LESS”
117
Theorem 3.6 (bilinear form) : Suppose that the bracket i i or g (i,i) : Y × Y o \ is a bilinear form or a finite dimensional linear space Y , dim Y = n , for instance a vector space over the field of real numbers. There exists a basis {e1 ,..., en } such that ei e j = 0 or g (ei , e j ) = 0 for i z j
(i)
(
)
ei ei = +1 or g ei , ei = +1 for 1 d i1 d p ° ° ® ei ei = 1 or g ei , ei = 1 for p + 1 d i2 d p + q = r ° ei ei = 0 or g ei , ei = 0 for r + 1 d i3 d n . °¯ 1
(ii)
2
1
(
2
3
3
2
1
)
2
(
1
3
3
)
The numbers r and p are determined exclusively by the bilinear form. r is called the rank, r p = q is called the relative index and the ordered pair (p,q) the signature. The theorem states that any two spaces of the same dimension with bilinear forms of the same signature are isometrically isomorphic. A scalar product (“inner product”) in this context is a nondegenerate bilinear form, for instance a form with rank equal to the dimension of Y . When dealing with low dimensional spaces as we do, we will often indicate the signature with a series of plus and minus signs when appropriate. For instance the signature of \ 14 may be written (+ + + ) instead of (3,1). Such an observation space Y is met when we are dealing with observations in Special Relativity. For instance, let us summarize the peculiar LESS features if the matrix G y \ n×n of the observation space is semidefinite, rk G y := ry < n . By means of Box 3.7 we have collected the essential items of the eigenspace analysis as well as the eigenspace synthesis G *y versus G y of the metric. ȁ y = = Diag(O1 ,..., Or ) denotes the matrix of non-vanishing eigenvalues {O1 ,..., Or } . Note the norm identity y
y
|| i ||G2 = || i ||2U ȁ U c , y
1
y
(3.22)
1
which leads to the U1 ȁ y U1c -LESS normal equations A cU1 ȁ y U1c x A = A cU1 ȁ y U1c y. Box 3.7: Canonical representation of the rank deficient matrix of the matrix of the observation space Y rk G y =: ry , ȁ y := Diag(O1 ,..., Or ) . y
(3.23)
118
3 The second problem of algebraic regression
“eigenspace analysis”
“eigenspace synthesis”
ªU c º (3.24) G *y = « 1 » G y [ U1 , U 2 ] = «U c » ¬ 2¼ ªȁ =« y ¬ 02
ªȁ G y = [ U1 , U 2 ] « y ¬ 02
01 º ª U1c º « » (3.25) 03 »¼ « U c » ¬ 2¼
\ n× n
01 º \ n× n 03 »¼ subject to
{
}
U SO(n(n 1) / 2) := U \ n× n | UcU = I n , U = +1 U1 \ 01 \
n× ry
ry × n
, U2 \
, 02 \
n×( n ry )
( n ry )× ry
, ȁy \
, 03 \
ry ×ry
( n ry )×( n ry )
“norms” (3.26)
|| i ||G2 = || i ||2U ȁ U c y
1
y
i cG y i = i cU1 ȁ y U1ci
~
1
(3.27)
LESS: || i ||G2 = min || i ||2U ȁ U c = min y
x
1
y
1
x
A cU1 ȁ y U1c xA = A cU1 ȁ y U1c y . Another example relates to an observation space Y = \ 12 k
( k {1,..., K })
of even dimension, but one negative eigenvalue. In such a pseudo-Euclidean space of signature (+ + ) the determinant of the matrix of metric G y is negative, namely det G y = O1 ...O2 K 1 O2 K . Accordingly x max = arg{|| i ||G2 = max | y = Ax + i, rk A = m} y
is G y -MORE (Maximal ObseRvational inconsistEncy solution), but not G y LESS. Indeed, the structure of the observational space, either pseudo-Euclidean or Euclidean, decides upon MORE or LESS. 3-21 A discussion of the metric of the parameter space X With the completion of the proof we have to discuss the basic results of Lemma 3.3 in more detail. At first we have to observe that the matrix G y of the met-
3-2 The least squares solution: “LESS”
119
ric of the observation space Y has to be given a priori. We classified LESS according to (i) G y = I n , (ii) G y positive definite and (iii) G y positive semidefinite. But how do we know the metric of the observation space Y? Obviously we need prior information about the geometry of the observation space Y, namely from the empirical sciences like physics, chemistry, biology, geosciences, social sciences. If the observation space Y R n is equipped with an inner product ¢ y1 | y 2 ² = y1cG y y 2 , y1 Y, y 2 Y where the matrix G y of the metric & y &2 = y cG y y is positive definite, we refer to the metric space Y R n as Euclidean E n . In contrast, if the observation space is positive semidefinite we call the observation space semi Euclidean E n , n . n1 is the number of positive eigenvalues, n2 the number of zero eigenvalues of the positive semidefinite matrix G y of the metric (n = n1 + n2 ). In various applications, namely in the adjustment of observations which refer to Special Relativity or General Relativity we have to generalize the metric structure of the observation space Y: If the matrix G y of the pseudometric & y &2 = y cG y y is built on n1 positive eigenvalues (signature +), n2 zero eigenvalues and n3 negative eigenvalues (signature -), we call the pseudometric parameter space pseudo Euclidean E n , n , n , n = n1 + n2 + n3 . For such an observation space LESS has to be generalized to & y Ax &2G = extr , for instance "maximum norm solution" . 1
2
1
2
3
y
3-22 Alternative choices of the metric of the observation space Y Another problem associated with the observation space Y is the norm choice problem. Up to now we have used the A 2 -norm, for instance A 2 -norm: & y Ax & 2 := ( y Ax)( y Ax) = i c i = = i12 + i22 + " + in21 + in2 , A p -norm: & y Ax & p :=
p
p
p
p
p
i1 + i2 + " + in 1 + in ,
1< p < f A f -norm: & i & f := max | ii | 1di d n
are alternative norms of choice. Beside the choice of the matrix G y of the metric within the weighted A 2 -norm we like to discuss the result of the LESS matrix G l of the metric. Indeed we have constructed LESS from an a priori choice of the metric G called G y and were led to the a posteriori choice of the metric G l of type (3.9) and (3.15). The matrices (i) G l = I n A( A cA) 1 Ac (ii) G l = I n A( A cG y A) A cG y 1
are (i) idempotent and (ii) G y1 idempotent, in addition.
(3.9) (3.15)
120
3 The second problem of algebraic regression
There are various alternative scales or objective functions for projection matrices for substituting Euclidean metrics termed robustifying. In special cases those objective functions operate on (3.11) xl = Hy subject to H x = ( AcG y A) 1 AG y , (3.13) y A = H y y subject to H y = A( A cG y A) 1 AG y , (3.14) i A = H A y subject to H A = ª¬I n A( A cG y A) 1 AG y º¼ y , where {H x , H y , H A } are called “hat matrices”. In other cases analysts have to accept that the observation space is non-Euclidean. For instance, direction observations in R p locate points on the hypersphere S p 1 . Accordingly we have to accept an objective function of von Mises-Fisher type which measures the spherical distance along a great circle between the measurement points on S p 1 and the mean direction. Such an alternative choice of a metric of a non- Euclidean space Y will be presented in chapter 7. Here we discuss in some detail alternative objective functions, namely
• • •
optimal choice of the weight matrix G y : second order design SOD optimal choice of the weight matrix G y by means of condition equations robustifying objective functions
3-221 Optimal choice of weight matrix: SOD The optimal choice of the weight matrix , also called second order design (SOD), is a traditional topic in the design of geodetic networks. Let us refer to the review papers by A. A. Seemkooei (2001), W. Baarda (1968, 1973), P. Cross (1985), P. Cross and K. Thapa (1979), E. Grafarend (1970, 1972, 1974, 1975), E. Grafarend and B. Schaffrin (1979), B. Schaffrin (1981, 1983, 1985), F. Krumm (1985), S. L. Kuang (1991), P. Vanicek, K. Thapa and D. Schröder (1981), B. Schaffrin, E. Grafarend and G. Schmitt (1977), B. Schaffrin, F. Krumm and D. Fritsch (1980), J. van Mierlo (1981), G. Schmitt (1980, 1985), C. C. Wang (1970), P. Whittle (1954, 1963), H. Wimmer (1982) and the textbooks by E. Grafarend, H. Heister, R. Kelm, H. Knopff and B. Schaffrin (1979) and E. Grafarend and F. Sanso (1985, editors). What is an optimal choice of the weight matrix G y , what is “a second order design problem”? Let us begin with Fisher’s Information Matrix which agrees to the half of the Hesse matrix, the matrix of second derivatives of the Lagrangean L(x):=|| i ||G2 = || y Ax ||G2 , namely y
y
3-2 The least squares solution: “LESS” G x = A c(x)G y A(x) =
121 1 2
w2L =: FISHER wx A wxcA
at the “point“ x A of type LESS. The first order design problem aims at determining those points x within the Jacobi matrix A by means of a properly chosen risk operating on “FISHER”. Here, “FISHER” relates the weight matrix of the observations G y , previously called the matrix of the metric of the observation space, to the weight matrix G x of the unknown parameters, previously called the matrix of the metric of the parameter space. Gx
Gy
weight matrix of
weight matrix of
the unknown parameters
the observations
or
or
matrix of the metric of the parameter space X
matrix of the metric of the observation space Y .
Being properly prepared, we are able to outline the optimal choice of the weight matrix G y or X , also called the second order design problem, from a criterion matrix Y , an ideal weight matrix G x (ideal) of the unknown parameters, We hope that the translation of G x and G y “from metric to weight” does not cause any confusion. Box 3.8 elegantly outlines SOD. Box 3.8: Second order design SOD, optimal fit to a criterion matrix of weights “weight matrix of the parameter space“ (3.28)
Y :=
1 2
w2L wx A wx Ac
3-21 “weight matrix of the observation space”
(
)
X := G y = Diag g1y ,..., g ny (3.29)
= Gx
x := ª¬ g1y ,..., g ny º¼c
“inconsistent matrix equation of the second order design problem“ A cXA + A = Y
(3.30)
“optimal fit” || ǻ ||2 = tr ǻcǻ = (vec ǻ)c(vec ǻ) = min
(3.31)
x S := arg{|| ǻ ||2 = min | A cXA + ǻ = Y, X = Diag x}
(3.32)
X
122
3 The second problem of algebraic regression
vec ǻ = = vec Y vec( A cXA) = vec Y ( A c
A c) vec X
(3.33)
vec ǻ = vec Y ( A c
A c)x
(3.34)
x \ n , vec Y \ n ×1 , vech Y \ n ( n +1) / 2×1 2
vec ǻ \ n ×1 , vec X \ n ×1 , ( A c
A c) \ n ×n , A c : A c \ n ×n 2
2
2
2
x S = [ ( A c : A c)c( A c : A c) ] ( Ac : Ac) vec Y . 1
2
(3.35)
In general, the matrix equation A cXA + ǻ = Y is inconsistent. Such a matrix inconsistency we have called ǻ \ m × m : For a given ideal weight matrix G x (ideal ) , A cG y A is only an approximation. The unknown weight matrix of the observations G y , here called X \ n× n , can only be designed in its diagonal form. A general weight matrix G y does not make any sense since “oblique weights” cannot be associated to experiments. A natural restriction is therefore X = Diag g1y ,..., g ny . The “diagonal weights” are collected in the unknown vector of weights
(
)
x := ª¬ g1y ,..., g ny º¼c \ n . The optimal fit “ A cXA to Y “ is achieved by the Lagrangean || ǻ ||2 = min , the optimum of the Frobenius norm of the inconsistency matrix ǻ . The vectorized form of the inconsistency matrix, vec ǻ , leads us first to the matrix ( A c
A c) , the Zehfuss product of Ac , second to the Kronecker matrix ( A c : A c) , the Khatri- Rao product of Ac , as soon as we implement the diagonal matrix X . For a definition of the Kronecker- Zehfuss product as well as of the Khatri- Rao product and related laws we refer to Appendix A. The unknown weight vector x is LESS, if x S = [ ( A c : A c)c( A c : A c) ] ( A c : Ac)c vec Y . 1
Unfortunately, the weights x S may come out negative. Accordingly we have to build in extra condition, X = Diag( x1 ,..., xm ) to be positive definite. The given references address this problem as well as the datum problem inherent in G x (ideal ) . Example 3.2 (Second order design):
3-2 The least squares solution: “LESS”
123 PȖ
y3 = 6.94 km
Pį y1 = 13.58 km
y2 = 9.15 km
PĮ
Pȕ
Figure 3.4: Directed graph of a trilateration network, known points {PD , PE , PJ } , unknown point PG , distance observations [ y1 , y2 , y3 ]c Y The introductory example we outline here may serve as a firsthand insight into the observational weight design, also known as second order design. According to Figure 3.4 we present you with the graph of a two-dimensional planar network. From three given points {PD , PE , PJ } we measure distances to the unknown point PG , a typical problem in densifying a geodetic network. For the weight matrix G x Y of the unknown point we postulate I 2 , unity. In contrast, we aim at an observational weight design characterized by a weight matrix G x X = Diag( x1 , x2 , x3 ) . The second order design equation A c Diag( x1 , x2 , x3 ) A + ǻ = I 2 is supposed to supply us with a circular weight matrix G y of the Cartesian coordinates ( xG , yG ) of PG . The observational equations for distances ( sDG , sEG , sJG ) = (13.58 km, 9.15 km, 6.94 km) have already been derived in chapter 1-4. Here we just take advantage of the first design matrix A as given in Box 3.9 together with all further matrix operations. A peculiar situation for the matrix equation A cXA + ǻ = I 2 is met: In the special configuration of the trilateration network the characteristic equation of the second order design problem is consistent. Accordingly we have no problem to get the weights
124
3 The second problem of algebraic regression
0 0 º ª0.511 « 0.974 0 »» , Gy = « 0 «¬ 0 0 0.515»¼ which lead us to the weight G x = I 2 a posteriori. Note that the weights came out positive. Box 3.9: Example for a second order design problem, trilateration network ª 0.454 0.891º A = «« 0.809 +0.588»» , X = Diag( x1 , x2 , x3 ), Y = I 2 «¬ +0.707 +0.707 »¼ A c Diag( x1 , x2 , x3 ) A = I 2 ª0.206 x1 + 0.654 x2 + 0.5 x3 « ¬ 0.404 x1 0.476 x2 + 0.5 x3
0.404 x1 0.476 x2 + 0.5 x3 º = 0.794 x1 + 0.346 x2 + 0.5 x3 »¼
ª1 0 º =« » ¬0 1 ¼ “inconsistency ǻ = 0 ” (1st) 0.206 x1 + 0.654 x2 + 0.5 x3 = 1 (2nd) 0.404 x1 0.476 x2 + 0.5 x3 = 0 (3rd) 0.794 x1 + 0.346 x2 + 0.5 x3 = 1 x1 = 0.511, x2 = 0.974, x3 = 0.515. 3-222 The Taylor Karman criterion matrix ? What is a proper choice of the ideal weight matrix G x ? There has been made a great variety of proposals. First, G x (ideal ) has been chosen simple: A weight matrix G x is called ideally simple if G x (ideal ) = I m . For such a simple weight matrix of the unknown parameters Example 3.2 is an illustration of SOD for a densification problem in a trilateration network. Second, nearly all geodetic networks have been SOD optimized by a criterion matrix G x (ideal ) which is homogeneous and isotropic in a two-dimensional or
3-2 The least squares solution: “LESS”
125
three-dimensional Euclidean space. In particular, the Taylor-Karman structure of a homogeneous and isotropic weight matrix G x (ideal ) has taken over the SOD network design. Box 3.10 summarizes the TK- G x (ideal ) of a two-dimensional, planar network. Worth to be mentioned, TK- G x (ideal ) has been developed in the Theory of Turbulence, namely in analyzing the two-point correlation function of the velocity field in a turbulent medium. (G. I. Taylor 1935, 1936, T. Karman (1937), T. Karman and L. Howarth (1936), C. C. Wang (1970), P. Whittle (1954, 1963)). Box 3.10: Taylor-Karman structure of a homogeneous and isotropic tensor- valued, two-point function, two-dimensional, planar network ª gx x « « gy x Gx = « « gx x «g «¬ y x
1 1
1 1
2 1
2 1
gx y gy y
1 1
gx x gy x
gx y gy y
gx x gy x
1 1
2 1
2 1
gx y º » gy y » » G x (xD , x E ) gx y » g y y »» ¼
1 2
1 2
1 2
1 2
2 2
2 2
2 2
2 2
PD (xD , yD )
“Euclidean distance function of points PE (x E , y E ) ”
and
sDE :=|| xD x E ||= ( xD xE ) 2 + ( yD yE ) 2 “decomposition of the tensor-valued, two-point weight function G x (xD , x E ) into the longitudinal weight function f A and the transversal weight function f m ” G x (xD , x E ) = ª¬ g j j (xD , x E ) º¼ = 1 2
ª x j ( PD ) x j ( PE ) º¼ ª¬ x j ( PD ) x j ( PE ) º¼ = f m ( sDE )G j j + ª¬ f A ( sDE ) f m ( sDE ) º¼ ¬ (3.36) s2 1
1
2
2
1 2
DE
j1 , j2 {1, 2} , ( xD , yD ) = ( x1 , y1 ), ( xE , yE ) = ( x2 , y2 ). 3-223 Optimal choice of the weight matrix: The space R ( A ) and R ( A) A In the introductory paragraph we already outlined the additive basic decomposition of the observation vector into y = y R (A) + y R
( A )A
y R ( A ) = PR ( A ) y , y R where PR( A ) and PR
( A )A
= y A + iA ,
( A )A
= PR
( A )A
y,
are projectors as well as
126
3 The second problem of algebraic regression
y A R ( A ) is an element of the range space R ( A ) , in general the tangent space Tx M of the mapping f (x)
i A R ( A ) is an element of its orthogonal complement in general the normal A space R ( A ) . A
versus
G y -orthogonality y A i A
Gy
= 0 is proven in Box 3.11.
Box 3.11 G y -Orthogonality of y A = y ( LESS ) and i A = i ( LESS ) “ G y -orthogonality” yA iA yA iA
GA
Gy
=0
(3.37)
= y c ¬ªG y A( AcG y A) 1 A c¼º G y ¬ª I n A( AcG y A) 1 A cG y ¼º y =
= y cG y A( A cG y A) 1 A cG y y cG y A ( A cG y A ) 1 A cG y A( A cG y A) 1 A cG y y = = 0. There is an alternative interpretation of the equations of G y -orthogonality i A y A G = i AcG y y A = 0 of i A and y A . First, replace iA = PR A y where PR A is ( ) ( ) a characteristic projection matrix. Second, substitute y A = Ax A where x A is G y LESS of x . As outlined in Box 3.12, G y -orthogonality i AcG y y A of the vectors i A and y A is transformed into the G y -orthogonality of the matrices A and B . The columns of the matrices A and B are G y -orthogonal. Indeed we have derived the basic equations for transforming +
y
parametric adjustment
into
y A = Ax A ,
adjustment of conditional equations BcG y y A = 0,
by means of BcG y A = 0. Box 3.12 G y -orthogonality of A and B i A R ( A) A , dim R ( A) A = n rk A = n m y A R ( A ) , dim R ( A ) = rk A = m
+
3-2 The least squares solution: “LESS” iA yA
Gy
127
= 0 ª¬I n A( AcG y A) 1 A cG y º¼c G y A = 0
rk ª¬ I n A( A cG y A) 1 A cG y º¼ = n rk A = n m
(3.38) (3.39)
“horizontal rank partioning” ª¬I n A( A cG y A) 1 A cG y º¼ = [ B, C]
(3.40)
B \ n× ( n m ) , C \ n× m , rk B = n m iA yA
Gy
= 0 BcG y A = 0 .
(3.41)
Example 3.3 finally illustrates G y -orthogonality of the matrices A und B . Example 3.3 (gravimetric leveling, G y -orthogonality of A and B ). Let us consider a triangular leveling network {PD , PE , PJ } which consists of three observations of height differences ( hDE , hEJ , hJD ) . These height differences are considered holonomic, determined from gravity potential differences, known as gravimetric leveling. Due to hDE := hE hD , hEJ := hJ hE , hJD := hD hJ the holonomity condition
³9 dh = 0
or
hDE + hEJ + hJD = 0
applies. In terms of a linear model the observational equations can accordingly be established by ª hDE º ª 1 0 º ªiDE º ª hDE º « » « » « » « hEJ » = « 0 1 » « h » + « iEJ » « hJD » «¬ 1 1»¼ ¬ EJ ¼ « iJD » ¬ ¼ ¬ ¼ ª hDE º ª1 0º ª hDE º « » y := « hEJ » , A := «« 0 1 »» , x := « » ¬ hEJ ¼ « hJD » «¬ 1 1»¼ ¬ ¼ y \ 3×1 , A \ 3× 2 , rk A = 2, x \ 2×1 . First, let us compute ( x A , y A , i A ,|| i A ||) I -LESS of ( x, y , i,|| i ||) . A. Bjerhammar’s left inverse supplies us with
128
3 The second problem of algebraic regression
ª y1 º ª 2 1 1º « » x A = A A y = ( AcA) 1 Acy = 13 « » « y2 » ¬ 1 2 1¼ « » ¬ y3 ¼ ª hDE º ª 2 y y2 y3 º xA = « » = 13 « 1 » h ¬ y1 + 2 y2 y3 ¼ ¬ EJ ¼ A ª 2 1 1º 1 1« c c y A = AxA = AA y = A( A A) A y = 3 « 1 2 1»» y «¬ 1 1 2 »¼ A
ª 2 y1 y2 y3 º y A = «« y1 + 2 y2 y3 »» «¬ y1 y2 + 2 y3 »¼ 1 3
(
)
i A = y Ax A = I n AA A y = ª¬I n A ( A cA) 1 A cº¼ y ª1 1 1º i A = ««1 1 1»» y = «¬1 1 1»¼ 1 3
1 3
ª y1 + y2 + y3 º «y + y + y » 2 3» « 1 «¬ y1 + y2 + y3 »¼
|| i A ||2 = y c(I n AA A )y = y c ª¬I n A( AcA) 1 A cº¼ y ª1 1 1º ª y1 º || i A ||2 = [ y1 , y2 , y3 ] 13 ««1 1 1»» «« y2 »» «¬1 1 1»¼ «¬ y3 »¼ || i A ||2 = 13 ( y12 + y22 + y32 + 2 y1 y2 + 2 y2 y3 + 2 y3 y1 ) . Second, we identify the orthogonality of A and B . A is given, finding B is the problem of horizontal rank partitioning of the projection matrix. ª1 1 1º G A := I n H y = I n AA = I n A ( A cA ) A c = ««1 1 1»» \ 3×3 , «¬1 1 1»¼ A
1
1 3
with special reference to the “hat matrix H y := A( A cA) 1 Ac ”. The diagonal elements of G A are of special interest for robust approximation. They amount to the uniform values hii = 13 (2, 2, 2), ( gii )A = (1 hii ) = 13 (1,1,1).
3-2 The least squares solution: “LESS”
129
Note
(
)
det G A = det I n AA A = 0, rk(I n AA A ) = n m = 1 ª1 1 1º « » G A = ª¬I 3 AA º¼ = [ B, C] = «1 1 1» «¬1 1 1»¼ A
1 3
B \ 3×1 , C \ 3× 2 . The holonomity condition hDE + hEJ + hJD = 0 is reestablished by the orthogonality of BcA = 0 . ª1 0º BcA = 0 [1,1,1] «« 0 1 »» = [ 0, 0] . «¬ 1 1»¼ 1 3
ƅ The G y -orthogonality condition of the matrices A and B has been successfully used by G. Kampmann (1992, 1994, 1997), G. Kampmann and B. Krause (1996, 1997), R. Jurisch, G. Kampmann and B. Krause (1997), R. Jurisch and G. Kampmann (1997, 1998, 2001 a, b, 2002), G. Kampmann and B. Renner (1999), R. Jurisch, G. Kampmann and J. Linke (1999 a, b, c, 2000) in order to balance the observational weights, to robustify G y -LESS and to identify outliers. The A Grassmann- Plücker coordinates which span the normal space R ( A ) will be discussed in Chapter 10 when we introduce condition equations. 3-224 Fuzzy sets While so far we have used geometry to classify the objective functions as well as the observation space Y, an alternative concept considers observations as elements of the set Y = [ y1 ," , yn ] . The elements of the set get certain attributes which make them fuzzy sets. In short, we supply some references on “fuzzy sets”, namely G. Alefeld and J. Herzberger (1983), B. F. Arnold and P. Stahlecker (1999), A. Chaturvedi and A. T. K. Wan (1999), S. M. Guu, Y. Y. Lur and C. T. Pang (2001), H. Jshibuchi, K. Nozaki and H. Tanaka (1992), H. Jshibuchi, K. Nozaki, N. Yamamoto and H. Tanaka (1995), B. Kosko (1992), H. Kutterer (1994, 1999), V. Ravi, P. J. Reddy and H. J. Zimmermann (2000), V. Ravi and H. J. Zimmermann (2000), S. Wang, T. Shi and C. Wu (2001), L. Zadch (1965), H. J. Zimmermann (1991). 3-23 G x -LESS and its generalized inverse A more formal version of the generalized inverse which is characteristic for G y LESS is presented by
130
3 The second problem of algebraic regression
Lemma 3.7 (characterization of G y -LESS): x A = Ly is I-LESS of the inconsistent system of linear equations (3.1) Ax + i = y , rk A = m , (or y R ( A) ) if and only if the matrix L \ m× n fulfils ª ALA = A « AL = ( AL)c. ¬
(3.42)
The matrix L is the unique A1,2,3 generalized inverse, also called left inverse A L . x A = Ly is G y -LESS of the inconsistent system of linear equations (3.1) Ax + i = y , rk A = m (or y R ( A) ) if and only if the matrix L fulfils ª G y ALA = G y A «G AL = (G AL)c. y ¬ y
(3.43)
The matrix L is the G y weighted A1,2,3 generalized inverse, in short A A , also called weighted left inverse. : Proof : According to the theory of the generalized inverse presented in Appendix A x A = Ly is G y -LESS of (3.1) if and only if A cG y AL = AcG y is fulfilled. Indeed A cG y AL = AcG y is equivalent to the two conditions G y ALA = G y A and G y AL = (G y AL)c . For a proof of such a statement multiply A cG y AL = AcG y left by Lc and receive LcA cG y AL = LcAcG y . The left-hand side of such a matrix identity is a symmetric matrix. In consequence, the right-hand side has to be symmetric, too. When applying the central symmetry condition to A cG y AL = AcG y
or
G y A = LcAcG y A ,
we are led to G y AL = LcA cG y AL = (G y AL)c , what had to be proven. ? How to prove uniqueness of A1,2,3 = A A ? Let us fulfil G y Ax A by G y AL1 y = G y AL1 AL1 y = L1c AcG y AL1 y = L1c AcL1c AcG y y =
3-2 The least squares solution: “LESS”
131
= L1c A cLc2 A cG y y = L1c A cG y L 2 y = G y AL1 AL 2 y = G y AL 2 y , in particular by two arbitrary matrices L1 and L 2 , respectively, which fulfil (i) G y ALA = G y A as well as (ii) G y AL = (G y AL)c . Indeed we have derived one result irrespective of L1 or L 2 .
ƅ If the matrix of the metric G y of the observation space is positive definite, we can prove the following duality Theorem 3.8 (duality): Let the matrix of the metric G x of the observation space be positive definite. Then x A = Ly is G y -LESS of the linear model (3.1) for any observation vector y \ n , if x ~m = Lcy ~ is G y1 -MINOS of the linear model y ~ = A cx ~ for all m × 1 columns y ~ R ( A c) . : Proof : If G y is positive definite, there exists the inverse matrix G y1 . (3.43) can be transformed into the equivalent condition A c = A cLcA
and
G y1LcAc = (G y1LcAc)c ,
which is equivalent to (1.33). 3-24 Eigenvalue decomposition of G y -LESS: canonical LESS For the system analysis of an inverse problem the eigenspace analysis and eigenspace synthesis of x A G y -LESS of x is very useful and gives some peculiar insight into a dynamical system. Accordingly we are confronted with the problem to construct “canonical LESS”, also called the eigenvalue decomposition of G y -LESS. First, we refer to the canonical representation of the parameter space X as well as the observation space introduced to you in the first Chapter, Box 1.8 and Box 1.9. But here we add by means of Box 3.13 the comparison of the general bases versus the orthonormal bases spanning the parameter space X as well as the observation space Y . In addition, we refer to Definition 1.5 and Lemma 1.6 where the adjoint operator A # has been introduced and represented. Box 3.13: General bases versus orthonormal bases spanning the parameter space X as well as the observation space Y
132
3 The second problem of algebraic regression
“left”
“right”
“parameter space”
“observation space”
“general left base”
“general right base”
span{a1 ,..., am } = X
Y = span{b1 ,..., bn }
: matrix of the metric :
: matrix of the metric : bbc = G y
aac = G x
(3.44)
(3.45)
“orthonormal left base”
“orthonormal right base”
span{e1x ,..., emx } = X
Y = span{e1y ,..., eny }
: matrix of the metric :
: matrix of the metric :
e x ecx = I m
e y ecy = I n
“base transformation”
“base transformation”
a = ȁ x 9e x
1 2
b = ȁ y Ue y
versus
versus
(3.46)
(3.48)
(3.47)
1 2
- 12
(3.49)
- 12
e x = V cȁ x a
e y = Ucȁ y b
span{e1x ,..., e xm } = X
Y = span{e1y ,..., e yn } .
(3.50)
(3.51)
Second, we are going to solve the overdetermined system of {y = Ax | A \ n× m , rk A = n, n > m} by introducing
• •
the eigenspace of the rectangular matrix A \ n× m of rank r := rk A = m , n > m : A 6 A* the left and right canonical coordinates: x o x* , y o y *
as supported by Box 3.14. The transformations x 6 x* (3.52), y 6 y * (3.53) from the original coordinates ( x1 ,..., xm ) to the canonical coordinates ( x1* ,..., xm* ) , the left star coordinates, as well as from the original coordinates ( y1 ,..., yn ) to the canonical coordinates ( y1* ,..., yn* ) , the right star coordinates, are polar decompositions: a rotation {U, V} is followed by a general stretch {G y , G x } . Those root matrices are generated by product decompositions of type G y = (G y )cG y as well as G x = (G x )cG x . Let us substitute the inverse transformations (3.54) x* 6 x = G x Vx* and (3.55) y * 6 y = G y Uy * into the system of linear equa1 2
1 2
1 2
1 2
1 2
1 2
1 2
1 2
3-2 The least squares solution: “LESS”
133
tions (3.1) y = Ax + i or its dual (3.57) y * = A* x* + i* . Such an operation leads us to (3.58) y * = f x* as well as (3.59) y = f ( x ) . Subject to the orthonormality conditions (3.60) U cU = I n and (3.61) V cV = I m we have generated the left– right eigenspace analysis (3.62)
( )
ªȁº ȁ* = « » ¬0¼ subject to the horizontal rank partitioning of the matrix U = [ U1 , U 2 ] . Alternatively, the left–right eigenspace synthesis (3.63) ªȁº A = G y [ U1 , U 2 ] « » V cG x ¬0¼ 1 2
1 2
- 12
is based upon the left matrix (3.64) L := G y U and the right matrix (3.65) R := G x V . Indeed the left matrix L by means of (3.66) LLc = G -1y reconstructs the inverse matrix of the metric of the observation space Y . Similarly, the right matrix R by means of (3.67) RR c = G -1x generates the inverse matrix of the metric of the parameter space X . In terms of “ L , R ” we have summarized the eigenvalue decompositions (3.68)-(3.73). Such an eigenvalue decomposition helps us to canonically invert y * = A* x* + i* by means of (3.74), (3.75), namely the rank partitioning of the canonical observation vector y * into y1* \ r×1 and y *2 \ ( n r )×1 to determine x*A = ȁ -1 y1* leaving y *2 “unrecognized”. Next we shall proof i1* = 0 if i1* is LESS. 1 2
Box 3.14: Canonical representation, overdetermined system of linear equations “parameter space X ” (3.52) x* = V cG x x
“observation space Y ” y * = U cG y y (3.53)
versus
1 2
1 2
and - 12 x
and
x = G Vx
(3.54)
- 12
y = G y Uy *
*
(3.55)
“overdetermined system of linear equations” {y = Ax + i | A \ n× m , rk A = m, n > m} y = Ax + i
(3.56) - 12
- 12
- 12
G y Uy * = AG x Vx* + G y Ui*
(
1 2
y * = A * x* + i *
versus
- 12
)
(3.58) y * = UcG y AG x V x* + i*
1 2
1 2
(3.57) 1 2
U cG y y = A* V cG x x + U cG y i
(
1 2
1 2
)
y = G y UA* V cG x x + i (3.59)
134
3 The second problem of algebraic regression
subject to
subject to
U cU = UUc = I n
(3.60)
V cV = VV c = I m
versus
(3.61)
“left and right eigenspace” “left-right eigenspace analysis”
“left-right eigenspace synthesis”
ª Uc º ªȁº ªȁº A = G y [ U1 , U 2 ] « » V cG x (3.63) (3.62) A* = « 1 » G y AG x V = « » ¬0¼ ¬0¼ ¬ U c2 ¼ “dimension identities” 1 2
1 2
1 2
1 2
ȁ \ r × r , U1 \ n × r 0 \ ( n r )× r , U 2 \ n × ( n r ) , V \ r × r r := rk A = m, n > m “right eigenspace”
“left eigenspace” - 12
1 2
(3.64) L := G y U L-1 = U cG y 12
- 12
1 2
versus R := G x V R -1 = V cG x (3.65)
12
L1 := G y U1 , L 2 := G y U 2 -1 -1 -1 (3.66) LLc = G -y1 (L-1 )cL-1 = G y versus RR c = G x (R )cR = G x (3.67)
ª L º ª Uc º L1 = « 1 » G y =: « 1 » ¬ U c2 ¼ ¬L 2 ¼ 1 2
(3.68)
A = LA* R -1
(3.70) A = [ L1 , L 2 ] A # R 1
(3.72)
A* = L-1 AR
versus
ª A # AL1 = L1 ȁ 2 « # «¬ A AL 2 = 0
versus
ª ȁ º ª L º A* = « » = « 1 » AR (3.71) ¬ 0 ¼ ¬L 2 ¼
versus
AA # R = Rȁ 2
“overdetermined system of linear equations solved in canonical coordinates” (3.74)
(3.69)
ªi* º ª y * º ªȁº y * = A* x* + i* = « » x* + « 1* » = « *1 » ¬0¼ ¬«i 2 ¼» ¬ y 2 ¼ “dimension identities”
(3.73)
3-2 The least squares solution: “LESS”
135
y *1 \ r ×1 , y *2 \ ( n r )×1 , i*1 \ r ×1 , i*2 \ ( n r )×1 y *1 = ȁx* + i*1 x* = ȁ 1 (y *1 i*1 )
(3.75)
“if i* is LESS, then x*A = ȁ 1 y *1 , i*1 = 0 ”. Consult the commutative diagram of Figure 3.5 for a shorthand summary of the newly introduced transformations of coordinates, both of the parameter space X as well as the observation space Y . Third, we prepare ourselves for LESS of the overdetermined system of linear equations {y = Ax + i | A \ n×m , rk A = m, n > m,|| i ||G2 = min} y
by introducing Lemma 3.9, namely the eigenvalue-eigencolumn equations of the matrices A # A and AA # , respectively, as well as Lemma 3.11, our basic result of “canonical LESS”, subsequently completed by proofs. Throughout we refer to the adjoint operator which has been introduced by Definition 1.5 and Lemma 1.6. X x
A
y R(A) Y
1 2
1 2
U cG y
V cG x
X x*
*
y* R(A* ) Y
A Figure 3.5:Commutative diagram of coordinate transformations Lemma 3.9
(eigenspace analysis versus eigenspace synthesis of the matrix {A \ n× m , r := rk A = m < n} )
The pair of matrices {L, R} for the eigenspace analysis and the eigenspace synthesis of the rectangular matrix A \ n× m of rank r := rk A = m < n , namely A* = L-1 AR or ª ȁ º ª L º A* = « » = « 1 » AR ¬ 0 ¼ ¬L 2 ¼
versus
A = LA* R -1 or
versus
ªȁº A = [ L1 , L 2 ] « » R ¬0¼
136
3 The second problem of algebraic regression
are determined by the eigenvalue–eigencolumn equations (eigenspace equations) of the matrices A # A and AA # , respectively, namely A # AR = Rȁ 2
ª AA # L1 = L1 ȁ 2 « # ¬ AA L 2 = 0
versus subject to
ªO12 … 0 º « » ȁ 2 = « # % # » , ȁ = Diag + O12 ,..., + Or2 . « 0 " Or2 » ¬ ¼
)
(
Let us prove first A # AR = Rȁ 2 , second A # AL1 = L1 ȁ 2 , AA # L 2 = 0 . (i) A # AR = Rȁ 2 A # AR = G -1x AcG y AR = ª Uc º ªȁº = G -1xG x V [ ȁ, 0c] « 1 » (G y )cG y G y [ U1 , U 2 ] « » V cG x G x V ¬0 ¼ ¬ U c2 ¼ ªȁº A # AR = G x V [ ȁ, 0c] « » = G x Vȁ 2 0 ¬ ¼ 1 2
1 2
1 2
1 2
1 2
1 2
1 2
A # AR = Rȁ 2 .
ƅ
(ii) AA # L1 = L1 ȁ 2 , AA # L 2 = 0 AA # L = AG -1x A cG y L = ª Uc º ªȁº = G y [ U1 , U 2 ] « » V cG x G -1x G x V [ ȁ, 0c] « 1 » (G y )cG y G y [ U1 , U 2 ] c ¬0¼ ¬U2 ¼ ª U c U U1c U 2 º ªȁº AA # L = [ L1 , L 2 ] « » [ ȁ, 0c] « 1 1 » ¬0¼ ¬ U c2 U1 U c2 U 2 ¼ 1 2
1 2
1 2
ªȁ2 AA # L = [ L1 , L 2 ] « ¬0
1 2
0c º ª I r »« 0¼¬0
1 2
0 º I n-r »¼
AA # [ L1 , L 2 ] = ª¬ L1 ȁ 2 , 0 º¼ , AA # L1 = L1 ȁ 2 , AA # L 2 = 0.
ƅ
The pair of eigensystems {A # AR = Rȁ 2 , AA # [L1 , L 2 ] = ª¬L1 ȁ 2 ,0º¼} is unfortunately based upon non-symmetric matrices AA # = AG -1x A cG y and A # A = G -1x AcG y A which make the left and right eigenspace analysis numerically more complex. It appears that we are forced to use the Arnoldi method rather than the more efficient Lanczos method used for symmetric matrices.
3-2 The least squares solution: “LESS”
137
In this situation we look out for an alternative. Actually as soon as we substitute - 12
- 12
{L, R} by {G y U, G x V} - 12
into the pair of eigensystems and consequently multiply AA # L by G x , we achieve a pair of eigensystems identified in Corollary 3.10 relying on symmetric matrices. In addition, such a pair of eigensystems produces the canonical base, namely orthonormal eigencolumns. Corollary 3.10 (symmetric pair of eigensystems): The pair of eigensystems 1 2
1 2
- 12
- 12
(3.76) G y AG -1x A c(G y )cU1 = ȁ 2 U1 versus (G x )cA cG y AG x V = Vȁ 2 (3.77) 1 2
- 12 y
(3.78) | G y AG Ac(G )c Ȝ I |= 0 versus -1 x
2 i n
- 12
- 12
| (G x )cA cG y AG x Ȝ 2j I r |= 0 (3.79)
is based upon symmetric matrices. The left and right eigencolumns are orthogonal. Such a procedure requires two factorizations, 1 2
1 2
- 12
- 12
G x = (G x )cG x , G -1x = G x (G x )c
and
1 2
1 2
- 12
- 12
G y = (G y )cG y , G -1y = G y (G y )c
via Choleskifactorization or eigenvalue decomposition of the matrices G x and Gy . Lemma 3.11 (canonical LESS): Let y * = A* x* + i* be a canonical representation of the overdetermined system of linear equations {y = Ax + i | A \ n× m , r := rk A = m, n > m} . Then the rank partitioning of y * = ª¬(y *1 )c, (y *2 )cº¼c leads to the canonical unknown vector (3.80)
ª y* º ª y* º y * \ r ×1 x*A = ª¬ ȁ -1 , 0 º¼ « *1 » = ȁ -1 y *1 , y * = « *1 » , * 1 ( n r )×1 ¬y 2 ¼ ¬y 2 ¼ y 2 \ and to the canonical vector of inconsistency
(3.82)
ª i* º ª y* º ª ȁ º i* = 0 i*A = « *1 » := « *1 » « » ȁ -1 y *1 or *1 * i2 = y2 ¬i 2 ¼ A ¬ y 2 ¼ A ¬ 0 ¼
(3.81)
138
3 The second problem of algebraic regression
of type G y -LESS. In terms if the original coordinates x X a canonical representation of G y -LESS is ª Uc º x A = G x V ª¬ ȁ -1 , 0 º¼ « 1 » G y y ¬ U c2 ¼ 1 2
1 2
- 12
(3.83)
1 2
x A = G x Vȁ -1 U1c G y y = Rȁ -1 L-1 y.
(3.84)
x A = A A y is built on the canonical (G x , G y ) weighted right inverse. For the proof we depart from G y -LESS (3.11) and replace the matrix A \ n× m by its canonical representation, namely by eigenspace synthesis. -1 x A = ( A cG y A ) A cG y y º » » ªȁº A = G y [ U1 , U 2 ] « » V cG x » ¬0¼ ¼» 1 2
1 2
ª Uc º ªȁº A cG y A = (G x )cV [ ȁ, 0] « 1 » (G y )cG y G y [ U1 , U 2 ] « » V cG x ¬0¼ ¬ U c2 ¼ 1 2
1 2
1 2
1 2
A cG y A = (G x )cVȁ 2 V cG x , ( AcG y A ) = G x Vȁ -2 V c(G x )c 1 2
-1
1 2
- 12
- 12
ª Uc º x A = G x Vȁ 2 V c(G x )c(G x )cV [ ȁ, 0] « 1 » (G y )cG y y ¬ U c2 ¼ 1 2
1 2
1 2
1 2
ª Uc º x A = G x V ª¬ ȁ -1 , 0 º¼ « 1 » G y y ¬ U c2 ¼ 1 2
1 2
- 12
1 2
x A = G x Vȁ -1 U1c G y y = A -A y - 12
1 2
A A- = G x Vȁ -1 U1c G y A1,2,3 G y
( G y weighted reflexive inverse) º ª Uc º x*A = V cG x x A = ȁ -1 U1c G y y = ª¬ ȁ -1 , 0 º¼ « 1 » G y y » ¬ U c2 ¼ » » * ª y º ª Uc º » y * = « *1 » = « 1 » G y y c U »¼ y ¬ 2¼ ¬ 2¼ 1 2
1 2
1 2
1 2
ª y* º x*A = ª¬ ȁ -1 , 0 º¼ « *1 » = ȁ -1 y 1* . ¬y 2 ¼
3-2 The least squares solution: “LESS”
139
Thus we have proven the canonical inversion formula. The proof for the canonical representation of the vector of inconsistency is a consequence of the rank partitioning ª i* º ª y* º ª ȁ º i* , y * \ r ×1 i*l = « 1* » := « 1* » « » x*A , * 1 * 1 ( n r )×1 , i2 , y2 \ ¬i 2 ¼ A ¬ y 2 ¼ ¬ 0 ¼ ª i* º ª y * º ª ȁ º ª0º i*A = « 1* » = « 1* » « » ȁ -1 y1* = « * » . ¬y 2 ¼ ¬i 2 ¼ A ¬ y 2 ¼ ¬ 0 ¼
ƅ The important result of x*A based on the canonical G y -LESS of {y * = A* x* + i* | A* \ n× m , rk A* = rk A = m, n > m} needs a comment. The rank partitioning of the canonical observation vector y * , namely y1* \ r , y *2 \ n r again paved the way for an interpretation. First, we appreciate the simple “direct inversion” x*A = ȁ -1 y1* , ȁ = Diag + O12 ,..., + Or2 , for instance
)
(
ª x º ª Ȝ1-1y1* º « » « » « ... » = « ... » . « x*m » « Ȝ -1r y *r » ¬ ¼A ¬ ¼ Second, i1* = 0 , eliminates all elements of the vector of canonical inconsistencies, for instance ª¬i1* ,..., ir* º¼ c = 0 , while i*2 = y *2 identifies the deficient elements of the A vector of canonical inconsistencies with the vector of canonical observations for * * instance ª¬ir +1 ,..., in º¼ c = ª¬ yr*+1 ,..., yn* º¼ c . Finally, enjoy the commutative diagram A A of Figure 3.6 illustrating our previously introduced transformations of type LESS and canonical LESS, by means of A A and A* , respectively. * 1
Y y
AA
1 2
A
xA X
1 2
UcG y
Y y*
( )
V cG x
( A* )A
x*A X
Figure 3.6: Commutative diagram of inverse coordinate transformations A first example is canonical LESS of the Front Page Example by G y = I 3 , Gx = I2 .
140
3 The second problem of algebraic regression
ª1 º ª1 1 º ª i1 º ªx º y = Ax + i : «« 2 »» = ««1 2 »» « 1 » + ««i2 »» , r := rk A = 2 x «¬ 4 »¼ «¬1 3 »¼ ¬ 2 ¼ «¬ i3 »¼ left eigenspace
right eigenspace
AA # U1 = AAcU1 = U1 ȁ 2
A # AV = A cAV = Vȁ 2
AA U 2 = AAcU 2 = 0 #
ª2 3 4 º AA c = «« 3 5 7 »» «¬ 4 7 10 »¼
ª3 6 º «6 14 » = A cA ¬ ¼ eigenvalues
| AAc Oi2 I 3 |= 0
| A cA O j2 I 2 |= 0
i {1, 2,3}
j {1, 2}
O12 =
17 1 17 1 + 265, O22 = 265, O32 = 0 2 2 2 2
left eigencolumns ª 2 O12 « (1st) « 3 « 4 ¬
right eigencolumns
3 4 º ª u11 º » 2 5 O1 7 » ««u21 »» = 0 7 10 O12 »¼ «¬u31 »¼
ª3 O12 6 º ª v11 º (1st) « »« » = 0 14 O12 ¼ ¬ v 21 ¼ ¬ 6
subject to
subject to
2 u112 + u21 + u312 = 1
v112 + v 221 = 1
ª(2 O12 )u11 + 3u21 + 4u31 = 0 « 2 ¬ 3u11 + (5 O1 )u21 + 7u31 = 0
versus
(3 O12 ) v11 + 6 v 21 = 0
36 72 ª 2 « v11 = 36 + (3 O 2 ) 2 = 265 + 11 265 1 « 2 « 2 (3 O1 ) 2 193 + 11 265 = « v 21 = 2 2 36 + (3 O1 ) «¬ 265 + 11 265 ª u112 º 1 « 2» «u21 » = (1 + 4O 2 ) 2 + (2 7O 2 ) 2 + (1 7O 2 + O 4 ) 2 1 1 1 1 « 2» ¬u31 ¼
ª (1 + 4O12 ) 2 º « » 2 2 « (2 7O1 ) » 2 4 2» « ¬ (1 7O1 + O1 ) ¼
3-2 The least squares solution: “LESS”
141
(
)
(
)
ª 35 + 2 265 2 º « » ª u112 º « 2» 2 « 2» «§ 115 + 7 265 · » ¨ ¸ » «u21 » = 43725 + 2685 265 «© 2 2 ¹ 2 » «u31 « » ¬ ¼ 2 « 80 + 5 265 » »¼ ¬« ª3 O22 (2nd) « ¬ 7
ª 2 O22 « (2nd) « 3 « 5 ¬
7 º ª u12 º »« » = 0 21 O22 ¼ ¬u22 ¼
subject to 2 u122 + u22 + u322 = 1
3 5 º ª v12 º » 2 5 O2 9 » «« v 22 »» = 0 9 17 O22 »¼ «¬ v32 »¼ subject to 2 v122 + v 22 =1
ª(2 O22 )u12 + 3u22 + 4u32 = 0 « 2 ¬ 3u12 + (5 O2 )u22 + 7u32 = 0
versus
(3 O22 ) v12 + 6 v 22 = 0
36 72 ª 2 « v12 = 36 + (3 O 2 ) 2 = 265 11 265 2 « 2 2 « 2 (3 O2 ) 193 11 265 = « v 22 = 2 2 36 + (3 ) O 265 11 265 ¬« 2 ª u122 º 1 « 2» «u22 » = (1 + 4O 2 ) 2 + (2 7O 2 ) 2 + (1 7O 2 + O 4 ) 2 2 2 2 2 2 » «u32 ¬ ¼
ª (1 + 4O22 ) 2 º « » 2 2 « (2 7O2 ) » « (1 7O22 + O24 ) 2 » ¬ ¼
ª (35 2 265) 2 º ª u122 º « » 2 115 7 « 2» 2» « ( 265) «u22 » = » 43725 2685 265 « 2 2 2 » «u32 « » 2 ¬ ¼ ¬« (80 5 265) »¼ ª 2 3 4 º ª u13 º (3rd) «« 3 5 7 »» ««u23 »» = 0 «¬ 4 7 10 »¼ «¬u33 »¼
subject to
2 u132 + u23 + u332 = 1
2u13 + 3u23 + 4u33 = 0 3u13 + 5u23 + 7u33 = 0 ª u13 º ª 2 3º ª u13 º ª 4 º ª 5 3º ª 4º « 3 5» «u » = « 7 » u33 «u » = « 3 2 » « 7 » u33 ¬ ¼ ¬ 23 ¼ ¬ ¼ ¬ ¼¬ ¼ ¬ 23 ¼
142
3 The second problem of algebraic regression
u13 = +u33 , u23 = 2u33 1 2 1 2 u132 = , u23 = , u332 = . 6 3 6 There are four combinatorial solutions to generate square roots. ª u11 u12 «u « 21 u22 «¬u31 u32 ª v11 «v ¬ 21
2 ª u13 º « ± u11 2 u23 »» = « ± u21 « u33 »¼ « ± u 2 31 ¬ 2 v12 º ª ± v11 « = v 22 »¼ « ± v 2 21 ¬
± u122 2 ± u22 2 ± u32
± u132 º » 2 » ± u23 » 2 » ± u33 ¼
2 º ± v12 ». ± v 222 »¼
Here we have chosen the one with the positive sign exclusively. In summary, the eigenspace analysis gave the result as follows. § 17 + 265 17 265 ȁ = Diag ¨ , ¨ 2 2 © ª « « « « U=« « « « « ¬
2 2 2 2
35 + 2 265 43725 + 2685 265 115 + 7 265 43725 + 2685 265 80 + 5 265 43725 + 2685 265
· ¸ ¸ ¹
35 2 265
2
43725 2685 265 2 2
115 7 265 43725 2685 265 80 5 265
2
43725 2685 265
72 ª « « 265 + 11 265 V=« « 193 + 11 265 « ¬ 265 + 11 265
72
º » 265 11 265 » ». 193 11 265 » » 265 11 265 ¼
º 1 » 6» 6 » 1 » 6 = [ U1 , U 2 ] 3 » » 1 » 6 6 » » ¼
3-3 Case study
143
3-3 Case study: Partial redundancies, latent conditions, high leverage points versus break points, direct and inverse Grassmann coordinates, Plücker coordinates This case study has various targets. First we aim at a canonical analysis of the hat matrices Hx and Hy for a simple linear model with a leverage point. The impact of a high leverage point is studied in all detail. Partial redundancies are introduced and interpreted in their peculiar role of weighting observations. Second, preparatory in nature, we briefly introduce multilinear algebra, the operations "join and meet", namely the Hodge star operator. Third, we go "from A to B": Given the columns space R ( A) = G m , n ( A) , identified as the Grassmann space G m , n R n of the matrix A R n× m , n > m, rk A = m , we construct the column space R (B) = R A ( A) = G n m , n R n of the matrix B which agrees to the orthogonal column space R A ( A) of the matrix A. R A ( A) is identified as Grassmann space G n m , n R n and is covered by Grassmann coordinates, also called Plücker coordinates pij. The matrix B, alternatively the Grassmann coordinates (Plücker coordinates), constitute the latent restrictions, also called latent condition equations, which control parameter adjustment and lead to a proper choice of observational weights. Fourth, we reverse our path: we go “from B to A”: Given the column space R (B) of the matrix of restrictions B RA × n , A < n , rk B = A we construct the column space R A (B) = R ( A) R n , the orthogonal column space of the matrix B which is apex to the column space R (A) of the matrix A. The matrix A, alternatively the Grassmann coordinates (Plücker coordinates) of the matrix B constitute the latent parametric equations which are “behind a conditional adjustment”. Fifth, we break-up the linear model into pieces, and introduce the notion of break points and their determination. The present analysis of partial redundancies and latent restrictions has been pioneered by G. Kampmann (1992), R. Jurisch, G. Kampmann and J. Linke (1999a, b) as well as R. Jurisch and G. Kampmann (2002 a, b). Additional useful references are D. W. Behmken and N. R. Draper (1972), S. Chatterjee and A. S. Hadi (1988), R. D. Cook and S. Weisberg (1982). Multilinear algebra, the operations “join and meet” and the Hodge star operator are reviewed in W. Hodge and D. Pedoe (1968), C. Macinnes (1999), S. Morgera (1992), W. Neutsch (1995), B. F. Doolin and C. F. Martin (1990). A sample reference for break point synthesis is C. H. Mueller (1998), N. M. Neykov and C. H. Mueller (2003) and D. Tasche (2003). 3-31 Canonical analysis of the hat matrix, partial redundancies, high leverage points A beautiful example for the power of eigenspace synthesis is the least squares fit of a straight line to a set of observation: Let us assume that we have observed a dynamical system y(t) which is represented by a polynomial of degree one with respect to time t.
144
3 The second problem of algebraic regression
y (ti ) = 1i x1 + ti x2 i {1," , n} . Due to y • (t ) = x2 it is a dynamical system with constant velocity or constant first derivative with result to time t0. The unknown polynomial coefficients are collected in the column array x = [ x1 , x2 ]c, x X = R 2 , dim X = 2 and constitute the coordinates of the two-dimensional parameter space X . For this example we choose n = 4 observations, namely y = [ y (t1 ), y (t2 ), y (t3 ), y (t4 )]c , y Y = R 4 , dim Y = 4 . The samples of the polynomial are taken at t1 = 1, t2 = 2, t3 = 3 and t4 = a. With such a choice of t4 we aim at modeling the behavior of high leverage points, e.g. a >> (t1 , t2 , t3 ) or a o f , illustrated by Figure 3.7. y4
*
y3
*
y (t ) y2 y1
* t1 = 1
* t2 = 2
t3 = 3
t4 = a
t Figure 3.7: Graph of the function y(t), high leverage point t4=a Box 3.15 summarizes the right eigenspace analysis of the hat matrix H y : =A(AcA)- Ac . First, we have computed the spectrum of A cA and ( A cA) 1 for the given matrix A R 4× 2 , namely the eigenvalues squared 2 O1,2 = 59 ± 3261 . Note the leverage point t4 = a = 10. Second, we computed the right eigencolumns v1 and v2 which constitute the orthonormal matrix V SO(2) . The angular representation of the orthonormal matrix V SO(2) follows: Third, we take advantage of the sine-cosine representation (3.85) V SO(2) , the special orthonormal group over R2. Indeed, we find the angular parameter J = 81o53ƍ25.4Ǝ. Fourth, we are going to represent the hat matrix Hy in terms of the angular parameter namely (3.86) – (3.89). In this way, the general representation (3.90) is obtained, illustrated by four cases. (3.86) is a special case of the general angular representation (3.90) of the hat matrix Hy. Five, we sum up the canonical representation AV cȁ 2 V cA c (3.91), of the hat matrix Hy, also called right eigenspace synthesis. Note the rank of the hat matrix, namely rk H y = rk A = m = 2 , as well as the peculiar fourth adjusted observation 1 yˆ 4 = y4 (I LESS) = ( 11 y1 + y2 + 13 y3 + 97 y4 ) , 100 which highlights the weight of the leverage point t4: This analysis will be more pronounced if we go through the same type of right eigenspace synthesis for the leverage point t4 = a, ao f , outlined in Box 3.18.
145
3-3 Case study
Box 3.15 Right eigenspace analysis of a linear model of an univariate polynomial of degree one - high leverage point a =10 “Hat matrix H y = A( A cA) 1 A = AVȁ 2 V cAc ” ª A cAV = Vȁ 2 « right eigenspace analysis: «subject to « VV c = I 2 ¬ ª1 1 º «1 2 » » , A cA = ª 4 16 º , ( AA) 1 = 1 ª 57 8º A := « « » «1 3 » 100 ¬« 8 2 ¼» ¬16 114¼ « » ¬1 10 ¼ spec( A cA) = {O12 , O 22 } : A cA O 2j I 2 = 0, j {1, 2} 4 O2 16 = 0 O 4 118O 2 + 200 = 0 2 16 114 O 2 O1,2 = 59 ± 3281 = 59 ± 57.26 = 0
spec( A cA ) = {O12 , O 22 } = {116.28, 1.72} versus spec( A cA) 1 = {
1 1 , } = {8.60 *103 , 0.58} O12 O 22
ª( A cA O 2j I 2 )V = 0 « right eigencolumn analysis: «subject to « VV c = I 2 ¬ ªv º 2 =1 (1st) ( A cA O12 I ) « 11 » = 0 subject to v112 + v21 v ¬ 21 ¼ (4 O12 )v11 + 16v21 = 0 º » 2 v112 + v21 =1 »¼
146
3 The second problem of algebraic regression
16
v11 = + v112 =
2 v21 = + v21 =
256 + (4 O12 ) 2 4 O12 256 + (4 O12 ) 2
= 0.141
= 0.990
ªv º 2 (2nd) ( A cA O 22 I 2 ) « 12 » = 0 subject to v122 + v22 =1 v ¬ 22 ¼ (4 O 22 )v12 + 16v22 = 0 º » 2 v122 + v22 =1 ¼ v12 = + v122 =
2 v22 = + v22 =
16 256 + (4 O 22 ) 2 4 O 22 256 + (4 O 22 ) 2
= 0.990
= 0.141
spec( A cA) = {116.28, 1.72} right eigenspace: spec( A cA) 1 = {8.60 *103 , 0.58} ªv V = « 11 ¬ v21
v12 º ª 0.141 0.990 º = SO(2) v22 »¼ «¬ 0.990 0.141»¼
V SO(2) := {V R 2×2 VV c = I 2 , V = 1} “Angular representation of V SO(2) ” ª cos J sin J º ª 0.141 0.990º V=« »=« » ¬ sin J cos J ¼ ¬ 0.990 0.141¼
(3.85)
sin J = 0.990, cos J = 0.141, tan J = 7.021 J=81o.890,386 = 81o53’25.4” hat matrix H y = A( A cA) 1 Ac = AVȁ 2 V cAc 1 1 1 ª 1 º 2 2 « O 2 cos J + O 2 sin J ( O 2 + O 2 ) sin J cos J » 2 1 2 » (3.86) ( A cA) 1 = V/ 2 V = « 1 « 1 » 1 1 1 2 2 sin J + 2 cos J » «( 2 + 2 ) sin J cos J 2 O1 O2 ¬ O1 O 2 ¼
147
3-3 Case study
( A cA) j 1j = 1 2
m=2
1
¦O j3 =1
cos J j j cos J j
2 j3
1 3
(3.87)
2 j3
subject to m=2
VV c = I 2 ~
¦ cos J j3 =1
j1 j3
cos J j
2 j3
= Gj j
(3.88)
1 2
case 1: j1=1, j2=1:
case 2: j1=1, j2=2:
cos 2 J11 + cos 2 J12 = 1
cos J11 cos J 21 + cos J12 cos J 22 = 0
(cos 2 J + sin 2 J = 1)
( cos J sin J + sin J cos J = 0)
case 3: j1=2, j2=1:
case 4: j1=2, j2=2:
cos J 21 cos J11 + cos J 22 cos J12 = 0
cos 2 J 21 + cos 2 J 22 = 1
( sin J cos J + cos J sin J = 0)
(sin 2 J + cos 2 J = 1)
( A cA) 1 = ª O12 cos 2 J 11 + O22 cos 2 J 12 « 2 2 ¬O1 cos J 21 cos J 11 + O2 cos J 22 cos J 12 H y = AVȁ 2 V cA c ~ hi i = 12
O12 cos J 11 cos J 21 + O22 cos J 12 cos J 22 º » O12 cos 2 J 21 + O22 cos 2 J 22 ¼ (3.89)
m=2
¦
j1 , j2 , j3 =1
ai j ai 1 1
2 j2
1 cos J j j cos J j O j2 1 3
2 j3
H y = A( A cA) 1 Ac = AVȁ 2 V cAc ª 0.849 « 1.839 A ~ := AV = « « 2.829 « ¬ 9.759
1.131 º 1.272 »» 2 , ȁ = Diag(8.60 × 103 , 0.58) 1.413 » » 2.400 ¼
ª 43 37 31 11º « » 1 « 37 33 29 1 » H y = A ~ ȁ 2 ( A ~ )c = 100 « 31 29 27 13 » « » ¬« 11 1 13 97 ¼» rk H y = rk A = m = 2 yˆ 4 = y4 (I -LESS) =
(3.90)
3
1 ( 11 y1 + y2 + 13 y3 + 97 y4 ) . 100
(3.91)
148
3 The second problem of algebraic regression
By means of Box 3.16 we repeat the right eigenspace analysis for one leverage point t4 = a, lateron a o f , for both the hat matrix H x : = ( A cA) 1 A c and H y : = A( A cA) 1 Ac . First, Hx is the linear operator producing xˆ = x A (I -LESS) . Second, Hy as linear operator generates yˆ = y A (I -LESS) . Third, the complementary operator I 4 H y =: R as the matrix of partial redundancies leads us to the inconsistency vector ˆi = i A (I -LESS) . The structure of the redundancy matrix R, rk R = n – m, is most remarkable. Its diagonal elements will be interpreted soonest. Fourth, we have computed the length of the inconsistency vector || ˆi ||2 , the quadratic form y cRy . The highlight of the analysis of hat matrices is set by computing 1st : H x (a o f) versus 2nd : H y (a o f) 3rd : R (a o f) versus 4th : || ˆi ( a o f) ||2 for “highest leverage point” a o f , in detail reviewed Box 3.17. Please, notice the two unknowns xˆ1 and xˆ2 as best approximations of type I-LESS. xˆ1 resulted in the arithmetic mean of the first three measurements. The point y4 , t4 = a o f , had no influence at all. Here, xˆ2 = 0 was found. The hat matrix H y (a o f) has produced partial hats h11 = h22 = h33 = 1/ 3 , but h44 = 1 if a o f . The best approximation of the I LESS observations were yˆ1 = yˆ 2 = yˆ 3 as the arithmetic mean of the first three observations but yˆ 4 = y4 has been a reproduction of the fourth observation. Similarly the redundancy matrix R (a o f) produced the weighted means iˆ1 , iˆ2 and iˆ3 . The partial redundancies r11 = r22 = r33 = 2 / 3, r44 = 0 , sum up to r11 + r22 + r33 + r44 = n m = 2 . Notice the value iˆ4 = 4 : The observation indexed four is left uncontrolled. Box 3.16 The linear model of a univariate polynomial of degree one - one high leverage point ª y1 º ª1 « y » «1 y = Ax + i ~ « 2 » = « « y3 » «1 « » « ¬« y4 ¼» ¬«1
1º ª i1 º » 2 » ª x1 º ««i2 »» + 3 » «¬ x2 »¼ « i3 » « » » a ¼» ¬«i4 ¼»
x R 2 , y R 4 , A R 4× 2 , rk A = m = 2 dim X = m = 2 versus dim Y = n = 4 (1st) xˆ = xA (I -LESS) = ( A cA) 1 A cy = H x y
(3.92)
149
3-3 Case study
Hx =
1 18 12a + 3a 2
ª8 a + a « ¬ 2 a
2
2 2a + a 2a
2
4 3a + a 6a
2
ª y1 º « » 14 6a º « y2 » » 6 + 3a ¼ « y3 » « » «¬ y4 »¼
(2nd ) yˆ = y A (I -LESS) = A c( A cA) 1 A cy = H y y
(3.93)
“hat matrix”: H y = A c( A cA) 1 A c, rk H y = m = 2 ª 6 2a + a 2 « 2 1 « 4 3a + a Hy = 18 18a + 3a 2 « 2 4a + a 2 « ¬« 8 3a
º 4 3a + a 2 2 4a + a 2 8 3a » 2 2 6 4a + a 8 5a + a 2 » 6 5a + a 2 14 6a + a 2 4 + 3a » » 2 4 + 3a 14 12a + 3a 2 ¼»
(3rd) ˆi = i A (I -LESS) = (I 4 A( A cA) 1 A c) y = Ry “redundancy matrix”: R = I 4 A( AcA) 1 Ac, rk R = n m = 2 “redundancy”: n – rk A = n – m = 2 ª12 10a + 2a 2 4 + 3a a 2 « 2 12 6a + 2a 2 1 « 4 + 3a a R= 18 12a + 3a 2 « 2 + 4a a 2 8 + 5a a 2 « 2 «¬ 8 + 3a
2 + 4 a a 2 8 + 5a a 2 4 6a + 2a 2 4 3a
8 + 3a º » 2 » 4 3a » » 4 »¼
(4th) || ˆi ||2 =|| i A (I -LESS) ||2 = y cRy . At this end we shall compute the LESS fit lim || iˆ(a ) ||2 ,
a of
which turns out to be independent of the fourth observation. Box 3.17 The linear model of a univariate polynomial of degree one - extreme leverage point a o f (1st ) H x (a o f) 2 2 4 3 14 6 º ª8 1 « a2 a + 1 + a2 a + 1 a2 a + 1 a2 a » 1 Hx = « » 18 12 2 1 6 1 6 3» « 2 1 + 3 + 2 + 2 2+ «¬ a 2 a a2 a a a a »¼ a a a
150
3 The second problem of algebraic regression
1 ª1 1 1 0 º lim H x = « aof 3 ¬0 0 0 0 »¼ 1 xˆ1 = ( y1 + y2 + y3 ), xˆ2 = 0 3 (2nd ) H y (a o f) 4 3 ª6 2 « a2 a + 1 a2 a + 1 « « 4 3 +1 6 4 +1 « a2 a 1 a2 a Hy = « 18 12 2 4 8 5 + 3 « 2 +1 2 +1 a2 a «a a a a « 8 3 2 « 2 a a2 ¬ a ª1 « 1 1 lim H y = « a of 3 «1 « ¬«0
1 1 1 0
1 1 1 0
2 4 8 3 º +1 2 a a a2 a » » 8 5 2 » +1 2 2 » a a a » 14 6 4 3 » +1 2 + a » a2 a a 4 3 14 12 » 2+ + 3» a a2 a a ¼
0º 0 »» , lim h44 = 1 0 » a of » 3 ¼»
1 yˆ1 = yˆ 2 = yˆ 3 = ( y1 + y2 + y3 ), yˆ 4 = y4 3 (3rd ) R (a o f) 4 3 2 4 8 3º ª 10 10 « a2 a + 2 a2 + a 1 a2 + a 1 a2 + a » « » 2 » « 4 + 3 1 12 8 + 2 8 + 5 1 2 « a2 a 1 a2 a a2 a a » R= « » 18 12 2 4 8 5 4 6 4 3 » « + 3 + 1 + 1 + 2 a2 a « a2 a a2 a a2 a a2 a » « 8 3 2 4 3 4 » « 2+ » 2 a a a2 a a2 ¼ ¬ a ª 2 1 1 « 1 1 2 1 lim R (a ) = « a of 3 « 1 1 2 « ¬0 0 0
0º 0 »» . 0» » 0¼
151
3-3 Case study
1 1 1 iˆ1 = (2 y1 y2 y3 ), iˆ2 = ( y1 + 2 y2 y3 ), iˆ3 = ( y1 y2 + 2 y3 ), iˆ4 = 0 3 3 3 (4th ) LESS fit : || iˆ ||2 ª 2 1 1 « 1 2 1 1 lim || iˆ(a ) ||2 = y c « a of 3 « 1 1 2 « «¬ 0 0 0
0º 0 »» y 0» » 0 »¼
1 lim || iˆ(a ) ||2 = (2 y12 + 2 y22 + 2 y32 2 y1 y2 2 y2 y3 2 y3 y1 ) . 3
aof
A fascinating result is achieved upon analyzing (the right eigenspace of the hat matrix H y (a o f) . First, we computed the spectrum of the matrices A cA and ( A cA) 1 . Second, we proved O1 (a o f) = f , O2 (a o f) = 3 or O11 (a o f) = 0 , O21 (a o f) = 1/ 3 . Box 3.18 Right eigenspace analysis of a linear model of a univariate polynomial of degree one - extreme leverage point a o f “Hat matrix H y = A c( A cA) 1 A c ” ª A cAV = Vȁ 2 « right eigenspace analysis: «subject to « VV c = I mc ¬ spec( A cA) = {O12 , O22 } : A cA O j2 I = 0 j {1, 2} 4 O2 6+a = 0 O 4 O 2 (18 + a 2 ) + 20 12a + 3a 2 = 0 2 2 6 + a 14 + a O 2 O1,2 =
1 tr ( A cA) ± (tr A cA) 2 4 det A cA 2
tr A cA = 18 + a 2 , det A cA = 20 12a + 3a 3 (tr A cA) 2 4 det AcA = 244 + 46a + 25a 2 + a 4
152
3 The second problem of algebraic regression
a2 a4 ± 61 + 12a + 6a 2 + 2 4 2 2 spec( A cA) = {O1 , O2 } =
2 O1,2 = 9+
° a 2 a4 a2 a4 = ®9 + + 61 + 12a + 6a 2 + , 9 + 61 + 12a + 6a 2 + 2 4 2 4 °¯ “inverse spectrum ” 1 1 spec( A cA) = {O12 , O22 } spec( A cA) 1 = { 2 , 2 } O1 O2 1 = O22
1 = O12
9+
9+
½° ¾ °¿
9 1 61 12 6 1 a2 a4 + 4+ 3+ 2+ 61 + 12a + 6a 2 + 2 2 a a a 4 2 4 =a 20 12 20 12a + 3a 2 +3 a2 a 1 lim =0 a of O 2 1 9 1 61 12 6 1 a2 a4 + + + + + + 61 + 12a + 6a 2 + 2 2 a 4 a3 a 2 4 2 4 = a 20 12 20 12a + 3a 2 +3 a2 a 1 1 lim = a of O 2 3 2
1 lim spec( A cA)(a) = {f,3} lim spec( A cA) 1 = {0, } aof 3 2 A cAV = Vȁ º 1 2 2 » A cA = Vȁ V c ( A cA ) = Vȁ V c VV c = I m ¼ aof
“Hat matrix H y = AVȁ 2 V cA c ”. 3-32 Multilinear algebra, “join” and ”meet”, the Hodge star operator Before we can analyze the matrices “hat Hy” and “red R” in more detail, we have to listen to an “intermezzo” entitled multilinear algebra, “join” and “meet” as well as the Hodge star operator. The Hodge star operator will lay down the foundation of “latent restrictions” within our linear model and of Grassmann coordinates, also referred to as Plücker coordinates. Box 3.19 summarizes the definitions of multilinear algebra, the relations “join and meet”, denoted by “ ” and “*”, respectively. In terms of orthonormal base vectors ei , " , ei , we introduce by (3.94) the exterior product ei " ei also known as “join”, “skew product” or 1st Grassmann relation. Indeed, such an exterior product is antisymmetric as defined by (3.95), (3.96), (3.97) and (3.98). 1
k
1
m
153
3-3 Case study
The examples show e1 e 2 = - e 2 e1 and e1 e1 = 0 , e 2 e 2 = 0 . Though the operations “join”, namely the exterior product, can be digested without too much of an effort, the operation ”meet”, namely the Hodge star operator, needs much more attention. Loosely speaking the Hodge star operator or 2nd Grassmann relation is a generalization of the conventional “cross product” symbolized by “ × ”. Let there be given an exterior form of degree k as an element of /k(Rn) over the field of real numbers Rn . Then the “Hodge *” transforms the input exterior form of degree m to the output exterior form of degree n – m, namely an element of /n-k(R n). Input: X/ /m(R n) o Output: *X/ /n-m. Applying the summation convention over repeated indices, (3.100) introduces the input operation “join”, while (3.101) provides the output operation “meet”. We say that X , (3.101) is a representation of the adjoint form based on the original form X , (3.100). The Hodge dualizer is a complicated exterior form (3.101) which is based upon Levi-Civita’s symbol of antisymmetry (3.102) which is illustrated by 3 examples. H k1"kA is also known as the permutation operator. Unfortunately, we have no space and time to go deeper into “join and meet“. Instead we refer to those excellent textbooks on exterior algebra and exterior analysis, differential topology, in short exterior calculus. Box 3.19 “join and meet” Hodge star operator “ , ” I := {i1 ," , ik , ik +1 ," , in } {1," , n} “join”: exterior product, skew product, 1st Grassmann relation ei "i := ei " e j e j +1 " ei
(3.94)
“antisymmetry”: ei ...ij ...i = ei ... ji...i i z j
(3.95)
1
m
m
1
1
m
1
m
ei ... e j e j ... ei = ei ... e j e j ... ei
(3.96)
ei "i i "i = 0 i = j
(3.97)
ei " ei e j " ei = 0 i = j
(3.98)
1
k +1
k
m
1
1
i j
k +1
1
k
m
m
m
Example: e1 e 2 = e 2 e1 or e i e j = e j e i i z j Example: e1 e1 = 0, e 2 e 2 = 0 or e i e j = 0 “meet”: Hodge star operator, Hodge dualizer 2nd Grassmann relation
i = j
154
3 The second problem of algebraic regression
: ȁ m ( R n ) o n m ȁ ( R n )
(3.99)
“a m degree exterior form X ȁ m ( R n ) over R n is related to a n-m degree exterior form *X called the adjoint form” :summation convention: “sum up over repeated indices” input: “join” X=
1 e i " e i X i "i m! 1
(3.100)
m
m
1
output: “meet” 1 g e j " e j H i "i j " j Xi "i m !(n m)! antisymmetry operator ( “Eddington’s epsilons” ):
*X :=
H k "k 1
A
1
nm
1
1
m 1
(3.101)
m
nm
ª +1 for an even permutation of the indices k1 " kA := «« 1 for an oded permutation of the indices k1 " kA «¬ 0 otherwise (for a repetition of the indices).
(3.102)
Example: H123 = H 231 = H 312 = +1 Example: H 213 = H 321 = H132 = 1 Example: H112 = H 223 = H 331 = 0. For our purposes two examples on “Hodge’s star” will be sufficient for the following analysis of latent restrictions in our linear model. In all detail, Box 3.20 illustrates “join and meet” for
: ȁ 2 ( R 3 ) o ȁ 1 ( R 3 ) . Given the exterior product a b of two vectors a and b in R 3 with ai 1 = col1 A, ai 2 = col 2 A 1
2
as their coordinates, the columns of the matrix A with respect to the orthonormal frame of reference {e1 , e 2 , e 3 |0} at the origin 0. ab =
n =3
¦e
i1 ,i2 =1
i1
ei ai 1ai 2 ȁ 2 (R 3 ) 2
1
2
is the representation of the exterior form a b =: X in the multibasis ei i = ei ei . By cyclic ordering, (3.105) is an explicit write-up of a b R ( A) . Please, notice that there are 12
1
2
155
3-3 Case study
§n · §3· ¨ ¸=¨ ¸=3 © m¹ © 2¹ subdeterminants of A . If the determinant of the matrix G = I 4 , g = 1 , then according to (3.106), (3.107)
det G = 1
(a b) R ( A) A = G1,3 represent the exterior form *X , which is an element of R ( A) called Grassmann space G1,3 . Notice that (a b) is a vector whose Grassmann coordinate (Plücker coordinate) are §n · §3· ¨ ¸=¨ ¸=3 © m¹ © 2¹ subdeterminants of the matrix A, namely a21a32 a31a22 , a31a12 a11a32 , a11a23 a21a12 . Finally, (3.108) (e 2 e 3 ) = e 2 × e 3 = e1 for instance demonstrates the relation between " , " called “join, meet” and the “cross product”. Box 3.20 The first example: “join and meet”
: ȁ 2 (R 3 ) o ȁ1 (R 3 ) Input: “join” n =3
n =3
a = ¦ ei ai 1 , i =1
1
b =¦ ei ai
1
2
i =1
2
2
(3.103)
ai 1 = col1 A; ai 2 = col 2 A 1
ab =
2
1 n =3 ¦ ei ei 2! i ,i =1 1
2
ai 1ai 2 ȁ 2 (R 3 ) 1
2
(3.104)
1 2
“cyclic order ab =
1 e 2 e3 (a21a32 a31a22 ) + 2! 1 + e3 e1 (a31a12 a11a32 ) + 2! 1 + e1 e 2 (a11a23 a21a12 ) R ( A ) = G 2,3 . 2!
(3.105)
156
3 The second problem of algebraic regression
Output: “meet” ( g = 1, G y = I 3 , m = 2, n = 3, n m = 1)
(a b) =
n=2
1 e j H i ,i , j ai 1ai i ,i , j =1 2!
¦
1 2
1
2
2
(3.106)
1 2
1 e1 ( a21a32 a31a22 ) + 2! 1 + e 2 (a31a12 a11a32 ) + 2! 1 + e3 ( a11a23 a21a12 ) R A ( A ) = G1,3 2!
*(a b) =
(3.107)
§n · §3· ¨ ¸ = ¨ ¸ subdeterminant of A © m¹ © 2¹ Grassmann coordinates (Plücker coordinates)
(e 2 e3 ) = e1 , (e3 e1 ) = e 2 , (e1 e 2 ) = e3 .
(3.108)
Alternatively, Box 3.21 illustrates “join and meet” for selfduality
: ȁ 2 ( R 4 ) o ȁ 2 ( R 4 ) . Given the exterior product a b of two vectors a R 4 and b R 4 , namely the two column vectors of the matrix A R 4× 2 , ai 1 = col1 A, ai 2 = col 2 A 1
2
as their coordinates with respect to the orthonormal frame of reference {e1 , e 2 , e 3 , e 4 | 0 } at the origin 0. ab =
n=4
¦e
i1 ,i2 =1
i1
ei
2
ai 1ai 2 ȁ 2 (R 4 ) 1
2
is the representation of the exterior form a b := X in the multibasis ei i = ei ei . By lexicographic ordering, (3.111) is an explicit write-up of a b ( R ( A)) . Notice that these are 12
1
2
§ n · § 4· ¨ ¸=¨ ¸=6 © m¹ © 2¹ subdeterminants of A . If the determinant of the matrix G of the metric is one G = I 4 , det G = g = 1 , then according to (3.112), (3.113)
(a b) R ( A) A =: G 2,4
157
3-3 Case study
represents the exterior form X , an element of R ( A) A , called Grassmann space G 2,4 . Notice that (a b) is an exterior 2-form which has been generated by an exterior 2-form, too. Such a relation is called “selfdual”. Its Grassmann coordinates (Plücker coordinates) are
§ n · § 4· ¨ ¸=¨ ¸=6 © m¹ © 2¹ subdeterminants of the matrix A, namely a11a12 a21a12 , a11a32 a31a22 , a11a42 a41a12 , a21a32 a31a22 , a21a42 a41a22 , a31a41 a41a32 . Finally, (3.113), for instance (e1 e 2 ) = e3 e 4 , demonstrates the operation " , " called “join and meet”, indeed quite a generalization of the “cross product”. Box 3.21 The second example “join and meet”
: / 2 ( R 4 ) o / 2 ( R 4 ) “selfdual” Input : “join” n=4
n=4
a = ¦ ei ai 1 , b = ¦ ei ai i1 =1
1
1
i2 =1
2
2
2
(3.109)
(ai 1 = col1 ( A), ai 2 = col 2 ( A)) 1
ab =
2
1 n=4 ¦ ei ei ai 1ai 2 ȁ 2 (R 4 ) 2! i ,i =1 1
2
1
2
(3.110)
1 2
“lexicographical order” 1 e1 e 2 ( a11a22 a21a12 ) + 2! 1 + e1 e 3 ( a11a32 a31a22 ) + 2! 1 + e1 e 4 (a11a42 a41a12 ) + 2!
ab =
(3.111)
158
3 The second problem of algebraic regression
1 e 2 e3 (a21a32 a31a22 ) + 2! 1 + e 2 e 4 (a21a42 a41a22 ) + 2! 1 + e3 e 4 (a31a42 a41a32 ) R ( A) A = G 2,4 2! +
§ n · § 4· ¨ ¸ = ¨ ¸ subdeterminants of A: © m¹ © 2¹ Grassmann coordinates ( Plücker coordinates). Output: “meet” g = 1, G y = I 4 , m = 2, n = 4, n m = 2
(a b) =
1 n=4 ¦ 2! i ,i , j , j 1 2
1
2
1 e j e j Hi i =1 2! 1
2
1 2 j1 j2
ai 1ai 1
2
2
1 e3 e 4 (a11a22 a21a12 ) + 4 1 + e 2 e 4 (a11a32 a31a22 ) + 4 1 + e3 e 2 (a11a42 a41a12 ) + 4 1 + e 4 e1 (a21a32 a31a22 ) + 4 1 + e3 e1 (a21a22 a41a22 ) + 4 1 + e1 e 2 (a31a42 a41a32 ) R ( A) A = G 2,4 4 =
(3.112)
§ n · § 4· ¨ ¸ = ¨ ¸ subdeterminants of A : © m¹ © 2¹ Grassmann coordinates (Plücker coordinates).
(e1 e 2 ) = e3 e 4 , (e1 e3 ) = e 2 e 4 , (e1 e 4 ) = e3 e 2 ,
(e 2 e3 ) = e 4 e1 , (e 2 e 4 ) = e3 e1 ,
(3.113)
(e3 e 4 ) = e1 e 2 . 3-33
From A to B: latent restrictions, Grassmann coordinates, Plücker coordinates.
Before we return to the matrix A R 4× 2 of our case study, let us analyze the matrix A R 2×3 of Box 3.22 for simplicity. In the perspective of the example of
159
3-3 Case study
our case study we may say that we have eliminated the third observation, but kept the leverage point. First, let us go through the routine to compute the hat matrices H x = ( A c A) 1 A c and H y = A( A c A) 1 A c , to be identified by (3.115) and (3.116). The corresponding estimations xˆ = x A (I -LESS) , (3.116), and y = y A (I -LESS) , (3.118), prove the different weights of the observations ( y1 , y2 , y3 ) influencing xˆ1 and xˆ2 as well as ( yˆ1 , yˆ 2 , yˆ3 ) . Notice the great weight of the leverage point t3 = 10 on yˆ 3 . Second, let us interpret the redundancy matrix R = I 3 A( AcA) 1 Ac , in particular the diagonal elements. r11 =
A cA (1)
=
A cA (2) A cA (3) 64 81 1 = = , r22 = , r33 = , 146 det AcA 146 det AcA 146
det A cA n =3 1 tr R = ¦ (AcA)(i ) = n rk A = n m = 1, det A cA i =1
the degrees of freedom of the I 3 -LESS problem. There, for the first time, we meet the subdeterminants ( A cA )( i ) which are generated in a two step procedure. “First step” eliminate the ith row from A as well as the ith column of A.
“Second step” compute the determinant A c( i ) A ( i ) .
Example : ( A cA)1 1 1 1 2 1 10
A c(1) A (1) 1 1
1
1 2 10
2
( A cA )(1) = det A c(1) A (1) = 64 det A cA = 146 12
12 104 Example: ( AcA) 2
A c( 2) A ( 2)
1 1 1 2 1 10
1 1
2
1
1 2 10
11
11 101
( A cA )(2) = det A c(2) A (2) = 81 det A cA = 146
160
3 The second problem of algebraic regression
Example: ( AcA)3
A c(3) A (3) 1 1
1 1 1 2 1 10
1
2 3
1 2 10
3 5
( A cA )(3) = det A c(3) A (3) = 1 det A cA = 146
Obviously, the partial redundancies (r11 , r22 , r33 ) are associated with the influence of the observation y1, y2 or y3 on the total degree of freedom. Here the observation y1 and y2 had the greatest contribution, the observation y3 at a leverage point a very small influence. The redundancy matrix R, properly analyzed, will lead us to the latent restrictions or “from A to B”. Third, we introduce the rank partitioning R = [ B, C] , rk R = rk B = n m = 1, (3.120), of the matrix R of spatial redundancies. Here, b R 3×1 , (3.121), is normalized to generate b = b / || b || 2 , (3.122). Note, C R 3× 2 is a dimension identity. We already introduced the orthogonality condition bcA = 0 or bcAxA = bcyˆ = 0 (b )cA = 0
or
(b )cAxA = (b )cyˆ = 0,
which establishes the latent restrictions (3.127) 8 yˆ1 9 yˆ 2 + yˆ 3 = 0. We shall geometrically interpret this essential result as soon as possible. Fourth, we aim at identifying R ( A) and R ( A) A for the linear model {Ax + i = y, A R n ×m , rk A = m = 2} ª1º wy y y y « » t1 := = [e1 , e 2 , e3 ] «1» , wx1 «¬1»¼ ª1 º wy y y y « t2 := = [e1 , e 2 , e3 ] « 2 »» , wx 2 «¬10 »¼ as derivatives of the observation functional y = f (x1 , x 2 ) establish the tangent vectors which span a linear manifold called Grassmann space. G 2,3 = span{t1 , t2 } R 3 ,
161
3-3 Case study
in short GRASSMANN (A). Such a notation becomes more obvious if we compute ª a11 x1 + a12 x2 º n =3 m = 2 y y y « y = [e1 , e 2 , e3 ] « a21 x1 + a22 x2 »» = ¦ ¦ eiy aij x j , «¬ a31 x1 + a32 x2 »¼ i =1 j =1 ª a11 º n =3 wy y y y « (x1 , x 2 ) = [e1 , e 2 , e3 ] « a21 »» = ¦ eiy ai1 wx1 «¬ a31 »¼ i =1 ª a12 º n =3 wy y y y « (x1 , x 2 ) = [e1 , e 2 , e3 ] « a22 »» = ¦ eiy ai2 . wx 2 «¬ a32 »¼ i =1 Indeed, the columns of the matrix A lay the foundation of GRASSMANN (A). Five, let us turn to GRASSMANN (B) which is based on the normal space R ( A) A . The normal vector n = t1 × t 2 = (t1 t 2 ) which spans GRASSMANN (B) is defined by the “cross product” identified by " , " , the skew product symbol as well as the Hodge star symbol. Alternatively, we are able to represent the normal vector n, (3.130), (3.132), (3.133), constituted by the columns {col1A, col2A} of the matrix, in terms of the Grassmann coordinates (Plücker coordinates). a a22 a a32 a a12 p23 = 21 = 8, p31 = 31 = 9, p12 = 11 = 1, a31 a32 a11 a12 a21 a22 identified as the subdeterminants of the matrix A, generated by n =3
¦ (e
i1 ,i2 =1
i1
ei )ai 1ai 2 . 2
1
2
If we normalize the vector b to b = b / || b ||2 and the vector n to n = n / || n ||2 , we are led to the first corollary b = n . The space spanned by the normal vector n, namely the linear manifold G1,3 R 3 defines GRASSMANN (B). In exterior calculus, the vector built on Grassmann coordinates (Plücker coordinates) is called Grassmann vector g or normalized Grassmann vector g*, here ª p23 º ª 8 º ª 8º g 1 « » « »
« » g := p31 = 9 , g := = 9 . « » « » & g & 2 146 « » «¬ p12 »¼ «¬ 1 »¼ «¬ 1 »¼ The second corollary identifies b = n = g .
162
3 The second problem of algebraic regression
“The vector b which constitutes the latent restriction (latent condition equation) coincides with the normalized normal vector n R ( A) A , an element of the space R ( A) A , which is normal to the column space R ( A) of the matrix A. The vector b is built on the Grassmann coordinates (Plücker coordinates), [ p23 , p31 , p12 ]c , subdeterminant of vector g in agreement with b .” Box 3.22 Latent restrictions Grassmann coordinates (Plücker coordinates) the second example ª y1 º ª1 1 º ª1 1 º « y » = «1 2 » ª x1 º A = «1 2 » , rk A = 2 « 2» « » «x » « » «¬ y3 »¼ «¬1 10 »¼ ¬ 2 ¼ «¬1 10 »¼ (1st) H x = ( A cA ) 1 A c 1 ª 92 79 25º 146 «¬ 10 7 17 »¼
(3.115)
1 ª 92 y1 + 79 y2 25 y3 º 146 «¬ 10 y1 7 y2 + 17 y3 »¼
(3.116)
H x = ( AcA) 1 Ac = xˆ = x A (I LESS) =
(2nd) H y = A( A cA) 1 A c ª 82 72 8 º 1 « H y = ( A cA) Ac = 72 65 9 »» , rk H y = rk A = 2 146 « «¬ 8 9 145»¼
(3.117)
ª82 y1 + 72 y2 8 y3 º 1 « yˆ = y A (I LESS) = 72 y1 + 65 y2 + 3 y3 »» 146 « «¬ 8 y1 + 9 y2 + 145 y3 »¼
(3.118)
1
yˆ 3 =
1 (8 y1 + 9 y2 + 145 y3 ) 146
(3rd) R = I 3 A( A cA ) 1 Ac
(3.119)
163
3-3 Case study
R = I 3 A( A cA) 1 Ac =
r11 =
ª 64 72 8 º 1 « 72 81 9 »» « 146 «¬ 8 9 1 »¼
(3.120)
A cA (1) A cA (2) A cA (3) 64 81 1 = , r22 = = , r33 = = 146 det A cA 146 det A cA 146 det A cA tr R =
n =3 1 ( A cA)(i ) = n rk A = n m = 1 ¦ det A cA i =1
latent restriction 8º ª 64 72 1 « R = [B, C] = 72 81 9 » , rk R = 1 « » 146 «¬ 8 9 1»¼ b :=
ª 64 º ª 0.438 º 1 « 72 »» = «« 0.493»» « 146 «¬ 8 »¼ «¬ 0.053 »¼
(3.120)
(3.121)
ª 8º ª 0.662 º b 1 « » « b := = 9 » = « 0.745 »» « &b& 146 «¬ 1 »¼ «¬ 0.083 »¼
(3.122)
(3.123)
bcA = 0 ( b )cA = 0
(3.124)
(3.125)
bcyˆ = 0 (b )cyˆ = 0
(3.126)
8 yˆ1 9 yˆ 2 + yˆ 3 = 0
(3.127)
" R (A) and R ( A) A : tangent space Tx M 2 versus normalspace N x M 2 , 3 Grassmann manifold G m2,3 R 3 versus Grassmann manifold G1,3 nm R "
ª1º wy y y y « » = [e1 , e 2 , e 3 ] 1 “the first tangent vector”: t1 := « » wx1 «¬1»¼
(3.128)
164
3 The second problem of algebraic regression
“the second tangent vector”: t 2 :=
ª1 º wy = [e1y , e 2y , e 3y ] « 2 » « » wx2 «¬10»¼
(3.129)
“ Gm,n ” G 2,3 = span{t1 , t 2 } R 3 : Grassmann ( A ) “the normal vector” n := t1 × t 2 = ( t1 t 2 ) n =3
n =3
t1 = ¦ ei ai 1 i =1
1
¦ee
i1 ,i2 =1
i1 i2
i =1
ai 1ai 2 = 1
t1 = ¦ ei ai
and
1
n =3
n=
(3.130)
2
2
2
(3.131)
2
n =3
¦ (e
i1 ,i2 =1
i1
ei )ai 1ai 2
1
2
2
(3.132)
i, i1 , i2 {1," , n = 3}
versus
n= (3.133)
n=
= e 2 × e3 (a21a32 a31a22 )
= (e 2 × e3 )( a21a32 a31a22 ) +
+e3 × e1 (a31a12 a11a32 )
+ (e3 × e1 )(a31a12 a11a32 ) + (3.134)
+e1 × e 2 (a11a22 a21a12 )
+ (e1 × e 2 )( a11a22 a21a12 )
Hodge star operator :
ª (e 2 e 3 ) = e 2 × e 3 = e1 « (e e ) = e × e = e 3 1 2 « 3 1 «¬ (e1 e 2 ) = e1 × e 2 = e 3
(3.135)
ª8 º n = t1 × t 2 = ( t1 × t 2 ) = [e , e , e ] « 9 » « » «¬1 »¼
(3.136)
ª8 º n 1 « » y y y n := = [e1 , e 2 , e3 ] 9 || n || 146 « » «¬1 »¼
(3.137)
y 1
y 2
y 3
Corollary: b = n “Grassmann manifold G n m ,n “
165
3-3 Case study
G1,3 = span n R 3 : Grassmann(B) Grassmann coordinates (Plücker coordinates) ª1 1º a a22 a31 a32 a11 a12 A = « 1 2 » , g ( A ) := { 21 , , }= « » a31 a32 a11 a12 a21 a22 «¬10 10»¼ 1 2 1 10 1 1 ={ , , } = {8, 9,1} 1 10 1 1 1 2
(3.138)
(cyclic order) g ( A) = { p23 , p31 , p12 } p23 = 8, p31 = 9, p12 = 1 ª p23 º ª8 º Grassmann vector : g := «« p31 »» = «« 9 »» «¬ p12 »¼ ¬«1 ¼»
(3.139)
ª8 º g 1 « » = 9 normalized Grassmann vector: g := || g || 146 « » «¬1 »¼
(3.140)
Corollary : b = n = g .
(3.141)
Now we are prepared to analyze the matrix A R 2× 4 of our case study. Box 3.23 outlines first the redundancy matrix R R 2× 4 (3.142) used for computing the inconsistency coordinates iˆ4 = i4 (I LESS) , in particular. Again it is proven that the leverage point t4=10 has little influence on this fourth coordinate of the inconsistency vector. The diagonal elements (r11, r22, r33, r44) of the redundancy matrix are of focal interest. As partial redundancy numbers (3.148), (3.149), (3.150) and (3.151) r11 =
AA (1) AA ( 2) AA (3) AA ( 4) 57 67 73 3 = , r22 = = , r33 = = , r44 = = , det A cA 100 det A cA 100 det A cA 100 det A cA 100 they sum up to tr R =
n=4 1 ¦ (AcA)(i ) = n rk A = n m = 2 , det A cA i =1
the degree of freedom of the I 4 -LESS problem. Here for the second time we meet the subdeterminants ( A cA )( i ) which are generated in a two-step procedure.
166
3 The second problem of algebraic regression
“First step”
“Second step”
eliminate the ith row from A as well as the ith column of Ac .
compute the determinant of A c( i ) A ( i )
Box 3.23 Redundancy matrix of a linear model of a uninvariant polynomial of degree one - light leverage point a=10 “Redundancy matrix R = (I 4 A( A cA) 1 A c) ” ª 57 37 31 11 º « » 1 « 37 67 29 1 » I 4 A( AcA) 1 Ac = 100 « 31 29 73 13» « » 1 13 3 »¼ «¬ 11
(3.142)
iˆ4 = i4 (I -LESS) = Ry
(3.143)
1 iˆ4 = i4 (I -LESS) = (11 y1 y2 13 y3 + 3 y4 ) 100
(3.144)
r11 =
57 67 73 3 , r22 = , r33 = , r44 = 100 100 100 100 “rank partitioning”
(3.145)
R R 4×4 , rk R = n rk A = n m = 2, B R 4×2 , C R 4×2 R = I 4 A( A cA) 1 A c = [B, C]
(3.146)
ª 57 37 º « 67 » 1 « 37 » , then BcA = 0 ” “ if B := 100 « 31 29 » « » ¬ 11 1 ¼
(3.147)
A cA (1) A cA ( 2 ) , r22 = det A cA det A cA c A A (3) A cA ( 4 ) r33 = (3.150) , r44 = det A cA det A cA n =4 1 tr R = ¦ (AcA)(i ) = n rk A = n m = 2 det A cA i =1
(3.148)
r11 =
(3.149) (3.151) (3.152)
167
3-3 Case study
Example: ( A cA )(1) 1 1
A c(1) A (1)
1 2
1 3 1 10 1 1 1
1
1 2 3 10
3
( A cA )(1) =det ( A c(1) A (1) ) =114 det A cA = 200
15
15 113
Example: ( A cA)( 2) 1 1
A c( 2) A ( 2)
1 2
1 3 1 10 1 1 1
1
1 2 3 10
3
( A cA)( 2) =det ( A c( 2) A ( 2) ) =134 det A cA = 200
14
14 110
Example: ( A cA)(3) 1 1
A c(3) A (3)
1 2
1 3 1 10 1 1 1
1
1 2 3 10
3
( A cA)(3) =det ( A c(3) A (3) ) =146 det A cA = 200
13
13 105
Example: ( A cA)( 4) 1 1
A c( 2) A ( 2)
1 2
1 3 1 10 1 1 1
1
1 2 3 10
3
6
6 10
( A cA)( 4) =det ( A c( 4) A ( 4) ) =6 det A cA = 200
168
3 The second problem of algebraic regression
Again, the partial redundancies (r11 ," , r44 ) are associated with the influence of the observation y1, y2, y3 or y4 on the total degree of freedom. Here the observations y1, y2 and y3 had the greatest influence, in contrast the observation y4 at the leverage point a very small impact. The redundancy matrix R will be properly analyzed in order to supply us with the latent restrictions or the details of “from A to B”. The rank partitioning R = [B, C], rk R = rk B = n m = 2 , leads us to (3.22) of the matrix R of partial redundancies. Here, B R 4× 2 , with two column vectors is established. Note C R 4×2 is a dimension identity. We already introduced the orthogonality conditions in (3.22) BcA = 0 or BcAxA = Bcy A = 0 , which establish the two latent conditions 57 37 31 11 yˆ1 yˆ 2 yˆ 3 + yˆ 4 = 0 100 100 100 100 37 67 29 1 yˆ1 + yˆ 2 yˆ 3 yˆ 4 = 0. 100 100 100 100 Let us identify in the context of this paragraph R( A) and R ( A) A for the linear model {Ax + i := y , A R n× m , rk A = m = 2} . The derivatives ª1º ª1 º «1» «2» wy wy t1 := = [e1y , e 2y , e 3y , e 4y ] « » , t 2 := = [e1y , e 2y , e 3y , e 4y ] « » , «1» «3» wx1 wx 2 « » « » ¬1¼ ¬10¼ of the observational functional y = f (x1 , x 2 ) generate the tangent vectors which span a linear manifold called Grassmann space G 2,4 = span{t1 , t 2 } R 4 , in short GRASSMANN (A). An illustration of such a linear manifold is ª a11 x1 + a12 x2 º « a x + a x » n=4 m=2 y = [e1y , e 2y , e3y , e 4y ] « 21 1 22 2 » = ¦ ¦ eiy aij x j , « a31 x1 + a32 x2 » i =1 j =1 « » ¬« a41 x1 + a42 x2 ¼»
169
3-3 Case study
ª a11 º «a » n=4 wy y y y y « 21 » ( x1 , x2 ) = [e1 , e 2 , e3 , e 4 ] = ¦ eiy ai1 , « » a31 wx1 i =1 « » ¬« a41 ¼» ª a12 º «a » n=4 wy 22 ( x1 , x2 ) = [e1y , e 2y , e3y , e 4y ] « » = ¦ eiy ai 2 . « a32 » i =1 wx2 « » «¬ a42 »¼ Box 3.24 Latent restrictions Grassmann coordinates (Plücker coordinates) the first example (3.153)
BcA = 0 Bcy = 0
(3.154)
(3.155)
ª1 1 º ª 57 37 º «1 2 » « » » B = 1 « 37 67 » A=« «1 3 » 100 « 31 29 » « » « » 1 »¼ «¬1 10 »¼ «¬ 11
(3.156)
“ latent restriction” 57 yˆ1 37 yˆ 2 31yˆ 3 + 11yˆ 4 = 0
(3.157)
37 yˆ1 + 67 yˆ 2 29 yˆ 3 yˆ 4 = 0
(3.158)
“ R( A) : the tangent space Tx M 2 the Grassmann manifold G 2,4 ” ª1º «» wy y y y y «1» [e1 , e 2 , e3 , e 4 ] “the first tangent vector”: t1 := «1» wx1 «» «¬1»¼
(3.159)
ª1 º « » wy y y y y « 2 » [e1 , e 2 , e3 , e 4 ] “the second tangent vector”: t 2 := «3 » wx 2 « » ¬«10 ¼»
(3.160)
170
3 The second problem of algebraic regression
G 2,4 = span{t1 , t 2 } R 4 : Grassmann ( A ) “the first normal vector”: n1 :=
b1 || b1 ||
(3.161)
|| b1 ||2 = 104 (572 + 372 + 312 + 112 ) = 57 102
(3.162)
ª 0.755 º « 0.490» » n1 = [e1y , e 2y , e 3y , e 4y ] « « 0.411» « » ¬ 0.146¼
(3.163)
“the second normal vector”: n 2 :=
b2 || b 2 ||
(3.164)
|| b 2 ||2 = 104 (37 2 + 67 2 + 292 + 12 ) = 67 102
(3.165)
ª 0.452 º « »
y y y y « 0.819 » n 2 = [e1 , e 2 , e3 , e 4 ] « 0.354 » « » ¬« 0.012 ¼»
(3.166)
Grassmann coordinates (Plücker coordinates) ª1 1 º «1 2 » » g ( A) := °® 1 1 , 1 1 , 1 1 , 1 2 , 1 2 , 1 3 °½¾ = A=« «1 3 » °¯ 1 2 1 3 1 10 1 3 1 10 1 10 °¿ « » ¬1 10 ¼ = { p12 , p13 , p14 , p23 , p24 , p34 } (3.167) p12 = 1, p13 = 2, p14 = 9, p23 = 1, p24 = 8, p34 = 7. Again, the columns of the matrix A lay the foundation of GRASSMANN (A). Next we turn to GRASSMANN (B) to be identified as the normal space R ( A) A . The normal vectors ªb11 º ªb21 º «b » «b » n1 = [e1y , e 2y , e3y , e 4y ] « 21 » ÷ || col1 B ||, n 2 = [e1y , e 2y , e3y , e 4y ] « 22 » ÷ || col2 B || «b31 » «b32 » « » « » ¬«b41 ¼» ¬«b42 ¼» are computed from the normalized column vectors of the matrix B = [b1 , b 2 ] .
171
3-3 Case study
The normal vectors {n1 , n 2 } span the normal space R ( A) A , also called GRASSMANN(B). Alternatively, we may substitute the normal vectors n1 and n 2 by the Grassmann coordinates (Plücker coordinates) of the matrix A, namely by the Grassmann column vector. p12 =
1 1 1 1 1 1 = 1, p13 = = 2, p14 = =9 1 2 1 3 1 10
p23 =
1 2 1 2 1 3 = 1, p24 = = 8, p34 = =7 1 3 1 10 1 10 n = 4, m = 2, n–m = 2
n=4
¦
i1 ,i2 =1
1 n=4 ¦ e j e j H i ,i , j , j ai 1ai 2! i ,i , j , j =1
(ei ei )ai 1ai 2 = 1
2
1
2
1
1 2
1
2
1 2
1
2
1
2
2
2
ª p12 º ª1 º «p » « » « 13 » « 2 » « p14 » «9 » g := « » = « » R 6×1 . « p23 » «1 » « p » «8 » « 24 » « » «¬ p34 »¼ ¬«7 ¼» ?How do the vectors {b1, b2},{n1, n2} and g relate to each other? Earlier we already normalized, {b1 , b 2 } to {b1 , b 2 }, when we constructed {n1 , n 2 } . Then we are left with the question how to relate {b1 , b 2 } and {n1 , n 2 } to the Grassmann column vector g. The elements of the Grassmann column vector g(A) associated with matrix A are the Grassmann coordinates (Plücker coordinates){ p12 , p13 , p14 , p23 , p24 , p34 } in lexicographical order. They originate from the dual exterior form D m = E n m where D m is the original m-exterior form associated with the matrix A. n = 4, n–m = 2
D 2 :=
1 n=4 ¦ ei ei ai ai = 2! i i =1 1
2
1
2
1, 2
1 1 e1 e 2 (a11a22 a21a12 ) + e1 e3 ( a11a32 a31a22 ) + 2! 2! 1 1 + e1 e 4 (a11a42 a41a12 ) + e 2 e3 ( a21a32 a31a22 ) + 2! 2! 1 1 + e 2 e 4 (a21a42 a41a22 ) + e3 e 4 (a31a42 a41a32 ) 2! 2!
=
172
3 The second problem of algebraic regression
E := D 2 (R 4 ) =
1 n=4 ¦ e j e j Hi i 4 i i , j , j =1 1
1, 2
1
2
1 2 j1 j2
ai 1ai 2 = 1
1
2
1 1 1 e3 e 4 p12 + e 2 e 4 p13 + e3 e 2 p14 + 4 4 4 1 1 1 + e 4 e1 p23 + e3 e1 p24 + e1 e 2 p34 . 4 4 4 =
The Grassmann coordinates (Plücker coordinates) { p12 , p13 , p14 , p23 , p24 , p34 } refer to the basis {e3 e 4 , e 2 e 4 , e3 e 2 , e 4 e1 , e3 e1 , e1 e 2 } . Indeed the Grassmann space G 2,4 spanned by such a basis can be alternatively covered by the chart generated by the column vectors of the matrix B,
J 2 :=
n=4
¦e
j1
e j b j b j GRASSMANN(Ǻ), 2
1
2
j1 , j2
a result which is independent of the normalisation of {b j 1 , b j 2 } . 1
2
As a summary of the result of the two examples (i) A \ 3× 2 and (ii) A \ 4× 2 for a general rectangular matrix A \ n × m , n > m, rkA = m is needed. “The matrix B constitutes the latent restrictions also called latent condition equations. The column space R (B) of the matrix B coincides with complementary column space R ( A) A orthogonal to column space R ( A) of the matrix A. The elements of the matrix B are the Grassmann coordinates, also called Plücker coordinates, special sub determinants of the matrix A = [a i1 ," , a im ] p j j := 1 2
n
¦
i1 ," , im =1
Hi "i 1
m j1 " jn-m
ai 1 "ai 1
mm
.
The latent restrictions control the parameter adjustment in the sense of identifying outliers or blunders in observational data.” 3-34 From B to A: latent parametric equations, dual Grassmann coordinates, dual Plücker coordinates While in the previous paragraph we started from a given matrix A \ n ×m , n > m, rk A = m representing a special inconsistent systems of linear equations y=Ax+i, namely in order to construct the orthogonal complement R ( A) A of R ( A) , we now reverse the problem. Let us assume that a matrix B \ A× n , A < n , rk B = A is given which represents a special inconsistent system of linear homogeneous condition equations Bcy = Bci . How can we construct the orthogonal complement R ( A) A of R (B) and how can we relate the elements of R (B) A to the matrix A of parametric adjustment?
173
3-3 Case study
First, let us depart from the orthogonality condition BcA = 0 or A cB = 0 we already introduced and discussed at length. Such an orthogonality condition had been the result of the orthogonality of the vectors y A = yˆ (LESS) and i A = ˆi (LESS) . We recall the general condition of the homogeneous matrix equation. BcA = 0 A = [I A B(BcB) 1 Bc]Z , which is, of course, not unique since the matrix Z \ A× A is left undetermined. Such a result is typical for an orthogonality conditions. Second, let us construct the Grassmann space G A ,n , in short GRASSMANN (B) as well as the Grassmann space G n A , n , in short GRASSMANN (A) representing R (B) and R (B) A , respectively. 1 n JA = (3.168) ¦ e j " e j b j 1"b j A A ! j " j =1 A
1
G n A := J A =
A
1
A
1
1 (n A)! i ,", i
nA
1
n
¦
1 ei " ei H i "i =1 A ! nA
1
, j1 ," , jA
1
n A j1 " jA
b j 1 "b j A . A
1
The exterior form J A which is built on the column vectors {b j 1 ," , b j A } of the matrix B \ A× n is an element of the column space R (B) . Its dual exterior form
J = G nA , in contrast, is an element of the orthogonal complement R (B) A . A
1
q i "i 1
nA
:= Hi "i 1
n A j1 " jA
b j 1"b j A
(3.169)
A
1
denote the Grassmann coordinates (Plücker coordinates) which are dual to the Grassmann coordinates (Plücker coordinates) p j " j . q := [ q i … q n A ] is constituted by subdeterminants of the matrix B, while p := [ p j … p n m ] by subdeterminants of the matrix A. 1
nm
1
1
The (D, E, J, G) -diagram of Figure 3.8 is commutative. If R (B) = R ( A) A , then R (B) A = R ( A) . Identify A = n m in order to convince yourself about the (D, E, J, G) - diagram to be commutative.
G n A = J A
id ( A = n m )
Dm
JA
id ( A = n m )
E n m = D m
Figure 3.8: Commutative diagram D m o D m = En-m = J n-m o J n-m = En-m =
D m = (1) m ( n-m ) D m
174
3 The second problem of algebraic regression
Third, let us specialize R ( A) = R (B) A and R ( A) A = R (B) by A = n - m . D m o D m = En-m = J n-m o J n-m = En-m =
D m = (1) m ( n-m ) D m
(3.170)
The first and second example will be our candidates for test computations of the diagram of Figure 3.8 to be commutative. Box 3.25 reviews direct and inverse Grassmann coordinates (Plücker coordinates) for A \ 3× 2 , B \ 3×1 , Box 3.26 for A \ 4× 2 , B \ 4× 2 . Box 3.25 Direct and inverse Grassmann coordinates (Plücker coordinates) first example The forward computation ª1 1 º n =3 n =3 A = ««1 2 »» \ 3×2 : a1 = ¦ ei ai 1 and a 2 = ¦ ei ai i =1 i =1 «¬1 10 »¼ n =3 1 D 2 := ¦ ei ei ai 1ai 2 ȁ 2 (\ 3 ) ȁ m ( \ n ) 2! i ,i =1 1
1
2
1
1
2
1
2
2
2
2
1 2
E1 := D 2 :=
n =3
1 e j Hi i j ai 1ai 2 ȁ 2 (\ 3 ) ȁ m ( \ n ) 2! i ,i , j =1
¦
1 2
1
12 1
1
2
1
Grassmann coordinates (Plücker coordinates) 1 1 1 E1 = e1 p23 + e 2 p31 + e3 p12 2 2 2 p23 = a21 a32 a31 a22 , p31 = a31 a12 a11 a32 , p12 = a11 a22 a21 a12 , p23 = 8, p31 = 9, p12 = 1 The backward computation J1 :=
n =3
¦
1 e j Hi i j ai 1 ai 2 = e1 p23 + e 2 p31 + e3 p12 ȁ1 (\ 3 ) 1! 1
i1 ,i2 , j1 =1
12 1
G 2 := J1 := G2 = G2 =
1 2!
1 2
1 2!
1
2
n =3
¦
ei ei H i i j H j 1
2
12 1
2 j3 j1
a j 1a j 2
3
2
i1 ,i2 , j1 , j2 , j3 =1
n =3
¦
e i e i (G i j G i 1
2
1 2
2 j3
Gi j Gi j ) a j 1 a j 1 3
2 2
2
3
2
i1 ,i2 , j1 , j2 , j3 =1
n =3
¦e
i1
ei ai 1 ai 2 = D 2 ȁ 2 (\ 3 ) ȁ m ( \ n ) 2
1
2
i1 ,i2 =1
inverse Grassmann coordinates (dual Grassmann coordinates, dual Plücker coordinates)
175
3-3 Case study G2 = D 2 =
1 1 e 2 e 3 ( a 21 a 32 a 31 a 22 ) + e 3 e1 ( a 31 a12 a11 a 32 ) + 2 2 1 + e1 e 2 ( a11 a 22 a 21 a12 ) 2
G2 = D 2 =
1 1 1 e 2 e3 q23 + e 2 e3 q31 + e 2 e3 q12 ȁ 2 (\ 3 ) . 2 2 2 Box 3.26
Direct and inverse Grassmann coordinates (Plücker coordinates) second example The forward computation ª1 1 º A = «1 2 » \ 4× 2 : a1 = «1 3 » «¬1 10 »¼
n=4
e i ai 1 ¦ i =1 1
1
and a 2 =
1
n=4
e i ai 2 ¦ i =1 2
2
2
n=4
D 2 :=
1 ei ei ai 1ai 2 ȁ 2 (\ 4 ) ȁ m (\ n ) 2! =1
¦
i1 ,i2
1
2
1
2
1 n=4 1 ¦ e j e j Hi i j j ai 1ai 2 ȁ 2 (\ 4 ) ȁ n-m (\ n ) 2! i ,i , j , j =1 2! 1 1 1 E2 = e3 e 4 p12 + e 2 e 4 p13 + e3 e 2 p14 + 4 4 4 1 1 1 + e 4 e1 p23 + e3 e1 p24 + e1 e 2 p34 4 4 4 p12 = 1, p13 = 2, p14 = 9, p23 = 1, p34 = 7
E2 := D 2 :=
1
1 2
1
2
12 1 2
1
2
2
The backward computation J 2 :=
1 n=4 ¦ e j e j Hi i 2! i ,i , j , j =1 1
1 2
1
G 2 := J 2 :=
1 2 j1 j2
2
ai 1ai 2 ȁ 2 (\ 4 ) ȁ n-m (\ n ) 1
2
2
n=4 1 ¦ 2! i ,i , j , j , j , j 1 2
1
2
3
1 ei ei H i i =1 2! 1
4
1 2 j1 j2
2
Hj j
1 2 j3 j4
a j 1a j 2 = 3
4
= D 2 ȁ 2 (\ 4 ) ȁ m (\ n ) G2 = D 2 =
1 n=4 ¦ ei ei ai 1ai 4 i ,i =1 1
2
1
2
2
1 2
1 1 1 e3 e 4 q12 + e 2 e 4 q13 + e3 e 2 q14 + 4 4 4 1 1 1 + e 4 e1q23 + e3 e1q24 + e1 e 2 q34 4 4 4 q12 = p12 ,q13 = p13 ,q14 = p14 ,q23 = p23 ,q24 = p24 ,q34 = p34 . G2 = D 2 =
176 3-35
3 The second problem of algebraic regression
Break points
Throughout the analysis of high leverage points and outliers within the observational data we did assume a fixed linear model. In reality such an assumption does not apply. The functional model may change with time as Figure 3.9 indicates. Indeed we have to break-up the linear model into pieces. Break points have to be introduced as those points when the linear model changes. Of course, a hypothesis test has to decide whether the break point exists with a certain probability. Here we only highlight the notion of break points in the context of leverage points. For localizing break points we apply the Gauss-Jacobi Combinatorial Algorithm following J. L. Awange (2002), A. T. Hornoch (1950), S. Wellisch (1910). Figure 3.9:
Figure 3.10:
Graph of the function y(t), two
Gauss-Jacobi Combinatorial
break points
Algorithm, piecewise linear model, 1st cluster : ( ti , t j )
Figure 3.11:
Figure 3.12:
Gauss-Jacobi Combinatorial Algorithm, 2nd cluster : ( ti , t j )
Gauss-Jacobi Combinatorial Algorithm, 3rd cluster : ( ti , t j ).
177
3-3 Case study
Table 3.1: Test “ break points” observations for a piecewise linear model y1 y2 y3 y4 y5 y6 y7 y8 y9 y10
y 1 2 2 3 2 1 0.5 2 4 4.5
t 1 2 3 4 5 6 7 8 9 10
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
Table 3.1 summarises a set of observations yi with n=10 elements. Those measurements have been taken at time instants {t1 ," , t10 } . Figure 3.9 illustrates the graph of the corresponding function y(t). By means of the celebrated Gauss-Jacobi Combinatorial Algorithm we aim at localizing break points. First, outlined in Box 3.27 we determine all the combinations of two points which allow the fit of a straight line without any approximation error. As a determined linear model y = Ax , A \ 2× 2 , r k A = 2 namely x = A 1 y we calculate (3.172) x1 and (1.173) x 2 in a closed form. For instance, the pair of observations ( y1 , y2 ) , in short (1, 2) at (t1 , t2 ) = (1, 2) determines ( x1 , x2 ) = (0,1) . Alternatively, the pair of observations ( y1 , y3 ) , in short (1, 3), at (t1, t3) = (1, 3) leads us to (x1, x2) = (0.5, 0.5). Table 3.2 contains the possible 45 combinations which determine ( x1 , x2 ) from ( y1 ," , y10 ) . Those solutions are plotted in Figure 3.10, 3.11 and 3.12. Box 3.27 Piecewise linear model Gauss-Jacobi combinatorial algorithm 1st step ª y (ti ) º ª1 ti º ª x1 º y=« »=« » « » = Ax ¬ y (t j ) ¼ ¬1 t j ¼ ¬ x2 ¼
i < j {1," , n} (3.171)
y R 2 , A R 2× 2 , rk A = 2, x R 2 ªx º 1 x = A 1y « 1 » = x t ¬ 2¼ j ti x1 =
t j y1 ti y2 t j ti
and
ª t j ti º ª y (ti ) º « 1 1 » « y (t ) » ¬ ¼¬ j ¼ x2 =
y j yi t j ti
.
(3.172)
(3.173)
178
3 The second problem of algebraic regression
Example:
ti = t1 = 1, t j = t2 = 2 y (t1 ) = y1 = 1, y (t2 ) = y2 = 2 x1 = 0, x2 = 1.
Example:
ti = t1 = 1, t j = t3 = 3 y (t1 ) = y1 = 1, y (t3 ) = y3 = 2 x1 = 0.5 and Table 3.2
x2 = 0.5 .
179
3-3 Case study
Second, we introduce the pullback operation G y o G x . The matrix of the metric G y of the observation space Y is pulled back to generate by (3.174) the matrix of the metric G x of the parameter space X for the “determined linear model” y = Ax , A R 2× 2 , rk A = 2 , namely G x = A cG y A . If the observation space Y = span{e1y ,e2y} is spanned by two orthonormal vectors e1y ,e 2y relating to a pair of observations (yi, yj), i<j, i, j {1," ,10} , then the matrix of the metric G y = I 2 of the observation space is the unit matrix. In such an experimental situation (3.175) G x = A cA is derived. For the first example (ti, tj)=(1, 2) we are led to vech G x = [2,3,5]c . “Vech half” shortens the matrix of the metric G x R 2× 2 of the parameter space X ( x1 , x2 ) by stacking the columns of the lower triangle of the symmetric matrix G x . Similarly, for the second example (ti, tj)=(1,3) we produce vech G x = [2, 4,10]c . For all the 45 combinations of observations (yi, yj). In the last column Table 3.2 contains the necessary information of the matrix of the metric G x of the parameter space X formed by (vech G x )c . Indeed, such a notation is quite economical. Box 3.28 Piecewise linear model: Gauss-Jacobi combinatorial algorithm 2nd step pullback of the matrix G X the metric from G y G x = A cG y A
(3.174)
“if G y = I 2 , then G x = A cA ” ª 2 G x = A cA = « ¬ ti + t j
ti + t j º ti2 + t 2j »¼
i < j {1," , n}.
(3.175)
Example: ti = t1 = 1 , t j = t2 = 2 ª 2 3º Gx = « » , vech G x = [2,3,5]c. ¬ 3 5¼ Example: ti = t1 = 1 , t j = t3 = 3 ª2 4 º Gx = « » , vech G x = [2, 4,10]c . ¬ 4 10¼ Third, we are left the problem to identify the break points. C.F. Gauss (1828) and C.G.J. Jacobi (1841) have proposed to take the weighted arithmetic mean of the combinatorial solutions (x1,x2)(1,2), (x1,x2)(1,3), in general (x1,x2)(i,j), i<j, are considered as Pseudo-observations.
180
3 The second problem of algebraic regression
Box 3.29 Piecewise linear model: Gauss-Jacobi combinatorial algorithm 3rd step pseudo-observations Example ª x1(1,2) º ª1 « (1,2) » « « x2 » = «0 « x1(1,3) » «1 « (1,3) » « ¬« x2 ¼» ¬«0
0º ª i1 º » 1 » ª x1 º ««i2 »» + 0 » «¬ x2 »¼ «i3 » « » » 1 ¼» ¬«i4 ¼»
\ 4×1
(3.176)
G x -LESS ª x1(1,2) º « (1,2) » ª xˆ1 º (1,2) (1,3) 1 (1,2) (1,3) « x2 x A := xˆ = « » = ¬ªG x + G x º¼ ª¬G x , G x º¼ (1,3) » « x1 » ¬ xˆ2 ¼ « (1,3) » «¬ x2 »¼
\ 2×1 (3.177)
) vech G (1,2 = [2, 3,5]c , vech G(1,3) = [2, 4,10]c x x
ª 2 3º ) G (1,2 =« x », ¬ 3 5¼
ª2 4 º G(1,3) =« x » ¬ 4 10¼
ª x1(1,2) º ª0º « (1,2) » = « » , ¬ x2 ¼ ¬ 1 ¼
ª x1(1,3) º ª0.5º « (1,3) » = « » ¬ x2 ¼ ¬0.5¼
1
ª4 7 º 1 ª 15 7 º (1,2) (1,3) 1 G =« = « » » = [G x + G x ] 7 15 7 4 11 ¬ ¼ ¬ ¼ 1 x
ª xˆ º 1 ª6º 6 x A := « 1 » = « » xˆ1 = xˆ2 = = 0.545, 454 . ˆ 11 ¬ x2 ¼ 11 ¬6¼ Box 3.29 provides us with an example. For establishing the third step of the Gauss-Jacobi Combinatorial Algorithm. We outline G X LESS for the set of pseudo-observations (3.176) ( x1 , x2 )(1,2) and ( x1 , x2 )(1,3) solved by (3.177) and G (1,3) representing the metric of the paxA = ( xˆ1 , xˆ2 ) . The matrices G(1,2) x x rameter space X derived from ( x1 , x2 )(1,2 ) and ( x1 , x2 )(1,3) are additively composed and inverted, a result which is motivated by the special design matrix A = [I 2 , I 2 ]c of “direct” pseudo-observations. As soon as we implement the ) weight matrices G (1,2 and G (1,3) from Table 3.2 as well as ( x1 , x2 )(1,2 ) and x x (1,3) we are led to the weighted arithmetic mean xˆ1 = xˆ2 = 6 /11 . Such a ( x1 , x2 ) result has to be compared with the componentwise median x1 ( median) = 1/4 and x2 ( median) = 3/4.
181
3-3 Case study
(1, 2), (1,3) ª (1, 2), (1,3) « combination combination « median « G x LESS « xˆ1 = 0.545, 454 x1 (median) = 0.250 « «¬ xˆ2 = 0.545, 454 x2 (median) = 0.750. Here, the arithmetic mean of x1(1,2) , x1(1,3) and x2(1,2) , x2(1,3) coincides with the median neglecting the weight of the pseudo-observations. Box 3.30 Piecewise linear models and two break points “Example”
ª y1 º ª1n « y2 » = « 0 « » « ¬ y3 ¼ «¬ 0
0 1n 0
tn 0 0
1
1
0 tn 0
2
ª x1 º « » 0 º « x2 » ª i y » x « 0 » « 3 » + «i y « x4 » t n »¼ « x » «¬i y 5 «x » ¬« 6 ¼»
0 0 1n
2
1
2
3
3
3
º » » »¼
(3.178)
I n -LESS, I n -LESS, I n -LESS, 1
2
3
ª(t cn t n )(1cn y n ) (1cn t n )(t cn y n ª x1 º 1 «x » = c 2 « ¬ 2 ¼ A n1t n t n (1cn t n ) «¬ (1cn t n )(1cn y1 ) + n1t cn y1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
)º » »¼
(3.179)
ª(t cn t n )(1cn y n ) (1cn t n )(t cn y n ) º ª x3 º 1 » «x » = 2 « ¬ 4 ¼ A n2 t cn t n (1cn t n ) «¬ (1cn t n )(1cn y 2 ) + n2 t cn y 2 ¼»
(3.180)
ª(t cn t n )(1cn y n ) (1cn t n )(t cn y n ) º ª x5 º 1 = « ». « » 2 ¬ x6 ¼ A n3t cn t n (1cn t n ) «¬ (1cn t n )(1cn y 3 ) + n3t cn y 3 ¼»
(3.181)
2
2
2
2
2
2
2
3
3
3
3
2
3
2
3
3
3
2
2
2
2
3
3
2
2
2
3
3
3
3
3
3
Box 3.31 Piecewise linear models and two break points “ Example” 1st interval: n = 4, m = 2, t {t1 , t2 , t3 , t4 } ª1 «1 y1 = « «1 « ¬1
t1 º t2 » ª x1 º » + i y = 1n x1 + t n x2 + i y t3 » «¬ x2 »¼ » t4 ¼ 1
1
(3.182)
182
3 The second problem of algebraic regression
2nd interval: n = 4, m = 2, t {t4 , t5 , t6 , t7 } ª1 «1 y2 = « «1 « ¬1
t4 º t5 » ª x 3 º » + i = 1n x3 + t n x4 + i y t6 » «¬ x4 »¼ y » t7 ¼ 2
2
(3.183)
3rd interval: n = 4, m = 2, t {t7 , t8 , t9 , t10 } ª1 «1 y3 = « «1 « ¬1
t7 º t8 » ª x5 º » + i = 1n x5 + t n x6 + i y . t9 » «¬ x6 »¼ y » t10 ¼ 3
3
(3.184)
Figure 3.10, 3.11 and 3.12 have illustrated the three clusters of combinatorial solutions referring to the first, second and third straight line. Outlined in Box 3.30 and Box 3.31, namely by (3.178) to (3.181), n1 = n2 = n3 = 4 , we have computed ( x1 , x2 ) A for the first segment, ( x3 , x4 ) A for the second segment and ( x5 , x6 ) A for the third segment of the least squares fit of the straight line. Table 3.3 contains the results explicitly. Similarly, by means of the Gauss-Jacobi Combinatorial Algorithm of Table 3.4 we have computed the identical solution ( x1 , x2 ) A , ( x3 , x4 ) A and ( x5 , x6 ) A as “weighted arithmetic means” numerically presented only for the first segment. Table 3.3 I n -LESS solutions for those segments of a straight line, two break points ªx º ª 0.5º I 4 -LESS : « 1 » = « » : y (t ) = 0.5 + 0.6t ¬ x2 ¼ A ¬0.6 ¼ ªx º 1 ª126 º I 4 -LESS : « 3 » = « » : y (t ) = 6.30 0.85t ¬ x4 ¼ A 20 ¬ 17 ¼ ªx º 1 ª 183º I 4 -LESS : « 5 » = « » : y (t ) = 9.15 + 1.40t . ¬ x6 ¼ A 20 ¬ 28 ¼
Table 3.4 Gauss-Jacobi Combinatorial Algorithm for the first segment of a straight line,
183
3-3 Case study
ª xˆ1 º (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) 1 « xˆ » = ª¬G x + G x + G x + G x + G x + G x º¼ ¬ 2¼ ª x1(1,2) º « (1,2) » « x2 » « x1(1,3) » « (1,3) » « x2 » « x (1,4) » « 1(1,4) » x ª¬G (1,2) º¼ «« 2(2,3) »» , G (1,3) , G (1,4) , G (2,3) , G(2,4) , G(3,4) x x x x x x x « 1(2,3) » « x2 » « x (2,4) » « 1 » « x2(2,4) » « (3,4) » « x1 » «¬« x2(3,4) »¼» 1 ª 15 5º ) ) ) 1 ª¬G (1,2 º¼ = + G (1,3) + G (1,4 + G (x2,3) + G (x2,4 ) + G (3,4 x x x x 30 «¬ 5 2 »¼ ª x1(1,2) º ª 2 3º ª 0º ª 3º G (1,2) « (1,2) » = « x »« » = « » ¬ x2 ¼ ¬ 3 5¼ ¬1¼ ¬5¼ ª x1(1,3) º ª 2 4 º ª 0.5º ª 3º G (1,3) « (1,3) » = « x »« » = « » ¬ x2 ¼ ¬ 4 10¼ ¬ 0.5¼ ¬ 7¼ ª x1(1,4) º ª 2 5 º ª 0.333º ª 4 º G (1,4) « (1,4) » = « x »« »=« » ¬ x2 ¼ ¬ 5 17 ¼ ¬ 0.666¼ ¬13¼ ª x1(2,3) º ª 2 5 º ª 2 º ª 4 º G (2,3) « (2,3) » = « x »« » = « » ¬ x2 ¼ ¬ 5 13¼ ¬ 0 ¼ ¬10¼ ª x1(2,4) º ª 2 6 º ª 1 º ª 5 º G (2,4) « (2,4) » = « x »« » = « » ¬ x2 ¼ ¬ 6 20¼ ¬ 0.5¼ ¬16¼ ª x1(3,4) º ª 2 7 º ª 1º ª 5 º G (3,4) « (3,4) » = « x »« » = « » ¬ x2 ¼ ¬ 7 25¼ ¬ 1 ¼ ¬18¼ (3,4) ª x1(1,2) º º ª 24 º (3,4) ª x1 " G (1,2) + + G « (1,2) » « (3,4) » = « » x x ¬ x2 ¼ ¬ x2 ¼ ¬ 69 ¼
ª xˆ1 º 1 ª15 5º ª 24 º ª 0.5º « xˆ » = 30 « 5 2 » « 69 » = « 0.6» . ¬ ¼¬ ¼ ¬ ¼ ¬ 2¼
(3.185)
184
3-4 Special linear and nonlinear models
3-4 Special linear and nonlinear models: A family of means for direct observations In case of direct observations, LESS of the inconsistent linear model y = 1n x + i has led to x A = arg{y = 1n x + i || i ||2 = min} , x A = (1c1) 11cy =
1 ( y1 + " + yn ) . n
Such a mean has been the starting point of many alternatives we present to you by Table 3.5 based upon S.R. Wassel´s (2002) review. Table 3.5 A family of means Name
arithmetic mean
Formula xA =
1 ( y1 + " + yn ) n
1 ( y1 + y2 ) 2 x A = (1cG y 1) 11cG y y
n = 2 : xA =
if G y = Diag( g1 ," , g1 ) weighted arithmetic mean
geometric mean
then xA =
g1 y1 + g n yn g1 + " + g n
xg =
n
§ n · y1 " yn = ¨ yi ¸ © i =1 ¹
n = 2 : xg = logarithmic mean
1
n
y1 y2
1 (ln y1 + " + ln yn ) n = ln xg
xlog = xlog
y(1) < " < y( n ) ordered set of observations
median
y( k +1) if n = 2k + 1 ª « " add " med y = « «[ y( k ) + y( k +1) ] / 2 if n = 2k « " even " ¬
3-5 A historical note
Wassel´s family of means
185 xp =
( y1 ) p +1 + " + ( yn ) p +1 ( y1 ) p + " + ( yn ) p n
x p = ¦ ( yi ) p +1 i =1
n
¦(y )
p
i
i =1
Case p=0: x p = xA Case p= 1/ 2 , n=2: x p = xA Hellenic mean
Case p= –1: n=2: 1
§ y 1 + y21 · 2 y1 y2 H = H ( y1 , y2 ) = ¨ 1 . ¸ = 2 y1 + y2 © ¹
3-5 A historical note on C.F. Gauss, A.M. Legendre and the inventions of Least Squares and its generalization The historian S.M. Stigler (1999, pp 320, 330-331) made the following comments on the history of Least Squares. “The method of least squares is the automobile of modern statistical analysis: despite its limitations, occasional accidents, and incidental pollution, this method and its numerous variations, extensions, and related conveyances carry the bulk of statistical analyses, and are known and valued by nearly all. But there has been some dispute, historically, as to who is the Henry Ford of statistics. Adrian Marie Legendre published the method in 1805, an American, Robert Adrian, published the method in late 1808 or early 1809, and Carl Fiedrich Gauss published the method in 1809. Legendre appears to have discovered the method in early 1805, and Robert Adrain may have “discovered” it in Legendre’s 1805 book (Stigler 1977c, 1978c), but in 1809 Gauss had the temerity to claim that he had been using the method since 1795, and one of the most famous priority disputes in the history of science was off and running. It is unnecessary to repeat the details of the dispute – R.L. Plackett (1972) has done a masterly job of presenting and summarizing the evidence in the case. Let us grant, then, that Gauss’s later accurate were substantially accunate, and that he did device the method of least squares between 1794 and 1799, independently of Legendre or any other discoverer. There still remains the question, what importance did he attach to the discovery? Here the answer must be that while Gauss himself may have felt the method useful, he was unsuccessful in communicating its importance to other before 1805. He may indeed have mentioned the method to Olbers, Lindemau, or von Zach before 1805, but in the total lack of applications by others, despite ample opportunity, suggests the message was not understood. The fault may have been more in the listener than in the teller, but in this case its failure serves only to enhance our admiration for Legendre’s 1805 success. For Legendre’s description of the method had an immediate and
186
3 The second problem of algebraic regression widespread effect – as we have seen, it even caught the eye and understanding of at least one of those astronomers (Lindemau) who had been deaf to Gauss’s message, and perhaps it also had an influence upon the form and emphasis of Gauss’s exposition of the method. When Gauss did publish on least squares, he went far beyond Legendre in both conceptual and technical development, linking the method to probability and providing algorithms for the computation of estimates. His work has been discussed often, including by H.L. Seal (1967), L. Eisenhart (1968), H. Goldsteine (1977, §§ 4.9, 4.10), D.A. Sprott (1978),O.S. Sheynin (1979, 1993, 1994), S.M. Stigler (1986), J.L. Chabert (1989), W.C. Waterhouse (1990), G.W. Stewart (1995), and J. Dutka (1996). Gauss’s development had to wait a long time before finding an appreciative audience, and much was intertwined with other’s work, notably Laplace’s. Gauss was the first among mathematicians of the age, but it was Legendre who crystallized the idea in a form that caught the mathematical public’s eye. Just as the automobile was not the product of one man of genius, so too the method of least squares is due to many, including at least two independent discoverers. Gauss may well have been the first of these, but he was no Henry Ford of statistics. If these was any single scientist who first put the method within the reach of the common man, it was Legendre.”
Indeed, these is not much to be added. G.W. Stewart (1995) recently translated the original Gauss text “Theoria Combinationis Observationum Erroribus Minimis Obmaxial, Pars Prior. Pars Posterior “ from the Latin origin into English. F. Pukelsheim (1998) critically reviewed the sources, the reset Latin text and the quality of the translation. Since the English translation appeared in the SIAM series “ Classics in Applied Mathematics”, he concluded: “ Opera Gaussii contra SIAM defensa”. “Calculus probilitatis contra La Place defenses.” This is Gauss’s famous diary entry of 17 June 1798 that he later quoted to defend priority on the Method of Least Squares (Werke, Band X, 1, p.533). C.F. Gauss goes Internet With the Internet Address http://gallica.bnf.fr you may reach the catalogues of digital texts of Bibliotheque Nationale de France. Fill the window “Auteur” by “Carl Friedrich Gauss” and you reach “Types de documents”. Continue with “Touts les documents” and click “Rechercher” where you find 35 documents numbered 1 to 35. In total, 12732 “Gauss pages” are available. Only the GaussGerling correspondence is missing. The origin of all texts are the resources of the Library of the Ecole Polytechnique. Meanwhile Gauss’s Werke are also available under http://www.sub.unigoettingen.de/. A CD-Rom is available from “Niedersächsische Staats- und Universitätsbibliothek.” For the early impact of the Method of Least Squares on Geodesy, namely W. Jordan, we refer to the documentary by S. Nobre and M. Teixeira (2000).
4
The second problem of probabilistic regression – special Gauss-Markov model without datum defect – Setup of BLUUE for the moments of first order and of BIQUUE for the central moment of second order : Fast track reading : Read only Theorem 4.3 and Theorem 4.13.
Lemma 4.2 ȟˆ : Ȉ y -BLUUE of ȟ
Definition 4.1 ˆȟ : Ȉ -BLUUE of ȟ y
Theorem 4.3 ȟˆ : Ȉ y -BLUUE of ȟ
Lemma 4.4 E{yˆ }, Ȉ y -BLUUE of E{y} e y , D{e y }, D{y}
“The first guideline of chapter four: definition, lemmas and theorem” In 1823, supplemented in 1828, C. F. Gauss put forward a new substantial generalization of “least squares” pointing out that an integral measure of loss, more definitely the principle of minimum variance, was preferable to least squares and to maximum likelihood. He abandoned both his previous postulates and set high store by the formula Vˆ 2 which provided an unbiased estimate of variance V 2 . C. F. Gauss’s contributions to the treatment of erroneous observations, lateron extended by F. R. Helmert, defined the state of the classical theory of errors. To the analyst C. F. Gauss’s preference to estimators of type BLUUE (Best Linear Uniformly Unbiased Estimator) for the moments of first order as well as of type BIQUUE (Best Invariant Quadratic Uniformly Unbiased Estimator) for the moments of second order is completely unknown. Extended by A. A. Markov who added correlated observations to the Gauss unbiased minimum variance
188
4 The second problem of probabilistic regression
estimator we present to you BLUUE of fixed effects and Ȉ y -BIQUUE of the variance component. “The second guideline of chapter four: definitions, lemmas, corollaries and theorems”
Theorem 4.5 equivalence of Ȉ y -BLUUE and G y -LESS Corollary 4.6 multinomial inverse
Definition 4.7 invariant quadratic estimation Vˆ 2 of V 2 : IQE
Lemma 4.8 invariant quadratic estimation Vˆ 2 of V 2 : IQE
Definition 4.9 variance-covariance components model Vˆ k IQE of V k
Lemma 4.10 invariant quadratic estimation Vˆ k of V k : IQE eigenspace
Definition 4.11 invariant quadratic unformly unbiased estimation: IQUUE
Lemma 4.12 invariant quadratic unformly unbiased estimation: IQUUE Lemma 4.13 var-cov components: IQUUE
Corollary 4.14 translational invariance
189
4 The second problem of probabilistic regression
Corollary 4.15 IQUUE of Helmert type: HIQUUE Corollary 4.16 Helmert equation det H z 0 Corollary 4.17 Helmert equation det H = 0
Definition 4.18 best IQUUE
Corollary 4.19 Gauss normal IQE
Lemma 4.20 Best IQUUE
Theorem 4.21 2 Vˆ BIQUUE of V
In the third chapter we have solved a special algebraic regression problem, namely the inversion of a system of inconsistent linear equations of full column rank classified as “overdetermined”. By means of the postulate of a least squares solution || i ||2 =|| y Ax ||2 = min we were able to determine m unknowns from n observations ( n > m : more equations n than unknowns m). Though “LESS” generated a unique solution to the “overdetermined” system of linear equations with full column rank, we are unable to classify “LESS”. There are two key questions we were not able to answer so far: In view of “MINOS” versus “LUMBE” we want to know whether “LESS” produces an unbiased estimation or not. How can we attach to an objective accuracy measure “LESS”?
190
4 The second problem of probabilistic regression
The key for evaluating “LESS” is handed over to us by treating the special algebraic regression problem by means of a special probabilistic regression problem, namely a special Gauss-Markov model without datum defect. We shall prove that uniformly unbiased estimations of the unknown parameters of type “fixed effects” exist. “LESS” is replaced by “BLUUE” (Best Linear Uniformly Unbiased Estimation). The fixed effects constitute the moments of first order of the underlying probability distributions of the observations to be specified. In contrast, its central moments of second order, known as the variance-covariance matrix or dispersion matrix, open the door to associate to the estimated fixed effects an objective accuracy measure. ? What is a probabilistic problem ? By means of certain statistical objective function, here of type “best linear uniformly unbiased estimation” (BLUUE) for moments of first order
“best quadratic invariant uniformly unbiased estimation” (BIQUUE) for the central moments of second order
we solve the inverse problem of linear, lateron nonlinear equations with fixed effects which relates stochastic observations to parameters. According to the Measurement Axiom, observations are elements of a probability space. In terms of second order statistics the observation space Y of integer dimension, dim Y = n , is characterized by the first moment E{y} , the expectation of y {Y, pdf }
and
the central second moment D{y} the dispersion matrix or variance-covariance matrix Ȉ y .
In the case of “fixed effects” we consider the parameter space Ȅ , dim Ȅ = m , to be metrical. Its metric is induced from the probabilistic measure of the metric, the variance-covariance matrix Ȉ y of the observations y {Y, pdf } . In particular, its variance-covariance matrix is pulled-back from the variancecovariance matrix Ȉ y . In the special probabilistic regression model with unknown “fixed effects” ȟ Ȅ (elements of the parameter space) are estimated while the random variables like y E{y} are predicted.
4-1 Introduction Our introduction has four targets. First, we want to introduce Pˆ , a linear estimation of the mean value of “direct” observations, and Vˆ 2 , a quadratic estimation of their variance component. For such a simple linear model we outline the postulates of uniform unbiasedness and of minimum variance. We shall pay special
4-1 Introduction
191
attention to the key role of the invariant quadratic estimation (“IQE”) Vˆ 2 of V 2 . Second, we intend to analyse two data sets, the second one containing an outlier, by comparing the arithmetic mean and the median as well as the “root mean square error” (r.m.s.) of type BIQUUE and the “median absolute deviation” (m.a.d.). By proper choice of the bias term we succeed to prove identity of the weighted arithmetic mean and the median for the data set corrupted by an obvious outlier. Third, we discuss the competitive estimator “MALE”, namely Maximum Likelihood Estimation which does not produce an unbiased estimation Vˆ 2 of V 2 , in general. Fourth, in order to develop the best quadratic uniformly unbiased estimation Vˆ 2 of V 2 , we have to highlight the need for fourth order statistic. “IQE” as well as “IQUUE” depend on the central moments of fourth order which are reduced to central moments of second order if we assume “quasi-normal distributed” observations. 4-11
The front page example
By means of Table 4.1 let us introduce a set of “direct” measurements yi , i {1, 2, 3, 4, 5} of length data. We shall outline how we can compute the arithmetic mean 13.0 as well as the standard deviation of 1.6. Table 4.1: “direct” observations, comparison of mean and median (mean y = 13, med y = 13, [n / 2] = 2, [n / 2]+1 = 3, med y = y (3) , mad y = med| y ( i ) med y | = 1, r.m.s. (I-BIQUUE) = 1.6) number of observation
observation
1 2 3 4 5
15 12 14 11 13
yi
difference of difference of observation observation and mean and median
+2 -1 +1 -2 0
+2 -1 +1 -2 0
ordered set ordered set of ordered set of of observa- | y ( i ) med y | tions y( i ) y( i ) mean y
11 12 13 14 15
0 1 1 2 2
+2 -1 +1 -1 0
In contrast, Table 4.2 presents an augmented observation vector: The observations six is an outlier. Again we have computed the new arithmetic mean 30.16 as well as the standard deviation 42.1. In addition, for both examples we have calculated the sample median and the sample absolute deviation for comparison. All definitions will be given in the context as well as a careful analysis of the two data sets. Table 4.2: “direct” observations, effect of one outlier (mean y = 30.16 , med y = (13+14) / 2 = 13.5, r.m.s. (I-BLUUE) = 42.1, med y ( i ) med y = mad y = 1.5)
192
4 The second problem of probabilistic regression
number of
observation
observation
yi
1 2 3 4 5 6 4-12
difference of difference of
ordered set ordered set of
observation
observation
of observa-
and mean
and median
tions
15.16 18.16 16.16 19.16 17.16 +85.83
+1.5 -1.5 +0.5 -2.5 -0.5 +102.5
15 12 14 11 13 116
ordered set
y( i ) med y
y( i )
11 12 13 14 15 116
of y( i ) mean y
0.5 0.5 1.5 1.5 2.5 +102.5
15.16 16.16 17.16 18.16 19.16 +85.83
Estimators of type BLUUE and BIQUUE of the front page example
In terms of a special Gauss-Markov model our data set can be described as following. The statistical moment of first order, namely the expectation E{y} = 1P of the observation vector y R n , here n = 5, and the central statistical moment of second order, namely the variance-covariance matrix Ȉ y , also called the dispersion matrix D{y} = I nV 2 , D{y} =: Ȉ y R n×n , rk Ȉ y = n, of the observation vector y R n , with the variance V 2 characterize the stochastic linear model. The mean P R of the “direct” observations and the variance factor V 2 are unknown. We shall estimate ( P , V 2 ) by means of three postulates:
•
first postulate: Pˆ : linear estimation, Vˆ 2 : quadratic estimation n
Pˆ = ¦ l p y p
or
Pˆ = l cy
or
Vˆ 2 = y cMy = (y c
y c)(vec M ) = (vecM )c(y
y )
p =1
Vˆ 2 =
n
¦m
p , q =1
•
pq
y p yq
the second postulate: uniform unbiasedness E{Pˆ } = P for all P R E{Vˆ 2 } = V 2 for all V 2 R +
•
the third postulate: minimum variance
D{Pˆ } = E{[ Pˆ E{Pˆ }]2 } = min A
and
D{Vˆ 2 } = E{[Vˆ 2 E{Vˆ 2 }]2 } = min
Pˆ = arg min D{Pˆ | Pˆ = A cy, E{Pˆ } = P} A
Vˆ 2 = arg min D{Vˆ 2 | Vˆ 2 = y cMy, E{Vˆ 2 } = V 2 } . M
M
4-1 Introduction
193
First, we begin with the postulate that the fixed unknown parameters ( P , V 2 ) are estimated by means of a certain linear form Pˆ = A cy + N = y cA + N and by means of a certain quadratic form Vˆ 2 = y cMy + xcy + Z = (vec M )c(y
y ) + + xcy + Z of the observation vector y, subject to the symmetry condition M SYM := {M R n× n | M = M c}, namely the space of symmetric matrices. Second we demand E{Pˆ } = P , E{Vˆ 2 } = V 2 , namely unbiasedness of the estimations ( Pˆ , Vˆ 2 ) . Since the estimators ( Pˆ , Vˆ 2 ) are special forms of the observation vector y R n , an intuitive understanding of the postulate of unbiasedness is the following: If the dimension of the observation space Y y , dim Y = n , is going to infinity, we expect information about the “two values” ( P , V 2 ) , namely lim Pˆ (n) = P , lim Vˆ 2 ( n) = V 2 .
nof
nof
Let us investigate how LUUE (Linear Uniformly Unbiased Estimation) of P as well as IQUUE (Invariant Quadratic Uniformly Unbiased Estimation) operate. LUUE E{Pˆ } = E{A cy + N } = A cE{y} + N º » E{y} = 1n P ¼ E{Pˆ } = A cE{y} + N = A c1n P + N E{Pˆ } = P N = 0, (A c1n 1) P = 0 N = 0, A c1n 1 = 0 for all P R. Indeed Pˆ is LUUE if and only if N = 0 and (A c1n 1) P = 0 for all P R. The zero identity (A c1n 1) P = 0 is fulfilled by means of A c1n 1 = 0, A c1n = 1, if we restrict the solution by the quantor “ for all P R ”. P = 0 is not an admissible solution. Such a situation is described as “uniformly unbiased”. We summarize that LUUE is constrained by the zero identity A c1n 1 = 0 . Next we shall prove that Vˆ 2 is IQUUE if and only if IQUUE E{Vˆ 2 } = E{y cMy + xcy + Z } = E{(vec M )c( y
y ) + xcy + Z} = (vec M )c E{y
y} + xcE{y} + Z
E{Vˆ 2 } = E{y cMy + xcy + Z } = E{(y c
y c)(vec M )c + y cx + Z} = E{y c
y c}(vec M )c + E{y c}x + Z .
Vˆ 2 is called translational invariant with respect to y 6 y E{y} if Vˆ 2 = y cMy + xcy + Z = (y E{y})cM (y E{y}) + xc( y E{y}) + Z and uniformly unbiased if
194
4 The second problem of probabilistic regression
E{Vˆ 2 } = (vec M )c E{y
y} + xcE{y} + Z = V 2 for all V 2 R + . Finally we have to discuss the postulate of a best estimator of type BLUUE of P and BIQUUE of V 2 . We proceed sequentially, first we determine Pˆ of type BLUUE and second Vˆ 2 of type BIQUUE. At the end we shall discuss simultaneous estimation of ( Pˆ , Vˆ 2 ) . The scalar Pˆ = A cy is BLUUE of P (Best Linear Uniformly Unbiased Estimation) with respect to the linear model E{y} = 1n P , D{y} = I nV 2 , if it is uniformly unbiased in the sense of E{Pˆ } = P for all P R and in comparison of all linear, uniformly unbiased estimations possesses the smallest variance in the sense of D{Pˆ } = E{[ Pˆ E{Pˆ }]2 } =
V 2 A cA = V 2 tr A cA = V 2 || A ||2 = min . The constrained Lagrangean L (A, O ) , namely
L (A, O ) := V 2 A cA + 2O (A c1n 1) = = V 2 A cA + 2(1n A 1)O = min, A ,O
produces by means of the first derivatives 1 wL ˆ ˆ (A, O ) =V 2 Aˆ +1n Oˆ = 0 2 wA 1 wL ˆ ˆ ˆ (A, O ) = A c1n 1= 0 2 wO the normal equations for the augmented unknown vector (A, O ) , also known as the necessary conditions for obtaining an optimum. Transpose the first normal equation, right multiply by 1n , the unit column and substitute the second normal equation in order to solve for the Lagrange multiplier Oˆ . If we substitute the solution Oˆ in the first normal equation, we directly find the linear operator Aˆ .
V 2 Aˆ c + 1cn Oˆ = 0c V 2 Aˆ c1 n + 1 cn 1 n Oˆ = V
2
+ 1 cn 1 n Oˆ = 0
V2 V2 Oˆ = = n 1cn1n 2
V = 0c V 2 Aˆ +1n Oˆ =V 2 lˆ 1n n
1 1 Aˆ = 1n and Pˆ = Aˆ cy = 1cn y . n n
4-1 Introduction
195
The second derivatives 1 w 2L ˆ ˆ ( A , O ) = V 2 I n > 0c 2 wAwA c constitute the sufficiency condition which is automatically satisfied. The theory of vector differentiation is presented in detail in Appendix B. Let us briefly summarize the first result Pˆ BLUUE of P . The scalar Pˆ = A cy is BLUUE of P with respect to the linear model E{y}= 1n P , D{y}= I nV 2 , if and only if 1 1 Aˆ c = 1cn and Pˆ = 1cn y n n is the arithmetic mean. The observation space y{Y, pdf } is decomposed into y (BLUUE):= 1n Pˆ 1 y (BLUUE) = 1n 1cn y n
versus versus
e y (BLUUE):= y y (BLUUE), 1 e y (BLUUE) =[I n (1n 1cn )]y, n
which are orthogonal in the sense of e y (BLUUE) y (BLUUE) = 0
or
1 1 [I n (1n1cn )] (1n1cn ) = 0. n n
Before we continue with the setup of the Lagrangean which guarantees BIQUUE, we study beforehand e y := y E{y} and e y (BLUUE):= y y (BLUUE) . Indeed the residual vector e y (BLUUE) is a linear form of residual vector e y . 1 e y (BLUUE) =[I n (1n1cn )] e y . n For the proof we depart from 1 e y (BLUUE):= y 1n Pˆ =[I n (1n1cn )]y n 1 =[I n (1n1cn )]( y E{y}) n 1 = I n (1n1cn ) , n where we have used the invariance property y 6 y E{y} based upon the idempotence of the matrix [I n (1n1cn ) / n] . Based upon the fundamental relation e y (BLUUE) = De y , where D:= I n (1n1cn ) / n is a projection operator onto the normal space R (1n ) A , we are able to derive an unbiased estimation of the variance component V 2 . Just compute
196
4 The second problem of probabilistic regression
E{ecy ( BLUUE )e y ( BLUUE )} = = tr E{e y (BLUUE)ecy (BLUUE)} = = tr D E{e y ecy }Dc = V 2 tr D Dc = V 2 tr D tr D = tr ( I n ) tr 1n (1n1cn ) = n 1 E{ecy (BLUUE)e y (BLUUE)} = V 2 ( n 1) . Let us define the quadratic estimator Vˆ 2 of V 2 by
Vˆ 2 =
1 ecy (BLUUE)e y (BLUUE) , n 1
which is unbiased according to E{Vˆ 2 } =
1 E{ecy (BLUUE)e y (BLUUE)} = V 2 . n 1
Let us briefly summarize the first result Vˆ 2 IQUUE of V 2 . The scalar Vˆ 2 = ecy (BLUUE)e y (BLUUE) /( n 1) is IQUUE of V 2 based upon the BLUUE-residual vector e y (BLUUE) = ª¬I n 1n (1n1cn ) º¼ y . Let us highlight Vˆ 2 BIQUUE of V 2 . A scalar Vˆ 2 is BIQUUE of V 2 (Best Invariant Quadratic Uniformly Unbiased Estimation) with respect to the linear model E{y} = 1n P , D{y} = I nV 2 , if it is (i)
uniformly unbiased in the sense of E{Vˆ 2 } = V 2 for all V 2 \ + ,
(ii) quadratic in the sense of Vˆ 2 = y cMy for all M = M c , (iii) translational invariant in the sense of y cMy = (y E{y})cM ( y E{y}) = ( y 1n P )cM ( y 1n P ) , (iv) best if it possesses the smallest variance in the sense of D{Vˆ 2 } = E{[Vˆ 2 E{Vˆ 2 }]2 } = min . M
First, let us consider the most influential postulate of translational invariance of the quadratic estimation
Vˆ 2 = y cMy = (vec M )c(y
y ) = (y c
y c)(vec M) to comply with Vˆ 2 = ecy Me y = (vec M )c(e y
e y ) = (ecy
ecy )(vec M )
4-1 Introduction
197 subject to M SYM := {M \ n× n| M = M c} .
Translational invariance is understood as the action of transformation group y = E{y} + e y = 1n P + e y with respect to the linear model of “direct” observations. Under the action of such a transformation group the quadratic estimation Vˆ 2 of V 2 is specialized to
Vˆ 2 = y cMy = ª¬ E{y} + e y º¼c M ª¬ E{y} + e y º¼ = (1cn P + ecy )M (1n P + e y ) Vˆ 2 = P 2 1cn M1n + P 1cn Me y + P ecy M1n + ecy Me y y cMy = ecy Me y 1cn M = 0c and 1cn M c = 0c . IQE, namely 1cn M = 0c and 1cn M c = 0c has a definite consequence. It is independent of P , the first moment of the probability distribution (“pdf”). Indeed, the estimation procedure of the central second moment V 2 is decoupled from the estimation of the first moment P . Here we find the key role of the invariance principle. Another aspect is the general solution of the homogeneous equation 1cn M = 0c subject to the symmetry postulate M = M c . ªM = ªI n 1cn (1cn1n ) 11cn º Z ¬ ¼ 1cM = 0c « «¬ M = (I n 1n 1n1cn )Z , where Z equation takes an Z \ n× n
is an arbitrary matrix. The general solution of the homogeneous matrix contains the left inverse (generalized inverse (1cn 1n ) 1 1cn = 1-n ) which exceptionally simple form, here. The general form of the matrix is in no agreement with the symmetry postulate M = M c . 1cn M = 0c M = D (I n 1n 1n1cn ). M = Mc
Indeed, we made the choice Z = D I n which reduces the unknown parameter space to one dimension. Now by means of the postulate “best” under the constraint generated by “uniform inbiasedness” Vˆ 2 of V 2 we shall determine the parameter D = 1/(n 1) . The postulate IQUUE is materialized by ª E{Vˆ 2 } = V 2 º E{ecy Me y } = mij E{eiy e jy } » « + 2 2 2 Vˆ 2 = ecy Me y ¼» ¬« = mij S ij = V mij G ij = V V \ E{Vˆ 2 | Ȉ y = I nV 2 } = V 2 tr M = 1 tr M 1 = 0 .
198
4 The second problem of probabilistic regression
For the simple case of “i.i.d.” observations, namely Ȉ y = I nV 2 , E{Vˆ 2 } = V 2 for an IQE, IQUUE is equivalent to tr M = 1 or (tr M ) 1 = 0 as a condition equation.
tr M = 1
D tr(I n 1n 1n1cn ) = D (n 1) = 1 1 D= . n 1
IQUUE of the simple case invariance : (i ) 1cM = 0c and M = M cº 1 M= (I n 1n 1n1cn ) » QUUE : (ii ) tr M 1 = 0 n 1 ¼ has already solved our problem of generating the symmetric matrix M .
Vˆ 2 = y cMy =
1 y c(I n 1n 1n1cn )y IQUUE n 1
? Is there still a need to apply “best” as an optimability condition for BIQUUE ? Yes, there is! The general solution of the homogeneous equations 1cn M = 0c and M c1n = 0 generated by the postulate of translational invariance of IQE did not produce a symmetric matrix. Here we present the simple symmetrization. An alternative approach worked depart from 1 2
(M + M c) = 12 {[I n 1n (1cn1n ) 11cn ]Z + Zc[I n 1n (1cn1n ) 11cn ]} ,
leaving the general matrix Z as an unknown to be determined. Let us therefore develop BIQUUE for the linear model E{y} = 1n P , D{y} = I nV 2 D{Vˆ 2 } = E{(Vˆ 2 E{Vˆ 2 }) 2 } = E{Vˆ 4 } E{Vˆ 2 }2 . Apply the summation convention over repeated indices i, j , k , A {1,..., n}. 1st : E{Vˆ 2 }2 E{Vˆ 2 }2 = mij E{eiy e jy }mk A E{eky eAy } = mij mklS ijS k A subject to
S ij := E{e e } = V G ij and S k A := E{eky eAy } = V 2G k A y y i j
2
E{Vˆ 2 }2 = V 4 mijG ij mk AG k A = V 4 (tr M ) 2 2nd : E{Vˆ 4 } E{Vˆ 4 } = mij mk A E{eiy e jy eky eAy } = mij mk AS ijk A
4-1 Introduction
199 subject to
S ijk A := E{eiy e jy eky eAy } i, j , k , A {1,..., n} . For a normal pdf, the fourth order moment S ijk A can be reduced to second order moments. For a more detailed presentation of “normal models” we refer to Appendix D.
S ijk A = S ijS k A + S ik S jA + S iAS jk = V 4 (G ijG k A + G ik G j A + G iAG jk ) E{Vˆ 4 } = V 4 mij mk A (G ijG k A + G ik G jA + G iAG jk ) E{Vˆ 4 } = V 4 [(tr M ) 2 + 2 tr M cM ]. Let us briefly summarize the representation of the variance D{Vˆ 2 } = E{(Vˆ 2 E{Vˆ 2 }) 2 } for normal models. Let the linear model of i.i.d. direct observations be defined by E{y | pdf } = 1n P , D{y | pdf } = I nV 2 . The variance of a normal IQE can be represented by D{Vˆ 2 } := E{(Vˆ 2 E{Vˆ 2 }) 2 } = = 2V 4 [(tr M ) 2 + tr(M 2 )]. In order to construct BIQUUE, we shall define a constrained Lagrangean which takes into account the conditions of translational invariance, uniform unbiasedness and symmetry.
L (M, O0 , O1 , O 2 ) := 2 tr M cM + 2O0 (tr M 1) + 2O11cn M1 n + 2O 2 1cn M c1 n = min . M , O0 , O1 , O2
Here we used the condition of translational invariance in the special form 1cn 12 (M + M c)1 n = 0 1cn M1 n = 0 and 1cn M c1 n = 0 , which accounts for the symmetry of the unknown matrix. We here conclude with the normal equations for BIQUUE generated from wL wL wL wL = 0, = 0, = 0, = 0. w (vec M ) wO0 wO1 wO2 ª 2(I n
I n ) vec I n I n
1 n 1 n
I n º ª vec M º ª0 º « (vec I )c 0 0 0 »» «« O0 »» ««1 »» n « = . « I n
1cn 0 0 0 » « O1 » «0 » « »« » « » 0 0 0 ¼ ¬ O2 ¼ ¬0¼ ¬ 1cn
I n
200
4 The second problem of probabilistic regression
These normal equations will be solved lateron. Indeed M = (I n 1n 1 n1cn ) / ( n 1) is a solution. 1 y c(I n 1n 1n1cn )y n 1 2 D{Vˆ 2 } = V4 n 1
Vˆ 2 = BIQUUE:
Such a result is based upon (tr M ) 2 (BIQUUE) =
1 1 , (tr M 2 )(BIQUUE) = , n 1 n 1
D{Vˆ 2 | BIQUUE} = D{Vˆ 2 } = 2V 4 [(tr M ) 2 + (tr M 2 )](BIQUUE), D{Vˆ 2 } =
2 V 4. n 1
Finally, we are going to outline the simultaneous estimation of {P , V 2 } for the linear model of direct observations.
•
first postulate: inhomogeneous, multilinear (bilinear) estimation
Pˆ = N 1 + A c1y + mc1 (y
y ) Vˆ 2 = N 2 + A c2 y + (vec M 2 )c( y
y ) mc1 º ª y º ª Pˆ º ªN 1 º ª A c1 «Vˆ 2 » = «N » + « A c (vec M )c» « y
y » ¬ ¼ ¬ 2¼ ¬ 2 ¼ 2 ¼¬ m1c º ªN A c1 ª Pˆ º x = XY x := « 2 » , X = « 1 » ¬Vˆ ¼ ¬N 2 A c2 (vec M 2 )c¼ ª 1 º Y := «« y »» «¬ y
y »¼
•
second postulate: uniform unbiasedness ª Pˆ º ª P º E {x} = E{« 2 »} = « 2 » ¬Vˆ ¼ ¬V ¼
•
third postulate: minimum variance D{x} := tr E{ª¬ x E {x}º¼ ª¬ x E {x}º¼ c } = min .
4-1 Introduction 4-13
201
BLUUE and BIQUUE of the front page example, sample median, median absolute deviation
According to Table 4.1 and Table 4.2 we presented you with two sets of observations yi Y, dim Y = n, i {1,..., n} , the second one qualifies to certain “one outlier”. We aim at a definition of the median and of the median absolute deviation which is compared to the definition of the mean (weighted mean) and of the root mean square error. First we order the observations according to y(1) < y( 2) < ... < y( n1) < y( n ) by means of the permutation matrix ª y(1) º ª y1 º « y » « y2 » « (2) » « » « ... » = P « ... » , « y( n 1) » « yn 1 » « » «¬ yn »¼ ¬« y( n ) ¼» namely data set one ª11º ª 0 «12 » « 0 « » « «13» = « 0 « » « «14 » « 0 «¬15»¼ «¬1 ª0 «0 « P5 = « 0 « «0 «¬1
0 0 1 0º ª15º 1 0 0 0 »» ««12»» 0 0 0 1 » «14» »« » 0 1 0 0 » «11» 0 0 0 0»¼ «¬13»¼ 0 0 1 0º 1 0 0 0»» 0 0 0 1» » 0 1 0 0» 0 0 0 0»¼
versus
versus
data set two 0 0 1 0 0º ª 15 º 1 0 0 0 0»» «« 12 »» 0 0 0 1 0» « 14 » »« » 0 1 0 0 0» « 11 » 0 0 0 0 0» « 13 » »« » 0 0 0 0 1 »¼ «¬116»¼ ª0 0 0 1 0 0º «0 1 0 0 0 0» « » «0 0 0 0 1 0» P6 = « ». «0 0 1 0 0 0» «1 0 0 0 0 0 » « » «¬0 0 0 0 0 1 »¼
ª 11 º ª 0 « 12 » « 0 « » « « 13 » « 0 « »=« « 14 » « 0 « 15 » «1 « » « «¬116 »¼ «¬ 0
Note PP c = I , P 1 = P c . Second, we define the sample median med y as well as the median absolute deviation mad y of y Y by means of y([ n / 2]+1) if n is an odd number ª med y := « 1 ¬ 2 ( y( n / 2) + y( n / 2+1) ) if n is an even number mad y := med | y( i ) med y | , where [n/2] denotes the largest integer (“natural number”) d n / 2 .
202
4 The second problem of probabilistic regression
Table 4.3: “direct” observations, comparison two data sets by means of med y, mad y (I-LESS, G y -LESS), r.m.s. (I-BIQUUE) data set one
data set two
n = 5 (“odd”)
n = 6 (“even”)
n / 2 = 2.5, [n / 2] = 2
n / 2 = 3, n / 2 + 1 = 4
[n / 2] + 1 = 3 med y = y(3) = 13
med y = 13.5
mad y = 1
mad y = 1.5
mean y (I -LESS) = 13
mean y (I -LESS) = 30.16 weighted mean y (G y -LESS) = 13.5 24 G y = Diag(1,1,1,1,1, 1000 )
Pˆ (I -BLUUE ) = 13
Pˆ (I -BLUUE ) = 30.16
Vˆ 2 (I -BIQUUE ) = 2.5
Vˆ 2 (I-BIQUUE ) = 1770.1
r.m.s. (I -BIQUUE ) =
r.m.s. (I -BIQUUE ) =
Vˆ (I -BIQUUE ) = 1.6
Vˆ (I-BIQUUE ) = 42.1
Third, we compute I-LESS, namely mean y = (1c1) 1 y = 1n 1cy listed in Table 4.3. Obviously for the second observational data set the Euclidean metric of the observation space Y is not isotropic. Indeed let us compute G y -LESS, namely the weighted mean y = (1cG y 1) 1 1cG y y . A particular choice of the matrix G y \ 6×6 of the metric, also called “weight matrix”, is G y = Diag(1,1,1,1,1, x) such that y + y2 + y3 + y4 + y5 + y6 x , weighted mean y = 1 5+ x where x is the unknown weight of the extreme value (“outlier”) y6 . A special robust design of the weighted mean y is the median y , namely weighted mean y = med y such that x=
y1 + y2 + y3 + y4 + y5 5med y med y y6
4-1 Introduction
203 here x = 0.024, 390, 243
24 . 1000
Indeed the weighted mean with respect to the weight matrix G y = Diag(1,1,1, 1,1, 24 /1000) reproduces the median of the second data set. The extreme value has been down-weighted by a weight 24 /1000 approximately. Four, with respect to the simple linear model E{y} = 1P , D{y} = IV 2 we compute I-BLUUE of P and I-BIQUUE of V 2 , namely
Pˆ = (1c1) 11cy = 1n 1cy Vˆ 2 =
1 1 1 (y 1Pˆ )c(y 1Pˆ ) . y c ªI 1(1c1) 1 º¼ y = y c ªI 1 11cº y = n 1 ¬ n 1 ¬ n ¼ n 1
Obviously the extreme value y6 in the second data set has spoiled the specification of the simple linear model. The r.m.s. (I-BLUUE) = 1.6 of the first data set is increased to the r.m.s. (I-BIQUUE) = 42.1 of the second data set. Five, we setup the alternative linear model for the second data set, namely ª y1 º ª P1 º ª P º ª1º ª0º « y2 » « P » « P » «1» «0» « » « 1» « » « » « » y P P » 1 0 E{« 3 »} = « 1 » = « = « » P + « »Q « y4 » « P1 » « P » «1» «0» « y5 » « P1 » « P » «1» «0» « » « » « » «» «1 » «¬ y6 »¼ ¬ P 2 ¼ ¬ P + Q ¼ ¬1¼ ¬ ¼ ª A := 1, a \ 5×2 , 1 := 1,1,1,1,1,1 c \ 6×1 [ ] [ ] E{y} = Aȟ : « « 2×1 6×1 c c ¬ȟ := [ P ,Q ] \ , a := [ 0, 0, 0, 0, 0,1] \ ª1 «0 « 0 D{y} = « «0 «0 «0 ¬
0 1 0 0 0 0
0 0 1 0 0 0
0 0 0 1 0 0
0 0 0 0 1 0
0º 0» » 0» 2 V \ 6×6 0» 0» 1 »¼
D{y} = I 6V 2 , V 2 \ + , adding to the observation y6 the bias term Q . Still we assume the variancecovariance matrix D{y} of the observation vector y \ 6×1 to be isotropic with one variance component as an unknown. ( Pˆ ,Qˆ ) is I 6 -BLUUE if
204
4 The second problem of probabilistic regression
ª Pˆ º 1 «Qˆ » = ( A cA) A cy ¬ ¼ ª Pˆ º ª 13 º «Qˆ » = «103» ¬ ¼ ¬ ¼
Pˆ = 13, Qˆ = 103, P1 = Pˆ = 13, yˆ 2 = 116 ª Pˆ º D{« »} = ( A cA) 1 V 2 ¬Qˆ ¼
V P2ˆ =
ª Pˆ º V 2 D{« »} = 5 ¬Qˆ ¼
ª 1 1º « 1 6 » ¬ ¼
V2 6 1 2 , V Q2ˆ = V 2 , V PQ ˆˆ = V 5 5 5 Vˆ 2 is I 6 -BIQUUE if
Vˆ 2 =
1 y c ªI 6 A ( A cA ) 1 A cº¼ y n rk A ¬
ª4 « 1 1 « 1 I 6 A ( A cA ) 1 A c = « 5 « 1 « 1 «0 ¬
1 4 1 1 1 0
1 1 4 1 1 0
1 1 1 4 1 0
1 1 1 1 4 0
0º 0» » 0» 0» 0» 5 »¼
§4 4 4 4 4 · ri := ª¬I 6 A( A cA) 1 A cº¼ = ¨ , , , , ,1¸ i {1,..., 6} ii ©5 5 5 5 5 ¹ are the redundancy numbers. y c(I 6 A( A cA) 1 A c) y = 13466 1 Vˆ 2 = 13466 = 3366.5, Vˆ = 58.02 4 3366.5 2 2 V Pˆ (Vˆ ) = = 673.3, V Pˆ (Vˆ ) = 26 5 6 V Q2ˆ (Vˆ 2 ) = 3366.5 = 4039.8, V Qˆ (Vˆ ) = 63.6 . 5 Indeed the r.m.s. value of the partial mean Pˆ as well as of the estimated bias Qˆ have changed the results remarkably, namely from r.m.s. (simple linear model) 42.1 to r.m.s. (linear model) 26. A r.m.s. value of the bias Qˆ in the order of 63.6 has been documented. Finally let us compute the empirical “error vector” l and is variance-covariance matrix by means of
4-1 Introduction
205
e y = ª¬I 6 A ( A cA ) 1 A cº¼ y , D{e y } = ª¬I 6 A( A cA) 1 A cº¼ V 2 , e y = [ 2 1 1 2 0 116]c ª4 « 1 « D l = « 1 1 « 1 «¬ 0
{}
1 4 1 1 1 0
1 1 4 1 1 0
ª4 « 1 « 2 D{l | Vˆ } = 673.3 « 1 1 « 1 «¬ 0 4-14
1 1 1 4 1 0 1 4 1 1 1 0
1 1 1 1 4 0 1 1 4 1 1 0
0º 0» 2 0» V 0» 5 0» 5 »¼ 1 1 1 1 1 1 4 1 1 4 0 0
0º 0» 0» . 0» 0» 5 »¼
Alternative estimation Maximum Likelihood (MALE)
Maximum Likelihood Estimation ("MALE") is a competitor to BLUUE of the first moments E{y} and to the BIQUUE of the second central moments D{y} of a random variable y
{Y, pdf } , which we like to present at least by an example. Maximum Likelihood Estimation :linear model: E{y} = 1n P , D{y} = I nV 2 "independent, identically normal distributed observations" [ y1 ,..., yn ]c "direct observations" unknown parameter:{P , V } {R, R + } =: X "simultaneous estimations of {P , V 2 } ". Given the above linear model of independent, identically, normal distributed observations [y 1 , ..., y n ]c = y {R n , pdf } . The first moment P as well as the central second moment V 2 constitute the unknown parameters ( P , V 2 ) R × R + where R × R + is the admissible parameter space. The estimation of the unknown parameters ( P , V 2 ) is based on the following optimization problem Maximize the log-likelihood function n
ln f ( y1 , ..., yn P , V 2 ) = ln f ( yi P , V 2 ) = i =1
= ln{
1
( 2S )
n n V 2
1 exp( 2 2V
n n 1 = ln 2S ln V 2 2 2 2 V
n
P ) 2 )} =
¦(y i =1
i
n
¦(y i =1
i
P ) 2 = max P ,V 2
206
4 The second problem of probabilistic regression
of the independent, identically normal distributed random variables { y1 ,… , yn } . The log-likelihood function is simple if we introduce the first sample moment m1 and the second sample moment m2 , namely m1 :=
(
1 n 1 1 n 2 1 c y = 1 y m = , : ¦ i n ¦ yi = n y cy 2 n i =1 n i =1
)
(
)
n n n ln f y1 , " , yn P , V 2 = ln 2S ln V 2 2 m2 2m1 P + P 2 . 2 2 2V Now we are able to define the optimization problem
(
)
(
)
A P , V 2 := ln f y1 , " , yn P , V 2 = max P, V 2
more precisely. Definition (Maximum Likelihood Estimation, linear model E{y} = 1n P , D{y} = I nV 2 , independent, identically normal distributed observations { y1 ,… , yn } ): A 2x1 vector [ PA , V A2 ]' is called MALE of [ P , V 2 ]' , (Maximum Likelihood Estimation) with respect to the linear model 0.1 if its loglikelihood function A( P , V 2 ) := ln f ( y1 ,… , yn P , V 2 ) = n n n = ln 2S ln V 2 2 (m2 2m1 + P 2 ) 2 2 2V is minimal. The simultaneous estimation of ( P , V 2 ) of type MALE can be characterized as following. Corollary (MALE with respect to the linear model E{y}= 1n P , D{y}= I nV 2 , independent identically normal distributed observations { y1 ,… , yn } ):
(
)
The log-likelihood function A P , V 2 with respect to the linear model E{y} = 1n P , D{y} = I nV 2 , ( P , V 2 R × R + ) , of independent, identically normal distributed observations { y1 ,… , yn } is maximal if 1 1 PA = m1 = 1c y, V A2 = m2 m12 = ( y yA )c( y yA ) n n is a simultaneous estimation of the mean volume (first moment) PA and of the variance (second moment) V A2 . :Proof: The Lagrange function n n L( P , V 2 ) := ln V 2 2 (m2 2m1 P + P 2 ) = max P ,V 2 2V 2
4-1 Introduction
207
leads to the necessary conditions nm nP wL ( P , V 2 ) = 21 = 2 = 0 wP V V wL n n ( P , V 2 ) = 2 + 4 (m2 2 P m1 + P 2 ) = 0, wV 2 2V 2V also called the likelihood normal equations. Their solution is 1cy ª P1 º ª m1 º 1 ª º «V 2 » = « m m 2 » = « y cy - (1cy)2 » . n¬ ¼ ¬ A¼ ¬ 2 1 ¼ The matrix of second derivates constitutes as a negative matrix the sufficiency conditions. ª 1 «V 2 w L « A ( P , V ) = n A A w ( P , V 2 )w ( P , V 2 ) ' « 0 « ¬ 2
º 0 » »>0. 1 » V A4 »¼
h Finally we can immediately check that A( P , V 2 ) o f as ( P , V 2 ) approaches the boundary of the parameter space. If the log-likelihood function is sufficiently regular, we can expand it as ª PP º A( P , V 2 ) = A( PA , V A2 ) + DA( PA , V A2 ) « 2 A2 » + ¬V V A ¼ ª PP º ª PP º 1 + D 2 A( PA , V A2 ) « 2 A2 »
« 2 A2 » + O3 . 2 ¬V V A ¼ ¬V V A ¼
Due to the likelihood normal likelihood equations DA( PA , V A2 ) vanishes. Therefore the behavior of A( P , V 2 ) near ( PA , V A2 ) is largely determined by D 2 A( PA , V A2 ) R × R + , which is a measure of the local curvature the loglikelihood function A( P , V 2 ) . The non-negative Hesse matrix of second derivatives I ( PA , V A2 ) =
w2A ( PA , V A2 ) > 0 2 2 w ( P , V )w ( P , V ) '
is called observed Fischer information. It can be regarded as an index of the steepness of the log-likelihood function moving away from ( P , V 2 ) , and as an indicator of the strength of preference for the MLE point with respect to the other points of the parameter space.
208
4 The second problem of probabilistic regression
Finally, compare by means of Table 4.4 ( PA , V A2 ) MALE of ( P , V 2 ) for the front page example of Table 4.1 and Table 4.2 Table 4.4 : ( PA , V A2 ) MALE of ( P , V 2 ) {R, R + } : the front page examples
PA
V A2
13
2
30.16
1474.65
| VA |
st
1 example (n=5) 2nd example (n=6)
2 36.40
4-2 Setup of the best linear uniformly unbiased estimator of type BLUUE for the moments of first order Let us introduce the special Gauss-Markov model y = Aȟ + e specified in Box 4.1, which is given for the first order moments in the form of a inconsistent system of linear equations relating the first non-stochastic (“fixed”), real-valued vector ȟ of unknowns to the expectation E{y} of the stochastic, real-valued vector y of observations, Aȟ = E{y} , since E{y} R ( A) is an element of the column space R ( A ) of the real-valued, non-stochastic ("fixed") "first order design matrix" A \ n× m . The rank of the fixed matrix A, rk A, equals the number m of unknowns, ȟ \ m . In addition, the second order central moments, the regular variance-covariance matrix Ȉ y , also called dispersion matrix D{y} constitute the second matrix Ȉ y \ n×n of unknowns to be specified as a linear model furtheron. Box 4.1: Special Gauss-Markov model y = Aȟ + e 1st moments Aȟ = E{y}, A \ n× m , E{y} R ( A), rk A = m 2nd moments
(4.1)
Ȉ y = D{y} \ n×n , Ȉ y positive definite, rk Ȉ y = n
(4.2)
ȟ, E{y}, y E{y} = e unknown Ȉ y unknown . 4-21 The best linear uniformly unbiased estimation ȟˆ of ȟ : Ȉ y -BLUUE Since we are dealing with a linear model, it is "a natural choice" to setup a linear form to estimate the parameters ȟ of fixed effects, namely ȟˆ = Ly + ț ,
(4.3)
4-2 Setup of the best linear uniformly unbiased estimators
209
where {L \ m × n , ț \ m } are fixed unknowns. In order to determine the realvalued m×n matrix L and the real-valued m×1 vector ț , independent of the variance-covariance matrix Ȉ y , the inhomogeneous linear estimation ȟˆ of the vector ȟ of fixed effects has to fulfil certain optimality conditions. (1st) ȟˆ is an inhomogeneous linear unbiased estimation of ȟ E{ȟˆ} = E{Ly + ț} = ȟ for all ȟ R m , and
(4.4)
(2nd) in comparison to all other linear uniformly unbiased estimations ȟˆ has minimum variance tr D{ȟˆ}: = E{(ȟˆ ȟ )c(ȟˆ ȟ )} = = tr L Ȉ y Lc =|| Lc ||Ȉ = min .
(4.5)
L
First the condition of a linear uniformly unbiased estimation E{ȟˆ} = E{Ly + ț} = ȟ for all ȟ R m with respect to the Special Gauss-Markov model (4.1), (4.2) has to be considered in more detail. As soon as we substitute the linear model (4.1) into the postulate of uniformly unbiasedness (4.4) we are led to (4.6) E{ȟˆ} = E{Ly + ț} = LE{y} + ț = ȟ for all ȟ R m and (4.7) LAȟ + ț = ȟ for all ȟ R m . Beside ț = 0 the postulate of linear uniformly unbiased estimation with respect to the special Gauss-Markov model (4.1), (4.2) leaves us with one condition, namely (4.8) (LA I m )ȟ = 0 for all ȟ R m or (4.9) LA I m = 0. Note that there are locally unbiased estimations such that (LA I m )ȟ 0 = 0 for LA I m z 0. Alternatively, B. Schaffrin (2000) has softened the constraint of unbiasedness (4.9) by replacing it by the stochastic matrix constraint A cLc = I m + E0 subject to E{vec E0 } = 0, D{vec E0 } = (I m
Ȉ 0 ), Ȉ 0 a positive definite matrix. For Ȉ0 o 0 , uniform unbiasedness is restored. Estimators which fulfill the stochastic matrix constraint A cLc = I m + E0 for finite Ȉ0 are called “softly unbiased” or “unbiased in the mean”. Second, the choice of norm for "best" of type minimum variance has to be discussed more specifically. Under the condition of a linear uniformly unbiased estimation let us derive the specific representation of the weighted Frobenius matrix norm of Lc . Indeed let us define the dispersion matrix
210
4 The second problem of probabilistic regression
D{ȟˆ} := E{(ȟˆ E{ȟˆ})(ȟˆ E{ȟˆ})c} = = E{(ȟˆ ȟ )(ȟˆ ȟ )c},
(4.10)
which by means of the inhomogeneous linear form ȟˆ = Ly + ț is specified to D{ȟˆ} = LD{y}Lc and
(4.11)
Definition 4.1 ( ȟˆ Ȉ y - BLUUE of ȟ ): An m×1 vector ȟˆ = Ly + ț is called Ȉ y - BLUUE of ȟ (Best Linear Uniformly Unbiased Estimation) with respect to the Ȉ y -norm in (4.1) if (1st) ȟˆ is uniformly unbiased in the sense of tr D{ȟˆ} : = tr L D{y} Lc =|| Lc ||Ȉ .
(4.12)
y
Now we are prepared for Lemma 4.2: ( ȟˆ Ȉ y - BLUUE of ȟ ): An m×1 vector ȟˆ = Ly + ț is Ȉ y - BLUUE of ȟ in (4.1) if and only if ț=0 holds and the matrix L fulfils the system of normal equations ª Ȉ y A º ª Lcº ª 0 º « Ac 0 » « ȁ » = «I » ¬ ¼¬ ¼ ¬ m¼
(4.13)
(4.14)
or Ȉ y Lc + Aȁ = 0 and
(4.15)
A cLc = I m with the m × m matrix of "Lagrange multipliers".
(4.16)
:Proof: Due to the postulate of an inhomogeneous linear uniformly unbiased estimation with respect to the parameters ȟ \ m of the special Gauss-Markov model we were led to ț = 0 and one conditional constraint which makes it plausible to minimize the constraint Lagrangean
L ( L, ȁ ) := tr LȈ y Lc + 2tr ȁ( A cLc I m ) = min L ,ȁ
(4.17)
for Ȉ y - BLUUE. The necessary conditions for the minimum of the quadratic constraint Lagrangean L (L, ȁ ) are
4-2 Setup of the best linear uniformly unbiased estimators wL ˆ ˆ ˆ )c = 0 ( L, ȁ ) := 2( Ȉ y Lˆ c + Aȁ wL wL ˆ ˆ (L, ȁ ) := 2( A cLˆ c I m ) = 0 , wȁ
211 (4.18) (4.19)
which agree to the normal equations (4.14). The theory of matrix derivatives is reviewed in Appendix B, namely (d3) and (d4). The second derivatives w 2L ˆ ) = 2( Ȉ
I ) > 0 (Lˆ , ȁ y m w (vecL)w (vecL)c
(4.20)
constitute the sufficiency conditions due to the positive-definiteness of the matrix Ȉ for L (L, ȁ ) = min . (The Kronecker-Zehfuss Product A
B of two arbitrary matrices A and B, is explained in Appendix A.) h Obviously, a homogeneous linear form ȟˆ = Ly is sufficient to generate Ȉ BLUUE for the special Gauss-Markov model (4.1), (4.2). Explicit representations of Ȉ - BLUUE of type ȟˆ as well as of its dispersion matrix D{ȟˆ | ȟˆ Ȉ y BLUUE} generated by solving the normal equations (4.14) are collected in Theorem 4.3 ( ȟˆ Ȉ y -BLUUE of ȟ ): Let ȟˆ = Ly be Ȉ - BLUUE of ȟ in the special linear Gauss-Markov model (4.1),(4.2). Then ȟˆ = ( A cȈ y 1 A) 1 A cȈ y1 y
(4.21)
ȟˆ = Ȉȟˆ A cȈ y1y
(4.22)
are equivalent to the representation of the solution of the normal equations (4.14) subjected to the related dispersion matrix D{ȟˆ}:= Ȉȟˆ = ( A cȈ y1A ) 1 . :Proof: We shall present two proofs of the above theorem: The first one is based upon Gauss elimination in solving the normal equations (4.14), the second one uses the power of the IPM method (Inverse Partitioned Matrix, C. R. Rao's Pandora Box). (i) forward step (Gauss elimination): Multiply the first normal equation by Ȉ y1 , multiply the reduced equation by Ac and subtract the result from the second normal equation. Solve for ȁ
212
4 The second problem of probabilistic regression
ˆ = 0 (first equation: º Ȉ y Lˆ c + Aȁ multiply by -A cȈy1 ) » » (second equation) »¼ A cLˆ c = I m
ˆ = 0º A cLˆ c A cȈy1Aȁ » ˆc=I A cL »¼ m ˆ =I A cȈ y1Aȁ ˆ = ( A cȈ 1A ) 1 . ȁ y
(4.23)
(ii) backward step (Gauss elimination): ˆ in the modified first normal equation and solve for Lˆ . Substitute ȁ
ˆ = 0 Lˆ = ȁ ˆ cA cȈ 1 º Lˆ c + Ȉy1Aȁ y » 1 1 ˆ ȁ = ( A cȈ y A ) »¼
Lˆ = ( A cȈy1A ) 1 A cȈy1 .
(4.24)
(iii) IPM (Inverse Partitioned Matrix): Let us partition the symmetric matrix of the normal equations (4.14) ª Ȉ y A º ª A11 « Ac 0 » = « Ac ¬ ¼ ¬ 12
A12 º . 0 »¼
According to Appendix A (Fact on Inverse Partitioned Matrix: IPM) its Cayley inverse is partitioned as well. ª Ȉy « Ac ¬
1
Aº ªA = « 11 c 0 »¼ ¬ A12
1
A12 º ªB = « 11 c 0 »¼ ¬B12
B12 º B 22 »¼
B11 = I m Ȉ y1A( A cȈ y1A) 1 A cȈ y1 c = ( A cȈ y1A) 1 A cȈ y1 B12 B 22 = ( AcȈ y1A) 1. The normal equations are now solved by ªLˆ cº ª A11 « »=« ˆ c ¬« ȁ ¼» ¬ A12
1
A12 º ª 0 º ª B11 = c 0 »¼ «¬ I m »¼ «¬ B12
B12 º ª 0 º B 22 »¼ «¬ I m »¼
c = ( AcȈ y1A) 1 AcȈ y1 Lˆ = B12 ˆ = B = ( A cȈ 1A) 1. ȁ 22 y
(4.25)
4-2 Setup of the best linear uniformly unbiased estimators
213
(iv) dispersion matrix The related dispersion matrix is computed by means of the "Error Propagation Law". D{ȟˆ} = D{Ly | Lˆ = ( A cȈ y1A) 1} = Lˆ D{y}Lˆ c D{ȟˆ} = ( A cȈ y1A) 1 A cȈ y1Ȉ y Ȉ y1A( A cȈ y1 A) 1 D{ȟˆ} = ( A cȈ y1A ) 1 .
(4.26)
Here is my proof's end. h By means of Theorem 4.3 we succeeded to produce ȟˆ - BLUUE of ȟ . In consen quence, we have to estimate E {y} as Ȉ y - BLUUE of E{y} as well as the "error vector" (4.27) e y := y E{y} n e y := y E {y} = y Aȟˆ = ( I n AL) y
(4.28)
out of n Lemma 4.4: ( E {y} Ȉ y - BLUUE of E{y} , e y , D{e y }, D{y} ): n (i) Let E {y} be Ȉ - BLUUE of E{y} = Aȟ with respect to the special Gauss-Markov model (4.1), (4.2) , Then n E {y} = Aȟˆ = A( A cȈ y 1A ) 1 A cȈy 1 y (4.29) leads to the singular variance-covariance matrix (dispersion matrix) D{Aȟˆ} = A( A cȈ y1A ) 1 A c .
(4.30)
(ii) If the error vector e is empirically determined, we receive for the residual vector e y = [I n A( A cȈy1A ) 1 A cȈy1 ]y (4.31) and its singular variance-covariance matrix (dispersion matrix) D{e y } = Ȉ y A( A cȈ y1A ) 1 A c, rk D{e y } = n m .
(4.32)
(iii) The dispersion matrices of the special Gauss-Markov model (4.1), (4.2) are related by D{y} = D{Aȟˆ + e y } = D{Aȟˆ} + D{e y } = = D{e y e y } + D{e y }, C{e y , Aȟˆ} = 0, C{e y , e y e y } = 0. e y and Aȟˆ are uncorrelated .
(4.33) (4.34)
214
4 The second problem of probabilistic regression
:Proof: n (i ) E {y} = Aȟˆ = A ( A cȈ y 1 A ) 1 A cȈ y 1 y As soon as we implement ȟˆ Ȉ y - BLUUE of ȟ , namely (4.21), into Aȟˆ we are directly led to the desired result. (ii ) D{Aȟˆ} = A ( A cȈ y 1 A ) 1 A c ȟˆ Ȉ y - BLUUE of ȟ , namely (4.21), implemented in D{Aȟˆ} := E{A(ȟˆ E{ȟˆ})(ȟˆ E{ȟˆ})cA c} D{Aȟˆ} = AE{(ȟˆ E{ȟˆ})(ȟˆ E{ȟˆ})c}A c D{Aȟˆ} = A( A cȈ y1A) 1 AcȈ y1 E{(y E{y})( y E{y})c}Ȉ y1A( AcȈ y1A) 1 Ac D{Aȟˆ} = A( A cȈ y1A) 1 AcȈ y1A( A cȈ y1A) 1 Ac D{Aȟˆ} = A( A cȈ y1A) 1 Ac leads to the proclaimed result. (iii ) e y = [I n A( AcȈ y1 A) 1 AcȈ y1 ]y. Similarly if we substitute Ȉ y - BLUUE of ȟ , namely (4.21), in n e y = y E {y} = y Aȟˆ = [I n A ( A cȈ y 1A ) 1 A cȈ y 1 ]y we gain what we wanted! (iv ) D{eˆ y } = Ȉ A( A cȈ y1A ) 1 A c D{e y } := E{(e y E{e y })(e y E{e y })c}. As soon as we substitute E{e y } = [I n A( A cȈy1A ) 1 A cȈy1 ]E{y} in the definition of the dispersion matrix D{e y } , we are led to D{e y }:= [I n A ( A cȈy1A ) 1 A cȈy1 ] Ȉ [I n Ȉy1A ( A cȈy1A ) 1 A c], D{e y } = = [ Ȉ y A ( A cȈ A ) A c][I n Ȉ y1A( A cȈ y1A ) 1 A c] = 1 y
1
= Ȉ y A ( A cȈ y1A ) 1 A c A( A cȈy1A ) 1 A c + A( A cȈy1A ) 1 A c = = Ȉ y A ( A cȈ y1A ) 1 A c.
4-2 Setup of the best linear uniformly unbiased estimators
215
rk D{e y } = rk D{y} rk A( AȈ y1A) 1 A c = n m. ( v ) D{y} = D{Aȟˆ + e y } = D{Aȟˆ} + D{e y } = D{e y e y } + D{e y } y E{y} = y Aȟ = y Aȟˆ + A(ȟˆ ȟ ) y E{y} = A(ȟˆ ȟ ) + e . y
The additive decomposition of the residual vector y-E{y} left us with two terms, namely the predicted residual vector e y and the term which is a linear functional of ȟˆ ȟ. The corresponding product decomposition [ y E{y}][ y E{y}]c = = A (ȟˆ ȟ )(ȟˆ ȟ )c + A(ȟˆ ȟ )e cy + e y (ȟˆ ȟ )cA c + e y e cy for ȟˆ Ȉ y - BLUUE of ȟ , in particular E{ȟˆ} = ȟ, and [ y E{y}][ y E{y}]c = = A (ȟˆ E{ȟˆ})(ȟˆ E{ȟˆ})c + A(ȟˆ E{ȟˆ})e cy + e y (ȟˆ E{ȟˆ})cA c + e y e cy D{y} = E{[y E{y}][ y E{y}]c} = D{Aȟˆ} + D{e y } = D{e y e y } + D{e y } due to E{A (ȟˆ E{ȟˆ})e cy } = E{A(ȟˆ E{ȟˆ})( y Aȟˆ )c} = 0 E{e y (ȟˆ E{ȟˆ})cA c} = E{e y (ȟˆ E{ȟˆ})cA c} = 0 or C{Aȟˆ , e y } = 0, C{e y , Aȟˆ} = 0. These covariance identities will be proven next. C{Aȟˆ , e y } = A( A cȈ y1A ) 1 A cȈ y1 E{(y E{y})( y E{y})c}[I n A( AcȈ y1A) 1 AcȈ y1 ]c C{Aȟˆ , e y } = A( A cȈ y1A) 1 A cȈ y1Ȉ y [I n Ȉ y1A( A cȈ y1A) 1 A c] C{Aȟˆ , e y } = A( A cȈ y1A) 1 A c A( A cȈ y1A) 1 A c = 0. Here is my proof’s end. h
216
4 The second problem of probabilistic regression
We recommend to consider the exercises as follows. Exercise 4.1 (translation invariance: y 6 y E{y}) : Prove that the error prediction of type ȟˆ Ȉ - BLUUE of ȟ , namely y
e y = [I n A( A cȈy1A ) 1 A cȈy1 ]y is translation invariant in the sense of y 6 y E{y} that is e y = [I n A( A cȈy1A ) 1 A cȈy1 ]e y subject to e y := y E{y} . Exercise 4.2 (idempotence): 1 y
Is the matrix I n A ( A cȈ A ) 1 A cȈy1 idempotent ? Exercise 4.3 (projection matrices): Are the matrices A ( A cȈy1A ) 1 A cȈy1 and I n A ( A cȈy1A ) 1 A cȈy1 projection matrices? 4-22
The Equivalence Theorem of G y -LESS and Ȉ y -BLUUE
We have included the fourth chapter on Ȉ y -BLUUE in order to interpret G y LESS of the third chapter. The key question is open: ?When are Ȉ y -BLUUE and G y -LESS equivalent? The answer will be given by Theorem 4.5 (equivalence of Ȉ y -BLUUE and G y -LESS): With respect to the special linear Gauss-Markov model of full column rank (4.1), (4.2) ȟˆ = Ly is Ȉ y -BLUUE, if ȟ A = Ly is G y -LESS of (3.1) for G y = Ȉ y1 G y1 = Ȉ y .
(4.35)
In such a case, ȟˆ = ȟ A is the unique solution of the system of normal equations (4.36) ( A cȈ1A )ȟˆ = A cȈ1y y
y
attached with the regular dispersion matrix D{ȟˆ} = ( A cȈ y1A ) 1 .
(4.37)
The proof is straight forward if we compare the solution (3.11) of G y -LESS and (4.21) of Ȉ y -BLUUE. Obviously the inverse dispersion matrix D{y}, y{Y,pdf} is equivalent to the matrix of the metric G y of the observation space Y. Or conversely the inverse matrix of the metric of the observation space Y determines the variance-covariance matrix D{y} Ȉ y of the random variable y {Y, pdf} .
4-3 Setup of BIQUUE
217
4-3 Setup of the best invariant quadratic uniformly unbiased estimator of type BIQUUE for the central moments of second order The subject of variance -covariance component estimation within Mathematical Statistics has been one of the central research topics in the nineteen eighties. In a remarkable bibliography up-to-date to the year 1977 H. Sahai listed more than 1000 papers on variance-covariance component estimations, where his basic source was “Statistical Theory and Method“ abstracts (published for the International Statistical Institute by Longman Groups Limited), "Mathematical Reviews" and "Abstract Service of Quality Control and Applied Statistics". Excellent review papers and books exist on the topic of variance-covariance estimation such as C.R. Rao and J. Kleffe, R.S. Rao (1977) S. B. Searle (1978), L.R. Verdooren (1980), J. Kleffe (1980), and R. Thompson (1980). The PhD Thesis of B. Schaffrin (1983) offers a critical review of state-of-the-art of variance-covariance component estimation. In Geodetic Sciences variance components estimation originates from F. R. Helmert (1924) who used least squares residuals to estimate heterogeneous variance components. R. Kelm (1974) and E. Grafarend, A. Kleusberg and B. Schaffrin (1980) proved the relation of Ȉ0 Helmert type IQUUE balled Ȉ - HIQUUE to BIQUUE and MINQUUE invented by C. R. Rao. Most notable is the Ph. D. Thesis of M. Serbetci (1968) whose gravimetric measurements were analyzed by Ȉ 0 -HIQUUE Geodetic extensions of the Helmert method to compete variance components originate from H. Ebner (1972, 1977), W. Förstner (1979, 1980), W. Welsch (1977, 1978, 1979, 1980), K. R. Koch (1978, 1981), C. G. Persson (1981), L. Sjoeberg (1978), E. Grafarend and A. d'Hone (1978), E. Grafarend (1984) B. Schaffrin (1979, 1980, 1981). W. Förstner (1979), H. Fröhlich (1980), and K.R. Koch (1981) used the estimation of variance components for the adjustment of geodetic networks and the estimation of a length dependent variance of distances. A special field of geodetic application has been oscillation analysis based upon a fundamental paper by H. Wolf (1975), namely M. Junasevic (1977) for the estimation of signal-to-noise ratio in gyroscopic azimuth observations. The Helmert method of variance component estimation was used by E. Grafarend and A. Kleusberg (1980) and A. Kleusberg and E. Grafarend (1981) to estimate variances of signal and noise in gyrocompass observations. Alternatively K. Kubik (1967a, b, c, 1970) pioneered the method of Maximum Likelihood (MALE) for estimating weight ratios in a hybrid distance – direction network. "MALE" and "FEMALE" extensions were proposed by B. Schaffrin (1983), K. R. Koch (1986), and Z. C. Yu (1996). A typical problem with Ȉ0 -Helmert type IQUUE is that it does not produce positive variances in general. The problem of generating a positive-definite variance-covariance matrix from variance-covariance component estimation has
218
4 The second problem of probabilistic regression
already been highlighted by J. R. Brook and T. Moore (1980), K.G. Brown (1977, 1978), O. Bemk and H. Wandl (1980), V. Chew (1970), Han Chien-Pai (1978), R. R. Corbeil and S. R. Searle (REML, 1976), F. J. H. Don and J. R. Magnus (1980), H. Drygas (1980), S. Gnot, W. Klonecki and R. Zmyslony (1977). H. O. Hartley and J. N. K. Rao (ML, 1967), in particular J. Hartung (1979, 1980), J. L. Hess (1979), S. D. Horn and R. A. Horn (1975), S. D. Horn, R. A. Horn and D. B. Duncan (1975), C. G. Khatri (1979), J. Kleffe (1978, 1980), ), J. Kleffe and J. Zöllner (1978), in particular L. R. Lamotte (1973, 1980), S. K. Mitra (1971), R. Pincus (1977), in particular F. Pukelsheim (1976, 1977, 1979, 1981 a, b), F. Pukelsheim and G. P. Styan (1979), C. R. Rao (1970, 1978), S. R. Searle (1979), S. R. Searle and H. V. Henderson (1979), J. S. Seely (1972, 1977), in particular W. A. Thompson (1962, 1980), L. R. Verdooren (1979), and H. White (1980). In view of available textbooks, review papers and basic contributions in scientific journals we are only able to give a short introduction. First, we outline the general model of variance-covariance components leading to a linear structure for the central second order moment, known as the variance-covariance matrix. Second, for the example of one variance component we discuss the key role of the postulate's (i) symmetry, (ii) invariance, (iii) uniform unbiasedness, and (iv) minimum variance. Third, we review variance-covariance component estimations of Helmert type. 4-31
Block partitioning of the dispersion matrix and linear space generated by variance-covariance components
The variance-covariance component model is defined by the block partitioning (4.33) of a variance-covariance matrix Ȉ y , also called dispersion matrix D{y} , which follows from a corresponding rank partitioning of the observation vector y = [ y1c,… , yAc ]c . The integer number A is the number of blocks. For instance, the variance-covariance matrix Ȉ R n× n in (4.41) is partitioned into A = 2 blocks. The various blocks consequently factorized by variance V 2j and by covariances V jk = U jk V jV k . U jk [1, +1] denotes the correlation coefficient between the blocks. For instance, D{y1 } = V11V 12 is a variance factorization, while D{y1 , y 2 } = V12V 12 = V12 U12V 1 V 2 is a covariance factorization. The matrix blocks V jj are built into the matrix C jj , while the off-diagonal blocks V jk , V jkc into the matrix C jk of the same dimensions. dim Ȉ = dim C jj = dim C jk = n × n . The collective matrices C jj and C jk enable us to develop an additive decomposition (4.36), (4.43) of the block partitioning variance-covariance matrix Ȉ y . As soon as we collect all variance-covariance components in an peculiar true order, namely ı := [V 12 , V 12 , V 22 , V 13 , V 23 , V 32 ,..., V A 1A , , V A2 ]c , we are led to a linear form of the dispersion matrix (4.37), (4.43) as well as of the
219
4-3 Setup of BIQUUE
dispersion vector (4.39), (4.44). Indeed the dispersion vector d(y ) = Xı builds up a linear form where the second order design matrix X, namely X := [vec C1 ," , vec CA ( A +1) ] R n
2
× A ( A +1) / 2
,
reflects the block structure. There are A(A+1)/2 matrices C j , j{1," , A(A +1) / 2} . For instance, for A = 2 we are left with 3 block matrices {C1 , C2 , C3 } . Before we analyze the variance-covariance component model in more detail, we briefly mention the multinominal inverse Ȉ 1 of the block partitioned matrix Ȉ . For instance by “JPM” and “SCHUR” we gain the block partitioned inverse matrix Ȉ 1 with elements {U11 , U12 , U 22 } (4.51) – (4.54) derived from the block partitioned matrix Ȉ with elements {V11 , V12 , V22 } (4.47). “Sequential JPM” solves the block inverse problems for any block partitioned matrix. With reference to Box 4.2 and Box 4.3 Ȉ = C1V 1 + C2V 2 + C3V 3 Ȉ 1 = E1 (V ) + E2 (V ) + E3 (V ) is an example. Box 4.2 Partitioning of variance-covariance matrix ª V11V 12 « Vc V « 12 12 Ȉ=« # « V1cA 1V 1A 1 «¬ V1cAV 1A
V12V 12 V22V 22 # V2cA 1V 2A 1 V2cAV 2A
V1A 1V 1A 1 V2A 1V 2A 1 # " VA 1A 1V A21 " VAc1AV A 1A " "
V1AV 1A º V2AV 2A » » # »>0 VA 1AV A 1A » VAAV A2 »¼
(4.38)
"A second moments V 2 of type variance, A (A 1) / 2 second moment V jk of type covariance matrix blocks of second order design ª0 " 0º C jj := « # V jj # » « » ¬« 0 " 0 ¼» 0º ª0 «" 0 V jk " » » C jk := « Vkj «" " » «0 0 »¼ ¬
j {1," , A }
ª subject to j < k « and j , k {1," , A} ¬
A
A 1, A
j =1
j =1, k = 2, j < k
Ȉ = ¦ C jjV 2j + Ȉ=
A ( A +1) / 2
¦ j =1
¦
C jk V jk
C jV j R n× m
(4.39)
(4.40)
(4.41) (4.42)
220
4 The second problem of probabilistic regression
[V 12 , V 12 , V 22 , V 13 , V 23 , V 32 ,..., V A 1A , , V A2 ]' =: V
(4.43)
"dispersion vector" D{y} := Ȉ y d {y} = vec D{y} = vec Ȉ d (y ) =
A ( A +1) / 2
¦
(vec C j )V j = XV
(4.44)
j =1
" X is called second order design matrix" X := [vec C1 ," , vec CA ( A +1) / 2 ]
(4.45)
"dimension identities" d (y ) R n ×1 , V R, X R n ×A ( A +1) / 2 . 2
2
Box 4.3 Multinomial inverse :Input: Ȉ12 º ª V11V 12 V12V 12 º ªȈ Ȉ = « 11 »= »=« ' ¬ Ȉ12 Ȉ 22 ¼ ¬ V12c V 12 V22V 22 ¼ ª0 0 º 2 ª V 0 º 2 ª 0 V12 º n× m = « 11 V1 + « V 12 + « » »V 2 R » c V 0 0 V 0 0 ¬ ¼ ¬ 12 ¼ ¬ 22 ¼ ªV C11 := C1 := « 11 ¬ 0
0º ª 0 , C12 := C2 := « » 0¼ ¬ V12c
V12 º ª0 0 º , C22 := C3 := « » » 0 ¼ ¬ 0 V22 ¼
(4.46)
(4.47)
3
Ȉ = C11V 12 + C12V 12 + C22V 22 = C1V 1 + C2V 2 + C3V 3 = ¦ C jV j
(4.48)
ªV 1 º vec 6 = ¦ (vec C j )V j =[vec C1 , vec C2 , vec C3 ] ««V 2 »» = XV j =1 «¬V 3 »¼
(4.49)
j =1
3
vec C j R n
2
×1
j {1,..., A(A + 1) / 2}
" X is called second order design matrix" X := [vec C1 ," , vec CA (A +1) / 2 ] R n
2
×A ( A +1) / 2
here: A=2 ªU Ȉ 1 = « 11 ¬ 0
:output: 0 º 2 ª 0 U12 º 1 ª 0 0 º 2 V1 + « V 12 + « »V 2 c 0 »¼ 0 »¼ ¬ U12 ¬ 0 U 22 ¼
(4.50)
221
4-3 Setup of BIQUUE
subject to (4.51)
U11 = V111 + qV111 V12 SV12c V111 , U12 = Uc21 = qV111 V12 S
(4.53)
U 22 = S = (V22 qV12c V111 V12 ) 1 ; q := Ȉ 1 = E1 + E2 + E3 =
V 122 V 12V 22
(4.52) (4.54)
A ( A +1) / 2 = 3
¦
Ej
(4.55)
j =1
ªU E1 (V ) := « 11 ¬ 0
0 º 2 ª 0 V 1 , E 2 (V ) := « » c 0¼ ¬ U12
U12 º 1 ª 0 0 º 2 V 12 , E3 (V ) := « » »V 2 . 0 ¼ ¬ 0 U 22 ¼
(4.56)
The general result that inversion of a block partitioned symmetric matrix conserves the block structure is presented in Corollary 4.6 (multinomial inverse): Ȉ=
A ( A +1) / 2
¦
C j V j Ȉ 1 =
j =1
A ( A +1) / 2
¦
Ǽ j (V ) .
(4.57)
j =1
We shall take advantage of the block structured multinominal inverse when we are reviewing HIQUUE or variance-covariance estimations of Helmert type. The variance component model as well as the variance-covariance model are defined next. A variance component model is a linear model of type ª V11V 12 « 0 « Ȉ=« # « 0 « ¬ 0
0 V22V 22 # 0 0
" 0 " 0 % # " VA 1A 1V A21 " 0
0 º 0 »» # » 0 » » VAAV A2 ¼
ªV 12 º d {y} = vec Ȉ = [vec C11 ,… , vec C jj ] « " » «V 2 » ¬ A¼ + d {y} = XV V R .
(4.58)
(4.59) (4.60)
In contrast, the general model (4.49) is the variance-covariance model with a linear structure of type ª V 12 º «V » 12 (4.61) d {y} = vec Ȉ = [vec C11 , vec C12 , vec C12 ,… , vec CAA ] « V 22 » « » " « 2» «¬ V A »¼
222
4 The second problem of probabilistic regression
d {y} = Xı V 2j R + , Ȉ positive definite.
(4.62)
The most popular cases of variance-covariance components are collected in the examples. Example 4.1 (one variance components, i.i.d. observations) D{y} : Ȉ y = , nV 2 subject to Ȉ y SYM (R n×n ), V 2 R + . Example 4.2 (one variance component, correlated observations) D{y} : Ȉ y = 9nV 2 subject to Ȉ y SYM (R n×n ), V 2 R + . Example 4.3. (two variance components, two sets of totally uncorrected observations "heterogeneous observations") ª n = n1 + n2 ª I n V 12 0 º « D{y} : Ȉ y = « subject to « Ȉ y SYM (R n×n ) (4.63) 2» 0 I V » 2¼ n ¬« « 2 + + 2 ¬V 1 R , V 2 R . 1
2
Example 4.4 (two variance components, one covariance components, two sets of correlated observations "heterogeneous observations") n = n1 + n2 ª V12V 12 º n ×n n ×n « » subject to « V11 R , V22 R V11V 22 ¼ n ×n V12 R «¬
ªV V 2 D{y} : Ȉ y = « 11 1 ¬ V12c V 12
1
1
2
1
2
2
Ȉ y SYM (R n×n ), V 12 R + , V 22 R + , Ȉ y positive definite.
(4.64)
Special case: V11 = I n , V22 = I n . 1
2
Example 4.5 (elementary error model, random effect model)
(4.66)
A
A
j =1
j =1
e y = y E{y z} = ¦ A j (z j E{z j }) = ¦ A j e zj
(4.65)
E{e zj } = 0, E{e zj , eczk } = G jk I q
(4.67)
A
D{y} : Ȉ y = ¦ A j A cjV 2j + j =1
A
¦
j , k =1, j < k
( A j A ck + A k A cj )V jk .
(4.68)
At this point, we should emphasize that a linear space of variance-covariance components can be build up independently of the block partitioning of the dispersion matrix D{y} . For future details and explicit examples let us refer to B. Schaffrin (1983).
223
4-3 Setup of BIQUUE
4-32
Invariant quadratic estimation of variance-covariance components of type IQE
By means of Definition 4.7 (one variance component) and Definition 4.9 (variance-covariance components) we introduce
Vˆ 2 IQE of V 2 and
Vˆ k IQE of V k .
Those conditions of IQE, represented in Lemma 4.7 and Lemma 4.9 enable us to separate the estimation process of first moments ȟ j (like BLUUE) from the estimation process of central second moments V k (like BIQUUE). Finally we provide you with the general solution (4.75) of the in homogeneous matrix equations M1/k 2 A = 0 (orthogonality conditions) for all k {1, " ,A(A+1)/2} where A(A+1)/2 is the number of variance-covariance components, restricted to the special Gauss–Markov model E {y} = Aȟ , d {y} = XV of "full column rank", A R n× m , rk A = m . Definition 4.7 (invariant quadratic estimation Vˆ 2 of V 2 : IQE ): The scalar Vˆ 2 is called IQE (Invariant Quadratic Estimation) of V 2 R + with respect to the special Gauss-Markov model of full column rank. E{y} = Aȟ, A R n×m , rk A = m (4.69) D{y} = VV 2 , V R n×n , rk V = n, V 2 R + , if the “variance component V 2 is V ” (i) a quadratic estimation
Vˆ 2 = y cMy = (vec M )c(y
y ) = (y c
y c)(vec M)
(4.70)
subject to M SYM := {M R n× n | M c = M}
(4.71)
(ii) transformational invariant : y o y E{y} =: ey in the sense of
Vˆ 2 = y cMy = e ycMe y or 2 Vˆ = (vec M )c(y
y ) = (vec M )c(e y
e y ) or 2 Vˆ = tr(Myy c) = tr(Me e c ) . y y
(4.72) (4.73) (4.74)
Already in the introductory paragraph we emphasized the key of "IQE". Indeed by the postulate "IQE" the estimation of the first moments E{y} = Aȟ is
224
4 The second problem of probabilistic regression
supported by the estimation of the central second moments D{y} = VV 2 or d {y} = XV . Let us present to you the fundamental result of " Vˆ 2 IQE OF V 2 ". Lemma 4.8 (invariant quadratic estimation Vˆ 2 of Vˆ 2 :IQE) : Let M = (M1/ 2 )cM1/ 2 be a multiplicative decomposition of the symmetric matrix M . The scalar Vˆ 2 is IQE of V 2 , if and only if M1/ 2 = 0
(4.75) 1/ 2
for all M
R
n× n
or A c(M1/ 2 )c = 0
(4.76)
. :Proof:
First, we substitute the transformation y = E{y} + e y subject to expectation identity E{y} = Aȟ, A R n× m , rk A = m, into y cMy. y ' My = ȟ cA cMAȟ + ȟ cA cMe y + e ycMAȟ + e ycMe y . Second, we take advantage of the multiplicative decomposition of the matrix M , namely M = (M1/ 2 )cM1/ 2 ,
(4.77)
which generates the symmetry of the matrix M SYM := {M R m×n | M c = M} y cMy = ȟ cA c(M1/ 2 )cM1/ 2 Aȟ + ȟ cA c(M1/ 2 )cM1/ 2e y + e yc (M1/ 2 )cM1/ 2 Aȟ + e ycMe y . Third, we postulate "IQE". y cMy = e ycMe y M1/ 2 A = 0 A c(M1/ 2 )c = 0. For the proof, here is my journey's end.
h
Let us extend " IQE " from a " one variance component model " to a " variancecovariance components model ". First, we define " IQE " ( 4.83 ) for variancecovariance components, second we give necessary and sufficient conditions identifying " IQE " . Definition 4.9 (variance-covariance components model Vˆ k IQE of V k ) : The dispersion vector dˆ (y ) is called IQE ("Invariant Quadratic Estimation") with respect to the special Gauss-Markov model of full column rank. ª E{y} = Aȟ, A {R n×m }; rk A = m « «¬ d {y} = Xı, D{y} ~ Ȉ y positive definite, rk Ȉ y = n,
(4.78)
225
4-3 Setup of BIQUUE
if the variance-covariance components ı := [V 12 , V 12 , V 22 , V 13 , V 23 ," , V A2 ]c
(4.79)
(i) bilinear estimations
Vˆ k = y cMy = (vec M )c(y
y ) = tr M k yy c M k R n× n× A ( A +1) / 2 ,
(4.80)
subject to M k SYM := {M k R n× n× A ( A +1) / 2 | M k = M k c },
(4.81)
(ii) translational invariant y o y E{y} =: e y
Vˆ k = y cM k y = e ycM k e y
(4.82)
Vˆ k = (vec M k )c(y
y ) = (vec M k )c(e y
e y ).
(4.83)
Note the fundamental lemma " Vˆ k IQE of V k " whose proof follows the same line as the proof of Lemma 4.7. Lemma 4.10 (invariant quadratic estimation Vˆ k of V k : IQE): Let M k = (M1/k 2 )cM1/k 2 be a multiplicative decomposition of the symmetric matrix M k . The dispersion vector Vˆ k is IQE of V k , if and only if (4.84)
M1/k 2 A = 0 or A c(M1/k 2 )c = 0
(4.85)
for all M1/k 2 R n× n× A ( A +1) / 2 . ? How can we characterize " Vˆ 2 IQE of V 2 " or " Vˆ k IQE of V k " ? The problem is left with the orthogonality conditions (4.75), (4.76) and (4.84), (4.85). Box 4.4 reviews the general solutions of the homogeneous equations (4.86) and (4.88) for our " full column rank linear model ". Box 4.4 General solutions of homogeneous matrix equations M1/k 2 = 0
M k = Z k (I n - AA )
" for all A G := {A R n× m | AA A = A} " : rk A = m
(4.86)
226
4 The second problem of probabilistic regression
A = A L = ( A cG y A) 1 A cG y
(4.87)
" for all left inverses A L {A R m× n | ( A A)c = A A} " M1/k 2 = 0 º 1 1/ 2 » M k = Z k [I n A( A cG y A) A cG y ] rk A d m ¼
(4.88)
"unknown matrices : Z k and G y . First, (4.86) is a representation of the general solutions of the inhomogeneous matrix equations (4.84) where Z k , k {1," , A(A + 1) / 2}, are arbitrary matrices. Note that k = 1, M1 describes the " one variance component model ", otherwise the general variance-covariance components model. Here we are dealing with a special Gauss-Markov model of " full column rank ", rk A = m . In this case, the generalized inverse A is specified as the " weighted left inverse " A L of type (4.71) whose weight G y is unknown. In summarizing, representations of two matrices Z k and G y to be unknown, given H1/k 2 , M k is computed by M k = (M1/k 2 )cM1/k 2 = [I n G y A(A cG y A)1 A ']Zck Z k [I n A( A cG y A) 1 A cG y ] (4.89) definitely as a symmetric matrix. 4-33
Invariant quadratic uniformly unbiased estimations of variancecovariance components of type IQUUE
Unbiased estimations have already been introduced for the first moments E{y} = Aȟ, A R n× m , rk A = m . Similarly we like to develop the theory of the one variance component V 2 and the variance-covariance unbiased estimations for the central second moments, namely components Vk , k{1,…,A(A +1)/2}, where A is the number of blocks. Definition 4.11 tells us when we use the terminology " invariant quadratic uniformly unbiased estimation " Vˆ 2 of V 2 or Vˆ k of V k , in short " IQUUE ". Lemma 4.12 identifies Vˆ 2 IQUUE of V 2 by the additional tr VM = 1 . In contrast, Lemma 4.12 focuses on Vˆ k IQUUE of V k by means of the additional conditions tr C j M k = į jk . Examples are given in the following paragraphs. Definition 4.11 (invariant quadratic uniformly unbiased estimation Vˆ 2 of V 2 and Vˆ k of V k : IQUUE) : The vector of variance-covariance components Vˆ k is called IQUUE (Invariant Quadratic Uniformly Unbiased Estimation ) of V k with respect to the special Gauss-Markov model of full column rank.
227
4-3 Setup of BIQUUE
ª E{y}= Aȟ, AR n×m , rk A = m « d {y}= Xı, XR n ×A ( A+1) / 2 , D{y}~ Ȉ positive definite, rk Ȉ y y « «¬ rk Ȉ y = n, vech D{y}= d{y}, 2
(4.90) if the variance-covariance components ı := [V 12 , V 12 , V 22 , V 13 , V 23 ," , V A2 ]
(4.91)
are (i) a bilinear estimation
Vˆ k = y cM k y = (vec M k )c(y
y ) = tr M k yy c M k R n× n× A ( A +1) / 2
(4.92)
subject to M k = (M1/k 2 )c(M1/k 2 ) SYM := {M k R n× m× A ( A +1) / 2 | M k = M k c }
(4.93)
(ii) translational invariant in the sense of y o y E{y} =: ey
Vˆ k = y cM k y = e ycM k e y
(4.94)
or Vˆ k = (vec M k )c(y
y ) = (vec M k )c(e y
e y ) or
(4.95)
Vˆ k = tr Ȃ k yy c = tr M k e y e yc ,
(4.96)
(iii) uniformly unbiased in the sense of k = 1 (one variance component) : E{Vˆ 2 } = V 2 , V 2 R + ,
(4.97)
k t 1 (variance-covariance components): E{Vˆ k } = V k , V k {R A ( A +1) / 2 | Ȉ y positive definite},
(4.98)
with A variance components and A(A-1)/2 covariance components. Note the quantor “for all V 2 R + ” within the definition of uniform unbiasedness (4.81) for one variance component. Indeed, weakly unbiased estimators exist without the quantor (B. Schaffrin 2000). A similar comment applies to the quantor “for all V k {R A ( A +1) / 2 | Ȉ y positive definite} ” within the definition of uniform unbiasedness (4.82) for variance-covariance components. Let us characterize “ Vˆ 2 IQUUE of V 2 ”.
228
4 The second problem of probabilistic regression
Lemma 4.12 ( Vˆ 2 IQUUE of V 2 ): The scalar Vˆ 2 is IQUUE of V 2 with respect to the special GaussMarkov model of full column rank. ª"first moment " : E{y} = Aȟ, A R n×m , rk A = m « + 2 2 n× n «¬"centralsecond moment " : D{y} = V , V R , rk V = n, V R , if and only if (4.99)
(i) M1/ 2 A = 0
and
(ii) tr VM = 1 .
(4.100)
:Proof: First, we compute E{Vˆ 2 } .
Vˆ 2 = tr Me y e yc E{Vˆ 2 } = tr MȈ y = tr Ȉ y M. Second, we substitute the “one variance component model” Ȉ y = VV 2 . E{Vˆ 2 } := V 2 V 2 R
tr VM = 1.
Third, we adopt the first condition of type “IQE”.
h
The conditions for “ Vˆ k IQUUE of V k ” are only slightly more complicated. Lemma 4.13 ( Vˆ k IQUUE of V 2 ): The vector Vˆ k , k {1," , A(A + 1) / 2} is IQUUE of V k with respect to the block partitioned special Gauss-Markov model of full column rank. " first moment" ª y1 º ª A1 º «y » «A » « 2 » « 2 » E{« # »} = « # » ȟ = Aȟ, A \ n n "n « » « » « y A 1 » « A A 1 » «¬ y A »¼ «¬ A A »¼ 1 2
A 1 , nA × m
n1 + n2 + " + nA 1 + nA = n " central second moment"
, rk A = m
(4.101)
229
4-3 Setup of BIQUUE 2 ª y1 º ª V11V 1 « «y » « 2 » « V12V 12 D{« # »} = « # « » « « y A 1 » « V1A 1V 1A 1 «¬ y A »¼ « V1AV 1l ¬
V12V 12 V22V 22 # V2A 1V 2A 1 V1AV 1l
V1A 1V 1A 1 V2A 1V 2A 1 # " VA 1,A 1V A21 " VA 1,AV A 1,A " "
A
A ( A +1) / 2
j =1
j , k =1 j j) to the variance-covariance matrix Ȉ y contain the variance factors V jj at {colj, rowj} while the covariance factors contain {V jkc , V jk } at {colk, rowj} and {colj, rowk}, respectively. The following proof of Lemma 4.12 is based upon the linear structure (4.88). :Proof: First, we compute E{Vˆ k } . E{Vˆ k } = tr M k Ȉ y = tr Ȉ y M k . Second, we substitute the block partitioning of the variance-covariance matrix Ȉy . A ( A +1) / 2
º A ( A +1) / 2 C jV j » tr Ȉ M = tr C j M kV j j =1 ¦ y k » j =1 » E{Vˆ k } = tr Ȉ y M k ¼
Ȉy =
¦
E{Vˆ k } = V k
A ( A +1) / 2
¦
(tr C j M k )V j = V k , V i R A ( A +1) / 2
(4.112)
j =1
tr C j M k G jk = 0 . Third, we adopt the first conditions of the type “IQE”. 4-34
Invariant quadratic uniformly unbiased estimations of one variance component (IQUUE) from Ȉ y BLUUE: HIQUUE
Here is our first example of “how to use IQUUE“. Let us adopt the residual vector e y as predicted by Ȉ y -BLUUE for a “one variance component“ dispersion model, namely D{y} = VV 2 , rk V = m . First, we prove that M1/ 2 generated by V-BLUUE fulfils both the conditions of IQUUE namely M1/ 2 A = 0 and tr VM = tr V (M1/ 2 )cM1/ 2 = 1 . As outlined in Box 4.5, the one condition of uniform unbiasedness leads to the solutions for one unknown D within the “ansatz” Z cZ = D V 1 , namely the number n-m of “degrees of freedom” or the “surjectivity defect”. Second, we follow “Helmert’s” ansatz to setup IQUUE of Helmert type, in Short “HIQUUE”. Box 4.5 IQUUE : one variance component 1st variations {E{y} = Ax, A R n× m , rk A = m, D{y} = VV 2 , rk V = m, V 2 R + } e y = [I n A ( A ' V 1A ) 1 A ' V 1 ]y
(4.31)
231
4-3 Setup of BIQUUE
1st test: IQE M1/ 2 A = 0 "if M1/ 2 = Z[I n A( A ' V 1 A ) 1 A ' V 1 ] , then M1/ 2 A = 0 " 2nd test : IQUUE "if tr VM = 1 , then tr{V[I n V 1 A( A ' V 1 A) 1 A ']Z cZ[I n A( A ' V 1 A) 1 A ' V 1 ]} = 1 ansatz : ZcZ = D V 1
(4.113)
tr VM = D tr{V[V V A( A cV A) ][I n A( AcV A) AcV ]} = 1 1
1
1
1
1
1
1
tr VM = D tr[I n A( A cV 1 A) A cV 1 ] = 1 tr I n = 0
º » tr[ A( A ' V A) A ' V ] = tr A( A ' V A) A ' V A = tr I m = m ¼ 1
1
1
1
1
tr VM = D (n m) = 1 D =
1
1 . nm
(4.114)
Let us make a statement about the translational invariance of e y predicted by Ȉ y - BLUUE and specified by the “one variance component” model Ȉ y = VV 2 . e y = e y ( Ȉ y - BLUUE) = [I n A( A ' Ȉ y1A) 1 A ' Ȉ y1 ]y .
(4.115)
Corollary 4.14 (translational invariance): e y = [I A( A ' Ȉ y1A) 1 A ' Ȉ y1 ]e y = Pe y
(4.116)
subject to P := I n A ( A ' Ȉ y1A) 1 A ' Ȉ y1 .
(4.117)
The proof is “a nice exercise”: Use e y = Py and replace y = E{y} + e y = A[ + e y . The result is our statement, which is based upon the “orthogonality condition” PA = 0 . Note that P is idempotent in the sense of P = P 2 . In order to generate “ Vˆ 2 IQUUE of V 2 ” we start from “Helmert’s ansatz”. Box 4.6 Helmert’s ansatz one variance component e cy Ȉ y1e y = ecy P ' Ȉ y1Pe y = tr PȈ y1Pe y ecy
(4.118)
E{e cy Ȉ y1e y } = tr(P ' Ȉ y1P E{e y ecy }) = tr(P ' Ȉ y1PȈ y )
(4.119)
232
4 The second problem of probabilistic regression
”one variance component“ Ȉ y = VV 2 = C1V 2 E{e cy V 1e y }= (tr P ' V 1PV )V 2
V 2 \ 2
(4.120)
tr P ' V 1 PV = tr[I n V 1 A ( A ' V 1 A ) A '] = n m
(4.121)
E{e cy V 1e y }= (n m)V 2
(4.122)
1 e cy V 1e y E{Vˆ 2 }=V 2 . nm Let us finally collect the result of “Helmert’s ansatz” in
Vˆ 2 :=
Corollary 4.15
(4.123)
( Vˆ 2 of HIQUUE of V 2 ): Helmert’s ansatz
Vˆ 2 =
1 e y ' V 1e y nm
(4.124)
is IQUUE, also called HIQUUE. 4-35
Invariant quadratic uniformly unbiased estimators of variance covariance components of Helmert type: HIQUUE versus HIQE
In the previous paragraphs we succeeded to prove that first M 1/ 2 generated by e y = e y ( Ȉ y - BLUUE) with respect to “one variance component” leads to IQUUE and second Helmert’s ansatz generated “ Vˆ 2 IQUUE of V 2 ”. Here we reverse the order. First, we prove that Helmert’s ansatz for estimating variancecovariance components may lead (or may, in general, not) lead to “ Vˆ k IQUUE of V k ”. Second, we discuss the proper choice of M1/k 2 and test whether (i) M1/k 2 A = 0 and (ii) tr H j M k = G jk is fulfilled by HIQUUE of whether M1/k 2 A = 0 is fulfilled by HIQE. Box 4.7 Helmert's ansatz variance-covariance components step one: make a sub order device of variance-covariance components:
V 0 := [V 12 , V 12 , V 2 2 , V 13 , V 12 ,..., V A 2 ]0c step two: compute Ȉ 0 := ( Ȉ y )0 = Ȉ
A ( A +1) / 2
¦ j =1
C jV j (V 0 )
(4.125)
233
4-3 Setup of BIQUUE
step three: compute e y = e y ( Ȉ 0 - BLUUE), namely 1
P (V 0 ) := (I A( A cȈ 0 A) 1 A cȈ 0 e y = P0 y = P0 e y step four: Helmert's ansatz
1
(4.126) (4.127)
e cy Ȉ 01e y = ecy P0cȈ 0-1P0e y = tr(P0 Ȉ 01P0ce y ecy )
(4.128)
E{eˆ cy Ȉ e } = tr (P0 Ȉ P c Ȉ)
(4.129)
1 0 0
-1 0 y
''variance-covariance components'' Ȉy = Ȉ
A ( A +1) / 2
¦ k =1
CkV k
(4.130)
E{e cy Ȉ -10 e cy } = tr(P0 Ȉ 01P0cCk )V k step five: multinomial inverse Ȉ=
A ( A +1) / 2
Ck V k Ȉ 1 =
¦ k =1
(4.131)
A ( A +1) / 2
¦ k =1
Ek (V j )
(4.132)
input: V 0 , Ȉ 0 , output: Ek (V 0 ). step six: Helmert's equation i, j {1," , A(A + 1) / 2} E{e cy Ei (V 0 )e y } =
A ( A +1) / 2
¦ k =1
(tr P(V 0 )Ei (V 0 )P c(V 0 )C j )V j
(4.133)
"Helmert's choice'' ecy Ei (V 0 ) e y =
A ( A +1) / 2
¦
(tr P(V 0 )Ei (V 0 )P c(V 0 )C j )V j
(4.134)
j =1
ª q := ey cEi (V 0 )ey « q = Hıˆ « H := tr P (V 0 )Ei (V 0 )P '(V 0 )C j (" Helmert ' s process ") (4.135) « 2 2 2 2 ¬ıˆ := [Vˆ1 , Vˆ12 , Vˆ 2 , Vˆ13 , Vˆ 23 , Vˆ 3 ,..., Vˆ A ] . Box 4.7 summarizes the essential steps which lead to “ Vˆ k HIQUUE of V k ” if det H = 0 , where H is the Helmert matrix. For the first step, we use some prior information V 0 = Vˆ 0 for the unknown variance-covariance components. For instance, ( Ȉ y )0 = Ȉ 0 = Diag[(V 12 ) 0 ,..., (V A2 ) 0 ] may be the available information on variance components, but leaving the covariance components with zero. Step two enforces the block partitioning of the variance-covariance matrix generating the linear space of variance-covariance components. e y = D0 e y in step three is the local generator of the Helmert ansatz in step four. Here we derive the key equation E{e y ' Ȉ -10 e y } = tr (D0 Ȉ 01D0c Ȉ) V k . Step five focuses on the multinormal inverse of the block partitioned matrix Ȉ , also called “multiple IPM”. Step six is
234
4 The second problem of probabilistic regression
taken if we replace 6 01 by the block partitioned inverse matrix, on the “Helmert’s ansatz”. The fundamental expectation equation which maps the variancecovariance components V j by means of the “Helmert traces” H to the quadratic terms q (V 0 ) . Shipping the expectation operator on the left side, we replace V j by their estimates Vˆ j . As a result we have found the aborted Helmert equation q = Hıˆ which has to be inverted. Note E{q} = Hı reproducing unbiasedness. Let us classify the solution of the Helmert equation q = Hı with respect to bias. First let us assume that the Helmert matrix is of full rank, vk H = A(A + 1) / 2 the number of unknown variance-covariance components. The inverse solution, Box 4.8, produces an update ıˆ 1 = H 1 (ıˆ 0 ) ' q(ıˆ 0 ) out of the zero order information Vˆ 0 we have implemented. For the next step, we iterate ıˆ 2 = H 1 (ıˆ 1 )q(ıˆ 1 ) up to the reproducing point Vˆ w = Vˆ w1 with in computer arithmetic when iteration ends. Indeed, we assume “Helmert is contracting”. Box 4.8 Solving Helmert's equation the fast case : rk H = A ( A + 1) / 2, det H z 0 :"iterated Helmert equation": Vˆ1 = H 1 (Vˆ 0 )q (Vˆ 0 ),..., VˆZ = HZ1 (VˆZ 1 ) q(VˆZ 1 )
(4.136)
"reproducing point" start: V 0 = Vˆ 0 Vˆ1 = H 01 q0 Vˆ 2 = H11 q1 subject to H1 := H (Vˆ1 ), q1 := q(Vˆ1 ) ... VˆZ = VˆZ 1 (computer arithmetic): end. ?Is the special Helmert variance-covariance estimator ıˆ x = H 1 q " JQUUE "? Corollary 4.16 gives a positive answer. Corollary 4.16 (Helmert equation, det H z 0); In case the Helmert matrix H is a full rank matrix, namely rk H = A ( A + 1) / 2 ıˆ = H 1q (4.137) is Ȉ f -HIQUUE at reproducing point. : Proof: q := e cy Ei e y E{ıˆ } = H E{q} = H 1 Hı = ı . 1
h
For the second case of our classification, let us assume that Helmert matrix is no longer of full rank, rk H < A(A + 1) / 2 , det H=0. Now we are left with the central question.
235
4-3 Setup of BIQUUE
? Is the special Helmert variance-covariance estimator ı = H l q = H + q of type “ MINOLESS” “ IQUUE”? n 1
Unfortunately, the MINOLESS of the rank factorized Helmert equation q = JKıˆ outlined in Box 4.9 by the weighted Moore-Penrose solution, indicates a negative answer. Instead, Corollary 4 proves Vˆ is only HIQE, but resumes also in establishing estimable variance-covariance components as “ Helmert linear combinations” of them. Box 4.9 Solving Helmert´s equation the second case: rk H < A(A + 1) / 2 , det H=0 " rank factorization" " MINOLESS" H = JK , rkH = rkF = rkG =: v
(4.138)
" dimension identities" H \ A ( A +1) / 2× A ( A +1) / 2 , J \ A ( A +1) / 2× v , G \ v × A ( A +1) / 2 H lm = H + ( weighted ) = K R ( weighted ) = J L ( weighted )
(4.139)
ıˆ lm = G ı-1K c(KG V-1K 1 )(J cG q J ) 1 G q q = HV+ , q q .
(4.140)
In case “ detH=0” Helmert´s variance-covariance components estimation is no longer unbiased, but estimable functions like Hıˆ exist: Corollary 4.17 (Helmert equation, det H=0): In case the Helmert matrix H, rkH< A(A + 1) / 2 , det H=0, is rank deficient, the Helmert equation in longer generates an unbiased IQE. An estimable parameter set is H Vˆ : Hıˆ = HH + q is Ȉ 0 HIQUUE (i) (4.141) (ii)
Vˆ is IQE . :Proof: (i) E{ıˆ } = H + E{q} = H + Hı z ı , ıˆ IQE
(ii) E{Hıˆ } = HH + E{q} = HH + Hı = Hı , Hıˆ HIQUUE.
h In summary, we lost a bit of our illusion that ı y ( Ȉ y BLUUE) now always produces IQUUE.
236
4 The second problem of probabilistic regression
“ The illusion of progress is short, but exciting” “ Solving the Helmert equations” IQUUE versus IQE
det H z 0
det H=0
ıˆ k is Ȉ 0 HIQUUE of V k
ıˆ k is only HIQE of ı k Hıˆ k is Ȉ 0 -IQUUE .
Figure 4.1 : Solving the Helmert equation for estimating variance-covariancecomponents Figure 4.1 illustrates the result of Corollary 4 and Corollary 5. Another drawback is that we have no guarantee that HIQE or HIQUUE ˆ . Such a postulate can generates a positive definite variance-covariance matrix Ȉ be enforced by means of an inequality constraint on the Helmert equation Hıˆ = q of type “ ıˆ > 0 ” or “ ıˆ > ı ” in symbolic writing. Then consult the text books on “ positive variance-covariance component estimation”. At this end, we have to give credit to B. Schaffrin (1.83, p.62) who classified Helmert´s variance-covariance components estimation for the first time correctly. 4-36
Best quadratic uniformly unbiased estimations of one variance component: BIQUUE
First, we give a definition of “best” Vˆ 2 IQUUE of V 2 within Definition 4.18 namely for a Gauss normal random variable y Y = {\ n , pdf} . Definition 4.19 presents a basic result representing “Gauss normal” BIQUUE. In particular we outline the reduction of fourth order moments to second order moments if the random variable y is Gauss normal or, more generally, quasi-normal. At same length we discuss the suitable choice of the proper constrained Lagrangean generating Vˆ 2 BIQUUE of V 2 . The highlighted is Lemma 4 where we resume the normal equations typical for BIQUUE and Theorem 4 with explicit representations of Vˆ 2 , D{Vˆ 2 } and Dˆ {Vˆ 2 } of type BIQUUE with respect to the special Gauss-Markov model with full column rank. ? What is the " best" Vˆ 2 IQUUE of V 2 ? First, let us define what is "best" IQUUE. Definition 4.18 ( Vˆ 2 best invariant quadratic uniformly unbiased estimation of V 2 : BIQUUE) Let y {\ n , pdf } be a Gauss normal random variable representing the stochastic observation vector. Its central moments up to order four
237
4-3 Setup of BIQUUE
E{eiy } = 0 , E{eiy e yj } = S ij = vijV 2
(4.142)
E{e e e } = S ijk = 0, (obliquity)
(4.143)
E{eiy e yj eky ely } = S ijkl = S ijS kl + S ik S jl + S ilS jk = = (vij vkl + vik v jl + vil v jk )V 4
(4.144)
y y y i j k
relate to the "centralized random variable" (4.145) e y := y E{y} = [eiy ] . The moment arrays are taken over the index set i, j, k, l {1000, n} when the natural number n is identified as the number of observations. n is the dimension of the observation space Y = {\ n , pdf } . The scalar Vˆ 2 is called BIQUUE of V 2 ( Best Invariant Quadratic Uniformly Unbiased Estimation) of the special Gauss-Markov model of full column rank. "first moments" : E{y} = Aȟ, A \ n× m , ȟ \ m , rk A = m (4.146) "central second moments": D{y} å y = VV 2 , V \ n×m , V 2 \ + , rk V = n m
(4.147) 2
+
where ȟ \ is the first unknown vector and V \ the second unknown " one variance component", if it is. (i)
a quadratic estimation (IQE): Vˆ 2 = y cMy = (vec M )cy
y = tr Myy c
(4.148)
subject to 1 2
1 2
M = (M )cM SYM := {M \ n×m M = M c} (ii)
translational invariant, in the sense of y o y E{y} =: e y ˆ V 2 = y cMy = ecy Mey or equivalently Vˆ 2 = (vec M )c y
y = (vec M )ce y
e y Vˆ 2 = tr Myy c = tr Me y ecy
(iii)
(4.149) (4.150) (4.151) (4.152) (4.153)
uniformly unbiased in the sense of
E{Vˆ 2 } = V 2 , V 2 \ + and (iv) of minimal variance in the sense
(4.154)
D{Vˆ 2 } := E{[Vˆ 2 E{Vˆ 2 }]2 } = min . M
(4.155)
238
4 The second problem of probabilistic regression
In order to produce "best" IQUUE we have to analyze the variance E{[Vˆ 2 E{Vˆ 2 }]1 } of the invariant quadratic estimation Vˆ 2 the "one variance component", of V 2 . In short, we present to you the result in Corollary 4.19 (the variance of Vˆ with respect to a Gauss normal IQE): If Vˆ 2 is IQE of V 2 , then for a Gauss normal observation space Y = {\ n , pdf } the variance of V 2 of type IQE is represented by E{[Vˆ 2 E{Vˆ 2 }]2 } = 2 tr M cVMV .
(4.156)
: Proof: ansatz: IQE
Vˆ = tr Me y ecy E{Vˆ 2 } = (tr MV )V 2 E{[Vˆ 2 E{Vˆ 2 }]} = E{[tr Me y ecy (tr MV)V 2 ][tr Me y ecy (tr MV)V 2 ]} = = E{(tr Me y ecy )(tr Me y ecy )} (tr MV) 2 V 4 (4.156). 2
h 2
2
With the “ansatz” Vˆ IQE of V we have achieved the first decomposition of var {Vˆ 2 } . The second decomposition of the first term will lead us to central moments of fourth order which will be decomposed into central moments of second order for a Gauss normal random variable y. The computation is easiest in “Ricci calculus“. An alternative computation of the reduction “fourth moments to second moments” in “Cayley calculus” which is a bit more advanced, is gives in Appendix D. E{(tr Me y ecy )(tr Me y ecy )} = = = =
n
n
¦
m ij m kl E{eiy e yj e ky ely } =
¦
m ij m kl ( ʌij ʌ kl + ʌik ʌ jl + ʌil ʌ jk ) =
¦
m ij m kl ( v ij v kl + v ik v jl + v il v jk )ı 4
i , j , k ,l =1 n i , j , k ,l =1 n i , j , k ,l =1
¦
i , j , k ,l =1
m ij m kl ʌijkl =
Ǽ{(tr Me y ecy )(tr 0 e y ecy )} = V 4 (tr MV ) 2 + 2V 4 tr(MV ) 2 .
(4.157)
A combination of the first and second decomposition leads to the final result. E{[Vˆ 2 E{Vˆ 2 }]} = E{(tr Me y ecy )(tr Me y ecy )} V 4 (tr MV) = = 2V 4 (tr MVMV ).
h A first choice of a constrained Lagrangean for the optimization problems “BIQUUE”, namely (4.158) of Box 4.10, is based upon the variance E{[Vˆ 2 E{Vˆ 2 }] IQE}
239
4-3 Setup of BIQUUE
constrained to “IQE” and the condition of uniform unbiasedness ( tr VM ) -1 = 0 as well as (ii) the condition of the invariant quadratic estimation A c(M1/ 2 ) = 0 . (i)
A second choice of a constrained Lagrangean generating Vˆ 2 BIQUUE of V 2 , namely (4.163) of Box 4.10, takes advantage of the general solution of the homogeneous matrix equation M1/ 2 A = 0 which we already obtained for “IQE”. (4.73) is the matrix container for M. In consequence, building into the Lagrangean the structure of the matrix M, desired by the condition of the invariance quadratic estimation Vˆ 2 IQE of V 2 reduces the first Lagrangean by the second condition. Accordingly, the second choice of the Lagrangean (4.163) includes only one condition, in particular the condition for an uniformly unbiased estimation ( tr VM )-1=0 . Still we are left with the problem to make a proper choice for the matrices ZcZ and G y . The first "ansatz" ZcZ = ĮG y produces a specific matrix M, while the second "ansatz" G y = V 1 couples the matrix of the metric of the observation space to the inverse variance factor V 1 . Those " natural specifications" reduce the second Lagrangean to a specific form (4.164), a third Lagrangean which only depends on two unknowns, D and O0 . Now we are prepared to present the basic result for Vˆ 2 BIQUUE of V 2 . Box 4.10 Choices of constrained Lagrangeans generating Vˆ 2 BIQUUE of V 2 "a first choice" L(M1/ 2 , O0 , A1 ) := 2 tr(MVMV ) + 2O0 [(tr VM ) 1] + 2 tr A1 Ac(M1/ 2 )c (4.158) M = (M
1/ 2
)cM
1/ 2
"a second choice" = [I n - G y A(A cG y A) 1 A c]Z cZ[I n A(A cG y A)-1 A cG y ] (4.159) ansatz : ZcZ = ĮG y M = ĮG y [I n A( A cG y A) 1 AcG y ]
(4.160)
VM = ĮVG y [I n A( A cG y A) 1 AcG y ]
(4.161)
ansatz : G y = V 1 VM = Į[I n A( AcV 1 A) 1 A cV 1 ]
(4.162)
L(Į, O0 ) = tr MVMV + 2O0 [( VM 1)]
(4.163)
tr MVMV = Į tr[I n A( AcV A) A cV ] = Į ( n m) 2
1
1
1
2
tr VM = Į tr[I n A( A cV 1 A) 1 A cV 1 ] = Į ( n m) L(Į, O0 ) = Į 2 (n m) + 2O0 [D (n m) 1] = min . Į , O0
(4.164)
240
4 The second problem of probabilistic regression
Lemma 4.20 ( Vˆ 2 BIQUUE of V 2 ): The scalar Vˆ 2 = y cMy is BIQUUE of V 2 with respect to special GaussMarkov model of full column rank, if and only if the matrix D together with the "Lagrange multiplier" fulfills the system of normal equations 1 º ª Dˆ º ª 0 º ª 1 «¬ n m 0 »¼ «Oˆ » = «¬1 »¼ ¬ 0¼
(4.165)
solved by 1 1 Dˆ = , O0 = . nm nm
(4.166)
: Proof: Minimizing the constrained Lagrangean L(D , O0 ) = D 2 (n m) + 2O0 [D ( n m) 1] = min D , O0
leads us to the necessary conditions 1 wL (Dˆ , Oˆ0 ) = Dˆ (n m) + Oˆ0 ( n m) = 0 2 wD 1 wL (Dˆ , Oˆ0 ) = Dˆ (n m) 1 = 0 2 wO0 or 1 º ª Dˆ º ª 0 º ª 1 «¬ n m 0 »¼ « Oˆ » = «¬1 »¼ ¬ 0¼ solved by Dˆ = Oˆ0 =
1 . nm
1 w2 L (Dˆ , Oˆ0 ) = n m × 0 2 wD 2 constitutes the necessary condition, automatically fulfilled. Such a solution for the parameter D leads us to the " BIQUUE" representation of the matrix M. M=
1 V 1 [I n A( AcV 1 A) 1 A cV 1 ] . nm
(4.167)
h 2
2
2
Explicit representations Vˆ BIQUUE of V , of the variance D{Vˆ } and its estimate D{ıˆ 2 } are highlighted by
241
4-3 Setup of BIQUUE
Theorem 4.21 ( Vˆ BIQUUE of V 2 ): Let Vˆ 2 = y cMy = (vec M )c(y
y ) = tr Myy c be BIQUUE of V 2 with reseat to the special Gauss-Markov model of full column rank. (i) Vˆ 2 BIQUUE of V 2 Explicit representations of Vˆ 2 BIQUUE of V 2
Vˆ 2 = (n m) 1 y c[V 1 V 1 A( A cV 1 A) 1 A cV 1 ]y
(4.168)
Vˆ 2 = (n m) 1 e cV 1e
(4.169)
subject to e = e ( BLUUE). (ii) D{ Vˆ 2
BIQUUE}
BIQUUE´s variance is explicitly represented by D{Vˆ 2 | BIQUUE} = E{[Vˆ 2 E{Vˆ 2 }]2 BIQUUE} = 2(n m) 1 (V 2 ) 2 . (4.170) (iii) D {Vˆ 2 } An estimate of BIQUUE´s variance is Dˆ {Vˆ 2 } = 2(n m) 1 (Vˆ 2 )
(4.171)
Dˆ {Vˆ } = 2(n m) (e cV e ) . 2
3
1
2
(4.172)
: Proof: We have already prepared the proof for (i). Therefore we continue to prove (ii) and (iii) (i) D{ıˆ 2 BIQUUE} D{Vˆ 2 } = E{[Vˆ 2 E{Vˆ 2 }]2 } = 2V 2 tr MVMV, 1 MV = [I n A( AcV 1 A) 1 AcV 1 ], nm 1 [I n A( A cV 1A) 1 A cV 1 ], MVMV = ( n m) 2 1 nm D{Vˆ 2 } = 2(n m) 1 (V 2 ) 2 . tr MVMV =
(iii) D{Vˆ 2 } Just replace within D{Vˆ 2 } the variance V 2 by the estimate Vˆ 2 . Dˆ {Vˆ 2 } = 2(n m) 1 (Vˆ 2 ) 2 .
h
242
4 The second problem of probabilistic regression
Upon writing the chapter on variance-covariance component estimation I learnt about the untimely death of J.F. Seely, Professor of Statistics at Oregon State University, on 23 February 2002. J.F. Seely, born on 11 February 1941 in the small town of Mt. Pleasant, Utah, who made various influential contributions to the theory of Gauss-Markov linear model, namely the quadratic statistics for estimation of variance components. His Ph.D. adviser G. Zyskind had elegantly characterized the situation where ordinary least squares approximation of fixed effects remains optimal for mixed models: the regression space should be invariant under multiplication by the variancecovariance matrix. J.F. Seely extended this idea to variance-covariance component estimation, introducing the notion of invariant quadratic subspaces and their relation to completeness. By characterizing the class of admissible embiased estimators of variance-covariance components. In particular, the usual ANOVA estimator in 2-variance component models is inadmissible. Among other contributions to the theory of mixed models, he succeeded in generalizing and improving on several existing procedures for tests and confidence intervals on variance-covariance components. Additional Reading Seely. J. and Lee, Y. (confidence interval for a variance: 1994), Azzam, A., Birkes, A.D. and Seely, J. (admissibility in linear models, polyhydral covariance structure: 1988), Seely, J. and Rady, E. (random effects – fixed effects, linear hypothesis: 1988), Seely, J. and Hogg, R.V. (unbiased estimation in linear models: 1982), Seely, J. (confidence intervals for positive linear combinations of variance components, 1980), Seely, J. (minimal sufficient statistics and completeness, 1977), Olsen, A., Seely, J. and Birkes, D. (invariant quadratic embiased estimators for two variance components, 1975), Seely, J. (quadratic subspaces and completeness, 1971) and Seely, J. (linear spaces and unbiased estimation, 1970).
5
The third problem of algebraic regression - inconsistent system of linear observational equations with datum defect: overdetermined- undertermined system of linear equations: {Ax + i = y | A \ n×m , y R ( A ) rk A < min{m, n}} :Fast track reading: Read only Lemma 5 (MINOS) and Lemma 5.9 (HAPS)
Lemma 5.2 G x -minimum norm, G y -least squares solution Lemma 5.3 G x -minimum norm, G y -least squares solution
Definition 5.1 G x -minimum norm, G y -least squares solution
Lemma 5.4 MINOLESS, rank factorization
Lemma 5.5 MINOLESS additive rank partitioning
Lemma 5.6 characterization of G x , G y -MINOS Lemma 5.7 eigenspace analysis versus eigenspace synthesis
244
5 The third problem of algebraic regression
Lemma 5.9 D -HAPS
Definition 5.8 D -HAPS
Lemma 5.10 D -HAPS
We shall outline three aspects of the general inverse problem given in discrete form (i) set-theoretic (fibering), (ii) algebraic (rank partitioning; “IPM”, the Implicit Function Theorem) and (iii) geometrical (slicing). Here we treat the third problem of algebraic regression, also called the general linear inverse problem: An inconsistent system of linear observational equations
{Ax + i = y | A \ n× m , rk A < min {n, m}} also called “under determined - over determined system of linear equations” is solved by means of an optimization problem. The introduction presents us with the front page example of inhomogeneous equations with unknowns. In terms of boxes and figures we review the minimum norm, least squares solution (“MINOLESS”) of such an inconsistent, rank deficient system of linear equations which is based upon the trinity
5-1 Introduction
245
5-1 Introduction With the introductory paragraph we explain the fundamental concepts and basic notions of this section. For you, the analyst, who has the difficult task to deal with measurements, observational data, modeling and modeling equations we present numerical examples and graphical illustrations of all abstract notions. The elementary introduction is written not for a mathematician, but for you, the analyst, with limited remote control of the notions given hereafter. May we gain your interest? Assume an n-dimensional observation space, here a linear space parameterized by n observations (finite, discrete) as coordinates y = [ y1 ," , yn ]c R n in which an m-dimensional model manifold is embedded (immersed). The model manifold is described as the range of a linear operator f from an m-dimensional parameter space X into the observation space Y. As a mapping f is established by the mathematical equations which relate all observables to the unknown parameters. Here the parameter space X , the domain of the linear operator f, will be also restricted to a linear space which is parameterized by coordinates x = [ x1 ," , xm ]c R m . In this way the linear operator f can be understood as a coordinate mapping A : x 6 y = Ax. The linear mapping f : X o Y is geometrically characterized by its range R(f), namely R(A), defined by R(f):= {y R n | y = f(x) for some x X} which in general is a linear subspace of Y and its kernel N(f), namely N(A), defined by N ( f ) := {x X | f (x) = 0}. Here the range R(f), namely the range space R(A), does not coincide with the ndimensional observation space Y such that y R (f ) , namely y R (A) . In addition, we shall assume here that the kernel N(f), namely null space N(A) is not trivial: Or we may write N(f) z {0}. First, Example 1.3 confronts us with an inconsistent system of linear equations with a datum defect. Second, such a system of equations is formulated as a special linear model in terms of matrix algebra. In particular we are aiming at an explanation of the terms “inconsistent” and “datum defect”. The rank of the matrix A is introduced as the index of the linear operator A. The left complementary index n – rk A is responsible for surjectivity defect, which its right complementary index m – rk A for the injectivity (datum defect). As a linear mapping f is neither “onto”, nor “one-to-one” or neither surjective, nor injective. Third, we are going to open the toolbox of partitioning. By means of additive rank partitioning (horizontal and vertical rank partitioning) we construct the minimum norm – least squares solution (MINOLESS) of the inconsistent system of linear equations with datum defect Ax + i = y , rk A d min{n, m }. Box 5.3 is an explicit solution of the MINOLESS of our front page example. Fourth, we present an alternative solution of type “MINOLESS” of the front page example by multiplicative rank partitioning. Fifth, we succeed to identify
246
5 The third problem of algebraic regression
the range space R(A) and the null space N(A) using the door opener “rank partitioning”. 5-11
The front page example Example 5.1 (inconsistent system of linear equations with datum defect: Ax + i = y, x X = R m , y Y R n A R n× m , r = rk A d min{n, m} ):
Firstly, the introductory example solves the front page inconsistent system of linear equations with datum defect, x1 + x2 1
x1 + x2 + i1 = 1
x2 + x3 1
x2 + x3 + i2 = 1
or
+ x1 x3 3
+ x1 x3 + i3 = 3
obviously in general dealing with the linear space X = R m x, dim X = m, here m=3, called the parameter space, and the linear space Y = R n y , dim Y = n, here n = 3 , called the observation space. 5-12
The front page example in matrix algebra
Secondly, by means of Box 5 and according to A. Cayley’s doctrine let us specify the inconsistent system of linear equations with datum defect in terms of matrix algebra. Box 5.1: Special linear model: three observations, three unknowns, rk A =2 ª y1 º ª a11 y = «« y2 »» = «« a21 ¬« y3 ¼» ¬« a31
a12 a22 a32
a13 º ª x1 º ª i1 º a23 »» «« x2 »» + ««i2 »» a33 ¼» ¬« x3 ¼» ¬« i3 ¼»
ª 1 º ª 1 1 0 º ª x1 º ª i1 º y = Ax + i : «« 1 »» = «« 0 1 1 »» «« x2 »» + ««i2 »» «¬ 3»¼ «¬ 1 0 1»¼ «¬ x3 »¼ «¬ i3 »¼ x c = [ x1 , x2 , x3 ], y c = [ y1 , y2 , y3 ] = [1, 1, 3], i c = [i1 , i2 , i3 ] x R 3×1 , y Z 3×1 R 3×1 ª 1 1 0 º A := «« 0 1 1 »» Z 3×3 R 3×3 «¬ 1 0 1»¼ r = rk A = 2 .
247
5-1 Introduction
The matrix A R n× m , here A R 3×3 , is an element of R n× m generating a linear mapping f : x 6 Ax. A mapping f is called linear if f (O x1 + x2 ) = O f ( x1 ) + f ( x2 ) holds. The range R(f), in geometry called “the range space R(A)”, and the kernel N(f), in geometry called “the null space N(A)” characterized the linear mapping as we shall see. ? Why is the front page system of linear equations called inconsistent ? For instance, let us solve the first two equations, namely -x1 + x3 = 2 or x1 – x3 = -2, in order to solve for x1 and x3. As soon as we compare this result to the third equation we are led to the inconsistency 2 = 3. Obviously such a system of linear equations needs general inconsistency parameters (i1 , i2 , i3 ) in order to avoid contradiction. Since the right-hand side of the equations, namely the in homogeneity of the system of linear equations, has been measured as well as the linear model (the model equations) has been fixed, we have no alternative but inconsistency. Within matrix algebra the index of the linear operator A is the rank r = rk A , here r = 2, which coincides neither with dim X = m, (“parameter space”) nor with dim Y = n (“observation space”). Indeed r = rk A < min {n, m}, here r = rk A < min{3, 3}. In the terminology of the linear mapping f, f is neither onto (“surjective”), nor one-to-one (“injective”). The left complementary index of the linear mapping f, namely the linear operator A, which accounts for the surjectivity defect, is given by d s = n rkA, also called “degree of freedom” (here d s = n rkA = 1 ). In contrast, the right complementary index of the linear mapping f, namely the linear operator A, which accounts for the injectivity defect is given by d = m rkA (here d = m rkA = 1 ). While “surjectivity” relates to the range R(f) or “the range space R(A)” and “injectivity” to the kernel N(f) or “the null space N(A)” we shall constructively introduce the notion of range R ( f ) range space R (A)
versus
kernel N ( f ) null space N ( f )
by consequently solving the inconsistent system of linear equations with datum defect. But beforehand let us ask: ? Why is the inconsistent system of linear equations called deficient with respect to the datum ? At this point we have to go back to the measurement process. Our front page numerical example has been generated from measurements with a leveling instrument: Three height differences ( yDE , yEJ , yJD ) in a triangular network have been observed. They are related to absolute height x1 = hD , x2 = hE , x3 = hJ by means of hDE = hE hD , hEJ = hJ hE , hJD = hD hJ at points {PD , PE , PJ } , outlined in more detail in Box 5.1.
248
5 The third problem of algebraic regression
Box 5.2: The measurement process of leveling and its relation to the linear model y1 = yDE = hDE + iDE = hD + hE + iDE y2 = yEJ = hEJ + iEJ = hE + hJ + iEJ y3 = yJD = hJD + iJD = hJ + hD + iJD ª y1 º ª hD + hE + iDE º ª x1 + x2 + i1 º ª 1 1 0 º ª x1 º ª i1 º « y » = « h + h + i » = « x + x + i » = « 0 1 1 » « x » + « i » . J EJ » « 2» « E « 2 3 2» « »« 2» « 2» «¬ y3 »¼ «¬ hJ + hD + iJD »¼ «¬ x3 + x1 + i3 »¼ «¬ 1 0 1»¼ «¬ x3 »¼ «¬ i3 »¼ Thirdly, let us begin with a more detailed analysis of the linear mapping f : Ax y or Ax + i = y , namely of the linear operator A R n× m , r = rk A d min{n, m}. We shall pay special attention to the three fundamental partitioning, namely
5-13
(i)
algebraic partitioning called additive and multiplicative rank partitioning of the matrix A,
(ii)
geometric partitioning called slicing of the linear space X (parameter space) as well as of the linear space Y (observation space),
(iii)
set-theoretical partitioning called fibering of the set X of parameter and the set Y of observations.
Minimum norm - least squares solution of the front page example by means of additive rank partitioning
Box 5.3 is a setup of the minimum norm – least squares solution of the inconsistent system of inhomogeneous linear equations with datum defect following the first principle “additive rank partitioning”. The term “additive” is taken from the additive decomposition y1 = A11x1 + A12 x 2 and y 2 = A 21x1 + A 22 x 2 of the observational equations subject to A11 R r × r , rk A11 d min{ n, m}. Box 5.3: Minimum norm-least squares solution of the inconsistent system of inhomogeneous linear equations with datum defect , “additive rank partitioning”. The solution of the hierarchical optimization problem (1st)
|| i ||2I = min : x
xl = arg{|| y Ax || I = min | Ax + i = y, A R n×m , rk A d min{ n, m }} 2
249
5-1 Introduction
(2nd)
|| x l ||2I = min : xl
xlm = arg{|| xl ||2I = min | AcAxl = Acy, AcA R m×m , rk AcA d m} is based upon the simultaneous horizontal and vertical rank partitioning of the matrix A, namely ªA A = « 11 ¬ A 21
A12 º , A R r × r , rk A11 = rk A =: r A 22 »¼ 11 with respect to the linear model y = Ax + i
y1 R r ×1 , x1 R r ×1 ª y1 º ª A11 A12 º ª x1 º ª i1 º + « », «y » = « A » « » y 2 R ( n r )×1 , x 2 R ( m r )×1 . ¬ 2 ¼ ¬ 21 A 22 ¼ ¬ x 2 ¼ ¬ i 2 ¼ First, as shown before, we compute the least-squares solution || i ||2I = min or ||y Ax ||2I = min which generates standard normal x x equations A cAxl = A cy c A11 + Ac21 A 21 ª A11 « Ac A + Ac A ¬ 12 11 22 21
or c A12 + Ac21 A 22 º ª x1 º ª A11 c A11 =« » « » c A12 + A c22 A 22 ¼ ¬ x 2 ¼ ¬ A12 c A12
Ac21 º ª y1 º A c22 »¼ ¬« y 2 ¼»
or ª N11 «N ¬ 21
N12 º ª x1l º ª m1 º = N 22 »¼ «¬ x 2 l »¼ «¬m 2 »¼ subject to
c A11 + A c21 A 21 , N12 := A11 c A12 + Ac21 A 22 , m1 = A11 c y1 + A c21y 2 N11 := A11 c A11 + A c22 A 21 , N 22 := A12 c A12 + A c22 A 22 , m 2 = A12 c y1 + A c22 y 2 , N 21 := A12 which are consistent linear equations with an (injectivity) defect d = m rkA . The front page example leads us to ªA A = « 11 ¬ A 21
ª 1 1 0 º A12 º « = 0 1 1 »» A 22 »¼ « «¬ 1 0 1»¼ or
ª 1 1 º ª0º A11 = « , A12 = « » » ¬ 0 1¼ ¬1 ¼ A 21 = [1 0] , A 22 = 1
250
5 The third problem of algebraic regression
ª 2 1 1º A cA = «« 1 2 1»» «¬ 1 1 2 »¼ ª 2 1º ª 1º N11 = « , N12 = « » , | N11 |= 3 z 0, » ¬ 1 2 ¼ ¬ 1¼ N 21 = [ 1 1] , N 22 = 2 ª1º ª 4 º c y1 + A c21y 2 = « » y1 = « » , m1 = A11 ¬1¼ ¬0¼ c y1 + A c22 y 2 = 4 . y 2 = 3, m 2 = A12 Second, we compute as shown before the minimum norm solution || x l ||2I = min or x1cx1 + x c2 x 2 which generates the standard normal x equations in the following way. l
L (x1 , x 2 ) = x1cx1 + xc2 x 2 = 1 1 1 1 c N11 = (xc2l N12 m1cN11 )(N11 N12 x 2l N11 m1 ) + xc2l x 2l = min
x2
“additive decomposition of the Lagrangean” L = L 0 + L1 + L 2 2 2 c N11 L 0 := m1cN11 m1 , L1:= 2xc2l N12 m1 2 c N11 L 2 := xc2l N12 N12 x 2l + xc2l x 2l
wL 1 wL1 1 wL2 (x 2lm ) = 0 (x 2lm ) + (x 2lm ) = 0 2 wx 2 2 wx 2 wx 2 2 2 c N11 c N11 N12 m1 + (I + N12 N12 )x 2lm = 0 2 2 c N11 c N11 x 2lm = (I + N12 N12 ) 1 N12 m1 ,
which constitute the necessary conditions. The theory of vector derivatives is presented in Appendix B. Following Appendix A Facts: Cayley inverse: sum of two matrices, formula (s9), (s10), namely (I + BC1 A c) 1 BC1 = B( AB + C) 1 for appropriate dimensions of the involved matrices, such that the identities holds 2 2 2 1 c N11 c N11 c ( N12 N12 c + N11 ( I + N12 N12 ) 1 N12 = N12 ) we finally find c (N12 N12 c + N112 ) 1 m1 . x 2 lm = N12 The second derivatives 1 w2L c N112 N12 + I ) > 0 (x 2 lm ) = (N12 2 wx 2 wxc2
251
5-1 Introduction 2 c N11 due to positive-definiteness of the matrix I + N12 N12 generate the sufficiency condition for obtaining the minimum of the unconstrained Lagrangean. Finally let us backward transform 1 1 x 2 l 6 x1m = N11 N12 x 2 l + N11 m1 , 1 2 1 c (N12 N12 c + N11 ) m1 + N11 x1lm = N111 N12 N12 m1 .
Let us right multiply the identity c = N11N11 c + N12 N12 c + N11N11 c N12 N12 c + N11 N11 c ) 1 such that by (N12 N12 c (N12 N12 c + N11 N11 c ) 1 = N11 N11 c (N12 N12 c + N11N11 c ) 1 + I N12 N12 holds, and left multiply by N111 , namely 1 c (N12 N12 c + N11 N11 c ) 1 = N11 c (N12 N12 c + N11 N11 c ) 1 + N11 N111 N12 N12 .
Obviously we have generated the linear form c (N12 N12 c + N11N11 c ) 1 m1 ª x1lm = N11 « c (N12 N12 c + N11N11 c ) 1 m1 ¬ x 2lm = N12 or ª x º ª Nc º c + N11N11 c ) 1 m1 xlm = « 1lm » = « 11 » (N12 N12 c ¼ ¬ x 2lm ¼ ¬ N12 or ª A c A + A c21A 21 º x lm = « 11 11
c A11 + A c22 A 21 »¼ ¬ A12 c A12 + A c21A 22 )( A12 c A11 + A c22 A 21 ) + ( A11 c A11 + A c21A 21 ) 2 ]1
[( A11 c y1 + A c21y 2 ].
[( A11 Let us compute numerically xlm for the front page example. ª 5 4 º ª1 1º c =« c =« N11N11 , N12 N12 » » ¬ 4 5 ¼ ¬1 1¼ ª 6 3 º 1 ª6 3º c + N11N11 c =« c + N11N11 c ]1 = N12 N12 , [N12 N12 » 27 «¬ 3 6 »¼ ¬ 3 6 ¼ ª 4 º 1 ª 4 º 4 4 2 m1 = « » x1lm = « » , x 2lm = , || xlm ||2I = 0 0 3 3 3 ¬ ¼ ¬ ¼
252
5 The third problem of algebraic regression
4 4 x1lm = hˆD = , x2lm = hˆE = 0, x3lm = hˆJ = 3 3 4 || xlm ||2I = 2 3 x + x + x = 0 ~ hˆ + hˆ + hˆ = 0. 1lm
2lm
D
3lm
E
J
The vector i lm of inconsistencies has to be finally computed by means of i lm = y Axlm ª1º 1 1 i lm = ««1»» , Aci l = 0, || i lm ||2I = 3. 3 3 «¬1»¼ The technique of horizontal and vertical rank partitioning has been pioneered by H. Wolf (1972,1973). h 5-14
Minimum norm - least squares solution of the front page example by means of multiplicative rank partitioning:
Box 5.4 is a setup of the minimum norm-least squares solution of the inconsistent system of inhomogeneous linear equations with datum defect following the first principle “multiplicative rank partitioning”. The term “multiplicative” is taken from the multiplicative decomposition y = Ax + i = DEy + i of the observational equations subject to A = DE, D R n×r , E R r × m , rk A = rk D = rk E d min{n, m} . Box 5.4: Minimum norm-least squares solution of the inconsistent system of inhomogeneous linear equations with datum defect multiplicative rank partitioning The solution of the hierarchical optimization problem (1st) ||i ||2I = min : x
xl = arg{|| y Ax ||2I = min | Ax + i = y , A R n×m , rk A d min{ n, m }} (2nd)
||x l ||2I = min : xl
x lm = arg{|| x l || I = min | A cAx l = A cy, AcA R m×m , rk AcA d m} 2
is based upon the rank factorization A = DE of the matrix A R n× m subject to simultaneous horizontal and vertical rank partitioning of the matrix A, namely
253
5-1 Introduction
ª D R n×r , rk D = rk A =: r d min{n, m} A = DE = « r ×m ¬E R , rk E = rk A =: r d min{n, m} with respect to the linear model y = Ax + i y = Ax + i = DEx + i
ª Ex =: z « DEx = Dz y = Dz + i . ¬
First, as shown before, we compute the least-squares solution || i ||2I = min or ||y Ax ||2I = min which generates standard normal x x equations DcDz l = Dcy z l = (DcD) 1 Dcy = Dcl y , which are consistent linear equations of rank rk D = rk DcD = rk A = r. The front page example leads us to ªA A = DE = « 11 ¬ A 21
ª 1 1 0 º ª 1 1 º A12 º « » = « 0 1 1 » , D = «« 0 1»» R 3×2 » A 22 ¼ «¬ 1 0 1»¼ «¬ 1 0 »¼ or
DcDE = DcA E = (DcD) 1 DcA ª 2 1º ª1 0 1º 1 ª2 1º 2×3 DcD = « , (DcD) 1 = « E=« » » »R 3 ¬1 2¼ ¬ 1 2 ¼ ¬ 0 1 1¼ 1 ª1 0 1º z l = (DcD) 1 Dc = « y 3 ¬0 1 1»¼ ª1º 4 ª2º y = «« 1 »» z l = « » 3 ¬1 ¼ «¬ 3»¼ 1 ª1 0 1º z l = (DcD) 1 Dc = « y 3 ¬ 0 1 1»¼ ª1º 4 ª2º y = «« 1 »» z l = « » . 3 ¬1 ¼ «¬ 3»¼ Second, as shown before, we compute the minimum norm solution || x A ||2I = min of the consistent system of linear equations with x datum defect, namely A
254
5 The third problem of algebraic regression
xlm = arg{|| xl ||2I = min | Exl = ( DcD) 1 Dcy }. xl
As outlined in Box1.3 the minimum norm solution of consistent equations with datum defect namely Exl = (DcD) 1 Dcy, rk E = rk A = r is xlm = Ec(EEc) 1 (DcD) 1 Dcy xlm = Em Dl y = A lm y = A+y ,
which is limit on the minimum norm generalized inverse. In summary, the minimum norm-least squares solution generalized inverse (MINOLESS g-inverse) also called pseudo-inverse A + or Moore-Penrose inverse is the product of the MINOS g-inverse Em (right inverse) and the LESS g-inverse Dl (left inverse). For the front page example we are led to compute ª1 0 1º ª2 1º E=« , EEc = « » » ¬1 2¼ ¬0 1 1¼ ª2 1 ª 2 1º 1« 1 (EEc) = « , Ec(EEc) = « 1 3 ¬ 1 2 »¼ 3 «¬ 1 ª 1 0 1« 1 1 xlm = Ec(EEc) (DcD) Dcy = « 1 1 3 «¬ 0 1 1
1º 2 »» 1 »¼ 1º 0 »» y 1»¼
ª1º ª 4 º ª 1º 1« » 4« » 4 « » y = « 1 » xlm = « 0 » = « 0 » , || xlm ||= 2 3 3 3 «¬ 3»¼ «¬ +4 »¼ «¬ +1»¼ ˆ ª x1lm º ª« hD º» ª 1º 4« » « » ˆ xlm = « x2lm » = « hE » = « 0 » 3 «¬ x3lm »¼ «« hˆ »» «¬ +1»¼ J ¬ ¼ 4 || xlm ||= 2 3 x1lm + x2lm + x3lm = 0 ~ hˆD + hˆE + hˆJ = 0. The vector i lm of inconsistencies has to be finally computed by means of i lm := y Axlm = [I n AAlm ]y, i lm = [I n Ec(EEc) 1 (DcD) 1 Dc]y;
255
5-1 Introduction
ª1º 1 1 3. i lm = ««1»» , Aci l = 0, || i lm ||= 3 3 «¬1»¼ h Box 5.5 summarizes the algorithmic steps for the diagnosis of the simultaneous horizontal and vertical rank partitioning to generate ( Fm Gy )-MINOS. 1
Box 5.5: algorithm The diagnostic algorithm for solving a general rank deficient system of linear equations y = Ax, A \ n× m , rk A < min{n, m} by means of simultaneous horizontal and vertical rank partioning Determine the rank of the matrix A rk A < min{n, m} .
Compute “the simultaneous horizontal and vertical rank partioning” r ×( m r ) r ×r ª A11 A12 º A11 \ , A12 \ A=« », n r ×r n r × m r ¬ A 21 A 22 ¼ A 21 \ ( ) , A 22 \ ( ) ( ) “n-r is called the left complementary index, m-r the right complementary index” “A as a linear operator is neither injective ( m r z 0 ) , nor surjective ( n r = 0 ) . ” Compute the range space R(A) and the null space N(A) of the linear operator A R(A) = span {wl1 ( A )," , wlr ( A )} N(A) = {x \ n | N11x1A + N12 x 2 A = 0} or 1 x1A = N11 N12 x 2 A .
256
5 The third problem of algebraic regression
Compute (Tm , Gy ) -MINOS ª x º ª Nc º ªy º c + N11N11 c ]1 [ A11 c G11y , A c21G12y ] « 1 » x Am = « 1 » = « 11 » = [ N12 N12 c ¼ ¬ x 2 ¼ ¬ N12 ¬y2 ¼ y y y y c G11A11 + A c21G 22 A 21 , N12 := A11 c G11A12 + A c21G 22 A 22 N11 := A11 y c , N 22 := A12 c G11y A12 + A c21G 22 N 21 := N12 A 22 .
5-15
The range R(f) and the kernel N(f) interpretation of “MINOLESS” by three partitionings (i) algebraic (rank partitioning) (ii) geometric (slicing) (iii) set-theoretical (fibering)
Here we will outline by means of Box 5.6 the range space as well as the null space of the general inconsistent system of linear equations. Box 5.6: The range space and the null space of the general inconsistent system of linear equations Ax + i = y , A \
n ×m
, rk A d min{n, m}
“additive rank partitioning”. The matrix A is called a simultaneous horizontal and vertical rank partitioning, if A12 º ªA r ×r A = « 11 » , A11 = \ , rk A11 = rk A =: r A A 22 ¼ ¬ 21 with respect to the linear model y = Ax + i, A \
n ×m
, rk A d min{n, m}
identification of the range space n
R(A) = span {¦ e i aij | j {1," , r}} i =1
“front page example”
257
5-1 Introduction
ª 1 1 0 º ª1º ª y1 º « » 3× 3 « » «¬ y2 »¼ = « 1 » , A = « 0 1 1 » \ , rk A =: r = 2 ¬ 3 ¼ ¬ 1 0 1¼ R(A)=span {e1 a11 + e 2 a21 + e3 a31 , e1 a12 + e 2 a22 + e3 a32 } \ 3 or R(A) = span {e1 + e3 , e1 e 2 } \ 3 = Y c1 = [1, 0,1], c 2 = [1, 1, 0], \ 3 = span{ e1 , e 2 , e3 }
ec1
ec2 O e3 y e1
e2
Figure 5.1 Range R (f ), range space R ( A) , (y R ( A))
identification of the null space c y1 + A c21 y 2 N11x1A + N12 x 2 A = A11 c y1 + A c22 y 2 N12 x1A + N 22 x 2 A = A12 N ( A ):= {x \ n | N11x1A + N12 x 2 A = 0} or 1 N11x1A + N12 x 2 A = 0 x1A = N11 N12 x 2 A “front page example” ª x1 º ªx3 º 1 ª 2 1 º ª 1º « x » = 3 « 1 2 » « 1» x 3A = « x » ¬ ¼¬ ¼ ¬ 3 ¼A ¬ 3 ¼A ª 2 1º ª 1º 1 ª2 1º 1 N11 = « , N11 = « , N12 = « » » » 3 ¬1 2¼ ¬ 1 2 ¼ ¬ 1¼ x1A = u, x 2A = u, x 3A = u N(A)= H 01 = G1,3 .
258
5 The third problem of algebraic regression
N(A)= L 0 1
N(A)=G1,3 \
3
x2
x1 Figure 5.2 : Kernel N( f ), null space N(A), “the null space N(A) as 1 the linear manifold L 0 (Grassmann space G1,3) slices the parameter space X = \ 3 ”, x3 is not displayed . Box 5.7 is a summary of MINOLESS of a general inconsistent system of linear equations y = Ax + i. Based on the notion of the rank r = rk A < min{n, m}, we designed the generalized inverse of MINOS type or A Am or A1,2,3,4 . Box 5.7 MINOLESS of a general inconsistent system of linear equations : f : x o y = Ax + i, x X = \ m (parameter space), y Y = \ n (observation space) r := rk A < min{n, m} A- generalized inverse of MINOS type: A1,2,3,4 or A Am Condition # 1 f(x)=f(g(y)) f = f D gD f
Condition # 1 Ax =AA-Ax AA-A=A
Condition # 2 g(y)=g(f(x)) g = gD f Dg
Condition # 2 A y=A-Ax=A-AA-y A-AA-=A-
Condition # 3 f(g(y))=yR(A)
Condition # 3 A-Ay= yR(A)
-
259
5-1 Introduction
f D g = PR ( A )
A A = PR ( A
Condition # 4 g(f(x))= y R ( A )
)
Condition # 4 AA = PR ( A )
A
g D f = PR (g) A A = PR ( A ) .
A
R(A-)
R(A ) D(A)
R(A) D(A-) D(A-) R(A) R(A)
-
A-
D(A) P R(A-)
A AAA = PR ( A ) f D g = PR ( f )
Figure 5.3 : Least squares, minimum norm generalized inverse A Am ( A1,2,3,4 or A + ) , the Moore-Penrose-inverse (Tseng inverse) A similar construction of the generalized inverse of a matrix applies to the diagrams of the mappings: (1) under the mapping A: D(A) o R(A) AA = PR ( A ) f D g = PR ( f )
(2) under the mapping A-: R (A ) o PR ( A
)
A A = PR ( A )
g D f = PR ( g ) .
In addition, we follow Figure 5.4 and 5.5 for the characteristic diagrams describing:
260
5 The third problem of algebraic regression
(i) orthogonal inverses and adjoints in reflexive dual Hilbert spaces Figure 5.4 A
X
Y
A =A
(A Gy A)
A G y A
Y
X +
A =A
( G y ) 1 = G y Gx G x = (G x ) 1
A G y A ( A G y A )
*
Gy
*
*
X
X
Y
A
A
Y
y 6 y = G y y y Y, y Y
(ii) Venn diagrams, trivial fiberings
Figure 5.5 : Venn diagram, trivial fibering of the domain D(f): Trivial fibers N ( f ) A , trivial fibering of the range R( f ): trivial fibers R ( f ) and R ( f ) A , f : \m = X o Y = \n , X set system of the parameter space, Y set system of the observation space In particularly, if Gy is rank defect we proceed as follows. Gy synthesis
ª / 0º G y = « y » ¬ 0 0¼ analysis
261
5-1 Introduction
ª Uc º G*y = UcG y U = « 1 » G y [U1 , U 2 ] ¬ Uc2 ¼ ª U1cG y U1 U1cG y U 2 º =« » ¬ Uc2G y U1c Uc2G y U 2 ¼
G y = UG *y U c = U1ȁ y U1c
ȁ y = U1cG y U1 U1ȁ y = G y U1 0 = G y Uc2 and U1cG y U 2 = 0 || y Ax ||G2 = || i ||2 = i cG y i º » G y = U1c ȁ y U1 »¼ (y Ax)cU1c ȁ y U1 (y Axc) = min y
x
U1 ( y Ax ) = U1i = i If we use simultaneous horizontal and vertical rank partitioning A12 º ª x1 º ª i1 º ª y º ªi º ªA y = « 1 » + « 1 » = « 11 » « » + «i » y i A A 22 ¼ ¬ x 2 ¼ ¬ 2 ¼ ¬ 2 ¼ ¬ 21 ¬ 2¼ subject to special dimension identities y1 \ r ×1 , y 2 \ ( n r )×1 A11 \ r × r , A12 \ r × ( m r ) A 21 \ ( n r )× r , A 22 \ ( n r )× ( m r ) , we arrive at Lemma 5.0. Lemma 5.0 ((Gx, Gy) –MINOLESS, simultaneous horizontal and vertical rank partitioning): ªy º ªi º ªA y = « 1 » + « 1 » = « 11 ¬ y 2 ¼ ¬i 2 ¼ ¬ A 21
A12 º ª x1 º ª i1 º + A 22 »¼ «¬ x 2 »¼ «¬ i 2 »¼
subject to the dimension identities y1 \ r ×1 , y 2 \ ( n r )×1 , x1 \ r ×1 , x 2 \ ( m r )×1 A11 \ r × r , A12 \ r ×( m r ) A 21 \ ( n r )× r , A 22 \ ( n r )×( m r ) is a simultaneous horizontal and vertical rank partitioning of the linear model (5.1)
(5.1)
262
5 The third problem of algebraic regression
{y = Ax + i, A \ n× m , r := rk A < min{n, m}}
(5.2)
r is the index of the linear operator A, n-r is the left complementary index and m-r is the right complementary index. x A is Gy-LESS if it fulfils the rank x Am is MINOS of A cG y Ax A = A cG y y , if x 1 c N111G11x 2G 21 (x1 )Am = N111 N12 [N12 N11 N12 + G x22 ]1 1 c N111G11x N111 2G x21 N11
(N12 )m1 + N111 m1 x c N111G11x N111 N12 2G 21 (x 2 )Am = [N12 N111 N12 + G x22 ]1 1 1 c N111G11x N11
(N12 2G x21 N11 )m1 .
(5.3)
(5.4)
The symmetric matrices (Gx, Gy) of the metric of the parameter space X as well as of the observation space Y are consequently partitioned as y ª G y G12 º G y = « 11 y y » ¬G 21 G 22 ¼
and
x x ª G11 º G12 = Gx « x x » ¬ G 21 G 22 ¼
(5.5)
subject to the dimension identities y y G11 \ r×r , G12 \ r×( n r ) y y G 21 \ ( n r )×r , G 22 \ ( n r )×( n r )
versus
x x G11 \ r×r , G12 \ r×( m r )
G x21 \ ( m r )×r , G x22 \ ( m r )×( m r )
deficient normal equations A cG y Ax A = A cG y y
(5.6)
or ª N11 «N ¬ 21
N12 º ª x1 º ª M11 = N 22 »¼ «¬ x 2 »¼ A «¬ M 21
M12 º ª y1 º ª m1 º = M 22 »¼ «¬ y 2 »¼ «¬ m2 »¼
(5.7)
subject to y y y c G11y A11 + A c21G 21 c G12 N11 := A11 A11 + A11 A 21 + A c21G 22 A 21
(5.8)
y y y c G11y A12 + A c21G 21 c G12 N12 := A11 A12 + A11 A 22 + A c21G 22 A 22 ,
(5.9)
c , N 21 = N12
(5.10)
y y y y c G11 c G12 N 22 := A12 A12 + A c22 G 21 A12 + A12 A 22 + Ac22 G 22 A 22 ,
(5.11)
y y y y c G11 c G12 M11 := A11 + A c21G 21 , M12 := A11 + A c21G 22 ,
(5.12)
263
5-2 MINOLESS and related solutions y c G11y + Ac22 G y21 , M 22 := A12 c G12 M 21 := A12 + A c22 G y22 ,
(5.13)
m1 := M11y1 + M12 y 2 , m2 := M 21y1 + M 22 y 2 .
(5.14)
5-2 MINOLESS and related solutions like weighted minimum norm-weighted least squares solutions 5-21
The minimum norm-least squares solution: "MINOLESS"
The system of the inconsistent, rank deficient linear equations Ax + i = y subject to A \ n× m , rk A < min{n, m} allows certain solutions which we introduce by means of Definition 5.1 as a solution of a certain hierarchical optimization problem. Lemma 5.2 contains the normal equations of the hierarchical optimization problems. The solution of such a system of the normal equations is presented in Lemma 5.3 for the special case (i) | G x |z 0 and case (ii) | G x + A cG y A |z 0, but | G x |= 0 . For the analyst: Lemma 5.4
Lemma 5.5
presents the toolbox of MINOLESS for multiplicative rank partitioning, known as rank factorization.
presents the toolbox of MINOLESS for additive rank partitioning.
and
Definition 5.1 ( G x -minimum norm- G y -least squares solution): A vector x Am X = \ m is called G x , G y -MINOLESS (MInimum NOrm with respect to the G x -seminorm-Least Squares Solution with respect to the G y -seminorm) of the inconsistent system of linear equations with datum defect ª rk A d min {n, m} A\ « Ax + i = y « y R ( A), N ( A) z {0} (5.15) «x X = \n , y Y = \n , «¬ if and only if first (5.16) x A = arg{|| i ||G = min | Ax + i = y, rk A d min{n, m}} , n× m
y
x
second x Am = arg{|| x ||G = min | A cG y Ax A = AcG y y} x
x
(5.17)
is G y -MINOS of the system of normal equations A cG y Ax A = A cG y y which are G x -LESS. The solutions of type G x , G y -MINOLESS can be characterized as following.
264
5 The third problem of algebraic regression
Lemma 5.2 ( G x -minimum norm, G y least squares solution): A vector x Am X = \ m is called G x , G y -MINOLESS of (5.1), if and only if the system of normal equations A cG y A º ª x Am º ª 0 º ª Gx = « A cG A 0 »¼ «¬ OAm »¼ «¬ A cG y y »¼ y ¬
(5.18)
with respect to the vector OAm of “Lagrange multipliers” is fulfilled. x Am always exists and is uniquely determined, if the augmented matrix [G x , A cG y A ] agrees to the rank identity rk[G x , A cG y A ] = m
(5.19)
or, equivalently, if the matrix G x + A cG y A is regular. :Proof: G y -MINOS of the system of normal equations A cG y Ax A = A cG y is constructed by means of the constrained Lagrangean L( x A , OA ) := xcA G x x A + 2OAc( A cG y Ax A A cG y y ) = min , x ,O
such that the first derivatives 1 wL º (x Am , OAm ) = G x x Am + A cG y AOAm = 0 » 2 wx » 1 wL (x Am , OAm ) = AcG y AxAm AcG y y = 0 » »¼ 2 wO A cG y A º ª x Am º ª 0 º ª Gx « = 0 »¼ «¬ OAm »¼ «¬ AcG y y »¼ ¬ A cG y A constitute the necessary conditions. The second derivatives 1 w 2L ( x Am , OAm ) = G x t 0 2 wxwx c
(5.20)
due to the positive semidefiniteness of the matrix G x generate the sufficiency condition for obtaining the minimum of the constrained Lagrangean. Due to the assumption A cG y y R( A cG x A ) the existence of G y -MINOS x Am is granted. In order to prove uniqueness of G y -MINOS x Am we have to consider case (i)
and
G x positive definite
case (ii) G x positive semidefinite .
case (i): G x positive definite
5-2 MINOLESS and related solutions
265
Gx A cG y A = G x A cG y AG x1A cG y A = 0. 0 A cG y A
(5.21)
First, we solve the system of normal equations which characterize x Am G x , G y MINOLESS of x for the case of a full rank matrix of the metric G x of the parametric space X, rk G x = m in particular. The system of normal equations is solved for
A cG y A º ª 0 º ª C1 C2 º ª 0 º ª x Am º ª G x = « O » = « A cG A 0 »¼ «¬ A cG y y »¼ «¬ C3 C4 »¼ «¬ A cG y y »¼ y ¬ Am ¼ ¬
(5.22)
subject to A cG y A º ª C1 C2 º ª G x A cG y A º ª G x A cG y A º ª Gx =« « A cG A » « » « » 0 ¼ ¬ C3 C4 ¼ ¬ A cG y A 0 ¼ ¬ A cG y A 0 »¼ y ¬ (5.23) as a postulate for the g-inverse of the partitioned matrix. Cayley multiplication of the three partitioned matrices leads us to four matrix identities. G x C1G x + G x C2 A cG y A + A cG y AC3G x + A cG y AC4 A cG y A = G x
(5.24)
G x C1A cG y A + A cG y AC3 A cG y A = A cG y A
(5.25)
A cG y AC1G x + A cG y AC2 A cG y A = AcG y A
(5.26)
A cG y AC1 A cG y A = 0.
(5.27)
Multiply the third identity by G x1A cG y A from the right side and substitute the fourth identity in order to solve for C2. A cG y AC2 A cG y AG x1A cG y A = A cG y AG x1A cG y A (5.28) C2 = G x1A cG y A ( A cG y AG x1A cG y A ) solves the fifth equation A cG y AG x1A cG y A ( A cG y AG x1A cG y A ) A cG y AG x1A cG y A = = A cG y AG x1A cG y A
(5.29)
by the axiom of a generalized inverse x Am = C2 A cG y y
(5.30)
x Am = G y1A cG y A ( A cG y AG x1A cG y A ) A cG y y . We leave the proof for “ G x1A cG y A ( A cG y AG x1A cG y A ) A cG y y is the weighted pseudo-inverse or Moore Penrose inverse A G+ G ” as an exercise. y
x
(5.31)
266
5 The third problem of algebraic regression
case (ii): G x positive semidefinite Second, we relax the condition rk G x = m by the alternative rk[G x , A cG y A ] = m G x positive semidefinite. Add the second normal equation to the first one in order to receive the modified system of normal equations ªG x + A cG y A A cG y A º ª x Am º ª A cG y y º = « A cG A 0 »¼ «¬ OAm »¼ «¬ A cG y y »¼ y ¬
(5.32)
rk(G x + A cG y A ) = rk[G x , A cG y A ] = m .
(5.33)
The condition rk[G x , A cG y A ] = m follows from the identity ªG G x + A cG y A = [G x , A cG y A ] « x ¬ 0
º ª Gx º »« », ( A cG y A ) ¼ ¬ A cG y A ¼ 0
(5.34)
namely G x + AcG y A z 0. The modified system of normal equations is solved for
ª x Am º ªG x + A cG y A AcG y A º ª A cG y y º = « O » = « A cG A 0 »¼ «¬ A cG y y »¼ y ¬ Am ¼ ¬ ª C C2 º ª A cG y y º ª C1A cG y y + C2 A cG y y º =« 1 »=« » »« ¬C3 C4 ¼ ¬ A cG y y ¼ ¬ C3A cG y y + C4 A cG y y ¼
(5.35)
subject to ªG x + A cG y A A cG y A º ª C1 C2 º ªG x + A cG y A A cG y A º = « A cG A 0 »¼ «¬C3 C4 »¼ «¬ A cG y A 0 »¼ y ¬ ªG x + A cG y A A cG y A º =« 0 »¼ ¬ A cG y A
(5.36)
as a postulate for the g-inverse of the partitioned matrix. Cayley multiplication of the three partitioned matrices leads us to the four matrix identities “element (1,1)” (G x + A cG y A)C1 (G x + A cG y A) + A cG y AC3 (G x + A cG y A) + +(G x + A cG y A)C2 A cG y A + A cG y AC4 A cG y A = G x + A cG y A
(5.37)
“element (1,2)” (G x + A cG y A)C1 A cG y A + A cG y AC3 = A cG y A
(5.38)
5-2 MINOLESS and related solutions
267
“element (2,1)” A cG y AC1 (G x + AcG y A) + AcG y AC2 AcG y A = AcG y A
(5.39)
“element (2,2)” A cG y AC1 A cG y A = 0.
(5.40)
First, we realize that the right sides of the matrix identities are symmetric matrices. Accordingly the left sides have to constitute symmetric matrices, too. (1,1):
(G x + A cG y A)C1c (G x + A cG y A) + (G x + A cG y A)Cc3 A cG y A + + A cG y ACc2 (G x + A cG y A) + A cG y ACc4 AcG y A = G x + A cG y A
(1,2): A cG y AC1c (G x + AcG y A) + Cc3 A cG y A = A cG y A (2,1): (G x + A cG y A )C1cA cG y A + A cG y ACc2 A cG y A = A cG y A (2,2): A cG y AC1cA cG y A = A cG y AC1A cG y A = 0 . We conclude C1 = C1c , C2 = Cc3 , C3 = Cc2 , C4 = Cc4 .
(5.41)
Second, we are going to solve for C1, C2, C3= C2 and C4. C1 = (G x + A cG y A) 1{I m A cG y A[ AcG y A(G x + A cG y A) 1 A cG y A]
A cG y A(G x + A cG y A ) 1}
(5.42)
C2 = (G x + A cG y A) 1 A cG y A[ A cG y A(G x + A cG y A) 1 A cG y A]
(5.43)
C3 = [ A cG y A(G x + A cG y A) 1 A cG y A] AcG y A(G x + A cG y A) 1
(5.44)
C4 = [ A cG y A (G x + A cG y A ) 1 A cG y A ] .
(5.45)
For the proof, we depart from (1,2) to be multiplied by A cG y A(G x + A cG y A) 1 from the left and implement (2,2) A cG y AC2 AcG y A(G x + A cG y A ) 1 A cG y A = A cG y A (G x + A cG y A ) 1 A cG y A . Obviously, C2 solves the fifth equation on the basis of the g-inverse [ A cG y A(G x + A cG y A) 1 A cG y A] or A cG y A(G x + A cG y A) 1 A cG y A[ A cG y A(G x + A cG y A) 1 A cG y A]
A cG y A(G x + A cG y A ) 1 A cG y A = A cG y A(G x + A cG y A) 1 A cG y A .
(5.46)
268
5 The third problem of algebraic regression
We leave the proof for “ (G x + A cG y A) 1 AcG y A[ A cG y A(G x + A cG y A ) 1 A cG y A] A cG y is the weighted pseudo-inverse or Moore-Penrose inverse A G+
y
( G x + AcG y A )
”
as an exercise. Similarly, C1 = (I m C2 A cG y A)(G x + A cG y A) 1
(5.47)
solves (2,2) where we again take advantage of the axiom of the g-inverse, namely A cG y AC1 AcG y A = 0
(5.48)
A cG y A(G x + A cG y A) 1 A cG y A(G x + A cG y A) 1 A cG y A A cG y A(G x + A cG y A) 1 A cG y A[ A cG y A(G x + A cG y A) 1 A cG y A]
A cG y A(G x + A cG y A) 1 A cG y A = 0 A cG y A(G x + A cG y A ) 1 A cG y A A cG y A(G x + A cG y A ) 1 A cG y A( A cG y A(G x + A cG y A ) 1 A cG y A)
A cG y A(G x + A cG y A ) 1 A cG y A = 0. For solving the system of modified normal equations, we have to compute C1 A cG y = 0 C1 = A cG y A = 0 A cG y AC1 AcG y A = 0 , a zone identity due to (2,2). In consequence, x Am = C2 A cG y y
(5.49)
has been proven. The element (1,1) holds the key to solve for C4 . As soon as we substitute C1 , C2 = Cc3 , C3 = Cc2 into (1,1) and multiply and
left by A cG y A(G x + AcG y A) 1
right by (G x + AcG y A) 1 AcG y A,
we receive 2AcGy A(Gx + AcGy A)1 AcGy A[AcGy A(Gx + AcGy A)1 AcGy A] AcGy A
(Gx + AcGy A)1 + AcGy A(Gx + AcGy A)1 AcGy AC4 AcGy A(Gx + AcGy A)1 AcGy A = = AcGy A(Gx + AcGy A)1 AcGy A. Finally, substitute C4 = [ A cG y A(G x + A cG y A) 1 A cG y A]
(5.50)
5-2 MINOLESS and related solutions
269
to conclude A cG y A(G x + A cG y A) 1 A cG y A[ A cG y A(G x + A cG y A) 1 A cG y A] A cG y A
(G x + A cG y A) 1 AcG y A = AcG y A(G x + AcG y A) 1 AcG y A , namely the axiom of the g-inverse. Obviously, C4 is a symmetric matrix such that C4 = Cc4 . Here ends my elaborate proof. The results of the constructive proof of Lemma 5.2 are collected in Lemma 5.3. Lemma
( G x -minimum
5.3
norm, G y -least
squares
solution:
MINOLESS): ˆ is G -minimum norm, G -least squares solution of (5.1) x Am = Ly y x subject to r := rk A = rk( A cG y A ) < min{n, m} rk(G x + A cG y A ) = m if and only if Lˆ = A G+
y Gx
= ( A Am ) G
y Gx
Lˆ = (G x + A cG y A ) 1 A cG y A[ A cG y A(G x + A cG y A ) 1 A cG y A ] AcG y
(5.51) (5.52)
xAm = (G x + AcG y A)1 AcG y A[ AcG y A(G x + AcG y A)1 AcG y A] AcG y y , (5.53) where A G+ G = A1,2,3,4 G G is the G y , G x -weighted Moore-Penrose inverse. If y
x
y
x
rk G x = m , then Lˆ = G x1A cG y A( A cG y AG x1A cG y A ) A cG y
(5.54)
x Am = G x1A cG y A ( A cG y AG x1A cG y A ) A cG y y
(5.55)
is an alternative unique solution of type MINOLLES. Perhaps the lengthy formulae which represent G y , G x - MINOLLES in terms of a g-inverse motivate to implement explicit representations for the analyst of the G x -minimum norm (seminorm), G y -least squares solution, if multiplication rank partitioning, also known as rank factorization, or additive rank partitioning of the first order design matrix A is available. Here, we highlight both representations of A + = A Am .
270
5 The third problem of algebraic regression
Lemma 5.4 ( G x -minimum norm, G y -least squares solution: MINOLESS, rank factorization) ˆ is G -minimum norm, G -least squares solution (MINOLLES) x Am = Ly y x of (5.1) {Ax + i = y | A \ n×m , r := rk A = rk( A cG y A ) < min{n, m}} , if it is represented by multiplicative rank partitioning or rank factorization A = DE, D \ n r , E \ r ×m as case (i): G y = I n , G x = I m
(5.56)
Lˆ = A Am = Ec( EEc) 1 ( DcD) 1 Dc
(5.57)
1 right inverse ˆ = E D ª«ER = Em = Ec(EEc) L R L 1 ¬ DL = DA = (DcD) Dc left inverse
x Am = A Am y = A + y = Ec(EEc) 1 (DcD) 1 Dcy .
(5.58) (5.59)
The unknown vector x Am has the minimum Euclidean length || x Am ||2 = xcAm x Am = y c( A + )cA + y = y c(DA )c(EEc) 1 DA y .
(5.60)
(5.61) y = y Am + i Am is an orthogonal decomposition of the observation vector y Y = \ n into Ax Am = y Am R ( A) and y AxAm = i Am R ( A) A ,
(5.62)
the vector of inconsistency. y Am = Ax Am = AA + y = = D( DcD) 1 Dcy = DDA y
and
i Am = y y Am = ( I n AA + ) y = = [I n D( DcD) 1 Dc]y = ( I n DD A ) y
AA + y = D( DcD) 1 Dcy = DDA y = y Am is the projection PR ( A ) and ( I n AA + ) y = [I n D( DcD) 1 Dc]y = ( I n DD A ) y is the projection PR ( A ) . A
i Am and y Am are orthogonal in the sense of ¢ i Am | y Am ² = 0 or ( I n AA + )cA = [I n D( DcD) 1 Dc]cD = 0 . The “goodness of fit” of MINOLESS is || y Ax Am ||2 =|| i Am ||2 = y c(I n AA + )y = = y c[I n D(DcD) 1 Dc]y = y c(I n DDA1 )y .
(5.63)
5-2 MINOLESS and related solutions
271
case (ii): G x and G y positive definite Lˆ = ( A m A ) (weighted) = G x Ec( EG x1Ec) 1 ( DcG y D) 1 DcG y
(5.64)
ª E = E weighted right inverse Lˆ = ER (weighted) D L (weighted) « m ¬ E L = EA weighted left inverse
(5.65)
R
x Am = ( A Am )G G y d A G+ G y = y
x
y
x
(5.66)
= G Ec(EG Ec) ( DcG y D) 1 DG y y. 1 x
1 x
1
The unknown vector x Am has the weighted minimum Euclidean length || x Am ||G2 = xcAm G x x Am = y c( A + )cG x A + y = x
= y cG y D(DcG y D) 1 (EG x1Ec) 1 EEc(EG x1Ec) 1 (DcG y D) 1 DcG y y c. y = y Am + i Am
(5.67) (5.68)
is an orthogonal decomposition of the observation vector y Y = \ n into Ax Am = y Am R ( A) and y AxAm =: i Am R ( A) A
(5.69)
of inconsistency. y Am = AA G+ G y y
AA G+
yGx
x
i A = ( I n AA G+ G ) y
and
y
I n AA G+
= PR ( A )
yGx
(5.70)
x
= PR ( A )
A
are G y -orthogonal ¢ i Am | y Am ² G = 0 or (I n AA + ( weighted ))cG y A = 0 . y
(5.71)
The “goodness of fit” of G x , G y -MINOLESS is || y Ax Am ||G2 =|| i Am ||G2 = y
= y c[I n AA
+ Gy Gx
y
]cG y [I n AA G+ G ]y = y
x
= y c[I n D(DcG y D) DcG y ]cG y [I n D(DcG y D) 1 DcG y ]y = 1
(5.72)
= y c[G y G y D(DcG y D) 1 DcG y ]y.
While Lemma 5.4 took advantage of rank factorization, Lemma 5.5 will alternatively focus on additive rank partitioning.
272
5 The third problem of algebraic regression
5.5 ( G x -minimum norm, G y -least MINOLESS, additive rank partitioning)
Lemma
ˆ is x Am = Ly G x -minimum (MINOLESS) of (5.1)
norm, G y -least
squares
solution:
squares
solution
{Ax + i = y | A \ n×m , r := rk A = rk( A cG y A ) < min{n, m}} , if it is represented by additive rank partitioning ªA A = « 11 ¬ A 21
A12 º A11 \ r × r , A12 \ r ×( m r ) , A 22 »¼ A 21 \ ( n r )× r , A 22 \ ( n r )×( m r )
(5.73)
subject to the rank identity rk A = rk A11 = r as
(5.74)
case (i): G y = I n , G x = I m ª Nc º c + N11 N11 c ) 1 [ A11 c , Ac21 ] Lˆ = A Am = « 11 » (N12 N12 c N ¬ 12 ¼
(5.75)
subject to c A11 + Ac21A 21 , N12 := A11 c A12 + Ac21A 22 N11 := A11 c c c A12 + A c22 A 22 N 21 := A12 A11 + A 22 A 21 , N 22 := A12 or ª Nc º ªy º c + N11 N11 c ) 1 [ A11 c , Ac21 ] « 1 » . x Am = « 11 » (N12 N12 c ¼ ¬ N12 ¬y 2 ¼
(5.76) (5.77)
(5.78)
The unknown vector xAm has the minimum Euclidean length || x Am ||2 = x cAm x Am = ªA º ªy º c + N11N11 c ) 1[ A11 c , A c21 ] « 1 » . = [ y1c , y c2 ] « 11 » ( N12 N12 ¬ A 21 ¼ ¬y2 ¼
(5.79)
y = y Am + i Am is an orthogonal decomposition of the observation vector y Y = \ n into Ax Am = y Am R ( A) and y AxAm =: i Am R ( A) A ,
(5.80)
5-2 MINOLESS and related solutions
273
the vector of inconsistency. y Am = Ax Am = AA Am y
i Am = y Ax Am =
and
= ( I n AA Am ) y are projections onto R(A) and R ( A) A , respectively. i Am and y Am are orthogonal in the sense of ¢ i Am | y Am ² = 0 or (I n AA Am )cA = 0 . The “goodness of fit” of MINOLESS is || y Ax Am ||2 =|| i Am ||2 = y c(I n AA Am )y . I n AA Am , rk( I n AA Am ) = n rk A = n r , is the rank deficient a posteriori weight matrix (G y )Am . case (ii): G x and G y positive definite )G Lˆ = ( A Am
5-22
yGx
.
(G x , G y ) -MINOS and its generalized inverse
A more formal version of the generalized inverse which is characteristic for G x MINOS, G y -LESS or (G x , G y ) -MINOS is presented by Lemma 5.6 (characterization of G x , G y -MINOS): (5.81)
rk( A cG y A) = rk A ~ R ( A cG y ) = R ( A c)
(5.82)
is assumed. x Am = L y is (G x , G y ) -MINOLESS of (5.1) for all y \ n if and only if the matrix L \ m ×n fulfils the four conditions G y ALA = G y A
(5.83)
G x LAL = G x L
(5.84)
G y AL = (G y AL)c
(5.85)
G x LA = (G x LA )c .
(5.86)
In this case G x x Am = G x L y is always unique. L, fulfilling the four conditions, is called the weighted MINOS inverse or weighted Moore-Penrose inverse. :Proof: The equivalence of (5.81) and (5.82) follows from R( A cG y ) = R( A cG y A ) .
274
5 The third problem of algebraic regression
(i) G y ALA = G y A and G y AL = (G y AL)c . Condition (i) G y ALA = G y A and (iii) G y AL = (G y AL)c are a consequence of G y -LESS. || i ||G2 =|| y Ax ||G2 = min AcG y AxA = AcG y y. y
y
x
If G x is positive definite, we can represent the four conditions (i)-(iv) of L by (G x , G y ) -MINOS inverse of A by two alternative solutions L1 and L2, namely AL1 = A( A cG y A ) A cG y AL1 = A( A cG y A ) A cL1cG y = = A ( A cG y A ) A cG y = = A ( A cG y A ) A cLc2 A cG y = A( A cG y A ) A cG y AL 2 = AL 2 and L 2 A = G x1 ( A cLc2 G x ) = G x1 ( A cLc2 A cLc2 G x ) = G x1 ( A cL1cA cLc2 G x ) = = G x1 ( A cL1cG x L 2 A ) = G x1 (G x L1AL 2 A ) = = G x1 (G x L1AL1 A ) = L1 A, L1 = G x1 (G x L1 AL1 ) = G x1 (G x L2 AL2 ) = L2 concludes our proof. The inequality || x Am ||G2 =|| L y ||G2 d|| L y ||G2 +2 y cLcG x ( I n LA) z + x
x
y
+ || ( I m LA ) z ||G2 y \ n
(5.87)
x
is fulfilled if and only if the “condition of G x -orthogonality” LcG x ( I m LA ) = 0
(5.88)
applies. An equivalence is LcG x = LcG x LA or LcG x L = LcG x LAL , which is produced by left multiplying with L. The left side of this equation is a symmetric matrix. Consequently, the right side has to be a symmetric matrix, too. G x LA = (G x LA )c . Such an identity agrees to condition (iv). As soon as we substitute in the “condition of G x -orthogonality” we are led to LcG x = LcG x LA G x L = (G x LA )cL = G x LAL , a result which agrees to condition (ii). ? How to prove uniqueness of A1,2,3,4 = A Am = A + ?
5-2 MINOLESS and related solutions
275
Uniqueness of G x x Am can be taken from Lemma 1.4 (characterization of G x MINOS). Substitute x A = Ly and multiply the left side by L. A cG y ALy = A cG y y AcG y AL = AcG y LcA cG y AL = LcA cG y G y AL = (G y AL)c = LcA cG y . The left side of the equation LcA cG y AL = LcA cG y is a symmetric matrix. Consequently the right side has to be symmetric, too. Indeed we have proven condition (iii) (G y AL)c = G y AL . Let us transplant the symmetric condition (iii) into the original normal equations in order to benefit from A cG y AL = A cG y or G y A = LcA cG y A = (G y AL)cA = G y ALA . Indeed, we have succeeded to have proven condition (i), in condition (ii) G x LAL = G x L and G x LA = (G x LA)c. Condition (ii) G y LAL = G x L and (iv) G x LA = (G x LA )c are a consequence of G x -MINOS. The general solution of the normal equations A cG y Ax A = A cG y y is x A = x Am + [I m ( A cG y A ) ( A cG y A )]z for an arbitrary vector z \ m . A cG y ALA = A cG y A implies x A = x Am + [I m ( A cG y A ) A cG y ALA ]z = = x Am + [I m LA ]z. Note 1: The following conditions are equivalent:
(1st)
ª (1) AA A = A « (2) A AA = A « « (3) ( AA )cG y = G y AA « «¬ (4) ( A A )cG x = G x A A
ª A #G y AA = A cG y « ¬ ( A )cG x A A = ( A )cG x “if G x and G y are positive definite matrices, then (2nd)
A #G y = G x A # or A # = G x1A cG y
(5.89)
276
5 The third problem of algebraic regression
are representations for the adjoint matrix” “if G x and G y are positive definite matrices, then ( A cG y A ) AA = A cG y ( A )cG x A A = ( A )cG x ” ª AA = PR ( A ) « «¬ A A = PR ( A ) .
(3rd)
The concept of a generalized inverse of an arbitrary matrix is originally due to E.H. Moore (1920) who used the 3rd definition. R. Penrose (1955), unaware of E.H. Moore´s work, defined a generalized inverse by the 1st definition to G x = I m , G y = I n of unit matrices which is the same as the Moore inverse. Y. Tseng (1949, a, b, 1956) defined a generalized inverse of a linear operator between function spaces by means of AA = PR ( A ) , A A = P
R ( A )
,
where R( A ) , R( A ) , respectively are the closure of R ( A ) , R( A ) , respectively. The Tseng inverse has been reviewed by B. Schaffrin, E. Heidenreich and E. Grafarend (1977). A. Bjerhammar (1951, 1957, 1956) initiated the notion of the least-squares generalized inverse. C.R. Rao (1967) presented the first classification of g-inverses. Note 2: Let || y ||G = ( y cG y y )1 2 and || x ||G = ( x cG x x )1 2 , where G y and G x are positive semidefinite. If there exists a matrix A which satisfies the definitions of Note 1, then it is necessary, but not sufficient that y
x
(1) G y AA A = G y A (2) G x A AA = G x A (3) ( AA )cG y = G y AA (4) ( A A )cG x = G x A A . Note 3: A g-inverse which satisfies the conditions of Note 1 is denoted by A G+ G and referred to as G y , G x -MINOLESS g-inverse of A. y
A G+ G is unique if G x is positive definite. When both G x and G y are general positive semi definite matrices, A G+ G may not be unique . If | G x + A cG y A |z 0 holds, A G+ G is unique. y
x
y
y
x
x
x
5-2 MINOLESS and related solutions
277
Note 4: If the matrices of the metric are positive definite, G x z 0, G y z 0 , then (i)
( A G+ G )G+ G = A , y
x
x
y
(ii) ( A G+ G ) # = ( A c)G+ y
5-23
x
1 1 x Gy
.
Eigenvalue decomposition of (G x , G y ) -MINOLESS
For the system analysis of an inverse problem the eigenspace analysis and eigenspace synthesis of x Am (G x , G y ) -MINOLESS of x is very useful and give some peculiar insight into a dynamical system. Accordingly we are confronted with the problem to develop “canonical MINOLESS”, also called the eigenvalue decomposition of (G x , G y ) -MINOLESS. First we refer again to the canonical representation of the parameter space X as well as the observation space Y introduced to you in the first chapter, Box 1.6 and Box 1.9. But we add here by means of Box 5.8 the forward and backward transformation of the general bases versus the orthogonal bases spanning the parameter space X as well as the observation space Y. In addition, we refer to Definition 1.5 and Lemma 1.6 where the adjoint operator A has been introduced and represented. Box 5.8 General bases versus orthogonal bases spanning the parameter space X as well as the observation space Y.
(5.90)
“left”
“right”
“parameter space”
“observation space”
“general left base”
“general right base”
span {a1 ,… , am } = X
Y=span {b1 ,… , bn }
:matrix of the metric:
:matrix of the metric:
aac = G x
bbc = G y
“orthogonal left base”
(5.92)
(5.94)
“orthogonal right base”
span {e ,… , e } = X
Y=span {e1y ,… , e ny }
:matrix of the metric:
:matrix of the metric:
e x ecx = I m
e y ecy = I n
“base transformation”
“base transformation”
x 1
x m
a = ȁ1x 2 9e x
(5.91)
b = ȁ1y 2 Ue y
(5.93)
(5.95)
278
5 The third problem of algebraic regression
versus
versus e y = Ucȁ y1 2 b
e x = V cȁ x1 2 a
(5.96)
span {e1x ,… , e xm } = X
(5.97)
Y=span {e1y ,… , e ny } .
Second, we are solving the general system of linear equations {y = Ax | A \ n ×m , rk A < min{n, m}} by introducing
•
the eigenspace of the rank deficient, rectangular matrix of rank r := rk A < min{n, m}: A 6 A
•
the left and right canonical coordinates: x 6 x , y 6 y
as supported by Box 5.9. The transformations x 6 x (5.97), y 6 y (5.98) from the original coordinates ( x1 ,… , x m ) to the canonical coordinates ( x1 ,… , x m ) , the left star coordinates, as well as from the original coordinates ( y1 ,… , y n ) to the canonical coordinates ( y1 ,… , y n ), the right star coordinates, are polar decompositions: a rotation {U, V}is followed by a general stretch {G1y 2 , G1x 2 } . Those root matrices are generated by product decompositions of type G y = (G1y 2 )cG1y 2 as well as G x = (G1x 2 )cG1x 2 . Let us substitute the inverse transformations (5.99) x 6 x = G x1 2 Vx and (5.100) y 6 y = G y1 2 Uy into the system of linear equations (5.1), (5.101) y = Ax + i, rk A < min{n, m} or its dual (5.102) y = A x + i . Such an operation leads us to (5.103) y = f( x ) as well as (5.104) y = f (x). Subject to the orthonormality condition (5.105) UcU = I n and (5.106) V cV = I m we have generated the left-right eigenspace analysis (5.107) ªȁ ȁ = « ¬ O2
O1 º O3 »¼
subject to the rank partitioning of the matrices U = [U1 , U 2 ] and V = [ V1 , V2 ] . Alternatively, the left-right eigenspace synthesis (5.118) ªȁ A = G y1 2 [U1 , U 2 ] « ¬O2
O1 º ª V1c º 1 2 G O3 »¼ «¬ V2c »¼ x
is based upon the left matrix (5.109) L := G y1 2 U decomposed into (5.111) L1 := G y1 2 U1 and L 2 := G y1 2 U 2 and the right matrix (5.100) R := G x1 2 V decomposed into R1 := G x1 2 V1 and R 2 := G x1 2 V2 . Indeed the left matrix L by means of (5.113) LLc = G y1 reconstructs the inverse matrix of the metric of the observation space Y. Similarly, the right matrix R by means of (5.114) RR c = G x1 generates
5-2 MINOLESS and related solutions
279
the inverse matrix of the metric of the parameter space X. In terms of “L, R” we have summarized the eigenvalue decompositions (5.117)-(5.122). Such an eigenvalue decomposition helps us to canonically invert y = A x + i by means of (5.123), namely the “full rank partitioning” of the system of canonical linear equations y = A x + i . The observation vector y \ n is decomposed into y1 \ r ×1 and y 2 \ ( n r )×1 while the vector x \ m of unknown parameters is decomposed into x1 \ r ×1 and x 2 \ ( m r )×1 . (x1 ) Am = ȁ 1 y1 is canonical MINOLESS leaving y 2 ”unrecognized” and x 2 = 0 as a “fixed datum”. Box 5.9: Canonical representation, the general case: overdetermined and unterdetermined system without full rank “parameter space X”
versus
x = V cG1x 2 x and
(5.98)
x = G x1 2 Vx
(5.100)
“observation space” y = UcG1y 2 y and
(5.99)
y = G y1 2 Uy
(5.101)
“overdetermined and unterdetermined system without full rank” {y = Ax + i | A \ n× m , rk A < min{n, m}} y = Ax + i
(5.102)
versus
+ UG1y 2 i
+ G y1 2 Ui versus
y = (G y1 2 UA V cG x1 2 )x + i (5.105)
subject to (5.106)
UcU = UUc = I n
(5.103)
UcG1y 2 y = A V cG x1 2 x +
G y1 2 Uy = AG x1 2 x +
(5.104) y = ( UcG1y 2 AG x1 2 V )x
y = A x + i
subject to versus
V cV = VV c = I m
(5.107)
“left and right eigenspace” “left-right eigenspace “left-right eigenspace analysis” synthesis” ª Uc º A = « 1 » G1y 2 AG x1 2 [ V1 , V2 ] = ¬ Uc2 ¼ (5.108) ªȁ G y1 2 [U1 , U 2 ] « ª ȁ O1 º =« ¬O2 » ¬ O2 O3 ¼
A= O1 º ª V1c º 1 2 (5.109) Gx O3 »¼ «¬ V2c »¼
280
5 The third problem of algebraic regression
“dimension identities” ȁ \ r × r , O1 \ r ×( m r ) , U1 \ n × r , V1 \ m × r O2 \ ( n r )× r , O3 \ ( n r )×( m r ) , U 2 \ n ×( n r ) , V2 \ m ×( m r ) “left eigenspace”
“right eigenspace”
(5.110) L := G y1 2 U L1 = UcG1y 2
R := G x1 2 V R 1 = V cG1x 2
(5.111)
(5.112) L1 := G y1 2 U1 , L 2 := G y1 2 U 2
R1 := G x1 2 V1 , R 2 := G x1 2 V2
(5.113)
(5.114) LLc = G y1 ( L1 )cL1 = G y
RR c = G x1 (R 1 )cR 1 = G x
(5.115)
ª L º ª Uc º (5.116) L1 = « 1 » G1y 2 =: « 1 » ¬ Uc2 ¼ ¬L2 ¼
ªR º ª Vcº R 1 = « 1 » G1x 2 =: « 1 » ¬ V2c ¼ ¬R 2 ¼
(5.117)
(5.118)
A = LA R 1
A = L1AR
versus
1 2
(5.120)
ªR º A = [L1 , L 2 ]A « » ¬R ¼
versus
(5.122)
AA # L1 = L1ȁ 2 º » AA # L 2 = 0 ¼
versus
ªȁ A = « ¬O2
O1 º = O3 »¼
ª L º = « 1 » A[R1 , R 2 ] ¬L2 ¼ ª A # AR1 = R1ȁ 2 « # ¬ A AR 2 = 0
(5.119)
(5.121)
(5.123)
“inconsistent system of linear equations without full rank” (5.124)
ªȁ y = A x + i = « ¬ O2
O1 º ª x1 º ª i1 º ª y1 º « »+« » = « » O3 »¼ ¬ x 2 ¼ ¬ i 2 ¼ ¬ y 2 ¼
y1 \ r ×1 , y *2 \ ( n r )×1 , i1* \ r ×1 , i*2 \ ( n r )×1 x1* \ r ×1 , x*2 \ ( m r )×1 “if ( x* , i* ) is MINOLESS, then x*2 = 0, i* = 0 : (x1* )Am = ȁ 1 y1* . ” Consult the commutative diagram of Figure 5.6 for a shortened summary of the newly introduced transformation of coordinates, both of the parameter space X as well as the observation space Y.
5-2 MINOLESS and related solutions
281 A
X x
R ( A) Y
V cG1x 2
UcG1y 2
y R ( A ) Y
X x
Figure 5.6 : Commutative diagram of coordinate transformations Third, we prepare ourselves for MINOLESS of the general system of linear equations {y = Ax + i | A \ n × m , rk A < min{n, m} , || i ||G2 = min subject to || x ||G2 = min} y
x
by introducing Lemma 5.4-5.5, namely the eigenvalue-eigencolumn equations of the matrices A#A and AA#, respectively, as well as Lemma 5.6, our basic result of “canonical MINOLESS”, subsequently completed by proofs. Throughout we refer to the adjoint operator which has been introduced by Definition 1.5 and Lemma 1.6. Lemma 5.7 (eigenspace analysis versus eigenspace synthesis A \ n × m , r := rk A < min{n, m} ) The pair of matrices {L, R} for the eigenspace analysis and the eigenspace synthesis of the rectangular matrix A \ n ×m of rank r := rk A < min{n, m} , namely A = L1 AR or ªȁ A = « ¬O2
O1 º = O3 »¼
ª L º = « 1 » A[R1 , R 2 ] ¬L2 ¼
versus
A = LA R 1 or A=
versus
ªR º = [L1 , L 2 ]A* « 1 » ¬R 2 ¼
are determined by the eigenvalue-eigencolumn equations (eigenspace equations) A # AR1 = R1ȁ 2 A # AR 2 = 0
versus
AA # L1 = L1ȁ 2 AA # L 2 = 0
282
5 The third problem of algebraic regression
subject to ªO12 " 0 º « » 2 2 « # % # » , ȁ = Diag(+ O1 ,… , + Or ) . « 0 " Or2 » ¬ ¼ 5-24
Notes
The algebra of eigensystems is treated in varying degrees by most books on linear algebra, in particular tensor algebra. Special mention should be made of R. Bellman’s classic “Introduction to matrix analysis” (1970) and Horn’s and Johnson’s two books (1985, 1991) on introductory and advanced matrix analysis. More or less systematic treatments of eigensystem are found in books on matrix computations. The classics of the field are Householder’s “Theory of matrices in numerical analysis” (1964) and Wilkinson’s “The algebraic eigenvalue problem” (1965) . G. Golub’s and Van Loan’s “Matrix computations” (1996) is the currently definite survey of the field. Trefethen’s and Bau’s “Numerical linear algebra” (1997) is a high-level, insightful treatment with a welcomed stress on geometry. G.W. Stewart’s “Matrix algorithm: eigensystems” (2001) is becoming a classic as well. The term “eigenvalue” derives from the German Eigenwert, which was introduced by D. Hilbert (1904) to denote for integral equations the reciprocal of the matrix eigenvalue. At some point Hilbert’s Eigenwert inverted themselves and became attached to matrices. Eigenvalues have been called many things in their day. The “characteristic value” is a reasonable translation of Eigenwert. However, “characteristic” has an inconveniently large number of syllables and survives only in the terms “characteristic equation” and “characteristic polynomial”. For symmetric matrices the characteristic equation and its equivalent are also called the secular equation owing to its connection with the secular perturbations in the orbits of planets. Other terms are “latent value” and “proper value” from the French “valeur propre”. Indeed the day when purists and pedants could legitimately object to “eigenvalue” as a hybrid of German and English has long since passed. The German “eigen” has become thoroughly materialized English prefix meaning having to do with eigenvalues and eigenvectors. Thus we can use “eigensystem”, “eigenspace” or “eigenexpansion” without fear of being misunderstood. The term “eigenpair” used to denote an eigenvalue and eigenvector is a recent innovation.
5-3 The hybrid approximation solution: D-HAPS and TykhonovPhillips regularization G x ,G y MINOLESS has been built on sequential approximations. First, the surjectivity defect was secured by G y LESS . The corresponding normal
5-3 The hybrid approximation solution
283
equations suffered from the effect of the injectivity defect. Accordingly, second G x LESS generated a unique solution the rank deficient normal equations. Alternatively, we may constitute a unique solution of the system of inconsistent, rank deficient equations {Ax + i = y | AG \ n× m , r := rk A < min{n, m}} by the D -weighted hybrid norm of type “LESS” and “MINOS”. Such a solution of a general algebraic regression problem is also called • Tykhonov- Phillips regularization
• •
ridge estimator D HAPS.
Indeed, D HAPS is the most popular inversion operation, namely to regularize improperly posed problems. An example is the discretized version of an integral equation of the first kind. Definition 5.8 (D-HAPS): An m × 1vector x is called weighted D HAPS (Hybrid AP proximative Solution) with respect to an D -weighted G x , G y -seminorm of (5.1), if x h = arg{|| y - Ax ||G2 +D || x ||G2 = min | Ax + i = y , y
A\
n× m
x
(5.125)
; rk A d min{n, m}}.
Note that we may apply weighted D HAPS even for the case of rank identity rkA d min{n, m} . The factor D \ + balances the least squares norm and the minimum norm of the unknown vector which is illustrated by Figure 5.7.
Figure 5.7. Balance of LESS and MINOS to general MORE Lemma 5.9 (D HAPS ) : x h is weighted D HAPS of x of the general system of inconsistent, possibly of inconsistent, possibly rank deficient system of linear equations (5.1) if and only if the system of normal equations 1 1 (5.126) (D G x + A cG y A )x h = AcG y y or (G x + A cG y A )x h = A cG y y (5.127) D D is fulfilled. xh always exists and is uniquely determined if the matrix (5.128) D G x + A'G y A is regular or rk[G x , A cG y A] = m.
284
5 The third problem of algebraic regression
: Proof : D HAPS is constructed by means of the Lagrangean L( x ) :=|| y - Ax ||G2 +D || x ||G2 = ( y - Ax )cG y ( y - Ax) + D ( xcG y x) = min , y
y
x
such that the first derivates dL ( x h ) = 2(D G x + A cG y A )x h 2A cG y y = 0 dx constitute the necessary conditions. Let us refer to the theory of vector derivatives in Appendix B. The second derivatives w2L ( x h ) = 2(D G x + A cG y A ) t 0 wxwx c generate the sufficiency conditions for obtaining the minimum of the unconstrained Lagrangean. If D G x + A ' G y A is regular of rk[G y , A cG y A ] = m , there exists a unique solution. h Lemma 5.10 (D HAPS ) : If x h is D HAPS of x of the general system of inconsistent, possibly of inconsistent, possibly rank deficient system of linear equations (5.1) fulfilling the rank identity rk[G y , A cG y A ] = m or det(D G x + A cG y A ) z 0 then x h = (D G x + A cG y A ) 1 A cG y y or 1 1 x h = (G x + A cG y A ) 1 A cG y y D D or x h = (D I + G x1A cG y A ) 1 G x1A cG y y or 1 1 1 G x A cG y A ) 1 G x1A cG y y D D are four representations of the unique solution. x h = (I +
6
The third problem of probabilistic regression – special Gauss - Markov model with datum problem – Setup of BLUMBE and BLE for the moments of first order and of BIQUUE and BIQE for the central moment of second order {y = Aȟ + c y , A \ n×m , rk A < min{n, m}} :Fast track reading: Read only Definition 6.1, Theorem 6.3, Definition 6.4-6.6, Theorem 6.8-6.11
Lemma 6.2 ȟˆ hom Ȉ y , S-BLUMBE of ȟ
Theorem 6.3 hom Ȉ y , S-BLUMBE of ȟ
Definition 6.1 ȟˆ hom Ȉ y , S-BLUMBE
Lemma 6.4 n E {y}, D{Aȟˆ}, D{e y }
Theorem 6.5 Vˆ 2 BIQUUE of Vˆ 2
Theorem 6.6 Vˆ 2 BIQE of V 2
286
6 The third problem of probabilistic regression
Definition 6.7 ȟˆ hom BLE of ȟ
Lemma 6.10 hom BLE, hom S-BLE, hom D -BLE
Theorem 6.11 ȟˆ hom BLE
Definition 6.8 ȟˆ S-BLE of ȟ
Definition 6.9 ȟˆ hom hybrid var-min bias
6
Theorem 6.12 ȟˆ hom S-BLE
Theorem 6.13 ȟˆ hom D -BLE
Definition 6.7 and Lemma 6.2, Theorem 6.3, Lemma 6.4, Theorem 6.5 and 6.6 review ȟˆ of type hom Ȉ y , S-BLUMBE, BIQE, followed by the first example. Alternatively, estimators of type best linear, namely hom BLE, hom S-BLE and hom D -BLE are presented. Definitions 6.7, 6.8 and 6.9 relate to various estimators followed by Lemma 6.10, Theorem 6.11, 6.12 and 6.13. In the fifth chapter we have solved a special algebraic regression problem, namely the inversion of a system of inconsistent linear equations with a datum defect. By means of a hierarchic postulate of a minimum norm || x ||2 = min , least squares solution || y Ax ||2 = min (“MINOLESS”) we were able to determine m unknowns from n observations through the rank of the linear operator, rk A = r < min{n, m} , was less than the number of observations or less the number of unknowns. Though “MINOLESS” generates a rigorous solution, we were left with the problem to interpret our results. The key for an evolution of “MINOLESS” is handed over to us by treating the special algebraic problem by means of a special probabilistic regression problem, namely as a special Gauss-Markov model with datum defect. The bias
6-1 Setup of the best linear minimum bias estimator of type BLUMBE
287
generated by any solution of a rank deficient system of linear equations will again be introduced as a decisive criterion for evaluating “MINOLESS”, now in the context of a probabilistic regression problem. In particular, a special form of “LUMBE” the linear uniformly minimum bias estimator || LA I ||= min , leads us to a solution which is equivalent to “MINOS”. “Best” of LUMBE in the sense of the average variance || D{ȟˆ} ||2 = tr D{ȟˆ} = min also called BLUMBE, will give us a unique solution of ȟˆ as a linear estimation of the observation vector y {Y, pdf} with respect to the linear model E{y} = Aȟ , D{y} = Ȉ y of “fixed effects” ȟ ; . Alternatively, in the fifth chapter we had solved the ill-posed problem y = Ax+i, A\ n×m , rk A < min{n,m} , by means of D -HAPS. Here with respect to a special probabilistic regression problem we succeed to compute D -BLE ( D weighted, S modified Best Linear Estimation) as an equivalence to D - HAPS of a special algebraic regression problem. Most welcomed is the analytical optimization problem to determine the regularization parameter D by means of a special form of || MSE{D} ||2 = min , the weighted Mean Square Estimation Error. Such an optimal design of the regulator D is not possible in the Tykhonov-Phillips regularization in the context of D -HAPS, but definitely in the context of D -BLE.
6-1 Setup of the best linear minimum bias estimator of type BLUMBE Box 6.1 is a definition of our special linear Gauss-Markov model with datum defect. We assume (6.1) E{y} = Aȟ, rk A < min{n, m} (1st moments) and (6.2) D{y} = Ȉ y , Ȉ y positive definite, rk Ȉ y = n (2nd moments). Box 6.2 reviews the bias vector as well as the bias matrix including the related norms. Box 6.1 Special linear Gauss-Markov model with datum defect {y = Aȟ + c y , A \ n×m , rk A < min{n, m}} 1st moments E{y} = Aȟ , rk A < min{n, m}
(6.1)
2nd moments D{y} =: Ȉ y , Ȉ y positive definite, rk Ȉ y = n, ȟ \ m , vector of “fixed effects”, unknown, Ȉ y unknown or known from prior information.
(6.2)
288
6 The third problem of probabilistic regression
Box 6.2 Bias vector, bias matrix Vector and matrix bias norms Special linear Gauss-Markov model of fixed effects subject to datum defect A \ n× m , rk A < min{n, m} E{y} = Aȟ, D{y} = Ȉ y
(6.3)
“ansatz” ȟˆ = Ly
(6.4)
bias vector ȕ := E{ȟˆ ȟ} = E{ȟˆ} ȟ z 0
(6.5)
ȕ = LE{y} ȟ = [I m LA ]ȟ z 0
(6.6)
bias matrix B := I n LA
(6.7)
“bias norms” || ȕ ||2 = ȕcȕ = ȟ c[I m LA]c[I m LA]ȟ
(6.8)
2 || ȕ ||2 = tr ȕȕc = tr[I m LA]ȟȟ c[I m LA ]c =|| B ||ȟȟ c
(6.9)
|| ȕ ||S2 := tr[I m LA]S[I m LA ]c =|| B ||S2
(6.10)
“dispersion matrix” D{ȟˆ} = LD{y}Lc = L6 y Lc
(6.11)
“dispersion norm, average variance” || D{ȟˆ} ||2 := tr LD{y}Lc = tr L6 y Lc =:|| Lc ||Ȉ
y
(6.12)
“decomposition” ȟˆ ȟ = (ȟˆ E{ȟˆ}) + ( E{ȟˆ} ȟˆ )
(6.13)
ȟˆ ȟ = L(y E{y}) [I m LA]ȟ
(6.14)
“Mean Square Estimation Error” MSE{ȟˆ} := E{(ȟˆ ȟ )(ȟˆ ȟ )c}
(6.15)
289
6-1 Setup of the best linear minimum bias estimator of type BLUMBE
MSE{ȟˆ} = LD{y}Lc + [I m LA ]ȟȟ c[I m LA ]c
(6.16)
“modified Mean Square Estimation Error” MSES {ȟˆ} := LD{y}Lc + [I m LA ]S[I m LA ]c
(6.17)
“MSE norms, average MSE” || MSE{ȟˆ} ||2 := tr E{(ȟˆ ȟ )(ȟˆ ȟ )c} || MSE{ȟˆ} ||2 =
(6.18)
= tr LD{y}Lc + tr[I m LA]ȟȟ c[I m LA ]c =
(6.19)
= || Lc ||
2 Ȉy
+ || (I m LA)c ||
2 ȟȟ c
|| MSES {ȟˆ} ||2 := := tr LD{y}Lc + tr[I m LA]S[I m LA]c = =|| Lc ||
2 Ȉy
(6.20)
+ || (I m LA)c || . 2 ȟȟ c
Definition 6.1 defines (1st) ȟˆ as a linear homogenous form, (2nd) of type “minimum bias” and (3rd) of type “smallest average variance”. Chapter 6-11 is a collection of definitions and lemmas, theorems basic for the developments in the future. 6-11
Definitions, lemmas and theorems Definition 6.1 (ȟˆ hom Ȉ , S-BLUMBE) : y
An m × 1 vector ȟˆ = Ly is called homogeneous Ȉ y , S- BLUMBE (homogeneous Best Linear Uniformly Minimum Bias Estimation) of ȟ in the special inconsistent linear Gauss Markov model of fixed effects of Box 6.1, if ȟˆ is a homogeneous linear form (1st) (6.21) ȟˆ = Ly ˆ (2nd) in comparison to all other linear estimations ȟ has the minimum bias in the sense of || B ||S2 :=|| ( I m LA )c ||S2 = min,
(6.22)
L
(3rd)
in comparison to all other minimum bias estimations ȟˆ has the smallest average variance in the sense of || D{ȟˆ} ||2 = tr LȈ y Lc =|| L ' ||2Ȉ = min . y
L
(6.23)
290
6 The third problem of probabilistic regression
The estimation ȟˆ of type hom Ȉ y , S-BLUMBE can be characterized by Lemma 6.2 (ȟˆ hom Ȉ y , S-BLUMBE of ȟ ) : An m × 1vector ȟˆ = Ly is hom Ȉ y , S-BLUMBE of ȟ in the special inconsistent linear Gauss- Markov model with fixed effects of Box 6.1, if and only if the matrix L fulfils the normal equations ASA 'º ªL 'º ª º ª Ȉy = « ASA ' 0 »¼ «¬ ȁ »¼ «¬ AS »¼ ¬
(6.24)
with the n × n matrix ȁ of “Lagrange multipliers”. : Proof : First, we minimize the S-modified bias matrix norm, second the MSE( ȟˆ ) matrix norm. All matrix norms have been chosen “Frobenius”. (i) || (I m LA) ' ||S2 = min . L
The S -weighted Frobenius matrix norm || (I m LA ) ' ||S2 establishes the Lagrangean
L (L) := tr(I m LA)S(I m LA) ' = min L
for S-BLUMBE . ª ASA ' Lˆ ' AS = 0 L (L) = min « L ¬ ( ASA ')
I m > 0, according to Theorem 2.3. ASA cº ª C1 ª Ȉy « ASA c 0 »¼ «¬C3 ¬
C2 º ª Ȉ y ASA cº ª Ȉ y ASA cº =« » « » C4 ¼ ¬ ASA c 0 ¼ ¬ ASA c 0 »¼
(6.25)
Ȉ y C1 Ȉ y + Ȉ y C2 ASA c + ASAcC3 Ȉ y + ASAcC4 ASAc = Ȉ y
(6.26)
Ȉ y C1 ASAc + ASA cC3 ASAc = ASAc
(6.27)
ASA cC1 Ȉ y + ASA cC2 ASA c = ASA c
(6.28)
ASA cC1ASA c = 0.
(6.29)
Let us multiply the third identity by Ȉ -1y ASAc = 0 from the right side and substitute the fourth identity in order to solve for C2 . ASAcC2 ASAcȈ y1 ASAc = ASAcȈ y1 ASAc
(6.30)
291
6-1 Setup of the best linear minimum bias estimator of type BLUMBE
C2 = Ȉ -1y ASA c( ASA cȈ y1 ASAc)
(6.31)
solves the fifth equation A cSAȈ-1y ASA c( ASA cȈ-y1ASA c) ASA cȈ-y1ASA c = = ASA cȈ-y1ASA c
(6.32)
by the axiom of a generalized inverse. (ii) || L ' ||2Ȉ = min . y
L
The Ȉ y -weighted Frobenius matrix norm of L subject to the condition of LUMBE generates the constrained Lagrangean
L (L, ȁ) = tr LȈ y L '+ 2 tr ȁ '( ASA ' L ' AS) = min . L,ȁ
According to the theory of matrix derivatives outlined in Appendix B wL ˆ ˆ ˆ ) ' = 0, (L, ȁ ) = 2( Ȉ y Lˆ '+ ASA ' ȁ wL wL ˆ ˆ (L, ȁ ) = 2( ASA ' L ' AS) = 0 , wȁ ˆ ) constitute the necessary conditions, while at the “point” (Lˆ , ȁ w2L ˆ ) = 2( Ȉ
I ) > 0 , (Lˆ , ȁ y m w (vec L) w (vec L ') to be a positive definite matrix, the sufficiency conditions. Indeed, the first matrix derivations have been identified as the normal equations of the sequential optimization problem.
h For an explicit representation of ȟˆ hom Ȉ y , S-BLUMBE of ȟ we solve the normal equations for Lˆ = arg{|| D(ȟˆ ) ||= min | ASA ' L ' AS = 0} . L
In addition, we compute the dispersion matrix D{ȟˆ | hom BLUMBE} as well as the mean square estimation error MSE{ȟˆ | hom BLUMBE}. Theorem 6.3 ( hom Ȉ y , S-BLUMBE of ȟ ): Let ȟˆ = Ly be hom Ȉ y , S-BLUMBE in the special GaussMarkov model of Box 6.1. Then independent of the choice of the generalized inverse ( ASA ' Ȉ y ASA ') the unique solution of the normal equations (6.24) is
292
6 The third problem of probabilistic regression
ȟˆ = SA '( ASA ' Ȉ -1y ASA ') ASA ' Ȉ-1y y ,
(6.33)
completed by the dispersion matrix D(ȟˆ ) = SA '( ASA ' Ȉ-1y ASA ') AS ,
(6.34)
the bias vector ȕ = [I m SA '( ASA ' Ȉ -1y ASA ') ASA ' Ȉ -1y A] ȟ ,
(6.35)
and the matrix MSE {ȟˆ} of mean estimation errors E{(ȟˆ ȟ )(ȟˆ ȟ ) '} = D{ȟˆ} + ȕȕ '
(6.36)
modified by E{(ȟˆ ȟ )(ȟˆ ȟ ) '} = D{ȟˆ} + [I m LA]S[I m LA]' = = D{ȟˆ} + [S SA '( ASA ' Ȉ -1 ASA ') ASA ' Ȉ -1 AS ], y
(6.37)
y
based upon the solution of ȟȟ ' by S. rk MSE{ȟˆ} = rk S
(6.38)
is the corresponding rank identity. :Proof: (i) ȟˆ hom Ȉ y , S-BLUMBE of ȟ . First, we prove that the matrix of the normal equations ASA cº ª Ȉy , « ASA c 0 »¼ ¬
ASAcº ª Ȉy =0 « ASA c 0 »¼ ¬
is singular. Ȉy ASAc =| Ȉ y | | ASAcȈ y 1 ASAc |= 0 , c 0 ASA due to rk[ ASAcȈ y1ASAc] = rk A < min{n, m} assuming rk S = m , rk Ȉ y = n . Note A11 A 21
A12 ª A \ m ×m =| A11 | | A 22 A 21 A111 A12 | if « 11 A 22 ¬ rk A11 = m1 1
1
293
6-1 Setup of the best linear minimum bias estimator of type BLUMBE
with reference to Appendix A. Thanks to the rank deficiency of the partitioned normal equation matrix, we are forced to compute secondly its generalized inverse. The system of normal equations is solved for ªLˆ cº ª Ȉ y ASA cº ª 0 º ª C1 = « »=« ˆ 0 »¼ «¬ AS »¼ «¬C3 «¬ ȁ »¼ ¬ ASA c
C2 º ª 0 º C4 »¼ «¬ AS »¼
Lˆ c = C2 AS
(6.39) (6.40)
Lˆ = SA cCc2
(6.41)
Lˆ = SA( ASA cȈ y1ASA c) ASA cȈ y1
(6.42)
such that ˆ = SA c( ASA cȈ 1ASA c) ASAcȈ 1y. ȟˆ = Ly y y
(6.43)
We leave the proof for “ SA c( ASA cȈ y1ASA c) ASA cȈ y1 is a weighted pseudo-inverse or Moore-Penrose inverse” as an exercise. (ii) Dispersion matrix D{ȟˆ} . The residual vector ȟˆ E{y} = Lˆ (y E{y})
(6.44)
leads to the variance-covariance matrix ˆ Lˆ c = D{ȟˆ} = LȈ y = SA c( ASA cȈ y1ASA c) ASAcȈ y1 ASA c( ASA cȈ y1 ASA c) AS =
(6.45)
= SAc( ASAcȈ ASAc) AS . 1 y
(iii) Bias vector E ˆ )ȟ = ȕ := E{ȟˆ ȟ} = (I m LA = [I m SA c( ASA cȈ y1ASA c) ASA cȈ y1A]ȟ .
(6.46)
Such a bias vector is not accessible to observations since ȟ is unknown. Instead it is common practice to replace ȟ by ȟˆ (BLUMBE), the estimation ȟˆ of ȟ of type BLUMBE. (iv) Mean Square Estimation Error MSE{ȟˆ}
294
6 The third problem of probabilistic regression
MSE{ȟˆ} := E{(ȟˆ ȟ )(ȟˆ ȟ )c} = D{ȟˆ} + ȕȕ c = ˆ Lˆ c + (I LA ˆ )ȟȟ c(I LA ˆ )c . = LȈ m
y
(6.47)
m
Neither D{ȟˆ | Ȉ y } , nor ȕȕ c are accessible to measurements. ȟȟ c is replaced by K.R. Rao’s substitute matrix S, Ȉ y = 9V 2 by a one variance component model V 2 by Vˆ 2 (BIQUUE) or Vˆ 2 (BIQE), for instance.
h
n Lemma 6.4 ( E {y} , D{Aȟˆ} , e y , D{ey } for ȟˆ hom Ȉ y , S of ȟ ): (i)
With respect to the model (1st) Aȟ = E{y} , E{y} \( A ), rk A =: r d m and VV 2 = D{y}, V positive definite, rkV = n under the condition dim R(SA c) = rk(SA c) = rk A = r , namely V, S-BLUMBE, is given by n E {y} = Aȟˆ = A( AcV 1 A) AcV 1 y
(6.48)
with the related singular dispersion matrix D{Aȟˆ} = V 2 A( A cV 1A) A c
(6.49)
for any choice of the generalized inverse ( AcV 1 A) . (ii)
The empirical error vector e y = y E{y} results in the residual error vector e y = y Aȟˆ of type e y = [I n A( A cV 1A) A cV 1By ]
(6.50)
with the related singular dispersion matrices D{e y } = V 2 [ V A( A cV 1A ) A c]
(6.51)
for any choice of the generalized inverse ( AcV 1 A) . (iii)
The various dispersion matrices are related by D{y} = D{Aȟˆ + e y } = D{Aȟˆ} + D{e y } = = D{e y e y } + D{e y },
(6.52)
where the dispersion matrices e y
and
Aȟˆ
(6.53)
are uncorrected, in particularly, C{e y , Aȟˆ} = C{e y , e y e y } = 0 .
(6.54).
295
6-1 Setup of the best linear minimum bias estimator of type BLUMBE
When we compute the solution by Vˆ of type BIQUUE and of type BIQE we arrive at Theorem 6.5 and Theorem 6.6. Theorem 6.5
( Vˆ 2 BIQUUE of V 2 , special Gauss-Markov model: E{y} = Aȟ , D{y} = VV 2 , A \ n× m , rk A = r d m , V \ n× m , rk V = n ):
Let Vˆ 2 = y cMy = (vec M )cy
y = y c
y c(vec M ) be BIQUUE with respect to the special Gauss-Markov model 6.1. Then
Vˆ 2 = (n - r )-1 y c[V 1 - V 1 A( A cV 1 A) A cV 1 ]y
(6.55)
Vˆ 2 = (n - r )-1 y c[V 1 - V 1 ASA c( ASA cV 1 ASA c) ASA cV 1 ]y
(6.56)
Vˆ 2 = (n - r )-1 y cV 1e y = (n - r )-1 e cy V 1e y
(6.57)
are equivalent representations of the BIQUUE variance component Vˆ 2 which are independent of the generalized inverses ( A cV 1 A)
or
( ASAcV 1 AcSA) .
The residual vector e y , namely e y (BLUMBE) = [I n A ( A cV 1A ) 1 A cV 1 ]y ,
(6.58)
is of type BLUMBE. The variance of Vˆ 2 BIQUUE of V 2 D{Vˆ 2 } = 2(n r ) 1 V 4 = 2( n r ) 1 (V 2 ) 2
(6.59)
can be substituted by the estimation D{Vˆ 2 } = 2(n r ) 1 (Vˆ 2 ) 2 = 2(n r ) 1 (e cy V 1e y ) 2 .
(6.60)
( Vˆ 2 BIQE of V 2 , special Gauss-Markov model: E{y} = Aȟ , D{y} = VV 2 , A \ n× m , rk A = r d m , V \ n× m , rk V = n ): Let Vˆ 2 = y cMy = (vec M )cy
y = y c
y c(vec M ) be BIQE with respect to the special Gauss-Markov model 6.1. Independent of the matrix S and of the generalized inverses ( A cV 1 A ) or ( ASAcV 1 AcSA) ,
Theorem 6.6
Vˆ 2 = (n r + 2) 1 y c[V 1 V 1 A( A cV 1 A) 1 A cV 1 ]y
(6.61)
Vˆ 2 = (n r + 2) 1 y c[V 1 V 1 ASA c( ASAcV 1 ASAc) 1 ASAcV 1 ]y (6.62)
296
6 The third problem of probabilistic regression
Vˆ 2 = ( n r + 2) 1 y cV 1e y = ( n r + 2) 1 e cy V 1e y
(6.63)
are equivalent representations of the BIQE variance component Vˆ 2 . The residual vector e y , namely e y (BLUMBE) = [I m A ( A cV 1A ) 1 A cV 1 ]y ,
(6.64)
is of type BLUMBE. The variance of Vˆ 2 BIQE of V2 D{Vˆ 2 } = 2(n r )(n r + 2) 2 V 4 = 2(n r )[( n r + 2) 1 V 2 ]2
(6.65)
can be substituted by the estimation Dˆ {Vˆ 2 } = 2(n r )( n r + 2) 4 (e cy V 1e y ) 2 .
(6.66)
The special bias ȕV := E{Vˆ 2 V 2 } = 2( n r + 2) 1V 2 2
(6.67)
can be substituted by the estimation ȕˆ V = Eˆ {Vˆ 2 V 2 } = 2( n r + 2) 2 e cy V 1e y . 2
(6.68)
Its MSE (Vˆ 2 ) (Mean Square Estimation Error) MSE{Vˆ 2 }:= Eˆ {(Vˆ 2 V 2 ) 2 } = D{Vˆ 2 } + (V 2 E{Vˆ 2 }) 2 = = 2( n r + 2) 1 (V 2 ) 2
(6.69)
can be substituted by the estimation n{Vˆ 2 } = Eˆ {(Vˆ 2 V 2 ) 2 } = Dˆ {Vˆ 2 } + ( Eˆ {V 2 }) 2 = MSE = 2(n r + 2) 3 (e cy V 1e y ). 6-12
(6.70)
The first example: BLUMBE versus BLE, BIQUUE versus BIQE, triangular leveling network
The first example for the special Gauss-Markov model with datum defect {E{y} = Aȟ, A \ n×m , rk A < min{n, m}, D{y} = VV 2 , V \ n×m , V 2 \ + , rk V = n} is taken from a triangular leveling network. 3 modal points are connected, by leveling measurements [ hĮȕ , hȕȖ , hȖĮ ]c , also called potential differences of absolute potential heights [hĮ , hȕ , hȖ ]c of “fixed effects”. Alternative estimations of type (i)
I, I-BLUMBE of ȟ \ m
6-1 Setup of the best linear minimum bias estimator of type BLUMBE
(ii)
V, S-BLUMBE of ȟ \ m
(iii)
I, I-BLE of ȟ \ m
(iv)
V, S-BLE of ȟ \ m
(v)
BIQUUE of V 2 \ +
(vi)
BIQE of V 2 \ +
297
will be considered. In particular, we use consecutive results of Appendix A, namely from solving linear system of equations based upon generalized inverse, in short g-inverses. For the analyst, the special Gauss-Markov model with datum defect constituted by the problem of estimating absolute heights [hĮ , hȕ , hȖ ] of points {PĮ , Pȕ , PȖ } from height differences is formulated in Box 6.3. Box 6.3 The first example ª hĮȕ º ª 1 +1 0 º ª hĮ º « » « » E{« hȕȖ »} = «« 0 1 +1»» « hȕ » « » « » ¬ hȖĮ ¼ ¬« +1 0 1¼» ¬ hȖ ¼ ª hĮȕ º « » y := « hȕȖ » , « hȖĮ » ¬ ¼
ª 1 + 1 0 º A := «« 0 1 +1»» \ 3×3 , «¬ +1 0 1»¼
ª hĮ º « » ȟ := « hȕ » « hȖ » ¬ ¼
ª hĮȕ º « » D{« hȕȖ »} = D{y} = VV 2 , V 2 \ + « hȖĮ » ¬ ¼ :dimensions: ȟ \ 3 , dim ȟ = 3, y \ 3 , dim{Y, pdf } = 3 m = 3, n = 3, rk A = 2, rk V = 3. 6-121 The first example: I3, I3-BLUMBE In the first case, we assume a dispersion matrix D{y} = I 3V 2 of i.i.d. observations [y1 , y 2 , y 3 ]c
and
a unity substitute matrix S=,3, in short u.s. .
Under such a specification ȟˆ is I3, I3-BLUMBE of ȟ in the special GaussMarkov model with datum defect.
298
6 The third problem of probabilistic regression
ȟˆ = A c( AA cAA c) AA cy ª 2 1 1º c c AA AA = 3 «« 1 2 1»» , «¬ 1 1 2 »¼
( AA cAA c) =
ª2 1 0º 1« 1 2 0 »» . « 9 «¬ 0 0 0 »¼
?How did we compute the g-inverse ( AA cAAc) ? The computation of the g-inverse ( AAcAAc) has been based upon bordering.
ª ª 6 3º 1 0 º ª 6 3 3 º ª 2 1 0º 1 «« » « » « » ( AA cAAc) = « 3 6 3» = « ¬ 3 6 ¼» 0 » = «1 2 0 » . 9 « 0 0 «¬ 3 3 6 »¼ «¬ 0 0 0 »¼ 0 »¼ ¬ Please, check by yourself the axiom of a g-inverse:
ª +6 3 3º ª +6 3 3º ª +6 3 3º ª +6 3 3º « »« » « » « » « 3 +6 3» « 3 +6 3» « 3 +6 3» = « 3 +6 3» ¬« 3 3 +6 ¼» ¬« 3 3 +6 ¼» ¬« 3 3 +6¼» ¬« 3 3 +6¼» or
ª +6 3 3º ª 2 1 0 º ª +6 3 3º ª +6 3 3º « »1« » « » « » « 3 +6 3» 9 « 1 2 0 » « 3 +6 3» = « 3 +6 3» «¬ 3 3 +6 »¼ «¬ 0 0 0 »¼ «¬ 3 3 +6 »¼ «¬ 3 3 +6»¼ ª hĮ º ª y1 + y3 º 1 « » ȟˆ = « hȕ » (I 3 , I 3 -BLUMBE) = «« y1 y2 »» 3 « hȖ » «¬ y2 y3 »¼ ¬ ¼
[ˆ1 + [ˆ2 + [ˆ3 = 0 . Dispersion matrix D{ȟˆ} of the unknown vector of “fixed effects” D{ȟˆ} = V 2 A c( AA cAAc) A ª +2 1 1º V2 « ˆ D{ȟ} = 1 + 2 1 » » 9 « ¬« 1 1 +2 ¼» “replace V 2 by Vˆ 2 (BIQUUE):
6-1 Setup of the best linear minimum bias estimator of type BLUMBE
299
Vˆ 2 = (n rk A) 1 e cy e y ” e y (I 3 , I 3 -BLUMBE) = [I 3 A ( AA c) A c]y ª1 1 1º ª1º 1 1 e y = «1 1 1» y = ( y1 + y2 + y3 ) «1» » «» 3« 3 «¬1 1 1»¼ «¬1»¼ ª1 1 1º ª1 1 1º 1 e cy e y = y c ««1 1 1»» ««1 1 1»» y 9 «¬1 1 1»¼ «¬1 1 1»¼ 1 e cy e y = ( y12 + y22 + y32 + 2 y1 y2 + 2 y2 y3 + 2 y3 y1 ) 3 1 Vˆ 2 (BIQUUE) = ( y12 + y22 + y32 + 2 y1 y2 + 2 y2 y3 + 2 y3 y1 ) 3 ª +2 1 1º 1« ˆ D{ȟ} = « 1 +2 1»» Vˆ 2 (BIQUUE) 9 «¬ 1 1 +2 »¼ “replace V 2 by Vˆ 2 (BIQE):
Vˆ 2 = ( n + 2 rk A ) 1 e cy e y ” e y (I 3 , I 3 -BLUMBE) = [I 3 A ( AA c) A c]y ª1 1 1º ª1º 1 1 e y = ««1 1 1»» y = ( y1 + y2 + y3 ) ««1»» 3 3 «¬1 1 1»¼ «¬1»¼ 1 Vˆ 2 ( BIQE ) = ( y12 + y22 + y32 + 2 y1 y2 + 2 y2 y3 + 2 y3 y1 ) 9 ª +2 1 1º 1 D{ȟˆ} = «« 1 +2 1»» Vˆ 2 (BIQE) . 9 «¬ 1 1 +2 »¼ For practice, we recommend D{ȟˆ (BLUMBE), Vˆ 2 (BIQE)} , since the dispersion matrix D{ȟˆ} is remarkably smaller when compared to D{ȟˆ (BLUMBE), Vˆ 2 (B IQUUE)} , a result which seems to be unknown!
300
6 The third problem of probabilistic regression
Bias vector ȕ(BLUMBE) of the unknown vector of “fixed effects” ȕ = [I 3 A c( AA cAA c) AA cA]ȟ , ª1 1 1º 1« ȕ = «1 1 1»» ȟ , 3 «¬1 1 1»¼ “replace ȟ which is inaccessible by ȟˆ (I 3 , I 3 -BLUMBE) ” ª1 1 1º 1« ȕ = «1 1 1»» ȟˆ (I 3 , I 3 -BLUMBE) , 3 ¬«1 1 1»¼ ȕ=0 (due to [ˆ1 + [ˆ2 + [ˆ3 = 0 ). Mean Square Estimation Error MSE{ȟˆ (I 3 , I 3 -BLUMBE)} MSE{ȟˆ} = D{ȟˆ} + [I 3 A c( AA cAA c) AA cA]V 2 , ª5 2 2º V2 « ˆ MSE{ȟ} = 2 5 2 »» . 9 « «¬ 2 2 5 »¼ “replace V 2 by Vˆ 2 (BIQUUE): Vˆ 2 = (n rk A) 1 ecy e y ” 1 Vˆ 2 (BIQUUE) = ( y12 + y22 + y32 + 2 y1 y2 + 2 y2 y3 + 2 y3 y1 ) , 3 ª5 2 2º 1« ˆ MSE{ȟ} = « 2 5 2 »» Vˆ 2 (BIQUUE) . 9 «¬ 2 2 5 »¼ “replace V 2 by Vˆ 2 (BIQE):
Vˆ 2 = ( n + 2 rk A ) 1 ecy e y ” 1 Vˆ 2 (BIQE) = ( y12 + y22 + y32 + 2 y1 y2 + 2 y2 y3 + 2 y3 y1 ) 9
6-1 Setup of the best linear minimum bias estimator of type BLUMBE
301
ª5 2 2º 1 MSE{ȟˆ} = «« 2 5 2 »» Vˆ 2 (BIQE) . 9 «¬ 2 2 5 »¼ Residual vector e y and dispersion matrix D{e y } of the “random effect” e y e y (I 3 , I 3 -BLUMBE) = [I 3 A ( A cA ) A c]y ª1 1 1º ª1º 1« 1 » e y = «1 1 1» y = ( y1 + y2 + y3 ) ««1»» 3 3 «¬1 1 1»¼ «¬1»¼ D{e y } = V 2 [I 3 A( A cA) A c] ª1 1 1º V2 « D{e y } = 1 1 1»» . 3 « «¬1 1 1»¼ “replace V 2 by Vˆ 2 (BIQUUE) or Vˆ 2 (BIQE)”: ª1 1 1º 1« D{e y } = «1 1 1»» Vˆ 2 (BIQUUE) 3 «¬1 1 1»¼ or ª1 1 1º 1« D{e y } = «1 1 1»» Vˆ 2 (BIQE) . 3 «¬1 1 1»¼ Finally note that ȟˆ (I 3 , I 3 -BLUMBE) corresponds x lm (I 3 , I 3 -MINOLESS) discussed in Chapter 5. In addition, D{e y | I 3 , I 3 -BLUUE} = D{e y | I 3 , I 3 -BLUMBE} . 6-122 The first example: V, S-BLUMBE In the second case, we assume a dispersion matrix D{y} = VV 2 of weighted observations [ y1 , y2 , y3 ]
and
a weighted substitute matrix S, in short w.s. .
302
6 The third problem of probabilistic regression
Under such a specification ȟˆ is V, S-BLUMBE of ȟ in the special GaussMarkov model with datum defect. ȟˆ = SA c( ASAcV 1ASAc) 1 ASAcV 1y . As dispersion matrix D{y} = VV 2 we choose ª2 1 1º 1« V = «1 2 1 »» , rk V = 3 = n 2 «¬1 1 2 »¼ ª 3 1 1º 1« V = « 1 3 1»» , but 2 «¬ 1 1 3 »¼ 1
S = Diag(0,1,1), rk S = 2 as the bias semi-norm. The matrix S fulfils the condition rk(SA c) = rk A = r = 2 . ?How did we compute the g-inverse ( ASAcV 1 ASA c) ? The computation of the g-inverse ( ASAcV 1 ASA c) has been based upon bordering. ª +3 1 1º 1« V = « 1 +3 1»» , S = Diag(0,1,1), rk S = 2 2 «¬ 1 1 +3»¼ 1
ȟˆ = SA c( ASA cV 1ASA c) ASA cV 1 ª 2 3 1 º ASAcV ASAc = 2 «« 3 6 3»» «¬ 1 3 2 »¼ 1
ª 2 0 1º 1« ( ASAcV ASAc) = « 0 0 3»» 6 «¬ 1 0 2 »¼ ª hĮ º 0 ª º 1« ˆȟ = « h » = « 2 y1 y2 y3 »» . « ȕ» 3 « hȖ » ¬« y1 + y2 2 y3 ¼» ¬ ¼ V ,S BLUMBE 1
6-1 Setup of the best linear minimum bias estimator of type BLUMBE
Dispersion matrix D{ȟˆ} of the unknown vector of “fixed effects” D{ȟˆ} = V 2SA c( ASA cV 1ASA c) AS ª0 0 0 º V2 « ˆ D{ȟ} = 0 2 1 »» « 6 «¬0 1 2 »¼ “replace V 2 by Vˆ 2 (BIQUUE): Vˆ 2 = (n rk A) 1 e cy e y ” e y = (V, S-BLUMBE) = [I 3 A( A cV 1A) A cV 1 ]y ª1 1 1º y + y2 + y3 1« e y = «1 1 1»» y = 1 3 3 «¬1 1 1»¼
ª1º «1» «» «¬1»¼
ª1 1 1º ª1 1 1º 1 « e cy e y = y c «1 1 1»» ««1 1 1»» y 9 «¬1 1 1»¼ «¬1 1 1»¼ 1 e cy e y = ( y12 + y22 + y32 + 2 y1 y2 + 2 y2 y3 + 2 y3 y1 ) 3 1 Vˆ 2 (BIQUUE) = ( y12 + y22 + y32 + 2 y1 y2 + 2 y2 y3 + 2 y3 y1 ) 3 D{ȟˆ} = [V + A( A cV 1A) A c]Vˆ 2 (BIQUUE) ª1 1 1º 2« ˆ D{ȟ} = «1 1 1»» Vˆ 2 (BIQUUE) . 3 «¬1 1 1»¼ “replace V 2 by Vˆ 2 (BIQE):
Vˆ 2 = (n + 2 rk A) 1 e cy e y ” e y (V , S-BLUMBE) = [I 3 A ( A cV 1A ) A cV 1 ]y
303
304
6 The third problem of probabilistic regression
ª1 1 1º y + y2 + y3 1 e y = ««1 1 1»» y = 1 3 3 «¬1 1 1»¼
ª1º «1» «» «¬1»¼
1 Vˆ 2 (BIQE) = ( y12 + y22 + y32 + 2 y1 y2 + 2 y2 y3 + 2 y3 y1 ) 9 ª +2 1 1º 1 D{ȟˆ} = «« 1 +2 1»» Vˆ 2 (BIQE) . 9 «¬ 1 1 +2 »¼ We repeat the statement that we recommend the use of D{ȟˆ (BLUMBE), Vˆ (BIQE)} since the dispersion matrix D{ȟˆ} is remarkably smaller when compared to D{ȟˆ (BLUMBE), Vˆ 2 (BIQUUE)} ! Bias vector ȕ(BLUMBE) of the unknown vector of “fixed effects” ȕ = [I 3 SA c( ASA cV 1 ASA c) ASA cV 1 A ]ȟ ª1 0 0 º ª[1 º « » ȕ = «1 0 0 » ȟ = ««[1 »» , ¬«1 0 0 »¼ ¬«[1 ¼» “replace ȟ which is inaccessible by ȟˆ (V,S-BLUMBE)” ª1º ȕ = ««1»» ȟˆ , (V , S-BLUMBE) z 0 . ¬«1¼» Mean Square Estimation Error MSE{ȟˆ (V , S-BLUMBE)} MSE{ȟˆ} = = D{ȟˆ} + [S SA c( ASA cV 1ASA c) ASA cV 1AS]V 2 ª0 0 0 º V2 « ˆ MSE{ȟ} = 0 2 1 »» = D{ȟˆ} . 6 « «¬0 1 2 »¼ “replace V 2 by Vˆ 2 (BIQUUE): Vˆ 2 = (n rk A) 1 ecy e y ”
Vˆ 2 (BIQUUE)=3V 2
6-1 Setup of the best linear minimum bias estimator of type BLUMBE
305
MSE{ȟˆ} = D{ȟˆ} . Residual vector e y and dispersion matrix D{e y } of the “random effect” e y e y (V , S-BLUMBE) = [I 3 A ( A cV 1A ) A cV 1 ]y ª1 1 1º y + y2 + y3 1« e y = «1 1 1»» y = 1 3 3 «¬1 1 1»¼
ª1º «1» «» «¬1»¼
D{e y } = V 2 [V A( A cV 1A) A c] ª1 2 2« D{e y } = V «1 3 «¬1 2 “replace V by Vˆ 2
1 1º 1 1»» . 1 1»¼ (BIQE):
Vˆ 2 = (n + 2 rk A) 1 ecy e y ” Vˆ 2 (BIQE) versus ª0 0 0 º 1« ˆ MSE{ȟ} = «0 2 1 »» V 2 ( BIQE ) . 6 «¬0 1 2 »¼ Residual vector e y and dispersion matrix D{e y } of the “random effects” e y e y (V , S-BLUMBE) = [I 3 A ( A cV 1A ) A cV 1 ]y ª1 1 1º y + y2 + y3 1« e y = «1 1 1»» y = 1 3 3 «¬1 1 1»¼
ª1º «1» «» «¬1»¼
D{e y } = V 2 [V A( A cV 1A) A c] ª1 1 1º 2 2« D{e y } = V «1 1 1»» . 3 «¬1 1 1»¼ “replace V 2 by Vˆ 2 (BIQUUE) or Vˆ 2 (BIQE)”:
306
6 The third problem of probabilistic regression
D{e y } =
ª1 1 1º 2« 1 1 1»» Vˆ 2 (BIQUUE) « 3 «¬1 1 1»¼ or
ª1 1 1º 2« D{e y } = «1 1 1»» Vˆ 2 (BIQE) . 3 «¬1 1 1»¼ 6-123 The first example: I3 , I3-BLE In the third case, we assume a dispersion matrix D{y} = I 3V 2 of i.i.d. observations [ y1 , y2 , y3 ]
and
a unity substitute matrix S=I3, in short u.s. .
Under such a specification ȟˆ is I3, I3-BLE of ȟ in the special Gauss-Markov model with datum defect. ȟˆ (BLE) = (I 3 + A cA ) 1 A cy ª +3 1 1º I 3 + A cA = «« 1 +3 1»» , «¬ 1 1 +3»¼
ª2 1 1º 1« (I 3 + AcA) = «1 2 1 »» 4 «¬1 1 2 »¼ 1º ª 1 0 ª y1 + y3 º ˆȟ (BLE) = 1 « 1 1 0 » y = 1 « + y y » 1 2» » 4« 4« «¬ 0 1 1»¼ «¬ + y2 y3 »¼ 1
[ˆ1 + [ˆ2 + [ˆ3 = 0 . Dispersion matrix D{ȟˆ | BLE} of the unknown vector of “fixed effects” D{ȟˆ | BLE} = V 2 A cA (I 3 + AcA) 2 ª +2 1 1º 2 V « 1 +2 1» . D{ȟˆ | BLE} = » 16 « «¬ 1 1 +2 »¼ Bias vector ȕ(BLE) of the unknown vector of “fixed effects”
6-1 Setup of the best linear minimum bias estimator of type BLUMBE
307
ȕ = [I 3 + A cA]1 ȟ ª2 1 1º ª 2[1 + [ 2 + [3 º 1« 1« » ȕ = « 1 2 1 » ȟ = «[1 + 2[ 2 + [3 »» . 4 4 «¬ 1 1 2 »¼ «¬[1 + [ 2 + 2[3 »¼ Mean Square Estimation Error MSE{ȟ (BLE)} MSE{ȟˆ (BLE)} = V 2 [I 3 + A cA]1 ª2 1 1º V2 « ˆ MSE{ȟ (BLE)} = 1 2 1 »» . 4 « «¬1 1 2 »¼ Residual vector e y and dispersion matrix D{e y } of the “random effect” e y e y (BLE) = [ AA c + I 3 ]1 y e y (BLE) =
ª2 1 1º ª 2 y1 + y2 + y3 º 1« 1« » 1 2 1 » y = « y1 + 2 y2 + y3 »» « 4 4 «¬1 1 2 »¼ «¬ y1 + y2 + 2 y3 »¼
D{e y (BLE)} = V 2 [I 3 + AA c]2 D{e y (BLE)} =
ª6 5 5º V2 « 5 6 5 »» . « 16 «¬5 5 6 »¼
Correlations C{e y , Aȟˆ} = V 2 [I 3 + AA c]2 AA c ª +2 1 1º V2 « ˆ C{e y , Aȟ} = 1 +2 1» . « » 16 ¬« 1 1 +2 ¼» Comparisons BLUMBE-BLE (i) ȟˆ BLUMBE ȟˆ BLE ȟˆ BLUMBE ȟˆ BLE = A c( AA cAA c) AAc( AAc + I 3 ) 1 y
308
6 The third problem of probabilistic regression
ª 1 0 1 º ª y1 + y3 º 1 1 ȟˆ BLUMBE ȟˆ BLE = «« 1 1 0 »» y = «« + y1 y2 »» . 12 12 «¬ 0 1 1»¼ «¬ + y2 y3 »¼ (ii) D{ȟˆ BLUMBE } D{ȟˆ BLE } D{ȟˆ BLUMBE } D{ȟˆ BLE } = = V 2 A c( AA cAAc) AAc( AAc + I 3 ) 1 AAc( AAcAAc) A + +V 2 A c( AA cAAc) AAc( AAc + I 3 ) 1 AAc( AAc + I 3 ) 1 AAc( AAcAAc) A ª +2 1 1º 7 2 D{ȟˆ BLUMBE } D{ȟˆ BLE } = V « 1 +2 1»» positive semidefinite. 144 « «¬ 1 1 +2 »¼ (iii) MSE{ȟˆ BLUMBE } MSE{ȟˆ BLE } MSE{ȟˆ BLUMBE } MSE{ȟˆ BLE } = = ı 2 A c( AA cAAc) AAc( AAc + I 3 ) 1 AAc( AAcAAc) A ª +2 1 1º V2 « ˆ ˆ MSE{ȟ BLUMBE } MSE{ȟ BLE } = 1 +2 1»» positive semidefinite. 36 « «¬ 1 1 +2 »¼ 6-124 The first example: V, S-BLE In the fourth case, we assume a dispersion matrix D{y} = VV 2 of weighted observations [ y1 , y2 , y3 ]
and
a weighted substitute matrix S, in short w.s. .
We choose ª2 1 1º 1« V = «1 2 1 »» positive definite, rk V = 3 = n , 2 «¬1 1 2 »¼ ª +3 1 1º 1« V = « 1 +3 1»» , 2 «¬ 1 1 +3»¼ 1
and S = Diag(0,1,1), rk S = 2 ,
6-1 Setup of the best linear minimum bias estimator of type BLUMBE
309
ȟˆ = (I 3 + SA cV 1A) 1 SA cV 1y , ª 1 0 0º ª 21 0 0 º 1 « « » 1 1 1 c c I 3 + SA V A = « 2 5 2 » , (I 3 + SA V A) = «14 5 2 »» , 21 «¬ 2 2 5 »¼ «¬14 2 5 »¼ ª hĮ º ˆȟ (V, S-BLE) = « h » = « ȕ» « hȖ » ¬ ¼ V ,S -BLE 0 0º 0 ª0 ª º 1 « 1 « » = «14 6 4 » y = «10 y1 6 y2 4 y3 »» . 21 21 «¬ 4 «¬ 4 y1 + 6 y2 10 y3 »¼ 6 10 »¼ Dispersion matrix D{ȟˆ | V, S-BLE} of the unknown vector of “fixed effects” D{ȟˆ | V, S-BLE} = V 2SA cV 1A[I 3 + SA cV 1A]1 S , ª0 0 0 º V2 « ˆ D{ȟ | V, S-BLE} = 0 76 22 »» . « 441 «¬0 22 76 »¼ Bias vector ȕ(V, S-BLE) of the unknown vector of “fixed effects” ȕ = [I 3 + SA cV 1 A]1 ȟ 21[1 ª 21 0 0 º ª º 1 « 1 « » ȕ = «14 5 2 » ȟ = «14[1 + 5[ 2 + 2[ 3 »» . 21 21 «¬14 2 5 »¼ «¬14[1 + 2[ 2 + 5[ 3 »¼ Mean Square Estimation Error MSE{ȟ | V , S-BLE} MSE{ȟ | V, S-BLE} = V 2 [I 3 + SA cVA]1 S ª0 0 0 º V2 « MSE{ȟ | V, S-BLE} = 0 5 2 »» . 21 « «¬0 2 5 »¼ Residual vector e y and dispersion matrix D{e y } of the “random effect” e y e y (V , S-BLE) = [I 3 + ASA cV 1 ]1 y
310
6 The third problem of probabilistic regression
e y {V , S-BLE} =
ª11 6 4 º ª11 y1 + 6 y2 + 4 y3 º 1 « » y = 1 « 6y + 9y + 6y » 6 9 6 1 2 3 » » 21 « 21 « «¬ 4 6 11»¼ «¬ 4 y1 + 6 y2 + 11y3 »¼
D{e y (V, S-BLE)} = V 2 [I 3 + ASA cV 1 ]2 V ª614 585 565 º V2 « D{e y (V, S-BLE)} = 585 594 585 »» . 882 « «¬565 585 614 »¼ Correlations C{e y , Aȟˆ} = V 2 (I 3 + ASAcV 1 ) 2 ASAc ª 29 9 20 º 2 V « 9 18 9 » . C{e y , Aȟˆ} = » 441 « «¬ 20 9 29 »¼ Comparisons BLUMBE-BLE (i) ȟˆ BLUMBE ȟˆ BLE ȟˆ V ,S BLUMBE ȟˆ V ,S -BLE = SA c( ASA cV 1ASA c) ASA c( ASA c + V ) 1 y ȟˆ V ,S BLUMBE ȟˆ V ,S -BLE
0º 0 ª0 0 ª º 1 « 1 « » = « 4 1 3 » y = « 4 y1 y2 3 y3 »» . 21 21 «¬ 3 1 4 »¼ «¬3 y1 + y2 4 y3 »¼
(ii) D{ȟˆ BLUMBE } D{ȟˆ BLE } D{ȟˆ V ,S -BLUMBE } D{ȟˆ V ,S -BLE } = = V SA c( ASA cV ASA c) ASA c( ASA c + V ) 1 ASA c(ASA cV 1ASA c) AV + 2
1
V 2 SA c( ASA cV 1 ASA c) ASA c( ASA c + V ) 1 ASA c(ASA c + V ) 1 ASA c(ASA cV 1ASA c) AS
0 0º ª0 2 V « D{ȟˆ V ,S -BLUMBE } D{ȟˆ V ,S -BLE } = 0 142 103»» positive semidefinite. 882 « «¬ 0 103 142 »¼ (iii) MSE{ȟˆ BLUMBE } MSE{ȟˆ BLE } MSE{ȟˆ V ,S -BLUMBE } MSE{ȟˆ V ,S -BLE } = = V 2SA c( ASAcV 1ASAc) ASAc( ASAc + V ) 1 ASAc( ASAcV 1 ASAc) AS
6-1 Setup of the best linear minimum bias estimator of type BLUMBE
MSE{ȟˆ V ,S BLUMBE } MSE{ȟˆ V ,S BLE } =
311
ª0 0 0 º V2 « 0 4 3 »» positive semidefinite. « 42 «¬0 3 4 »¼
Summarizing, let us compare I,I-BLUMBE versus I,I-BLE and V,S-BLUMBE versus V,S-BLE! ȟˆ BLUMBE ȟˆ BLE , D{ȟˆ BLUMBE } D{ȟˆ BLE } and MSE{ȟˆ BLUMBE } MSE{ȟˆ BLE } as well as ȟˆ V ,S -BLUMBE ȟˆ V,S -BLE , D{ȟˆ V ,S -BLUMBE } D{ȟˆ V ,S -BLE } and MSE{ȟˆ V ,S -BLUMBE } MSE{ȟˆ V ,S -BLE } result positive semidefinite: In consequence, for three different measures of distorsions BLE is in favor of BLIMBE: BLE produces smaller errors in comparing with BLIMBE! Finally let us compare weighted BIQUUE and weighted BIQE: (i)
Weighted BIQUUE Vˆ 2 and weighted BIQE Vˆ 2
Vˆ 2 = (n r ) 1 y cV 1e y =
versus
= (n r ) 1 e cy V 1e y
Vˆ 2 = (n r + 2)y cV 1e y = = (n r + 2)e cy V 1e y
(e y ) V ,S -BLUMBE
ª4 1 1º 1« = «1 4 1 »» y 6 «¬1 1 4 »¼
ª +3 1 1º 1« r = rk A = 2, n = 3, V = « 1 +3 1»» 2 «¬ 1 1 +3»¼ 1
(ii)
1 Vˆ 2 = ( y12 + y22 + y32 ) 2
versus
1 Vˆ 2 = ( y12 + y22 + y32 ) 6
D{Vˆ 2 | BIQUUE}
versus
D{Vˆ 2 | BIQE}
D{Vˆ 2 } = 2(n r ) 1 V 4
versus
D{Vˆ 2 } = 2(n r )(n r + 2) 1 V 4
D{Vˆ 2 } = 2V 4
versus
2 D{Vˆ 2 } = V 4 9
312
6 The third problem of probabilistic regression
Dˆ {Vˆ 2 } = 2(n r ) 1 (Vˆ 2 ) 2
versus
Eˆ {Vˆ 2 V 2 } = = 2(n r + 2) 1 e cy V 1e y
1 Dˆ {Vˆ 2 } = ( y12 + y22 + y32 ) 2
versus
1 Eˆ {Vˆ 2 V 2 } = ( y12 + y22 + y32 ) 9 Eˆ {Vˆ 2 V 2 } = 2(n r + 2) 1 (e cy V 1e y ) 2 1 Eˆ {(Vˆ 2 V 2 )} = ( y12 + y22 + y32 ) . 54
(iii)
(e y ) BLUMBE = [I n A( A cV 1A) AV 1 ](e y ) BLE (Vˆ 2 ) BIQUUE = ( n r )(e cy ) BLE [ V 1 V 1A( A cV 1A) AV 1 ](e y ) BLE 1 Vˆ 2BIQUUE Vˆ 2BIQE = ( y12 + y22 + y32 ) positive. 3
2 We repeat that the difference Vˆ 2BIQUUE Vˆ BIQE is in favor of Vˆ 2BIQE < Vˆ 2BIQUUE .
6-2 Setup of the best linear estimators of type hom BLE, hom SBLE and hom Į-BLE for fixed effects Numerical tests have documented that ȟˆ of type Ȉ - BLUUE of ȟ is not robust against outliers in the stochastic vector y observations. It is for this reason that we give up the postulate of unbiasedness, but keeping the set up of a linear estimation ȟˆ = Ly of homogeneous type. The matrix L is uniquely determined by the D weighted hybrid norm of type minimum variance || D{ȟˆ} ||2 and minimum bias || I LA ||2 . For such a homogeneous linear estimation (2.21) by means of Box 6.4 let us specify the real-valued, nonstochastic bias vector ȕ:= E{ȟˆ ȟ} = E{ȟˆ}ȟ of type (6.11), (6.12), (6.13) and the real-valued, nonstochastic bias matrix I m LA (6.74) in more detail. First, let us discuss why a setup of an inhomogeneous linear estimation is not suited to solve our problem. In the case of an unbiased estimator, the setup of an inhomogeneous linear estimation ȟˆ = Ly + ț led us to E{ȟˆ} = ȟ the postulate of unbiasedness if and only if E{ȟˆ} ȟ := LE{y} ȟ + ț = (I m LA)ȟ + ț = 0 for all ȟ R m or LA = I m and ț = 0. Indeed the postulate of unbiasedness restricted the linear operator L to be the (non-unique) left inverse L = A L as well as the vector ț of inhomogeneity to zero. In contrast the bias vector ȕ := E{ȟˆ ȟ} = E{ȟˆ} ȟ = LE{y} ȟ = (I m LA)ȟ + ț for a setup of an inhomogeneous linear estimation should approach zero if ȟ = 0 is chosen as a special case. In order to include this case in the linear biased estimation procedure we set ț = 0 .
6-2 Setup of the best linear estimators fixed effects
313
Second, we focus on the norm (2.79) namely || ȕ ||2 := E{(ȟˆ ȟ )c(ȟˆ ȟ )} of the bias vector ȕ , also called Mean Square Error MSE{ȟˆ} of ȟˆ . In terms of a setup of a homogeneous linear estimation, ȟˆ = Ly , the norm of the bias vector is represented by (I m LA)cȟȟ c(I m LA) or by the weighted Frobenius matrix norm 2 || (I m LA)c ||ȟȟ c where the weight matrix ȟȟ c, rk ȟȟ c = 1, has rank one. Obviously 2 || (I m LA)c ||ȟȟ c is only a semi-norm. In addition, ȟȟ c is not accessible since ȟ is unknown. In this problematic case we replace the matrix ȟȟ c by a fixed, positive definite m×m matrix S and define the S-weighted Frobenius matrix norm || (I m LA)c ||S2 of type (2.82) of the bias matrix I m LA . Indeed by means of the rank identity, rk S=m we have chosen a weight matrix of maximal rank. Now we are prepared to understand intuitively the following. Here we focus on best linear estimators of type hom BLE, hom S-BLE and hom Į-BLE of fixed effects ȟ, which turn out to be better than the best linear uniformly unbiased estimator of type hom BLUUE, but suffer from the effect to be biased. At first let us begin with a discussion about the bias vector and the bias matrix as well as the Mean Square Estimation Error MSE{ȟˆ} with respect to a homogeneous linear estimation ȟˆ = Ly of fixed effects ȟ based upon Box 6.4. Box 6.4 Bias vector, bias matrix Mean Square Estimation Error in the special Gauss–Markov model with fixed effects E{y} = Aȟ
(6.71)
D{y} = Ȉ y
(6.72)
“ansatz” ȟˆ = Ly
(6.73)
bias vector ȕ := E{ȟˆ ȟ} = E{ȟˆ} ȟ
(6.74)
ȕ = LE{y} ȟ = [I m LA] ȟ
(6.75)
bias matrix B := I m LA
(6.76)
decomposition ȟˆ ȟ = (ȟˆ E{ȟˆ}) + ( E{ȟˆ} ȟ )
(6.77)
ȟˆ ȟ = L(y E{y}) [I m LA] ȟ
(6.78)
314
6 The third problem of probabilistic regression
Mean Square Estimation Error MSE{ȟˆ} := E{(ȟˆ ȟ )(ȟˆ ȟ )c}
(6.79)
MSE{ȟˆ} = LD{y}Lc + [I m LA ] ȟȟ c [I m LA ]c
(6.80)
( E{ȟˆ E{ȟˆ}} = 0) modified Mean Square Estimation Error MSES {ȟˆ} := LD{y}Lc + [I m LA] S [I m LA]c
(6.81)
Frobenius matrix norms || MSE{ȟˆ} ||2 := tr E{(ȟˆ ȟ )(ȟˆ ȟ )c}
(6.82)
|| MSE{ȟˆ} ||2 = = tr LD{y}Lc + tr[I m LA] ȟȟ c [I m LA]c
(6.83)
= || Lc ||
2 6y
+ || (I m LA)c ||
2 ȟȟ '
|| MSES {ȟˆ} ||2 := := tr LD{y}Lc + tr[I m LA]S[I m LA]c
(6.84)
= || Lc ||6y + || (I m LA)c ||S 2
2
hybrid minimum variance – minimum bias norm Į-weighted L(L) := || Lc ||62 y + 1 || (I m LA)c ||S2 D
(6.85)
special model dim R (SAc) = rk SAc = rk A = m .
(6.86)
The bias vector ȕ is conventionally defined by E{ȟˆ} ȟ subject to the homogeneous estimation form ȟˆ = Ly . Accordingly the bias vector can be represented by (6.75) ȕ = [I m LA] ȟ . Since the vector ȟ of fixed effects is unknown, there has been made the proposal to use instead the matrix I m LA as a matrix-valued measure of bias. A measure of the estimation error is the Mean Square estimation error MSE{ȟˆ} of type (6.79). MSE{ȟˆ} can be decomposed into two basic parts: • the dispersion matrix D{ȟˆ} = LD{y}Lc •
the bias product ȕȕc.
Indeed the vector ȟˆ ȟ can be decomposed as well into two parts of type (6.77), (6.78), namely (i) ȟˆ E{ȟˆ} and (ii) E{ȟˆ} ȟ which may be called estimation
315
6-2 Setup of the best linear estimators fixed effects
error and bias, respectively. The double decomposition of the vector ȟˆ ȟ leads straightforward to the double representation of the matrix MSE{ȟˆ} of type (6.80). Such a representation suffers from two effects: Firstly the vector ȟ of fixed effects is unknown, secondly the matrix ȟȟ c has only rank 1. In consequence, the matrix [I m LA] ȟȟ c [I m LA]c has only rank 1, too. In this situation there has been made a proposal to modify MSE{ȟˆ} with respect to ȟȟ c by the regular matrix S. MSES {ȟˆ} has been defined by (6.81). A scalar measure of MSE{ȟˆ} as well as MSES {ȟˆ} are the Frobenius norms (6.82), (6.83), (6.84). Those scalars constitute the optimal risk in Definition 6.7 (hom BLE) and Definition 6.8 (hom S-BLE). Alternatively a homogeneous Į-weighted hybrid minimum varianceminimum bias estimation (hom Į-BLE) is presented in Definition 6.9 (hom ĮBLE) which is based upon the weighted sum of two norms of type (6.85), namely •
average variance || Lc ||62 y = tr L6 y Lc
•
average bias || (I m LA)c ||S2 = tr[I m LA] S [I m LA]c.
The very important estimator Į-BLE is balancing variance and bias by the weight factor Į which is illustrated by Figure 6.1.
min bias
balance between variance and bias
min variance
Figure 6.1 Balance of variance and bias by the weight factor Į Definition 6.7 ( ȟˆ hom BLE of ȟ ): An m×1 vector ȟˆ is called homogeneous BLE of ȟ in the special linear Gauss-Markov model with fixed effects of Box 6-3, if (1st) ȟˆ is a homogeneous linear form ȟˆ = Ly ,
(6.87)
(2nd) in comparison to all other linear estimations ȟˆ has the minimum Mean Square Estimation Error in the sense of || MSE{ȟˆ} ||2 = = tr LD{y}Lc + tr[I m LA] ȟȟ c [I m LA]c = || Lc ||6y + || (I m LA)c || 2
2 ȟȟ c
.
(6.88)
316
6 The third problem of probabilistic regression
Definition 6.8
( ȟˆ S-BLE of ȟ ):
An m×1 vector ȟˆ is called homogeneous S-BLE of ȟ in the special linear Gauss-Markov model with fixed effects of Box 6.3, if (1st) ȟˆ is a homogeneous linear form ȟˆ = Ly ,
(6.89)
(2nd) in comparison to all other linear estimations ȟˆ has the minimum S-modified Mean Square Estimation Error in the sense of || MSES {ȟˆ} ||2 := := tr LD{y}Lc + tr[I m LA]S[I m LA]c
(6.90)
= || Lc ||62 y + || (I m LA)c ||S2 = min . L
Definition 6.9 ( ȟˆ hom hybrid min var-min bias solution, Į-weighted or hom Į-BLE): An m×1 vector ȟˆ is called homogeneous Į-weighted hybrid minimum variance- minimum bias estimation (hom Į-BLE) of ȟ in the special linear Gauss-Markov model with fixed effects of Box 6.3, if (1st) ȟˆ is a homogeneous linear form ȟˆ = Ly ,
(6.91)
(2nd) in comparison to all other linear estimations ȟˆ has the minimum variance-minimum bias in the sense of the Į-weighted hybrid norm tr LD{y}Lc + 1 tr (I m LA ) S (I m LA )c D = || Lc ||62 + 1 || (I m LA)c ||S2 = min , L D
(6.92)
y
in particular with respect to the special model
D \ + , dim R (SA c) = rk SA c = rk A = m . The estimations ȟˆ hom BLE, hom S-BLE and hom Į-BLE can be characterized as following: Lemma 6.10 (hom BLE, hom S-BLE and hom Į-BLE): An m×1 vector ȟˆ is hom BLE, hom S-BLE or hom Į-BLE of ȟ in the special linear Gauss-Markov model with fixed effects of Box 6.3, if and only if the matrix Lˆ fulfils the normal equations
317
6-2 Setup of the best linear estimators fixed effects
(1st)
hom BLE: ( Ȉ y + Aȟȟ cA c)Lˆ c = Aȟȟ c
(2nd)
(3rd)
(6.93)
hom S-BLE: ˆ c = AS ( Ȉ y + ASAc)L
(6.94)
( Ȉ y + 1 ASAc)Lˆ c = 1 AS . D D
(6.95)
hom Į-BLE:
:Proof: (i) hom BLE: The hybrid norm || MSE{ȟˆ} ||2 establishes the Lagrangean
L (L) := tr L6 y Lc + tr (I m LA) ȟȟ c (I m LA)c = min L
for ȟˆ hom BLE of ȟ . The necessary conditions for the minimum of the quadratic Lagrangean L (L) are wL ˆ (L) := 2[6 y Lˆ c + Aȟȟ cA cLˆ c Aȟȟ c] = 0 wL which agree to the normal equations (6.93). (The theory of matrix derivatives is reviewed in Appendix B (Facts: derivative of a scalar-valued function of a matrix: trace). The second derivatives w2 L (Lˆ ) > 0 w (vecL)w (vecL)c at the “point” Lˆ constitute the sufficiency conditions. In order to compute such an mn×mn matrix of second derivatives we have to vectorize the matrix normal equation wL ˆ ( L ) := 2Lˆ (6 y + Aȟȟ cA c) 2ȟ ȟ cA c , wL wL ( Lˆ ) := vec[2 Lˆ (6 y + Aȟȟ cA c) 2ȟȟ cA c] . w (vecL )
(ii) hom S-BLE: The hybrid norm || MSEs {ȟˆ} ||2 establishes the Lagrangean
L (L) := tr L6 y Lc + tr (I m LA) S (I m LA)c = min L
318
6 The third problem of probabilistic regression
for ȟˆ hom S-BLE of ȟ . Following the first part of the proof we are led to the necessary conditions for the minimum of the quadratic Lagrangean L (L) wL ˆ (L) := 2[6 y Lˆ c + ASAcLˆ c AS]c = 0 wL as well as to the sufficiency conditions w2 L (Lˆ ) = 2[( Ȉ y + ASAc)
I m ] > 0 . w (vecL)w ( vecL)c The normal equations of hom S-BLE
wL wL (Lˆ ) = 0 agree to (6.92).
(iii) hom Į-BLE: The hybrid norm || Lc ||62 + 1 || ( I m - LA )c ||S2 establishes the Lagrangean D y
L (L) := tr L6 y Lc + 1 tr (I m - LA)S(I m - LA)c = min L D for ȟˆ hom Į-BLE of ȟ . Following the first part of the proof we are led to the necessary conditions for the minimum of the quadratic Lagrangean L (L) wL ˆ (L) = 2[( Ȉ y + Aȟȟ cA c)
I m ]vecLˆ 2vec(ȟȟ cA c) . wL The Kronecker-Zehfuss product A
B of two arbitrary matrices as well as ( A + B)
C = A
B + B
C of three arbitrary matrices subject to dim A = dim B is introduced in Appendix A, “Definition of Matrix Algebra: multiplication of matrices of the same dimension (internal relation) and Laws”. The vec operation (vectorization of an array) is reviewed in Appendix A, too, “Definition, Facts: vecAB = (Bc
I cA )vecA for suitable matrices A and B”. Now we are prepared to compute w2 L (Lˆ ) = 2[(6 y + Aȟȟ cA c)
I m ] > 0 w (vecL)w (vecL)c as a positive definite matrix. The theory of matrix derivatives is reviewed in Appendix B “Facts: Derive of a matrix-valued function of a matrix, namely w (vecX) w (vecX)c ”. wL ˆ ˆ c+ Ȉ L ˆ c 1 AS]cD ( L) = 2[ 1 ASA cL y D D wL as well as to the sufficiency conditions
319
6-2 Setup of the best linear estimators fixed effects
w2 L (Lˆ ) = 2[( 1 ASA c + Ȉ y )
I m ] > 0. D w (vecL)w (vecL)c The normal equations of hom Į-BLE wL wL (Lˆ ) = 0 agree to (6.93).
h For an explicit representation of ȟˆ as hom BLE, hom S-BLE and hom Į-BLE of ȟ in the special Gauss–Markov model with fixed effects of Box 6.3, we solve the normal equations (6.94), (6.95) and (6.96) for Lˆ = arg{L (L) = min} . L
Beside the explicit representation of ȟˆ of type hom BLE, hom S-BLE and hom Į-BLE we compute the related dispersion matrix D{ȟˆ} , the Mean Square Estimation Error MSE{ȟˆ}, the modified Mean Square Estimation Error MSES {ȟˆ} and MSED ,S {ȟˆ} in Theorem 6.11 ( ȟˆ hom BLE): Let ȟˆ = Ly be hom BLE of ȟ in the special linear Gauss-Markov model with fixed effects of Box 6.3. Then equivalent representations of the solutions of the normal equations (6.93) are ȟˆ = ȟȟ cA c[ Ȉ y + Aȟȟ cA c]1 y
(6.96)
(if [6 y + Aȟȟ cA c]1 exists) and completed by the dispersion matrix D{ȟˆ} = ȟȟ cA c[ Ȉ y + Aȟȟ cA c]1 Ȉ y × × [ Ȉ y + Aȟȟ cA c]1 Aȟȟ c ,
(6.97)
by the bias vector ȕ := E{ȟˆ} ȟ = [I m ȟȟ cA c( Aȟȟ cA c + Ȉ y ) 1 A] ȟ
(6.98)
and by the matrix of the Mean Square Estimation Error MSE{ȟˆ} :
MSE{ȟˆ}:= E{(ȟˆ ȟ)(ȟˆ ȟ)c} = D{ȟˆ} + ȕȕc
(6.99)
320
6 The third problem of probabilistic regression
MSE{ȟˆ} := D{ȟˆ} + [I m ȟȟ cA c( Aȟȟ cA c + Ȉ y ) 1 A] ×
(6.100)
×ȟȟ c [I m Ac( Aȟȟ cA c + Ȉ y ) Aȟȟ c]. 1
At this point we have to comment what Theorem 6.11 tells us. hom BLE has generated the estimation ȟˆ of type (6.96), the dispersion matrix D{ȟˆ} of type (6.97), the bias vector of type (6.98) and the Mean Square Estimation Error of type (6.100) which all depend on the vector ȟ and the matrix ȟȟ c , respectively. We already mentioned that ȟ and the matrix ȟȟ c are not accessible from measurements. The situation is similar to the one in hypothesis testing. As shown later in this section we can produce only an estimator ȟ and consequently can setup a hypothesis ȟ 0 of the "fixed effect" ȟ . Indeed, a similar argument applies to the second central moment D{y} ~ Ȉ y of the "random effect" y, the observation vector. Such a dispersion matrix has to be known in order to be able to compute ȟˆ , D{ȟˆ} , and MSE{ȟˆ} . Again we have to apply the argument that we are ˆ and to setup a hypothesis about only able to construct an estimate Ȉ y D{y} ~ 6 y . Theorem 6.12 ( ȟˆ hom S-BLE): Let ȟˆ = Ly be hom S-BLE of ȟ in the special linear GaussMarkov model with fixed effects of Box 6.3. Then equivalent representations of the solutions of the the normal equations (6.94) are ȟˆ = SA c( Ȉ y + ASA c) 1 y
(6.101)
ȟˆ = ( A cȈ y1A + S 1 ) 1 AcȈ y1y
(6.102)
ȟˆ = (I m + SA cȈ y1A) 1 SA c6 y1y
(6.103)
(if S 1 , Ȉ y1 exist) are completed by the dispersion matrices D{ȟˆ} = SA c( ASAc + Ȉ y ) 1 Ȉ y ( ASAc + Ȉ y ) 1 AS D{ȟˆ} = ( A cȈ A + S ) Ac6 A( A cȈ A + S ) 1 y
1 1
1 y
1 y
1 1
(6.104) (6.105)
(if S 1 , Ȉ y1 exist) by the bias vector ȕ := E{ȟˆ} ȟ = [I m SA c( ASA c + Ȉ y ) 1 A] ȟ ȕ = [I m ( A cȈ y1A + S 1 ) 1 A c6 y1A] ȟ
(6.106)
321
6-2 Setup of the best linear estimators fixed effects
(if S 1 , Ȉ y1 exist) and by the matrix of the modified Mean Square Estimation Error MSE{ȟˆ} : MSES {ȟˆ} := E{(ȟˆ ȟ )(ȟˆ ȟ )c} = D{ȟˆ} + ȕȕc
(6.107)
MSES {ȟˆ} = SA c( ASA c + Ȉ y ) 1 Ȉ y ( ASA c + Ȉ y ) 1 AS + +[I m SA c( ASA c + Ȉ y ) 1 A] ȟȟ c [I m Ac( ASAc + Ȉ y ) 1 AS] =
(6.108)
= S SA c( ASA c + Ȉ y ) AS 1
MSES {ȟˆ} = ( A cȈ y1A + S 1 ) 1 A cȈ y1A( A cȈ y1A + S 1 )1 + + [I m ( A cȈ y1A + S 1 ) 1 A cȈ y1A] ȟȟ c × × [I m A cȈ y1A( A cȈ y1A + S 1 ) 1 ]
(6.109)
= ( A cȈ y1A + S 1 ) 1 (if S 1 , Ȉ y1 exist). The interpretation of hom S-BLE is even more complex. In extension of the comments to hom BLE we have to live with another matrix-valued degree of freedom, ȟˆ of type (6.101), (6.102), (6.103) and D{ȟˆ} of type (6.104), (6.105) do no longer depend on the inaccessible matrix ȟȟ c , rk(ȟȟ c) , but on the "bias weight matrix" S, rk S = m. Indeed we can associate any element of the bias matrix with a particular weight which can be "designed" by the analyst. Again the bias vector ȕ of type (6.106) as well as the Mean Square Estimation Error of type (6.107), (6.108), (6.109) depend on the vector ȟ which is inaccessible. Beside the "bias weight matrix S" ȟˆ , D{ȟˆ} , ȕ and MSEs {ȟˆ} are vector-valued or matrix-valued functions of the dispersion matrix D{y} ~ 6 y of the stochastic observation vector which is inaccessible. By hypothesis testing we may decide y . upon the construction of D{y} ~ 6 y from an estimate 6 Theorem 6.13 ( ȟˆ hom
D -BLE):
Let ȟˆ = /y be hom D -BLE of ȟ in the special linear GaussMarkov model with fixed effects Box 6.3. Then equivalent representations of the solutions of the normal equations (6.95) are ȟˆ = 1 SA c( Ȉ y + 1 ASA c) 1 y D D
(6.110)
ȟˆ = ( A cȈ y1A + D S 1 ) 1 A cȈ y1y
(6.111)
ȟˆ = (I m + 1 SA cȈ y1A) 1 1 SA cȈ y1y D D
(6.112)
322
6 The third problem of probabilistic regression
(if S 1 , Ȉ y1 exist) are completed by the dispersion matrix D{ȟˆ} = 1 SA c( Ȉ y + 1 ASA c) 1 Ȉ y ( Ȉ y + 1 ASA c) 1 AS 1 D D D D
(6.113)
D{ȟˆ} = ( A cȈ y1A + D S 1 ) 1 A cȈ y1A( A cȈ y1A + D S 1 )1
(6.114)
(if S 1 , Ȉ y1 exist), by the bias vector ȕ := E{ȟˆ} ȟ = [I m 1 SA c( 1 ASAc + Ȉ y ) 1 A] ȟ D D ȕ = [I m ( AcȈ y1 A + D S 1 ) 1 AcȈ y1A] ȟ
(6.115)
(if S 1 , Ȉ y1 exist) and by the matrix of the Mean Square Estimation Error MSE{ȟˆ} : MSE{ȟˆ} := E{(ȟˆ ȟ )(ȟˆ ȟ )c} = D{ȟˆ} + ȕȕc
(6.116)
MSED , S {ȟˆ} = SCc( ASA c + Ȉ y ) 1 Ȉ y ( ASA c + Ȉ y ) 1 AS + + [I m - 1 SAc( 1 ASA c + Ȉ y ) 1 A] ȟȟ c × D D × [I m - A c( 1 ASA c + Ȉ y ) 1 AS 1 ] = D D 1 1 1 1 1 = S SAc( ASAc + Ȉ y ) AS D D D D
(6.117)
MSED , S {ȟˆ} = ( A cȈ y1A + D S 1 ) 1 A cȈ y1A( A cȈ y1A + D S 1 ) 1 + + [I m - ( A cȈ y1A + D S 1 ) 1 A cȈ y1A] ȟȟ c × × [I m - A cȈ y1A( A cȈ y1A + D S 1 ) 1 ]
(6.118)
= ( A cȈ y1A + D S 1 ) 1 (if S 1 , Ȉ y1 exist). The interpretation of the very important estimator hom Į-BLE ȟˆ of ȟ is as follows: ȟˆ of type (6.111), also called ridge estimator or Tykhonov-Phillips regulator, contains the Cayley inverse of the normal equation matrix which is additively decomposed into A cȈ y1A and D S 1 . The weight factor D balances the first
6-2 Setup of the best linear estimators fixed effects
323
inverse dispersion part and the second inverse bias part. While the experiment l y , the bias weight matrix informs us of the variance-covariance matrix Ȉ y , say Ȉ S and the weight factor D are at the disposal of the analyst. For instance, by the choice S = Diag( s1 ,..., sA ) we may emphasize increase or decrease of certain bias matrix elements. The choice of an equally weighted bias matrix is S = I m . In contrast the weight factor D can be determined by the A-optimal design of type •
tr D{ȟˆ} = min
•
ȕȕc = min
•
tr MSED , S {ȟˆ} = min .
D
D
D
In the first case we optimize the trace of the variance-covariance matrix D{ȟˆ} of type (6.113), (6.114). Alternatively by means of ȕȕ ' = min we optimize D the quadratic bias where the bias vector ȕ of type (6.115) is chosen, regardless of the dependence on ȟ . Finally for the third case – the most popular one – we minimize the trace of the Mean Square Estimation Error MSED , S {ȟˆ} of type (6.118), regardless of the dependence on ȟȟ c . But beforehand let us present the proof of Theorem 6.10, Theorem 6.11 and Theorem 6.8. Proof: (i) ȟˆ = ȟȟ cA c[ Ȉ y + Aȟȟ cA c]1 y If the matrix Ȉ y + Aȟȟ cA c of the normal equations of type hom BLE is of full rank, namely rk(Ȉ y + Aȟȟ cA c) = n, then a straightforward solution of (6.93) is Lˆ = ȟȟ cA c[ Ȉ y + Aȟȟ cA c]1. (ii) ȟˆ = SA c( Ȉ y + ASA c) 1 y If the matrix Ȉ y + ASAc of the normal equations of type hom S-BLE is of full rank, namely rk(Ȉ y + ASA c) = n, then a straightforward solution of (6.94) is Lˆ = SAc( Ȉ y + ASAc) 1. (iii) z = ( A cȈ y1A + S 1 ) 1 AcȈ y1y Let us apply by means of Appendix A (Fact: Cayley inverse: sum of two matrices, s(10), Duncan-Guttman matrix identity) the fundamental matrix identity SA c( Ȉ y + ASA c) 1 = ( A cȈ y1A + S 1 ) 1 A cȈ y1 , if S 1 and Ȉ y1 exist. Such a result concludes this part of the proof. (iv) ȟˆ = (I m + SA cȈ y1A) 1 SA cȈ y1y Let us apply by means of Appendix A (Fact: Cayley inverse: sum of two matrices, s(9)) the fundamental matrix identity
324
6 The third problem of probabilistic regression
SA c( Ȉ y + ASAc) 1 = (I m + SAcȈ y1 A) 1 SAcȈ y1 , if Ȉ y1 exists. Such a result concludes this part of the proof. (v) ȟˆ = 1 SA c( Ȉ y + 1 ASA c) 1 y D D If the matrix Ȉ y + D1 ASA c of the normal equations of type hom Į-BLE is of full rank, namely rk(Ȉ y + D1 ASA c) = n, then a straightforward solution of (6.95) is Lˆ = 1 SA c[ Ȉ y + 1 ASAc]1 . D D (vi) ȟˆ = ( A cȈ y1A + D S 1 ) 1 A cȈ y1y Let us apply by means of Appendix A (Fact: Cayley inverse: sum of two matrices, s(10), Duncan-Guttman matrix identity) the fundamental matrix identity 1 SAc( Ȉ + ASAc) 1 = ( AcȈ 1 A + D S 1 ) 1 AcȈ 1 y y y D if S 1 and Ȉ y1 exist. Such a result concludes this part of the proof. (vii) ȟˆ = (I m + 1 SA cȈ y1A) 1 1 SA cȈ y1y D D Let us apply by means of Appendix A (Fact: Cayley inverse: sum of two matrices, s(9), Duncan-Guttman matrix identity) the fundamental matrix identity 1 SA c( Ȉ + ASA c) 1 = (I + 1 SA cȈ 1A ) 1 1 SA cȈ 1 m y y y D D D if Ȉ y1 exist. Such a result concludes this part of the proof. (viii) hom BLE: D{ȟˆ} D{ȟˆ} := E{[ȟˆ E{ȟˆ}][ȟˆ E{ȟˆ}]c} = = ȟȟ cA c[ Ȉ y + Aȟȟ cA c]1 Ȉ y [ Ȉ y + Aȟȟ cA c]1 Aȟȟ c. By means of the definition of the dispersion matrix D{ȟˆ} and the substitution of ȟˆ of type hom BLE the proof has been straightforward. (ix) hom S-BLE: D{ȟˆ} (1st representation) D{ȟˆ} := E{[ȟˆ E{ȟˆ}][ȟˆ E{ȟˆ}]c} = = SA c( ASA c + Ȉ y ) 1 Ȉ y ( ASA c + Ȉ y ) 1 AS. By means of the definition of the dispersion matrix D{ȟˆ} and the substitution of ȟˆ of type hom S-BLE the proof of the first representation has been straightforward.
6-2 Setup of the best linear estimators fixed effects
325
(x) hom S-BLE: D{ȟˆ} (2nd representation) D{ȟˆ} := E{[ȟˆ E{ȟˆ}][ȟˆ E{ȟˆ}]c} = = ( A cȈ y1A + S 1 ) 1 Ac6 y1A( A cȈ y1A + S 1 )1 , if S 1 and Ȉ y1 exist. By means of the definition of the dispersion matrix D{ȟˆ} and the substitution of ȟˆ of type hom S-BLE the proof of the second representation has been straightforward. (xi) hom Į-BLE: D{ȟˆ} (1st representation) ˆ D{ȟ} := E{[ȟˆ E{ȟˆ}][ȟˆ E{ȟˆ}]c} = = 1 SA c( Ȉ y + 1 ASA c) 1 Ȉ y ( Ȉ y + 1 ASA c) 1 AS 1 . D D D D By means of the definition of the dispersion matrix D{ȟˆ} and the substitution of ȟˆ of type hom Į-BLE the proof of the first representation has been straightforward. (xii) hom Į-BLE: D{ȟˆ} (2nd representation) D{ȟˆ} := E{[ȟˆ E{ȟˆ}][ȟˆ E{ȟˆ}]c} = = ( A cȈ y1A + D S 1 ) 1 AcȈ y1A( AcȈ y1A + D S 1 )1 , if S 1 and Ȉ y1 exist. By means of the definition of the dispersion matrix and the D{ȟˆ} substitution of ȟˆ of type hom Į-BLE the proof of the second representation has been straightforward. (xiii) bias ȕ for hom BLE, hom S-BLE and hom Į-BLE As soon as we substitute into the bias ȕ := E{ȟˆ} ȟ = ȟ + E{ȟˆ} the various estimators ȟˆ of the type hom BLE, hom S-BLE and hom Į-BLE we are directly led to various bias representations ȕ of type hom BLE, hom S-BLE and hom ĮBLE. (xiv) MSE{ȟˆ} of type hom BLE, hom S-BLE and hom Į-BLE MSE{ȟˆ} := E{(ȟˆ ȟ )(ȟˆ ȟ )c} ȟˆ ȟ = ȟˆ E{ȟˆ} + ( E{ȟˆ} ȟ ) E{(ȟˆ ȟ )(ȟˆ ȟ )c} = E{(ȟˆ E{ȟˆ})((ȟˆ E{ȟˆ})c} +( E{ȟˆ} ȟ )( E{ȟˆ} ȟ )c MSE{ȟˆ} = D{ȟˆ} + ȕȕc .
326
6 The third problem of probabilistic regression
At first we have defined the Mean Square Estimation Error MSE{ȟˆ} of ȟˆ . Secondly we have decomposed the difference ȟˆ ȟ into the two terms • •
ȟˆ E{ȟˆ} E{ȟˆ} ȟ
in order to derive thirdly the decomposition of MSE{ȟˆ} , namely • •
the dispersion matrix of ȟˆ , namely D{ȟˆ} , the quadratic bias ȕȕc .
As soon as we substitute MSE{ȟˆ} the dispersion matrix D{ȟˆ} and the bias vector ȕ of various estimators ȟˆ of the type hom BLE, hom S-BLE and hom D -BLE we are directly led to various representations ȕ of the Mean Square Estimation Error MSE{ȟˆ} . Here is my proof’s end.
h
7
A spherical problem of algebraic representation - inconsistent system of directional observational equations - overdetermined system of nonlinear equations on curved manifolds
“Least squares regression is not appropriate when the response variable is circular, and can lead to erroneous results. The reason for this is that the squared difference is not an appropriate measure of distance on the circle.” U. Lund (1999) A typical example of a nonlinear model is the inconsistent system of nonlinear observational equations generated by directional measurements (angular observations, longitudinal data). Here the observation space Y as well as the parameter space X is the hypersphere S p R p +1 : the von Mises circle S1 , p = 2 the Fisher sphere S 2 , in general the Langevin sphere S p . For instance, assume repeated measurements of horizontal directions to one target which are distributed as polar coordinates on a unit circle clustered around a central direction. Alternatively, assume repeated measurements of horizontal and vertical directions to one target which are similarly distributed as spherical coordinates (longitude, latitude) on a unit sphere clustered around a central direction. By means of a properly chosen loss function we aim at a determination of the central direction. Let us connect all points on S1 , S 2 , or in general S p the measurement points, by a geodesic, here the great circle, to the point of the central direction. Indeed the loss function will be optimal at a point on S1 , S 2 , or in general S p , called the central point. The result for such a minimum geodesic distance mapping will be presented. Please pay attention to the guideline of Chapter 7.
Lemma 7.2 minimum geodesic distance: S1 Definition 7.1 minimum geodesic distance: S1
Lemma 7.3 minimum geodesic distance: S1
Definition 7.4 minimum geodesic distance: S 2
Lemma 7.5 minimum geodesic distance: S 2 Lemma 7.6 minimum geodesic distance: S 2
328
7 A spherical problem of algebraic representation
7-1 Introduction Directional data, also called “longitudinal data” or “angular data”, arise in several situations, notable geodesy, geophysics, geology, oceanography, atmospheric science, meteorology and others. The von Mises or circular normal distribution CN ( P , N ) with mean direction parameter P (0 d P d 2S ) and concentration parameter N (N > 0) , the reciprocal of a dispersion measure, plays the role in circular data parallel to that of the Gauss normal distribution in linear data. A natural extension of the CN distribution to the distribution on a pdimensional sphere S p R p +1 leads to the Fisher - von Mises or Langevin distribution L( P , N ) . For p=2, namely for spherical data (spherical longitude, spherical latitude), this distribution has been studied by R. A. Fisher (1953), generalizing the result of R. von Mises (1918) for p=1, and is often quoted as the Fisher distribution. Further details can be taken from K. V. Mardia (1972), K. V. Mardia and P.E. Jupp (2000), G. S. Watson (1986, 1998) and A. Sen Gupta and R. Maitra (1998). Box 7.1: Fisher - von Mises or Langevin distribution p=1 ( R. von Mises 1918) f (/ | P , N ) = [2S I 0 (N )]1 exp[N cos( / P / )]
(7.1)
f (/ | P , N ) = [2S I 0 (N )] exp N < ȝ | ; > cos < := :=< ȝ | ; >= P x X + P y Y = cos P / cos / + sin P / sin /
(7.2) (7.3)
1
cos < = cos(/ P/ )
(7.4)
ȝ = e1 cos P/ + e 2 sin P / S1
(7.5)
X = e1 cos / + e 2 sin / S1
(7.6)
p=2 (R. A. Fisher 1953) f ( /, ) | P / , P ) , N ) =
N exp[cos ) cos P) cos(/ P / ) + sin ) sin P) ] 4S sinh N N = exp N < ȝ | X > 4S sinh N cos < :=< ȝ | X >= P x X + P yY + P z Z =
= cos P) cos P / cos ) cos / + cos P) sin P / cos ) sin / + sin P) sin ) cos < = cos ) cos P) cos(/ P / ) + sin ) sin P)
(7.7)
(7.8)
329
7-1 Introduction
ȝ = e1 P x + e 2 P y + e3 P z = = e1 cos P) cos P / + e 2 cos P) sin P / + e3 sin P) S 2 X = e1 X + e 2Y + e3 Z = = e1 cos ) cos / + e 2 cos ) sin / + e3 sin ) S 2 .
(7.9) (7.10)
Box 7.1 is a review of the Fisher- von Mises or Langevin distribution. First, we setup the circular normal distribution on S1 with longitude / as the stochastic variable and ( P/ , N ) the distributional parameters called “mean direction ȝ ” and “concentration measure”, the reciprocal of a dispersion measure. Due to the normalization of the circular probability density function (“pdf”) I 0 (N ) as the zero order modified Bessel function of the first kind of N appears. The circular distance between the circular mean vector ȝ S1 and the placement vector X S1 is measured by “ cos < ”, namely the inner product < ȝ | X > , both P and X represented in polar coordinates ( P / , / ) , respectively. In summary, (7.1) is the circular normal pdf, namely an element of the exponential class. Second, we refer to the spherical normal pdf on S 2 with spherical longitude / , spherical latitude ) as the stochastic variables and ( P / , P) , N ) the distributional parameters called “longitudinal mean direction, lateral mean direction ( P/ , P) ) ” and “concentration measure N ”, the reciprocal of a dispersion measure. Here the normalization factor of the spherical pdf is N /(4S sinh N ) . The spherical distance between the spherical mean vector ȝ S 2 and the placement vector X S 2 is measured by “ cos < ”, namely the inner product < ȝ | X > , both ȝ and X represented in polar coordinates – spherical coordinates ( P / , P) , /, ) ) , respectively. In summary, (7.7) is the spherical normal pdf, namely an element of the exponential class. Box 7.2: Loss function p=1: longitudinal data n
type1:
¦ cos < i =1
i
= max ~ 1c cos Ȍ = max
n
type 2 :
¦ (1 cos < i =1 n
type 3 :
¦ sin i =1
2
i
) = min ~ 1c(1 cos Ȍ) = min
< i / 2 = min ~ (sin
Ȍ Ȍ )c (sin ) = min 2 2
(7.11) (7.12) (7.13)
transformation 1 cos < = 2sin 2 < / 2 " geodetic distance" cos< i = cos(/ i x) = cos / i cos x + sin / i sin x
(7.14)
2sin < i / 2 = 1 cos < i = 1 cos / i cos x + sin / i sin x
(7.16)
2
(7.15)
330
7 A spherical problem of algebraic representation
ª cos 0 2 dx i =1 i =1 builds up the sufficiency condition for the minimum at / g .
Lemma 7.3
(minimum geodesic distance, solution of the normal equation: S1 ):
Let the point / g S1 be at minimum geodesic distance to other points / i S1 , i {1, " , n} . Then the corresponding normal equation (7.28) is uniquely solved by tan / g = [sin /] /[cos / ] ,
(7.29)
such that the circular solution point is X g = e1 cos / g + e 2 sin / g =
1 [sin / ] + [cos / ]2 2
{e1 [cos /] + e 2 [sin /]} (7.30)
with respect to the Gauss brackets n
[sin / ]2 := (¦ sin / i ) 2
(7.31)
i=1 n
[cos / ]2 := (¦ cos / i ) 2 .
(7.32)
i=1
Next we generalize MINGEODISC ( p = 1) on S1 to MINGEODISC ( p = 2) on S 2 . Definition 7.4 (minimum geodesic distance: S 2 ): A point (/ g , ) g ) S 2 is called at minimum geodesic distance to other points (/ i , ) i ) S 2 , i {1, " , n} if the spherical distance
7-2 Minimal geodesic distance: MINGEODISC
333
function n
L(/ g , ) g ) := ¦ 2(1 cos < i ) =
(7.33)
i=1
n
= ¦ 2[1 cos ) i cos ) g cos(/ i / g ) sin ) i sin ) g ] = min
/g ,)g
i =1
is minimal. n
(/ g , ) g ) = arg {¦ 2(1 cos < i ) = min |
(7.34)
i=1
| cos < i = cos ) i cos ) g cos(/ i / g ) + sin ) i sin ) g } . Lemma 7.5 (minimum geodesic distance, normal equation: S 2 ): A point (/ g , ) g ) S 2 is called at minimum geodesic distance to other points (/ i , ) i ) S 2 , i {1, " , n} if / g = x1 , ) g = x2 fulfils the normal equations n
n
i =1
i =1
sin x2 cos x1 ¦ cos ) i cos / i sin x2 sin x1 ¦ cos ) i sin / i + (7.35)
n
+ cos x2 ¦ sin ) i = 0 i =1
n
n
i =1
i =1
cos x2 cos x1 ¦ cos ) i sin / i cos x2 sin x1 ¦ cos ) i cos / i = 0 . Proof: (/ g , ) g ) is generated by means of the Lagrangean (loss function) n
L( x1 , x2 ) := ¦ 2[1 cos ) i cos / i cos x1 cos x2 i =1
cos ) i sin / i sin x1 cos x2 sin ) i sin x2 ] = n
= 2n 2 cos x1 cos x2 ¦ cos ) i cos / i i =1
n
n
i =1
i =1
2sin x1 cos x2 ¦ cos ) i sin / i 2sin x2 ¦ sin ) i . The first derivatives n wL( x) (/ g , ) g ) = 2sin / g cos ) g ¦ cos ) i cos / i w x1 i =1 n
2 cos / g cos ) g ¦ cos ) i sin / i = 0 i =1
(7.36)
334
7 A spherical problem of algebraic representation n wL( x) (/ g , ) g ) = 2 cos / g sin ) g ¦ cos ) i cos / i + w x2 i =1 n
+ 2sin / g sin ) g ¦ cos ) i sin / i i =1
n
2 cos ) g ¦ sin ) i = 0 i =1
constitute the necessary conditions. The matrix of second derivative w 2 L( x ) (/ g , ) g ) t 0 w xw xc builds up the sufficiency condition for the minimum at (/ g , ) g ) . w 2 L( x ) (/ g , ) g ) = 2 cos / g cos ) g [cos ) cos / ] + w x12 + 2sin / g cos ) g [cos ) sin / ] w 2 L( x ) (/ g , ) g ) = 2sin / g sin ) g [cos ) cos / ] + w x1 x2 + 2 cos / g sin ) g [cos ) sin / ] w L( x ) (/ g , ) g ) = 2 cos / g cos ) g [cos ) cos / ] + w x22 2
+ 2sin / g cos ) g [cos ) sin / ] + + sin ) g [sin ) ].
.
h Lemma 7.6
(minimum geodesic distance, solution of the normal equation: S 2 ):
Let the point (/ g , ) g ) S 2 be at minimum geodesic distance to other points (/ i , ) i ) S 2 , i {1, " , n} . Then the corresponding normal equations ((7.35), (7.36)) are uniquely solved by tan / g = [cos ) sin /] /[cos ) cos /]
(7.37)
[sin )]
tan ) g =
[cos ) cos / ]2 + [cos ) sin /]2 such that the circular solution point is X g = e1 cos ) g cos / g + e 2 cos ) g sin / g + e3 sin ) g = =
1 [cos ) cos / ] + [cos ) sin /]2 + [sin )]2 2
*
*{e1[cos ) cos / ] + e 2 [cos ) sin /] + e3 [sin )]} 2
2
(7.38)
335
7-3 Special models
subject to n
[cos ) cos / ] := ¦ cos ) i cos / i
(7.39)
i=1 n
[cos ) sin / ] := ¦ cos ) i sin / i
(7.40)
i=1 n
[sin )] := ¦ sin ) i . i=1
7-3 Special models: from the circular normal distribution to the oblique normal distribution First, we present a historical note about the von Mises distribution on the circle. Second, we aim at constructing a twodimensional generalization of the Fisher circular normal distribution to its elliptic counterpart. We present 5 lemmas of different type. Third, we intend to prove that an angular metric fulfils the four axioms of a metric. 7-31
A historical note of the von Mises distribution
Let us begin with a historical note: The von Mises Distribution on the Circle In the early part of the last century, Richard von Mises (1918) considered the table of the atomic weights of elements, seven entries of which are as follows: Table 7.1 Element Atomic Weight
W
Al
Sb
Ar
As
Ba
Be
Bi
26.98
121.76
39.93
74.91
137.36
9.01
209.00
He asked the question “Does a typical element in some sense have integer atomic weight ?” A natural interpretation of the question is “Do the fractional parts of the weight cluster near 0 and 1?” The atomic weight W can be identified in a natural way with points on the unit circle, in such a way that equal fractional parts correspond to identical points. This can be done under the mapping ªcos T1 º W ox=« , T1 = 2S (W [W ]), ¬sin T1 »¼ where [u ] is the largest integer not greater than u . Von Mises’ question can now be seen to be equivalent to asking “Do this points on the circle cluster near e1 = [1 0]c ?”. Incidentally, the mapping W o x can be made in another way:
336
7 A spherical problem of algebraic representation
ªcos T 2 º 1 W ox=« , T 2 = 2S (W [W + ]) . » ¬sin T 2 ¼ 2 The two sets of angles for the two mappings are then as follows: Table 7.2 Element
Al
Sb
Ar
As
Ba
Be
Bi
Average
T1 / 2S
0.98
0.76
0.93
0.91
0.36
0.01
0.00
T1 / 2S = 0.566
T 2 / 2S
-0.02
-0.24
-0.06
-0.09
0.36
0.01
0.00
T 2 / 2S = 0.006
We note from the discrepancy between the averages in the final column that our usual ways of describing data, e.g., means and standard deviations, are likely to fail us when it comes to measurements of direction. If the points do cluster near e1 then the resultant vector 6 Nj=1x j (here N =7) we should have approximately x / || x ||= e , should point in that direction, i.e., 1 elements where x = 6 x j / N and ||x|| = (xcx)1/ 2 is the length of x . For the seven are considered we find whose weights here, D D ª 0.9617 º ª cos 344.09 º ª cos 15.91 º , x / || x ||= « = = « » « » D D» ¬ 0.2741 ¼ ¬«sin 344.09 ¼» ¬«sin 15.91 ¼»
a direction not far removed from e1 . Von Mises then asked “For what distribution of the unit circle is the unit vector Pˆ = [cos Tˆ0 sin Tˆ0 ]c = x / || x || a maximum likelihood estimator (MLE) of a direction T 0 of clustering or concentration ?” The answer is the distribution now known as the von Mises or circular normal distribution. It has density, expressed in terms of random angle T , exp{k cos(T T 0 )} dT , I 0 (k ) 2S where T 0 is the direction of concentration and the normalizing constant I 0 (k ) is a Bessel function. An alternative expression is ªcos Tˆ0 º exp(kx ' ȝ) dS , ȝ=« » , ||x|| = 1. I 0 (k ) 2S «¬sin Tˆ0 »¼ Von Mises’ question clearly has to do with the truth of the hypothesis ª1 º H 0 : T 0 = 0 or ȝ = e1 = « » . ¬0¼
337
7-3 Special models
It is worth mentioning that Fisher found the same distribution in another context (Fisher, 1956, SMSI, pp. 133-138) as the conditional distribution of x , given || x ||= 1, when x is N 2 (ȝ, k 1 I 2 ) . 7-32 Oblique map projection A special way to derive the general representation of the twodimensional generalized Fisher sphere is by forming the general map of S 2 onto \ 2 . In order to follow a systematic approach, let us denote by { A, , cos E =< y, z >, cos J =< x, z > . We wish to prove J d D + E . This result is trivial in the case cos t S , so we may assume D + E [0, S ] . The third desired inequality is equivalent to cos J t cos(D + E ) . The proof of the basic formulas relies heavily on the properties of the inverse product: < u + uc, v >=< u, v > + < uc, v > º < u, v + v c >=< u, v > + < u, v c > »» for all u,uc, v, v c \ 3 , and O \. < O u, v >= O < u, v >=< u, O v > »¼ Define xc, z c \ 3 by x = (cos D )y + xc, z = (cos E )y z c, then < xc, z c >=< x (cos D )y, z + (cos E ) y >= = < x, z > + (cos D ) < y, z > + (cos E ) < x, y > (cos D )(cos E ) < y, y >= = cos J + cos D cos E + cos D cos E cos D cos E = cos J + cos D cos E . In the same way || xc ||2 =< x, xc >= 1 cos 2 D = sin 2 D so that, since 0 d D d S , || xc ||= sin D . Similarly, || z c ||= sin E . But by Schwarz’ Inequality, < xc, z c > d || xc |||| z c || . It follows that cos J t cos D cos E sin D sin E = = cos(D + E ) and we are done!
7-4 Case study Table 7.3 collects 30 angular observations with two different theodolites. The first column contains the number of the directional observations / i , i {1,...,30}, n = 30 . The second column lists in fractions of seconds the directional data, while the third column / fourth column is a printout of cos / i / sin / i . ˆ as the arithmetic mean. Obviously, on the Table 7.4 is a comparison of / g and / ˆ are nearly the same. level of concentration of data, / g and /
342
7 A spherical problem of algebraic representation
Table 7.3 The directional observation data using two theodolites and its calculation Theodolite 1 No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
¦
Value of observation ( /i ) 76 42 c 17.2 cc 19.5 19.2 16.5 19.6 16.4 15.5 19.9 19.2 16.8 15.0 16.9 16.6 20.4 16.3 16.7 16.0 15.5 19.1 18.8 18.7 19.2 17.5 16.7 19.0 16.8 19.3 20.0 17.4 16.2 ˆ = / D
Theodolite 2
cos / i
sin / i
0.229969
0.973198
0.229958 0.229959 0.229972 0.229957 0.229972 0.229977 0.229956 0.229959 0.229970 0.229979 0.229970 0.229971 0.229953 0.229973 0.229971 0.229974 0.229977 0.229960 0.229961 0.229962 0.229959 0.229967 0.229971 0.229960 0.229970 0.229959 0.229955 0.229968 0.229973
0.973201 0.973200 0.973197 0.973201 0.973197 0.973196 0.973201 0.973200 0.973198 0.973196 0.973198 0.973197 0.973202 0.973197 0.973197 0.973197 0.973196 0.973200 0.973200 0.973200 0.973200 0.973198 0.973197 0.973200 0.973198 0.973200 0.973201 0.973198 0.973197
6.898982
29.195958
L
D
76 42 c17.73cc sˆ = ±1.55cc
Value of observation ( /i ) D
76 42 c19.5cc 19.0 18.8 16.9 18.6 19.1 18.2 17.7 17.5 18.6 16.0 17.3 17.2 16.8 18.8 17.7 18.6 18.8 17.7 17.1 16.9 17.6 17.0 17.5 18.2 18.3 19.8 18.6 16.9 16.7 ˆ = /
cos / i
sin / i
0.229958
0.973201
0.229960 0.229961 0.229970 0.229962 0.229960 0.229964 0.229966 0.229967 0.229962 0.229974 0.229968 0.229969 0.229970 0.229961 0.229966 0.229962 0.229961 0.229966 0.229969 0.229970 0.229967 0.229970 0.229967 0.229964 0.229963 0.229956 0.229962 0.229970 0.229971
0.973200 0.973200 0.973198 0.973200 0.973200 0.973199 0.973199 0.973198 0.973200 0.973197 0.973198 0.973198 0.973198 0.973200 0.973199 0.973200 0.973200 0.973199 0.973198 0.973198 0.973198 0.973198 0.973198 0.973199 0.973199 0.973201 0.973200 0.973198 0.973197
6.898956
29.195968
L
D
76 42c17.91cc sˆ = ±0.94 cc
Table 7.4: Computation of theodolite data ˆ and / Comparison of / g Left data set versus Right data set ˆ = 76D 42'17.73'', sˆ = 1.55'' Theodolite 1: / ˆ = 76D 42'17.91'', sˆ = 0.94 '' Theodolite 2: / “The precision of the theodolite two is higher compared to the theodolite one”.
7-4 Case study
343
Alternatively, let us present a second example. Let there be given observed azimuths / i and vertical directions ) i , by Table 7.3 in detail. First, we compute the solution of the optimization problem n
n
¦ 2(1 cos < ) = ¦ 2[1 cos ) i
i =1
i =1
i
cos ) P cos(/ i / P ) sin ) i sin ) P ] =
= min
/P , )P
subject to values of the central direction n
ˆ = tan /
¦ cos ) sin / i
i =1 n
n
¦ sin )
i
ˆ = , tan )
¦ cos )i cos /i
i =1
n
i
.
n
(¦ cos ) i cos / i ) + (¦ cos ) i sin / i ) 2
i =1
i =1
2
i =1
Table 7.5: Data of type azimuth / i and vertical direction ) i /i
)i
/i
)i
124D 9
88D1
125D 0
88D 0
125D 2
88D 3
124D 9
88D 2
126D1
88D 2
124D8
88D1
125D 7
88D1
125D1
88D 0
This accounts for measurements of data on the horizontal circle and the vertical circle being Fisher normal distributed. We want to tackle two problems: ˆ ,) ˆ ) with the arithmetic mean (/, ) ) of the Problem 1: Compare (/ data set. Why do the results not coincide? ˆ ,) ˆ ) and (/, ) ) do coincide? Problem 2: In which case (/ Solving Problem 1 Let us compute ˆ ,) ˆ ) = (125D.206, 664,5, 88D.125, 050, 77) (/, ) ) = (125D.212,5, 88D.125) and (/ '/ = 0D.005,835,5 = 21cc .007,8 , ') = 0D.000, 050, 7 = 0cc.18. The results do not coincide due to the fact that the arithmetic means are obtained by adjusting direct observations with least-squares technology.
344
7 A spherical problem of algebraic representation
Solving Problem 2 The results do coincide if the following conditions are met. • All vertical directions are zero •
ˆ = / if the observations / , ) fluctuate only “a little” around / i i the constant value / 0 , ) 0
•
ˆ = / if ) = const. / i
•
ˆ = ) if the fluctuation of / around / is considerably ) i 0 smaller than the fluctuation of ) i around ) 0 .
Note the values 8
¦ cos ) sin / i
i =1
8
¦ cos ) i =1
8
i
= 0.213,866, 2; (¦ cos ) i sin / i ) 2 = 0.045,378,8 i =1
8
i
cos / i = 0.750,903, 27; (¦ cos ) i cos / i ) 2 = 0.022, 771,8 i =1
8
¦ cos )
i
= 7.995, 705,3
i=1
and / i = / 0 + G/ i versus ) i = ) 0 + G) i
G/ =
1 n 1 n G/ i versus G) = ¦ G) i ¦ n i =1 n i =1
n
n
n
i =1
i =1
i =1
¦ cos )i sin /i = n cos ) 0 sin / 0 + cos ) 0 cos / 0 ¦ G/i sin ) 0 sin / 0 ¦ G) i = = n cos ) 0 (sin / 0 + G/ cos / 0 G) tan / 0 sin / 0 ) n
¦ cos ) i =1
i
n
n
i =1
i =1
cos / i = n cos ) 0 cos / 0 cos ) 0 sin / 0 ¦ G/ i sin ) 0 sin / 0 ¦ G) i = = n cos ) 0 (cos / 0 G/ sin / 0 G) tan / 0 cos / 0 ) ˆ = sin / 0 + G/ cos / 0 G) tan ) 0 sin / 0 tan / cos / 0 G/ sin / 0 G) tan ) 0 cos / 0 tan / =
sin / 0 + G/ cos / 0 cos / 0 G/ sin / 0
7-4 Case study
345 n
n
(¦ cos ) i sin / i ) 2 + (¦ cos ) i cos / i ) 2 = i =1
i =1
2
= n 2 (cos 2 ) 0 + cos 2 G/ + sin 2 G) 2sin ) 0 cos G)) n
¦ sin ) i =1
ˆ = tan )
i
= n sin ) 0 + G) cos ) 0 n sin ) 0 + G) cos ) 0
n cos ) 0 + G/ cos 2 ) 0 + G) 2 sin 2 ) 0 2G) sin ) 0 cos ) 0 2
2
tan ) =
sin ) 0 + G) cos ) 0 . cos ) 0 G) sin ) 0
ˆ z /, ) ˆ z ) holds in general. In consequence, / At the end we will summarize to additional references like E. Batschelet (1965), T.D. Downs and A.L. Gould (1967), E.W. Grafarend (1970), E.J. Gumbel et al (1953), P. Hartmann et al (1974), and M.A. Stephens (1969). References Anderson, T.W. and M.A. Stephens (1972), Arnold, K.J. (1941), BarndorffNielsen, O. (1978), Batschelet, E. (1965), Batschelet, E. (1971), Batschelet, E. (1981), Beran, R.J. (1968), Beran, R.J. (1979), Blingham, C. (1964), Blingham, C. (1974), Chang, T. (1986) Downs, T. D. and Gould, A. L. (1967), Durand, D. and J.A. Greenwood (1957), Enkin, R. and Watson, G. S. (1996), Fisher, R. A. (1953), Fisher, N.J. (1985), Fisher, N.I. (1993), Fisher, N.I. and Lee A. J. (1983), Fisher, N.I. and Lee A. J. (1986), Fujikoshi, Y. (1980), Girko, V.L. (1985), Goldmann, J. (1976), Gordon, L. and M. Hudson (1977), Grafarend, E. W. (1970), Greenwood, J.A. and D. Durand (1955), Gumbel, E. J., Greenwood, J. A. and Durand, D. (1953), Hammersley, J.M. (1950), Hartman, P. and G. S. Watson (1974), Hetherington, T.J. (1981), Jensen, J.L. (1981), Jupp, P.E. and… (1980), Jupp, P. E. and Mardia, K. V. , Kariya, T. (1989), (1989), Kendall, D.G. (1974), Kent, J. (1976), Kent, J.T. (1982), Kent, J.T. (1983), Krumbein, W.C. (1939), Langevin, P. (1905), Laycock, P.J. (1975), Lenmitz, C. (1995), Lenth, R.V. (1981), Lord, R.D. (1948), Lund, U. (1999), Mardia, K.V. (1972), Mardia, K.V. (1975), Mardia, K. V. (1988), Mardia, K. V. and Jupp, P. E. (1999), Mardia, K.V. et al. (2000), Mhaskar, H.N., Narcowich, F.J. and J.D. Ward (2001), Muller, C. (1966), Neudecker, H. (1968), Okamoto, M. (1973), Parker, R.L. et al (1979), Pearson, K. (1905), Pearson, K. (1906), Pitman, J. and M. Yor (1981), Presnell, B., Morrison, S.P. and R.C. Littell (1998), Rayleigh, L. (1880), Rayleigh, L. (1905), Rayleigh, R. (1919), Rivest, L.P. (1982), Rivest, L.P.
346
7 A spherical problem of algebraic representation
(1988), Roberts, P.H. and H.D. Ursell (1960), Sander, B. (1930), Saw, J.G. (1978), Saw, J.G. (1981), Scheidegger, A.E. (1965), Schmidt-Koenig, K. (1972), Selby, B. (1964), Sen Gupta, A. and R. Maitra (1998), Sibuya, M. (1962), Stam, A.J. (1982), Stephens, M.A. (1963), Stephens, M.A. (1964), Stephens, M. A. (1969), Stephens, M.A. (1979), Tashiro, Y. (1977), Teicher, H. (1961), Von Mises, R. (1918), Watson, G.S. (1956a, 1956b), Watson, G.S. (1960), Watson, G.S.(1961), Watson, G.S. (1962), Watson, G.S. (1965), Watson, G.S. (1966), Watson, G.S. (1967a, 1967b), Watson, G.S. (1968), Watson, G.S. (1969), Watson, G.S. (1970), Watson, G.S. (1974), Watson, G.S. (1981a, 1981b), Watson, G.S. (1982a, 1982b, 1982c, 1982d), Watson, G.S. (1983), Watson, G.S. (1986), Watson, G.S. (1988), Watson, G.S. (1998), Watson, G.S. and E.J. Williams (1956), Watson, G.S. and E..Irving (1957), Watson, G.S. and M.R. Leadbetter (1963), Watson, G.S. and S. Wheeler (1964), Watson, G.S. and R.J. Beran (1967), Watson, G.S., R. Epp and J.W. Tukey (1971), Wellner, J. (1979), Wood, A. (1982), Xu, P.L. (1999), Xu, P.L. (2001), Xu, P. (2002), Xu, P.L. et al. (1996a, 1996b), Xu, P.L., and Shimada, S.(1997).
8
The fourth problem of probabilistic regression – special Gauss-Markov model with random effects – Setup of BLIP and VIP for the central moments of first order
: Fast track reading : Read only Theorem 8.5, 8.6 and 8.7.
Lemma 8.4 hom BLIP, hom S-BLIP and hom Į-VIP
Definition 8.1 z : hom BLIP of z
Theorem 8.5 z : hom BLIP of z
Definition 8.2 z : S-hom BLIP of z
Theorem 8.6 z : S-hom BLIP of z
Definition 8.3 z : hom Į-VIP of z
Theorem 8.7 z : hom Į-VIP of z
The general model of type “fixed effects”, “random effects” and “error-invariables” will be presented in our final chapter: Here we focus on “random effects”.
348
8 The fourth problem of probabilistic regression
Figure 8.1: Magic triangle
8-1 The random effect model Let us introduce the special Gauss-Markov model with random effects y = Cz + e y Ce z specified in Box 8.1. Such a model is governed by two identities, namely the first identity CE{z} = E{y} of moments of first order and the second identity D{y Cz} + CD{z}Cc = D{y} of central moments of second order. The first order moment identity CE{z} = E{y} relates the expectation E{z} of the stochastic, real-valued vector z of unknown random effects ( “Zufallseffekte”) to the expectation E{y} of the stochastic, real-valued vector y of observations by means of the non-stochastic (“fixed”) real-valued matrix C \ n×l of rank rk C = l. n = dim Y is the dimension of the observation space Y, l=dim Z the dimension of the parameter space Z of random effects z. The second order central moment identity Ȉ y -Cz + CȈ z Cc = Ȉ y relates the variancecovariance matrix Ȉ y -Cz of the random vector y Cz , also called dispersion matrix D{y Cz} and the variance-covariance matrix Ȉ z of the random vector z, also called dispersion matrix D{z} , to the variance-covariance matrix Ȉ y of the random vector y of the observations, also called dispersion matrix D{y} . In the simple random effect model we shall assume (i) rk Ȉ y = n and (ii) C{y, z} = 0 , namely zero correlation between the random vector y of observations and the vector z of random effects. (In the random effect model of type KolmogorovWiener we shall give up such a zero correlation.) There are three types of un-
8-1 The random effect model
349
knowns within the simple special Gauss-Markov model with random effects: (i) the vector z of random effects is unknown, (ii) the fixed vectors E{y}, E{z} of expectations of the vector y of observations and of the vector z of random effects (first moments) are unknown and (iii) the fixed matrices Ȉ y , Ȉ z of dispersion matrices D{y}, D{z} (second central moments) are unknown. Box 8.1: Special Gauss-Markov model with random effects y = Cz + e y Ce z E{y} = CE{z} \ n D{y} = D{y Cz} + CD{z}Cc \ n×n C{y , z} = 0 z, E{z}, E{y}, Ȉ y-Cz , Ȉ z unknown dim R (Cc) = rk C = l. Here we focus on best linear predictors of type hom BLIP, hom S-BLIP and hom Į-VIP of random effects z, which turn out to be better than the best linear uniformly unbiased predictor of type hom BLUUP. At first let us begin with a discussion of the bias vector and the bias matrix as well as of the Mean Square Prediction Error MSPE{z} with respect to a homogeneous linear prediction z = Ly of random effects z based upon Box 8.2. Box 8.2: Bias vector, bias matrix Mean Square Prediction Error in the special Gauss–Markov model with random effects E{y} = CE{z} D{y} = D{y Cz} + CD{z}Cc “ansatz”
(8.1) (8.2)
z = Ly bias vector
(8.3)
ȕ := E{z z} = E{z } E{z}
(8.4)
ȕ = LE{y} E{z} = [I A LC]E{z}
(8.5)
bias matrix B := I A LC decomposition
(8.6)
350
8 The fourth problem of probabilistic regression
z z = z E{z} (z E{z}) + ( E{z} E{z}) z z = L(y E{y}) (z E{z}) [I A LC]E{z}
(8.7)
(8.8)
Mean Square Prediction Error MSPE{z} := E{(z z )(z z )c}
(8.9)
MSPE{z} = LD{y}Lc + D{z} + [I A LC]E{z}E{z}c [I A LC]c
(8.10)
(C{y , z} = 0, E{z E{z}} = 0, E{z E{z}} = 0) modified Mean Square Prediction Error MSPE S {z} := LD{y}Lc + D{z} + [I A LC] S [I A LC]c
(8.11)
Frobenius matrix norms || MSPE{z} ||2 := tr E{(z z )(z z )c}
(8.12)
|| MSPE{z} || = = tr LD{y}Lc + tr D{z} + tr [I A LC]E{z}E{z}c [I A LC]c
(8.13)
2
= || Lc ||62 + || (I A LC)c ||2E{( z ) E ( z ) ' + tr E{(z E{z})(z E{z})c} y
|| MSPE S {z} ||2 := := tr LD{y}Lc + tr [I A LC]S[I A LC]c + tr D{z}
(8.14)
= || Lc ||6y + || (I A LC)c ||S + tr E{( z E{z})(z E{z})c} 2
2
hybrid minimum variance – minimum bias norm Į-weighted L(L) := || Lc ||62 y + 1 || (I A LC)c ||S2 D
(8.15)
special model dim R (SCc) = rk SCc = rk C = l .
(8.16)
The bias vector ȕ is conventionally defined by E{z} E{z} subject to the homogeneous prediction form z = Ly . Accordingly the bias vector can be represented by (8.5) ȕ = [I A LC]E{z} . Since the expectation E{z} of the vector z of random effects is unknown, there has been made the proposal to use instead the matrix I A LC as a matrix-valued measure of bias. A measure of the prediction error is the Mean Square prediction error MSPE{z} of type (8.9). MSPE{z} can be decomposed into three basic parts:
8-1 The random effect model
351
•
the dispersion matrix D{z} = LD{y}Lc
•
the dispersion matrix D{z}
•
the bias product ȕȕc .
Indeed the vector z z can also be decomposed into three parts of type (8.7), (8.8) namely (i) z E{z} , (ii) z E{z} and (iii) E{z} E{z} which may be called prediction error, random effect error and bias, respectively. The triple decomposition of the vector z z leads straightforward to the triple representation of the matrix MSPE{z} of type (8.10). Such a representation suffers from two effects: Firstly the expectation E{z} of the vector z of random effects is unknown, secondly the matrix E{z}E{z c} has only rank 1. In consequence, the matrix [I A LC]E{z}E{z}c [I A LC]c has only rank 1, too. In this situation there has made the proposal to modify MSPE{z} by the matrix E{z}E{z c} and by the regular matrix S. MSPE{z} has been defined by (8.11). A scalar measure of MSPE{z } as well as MSPE{z} are the Frobenius norms (8.12), (8.13), (8.14). Those scalars constitute the optimal risk in Definition 8.1 (hom BLIP) and Definition 8.2 (hom S-BLIP). Alternatively a homogeneous Į-weighted hybrid minimum variance- minimum bias prediction (hom VIP) is presented in Definition 8.3 (hom Į-VIP) which is based upon the weighted sum of two norms of type (8.15), namely •
average variance || Lc ||62 y = tr L6 y Lc
•
average bias || (I A LC)c ||S2 = tr[I A LC] S [I A LC]c .
The very important predictor Į-VIP is balancing variance and bias by the weight factor Į which is illustrated by Figure 8.1.
min bias
balance between variance and bias
min variance
Figure 8.1. Balance of variance and bias by the weight factor Į Definition 8.1 ( z hom BLIP of z): An l×1 vector z is called homogeneous BLIP of z in the special linear Gauss-Markov model with random effects of Box 8.1, if (1st) z is a homogeneous linear form z = Ly , (2nd) in comparison to all other linear predictions z has the
(8.17)
352
8 The fourth problem of probabilistic regression
minimum Mean Square Prediction Error in the sense of || MSPE{z} ||2 = = tr LD{y}Lc + tr D{z} + tr[I A LC]E{z}E{z}c [I A LC]c
(8.18)
= || Lc ||62 y + || (I A LC)c ||2E{( z ) E ( z ) ' + tr E{( z E{z})( z E{z})c}. Definition 8.2 ( z S-hom BLIP of z): An l×1 vector z is called homogeneous S-BLIP of z in the special linear Gauss-Markov model with random effects of Box 8.1, if (1st) z is a homogeneous linear form z = Ly ,
(8.19)
(2nd) in comparison to all other linear predictions z has the minimum S-modified Mean Square Prediction Error in the sense of || MSPE S {z} ||2 := := tr LD{y}Lc + tr[I A LC]S[I A LC]c + tr E{( z E{z})( z E{z})c} = || Lc ||62 y + || (I A LC)c ||S2 + tr E{( z E{z})( z E{z})c} = min .
(8.20)
L
Definition 8.3 ( z hom hybrid min var-min bias solution, Į-weighted or hom Į-VIP): An l×1 vector z is called homogeneous Į-weighted hybrid minimum variance- minimum bias prediction (hom Į-VIP) of z in the special linear Gauss-Markov model with random effects of Box 8.1, if (1st) z is a homogeneous linear form z = Ly ,
(8.21)
(2nd) in comparison to all other linear predictions z has the minimum variance-minimum bias in the sense of the Į-weighted hybrid norm tr LD{y}Lc + 1 tr (I A LC) S (I A LC)c D 2 1 = || Lc ||6 + || (I A LC)c ||S2 = min L D y
in particular with respect to the special model
D \ + , dim R (SCc) = rk SCc = rk C = l .
(8.22)
8-1 The random effect model
353
The predictions z hom BLIP, hom S-BLIP and hom Į-VIP can be characterized as follows: Lemma 8.4 (hom BLIP, hom S-BLIP and hom Į-VIP): An l×1 vector z is hom BLIP, hom S-BLIP or hom Į-VIP of z in the special linear Gauss-Markov model with random effects of Box 8.1, if and only if the matrix Lˆ fulfils the normal equations (1st)
hom BLIP: (6 y + CE{z}E{z}cCc)Lˆ c = CE{z}E{z}c
(2nd)
(3rd)
(8.23)
hom S-BLIP: (6 y + CSCc)Lˆ c = CS
(8.24)
(6 y + 1 CSCc)Lˆ c = 1 CS . D D
(8.25)
hom Į-VIP:
:Proof: (i) hom BLIP: 2
The hybrid norm || MSPE{z} || establishes the Lagrangean
L (L) := tr L6 y Lc + tr (I l LC) E{z}E{z}c (I l LC)c + tr 6 z = min L
for z hom BLIP of z. The necessary conditions for the minimum of the quadratic Lagrangean L (L) are wL ˆ (L) := 2[6 y Lˆ c + CE{z}E{z}cCcLˆ c CE{z}E{z}c ] = 0 , wL which agree to the normal equations (8.23). (The theory of matrix derivatives is reviewed in Appendix B (Facts: derivative of a scalar-valued function of a matrix: trace).) The second derivatives w2 L (Lˆ ) > 0 w (vecL)w (vecL)c at the “point” Lˆ constitute the sufficiency conditions. In order to compute such an ln×ln matrix of second derivatives we have to vectorize the matrix normal equation
354
8 The fourth problem of probabilistic regression
wL ˆ (L) := 2Lˆ (6 y + CE{z}E{z}cCc) 2 E{z}E{z}cCc wL wL (Lˆ ) := vec[2Lˆ (6 y + CE{z}E{z}cCc) 2 E{z}E{z}cCc]. w (vecL) (ii) hom S-BLIP: 2
The hybrid norm || MSPEs {z} || establishes the Lagrangean
L (L) := tr L6 y Lc + tr (I A LC)S(I A LC)c + tr 6 z = min L
for z hom S-BLIP of z. Following the first part of the proof we are led to the necessary conditions for the minimum of the quadratic Lagrangean L (L) wL ˆ (L) := 2[6 y Lˆ c + CSCcLˆ c CS]c = 0 wL as well as to the sufficiency conditions w2 L (Lˆ ) = 2[(6 y + CSCc)
I A ] > 0. w (vecL)w (vecL)c The normal equations of hom S-BLIP wL wL (Lˆ ) = 0 agree to (8.24). (iii) hom Į-VIP: The hybrid norm || Lc ||62 + 1 || (I A - LC)c ||S2 establishes the Lagrangean D L (L) := tr L6 y Lc + 1 tr (I A - LC)S(I A - LC)c = min L D y
for z hom Į-VIP of z. Following the first part of the proof we are led to the necessary conditions for the minimum of the quadratic Lagrangean L (L) wL ˆ (L) = 2[(6 y + CE{z}E{z}cCc)
I A ]vecLˆ 2vec(E{z}E{z}cCc). wL The Kronecker-Zehfuss Product A
B of two arbitrary matrices as well as ( A + B)
C = A
B + B
C of three arbitrary matrices subject to dim A=dim B is introduced in Appendix A. (Definition of Matrix Algebra: multiplication matrices of the same dimension (internal relation) and multiplication of matrices (internal relation) and Laws). The vec operation (vectorization of an array) is reviewed in Appendix A, too. (Definition, Facts: vecAB = (Bc
I cA )vecA for suitable matrices A and B.) Now we are prepared to compute w2 L (Lˆ ) = 2[(6 y + CE{z}E{z}Cc)
I A ] > 0 w (vecL)w (vecL)c
8-1 The random effect model
355
as a positive definite matrix. (The theory of matrix derivatives is reviewed in Appendix B (Facts: derivative of a matrix-valued function of a matrix, namely w (vecX) w (vecX)c ).) wL ˆ (L) = 2[ 1 CSCcLˆ c + 6 y Lˆ c 1 CS]cD = 0 D D wL as well as to the sufficiency conditions w2 L (Lˆ ) = 2[( 1 CSCc + 6 y )
I A ] > 0 . D w (vecL)w ( vecL)c The normal equations of hom Į-VIP wL wL (Lˆ ) = 0 agree to (8.25).
h For an explicit representation of z as hom BLIP, hom S-BLIP and hom Į-VIP of z in the special Gauss–Markov model with random effects of Box 8.1, we solve the normal equations (8.23), (8.24) and (8.25) for Lˆ = arg{L (L) = min} . L
Beside the explicit representation of z of type hom BLIP, hom S-BLIP and hom Į-VIP we compute the related dispersion matrix D{z} , the Mean Square Prediction Error MSPE{z} , the modified the Mean Square Prediction Error MSPE S {z} and MSPED ,S {z} and the covariance matrices C{z, z z} in Theorem 8.5 ( z hom BLIP): Let z = Ly be hom BLIP of z in the special linear Gauss-Markov model with random effects of Box 8.1. Then equivalent representations of the solutions of the normal equations (8.23) are z = E{z}E{z}cCc[6 y + CE{z}E{z}cCc]1 y = E{z}E{z}cCc[6 y Cz + C6 z Cc + CE{z}E{z}cCc]1 y
(8.26)
(if [6 y + CE{z}E{z}cCc]1 exists) are completed by the dispersion matrix D{z} = E{z}E{z}cCc[6 y + CE{z}E{z}cCc]1 6 y × × [6 y + CE{z}E{z}cCc]1 CE{z}E{z}c by the bias vector (8.5)
(8.27)
356
8 The fourth problem of probabilistic regression
ȕ := E{z } E{z} = [I A E{z}E{z}cCc(CE{z}E{z}cCc + 6 y ) 1 C]E{z}
(8.28)
and by the matrix of the Mean Square Prediction Error MSPE{z} : MSPE{z } := E{(z z )(z z )c} = D{z} + D{z} + ȕȕc
(8.29)
MSPE{z } := D{z} + D{z} + [I A E{z}E{z}cCc(CE{z}E{z}cCc + 6 y ) 1 C] ×
(8.30)
×E{z}E{z}c [I A Cc(CE{z}E{z}cCc + 6 y ) 1 CE{z}E{z}c ]. At this point we have to comment what Theorem 8.5 tells us. hom BLIP has generated the prediction z of type (8.26), the dispersion matrix D{z} of type (8.27), the bias vector of type (8.28) and the Mean Square Prediction Error of type (8.30) which all depend on the vector E{z} and the matrix E{z}E{z}c , respectively. We already mentioned that E{z} and E{z}E{z}c are not accessible from measurements. The situation is similar to the one in hypothesis theory. n As shown later in this section we can produce only an estimator E {z} and consequently can setup a hypothesis first moment E{z} of the "random effect" z. Indeed, a similar argument applies to the second central moment D{y} ~ 6 y of the "random effect" y, the observation vector. Such a dispersion matrix has to be known in order to be able to compute z , D{z} , and MSPE{z} . Again we have to apply the argument that we are only able to construct an estimate 6ˆ cy and to setup a hypothesis about D{y} ~ 6 y . Theorem 8.6 ( z hom S-BLIP): Let z = Ly be hom S-BLIP of z in the special linear Gauss-Markov model with random effects of Box 8.1. Then equivalent representations of the solutions of the normal equations (8.24) are z = SCc(6 y + CSCc) 1 y = = SCc(6 y Cz + C6 z Cc + CSCc) 1 y
(8.31)
z = (Cc6 y1C + S 1 ) 1 Cc6 y1y
(8.32)
z = (I A + SCc6 y1C) 1 SCc6 y1y
(8.33)
(if S 1 , 6 y1 exist) are completed by the dispersion matrices D{z} = SCc(CSCc + 6 y ) 1 6 y (CSCc + 6 y ) 1 CS
(8.34)
D{z} = (Cc6 y1C + S 1 ) 1 Cc6 y1C(Cc6 y1C + S 1 )1
(8.35)
8-1 The random effect model
357 (if S 1 , 6 y1 exist) by the bias vector (8.5) ȕ := E{z} E{z}
= [I A SCc(CSCc + 6 y ) 1 C]E{z} ȕ = [I l (Cc6 y1C + S 1 ) 1 Cc6 y1C]E{z}
(8.36)
(if S 1 , 6 y1 exist) and by the matrix of the modified Mean Square Prediction Error MSPE{z} : MSPE S {z} := E{(z z )(z z )c} = D{z} + D{z} + ȕȕc
MSPES {z} = 6z + SCc(CSCc + 6y )1 6y (CSCc + 6y )1 CS + +[I A SCc(CSCc + 6y )1 C]E{z}E{z}c [Il Cc(CSCc + 6y )1 CS]
(8.37)
(8.38)
MSPE S {z } = 6 z + (Cc6 y1C + S 1 ) 1 Cc6 y1C(Cc6 y1C + S 1 )1 CS + + [I A (Cc6 y1C + S 1 ) 1 Cc6 y1C]E{z}E{z}c ×
(8.39)
× [I A Cc6 y1C(Cc6 y1C + S 1 ) 1 ] (if S 1 , 6 y1 exist). The interpretation of hom S-BLIP is even more complex. In extension of the comments to hom BLIP we have to live with another matrix-valued degree of freedom, z of type (8.31), (8.32), (8.33) and D{z} of type (8.34), (8.35) do no longer depend on the inaccessible matrix E{z}E{z}c , rk( E{z}E{z}c ) , but on the "bias weight matrix" S, rk S = l. Indeed we can associate any element of the bias matrix with a particular weight which can be "designed" by the analyst. Again the bias vector ȕ of type (8.36) as well as the Mean Square Prediction Error of type (8.37), (8.38), (8.39) depend on the vector E{z} which is inaccessible. Beside the "bias weight matrix S" z , D{z} , ȕ and MSPE{z} are vector-valued or matrix-valued functions of the dispersion matrix D{y} ~ 6 y of the stochastic observation vector which is inaccessible. By hypothesis testing we may decide y . upon the construction of D{y} ~ 6 y from an estimate 6 Theorem 8.7 ( z hom Į-VIP): Let z = Ly be hom Į-VIP of z in the special linear Gauss-Markov model with random effects Box 8.1. Then equivalent representations of the solutions of the normal equations (8.25) are
358
8 The fourth problem of probabilistic regression
z = 1 SCc(6 y + 1 CSCc) 1 y D D = 1 SCc(6 y Cz + C6 z Cc + 1 CSCc) 1 y D D
(8.40)
z = (Cc6 y1C + D S 1 ) 1 Cc6 y1y
(8.41)
z = (I A + 1 SCc6 y1C) 1 1 SCc6 y1y D D
(8.42)
(if S 1 , 6 y1 exist) are completed by the dispersion matrix D{z} = 1 SCc(6 y + 1 CSCc) 1 6 y (6 y + 1 CSCc) 1 CS 1 D D D D
(8.43)
D{z} = (Cc6 y1C + D S 1 ) 1 Cc6 y1C(Cc6 y1C + D S 1 )1
(8.44)
(if S 1 , 6 y1 exist) by the bias vector (8.5) ȕ := E{z} E{z} = [I A 1 SCc( 1 CSCc + 6 y ) 1 C]E{z} D D ȕ = [I A (Cc6 y1C + D S 1 ) 1 Cc6 y1C]E{z}
(8.45)
(if S 1 , 6 y1 exist) and by the matrix of the Mean Square Prediction Error MSPE{z} : MSPE{z } := E{(z z )(z z )c} = D{z} + D{z} + ȕȕc
(8.46)
MSPE{z } = = 6 z + SCc(CSCc + 6 y ) 1 6 y (CSCc + 6 y ) 1 CS + +[I A - 1 SCc( 1 CSCc + 6 y ) 1 C]E{z}E{z}c × D D 1 ×[I A - Cc( CSCc + 6 y ) 1 CS 1 ] D D
(8.47)
MSPE{z} = = 6 z + (Cc6 C + D S 1 ) 1 Cc6 y1C(Cc6 y1C + D S 1 ) 1 CS + 1 y
+[I A - (Cc6 y1C + D S 1 ) 1 Cc6 y1C]E{z}E{z}'× ×[I A - Cc6 y1C(Cc6 y1C + D S 1 ) 1 ] (if S 1 , 6 y1 exist).
(8.48)
8-1 The random effect model
359
The interpretation of the very important predictors hom Į-VIP z of z is as follows: z of type (8.41), also called ridge estimator or Tykhonov-Phillips regulator, contains the Cayley inverse of the normal equation matrix which is additively decomposed into Cc6 y1C and D S 1 . The weight factor Į balances the first inverse dispersion part and the second inverse bias part. While the experiment informs y , the bias weight matrix S and us of the variance-covariance matrix 6 y , say 6 the weight factor Į are at the disposal of the analyst. For instance, by the choice S = Diag( s1 ,..., sA ) we may emphasize increase or decrease of certain bias matrix elements. The choice of an equally weighted bias matrix is S = I A . In contrast the weight factor Į can be determined by the A-optimal design of type •
tr D{z} = min
•
ȕȕc = min
•
tr MSPE{z} = min .
D
D
D
In the first case we optimize the trace of the variance-covariance matrix D{z} of type (8.43), (8.44). Alternatively by means of ȕȕ ' = min we optimize D the quadratic bias where the bias vector ȕ of type (8.45) is chosen, regardless of the dependence on E{z} . Finally for the third case – the most popular one – we minimize the trace of the Mean Square Prediction Error MSPE{z} of type (8.48), regardless of the dependence on E{z}E{z}c . But beforehand let us present the proof of Theorem 8.5, Theorem 8.6 and Theorem 8.7. Proof: (i) z = E{z}E{z}cCc[6 y + CE{z}E{z}cCc]1 y If the matrix 6 y + CE{z}E{z}cCc of the normal equations of type hom BLIP is of full rank, namely rk(6 y + CE{z}E{z}cCc) = n, then a straightforward solution of (8.23) is Lˆ = E{z}E{z}cCc[6 y + CE{z}E{z}cCc]1 . (ii) z = SCc(6 y + CSCc) 1 y If the matrix 6 y + CSC ' of the normal equations of type hom S-BLIP is of full rank, namely rk(6 y + CSC ') = n, then a straightforward solution of (8.24) is Lˆ = SCc[6 y + CSCc]1. (iii) z = (Cc6 y1C + S 1 ) 1 Cc6 y1y Let us apply by means of Appendix A (Fact: Cayley inverse: sum of two matrices, s(10), Duncan-Guttman matrix identity) the fundamental matrix identity SCc(6 y + CSCc) 1 = (Cc6 y1C + S 1 ) 1 Cc6 y1 ,
360
8 The fourth problem of probabilistic regression
if S 1 and 6 y1 exist. Such a result concludes this part of the proof. (iv) z = (I A + SCc6 y1C) 1 SCc6 y1y Let us apply by means of Appendix A (Fact: Cayley inverse: sum of two matrices, s(9)) the fundamental matrix identity SC '(6 y + CSCc) 1 = (I A + SCc6 y1C) 1 SCc6 y1 , if 6 y1 exists. Such a result concludes this part of the proof. (v) z = 1 SCc(6 y + 1 CSCc) 1 y D D If the matrix 6 y + D1 CSCc of the normal equations of type hom Į-VIP is of full rank, namely rk(6 y + D1 CSCc) = n, then a straightforward solution of (8.25) is Lˆ = 1 SCc[6 y + 1 CSCc]1. D D (vi) z = (Cc6 y1C + D S 1 ) 1 Cc6 y1y Let us apply by means of Appendix A (Fact: Cayley inverse: sum of two matrices, s(10), Duncan-Guttman matrix identity) the fundamental matrix identity 1 SCc(6 + CSCc) 1 = (Cc6 1C + D S 1 ) 1 Cc6 1 , y y y D if S 1 and 6 y1 exist. Such a result concludes this part of the proof. (vii) z = (I A + 1 SCc6 y1C) 1 1 SCc6 y1y D D Let us apply by means of Appendix A (Fact: Cayley inverse: sum of two matrices, s(9), Duncan-Guttman matrix identity) the fundamental matrix identity 1 SCc(6 + CSCc) 1 = (I + 1 SCc6 1C) 1 1 SCc6 1 , y l y y D D D if 6 y1 exist. Such a result concludes this part of the proof. (viii) hom BLIP: D{z} D{z} := E{[z E{z}][ z E{z}]c} = = E{z}E{z}' C '[6 y + CE{z}E{z}cCc]1 6 y × × [6 y + CE{z}E{z}cCc]1 CE{z}E{z}c . By means of the definition of the dispersion matrix D{z} and the substitution of z of type hom BLIP the proof has been straightforward. (ix) hom S-BLIP: D{z} (1st representation)
8-1 The random effect model
361
D{z} := E{[z E{z}][z E{z }]c} = = SCc(CSCc + 6 y ) 1 6 y (CSCc + 6 y ) 1 CS. By means of the definition of the dispersion matrix D{z} and the substitution of z of type hom S-BLIP the proof of the first representation has been straightforward. (x) hom S-BLIP: D{z} (2nd representation) D{z}:= E{[z E{z}][z E{z}]c} = = (Cc6 y1C + S 1 ) 1 Cc6 y1C(Cc6 y1C + S 1 ) 1 ,
if S 1 and 6 y1 exist. By means of the definition of the dispersion matrix D{z} and the substitution of z of type hom S-BLIP the proof the second representation has been straightforward. (xi) hom Į-VIP: D{z} (1st representation) D{z}:= E{[z E{z}][z E{z}]c} = = 1 SCc(6 y + 1 CSCc) 1 6 y (6 y + 1 CSCc) 1 CS 1 . D D D D
By means of the definition of the dispersion matrix D{z} and the substitution of z of type hom Į-VIP the proof the first representation has been straightforward. (xii) hom Į-VIP: D{z} (2nd representation) D{z } := E{[z E{z}][z E{z}]c} = (Cc6 y1C + D S 1 ) 1 Cc6 y1C(Cc6 y1C + D S 1 ) 1 , if S 1 and 6 y1 exist. By means of the definition of the dispersion matrix D{z} and the substitution of z of type hom Į-VIP the proof of the second representation has been straightforward. (xiii) bias ȕ for hom BLIP, hom S-BLIP and hom Į-VIP As soon as we substitute into the bias ȕ := E{z} E{z} = E{z} + E{z } the various predictors z of the type hom BLIP, hom S-BLIP and hom Į-VIP we are directly led to various bias representations ȕ of type hom BLIP, hom S-BLIP and hom Į-VIP. (xiv) MSPE{z} of type hom BLIP, hom S-BLIP and hom Į-VIP MSPE{z}:= E{(z z )(z z )c} z z = z E{z} (z E{z}) = z E{z} (z E{z}) ( E{z} E{z})
E{( z z )(z z )c} = E{(z E{z})((z E{z})c} + E{(z E{z})( z E{z})c} + +( E{z} E{z})( E{z} E{z})c
362
8 The fourth problem of probabilistic regression
MSPE{z} = D{z} + D{z} + ȕȕc. At first we have defined the Mean Square Prediction Error MSPE{z} of z . Secondly we have decomposed the difference z z into the three terms • z E{z} • z E{z} • E{z} E{z} in order to derive thirdly the decomposition of MSPE{z} , namely •
the dispersion matrix of z , namely D{z} ,
•
the dispersion matrix of z , namely D{z} ,
•
the quadratic bias ȕȕc .
As soon as we substitute MSPE{z} the dispersion matrix D{z} and the bias vector ȕ of various predictors z of the type hom BLIP, hom S-BLIP and hom Į-VIP we are directly led to various representations ȕ of the Mean Square Prediction Error MSPE{z} . Here is my proof’s end.
h
8-2 Examples Example 8.1 Nonlinear error propagation with random effect models Consider a function y = f ( z ) where y is a scalar valued observation and z a random effect. Three cases can be specified as follows: Case 1 ( P z assumed to be known): By Taylor series expansion we have 1 1 f c( P z )( z P z ) + f cc( P z )( z P z ) 2 + O (3) 1! 2! 1 E{ y} = E{ f ( z )} = f ( P z ) + f cc( P z ) E{( z P z ) 2 } + O (3) 2!
f ( z) = f (P z ) +
leading to (cf. E. Grafarend and B. Schaffrin 1983, p.470) 1 f cc( P z )V z2 + O (3) 2! 1 2 E{( y E{ y}) } = E{[ f c( P z )( z P z ) + f cc( P z )( z P z ) 2 + O (3) 2! 1 f cc( P z )V z2 O (3)]2 }, 2! hence E{[ y E{ y}][[ y E{ y}]} is given by E{ y} = f ( P z ) +
8-2 Examples
363
V y2 = f c2 ( P z )V z2
1 2 f cc ( P z )V z4 + f fc cc( P z ) E{( z P z )3 } + 4 1 + f cc2 E{( z P z ) 4 } + O (3). 4
Finally if z is quasi-normally distributed, we have V y2 = f c 2 ( P z )V z2 +
1 f cc 2 ( P z )V z4 + O (3). 2
Case 2 ( P z unknown, but [ 0 known as a fixed effect approximation (this model is implied in E. Grafarend and B. Schaffrin 1983, p.470, [ 0 z P z )): By Taylor series expansion we have f ( z ) = f ([ 0 ) +
1 1 f c([ 0 )( z [ 0 ) + f cc([ 0 )( z [ 0 ) 2 + O (3) 1! 2!
using
[ 0 = P z + ([ 0 P z ) z [ 0 = z P z + ( P z [ 0 ) we have 1 1 f c([ 0 )( z P z ) + f c([ 0 )( z [ 0 ) + 1! 1! 1 1 2 + f cc([ 0 )( z P z ) + f cc([ 0 )( z [ 0 ) 2 + 2! 2! + f cc([ 0 )( z P z )( z [ 0 ) + O (3)
f ( z ) = f ([ 0 ) +
and E{ y} = E{ f ( z )} = f ([ 0 ) + f c([ 0 )( P z [ 0 ) + +
1 f cc([ 0 )V z2 + 2
1 f cc([ 0 )( P z [ 0 ) 2 + O (3) 2
leading to E{[ y E{ y}][[ y E{ y}]} as
V z2 = f c2 ([ 0 )V z2 + f fc cc([ 0 ) E{( z P z )3} + 2 f fc cc([ 0 )V z2 ( P z [ 0 ) + 1 + f cc2 ([ 0 ) E{( z P z ) 4 } + f cc2 ([ 0 ) E{( z P z ) 3}( P z [ 0 ) 4 1 f cc2 ([ 0 )V z4 + f cc2 ([ 0 )V z2 ( P z [ 0 ) 2 + O (3) 4 and with z being quasi-normally distributed, we have V z2 = f c2 ([ 0 )V z2 + 2 f fc cc([ 0 )V z2 ( P z [ 0 ) +
1 2 f cc ([ 0 )V z4 + f cc2 ([ 0 )V z2 ( P z [ 0 ) 2 + O (3) , 2
with the first and third terms (on the right hand side) being the right hand sideterms of case 1 (cf. E. Grafarend and B. Schaffrin 1983, p.470).
364
8 The fourth problem of probabilistic regression
Case 3 ( P z unknown, but z0 known as a random effect approximation): By Taylor series expansion we have f ( z) = f (P z ) +
1 1 f c( P z )( z P z ) + f cc( P z )( z P z ) 2 + 1! 2! 1 + f ccc( P z )( z P z )3 + O (4) 3!
changing z P z = z0 P z = z0 E{z0 } ( P z E{z0 }) and the initial bias ( P z E{z0 }) = E{z0 } P z =: E 0 leads to z P z = z0 E{z0 } + E 0 . Consider ( z P z ) 2 = ( z0 E{z0 }) 2 + E 02 + 2( z0 E{z0 }) E 0 we have 1 1 f c( P z )( z0 E{z0 }) + f c( P z ) E 0 + 1! 1! 1 1 2 2 + f cc( P z )( z0 E{z0 }) + f cc( P z ) E 0 + f cc( P z )( z0 E{z0 }) E 0 + O (3) 2! 2! 1 1 E{ y} = f ( P z ) + f c( P z ) E 0 + f cc( P z )V z2 + f cc( P z ) E 02 + O (3) 2 2 f ( z) = f (P z ) +
0
leading to E{[ y E{ y}][[ y E{ y}]} as
V y2 = f c2 ( P z )V z2 + f fc cc( P z ) E{( z0 E{z0 })3 } + 1 2 f fc cc( P z )V z2 E 0 + f cc2 ( P z ) E{( z0 E{z0 }) 4 } + 4 f cc2 ( P z ) E{( z0 E{z0 })3 }E 0 + f cc2 ( P z )V z2 E 02 + 1 1 + f cc2 ( P z )V z4 f cc2 ( P z ) E{( z0 E{z0 }) 2 }V z2 + O (3) 4 2 0
0
0
0
0
and with z0 being quasi-normally distributed, we have
V y2 = f c2 ( P z )V z2 + 2 f fc cc( P z )V z2 E 0 + 0
0
1 2 f cc ( P z )V z4 + f cc2 ( P z )V z2 E 02 + O (3) 2 0
0
with the first and third terms (on the right-hand side) being the right-hand side terms of case 1.
8-2 Examples
365
Example 8.2
Nonlinear vector valued error propagation with random effect models
In a GeoInformation System we ask for the quality of a nearly rectangular planar surface element. Four points {P1, P2, P3, P4} of an element are assumed to have the coordinates (x1, y1), (x2, y2), (x3, y3), (x4, y4) and form a 8×8 full variancecovariance matrix (central moments of order two) and moments of higher order. The planar surface element will be computed according the Gauß trapezoidal: 4
F =¦ i =1
yi + yi +1 ( xi xi +1 ) 2
with the side condition x5= x1, y5=y1. Note that within the Error Propagation Law w2 F z0 wxwy holds. P3
P2
e2
P4
P1 e1 Figure 8.2: Surface element of a building in the map First question ? What is the structure of the variance-covariance matrix of the four points if we assume statistical homogeneity and isotropy of the network (Taylor-Karman structure)? Second question ! Approach the criterion matrix in terms of absolute coordinates. Interpolate the correlation function linear!
366
8 The fourth problem of probabilistic regression
Table 8.1: Coordinates of a four dimensional simplex Point
x
y
P1
100.00m
100.00m
P2
110.00m
117.32m
P3
101.34m
122.32m
P4
91.34m
105.00m
Table 8.2: Longitudinal and lateral correlation functions 6 m and 6 A for a Taylor-Korman structured 4 point network |x|
6 m (|x|)
6 A (|x|)
10m
0.700
0.450
20m
0.450
0.400
30m
0.415
0.238
Our example refers to the Taylor-Karman structure or the structure function introduced in Chapter 3-222. :Solution: The Gauß trapezoidal surface element has the size: F=
y + y3 y + y4 y1 + y2 y + y1 ( x1 x2 ) + 2 ( x2 x3 ) + 3 ( x3 x4 ) + 4 ( x4 x1 ). 2 2 2 2
Once we apply the “error propagation law” we have to use (E44). 1 V F2 = JȈJ c + + (vecȈ)(vecȈ)c+ c . 2 In our case, n=1 holds since we have only one function to be computed. In contrast, the variance-covariance matrix enjoys the format 8×8, while the Jacobi matrix of first derivatives is a 1×8 matrix and the Hesse matrix of second derivatives is a 1×64 matrix. (i) The structure of the homogeneous and isotropic variance-covariance matrix is such that locally 2×2 variance-covariance matrices appear as unit matrices generating local error circles of identical radius. (ii) The celebrated Taylor-Karman matrix for absolute coordinates is given by 'xi 'x j Ȉij (x p , x q ) = 6 m (| x p x q |)G ij + [6 A (| x p x q |) 6 m (| x p x q |)] |x p x q |2 subject to 'x1 := 'x = x p xq , 'x2 := 'y = y p yq , i, j {1, 2}; p, q {1, 2,3, 4}.
8-2 Examples
367
By means of a linear interpolation we have derived the Taylor-Karman matrix by Table 8.3 and Table 8.4. Table 8.3: Distances and meridian correlation function 6 m and longitudinal correlation function 6 A p-q
|x p x q |
|x p x q |2
6m
6A
xp-xq
yp-yq
1-2
20.000
399.982
0.45
0.40
-10
-17.32
1-3
22.360
499.978
0.44
0.36
-1.34
-22.32
1-4
10.000
100.000
0.70
0.45
8.66
-5
2-2
10.000
100.000
0.70
0.45
8.66
-5
2-4
22.360
499.978
0.44
0.36
18.66
12.32
3-4
20.000
399.982
0.45
0.40
10
17.32
Table 8.4: Distance function versus 6 m (x), 6 A ( x) |x|
6 m ( x)
6 A ( x)
10-20
0.95 0.025 | x |
0.5 0.005 | x |
20-30
0.52 0.0035 | x |
0.724 0.0162 | x |
Once we take care of ¦ m and ¦ A as a function of the distance for gives values of tabulated distances we arrive at the Taylor-Karman correlation values of type Table 8.5. Table 8.5: Taylor-Karman matrix for the case study
x1 y1 x2
x1
y1
x2
y2
x3
y3
x4
y4
1
0
0.438
-0.022
0.441
-0.005
0.512
0.108
1
-0.022
0.412
-0.005
0.361
0.108
0.638
1
0
0.512
0.108
0.381
-0.037
1
0.108
0.634
-0.037
0.417
1
0
0.438
-0.022
1
-0.022
0.412
1
0
y2 x3 y3 x4 y4
symmetric
1
Finally, we have computed the Jacobi matrix of first derivatives in Table 8.6 and the Hesse matrix of second derivatives in Table 8.7.
368
8 The fourth problem of probabilistic regression
Table 8.6: Table of the Jacobi matrix “Jacobi matrix” wF wF wF wF , ," , , ] wx1 wy1 wx4 wy4 J = 12 [ y2 y4 , x4 x2 , y3 y1 , x1 x3 , y4 y2 , x2 x4 , y1 y3 , x3 x1 ] J =[
J = 12 [12.32, 18.66, 22.32, 1.34, 12.32,18.66, 22.32,1.34] :Note: y y º wF = i +1 i -1 » wxi 2 » wF xi +1 xi -1 » = » wyi 2 ¼
x0 = x4 y0 = y4 x5 = x1 y5 = y1 .
Table 8.7: Table Hesse matrix “Hesse matrix” w w
F (x)= wxc wxc w wF wF wF wF =
[ , ,", , ] wxc wx1 wy1 wx4 wy4
H=
=[
w2 F w2 F w2 F w2 F w2 F w2 F , ,", , ,", , ] 2 wx1wy4 wy1wx1 wy4 wx4 wy 2 wx1 wx1wy1
=[
w wF wF w wF wF ( ,", ),", ( ,", )]. wx1 wx1 wy4 wy4 wx1 wy4
Note the detailed computation in Table 8.8. Table 8.8: Second derivatives {0, +1/2, -1/2} : “interims formulae: Hesse matrix”: w2 F w2 F = =0 wxi wx j wyi wy j
i, j = 1, 2,3, 4
w2 F w2 F = =0 wxi wyi wyi wxi
i = 1, 2,3, 4
w2 F w2 F 1 = = wxi wyi 1 wyi wxi +1 2 w2 F w2 F 1 = = wyi wxi 1 wxi wyi +1 2
i = 1, 2,3, 4 i = 1, 2,3, 4.
8-2 Examples
369 Results
At first, we list the distances {P1P2, P2P3, P3P4, P4P1} of the trapezoidal finite element by |P1P2|=20 (for instance 20m), |P2P3|=10 (for instance 10m), |P3P4|=20 (for instance 20m) and |P4P1|=10 (for instance 10m). Second, we compute V F2 ( first term) = JȈJ c by V F2 ( first term) = 1 = [12.32, 18.66, 22.32, 1.34, 12.32, 18.62, 22.32, 1.34] × 2 0 0.438 -0.022 0.442 -0.005 0.512 0.108 º ª 1 1 -0.022 0.412 -0.005 0.362 0.108 0.638 » « 0 « 0.438 -0.022 1 0 0.512 0.108 0.386 -0.037 » «-0.022 0.412 0 1 0.108 0.638 -0.037 0.418 » × ×« 0.442 -0.005 0.512 0.108 1 0 0.438 -0.022 » « -0.005 0.362 0.108 0.638 0 1 -0.022 0.412 » « 0.512 0.108 0.386 -0.037 0.438 -0.022 1 0 » « 0.108 0.638 -0.037 0.418 -0.022 0.412 0 1 »¼ ¬ 1 × [12.32, 18.66, 22.32, 1.34, 12.32, 18.62, 22.32, 1.34]c = 2 = 334.7117. 1 Third, we need to compute V F2 ( second term) = H (vecȈ)(vecȈ)cH c by 2 1 V F2 ( second term) = H (vecȈ)(vecȈ)cH c = 7.2222 × 10-35 2 where ª 0 0 0 12 0 0 0 12 º « 0 0 1 0 0 0 1 0 » 2 2 « » 1 1 « 0 2 0 0 0 2 0 0 » « 1 0 0 0 1 0 0 0 » 2 H = vec « 2 1 1 », « 0 0 01 2 0 0 01 2 » « 0 0 2 0 0 0 2 0 » « 0 1 0 0 0 1 0 0 » 2 « 1 2 » 1 ¬« 2 0 0 0 2 0 0 0 ¼» and ª 1 « 0 « 0.438 «-0.022 vec Ȉ = vec « « 0.442 « -0.005 « 0.512 ¬« 0.108
0 1 -0.022 0.412 -0.005 0.362 0.108 0.638
0.438 -0.022 1 0 0.512 0.108 0.386 -0.037
-0.022 0.412 0 1 0.108 0.638 -0.037 0.418
0.442 -0.005 0.512 0.108 1 0 0.438 -0.022
-0.005 0.362 0.108 0.638 0 1 -0.022 0.412
Finally, we get the variance of the planar surface element F
0.512 0.108 0.386 -0.037 0.438 -0.022 1 0
0.108 º 0.638 » -0.037 » 0.418 » ». -0.022 » 0.412 » 0 » 1 ¼»
370
8 The fourth problem of probabilistic regression
V F2 = 334.7117+7.2222 × 10-35 = 334.7117 i.e.
V F = ±18.2951 (m 2 ) . Example 8.3 Nonlinear vector valued error propagation with random effect models The distance element between P1 and P2 has the size: F = ( x2 x1 ) 2 + ( y2 y1 ) 2 . Once we apply the “error propagation law” we have to use (E44). 1 V F2 = JȈJ c + H(vecȈ)(vecȈ)cH c. 2 Table 8.6: Table of the Jacobi matrix “Jacobi matrix” wF wF wF wF , ," , , ] wx1 wy1 wx4 wy4 ( x2 x1 ) ( y2 y1 ) ( x2 x1 ) ( y2 y1 ) J =[ , , , , 0, 0, 0, 0] F F F F J =[
J = [0.5, 0.866, 0.5, 0.866, 0, 0, 0, 0]. Table 8.7: Table Hesse matrix “Hesse matrix” w w
F (x)= wxc wxc w wF wF wF wF =
[ , ," , , ] wxc wx1 wy1 wx4 wy4 w2 F w2 F w2 F w2 F w2 F w2 F =[ 2 , ," , , ," , , ] wx1wy4 wy1wx1 wy4 wx4 wy 2 wx1 wx1wy1 w wF wF w wF wF =[ ( ," , )," , ( ,", )]. wx1 wx1 wy4 wy4 wx1 wy4
H=
Note the detailed computation in Table 8.8. Table 8.8: Second derivatives : “interims formulae: Hesse matrix”: ( x x1 )( y 2 y1 ) 1 ( x x )2 w wF wF ( ," , ) =[ 2 31 , 2 , F wx1 wx1 wy 4 F F3 1 ( x 2 x 2 ) ( x x1 )( y 2 y1 ) , 0, 0, 0, 0], + 2 3 1 , 2 F F F3
8-2 Examples
371
( x x )( y y ) 1 ( y y ) 2 w wF wF ( ," , ) = [ 2 1 3 2 1 , 2 3 1 , F wy1 wx1 wy4 F F ( x2 x1 )( y2 y1 ) 1 ( y22 y12 ) , + , 0, 0, 0, 0] F F3 F3 w wF wF 1 ( x x ) 2 ( x x )( y y ) ( ," , ) = [ + 2 3 1 , 2 1 3 2 1 , F wx2 wx1 wy4 F F ( x x )( y y ) 1 ( x22 x12 ) , 2 1 3 2 1 , 0, 0, 0, 0], 3 F F F ( x x )( y y ) w wF wF 1 ( y y )2 ( ," , ) =[ 2 1 3 2 1 , + 2 3 1 , F wy2 wx1 wy4 F F
( x2 x1 )( y2 y1 ) 1 ( y22 y12 ) , , 0, 0, 0, 0], F F3 F3
w wF wF º ( ," , ) = [0, 0, 0, , 0, 0, 0, 0, 0] » wxi wx1 wy4 » , i = 3, 4. » w wF wF ( ," , ) = [0, 0, 0, , 0, 0, 0, 0, 0]» wyi wx1 wy4 ¼ Results At first, we list the distance {P1P2} of the distance element by |P1P2|=20 (for instance 20m). Second, we compute V F2 ( first term) = JȈJ c by
V F2 ( first term) = = [0.5, 0.866, 0.5, 0.866, 0, 0, 0, 0] × ª 1 « 0 « « 0.438 -0.022 × «« 0.442 « -0.005 « « 0.512 «¬ 0.108
0 1 -0.022 0.412 -0.005 0.362 0.108 0.638
0.438 -0.022 1 0 0.512 0.108 0.386 -0.037
-0.022 0.412 0 1 0.108 0.638 -0.037 0.418
0.442 -0.005 0.512 0.108 1 0 0.438 -0.022
-0.005 0.362 0.108 0.638 0 1 -0.022 0.412
0.512 0.108 0.386 -0.037 0.438 -0.022 1 0
0.108 º 0.638 » » -0.037 » 0.418 » × -0.022 » 0.412 »» 0 » 1 »¼
×[0.5, 0.866, 0.5, 0.866, 0, 0, 0, 0]c = = 1.2000. Third, we need to compute V F2 ( second term) =
1 H (vecȈ)(vecȈ)cH c by 2
372
8 The fourth problem of probabilistic regression
1 V F2 ( second term) = H (vecȈ)(vecȈ)cH c = 0.0015 2 where ª 0.0375 « -0.0217 « « -0.0375 0.0217 H = vec «« 0 « 0 « « 0 «¬ 0
-0.0217 0.0125 0.0217 -0.0125 0 0 0 0
-0.0375 0.0217 0.0375 -0.0217 0 0 0 0
0.0217 -0.0125 -0.0217 0.0125 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0º 0» » 0» 0» , 0» » 0» 0» 0»¼
and ª 1 « 0 « « 0.438 -0.022 vec Ȉ = vec «« 0.442 « -0.005 « « 0.512 «¬ 0.108
0 1 -0.022 0.412 -0.005 0.362 0.108 0.638
0.438 -0.022 1 0 0.512 0.108 0.386 -0.037
-0.022 0.412 0 1 0.108 0.638 -0.037 0.418
0.442 -0.005 0.512 0.108 1 0 0.438 -0.022
-0.005 0.362 0.108 0.638 0 1 -0.022 0.412
Finally, we get the variance of the distance element F
V F2 = 1.2000+0.0015 = 1.2015 i.e.
V F = ±1.0961 (m) .
0.512 0.108 0.386 -0.037 0.438 -0.022 1 0
0.108 º 0.638 » » -0.037 » 0.418 » . -0.022 » » 0.412 » 0 » 1 »¼
9
The fifth problem of algebraic regression - the system of conditional equations: homogeneous and inhomogeneous equations {By = Bi versus c + By = Bi} :Fast track reading: Read Lemma 2, 3 and 6
Lemma 9.2 “inconsistent homogeneous conditions”
Definition 9.1 “inconsistent homogeneous conditions”
Lemma 9.3 G y -norm: least squares solution
Theorem 9.4 G y -seminorm: least squares solution
Definition 9.5 “inconsistent inhomogeneous conditions”
Lemma 9.6 “inconsistent inhomogeneous conditions”
Here we shall outline two systems of poor conditional equations, namely homogeneous and inhomogeneous inconsistent equations. First, Definition 9.1 gives us G y -LESS of a system of inconsistent homogeneous conditional equations which we characterize as the least squares solution with respect to the G y seminorm ( G y -norm) by means of Lemma 9.2, Lemma 9.3 ( G y -norm) and Lemma 9.4 ( G y -seminorm). Second, Definition 9.5 specifies G y -LESS of a system of
374
9 The fifth problem of algebraic regression
inconsistent inhomogeneous conditional equations which alternatively characterize as the corresponding least squares solution with respect to the G y -seminorm by means of Lemma 9.6. Third, we come up with examples.
9-1
G y -LESS of a system of a inconsistent homogeneous conditional equations
Our point of departure is Definition 9.1 by which we define G y -LESS of a system of inconsistent homogeneous condition equations. Definition 9.1 ( G y -LESS of a system of inconsistent homogeneous condition equations): An n × 1 vector i A of inconsistency is called G y -LESS (LEast Squares Solution with respect to the G y -seminorm ) of the inconsistent system of linear condition equations Bi = By, if in comparison to all other vectors i \ n the inequality || i A ||G2 := i cA G y i A d i cG y i =:|| i ||G2 y
y
(9.1) (9.2)
holds in particular if the vector of inconsistency i A has the least G y -seminorm. Lemma 9.2 characterizes the normal equations for the least squares solution of the system of inconsistent homogeneous condition equations with respect to the G y -seminorm. Lemma 9.2 (least squares solutions of the system of inconsistent homogeneous condition equations with respect to the G y -seminorm ): An n × 1 vector i A of the system of inconsistent homogeneous condition equations (9.3) Bi = By is G y -LESS if and only if the system of normal equations ªG y «B ¬
B cº ª i A º ª 0 º =« » 0 »¼ «¬ OA »¼ ¬ B y ¼
with the q × 1 vector OA of “Lagrange multipliers” is fulfilled. :Proof: G y -LESS of Bi = By is constructed by means of the Lagrangean
(9.4)
9-1 Inconsistent homogeneous conditional equations
375
L( i, O ) := icG y i + 2O c( Bi By ) = min . i, O
The first derivatives wL ( i A , OA ) = 2(G y i A + BcOA ) = 0 wi wL ( i A , OA ) = 2( Bi A By ) = 0 wO constitute the necessary conditions. (The theory of vector-valued derivatives is presented in Appendix B.) The second derivatives wL ( i A , OA ) = 2G y t 0 wiwic build up due to the positive semidefiniteness of the matrix G y the sufficiency condition for the minimum. The normal equations (9.4) are derived from the two equations of first derivatives, namely ªG y «B ¬
B cº ª i A º ª 0 º = . 0 »¼ «¬ OA »¼ «¬ By »¼
h Lemma 9.3 is a short review of the system of inconsistent homogeneous condition equations with respect to the G y -norm, Lemma 9.4 alternatively with respect to the G y -seminorm. Lemma 9.3 (least squares solution of the system of inconsistent homogeneous condition equations with respect to the G y -norm): An n × 1 vector i A of the system of inconsistent homogeneous condition equations Bi = By is the least squares solution with respect to the G y -norm if and only if it solves the normal equations G y i A = Bc(BG y1Bc) 1 By. (9.5) The solution i A = G y1Bc( BG y1Bc) 1 By
(9.6)
is unique. The “goodness of fit” of G y -LESS is || i A ||G2 = i cA G y i A = y cBc(BG y1Bc) 1 By. y
(9.7)
:Proof: A basis of the proof could be C. R. Rao’s Pandora Box, the theory of inverse partitioned matrices (Appendix A: Fact: Inverse Partitioned Matrix /IPM/ of a symmetric matrix). Due to the rank identity rkG y = n , the normal equations (9.4) can be faster solved by Gauss elimination.
376
9 The fifth problem of algebraic regression
G y i A + BcOA = 0 Bi A = By. Multiply the first normal equation by BG y1 and substitute the second normal equation for Bi A . BG y1G y i A = Bi A = BG y1BcOA º » Bi A = By ¼ BG y1BcOA = By
OA = (BG y1Bc) 1 By. Finally we substitute the “Lagrange multiplier” OA back to the first normal equation in order to prove G y i A + BcOA = G y i A Bc(BG y1Bc) 1 By = 0 i A = G y1Bc(BG y1Bc) 1 By. h We switch immediately to Lemma 9.4. Lemma 9.4 (least squares solution of the system of inconsistent homogeneous condition equations with respect to the G y -seminorm ): An n × 1 vector i A of the system of inconsistent homogeneous condition equations Bi = By is the least squares solution with respect to the G y -seminorm if the compatibility condition R (Bc) R (G y )
(9.8)
is fulfilled, and solves the system of normal equations G y i A = Bc(BG y1Bc) 1 By ,
(9.9)
which is independent of the choice of the g-inverse G y .
9-2
Solving a system of inconsistent inhomogeneous conditional equations
The text point of departure is Definition 9.5, a definition of G y -LESS of a system of inconsistent inhomogeneous condition equations.
377
9-3 Examples
Definition 9.5 ( G y -LESS of a system of inconsistent inhomogeneous condition equations): An n × 1 vector i A of inconsistency is called G y -LESS (LEast Squares Solution with respect to the G y -seminorm ) of the inconsistent system of inhomogeneous condition equations c + By = B i (9.10) (the minus sign is conventional), if in comparison to all other vectors i R n the inequality || i A ||2 := i cA G y i A d i cG y i =:|| i ||G2
y
(9.11)
holds, in particular if the vector of inconsistency i A has the least G y -seminorm. Lemma 9.6 characterizes the normal equations for the least squares solution of the system of inconsistent inhomogeneous condition equations with respect to the G y -seminorm. Lemma 9.6 (least squares solution of the system of inconsistent inhomogeneous condition equations with respect to the G y -seminorm): An n × 1 vector i A of the system of inconsistent homogeneous condition equations (9.12) Bi = By c = B(y d) is G y -LESS if and only if the system of normal equations ªG y B c º ª i A º ª O º (9.13) « B O » «O » = «By c » , ¬ ¼¬ A¼ ¬ ¼ with the q × 1 vector O of Lagrange multipliers is fulfilled. i A exists surely if (9.14) R (Bc) R (G y ) and it solves the normal equations G y i A = B c(BG y1B c) 1 (By c) , (9.15) which is independent of the choice of the g-inverse G y . i A is unique if the matrix G y is regular and in consequence positive definite.
9-3
Examples
Our two examples relate to the triangular condition, the so-called zero misclosure, within a triangular network, and the condition that the sum within a flat triangle accounts to 180 o .
378
9 The fifth problem of algebraic regression
(i)
the first example: triplet of angular observations
We assume that three observations of height differences within the triangle PD PE PJ sum up to zero. The condition of holonomic heights says hDE + hEJ + hJD = 0 , namely ª hDE º ªiDE º B := [1, 1, 1], y := «« hEJ »» , i := «« iEJ »» . «¬ hJD »¼ «¬ iJD »¼ The normal equations of the inconsistent condition read for the case G y = I 3 : i A = Bc(BBc) 1 By , ª1 1 1º 1 Bc(BBc)B = «1 1 1» , 3 «1 1 1» ¬ ¼ 1 (iDE )A = (iEJ )A = (iJD ) A = ( hDE + hEJ + hJD ) . 3 (ii)
the second example: sum of planar triangles
Alternatively, we assume: three angles which form a planar triangle of sum to
D + E + J = 180D namely ªD º ªiD º « » B := [1, 1, 1], y := « E » , i := ««iE »» , c := 180D. «¬ J »¼ «¬ iJ »¼ The normal equations of the inconsistent condition equation read in our case G y = I3 : i A = Bc(BBc) 1 (By c) , ª1º 1 Bc(BBc) 1 = «1» , By c = D + E + J 180D , 3 «1» ¬¼ ª1º 1 (iD )A = (i E )A = (iJ )A = «1» (D + E + J 180D ) . 3 «1» ¬¼
10 The fifth problem of probabilistic regression – general Gauss-Markov model with mixed effects– Setup of BLUUE for the moments of first order (Kolmogorov-Wiener prediction) “Prediction company’s chance of success is not zero, but close to it.” Eugene Fama “The best way to predict the future is to invent it.” Alan Kay : Fast track reading : Read only Theorem 10.3 and Theorem 10.5
Lemma 10.2 Ȉ y BLUUE of ȟ and E{z} :
Definition 10.1 Ȉ y BLUUE of ȟ and E{z} :
Theorem 10.3 n ȟˆ , E {z} : Ȉ y BLUUE of ȟ and E{z} : Lemma 10.4 n n E{y}: ȟˆ , E {z} Ȉ y BLUUE of ȟ and E{z}
The inhomogeneous general linear Gauss-Markov model with fixed effects and random effects will be presented first. We review the special KolmogorovWiener model and extend it by the proper stochastic model of type BIQUUE given by Theorem 10.5.
10.1.1.1 Theorem 10.5 Homogeneous quadratic setup of Vˆ 2
380
10 The fifth problem of probabilistic regression
The extensive example for the general linear Gauss-Markov model with fixed effects and random effects concentrates on a height network observed at two epochs. At the first epoch we assume three measured height differences. In between the first and the second epoch we assume height differences which change linear in time, for instance as a result of an earthquake we have found the height difference model • hDE (W ) = hDE (0) + hDE W + O (W 2 ) .
Namely, W indicates the time interval from the first epoch to the second epoch • relative to the height difference hDE . Unknown are •
the fixed effects hDE and
•
the expected values of stochastic effects of type height difference velocities hDE
given the singular dispersion matrix of height differences. Alternative estimation and prediction producers of •
type (V + CZCc) -BLUUE for the unknown fixed parameter vector ȟ of height differences of initial epoch and
•
the expectation data E{z} of stochastic height difference velocities z, and
•
of type (V + CZCc) -BLUUE for the expectation data E{y} of height difference measurements y,
•
of type e y of the empirical error vector,
•
as well as of type (V + CZCc) -BLUUP of the stochastic vector z of height difference velocities.
For the unknown variance component V 2 of height difference observations we review estimates of type BIQUUE. At the end, we intend to generalize the concept of estimation and prediction of fixed and random effects by a short historical remark.
10-1 Inhomogeneous general linear Gauss-Markov model (fixed effects and random effects) Here we focus on the general inhomogeneous linear Gauss-Markov model including fixed effects and random effects. By means of Definition 10.1 we review Ȉ y -BLUUE of ȟ and E{z} followed by the related Lemma 10.2, Theorem 10.3 and Lemma 10.4. Box 10.1 Inhomogeneous general linear Gauss–Markov model (fixed effects and random effects)
10-1 Inhomogeneous general linear Gauss-Markov model
381
Aȟ + CE{z} + Ȗ = E{y}
(10.1)
Ȉ z := D{z}, Ȉ y := D{y} C{y , z} = 0 ȟ, E{z}, E{y}, Ȉ z , Ȉ y
unknown
Ȗ known E{y} Ȗ ([ A, C]) .
(10.2)
The n×1 stochastic vector y of observations is transformed by means of y Ȗ =: y to the new n×1 stochastic vector y of reduced observations which is characterized by second order statistics, in particular by the first moments E{y} and by the central second moments D{y}. Definition 10.1 ( Ȉ y BLUUE of ȟ and E{z} ): The partitioned vector ȗ = Ly + ț , namely ª ȟˆ º ª ȟˆ º ª L1 º ª ț1 º « » = «n» = « »y + « » ¬ț 2 ¼ ¬ Șˆ ¼ «¬ E{z}»¼ ¬ L 2 ¼ is called Ȉ y BLUUE of ȗ (Best Linear Uniformly Unbiased Estimation with respect to Ȉ y - norm) in (10.1) if ȗˆ is uniformly unbiased in the sense of (1st)
(2nd)
(10.3) E{ȗˆ} = E{Ly + ț} = ȗ for all ȗ R m+l or ˆ E{ȟ} = E{L1y + ț1} = ȟ for all ȟ R m (10.4) n {z}} = E{L 2 y + ț 2 } = Ș = E{z} for all Ș R l E{Șˆ } = E{E and in comparison to all other linear uniformly unbiased estimation ȗˆ has minimum variance. tr D{ȗˆ} := E{(ȗˆ ȗ)c(ȗˆ ȗ)} = tr LȈ y Lc =|| Lc ||2Ȉ = min y
(10.5)
L
or tr D{ȟˆ} := E{(ȟˆ - ȟ )c(ȟˆ - ȟ )} = tr L1Ȉ y L1c =|| L1c ||2Ȉ = min y
L1
n tr D {Șˆ } := tr D{E {z}} := E {( Șˆ - Ș) (Șˆ - Ș)} = n n = E{( E {z} E{z})c( E {z} E{z}{z})} = tr L 2 Ȉ y Lc2 =|| Lc ||2Ȉ = min . y
L2
(10.6)
382
10 The fifth problem of probabilistic regression
We shall specify Ȉ y -BLUUE of ȟ and E{z} by means of ț1 = 0 , ț 2 = 0 and writing the residual normal equations by means of “Lagrange multipliers”. Lemma 10.2 ( Ȉ y BLUUE of ȟ and E{z} ): n An (m+l)×1 vector [ȟˆ c, Șˆ c]c = [ȟˆ c, E {z}c]c = [L1c , Lc2 ]c y + [ț1c , ț c2 ]c is n c c c Ȉ y BLUUE of [ȟ , E{z} ] in (10.1), if and only if ț1 = 0, ț 2 = 0 hold and the matrices L1 and L 2 tions ª Ȉ y A Cº ª L1c « Ac 0 0 » « ȁ « » « 11 «¬ Cc 0 0 »¼ «¬ ȁ 21
fulfill the system of normal equaLc2 º ª 0 ȁ12 » = « I m » « ȁ 22 »¼ «¬ 0
0º 0» » I l »¼
(10.7)
or Ȉ y L1c Aȁ11 Cȁ 21 = 0, Ȉ y Lc2 Aȁ12 Cȁ 22
0
A cL1c = I m , A cLc2 = 0 CcL1c = 0, CcLc2 = I l
(10.8)
with suitable matrices ȁ11 , ȁ12 , ȁ 21 and ȁ 22 of “Lagrange multipliers”. Theorem 10.3 specifies the solution of the special normal equations by means of (10.9) relative to the specific “Schur complements” (10.10)-(10.13). n Theorem 10.3 ( ȟˆ , E {z} Ȉ y BLUUE of ȟ and E{z} ): n Let [ȟˆ c, E{z}c]c be Ȉ y BLUUE of the [ȟ c, E{z}c]c in the mixed Gauss-Markov model (10.1). Then the equivalent representations of the solution of the normal equations (10.7) ª ȟˆ º ª A cȈ-y1A A cȈ-1y Cº ª A cº ˆ ˆȗ := «ª ȟ »º := « Ȉ-y1y »=« -1 -1 » « » n c c c ¬ Șˆ ¼ «¬ E {z}»¼ ¬ C Ȉ y A C Ȉ y C ¼ ¬ C ¼
(10.9)
ȟˆ = {A cȈ-y1[I n C(CcȈ-y1C) -1 CcȈ-y1 ]A}-1 × A cȈ-y1[I n - C(CcȈ-y1C) -1 CcȈ-y1 ] y 1 n Șˆ = E {z} = {CcȈ-1y [I n - A( AcȈ-y1A )-1 AcȈ-y1 ]C} × CcȈ-y1[In - A( AcȈ-y1A)-1 AcȈ-y1 ] y
ȟˆ = S -A1sA n Șˆ := E {z} = SC-1sC n ȟˆ = ( A cȈ-y1A ) 1 A cȈ-1y ( y E {z}) n Șˆ := E {z} = (CcȈ-1C) 1 CcȈ-1 ( y - Aȟˆ ) y
y
10-1 Inhomogeneous general linear Gauss-Markov model
383
are completed by the dispersion matrices and the covariance matrices. 1 -1 -1 ˆ ° ª ȟˆ º ½° ° ª ȟ º °½ ª A cȈ y A A cȈ y Cº ˆ D ȗ := D ® « » ¾ := D ® « =: Ȉȗˆ » = n ¾ « CcȈ-y1A CcȈ-y1C » ¼ ¯° ¬ Șˆ ¼ ¿° ¯° «¬ E {z}»¼ ¿° ¬
{}
{}
1 D ȟˆ = {A cȈ-y1 [I n - C(CcȈ-y1C) -1 CcȈ-y1 ]A} =: Ȉȟˆ
n ˆ Șˆ } = C{ȟˆ , E C{ȟ {z}}} =
= {A cȈ-y1[I n - C(CcȈ-y1C) -1 CcȈ-y1 ]A} A cȈ-y1C(CcȈ-y1C) 1 1
= ( A cȈ-y1A ) -1 A cȈ-y1C {CȈ-y1[I n - A( A cȈ-1y A ) -1 AcȈ-y1 ]C}
-1
n D {Șˆ } := D{E {z}} = {CcȈ-y1 [I n - A ( A cȈ-y1A ) -1 A cȈ-y1 ]C}1 =: ȈȘˆ n D{ȟˆ} = S 1 , D{Șˆ } = D{E {z}} = S 1 A
C
C{ȟˆ , z} = 0 n C{Șˆ , z}:= C{E {z}, z} = 0 , where the “Schur complements” are defined by S A := A cȈ-y1[I n - C(CcȈ-y1C) -1 CcȈ-y1 ]A,
(10.10)
s A := A cȈ-y1[I n - C(CcȈ-y1C) -1 CcȈ-1y ]y
(10.11)
SC := CcȈ-y1[I n - A( A cȈ-y1A ) -1 A cȈ-y1 ]C
(10.12)
sC := CcȈ-y1[I n - A( A cȈ-y1A ) -1 AcȈ-y1 ]y .
(10.13)
Our final result (10.14)-(10.23) summarizes (i) the two forms (10.14) and (10.15) n n {y} and D{E {y}} as derived covariance matrices, (ii) the empirical of estimating E error vector e y and the related variance-covariance matrices (10.19)-(10.21) and (iii) the dispersion matrices D{y} by means of (10.22)-(10.23). n n Lemma 10.4 ( E {y}: ȟˆ , E {z} Ȉ y BLUUE of ȟ and E{z} ): (i) With respect to the mixed Gauss-Markov model (10.1) Ȉ y BLUUE of the E{y} = Aȟ + CE{z} is given by n n E {y} = Aȟˆ + C E {z} = = AS -A1s A + C(CcȈy1C) 1 CcȈy1 ( y AS -A1s A ) or n n E{y} = Aȟˆ + C E {z} = = A ( A cȈy1A ) 1 A cȈy1 ( y AS C-1sC ) + CS-1CsC with the corresponding dispersion matrices
(10.14)
(10.15)
384
10 The fifth problem of probabilistic regression
n n D{E {y}} = D{Aȟˆ + C E {z}} = n n n = AD{ȟˆ}A c + A cov{ȟˆ , E {z}}Cc + C cov{ȟˆ , E {z}}A c + CD{E {z}}Cc n n D{E {y}} = D{Aȟˆ + C E {z}} = = C(CcȈ-y1C) 1 Cc + [I n - C(CcȈ-y1C) -1 CcȈ-y1 ]AS A1A c[I n - Ȉ -y1C(CcȈ -y1C) -1 Cc] n n {y}} = D{Aȟˆ + C E {z}} = D{E = A( A cȈ-y1A) 1 Ac + [I n - A( AcȈ-y1A)-1 AcȈ-y1 ]CS C1Cc[I n - Ȉ-y1A( AcȈ-y1A) -1 Ac], where S A , s A , SC , sC are “Schur complements” (10.10), (10.11), (10.12) and (10.13). n The covariance matrix of E {y} and z amounts to n n cov{E {y}, z} = C{Aȟˆ + C E {z}, z} = 0.
(10.16)
(ii) If the “error vector” e y is empirically determined by means of n the residual vector e y = y E { y} we gain the various representations of type e y = [I n C(CcȈ y1C) 1 CcȈ y1 ]( y AS -A1s A ) or
(10.17)
e y = [I n A ( A cȈy1A ) 1 A cȈy1 ]( y CSC-1sC )
(10.18)
with the corresponding dispersion matrices D{e y } = Ȉ y C(CcȈ-y1C) 1 Cc [I n - C(CcȈ-y1C)-1 CcȈ-y1 ]AS A1A c[I n - Ȉ-y1C(CcȈ-y1C)-1 Cc]
(10.19)
or D{e y } = Ȉ y A( A cȈ -y1A) 1 Ac [I n - A( A cȈ -y1A) -1 A cȈ -y1 ]CS C1Cc[I n - Ȉ -y1A( A cȈ -y1A) -1 Ac],
(10.20)
where S A , s A , SC , sC are “Schur complements” (10.10), (10.11), (10.12) and (10.13). e y and z are uncorrelated because of C{e y , z} = 0.
(10.21)
(iii) The dispersion matrices of the observation vector is given by n n D{y} = D{Aȟˆ + C E {z} + e y } = D{Aȟˆ + C E {z}} + D{e y } (10.22) D{y} = D{e y e y } + D{e y }. n {y} are uncorrelated since e y and E n n C{e y , E {y}} = C{e y , Aȟˆ + C E {z}} = C{e y , e y e y } = 0 .
(10.23)
385
10-2 Explicit representations of errors
10-2 Explicit representations of errors in the general Gauss-Markov model with mixed effects A collection of explicit representations of errors in the general Gauss-Markov model with mixed effects will be presented: ȟ , E{z} , y Ȗ = y , Ȉ z , Ȉ y will be assumed to be unknown, Ȗ known. In addition, C{y, z} will be assumed to vanish. The prediction of random effects will be summarized here. Note our simple model Aȟ + CE{z} = E{y}, E{y} R ([ A, C]), rk[ A, C] = m + A < n , E{z} unknown, ZV 2 = D{z} , Z positive definite rk Z = s d A , VV 2 = D{y Cz} , V positive semidefinite rk V = t d n, rk[V, CZ] = n , C{z, y Cz} = 0 . A homogeneous-quadratic ansatz Vˆ 2 = y cMy will be specified now. Theorem 10.5 (homogeneous-quadratic setup of Vˆ 2 ): (i)
Let Vˆ 2 = y cMy = (vec M)c( y
y ) be BIQUUE of V 2 with respect to the model of the front desk. Then
Vˆ 2 = ( n m A) 1[ y c{I n ( V + CZCc) 1 A[ A c( V + CZCc) 1 A ]1 Ac} ( V + CZCc) 1 y scASC1sC ]
Vˆ 2 = ( n m A) 1 [ y cQ( V + CZCc) 1 y scA S A1sA ] Q := I n (V + CZCc) 1 C[Cc(V + CZCc) 1 C]1 Cc subject to [S A , s A ] := A c( V + CZCc) 1{I n C[Cc( V + CZCc) 1 C]1 Cc( V + CZCc) 1}[ A, y ] = = A cQ( V + CZCc) 1[ A, y ] and [SC , sC ] := Cc( V + CZCc) 1{I n A[ Ac( V + CZCc) 1 A]1 Ac( V + CZCc) 1}[C, y] , where SA and SC are “Schur complements”. Alternately, we receive the empirical data based upon
Vˆ 2 = ( n m A) 1 y c( V + CZCc) 1 e y = = ( n m A) 1 e cy ( V + CZCc) 1 e y and the related variances
386
10 The fifth problem of probabilistic regression
D{Vˆ 2 } = 2( n m A) 1V 4 = 2( n m A) 1 (V 2 ) 2 or replacing by the estimations D{Vˆ 2 } = 2(n m A) 1 (Vˆ 2 ) 2 = 2(n m A) 1[e cy (V + CZCc) 1 e y ]2 . (ii)
If the cofactor matrix V is positive definite, we will find for the simple representations of type BIQUUE of V 2 the equivalent representations ª A cV 1A A cV 1Cº ª A cº 1 Vˆ 2 = ( n m A) 1 y c{V 1 V 1[ A, C] « » A }y 1 1 » « ¬ CcV A CcV C ¼ ¬ Cc ¼
Vˆ 2 = ( n m A) 1 ( y cQV 1 y scA S A1s A ) Vˆ 2 = ( n m A) 1 y cV 1 [I n A ( A cV 1A ) 1 A cV 1 ]( y CSC1sC ) subject to the projection matrix Q = I n V 1C(CcV 1C) 1 Cc and [S A , s A ] := A cV 1 [I n C(CcV 1C) 1 CcV 1 ][ A, y ] = A cQV 1 [ A, y ] [SC , sC ] := {I A + CcV 1[I n A( A cV 1A) 1 A cV 1 ]CZ}1 × ×CcV 1[I n A( AcV 1A) 1 AcV 1 ][C, y ]. Alternatively, we receive the empirical data based upon
Vˆ 2 = ( n m A) 1 y cV 1e y = ( n m A) 1 e cy V 1e y and the related variances D{Vˆ 2 } = 2( n m A) 1V 4 = 2( n m A) 1 (V 2 ) 2 Dˆ {Vˆ 2 } = 2( n m A) 1 (Vˆ 2 ) 2 = 2( n m A) 1 (e cy V 1e y ) 2 . The proofs are straight forward.
10-3 An example for collocation Here we will focus on a special model with fixed effects and random effects, in particular with ȟ , E{z} , E{y} , Ȉ z , Ȉ y unknown, but Ȗ known. We depart in analyzing a height network observed at two epochs. At the initial epoch three height differences have been observed. From the first epochs to the second epoch we assume height differences which change linear in time, for instance caused by an Earthquake. There is a height varying model • hDE (W ) = hDE (0) + hDE (0)W + O (W 2 ),
387
10-3 An example for collocation
where W notes the time difference from the first epoch to the second epoch, related to the height difference. Unknown are the fixed height differences hDE and • the expected values of the random height difference velocities hDE . Given is the singular dispersion matrix of height difference measurements. Alternative estimation and prediction data are of type (V + CZCc) -BLUUE for the unknown parameter ȟ of height difference at initial epoch and the expected data E{z} of stochastic height difference velocities z, of type (V + CZCc) -BLUUE of the expected data E{y} of height difference observations y, of type e y of the empirical error vector of observations and of type (V + CZCc) -BLUUP for the stochastic vector z of height difference velocities. For the unknown variance component V 2 of height difference observations we use estimates of type BIQUUE. In detail, our model assumptions are epoch 1 ª hDE º ª 1 0 º ªh º E{«« hEJ »»} = « 0 1 » « DE » « » hEJ «¬ hJD »¼ ¬ 1 1 ¼ ¬ ¼ epoch 2 ª hDE º ª hDE º ª 1 0 º ª W 0 º « hEJ » » « « « » » « » E{« hEJ »} = 0 1 0 W • « »« » « E{hDE }» «¬ hJD »¼ ¬ 1 1 ¼ ¬ W W ¼ « • » ¬« E{hEJ }¼» epoch 1 and 2 ª hDE º ª 1 0 º ª0 « h » « » «0 EJ 0 1 « » « » ªh º « h « » 1 1 » DE 0 E{« JD »} = « +« kDE 1 0 » «¬ hEJ »¼ « W « « » « » «0 « k EJ » « 0 1 » « W « k » ¬ 1 1 ¼ ¬ ¬ JD ¼ ª hDE º ª1 « h » «0 EJ « » « « hJD » 1 A := « y := « » , kDE «1 « » «0 « k EJ » « 1 « k » ¬ ¬ JD ¼ ªh º ȟ := « DE » ¬ hEJ ¼
0º 0» • » 0 » ª E{hDE }º • » 0 » «« E{hEJ }¼» ¬ W» W »¼ 0º 1» » 1» , 0» 1» 1 »¼
388
10 The fifth problem of probabilistic regression
ª 0 0º « 0 0» « » 0 0» C := « , « W 0» « 0 W» « W W » ¬ ¼
• ª E{hDE }º E{z} = « • » ¬« E{hEJ }¼»
rank identities rk A=2, rk C=2, rk [A,C]=m+l=4. The singular dispersion matrix D{y} = VV 2 of the observation vector y and the singular dispersion matrix D{z} = ZV 2 are determined in the following. We separate 3 cases. (i) rk V=6, rk Z=1 V = I6 , Z =
1 ª1 1º W 2 «¬1 1»¼
(ii) rk V=5, rk Z=2 V = Diag(1, 1, 1, 1, 1, 0) , Z =
1 I 2 , rk(V +CZCc)=6 W2
(iii) rk V=4, rk Z=2 V = Diag(1, 1, 1, 1, 0, 0) , Z =
1 I 2 , rk (V +CZCc)=6 . W2
In order to be as simple as possible we use the time interval W=1. With the numerical values of matrix inversion and of “Schur-complements”, e.g. Table 1: (V +CZCc)-1 Table 2: {I n -A[Ac(V +CZCc)-1 A]-1 Ac(V +CZCc)-1 } Table 3: {I n -C[Cc(V +CZCc)-1C]-1Cc(V +CZCc)-1 } Table 4: “Schur-complements” SA, SC Table 5: vectors sA, sC 1st case:
n n ȟˆ , D{ȟˆ} , E {z} , D{E {y}}
389
10-3 An example for collocation
1 ª 2 1 1 0 0 0 º 1 ª 2 y + y2 y3 º , ȟˆ = « y= « 1 » 3 ¬1 2 1 0 0 0¼ 3 ¬ y1 + 2 y2 + y3 »¼
V 2 ª2 1º , D{ȟˆ} = 3 «¬1 2 »¼ ª 2 y1 y2 + y3 + 2 y4 + y5 y6 º ª 2 1 1 2 1 1º n , E {z} = « y=« » 1 2 1 1 2 1 ¬ ¼ ¬ y1 2 y2 y3 + y4 + 2 y5 + y6 »¼ 2
V n D{E {z}} = 3 2nd case:
ª7 5º «¬ 5 7 »¼ ,
n n ȟˆ , D{ȟˆ} , E {z} , D{E {z}}
1 ª 2 1 1 0 0 0 º 1 ª 2 y + y2 y3 º , ȟˆ = « y= « 1 » 3 ¬1 2 1 0 0 0¼ 3 ¬ y1 + 2 y2 + y3 »¼
V 2 ª2 1º , D{ȟˆ} = 3 «¬1 2 »¼ ª 4 2 2 3 3 3º n E {z} = « y, ¬ 2 4 2 3 3 3 »¼
V 2 ª13 5 º n , D{E {z}} = 6 «¬ 5 13»¼ 3rd case:
n n {z} , D{E {z}} ȟˆ , D{ȟˆ} , E
1 ª 2 1 1 0 0 0 º 1 ª 2 y + y2 y3 º , ȟˆ = « y= « 1 » 1 2 1 0 0 0 3¬ 3 ¬ y1 + 2 y2 + y3 »¼ ¼
V 2 ª2 1º , D{ȟˆ} = 3 «¬1 2 »¼ ª 2 1 1 3 0 0 º 1 ª 2 y1 y2 + y3 + 3 y4 º n , E {z} = « ¬ 1 2 1 0 3 0 »¼ 3 «¬ y1 2 y2 y3 + 3 y5 »¼ 2
V n D{E {z}} = 3
ª5 1 º «¬1 5»¼ .
Table 1: Matrix inverse (V +CZCc)-1 for a mixed Gauss-Markov model with fixed and random effects V +CZCc
(V +CZCc)-1
390
10 The fifth problem of probabilistic regression
1st case
ª1 «0 « «0 «0 «0 «0 ¬
0 1 0 0 0 0
0 0 1 0 0 0
0 0 0 2 1 0
0 0 0 1 2 0
0º 0» » 0» 0» 0» 1 »¼
ª3 «0 1 ««0 3 «0 «0 «0 ¬
0 3 0 0 0 0
0 0 0 0 3 0 0 2 0 1 0 0
0 0 0 1 2 0
0 º 0 » » 0 » 0» 0» 3»¼
0 4 0 0 0 0
0 0 0 0 4 0 0 3 0 1 0 2
0 0 0 1 3 2
0 0 0
2nd case
ª1 «0 « «0 «0 «0 «0 ¬
0 1 0 0 0 0
0 0 0 0 1 0 0 2 0 0 0 1
0 0 0 0 2 1
0 º 0 » » 0 » 1» 1» 2 »¼
ª4 «0 1 «« 0 4 «0 «0 «0 ¬
3rd case
ª1 «0 « «0 «0 «0 «0 ¬
0 1 0 0 0 0
0 0 0 0 1 0 0 1 0 0 0 1
0 0 0 0 1 1
0 º 0 » » 0 » 1» 1» 3 »¼
ª1 «0 « «0 «0 «0 «0 ¬
0 1 0 0 0 0
0 0 0 0 1 0 0 2 0 1 0 1
0 0 0 1 2 1
º » » » 2» 2 » 4 »¼
0 º 0 » » 0 » 1» 1» 1 »¼
Table 2: Matrices {I n +[ Ac( V +CZCc) -1 A]1 Ac(V +CZCc) -1} for a mixed Gauss-Markov model with fixed and random effects {I n -A[ A c(V +CZCc) -1 A]-1 Ac(V +CZCc) -1 } 1st case
ª 13 7 « 7 13 1 «« 4 4 24 « 11 7 « 7 11 « 4 4 ¬
4 4 16 4 4 8
5 1 4 19 1 4
1 5 4 1 19 4
4º 4 » » 8 » 4» 4 » 16 »¼
2nd case
6 ª 13 5 « 5 13 6 1 «« 6 6 12 6 24 « 11 5 « 5 11 6 « 6 6 12 ¬
4 4 0 20 4 0
4 4 0 4 20 0
3º 3 » » 6 » 3» 3 » 18 »¼
391
10-3 An example for collocation
3rd case
ª5 « 1 1 «« 2 8 « 3 « 1 «2 ¬
1 2 3 1 0 º 5 2 1 3 0 » » 2 4 2 2 0 » 1 2 5 1 0 » 3 2 1 5 0 » 2 4 2 2 8 »¼
Table 3: Matrices {I n C[Cc(V +CZCc)-1Cc]1 Cc(V +CZCc)-1 } for a mixed Gauss-Markov model with fixed and random effects {I n -C[Cc(V +CZCc)-1C]-1Cc(V +CZCc)-1 } 1st case
ª3 «0 1 «« 0 3 «0 «0 «0 ¬
2nd case
ª2 «0 « «0 «0 «0 «0 ¬
3rd case
ª1 «0 « «0 «0 «0 «0 ¬
0 3 0 0 0 0 0 2 0 0 0 0
0 0 0 0 0 0 3 0 0 0 1 1 0 1 1 0 0 0
0º 0» » 0» 1» 1» 0»¼
0 0 0 0º 0 0 0 0» » 2 0 0 0» 0 1 1 1 » 0 1 1 1» 0 0 0 0 »¼
0 1 0 0 0 0
0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 1 1
0º 0» » 0» 0» 0» 1 »¼
Table 4: “Schur-complements” SA, SC , three cases, for a mixed Gauss-Markov model with fixed and random effects SA (10.10)
S A1
SC (10.12)
SC1
392
10 The fifth problem of probabilistic regression
1st case
ª 2 1º «¬ 1 2 »¼
1 ª2 1º 3 «¬1 2 »¼
1 ª 7 5º 8 «¬ 5 7 »¼
1 ª7 5º 3 «¬ 5 7 »¼
2nd case
ª 2 1º «¬ 1 2 »¼
1 ª2 1º 3 «¬1 2 »¼
1 ª13 5º 24 «¬ 5 13 »¼
1 ª13 5 º 6 «¬ 5 13»¼
3rd case
ª 2 1º «¬ 1 2 »¼
1 ª2 1º 3 «¬1 2 »¼
1 ª 5 1º 8 «¬ 1 5 »¼
1 ª5 1 º 3 «¬1 5»¼
Table 5: Vectors sA and sC, three cases, for a mixed Gauss-Markov model with fixed and random effects sA sC (10.11) (10.13) 1st case
ª1 0 1 0 0 0 º «0 1 1 0 0 0 » y ¬ ¼
9 3 12 º 1 ª 9 3 12 y « 8 ¬ 3 9 12 3 9 12 »¼
2nd case ª1 0 1 0 0 0 º «0 1 1 0 0 0 » y ¬ ¼
1 ª 7 1 6 4 4 9 º y 24 «¬ 1 7 6 4 4 9 »¼
3rd case
1 ª 3 1 2 5 1 0 º y 8 «¬ 1 3 2 1 5 0 »¼
ª1 0 1 0 0 0 º «0 1 1 0 0 0 » y ¬ ¼
First case: n n ȟˆ , D{ȟˆ} , E {z} , D{E {z}} Second case: n n ȟˆ , D{ȟˆ} , E {z} , D{E {z}} Third case: n n ȟˆ , D{ȟˆ} , E {z} , D{E {z}}. Here are the results of computing n n E {y} , D{E {z}}, e y and D{e y } , ordered as case 1, case 2, and case 3.
393
10-3 An example for collocation
Table 6: Numerical values, Case 1 n n {y} , D{E {y}} , e y , D{e y } 1st case: E 1
ª2 «1 « 1 n E {y} = « «0 «0 «0 ¬
1 1 0 2 1 0 1 2 0 0 0 2 0 0 1 0 0 1
ª2 «1 2 « V « 1 n D{E {y}} = 3 «0 «0 «0 ¬ ª 1 « 5 1 «« 8 e y = 12 « 11 «7 « 4 ¬
0 0º ª 2 y1 + y2 y3 º « y1 + 2 y2 + y3 » » 0 0 « » » y + 2 y2 + 2 y3 » 0 0» y=« 1 1 1» « 2 y4 + y5 y6 » « y4 + 2 y5 y6 » » 2 1 « y + y + 2y » » 1 2¼ 5 6 ¼ ¬ 4 1 1 0 2 1 0 1 2 0 0 0 5 0 0 4 0 0 1
0 0º 0 0» » 0 0» 4 1» 5 1» 1 2 »¼
5 8 5 1 4 º 1 8 1 5 4» » 8 4 4 4 8» 7 4 7 11 8 » 11 4 11 7 8» 4 8 8 8 4 »¼
ª 1 1 1 « 1 1 1 2 « V « 1 1 1 D{e y } = 3 «0 0 0 «0 0 0 «0 0 0 ¬
0 0 0 0 0 0 1 1 1 1 1 1
0º 0» » 0» . 1» 1» 1 »¼
Table 7 Numerical values, Case 2 n n {y} , D{E {y}} , e y , D{e y } 2nd case: E 1
ª4 «2 « 2 1 n E {y} = « 6« 0 «0 ¬0
2 2 4 2 2 4 0 0 0 0 0 0
0 0 0 3 3 0
0 0º 0 0» 0 0» 3 3» » 3 3» 0 6¼
394
10 The fifth problem of probabilistic regression
ª2 «1 2 « 1 V n D{E {y}} = « 3 «0 «0 ¬0
1 1 0 2 1 0 1 2 0 0 0 5 0 0 4 0 0 1
0 0º 0 0» 0 0» 4 1» » 5 1» 1 2¼
ª 2 y1 2 y2 + 2 y3 º ª 2 2 2 0 0 0 º « 2 y1 + 2 y2 2 y3 » « 2 2 2 0 0 0 » « » 1 «« 2 2 2 0 0 0 »» 2 y 2 y2 + 2 y3 » e y = y=« 1 « 3 y4 3 y5 + 3 y6 » 6 « 0 0 0 3 3 3 » « 3 y4 3 y5 + 3 y6 » « 0 0 0 3 3 3» « » «0 0 0 0 0 0» ¬ ¼ ¬ 3 y4 + 3 y5 3 y6 ¼ ª 2 2 2 0 0 « 2 2 2 0 0 2 « V « 2 2 2 0 0 D{e y } = 6 « 0 0 0 3 3 « 0 0 0 3 3 «0 0 0 0 0 ¬
0º 0» » 0» 0» 0» 0 »¼
n n {y} , D{E {y}} , e y , D{e y } 1st case: E ª2 «1 « 1 n E {y} = « «0 «0 «0 ¬
1 1 0 2 1 0 1 2 0 0 0 2 0 0 1 0 0 1
ª2 «1 2 « V « 1 n D{E {y}} = 3 «0 «0 «0 ¬ ª 1 « 5 1 «8 e y = « 12 « 11 «7 « 4 ¬
0 0º ª 2 y1 + y2 y3 º « y1 + 2 y2 + y3 » 0 0» « » » y + 2 y2 + 2 y3 » 0 0» y=« 1 1 1» « 2 y4 + y5 y6 » « y4 + 2 y5 y6 » » 2 1 « y + y + 2y » » 1 2¼ 5 6 ¼ ¬ 4 1 1 0 2 1 0 1 2 0 0 0 5 0 0 4 0 0 1
0 0º 0 0» » 0 0» 4 1» 5 1» 1 2 »¼
5 8 5 1 4 º 1 8 1 5 4» » 8 4 4 4 8» 7 4 7 11 8 » 11 4 11 7 8» 4 8 8 8 4 »¼
395
10-3 An example for collocation
ª 1 1 1 « 1 1 1 2 « V « 1 1 1 D{e y } = 3 «0 0 0 «0 0 0 «0 0 0 ¬
0 0 0 0 0 0 1 1 1 1 1 1
0º 0» » 0» . 1» 1» 1 »¼
Table 8 Numerical values, Case 3 n n {y} , D{E {y}} , e y , D{e y } 3rd case: E ª0 0 0 0 «0 0 0 0 1« 0 0 0 0 n E {y} = « 3 « 2 1 1 0 « 1 2 1 0 « 1 1 0 3 ¬ ª2 «1 2 « V « 1 n D{E {y}} = 3 «0 «0 «0 ¬ ª 1 1 1 « 1 1 1 1 « 1 1 1 e y = « 3« 0 0 0 «0 0 0 «0 0 0 ¬
0 0 0 0 0 0
1 1 0 2 1 0 1 2 0 0 0 3 0 0 0 0 0 3 0 0 0 0 0 0
0 0 0 0 3 3
0º 0» » 0» 0» 0» 0»¼ 0 0º 0 0» » 0 0» 0 3» 3 3» 3 6 »¼
0º ª y1 y2 + y3 º « y1 + 2 y2 y3 » 0» » « » 0» y y2 + y3 » y=« 1 0» 0 « » « » 0» 0 » « » 0¼ 0 ¬ ¼
ª 1 1 1 « 1 1 1 V 2 «« 1 1 1 D{e y } = 3 «0 0 0 «0 0 0 «0 0 0 ¬
0 0 0 0 0 0
0 0 0 0 0 0
0º 0» » 0» . 0» 0» 0 »¼
396
10 The fifth problem of probabilistic regression
Table 9 Data of type z , D{z} for 3 cases 1st case: 1 ª 2 1 1 2 1 1º z = « y, 3 ¬ 1 2 1 1 2 2 »¼ D{z} =
V ª7 5 º , 3 «¬ 5 7 »¼
2nd case: 1 ª 4 2 2 3 3 3º z = « y, 3 ¬ 2 4 2 3 3 3 »¼ D{z} =
V ª13 5 º , 6 «¬ 5 13»¼
3rd case: 1 ª 2 1 1 3 0 0 º z = « y, 3 ¬ 1 2 1 0 3 0 »¼ D{z} =
V ª5 1 º , 3 «¬1 5»¼
Table 10 Data of type Vˆ 2 , D{Vˆ 2 }, Dˆ {Vˆ 2 } for 3 cases 1st case: n=6, m=2, A = 2 , n m A = 2 ª7 « 1 1 « 2 Vˆ 2 = y c « 12 « 5 « 1 «4 ¬
1 7 2 1 5 4
2 2 10 4 4 8
ª7 « 1 « 1 2 Dˆ {Vˆ 2 } = {y c « 144 « 5 « 1 «4 ¬
5 1 4 7 1 2 1 7 2 1 5 4
1 5 4 1 7 2 2 2 10 4 4 8
4º 4 » » 8 » y , D{Vˆ 2 } = V 4 , 2 » 2» 10 »¼ 5 1 4 7 1 2
1 5 4 1 7 2
2nd case: n=6, m=2, A = 2 , n m A = 2
4º 4 » » 8 » 2 y} , 2 » 2» 10 »¼
397
10-4 Comments
ª 2 2 2 0 0 0 º « 2 2 2 0 0 0 » 1 « 2 2 2 0 0 0 » 2 4 2 Vˆ = y c « » y , D{Vˆ } = V , 12 « 0 0 0 3 3 3 » « 0 0 0 3 3 3» ¬ 0 0 0 3 3 3 ¼ ª2 « 2 «2 1 2 ˆ D{Vˆ } = {y c « 144 « 0 «0 ¬0 3rd case: n=6, m=2,
2 2 0 0 0 º 2 2 0 0 0 » 2 2 0 0 0 » 2 y} , 0 0 3 3 3 » » 0 0 3 3 3» 0 0 3 3 3 ¼ A = 2, nmA = 2
ª 1 1 1 « 1 1 1 « 1 1 1 1 Vˆ 2 = y c « 6 «0 0 0 «0 0 0 ¬0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0º 0» 0» y , D{Vˆ 2 } = V 4 , 0» » 0» 0¼
ª 1 1 1 « 1 1 1 « 1 1 1 1 2 Dˆ {Vˆ } = {y c « 144 « 0 0 0 «0 0 0 ¬0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0º 0» 0» 2 y} . 0» » 0» 0¼
Here is my journey’s end.
10-4 Comments (i) In their original contribution A. N. Kolmogorov (1941 a, b, c) and N. Wiener (“yellow devil”, 1939) did not depart from our general setup of fixed effects and random effects. Instead they departed from the model q
“ yPD = cPD + ¦ cPD QE yQE + E =1
q
¦c E J , =1
PD QE QJ
yQE yQJ + O ( y 3 ) ”
of nonlinear prediction, e. g. homogeneous linear prediction yP = yQ cPQ + yQ cPQ + " + yQ cPQ yP = yQ cP Q + yQ cP Q + " + yQ cP Q … yP = yQ cP Q + yQ cP Q + " + yQ cP Q 1
1
1 1
2
1
2
q
1
2
1
2 1
2
2
2
q
2
p 1
p 1 1
1
p 1
2
q
q
p 1 q
q
2
yP = yQ cP Q + yQ cP Q + " + yQ cP Q . p
1
p
1
2
p
2
q
p
q
398
10 The fifth problem of probabilistic regression
From given values of “random effects” ( yQ ," , yQ ) other values of “random effects” ( yP ," , yP ) have been predicted, namely under the assumption of “equal correlation” P of type p
1
p
1
P1
P2 = Q1
Q2“
n E { yP } = C(CȈ y1 C) 1 CȈ y1 yP P
P
n D{E { yP }} = C(CȈ y1 C) 1 Cc P
or for all yP { yP ," , yP } 1
P
|| yP yˆ P ||:= E{( yP yˆ P )}2 d min , Qq
E{( yP yˆ P ) 2 } = E{( y2i ¦ y1c j cij ) 2 } = min( KW ) Q1
for homogeneous linear prediction. “ansatz” E{ yP E{ yˆ P }} := 0 , E{( yP E{ yˆ P })( yP E{ yˆ P })} = cov( yP , yP ) 1
E{( yP yˆ P ) 2 } = cov( P, P )
Q = Qq
¦c
pj
Q = Q1
1
2
2
1
2
cov( P, Q j ) + ¦¦ c pj c pk cov(Q j , Qk ) = min Qj
Qk
“Kolmogorov-Wiener prediction” k =q
c pj ( KW ) = ¦ [ cov(Q j , Qk )]1 cov(Qk , P) k =1
q
q
E{( yP yˆ P ) 2 | KW } = cov( P, P) ¦¦ cov( P, Q j ) cov( P, Qk )[cov(Q j , Qk )]1 j =1 k =1
constrained to | Q j Qk | Qk
Qj cov(Q j , Qk ) = cov(Q j Qk )
“ yP E{ yP } is weakly translational invariant” KW prediction suffers from the effect that “apriori” we know only one realization of the random function y (Q1 )," , y (Qq ) , for instance.
10-4 Comments
399 1 n cov( W) = NG
¦
( yQ E{ yQ })( yQ E{ yQ }). j
j
k
k
|Q j Qk |
Modified versions of the KW prediction exist if we work with random fields in the plane (weak isotropy) or on the plane (rotational invariances). As a model of “random effects” we may write
¦c
PD QE
E{ yQE } = E{ yPD } CE{z} = E{ y}.
QE
(ii) The first model applies if we want to predict data of one type to predict data of the same type. Indeed, we have to generalize if we want to predict, for instance, vertical deflections, gravity gradients, gravity values,
from gravity disturbances.
The second model has to start from relating one set of heterogeneous data to another set of heterogeneous data. In the case we have to relate the various data sets to each other. An obvious alternative setup is
¦c
1 PD QE
QE
E{z1QE } + ¦ c2 PD RJ E{z2 RJ } + " = E{ yPD } . RJ
(iii) The level of collocation is reached if we include a trend model in addition to Kolmogorov-Wiener prediction, namely E{y} = Aȟ + CE{z} , the trend being represented by Aȟ . The decomposition of „trend“ and “signal” is well represented in E. Grafarend (1976), E. Groten (1970), E. Groten and H. Moritz (1964), S. Heitz (1967, 1968, 1969), S. Heitz and C.C. Tscherning (1972), R.A. Hirvonen (1956, 1962), S. K. Jordan (1972 a, b, c, 1973), W.M. Kaula (1959, 1963, 1966 a, b, c, 1971), K.R. Koch (1973 a, b), K.R. Koch and S. Lauer (1971), L. Kubackova (1973, 1974, 1975), S. Lauer (1971 a, b), S.L. Lauritzen (1972, 1973, 1975), D. Lelgemann (1972, 1974), P. Meissl (1970, 1971), in particular H. Moritz (1961, 1962, 1963 a, b, c, d, 1964, 1965, 1967, 1969, 1970 a, b, c, d, e, 1971, 1972, 1973 a, b, c, d, e, f, 1974 a, b, 1975), H. Moritz and K.P. Schwarz (1973), W. Mundt (1969), P. Naicho (1967, 1968), G. Obenson (1968, 1970), A.M. Obuchow (1947, 1954), I. Parzen (1960, 1963, 1972), L.P. Pellinen (1966, 1970), V.S. Pugachev (1962), C.R. Rao (1971, 1972, 1973 a, b), R. Rupp (1962, 1963, 1964 a, b, 1966 a, b, c, 1972, 1973 a, b, c, 1974, 1975), H. P. Robertson (1940), M. Rosenblatt (1959, 1966), R. Rummel (1975 a, b), U. Schatz (1970), I. Schoenberg (1942), W. Schwahn (1973, 1975), K.P. Schwarz (1972,
400
10 The fifth problem of probabilistic regression
1974 a, b, c, 1975 a, b, c), G. Seeber (1972), H.S. Shapiro (1970), L. Shaw et al (1969), L. Sjoeberg (1975), G. N. Smith (1974), F. Sobel (1970), G.F. Taylor (1935, 1938), C. Tscherning (1972, a, b, 1973, 1974 a, b, 1975 a, b, c, d, e), C. Tscherning and R.H Rapp (1974), V. Vyskocil (1967, 1974 a, b, c), P. Whittle (1963 a, b), N. Wiener (1958, 1964), H. Wolf (1969, 1974), E. Wong (1969), E. Wong and J.B. Thoma (1961), A.M. Yaglom (1959, 1961). (iv) An interesting comparison is the various solutions of type •
ˆ yˆ 2 = lˆ + Ly 1
•
ˆ yˆ 2 = Ly 1
(best homogeneous linear prediction)
•
ˆ yˆ 2 = Ly 1
(best homogeneous linear unbiased prediction)
(best inhomogeneous linear prediction)
dispersion identities D3 d D2 d D4 . (v) In spite of the effect that “trend components” and “KW prediction” may serve well the needs of an analyst, generalizations are obvious. For instance, in Krige’s prediction concept it is postulated that only || y p y p ||2 = : E{( y p y p ( E{ y p } E{ y p })) 2 } 1
2
1
2
1
2
is a weakly relative translational invariant stochastic process. A.N. Kolmogorov has called the weakly relation translational invariant random function “structure function” Alternatively, higher order variance-covariance functions have been proposed: || y1 , y2 , y3 ||2 = : E{( y1 2 y2 + y3 ) ( E{ y1 } 2 E{ y2 } + E{ y3 }) 2 } || y1 , y2 , y3 , y4 ||2 = : E{( y1 3 y2 + 3 y3 y4 ) ( E{ y1} 3E{ y2 } + 3E{ y3 } E{ y 4 }) 2 } etc.. (vi) Another alternative has been the construction of higher order absolute variancecovariance functions of type || ( yQ E{ yQ }) ( yQ E{ yQ }) ( yQ E{ yQ }) || 1
1
2
2
3
3
|| ( yQ E{ yQ }) (...) ( yQ E{ yQ }) || 1
1
n
n
like in E. Grafarend (1984) derived from the characteristic functional, namely a series expression of higher order variance-covariance functions.
11 The sixth problem of probabilistic regression – the random effect model – “errors-in-variables” “In difference to classical regression error-in-variables models here measurements occurs in the regressors. The naive use of regression estimators leads to severe bias in this situation. There are consistent estimators like the total least squares estimator (TLS) and the moment estimator (MME). J. Polzehl and S. Zwanzig Department of Mathematics Uppsala University, 2003 A.D.” Read only Definition 11.1 and Lemma 11.2 Please, pay attention to the guideline of Chapter 11, namely to Figure 11.1, 11.2 and 11.3, and the Chapter 11.3: References
Definition 11.1 “random effect model: errors-in variables”
Lemma 11.2 ”error-in-variables model, normal equations”
Figure 11.1: Magic triangle
402
11 The sixth problem of probabilistic regression
By means of Figure 11.1 we review the mixed model (fixed effects plus random effects), total least squares (fixed effects plus “errors-in-variables”) and a special type of the mixed model which superimposes random effects and “errors-invariables”. Here we will concentrate on the model “errors-in-variables”. In the context of the general probabilistic regression problem E{y} = Aȟ + &E{]} + E{;}Ȗ , we specialize here to the model “errors in variables”, namely E{y} = E{X}Ȗ in which y as well as X (vector y and matrix X) are unknown. A simple example is the straight line fit, abbreviated by “y = ax + b1”. (x, y) is assumed to be measured, in detail E{y} = aE{x} + b1 = ªa º = [ E{x}, 1] « » ¬b ¼ 2 E{(x E{x}) } z 0, E{(y E{y})2 } z 0 but Cov{x, y} = 0. Note Ȗ1 := a, Ȗ 2 := b E{y} = y ey , E{x} = x ex Cov{ex , ey } = 0 and ªȖ º y e y = [x e x , 1] « 1 » ¬Ȗ2 ¼ y = xȖ1 + 1Ȗ 2 e x Ȗ1 + e y .
403
11 The sixth problem of probabilistic regression
constrained by the Lagrangean
L ( Ȗ1 , Ȗ 2 , e x , e y , Ȝ ) =: = ecx Px e x + ecy Py e y + 2Ȝ c( y xȖ1 1Ȗ 2 + e x Ȗ1 e y ) = =
min
Ȗ1 , Ȗ 2 , ex , e y , Ȝ
.
The first derivatives constitute the necessary conditions, namely 1 wL 2 wȖ1 1 wL 2 wȖ 2 1 wL 2 we x 1 wL 2 we y 1 wL 2 wȜ
= x cȜ = 0, = 1cȜ = 0, = Px e x + Ȗ1Ȝ = 0, = Py e y 2 = 0, = y xȖ1 1Ȗ 2 + e x Ȗ1 e y = 0,
while the second derivatives “ z 0 ” refer the sufficiency conditions. Figure 11.2 is a geometric interpretation of the nonlinear model of type “errorsin-variables”, namely the straight line fit of total least squares.
y
• • •
( E{x}, E{ y}) •
•
ey
• •
ex
P ( x, y )
•
x
Figure11.2. The straight line fit of total least squares ( E{y} = a E{x} + 1b, E{x} = x e x , E{y} = y e y )
404
11 The sixth problem of probabilistic regression
An alternative model for total least squares is the rewrite E{y} = y e y = Ȗ1 E{x} + 1Ȗ 2 y 1Ȗ 2 = Ȗ1 E{x} + e y E{x} = x e x
x = E{x} + e x , for instance
ªe y º ª y 1Ȗ 2 º ª Ȗ1, n º « x » = « I » E{x} + «e » , ¬ ¼ ¬ n ¼ ¬ x¼ we get for Ȉ - BLUUE of E{x} ª y 1Ȗ 2 º n E {x} = ( A cȈ 1A ) 1 A cȈ 1 « » ¬ x ¼ subject to ªȖ I º ªȈ A := « 1 n » , Ȉ1 = « yy ¬ 0 ¬ In ¼
1
0 º ªP =« y » Ȉxx ¼ ¬0
0º . Px »¼
We can further solve the optimality problem using the Frobenius Ȉ - seminorms: n n ª y 1Ȗ 2 Ȗ1 E {x}º 1 ª y 1Ȗ 2 Ȗ1 E {x}º « »Ȉ « » n n «¬ »¼ «¬ »¼ xE {x} xE {x} ªeˆ º ª¬eˆ cy , eˆ cx º¼ Ȉ 1 « y » = eˆ cx Px eˆ x + eˆ cy Py eˆ y = min . Ȗ ,Ȗ ¬eˆ x ¼ 1
2
11-1 Solving the nonlinear system of the model “errors-invariables” First, we define the random effect model of type “errors-in-variables” subject to the minimum condition i ycWy i y + tr I 'X WX I X = min . Second, we form the derivations, the partial derivations i ycWy i y + tr I 'X WX I X + 2Ȝ c{ y XJ + I XJ i y } , the neccessary conditions for obtaining the minimum. Definition 11.1
(the random effect model: errors-invariables)
The nonlinear model of type “errors-in-variables” is solved by “total least squares” based on the risk function (11.1)
y = E{y} + e y
(11.3)
X = E{X} + E X
~ ~ subject to
y = y0 + iy
(11.2)
X = X 0 + IX
(11.4)
11-1 Solving the nonlinear system of the model “errors-in-variables”
405
y0 \n
(11.6)
X0 \ nxm
(11.8)
(11.5)
E{y} \ n
(11.7)
E{X} \ nxm
~ rk E{X} = m ~ rk X0 = m
(11.9)
and n t m
(11.10)
L1 := ¦ wii i2 ii1 ii2
and
L1 := icy Wi y
and
L2 :=
i1 ,i2
¦
i1 , k1 , k2
wk1k2 ii1k1 ii1k2
L2 := tr I 'X WX I X
L1 + L2 = icy Wy i y + tr I 'X WX I X = min y0 ,X0
L =: || i y ||2 W + || I X ||2 W y
X
(11.11)
subject to y X Ȗ + I X Ȗ i y = 0.
(11.12)
The result of the minimization process is given by Lemma 11.2: Lemma 11.2 (error-in-variables model, normal equations): The risk function of the model “errors-in-variables” is minimal, if and only if 1 wL = X cȜ A + I'X Ȝ A = 0 2 wȖ
(11.13)
1 wL = Wy i y Ȝ A = 0 2 wi y
(11.14)
1 wL = WX I X + ȖȜ cA 2 wI X
(11.15)
1 wL = y X y + IX Ȗ i y = 0 2 wȜ
(11.16)
and det (
w2L ) t 0. wJ i wJ j :Proof:
First, we begin with the modified risk function
(11.17)
406
11 The sixth problem of probabilistic regression
|| i y ||2 W + || I X ||2 +2 Ȝ c( y XȖ + I X Ȗ i y ) = min , y
where the minimum condition is extended over y, X, i y , I X , Ȝ , when Ȝ denotes the Lagrange parameter. icy Wy i y + tr I'X WX I X + 2Ȝ '( y XȖ + I X Ȗ i y ) = min
(11.18)
if and only if 1 wL = X cȜ A + I'X Ȝ A = 0 2 wȖ 1 wL = Wy i y Ȝ A 2 wi y 1 wL = WX I X = 0 2 w (tr I'X WX I X ) 1 wL = y XȖ + I X Ȗ i y = 0 2 wȜ A and
(11.19) (11.20)
(11.21)
(11.22)
w2L positive semidefinite. wȖ wȖ ' The first derivatives guarantee the necessity of the solution, while the second derivatives being positive semidefinite assure the sufficient condition. (11.23)
Second, we have the nonlinear equations, namely (11.24)
( X c + I'X ) Ȝ A = 0
(11.25)
Wy i y Ȝ A = 0
(11.26)
WX I X = 0
(11.27)
y XȖ + I X Ȗ i y = 0 ,
(bilinear) (linear) (bilinear) (bilinear)
which is a problem outside our orbit-of-interest. An example is given in the next chapter. Consult the literature list at the end of this chapter.
11-2 Example: The straight line fit Our example is based upon the “straight line fit” " y = ax + b1 ",
407
11-2 Example: The straight line fit
where (x, y) has been measured, in detail E{y} = a E{x} + b 1 = ªa º = [ E{x}, 1] « » ¬ b¼ or ªJ º J 1 := a, J 2 := b, xJ 1 + 1J 2 = [ x, 1] « 1 » ¬J 2 ¼ and y xJ 1 1J 2 + ecx J 1 ecy = 0. ( J 1 , J 2 ) are the two unknowns in the parameter space. It has to be noted that the term exJ 1 includes two coupled unknowns, namely e x and J 1 . Second, we formulate the modified method of least squares.
L (J 1 , J 2 , e x , e y , Ȝ ) = = icWi + 2( y c x cJ 1 1J 2 + i 'x J 1 i ' y )Ȝ = = icWi + 2Ȝ c( y xJ 1 1J 2 + i 'x J 1 i y ) or i cy Wy i y + i cx Wx i x + +2(y c xcJ 1 1cJ 2 + i cxJ 1 i cy )Ȝ. Third, we present the necessary and sufficient conditions for obtaining the minimum of the modified method of least squares. (11.28) (11.29) (11.30)
1 wL = x cȜ A + icx Ȝ A = 0 2 wJ 1 1 wL = 1cȜ A = 0 2 wJ 2 1 wL = Wy i y Ȝ A = 0 2 wi y
(11.31)
1 wL = Wx i x + Ȝ lJ 1 = 0 2 wi x
(11.32)
1 wL = y xJ 1 1J 2 + i xJ i y = 0 2 wȜ and ª w 2 L / w 2J 1 det « 2 ¬w L / wJ 1wJ 2
w 2 L / wJ 1wJ 2 w 2 L / w 2J 2
º » t 0. ¼
(11.33)
408
11 The sixth problem of probabilistic regression
Indeed, these conditions are necessary and sufficient for obtaining the minimum of the modified method of least squares. By Gauss elimination we receive the results (11.34)
(-x c + icx ) Ȝ A = 0
(11.35)
Ȝ1 + ... + Ȝ n = 0
(11.36)
Wy i y = Ȝ A
(11.37)
Wx i x = Ȝ A Ȗ1
(11.38)
Wy y = Wy xJ 1 + Wy 1J 2 Wy i xJ 1 + Wy i y or Wy y = Wy xJ 1 Wy 1J 2 ( I x J 12 ) Ȝ A = 0
(11.39)
if Wy = Wx = W and
(11.40)
xcWy xcWxJ 1 xcW1J 2 xc(I x J 12 )Ȝ A = 0
(11.41)
y cWy y cWxJ 1 y cW1J 2 y c( I x J 12 ) Ȝ A = 0
(11.42)
+ x cȜ l = + i 'x Ȝ A
(11.43)
Ȝ , +... + Ȝ n = 0 ª x cº ª y cº ª x cº ª x cº 2 « y c» Wy « x c» WxJ 1 « y c» W1J 2 « y c» ( I x J 1 ) Ȝ A = 0 ¬ ¼ ¬ ¼ ¬ ¼ ¬ ¼
(11.44)
subject to Ȝ1 + ... + Ȝ n = 0,
(11.45)
x1c Ȝ l = i cx Ȝ l .
(11.46) Let us iterate the solution. ª0 «0 « «0 « «0 «¬ x n
0 0 0
0 0 Wx
0 0 0
0 1n
0 J 1I n
Wy In
(xcn i cx ) º 1cn »» J 1I n » » I n » 0 »¼
ªJ1 º ª 0 º «J » « 0 » « 2» « » « ix » = « 0 » . « » « » « iy » « 0 » «¬ Ȝ A »¼ «¬ y n »¼
We meet again the problem that the nonlinear terms J 1 and i x appear. Our iteration is based on the initial data
409
11-2 Example: The straight line fit
(i ) Wx = Wy = W = I n , (ii) (i x ) 0 = 0, (iii) (J 1 ) n = J 1 (y (i y ) 0 = x(J 1 ) 0 + 1n (J 2 ) 0 ) in general ª0 «0 « «0 « «0 «xn ¬
0 0 0 0 1n
0 0 Wx 0 J 1(i ) I n
0 0 0 Wy In
(xc i cx ) º 1 »» J 1(i ) I n » » I n » 0 »¼
ª J 1(i +1) º ª 0 º «J » « » « 2(i +1) » « 0 » « i x (i +1) » = « 0 » « » « » « i y (i +1) » « 0 » « Ȝ l (i +1) » « y » ¬ ¼ ¬ n¼
x1 = J 1 , x2 = J 1 , x3 = i x , x4 = i y , x5 = Ȝ A . The five unknowns have led us to the example within Figure 11.3.
(1, 0.5)
(7, 5.5) (10, 5.05) (11, 9)
Figure 11.3: General linear regression Our solutions are collected as follows:
J 1 : 0.752, 6 0.866,3 0.781, 6 J 2 : 0.152, 0 -1.201,6 -0.244, 6 case : case : our general V y (i ) = 0 V X (i ) = 0 results after V X (i ) z 0
V y (i ) z 0
iteration.
11-3 References Please, contact the following references. Abatzoglou, J.T. and Mendel, J.M. (1987),,Abatzoglou, J.T. and Mendel, J.M. and Harada, G.A. (1991), Bajen, M.T., Puchal, T., Gonzales, A., Gringo, J.M., Castelao, A ., Mora, J. and Comin, M. (1997), Berry, S.M., Carroll, R.J. and Ruppert, D. (2002), Björck, A. (1996), Björck, A. (1997), Björck, A., Elfving, T. and Strakos, Z. (1998), Björck, A., Heggernes, P. and Matstoms, P. (2000), Bojanczyk, A.W., Bront, R.P., Van Dooren, P., de Hoog, F.R. (1987), Bunch, J.R., Nielsen, C.P. and Sorensen, D.C. (1978), Bunke, H. and Bunke, O. (1989), Capderou, A., Douguet, D., Simislowski, T., Aurengo, A. and Zelter, M. (1997), Carroll, R.J., Ruppert, D. and Stefanski, L.A. (1996), Carroll, R.J., Küschenhoff, H., Lombard, F. and Stefanski, L.A. (1996), Carroll, R.J. and Stefanski, L.A. (1997), Chandrasekuran, S. and Sayed, A.H. (1996), Chun, J. Kailath, T. and Lev-Ari, H. (1987), Cook, J.R. and Stefanski, L.A. (1994), Dembo, R.S., Eisenstat, S.C. and Steihaug, T. (1982), De Moor, B. (1994), Fuller, W.A. (1987), Golub, G.H. and Van Loan, C.F. (1980), Hansen, P.C. (1998), Higham, N.J. (1996), Holcomb, J. P. (1996), Humak, K.M.S. (1983), Kamm, J. and Nagy, J.G. (1998), Kailath, T. Kung, S. and More, M. (1979), Kailath, T. and Chun, J. (1994), Kailath, T. and Sayed, A.H.(1995), Kleming, J.S. and Goddard, B.A. (1974), Kung, S.Y., Arun, K.S. and Braskar Rao, D.V. (1983), Lemmerling, P., Van Huffel, S. and de Moor, B. (1997), Lemmerling, P., Dologlou, I. and Van Huffel, S. (1998), Lin, X. and Carroll, R.J. (1999), Lin, X. and Carroll, R.J. (2000), Mackens, W. and Voss, H. (1997), Mastronardi, N, Van Dooren, P. and Van Huffel, S., Mastronardi, N., Lemmerling, P. and Van Huffel, S. (2000), Nagy, J.G. (1993), Park, H. and Elden, L. (1997), Pedroso, J.J. (1996), Polzehl, J. and Zwanzig, S. (2003), Rosen, J.B., Park, H. and Glick, J. (1996), Stefanski, L.A. and Cook, J.R. (1995), Stewart, M. and Van Dooren, P. (1997), Van Huffel, S. (1991), Van Huffel, S. and Vandewalle, J. (1991), Van Huffel, S., Decanniere, C., Chen, H. and Van Hecke, P.V. (1994), Van Huffel, S., Park, H. and Rosen, J.B. (1996), Van Huffel, S., Vandewalle, J., de Rao, M.C. and Willems, J.L.,Wang, N., Lin, X., Gutierrez, R.G. and Carroll, R.J. (1998), and Yang, T. (1996).
12 The sixth problem of generalized algebraic regression – the system of conditional equations with unknowns – (Gauss-Helmert model) C.F. Gauss and F.R. Helmert introduced the generalized algebraic regression problem which can be identified as a system of conditional equations with unknowns. :Fast track reading: Read only Lemma 12.2, Lemma 12.5, Lemma 12.8
Lemma 12.2 Normal equations: Ax + Bi = By Definition 12.1 W - LESS: Ax + Bi = By Lemma 12.3 Condition A B
“The guideline of chapter twelve: first definition and lemmas”
Lemma 12.5 R, W - MINOLEES: Ax + Bi = By Definition 12.4 R, W - MINOLESS: Ax + Bi = By Lemma 12.6 relation between A und B
“The guideline of chapter twelve: second definition and lemmas”
412
12 The sixth problem of generalized algebraic regression
Definition 12.7 R, W – HAPS: Ax + Bi = By
Lemma 12.8 R, W – HAPS: normal equations: Ax + Bi = By
“The guideline of chapter twelve: third definition and lemmas” The inconsistent linear system Ax + Bi = By called generalized algebraic regression with unknowns or homogeneous Gauß - Helmert model will be characterized by certain solutions which we present in Definition 12.1, Definition 12.4 and Definition 12.7 solving special optimization problems. Because of rk B = q there holds automatically R ( A ) R (B). Lemma 12.2, Lemma 12.5 and Lemma 12.8 contain the normal equations as special optimizational problems. Lemma 12.3 and Lemma 12.6 refer to special solutions as linear forms of the observation vector, in particular which are characterized by products of certain generalized inverses of the coefficient matrices A and B of conditional equations. In addition, we compare R, W - MINOLESS and R, W - HAPS by a special lemma. As examples we treat a height network which is characterized by absolute and relation height difference measurements called “leveling” of type I - LESS, I, I - MINOLESS, I, I - HAPS and R,WMINOLESS. Lemma 12.10 W - LESS: Ax + Bi = By c Definition 12.9 W - LESS: Ax + Bi = By c Lemma 12.11 Condition A B
“The guideline of Chapter twelve: fourth definition and lemmas”
12 The sixth problem of generalized algebraic regression
413
Lemma 12.13 R, W - MINOLESS: Ax + Bi = By c Definition 12.12 R, W - MINOLESS: Ax + Bi = By c Lemma 12.14 relation between A und B
“The guideline of chapter twelve: fifth definition and lemmas”
Definition 12.15 R, W – HAPS: Ax + Bi = By c
Lemma 12.16 R, W – HAPS: Ax + Bi = By c
“The guideline of chapter twelve: sixth definition and lemmas” The inconsistent linear system Ax + Bi = By - note the constant shift – called generalized algebraic regression with unknowns or inhomogeneous Gauß-Helmert model will be characterized by certain solutions which we present in Definition 12.9, Definition 12.12 and Definition 12.15, solving special optimization problems. Because of the rank identity rk B = q there holds automatically R ( A) R (B). Lemma 12.10, Lemma 12.13 and Lemma 12.16 contain the normal equations as special optimizational problems. Lemma 12.11 and Lemma 12.14 refer to special solutions as linear forms of the observation vector, in particular which can be characterized by products of certain generalized inverses of the coefficient matrices A and B of the conditional equations. In addition, we compare R, WMINOLESS and R, W - HAPS by a special lemma.
414
12 The sixth problem of generalized algebraic regression
At this point we have to mention that we were not dealing with a consistent system of homogeneous or inhomogeneous condition equations with unknowns of type Ax = By , By R ( A ) or Ax = By c, By c + R ( A ) . For further details we refer to E. Grafarend and B. Schaffrin (1993, pages 28-34 and 54-57). We conclude with Chapter 4 (conditional equations with unknowns, namely “bias estimation” within an equivalent stochastic model) and Chapter 2 (Examples for the generalized algebraic regression problem: W - LESS, R,W - MINOLESS and R, W - HAPS).
12-1 Solving the system of homogeneous condition equations with unknowns First, we solve the problem of homogeneous condition equations by the method of minimizing the W - seminorm of Least Squares. We review by Definition 12.1, Lemma 12.2 and Lemma 12.3 the characteristic normal equations and linear form which build up the solution of type W - LESS. Instead, secondly by Definition 12.4 and Lemma 12.5 and Lemma 12.6 R we present, W - MINOLESS as MInimum NOrm LEast Squares Solution (R - SemiNorm, W - SemiNorm of type Least Squares). Third, we alternatively concentrate by Definition 12.7 and Lemma 12.8 and Lemma 12.9 on R, W - HAPS (Hybrid APproximate Solution with respect to the combined R - and W - Seminorm). Fourth, we compare R, W - MINOLESS and R, W - HAPS x h xlm . 12-11
W - LESS
W - LESS is built on Definition 12.1, Lemma 12.2 and Lemma 12.3. Definition 12.1 (W - LESS, homogeneous conditions with unknowns): An m × 1 vector xl is called W - LESS (LEast Squares Solution with respect to the W -seminorm ) of the inconsistent system of linear equations (12.1) Ax + Bi = By with Bi A := By Ax A , if compared to alternative vectors x R m with Bi := Bi Ax the inequality || i A ||2 W := i cA Wi A d i cWi =:|| i ||2W holds, if in consequence i A has the smallest W - seminorm. The solutions of type W - LESS are computed as follows.
(12.2)
12-1 Solving the system of homogeneous condition equations with unknowns
415
Lemma 12.2 (W - LESS, homogeneous conditions with unknowns: normal equations): An m × 1 vector x A is W-LESS of (12.1) if and only if it solves the system of normal equations ª W Bc 0 º ª i A º ª 0 º « B 0 ǹ » « Ȝ A » = «B y » «¬ 0 A c 0 »¼ « x » « 0 » ¬ A¼ ¬ ¼
(12.3)
with the q × 1 vector OA of “Lagrange multipliers”. x A exists in the case of R (B) R ( W) and is solving the system of normal equations A c(BW Bc) 1 AxA = A c(BW Bc) 1 By .
(12.4) (12.5)
which is independent of the choice of the g - inverse W and unique. x A is unique if and only if the matrix A c( BW Bc) 1 A is regular, and equivalently, if
(12.6)
rk A = m
(12.7)
holds. :Proof: W - LESS is constructed by means of the “Lagrange function”
L (i, x, Ȝ ):= i cWi + 2Ȝ c( Ax + Bi By ) = min . i , x, Ȝ
The necessary conditions for obtaining the minimum are given by the first derivatives 1 wL (i A , x A , Ȝ A ) = Wi A + BcȜ A = 0 2 wi 1 wL (i A , x A , Ȝ A ) = A cȜ A = 0 2 wx 1 wL (i A , x A , Ȝ A ) = Ax A + Bi A By = 0. 2 wȜ Details for obtaining the derivatives of vectors are given in Appendix B. The second derivatives 1 w2L (i A , x A , Ȝ A ) = W t 0 2 wiwic are the sufficient conditions for the minimum due to the matrix W being positive semidefinite. Due to the condition
416
12 The sixth problem of generalized algebraic regression
R ( Bc ) R ( W ) we have WW Bc = Bc. As shown in Appendix A, BW 1B is invariant with respect to the choice of the generalized inverse W . In fact, the matrix BW - Bc is uniquely invertible. Elimination of the vector i A leads us to the system of reduced normal equations. ª BW Bc A º ª Ȝ A º ª By º « »« » = « » 0 ¼ ¬xA ¼ ¬ 0 ¼ ¬ Ac and finally eliminating Ȝ A to A c( BW Bc) 1 Ax A = A c( BW Bc) 1 By,
(12.8)
because of BW W = B there follows the existence of x A . Uniqueness is assured due to the regularity of the matrix A c(BW Bc) 1 A, which is equivalent to rk A = m .
The linear forms x A = Ly , which lead to W - LESS of arbitrary observation vectors y R n because of (12.4), can be characterized by Lemma 12.3. Lemma 12.3 (W - LESS, relation between A and B): Under the condition
R (Bc) R ( W)
is x A = L y W - LESS of (12.1) for all y R n if and only if the matrix L obeys the condition (12.9) L = AB subject to ( BW Bc) 1 AA = [( BW Bc) 1 AA 1 ]c.
(12.10)
In this case, the vector Ax A = AA By is always unique. 12-12
R, W – MINOLESS
R, W - MINOLESS is built on Definition 12.4, Lemma 12.5 and Lemma 12.6. Definition 12.4 (R, W - MINOLESS, homogeneous conditions with unknowns): An m × 1 vector xAm is called R, W - MINOLESS (Minimum NOrm with respect to the R – Seminorm, LEast Squares Solution with respect to the W – Seminorm) of the inconsistent system of linear equations if (12.3) is consistent and x Am is R- MINOS of (12.3).
417
12-1 Solving the system of homogeneous condition equations with unknowns
In case of R (Bc) R ( W) we can compute the solutions of type R, W – MINOLESS as follows. Lemma 12.5 (R, W – MINOLESS, homogeneous conditions with unknowns: normal equations): Under the assumption R (Bc) R ( W) is an m × 1 vector xAm R- MINOLESS of (12.1) if and only if it solves the normal equation ª R A c(BW Bc)-1 A º ª x Am º « A c(BW Bc)-1 A » «Ȝ » = 0 ¬ ¼ ¬ Am ¼ 0 ª º =« » -1 ¬ A c(BW Bc) By ¼
(12.11)
with the m × 1 vector Ȝ Am of “Lagrange – multipliers”. x Am exists always and is uniquely determined, if rk[R, Ac] = m
(12.12)
holds, or equivalently, if the matrix R + A c(BW Bc)-1 A
(12.13)
is regular. The proof of Lemma 12.5 is based on applying Lemma 1.2 on the normal equations (12.5). The rest is based on the identity ªR R + A c(BW Bc)-1 A = [ R , A c] « ¬ 0
º ªR º 0 -1 » « » . c (BW B ) ¼ ¬ A ¼
Obviously, the condition (12.12) is fulfilled if the matrix R is positive definite, or if R describes an R norm. The linear forms x Am = Ly , which lead to the R, W – MINOLESS solutions, are characterized as follows. Lemma 12.6 (R, W – MINOLESS, relation between A und B): Under the assumption R (B') = R ( W ) is x Am = Ly of type R, W – MINOLESS of (12.1) for all y R n if and only if the matrix A B follows the condition L = AB
(12.14)
subject to (BW Bc)-1 AA = [(BW Bc)-1 AA ']'
(12.15)
RA AA = RA
(12.16)
418
12 The sixth problem of generalized algebraic regression
RA A = ( RA A ) ' is fulfilled. In this case
(12.17)
(12.18) Rx Am = RLy is always unique. In the special case that R is positive definite, the matrix L is unique, fulfilling (12.14) - (12.17). :Proof: Earlier we have shown that the representation (BW Bc)-1 AA = [(BW Bc)-1 AA ]'
(12.19)
leads us to L = A B uniquely. The general solution of the system A c(BW Bc)-1 Ax A = A c(BW Bc)-1 By
(12.20)
x A = x Am + (I [ A c(BW B c)-1 A ] A c(BW B c)-1 A ]z
(12.21)
is given by
or x A = x Am + (I LBcA )z
(12.22)
for an arbitrary m × 1 vector z , such that the related R - seminorm follows the inequality || x Am ||2R =|| L y ||R2 d|| L y ||R2 +2y cLcR (I LB A )z + || (I LB A )z ||R2 .
(12.23)
For arbitrary y R n , we have the result that LcR (I LB A) = 0
(12.24)
must be zero! Or (12.25)
RLB R AL = RL
and
RLB R A = (RLB R A )c.
(12.26)
To prove these identities we must multiply from the right by L, namely LcR = LcRLB R A
:
LcRL = LcRLB R AL.
(12.27)
Due to the fact that the left hand side is a symmetric matrix, the right-hand side must have the same property, in detail RLB R A = (RLB R A)c q.e.d . Add L = A B and taking advantage of B R BA = A , we find RA AA = RLB R ALW Bc(BW Bc) 1 = = RLW Bc(BW Bc) 1 = RA and RA A = (RA A)c .
(12.28) (12.29)
419
12-1 Solving the system of homogeneous condition equations with unknowns
Uniqueness of xAm follows automatically. In case that the matrix R is positive definite and, of course, invertible, it is granted that the matrix L = A B is unique! 12-13
R, W – HAPS
R, W – HAPS is alternatively built on Definition 12.7 and Lemma 12.8. Definition 12.7 (R, W - HAPS, homogeneous conditions with unknowns): An m × 1 vector x h with Bi h = By Ax h is called R, W - HAPS (Hybrid APproximate Solution with respect to the combined Rand W- Seminorm if compared to all other vectors x R n of type Bi = By Ax the inequality || i h ||2W + || x h ||2R := i ch Wi h + xch Rx h d d i cWi + xcRx =:|| i ||2W + || x ||2R
(12.30)
holds, in other words if the hybrid risk function || i ||2W + || x ||2R is minimal. The solutions of type R, W – HAPS can be computed by Lemma 12.8
(R, W – HAPS homogeneous conditions with unknowns: normal equations):
An m × 1 vector x h is R, W – HAPS of the Gauß – Helmert model of conditional equations with unknowns if and only if the normal equations ª W B' 0 º ª i h º ª 0 º « B 0 A » « Ȝ h » = « By » «¬ 0 A' R »¼ « x » «¬ 0 »¼ ¬ h¼
(12.31)
with the q × 1 vector Ȝ h of “Lagrange – multpliers” are fulfilled. x A certainly exists in case of R (Bc) R ( W) and is solution of the system of normal equations (R +A c(BW Bc)-1 A)x h = A c(BW Bc)-1 By ,
(12.32) (12.33)
which is independent of the choice of the generalized inverse W uniquely defined. x h is uniquely defined if and only if the matrix (R +A c(BW Bc)-1 A) is regular, equivalently if rk[R, Ac] = m holds.
(12.34)
420
12 The sixth problem of generalized algebraic regression
:Proof: With the “Lagrange function” Ȝ R, W – HAPS is defined by L(i, x, Ȝ ) := i cWi + xcRx + 2Ȝ c(Ax + Bi - By ) = min . i,x,Ȝ
The first derivatives 1 wL (i h , x h , Ȝ h ) = Wi h + BcȜ h = 0 2 wi
(12.35)
1 wL (i h , x h , Ȝ h ) = A cȜ h + Rx h = 0 2 wx
(12.36)
1 wL ( i h , x h , Ȝ h ) = Ax h + Bi h By = 0 2 wȜ
(12.37)
establish the necessary conditions. The second derivatives 1 w 2L (i h , x h , Ȝ h ) = W t 0 2 wiwi c
(12.38)
1 w 2L (i h , x h , Ȝ h ) = R t 0 2 wxwxc
(12.39)
due to the positive definiteness of the matrices W and R a sufficient criterion for the minimum. If in addition R (Bc) R ( W) holds, we are able to reduce i h , namely to device the reduced system of normal equations ª(BW Bc) A º ª Ȝ h º ªBy º « »« » = « » R ¼ ¬ xh ¼ ¬ 0 ¼ ¬ A'
(12.40)
and by reducing Ȝ h , in addition, (R +A c(BW Bc)-1 A)x h = A c(BW B c)-1 By.
(12.41)
Because of the identity ªRR +A c(BW B c)-1 A = [R , A '] « ¬0
º ªR º 0 », -1 » « (BW Bc) ¼ ¬ A ¼
(12.42)
we can assure the existence of our solution x h and, in addition, the equivalence of the regularity of the matrix (R +A c(BW Bc)-1 A) with the condition rk[ R, A c] = m , the basis of the uniqueness of x h .
421
12-2 Examples
12-14
R, W - MINOLESS against R, W - HAPS
Obviously, R, W – HAPS with respect to the model (12.32) is unique if and only if R, W – MINOLESS is unique, because the representations (12.34) and (12.12) are identical. Let us replace (12.11) by the equivalent system (R +A c(BW Bc)-1 A)x Am + A c(BW Bc)-1 AȜ Am = A c(BW Bc)-1 By (12.43) and A c(BW Bc) Ax Am = A c(BW Bc)-1 By
-1
(12.44)
such that the difference x h x Am = (R +A c(BW B c)-1 A)1 A c(BW B c)-1 AȜ Am
(12.45)
follows automatically.
12-2 Examples for the generalized algebraic regression problem: homogeneous conditional equations with unknowns As an example of inconsistent linear equations Ax + Bi = By we treat a height network, consisting of four points whose relative and absolute heights are derived from height difference measurements according to the network graph in Chapter 9. We shall study various optimal criteria of type I-LESS I, I-MINOLESS I, I-HAPS, and R, W-MINOLESS: R positive semidefinite W positive semidefinite. We use constructive details of the theory of generalized inverses according to Appendix A. Throughout we take advantage of holonomic height difference measurements, also called “gravimetric leveling” { hDE := hE hD , hJD := hD hJ , hEG := hG hE , hGJ := hJ hG } within the triangles {PD , PE , PJ } and {PE , PG , PJ }. In each triangle we have the holonomity condition, namely {hDE + hEJ + hJD = 0, (hDE + hEJ = hJD )} {hJE + hEG + hGJ = 0, (hEG + hGJ = hJE )} .
422 12-21
12 The sixth problem of generalized algebraic regression
The first case: I - LESS
In the first example we order four height difference measurements to the system of linear equations ªiDE º ª hDE º « » « » ª 1º ª1 1 0 0º «iJD » ª1 1 0 0º « hJD » «1 » hEJ + « 0 0 1 1 » «i » = « 0 0 1 1 » « h » ¬ ¼ ¬ ¼ « EG » ¬ ¼ « EG » «iGJ » « hGJ » ¬ ¼ ¬ ¼ as an example of homogeneous inconsistent condition equations with unknowns:
ª 1º A := « » , ¬1 ¼
ª hDE º « » « hJD » 1 1 0 0 ª º x := hEJ , B := , y := « » ¬« 0 0 1 1 ¼» h « EG » « hGJ » ¬ ¼
n = 4 , m = 1, rkA = 1, rkB = q = 2 1 A c(BBc)-1 A = 1, A c(BBc)-1 B = [1, 1,1,1] 2 1 xA = (hEJ ) A = ( hDE hJD + hEG + hGJ ) . 2 12-22
The second case: I, I – MINOLESS
In the second example, we solve I, I – MINOLESS for the problem of four height difference measurements associated with the system of linear equations ªiDE º ª hDE º « » « » ª 1 1º ª hE º ª1 1 0 0 º «iJD » ª1 1 0 0 º « hJD » «¬ 1 1 »¼ « h » + « 0 0 1 1 » «i » = « 0 0 1 1 » « h » «¬ J »¼ ¬ ¼ « EG » ¬ ¼ « EG » «iGJ » « hGJ » ¬ ¼ ¬ ¼ as our second example of homogeneous inconsistent condition equations with unknowns: ª hDE º « » ª hE º « hJD » ª 1 1º ª1 1 0 0 º A := « , x := « » , B := « , y := « » h ¬ 1 1 »¼ ¬ 0 0 1 1 »¼ «¬ hJ »¼ « EG » « hGJ » ¬ ¼ n = 4 , m = 2 , rkA = 1, rkB = q = 2 . I, I – LESS solves the system of normal equations
423
12-2 Examples
A c(BBc)-1 Ax A = A c(BBc)-1 By ª 1 1º A c(BBc)-1 A = « » =: DE ¬ 1 1 ¼ ª1 º D = « » , E = [1, 1] . ¬ 1¼ For the matrix of the normal equations A c(BBc)-1 A = DE we did rank factorizing: O(D) = m × r , O(E) = r × m, rkD = rkE = r = 1 [ A c(BBc)-1 A ]' = Ec(EEc) 1 (DcD) 1 Dc = A '(BB ')-1 B =
1 1 ª 1 1º A c(BB c)-1 A = « , 4 4 ¬ 1 1 »¼
1 ª 1 1 1 1º . 2 «¬ 1 1 1 1 »¼
I, I – MINOLESS due to rk[R , A ' ] = 2 leads to the unique solution ª hE º 1 ª hDE + hJD hEG hGJ º x Am = « » = « », ¬« hJ ¼» Am 4 «¬ hDE hJD + hEG + hGJ ¼» which leads to the centric equation (hE )Am + (hJ )Am = 0. 12-23
The third case: I, I - HAPS
Relating to the second design we compute the solution vector x h of type I, I – HAPS by the normal equations, namely I + A c(BBc) 1 A =
1 ª 5 3º ª 5 3º , [I + A c(BBc) 1 A]1 = « 4 «¬3 5»¼ ¬ 3 5 »¼
ª hE º xh = « » = 4 «¬ hJ »¼ h 12-24
ª hDE + hJE hEG hGJ º « ». «¬ hDE hJE + hEG + hGJ »¼
The fourth case: R, W – MINOLESS, R positive semidefinite, W positive semidefinite
This time we refer the characteristic vectors x, y and the matrices A, B to the second design. The weight matrix of inconsistency parameters will be chosen to
424
12 The sixth problem of generalized algebraic regression
ª1 1 «1 W= « 2 0 «0 ¬
1 1 0 0
0 0 1 1
0º 0» 1» 1 »¼
and W = W , such that R (B ') R ( W) holds. The positive semidefinite matrix R = Diag(0,1) has been chosen in such a way that the rank partitioned unknown vector x = [x1c , xc2 ]', O(x1 ) = r × 1, O( x 2 ) = ( m 1) × 1, rkA =: r = 1 relating to the partial solution x 2 = xJ = 0, namely 1 ª 1 1 1 1º ª 1 1º A c(BW Bc)-1 A = « , A c(BW Bc)-1 B = « » 2 ¬ 1 1 1 1 »¼ ¬ 1 1 ¼ and 1 ª 1 1 1 1º ª 1 1º ª x E º «¬ 1 1 »¼ « x » = 2 «¬ 1 1 1 1 »¼ y , «¬ J »¼ Am 1 ( x E ) Am = ( hDE + hJD hEG hGJ ), ( xJ ) Am = 0. 2
12-3 Solving the system of inhomogeneous condition equations with unknowns First, we solve the problem of inhomogeneous condition equations by the method of minimizing the W – seminorm of Least Squares. We review by Definition 12.9 and Lemma 12.10 and Lemma 12.11 the characteristic normal equations and linear form which build up the solution of type W – LESS. Second, we extend the method of W – LESS by R, W – MINOLESS by means of Definition 12.12 and Lemma 12.13 and Lemma 12.14. R, W – MINOLESS stands for Minimum Norm East Squares Solution (R – Seminorm, W – Seminorm of type (LEast Squares). Third, we alternatively present by Definition 12.15 and Lemma 12.16 R, W – HAPS (Hybrid AProximate Solution with respect to the combined R- and W– Seminorm). Fourth, we again compare R, W – MINOLESS and R, W – HAPS by means of computing the difference vector x A x Am . 12-31
W – LESS
W – LESS of our system of inconsistent inhomogeneous condition equations with unknowns Ax + Bi = By - c, By c + R(A) is built on Definition 12.9, Lemma 12.10 and Lemma 12.11. Definition 12.9 (W - LESS , inhomogeneous conditions with unknowns): An m × 1 vector x A is called W - LESS (LEast Squares Solution with respect to the W- seminorm) of the inconsistent system of inhomogeneous linear equations
425
12-3 Solving the system of inhomogeneous condition equations with unknowns
Ax + Bi = By - c
(12.46)
with Bi A := By c Ax A , if compared to alternative vector x R m with Bi := By c Ax the inequality || i A ||2W := i 'A Wi A d i'Wi =:|| i ||2W
(12.47)
holds. As a consequence i A has the smallest W – seminorm. The solutions of the type W- LESS are computed as follows. Lemma 12.10
(W – LESS, inhomogeneous conditions with unknowns: normal equations):
An m × 1 vector x A is W – LESS of (12.46) if and only if it solves the system of normal equation ª W Bc 0 º ª i A º ª 0 º « B 0 A » « Ȝ A » = « By c » « »« » « » ¬ 0 Ac 0 ¼ ¬ xA ¼ ¬ 0 ¼
(12.48)
with the q × 1 vector Ȝ A of “Lagrange – multipliers”. x A exists indeed in case of R (B ') R ( W) and is solving the system of normal equations A c(BW Bc)-1 Ax A = A c(BW Bc)-1 (By c) ,
(12.49) (12.50)
which is independent of the choice of the g – inverse W and unique. x A is unique if and only if the matrix A c(BW Bc)-1 A is regular, or equivalently, if rkA = m
(12.51) (12.52)
holds. In this case the solution can be represented by x A = [ A c(BW Bc)-1 A]1 A c(BW Bc)-1 (By c).
(12.53)
The proof follows the same line-of-thought of (12.3) – (12.7). The linear form x A = L(y d) = Ly A follows the basic definitions and can be characterized by Lemma 12.11. Lemma 12.11 (W – LESS, relation between A and B): Under the condition R (Bc) R ( W)
(12.54)
426
12 The sixth problem of generalized algebraic regression
is x A = Ly 1 W – LESS of (12.46) for y R n if and only if the matrix L and m × 1 vector 1 obey the conditions (12.55)
L = AB
and
1 = Ac
(12.56)
subject to (BW Bc)-1 AA = [(BW B c)-1 AA ]'.
(12.57)
In this case, the vector Ax A = AA (By c) is always unique. The proof is obvious. 12-32
R, W – MINOLESS
R, W – MINOLESS of our system of inconsistent, inhomogeneous condition equations with unknowns Ax + Bi = By c , By c + R(A) is built on Definition 12.12, Lemma 12.13 and Lemma 12.14. (R, W - MINOLESS, inhomogeneous conditions with unknowns): An m × 1 vector xAm is called R, W - MINOLESS (Minimum NOrm with respect to the R – Seminorm LEast Squares Solution with respect to the W – Seminorm) if the inconsistent system of linear equations of (12.46) is inconsistent and x Am R- MINOS of (12.46). Definition 12.12
In case of R (B') R ( W ) we can compute the solutions of type R, W – MINOLESS of (12.46) as follows. Lemma 12.13
(R, W – MINOLESS, inhomogeneous conditions with unknowns: normal equations):
Under the assumption R (Bc) R ( W)
(12.58)
is an m × 1 vector xAm R, W - MINOLESS of (12.46) if and only if it solves the normal equation ª R A c(BW Bc)-1 A º ª x Am º « A c(BW Bc)-1 A » «Ȝ » = 0 ¬ ¼ ¬ Am ¼ 0 ª º =« » -1 ¬ A c(BW Bc) (By c) ¼
(12.59)
with the m × 1 vector Ȝ Am of “Lagrange – multipliers”. x Am exists always and is uniquely determined, if rk[R, A '] = m
(12.60)
427
12-3 Solving the system of inhomogeneous condition equations with unknowns
holds, or equivalently, if the matrix R + A c(BW Bc)-1 A is regular. In this case the solution can be represented by
(12.61)
x Am = [R + A c(BW Bc)-1 A c(BW Bc)-1 A ] × ×{A c(BW Bc)-1 A[R + A c(BW Bc)-1 A ]1 ×
(12.62)
×A c(BW Bc) A} A c(BW Bc) (By c) , which is independent of the choice of the generalized inverse.
-1
-1
The proof follows similar lines as in Chapter 12-5. Instead we present the linear forms x = Ly which lead to the R, W – MINOLESS solutions and which can be characterized as follows. Lemma 12.14 (R, W – MINOLESS, relation between A und B): Under the assumption R (Bc) R ( W) is x Am = Ly of type R, W – MINOLESS of (12.46) for all y R n if and only if the matrix A B follows the condition (12.63)
and 1 = Ac and (BW Bc)-1 AA = [(BW Bc)-1 AA ]c RA AA = RA RA A = (RA A)c are fulfilled. In this case is L = AB
Rx Am = R ( Ly 1)
(12.64) (12.65) (12.66) (12.67) (12.68)
always unique. In the special case that R is positive definite, the matrix L is unique, following (12.59) - (12.62). The proof is similar to Lemma 12.6 if we replace everywhere By by By c . 12-33
R, W – HAPS
R, W – HAPS is alternatively built on Definition 12.15 and Lemma 12.16 for the special case of inconsistent, inhomogeneous conditions equations with unknowns Ax + Bi = By c , By c + R(A) . Definition 12.15
(R, W - HAPS, inhomogeneous conditions with unknowns):
An m × 1 vector x h with Bi h = By c Ax h is called R, W HAPS (Hybrid APproximate Solution with respect to the combined R- and W- Seminorm if compared to all other vectors x R n of type Bi = By c Ax the inequality
428
12 The sixth problem of generalized algebraic regression
|| i h ||2W + || x h ||2R =: i ch Wi h + xch Rx h d d i cWi + xcRx =:|| i ||2W + || x ||2R
(12.69)
holds, in other words if the hybrid risk function || i || + || x || is minimal. 2 W
2 R
The solution of type R, W – HAPS can be computed by Lemma 12.16
(R, W – HAPS inhomogeneous conditions with unknowns: normal equations):
An m × 1 vector x h is R, W – HAPS of the Gauß – Helmert model of inconsistent, inhomogeneous condition equations with unknowns if and only if the normal equations ª W Bc 0 º ª i h º ª 0 º « B 0 A » « Ȝ h » = « By c » « »« » « » ¬ 0 Ac R ¼ ¬ x h ¼ ¬ 0 ¼
(12.70)
with the q × 1 vector Ȝ h of “Lagrange – multpliers” are fulfilled. x A exists certainly in case of R (Bc) R ( W)
(12.71)
and is solution of the system of normal equations [R +A c(BW Bc)-1 A ]x h = A c(BW B c)-1 (By c) , (12.72) which is independent of the choice of the generalized inverse W uniquely defined. x h is uniquely defined if and only if the matrix [R +A c(BW Bc)-1 A] is regular, equivalently if rk[R, A c] = m
(12.73)
holds. In this case the solution can be represented by x h = [R +A c(BW Bc)-1 A]1 A c(BW Bc)-1 (By c).
(12.74)
The proof of Lemma 12.16 follows the lines of Lemma 12.8. 12-34
R, W - MINOLESS against R, W - HAPS
Again we note the relations between R, W-MINOLESS and R, W-HAPS: R, WHAPS is unique because the representations (12.59) and (12.12) are identical. Let us replace (12.59) by the equivalent system (R +A c(BW Bc)-1 A)x Am + A c(BW B c)-1 AȜ Am = A c(BW B c)-1 (By c) (12.75)
-1
A c(BW Bc) x Am
and = A c(BW Bc)-1 (By c) ,
(12.76)
429
12-4 Conditional equations with unknowns
such that the difference x h x Am = [R +A'(BW - B')-1 A ]1 A'(BWB')-1 AȜ Am
(12.77)
follows automatically.
12-4 Conditional equations with unknowns: from the algebraic approach to the stochastic one Let us consider the stochastic portray of the model “condition equations with unknowns”, namely the stochastic Gauß-Helmert model. Consider the model equations AE{x} = BE{y} Ȗ or Aȟ = BȘ Ȗ subject to O( A) = q × m, O(B) = q × n Ȗ = Bį for some į R n rkA = m < rkB = q < n E{x} = ȟ, E{y} = Ș E{x} = x e x , E{y} = y e y ª E{x E{x}} = 0 « 2 ¬ E{(x E{x})( x E{x}) '} = ı x 4 x versus ª E{y E{y}} = 0 « 2 «¬ E{(y E{y})( y E{y}) '} = ı y 4 y . 12-41
Shift to the centre
From the identity Ȗ = Bį to the centre we gain another identity of type AE{x} = BE{y į} = B{y į} Be y = w Be y such that B(y į) =: w w = Aȟ + Be y . 12-42
The condition of unbiased estimators
The unknown ȟˆ = K1 By + A 1 is uniformly unbiased estimable, if and only if º ªȟ = K1BE{y} + l1 = K1 ( Aȟ + Ȗ ) + l1 » or « n for all ȟ R m . E{x} = E{E {x}}¼» ¬
ȟ E{ȟˆ}
430
12 The sixth problem of generalized algebraic regression
ȟˆ is unbiased estimable if and only if A1 = K1J or K1 A = I m . In consequence, K1 = A must be a left generalized inverse. L
n {y} = y IJ = y (K2 By A 2 ) is uniformly unbiased estimable if The unknown E and only if n B{y} = E{E {y}} = (I n K2 B) E{y} + A 2 = E{y} K2 ( Aȟ + Ȗ ) + l2 for all ȟ R m and for all E{y} = R n . n E {y} is unbiased estimable if and only if A 2 = K2 Ȗ or K2 A = 0. 12-43
n {ȟ} The first step: unbiased estimation of ȟˆ and E
n {ȟ} will be presented first. The key lemma of unbiased estimation of ȟˆ and E ȟˆ is unbiased estimable if and only if n {y} is unbiased estimable if and only if ȟˆ = L1 y + A 1 = K1By + A 1 E n E {y} = L 2 y + A 2 = (I n K2 B)y + A 2 or BL 2 = AL1 = AA B since Ȗ + R ( A ) R (B ) = R 9 . 12-44
The second step: unbiased estimation K1 and K2
The bias parameter for K1 and K2 are to be estimated by K1 = [ A '(BQ y B ') 1 A]A '(BQ y B ') 1 K2 = Q 'y B '(BQ y B ') 1 (I q AK1 ) A 1 = K1Ȗ , A 2 = +K2 Ȗ generating BLUUE of E{x} = ȟ and E{y} = Ș , respectively.
13 The nonlinear problem of the 3d datum transformation and the Procrustes Algorithm A special nonlinear problem is the three-dimensional datum transformation solved by the Procrustes Algorithm. A definition of the three-dimensional datum transformation with the coupled unknowns of type dilatation unknown, also called scale factor, translation and rotation unknown follows afterwards. :Fast track reading: Read Definition 13.1, Corollary 13.2-13.4, 6 and Lemma 13.5 and Lemma 13.7
Corollary 13.2: translation partial W - LESS for x 2 Corollary 13.3: scale partial W - LESS for x1 Definition 13.1: ^ 7(3) 3d datum transformation
Corollary 13.4: rotation partial W - LESS for X 3 Theorem 13.5: W – LESS of ( Y '1 = Y2 X '3 x1 + 1x '2 + E) Corollary 13.6: I – LESS of ( Y1 Y2 X '3 x1 + 1x '2 + E)
“The guideline of Chapter 13: definition, corollaries and lemma” Let us specify the parameter space X, namely x1 the dilatation parameter – the scale factor – x1 \
versus
x 2 the column vector of translation parameter x 2 \ 3×1
versus X 3 O + (3) =: {X 3 || \ 3×3 | X 3* X 3 = I 3 and |X 3 |= +1} X 3 is an orthonormal matrix, rotation matrix of three parameters
432
13 The nonlinear problem
which is built on the scalar x1 , the vector x 2 and the matrix X 3 . In addition, by the matrices ª x1 Y1 := «« y1 «¬ z1
x2 ... xn 1 y2 ... yn 1 z2
... zn 1
xn º yn »» \ 3×n zn »¼
and ª X1 Y2 := «« Y1 «¬ Z1
X 2 ... X n 1 Y2 ... Yn 1 Z2
... Z n 1
Xn º Yn »» \ 3×n Z n »¼
we define a left and right three-dimensional coordinate arrays as an ndimensional simplex of observed data. Our aim is to determine the parameters of the 3 - dimensional datum transformation {x1 , x 2 , X 3} out of a nonlinear transformation (conformal group ^ 7 (3) ). x1 \ stands for the dilatation parameter, also called scale factor, x 2 \ 3×1 denotes the column vector of translation parameters, and X 3 O + (3) =: {X 3 \ 3×3 | X '3 X 3 = I 3 , |X 3 |= +1} the orthonormal matrix, also called rotation matrix of three parameters. The key problem is how to determine the parameters for the unknowns of type {x1 , x 2 , X 3} , namely scalar dilatation x1 , vector of translation and matrix of rotation, for instance by weighted least squares. Example 1 (simplex of minimal dimension, n = 4 points tetrahedron): ª x1 Y1 := «« y1 «¬ z1 ª x1 «x « 2 « x3 « ¬« x4
y1 y2 y3 y4
x2 y2 z2
x3 y3 z3
x4 ºc ª X1 » y4 » «« Y1 «¬ Z1 z4 »¼
z1 º ª X 1 Y1 z2 »» «« X 2 Y2 = z3 » « X 3 Y3 » « z4 ¼» ¬« X 4 Y4
X2 Y2 Z2
X3 Y3 Z3
X 4 ºc Y4 »» =: Y Z 4 »¼
Z1 º ª e11 e12 «e » e Z2 » X3 x1 + 1x '2 + « 21 22 «e31 e32 Z3 » « » Z 4 ¼» ¬«e 41 e 42
e13 º e 23 »» . e33 » » e 43 ¼»
Example 2 (W – LESS) We depart from the setup of the pseudo-observation equation given in Example 1 (simplex of minimal dimension, n = 4 points, tetrahedron). For a diagonal
13-1 The 3d datum transformation and the Procrustes Algorithm
433
weight W = Diag( w1 ,..., w4 ) R 4× 4 we compute the Frobenius error matrix W – semi - norm ª w1 0 0 0 º ª e11 e12 e13 º ª e11 e 21 e31 e 41 º « 0 w2 0 0 »» ««e 21 e 22 e 23 »» × || E ||2W := tr(E ' WE) = tr( ««e12 e 22 e32 e 42 »» « )= « 0 0 w3 0 » «e31 e32 e33 » «¬e13 e 23 e33 e 43 »¼ « » » « ¬« 0 0 0 w4 ¼» ¬«e 41 e 42 e 43 ¼» ª e11 e12 e13 º ª e11w1 e 21w2 e31w3 e 41w4 º « e e e » = tr( ««e12 w1 e 22 w2 e32 w3 e 42 w4 »» × « 21 22 23 » ) = «e31 e32 e33 » » ¬«e13 w1 e 23 w2 e33 w3 e 43 w4 ¼» «e ¬« 41 e 42 e 43 ¼» 2 2 2 = w1e11 + w2e 221 + w3e31 + w4e 241 + w1e12 + w2e 222 + 2 2 2 + w3e32 + w4e 242 + w1e13 + w2e 223 + w3e33 + w4e 243 .
Obviously, the coordinate errors (e11 , e12 , e13 ) have the same weight w1 , (e 21 , e 22 , e 23 ) have the same weight w2 , (e31 , e32 , e33 ) have the same weight w3 , and finally (e 41 , e 42 , e 43 ) have the same weight w4 . We may also say that the error weight is pointwise isotropic, weight e11 = weight e12 = weight e13 = w1 etc. However, the error weight is not homogeneous since w1 = weight e11 z weight e 21 = w2 . Of course, an ideal homogeneous and isotropic weight distribution is guaranteed by the criterion w1 = w2 = w3 = w4 .
13-1 The 3d datum transformation and the Procrustes Algorithm First, we present W - LESS for our nonlinear adjustment problem for the unknowns of type scalar, vector and special orthonormal matrix. Second, we review the Procrustes Algorithm for the parameters {x1 , x 2 , X 3} . Definition 13.1: (nonlinear analysis for the three-dimensional datum transformation: the conformal group ^ 7 (3) ): The parameter array {x1A , x 2 A , X 3A } is called W – LESS (LEast Squares Solution with respect to the W – Seminorm) of the inconsistent linear system of equations Y2 Xc3 x1 + 1xc2 + E = Y1
(13.1)
subject to Xc3 X3 = I 3 , | X3 |= +1
(13.2)
434
13 The nonlinear problem
of the field of parameters in comparison with alternative parameter arrays {x1A , x 2 A , X 3A } fulfils the inequality equation || Y1 Y2 Xc3 A x1A 1xc2 A ||2W := := tr(( Y1 Y2 Xc3A x1A 1xc2 A )cW( Y1 Y2 Xc3 A x1A 1xc2 A )) d =: tr ((Y1 Y2 Xc3 x1 1xc2 )cW( Y1 Y2 Xc3 x1 1xc2 )) =: =:|| Y1 Y2 Xc3 x1 1xc2 ||2W
(13.3)
EA := Y1 Y2 Xc3 A x1A 1xc2 A
(13.4)
in other words if
has the least W – seminorm. ? How to compute the three unknowns {x1 , x 2 , X 3} by means of W – LESS ? Here we will outline the computation of the parameter vector by means of partial W – LESS: At first, by means of W – LESS we determine x 2 A , secondly by means of W – LESS x1A , followed by thirdly means of W – LESS X 3 . In total, we outline the Procrustes Algorithm. Step one: x 2 Corollary 13.2 (partial W – LESS for x 2 A ): A 3 × 1 vector x 2A is partial W – LESS of (13.1) subject to (13.2) if and only if x 2A fulfils the system of normal equations 1cW1x 2 A = ( Y1 Y2 Xc3 x1 )cW1.
(13.5)
x 2A always exists and is represented by x 2 A = (1cW1)-1 (Y1 - Y2 Xc3 x1 )cW1 .
(13.6)
For the special case W = I n the translated parameter vector x 2 A is given by 1 x 2A = (Y1 Y2 Xc3 x1 )c1. (13.7) n For the proof, we shall first minimize the risk function (Y1 Y2 Xc3 x1 1xc2 )c(Y1 Y2 Xc3 x1 1xc2 ) = min x2
with respect to x 2 ! :Detailed Proof of Corollary 13.2: W – LESS is constructed by the unconstrained Lagrangean
435
13-1 The 3d datum transformation and the Procrustes Algorithm
L( x1 , x 2 , X 3 ) := =
1 || E ||2W =|| Y1 Y2 X '3 x1 1x ' 2 ||2W = 2
1 tr( Y1 Y2 X '3 x1 1x '2 ) ' W( Y1 Y2 X '3 x1 1x '2 ) = 2 =
min
x1 t 0, x 2 R 3×1 , X 3 ' X 3 = I 3
wL ( x 2A ) = (1'W1)x 2 ( Y1 Y2 X '3 x1 ) ' W1 = 0 wx ,2 constitutes the first necessary condition. Basics of the vector-valued differentials are found in E. Grafarend and B. Schaffrin (1993, pp. 439-451). As soon as we backward substitute the translational parameter x 2A , we are led to the centralized Lagrangean L( x1 , X 3 ) =
1 tr{[ Y1 ( Y2 X '3 x1 + (1'W1) 111 ' W( Y1 Y2 X '3 x1 ))]' W * 2
*[Y1 ( Y2 X '3 x1 + (1'W1) 1 11 ' W( Y1 Y2 X '3 x1 ))]} L( x1 , X 3 ) =
1 tr{[(I (1'W1) 111 ') W( Y1 Y2 X '3 x1 )]' W * 2
*[( I (1'W1) 1 11 ') W( Y1 Y2 X '3 x1 )]} 1 C := I n 11' 2 being a definition of the centering matrix, namely for W = I n C := I n (1 ' W1) 1 11'W
(13.8)
being in general symmetric. Substituting the centering matrix into the reduced Lagrangean L( x1 , X 3 ) , we gain the centralized Lagrangean L( x1 , X3 ) =
1 tr{[ Y1 Y2 X '3 x1 ]' C'WC[ Y1 Y2 X '3 x1 ]}. 2
(13.9)
Step two: x1 Corollary 13.3 (partial W – LESS for x1A ): A scalar x1A is partial W – LESS of (13.1) subject to (13.3) if and only if x1A =
tr Y '1 C'WCY2 X '3 tr Y '2 C'WCY2
(13.10)
436
13 The nonlinear problem
holds. For the special case W = I n the real parameter is given by x1A =
tr Y '1 C'CY2 X '3 . tr Y '2 C'CY2
(13.11)
The general condition is subject to C := I n (1'W1) 1 11'W.
(13.12)
:Detailed Proof of Corollary 13.3: For the proof we shall newly minimize the risk function 1 L( x1 , X 3 ) = tr{[ Y1 Y2 X '3 x1 ]' C'WC[ Y1 Y2 X '3 x1 ]} = min x 2 subject to 1
X '3 X 3 = I 3 . wL ( x1A ) = x1A tr X 3Y '2 C ' WCY2 X '3 tr Y '1 C ' WCY2 X '3 = 0 wx1 constitutes the second necessary condition. Due to tr X 3Y '2 C ' WCY2 X '3 = tr Y '2 C ' WCY2 X '3 X 3 = Y ' 2 C ' WCY2 lead us to x1A . While the forward computation of (wL / wx1 )( x1A ) = 0 enjoyed a representation of the optimal scale parameter x1A , its backward substitution into the Lagrangean L( x1 , X 3 ) amounts to L( X 3 ) = tr{[ Y1 Y2 X '3
L( X 3 ) =
tr Y '1 C ' WCY2 X '3 tr Y '1 C ' WCY2 X '3 ]C ' WC *[ Y1 Y2 X ' 3 ]} tr Y '2 C ' WCY2 tr Y '2 C ' WCY2
tr Y '1 C ' WCY2 X '3 1 tr{( Y '1 C ' WCY1 ) tr( Y '1 C ' WCY2 X '3 ) * 2 tr Y '2 C ' WCY2 tr( X 3Y '2 C'WCY1 )
tr Y '1 C ' WCY2 X '3 + tr Y '2 C ' WCY2
+ tr( X 3Y '2 C'WCY2 X '3 )
[tr Y '1 C ' WCY2 X '3 ]2 [tr Y '2 C ' WCY2 ]2
L( X 3 ) =
[tr Y '1 C ' WCY2 X '3 ]2 1 [tr Y '1 C ' WCY2 X '3 ]2 1 tr( Y '1 C ' WCY1 ) + 2 [tr Y '2 C ' WCY2 ] 2 [tr Y '2 C ' WCY2 ]
L( X3 ) =
1 1 [tr Y '1 C ' WCY2 X '3 ]2 tr( Y '1 C ' WCY1 ) = min . X ' X =I 2 2 [tr Y '2 C ' WCY2 ] 3
3
3
Third, we are left with the proof for the Corollary 13.4, namely X 3 .
(13.13)
437
13-1 The 3d datum transformation and the Procrustes Algorithm
Step three: X 3 Corollary 13.4 (partial W – LESS for X 3A ): A 3 × 1 orthonormal matrix X 3 is partial W – LESS of (13.1) subject to (13.3) if and only if X 3A = UV '
(13.14)
holds where A := Y '1 C'WCY2 = UȈ s V ' is a singular value decomposition with respect to a left orthonormal matrix U, U'U = I 3 , a right orthonormal matrix V, VV' = I 3 , and Ȉ s = Diag(V 1 , V 2 , V 3 ) a diagonal matrix of singular values (V 1 , V 2 , V 3 ). The singular values are the canonical coordinates of the right eigenspace ( A'A Ȉ2s I) V = 0. The left eigenspace is based upon U = AVȈs 1 . :Detailed Proof of Corollary 13.4: The form L( X 3 ) subject to X '3 X 3 = I 3 is minimal if tr(Y '1 C ' WCY2 X '3 ) =
max
x1 t 0, X '3 X3 = I3
.
Let A := Y '1 C ' WCY2 = UȈV ' , a singular value decomposition with respect to a left orthonormal matrix U, U'U = I 3 , a right orthonormal matrix V, VV' = I 3 and Ȉ s = Diag (V 1 , V 2 , V 3 ) a diagonal matrix of singular values (V 1 , V 2 , V 3 ). Then 3
3
i =1
i =1
tr( AX '3 ) = tr( UȈ s V ' X '3 ) = tr( Ȉ s V ' X 3U ) = ¦ V i rii d ¦ V i holds, since R = V'X '3 U = [ rij ] R 3×3
(13.15) 3
is orthonormal with || rii ||d 1 . The identity tr ( AX '3 ) = ¦ V i applies, if i =1
V'X '3 U = I3 , i.e. X '3 = VU ', X 3 = UV ', namely, if tr( AX '3 ) is maximal 3
tr( AX '3 ) = max tr AX '3 = ¦V i R = V'X '3 U = I 3 . X '3 X 3 = I 3
(13.16)
i =1
An alternative proof of Corollary 13.4 based on formal differentiation of traces and determinants has been given by P.H. Schönemann (1966) and P.H. Schönemann and R.M. Carroll (1970). Finally, we collect our sequential results in Theorem 13.5 identifying the stationary point of W – LESS specialized for W = I in Corollary 13.5. The highlight is the Procrustes Algorithm we review in Table 13.1.
438
13 The nonlinear problem
Theorem 13.5 (W – LESS of Y '1 = Y2 X '3 x1 + 1x '2 + E ) (i) The parameter array {x1 , x 2 , X 3} is W – LESS if x1A =
tr Y '1 C ' WCY2 X '3A tr Y '2 C ' WCY2
(13.17)
x 2 A = (1'W1) 1 ( Y1 Y2 X '3A x1A ) ' W1
(13.18)
X 3 = UV '
(13.19)
subject to the singular value decomposition of the general 3 × 3 matrix (13.20) Y '1 C ' WCY2 = U Diag(V 1 , V 2 , V 3 )V ' namely [( Y '1 C ' WCY2 ) '( Y '1 C ' WCY2 ) V i I] v i = 0
(13.21)
V = [v1 , v 2 ,v 3 ], VV' = I 3
(13.22)
U = Y '1 C ' WCY2 V Diag(V 11 , V 21 , V 31 ),
(13.23) (13.24)
U'U = I3 and as well as the centering matrix C := I n (1'W1) 1 11'W.
(13.25)
(ii) The empirical error matrix of type W- LESS accounts for EA = [I n 11'W(1 ' W1) 1 ]( Y1 Y2 VU '
tr Y '1 C ' WCY2 VU ' ) tr Y '2 C ' WCY2
(13.26)
with the related Frobenius matrix W – seminorm || EA ||2W = tr( E 'A WEA ) = tr{( Y1 Y2 VU '
tr Y '1 C ' WCY2 VU ' )' * tr Y '2 C ' WCY2
*[I n 11'W(1'W1) 1 ]' W[I n 11'W(1'W1) 1 ]* *( Y1 Y2 VU '
tr Y '1 C ' WCY2 VU ' )} tr Y '2 C ' WCY2
(13.27)
and the representative scalar measure of the error of type W - LESS || EA ||W = tr(E 'A WEA ) / 3n .
(13.28)
A special result is obtained if we specialize Theorem 13.5 to the case W = I n :
439
13-1 The 3d datum transformation and the Procrustes Algorithm
Corollary 13.6 (I – LESS of Y '1 = Y2 X '3 x1 + 1x '2 + E ): (i)
The parameter array {x1 , x 2 , X 3} is Y '1 = Y2 X '3 x1 + 1x '2 + E if x1A =
I – LESS of
tr Y '1 CY2 X '3A tr Y '2 CY2
(13.29)
1 ( Y1 Y2 X '3A x1A ) ' 1 (13.30) n (13.31) X 3A = UV ' subject to the singular value decomposition of the general 3 × 3 matrix (13.32) Y1 ' CY2 = U Diag(V 1 , V 2 , V 3 )V ' namely x 2A =
[(Y '1 CY2 )'(Y '1 CY2 )-V i2 ]Iv i = 0, i {1,2,3}, V = [v1 , v 2 ,v 3 ], VV' = I 3 (13.33)
U = Y '1 CY2 V Diag(V 11 , V 21 , V 31 ), UU' = I 3 and as well as the centering matrix 1 C := I n 11'. n (ii) The empirical error matrix of type I- LESS accounts for tr Y '1 C ' Y2 VU ' 1 EA = [I n 11']( Y1 Y2 VU ' ) n tr Y '2 CY2
(13.34)
(13.35)
(13.36)
with the related Frobenius matrix W – seminorm tr Y '1 CY2 VU ' )' * tr Y '2 CY2 tr Y '1 CY2 VU ' 1 *[I n 11']( Y1 Y2 VU ' )} n tr Y '2 CY2
|| E ||I2 = tr( E 'A EA ) = tr{( Y1 Y2 VU '
(13.37)
and the representative scalar measure of the error of type I - LESS || EA ||I = tr(E 'A EA ) / 3n .
(13.38)
In the proof of Corollary 13.6 we only sketch the result that the matrix I n (1/ n)11' is idempotent: 1 1 2 1 (I n 11c)(I n 11c) = I n 11c + 2 (11') 2 n n n n 2 1 1 = I n 11c + 2 n11c = I n 11c. n n n
440
13 The nonlinear problem
As a summary of the various steps of Corollary 2-4, 5 and Theorem 5, Table 13.1 presents us the celebrated Procrustes Algorithm, which is followed by one short und interesting Citation about “Procrustes”. Following Table 13.1, we present the celebrated Procrustes Algorithm which is a summary of the various steps of Corollary 2-4,5 and Theorem 5. Table 13.1: Procrustes Algorithm ª x1 y1 z1 º ª X 1 Y1 Z1 º « » # # # # » = Y2 and « # Step 1: Read Y1 = # « » « » «¬ xn yn zn »¼ «¬ X n Yn Z n ¼» 1 Step 2: Compute: Y '1 CY2 subject to C := I n 11 ' n Step 3: Compute: SVD Y '1 CY2 = UDiag (V 1 , V 2 , V 3 ) V ' 3-1
| ( Y '1 CY2 ) '( Y '1 CY2 ) V i2 I |= 0 (V 1 , V 2 , V 3 )
(( Y '1 CY2 ) '( Y '1 CY2 ) V i2 I)v i = 0, i {1, 2, 3} V = [ v1 , v 2 ,v 3 ] right eigenvectors (right eigencolumns) 3-3 U = Y '1 CY2 VDiag (V 11 , V 21 , V 31 ) left eigenvectors (left eigencolumns) Step 4: Compute: X 3A = UV ' rotation 3-2
Step 5: Step 6: Step 7:
tr Y '1 CY2 X '3 (dilatation) tr Y '2 CY2 1 Compute: x 2 A = ( Y1 Y2 X '3 x1 ) ' 1 (translation) n tr Y '1 CY2 VU ' ) (error matrix) Compute: EA = C(Y1 Y2 VU ' tr Y '2 CY2
Compute: x1A =
‘optional control’ EA := Y1 ( Y2 X '3 x1A + 1x '2 A ) Step 8:
Compute: || EA ||I := tr( E 'A EA ) (error matrix)
Step 9:
Compute: || EA ||I := tr( EcA EA ) / 3n (mean error matrix)
Procrustes (the subduer), son of Poseidon, kept an inn benefiting from what he claimed to be a wonderful all-fitting bed. He lopped of excessive limbage from tall guests and either flattened short guests by hammering or stretched them by racking. The victim fitted the bed perfectly but, regrettably, died. To exclude the embarrassment of an initially exact-fitting guest, variants of the legend allow Procrustes two, different-sized beds. Ultimately, in a crackdown on robbers and monsters, the young Theseus fitted Procrustes to his own bed.
441
13-2 The variance - covariance matrix of the error matrix E
13-2 The variance - covariance matrix of the error matrix E By Lemma 13.7 we review the variance - covariance matrix, namely the vector valued form of the transposed error matrix, as a function of Ȉ vecY , Ȉ vecY and the covariance matrix Ȉ vecY ' , ( I
x X ) vecȈ . 1
2
n
1
1
3
Y '2
Lemma 13.7: (Variance – covariance “error propagation”): Let vecE ' be the vector valued form of the transposed error matrix E := Y1 Y2 X '3 x1 1x '2 . Then Ȉ vecE ' = Ȉ vecY ' + (I n
x1 X3 ) ȈvecY ' (I n
x1 X3 ) ' 2ȈvecY ' , ( I 1
2
1
n
x1 X 3 ) vec Y '2
(13.39)
is the exact representation of the dispersion matrix (variance – covariance matrix) ȈvecE' of vec E ' in terms of dispersion matrices (variance – covariance matrices) ȈvecY ' and ȈvecY ' of the two coordinates sets vec Y '1 and vec Y '2 as well as their covariance matrix ȈvecY ' , ( I
X ) vecY ' . 1
1
n
3
2
2
The proof follows directly from “error propagation”. Obviously the variance – covariance matrix of ȈvecE' can be decomposed in the variance – covariance matrix ȈvecY ' , the product ( I n
x1X 3 ) ȈvecY ' (I n
X 3 ) ' using prior information of x1 and X 3 and the covariance matrix ȈvecY ' , ( I
x X ) vecY ' again using prior information of x1 and X3 . 1
2
1
n
1
3
2
13-3 Case studies: The 3d datum transformation and the Procrustes Algorithm By Table 13.1 and Table 13.2 we present two sets of coordinates, first for the local system A, second for the global system B, also called “World Geodetic System 84”. The units are in meter. The results of I – LESS, Procrustes Algorithm are listed in Table 13.3, especially || EA ||I := tr( E 'A EA ), ||| EA |||I := tr( E ' A EA ) / 3n and W – LESS, Procrustes Algorithm in Table 13.4, specially || EA ||W := tr( E 'A WEA ), |||EA |||W := tr( E ' A WEA ) / 3n completed by Table 13.5 of residuals from the Linearized Least Squares and by Table 13.6 listing the weight matrix.
442
13 The nonlinear problem
Discussion By means of the Procrustes Algorithm which is based upon W – LESS with respect to Frobenius matrix W – norm we have succeeded to solve the normal equations of Corollary 13.2 and 13.5 (necessary conditions) of the matrix – valued “error equations” vec E ' = vec Y '1 (I n
x1X3 ) vec Y '2 vec x 2 1 ' subject to X '3 X 3 = I 3 , | X 3 |= +1 . The scalar – valued unknown x1 R represented dilatation (scale factor), the vector – valued unknown x 2 R 3×1 the translation vector, and the matrix valued unknown X 3 SO (3) the orthonormal matrix. The conditions of sufficiency, namely the Hesse matrix of second derivatives, of the Lagrangean L( x1 , x 2 , X 3 ) are not discussed here. They are given in the Procrustes references. In order to present you with a proper choice of the isotropic weight matrix W, we introduced the corresponding “random regression model” E{vec E '} = E{vec Y '1} (I n
x1X3 )E{vec Y '2 } vec x 2 1 ' = 0 first moment identity, D{vec E '} = D{vec Y '1 } (I n
x1 X3 ) D{vec Y '2 }(I n
x1 X3 ) ' 2C{vec Y '1 , (I n
x1 X3 ) vec Y '2 }, second central moment identity. Table 13.2. Coordinates for system A (local system) Station name Solitude Buoch Zeil Hohenneuffen Kuehlenberg Ex Mergelaec Ex Hof Asperg Ex Kaisersbach
X(m)
Y(m)
Z(m)
positional error sphere
4 157 222.543 4 149 043.336 4 172 803.511 4 177 148.376 4 137 012.190 4 146 292.729 4 138 759.902
664 789.307 688 836.443 690 340.078 642 997.635 671 808.029 666 952.887 702 670.738
4 774 952.099 4 778 632.188 4 758 129.701 4 760 764.800 4 791 128.215 4 783 859.856 4 785 552.196
0.1433 0.1551 0.1503 0.1400 0.1459 0.1469 0.1220
Table 13.3. Coordinates for system B (WGS 84) Station name Solitude Buoch Zeil Hohenneuffen Kuehlenberg Ex Mergelaec Ex Hof Asperg Ex Kaisersbach
X(m)
Y(m)
Z(m)
4 157 870.237 4 149 691.049 4 173 451.354 4 177 796.064 4 137 659.549 4 146 940.228 4 139 407.506
664 818.678 688 865.785 690 369.375 643 026.700 671 837.337 666 982.151 702 700.227
4 775 416.524 4 779 096.588 4 758 594.075 4 761 228.899 4 791 592.531 4 784 324.099 4 786 016.645
positional error sphere 0.0103 0.0038 0.0006 0.0114 0.0068 0.0002 0.0041
13-3 Case studies: The 3d datum transformation and the Procrustes Algorithm
443
Table 13.4. Results of the I-LESS Procrustes transformation Rotation matrix
X 3 \3 x 3 Translation
x 2 \ 3 x1 (m) Scale x1 \ Residual matrix
E(m)
Error matrix norm (m)
Values 0.999999999979023 -4.33275933098276e- 6 4.81462518486797e-6 -4.8146461589238e-6 0.999999999976693 -4.84085332591588e-6 4.33273602401529e-6 4.84087418647916e-6 0.999999999978896 641.8804 68.6553 416.3982 1.00000558251985 Site Solitud Buoch Zeil Hohenneuffen Kuelenberg Ex Mergelaec Ex Hof Asperg Ex Keisersbach
X(m)
Y(m)
Z(m)
0.0940 0.0588 -0.0399 0.0202 -0.0919 -0.0118 -0.0294
0.1351 -0.0497 -0.0879 -0.0220 0.0139 0.0065 0.0041
0.1402 0.0137 -0.0081 -0.0874 -0.0055 -0.0546 0.0017
0.2890
|| EA ||I := tr(EcA EA )
Mean error matrix norm (m)
0.0631
||| EA |||I := tr(EcA EA ) / 3n
Table 13.5. Results of the W-LESS Procrustes transformation Rotation matrix
X 3 \3 x 3 Translation
x 2 \ 3 x1 (m) Scale x1 \ Residual matrix E(m)
Values 0.999999999979141 4.77975830372179e-6 -4.34410139438235e-6 -4.77977931759299e-6 0.999999999976877 -4.83729276438971e-6 4.34407827309968e-6 4.83731352815542e-6 0.999999999978865 641.8377 68.4743 416.2159 1.00000561120732 Site Solitude Buoch Zeil Hohenneuffen Kuelenberg Ex Mergelaec Ex Hof Asperg Ex Keisersbach
Error matrix norm (m) ||| EA |||W := tr( E*A WEA )
Mean error matrix norm (m) ||| EA |||W := tr( E WEA ) / 3n * A
0.4268
0.0930
X(m) 0.0948 0.0608 -0.0388 0.0195 -0.0900 -0.0105 -0.0266
Y(m) 0.1352 -0.0500 -0.0891 -0.0219 0.0144 0.0067 0.0036
Z(m) 0.1407 0.0143 -0.0072 -0.0868 -0.0052 -0.0542 0.0022
444
13 The nonlinear problem
Table 13.6. Residuals from the linearized LS solution Site Solitude Buoch Zeil Hohenneuffen Kuelenberg Ex Mergelaec Ex Hof Asperg Ex Keisersbach
X(m) 0.0940 0.0588 -0.0399 0.0202 -0.0919 -0.0118 -0.0294
Y(m) 0.1351 -0.0497 -0.0879 -0.0220 0.0139 0.0065 0.0041
Z(m) 0.1402 0.0137 -0.0081 -0.0874 -0.0055 -0.0546 -0.0017
Table 13.7. Weight matrix 1.8110817 0 0 0 0 0 0
0 0 0 2.1843373 0 0 0 2.1145291 0 0 0 1.9918578 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 2.6288452 0 0 0 2.1642460 0 0 0 2.359370
13-4 References Here is a list of important references: Awange LJ (1999), Awange LJ (2002), Awange LJ, Grafarend E (2001 a, b, c), Awange LJ, Grafarend E (2002), Bernhardt T (2000), Bingham C, Chang T, Richards D (1992), Borg I, Groenen P (1997), Brokken FB (1983), Chang T, Ko DJ (1995), Chu MT, Driessel R (1990), Chu MT, Trendafilov NT (1998), Crosilla F (1983a, b), Dryden IL (1998), Francesco D, Mathien PP, Senechal D (1997), Golub GH (1987), Goodall C (1991), Gower JC (1975), Grafarend E and Awange LJ (2000, 2003), Grafarend E, Schaffrin B (1993), Grafarend E, Knickmeyer EH, Schaffrin B (1982), Green B (1952), GullikssonM(1995a, b), Kent JT, Mardia KV (1997), Koch KR (2001), Krarup T (1979), Lenzmann E, Lenzmann L (2001a, b), Mardia K (1978), Mathar R (1997), Mathias R (1993), Mooijaart A, Commandeur JJF (1990), Preparata FP, Shamos MI (1985), Reinking J (2001), Schönemann PH (1966), Schönemann PH, Carroll RM (1970), Schottenloher M (1997), Ten Berge JMF (1977), Teunissen PJG (1988), Trefethen LN, Bau D (1997) and Voigt C (1998).
14 The seventh problem of generalized algebraic regression revisited: The Grand Linear Model: The split level model of conditional equations with unknowns (general Gauss-Helmert model) The reaction of one man can be forecast by no known mathematics; the reaction of a billion is something else again. Isaac Asimov :Fast track reading: Read only Lemma 14.1, Lemma 14.2, Lemma 14.3
Lemma 14.10: W - LESS
Lemma 14.1: W - LESS
Lemma 14.13:R, W - MINOLESS
Lemma 14.2: R, W - MINOLESS
14-11
Lemma 14.16: R, W - Lemma 14.3: R, W - HAPS relation between A und B HAPS “The guideline of Chapter 14: three lemmas”
The inconsistent, inhomogeneous system of linear equations Ax + Bi = By c we treated before will be specialized for arbitrary condition equations between the observation vector y on the one side and the other side. We assume in addition that those condition equations which do not contain the observation vector y are consistent. The n × 1 vector i of inconsistency is specialized to B(y - i ) c + R ( A) . The first equation: B1i = B1y c1 The first condition equation is specialized to contain only conditions acting on the observation vector y, namely as an inconsistent equation B1i = B1y c1.
446
14 The seventh problem of generalized algebraic regression
There are many examples for such a model. As a holonomity condition there is said, for instance, that the “true observations” fulfill an equation of type 0 = B1E{y} c1 . Example Let there be a given two connected triangular networks of type height difference measurements which we already presented in Chapter 9-3, namely for c1 := 0 and {hDE + hEJ + hJD = 0} {hJE + hEG + hJE = 0} ª1 1 1 0 0 º B1 = « , y := ª¬ hDE , hEJ , hJD , hEJ , hGE º¼c . ¬ 0 1 0 1 1 »¼ The second equation: A 2 x + B 2 i = B 2 y c 2 , c 2 R (B 2 ) The second condition equation with unknowns is assumed to be the general model which is characterized by the inconsistent, inhomogeneous system of linear equations, namely A 2 x + B 2 i = B 2 y c 2 , c 2 R (B 2 ). Examples have been given earlier. The third equation: A 3 x = c3 , c3 R ( A 3 ) The third condition equation is specialized to contain only a restriction acting on the unknown vector x in the sense of a fixed constraint, namely A 3 x = c3 or A 3 x + c3 = 0, c3 R ( A 3 ). We refer to our old example of fixing a triangular network in the plane whose position coordinates are derived from distance measurements and fixed by a datum constraint. The other linear model of type Chapters 1, 3, 5, 9 and 12 can be considered as special cases. Lemma 14.1 refers to the solution of type W LESS, Lemma 14.2 of type R, W – MINOLESS and Lemma 14.3 of type R, W – HAPS.
14-1 Solutions of type W-LESS The solutions of our model equation Ax + Bi = By c of type W – LESS can be characterized by Lemma 14.1. Lemma 14.1 (The Grand Linear Model, W - LESS): The m × 1 vector xl is W – LESS of Ax + Bi = By c if and only if
447
14-1 Solutions of type W-LESS
ª W Bc 0 º ª i l º ª 0 º « B 0 A » « Ȝ l » = « By c » « 0 Ac 0 » « x » « 0 » ¬ ¼¬ l¼ ¬ ¼
(14.1)
with the q3 × 1 vector Ȝ l of “Lagrange multipliers”. x l exists in the case of R (Bc) R ( W) and is solution of the system of normal equations ª A c2 (B 2 W W1W Bc2 ) 1 A 2 A 3 º ª xl º = « 0 »¼ «¬ Ȝ 3 »¼ A2 ¬ ª A c (B W W1W Bc2 ) 1 (B 2 W W1y k 2 ) º =« 2 2 » c 3 ¬ ¼
(14.2)
with W1 := W B1c ( B1 W B1c ) 1 B1
(14.3)
k 2 := c 2 B 2 W B1c ( B1 W B1c ) c1
(14.4)
1
and the q3 × 1 vector Ȝ 3 of “Lagrange multipliers” which are independent of the choice of the g-inverse W uniquely determined if Ax l is uniquely determined. x l is unique if the matrix N := A c2 ( B 2 W W1 W Bc2 ) 1 A 2 + A c3 A 3 is regular or equivalently if
(14.5)
rk[ A c2 , A c3 ] = rk A = m.
(14.6)
In this case, x l has the representation xl = N 1 N 3 N 1 A c2 (B 2 W W1 k 2 ) N 1 Ac3 ( A 3 N 1 Ac3 ) c3
(14.7)
with N 3 := N A c3 ( A 3 N 1A c3 ) A 3
(14.8)
independent of the choice of the g-inverse ( A 3 N 1 A c3 ) . :Proof: W – LESS will be constructed by means of the “Lagrange function”
L ( i, x, Ȝ ) := icWi + 2Ȝ c( Ax + Bi By + c ) = min i, x, Ȝ
for which the first derivatives wL (i l , xl , Ȝ l ) = 2( Wi l + BcȜ l ) = 0 wi
448
14 The seventh problem of generalized algebraic regression
wL (i l , xl , Ȝ l ) = 2 A cȜ l = 0 wx wL (i l , xl , Ȝ l ) = 2( Axl + Bi l By + c) = 0 wȜ are necessary conditions. Note the theory of vector derivatives is summarized in Appendix B. The second derivatives w2L (i l , x l , Ȝ l ) = 2W t 0 wiwic constitute due to the positive-semidefiniteness of the matrix W the sufficiency condition. In addition, due to the identity WW Bc = Bc and the invariance of BW Bc with respect to the choice of the g-inverse such that with the matrices BW Bc and B1 W B1c the “Schur complements” B 2 W Bc2 B 2 W B1c (B1 W B1c ) 1 B1 W Bc2 = B 2 W W1 W Bc2 is uniquely invertible. Once if the vector ( q1 + q2 + q3 ) × 1 vector Ȝ l is partitioned with respect to Ȝ cl := [Ȝ1c , Ȝ c2 , Ȝ c3 ] with O(Ȝ i ) = q1 × 1 for all i = 1, 2, 3, then by eliminating of i l we arrive at the reduced system of normal equations ª B1W B1c B1W Bc2 0 0 º ª Ȝ1 º ª B1y c1 º « B W B c B W B c 0 A » « Ȝ » « B y c » 2 1 2 2 2» « 2 « 2»= « 2 » Ȝ c 0 0 0 A 3 « 3»« 3» « » 0 A c2 A c3 0 »¼ «¬ Ȝ l »¼ «¬ 0 »¼ ¬«
(14.9)
and by further eliminating Ȝ1 and Ȝ 2 1
ª B W Bc B1 W Bc2 º 1 1 » = A c(B 2 W W1 W Bc2 ) [B 2 W B1c (B1 W B1c ) , I] c B W B ¬ 2 1 2 2¼
[ 0, Ac2 ] «B1 W B1c
leads with c1 \ q = R (B1 ), c 2 \ q = R (B 2 ) and c3 R ( A 3 ) to the existence of x l . An equivalent system of equations is 1
2
ª N A c3 º ª xl º ª A c2 (B 2 W W1W Bc2 ) 1 (B 2 W W1y k 2 ) Ac3c3 º » «¬ A 3 0 »¼ «¬ Ȝ 3 »¼ = « c3 ¬ ¼ subject to N := A c2 ( B 2 W W1 W Bc2 ) 1 A 2 + A c3 A 3 ,
which we can solve for c3 = A 3 xl = A 3 N Nxl = A 3 N Ac2 (B 2 W W1 W Bc2 ) 1 (B 2 W W1 y k 2 ) A3 N Ac3 (c3 + Ȝ 3 ) for an arbitrary g-inverse N . In addition, we solve further for a g-inverse ( A 3 N A c3 )
449
14-2 Solutions of type R, W-MINOLESS
A c3 (c3 + Ȝ 3 ) = A c3 ( A 3 N A c3 ) A c3 N Ac3 (c3 Ȝ 3 ) = = A c3 ( A 3 N A c3 ) A 3 N Ac2 (B 2 W W1 W Bc2 ) 1 (B 2 W W1 y k 2 ) + Ac3 ( A 3 N Ac3 ) c3 subject to Nxl = A c2 (B 2 W W1 W Bc2 ) 1 (B 2 W W1 y k 2 ) Ac3 (c3 + Ȝ 3 ) = [N A c3 ( A 3 N A c3 ) A 3 ]N A c2 (B 2 W W1 W Bc2 ) 1 (B 2 W W1 y k 2 ) A c3 ( A 3 N A c3 ) c3 . With the identity ª(B W W1W Bc2 ) 1 0 º ª A 2 º N = [ A c2 , A c3 ] « 2 0 I »¼ «¬ A 3 »¼ ¬
(14.10)
we recognize that with Nx l also Ax l = AN Nx l is independent of the g-inverse N and is always uniquely determinable. x l is unique if and only if N is regular. We summarize our results by specializing the matrices A and B and the vector c and find x l of type (14.7) and (14.8).
14-2 Solutions of type R, W-MINOLESS The solutions of our model equation Ax + Bi = By c of type R, W – MINOLESS are characterized by Lemma 14.2. Lemma 14.2 (The Grand Linear Model, R, W – MINOLESS): Under the assumption R (Bc) R ( W)
(14.11)
is the vector x lm R, W – MINOLESS of Ax + Bi = By c if and only if the system of normal equations 0 ª º ª R N º ª xlm º « 1 » c c N N A ( B W W W B ) ( B W W y k ) 2 2 1 2 2 1 2 «¬ N 0 »¼ «¬ Ȝ lm »¼ « 3 » A c3 ( A 3 N A c3 ) c3 ¬« ¼»
(14.12)
with the m × 1 vector Ȝ lm of “Lagrange multipliers” subject to (14.13) W1 := W B1 (B1 W B1c )-1 B1 , k 2 := c 2 B 2 W B1c (B1 W B1c )-1 c1
(14.14)
(14.15) N := A c2 (B 2 W W1 W Bc2 ) A 2 + A c3 A 3 , N 3 := N A c3 ( A 3 N Ac3 ) A 3 . (14.16)
-1
All definitions are independent of the choice of the g – inverse N and ( A 3 N A c3 ) . xAm exists always if and only if the matrix (14.17) R+N is regular, or equivalently if (14.18) rk[R, A] = m holds.
450
14 The seventh problem of generalized algebraic regression
:Proof: The proof follows the line of Lemma 12.13 if we refer to the reduced system of normal equations (12.62). The rest is subject to the identity (12.59) ªR « 0 R + N = [R , A c] « « 0 « ¬« 0
0 0 0º » I 0 0» ªR º « ». 0 (B 2 W W1W Bc2 )1 0 » ¬ A ¼ » 0 0 I ¼»
(14.19)
It is obvious that the condition (14.18) is fulfilled if the matrix R is positive definite and consequently if it is describing a R - norm. In specifying the matrices A and B and the vector c, we receive a system of normal equations of type (14.12)-(14.18).
14-3 Solutions of type R, W-HAPS The solutions of our model equation Ax + Bi = By c of type R, W – HAPS will be characterized by Lemma 14.3. Lemma 14.3 (The Grand Linear Model, R, W - HAPS): An m × 1 vector x h is R, W – HAPS of Ax + Bi = By c if it solves the system of normal equations ª W Bc 0 º ª i h º ª 0 º « B 0 A » « Ȝ h » = « By - c » « 0 Ac R » « x » « 0 » ¬ ¼¬ h¼ ¬ ¼
(14.20)
with the q × 1 vector Ȝ A of “Lagrange multipliers”. x h exists if R (Bc) R ( W) and if it solves the system of normal equations ª R + A c2 (B 2 W W1W Bc2 ) 1 A 2 A c3 º ª x h º = « A3 0 »¼ «¬ Ȝ 3 »¼ ¬ ª A c (B W W1W Bc2 ) 1 (B 2 W W1y k 2 ) º =« 2 2 » c3 ¬ ¼
(14.21)
(14.22)
with (14.23) W1 := W B1c ( B1 W B1c ) 1 B1 , k 2 := c 2 B 2 W B1c ( B1 W B1c ) 1 c1 (14.24) and the q3 × 1 vector Ȝ 3 of “Lagrange multipliers” which are defined independent of the choice of the g – inverse W and in such a way that both Ax h and Rx A are uniquely determined. x h is unique if and only if the matrix
451
14-3 Solutions of type R, W-HAPS
R+N
(14.25)
with N := A c2 ( B 2 W W1 W Bc2 ) 1 A 2 + A c3 A 3
(14.26)
being regular or equivalently, if rk[R , A ] = m.
(14.27)
In this case, x h can be represented by x h = (R + N ) 1{R + N A c3 [ A 3 (R + N) 1 A c3 ] A 3 × ×(R + N ) 1 A c2 (B 2 W W1 W Bc2 ) 1 ×
(14.28)
×(B 2 W W1 y k 2 ) ( R + N) Ac3 [ A 3 ( R + N) A c3 ) c3 ,
1
1
which is independent of the choice of the g - inverse [ A 3 ( R + N) 1 A c3 ] . :Proof: R, W – HAPS will be constructed by means of the “Lagrange function”
L ( i, x, Ȝ ) := icWi + x cRx + 2 Ȝ c( Ax + Bi By + c) = min i, x, Ȝ
for which the first derivatives wL ( i h , x h , Ȝ h ) = 2( Wi h + BcȜ h ) = 0 wi wL ( i h , x h , Ȝ h ) = 2( A cȜ h + Rx h ) = 0 wx wL ( i h , x h , Ȝ h ) = 2( Ax h + Bi h By + c ) = 0 wȜ are necessary conditions. Note the theory of vector derivatives is summarized in Appendix B. The second derivatives w 2L (i h , x h , Ȝ h ) = 2 W t 0 wiwi c w 2L ( i h , x h , Ȝ h ) = 2R t 0 wxwx c constitute due to the positive-semidefiniteness of the matrices W and R a sufficiency condition for obtaining a minimum. Because of the condition (14.21) R (Bc) R ( W) we are able to reduce first the vector i A in order to be left with the system of normal equations.
452
14 The seventh problem of generalized algebraic regression
ª -B1W B1c -B1 W B c2 0 0 º ª Ȝ1 º ª B1y - c1 º «-B W Bc -B W B c 0 A » « Ȝ » «B y - c » 2 1 2 2 3» « 2 « 2»= « 2 » 0 0 0 A 3 » « Ȝ 3 » « -c3 » « 0 A c3 A c3 R ¼» «¬ x h »¼ «¬ 0 ¼» ¬«
(14.29)
is produced by partitioning the ( q1 + q2 + q3 ) × 1 vector due to Ȝ ch = [Ȝ1c , Ȝ c2 , Ȝ c3 ] and 0( Ȝ i ) = qi for i = 1, 2, 3 . Because of BW Bc and B1 W B1c with respect to the “Schur complement”, B 2 W Bc2 B 2 W B1c (B1 W B1c ) 1 B1 W Bc2 = B 2 W W1 W Bc2 is uniquely invertible leading to a further elimination of Ȝ1 and Ȝ 2 because of 1
ª B W Bc B1W Bc2 º 1 1 [0, A c2 ] « 1 1 » = A c2 (B 2 W W1 W Bc2 ) [ B 2 W B1c (B1W B1c ) , I ] c c B W B B W B ¬ 2 1 2 2¼ and ª R + N A c3 º ª x h º ª A c2 (B 2 W W1 W Bc2 )1 (B 2 W W1 y k 2 ) A c3 c3 º =« ». « A 0 »¼ «¬ Ȝ 3 »¼ ¬ c3 ¬ 3 ¼ For any g – inverse ( R + N ) there holds c3 = A 3 x h = A 3 (R + N) 1 (R + N)x h = A 3 (R + N)[ A c2 (B 2 W W1 W Bc2 )-1 (B 2 W W1 y k 2 ) A 3 (c3 + Ȝ 3 )] and for an arbitrary g – inverse [ A 3 ( R + N) A c3 ] (R + N)x h = A c2 (B 2 W W1W Bc2 )-1 (B 2 W W1y k 2 ) A c3 (c3 = Ȝ 3 ) = = {R + N A c3 [ A 3 (R + N) A c3 ] A c3 }(R + N) (B 2 W B1 W Bc2 )-1 × ×(B 2 W W1y k 2 ) A3c [ A3 ( R + N) Ac3 ] c3 A c3 (c3 = Ȝ 3 ) = A c3 [ A 3 (R + N) A 3 ](R + N) A c3 (c3 = Ȝ 3 ) = = A c3 [ A 3 ( R + N) A c3 ] A 3 (R + N) 1 A c2 (B 2 W W1 W Bc2 )-1 (B 2 W W1 y k 2 ) + + A c3 [ A 3 (R + N) A c3 ] c3 . (14.30) Thanks to the identity ªR « R + N = [R, A c2 , A c3 ] « 0 « 0 ¬ it is obvious that
0 0º ª R º » 1 (B 2 W W1W B c2 ) 0 » «« A 2 »» 0 I »¼ «¬ A 3 »¼
453
14-4 Review of the various models: the sixth problem
(i)
the solution x h exists always and
(ii)
is unique when the matrix ( R + N ) is regular which coincide with (14.28).
Under the condition R (Bc) R ( W) , R, W – HAPS is unique if R, W – MINOLESS is unique. Indeed the forms for R, W – HAPS and R, W – MINOLESS are identical in this case. A special form, namely (R + N)x Am + NȜ Am = = N 3 N A c2 (B 2 W W1 W Bc2 )-1 (B 2 W W1y k 2 ) A c3 ( A 3 N Ac3 ] c3
Nx Am = = N 3 N Ac2 (B 2 W W1W Bc2 ) (B 2 W W1y k 2 ) A c3 ( A 3 N A c3 ] c3 ,
1
(14.31)
(14.32)
leads us to the representation xh xAm = (R + N)1 NȜ Am + +(R + N)1{R + N Ac3 ( A3 (R + N)1 Ac3 ) A3 ](R + N)1 N3 N- }Ac2 (B2 W W1 W B'2 )-1 ×
(14.33)
×(B2 W W1y k 2 ) + (R + N)1 Ac3 {(A3 N1 Ac3 ) [A3 (R + N)1 Ac3 ] }c3 .
14-4 Review of the various models: the sixth problem Table 14.1 gives finally a review of the various models of type “split level”. Table 14.1 (Special cases of the general linear model of type conditional equations with unknowns (general Gauss-Helmert model)): B1i = B1y - c1 ª0 º ª B1 º ª B1 º ª c1 º A 2 x + B 2 i = B 2 y - c 2 « A 2 » x + «B 2 » i = « B 2 » y «c 2 » « » « » « » « » A 3 x = -c31 ¬ A3 ¼ ¬0¼ ¬0¼ ¬ c3 ¼ Ax = y Ax + i = y AX = By Ax + Bi = By Bi = By Ax = By - c Ax + Bi = By - c Bi = By - c y R(A) y R (A) By R(A) By R(A) By c + R(A) By c + R(A) A2 = A A3 = 0 B1 = 0 B2 = I (i = 0) c1 = 0 c2 = 0 c3 = 0
A2 = A A3 = 0 B1 = 0 B2 = I c1 = 0 c2 = 0 c3 = 0
A2 = A A3 = 0 B1 = 0 B2 = B (i = 0) c1 = 0 c2 = 0 c3 = 0
A2 = A A3 = 0 B1 = 0 B2 = B
A2 = 0 A3 = 0 B1 = 0 B2 = 0
c1 = 0 c2 = 0 c3 = 0
c1 = 0 c2 = 0 c3 = 0
A2 = A A3 = 0 B1 = 0 B2 = B (i = 0) c1 = 0 c2 = c c3 = 0
A2 = A A3 = 0 B1 = 0 B2 = B
A2 = 0 A3 = 0 B1 = B B2 = 0
c1 = 0 c2 = c c3 = 0
c1 = c c2 = 0 c3 = 0
454
14 The seventh problem of generalized algebraic regression
Example 14.1 As an example of a partitioned general linear system of equations, of type Ax + Bi = By c we treat a planar triangle whose coordinates consist of three distance measurements under a datum condition. As approximate coordinates for the three points we choose xD = 3 / 2, yD = 1/ 2, xE = 3, yE = 1, xJ = 3 / 2, yJ = 3 / 2 such that the linearized observation equation can be represented as A 2 x = y (B 2 = I, c 2 = 0, y R (A 2 )) ª 3 / 2 1/ 2 « A2 = « 0 1 « 0 0 ¬
0 º » 1 ». 3 / 2 1/ 2 3 / 2 1/ 2 »¼
3/2 0
1/ 2 0
0 0
The number of three degrees of freedom of the network rotation are fixed by three conditions: ª1 0 0 0 0 A 3x = c 3 (c 3 R(A 3 ) ), A 3 = «0 1 0 0 0 « «¬0 0 0 0 1
of type translation and 0º ª 0, 01º » 0 , c 3 = «0, 02 » , » « » «¬ 0, 01»¼ 0»¼
especially with the rank and order conditions rkA 3 = m rkA 2 = 3, O(A 2 ) = n × m=3 × 6, O(A 3 ) = ( m rkA 2 ) × m=3 × 6, and the 6 × 6 matrix [ A c2 , A c3 ]c is of full rank. Choose the observation vector ªA2 º ª y º y = [104 ,5 × 104 , 4 × 104 ]c and find the solution x = « » « » , in detail: ¬ A 3 ¼ ¬ c 3 ¼ x = [0.01, 0.02, 0.00968, 0.02075, 0.01, 0.02050]c.
15 Special problems of algebraic regression and stochastic estimation: multivariate Gauss-Markov model, the n-way classification model, dynamical systems Up to now, we have only considered an “univariate Gauss-Markov model”. Its generalization towards a multivariate Gauss-Markov model will be given in Chapter 15.1. At first, we define a multivariate linear model by Definition 15.1 by giving its first and second order moments. Its algebraic counterpart via multivariate LESS is subject of Definition 15.2. Lemma 15.3 characterizes the multivariate LESS solution. Its multivariate Gauss-Markov counterpart is given by Theorem 15.4. In case we have constraints in addition, we define by Definition 15.5 what we mean by “multivariate Gauss-Markov model with constraints”. The complete solution by means of “multivariate Gauss-Markov model with constraints” is given by Theorem 15.5. In contrast, by means of a MINOLESS solution we present the celebrated “n-way classification model”. Examples are given for a 1-way classification model, for a 2-way classification model without interaction, for a 2-way classification model with interaction with all numerical details for computing the reflexive, symmetric generalized inverse ( A cA ) rs . The higher classification with interaction is finally reviewed. We especially deal with the problem how to compute a basis of unbiased estimable quantities from biased solutions. Finally, we take account of the fact that in addition to observational models, we have dynamical system equations. Additionally, we therefore review the Kalman Filter (Kalman - Bucy Filter). Two examples from tracking a satellite orbit and from statistical quality control are given. In detail, we define the stochastic process of type ARMA and ARIMA. A short introduction on “dynamical system theory” is presented. By two examples we illustrate the notions of “a steerable state” and of “observability”. A careful review of the conditions “steerability” by Lemma 15.7 and “observability” by Lemma 15.8 is presented. Traditionally the state differential equation as well as the observational equation are solved by a typical Laplace transformation which we will review shortly. At the end, we focus on the modern theory of dynamic nonlinear models and comment on the theory of chaotic behaviour as its up-to date counterpart.
15-1 The multivariate Gauss-Markov model – a special problem of probabilistic regression – Let us introduce the multivariate Gauss-Markov model as a special problem of the probabilistic regression. If for one matrix A of dimension O( A) = n × m in a Gauss-Markov model instead of one vector of observations several observation vectors y i of dimension O ( y i ) = n × p with identical variance-covariance matrix Ȉij are given and the fixed array of parameters ȟ i has to be determined, the model is referred to as a
456
15 Special problems of algebraic regression and stochastic estimation
Multivariate Gauss-Markov model. The standard Gauss-Markov model is then called a univariate Gauss-Markov model. The analysis of variance-covariance is applied afterwards to a multivariate model if the effect of factors can be referred to not only by one characteristic of the phenomenon to be observed, but by several characteristics. Indeed this is the multivariate analysis of variance-covariance. For instance, the effects of different regions on the effect of a species of animals are to be investigated, the weight of the animals can serve as one characteristic and the height of the animals as a second one. Multivariate models can also be setup, if observations are repeated at different times, in order to record temporal changes of a phenomenon. If measurements in order to detect temporal changes of manmade constructions are repeated with identical variance-covariance matrices under the same observational program, the matrix A of coefficients in the Gauss-Markov model stays the same for each repetition and each repeated measurement corresponds to one characteristic. Definition 15.1 (multivariate Gauss-Markov model): Let the matrix A of the order n × m be given, called the first order design matrix, let ȟ i denote the matrix of the order m × p of fixed unknown parameters, and let y i be the matrix of the order n × p called the matrix of observations subject to p d n . Then we speak of a “multivariate Gauss-Markov model” if (15.1)
E{y i } = Aȟ i
(15.2)
D{y i , y j } = I nG ij
ª O{y i } = n × p subject to «« O{A} = n × m «¬O{ȟ i } = m × p subject to O{G ij } = p × p and p.d.
for all i, j {1,..., p} apply for a second order statistics. Equivalent vector and matrix forms are (15.3)
E{Y} = AȄ and E{vec Y } = (I p
A) vec Ȅ
(15.5)
D{vec Y} = Ȉ
I n and d {vec Y} = d ( Ȉ
I n ) (15.6)
(15.4)
subject to O{Ȉ} = p × p, O{Ȉ
I} = np × np, O{vec Y} = np × 1, O{Y} = n × p O{D{vec Y}} = np × np, O{d (vec Y)} = np( np + 1) / 2. The matrix D{vec Y} builds up the second order design matrix as the Kronecker-Zehfuss product Ȉ and I n .
457
15-1 The multivariate Gauss-Markov model
In the multivariate Gauss-Markov model both the matrices ȟ i and V ij or ; and Ȉ are unknown. An algebraic equivalent of the multivariate linear model would read as given by Definition 15.2. Definition 15.2 (multivariate linear model): Let the matrix A of the order n × m be given, called first order algebraic design matrix, let x i denote the matrix of the order of fixed unknown parameter, and y i be the matrix of order n × p called the matrix of observations subject to p d n . Then we speak of an algebraic multivariate linear model if p
¦ || y
i
Axi ||G2 = min ~ || vec Y (I p
A) vec X ||G2 y
i =1
xi
vec Y
= min
(15.7)
establishing a G y or G vecY -weighted least squares solution of type multivariate LESS. It is a standard solution of type multivariate LESS if ª O{X} = m × p X = ( A cG y A) A cG y Y « O{A} = n × m « ¬O{Y} = n × p ,
(15.8)
which nicely demonstrates that the multivariate LESS solution is built on a series of univariate LESS solutions. If the matrix A is regular in the sense of rk( A cG y A ) = rk A = m , our multivariate solution reads X = ( A cG y A ) 1 A cG y Y ,
(15.9)
excluding any rank deficiency caused by a datum problem. Such a result may be initiated by fixing a datum parameter of type translation (3 parameters at any epoch), rotation (3 parameters at any epoch) and scale (1 parameter at any epoch). These parameters make up the seven parameter conformal group C7 (3) at any epoch in a three-dimensional Euclidian space (pseudo-Euclidian space). Lemma 15.3 (general multivariate linear model): A general multivariate linear model is multivariate LESS if p, p
¦ (y
Ax i )Gij ( y j Ax j ) = min
i
(15.10)
x
i , j =1
or n m, m p , p
¦ ¦ ¦ ( yD D E J =1
,
i
aDE xE i )G ij ( yD j aDJ xJ j ) = min
(15.11)
x
i, j
or (vec Y (I p
A) vec X)c(I n
G y (vec Y (I p
A) vec X) = min . (15.12) x
An array X , dim X = m × p is multivariate LESS, if
458
15 Special problems of algebraic regression and stochastic estimation
vec X = [( I p
A )c( I n
G Y ) ( I p
A )]1 ( I p
A ) c( I n
G Y ) vec Y (15.13) and rk(I n
G Y ) = np.
(15.14)
Thanks to the weight matrix Gij the multivariate least squares solution (15.3) differs from the special univariate model (15.9). The analogue to the general LESS model (15.10)-(15.12) of type multivariate BLUUE is given next. Theorem 15.4 (multivariate Gauss-Markov model of type ȟ i , in particular ( Ȉ, I n ) - BLUUE): A multivariate Gauss-Markov model is ( Ȉ, I n ) - BLUUE if the vector vecȄ of an array Ȅ, dim Ȅ = n × p , dim(vec Ȅ) = np × 1 of unknowns is estimated by the matrix ȟ i , namely ˆ = [(I
A)c( Ȉ
I ) 1 (I
A)]1 (I
A)c( Ȉ
I ) 1 vec Y (15.15) vec Ȅ p n p p n subject to rk( Ȉ
I n ) 1 = np .
(15.16)
Ȉ ~ V ij denotes the variance-covariance matrix of multivariate effects yDi for all D = 1,..., n and i = 1,..., p . An unbiased estimator of the variance-covariance matrix of multivariate effects is 1 ª 2 ˆ ˆ « i = j : Vˆ i = n q (y i Aȟ i )c(y i Aȟ i ) « «i z j : Vˆ = 1 (y Aȟˆ )c(y Aȟˆ ) ij i i j i «¬ nq
(15.17)
because of E{(y i Aȟˆ i )c(y j Aȟˆ j )} = E{y ci ( I A( AcA) 1 A c) y j } = V ij ( n q) . (15.18) A nice example is given in K.R. Koch (1988 pp. 281-286). For practical applications we need the incomplete multivariate models which do not allow a full rank matrix V ij . For instance, in the standard multivariate model, it is assumed that the matrix A of coefficients has to be identical for p vectors y i and the vectors y i have to be completely given. If due to a change in the observational program in the case of repeated measurements or due to a loss of measurements, these assumptions are not fulfilled, an incomplete multivariate model results. If all the matrices of coefficients are different, but if p vectors y i of observations agree with their dimension, the variance-covariance matrix Ȉ and the vectors ȟ i of first order parameters can be iteratively estimated.
459
15-1 The multivariate Gauss-Markov model
For example, if the parameters of first order, namely ȟ i , and the parameters of second order, namely V ij , the elements of the variance-covariance matrix, are unknown, we may use the hybrid estimation of first and second order parameters of type {ȟ i , V ij } as outlined in Chapter 3, namely Helmert type simultaneous estimation of {ȟ i , V ij } (B. Schaffrin 1983, p.101). An important generalization of the standard multivariate Gauss-Markov model taking into account constraints, for instance caused by rank definitions, e.g. the datum problem at r epochs, is the multivariate Gauss-Markov model with constraints which we will treat at the end. Definition 15.5 (multivariate Gauss-Markov model with constraints): If in a multivariate model (15.1) and (15.2) the vectors ȟ i of parameters of first order are subject to constraints Hȟ i = w i , (15.19) where H denotes the r × m matrix of known coefficients with the restriction (15.20) H ( A cA) A cA = H, rk H = r d m and w i known r × 1 vectors, then E{y i } = Aȟ i ,
(15.21) (15.22)
D{y i , y j } = I nV ij subject to Hȟ i = w i (15.23) is called “the multivariate Gauss-Markov model with linear homogeneous constraints”. If the p vectors w i are collected in the r × p matrix W , dim W = r × p, the corresponding matrix model reads E{Y} = A;, D{vec Y} = Ȉ
I n , HȄ = W
(15.24)
subject to O{Ȉ} = p × p, O{Ȉ
I n } = np × np, O{vec Y} = np × 1, O{Y} = n × p O{D{vec Y}} = np × np, O{H} = r × m, O{Ȅ} = m × p, O{W} = r × p.
(15.25)
The vector forms E{vec Y} = (I p
A) vec Ȅ, '{vec Y} = Ȉ
I n , vec W = (I p
H) vec Ȅ are equivalent to the matrix forms.
460
15 Special problems of algebraic regression and stochastic estimation
A key result is Lemma 15.6 in which we solve for a given multivariate weight matrix G ij - being equivalent to ( Ȉ
I n ) 1 - a multivariate LESS problem. Theorem 15.6 (multivariate Gauss-Markov model with constraints): A multivariate Gauss-Markov model with linear homogeneous constraints is ( Ȉ, I n ) BLUUE if ˆ = (I
( AcA) Ac) vec Y + Y(I
( AcA) H c( H( AcA) H c) 1 ) vec Y vec Ȅ p p (I p
( A cA) H c) 1 H ( A cA) A c) vec Y (15.26) or ˆ = ( A cA) ( A cY + H c(H ( A cA) H c) 1 ( W H( A cA) A cY)) Ȅ (15.27) An unbiased estimation of the variance-covariance matrix Ȉ is 1 = ˆ Y)c( AȄ ˆ Y) + Ȉ {( AȄ (15.28) nm+r ˆ W)c( H( AcA) Ac) 1 ( HȄ ˆ W)}. + (HȄ
15-2 n-way classification models Another special model is called n-way classification model. We will define it and show how to solve its basic equations. Namely, we begin with the 1-way classification and to continue with the 2- and 3-way classification models. A specific feature of any classification model is the nature of the specific unknown vector which is either zero or one. The methods to solve the normal equation vary: In one approach, one assumption is that the unknown vector of zeros or ones is a fixed effect. The corresponding normal equations are solved by standard MINOLESS, weighted or not. Alternatively, one assumes that the parameter vector consists of random effects. Methods of variance-covariance component estimation are applied. Here we only follow a MINOLESS approach, weighted or not. The interested reader of the alternative technique of variance-covariance component estimation is referred to our Chapter 3 or to the literature, for instance H. Ahrens and J. Laeuter (1974) or S.R. Searle (1971), my favorite. 15-21
A first example: 1-way classification
A one-way classification model is defined by (15.30)
yij = E{ yij } + eij = P + Di + eij
(15.29)
y c := [y1c y c2 ...y cp 1 y cp ], xc := [ P D1 D 2 ...D p 1 D p ]
(15.31)
where the parameters P and Di are unknown. It is characteristic for the model that the coefficients of the unknowns are either one or zero. A MINOLESS
461
15-2 n-way classification models
(Minimum Norms LEast Squares Solution) for the unknown parameters { P , Di } is based on || y - Ax ||I2 = min and || x ||I2 = min x
x
we built around a numerical example. Numerical example: 1-way classification Here we will investigate data concerning the investment on consumer durables of people with different levels of education. Assuming that investment is measured by an index number, namely supposing that available data consist of values of this index for 7 people: Table 15.1 illustrates a very small example, but adequate for our purposes. Table 15.1 (investment indices of seven people): Level of education
number of people
Indices
Total
1 (High School incomplete)
3
74, 68, 77
219
2 (High School graduate) 3 (College graduate) Total
2 2 7
76, 80 85,93
156 178 553
A suitable model for these data is yij = P + Di + eij ,
(15.32)
where yij is investment index of the jth person in the ith education level, P is a general mean, Di is the effect on investment of the ith level of education and eij is the random error term peculiar to yij . For the data of Table 15.1 there are 3 educational levels and i takes the values j = 1, 2,..., ni 1, ni where ni is the number of observations in the ith educational level, in our case n1 = 3, n2 = 2 and n3 = 2 in Table 15.1. Our model is the model for the 1-way classification. In general, the groupings such as educational levels are called classes and in our model yij as the response and levels of education as the classes, this is a model we can apply to many situations. The normal equations arise from writing the data of Table 15.1 in terms of our model equation. ª 74 º ª y11 º ª P + D1 + e11 º « 68 » « y12 » « P + D1 + e12 » «77 » « y13 » « P + D1 + e13 » « 76 » = « y21 » = « P + D 2 + e21 » , O (y ) = 7 × 1 «80 » « y22 » « P + D 2 + e22 » «85 » « y31 » « P + D 3 + e31 » «¬ 93 »¼ « y » « P + D + e » 3 32 ¼ ¬ 32 ¼ ¬
462
15 Special problems of algebraic regression and stochastic estimation
or ª 74 º ª1 « 68 » «1 « » « «77 » «1 « 76 » = y = «1 «80 » «1 «85 » «1 « » « «¬ 93 »¼ «¬1 ª1 «1 «1 A = «1 «1 «1 «¬1
1 1 1 0 0 0 0
0 0 0 1 1 0 0
1 1 1 0 0 0 0
0 0 0 1 1 0 0
0º ª e11 º «e » 0» » ª P º « 12 » 0 » « » « e13 » D 0 » « 1 » + « e21 » = Ax + e y D 0 » « 2 » « e22 » «¬D 3 »¼ « » » 1» « e31 » 1 »¼ «¬ e32 »¼ and
0º 0» ªP º 0» «D » 0 » , x = «D 1 » , O ( A) = 7 × 4, O( x) = 4 ×1 2 » 0 «¬D 3 »¼ 1» 1 »¼
with y being the vector of observations and e y the vector of corresponding error terms. As an inconsistent linear equation y e y = Ax, O{y} = 7 × 1, O{A} = 7 × 4, O{x} = 4 × 1 we pose the key question: ?What is the rank of the design matrix A? Most notable, the first column is 1n and the sum of the other three columns is also one, namely c 2 + c 3 + c 4 = 1n ! Indeed, we have a proof for a linear dependence: c1 = c 2 + c 3 + c 4 . The rank rk A = 3 is only three which differs from O{A} = 7 × 4. We have to build in this rank deficiency. For example, we could postulate the condition x4 = D 3 = 0 eliminating one component of the unknown vector. A more reasonable approach would be based on the computation of the symmetric reflexive generalized inverse such that xlm = ( A cA) rs A cy ,
(15.33)
which would guarantee a least squares minimum norm solution or a V, SBLUMBE solution (Best Linear V-Norm Uniformly Minimum Bias S-Norm Estimation) for V=I, S=I and rk A = rk A cA = rk( A cA) rs = rk A + A cA is a symmetric matrix ( A cA ) rs is a symmetric matrix or called :rank preserving identity: !symmetry preserving identity!
(15.34)
463
15-2 n-way classification models
We intend to compute xlm for our example. Table 15.2: 1-way classification, example: normal equation ª1 1 0 0 º «1 1 0 0 » ª1 1 1 1 1 1 1 º «1 1 0 0 » ª7 3 2 2 º «1 1 1 0 0 0 0 » « «3 3 0 0» A cA = « 1 0 1 0» = « 0 0 0 1 1 0 0» « 2 0 2 0» » «0 0 0 0 0 1 1 » «1 0 1 0 » « 2 0 0 2 » ¬ ¼ 1 0 0 1 ¬ ¼ « » ¬«1 0 0 1 ¼» A cA = DE, O{D} = 4 × 3, O{E} = 3 × 4 ª7 «3 DcD = « 2 «2 ¬
3 3 0 0
2º 0» , E to be determined: 2» » 0¼
DcA cA = DcDE ( DcD) 1 DcAcA = E ª7 3 2 2º DcD = « 3 3 0 0 » «¬ 2 0 2 0 »¼
ª7 «3 «2 «2 ¬
3 3 0 0
2º 66 30 18º 0 » ª« = 30 18 6 » 2» « 18 6 8 »¼ » ¬ 0¼
:compute (DcD) 1 and (DcD) 1 Dc : ª1 0 0 1 º E = (DcD) 1 DcA cA = «0 1 0 1» «¬0 0 1 1 »¼ ( A cA) rs = Ec(EEc) 1 (DcD) 1 Dc = 0 ª 0.0833 « 0 0.2500 =« 0.0417 -0.1250 « ¬« 0.0417 -0.1250
0.0417 -0.1250 0.3333 -0.1667
0.0417 º -0.1250 » -0.1667 » » 0.3333 ¼»
( A cA) rs A c = ª 0.0833 « 0.2500 «-0.0833 « «¬-0.0833
0.0833 0.2500 -0.0833 -0.0833
0.0833 0.2500 -0.0833 -0.0833
0.1250 -0.1250 0.3750 -0.1250
0.1250 -0.1250 0.3750 -0.1250
ª 60.0 º «13.0 » xlm = ( AcA) Acy = « ». «18.0 » ¬« 29.0 ¼» rs
0.1250 -0.1250 -0.1250 0.3750
0.1250 º -0.1250 » -0.1250 » » 0.3750 »¼
464
15 Special problems of algebraic regression and stochastic estimation
Summary The general formulation of our 1-way classification problem is generated by identifying the vector of responses as well as the vector of parameters: Table 15.3: 1-way classification y c := [y11 , y12 ...y1( n 1) y1n | y 21y 22 ...y 2( n 1
1
2
1)
y 2 n | ... | y p1y p 2 ...y p ( n 2
p
1)
y ( pn ) ] p
xc := [ P D1 D 2 ...D p 1 D p ] ª1 «" «1 «1 A := «" 1 «" «1 «" ¬« 1
1 " 1 0 " 0 " 0 " 0
0
0º "» 0» 0 "»» , O( A) = n × ( p + 1) 0 "» 1» "» 1 ¼»
0 1 1 0 0
p
n = n1 + n2 + ... + n p = ¦ ni
(15.35)
i =1
experimental design: number of rank of the number of observations parameters: design matrix: n = n1 + n2 + … + n p
1+ p
1 + ( p 1) = p
(15.36)
:MINOLESS: (15.37)
15-22
|| y - Ax || = min and || x ||2 = min
(15.38)
xlm = ( A cA) rs A cy.
(15.39)
2
x
A second example: 2-way classification without interaction
A two-way classification model without interaction is defined by “MINOLESS” yijk = E{ yijk } + eijk = P + D i + E j + eijk
(15.40)
c , y c21 ,..., y cp 1 1 , y cp1 , y12 c , y c22 ,..., y cp q 1 , y cpq ] y c := [y11 xc = [ P ,D1 ,..., D p , E1 ,..., E q ] (15.42)
|| y - Ax ||I2 = min x
and
(15.41)
|| x ||2 = min . x
(15.43)
The factor A appears in p levels and the factor B in g levels. If nij denotes the number of observations under the influence of the ith level of the factor A and the jth level of the factor B, then the results of the experiment can be condensed
465
15-2 n-way classification models
in Table 15.4. If Di and E y denote the effects of the factors A and B, P the mean of all observations, we receive
P + D i + E j = E{y ijk } for all i {1,..., p}, j {1,..., q}, k {1,..., nij } (15.44) as our model equation. Table 15.4 (level of factors): level of the factor B
1
2
…
q
level of factor A
1
n11
n12
…
n1q
2 … p
n21 … np1
n22 … np2
n2q … npq
If nij = 0 for at least one pair{i, j} , then our experimental design is called incomplete. An experimental design for which nij is equal of all pairs {ij} , is said to be balanced. The data of Table 15.5 describe such a general model of y ijk observations in the ith row (brand of stove) and jth column (make of the pan), P is the mean, Di is the effect of the ith row, E j is the effect of the jth column, and eijk is the error term. Outside the context of rows and columns Di is equivalently the effect due to the ith level of the D factor and E j is the effect due to the jth level of the E factor. In general, we have p levels of the D factor with i = 1,..., p and q levels of the E factor with j = 1,..., q : in our example p = 4 and q = 3 . Table 15.5 (number of seconds beyond 3 minutes, taken to boil 2 quarts of water): Make of Pan number of A B C total mean observations Brand of Stove
X Y Z W
Total number of observations mean
18 — 3 6 27
12 — — 3 15
24 9 15 18 66
3
2
4
9
1 2
7
54 9 18 27 108
3 3 3 3
18 18 18 18
16 12
With balanced data every one of the pq cells in Table 15.5 would have one (or n) observations and n d 1 would be the only symbol needed to describe the number of observations in each cell. In our Table 15.5 some cells have zero observations and some have one. We therefore need nij as the number of observations in
466
15 Special problems of algebraic regression and stochastic estimation
the ith row and jth column. Then all nij = 0 or 1, and the number of observations are the values of q
p
p
q
ni = ¦ nij , n j = ¦ nij , n = ¦¦ nij . j =1
i =1
(15.45)
i =1 j =1
Corresponding totals and means of the observations are shown, too. For the observations in Table 15.5 the linear equations of the model are given as follows, ª18 º ª y11 º ª «12 » « y12 » « « 24 » « y13 » « « 9 » « y23 » « « 3 » = « y31 » = « «15 » « y33 » « « 6 » «y » « « 3 » « y 41 » « «18 » «« y42 »» « ¬ ¼ ¬ 43 ¼ ¬
1 1 1 1 1 1 1 1 1
1 1 1
1
1 1
1 1 1
1 1
1 1 1
º » 1» 1» » 1» » » 1 »¼
ªe º ª P º « e11 » «D1 » « e12 » «D 2 » « 13 » «D 3 » « e23 » «D 4 » + « e31 » , « E1 » « e33 » « E » « e41 » « E 2 » «e42 » ¬ 3 ¼ «e » ¬ 43 ¼
where dots represent zeros. In summary, ª ª18 º « «12 » « « 24 » « «9» «3»=y=« « «15 » « «6» « «3» « «18 » ¬ ¼ ¬ ª1 1 0 0 «1 1 0 0 «1 1 0 0 «1 0 1 0 A = «1 0 0 1 «1 0 0 1 «1 0 0 0 «1 0 0 0 «1 0 0 0 ¬
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1
1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0
0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0
0 0 0 0 1 1 0 0 0
0 0 0 0 0 0 1 1 1
1 0 0 0 1 0 1 0 0
0 1 0 0 0 0 0 1 0
0º 0» 1» 1» 0» 1» 0» 0» 1 »¼
ªe º ª P º « e11 » «D1 » « e12 » «D 2 » « 13 » «D 3 » « e23 » «D 4 » + « e31 » « E1 » « e33 » « E » « e41 » « E 2 » « e42 » ¬ 3 ¼ «e » ¬ 43 ¼
0º ªP º 0» « D1 » 1» «D 2 » 1» «D » 0 » , x = «D 3 » , O( A) = 9 × 8, O(x) = 8 × 1 4 1» « E1 » » 0 «E » 0» « E2 » ¬ 3¼ 1 »¼
with y being the vector of observations and e y the vector of corresponding error terms. As an inconsistent linear equation y - e y = Ax, O{y} = 9 × 1, O{A} = 9 × 8, O{x} = 8 × 1 we pose the key question: ? What is the rank of the design matrix A ? Most notable, the first column is 1n and the sum of the next 4 columns is also 1n as well as the sum of the remaining 3 columns is 1n , too, namely
467
15-2 n-way classification models
c2 + c3 + c4 + c5 = 1n and c6 + c7 + c8 = 1n. The rank rkA = 1 + ( p 1) + ( q 1) = 1 + 3 + 2 = 6 is only six which differs from O{A} = 9 × 8. We have to take advantage of this rank deficiency. For example, we could postulate the condition x5 = 0 and x8 = 0 eliminating two components of the unknown vector. A more reasonable approach would be based on the computation of the symmetric reflexive generalized inverse such that xlm = ( AcA) rs Acy ,
(15.46)
which would guarantee a least square minimum norm solution or a I, I – BLUMBE solution (Best Linear I – Norm Uniformly Minimum Bias I – Norm Estimation) and rk A = rk A cA = rk( A cA) rs = rkA c
(15.47)
rs
A cA is a symmetric matrix ( A cA) is a symmetric matrix or called :rank preserving identity: :symmetry preserving identity:
Table 15.6: 2-way classification without interaction, example: normal equation
ª1 «1 «0 «0 A cA = « «0 «1 «0 «¬ 0
1 1 0 0 0 0 1 0
1 1 0 0 0 0 0 1
1 0 1 0 0 0 0 1
ª9 «3 «1 « = «2 3 «3 «2 «4 ¬
1 0 0 1 0 1 0 0
1 0 0 1 0 0 0 1
3 3 0 0 0 1 1 1
1 0 0 0 1 1 0 0
1 0 0 0 1 0 1 0
1º 0» 0» 0» » 1» 0» 0» 1 »¼
1 0 1 0 0 0 0 1
2 0 0 2 0 1 0 1
3 0 0 0 3 1 1 1
P ª «1 «1 «1 « «1 «1 «1 «1 «1 «1 «¬ 3 1 0 1 1 3 0 0
D1 D 2 D 3 D 4 p 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 n 2 1 0 0 1 0 2 0
4º 1» 1» 1» 1» 0» 0» 4 »¼
E1 E 2 E 3 pº 1 0 0» 0 1 0» 0 0 1» » 0 0 1» 1 0 0» 0 0 1» 1 0 0» 0 1 0» 0 0 1» n »¼
468
15 Special problems of algebraic regression and stochastic estimation
A cA = DE, O{D} = 8 × 6, O{E} = 6 × 8 ª9 «3 «1 D = «2 «3 «3 «¬ 2
3 3 0 0 0 1 1
1 0 1 0 0 0 0
2 0 0 2 0 1 0
3 1 0 1 1 3 0
2º 1» 0» 0» , 1» 0» 2 »¼
E to be determined
DcA cA = DcDE (DcD) 1 DcAcA = E ª133 « 45 « DcD = « 14 29 « 44 «¬ 28
45 14 29 21 4 8 4 3 3 8 3 10 15 3 11 11 2 4
44 28º 15 11 » 3 2» 11 4 » 21 8 » 8 10 »¼
compute (DcD) 1 and (DcD) 1 Dc E = (DcD) 1 DcA cA = ª1.0000 « 0.0000 « 0.0000 =« « 0.0000 « 0.0000 «¬ 0.0000
0.0000 1.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 1.0000 0.0000 0 0
ª 0.0665 -0.0360 « -0.0360 0.3112 « 0.1219 -0.2327 « 0.0166 -0.0923 =« « -0.0360 -0.0222 « 0.0222 -0.0120 « « 0.0748 -0.0822 «¬ -0.0305 0.0582 ª 0.0526 « 0.2632 « 0.0132 « 0.1535 =« «-0.0702 « 0.2675 « «-0.1491 ¬«-0.0658
0.1053 0.1930 0.0263 0.0263 -0.1404 -0.1316 0.3684 -0.1316
0.0000 0.0000 0.0000 1.0000 0.0000 0.0000
( AcA) rs 0.1219 -0.2327 0.8068 -0.2195 -0.2327 0.1240 0.1371 -0.1392
0.0000 0.3333 -0.2500 -0.0833 0.0000 -0.0833 -0.1667 0.2500
1.0000 -1.0000 -1.0000 -1.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 1.0000 0.0000
= Ec(EEc)( DcD) 1 Dc = 0.0166 -0.0360 0.0222 -0.0923 -0.0222 -0.0120 -0.2195 -0.2327 0.1240 0.4208 -0.0923 -0.0778 -0.0923 0.3112 -0.0120 -0.0778 -0.0120 0.2574 0.1020 -0.0822 -0.1417 -0.0076 0.0582 -0.0935
( AcA) rs A = 0.1579 0.1053 -0.2105 -0.1404 0.7895 0.0263 -0.2105 0.3596 -0.2105 -0.1404 0.0526 0.2018 0.0526 0.0351 0.0526 -0.1316
0.0526 -0.0702 -0.2368 0.4298 -0.0702 -0.1491 0.0175 0.1842
0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 0.0748 -0.0822 0.1371 0.1020 -0.0822 -0.1417 0.3758 -0.1593
0.0526 -0.0702 0.0132 -0.1535 0.2632 0.2675 -0.1491 -0.0658
1.0000 º 0.0000 » 0.0000 » » 0.0000 » -1.0000 » -1.0000 »¼
-0.0305 º 0.0582 » -0.1392 » » -0.0076 » 0.0582 » -0.0935 » » -0.1593 » 0.2223 »¼
0.1053 -0.1404 0.0263 0.0263 0.1930 -0.1316 0.3684 -0.1316
0.0000 º 0.0000» -0.2500 » » -0.0833 » 0.3333 » -0.0833 » » -0.1667 » 0.2500 ¼»
469
15-2 n-way classification models
ª « « « xlm = ( A cA) rs A cy = « « « « « ¬«
5.3684 º 10.8421» -6.1579 » » -1.1579 » . 1.8421 » -0.2105 » -4.2105 »» 9.7895 ¼»
Summary The general formulation of our 2-way classification problem without interaction is generated by identifying the vector of responses as well as the vector of parameters. Table 15.7: 2-way classification without interaction c , ..., y cp1 , y12 c ,..., y cpq 1 y cpq ] y c := [y11 xc := [ P , D1 ,..., D p , E1 ,..., E q ] A := [1n , c2 ,..., c p , c p +1 ,..., cq ] subject to c2 + ... + c p = 1, c p +1 + ... + cq = 1 q
p
ni = ¦ nij , n j = ¦ nij , n = j =1
i =1
experimental design: number of number of observations parameters: n=
p,q
¦n
i , j =1
ij
1+ p + q
p, q
¦
i =1, j =1
nij
(15.48)
rank of the design matrix:
1 + ( p 1) + (q 1) = p + q 1
(15.49)
:MINOLESS: (15.50)
|| y Ax || = min 2
x
and
|| x ||2 = min
(15.51)
xlm = ( A cA) rs A cy . 15-23
A third example: 2-way classification with interaction
A two-way classification model with interaction is defined by “MINOLESS” yijk = E{ yijk } + eijk = P + D i + E j + (DE )ij + eijk subject to i {1,… , p}, j {1,… , q}, k {1,… , nij } c ,… , y cp 1 , y cp , y12 c y c22 ,… y cnq 1 y cpq ] y c := [ y11
(15.52)
470
15 Special problems of algebraic regression and stochastic estimation
(15.54)
xc := [ P , D1 ,… , D p , E1 ,… , E q , (DE )11 ,… , (DE ) pq ]
(15.53)
|| y Ax ||2I = min and || x ||2I = min .
(15.55)
x
x
It was been in the second example on 2-way classification without interaction that the effects of different levels of the factors A and B were additive. An alternative model is a model in which the additivity does not hold: the observations are not independent of each other. Such a model is called a model with interaction between the factors whose effect (DE )ij has to be reflected by means of ªi {1,… ,p} P + D i + E j + (DE )ij = E{ yijk } for all « j {1,… ,q} (15.56) « k {1,… ,nij } ¬ like our model equation. As an example we consider by means of Table 15.8 a plant breeder carrying out a series of experiments with three fertilizer treatments on each of four varieties of grain. For each treatment-by-variety combination when he or she plants several 4c × 4c plots. At harvest time she or he finds that many of the plots have been lost due to being wrongly ploughed up and all he or she is left with are the data of Table 15.8. Table 15.8 (weight of grain form 4c × 4c trial plots): Variety Treatment 1 2 3 4 1 8 12 7 13 11 9 30 12 18 2
y11 (n11 ) 6 12 18
3
-
Totals
48
y31 (n13 )
y41 ( n14 )
12 14 26
-
-
9 7
14 16
16 42
30 42
Totals
60
44 10 14 11 13 48 66
94 198
With four of the treatment-in-variety combinations there are no data at all, and with the others there are varying numbers of plots, ranging from 1 to 4 with a total of 18 plots in all. Table 15.8 shows the yield of each plot, the total yields, the number of plots in each total and the corresponding mean, for each treatmentvariety combination having data. Totals, numbers of observations and means are also shown for the three treatments, the four varieties and for all 18 plots. The symbols for the entries in the table, are also shown in terms of the model.
471
15-2 n-way classification models
The equations of a suitable linear model for analyzing data of the nature of Table 15.8 is for yijk as the kth observation in the ith treatment and jth variety. In our top table, P is the mean, D i is the effect of the ith treatment, E j is the effect of the jth variety, (DE )ij is the interaction effect for the ith treatment and the jth variety and A ijk is the error term. With balanced data every one of pq cells of our table would have n observations. In addition there would be pq levels of the (DE ) factor, the interaction factor. However, with unbalanced data, when some cells have no observations they are only as many (DE )ij - levels in the data as there are non-empty cells. Let the number of such cells be s (s = 8 in Table 15.8). Then, if nij is the number of observations in the (i, j)th cell of type “treatment i and variety j”, s the number of cells in which nij z 0 , in all other cases nij > 0 . For these cells nij
yij = ¦ yijk , yij = yij / nij
(15.57)
k =1
is the total yield in the (i, j)th cell, and yij is the corresponding mean. Similarly, p
q
p ,q
i =1
j =1
i , j =1
y = ¦ yi = ¦ y j =
¦
p , q , nij
yij =
¦
yijk
(15.58)
i =1, j =1, k =1
is the total yield for all plots, the number of observations called “plots” therein being p
q
p,q
n = ¦ ni = ¦ n j = ¦ nij . i =1
j =1
(15.59)
i, j
We shall continue with the corresponding normal equations being derived from the observational equations. (DE )
P D1 D 2 D 3 E1 E 2 E 3 E 4 11 13 14 21 22 31 33 34 ª e111 º ª 8 º ª y111 º ª1 1 1 1 º y « » « e112 » P ª º «1 1 1 1 » «13» 112 « 9 » « y113 » «1 1 1 1 » « D1 » « e113 » «12 » « y131 » «1 1 1 » « D 2 » « e131 » « 7 » « y141 » «1 1 1 » « D 3 » « e141 » «11» « y142 » «1 1 1 » « E1 » « e142 » « 6 » « y211 » «1 1 1 1 » « E 2 » « e211 » «12 » « y212 » «1 1 1 1 » « E 3 » «e212 » «12 » « y » «1 1 1 1 » « E » « e » «14 » = « y 221 » = «1 1 1 1 » « (DE4) » + «e221 » « 9 » « y222 » «1 1 1 1 » « (DE )11 » « e222 » « 7 » « y321 » «1 1 1 1 » « (DE )13 » « e321 » 14 » « 322 » » « « » « 322 » « 14 1 1 1 1 DE ( ) y « » « e331 » 21 » 31 3 « « » « » «16 » « y332 » «1 1 1 1 » « (DE ) 22 » « e332 » «10 » « y341 » «1 1 1 1» « (DE )32 » « e341 » «14 » « y342 » «1 1 1 1» « (DE )33 » « e342 » «11» « y343 » «1 1 1 1» «¬ (DE )34 »¼ « e343 » ««¬ e344 ¼»» ¬«13¼» ««¬ y344 »»¼ ¬«1 1 1 1¼»
472
15 Special problems of algebraic regression and stochastic estimation
where the dots represent zeros. ª18 «6 «4 «8 «5 «4 «3 «6 «3 «1 «2 « «2 «2 «2 «2 ¬« 4
6 6 3 1 2 3 1 2
4 4 2 2 2 2
8 8 2 2 4 2 2 4
5 3 2 5 3 2
4 2 2 4 2 2
3 1 2 3 1 2
6 2 4 6 2 4
3 3 3 3
1 1 1 1
2 2 2 2
2 2 2 2
2 2 2 2
2 2 2 2
2 2 2 2
4 º ª P º ª y º ª198º » « D1 » « y1 » « 60 » » « D 2 » « y2 » « 44 » 4 » « D 3 » « y3 » « 94 » » « E1 » « y1 » « 48 » » « E 2 » « y2 » « 42 » » «« E 3 »» «« y3 »» « 42 » 4 » « E 4 » = « y4 » = « 66 » ~ AcAx = Acy. A » « (DE )11 » « y11 » « 30 » » « (DE )13 » « y13 » « 12 » »» « (DE )14 » « y14 » «« 18 »» » « (DE ) 21 » « y21 » « 18 » » «(DE ) 22 » « y22 » « 26 » » « (DE )32 » « y32 » « 16 » » « (DE )33 » « y33 » « 30 » 4 ¼» «¬ (DE )34 »¼ «¬ y34 »¼ ¬« 48 ¼»
Now we again pose the key question: ?What is the rank of the design matrix A? The first column is 1n and the sum of other columns is c2 + c3 + c4 = 1n and c5 + c6 + c7 + c8 = 1n . How to handle the remaining sum (DE )... of our incomplete model? Obviously, we experience rk[c9 ,… , c16 ] = 8 , namely rk[(DE )ij ] = 8 for (DE )ij {J 11 ,… , J 34 }.
(15.60)
As a summary, we have computed rk( AcA) = 8 , a surprise for our special case. A more reasonable approach would be based on the computation of the symmetric reflexive generalized inverse such that (15.61) x Am = ( A cA ) rs A cy , which would assure a minimum norm, least squares solution or a I, I – BLUMBE solution (Best Linear I – Norm Uniformly Minimum Bias I – Norm Estimation ) and rkA = rkA cA = rk(A cA ) rs = rkA +
(15.62)
rs
A cA is a symmetric matrix ( A cA ) is a symmetric matrix or called :rank preserving identity: !symmetry preserving identity! Table 15.9 summarizes all the details of 2-way classification with interaction. In general, for complete models our table lists the general number of parameters and the rank of the design matrix which differs from our incomplete design model.
473
15-2 n-way classification models
Table 15.9: 2-way classification with interaction c ,… , y cp1 , y12 c ,… , y cpq 1 , y cpq ] y c := [ y11 x c := [ P , D1 ,… , D p , E1 ,… , E q , (DE )11 ,… , (DE ) pq ] A := [1n , c1 ,… , c p , c p +1 ,… , cq , c11 ,… , c pq ] subject to c2 + … + c p = 1 , c p +1 + … + cq = 1,
p,q
¦c
i , j =1
p
q
p ,q
i =1
j =1
i, j
ij
= ( p 1)(q 1)
n = ¦ ni = ¦ n j = ¦ ni , j experimental design: number of observations p,q
n = ¦ ni , j
number of parameters:
1 + ( p 1) + (q 1) + (15.63) + ( p 1)(q 1)
1 + p + q + pq
i, j
(15.64)
rank of the design matrix:
|| y Ax ||2 = min and || x ||2 = min x
(15.65)
x
x Am = ( A cA ) rs A cy .
(15.66)
For our key example we get from the symmetric normal equation A cAx A = A cy the solution x Am = ( A cA ) rs A cy given A cA and A cy O{A cA} = 16 × 16, O{P ,… , (DE )31 } = 16 × 1, O{A cy} = 16 × 1 A cA = DE, O{D} = 18 × 12, O{E} = 12 × 16 ª3 «3 «0 «0 «3 «0 «0 « D = «0 3 «0 «0 «0 «0 «0 «0 «0 ¬«
1 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0
2 2 0 0 0 0 0 2 0 0 2 0 0 0 0 0
2 0 2 0 2 0 0 0 0 0 0 2 0 0 0 0
2 0 2 0 0 2 0 0 0 0 0 0 2 0 0 0
2 0 0 2 0 2 0 0 0 0 0 0 0 2 0 0
2 0 0 2 0 0 2 0 0 0 0 0 0 0 2 0
4º 0» 0» 4» 0» 0» 0» 4» 0» 0» 0» 0» 0» 0» 0» 4 »¼»
474
15 Special problems of algebraic regression and stochastic estimation
ª1.0000 «1.0000 « «1.0000 «1.0000 «1.0000 «1.0000 « «1.0000 «¬1.0000
DcA cA = DcDE ( DcD) 1 DcA cA = E E = (DcD) 1 DcA cA = 1.0000 0.0000 0.0000 1.0000 0.0000 0.0000 º 1.0000 0.0000 0.0000 0.0000 0.0000 1.0000 » » 0.0000 » 1.0000 0.0000 0.0000 0.0000 0 0.0000 1.0000 0.0000 1.0000 0.0000 0.0000 » 0 1.0000 0.0000 0 1.0000 0.0000 » » 0 0 1.0000 0 1.0000 0 » 0.0000 0.0000 1.0000 0 0.0000 1.0000 » »¼ 0 0.0000 1.0000 0 0.0000 0
x Am = ( A cA) rs A cy = = [6.4602, 1.5543, 2.4425, 2.4634, 0.6943, 1.0579, 3.3540, 1.3540, 1.2912, 0.6315, -0.3685, -0.5969, 3.0394, -1.9815, 2.7224,1.72245]c . 15-24
Higher classifications with interaction
If we generalize 1-way and 2-way classifications with interactions we arrive at a higher classification of type P + Di + E j + J k + … + +(DE )ij + (DJ )ik + (DE ) jk + … + (DEJ )ijk + … = E{y ijk …A }
(15.67)
for all i {1,… , p}, j {1,… , q}, k {1,… , r},… , A {1,… , nijk ,…}. An alternative stochastic model assumes a fully occupied variance – covariance matrix of the observations, namely D{y} E{[ y E{y}][ y E{y}]c} Ȉij . Variance – covariance estimation techniques are to be applied. In addition, a mixed model for the effects are be applied, for instance of type E{y} = Aȟ + &E{]} D{y} = CD{z}Cc.
(15.68) (15.69)
Here we conclude with a discussion of what is unbiased estimable: Example: 1–way classification If we depart from the model E{y ij } = P + D i and note rkA = rkA cA = 1 + ( p 1) for i {1,… , p}, namely rkA = p, we realize that the 1 + p parameters are not unbiased estimable: The first column results from the sum of the other columns. It is obvious the difference Di D1 is unbiased estimable. This difference produces a column matrix with full rank. Summary Di D1 quantities are unbiased estimable
475
15-2 n-way classification models
Example: 2–way classification with interaction Our first statement relates to the unbiased estimability of the terms D1 ,… , D p and E1 ,… , E q : obviously, the differences Di D1 and E j E1 for i, j < 1 are unbiased estimable. The first column is the sum of the other terms. For instance, the second column can be eliminated which is equivalent to estimating Di D1 in order to obtain a design matrix of full column rank. The same effect can be seen with the other effect E j for the properly chosen design matrix: E j E1 for all j > 1 is unbiased estimable! If we add the pq effect (DE )ij of interactions, only those interactions increase the rank of the design matrix by one respectively, which refer to the differences Di D1 and E j E1 , altogether ( p 1)( q 1) interactions. To the effect (DE )ij of the interactions are estimable, pq ( p 1)( q 1) = p + q 1 constants may be added, that is to the interactions (DE )i1 , (DE )i 2 ,… , (DE )iq with
i {1,… , p}
the constants '(DE1 ), '(DE 2 ),…, '(DE q ) and to the interactions (DE ) 2 j , (DE )3 j ,… , (DE ) pj with j {1,… , q} the constants '(DE 2 ), '(DE 3 ),… , '(DE p ). The constants '(DE1 ) need not to be added which can be interpreted by '(DE1 ) = '(DE1 ). A numerical example is p = 2, q = 2 xc = [ P , D1 , D 2, E1 , E 2 , (DE )11 , (DE )12 , (DE ) 22 ]. Summary 'D = D 2 D1 , 'E = E 2 E1 for all i {1, 2}, j {1, 2} as well as '(DE1 ), '(DE 2 ), '(DE 2 ) are unbiased estimable.
(15.70) (15.71)
476
15 Special problems of algebraic regression and stochastic estimation
At the end we review the number of parameters and the rank of the design matrix for a 3–way classification with interactions according to the following example. 3–way classification with interactions experimental design: number of number of observations parameters:
n=
p,q,r
¦
i , j , k =1
nijk
rank of the design matrix:
1 + ( p 1) + (q 1) + (r 1) 1+ p + q + r + +( p 1)(q 1) + (15.72) + pq + pr + qr + +( p 1)(r 1) + (q 1)(r 1) + + pqr ( p 1)(q 1)(r 1) = pqr
15-3 Dynamical Systems There are two essential items in the analysis of dynamical systems: First, there exists a “linear or liniarized observational equation y (t ) = Cz(t ) ” connecting a vector of stochastic observations y to a stochastic vector z of so called “state variables”. Second, the other essential is the characteristic differential equation of type “ zc(t ) = F (t , z(t )) ”, especially linearizied “ zc(t ) = Az(t ) ”, which maps the first derivative of the “state variable” to the “state variable” its off. Both, y (t ) and z(t ) are functions of a parameter, called “time t”. The second equation describes the time development of the dynamical system. An alternative formulation of the dynamical system equation is “ z(t ) = Az(t 1) ”. Due to the random nature “of the two functions “ y (t ) = Cz(t ) ” and zc = Az ” the complete equations read (15.73) E{y (t )} = CE{z} (15.76) E{zc(t )} = AE{z(t )}
and
and
V{e y (t1 ), e y (t2 )} = Ȉ y (t1 , t2 )
(15.74)
D{e y (t )} = Ȉ y (t ),
(15.75)
D{e z (t )} = Ȉ z (t ), V{e z (t1 ), e z (t2 )} = Ȉz (t1 , t2 )
(15.77) (15.78)
Here we only introduce “the time invariant system equations” characterized by A (t ) = A. zc(t ) abbreviates the functional dz(t ) / dt. There may be the case that the variance-covariance functions Ȉ y (t ) and Ȉ z (t ) do not change in time: Ȉ y (t ) = Ȉ y , Ȉ z (t ) = Ȉ z equal a constant. Various models exist for the variancecovariance functions Ȉ y (t1 , t2 ) and Ȉ z (t1 , t2 ) , e.q. linear functions as in the case of a Gauss process or a Brown process Ȉ y (t2 t1 ) , Ȉz (t2 t1 ) . The analysis of dynamic system theory was initiated by R. E. Kalman (1960) and by R. E. Kalman and R. S. Bucy (1961): “KF” stand for “Kalman filtering”. Example 1 (tracking a satellite orbit):
477
15-3 Dynamical Systems
Tracking a satellite’s orbit around the Earth might be based on the unknown state vector z(t ) being a function of the position and the speed of the satellite at time t with respect to a spherical coordinate system with origin at the mass center of the Earth. Position and speed of a satellite can be measured by GPS, for instance. If distances and accompanying angles are measured, they establish the observation y (t ) . The principles of space-time geometry, namely mapping y (t ) into z(t ) , would be incorporated in the matrix C while e y (t ) would reflect the measurement errors at the time instant t. The matrix A reflects the situation how position and speed change in time according the physical lows governing orbiting bodies, while ez would allow for deviation from the lows owing to factors as nonuniformity of the Earth gravity field. Example 2 (statistical quality control): Here the observation vector y (t ) is a simple approximately normal transformation of the number of derivatives observed in a sample obtained at time t, while y1 (t ) and y2 (t ) represent respectively the refractive index of the process and the drift of the index. We have the observation equation and the system equations z (t ) = z2 (t ) + ez1 y (t ) = z1 (t ) + ey (t1 ) and 1 z2 (t ) = z2 (t 1) + ez 2 . In vector notation, this system of equation becomes z(t ) = Az(t 1) + e z namely ª z (t ) º ª1 1º ª ez º ª 0 1º z(t ) = « 1 » , e z = « « », A = « » » ¬ 0 1¼ «¬ ez »¼ ¬ 0 1¼ ¬ z2 ( t ) ¼ 1
2
do not change in time. If we examine y (t ) y (t 1) for this model, we observe that under the assumption of constant variance, namely e y (t ) = e y and e z (t ) = e z , the autocorrelation structure of the difference is identical to that of an ARIMA (0,1,1) process. Although such a correspondence is sometimes easily discernible, we should in general not consider the two approaches to be equivalent. A stochastic process is called an ARMA process of the order ( p, q ) if z (t ) = a1z (t 1) + a2 z (t 2) + … + a p z (t p ) = = b0u(t ) + b1u(t 1) + b2u(t 2) + … + bq u(t q ) for all t {1,… , T } also called a mixed autoregressive/moving - average process.
(15.79)
478
15 Special problems of algebraic regression and stochastic estimation
In practice, most time series are non-stationary. In order to fit a stationary model, it is necessary to remove non-stationary sources of variation. If the observed time series is non-stationary in the mean, then we can use the difference of the series. Differencing is widely used in all scientific disciplines. If z (t ), t {1,… , T } , is replaced by d z (t ) , then we have a model capable of describing certain types of non-stationary signals. Such a model is called an “integrated model” because the stationary model that fits to the difference data has to the summed or “integrated” to provide a model for the original non-stationary data. Writing W (t ) = d z (t ) = (1 B ) d z (t )
(15.80)
for all t {1,… , T } the general autoregressive integrated moving-average (ARIMA) process is of the form ARIMA W (t ) = D1W (t 1) + … + D pW (t p ) + b0 u(t ) + … + bq u(t q) (15.81) or ĭ( B )W(t ) = ī( B )
(15.82)
ĭ( B)(1 B) d z (t ) = ī( B)u (t ).
(15.83)
Thus we have an ARMA ( p, q ) model for W (t ) , t {1,… , T } , while the model for W (t ) describing the dth differences for z (t ) is said to be an ARIMA process of order ( p, d , q) . For our case, ARIMA (0,1,1) means a specific process p = 1, = 1, q = 1 . The model for z (t ), t {1,… , T } , is clearly nonstationary, as the AR operators ĭ( B )(1 q) d has d roots on the unit circle since patting B = 1 makes the AR operator equal to zero. In practice, first differencing is often found to the adequate to make a series stationary, and accordingly the value of d is often taken to be one. Note the random part could be considered as an ARIMA (0,1,0) process. It is a special problem of time-series analysis that the error variances are generally not known a priori. This can be dealt with by guessing, and then updating then is an appropriate way, or, alternatively, by estimating then forming a set of data over a suitable fit period. In the state space modeling, the prime objective is to predict the signal in the presence of noise. In other words, we want to estimate the m × 1 state vector E{z(t )} which cannot usually be observed directly. The Kalman filter provides a general method for doing this. It consists of a set of equations that allow us to update the estimate of E{z(t )} when a new observation becomes available. We will outline this updating procedure with two stages, called
479
15-3 Dynamical Systems
Ɣ Ɣ
the prediction stage and the updating stage.
Suppose we have observed a univariate time series up to time (t 1) , and that E{z (t 1)} is “the best estimator” E{z (t 1)} based on information up to this time. For instance, “best” is defined as an PLUUP estimator. Note that z (t ), z (t 1) etc is a random variable. Further suppose that we have evaluated the m × m variance-covariance matrix of E{zn (t 1)} which we denote by P{t 1}. The first stage called the prediction stage is concerned with forecasting E{z (t )} from data up to the time (t 1) , and we denote the resulting estimator in a obvious notation by E{zn (t ) | z (t 1)} . Considering the state equations where D{ez (t 1)} is still unknown at time (t 1), the obvious estimator for E{z (t )} is given by E{z(t ) | zˆ (t 1)} = G (t ) E{zn (t 1)}
(15.84)
and the variance-covariance matrix V{t | t 1} = G (t ) V{t 1}G c + W {t}
(15.85)
called prediction equations. The last equations follows from the standard results on variance-covariance matrices for random vector variables. When the new observation at time t, namely when y (t ) has been observed the estimator for E{z(t )} can be modified to take account of the extra information. At time (t 1) , the best forecast of y (t ) is given by hc E{z(t ) | zˆ (t 1)} so that the prediction error is given by (15.86) eˆy (t ) = y (t ) hc(t ) E{z (t ) | z (t 1)}. This quantity can be used to update the estimate of E{z (t )} and its variancecovariance matrix. E{zˆ (t )} = E{z(t ) | zˆ (t 1)} + K (t )eˆ y
(15.87)
V{t} = V{t | t 1} K (t )hc(t ) V{t | t 1}
(15.88)
V{t} = V{t , t 1}hc(t ) /[hc(t ) V{t | t 1}h + V n2 ] .
(15.89)
V{t} is called the gain matrix. In the univariate case, K (t ) is just a vector of size ( m 1) . The previous equations constitute the second updating stage of the Kalman filter, thus they are called the updating equations. A major practical advantage of the Kalman filter is that the calculations are recursive so that, although the current estimates are based on the whole past history of measurements, there is no need for an ever expanding memory. Rather the near estimate of the signal is based solely on the previous estimate and the latest observations. A second advantage of the Kalman filter is that it converges fairly
480
15 Special problems of algebraic regression and stochastic estimation
quickly when there is a constant underlying model, but can also follow the movement of a system where the underlying model is evolving through time. For special cases, there exist much simpler equations. An example is to consider the random walk plus noise model where the state vector z(t ) consist of one state variable, the current level ȝ(t ) . It can be shown that the Kalman filter for this model in the steady state case for t o f reduces the simple recurrence relation ˆ t) , ˆ t ) = ȝ( ˆ t 1) + D e( ȝ( where the smoothing constant D is a complicated function of the signal-to-noise ratio ı 2w ı n2 . Our equation is simple exponential smoothing. When ı 2w tends to zero, ȝ(t ) is a constant and we find that D o 0 would intuitively be expected, while as ı 2w ı n2 becomes large, then D approaches unity. For a multivariate time series approach we may start from the vector-valued equation of type E{y (t )} = CE{z} ,
(15.90)
where C is a known nonsingular m × m matrix. By LESS we are able to predict E{zn (t )} = C1E{n y (t )} .
(15.91)
Once a model has been put into the state-space form, the Kalman filter can be used to provide estimates of the signal, and they in turn lead to algorithms for various other calculations, such as making prediction and handling missing values. For instance, forecasts may be obtained from the state-space model using the latest estimates of the state vector. Given data to time N, the best estimate of the state vector is written E{zn ( N )} and the h-step-ahead forecast is given by E{yn ( N )} = hc( N + h) E{zn ( N + h)} = = h( N + h)G ( N + h)G{N + h 1}… G ( N + 1) E{zn ( N )}
(15.92)
where we assume h ( N + h) and future values of G (t ) are known. Of course, if G (t ) is a constant, say G, then we get E{yn ( N | h)} = hc( N + h)G h E{zn ( N )}.
(15.93)
If future values of h(t ) or G(t ) are not known, then they must themselves be forecasted or otherwise guessed. Up to this day a lot of research has been done on nonlinear models in prediction theory relating to state-vectors and observational equations. There are excellent reviews, for instance by P. H. Frances (1988), C. W. J. Granger and P. Newbold
481
15-3 Dynamical Systems
(1986), A. C. Harvey (1993), M. B. Priestley (1981, 1988) and H. Tong (1990). C. W. Granger and T. Teräsvirta (1993) is a more advanced text. In terms of the dynamical system theory we regularly meet the problem that the observational equation is not of full column rank. A state variable leads to a relation between the system input-output solution, especially a statement on how a system is developing in time. Very often it is reasonable to switch from a state variable, in one reference system to another one with special properties. Let T this time be a similarity transformation, namely described by a non-singular matrix of type z := Tz z = T -1z
(15.94)
d z = T 1ATz + T 1Bu(t ), z 0 = T 1z 0 dt
(15.95)
y (t ) = CTz (t ) + Du(t ).
(15.96)
The key question is now whether to the characteristic state equation there belongs a transformation matrix such that for a specific matrix A and B there exists an integer number r, 0 d r < n , of the form ª A A = « 11 ¬ 0
º A12 r × (n r ) º ª r×r , O{A } = «
» A 22 ¼ ¬(n 1) × r (n r ) × ( n r ) ¼»
ª B º ª q×r º B = « 1 » , O{B } = « ». ¬ q × (n r ) ¼ ¬0 ¼ In this case the state equation separates in two distinct parts. d
z1 (t ) = A11 z1 + A12 z 2 (t ) + B1 h(t ), z1 (0) = z10 dt d z 2 (t ) = A 22 z 2 (t ), z 2 (0) = z 20 . dt
(15.97) (15.98)
The last n r elements of z cannot be influenced in its time development. Influence is restricted to the initial conditions and to the eigen dynamics of the partial system 2 (characterized by the matrix A 22 ). The state of the whole system cannot be influenced completely by the artificially given point of the state space. Accordingly, the state differential equation in terms of the matrix pair (A, B) is not steerable. Example 3: (steerable state): If we apply the dynamic matrix A and the introductory matrix of a state model of type
482
15 Special problems of algebraic regression and stochastic estimation
ª 4 « A=« 3 1 « «¬ 3
2º 3», B = ª 1 º , « 0.5» 5» ¬ ¼ » 3 »¼
we are led to the alternative matrices after using the similarity transformation ª 1 A = « ¬ 0
1º ª1º , B =« ». 2 »¼ ¬ 0¼
If the initial state is located along the z1 -axis, for instance z20 = 0 , then the state vector remains all times along this axis. It is only possible to move this axis along a straight line “up and down”.
In case that there exists no similarity transformation we call the state matrices (A, B) steerable. Steerability of a state differential equation may be tested by Lemma 15.7 (Steerability): The pair (A, B) is steerable if and only if rk[BAB … A n 1B] = rkF( A, B) = n.
(15.99)
F( A, B) is called matrix of steerability. If its rank r < n , then there exists a transformation T such that A = T 1AT and B = T 1B has the form
ª A A12 º A = « 11 ,
» ¬ 0 A 22 ¼
and ( A11 , B1 ) is steerable.
(15.100)
ª B º B = « 1 » ¬0 ¼
(15.101)
Alternatively we could search for a transformation matrix T such that transforms the dynamic matrix and the exit matrix of a state model to the form ª A A = « 11
¬ A 21
r × (n r ) º 0 º ª r×r , O{A } = « »
» A 22 ¼ ¬( n 1) × r ( n r ) × ( n r ) ¼
C = [C1 , 0], O{C } = [r × p, (n r ) × p ]. In this case the state equation and the observational equations read d
z1 (t ) = A11 z1 (t ) + B1 u(t ), z1 (0) = z10 dt
(15.102)
d z 2 (t ) = A 21z1 (t ) + A 22 z 2 (t ) + B 2u( t ), z 2 (0) = z 20 dt
(15.103)
y (t ) = C 2 z1 (t ) + Du (t ).
(15.104)
483
15-3 Dynamical Systems
The last n r elements of the vector z are not used in the exit variable y. Since they do not have an effect to z1 , the vector g contains no information of the component of the state vector. This state moves in the n r dimensional subspace of \ n without any change in the exit variables. Our model (C, A) is in this case called non-observable. Example 4: (observability): If the exit matrix and the dynamic matrix of a state model can be characterized by the matrices 2º ª4 ª 0 1º C=« », A = « , 3¼ ¬ 2 3»¼ ¬5 an application of the transformation matrix T leads to the matrices 0º ª 1 C = [1, 0] , A = « ». ¬ 1 2 ¼ For an arbitrary motion of the state in the direction of the z 2 axis has no influence on the existing variable. If there does not exist a transformation T, we call the state vector observable. A rank study helps again! Lemma 15.8 (Observability test): The pair (C, A) is observable if and only if ªC º «CA » » = rkG(C, A ) = n. (15.105) rk « «# » « n 1 » «¬C »¼ G(C, A ) is called observability matrix. If its rank r < n , then there exists a transformation matrix T such that A = T 1AT and C = CT is of the form ª A A = « 11
¬ A 21
r × (n r ) º 0 º ª r×r , O{A } = « »
» A 22 ¼ ¬( n 1) × r ( n r ) × ( n r ) ¼
C = [C1 , 0], O{C } = [r × p, (n r ) × p ]
(15.106) (15.107)
and C1 , A11 is observable. With Lemma 15.7 and Lemma 15.8 we can only state whether a state model is steerable or observable or not, or which dimension has a partial system being classified as non-steerable and non-observable. In order to determine which part of a system is non-steerable or non-observable - which eigen motion is not ex-
484
15 Special problems of algebraic regression and stochastic estimation
cited or non-observable – we have to be able to construct proper transformation matrices T. A tool is the PBH – test we do not analyze here. Both the state differential equation as well as the initial equation we can Laplace transform easily. We only need the relations between the input, output and state variable via polynom matrices. If the initial conditions z0 vanish, we get the Laplace transformed characteristical equations ( sI n A ) z( s ) = Bu( s )
(15.108)
= Cz ( s ) + Du( s ).
y (s)
(15.109)
For details we recommend to check the reference list. We only refer to solving both the state differential equation as well as the initial equation: Eliminating the state vector z( s ) lead us to the algebraic relation between u( s ) and y ( s ) : G( s ) = C( sI n A ) 1 B + D
(15.110)
or (15.111)
§ ªI G( s ) = ¬ªC C ¼º ¨ s « r ¨ ¬0 ©
1
2
0 º ª A11 « I n r »¼ ¬ 0
1
º · ª B1 º A12 ¸ « »+D » A 22 ¼ ¸¹ ¬ 0 ¼
1 = C1 ( sI n A11 ) B1 + D.
Recently, the topic of chaos has attracted much attention. Chaotic behavior arises from certain types of nonlinear models, and a loose definition is apparently random behavior that is generated by a purely deterministic, nonlinear system. Refer to the contributions of K.S. Chan and H. Tong (2001), J. Gleick (1987), V. Isham (1983), H. Kants and I. Schreiber (1997).
Appendix A: Matrix Algebra As a two-dimensional array we define a quadratic and rectangular matrix. First, we review matrix algebra with respect to two inner and one external relation, namely multiplication of a matrix by a scalar, addition of matrices of the same order, matrix multiplication of type Cayley, Kronecker-Zehfuss, Khati-Rao and Hadamard. Second, we introduce special matrices of type symmetric, antisymmetric, diagonal, unity, null, idempotent, normal, orthogonal, orthonormal (special facts of representing a 2×2 orthonormal matrix, a general nxn orthonormal matrix, the Helmert representation of an orthonormal matrix with examples, special facts about the representation of a Hankel matrix with examples, the definition of a Vandermonde matrix), the permutation matrix, the commutation matrix. Third, scalar measures like rank, determinant, trace and norm. In detail, we review the Inverse Partitional Matrix /IPM/ and the Cayley inverse of the sum of two matrices. We summarize the notion of a division algebra. A special paragraph is devoted to vector-valued matrix forms like vec, vech and veck. Fifth, we introduce the notion of eigenvalue-eigenvector decomposition (analysis versus synthesis) and the singular value decomposition. Sixth, we give details of generalized inverse, namely g-inverse, reflexive g-inverse, reflexive symmetric ginverse, pseudo inverse, Zlobec formula, Bjerhammar formula, rank factorization, left and right inverse, projections, bordering, singular value representation and the theory solving linear equations.
A1 Matrix-Algebra A matrix is a rectangular or a quadratic array of numbers, ª a11 «a « 21 A := [aij ] = « ... « « an 11 «¬ an1
a12 a22 ...
... ... ...
a1m 1 a2 m 1 ...
an 12 ... an 1m1 an 2 ... anm 1
a1m a2 m ...
º » » » , aij \,[ aij ] \ n×m . » an 1m » anm »¼
The format or “order” of A is given by the number n of rows and the number of the columns, O( A) := n × m. Fact: Two matrices are identical if they have identical format and if at each place (i, j) are identical numbers, namely ª i {1,..., n} A = B aij = bij « ¬ j {1,..., m}.
486
Appendix A: Matrix Algebra
Beside the identity of two matrices the transpose of an m × n matrix A = [aij ] is the m × n matrix ǹc = [a ji ] whose ij element is a ji . Fact: ( Ac)c = A. A matrix algebra is defined by the following operations: • multiplication of a matrix by a scalar (external relation) • addition of two matrices of the same order (internal relation) • multiplication of two matrices (internal relation) Definition (matrix additions and multiplications): (1) Multiplication by a scalar ǹ = [aij ], D \ D A = AD = [D aij ] . (2) Addition of two matrices of the same order A = [aij ], B = [bij ] A + B := [aij + bij ] A + B = B + A (commutativity) (A + B) + C = A + (B + C) (associativity) A B = A + ( 1)B (inverse addition). Compatibility (D + E )A = D A + E A º distributivity D ( A + B) = D A + D B »¼ ( A + B)c = A c + Bc. (3) Multiplication of matrices 3(i) “Cayley-product” (“matrix-product”) ª A = [aij ], O( A) = n × l º « B = [b ], O(B) = l × m » ij ¬ ¼ l
C := A B = [cij ] := ¦ aik bkl , O(C) = n × m k =1
3(ii) “Kronecker-Zehfuss-product” A = [aij ], O( A) = n × m º B = [bij ], O(B) = k × l »¼
487
A1 Matrix-Algebra
C := B
A = [cij ], B
A := [bij A], O(C) = O(B
A) = kn × l 3(iii) “Khatri-Rao-product” (of two rectangular matrices of identical column number) A = [a1 ,..., am ], O ( A) = n × m º B = [b1 ,..., bm ], O (B) = k × m »¼ C := B : A := [b1
a1 ,… , bm
am ], O(C) = kn × m 3(iv) “Hadamard-product” (of two rectangular matrices of the same order; elementwise product) G = [ gij ], O(G ) = n × m º H =[hij ], O(H ) = n × m »¼ K := G H = [kij ], kij := gij hij , O(K ) = n × m . The existence of the product A B does not imply the existence of the product B A . If both products exist, they are in general not equal. Two quadratic matrices A and B, for which holds A B = B A , are called commutative. Laws (i)
(A B) C = A (B C) A ( B + C) = A B + A C ( A + B) C = A C + B C ( A B)c = ( Bc A c) .
(ii) ( A
B)
C = A
( B
C) = A
B
C ( A + B )
C = ( A
B ) + ( B
C) A
( B + C) = ( A
B) + ( A
C) ( A
B ) ( C
D ) = ( A C)
( B D ) ( A
B )c = A c
B c . (iii) ( A : B) : C = A : ( B : C) = A : B : C ( A + B ) : C = ( A : C ) + ( B : C) A : ( B + C) = ( A : B) + ( A : C) ( A C) : (B D) = ( A : B) (C : D) A : (B D) = ( A : B) D, if dij = 0 for i z j.
488
Appendix A: Matrix Algebra
The transported Khatri-Rao-product generates a row product which we do not follow here. (iv)
A B = B A ( A B ) C = A ( B C) = A B C ( A + B ) C = ( A C ) + ( B C) ( A1 B1 C1 ) ( A 2 B 2 C2 ) = ( A1 : A 2 )c ( B1
B 2 ) (C1 : C2 ) (D A) (B D) = D ( A B) D, if dij = 0 for i z j ( A B)c = Ac Bc.
A2 Special Matrices We will collect special matrices of symmetric, antisymmetric, diagonal, unity, zero, idempotent, normal, orthogonal, orthonormal, positive-definite and positive-semidefinite, special orthonormal matrices, for instance of type Helmert or of type Hankel. Definitions (special matrices): A quadratic matrix A = [aij ] of the order O( A) = n × n is called symmetric
aijc = a ji i, j {1,..., n} : A = A c
antisymmetric aij = a ji i, j {1,..., n} : A = A c aij = 0 i z j ,
diagonal
A = Diag[a11 ,..., ann ] ª aij = 0 i z j I n× n = « ¬ aij = 1 i z j
unity zero matrix
0 n× n : aij = 0 i, j {1,..., n}
upper º » triangular: lower »¼
ª aij = 0 i > j « a = 0 i < j ¬ ij
idempotent if and only if A A = 0 normal if and only if A A c = A c A . Definition (orthogonal matrix) : The matrix A is called orthogonal if AA c and A cA are diagonal matrices. (The rows and columns of A are orthogonal.)
A2 Special Matrices
489
Definition (orthonormal matrix) : The matrix A is called orthonormal if AA c = A cA = I . (The rows and columns of A are orthonormal.) Facts (representation of a 2×2 orthonormal matrix) X SO(2) : A 2×2 orthonormal matrix X SO(2) is an element of the special orthogonal group SO(2) defined by SO(2) := {X R 2×2 | XcX = I 2 and det X = +1} ªx {X = « 1 ¬ x3
(i)
x2 º R 2×2 x 4 »¼ ª cos I X=« ¬ sin I
x12 + x 22 = 1 x1 x3 + x 2 x 4 = 0 , x1 x 4 x 2 x3 = +1} x32 + x 42 = 1 sin I º R 2×2 , I [0, 2S ] cos I »¼
is a trigonometric representation of X SO(2) . (ii)
ª x X=« 2 ¬ 1 x
1 x 2 º R 2×2 , x [1, +1] » x ¼
is an algebraic representation of X SO(2) 2 2 ( x112 + x122 = 1, x11 x 21 + x12 x 22 = x 1 x 2 + x 1 x 2 = 0, x 21 + x 22 = 1) .
(iii)
ª 1 x2 2x º + « » 2 1 + x 2 » R 2×2 , x R X = « 1+ x 2 1 x » « 2 x «¬ 1 + x 2 1 + x 2 »¼ is called a stereographic projection of X (stereographic projection of SO(2) ~ S1 onto L1 ).
(iv)
ª 0 xº X = (I 2 + S)(I 2 S) 1 , S = « », ¬ x 0 ¼ where S = S c is a skew matrix (antisymmetric matrix), is called a Cayley-Lipschitz representation of X SO(2) .
(v)
X SO(2) is a commutative group (“Abel”) (Example: X1 SO(2) , X 2 SO(2) , then X1 X 2 = X 2 X1 ) ( SO( n) for n = 2 is the only commutative group, SO(n | n z 2) is not “Abel”).
490
Appendix A: Matrix Algebra
Facts (representation of an n×n orthonormal matrix) X SO(n) : An n×n orthonormal matrix X SO(n) is an element of the special orthogonal group SO(n) defined by SO(n) := {X R n×n | XcX = I n and det X = +1} . As a differentiable manifold SO(n) inherits a Riemann structure from the ambin 2 ent space R n with a Euclidean metric ( vec Xc \ , dim vec Xc = n ). Any atlas of the special orthogonal group SO(n) has at least four distinct charts and there is one with exactly four charts. (“minimal atlas”: Lusternik – Schnirelmann category) 2
2
(i)
X = (I n + S)(I n S) 1 , where S = Sc is a skew matrix (antisymmetric matrix), is called a Cayley-Lipschitz representation of X SO(n) . ( n! / 2(n 2)! is the number of independent parameters/coordinates of X)
(ii)
If each of the matrices R 1 ," , R k is an n×n orthonormal matrix, then their product R1R 2 " R k 1R k SO(n) is an n×n orthonormal matrix. Facts (orthonormal matrix: Helmert representation) :
Let ac = [a1 , ", a n ] represent any row vector such that a i z 0 (i {1, " , n}) is any row vector whose elements are all nonzero. Suppose that we require an n×n orthonormal matrix, one row which is proportional to ac . In what follows one such matrix R is derived. Let [r1c, " , rnc ] represent the rows of R and take the first row r1c to be the row of R that is proportional to ac . Take the second row r2c to be proportional to the ndimensional row vector [a1 , a12 / a 2 , 0, 0, " , 0],
(H2)
the third row r3c proportional to [a1 , a 2 , (a12 + a 22 ) / a 3 , 0, 0, ", 0]
(H3)
and more generally the first through nth rows r1c, " , rnc proportional to k 1
[a1 , a 2 , " , a k 1 , ¦ a i2 / a k , 0, 0, " , 0] i =1
for k {2,", n} ,
(Hn-1)
A2 Special Matrices
491
respectively confirm to yourself that the n-1 vectors ( H n1 ) are orthogonal to each other and to the vector ac . In order to obtain explicit expressions for r1c, ", rnc it remains to normalize ac and the vectors ( H n1 ). The Euclidean norm of the kth of the vectors ( H n1 ) is k 1
k 1
k 1
k
i =1
i =1
i =1
i =1
{¦ a i2 + (¦ a i2 ) 2 / a k2 }1 / 2 = {(¦ a i2 ) (¦ a i2 ) / a k2 }1 / 2 . Accordingly for the orthonormal vectors r1c, " , rnc we finally find n
r1c = [¦ a i2 ] 1 / 2 (a1 , ", a n )
(1st row)
i =1
(kth row) rkc = [
a 2k k
i =1
i =1
(¦ a i2 ) (¦ a i2 ). (nth row)
rnc = [
a i2 , 0, 0, ", 0) i =1 a k
k 1
k 1
] 1 / 2 (a1 , a 2 , ", a k 1 , ¦
a 2n
a i2 ] . i =1 a n
n 1
n 1
n
i =1
i =1
(¦ a i2 ) (¦ a i2 ).
] 1 / 2 [a1 , a 2 , ", a n1 , ¦
The recipy is complicated: When a c = [1, 1, ",1, 1] , the Helmert factors in the 1st row, …, kth row,…, nth row simplify to r1c = n 1 / 2 [1, 1, ",1, 1] R n rkc = [k (k 1)]1 / 2 [1, 1, " ,1, 1 k , 0, 0, " , 0, 0] R n rnc = [ n( n 1)]
1/ 2
n
[1, 1, " ,1, 1 n] R .
The orthonormal matrix ª r1c º « rc » « 2 » «"» « » «rkc1 » SO(n) « rkc » « » «"» «r c » « n 1 » «¬ rnc »¼ is known as the Helmert matrix of order n. (Alternatively the transposes of such a matrix are called the Helmert matrix.)
492
Appendix A: Matrix Algebra
Example (Helmert matrix of order 3): ª1/ 3 « «1/ 2 « «¬1/ 6
1/ 3 º » 0 » SO(3). » 2 / 6 »¼
1/ 3 1/ 2 1/ 6
Check that the rows are orthogonal and normalized. Example (Helmert matrix of order 4): ª 1/ 2 « « 1/ 2 « « 1/ 6 «1/ 12 ¬
1/ 2
1/ 2
1/ 2
0
1/ 6
2 / 6
1/ 12
1/ 12
1/ 2 º » 0 » » SO(4). 0 » 3 / 12 »¼
Check that the rows are orthogonal and normalized. Example (Helmert matrix of order n): ª 1/ n « 1/ 2 « « 1/ 6 « « " « 1 « « « (n 1)(n 2) « 1 « n(n 1) ¬«
1/ n
1/ n "
1/ n
1/ n
1/ 2
0
0
"
0
1/ 6
2/ 6
0
"
0
1
1
(n 1)(n 2)
(n 1)(n 2)
1
1
n(n 1)
n(n 1)
" "
"
"
"
1 (n 1) (n 1)(n 2) 1 n(n 1)
1/ n º » 0 » » 0 » » » SO(n). » 0 » » 1 n » » n(n 1) ¼»
Check that the rows are orthogonal and normalized. An example is the nth row 1 n ( n 1)
+"+
2
=
n n n ( n 1)
=
1 n( n 1)
n ( n 1) n( n 1)
+
(1 n )
2
n( n 1)
=
n 1 n( n 1)
+
1 2n + n n( n 1)
= 1,
where (n-1) terms 1/[n(n-1)] have to be summed. Definition (orthogonal matrix) : A rectangular matrix A = [aij ] \ n×m is called “a Hankel matrix” if the n+m-1 distinct elements of A ,
2
=
A2 Special Matrices
493 ª a11 «a « 21 « " « « an 11 «¬ an1
an 2
º » » » » » " anm »¼
only appear in the first column and last row. Example: Hankel matrix of power sums Let A R n× m be a n×m rectangular matrix ( n d m ) whose entries are power sums. ª n « ¦ D i xi « i =1 « n ¦D x2 A := «« i =1 i i « # « n « D xn i i «¬ ¦ i =1
n
¦D x
2 i i
i =1 n
¦D x
3 i i
i =1
# n
n +1 i i
¦D x i =1
n
º » i =1 » n m +1 » " ¦ D i xi » i =1 » » # # » n " ¦ D i xin + m1 » »¼ i =1 "
¦D x
m i i
A is a Hankel matrix. Definition (Vandermonde matrix): Vandermonde matrix: V R n× n ª 1 « x V := « #1 «¬ x1n 1
1 " 1 º x2 " xn » # # # », n 1 n 1 » " xn ¼ x2
n
det V = ( xi x j ). i, j i> j
Example: Vandermonde matrix V R 3×3 ª1 V := «« x1 «¬ x12
1 x2 x22
1º x3 »» , det V = ( x2 x1 )( x3 x2 )( x3 x1 ). x32 »¼
Example: Submatrix of a Hankel matrix of power sums Consider the submatrix P = [a1 , a2 ," , an ] of the Hankel matrix A R n× m (n d m) whose entries are power sums. The determinant of the power sums matrix P is
494
Appendix A: Matrix Algebra n
n
i =1
i =1
det P = ( D i )( xi )(det V ) 2 , where det V is the Vandermonde determinant. Example: Submatrix P R 3×3 of a 3×4 Hankel matrix of power sums (n=3,m=4) A= ª D1 x1 + D 2 x2 + D 3 x3 D1 x12 + D 2 x22 + D 3 x32 D1 x13 + D 2 x23 + D 3 x33 D1 x14 + D 2 x24 + D 3 x34 º « 2 2 2 3 3 3 4 4 4 5 5 5» «D1 x1 + D 2 x2 + D 3 x3 D1 x1 + D 2 x2 + D 3 x3 D1 x1 + D 2 x2 + D 3 x3 D1 x1 + D 2 x2 + D 3 x3 » «D1 x13 + D 2 x23 + D 3 x33 D1 x14 + D 2 x24 + D 3 x34 D1 x15 + D 2 x25 + D 3 x35 D1 x16 + D 2 x26 + D 3 x36 » ¬ ¼ P = [a1 , a2 , a3 ] ª D1 x1 + D 2 x2 + D 3 x3 D1 x12 + D 2 x22 + D 3 x32 D1 x13 + D 2 x23 + D 3 x33 º « 2 2 2 3 3 3 4 4 4» «D1 x1 + D 2 x2 + D 3 x3 D1 x1 + D 2 x2 + D 3 x3 D1 x1 + D 2 x2 + D 3 x3 » . «D1 x13 + D 2 x23 + D 3 x33 D1 x14 + D 2 x24 + D 3 x34 D1 x15 + D 2 x25 + D 3 x35 » ¬ ¼
Definitions (positive definite and positive semidefinite matrices) A matrix A is called positive definite, if and only if xcAx > 0 x \ n , x z 0 . A matrix A is called positive semidefinite, if and only if xcAx t 0 x \ n . An example follows. Example (idempotence): All idempotent matrices are positive semidefinite, at the time BcB and BBc for an arbitrary matrix B . What are “permutation matrices” or “commutation matrices”? After their definitions we will give some applications. Definitions (permutation matrix, commutation matrix) A matrix is called a permutation matrix if and only if each column of the matrix A and each row of A has only one element 1 . All other elements are zero. There holds AA c = I . A matrix is called a commutation matrix, if and only if for a matrix of the order n 2 × n 2 there holds K = K c and K 2 = I n . 2
The commutation matrix is symmetric and orthonormal.
A3 Scalar Measures and Inverse Matrices
495
Example (commutation matrix)
n=2
ª1 «0 K4 = « «0 « ¬0
0 0 1 0
0 1 0 0
0º 0 »» = K c4 . 0» » 1¼
A general definition of matrices K nm of the order nm × nm with n z m are to found in J.R. Magnus and H. Neudecker (1988 p.46-48). This definition does not lead to a symmetric matrix anymore. Nevertheless is the transpose commutation matrix again a commutation matrix since we have K cnm = K nm and K nm K mn = I nm . Example (commutation matrix)
n = 2º m = 3»¼
n = 3º m = 2 »¼
K 23
ª1 «0 «0 = «0 «0 «0 «¬0
0 0 0 1 0 0 0
0 1 0 0 0 0 0
0 0 0 0 1 0 0
0 0 1 0 0 0 0
0º 0» 0» 0» 0» 0» 1 »¼
K 32
ª1 «0 « = «0 0 «0 «¬0
0 0 1 0 0 0
0 0 0 0 1 0
0 1 0 0 0 0
0 0 0 1 0 0
0º 0» 0» 0» 0» 1 »¼
K 32 K 23 = I 6 = K 23 K 32 .
A3 Scalar Measures and Inverse Matrices We will refer to some scalar measures, also called scalar functions, of matrices. Beforehand we will introduce some classical definitions of type • • •
linear independence column and row rank rank identities.
Definitions (linear independence, column and row rank): A set of vectors x1 , ..., x n is called linear independent if for an arbitrary n linear combination 6 i=1D i xi = 0 only holds if all scalars D1 , ..., D n disappear, that is if D1 = D 2 = ... = D n 1 = D n = 0 holds.
496
Appendix A: Matrix Algebra
For all vectors which are characterized by x1 ,..., x n unequal from zero are called linear dependent. Let A be a rectangular matrix of the order O( ǹ) = n × m . The column rank of the matrix A is the largest number of linear independent columns, while the row rank is the largest number of linear independent rows. Actually the column rank of the matrix A is identical to its row rank. The rank of a matrix thus is called rk A . Obviously, rk A d min{n, m}. If rk A = n holds, we say that the matrix A has full row ranks. In contrast if the rank identity rk A = m holds, we say that the matrix A has full column rank. We list the following important rank identities. Facts (rank identities): (i)
rk A = rk A c = rk A cA = rk AA c
(ii)
rk( A + B) d rk A + rk B
(iii)
rk( A B) d min{rk A, rk B}
(iv)
rk( A B) = rk A if B has full row rank,
(v)
rk( A B) = rk B if A has full column rank.
(vi)
rk( A B C) + rk B t rk( A B) + rk(B C)
(vii)
rk( A
B) = (rk A) (rk B).
If a rectangular matrix of the order O( A) = n × m is fulfilled and, in addition, Ax = 0 holds for a certain vector x z 0 , then rk A d m 1 . Let us define what is a rank factorization, the column space, a singular matrix and, especially, what is division algebra. Facts (rank factorization) We call a rank factorization A = GF , if rk A = rk G = rk F holds for certain matrices G and F of the order O(G ) = n × rk A and O(F) = rk A × m.
497
A3 Scalar Measures and Inverse Matrices
Facts A matrix A has the column space R ( A) formed by the column vectors. The dimension of such a vector space is dim R ( A) = rk A . In particular, R ( A) = R ( AA c) holds. Definition (non-singular matrix versus singular matrix) Let a quadratic matrix A of the order O( A) be given. A is called nonsingular or regular if rk A = n holds. In case rk A < n, the matrix A is called singular. Definition (division algebra): Let the matrices A, B, C be quadratic and non-singular of the order O( A) = O(B) = O(C) = n × n . In terms of the Cayley-product an inner relation can be based on A = [aij ], B = [bij ], C = [cij ], O( A) = O(B) = O(C) = n × n (i)
( A B ) C = A ( B C)
(ii)
AI = A
(identity)
(iii)
A A 1 = I
(inverse).
(associativity)
The non-singular matrix A 1 = B is called Cayley-inverse. The conditions A B = In B A = In are equivalent. The Cayley-inverse A 1 is left and right identical. The Cayleyinverse is unique. Fact: ( A 1 ) c = ( A c) 1 : A is symmetric A 1 is symmetric. Facts: (Inverse Partitional Matrix /IPM/ of a symmetric matrix): Let the symmetric matrix A be partitioned as ªA A := « 11 c ¬ A 12
A 12 º c = A 11 , A c22 = A 22 . , A 11 A 22 »¼
498
Appendix A: Matrix Algebra
Then its Cayley inverse A 1 is symmetric and can be partitioned as well as ªA A 1 = « 11 c ¬ A 12
A 12 º A 22 »¼
1 1 1 ª[I + A 11 c A 11 c ]A 11 A 12 ( A 22 A 12 A 12 ) 1 A 12 « 1 1 c A 11 c A 11 ( A 22 A 12 A 12 ) 1 A 12 ¬
1
=
1 1 c A 11 A 11 A 12 ( A 22 A 12 A 12 ) 1 º », 1 c A 11 ( A 22 A 12 A 12 ) 1 ¼
1 if A 11 exists ,
A
1
ªA = « 11 c ¬ A 12
A 12 º A 22 »¼
1
=
ª º c A 221 A 12 ) 1 c A 221 A 12 ) 1 A 12 A 221 ( A 11 A 12 ( A 11 A 12 , « 1 1 1 1 1 1 1 » c ( A 11 A 12 c A 22 A 12 ) c ( A 11 A 12 A 22 A 12 c ) A 12 ]A 22 ¼ [I + A 22 A 12 ¬ A 22 A 12 if A 221 exists . 1 c A 11 c A 221 A 12 S 11 := A 22 A 12 A 12 and S 22 := A 11 A 12
are the minors determined by properly chosen rows and columns of the matrix A called “Schur complements” such that A
1
ªA = « 11 c ¬ A 12
1 1 1 ª(I + A 11 c ) A 11 A 12 S 11 A 12 « 1 1 c A 11 S 11 A 12 ¬
A 12 º A 22 »¼
1
=
1 1 º A 11 A 12 S 11 » 1 S 11 ¼
1 if A 11 exists ,
A
1
ªA = « 11 c ¬ A 12
A 12 º A 22 »¼
1
=
ª º S 221 S 221 A 12 A 221 « 1 1 1 1 1 » c S 22 [I + A 22 A 12 c S 22 A 12 ]A 22 ¼ ¬ A 22 A 12 if A 221 exists , are representations of the Cayley inverse partitioned matrix A 1 in terms of “Schur complements”.
499
A3 Scalar Measures and Inverse Matrices
The formulae S11 and S 22 were first used by J. Schur (1917). The term “Schur complements” was introduced by E. Haynsworth (1968). A. Albert (1969) replaced the Cayley inverse A 1 by the Moore-Penrose inverse A + . For a survey we recommend R. W. Cottle (1974), D.V. Oullette (1981) and D. Carlson (1986). :Proof: For the proof of the “inverse partitioned matrix” A 1 (Cayley inverse) of the partitioned matrix A of full rank we apply Gauss elimination (without pivoting). AA 1 = A 1 A = I ªA A = « 11 c ¬ A 12
A 12 º c = A 11 , A c22 = A 22 , A 11 A 22 »¼
ª A R m×m , A R m×l 12 « 11 l ×m l ×l c R , A 22 R «¬ A 12 ªB A 1 = « 11 c ¬B 12
B 12 º c = B 11 , B c22 = B 22 , B 11 B 22 »¼
ªB R m×m , B R m×l 12 « 11 l ×m l ×l c R , B 22 R «¬ B 12 AA 1 = A 1 A = I
c = B11A11 + B12 A12 c = Im ª A11B11 + A12 B12 «A B + A B = B A +B A = 0 12 22 11 12 12 22 « 11 12 c B11 + A 22 B12 c = B12 c A11 + B 22 A12 c =0 « A12 « c B12 + A 22 B 22 = B12 c A12 + B 22 A 22 = I l ¬ A12
(1) (2) (3) (4).
1 Case (i): A 11 exists
“forward step” c = I m (first left equation: A11B11 + A12 B12
º » cA ) » multiply by A12 c B11 + A 22 B12 c = 0 (second right equation) ¼» A12 1 11
1 1 º c B 11 A 12 c A 11 c = A 12 c A 11 A 12 A 12 B 12 » c B 11 + A 22 B 12 c =0 A 12 ¼
500
Appendix A: Matrix Algebra
c = Im ª A B + A 12 B 12 « 11 11 1 1 c A 11 A 12 )B 12 c = A 12 c A 11 ¬( A 22 A 12 1 1 c = ( A 22 A 12 c A 11 c A 11 B 12 A 12 ) 1 A 12 1 1 c = S 11 A 12 c A 11 B 12
or
ª Im « Ac A 1 ¬ 12 11
0 º ª A11
I l »¼ «¬ A12c
A12 º
ª A11 = A 22 »¼ «¬ 0
º . A 22 A12c A11 A12 »¼ A12
1
1 c A 11 Note the “Schur complement” S 11 := A 22 A 12 A 12 .
“backward step” c = Im A 11B 11 + A 12 B12 º 1 1 1 » c = ( A 22 A 12 c A 11 A 12 ) A 12 c A 11 ¼ B12 1 1 c ) = (I m B 12 A 12 c ) A 11 B 11 = A 11 (I m A 12 B 12 1 1 1 c A 11 c ]A 11 B 11 = [I m + A 11 A 12 ( A 22 A 12 A 12 ) 1 A 12 1 1 1 1 c A 11 B 11 = A 11 + A 11 A 12 S 11 A 12
A11B12 + A12 B 22 = 0 (second left equation) 1 1 1 c A 11 B 12 = A 11 A 12 B 22 = A 11 A 12 ( A 22 A 12 A 12 ) 1
1 c A11 B 22 = ( A 22 A12 A12 ) 1 1 B 22 = S11 .
Case (ii): A 221 exists “forward step” A11B12 + A12 B 22 = 0 (third right equation) º c B12 + A 22 B 22 = I l (fourth left equation: » A12 » multiply by A12 A 221 ) »¼
A 11B 12 + A 12 B 22 = 0 º 1 1 » c B 12 A 12 B 22 = A 12 A 22 ¼ A 12 A 22 A 12
ª A c B + A 22 B 22 = I l « 12 12 1 c )B 12 = A 12 A 221 ¬( A 11 A 12 A 22 A 12
501
A3 Scalar Measures and Inverse Matrices 1 1 c ) 1 A 12 c A 22 B 12 = ( A 11 A 12 A 22 A 12 1 1 B 12 = S 22 A 12 A 22
or ªI m « ¬0
A 12 A 221 º ª A 11 »« c Il ¼ ¬ A 12
A 12 º ª A 11 A 12 A 221 A 12 c =« » A 22 ¼ ¬ c A 12
0 º ». A 22 ¼
1 c . Note the “Schur complement” S 22 := A 11 A 12 A 22 A 12
“backward step” c B12 + A 22 B 22 = I l A 12 º 1 1 1 » c ) A 12 A 22 ¼ B12 = ( A 11 A 12 A 22 A 12 1 1 c B 12 c ) = (I l B 12 c A 12 ) A 22 B 22 = A 22 (I l A 12
c ( A 11 A 12 A 221 A 12 c ) 1 A 12 ]A 221 B 22 = [I l + A 221 A 12 1 1 1 1 c S 22 A 12 A 22 B 22 = A 22 + A 22 A 12 c B 11 + A 22 B 12 c = 0 ( third left equation ) A 12 c = A 221 A 12 c B 11 = A 221 A 12 c ( A 11 A 12 A 221 A 12 c ) 1 B 12
B 11 = ( A 11 A 12 A
1 22
A 1c 2 ) 1
B 1 1 = S 2 21 .
h c , B 22 } in terms of { A11 , A12 , A 21 = A12 c , The representations { B11 , B12 , B 21 = B12 A 22 } have been derived by T. Banachiewicz (1937). Generalizations are referred to T. Ando (1979), R. A. Brunaldi and H. Schneider (1963), F. Burns, D. Carlson, E. Haynsworth and T. Markham (1974), D. Carlson (1980), C. D. Meyer (1973) and S. K. Mitra (1982), C. K. Li and R. Mathias (2000). We leave the proof of the following fact as an exercise. Fact (Inverse Partitioned Matrix /IPM/ of a quadratic matrix): Let the quadratic matrix A be partitioned as ªA A := « 11 ¬ A 21
A 12 º . A 22 »¼
Then its Cayley inverse A 1 can be partitioned as well as
502
Appendix A: Matrix Algebra
ªA A 1 = « 11 ¬ A 21
A 12 º A 22 »¼
1 1 1 1 ª A 11 + A 11 A 12 S 11 A 21 A 11 « 1 1 S 11 A 21 A 11 ¬
1
=
1 1 º A 11 A 12 S 11 », 1 S 11 ¼
1 if A 11 exists
A
1
ª S 221 « 1 1 ¬ A 22 A 21S 22
ªA = « 11 ¬ A 21
A 221
A 12 º A 22 »¼
1
=
º S 221 A 12 A 221 , 1 1 1 » + A 22 A 21S 22 A 12 A 22 ¼
if A 221 exists and the “Schur complements” are definded by 1 S 11 := A 22 A 21 A 11 A 12 and S 22 := A 11 A 12 A 221 A 21 .
Facts: ( Cayley inverse: sum of two matrices): (s1)
( A + B ) 1 = A 1 A 1 ( A 1 + B 1 ) 1 A 1
(s2)
( A B) 1 = A 1 + A 1 ( A 1 B 1 ) 1 A 1
(s3)
( A + CBD) 1 = A 1 A 1 (I + CBDA 1 ) 1 CBDA 1
(s4)
( A + CBD) 1 = A 1 A 1 (I + BDA 1C) 1 BDA 1
(s5)
( A + CBD) 1 = A 1 A 1CB(I + DA 1CB) 1 DA 1
(s6)
( A + CBD) 1 = A 1 A 1CBD(I + A 1CBD) 1 A 1
(s7)
( A + CBD) 1 = A 1 A 1CBDA 1 (I + CBDA 1 ) 1
(s8)
( A + CBD) 1 = A 1 A 1C(B 1 + DA 1C) 1 DA 1 ( Sherman-Morrison-Woodbury matrix identity )
(s9)
B( AB + C) 1 = (I + BC1 A) 1 BC1
(s10)
BD( A + CBD) 1 = (B 1 + DA 1C) 1 DA 1 (Duncan-Guttman matrix identity).
W. J. Duncan (1944) calls (s8) the Sherman-Morrison-Woodbury matrix identity. If the matrix A is singular consult H. V. Henderson and G. S. Searle (1981), D. V. Ouellette (1981), W. M. Hager (1989), G. W. Stewart (1977) and K. S. Riedel
A3 Scalar Measures and Inverse Matrices
503
(1992). (s10) has been noted by W. J. Duncan (1944) and L. Guttman (1946): The result is directly derived from the identity ( A + CBD)( A + CBD) 1 = I A( A + CBD) 1 + CBD( A + CBD) 1 = I ( A + CBD) 1 = A 1 A 1CBD( A + CBD) 1 A 1 = ( A + CBD) 1 + A 1CBD( A + CBD) 1 DA 1 = D( A + CBD) 1 + DA 1CBD( A + CBD) 1 DA 1 = (I + DA 1CB)D( A + CBD) 1 DA 1 = (B 1 + DA 1C)BD( A + CBD) 1 (B 1 + DA 1C) 1 DA 1 = BD( A + CBD)1 .
h Certain results follow directly from their definitions. Facts (inverses): (i)
( A ¸ B)1 = B1 ¸ A1
(ii)
( A B)1 = B1 A1
(iii)
A positive definite A1 positive definite
(iv)
( A B)1 , ( A B)1 and (A1 B1 ) are positive definite, then (A1 B1 ) ( A B)1 is positive semidefinite as well as (A1 A ) I and I (A1 A)1 .
Facts (rank factorization): (i) If the n × n matrix is symmetric and positive semidefinite, then its rank factorization is ªG º A = « 1 » [G1c G c2 ] , ¬G 2 ¼ where G1 is a lower triangular matrix of the order O(G1 ) = rk A × rk A with rk G 2 = rk A , whereas G 2 has the format O(G 2 ) = (n rk A) × rk A. In this case we speak of a Choleski decomposition.
504
Appendix A: Matrix Algebra
(ii) In case that the matrix A is positive definite, the matrix block G 2 is not needed anymore: G1 is uniquely determined. There holds A 1 = (G11 )cG11 . Beside the rank of a quadratic matrix A of the order O( A) = n × n as the first scalar measure of a matrix, is its determinant A =
¦
(1)) ( j ,..., j
n
1
n )
a i =1
perm ( j1 ,..., jn )
iji
plays a similar role as a second scalar measure. Here the summation is extended as the summation perm ( j1 ,… , jn ) over all permutations ( j1 ,..., jn ) of the set of integer numbers (1,… , n) . ) ( j1 ,… , jn ) is the number of permutations which transform (1,… , n) into ( j1 ,… , jn ) . Laws (determinant) (i)
| D A | = D n | A | for an arbitrary scalar D R
(ii)
| A B |=| A | | B |
(iii)
| A
B |=| A |m | B |n for an arbitrary m × n matrix B
(iv)
(vi)
| A c |=| A | 1 | (A + A c) |d| A | if A + A c is positive definite 2 | A 1 |=| A |1 if A 1 exists
(vii)
| A |= 0 A is singular ( A 1 does not exist)
(viii)
| A |= 0 if A is idempotent, A z I
(ix)
| A |= aii if A is diagonal and a triangular matrix
(v)
n
i =1
n
(x)
0 d| A |d aii =| A I | if A is positive definite i =1
n
(xi)
| A | | B | d | A | bii d| A B | if A and B are posii =1
tive definite
(xii)
ª A11 «A ¬ 21
1 ª det A11 det( A 22 A 21A11 A12 ) « m ×m , rk A11 = m1 A12 º « A11 R =« » 1 A 22 ¼ « det A 21 det( A11 A12 A 22 A 21 ) « A R m ×m , rkA = m . 22 22 2 ¬ 1
2
1
2
505
A3 Scalar Measures and Inverse Matrices
A submatrix of a rectangular matrix A is the result of a canceling procedure of certain rows and columns of the matrix A. A minor is the determinant of a quadratic submatrix of the matrix A. If the matrix A is a quadratic matrix, to any element aij there exists a minor being the determinant of a submatrix of the matrix A which is the result of reducing the i-th row and the j-th column. By multiplying with (1)i + j we gain a new element cij of a matrix C = [cij ] . The transpose matrix Cc is called the adjoint matrix of the matrix A, written adjA . Its order is the same as of the matrix A. Laws (adjoint matrix) n
(i)
| A |= ¦ aij cij , i = 1,… , n j =1
n
(ii)
| A |= ¦ a jk c jk , k = 1,… , n j =1
(iii)
A (adj A) = (adj A) A = | A | I
(iv)
adj( A B) = (adj B) (adj A)
(v)
adj( A
B) = (adj A)
(adj B)
(vi)
adj A =| A | A 1 if A is nonsingular
(vii)
adjA positive definitive A positive definite.
As a third scalar measure of a quadratic matrix A of the order O( A) = n × n we introduce the trace tr A as the sum of diagonal elements, n
tr A = ¦ aii . i =1
Laws (trace of a matrix) (i)
tr(D A) = D tr A for an arbitrary scalar D R
(ii)
tr( A + B) = tr A + tr B for an arbitrary n × n matrix B
(iii)
tr( A
B) = (tr A) (tr B) for an arbitrary m × m matrix B
iv) (v)
tr A = tr(B C) for any factorization A = B C tr A c(B C) = tr( A c Bc)C for an arbitrary n × n matrix B and C tr A c = tr A trA = rkA if A is idempotent 0 < tr A = tr ( A I) if A is positive definite
(vi) (vii) (viii) (ix)
tr( A B) d (trA) (trB) if A und % are positive semidefinite.
506
Appendix A: Matrix Algebra
In correspondence to the W – weighted vector (semi) – norm. || x ||W = (xc W x)1/ 2 is the W – weighted matrix (semi) norm || A ||W = (trA cWA)1/ 2 for a given positive – (semi) definite matrix W of proper order. Laws (trace of matrices): (i) tr A cWA t 0 (ii) tr A cWA = 0 WA = 0 A = 0 if W is positive definite
A4 Vector-valued Matrix Forms If A is a rectangular matrix of the order O( A) = n × m , a j its j – th column, then vec A is an nm × 1 vector ª a1 º «a » « 2 » vec A = « … » . « » « an 1 » «¬ an »¼ In consequence, the operator “vec” of a matrix transforms a vector in such a way that the columns are stapled one after the other. Definitions ( vec, vech, veck ): ª a1 º «a » « 2 » (i) vec A = « … » . « » « an 1 » «¬ an »¼ (ii) Let A be a quadratic symmetric matrix, A = A c , of order O( A) = n × n . Then vechA (“vec - koef”) is the [n(n + 1) / 2] × 1 vector which is the result of row (column) stapels of those matrix elements which are upper and under of its diagonal.
A4 Vector-valued Matrix Forms
507
ª a11 º «… » « » « an1 » a A = [aij ] = [a ji ] = A c vechA := «« 22 »» . … «a » « n2 » «… » «¬ ann »¼ (iii) Let A be a quadratic, antisymmetric matrix, A = A c , of order O( A) = n × n . Then veckA (“vec - skew”) is the [n(n + 1) / 2] × 1 vector which is generated columnwise stapels of those matrix elements which are under its diagonal. ª a11 º « … » « » « an1 » « a » A = [aij ] = [a ji ] = Ac veckA := « 32 » . … « a » « n2 » « … » «¬ an, n 1 »¼ Examples (i)
(ii)
(iii)
ªa b A=« ¬d e ªa b A = «« b d ¬« c e
cº vecA = [a, d , b, e, c, f ]c f »¼ cº e »» = A c vechA = [ a, b, c, d , e, f ]c f »¼
ª 0 a b « a 0 d A=« «b d 0 « f «¬ c e
c º e »» = A c veckA = [a, b, c, d , e, f ]c . f» » 0 »¼
Useful identities, relating to scalar- and vector - valued measures of matrices will be reported finally. Facts (vec and trace forms): vec(A B Cc) = (C
A) vec B (i) (ii)
vec(A B) = (Bc
I n ) vec A = (Bc
A) vec I m = = (I1
A) vec B, A R n× m , B R m × q
508
Appendix A: Matrix Algebra
(iii)
A B c = (cc
A)vecB = ( A
cc)vecB c, c R q
(iv)
tr( A c B) = (vecA)cvecB = (vecA c)vecBc = tr( A Bc)
(v)
tr(A B Cc Dc) = (vec D)c(C
A) vec B = = (vec Dc)c( A
C) vec Bc
(vi)
K nm vecA = vecA c, A R n× m
(vii)
K qn (A
B) = (B
A)K pm
(viii)
K qn (A
B)K mp = (B
A)
(ix)
K qn (A
c) = c
A
(x)
K nq (c
A) = A
c, A R n×m , B R q× p , c R q
(xi)
vec(A
B) = (I m
K pn
I q )(vecA
vecB)
(xii)
A = (a1 ,… , a m ), B := Diagb, O(B) = m × m, m
Cc = [c1 ,… , c m ] vec(A B Cc) = vec[¦ (a j b j ccj )] = j =1
m
= ¦ (c j
a j )b j = [c1
a1 ,… , c m
a m )]b = (C : A)b j =1
(xiii)
A = [aij ], C = [cij ], B := Diagb, b = [b1 ,… ,b m ] R m tr(A B Cc B) = (vec B)c vec(C B A c) = = bc(I m : I m )c ( A : C)b = bc( A C)b
(xiv)
B := I m tr( A Cc) = rmc ( A C)rm ( rm is the m ×1 summation vector: rm := [1, …,1]c R m )
(xv)
vec DiagD := (I m D)rm = [I m ( A c B C)]rm = = (I m : I m )c = [I m : ( A c B C)] vec DiagI m = = (I m : I m )c vec( A c B C) = = (I m : I m )c (Cc
A c)vecB = (C : A)cvecB when D = A c B C is factorized.
Facts (Löwner partial ordering): For any quadratic matrix A R m×m there holds the uncertainty I m ( A c A) t I m A A = I m [( A : I m )c (I m : A)] in the Löwner partial ordering that is the difference matrix I m (Ac A) I m A A is at least positive semidefinite.
A5 Eigenvalues and Eigenvectors
509
A5 Eigenvalues and Eigenvectors To any quadratic matrix A of the order O( A) = m × m there exists an eigenvalue O as a scalar which makes the matrix A O I m singular. As an equivalent statement, we say that the characteristic equation O I m A = 0 has a zero value which could be multiple of degrees, if s is the dimension of the related null space N ( A O I ) . The non-vanishing element x of this null space for which Ax = O x, x z 0 holds, is called right eigenvector of A. Related vectors y for which y cA = Ȝy , y z 0 , holds, are called left eigenvectors of A and are representative of the right eigenvectors A’. Eigenvectors always belong to a certain eigenvalue and are usually normed in the sense of xcx = 1, y cy = 1 as long as they have real components. As the same time, the eigenvectors which belong to different eigenvalues are always linear independent: They obviously span a subspace of R ( A) . In general, the eigenvalues of a matrix A are complex! There is an important exception: the orthonormal matrices, also called rotation matrices whose eigenvalues are +1 or, –1 and idempotent matrices which can only be 0 or 1 as a multiple eigenvalue generally, we call a null eigenvalue a singular matrix. There is the special case of a symmetric matrix A = A c of order O( A) = m × m . It can be shown that all roots of the characteristic polynomial are real numbers and accordingly m - not necessary different - real eigenvalues exist. In addition, the different eigenvalues O and P and their corresponding eigenvectors x and y are orthogonal, that is (O P )xc y = ( xc Ac) y xc( A y ) = 0, O P z 0. In case that the eigenvalue O of degrees s appears s-times, the eigenspace N ( A O I m ) is s - dimensional: we can choose s orthonormal eigenvectors which are orthonormal to all other! In total, we can organize m orthonormal eigenvectors which span the entire R m . If we restrict ourselves to eigenvectors and to eigenvalues O , O z 0 , we receive the column space R ( A) . The rank of A coincides with the number of non-vanishing eigenvalues {O1 ,… , Or }. U := [U1 , U 2 ], O(U) = m × m, U U c = U cU = I m U1 := [u1 ,… , u r ], O(U1 ) = m × r , r = rkA U 2 := [u r +1 ,… , u m ], O(U 2 ) = m × (m r ), A U 2 = 0. With the definition of the r × r diagonal matrix O := Diag(O1 ,… Or ) of nonvanishing eigenvalues we gain ª/ 0º A U = A [U1 , U 2 ] = [U1/, 0] = [U1 , U 2 ] « ». ¬ 0 0¼
510
Appendix A: Matrix Algebra
Due to the orthonormality of the matrix U := [U1 , U 2 ] we achieve the results about eigenvalue – eigenvector analysis and eigenvalues – eigenvector synthesis. Lemma (eigenvalue – eigenvector analysis: decomposition): Let A = A c be a symmetric matrix of the order O( A) = m × m . Then there exists an orthonormal matrix U in such a way that U cAU = Diag(O1 ,… Or , 0,… , 0) holds. (O1 ,… Or ) denotes the set of non – vanishing eigenvalues of A with r = rkA ordered decreasingly. Lemma (eigenvalue – eigenvectorsynthesis: decomposition): Let A = A c be a symmetric matrix of the order O( A) = m × m . Then there exists a synthetic representation of eigenvalues and eigenvectors of type A = U Diag(O1 ,… Or , 0,… , 0)U c = U1/U1c . In the class of symmetric matrices the positive (semi)definite matrices play a special role. Actually, they are just the positive (nonnegative) eigenvalues squarerooted. /1/ 2 := Diag( O1 ,… , Or ) . The matrix A is positive semidefinite if and only if there exists a quadratic m × m matrix G such that A = GG c holds, for instance, G := [u1 /1/ 2 , 0] . The quadratic matrix is positive definite if and only if the m × m matrix G is not singular. Such a representation leads to the rank fatorization A = G1 G1c with G1 := U1 /1/ 2 . In general, we have Lemma (representation of the matrix U1 ): If A is a positive semidefinite matrix of the order O( A) with non – vanishing eigenvalues {O1 ,… , Or } , then there exists an m × r matrix U1 := G1 / 1 = U1 / 1/ 2 with U1c U1 = I r , R (U1 ) = R (U1 ) = R ( A), such that U1c A U1 = (/
1/ 2
U1c ) (U1 / U1c ) (U1 / 1/ 2 ) = I r .
A5 Eigenvalues and Eigenvectors
511
The synthetic relation of the matrix A is A = G1 G1c = U1 / 1 U1c . The pseudoinverse has a peculiar representation if we introduce the matrices U1 , U1 and / 1 . Definition (pseudoinverse): If we use the representation of the matrix A of type A = G1 G1c = U1 /U1c then A + := U1 U1 = U1 / 1 U1c is the representation of its pseudoinverse namely (i)
AA + A = (U1 /U1c )( U1 / 1 U1c )( U1 /U1c ) = U1 /U1c
(ii) A + AA + = (U1 / 1 U1c )( U1 /U1c )( U1 / 1 U1c ) = U1/ 1 U1c = A + (iii) AA + = (U1 /U1c )( U1 / 1 U1c ) = U1 U1c = ( AA + )c (iv) A + A = (U1 / 1 U1c )( U1 /U1c ) = U1 U1c = ( A + A)c . The pseudoinverse A + exists and is unique, even if A is singular. For a nonsingular matrix A, the matrix A + is identical with A 1 . Indeed, for the case of the pseudoinverse (or any other generalized inverse) the generalized inverse of a rectangular matrix exists. The singular value decomposition is an excellent tool which generalizes the classical eigenvalue – eigenvector decomposition of symmetric matrices. Lemma (Singular value decomposition): (i) Let A be an n × m matrix of rank r := rkA d min(n, m) . Then the matrices A cA and A cA are symmetric positive (semi) definite matrices whose nonvanishing eigenvalues {O1 ,… Or } are positive. Especially r = rk( A cA) = rk( AA c) holds. AcA contains 0 as a multiple eigenvalue of degree m r , and AAc has the multiple eigenvalue of degree n r . (ii) With the support of orthonormal eigenvalues of A cA and AA c we are able to introduce an m × m matrix V and an n × n matrix U such that UUc = UcU = I n , VV c = V cV = I m holds and U cAAcU = Diag(O12 ,… , Or 2 , 0,… , 0), V cA cAV = Diag(O12 ,… , Or 2 , 0,… , 0).
512
Appendix A: Matrix Algebra
The diagonal matrices on the right side have different formats m × m and m × n . (iii)
The original n × m matrix A can be decomposed according to ª/ 0º U cAV = « » , O(UAV c) = n × m ¬ 0 0¼
with the r × r diagonal matrix / := Diag(O1 ,… , Or ) of singular values representing the positive roots of nonvanishing eigenvalues of A cA and AA c . (iv)
A synthetic form of the n × m matrix A is ª / 0º A = Uc « » V c. ¬ 0 0¼
We note here that all transformed matrices of type T1AT of a quadratic matrix have the same eigenvalues as A = ( AT)T1 being used as often as an invariance property. ?what is the relation between eigenvalues and the trace, the determinant, the rank? The answer will be given now. Lemma (relation between eigenvalues and other scalar measures): Let A be a quadratic matrix of the order O( A) = m × m with eigenvalues in decreasing order. Then we have m
m
j =1
j =1
| A |= O j , trA = ¦ O j , rkA = trA , if A is idempotent. If A = A c is a symmetric matrix with real eigenvalues, then we gain O1 t max{a jj | j = 1,… , m},
Om d min{a jj | j = 1,… , m}. At the end we compute the eigenvalues and eigenvectors which relate the variation problem xcAx = extr subject to the condition xcx = 1 , namely xcAx + O (xcx) = extr . x, O
The eigenvalue O is the Lagrange multiplicator of the optimization problem.
513
A6 Generalized Inverses
A6 Generalized Inverses Because the inversion by Cayley inversion is only possible for quadratic nonsingular matrices, we introduce a slightly more general definition in order to invert arbitrary matrices A of the order O( A) = n × m by so – called generalized inverses or for short g – inverses. An m × n matrix G is called g – inverse of the matrix A if it fulfils the equation AGA = A in the sense of Cayley multiplication. Such g – inverses always exist and are unique if and only if A is a nonsingular quadratic matrix. In this case G = A 1 if A is invertible, in other cases we use the notation G = A if A 1 does not exist. For the rank of all g – inverses the inequality r := rk A d rk A d min{n, m} holds. In reverse, for any even number d in this interval there exists a g – inverse A such that d = rkA = dim R ( A ) holds. Especially even for a singular quadratic matrix A of the order O( A) = n × n there exist g-inverses A of full rank rk A = n . In particular, such g-inverses A r are of interest which have the same rank compared to the matrix A, namely rkA r = r = rkA . Those reflexive g-inverse A r are equivalent due to the additional condition A r AA r = A r but are not necessary symmetric for symmetric matrices A. In general, A = A c and A g-inverse of A ( A )c g-inverse of A A rs := A A( A )c is reflexive symmetric g inverse of A. For constructing of A rs we only need an arbitrary g-inverse of A. On the other side, A rs does not mean unique. There exist certain matrix functions which are independent of the choice of the g-inverse. For instance,
514
Appendix A: Matrix Algebra
A( A cA) A and A c( AA c) 1 A can be used to generate special g-inverses of AcA or AA c . For instance, A A := ( A cA) A c and A m := A c( AA c) have the special reproducing properties A( A cA) A cA = AA A A = A and AAc( AAc) A = AA m A = A , which can be generalized in case that W and S are positive semidefinite matrices to WA( A cWA) A cWA = WA ASAc( ASAc) AS = AS , where the matrices WA( A cWA) A cW and SAc( ASA c) AS are independent of the choice of the g-inverse ( A cWA) and ( ASA c) . A beautiful interpretation of the various g-inverses is based on the fact that the matrices ( AA )( AA ) = ( AA A) A = AA and ( A A)( A A) = A ( AA A) = A A are idempotent and can therefore be geometrically interpreted as projections. The image of AA , namely R ( AA ) = R ( A) = {Ax | x R m } R n , can be completed by the projections A A along the null space N ( A A) = N ( A) = {x | Ax = 0} R m . By the choice of the g – inverse we are able to choose the projected direction of AA and the image of the projections A A if we take advantage of the complementary spaces of the subspaces R ( A A) N ( A A) = R m and R ( AA ) N ( AA ) = R n by using the symbol " " as the sign of “direct sum” of linear spaces which only have the zero element in common. Finally we have use the corresponding dimensions dim R ( A A) = r = rkA = dim R ( AA ) ªdim N ( A A) = m rkA = m r « ¬ dim N ( AA ) = n rkA = n r
515
A6 Generalized Inverses
independent of the special rank of the g-inverses A which are determined by the subspaces R ( A A) and N ( AA ) , respectively.
N ( AA c)
R( A A )
N (A A)
R( AA ) in R n
in R m
Example (geodetic networks): In a geodetic network, the projections A A correspond to a S – transformations in the sense of W. Baarda (1973). Example ( A A and A m g-inverses): The projections AA A = A( A cA) A c guarantee that the subspaces R ( AA ) and N ( AA A ) are orthogonal to each other. The same holds for the subspaces R ( A m A) and N ( A m A) of the projections A m A = A c( AA c) A. In general, there exist more than one g-inverses which lead to identical projections AA and A A. For instance, following A. Ben – Israel, T. N. E. Greville (1974, p.59) we learn that the reflexive g-inverse which follows from A r = ( A A) A ( AA ) = A AA contains the class of all reflexive g-inverses. Therefore it is obvious that the reflexive g-inverses A r contain exact by one pair of projections AA and A A and conversely. In the special case of a symmetric matrix A , A = A c , and n = m we know due to R ( AA ) = R ( A) A N ( A c) = N ( A) = N ( A A) that the column spaces R ( AA ) are orthogonal to the null space N ( A A) illustrated by the sign ”A ”. If these complementary subspaces R ( A A) and N ( AA ) are orthogonal to each other, the postulate of a symmetric reflexive ginverse agrees to A rs := ( A A) A ( A A)c = A A( A )c , if A is a suited g-inverse.
516
Appendix A: Matrix Algebra
There is no insurance that the complementary subspaces R ( A A) and N ( A A) and R ( AA ) and N ( AA ) are orthogonal. If such a result should be reached, we should use the uniquely defined pseudoinverse A + , also called Moore-Penrose inverse for which holds R ( A + A) A N ( A + A), R ( AA + ) A N ( AA + ) or equivalent +
AA = ( AA + )c, A + A = ( A + A)c. If we depart from an arbitrary g-inverse ( AA A) , the pseudoinverse A + can be build on A + := Ac( AAcA) Ac (Zlobec formula) or +
A := Ac( AAc) A( AcA) Ac (Bjerhammar formula) ,
if both the g-inverses ( AA c) and ( A cA) exist. The Moore-Penrose inverse fulfils the Penrose equations: (i) AA + A = A (g-inverse) (ii) A + AA + = A + (reflexivity) (iii) AA + = ( AA + )cº » Symmetry due to orthogonal projection . (iv) A + A = ( A + A)c »¼
Lemma (Penrose equations) Let A be a rectangular matrix A of the order O( A) be given. A ggeneralized matrix inverse which is rank preserving rk( A) = rk( A + ) fulfils the axioms of the Penrose equations (i) - (iv). For the special case of a symmetric matrix A also the pseudoinverse A + is symmetric, fulfilling R ( A + A) = R ( AA + ) A N ( AA + ) = N ( A + A) , in addition A + = A( A 2 ) A = A( A 2 ) A( A 2 ) A.
517
A6 Generalized Inverses
Various formulas of computing certain g-inverses, for instance by the method of rank factorization, exist. Let A be an n × m matrix A of rank r := rkA such that A = GF, O(G ) = n × r , O(F) = r × m . Due to the inequality r d rk G d min{r , n} = r only G posesses reflexive ginverses G r , because of I r × r = [(G cG ) 1 G c]G = [(G cG ) 1 G c](GG cr G ) = G r G represented by left inverses in the sense of G L G = I. In a similar way, all ginverses of F are reflexive and right inverses subject to Fr := F c(FF c) 1 . The whole class of reflexive g-inverses of A can be represented by A r := Fr G r = Fr G L . In this case we also find the pseudoinverse, namely A + := F c(FF c) 1 (G cG ) 1 G c because of +
R ( A A) = R (F c) A N (F) = N ( A + A) = N ( A) R ( AA + ) = R (G ) A N (G c) = N ( AA + ) = N ( A c). If we want to give up the orthogonality conditions, in case of a quadratic matrix A = GF , we could take advantage of the projections A r A = AA r we could postulate R ( A p A) = R ( AA r ) = R (G ) , N ( A cA r ) = N ( A r A) = N (F) . In consequence, if FG is a nonsingular matrix, we enjoy the representation A r := G (FG ) 1 F , which reduces in case that A is a symmetric matrix to the pseudoinverse A + . Dual methods of computing g-inverses A are based on the basis of the null space, both for F and G, or for A and A c . On the first side we need the matrix EF by FEcF = 0, rkEF = m r versus G cEG c = 0, rkEG c = n r on the other side. The enlarged matrix of the order (n + r r ) × (n + m r ) is automatically nonsingular and has the Cayley inverse
518
Appendix A: Matrix Algebra
ªA «E ¬ F
1
EG c º ª A+ =« + » 0 ¼ ¬ EG c
EF+ º » 0¼
with the pseudoinverse A + on the upper left side. Details can be derived from A. Ben – Israel and T. N. E. Greville (1974 p. 228). If the null spaces are always normalized in the sense of < EF | EcF >= I m r , < EcG c | EG c >= I n r because of + F
E = EcF < EF | EcF > 1 = EcF and E
+ Gc
=< EcG c | EG c > 1 EcG c = EcG c
ªA «E ¬ F
1
EG c º ªA+ = « 0 »¼ ¬ EcF
EG c º » . 0 ¼
These formulas gain a special structure if the matrix A is symmetric to the order O( A) . In this case EG c = EcF =: Ec , O(E) = (m r ) × m , rk E = m r and 1
ª A+ E c < E | Ec > 1 º ª A Ec º » «E 0 » = « 1 0 ¬ ¼ ¬ < E | Ec > E ¼ on the basis of such a relation, namely EA + = 0 there follows I m = AA + + Ec < E | Ec > 1 E = = ( A + EcE)[ A + + Ec(EEcEEc) 1 E] and with the projection (S - transformation) A + A = I m Ec < E | Ec > 1 E = ( A + EcE) 1 A and A + = ( A + EcE) 1 Ec(EEcEEc) 1 E pseudoinverse of A R ( A + A) = R ( AA + ) = R ( A) A N ( A) = R (Ec) . In case of a symmetric, reflexive g-inverse A rs there holds the orthogonality or complementary
519
A6 Generalized Inverses
R ( A rs A) A N ( AA rs ) N ( AA rs ) complementary to R ( AA rs ) , which is guaranteed by a matrix K , rk K = m r , O(K ) = (m r ) × m such that KEc is a non-singular matrix. At the same time, we take advantage of the bordering of the matrix A by K and K c , by a non-singular matrix of the order (2m r ) × (2m r ) . 1
ª A rs K R º ª A K cº = « ». «K 0 » ¬ ¼ ¬ (K R ) c 0 ¼ K R := Ec(KEc) 1 is the right inverse of A . Obviously, we gain the symmetric reflexive g-inverse A rs whose columns are orthogonal to K c : R( A rs A) A R(K c) = N ( AA rs ) KA rs = 0
I m = AA + K c(EK c) 1 E = rs
= ( A + K cK )[ A rs + Ec(EK cEK c) 1 E] and projection (S - transformation) A A = I m Ec(KEc) 1 K = ( A + K cK ) 1 A c , rs
A rs = ( A + K cK ) 1 Ec(EK cEK c) 1 E. symmetric reflexive g-inverse For the special case of a symmetric and positive semidefinite m × m matrix A the matrix set U and V are reduced to one. Based on the various matrix decompositions ª- 0 º ª U1c º A = [ U1 , U 2 ] « » « » = U1AU1c , ¬ 0 0 ¼ ¬ U c2 ¼ we find the different g - inverses listed as following. ª-1 A = [ U1 , U 2 ] « ¬ L 21
L12 º ª U1c º »« ». L 21-L12 ¼ ¬ U c2 ¼
Lemma (g-inverses of symmetric and positive semidefinite matrices): (i)
ª-1 A = [ U1 , U 2 ] « ¬ L 21
L12 º ª U1c º » « », L 22 ¼ ¬ U c2 ¼
520
Appendix A: Matrix Algebra
(ii) reflexive g-inverse L12 º ª U1c º »« » L 21-L12 ¼ ¬ U c2 ¼
ª-1 A r = [ U1 , U 2 ] « ¬ L 21
(iii) reflexive and symmetric g-inverse ª-1 L12 º ª U1c º A rs = [ U1 , U 2 ] « »« » ¬ L12 L12 -L12 ¼ ¬ U c2 ¼ (iv) pseudoinverse ª-1 A + = [ U1 , U 2 ] « ¬ 0
0 º ª U1c º 1 » « c » = U1- U1 . U 0¼ ¬ 2 ¼
We look at a representation of the Moore-Penrose inverse in terms of U 2 , the basis of the null space N ( A A) . In these terms we find E := U1
ªA ¬« U c2
1
+ U2 º = ª« A 0 ¼» ¬ U c2
U2 º , 0 »¼
by means of the fundamental relation of A + A A + A = lim( A + G I m ) 1 A = AA + = I m U 2 U c2 = U1U1c , G o0
we generate the fundamental relation of the pseudo inverse A + = ( A + U 2 U c2 ) 1 U 2 U c2 . The main target of our discussion of various g-inverses is the easy handling of representations of solutions of arbitrary linear equations and their characterizations. We depart from the solution of a consistent system of linear equations, Ax = c, O( A) = n × m,
c R ( A) x = A c
for any g-inverse A .
x = A c is the general solution of such a linear system of equations. If we want to generate a special g - inverse, we can represent the general solution by x = A c + (I m A A ) z
for all z R m ,
since the subspaces N ( A) and R (I m A A ) are identical. We test the consistency of our system by means of the identity AA c = c . c is mapped by the projection AA to itself. Similary we solve the matrix equation AXB = C by the consistency test: the existence of the solution is granted by the identity
521
A6 Generalized Inverses
AA CB B = C for any g-inverse A and B . If this condition is fulfilled, we are able to generate the general solution by X = A CB + Z A AZBB , where Z is an arbitrary matrix of suitable order. We can use an arbitrary ginverse A and B , for instance the pseudoinverse A + and B + which would be for Z = 0 coincide with two-sided orthogonal projections. How can we reduce the matrix equation AXB = C to a vector equation? The vec-operator is the door opener. AXB = C
(Bc
A) vec X = vec C .
The general solution of our matrix equation reads vec X = (Bc
A) vec C + [I (Bc
A) (Bc
A)] vec Z . Here we can use the identity ( A
B) = B
A , generated by two g-inverses of the Kronecker-Zehfuss product. At this end we solve the more general equation Ax = By of consistent type R ( A) R (B) by Lemma (consistent system of homogenous equations Ax = By ): Given the homogenous system of linear equations Ax = By for y R A constraint by By R ( A ) . Then the solution x = Ly can be given under the condition R ( A ) R (B ) . In this case the matrix L may be decomposed by L = A B for a certain g-inverse A .
Appendix B: Matrix Analysis A short version on matrix analysis is presented. Arbitrary derivations of scalarvalued, vector-valued and matrix-valued vector – and matrix functions for functionally independent variables are defined. Extensions for differenting symmetric and antisymmetric matrices are given. Special examples for functionally dependent matrix variables are reviewed.
B1 Derivatives of Scalar valued and Vector valued Vector Functions Here we present the analysis of differentiating scalar-valued and vector-valued vector functions enriched by examples. Definition: (derivative of scalar valued vector function): Let a scalar valued function f (x) of a vector x of the order O(x) = 1× m (row vector) be given, then we call Df (x) = [D1 f (x),… , Dm f ( x)] :=
wf wxc
first derivative of f (x) with respect to xc . Vector differentiation is based on the following definition. Definition: (derivative of a matrix valued matrix function): Let a n × q matrix-valued function F(X) of a m × p matrix of functional independent variables X be given. Then the nq × mp Jacobi matrix of first derivates of F is defined by J F = DF(X) :=
wvecF(X) . w (vecX)c
The definition of first derivatives of matrix-functions can be motivated as following. The matrices F = [ f ij ] R n × q and X = [ xk A ] R m × p are based on twodimensional arrays. In contrast, the array of first derivatives ª wf ij º n× q× m× p « » = ª¬ J ijk A º¼ R w x ¬ kA ¼ is four-dimensional and automatic outside the usual frame of matrix algebra of two-dimensional arrays. By means of the operations vecF and vecX we will vectorize the matrices F and X. Accordingly we will take advantage of vecF(X) of the vector vecX derived with respect to the matrix J F , a two-dimensional array.
B2 Derivatives of Trace Forms
523
Examples (i) f (x) = xcAx = a11 x12 + (a12 + a21 ) x1 x2 + a22 x22 wf = wxc = [2a11 x1 + (a12 + a21 ) x2 | (a12 + a21 ) x1 + 2a22 x2 ] = xc( A + Ac) Df (x) = [D1 f (x), D 2 f (x)] =
ªa x + a x º (ii) f ( x) = Ax = « 11 1 12 2 » ¬ a21 x1 + a22 x2 ¼ J F = Df (x) =
wf ª a11 =« wxc ¬ a21
ª x2 + x x (iii) F(X) = X 2 = « 11 12 21 ¬ x21 x11 + x22 x21
a12 º =A a22 »¼
x11 x12 + x12 x22 º 2 » x21 x12 + x22 ¼
ª x112 + x12 x21 º « » x x +x x vecF(X) = « 21 11 22 21 » «x x + x x » « 11 12 12 2 22 » ¬« x21 x12 + x22 ¼» (vecX)c = [ x11 , x21 , x12 , x22 ] ª 2 x11 « wvecF(X) « x21 J F = DF(X) = = w (vecX)c « x12 « ¬ 0
x12 x11 + x22 0 x12
x21 0 x11 + x22 x21
0 º x21 »» x12 » » 2 x22 ¼
O(J F ) = 4 × 4 .
B2 Derivatives of Trace Forms Up to now we have assumed that the vector x or the matrix X are functionally idempotent. For instance, the matrix X cannot be a symmetric matrix X = [ xij ] = [ x ji ] = Xc or an antisymmetric matrix X = [ xij ] = [ x ji ] = Xc . In case of a functional dependent variables, for instance xij = x ji or xij = x ji we can take advantage of the chain rule in order to derive the differential procedure. ª A c, if X consists of functional independent elements; w « tr( AX) = « Ac + A - Diag[a11 ,… , ann ], if the n × n matrix X is symmetric; wX «¬ A c A, if the n × n matrix X is antisymmetric.
524
Appendix B: Matrix Analysis
ª[vecAc]c, if X consists of functional independent elements; «[vec(Ac + A - Diag[a ,…, a ])]c, if the n × n matrix X is w 11 nn tr( AX) = « « symmetric; w(vecX) « ¬[vec(Ac A)]c, if the n × n matrix X is antisymmetric. for instance ªa A = « 11 ¬ a21
a12 º ªx , X = « 11 » a22 ¼ ¬ x21
x12 º . x22 »¼
Case # 1: “the matrix X consists of functional independent elements” ª w « wx w = « 11 wX « w « wx ¬ 21
w º wx12 » », w » wx22 »¼
ªa w tr( AX) = « 11 wX ¬ a12
a21 º = Ac. a22 »¼
Case # 2: “the n × n matrix X is symmetric : X = Xc “ x12 = x21 tr( AX ) = a11 x11 + ( a12 + a21 ) x21 + a22 x22 ª w « wx w = « 11 wX « w « wx ¬ 21 ª a11 w tr( AX) = « wX ¬ a12 + a21
dx21 w º ª w dx12 wx21 » « wx11 »=« w » « w wx22 »¼ «¬ wx21
w º wx21 » » w » wx22 »¼
a12 + a21 º = A c + A Diag(a11 ,… , ann ) . a22 »¼
Case # 3: “the n × n matrix X is antisymmetric : X = X c ” x11 = x22 = 0, x12 = x21 tr( AX) = (a12 a21 ) x21 ª w « wx w = « 11 wX « w « ¬ wx21
dx21 w º ª w dx12 wx21 » « wx11 »=« » « w w » « wx22 ¼ ¬ wx21
ª 0 w tr( AX) = « wX ¬ a12 a21
w º wx21 » » w » wx22 »¼
a12 + a21 º » = Ac A . 0 ¼
525
B2 Derivations of Trace Forms
Let us now assume that the matrix X of variables xij is always consisting of functionally independent elements. We note some useful identities of first derivatives. Scalar valued functions of vectors w (acx) = ac w xc
(B1)
w (xcAx) = Xc( A + Ac). w xc
(B2)
Scalar-valued function of a matrix: trace w tr(AX) = Ac ; wX
(B3)
especially: w acXb w tr(bacX) = = b c
ac ; w (vecX)c w (vecX)c w tr(XcAX) = ( A + A c) X ; wX
(B4)
especially: w tr(XcX) = 2(vecX)c . w (vecX)c w tr(XAX) = XcA c + A cXc , wX
(B5)
especially: w trX 2 = 2(vecXc)c . w (vecX)c w tr(AX 1 ) = ( X 1AX 1 ), if X is nonsingular, wX especially: 1
w tr(X ) = [vec(X 2 )c]c ; w (vecX)c w acX 1b w tr(bacX 1 ) = = bc( X 1 )c
acX 1 . c c w (vecX) w (vecX)
(B6)
526
Appendix B: Matrix Analysis
w trXD = D ( Xc)D 1 , if X is quadratic ; wX
(B7)
especially: w trX = (vecI)c . w (vecX)c
B3 Derivatives of Determinantal Forms The scalarvalued forms of matrix determinantal form will be listed now. w | AXBc |= A c(adjAXBc)cB =| AXBc | A c(BXcA c) 1 B, wX if AXBc is nonsingular ;
(B8)
especially: wacxb = bc
ac, where adj(acXb)=1 . w (vecX)c w | AXBXcC |= C(adjAXBXcC) AXB + Ac(adjAXBXcC)cCXBc ; wX
(B9)
especially: w | XBXc |= (adjXBXc)XB + (adjXB cXc) XB c ; wX w | XSXc | = 2(vecX)c(S
adjXSXc), if S is symmetric; w (vecX)c w | XXc | = 2(vecX)c(I
adjXXc) . w (vecX)c w | AXcBXC |= BXC(adjAXcBXC) A + BcXA c(adjAXcBXC)cCc ; wX especially: w | XcBX |= BX(adjXcBX) + BcX(adjXcBcX) ; wX w | XcSX | = 2(vecX)c(adjXcSX
S), if S is symmetric; w (vecX)c w | XcX | = 2(vecX)c(adjXcX
I ) . w (vecX)c
(B10)
B4 Derivatives of a Vector/Matrix Function of a Vector/Matrix w | AXBXC |= BcXcA c(adjAXBXC)cCc + A c(adjAXBXC)cCcXcBc ; wX w | XBX |= BcXc(adjXBX)c + (adjXBX)cXB c ; wX
527 (B11)
especially: 2
w|X | = (vec[Xcadj(X 2 )c + adj(X 2 )cXc])c = w (vecX)c =| X |2 (vec[X c(X c) 2 + (X c) 2 X c])c = 2 | X |2 [vec(X 1 ) c]c, if X is non-singular . w | XD |= D | X |D ( X 1 )c, D N if X is non-singular , wX
(B12)
w|X| =| X | (X 1 )c if X is non-singular; wX especially: w|X| = [vec(adjXc)]c. w (vecX)c
B4 Derivatives of a Vector/Matrix Function of a Vector/Matrix If we differentiate the vector or matrix valued function of a vector or matrix, we will find the results of type (B13) – (B20). vector-valued function of a vector or a matrix w AX = A wxc
(B13)
w w (ac
A)vecX AXa = = ac
A w (vecX)c w (vecX)c
(B14)
matrix valued function of a matrix w (vecX) = I mp for all X R m × p w (vecX)c
(B15)
w (vecXc) = K m p for all X R m × p w (vecX)c
(B16)
where K m p is a suitable commutation matrix w (vecXX c) = ( I m +K m m )(X
I m ) for all X R m × p , w (vecX )c 2
where the matrix I m +K mm is symmetric and idempotent, 2
528
Appendix B: Matrix Analysis
w (vecXcX) = (I p +K p p )(I p
Xc) for all X R m × p w (vecX)c 2
w (vecX 1 ) = ( X 1 )c if X is non-singular w (vecX)c w (vecXD ) D = ¦ (Xc)D -j
X j 1 for all D N , if X is a square matrix. w (vecX)c j =1
B5 Derivatives of the Kronecker – Zehfuss product Act a matrix-valued function of two matrices X and Y as variables be given. In particular, we assume the function F(X, Y) = X
Y for all X R m × p , Y R n × q as the Kronecker – Zehfuss product of variables X and Y well defined. Then the identities of the first differential and the first derivative follow: dF(X, Y) = (dX)
Y + X
dY, dvecF(X, Y) = vec( dX
Y) + vec(X
dY), vec( dX
Y) = (I p
K qm
I n ) (vecdX
vecY) = = (I p
K qm
I n ) (I mp
vecY) d (vecX) = = (I p
[K qm
I n ) (I m
vecY)]) d (vecX), vec(X
dY) = (I p
K qm
I n ) (vecX
vecdY) = = (I p
K qm
I n ) (vecX
I nq ) d (vecY) = = ([( I p
K qm ) (vecX
I q )]
I n ) d (vecY), w vec(X
Y) = I p
[(K qm
I n) (I m
vecY)] , w (vecX)c w vec(X
Y) = (I p
K qm ) (vecX
I q )]
I n . w (vecY)c
B6 Matrix-valued Derivatives of Symmetric or Antisymmetric Matrix Functions Many matrix functions f ( X) or F(X) force us to pay attention to dependencies within the variables. As examples we treat here first derivatives of symmetric or antisymmetric matrix functions of X.
B6 Matrix-valued Derivatives of Symmetric or Antisymmetric Matrix Functions
Definition: (derivative of a matrix-valued symmetric matrix function): Let F(X) be an n × q matrix-valued function of an m × m symmetric matrix X = X c . The nq × m( m + 1) / 2 Jacobi matrix of first derivates of F is defined by wvecF(X) . w (vechX )c
J Fs = DF(X = X c) :=
Definition: (derivative of matrix valued antisymmetric matrix function): Let F(X) be an n × q matrix-valued function of an m × m antisymmetric matrix X = X c . The nq × m( m 1) / 2 Jacobi matrix of first derivates of F is defined by J aF = DF(X = X c) :=
wvecF(X) . w (veckX )c
Examples (i) Given is a scalar-valued matrixfunction tr(AX ) of a symmetric variable matrix X = X c , for instance a A = ª« 11 a ¬ 21
a12 º x , X = ª« 11 a22 »¼ x ¬ 21
ª x11 º x12 º , vechX = «« x22 »» x22 »¼ «¬ x33 »¼
tr(AX ) = a11 x11 + (a12 + a 21 )x 21 + a22 x22 w w w w =[ , , ] w (vechX )c wx11 wx21 wx22 w tr(AX) = [a11 , a12 + a21 , a22 ] w (vechX)c w tr(AX) w tr(AX) = [vech(A c + A Diag[a11 ,… , ann ])]c=[vech ]c. c w (vechX) wX (ii) Given is scalar-valued matrix function tr(AX) of an antisymmetric variable matrix X = Xc , for instance a A = ª« 11 a ¬ 21
a12 º 0 , X = ª« a22 »¼ x ¬ 21
x21 º , veckX = x21 , 0 »¼
tr(AX) = (a12 a 21 )x 21
529
530
Appendix B: Matrix Analysis
w w w tr(AX) = , = a12 a21 , w (veckX)c wx21 w (veckX)c w tr(AX) w tr(AX) = [veck(A c A)]c=[veck ]c . w (veckX)c wX
B7 Higher order derivatives Up to now we computed only first derivatives of scalar-valued, vector-valued and matrix-valued functions. Second derivatives is our target now which will be needed for the classification of optimization problems of type minimum or maximum. Definition: (second derivatives of a scalar valued vector function): Let f (x) a scalar-valued function of the n × 1 vector x . Then the m × m matrix w2 f DDcf (x) = D( Df ( x))c := wxwxc denotes the second derivatives of f ( x ) to x and xc . Correspondingly w w D2 f (x) :=
f (x) = (vecDDc) f ( x) wxc wx denotes the 1 × m 2 vector of second derivatives. and Definition: (second derivative of a vector valued vector function): Let f (x) be an n × 1 vector-valued function of the m × 1 vector x . Then the n × m 2 matrix of second derivatives H f = D2f (x) = D(Df (x)) =:
w w w 2 f ( x)
f ( x) = wxc wx wxcwx
is the Hesse matrix of the function f (x) . and Definition: (second derivatives of a matrix valued matrix function): Let F(X) be an n × q matrix valued function of an m × p matrix of functional independent variables X . The nq × m 2 p 2 Hesse matrix of second derivatives of F is defined by H F = D2 F(X) = D(DF(X)):=
w w w 2 vecF(X)
vecF(X) = . w (vecX)c w (vecX)c w (vecX)c
w(vecX)c
531
B7 Higher order derivatives
The definition of second derivatives of matrix functions can be motivated as follows. The matrices F = [ f ij ] R n×q and X = [ xk A ] R m × p are the elements of a two-dimensional array. In contrast, the array of second derivatives [
w 2 f ij wxk A wx pq
] = [kijk Apq ] R n × q × m × p × m × p
is six-dimensional and beyond the common matrix algebra of two-dimensional arrays. The following operations map a six-dimensional array of second derivatives to a two-dimensional array. (i) vecF(X) is the vectorized form of the matrix valued function (ii) vecX is the vectorized form of the variable matrix w w
w (vecX )c w (vecX )c vectorizes the matrix of second derivatives
(iii) the Kronecker – Zehfuss product
(iv) the formal product of the 1 × m 2 p 2 row vector of second derivatives with the nq ×1 column vector vecF(X) leads to an nq × m 2 p 2 Hesse matrix of second derivatives. Again we assume the vector of variables x and the matrix of variables X consists of functional independent elements. If this is not the case we according to the chain rule must apply an alternative differential calculus similary to the first deri-vative, case studies of symmetric and antisymmetric variable matrices. Examples: (i) f (x) = xcAx = a11 x12 + (a12 + a21 ) x1 x2 + a22 x22 Df (x) =
wf = [2a11 x1 + (a12 + a21 ) x2 | (a12 + a21 ) x1 + 2a22 x2 ] wxc
D2 f (x) = D(Df (x))c =
(ii)
ª 2a11 w2 f =« wxwxc ¬ a12 + a21
a12 + a21 º = A + Ac 2a22 »¼
ªa x + a x º f (x) = Ax = « 11 1 12 2 » ¬ a21 x1 + a22 x2 ¼ Df (x) =
DDcf (x) =
wf ª a11 =« wxc ¬ a21
a12 º =A a22 »¼
ª0 0º w 2f =« , O(DDcf (x)) = 2 × 2 c wxwx ¬0 0 »¼
D2 f (x) = [0 0 0 0], O(D2 f (x)) = 1 × 4
532
Appendix B: Matrix Analysis
(iii)
ª x2 + x x F(X) = X 2 = « 11 12 21 ¬ x21 x11 + x22 x21
x11 x12 + x12 x22 º 2 » x21 x12 + x22 ¼
ª x112 + x12 x21 º « » x21 x11 + x22 x21 » « vecF(X) = , O (F ) = O ( X) = 2 × 2 « x11 x12 + x12 x22 » « » 2 «¬ x21 x12 + x22 »¼ (vecX)c = [ x11 , x21 , x12 , x22 ] ª 2 x11 « w vecF(X) « x21 = JF = w (vecX)c « x12 « «¬ 0
x12 x11 + x22
x21 0
0 x12
x11 + x22 x21
0 º x21 »» x12 » » 2 x22 »¼
O(J F ) = 4 × 4 HF =
w w w w w w
vecF(X) = [ , , , ]
JF = w (vecX)c w (vecX)c wx11 wx21 wx12 wx22
ª2 « 0 =« «0 « ¬« 0
0 1 0 0
0 0 1 0
0 0 0 0
0 1 0 0
0 0 0 0
1 0 0 1
0 1 0 0
0 0 1 0
1 0 0 1
0 0 0 0
0 0 1 0
0 0 0 0
0 1 0 0
0 0 1 0
0º » 0» 0» » 2 ¼»
O(H F ) = 4 × 16 . At the end, we want to define the derivative of order l of a matrix-valued matrix function whose structure is derived from the postulate of a suitable array. Definition ( l-th derivative of a matrix-valued matrix function): Let F(X) be an n × q matrix valued function of an m × p matrix of functional independent variables X. The nq × ml p l matrix of l-th derivative is defined by Dl F(X) := =
w w
…
vecF(X) = w (vecX)c l -times w (vecX)c
wl vecF(X) for all l N . w (vecX)c
…
(vecX)c l -times
Appendix C: Lagrange Multipliers ?How can we find extrema with side conditions? We generate solutions of such external problems first on the basis of algebraic manipulations, namely by the lemma of implicit functions, and secondly by a geometric tool box, by means of interpreting a risk function and side conditions as level surfaces (specific normal images, Lagrange multipliers).
C1 A first way to solve the problem A first way to find extreme with side conditions will be based on a risk function f ( x1 ,..., xm ) = extr
(C1)
with unknowns ( x1 ,..., xm ) \ m , which are restricted by side conditions of type
[ F1 ( x1 ,..., xm ), F2 ( x1 ,..., xm ),..., Fr ( x1 ,..., xm ) ]c = 0 rk(
wFi ) = r < m. wxm
(C2) (C3)
The side conditions Fi ( x j ) (i = 1,..., r , j = 1,..., m) are reduced by the lemma of the implicit function: solve for xm r +1 = G1 ( x1 ,..., xm r ) xm r +2 = G2 ( x1 ,..., xmr ) ... xm1 = Gr 1 ( x1 ,..., xm r )
(C4)
xm = Gr ( x1 ,..., xmr ) and replace the result within the risk function f ( x1 , x2 ,..., xm r , G1 ( x1 ,..., xm r ),..., Gr ( x1 ,..., xm1 )) = extr .
(C5)
The “free” unknowns ( x1 , x2 ,..., xm r 1 , xm r ) \ m r can be found by taking the result of the implicit function theorem as follows. Lemma C1 (“implicit function theorem”): Let ȍ be an open set of \ m = \ m r × \ r and F : ȍ o \ r with vectors x1 \ m r and x 2 \ m r . The maps
534
Appendix C: Lagrange Multipliers
ª F1 ( x1 ,..., xm r ; xm r +1 ,..., xm ) º « F2 ( x1 ,..., xm r ; xm r +1 ,..., xm ) » « » (x1 , x 2 ) 6 F(x1 , x 2 ) = « (C6) ... » « Fr 1 ( x1 ,..., xm r ; xm r +1 ,..., xm ) » «¬ Fr ( x1 ,..., xm r ; xm r +1 ,..., xm ) »¼ transform a continuously differential function with F(x1 , x 2 ) = 0 . In case of a Jacobi determinant j not zero or a Jacobi matrix J of rank r, or w ( F1 ,..., Fr ) (C7) , w ( xm r +1 ,..., xm ) there exists a surrounding U := U(x1 ) \ m r and V := UG (x 2 ) \ r such that the equation F (x1 , x 2 ) = 0 for any x1 U in V c has only one solution j := det J z 0 or rk J = r , J :=
ª xm r +1 º ª G1 ( x1 ,..., xm r ) º « xm r + 2 » « G 2 ( x1 ,..., xm r ) » « » « » x 2 = G (x1 ) or « ... » = « ... ». x G ( x ,..., x ) « m 1 » « r 1 1 mr » «¬ xm »¼ «¬ G r ( x1 ,..., xm r ) »¼
(C8)
The function G : U o V is continuously differentiable. A sample reference is any literature treating analysis, e.g. C. Blotter . Lemma C1 is based on the Implicit Function Theorem whose result we insert within the risk function (C1) in order to gain (C5) in the free variables ( x1 , ..., xm r ) \ m r . Our example C1 explains the solution technique for finding extreme with side conditions within our first approach. Lemma C1 illustrates that there exists a local inverse of the side conditions towards r unknowns ( xm r +1 , xm r + 2 ,..., xm 1 , xm ) \ r which in the case of nonlinear side conditions towards r unknowns ( xm r +1 , xm r + 2 ,..., xm 1 , xm ) \ r which in case of nonlinear side conditions is not necessary unique. :Example C1: Search for the global extremum of the function f ( x1 , x2 , x3 ) = f ( x, y , z ) = x y z subject to the side conditions ª F1 ( x1 , x2 , x3 ) = Z ( x, y, z ) := x 2 + 2 y 2 1 = 0 « F ( x , x , x ) = E ( x, y , z ) := 3x 4 z = 0 ¬ 2 1 2 3
(elliptic cylinder) (plane)
C1 A first way to solve the problem
J=(
535
wFi ª2 x 4 y 0 º )=« , rk J ( x z 0 oder y z 0) = r = 2 wx j ¬ 3 0 4 »¼
1 ª 2 «1 y = + 2 2 1 x F1 ( x1 , x2 , x3 ) = Z ( x, y, z ) = 0 « «2 y = 1 2 1 x2 «¬ 2 3 F2 ( x1 , x2 , x3 ) = E ( x, y, z ) = 0 z = x 4 1 3 2 1 x2 , ) = 1 f ( x1 , x2 , x3 ) = 1 f ( x, y , z ) = f ( x, + 2 4 x 1 = 2 1 x2 4 2 1 3 2 1 x2 , ) 2 f ( x1 , x2 , x3 ) = 2 f ( x, y , z ) = f ( x, 2 4 x 1 = + 2 1 x2 4 2 x 1 1 1 c + = 0 1 x = 2 1 f ( x) = 0 4 2 3 1 x2 x 1 1 1 + = 0 2 x = + 2 2 4 2 3 1 x 1 3 1 3 (minimum), 2 f ( ) = + ( maximum). 1 f ( ) = 3 4 3 4 2
f ( x )c = 0
At the position x = 1/ 3, y = 2 / 3, z = 1/ 4 we find a global minimum, but at the position x = +1/ 3, y = 2 / 3, z = 1/ 4 a global maximum. An alternative path to find extreme with side conditions is based on the geometric interpretation of risk function and side conditions. First, we form the conditions F1 ( x1 ,… , xm ) = 0 º wFi F2 ( x1 ,… , xm ) = 0 » )=r » rk( … wx j » Fr ( x1 ,… , xm ) = 0 »¼ by continuously differentiable real functions on an open set ȍ \ m . Then we define r equations Fi ( x1 ,..., xm ) = 0 for all i = 1,..., r with the rank conditions rk(wFi / wx j ) = r , geometrically an (m-1) dimensional surface M F ȍ which can be seen as a level surface. See as an example our Example C1 which describe as side conditions
536
Appendix C: Lagrange Multipliers
F1 ( x1 , x2 , x3 ) = Z ( x, y , z ) = x 2 + 2 y 2 1 = 0 F2 ( x1 , x2 , x3 ) = E ( x, y , z ) = 3x 4 z = 0 representing an elliptical cylinder and a plane. In this case is the (m-r) dimensional surface M F the intersection manifold of the elliptic cylinder and of the plane as the m-r =1 dimensional manifold in \ 3 , namely as “spatial curve”. Secondly, the risk function f ( x1 ,..., xm ) = extr generates an (m-1) dimensional surface M f which is a special level surface. The level parameter of the (m-1) dimensional surface M f should be external. In our Example C1 one risk function can be interpreted as the plane f ( x1 , x2 , x3 ) = f ( x, y , z ) = x y z . We summarize our result within Lemma C2. Lemma C2 (extrema with side conditions) The side conditions Fi ( x1 ,… , xm ) = 0 for all i {1,… , r} are built on continuously differentiable functions on an open set ȍ \ m which are subject to the side conditions rk(wFi / wx j ) = r generating an (m-r) dimensional level surface M f . The function f ( x1 ,… , xm ) produces certain constants, namely an (m-1) dimensional level surface M f . f ( x1 ,… , xm ) is geometrically as a point p M F conditionally extremal (stationary) if and only if the (m-1) dimensional level surface M f is in contact to the (m-r) dimensional level surface in p. That is there exist numbers O1 ,… , Or , the Lagrange multipliers, by r
grad f ( p ) = ¦ i =1 Oi grad Fi ( p ). The unnormalized surface normal vector grad f ( p ) of the (m-1) dimensional level surface M f in the normal space `M F of the level surface M F is in the unnormalized surface normal vector grad Fi ( p ) in the point p . To this equation belongs the variational problem
L ( x1 ,… , xm ; O1 ,… , Or ) = r
f ( x1 ,… , xm ) ¦ i =1 Oi Fi ( x1 ,… , xm ) = extr . :proof: First, the side conditions Fi ( x j ) = 0, rk(wFi / wx j ) = r for all i = 1,… , r ; j = 1,… , m generate an (m-r) dimensional level surface M F whose normal vectors ni ( p ) := grad Fi ( p ) ` p M F
(i = 1,… , r )
span the r dimensional normal space `M of the level surface M F ȍ . The r dimensional normal space ` p M F of the (m-r) dimensional level surface M F
537
C1 A first way to solve the problem
is orthogonal complement Tp M p to the tangent space Tp M F \ m1 of M F in the point p spanned by the m-r dimensional tangent vectors t k ( p ) :=
wx wx k
Tp M F
(k = 1,..., m r ).
x= p
:Example C2: Let the m r = 2 dimensional level surface M F of the sphere S r2 \ 3 of radius r (“level parameter r 2 ”) be given by the side condition F ( x1 , x2 , x3 ) = x12 + x2 2 + x32 r 2 = 0. :Normal space: ª 2 x1 º wF wF wF + e2 + e3 = [e1 , e 2 , e 3 ] « 2 x2 » . «2 x » wx1 wx2 wx3 ¬ 3¼p 3 The orthogonal vectors [e1 , e 2 , e 3 ] span \ . The normal space will be generated locally by a normal vector n( p ) = grad F ( p ). n( p ) = grad F ( p ) = e1
:Tangent space: The implicit representation is the characteristic element of the level surface. In order to gain an explicit representation, we take advantage of the Implicit Function Theorem according to the following equations. F ( x1 , x2 , x3 ) = 0 º » wF rk( ) = r = 1 » x3 = G ( x1 , x2 ) wx j »¼ x12 + x22 + x32 r = 0 and (
wF wF ) = [2 x1 + 2 x2 + 2 x3 ], rk( ) =1 wx j wx j
x j = G ( x1 , x2 ) = + r 2 ( x12 + x2 2 ) . The negation root leads into another domain of the sphere: here holds the do2 2 main 0 < x1 < r , 0 < x2 < r , r 2 ( x1 + x2 ) > 0. The spherical position vector x( p ) allows the representation x( p ) = e1 x1 + e 2 x2 + e 3 r 2 ( x12 + x22 ) , which is the basis to produce
538
Appendix C: Lagrange Multipliers
ª ª « « x1 wx « t1 ( p ) = ( p ) = e1 e3 = [e1 , e 2 , e3 ] « 2 2 2 wx2 « « r ( x1 + x2 ) « «¬ « ª « « x2 wx « = [e1 , e 2 , e3 ] « «t1 ( p ) = wx ( p ) = e 2 e3 2 2 2 « r ( x1 + x2 ) 2 « «¬ «¬
1 0 x1 r 2 ( x12 + x2 2
0 1 x2 r 2 ( x12 + x2 2
º » » » )» ¼ º » », » )» ¼
which span the tangent space Tp M F = \ 2 at the point p. :The general case: In the general case of an ( m r ) dimensional level surface M F , implicitly produced by r side conditions of type F1 ( x1 ,..., xm ) = 0 º F2 ( x1 ,..., xm ) = 0 » wFi » ... » rk ( wx ) = r , j Fr j ( x1 ,..., xm ) = 0 » » Fr ( x1 ,..., xm ) = 0 ¼ the explicit surface representation, produced by the Implicit Function Theorem, reads x ( p ) = e1 x1 + e 2 x2 + ... + e m r xm r + e m r +1G1 ( x1 ,..., xmr ) + ... + e mGr ( x1 ,..., xmr ). The orthogonal vectors [e1 ,..., e m ] span \ m . Secondly, the at least once conditional differentiable risk function f ( x1 ,..., xm ) for special constants describes an (m 1) dimensional level surface M F whose normal vector n f := grad f ( p ) ` p M f spans an one-dimensional normal space ` p M f of the level surface M f ȍ in the point p . The level parameter of the level surface is chosen in the extremal case that it touches the level surface M f the other level surface M F in the point p . That means that the normal vector n f ( p ) in the point p is an element of the normal space ` p M f . Or we may say the normal vector grad f ( p ) is a linear combination of the normal vectors grad Fi ( p ) in the point p, r
grad f ( p ) = ¦ i =1 Oi grad Fi ( p ) for all i = 1,..., r , where the Lagrange multipliers Oi are the coordinates of the vector grad f ( p ) in the basis grad Fi ( p ).
539
C1 A first way to solve the problem
:Example C3: Let us assume that there will be given the point X \ 3 . Unknown is the point in the m r = 2 dimensional level surface M F of type sphere S r 2 = \ 3 which is from the point X \ 3 at extremal distance, either minimal or maximal. The distance function || X x ||2 for X \ 3 and X S r 2 describes the risk function f ( x1 , x2 , x3 ) = ( X 1 x1 ) 2 + ( X 2 x2 ) 2 + ( X 3 x3 ) 2 = R 2 = extr , x1 , x2 , x3
which represents an m 1 = 2 dimensional level surface M f of type sphere S r 2 \ 3 at the origin ( X 1 , X 2 , X 3 ) and level parameter R 2 . The conditional extremal problem is solved if the sphere S R 2 touches the other sphere S r 2 . This result is expressed in the language of the normal vector. n( p ) := grad f ( p ) = e1
wf wf wf + e2 + e3 = wx1 wx2 wx3
ª 2( X 1 x1 ) º = [e1 , e 2 , e 3 ] « 2( X 2 x2 ) » N p M f « 2( X x ) » 3 3 ¼p ¬ ª 2 x1 º n( p ) := grad F ( p ) = [e1 , e 2 , e 3 ] « 2 x2 » «2 x » ¬ 3¼ is an element of the normal space N p M f . The normal equation grad f ( p ) = O grad F ( p ) leads directly to three equations xi X 0 = O xi xi (1 O ) = X i
(i = 1, 2,3) ,
which are completed by the fourth equation F ( x1 , x2 , x3 ) = x12 + x2 2 + x3 2 r 2 = 0. Lateron we solve the 4 equations. Third, we interpret the differential equations r
grad f ( p ) = ¦ i =1 Oi grad Fi ( p ) by the variational problem, by direct differentiation namely
540
Appendix C: Lagrange Multipliers
L ( x1 ,..., xm ; O1 ,..., Or ) = r
= f ( x1 ,..., xm ) ¦ i =1 Oi Fi ( x1 ,..., xm ) =
extr
x1 ,..., xm ; O1 ,..., Or
wFi wf r ª wL « wx = wx ¦ i =1 Oi wx = 0 ( j = 1,..., m) j j « i « wL (i = 1,..., r ). « wx = Fi ( x j ) = 0 k ¬ :Example C4: We continue our third example by solving the alternative system of equations.
L ( x1 , x2 , x3 ; O ) = ( X 1 x1 ) 2 + ( X 2 x2 ) 2 + ( X 3 x3 ) O ( x12 + x22 + x32 r 2 ) = extr
x1 , x2 , x3 ; O
wL º = 2( X j x j ) 2O x j = 0 » wx j » wL » 2 2 2 2 = x1 + x2 + x3 r = 0 » wO ¼ X X º x1 = 1 ; x2 = 2 » 1 O 1 O » x12 + x22 + x32 r 2 = 0 ¼ X 12 + X 2 2 + X 32 r 2 = 0 (1 O ) 2 r 2 + X 12 + X 2 2 + X 32 = 0 (1 O ) 2 (1 O ) 2 =
X 12 + X 2 2 + X 32 1 X 12 + X 2 2 + X 32 1 O1, 2 = ± r r2
O1, 2 = 1 ±
r ± X 12 + X 2 2 + X 32 1 X 12 + X 2 2 + X 32 = r r rX 1 ( x1 )1, 2 = ± , X 12 + X 2 2 + X 32 ( x2 )1, 2 = ± ( x3 )1, 2 = ±
rX 2 X 12 + X 2 2 + X 32 rX 3 X + X 2 2 + X 32 2 1
, .
The matrix of second derivatives H decides upon whether at the point ( x1 , x2 , x3 , O )1, 2 we enjoy a maximum or minimum.
541
C1 A first way to solve the problem
H=(
w 2L ) = (G jk (1 O )) = (1 O )I 3 wx j xk
H (1 O > 0) > 0 ( minimum) º ( x1 , x2 , x3 ) is the point of minimum »¼
ª H(1 O < 0) < 0 ( maximum) «( x , x , x ) is the point of maximum . ¬ 1 2 3
Our example illustrates how we can find the global optimum under side conditions by means of the technique of Lagrange multipliers. :Example C5: Search for the global extremum of the function f ( x1 , x2 , x3 ) subject to two side conditions F1 ( x1 , x2 , x3 ) and F2 ( x1 , x2 , x3 ) , namely f ( x1 , x2 , x3 ) = f ( x, y , z ) = x y z (plane) ª F1 ( x1 , x2 , x3 ) = Z ( x, y, z ) := x 2 + 2 y 2 1 = 0 « F ( x , x , x ) = E ( x, y , z ) := 3x 4 z = 0 ¬ 2 1 2 3 J=(
(elliptic cylinder) (plane)
wFi ª2 x 4 y 0 º )=« , rk J ( x z 0 oder y z 0) = r = 2 . wx j ¬ 3 0 4 »¼ :Variational Problem:
L ( x1 , x2 , x3 ; O1 , O2 ) = L ( x, y, z; O , P ) = x y z O ( x + 2 y 2 1) P (3 x 4 z ) = 2
extr
x1 , x2 , x3 ; O , P
wL º = 1 2O x 3P = 0 » wx » wL 1 = 1 4O y = 0 O = » wy 4y » » wL 1 » = 1 4 P = 0 O = wz 4 » » wL » = x2 + 2 y 2 1 = 0 wO » » wL = 3 x 4 z = 0. » wP ¼ We multiply the first equation wL / wx by 4y, the second equation wL / wy by (2 x) and the third equation wL / wz by 3 and add ! 4 y 8O xy 12P y + 2 x + 8O xy 3 y + 12P y = y + 2 x = 0 .
542
Appendix C: Lagrange Multipliers
Replace in the cylinder equation (first side condition) Z(x, y, z)= x 2 + 2 y 2 1 = 0 , that is x1,2 = ±1/ 3. From the second condition of the plane (second side condition) E ( x, y, z ) = 3 x 4 z = 0 we gain z1,2 = ±1/ 4. As a result we find x1,2 , z1,2 and finally y1,2 = B 2 / 3. The matrix of second derivatives H decides upon whether at the point O 1,2 = B 3 / 8 we find a maximum or minimum. H=(
ª 2O w 2L )=« 0 wx j xk «¬ 0
0 0º 4O 0 » 0 0 »¼
ºª ª - 34 0 0 º » « H (O2 = 3 ) = « 0 - 3 0 » d 0 »« 8 « 0 02 0 » ¬ ¼ »« » «(maximum) »« 1 2 1 3 1 ( x, y, z; O , P )1 =( 13 ,- 32 , 14 ;- 83 , 14 ) » «( x, y, z; O , P ) 2 =(- 3 , 3 ,- 4 ; 8 , 4 ) is the restricted minmal solution point.»¼ «¬is the restricted maximal solution point. 3 3 ª 4 03 0 º H (O1 = ) = «0 2 0 » t 0 8 «0 0 0» ¬ ¼ (minimum)
The geometric interpretation of the Hesse matrix follows from E. Grafarend and P. Lohle (1991). The matrix of second derivatives H decides upon whether at the point ( x1 , x2 , x3 , O )1, 2 we enjoy a maximum or minimum.
Apendix D: Sampling distributions and their use: confidence intervals and confidence regions D1
A first vehichle: Transformation of random variables
If the probability density function (p.d.f.) of a random vector y = [ y1 ,… , yn ]c is known, but we want to derive the probability density function (p.d.f.) of a random vector x = [ x1 ,… , xn ]c (p.d.f.) which is generated by an injective mapping x =g(y) or xi = g i [ y1 ,… , yn ] for all i {1," , n} we need the results of Lemma D1. Lemma D1 (transformation of p.d.f.): Let the random vector y := [ y1 ,… , yn ]c be transformed into the random vector x = [ x1 ,… , xn ]c by an injective mapping x = g(y) or xi = gi [ y1 ,… , yn ] for all i {1,… , n} which is of continuity class C1 (first derivatives are continuous). Let the Jacobi matrix J x := (wg i / wyi ) be regular ( det J x z 0 ), then the inverse transformation y = g-1(x) or yi = gi1 [ x1 ,… , xn ] is unique. Let f x ( x1 ,… , xn ) be the unknown p.d.f., but f y ( y1 ,… , yn ) the given p.d.f., then f x ( x1 ,… , xn ) = f ( g11 ( x1 ,… , xn ),… , g11 ( x1 ,… , xn )) det J y with respect to the Jacobi matrix J y := [
wyi wg 1 ]=[ i ] wx j wx j
for all i, j {1,… , n} holds. Before we sketch the proof we shall present two examples in order to make you more familiar with the notation. Example D1 (“counter example”): The vector-valued random variable (y1, y2) is transformed into the vector-valued random variable (x1, x2) by means of x1 = y1 + y2 , J x := [
ª wx / wy1 wx ]= « 1 wy c ¬wx2 / wy1
x2 = y12 + y22 wx1 / wy2 º ª 1 = wx2 / wy2 »¼ «¬ 2 y1
1 º 2 y2 »¼
x12 = y12 + 2 y1 y2 + y22 , x2 + 2 y1 y2 = y12 + 2 y1 y2 + y22 x12 = x2 + 2 y1 y2 , y2 = ( x12 x2 ) /(2 y1 ) 1 x12 x2 1 , x1 y1 = y12 + ( x12 x2 ) 2 y1 2 1 y12 x1 y1 + ( x12 x2 ) = 0 2
x1 = y1 + y2 = y1 +
544
Appendix D: Sampling distributions and their use
x2 1 1 y1± = x1 ± 1 ( x12 x2 ) 2 4 2 y2± =
1 x1 B 2
x12 1 2 ( x1 x2 ) . 4 2
At first we have computed the Jacobi matrix J x, secondly we aimed at an inversion of the direct transformation ( y1 , y2 ) 6 ( x1 , x2 ) . As the detailed inversion step proves, namely the solution of a quadratic equation, the mapping x = g(y) is not injective. Example D2: Suppose (x1, x2) is a random variable having p.d.f. ªexp( x1 x2 ), x1 t 0, x2 t 0 f x ( x1 , x2 ) = « , otherwise. ¬0 We require to find the p.d.f. of the random variable (x1+ x2, x2 / x1). The transformation y1 = x1 + x2 , y2 =
x2 x1
has the inverse x1 =
y1 yy , x2 = 1 2 . 1 + y2 1 + y2
The transformation provides a one-to-one mapping between points in the first quadrant of the (x1, x2) - plane Px2 and in the first quadrant of the (y1, y2) - plane Py2 . The absolute value of the Jacobian of the transformation for all points in the first quadrant is wx1 wy1
w ( x1 , x2 ) = wx2 w ( y1 , y2 ) wy1
wx1 wy2 wx2 wy2
=
(1 + y2 ) 1 y2 (1 + y2 ) 1
y1 (1 + y2 ) 2 = y1 (1 + y2 ) 2 [(1 + y2 ) y2 ]
= y1 (1 + y2 ) 3 + y1 y2 (1 + y2 ) 3 = Hence we have found for the p.d.f. of (y1, y2)
y1 . (1 + y2 ) 2
D1
A first vehichle: Transformation of random variables
545
y1 ª exp( y1 ) , y1 > 0, y2 > 0 (1 + y2 ) 2 f y ( y1 , y2 ) = « « «¬0 , otherwise. Incidentally it should be noted that y1 and y2 are independent random variables, namely f y ( y1 , y2 ) = f1 ( y1 ) f ( y2 ) = y1 exp( y1 )(1 + y2 ) 2 .
h
Proof: The probability that the random variables y1 ,… , yn take on values in the region : y is given by
³" ³ f
y
( y1 ,… , yn )dy1 " dyn .
:y
If the random variables of this integral are transformed by the function xi = gi ( y1 ,… , yn ) for all i {1,… , n} which map the region :y onto the regions :x , we receive
³" ³ f
y
( y1 ,… , yn )dy1 " dyn =
:y
³" ³ f
y
( g11 ( x1 ,… , xn ),… , g n1 ( x1 ,… , xn )) det J y dx1 " dxn
:x
from the standard theory of transformation of hypervolume elements, namely dy1 " dyn = | det J y | dx1 " dxn or *(dy1 " dyn ) = | det J y | * (dx1 " dxn ). Here we have taken advantage of the oriented hypervolume element dy1 " dyn (Grassmann product, skew product, wedge product) and the Hodge star operator * applied to the n - differential form dy1 " dyn / n (the exterior algebra / n ). The star * : / p o / n p in R n maps a p - differential form onto a (n-p) - differential form, in general. Here p = n, n – p = 0 applies. Finally we define f x ( x1 ,… , xn ) := f ( g11 ( x1 ,… , xn ),… , g n1 ( x1 ,… , xn )) | det J y | as a function which is certainly non-negative and integrated over :x to one. h
546
Appendix D: Sampling distributions and their use
In applying the transformation theorems of p.d.f. we meet quite often the problem that the function xi = gi ( y1 ,… , yn ) for all i {1,… , n} is given but not the inverse function yi = gi1 ( x1 ,… , xn ) for all i {1,… , n} . Then the following results are helpful. Corollary D2 (Jacobian): If the inverse Jacobian | det J x | = | det(wgi / wy j ) | is given, we are able to compute. | det J y | = | det
wgi1 ( x1 ,… , xn ) | = | det J |1 wx j
= | det
wgi ( y1 ,… , yn ) 1 | . wy j
Example D3 (Jacobian): Let us continue Example D2. The inverse map ª g 1 ( y , y ) º ª x + x º 1 º wy ª 1 =« y = « 11 1 2 » = « 1 2 » » 2 wxc ¬ x2 / x1 1/ x1 ¼ ¬« g 2 ( y1 , y2 ) ¼» ¬ x2 / x1 ¼ jy = | J y | =
wy 1 x x +x = + 22 = 1 2 2 wx c x1 x1 x1
jx = | J x | = j y1 = | J y |1 =
wx x12 x2 = = 1 wy c x1 + x2 y1
allows us to compute the Jacobian Jx from Jy. The direct map ª g ( y , y ) º ª x º ª y /(1 + y2 ) º x=« 1 1 2 »=« 1»=« 1 » «¬ g 2 ( y1 , y2 ) »¼ ¬ x2 ¼ ¬ y1 y2 /(1 + y2 ) ¼ leads us to the final version of the Jacobian. jx = | J x | =
y1 . (1 + y2 ) 2
For the special case that the Jacobi matrix is given in a partitioned form, the results of Corollary D3 are useful. Corollary D3 (Jacobian): If the Jacobi matrix Jx is given in the partitioned form | J x |= ( then
wg i ªU º )=« », wy j ¬V ¼
D2
A second vehicle: Transformation of random variables
547
det J x = | det J x J xc | = det(UU c) det[VV c (VUc)(UU c) 1 UV c] if det(UU c) z 0 det J x = | det J x J xc | = det(VV c) det[UU c UV c(VV c) 1 VU c] , if det(VV c) z 0 | det J y | = | det J x |1 . Proof: The Proof is based upon the determinantal relations of a partitioned matrix of type ªA Uº 1 det « » = det A det(D VA U ) if det A z 0 V D ¬ ¼ ªA Uº 1 det « » = det D det( A UD U) if det D z 0 V D ¬ ¼ ªA Uº det « » = D det A V (adj A )U , ¬ V D¼ which have been introduced by G. Frobenius (1908): Über Matrizen aus positiven Elementen, Sitzungsberichte der Königlich Preussischen Akademie der Wissenschaften von Berlin, 471-476, Berlin 1908 and J. Schur (1917): Über Potenzreihen, die im Innern des Einheitskreises beschränkt sind, J. reine und angew. Math 147 (1917) 205-232.
D2
A second vehicle: Transformation of random variables
Previously we analyzed the transformation of the p.d.f. under an injective map of random variables y 6 g ( y ) = x . Here we study the transformation of polar coordinates [I1 , I2 ,… , In 1 , r ] Y as parameters of an Euclidian observation space to Cartesian coordinates [ y1 ,… , yn ] Y . In addition we introduce the hypervolume element of a sphere S n 1 Y, dim Y = n . First, we give three examples. Second, we summarize the general results in Lemma D4. Example D4 (polar coordinates: “2d”): Table D1 collects characteristic elements of the transformation of polar coordinates (I1 , r ) of type “longitude, radius” to Cartesian coordinates ( y1 , y2 ), their domain and range, the planar elements dy1 , dy2 as well as the circle S1 embedded into E 2 := {R 2 , G kl } , equipped with the canonical metric I 2 = [G kl ] and its total measure of arc Z1.
548
Appendix D: Sampling distributions and their use
Table D1 Cartesian and polar coordinates of a two-dimensional observation space, total measure of the arc of the circle (I1 , r ) [0, 2S ] × ]0, f[ , ( y1 , y2 ) R 2 dy1dy2 = rdrdI1 S1 := {y R 2 | y12 + y22 = 1} 2S
Z1 =
³ dI
1
= 2S .
0
Example D5 (polar coordinates: “3d”): Table D2 is a collectors’ item for characteristic elements of the transformation of polar coordinates (I1 , I2 , r ) of type “longitude, latitude, radius” to Cartesian coordinates ( y1 , y2 , y3 ), their domain and range, the volume element dy1 , dy2 , dy3 as well as of the sphere S2 embedded into E3 := {R 3 , G kl } equipped with the canonical metric I 3 = [G kl ] and its total measure of surface Z2. Table D2 Cartesian and polar coordinates of a three-dimensional observation space, total measure of the surface of the circle y1 = r cos I2 cos I1 , y2 = r cos I2 sin I1 , y3 = r sin I2 (I1 , I2 , r ) [0, 2S ] × ]
S S , [ × ]0, r[ , ( y1 , y2 ) R 2 2 2
( y1 , y2 , y3 ), R 3 dy1dy2 dy3 = r 2 dr cos I2 dI1dI2 S 2 := { y R 3 | y12 + y22 + y32 = 1} +S / 2
2S
Z2 =
³ dI ³ 1
0
dI2 cos I2 = 4S .
S / 2
Example D6 (polar coordinates: “4d”): Table D3 is a collection of characteristic elements of the transformation of polar coordinates (I1 , I2 , I3 , r ) to Cartesian coordinates ( y1 , y2 , y3 , y4 ), their domain and range, the hypervolume element dy1 , dy2 , dy3 , dy4 as well as of the 3 - sphere S3 embedded into E 4 := {R 4 , G kl } equipped with the canonical metric I 4 = [G kl ] and its total measure of hypersurface.
D2
A second vehicle: Transformation of random variables
549
Table D3 Cartesian and polar coordinates of a four-dimensional observation space total measure of the hypersurface of the 3-sphere y1 = r cos I3 cos I2 cos I1 , y2 = r cos I3 cos I2 sin I1 , y3 = r cos I3 sin I2 , y4 = r sin I3 (I1 , I2 , I3 , r ) [0, 2S ] × ]
S S S S , [ × ] , [ × ]0, 2S [ 2 2 2 2
dy1dy2 dy3dy4 = r 3 cos2 I3 cos I2 drdI3 dI2 dI1 J y :=
w ( y1 , y2 , y3 , y4 ) = w (I1 , I2 , I3 , r )
ª r cos I3 cos I2 sin I1 r cos I3 sin I2 cos I1 r sin I3 cos I2 cos I1 cos I3 cos I2 cos I1 º « » « + r cos I3 cos I2 cos I1 r cos I3 sin I2 sin I1 r sin I3 cos I2 sin I1 cos I3 cos I2 sin I1 » « 0 + r cos I3 cos I2 r sin I3 sin I2 cos I3 cos I2 » « » 0 0 r cos I3 sin I3 ¬ ¼ | det J y |= r 3 cos 2 I3 cos I2 S3 := { y R 4 | y12 + y22 + y32 + y42 = 1}
Z3 = 2S 2 . Lemma D4 (polar coordinates, hypervolume element, hypersurface element):
Let
ª cos I cos I cos I " cos I cos I º n 1 n2 n3 2 1» « ª y1 º « » « y » « cos In 1 cos In 2 cos In 3 " cos I 2 sin I1 » 2 « » « » « y3 » « cos In 1 cos In 2 cos In 3 " sin I2 » « » « » y « 4 » « cos In 1 cos In 2 " cos I3 » « " » = r« » « » « » " y « n 3 » « » « yn 2 » « cos In 1 cos In 2 sin In 3 » « » « » cos In 1 cos In 2 « yn 1 » « » «« y »» cos I sin I « » n 1 n 2 ¬ n ¼ ««sin I »» n 1 ¬ ¼
550
Appendix D: Sampling distributions and their use
be a transformation of polar coordinates (I1 , I2 ,… , In 2 , In 1 , r ) to Cartesian coordinates ( y1 , y2 ,… , yn 1 , yn ) , their domain and range given by
S S S S S S (I1 , I2 ,…, In2 , In1 , r ) [0, 2S ] × ] , + [ ×"× ] , + [ × ] , + [ × ]0, f[, 2 2 2 2 2 2 then the local hypervolume element dy1 ...dyn = r n 1dr cos n 2 In 1 cos n 3 In 2 ...cos 2 I3 cos I2 dIn 1dIn 2 ...dI3dI2 dI1 , as well as the global hypersurface element
Z n 1 =
+S / 2
+S / 2
2S
2 S ( n 1) / 2 := ³ cos Inn12 dIn 1 " ³ cos I2 dI2 ³ dI1 , n 1 S / 2 S / 2 0 *( ) 2
where J ( X ) is the gamma function. Before we care for the proof, let us define Euler’s gamma function. Definition D5 (Euler’s gamma function): f
*( x) = ³ e t t x 1 dt
( x > 0)
0
is Euler’s gamma function which enjoys the recurrence relation *( x + 1) = x*( x) subject to *(1) = 1
or
1 *( ) = S 2
*(2) = 1!
3 1 1 1 *( ) = *( ) = S 2 2 2 2
…
…
*(n + 1) = n !
p pq pq *( ) = *( ) q q q
if n is integer, n Z
+
if
p is a rational q
number, p / q Q+ . Example D7 (Euler’s gamma function):
D2
A second vehicle: Transformation of random variables
(i)
*(1) = 1
(i)
1 *( ) = S 2
(ii)
*(2) = 1
(ii)
3 1 1 1 *( ) = *( ) = S 2 2 2 2
(iii)
*(3) = 1 2 = 2
(iii)
5 3 3 3 *( ) = *( ) = S 2 2 2 4
(iv)
*(4) = 1 2 3 = 6
(iv)
7 5 5 15 S. *( ) = *( ) = 2 2 2 8
551
Proof: Our proof of Lemma D4 will be based upon computing the image of the tangent space Ty S n 1 E n of the hypersphere S n 1 E n . Let us embed the hypersphere S n 1 parameterized by (I1 , I2 ,… , In 2 , In 1 ) in E n parameterized by ( y1 ,… , yn ) , namely y E n , y = e1r cos In 1 cos In 2 " cos I2 cos I1 + +e 2 r cos In 1 cos In 2 " cos I2 sin I1 + " + +e n 1r cos In 1 sin In 2 + +e n r sin In 1. Note that I1 is a parameter of type longitude, 0 d I1 d 2S , while I2 ,… , In1 are parameters of type latitude, S / 2 < I2 < +S / 2,… , S / 2 < In1 < +S / 2 (open intervals). The images of the tangent vectors which span the local tangent space are given in the orthonormal n- leg {e1 , e 2 ,… , e n 1 , e n | 0} by g1 := DI y = e1r cos In 1 cos In 2 " cos I2 sin I1 + 1
+e 2 r cos In 1 cos In 2 " cos I2 cos I1 g 2 := DI y = e1r cos In 1 cos In 2 "sin I2 cos I1 2
e 2 r cos In 1 cos In 2 " sin I2 sin I1 + +e3 r cos In 1 cos In 2 " cos I2 ... g n 1 := DI y = e1r sin In 1 cos In 2 " cos I2 cos I1 " n 1
e n 1r sin In 1 sin In 2 + +e n r cos In 1
552
Appendix D: Sampling distributions and their use
g n := DI y = e1r cos In 1 cos In 2 " cos I2 cos I1 + n
+ e 2 r cos In 1 cos In 2 " sin I2 sin I1 + " + +e n 1r cos In 1 cos In 2 + e n r sin In 1 = y / r. {g1 ,… , g n 1} span the image of the tangent space in E n . gn is the hypersphere normal vector, || gn|| = 1. From the inner products < gi | gj > = gij, i, j {1,… , n} , we derive the Gauss matrix of the metric G:= [ gij]. < g1 | g1 > = r 2 cos 2 In 1 cos 2 In 2 " cos 2 I3 cos 2 I2 < g 2 | g 2 > = r 2 cos 2 In 1 cos 2 In 2 " cos 2 I3 " < g n 1 | g n 1 > = r 2 , < g n | g n > = 1. The off-diagonal elements of the Gauss matrix of the metric are zero. Accordingly det G n = det G n 1 = r n 1 (cos In 1 ) n 2 (cos In 2 ) n 3 " (cos I3 ) 2 cos I2 . The square root minant
det G n1 elegantly represents the Jacobian deter-
det G n , J y :=
w ( y1 , y2 ,… , yn ) = det G n . w (I1 , I2 ,… , In 1 , r )
Accordingly we have found the local hypervolume element det G n dr dIn 1 dIn 2 " dI3 dI2 dI1 . For the global hypersurface element Z n 1 , we integrate 2S
³ dI
1
+S / 2
³
= 2S
0
cos I2 dI2 = [sin I2 ]+SS // 22 = 2
S / 2 +S / 2
1 cos 2 I3 dI3 = [cos I3 sin I3 I3 ]+SS // 22 = S / 2 2 S / 2
³
+S / 2
1 4 cos3 I4 dI4 = [cos 2 I4 sin I4 2sin I4 ]+SS // 22 = 3 3 S / 2
³
...
D3 A first confidence interval of Gauss-Laplace normally distributed observations +S / 2
³
S / 2
(cos In 1 ) n 2 dIn 1 =
553
+S / 2
1 1 [(cos In 1 ) n 3 ]+SS // 22 + (cos In 1 ) n 4 dIn 1 n2 n 3 S³/ 2
recursively. As soon as we substitute the gamma function, we arrive at Zn-1. h
D3
A first confidence interval of Gauss-Laplace normally distributed observations P ,V 2 known, the Three Sigma Rule
The first confidence interval of Gauss-Laplace normally distributed observations constrained to ( P , V 2 ) known, will be computed as an introductory example. An application is the Three Sigma Rule. In the empirical sciences, estimates of certain quantities derived from observations are often given in the form of the estimate plus or minus a certain amount. For instance, the distance between a benchmark on the Earth’s surface and a satellite orbiting the Earth may be estimated to be (20, 101, 104.132 ± 0.023) m with the idea that the first factor is very unlikely to be outside the range 20, 101, 104.155 m to 20, 101, 104.109 m. A cost accountant for a publishing company in trying to allow for all factors which enter into the cost of producing a certain book, actual production costs, proportion of plant overhead, proportion of executive salaries, may estimate the cost to be 21 ± 1,1 Euro per volume with the implication that the correct cost very probably lies between 19.9 and 22.1 Euro per volume. The Bureau of Labor Statistics may estimate the number of unemployed in a certain area to be 2.4 ± .3 million at a given time though intuitively it should be between 2.1 and 2.7 million. What we are saying is that in practice we are quite accustomed to seeing estimates in the form of intervals. In order to give precision to these ideas we shall consider a particular example. Suppose that a random sample x {R, pdf } is taken from a Gauss-Laplace normal distribution with known mean P and known variance V 2 . We ask the key question. ?What is the probability J of the random interval ( P cV , P + cV ) to cover the mean P as a quantile c of the standard deviation V ? To put this question into a mathematical form we write the probabilistic twosided interval identity.
554
Appendix D: Sampling distributions and their use
P ( x1 < X < x2 ) = P ( P cV < X < P + cV ) = J , x2 = P + cV x1 =
§ 1 · exp ¨ 2 ( x P ) 2 ¸ dx = J © 2V ¹ cV V 2S
³ P
1
with a left boundary l = x1 and a right boundary r = x2 . The length of the interval is x2 x1 = r l . The center of the interval is ( x1 + x2 ) / 2 or P . Here we have taken advantage of the Gauss-Laplace pdf in generating the cumulative probability P( x1 < X < x2 ) = F( x2 ) F( x1 ) F( x2 ) F( x1 ) = F( P + cV ) F( P + cV ). Typical values for the confidence coefficient J are J = 0.95 ( J = 95% or 1 J = 5% negative confidence), J =0.99 ( J = 1% or 1 J = 1% negative confidence) or J = 0.999 ( J = 999% or 1 J = 1% negative confidence). O
O
f(x)
P-3V P-2V P-V
P
P+V
P+2V P+3V
x Figure D1: Probability mass in a two-sided confidence interval x1 < X< x2 or P cV < X< P + cV , three cases: (i) c = 1 , (ii) c = 2 and (iii) c = 3.
D3 A first confidence interval of Gauss-Laplace normally distributed observations
555
Consult Figure D1 for a geometric interpretation. The confidence coefficient J is a measure of the probability mass between x1 = P cV and x2 = P + cV . For a given confidence coefficient J x2
³ f ( x | P ,V
2
)dx = J
x1
establishes an integral equation. To make this point of view to be better understood let us transform the integral equations to its standard form. x6z= x2 = P + cV
1 (x P) x = P + V z V +c
1
1 § 1 · § 1 · exp ¨ 2 ( x P ) 2 ¸ dx = ³ exp ¨ z 2 ¸ dz = J ³ 2S © 2V ¹ © 2 ¹ c x = P cV V 2S 1
x2
³
f ( x | P , V 2 )dx =
+c
³ f ( z | 0,1)dz = J .
c
x1
The special Helmert transformation maps x to z, now being standard GaussLaplace normal: V 1 is the dilatation factor, also called scale variation, but P the translation parameter. The Gauss-Laplace pdf is symmetric, namely f ( x ) = f ( + x ) or f ( z ) = f ( + z ) . Accordingly we can write the integral identity x2
³
x1
x2
f ( x | P , V 2 )dx = 2 ³ f ( x | P , V 2 )dx = J 0
+c
³
c
c
f ( z | 0,1)dz = 2 ³ f ( z | 0,1)dz = J . 0
The classification of integral equations tells us that z
J ( z ) = 2 ³ f ( z )dz 0
is a linear Volterra integral equation the first kind. In case of a Gauss-Laplace standard normal pdf, such an integral equation is solved by a table. In a forward computation z
F( z ) :=
³
f
z
f ( z | 0,1)dz or ) ( z ) :=
1
§ 1
³ 2S exp ¨© 2 z
f
2
· ¸ dz ¹
are tabulated in a regular grid. For a given value F ( z1 ) or F ( z2 ) , z1 or z2 are determined by interpolation. C. F. Gauss did not use such a procedure. He took
556
Appendix D: Sampling distributions and their use
advantage of the Gauss inequality which has been reviewed in this context by F. Pukelsheim (1994). There also the Vysochanskii-Petunin inequality has been discussed. We follow here a two-step procedure. First, we divide the domain z [0, f] into two intervals z [0,1] and z [1, f ] . In the first interval f ( z ) is isotonic, differentiable and convex, f cc( z ) = f ( z )( z 2 1) < 0, while in the second interval isotonic, differentiable and concave, f cc( z ) = f ( z )( z 2 1) > 0 . z = 1 is the point of inflection. Second, we setup Taylor series of f ( z ) in the interval z [0,1] at the point z = 0 , while in the interval z [1, f ] at the point z = 1 and z [2, f] at the point z = 2 . Three examples of such a forward solution of the characteristic linear Volterra integral equation of the first kind will follow. They establish: The One Sigma Rule
The Two Sigma Rule
The Three Sigma Rule Box D1
Operational calculus applied to the Gauss-Laplace probability distribution “generating differential equations” f cc( z ) + 2 f c( z ) + f ( z ) = 0 subject to +f
³
f ( z )dz = 1
f
“recursive differentiation” 1 § 1 · f ( z) = exp ¨ z 2 ¸ 2S © 2 ¹ f c( z ) = zf ( z ) =: g ( z ) f cc( z ) = g c( z ) = f ( z ) zg ( z ) = ( z 2 1) f ( z ) f ccc( z ) = 2 zf ( z ) + ( z 2 1) g ( z ) = ( z 3 + 3 z ) f ( z ) f ( 4) ( z ) = (3 z 2 + 3) f ( z ) + ( z 3 + 3 z ) g ( z ) = ( z 4 6 z 2 + 3) f ( z ) f (5) ( z ) = (4 z 3 12 z ) f ( z ) + ( z 4 6 z 2 + 3) g ( z ) = ( z 5 + 10 z 3 15 z ) f ( z ) f (6) ( z ) = (5 z 4 + 30 z 2 15) f ( z ) + ( z 5 + 10 z 3 15 z ) g ( z ) = = ( z 6 15 z 4 + 45 z 2 15) f ( z )
D3 A first confidence interval of Gauss-Laplace normally distributed observations
557
f (7) ( z ) = (6 z 5 60 z 3 + 90 z ) f ( z ) + ( z 6 15 z 4 + 45 z 2 15) g ( z ) = = ( z 7 + 21z 5 105 z 3 + 105 z ) f ( z ) f (8) ( z ) = (7 z 6 + 105 z 4 315 z 2 + 105) f ( z ) + ( z 7 + 21z 5 105 z 3 + 105 z ) f ( z ) = = ( z 8 28 z 6 + 210 z 4 420 z 2 + 105) f ( z ) f (9) ( z ) = (8 z 7 168 z 5 + 840 z 3 840 z ) f ( z ) + +( z 8 28 z 6 + 210 z 4 420 z 2 + 105) g ( z ) = = ( z 9 + 36 z 7 378 z 5 + 1260 z 3 945z ) f ( z ) f (10) ( z ) = (9 z 8 + 252 z 6 1890 z 4 + 3780 z 2 945) f ( z ) + + ( z 9 + 36 z 7 378 z 5 + 1260 z 3 945) g ( z ) = = ( z10 45 z 8 + 630 z 6 3150 z 4 + 4725 z 2 945) f ( z ) ”upper triangle representation of the matrix transforming f ( z ) o f n ( z ) ” ª f ( z) º ª 1 « » « 0 « f c( z ) » « « f cc( z ) » « 1 « » « ccc « f ( z) » « 0 « (4) » « 3 « f ( z) » « f (5) ( z ) » = f ( z ) « 0 « « » « 15 « f (6) ( z ) » « « (7) » « 0 « f ( z) » « « (8) » « 105 « f ( z) » « 0 « f (9) ( z ) » « « (10) » ¬ 945 f z ( ) ¬« ¼»
D31
0 1 0 3 0
0 0 1 0 6
0 0 0 1 0
0 0 0 0 1
0 0 0 0 0
15 0
0 45
10 0
0 15
1 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 1 0 0 105 0 105 0 21 0 1 0 0 420 0 210 0 28 0 1 945 0 1260 0 378 0 36 0 0 4725 0 3150 0 630 0 45
ª1 º 0º « » z 0 »» « » « 2» 0» « z » » 3 0» « z » « 4» 0» « z » » 0 0» « z 5 » . « » 0 0» « z 6 » »« » 0 0» « z 7 » » 0 0» « z8 » « » 1 0 » « z 9 » » 0 1 ¼ « 10 » ¬« z ¼» 0 0 0 0 0
The forward computation of a first confidence interval of GaussLaplace normally distributed observations: P , V 2 known
We can avoid solving the linear Volterra integral equation of the first kind if we push forward the integration for a fixed value z. Example D8 (Series expansion of the Gauss-Laplace integral, 1st interval): Let us solve the integral 1
J ( z = 1) := 2³ f ( z )dz 0
558
Appendix D: Sampling distributions and their use
in the first interval 0 d z d 1 by Taylor expansion with respect to the successive differentiation of f ( z ) outlined in Box D1 and the specific derivatives f n (0) given in Table D1. Based on those auxiliary results, Box D2 presents us the detailed interpretation. First, we expand exp(z 2 / 2) up to order O(14). The specific Taylor series are uniformly convergent. Accordingly, in order to compute the integral, second we integrate termwise up to order O(15). For the specific value z=1, we have computed the coefficient of confidence J (1) = 0.683 . The result P( P V < X < P + V ) = 0.683 is known as the One Sigma Rule. 68.3 per cent of the sample are in the interval ]P 1V , P + 1V [ , 0.317 per cent outside. If we make 3 experiments, one experiment is outside the 1V interval. Box D2 A specific integral “expansion of the exponential function” x x 2 x3 xn + + + " + + O (n) |x|