LINEAR .MODELS AN INTEGRATED APPROACH
SERIES ON MULTIVARIATE ANALYSIS Editor: M M Rao
Published Vol. 1: Martingales and Stochastic Analysis J. Yeh Vol. 2: Multidimensional Second Order Stochastic Processes Y. Kakihara Vol. 3: Mathematical Methods in Sample Surveys H. G. Tucker Vol. 4: Abstract Methods in Information Theory Y. Kakihara Vol. 5: Topics in Circular Statistics S. R. Jammalamadaka and A. SenGupta
Forthcoming Convolution Structures and Stochastic Processes R. Lasser
LINEAR MODELS AN INTEGRATED APPROACH
Debasis Sengupta Applied Statistics Unit Indian Statistical Institute India
Sreenivasa Rao Jammalamadaka Department oj Statistics and Applied Probability University oj California, Santa Barbara USA
V f e World Scientific wB
New Jersey London Singapore Hong Kong
Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: Suite 202, 1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
LINEAR MODELS An Integrated Approach Copyright © 2003 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN 981-02-4592-0
Printed in Singapore by World Scientific Printers (S) Pte Ltd
To the memory of my grandfather, Sukumar Sen DS To the memory of my parents, Seetharamamma and Ramamoorthy, Jammalamadaka SRJ
Preface
The theory of linear models provides the foundation for two of the most important tools in applied statistics: regression and analysis of designed experiments. Thus it is no surprise that there are many books written on this topic. These books adopt a variety of approaches in providing an understanding of linear models. Recent books generally embrace the coordinate-free approach invoking vector spaces. In this book, Linear Models: An Integrated Approach, we also use vector spaces — but the emphasis is on statistical ideas and interpretations. Why another book? Because we firmly believe that a comprehensive story of linear models can be told in simple language by appealing to the reader's statistical thinking. We develop the basic theory using essentially two simple statistical concepts, the linear zero function and the principle of covariance adjustment. Although these two ideas go back a few decades, their potential in developing the theory of linear models has not been fully exploited till recently. In this context, linear zero functions correspond to ancillary statistics. It is fascinating how they can be used to make complex expressions look obvious, particularly when the error dispersion matrix is singular. We believe that there is no easier or better way for the exposition of the general linear model. We also review the other 'unified theories' of linear unbiased estimation in the general linear model, and provide their statistical motivation which may not be available elsewhere. The syllabus of a graduate-level course on 'Linear Models' taught in many universities includes a good bit of application in regression or vii
viii
Preface
analysis of designed experiments. This leaves less space for the underlying theory of linear models in most contemporary books on the subject. An important secondary objective of this book is to provide a more complete and up-to-date treatment of this relatively neglected area. Linear Models: An Integrated Approach aims to achieve 'integration' in more ways than one. Establishing basic principles which link the general linear model with the special case of homoscedastic errors is one aspect of it. This linkage continues well beyond the derivation of the main results. The approach based on linear zero functions is used to derive and interpret results for many modifications and extensions. These include change in the residual sum of squares due to restrictions, effects of nuisance parameters and inclusion or exclusion of observations or parameters, as well as the multivariate linear model. Results on the decomposition of sum of squares in designed experiments, recursive inference, Kalman filter, missing plot techniques, deletion diagnostics and so on for the general linear model follow in a natural way. Simultaneous tests, confidence and tolerance intervals for parameters of the general linear model are also provided. Another aspect of 'integration' is the unified treatment of linear models with partially specified dispersion matrix, which include such special cases as variance components or mixed effects models and linear models with serially or spatially correlated errors. The third aspect of 'integration' is a comprehensive discussion of the foundations of linear inference, which is developed by carefully sequencing material that had been scattered in a number of articles. This theory runs parallel to the general theory of statistical inference, and shows how the approach based on linear zero functions emanates naturally from the basic principles of inference — without any distributional assumption. After giving a brief introduction to the linear model in Chapter 1, we provide in Chapters 2 and 3 a summary of the algebraic and statistical results that would be used in the later chapters. In the next two chapters, we consider the linear model with uncorrelated errors of equal variance. We develop the theory of linear estimation in Chapter 4 and proceed to discuss confidence regions, testing and prediction in Chapter 5. Analysis of variance and covariance in designed experi-
Preface
ix
ments is considered in Chapter 6. The results of Chapters 4 and 5 are then extended to the case of the general linear model with an arbitrary dispersion matrix, in Chapter 7. In Chapter 8 we consider the case when the error dispersion matrix is misspecified or partially specified. We deal with updates in the general linear model in Chapter 9 and with multivariate response in Chapter 10. In Chapter 11 we discuss the statistical foundation for linear inference, alternative linear estimators, a geometric perspective and asymptotic results. The book contains over 300 end-of-the-chapter exercises. These are meant to (a) illustrate the material covered, (b) supplement some results with proofs, interpretations and extensions, (c) introduce or expand ideas that are related to the text of the chapter, and (d) give glimpses of interesting research issues. These are arranged, more or less, in the order of the corresponding sections. Solutions to selected exercises are given in the appendix, while solutions to almost all the other exercises are given in the URL http://www.isical.ac.in/~sdebasis/linmodel. The URL also contains soft copies of the data sets. Linear inference in the linear model often serves as a benchmark for other methods and models. While providing a comprehensive account of the state of the art in this area, we were unable to cover other inference procedures or other related models — for which several books are already available. Topics that are left out include nonlinear methods of inference such as Bayesian, robust and rank-based methods, resampling techniques, inference in the generalized linear model and data-analytic methods such as transformation of variables and use of missing data. The book is meant primarily for students and researchers in statistics. Engineers and scientists who need a thorough understanding of linear models for their work may also find it useful. Even though practitioners may not find structured instructions, they will find in this book a reference for many of the tools they need and use — such as leverages, residuals, deletion diagnostics, indicators of collinearity and various plots. We also hope that researchers will find stimulus for further work from the perspective of current research discussed in the later chapters and the suggestions and leads provided in some of the exercises. Familiarity with statistical inference and linear algebra at the upper
x
Preface
division or first-year graduate level is a prerequisite for reading this book. Essential topics in these areas are briefly reviewed in Chapters 2 and 3. Mastery of algebra is not a prerequisite; we have simplified the proofs to the extent that algebra does not obscure the statistical content. A one-semester graduate-level introductory course in linear models can be taught by using Chapters 1-6 and selected topics from the other chapters. A follow-up course can be taught from Chapters 7-11. Some sections in Chapters 4-9 are marked with an asterisk; these may be omitted during the first reading. For students who have already had a first course in linear models/regression elsewhere, a second course may be taught by rushing through Chapters 4-6, covering Chapter 7 in detail and then teaching selected topics from Chapters 8-11. We do not make a distinction among lemmas, theorems and corollaries in this book. All of these are called propositions. The propositions, definitions, remarks and examples are numbered consecutively within a section, in a common sequence. Equations are also numbered consecutively within a section. Throughout the book, vectors are represented by lowercase and boldface letters, while uppercase and boldface letters are used to denote matrices. No notational distinction is made between a random vector and a particular realization of it. Errata for the book appear at the URL mentioned earlier. We would appreciate receiving comments and suggestions on the book sent by email to
[email protected]. The approach adopted in this book has its roots in the lecture notes of R.C. Bose (1949). It was conceived in its present form by P. Bhimasankaram of Indian Statistical Institute, who further indebted us by providing extensive suggestions on several versions of the manuscript. Comments from Professors Bikas K. Sinha and Anis C. Mukhopadhyay of Indian Statistical Institute and Professor Thomas Mathew of University of Maryland, Baltimore County were also very useful. The first author thanks his family and friends for coping with several long periods of absence and specially his son Shairik for putting up a brave face even as he missed his father's company. Debasis Sengupta Sreenivasa Rao Jammalamadaka
Contents
Preface
vii
Glossary of abbreviations
xix
Glossary of matrix notations
xxi
Chapter 1 Introduction 1.1 The linear model 1.2 Why a linear model? 1.3 Description of the linear model and notations 1.4 Scope of the linear model 1.5 Related models 1.6 Uses of the linear model 1.7 A tour through the rest of the book 1.8 Exercises
1 1 3 4 6 9 11 13 16
Chapter 2 Review of Linear Algebra 2.1 Matrices and vectors 2.2 Inverses and generalized inverses 2.3 Vector space and projection 2.4 Column space 2.5 Matrix decompositions 2.6 Lowner order 2.7 Solution of linear equations 2.8 Optimization of quadratic forms and functions
23 23 28 31 36 40 45 47 48
xi
xii 2.9
Contents Exercises
Chapter 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10
52 Review of Statistical Results
Covariance adjustment Basic distributions Distribution of quadratic forms Regression Basic concepts of inference Point estimation Bayesian estimation Tests of hypotheses Confidence region Exercises
55 55 58 60 61 66 70 78 82 85 87
Chapter 4 Estimation in the Linear Model 93 4.1 Linear estimation: some basic facts 94 4.1.1 Linear unbiased estimator and linear zero function . 94 4.1.2 Estimability and identifiability 97 4.2 Least squares estimation 100 4.3 Best linear unbiased estimation 102 4.4 Maximum likelihood estimation 107 4.5 Fitted value, residual and leverage 108 4.6 Dispersions Ill 4.7 Estimation of error variance and canonical decompositions . 113 4.7.1 A basis set of linear zero functions 113 4.7.2 A natural estimator of error variance 116 4.7.3 A decomposition of the sum of squares* 117 4.8 Reparametrization 118 4.9 Linear restrictions 120 4.10 Nuisance parameters 126 4.11 Information matrix and Cramer-Rao bound 128 4.11.1 The case of normal distribution* 128 4.11.2 The symmetric non-normal case* 131 4.12 Collinearity in the linear model* 134 4.13 Exercises 137
Contents
xiii
Chapter 5 Further Inference in the Linear Model 5.1 Distribution of the estimators 5.2 Confidence regions 5.2.1 Confidence interval for a single LPF 5.2.2 Confidence region for a vector LPF 5.2.3 Simultaneous confidence intervals* 5.2.4 Confidence band for regression surface* 5.3 Tests of linear hypotheses 5.3.1 Testability of linear hypotheses* 5.3.2 Hypotheses with a single degree of freedom 5.3.3 Decomposing the sum of squares 5.3.4 Generalized likelihood ratio test and ANOVA table . 5.3.5 Special cases 5.3.6 Power of the generalized likelihood ratio test* . . . . 5.3.7 Multiple comparisons* 5.3.8 Nested hypotheses 5.4 Prediction in the linear model 5.4.1 Best linear unbiased predictor 5.4.2 Prediction interval 5.4.3 Simultaneous prediction intervals* 5.4.4 Tolerance interval* 5.5 Consequences of collinearity* 5.6 Exercises
147 147 148 148 150 151 156 158 158 162 164 166 168 171 172 173 174 175 176 179 179 182 185
Chapter 6 Analysis of Variance in Basic Designs 6.1 Optimal design 6.2 One-way classified data 6.2.1 The model 6.2.2 Estimation of model parameters 6.2.3 Analysis of variance 6.2.4 Multiple comparisons of group means 6.3 Two-way classified data 6.3.1 Single observation per cell 6.3.2 Interaction in two-way classified data 6.3.3 Multiple observations per cell: balanced data . . . . 6.3.4 Unbalanced data
191 193 194 194 196 199 200 202 203 207 212 216
xiv 6.4 6.5 6.6
6.7
Contents Multiple treatment/block factors Nested models Analysis of covariance 6.6.1 The model 6.6.2 Uses of the model 6.6.3 Estimation of parameters 6.6.4 Tests of hypotheses 6.6.5 ANCOVA table and adjustment for covariate* Exercises
...
Chapter 7 General Linear Model 7.1 Why study the singular model? 7.2 Special considerations with singular models 7.2.1 Checking for model consistency* 7.2.2 LUE, LZF, estimability and identifiability* 7.3 Best linear unbiased estimation 7.3.1 BLUE, fitted values and residuals 7.3.2 Dispersions 7.3.3 The nonsingular case 7.4 Estimation of error variance 7.5 Maximum likelihood estimation 7.6 Weighted least squares estimation 7.7 Some recipes for obtaining the BLUE 7.7.1 'Unified theory'of least squares estimation* 7.7.2 The inverse partitioned matrix approach* 7.7.3 A constrained least squares approach* 7.8 Information matrix and Cramer-Rao bound* 7.9 Effect of linear restrictions 7.9.1 Linear restrictions in the general linear model . . . . 7.9.2 Improved estimation through restrictions 7.9.3 Stochastic restrictions* 7.9.4 Inequality constraints* 7.10 Model with nuisance parameters 7.11 Tests of hypotheses 7.12 Confidence regions 7.13 Prediction
219 220 224 224 224 226 228 229 232 243 244 246 246 248 251 251 255 257 258 261 263 265 266 268 271 272 275 275 278 279 282 284 287 290 291
Contents
xv
7.13.1 Best linear unbiased predictor 7.13.2 Prediction and tolerance intervals 7.13.3 Inference through finite population sampling* . . . . 7.14 Exercises
291 293 294 297
Chapter 8 Misspecified or Unknown Dispersion 8.1 Misspecified dispersion matrix 8.1.1 When dispersion misspecification can be tolerated* . 8.1.2 Efficiency of least squares estimators* 8.1.3 Effect on the estimated variance of LSEs* 8.2 Unknown dispersion: the general case 8.2.1 An estimator based on prior information* 8.2.2 Maximum likelihood estimator 8.2.3 Translation invariance and REML 8.2.4 A two-stage estimator 8.3 Mixed effects and variance components 8.3.1 Identifiability and estimability 8.3.2 ML and REML methods 8.3.3 ANOVA methods 8.3.4 Minimum norm quadratic unbiased estimator . . . . 8.3.5 Best quadratic unbiased estimator 8.3.6 Further inference in the mixed model* 8.4 Other special cases with correlated error 8.4.1 Serially correlated observations 8.4.2 Models for spatial data 8.5 Special cases with uncorrelated error 8.5.1 Combining experiments: meta-analysis 8.5.2 Systematic heteroscedasticity 8.6 Some problems of signal processing 8.7 Exercises
305 306 308 315 322 324 324 326 328 330 332 333 336 342 345 351 352 353 353 356 358 358 361 362 364
Chapter 9
371
9.1
Updates in the General Linear Model
Inclusion of observations 9.1.1 A simple case 9.1.2 General case: linear zero functions gained* 9.1.3 General case: update equations*
372 372 375 378
xvi
9.2
9.3
9.4
9.5 9.6
Contents 9.1.4 Application to model diagnostics 9.1.5 Design augmentation* 9.1.6 Recursive prediction and Kalman filter* Exclusion of observations 9.2.1 A simple case 9.2.2 General case: linear zero functions lost* 9.2.3 General case: update equations* 9.2.4 Deletion diagnostics 9.2.5 Missing plot substitution* Exclusion of explanatory variables 9.3.1 A simple case 9.3.2 General case: linear zero functions gained* 9.3.3 General case: update equations* 9.3.4 Consequences of omitted variables 9.3.5 Sequential linear restrictions* Inclusion of explanatory variables 9.4.1 A simple case 9.4.2 General case: linear zero functions lost* 9.4.3 General case: update equations* 9.4.4 Application to regression model building Data exclusion and variable inclusion* Exercises
383 385 389 397 398 400 402 404 407 410 412 413 415 417 417 417 418 419 421 422 423 424
Chapter 10 Multivariate Linear Model 429 10.1 Description of the multivariate linear model 430 10.2 Best linear unbiased estimation 431 10.3 Unbiased estimation of error dispersion 435 10.4 Maximum likelihood estimation 439 10.4.1 Estimator of mean 439 10.4.2 Estimator of error dispersion 440 10.4.3 REML estimator of error dispersion 441 10.5 Effect of linear restrictions 442 10.5.1 Effect on estimable LPFs, LZFs and BLUEs . . . . 442 10.5.2 Change in error sum of squares and products . . . . 443 10.5.3 Change in 'BLUE' and mean squared error matrix . 444 10.6 Tests of linear hypotheses 445
Contents
xvii
10.6.1 Generalized likelihood ratio test 10.6.2 Roy's union-intersection test 10.6.3 Other tests 10.6.4 A more general hypothesis 10.6.5 Multiple comparisons 10.6.6 Test for additional information 10.7 Linear prediction and confidence regions 10.8 Applications 10.8.1 One-sample problem 10.8.2 Two-sample problem 10.8.3 Multivariate ANOVA 10.8.4 Growth models 10.9 Exercises
445 448 449 453 454 455 457 460 460 460 461 462 463
Chapter 11 Linear Inference — Other Perspectives 11.1 Foundations of linear inference 11.1.1 General theory 11.1.2 Basis set of BLUEs 11.1.3 A decomposition of the response 11.1.4 Estimation and error spaces 11.2 Admissible, Bayes and minimax linear estimators 11.2.1 Admissible linear estimator 11.2.2 Bayes linear estimator 11.2.3 Minimax linear estimator 11.3 Biased estimators with smaller dispersion 11.3.1 Subset estimator 11.3.2 Principal components estimator 11.3.3 Ridge estimator 11.3.4 Shrinkage estimator 11.4 Other linear estimators 11.4.1 Best linear minimum bias estimator 11.4.2 'Consistent' estimator 11.5 A geometric view of BLUE in the Linear Model 11.5.1 The homoscedastic case 11.5.2 The effect of linear restrictions 11.5.3 The general linear model
469 470 470 476 479 483 486 486 489 492 500 500 503 506 508 510 510 512 512 513 515 516
xviii
Contents
11.6 Large sample properties of estimators
521
11.7 Exercises
525
Solutions to Odd-Numbered Exercises
533
Bibliography and Author Index
587
Index
607
Glossary of Abbreviations (Abbreviations which are used in more than one section of the book) Abbreviation ALE ANOVA ANCOVA AR ARMA BLE BLP BLUE BLUP CRD GLRT LPF LSE LUE LZF MILE MINQUE ML MLE MSE MSEP NLZF RBD REML SSE UMVUE WLSE
Full form
Described in page admissible linear estimator 486 analysis of variance 167 analysis of covariance 230 autoregressive 10 autoregressive moving average 10 Bayes linear estimator 489 best linear predictor 62 best linear unbiased estimator 102 best linear unbiased predictor 175 completely randomized design 195 generalized likelihood ratio test 84 linear parametric function 94 least squares estimator 100 linear unbiased estimator 94 linear zero function 94 Minimax linear estimator 492 minimum norm quadratic unbiased estimator 347 maximum likelihood 74 maximum likelihood estimator 74 mean squared error 71 mean squared error of prediction 174 normalized linear zero function 435 randomized block design 203 restricted/residual maximum likelihood 329 sum of squares due to error 115 uniformly minimum variance unbiased estimator 72 weighted least squares estimator 263 xix
Glossary of Matrix Notations
Notation ai,j
((oij)) AB A' tr(A) |A| vec(A) p(A) A~L A~R A"1 A~ A+ C(A) C(A)1dim(C(A)) P. ^max(A) ^min(A) || A|| 11A||F A (g> B / 0 1
Meaning (*> j)th element of the matrix A the matrix whose (i,j)th element is OJJ product of the matrix A with the matrix B transpose of the matrix A trace of the matrix A determinant of the matrix A vector obtained by successively concatenating the columns of the matrix A rank of the matrix A a left-inverse of the matrix A a right-inverse of the matrix A inverse of the matrix A a g-inverse of the matrix A Moore-Penrose inverse of the matrix A column space of the matrix A orthogonal complement of C(A) dimension of C{A) orthogonal projection matrix of C(A) largest eigenvalue of the symmetric matrix A smallest eigenvalue of the symmetric matrix A largest singular value of the matrix A Probenius (Euclidean) norm of the matrix A Kronecker product of A with B identity matrix matrix of zeroes (column) vector of ones xxi
Defined in page 23 23 24 25 25 44 26 27 28 28 28 28 29 36 32 32 39 42 42 51 28 26 25 25 25
Chapter 1
Introduction
It is in human nature to try and understand the physical and natural phenomena that occur around us. When observations on a phenomenon can be quantified, such an attempt at understanding often takes the form of building a mathematical model, even if it is only a simplistic attempt to capture the essentials. Either because of our ignorance or in order to keep it simple, many relevant factors may be left out. Also models need to be validated through measurement, and such measurements often come with error. In order to account for the measurement or observational errors as well as the factors that may have been left out, one needs a statistical model which incorporates some amount of uncertainty.
1.1
The linear model
An important question that one often tries to answer through statistical models is the following: How can an observed quantity y be explained by a number of other quantities, xi,X2,..., xp? Perhaps the simplest model that is used to answer this question is the linear model y = /30 + ftsi + /32x2 +
+ j3pxp + e,
(1.1.1)
where /3o, /3i,..., /3p are constants and e is an error term that accounts for uncertainties. We shall refer to y as the response variable. It is also referred to as the dependent variable, endogenous variable or criterion 1
2
Chapter 1 : Introduction
, xp as explanatory variables. These variable. We shall refer to x\, X2, are also called independent variables or exogenous variables. In the context of some special cases these are called regressors, predictors or factors. The coefficients /3Q, 0i,..., j3p are the parameters of the model. Note that the right hand side of (1.1.1), which is a linear function of the explanatory variables, can also be viewed as a linear function of the parameters. Example 1.1.1 A well-known result of optics is Snell's law which relates the angle of incidence {6\) with the angle of refraction (#2), when light crosses the boundary between two media. According to this law, sin 62 — 1^ sin #1, where K is the ratio of refractive indices of the two media. If the refractive index of one medium is known, the refractive index of the other can be estimated by observing 6\ and 62. However, any measurement will involve some amount of error. Thus, the following special case of the model (1.1.1) can be used: V = P\x\ +e, where y = sin^j %i = sin#i, 9\ and O2 being the measured angles of incidence and refraction, respectively, and fix = K. Example 1.1.2 The hospital bill of a patient is likely to be bigger if the patient has to spend more days in the hospital. The bill also depends on several other factors including the nature of treatment, whether intensive care is needed and so on. Some factors may even be unknown (like the hospital's greed!). A simple model that can be used here is y = Po + Pixi + /32x2 + e,
where y is the amount of the hospital bill, X\ is the duration of stay in the hospital (excluding stay at the intensive care unit) and X2 is the duration of stay in the intensive care unit. The error term e represents all the factors that are not specifically included, such as the nature of treatments and tests and variation from one hospital to another. The above model is again a special case of (1.1.1).
1.2 Why a linear model?
3
Example 1.1.3 The height of an adult person varies from one homogeneous ethnic group to another. It also depends on the gender of the person. A comparison of two groups in terms of height can be made on the basis of the following model, which again is a special case of (1.1.1): y = /30 + /3iZi +/?2£2 + e, where y is the measured height of an adult, x\ is a binary variable representing the ethnic group and X2 is another binary variable representing the gender. The error term (e) represents a combination of measurement error and the variation in heights that exists among the adults of a particular gender in a given ethnic group. n Example 1.1.4 The yield of tea in an acre of tea plantation depends on various types of agricultural practices (treatments). An experiment may be planned where various plots are subjected to one out of two possible treatments over a period of time. The yield of tea before the application of treatment is also recorded. A model for post-treatment yield (y) is y = p0 + /3i£i + &Z2 + e, where the binary variable xi represents the treatment type and the realvalued variable X2 is the pre-treatment yield. The error term mainly consists of unaccounted factors. Inclusion of X2 is meant to reduce the effect of unaccounted factors such as soil type or the inherent differences in tea bushes. O 1.2
Why a linear model?
The model (1.1-1) is just one of many possible models that can be used to explain the response in terms of the explanatory variables. Some of the reasons why we undertake a detailed study of the linear model are as follows. (a) Because of its simplicity, the linear model is better understood and easier to interpret than most of the other competing models, and the methods of analysis and inference are better developed.
4
Chapter 1 : Introduction
(b)
(c)
(d)
(e)
1.3
Therefore, if there is no particular reason to presuppose another model, the linear model may be used at least as a first step. The linear model formulation is useful even for certain nonlinear models which can be reduced to the form (1.1.1) by means of a transformation. Examples of such models are given in Section 1.4. Results obtained for the linear model serve as a stepping stone for the analysis of a much wider class of related models such as mixed effects model, state-space and other time series models. These are outlined in Section 1.5. Suppose that the response is modelled as a nonlinear function of the explanatory variables plus error. In many practical situations only a part of the domain of this function is of interest. For example, in a manufacturing process, one is interested in a narrow region centered around the operating point. If the above function is reasonably smooth in this region, a linear model serves as a good first order approximation to what is globally a nonlinear model. Certain probability models for the response and the explanatory variables imply that the response should be centered around a linear function of the explanatory variables. If there is any reason to believe in a probability model of this kind, the linear model is the natural choice. An important example of such a probability model is the multivariate normal distribution. Sometimes the assumption of this distribution is justified by invoking the central limit theorem, particularly when the variables are themselves aggregates or averages of a large collection of other quantities.
Description of the linear model and notations
If one uses (1.1.1) for a set of n observations of the response and explanatory variables, the explicit form of the equations would be Vi = 0o + Pixn + fcxi2 +
\-0pXip + ei,
i = l,...,n,
(1.3.1)
1.3 Description of the linear model and notations
5
where for each i, yi is the ith observation of the response, x^ is the ?th observation of the j t h explanatory variable (j = 1,2,... ,p), and e, is the unobservable error corresponding to this observation. This set of n equations can be written in the following compact form by using matrix and vectors. a
y = X0 + e.
(1.3.2)
In this model, / 2/1 \
/I
2/2
y=
. \yn/
, X =
p\
xn
1
X2l
.
.
\1
%
//30\
X2p
.
. Xnp/
fa
, 0 =
. , e = \/?p/
/ei\ €2
.
.
\ ^n I
In order to complete the description of the model, some assumptions about the nature of the errors are necessary. It is assumed that the errors have zero mean and their variances and covariances are known up to a scale factor. These assumptions are summarized in the matrixvector form as E(e) = 0, D(e) = a2V, (1.3.3) where the notation E stands for expected value and D represents the dispersion (or variance-covariance) matrix. The vector 0 denotes a vector with zero elements (in this case n elements) and V is a known matrix of order n x n. The parameter a2 is unspecified, along with the vector parameter 0. The elements of f3 are real-valued, while nxm J = tv{UnxrnArnxn)
n
= y ^/ i=l
J
ciijbjj.
j=l
A matrix with a single column is called a column vector, while a matrix with a single row is called a row vector. Throughout this book, when we simply refer to a vector it should be understood that it is a column vector. We shall denote a vector, with a bold and lowercase Roman or Greek letter, such as a or a. We shall use the corresponding lowercase (non-bold) letter to represent an element of a vector, with the location specified as subscript. Thus, a^ and a, are the ith elements (or components) of the vectors a and a, respectively. If the order of a vector is n x 1, then we shall call it a vector of order n for brevity. A matrix with a single row and a single column is a scalar, which we shall denote by a lowercase Roman or Greek letter, such as a and a. We shall use special notations for a few frequently used matrices. A matrix having all the elements equal to zero will be denoted by 0, regardless of the order. Therefore, a vector of Os will also be denoted by 0. A non-trivial vector or matrix is one which is not identically equal to 0. The notation 1 will represent a vector of Is (every element equal to 1). A square, diagonal matrix with all the diagonal elements equal to 1 is called an identity matrix, and is denoted by I. It can be verified that A + 0 = A, A0 = 0A = 0, and AI = IA = A
26
Chapter 2 : Review of Linear Algebra
for 0 and I of appropriate order. Often a few contiguous rows and/or columns of a matrix are identified as blocks. For instance, we can partition J5X5 into four blocks as j
_ f Ij,xJ, 03x2 \ \O2x3 -^2x2/
Sometimes the blocks of a matrix can be operated with as if they are single elements. For instance, it can be easily verified that /«mxi\
(A mxril BmXjl2
Cmxn3) I vn2Xi J = Au + Bv + Cw.
The Kronecker product of two matrices AmXn
and Bpxq,
denoted
by A®B
= {(aijB)),
is a partitioned mp x nq matrix with aijB as its (2, j)th block. This product is found to be very useful in the manipulation of matrices with special block structure. It follows from the definition of the Kronecker product that (a) (b) (c) (d) (e)
(Ai + A2) B = Ai B + A2 B, A ® (Bi + B2) = A (g) Bi + A ® B2; (AiA 2 ) ® (BiBa) = (A, ® Bi)(A 2 ® B 2 ); (A ® B ) ' = A'® B'; Amxnb = (b' ® / m x m )vec(A), where vec(A) is the vector obtained by successively concatenating the columns of A.
A set of vectors {v\,..., v^} is called linearly dependent if there is a set of real numbers {a\,..., a^}, not all zero, such that J2i=i aivi = 0- If a set of vectors is not linearly dependent, it is called linearly independent. Thus, all the columns of a matrix A are linearly independent if there is no non-trivial vector b such that the linear combination of the columns Ab = 0. All the rows of A are linearly independent if there is no nontrivial vector c such that the linear combination of the rows c'A = 0. The column rank of a matrix A is the maximum number of linearly independent columns of A. Likewise, the row rank is the maximum
2.1 Matrices and vectors
27
number of its rows that are linearly independent. If the column rank of a matrix Amxn is equal to n, the matrix is called full column rank, and it is called full row rank if the row rank equals m. A matrix which is neither full row rank nor full column rank is called rank-deficient. An important result of matrix theory is that the row rank of any matrix is equal to its column rank (see, e.g., Rao and Bhimasankaram (1992, p.107) for a proof). This number is called the rank of the corresponding matrix. We denote the rank of the matrix A by p{A). Obviously, p(A) < min{m, n}. A square matrix Bnxn is called full rank or nonsingular if p{B) = n. If p(Bnxn) < n, then B is called a singular matrix. Thus, a singular matrix is a square matrix which is rank-deficient. The inner product of two vectors a and b, having the same order n, is defined as the matrix product a'b = YA=I ai^i- This happens to be a scalar, and is identical to b'a. We define the norm of a vector a as ||a|| = (a'a) 1 / 2 . If |a|| = 1, then a is called a vector with unit norm, or simply a unit vector. For a vector vnx\ and a matrix Anxn, the scalar n
n
v'Av = ]T J2 a-ipiVj i=ij-i
is called a quadratic form in v. Note that v'Av = v'(^(A + A'))v. Since a non-symmetric matrix A in a quadratic form can always be replaced by its symmetric version \(A + A') without changing the value of the quadratic form, there is no loss of generality in assuming A to be symmetric. The quadratic form v'Av is characterized by the symmetric matrix A, which is called the matrix of the quadratic form. Such a matrix is called (a) (b) (c) (d)
positive definite if v'Av > 0 for all w^O, negative definite if v'Av < 0 for all « ^ 0 , nonnegative definite if v'Av > 0 for all v and positive semidefinite if v'Av > 0 for all v and v'Av = 0 for some v ^ 0.
A positive definite matrix is nonsingular, while a positive semidefinite matrix is singular (see Exercise 2.3). Saying that a matrix is nonnega-
28
Chapter 2 : Review of Linear Algebra
tive definite is equivalent to saying that it is either positive definite or positive semidefinite. If Anxn is a positive definite matrix, then one can define a general inner product between the pair of order-n vectors a and b as a'Ab. The corresponding generalized norm of a would be (a'Aa)1/2. The fact that A is a positive definite matrix ensures that a'Aa is always positive, unless a = 0. A useful construct for matrices is the vector formed by stacking the consecutive columns of a matrix. We denote the vector constructed from the matrix A in this manner by vec(A). It is easy to see that tr(AB) — vec(A')'vec(B). The number ||vec(A)|| is called the Frobenius norm of the matrix A, and is denoted by | | A | | F - Note that ||A||^ is the sum of squares of all the elements of the matrix A. The Frobenius norm is also referred to as the 'Euclidean norm' in statistical literature.
2.2
Inverses and generalized inverses
If AB = / , then B is called a right-inverse of A, and A is called a left-inverse of B. We shall denote a right-inverse of A by A~ . It exists only when A is of full row rank. Likewise, we shall denote a left-inverse of B, which exists only when B is of full column rank, by B~L. Even when a right-inverse or a left-inverse exists, it may not be unique. For a rectangular matrix AmXrn the rank condition indicates that there cannot be a right-inverse when m > n, and there cannot be a left-inverse when m < n. As a matter of fact, both the inverses would exist if and only if the matrix A is square and full rank. In such a case, A~L and A~R happen to be unique and equal to each other (this follows from Theorem 2.1.1 of Rao and Mitra, 1971). This special matrix is called the inverse of the nonsingular matrix A, and is denoted by A"1. By definition, the inverse exists and is unique if and only if A is nonsingular, and AA~l = A~l A = I. If A and B are both nonsingular with the same order, then (AB)" 1 = B~1A~1. A matrix B is called a generalized inverse or g-inverse of A if ABA — A. A g-inverse of A is denoted by A~. Obviously, if A has order m x n, then A" must have the order n x. m. Every matrix
2.2 Inverses and generalized inverses
29
has at least one g-inverse. Every symmetric matrix has at least one symmetric g-inverse (see Exercise 2.5). It is easy to see that if A has either a left-inverse or a right-inverse, then the same is also a g-inverse of A. In general, A~ is not uniquely defined. It is unique if and only if A is nonsingular, in which case A~ = A~l. Even though A", A~L and A~~R are not uniquely defined in general, we often work with these notations anyway. However, we use these notations only in those expressions where the specific choice of A~, A~L or A~R does not matter. We have just noted that the matrix A has an inverse if and only if it is square and nonsingular. Therefore a nonsingular matrix is also called an invertible matrix. Every other (non-invertible) matrix has a g-inverse that is necessarily non-unique. It can be shown that for every matrix A there is a unique matrix B having the properties (a) (b) (c) (d)
ABA = A, BAB = B, AB = (AB)' and BA = (BA)'.
Property (a) indicates that B is a g-inverse of A. This special g-inverse is called the Moore-Penrose inverse of A, and is denoted by A+. When A is invertible, A+ = A~l. When A is a square and diagonal matrix, A+ is obtained by replacing the non-zero diagonal elements of A by their respective reciprocals. Example 2.2.1 /
2
3\
Let /_A
I
_I3 \
/_8
23 _ i \
Then it can be verified that B and C are distinct choices of A~L. Likewise, B' and C are right-inverses of A'. While B is the MoorePenrose inverse of A, C is not, since AC is not a symmetric matrix. Q If A is invertible and A~l = A', then A is called an orthogonal matrix. If A is of full column rank and A' is a left-inverse of A, then
30
Chapter 2 : Review of Linear Algebra
A is said to be semi-orthogonal. If ai,...,an are the columns of a semi-orthogonal matrix, then a[aj = 0 for i ^ j , while a'fii = 1. A semi-orthogonal matrix happens to be orthogonal if it is square. The following inversion formulae are useful for small-scale computations.
+ a
A+
I 0 if a = 0; {{a'a)-la' if ||a|| > 0, 10 if ||a|| = 0; = Yim(A'A + 52I)-lA'= \imA'{AA'+
=
{A'A)+ =
52I)-X-
A+{A+)'.
The third formula is proved in Albert (1972, p.19). The other formulae are proved by direct verification. Two other formulae are given in Proposition 2.5.2. See Golub and Van Loan (1996) for numerically stable methods for computing the Moore-Penrose inverse. Possible choices of the right- and left-inverses, when they exist, are A~L
=
{A'A)-lA';
A~R
=
A'{AA'yl.
In fact, these choices of left- and right-inverses are Moore-Penrose inverses. If A~ is a particular g-inverse of A, other g-inverses can be expressed as A~ + B-AABAA-, where B is an arbitrary matrix of appropriate order. This is a characterization of all g-inverses of A (see Rao, 1973c, p.25). We conclude this section with inversion formulae for matrices with some special structure. It follows from the definition and properties of the Kronecker product of matrices (see Section 2.1) that a generalized inverse of A B is (A®B)~ = A~ ®B~,
2.3 Vector space and projection
31
where A~ and B~ are any g-inverse of A and B, respectively. It can be verified by direct substitution that if A, C and A + BCD are nonsingular matrices, then C~l + DA~lB is also nonsingular and {A + BCD)1 = A-1 - A-lB{C~l
+ DA~lB)'1
DA1.
Consider the square matrix M=(c
D)'
If M and A are both nonsingular, then ! M
_
(A~l + A-lBT-lCA~l
~ {
-T-lCA~l
-A-lBT~l\
T- 1
)
where T = D — CA~lB, which must be a nonsingular matrix (see Exercise 2.7). KC = B' and M is a symmetric and nonnegative definite matrix, then a g-inverse of M is
*- = r + i:*™ -\BT~) where T = D — B'A~B, which is a nonnegative definite matrix (see Exercise 2.19). The proofs of these results may be found in Rao and Bhimasankaram (1992, p. 138,347). 2.3
Vector space and projection
For our purpose, a vector space is a nonempty set <S of vectors having a fixed number of components such that if u £ S and v G S, then au + bv E S for any pair of real numbers a and b. If the vectors in S have order n, then S C IRn.
32
Chapter 2 : Review of Linear Algebra
If Si and 52 are two vector spaces containing vectors of the same order, then the intersection Sif)S2 contains all the vectors that belong to both the spaces. Every vector space contains the vector 0. If Si (IS2 = {0}, then S\ and S2 are said to be virtually disjoint. It is easy to see that Si n 2 is itself a vector space. However, the union Si U S2 is not necessarily a vector space. The smallest vector space that contains the set S1US2 is called the sum of the two spaces, and is denoted by Si +S2. It consists of all the vectors of the form u + v where u £ Si and D 6 52. A vector u is said to be orthogonal to another vector v (having the same order) if u'v = 0. If a vector is orthogonal to all the vectors in the vector space S, then it is said to be orthogonal to S. If Si and £2 are two vector spaces such that every vector in Si is orthogonal to S2 (and vice versa), then the two spaces are said to be orthogonal to each other. Two vector spaces which are orthogonal to each other must be virtually disjoint, but the converse is not true. The sum of two spaces which are orthogonal to each other is called the direct sum. In order to distinguish it from the sum, the symbol ' + ' is replaced by '©.' Thus, when Si and £2 are orthogonal to each other, Si + S2 can be written as Si © S2. If Si ffi S2 = IRn, then Si and S2 are called orthogonal complements of each other. We then write Si = S^ and S2 = S^. Clearly, (S-1)1- = S. A set of vectors {ui,..., Uk} is called a basis of the vector space S if (a) Ui G S for i = 1 , . . . , fc, (b) the set { u i , . . . , Uk} is linearly independent and (c) every member of S is a linear combination of « i , ...,«/&. Every vector space has a basis, which is in general not unique. However, the number of vectors in any two bases of S is the same (see Exercise 2.8). Thus, the number of basis vectors is a uniquely defined attribute of any given vector space. This number is called the dimension of the vector space. The dimension of the vector space S is denoted by dim(S). Two different vector spaces may have the same dimension. If S consists of n-component vectors, then dim(S) < n (see Exercise 2.10).
Example 2.3.1
Let
" i= 0- *"-(!) HD 1 HD-
2.3 Vector space and projection
33
Define S\ ^2 53 54
= = = =
,£5 =
{u {u {u {u
: u = aui + bu2 for any real a and 6}, : u — au\ + bu4 for any real a and b}, :u = for any real a}, : u = au\ for any real a},
{u : u = au2 for any real a } .
It is easy to see that S i , . . . , S s are vector spaces. A basis of S\ is {tti, 112}- An alternative basis of <Si is {ui,uz}. The pair of vector spaces 54 and 5s constitute an example of virtually disjoint spaces which are not orthogonal to each other. T h e spaces <Si and S3 are orthogonal to each other. In fact, 5 i ©53 = J R 3 , so that S3 = S^. The intersection between Si and S2 consists of all the vectors which are proportional to Ui, that is, Si flS2 = S4. In this case, Si US2 is not a vector space. For instance, Ui + U4 is not a member of Si U S2, even though u-i and 114 are. The sum, <Si + S2 is equal to 1R3, which is a vector space. Even so, Si and S2 are not orthogonal complements of each other, because they are not orthogonal to each other in the first place. Note that a set of pairwise orthogonal vectors are linearly independent, but the converse does not hold. If the vectors in a basis set are orthogonal to each other, this special basis set is called an orthogonal basis. If in addition, the vectors have unit norm, then the basis is called an orthonormal basis. For instance, {u\, 113} is an orthonormal basis for Si in Example 2.3.1. Given any basis set, one can always construct an orthogonal or orthonormal basis out of it, such that the new basis spans the same vector space. Gram-Schmidt orthogonalization (see Golub and Van Loan, 1996) is a sequential method for such a conversion. A few important results on vector spaces are given below. P r o p o s i t i o n 2.3.2
Suppose Si and S2 are two vector spaces.
(a) dim(Si n S 2 ) + dim(S x + S 2 ) = dim(Si) + dim(S 2 ).
(b) (S1 + S 2 ) x = St n S2X. (c) If Si C S2 and dim(Si) = dim(S2), then Si = S 2 .
34
Chapter 2 : Review of Linear Algebra
Proof. See Exercise 2.9. Note that Part (b) of the above proposition also implies that (Si n 52)-L = 51J- + 52J-. For any vector space S containing vectors of order n, we have S © S1- = ZR™. Hence, every vector v of order n can be decomposed as V = Vi
+V2,
where Vi G S and v^ G SL. Thus, the two parts belong to mutually orthogonal spaces, and are orthogonal to each other. This is called an orthogonal decomposition of the vector v. The vector v\ is called the projection of v on S. The projection of a vector on a vector space is uniquely denned (see Exercise 2.11). A matrix P is called a projection matrix for the vector space S if Pv = v for all v G S and Pv G 5 for all i? of appropriate order. In such a case, Pv is the projection of v on S for all v. Since PPv = Pv for any v, the matrix P satisfies the property P2 = P. Square matrices having this property are called idempotent matrices. Every projection matrix is necessarily an idempotent matrix. If P is an idempotent matrix, it is easy to see that / — P is also idempotent. If P is a projection matrix of the vector space S such that I — P is a projection matrix of S^, then P is called the orthogonal projection matrix for S. Every vector space has a unique orthogonal projection matrix, although it may have other projection matrices (see Exercise 2.12). Henceforth we shall denote the orthogonal projection matrix of S by P~. An orthogonal projection matrix is not only idempotent but also symmetric. Conversely, every symmetric and idempotent matrix is an orthogonal projection matrix (see Exercise 2.13). Example 2.3.3 the matrices
Consider the vector space <Si of Example 2.3.1, and
/I 0 0\ Pi = I 0 1 0 \0 0 0/
and
/I 0 1\ P2 = 0 1 1 . \0 0 0/
Notice that PiUj = Uj for i = 1,2, j = 1,2. Therefore, P{V = v for
2.3 Vector space and projection
35
any v G <Si and i = 1,2. Further, Pj = (m : U2)Ti, z = 1,2, where
rTl~[o -(
1
-1I °\ oj'
T
-(
T2-\o
1
-1i i °\ j-
Therefore, for any u, PjV = (u\ : 1*2) (TJU), which is a linear combination of u\ and u^ and hence is in «Si. Thus, P\ and Pi are both projection matrices of S\. It can be verified that both are idempotent matrices. Notice that U4 G 1S3 = 5/". Further, (J — Pi)it4 = U4, but (J — P2)UA 7^ W4. Also, (/ — P\)v is in >?3 for all v. Therefore, P i is the orthogonal projection matrix for «Si, while P2 is not. d Proposition 2.3.4 If the vectors ui,...,Uk constitute an orthonormal basis of a vector space S, then Ps = X3i=i uiu'i-
Proof. Let P = ]Ci=i Uiu\. Since any vector v in <S is of the form J2i=i aiuii it follows that k
k
k
Pv = ^2 Yl uiu'iujaj = 5Z a i t t i
= v-
One the other hand, for a general vector v, Pv = J2i=i 12j=i{uiv)uh which is evidently in S. Therefore, P is indeed a projection matrix for S. We now have to show that / — P is a projection matrix of S1-. Let v £ S^, so that V'UJ = 0 for j = 1,..., k. Then k
(I — P)v
— v — ^2 UJ(U'JV) = v. J=l
Since u'^P = u'j for j = 1 , . . . ,k, we have for a general vector v of appropriate order, u'J(I-P)v
= 0,
j =
l,...,k.
Therefore, (J — P)v is orthogonal to S, and is indeed a member of 5-1. Combining the above results, and using the fact that the orthogonal projection matrix of any vector space is unique (see Exercise 2.12), we have Ps= P.
36 2.4
Chapter 2 : Review of Linear Algebra Column space
Let the matrix A have the columns oi, 02, , an. If a; is a vector having components x\,X2, ,xn, then the matrix-vector product Ax — x\ai + X2CL2 H
+
xnan,
represents a linear combination of the columns of A. The set of all vectors that may be expressed as linear combinations of the columns of A is a vector space and is called the column space of A. We denote it by C{A). The column space C(A) is said to be spanned by the columns of A. The statement u £ C{A) is equivalent to saying that the vector u is of the form Ax where x is another vector. The row space of A is defined as C(A'). We now list a few results which will be useful in the subsequent chapters. Proposition 2.4.1 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (I)
C(A:B) =C(A)+C{B). C{AB) CC{A). C(AA') = C(A). Consequently, p{AA') = p(A). C(C) C C(A) only if C is of the form AB for a suitable matrix B. IfC(B) C C(A), then AA~B = B, irrespective of the choice of the g-inverse. Similarly, C(B') C C(A') implies BA~ A = B. C(B') C C{A') and C{C) C C(A) if and only if BAC is invariant under the choice of the g-inverse. B'A = 0 if and only if C{B) C C(A)1-. dvm{C{A)) =p(A). If A has n rows, then dim(C(A)-L) = n — p(A). IfC(A) C C(B) and p(A) = p(B), then C(A) = C(B). In particular, C(Inxn) = JRnp(AB)<min{p(A),p(B)}. p(A + B) I'AA' = 0 =* I'AA'l = \\Al\\2 = 0 => Al = O = leC(A)1.
Thus C{AA')L C C(A) 1 , and consequently, C(A) C C(AJ4'). The reverse inclusion follows from part (b). Equating the dimensions (see part (h), proved below), we have p(AA') = p(A). To prove Part (d), let C = {cx : : ck). Since C{C) C C(A), Cj E C(A) for j = l,...,h. Therefore, for each j between 1 and k, there is a vector bj such that Cj = Abj. It follows that C = AB where B =
(bl:---:bk).
IfC(B) C C(A), then there is a matrix T such that B = AT. Hence, = A A" AT = AT = B. The other statement of part (e) is AAB proved in a similar manner. In order to prove part (f), let C(B') C C{A') and C{C) C C(A). There are matrices TY and T 2 such that B = TXA and C = AT 2 . If Aj~ and A~^ are two g-inverses of A, then B A ^ C - B1 2 -C = TX{AA\A
- AA^A)T 2 = TX{A - A)T2 = 0.
This proves the invariance of BA~C under the choice of the g-inverse. In order to prove the converse, consider the g-inverses A^ = A+ and A^ = A+ + K — A+AKAA+, where K is an arbitrary matrix of appropriate dimension. Then the invariance of BA~C implies 0 = BAz C - BA{C = BKC - (BA+A)K{AA+C) for all K. By choosing K = uiv'j, where Ui is the iih. column of an appropriate identity matrix and Vj is the jth row of another identity matrix, we conclude that {Buitfv'jC) = (BA+AuiJiv'jAA+C)
for all i,j.
Therefore, B = aBA+A and C = a^AA+C for some a ^ O . (In fact, by using the first identity repeatedly, we can show that a = 1.) Therefore, C{B') C C{A') and C(C) C C(A).
38
Chapter 2 : Review of Linear Algebra
Part (g) is proved by noting that I G C(B) implies that I = Bm for some vector m , and consequently VA — m'B'A = 0 or Z G C(A)-L. To prove part (h), let k = p{A), and a\,..., a^ be linearly independent columns of A. By definition, any column of A outside this list is a linear combination of these columns. Therefore, any vector in C(A) is also a linear combination of these vectors. Hence, these vectors constitute a basis set of C(A), and dim(C(.4.)) — k = p(A). Part (i) follows from part (h) above and part (a) of Proposition 2.3.2. Part (j) is a direct consequence of part (g) above and part (c) of Proposition 2.3.2. Parts (b) and (h) imply that p(AB) < p{A) and p{AB) = p(B'A') < p(B') = p{B). Combining these two, we have the result of part (k). In order to prove part (1), observe that p(A + B)
=
dim(C(A + JB))
I € C{A : B). The result of part (a) is obtained by combining these two implications. Part (b) is a direct consequence of part (a) and the fact that
P
= P +P
whenever S\ and 52 are mutually orthogonal vector spaces (see Exercise 2.14).
2.5
Matrix decompositions
A number of decompositions of matrices are found to be useful for numerical computations as well as for theoretical developments. We mention three decompositions which will be needed later. Any non-null matrix Amxn of rank r can be written as BmxrCrXn, where B has full column rank and C has full row rank. This is called a rank-factorization. Any matrix Amxn can be written as UDV', where Umxm and VnXn are orthogonal matrices and Dmxn is a diagonal matrix with nonnegative diagonal elements. This is called a singular value decomposition (SVD) of the matrix A. The non-zero diagonal elements of D are referred to as the singular values of the matrix A. The columns of U and V corresponding to the singular values are called the left and right singular vectors of A, respectively. It can be seen that p(A) = p(D) (see Exercise 2.16). Therefore, the number of nonzero (positive) singular values of a matrix is equal to its rank. The diagonal elements of D can be permuted in any way, provided the columns of U and V are also permuted accordingly. A combination of such permuted versions of D, U and V would constitute another SVD of A (see Exercise 2.2). If the singular values are arranged so that the positive elements occur in the first few diagonal positions, then we can write A = J2l=i diUiv[, where r = p(A), di,..., dr are the non-zero singular values, while u\,..., ur and v\,..., vr are the corresponding left and right singular vectors. This is an alternative form of the SVD.
2.5 Matrix decompositions
41
This sum can also be written as U\DiV[, where XJ\ = {u\ : : ur}, Vi — {vi vr} and D\ is a diagonal matrix with d{ in the (i,i)th location, i = 1,..., r. Example 2.5.1 Consider the matrix /4
5 2\
A- I
3 6
4 5 2 " \ 0 3 6/ The rank of A is 2. An SVD of A is UDV', where
r7
_
2
2
I
I
V2
2
2
_I
2
n
_I
1 1 1 1 ) 2
_
'
0
6
0
0 0
U
'
\o o o i
2
1
_2
3
3
3
"
U - I \)
2 /
This decomposition is not unique. We can have another decomposition by reversing the signs of the first columns of U and V. Yet another SVD is obtained by replacing the last two columns o f t / b y ( - T = : 0 : — 4 ^ : 0 ) and (0 : ^ : 0 : - ^ ) ' . Two alternative forms of SVD of A are given below: (\\ diuit,; + d2u2v'2
= 1 2
( \
( i f
|) +6
\ \ ~\
2
/I TT D V
( f \
- | ) ,
2
\ 5/ I \ 2
2
2
2
2
2
\-5 (12 V
0W /
\
3
3
3
3
/ 3 \ 3/
Two rank-factorizations of A are (L7i.Di)(Vi) and ( [ / ^ ( D ^ i J . n
42
Chapter 2 : Review of Linear Algebra
If A is a symmetric matrix, it can be decomposed as VAV', where V is an orthogonal matrix and A is a square and diagonal matrix. The diagonal elements of A are real, but these need not be nonnegative (see Proposition 2.5.2 below). The set of the distinct diagonal elements of A is called the spectrum of A. We shall refer to VAV' as a spectral decomposition of the symmetric matrix A. The diagonal elements of A and the columns of V have the property Avi = XiVi,
i=
l,...,n,
A, and V{ being the z'th diagonal element of A and the ith column of V, respectively. Combinations of scalars and vectors satisfying this property are generally called eigenvalues and eigenvectors of A, respectively. Thus, every Aj is an eigenvalue of A, while every Vi is an eigenvector of A. We shall denote the largest and smallest eigenvalues of the symmetric matrix A by A max (A) and \min(A), respectively. There are several connections among the three decompositions mentioned so far. If A is a general matrix with SVD UDV', a spectral decomposition of the nonnegative definite matrix A'A is VD2 V'. If A is itself a nonnegative definite matrix, any SVD of A is a spectral decomposition, and vice versa. An alternative form of spectral decomposition of a symmetric matrix A is ViAiV^ where Vi is semi-orthogonal and Ai is a nonsingular, diagonal matrix. If p(A) — r, then Ai has r positive diagonal elements, and can be written as D\, D\ being another diagonal matrix. Thus, A can be factored as {V\Di){V\D{)'. This construction shows that any nonnegative definite matrix can be rankfactorized as BB1 where B has full column rank. In general we use this form of rank-factorization for nonnegative definite matrices (see Rao and Bhimasankaram, 2000, p.361 for an algorithm for this decomposition). We have already seen in Example 2.5.1 how an SVD leads to a rank-factorization of a general matrix (not necessarily square). Although the SVD is not unique, the set of singular values of any matrix is unique. Likewise, a symmetric matrix can have many spectral decompositions, but a unique set of eigenvalues. The rank-factorization, SVD and spectral decomposition help us better understand the concepts introduced in the preceding sections. We now present a few characterizations based on these decompositions.
2.5 Matrix decompositions
43
Proposition 2.5.2 Suppose UiDiV[ is an SVD of the matrix A, such that D\ is full rank. Suppose, further, that B is a symmetric matrix with spectral decomposition VAV', A having the same order as B. (a) PA = U1U[. (b) A+ = VlDl1U\. (c) PA=AA+. (d) B is nonnegative (or positive) definite if and only if all the elements of A are nonnegative (or positive). (e) If B is nonnegative definite, then B+ = VA+V (f) B is idempotent if and only if all the elements of A are either 0 or 1. (g) If CD is a rank-factorization of A, then the Moore-Penrose inverse of A is given by A+ =
D'{DD')-l{C'CylC.
(h) tr(B) = tr(A), the sum of the eigenvalues. Proof. Note that U\ and V\ are semi-orthogonal matrices. Further, C(A) = C(AA') = C(UiDiDiU[)
= C{UxDi) = C{UX).
The last equality holds because D\ is invertible. It follows from Proposition 2.4.2 that PA = PVi = Ui(UiUi)-Ui = UiU[. Part (b) is proved by verifying the four conditions the Moore-Penrose inverse must satisfy (see Section 2.2). Part (c) follows directly from parts (a) and (b). Suppose that the elements of A are Ai,..., An and the columns of V are v\,... ,vn. Then, for any vector I of order n,
I'Bl = I' (JT X^v') 1 = ]T Kil'vi)2. If Ai,..., An are all non-negative, I'Bl is nonnegative for all I, indicating that B is nonnegative definite. Otherwise, if Xj < 0 for some j , we have V'JBVJ = Xj < 0, and B is not nonnegative definite. Thus, B is
44
Chapter 2 : Review of Linear Algebra
nonnegative definite if and only if the elements of A are nonnegative. If the word 'nonnegative' is replaced by 'positive,' the statement is proved in a similar manner. This proves part (d). As mentioned on page 29, the Moore-Penrose inverse of a square and diagonal matrix is obtained by replacing the non-zero elements of the matrix by their respective reciprocals. Therefore, A+ is obtained from A in this process. The matrix V A + V is easily seen to satisfy the four properties of a Moore-Penrose inverse given in page 29, using the fact that V is an orthogonal matrix. This proves part (e). In order to prove part (f), note that B2 — B if and only if VA2V' = VAV', which is equivalent to A = A. Since A is a diagonal matrix, this is possible if and only if each diagonal element of A is equal to its square. The statement follows. where C and D are as in Part (g). Let F = D'{DD'yl{C'C)-lC', The conditions AFA = A, FAF, AF = (AF)' and FA = (FA)' are easily verified. Hence, F must be the Moore-Penrose inverse of A. Part (h) follows from the fact that tr(VAV') =tr(A W ) =tr(A).D In view of Proposition 2.4.2, Part (c) of Proposition 2.5.2 describes how a projection matrix AA~ can be made an orthogonal projection matrix by choosing the g-inverse suitably. Part (d) implies that a nonnegative definite matrix is singular if and only if it has at least one zero eigenvalue. Part (f) characterizes the eigenvalues of orthogonal projection matrices (see Exercise 2.13). Parts (f) and (h) imply that whenever B is an orthogonal projection matrix, p{B) = tr(B). We define the determinant of a symmetric matrix as the product of its eigenvalues. This coincides with the conventional definition of the determinant that applies to any square matrix (see a textbook on linear algebra, such as Marcus and Mine, 1988). We denote the determinant of a symmetric matrix B by |B|. According to part (d) of Proposition 2.5.2, B is nonnegative definite only if \B\ > 0, positive definite only if |J3| > 0 and positive semidefinite (singular) only if |B| = 0 . The decompositions described above also serve as tools to prove some theoretical results that may be stated without reference to the decompositions. We illustrate the utility of rank-factorization by proving
2.6 Lowner order
45
some more useful results on column spaces. Proposition 2.5.3 (a) If B is nonnegative definite, then C(ABA') p(ABA') = p{AB) = p(BA').
— C(AB), and
is a nonnegative definite matrix, then C(B) C (b) If I , \B Cj C(A) and C(B') C C(C). Proof. Suppose that CC is a rank-factorization of B. Then C{ABA') = C{ACC'A') C C{ACC) C C{AC). However, C{AC) = C({AC){AC)') = C(ABA'). Thus, all the above column spaces are identical. In particular, C(AB) = C(ACC') = C{ABA'). Consequently, p{ABA') = p{AB) = p({AB)') = p{BA'). In order to prove part (b), let TT' be a rank-factorization of the T \ bea partition of T
(
1
2/
such that the block T\ has the same number of rows as A. Then we have T T /_fTiTi
T1T'2\_(A B\ ^T2Ti T2T'2J \B' CJ-
Comparing the blocks, we have A — TiT\ and B = T{T'2. Further, C{B) = C{TiT'2) C C(Ti) = C(TiTi) = C(A). Repeating this argument on the transposed matrix, we have C(B') C C(C). 2.6
Lowner order
Nonnegative definite matrices are often arranged by a partial order called the Lowner order, that is defined below: Definition 2.6.1 If A and B are nonnegative definite matrices of the same order, then A is said to be smaller than B in the sense of Lowner order (written as A < B or B > A) if the difference B — A is nonnegative definite. If the difference is positive definite, then A is said to be strictly smaller than B (written a s > l < J 3 o r J 5 > A ) .
46
Chapter 2 : Review of Linear Algebra
It is easy to see that whenever A < B, every diagonal element of A is less than or equal to the corresponding diagonal element of B. Apart from the diagonal elements, several other real-valued functions of the matrix elements happen to be algebraically ordered whenever the corresponding matrices are Lowner ordered. Proposition 2.6.2 Let A and B be symmetric and nonnegative definite matrices having the same order and let A < B. Then (a) (b) (c) (d)
tr(A) < t r ( £ ) ; the largest eigenvalue of A is less than or equal to that of B; the smallest eigenvalue of A is less than or equal to that of B; \A\ < \B\.
Proof. Part (a) follows from the fact that B — A is a symmetric and nonnegative definite matrix, and the sum of the eigenvalues of this matrix is tr(B) — tr(A). Parts (b) and (c) are consequences of the fact that u'Au < u'Bu for every u, and that the inequality continues to hold after both sides are maximized or minimized with respect to u. In order to prove part (d), note that the proof is non-trivial only when \A\ > 0. Let A be positive definite and CC1 be a rank-factorization of A. It follows that I < C~lB{C')~~l. By part (c), the smallest eigenvalue (and hence, every eigenvalue) of C~lB{C')~1 is greater than or equal to 1. Therefore, \C~lB{C')~l\ > 1. The stated result follows from the identities \C-lB(C')-y\ = \C~l\ \B\ \{C')-l\ = \B\ \A\~l. Note that the matrix functions considered in Proposition 2.6.3 are in fact functions of the eigenvalues. It can be shown that whenever A < B, all the ordered eigenvalues of A are smaller than the corresponding eigenvalues of B (see Bellman, 1960). Thus, the Lowner order implies algebraic order of any increasing function of the ordered eigenvalues. The four parts of Proposition 2.6.2 are special cases of this stronger result. There is a direct relation between the Lowner order and the column spaces of the corresponding matrices.
2.7 Solution of linear equations
47
Proposition 2.6.3 Let A and B be matrices having the same number of rows. (a) If A and B are both symmetric and nonnegative definite, then A< B implies C(A) C C{B). (b) C(A) C C(B) if and only if PA < PB and C{A) C C(B) if and only if PA P - ^A-
In particular, iiC(A) C C(B), then C((I - PA)B) cannot be identically zero, which leads to the strict order PA < PB. On the other hand, when PA < PB, part (a) implies that C(A) C C{B). If PA < Pg, there is a vector I such that A'l = 0 but B'l ^ 0. Therefore, C(B)L C C(A')1-, that is, C(A) CC(B). 2.7
Solution of linear equations
Consider a set of linear equations written in the matrix-vector form as Ax = 6, where x is unknown. Proposition 2.7.1 below provides answers to the following questions: (a) When do the equations have a solution? (b) If there is a solution, when is it unique? (c) When there is a solution, how can we characterize all the solutions? Proposition 2.7.1 Suppose Amxn
and bmx\ are known.
(a) The equations Ax = b have a solution if and only if b E C(A).
48
Chapter 2 : Review of Linear Algebra (b) The equations Ax = b have a unique solution if and only if b e C{A) and p(A) = n. (c) If b E C(A), every solution to the equations Ax = b is of the form A~b+ (I — A~A)c where A~ is any fixed g-inverse of A and c is an arbitrary vector.
Proof. Part (a) follows directly from Parts (b) and (d) of Proposition 2.4.1. Part (b) is proved by observing that b can be expressed as a unique linear combination of the columns of A if and only if b E C(A) and the columns of A are linearly independent. It is easy to see that whenever b E C(A), A~b is a solution to Ax = b. If XQ is another solution, then XQ — A~b must be in C(-A')-1-. Since Al = 0 if and only if (/ — A~ A)l = I, C(A')1- must be the same as C(I — A" A). Hence XQ must be of the form A~b + (I — A~A)c for some c. Remark 2.7.2 If b is a non-null vector contained in C(A), every solution of the equations Ax = b can be shown to have the form A~b where A~ is some g-inverse of A. (See Corollary 1, p.27 of Rao and Mitra, 1971). Since the equations Ax = b have no solution unless b E C(A), this condition is often called the consistency condition. If this condition is violated, the equations have an inherent contradiction. If .A is a square and nonsingular matrix, then the conditions of parts (a) and (b) are automatically satisfied. In such a case, Ax = b has a unique solution given by x = A~lb. lib E C(A), a general form of the solution of Ax — bis A+b + (I — P.,)c. This is obtained by choosing the g-inverse in part (c) as A+.
2.8
Optimization of quadratic forms and functions
Consider a quadratic function q(x) = x'Ax + b'x + c of a vector variable x, where we assume A to be symmetric without loss of generality. In order to minimize or maximize q(x) with respect to x, we can differentiate q(x) with respect to the components of x, one at a time,
2.8 Optimization of quadratic forms and functions
49
and set the derivatives equal to zero. The solution(s) of these simultaneous equations are candidates for the optimizing value of x. This algebra can be carried out in a neater way using vector calculus. Let x = (x\,..., xn)'. The gradient of a function f(x) is defined as / df(x) \ dx\ ox dj\x)
\ dxn I The Hessian matrix of f(x) is defined as / d2f(x) dx\ d2f{x) ^/(tt) „ , , = dx2dxi ox ox' d2f(x) \ dxndx\
d2f(x) dx\dx2 Q2f^) dx\
d2f(x) \ dx\dxn g2/(*) dx2dxn
...
d2f(x) dxndx2
'
d2f(x) dx2 J
The gradient and the Hessian are the higher-dimensional equivalents of the first and second derivatives, respectively. A differentiable function has a maximum or a minimum at the point XQ only if its gradient is zero at XQ. If the gradient is zero at XQ, it is a minimum if the Hessian at XQ is nonnegative definite, and maximum if the negative of the Hessian is nonnegative definite. Proposition 2.8.1 matrix.
Let q(x) = x'Ax+b'x+c,
d(x'Ax) _ (a)
^
d2(x'Ax) ( ' dxdx'
a(*/x)
- ZAx, ~
^
where A is a symmetric
ac - b, ^
d2(b'x) ' dxdx' ~u'
- 0.
d2c dx~w
-
50
Chapter 2 : Review of Linear Algebra (c) q(x) has a maximum if and only if b € C(A) and —A is nonnegative definite. In such a case, a maximizer of q(x) is of the form — \A~b + (I — A~A)XQ, where xo is arbitrary. (d) q(x) has a minimum if and only ifb£ C(A) and A is nonnegative definite. In such a case, a minimizer of q(x) is of the form — l}A~b + (I — A~A)xo, where XQ is arbitrary.
Proof. The proposition is proved by direct verification of the expressions and the conditions stated above, coupled with part (c) of Proposition 2.7.1. Most of the time it is easier to maximize or minimize a quadratic function by 'completing squares', rather than by using vector calculus. To see this, assume that b G C(A) are rewrite q(x) as
(x + \A~b\
A(X + \A~b) + (c- \b'A-b) .
If A > 0, then q(x) is minimized when x = — \A~b+ {I — A~ A)XQ for arbitrary x. If A < 0, then q(x) is maximized at this value of x. The quadratic function q(x) may be maximized or minimized subject to the linear constraint Dx = e by using the Lagrange multiplier method. This is accomplished by adding the term 2y'(Dx — e) to ^(a;) and optimizing the sum with respect to x and y. Thus, the task is to optimize
O'(2 ?)(;MA.)'C)(x\ with respect to I j . This is clearly covered by the results given in Proposition 2.8.1. The next proposition deals with maximization of a quadratic form under a quadratic constraint. Proposition 2.8.2 Let A be a symmetric matrix, and b be a vector such that b'b — 1. Then b'Ab is maximized with respect to b when b is a unit-norm eigenvector of A corresponding to its largest eigenvalue, and the corresponding maximum value of A is equal to the largest eigenvalue.
2.8 Optimization of quadratic forms and functions
51
Proof. Let VAV' be a spectral decomposition of A, and let Ai,..., An, the diagonal elements of A, be in decreasing order. Suppose that o = Vb. It follows that the optimization problem at hand is equivalent to the task of maximizing YA~I ^ial subject to the constraint ^ " = 1 af = 1. It is easy to see that the weighted sum of A^s is maximized when a\ = 1 and ai = 0 for i — 2,..., n. This choice ensures the solution stated in the proposition. D Note that if 6o maximizes b'Ab subject to the unit-norm condition, so does — &o- Other solutions can be found if Ai = A2. If Ai > 0, then the statement of the above proposition can be strengthened by replacing the constraint b'b with the inequality constraint b'b < 1. An equivalent statement to Proposition 2.8.2 is the following: the ratio b'Ab/b'b is maximized over all b ^ 0 when b is an eigenvector of A corresponding to its largest eigenvalue. The corresponding maximum value of b'Ab/b'b is equal to the largest eigenvalue of A. Similarly it may be noted that b'Ab is minimized with respect to b subject to the condition b'b = 1 when b is a unit-norm eigenvector of A corresponding to its smallest eigenvalue. The corresponding minimum value of b'Ab is equal to the smallest eigenvalue of A. This statement can be proved along the lines of Proposition 2.8.2. An equivalent statement in terms of the minimization of the ratio b'Ab/b'b can also be made. We shall define the norm of any rectangular matrix A as the largest value of the ratio
||Ab|| _ (b'A'AbV12
11*11 ~\ tib )
'
and denote it by ||A||. Proposition 2.8.2 and the preceding discussion imply that \\A\\ must be equal to the square-root of the largest eigenvalue of A'A, which is equal to the largest singular value of A (see the discussion of page 42). It also follows that the vector b which maximizes the above ratio is proportional to a right singular vector of A corresponding to its largest singular value. The norm defined here is different from the Frobenius norm defined in page 28.
52
Chapter 2 : Review of Linear Algebra
2.9
Exercises AP is 2.1 Find a matrix Pnxn such that for any matrix Amxru a modification of A with the first two columns interchanged. Can you find a matrix Q with suitable order such that QA is another modification of A with the first two rows interchanged? 2.2 Obtain the inverses of the matrices P and Q of Exercise 2.1. 2.3 Let A be a matrix of order n x n. (a) Show that if A is positive definite, then it is nonsingular. (b) Show that if A is symmetric and positive semidefmite, then it is singular. (c) If A is positive semidefinite but not necessarily symmetric, does it follow that it is singular? 2.4 Show that Mmxrai
Onixm2 j
is a g-inverse of
(
Amixni
Omixn2
0m2 xni
"m2XH2/
|
2.5 Is (A~)' a g-inverse of A'l Show that a symmetric matrix always has a symmetric g-inverse. 2.6 If A has full column rank, show that every g-inverse of A is also a left-inverse. 2.7 If A is nonsingular, show that the matrix M = I _, is \G D J nonsingular if and only if D — CA~XB is nonsingular. 2.8 Show that any two bases of a given vector space must have the same number of vectors. 2.9 Prove Proposition 2.3.2. 2.10 Prove that the dimension of a vector space is no more than the order of the vectors it contains. 2.11 Show that the projection of a vector on a vector space is uniquely defined. 2.12 Prove that the orthogonal projection matrix for a given vector space is unique.
2.9 Exercises
53
2.13 (a) Prove that every idempotent matrix is a projection matrix, (b) Prove that a matrix is an orthogonal projection matrix if and only if it is symmetric and idempotent. 2.14 If S\ and S2 are mutually orthogonal vector spaces, show that
2.15 2.16
2.17 2.18
2.19
2.20
P = P 4- P Sies2 5i ,Pp a r e a s m (3-4.2), and e = where x = {x\ : y—E(y\x). According to Proposition 3.4.1, e has zero mean and is uncorrelated with x. Therefore, (3.4.3) is a special case of (1.3.2)—(1.3.3) for a single observation. However, the explanatory variables in (3.4.3) are in general random and are not necessarily independent of e. Even though the model (3.4.3) applies to any y and x having the moments described in Proposition 3.4.1, the model may not always be interpretable as a conditional one (given the explanatory variables). Methods of inference which require the explanatory variables to be conditionally non-random are not applicable to (3.4.3). The multiple correlation coefficient of y with x is the maximum value of the correlation between y and any linear function of a;. If the covariance structure of x and y is as in Proposition 3.4.1, the linear
64
Chapter 3 : Review of Statistical Results
function of x which has the maximum correlation with y happens to be v'xyV~xx. Therefore, the squared multiple correlation is Cov(y,vxyV-xx)2 Var{y) Var(v'xyVxxx)
=
v'xyV-xvxy vyy
4
Thus, y has a larger correlation with the BLP of y, given in part (a) of Proposition 3.4.1, than with any other linear function of x. We end this section with a proposition which shows how the linear model (1.3.2)—(1.3.3) follows from a multivariate normal distribution of the variables. We shall denote by vec(A) the vector obtained by successively concatenating the columns of the matrix A. Proposition 3.4.2 Let the random vector y n x l and the random matrix -X"nx(p+i) be such that X = (1 : Z), vec(y:Z)~Jv((M
® l n x l , £ { p + 1 ) x ( p + 1 ) ® Vnxn) .
Then E(y\X) = (l:Z)/3,
D(y\X) = a2V,
where P ff2
=
(Vy + °iy^xxMx\ V ^xxaxy J'
=
°yy ~
CT'xy^'xx{Txy
Proof. Since X = (1 : Z), conditioning with respect to X is equivalent to conditioning with respect to Z. The stated formulae would follow from (3.2.1) and (3.2.2). Let x = vee(Z),
and S ( p + 1 ) x ( p + 1 ) = (a™ \ Oxy
< ) . Zjxx J
It follows that D(y\X)
= D(y\x) = ayyV - (a'xy ® V)(VXX ® V)-(axy
® V)
= OyyV - (*'xy ® V)(E- a ® V-)(fe(y)
V0
is called the maximum likelihood estimator (MLE) of 0. Provided a number of regularity conditions hold and the derivatives exist, such a point can be obtained by methods of calculus. Since logp is a monotone increasing function of p, maximizing fe(y) is equivalent to maximizing log fe{y) with respect to 0, which is often easier (e.g., when fg{y) is in the exponential family). The maximum occurs at a point where the gradient vector satisfies the likelihood equation dlog fo(y) 80 and the matrix of second derivatives (Hessian matrix)
o2 log Mv) 8080'
3.6 Point estimation
75
is negative definite. Example 3.6.5 Let ynxl be easily seen that
~ N(fil,a2I).
Here, 9 = (n : a2)'. It can
log/*(y) = -(n/2)log(27ra 2 ) - (y -»l)'(y
-
M l)/(2a 2 ).
Consequently / glog/e(y) _ 50
nfx — l'y \ ^2 (y-A*i);(y-/*i) '
» V
2a 2
2(CT 2 ) 2
/
Equating this vector to zero, we obtain
and a 2
=
n - 1 ( y - /2l)'(y - /II)
as the unique solutions. Note that these are the sample mean and sample variance, respectively. These would be the respective MLEs of /j, and a2 if the Hessian matrix is negative definite. Indeed, n
I
d2 log fe(y) d9d9' 9=9
=
~72 nfl - l'y
njl— l'y
(72)2 \\y - fil\\2
V {72Y _ fn/a2 V 0
(72f 0_
\
n
2(^)2/
\
n/(2(o*)2))'
which is obviously negative definite. Recall that, when a sufficient statistic t(y) is available, the density of y can be factored as fe(y) = ge(t(y))
h(y)
in view of the factorization theorem (Proposition 3.5.6). Therefore, maximizing the log-likelihood is equivalent to maximizing loggg(t(y)).
76
Chapter 3 : Review of Statistical
Results
The value of 6 which maximizes this quantity must depend on y, only through t(y). Thus, the MLE is a function of every sufficient statistic. It can be shown that under some regularity conditions, the bias of the MLE goes to zero as the sample size goes to infinity. Thus, it is said to be asymptotically unbiased. On the other hand, a UMVUE is always unbiased. We now discuss a theoretical limit on the dispersion of an unbiased estimator, regardless of the method of estimation. Let fe(y) be the likelihood function of the r-dimensional vector parameter 6 corresponding to observation y. Let
^)-((-^r))'
^
assuming that the derivatives and the expectation exit. The matrix 1(6) is called the (Fisher) information matrix for 6. The information matrix can be shown to be an indicator of sensitivity of the distribution of y to changes in the value of 6 (larger sensitivity implies greater potential of knowing 0 from the observation y). Proposition 3.6.6 Under the above set-up, let t(y) be an unbiased estimator of the k-dimensional vector parameter g(9), and let
Then D(t(y)) > G{6)l-(Q)G'(e) in the sense of the Lowner order, and G(6)l~(0)G(0) on the choice of the g-inverse. Proof. Let s(y) =
°g °
does not depend
. It is easy to see that E[s{y)} = 0, and
do
\d2fe(y)l r\d2logfe(y)]
[ de^
deiddj
=
[
fe(y)
\fdf9(y)\ _
\
(dfe(y)Y
dOj
fe(y)
Lv / V
= 0-Cov(si{y),sj{y)),
dOj
fe(y)
).
3.6 Point estimation that is, D(s(y)) = 1(9). Further Cov(t(y),s(y)) dispersion matrix
77 = G{0). Hence, the
(t(y)\_(D(t(y)) G(9)\ G'(9) 1(9))
U{s(y))-(
must be nonnegative definite. The result of Exercise 2.19(a) indicates that D(t(y)) - G(9)1~(9)G'(9) is nonnegative definite. The inequality follows. The invariance of G(9)1~(9)G(9)' on the choice of 1~(9) is a consequence of Propositions 3.1.l(b) and 2.4.1(f). C The proof of Proposition 3.6.6 reveals that the information matrix has the following alternative expressions Tlff\ Z{6) =
p /d2log/fl(yA
„ \(d\ogfg(y)\
~E { 3989' ) = E [{
39 ) {
fd\ogfg(y)\r
39 ) '
The lower bound on D(t(y)) given in Proposition 3.6.6 depends only on the distribution of y, and the result holds for any unbiased estimator, irrespective of the method of estimation. This result is known as the information inequality or Cramer-Rao inequality. The Cramer-Rao lower bound holds even if there is no UMVUE for g(9). If t(y) is an unbiased estimator of 9 itself, then the information inequality simplifies to D{t(y)) >l'l(9). Example 3.6.7 Let ynxl ~ N(fj,l,a2I) and 9 - (// : a2)'. It follows from the calculations of Example 3.6.5 that m~{
0 n/(2^)J-
The information matrix is proportional to the sample size. The CramerRao lower bound on the variance of any unbiased estimator of /j, and a2 are a2/n and 2a 4 /n, respectively. The bound a2 jn is achieved by the sample mean, which is the UMVUE as well as the MLE of /i. The variance of the UMVUE of a2 (given in Example 3.6.4) is 2cr4/(n - 1), and therefore this estimator does not achieve the Cramer-Rao lower bound. The variance of the MLE of a2 (given in Example 3.6.5) is
78
Chapter 3 : Review of Statistical Results
2(n — I)u 4 /n 2 , which is smaller than the Cramer-Rao lower bound. However, the information inequality is not applicable to this estimator, as it is biased. If t(y) is an unbiased estimator of the parameter g(0) and bg is the corresponding Cramer-Rao lower bound described in Proposition 3.6.6,
then the ratio bg/Var(t(y)) is called the efficiency of t(y). The Cramer-Rao lower bound has a special significance for maximum likelihood estimators. Let 6Q be the 'true' value of 0, and 1(6) be the corresponding information matrix. It can be shown under some regularity conditions that, (a) the likelihood equation for 6 has at least one consistent solution 6 (that is, for all 6 > 0, the probability P[\\6 — 6Q\\ > 6] goes to 0 as the sample size n goes to infinity), (b) the distribution function of n 1//2 (0 — 0) converges pointwise to that of N(0,G(6)l-{60)G'{0)), and (c) the consistent MLE is asymptotically unique in the sense that if 6\ and #2 are distinct roots of the likelihood equation which are both consistent, then nll2(d\ — 62) goes to 0 with probability 1 as n goes to infinity. We refer the reader to Schervish (1995) for more discussion of Fisher information and other measures of information.
3.7
Bayesian estimation
Sometimes certain knowledge about the parameter 6 may be available prior to one's access to the vector of observations y. Such knowledge may be subjective or based on past experience in similar experiments. Bayesian inference consists of making appropriate use of this prior knowledge. This knowledge is often expressed in terms of a prior distribution of 6, denoted by IT(6). The prior distribution is sometimes referred to simply as the prior. Once a prior is attached to 0, the 'distribution' of y mentioned in the foregoing discussion has to be interpreted as the conditional distribution of y given 6. The average risk of the estimator t(y) with respect to the prior n(d) is
r(t,n) = JR(d,t)d7r(e),
3.7 Bayesian estimation
79
where R(9,t) is the risk function, defined in Section 3.6. Definition 3.7.1 An estimator which minimizes the average risk (also known as the Bayes risk) r(t, IT) is called the Bayes estimator of g(0) with respect to the prior IT. A Bayes estimator t of g(0) is said to be unique if for any other Bayes estimator s, r(s,7r) < r(t, n) implies that Pg(t(y) ^ s(y)) = 0 for all e <E 0 . The comparison of estimators with respect to a risk function sometimes reveals the unsuitability of some estimators. For instance, if the risk function of one estimator is larger than that of another estimator for all values of the parameter, the former estimator should not be considered a competitor. Such an estimator is called an inadmissible estimator. Definition 3.7.2 An estimator t belonging to a class of estimators A is called admissible for the parameter g(0) in the class A with respect to the loss function L if there is no estimator s in A such that R(6, s) < R(0,t) for all 0 E 0 , with strict inequality for at least one 6 € Q. D The above definition can be used even when the scalar function g and the scalar statistic t are replaced by vector-valued g and t. The risk continues to be defined as the expected loss, while the loss is a function of 0 and t. The squared error loss function in the vector case isp(y)-c}, where the threshold c is determined by the level condition. Such a set is called the highest posterior density (HPD) credible set. 3.10
Exercises
3.1 Prove the following facts about covariance adjustment. (a) If u and v are random vectors with known first and second order moments such that E{v) 6 C(D(v)), and B is chosen so that u — Bv is uncorrelated with v, show that D(u — Bv) < D(u) (that is, covariance adjustment reduces the dispersion of a vector in the sense of the Lowner order). (b) Let u, v and B be as above, v = (v[ : v'2)1 and B\ be chosen such that l'(u — BiV\) is uncorrelated with v\. Show that D(u — B\V\) > D(u — Bv) (that is, larger the size of v, smaller is the dispersion of the covariance adjusted vector). 3.2 If z ~ N(0, S), then show that the quadratic form z'YTz almost surely does not depend on the choice of the g-inverse, and has the chi-square distribution with p(S) degrees of freedom. 3.3 Prove Proposition 3.3.2. 3.4 Prove the following converse of part (a) of Proposition 3.3.2: If y ~ N(fj,, I), and y'Ay has a chi-square distribution then A is an idempotent matrix, in which case the chi-square distribution has p{A) degrees of freedom and noncentrality parameter fi'Afx. [See Rao (1973c, p.186) for a more general result.] 3.5 Show that part (b) of Proposition 3.3.2 (for /j, = 0) holds under the weaker assumption that A and B are any nonnegative definite matrix (not necessarily idempotent). 3.6 If y ~ N(0,I), and A and B be nonnegative definite matrices such that y'Ay ~ x\A) a n d v'(A + B)v ~ Xp(A+B)> t h e n s h o w
88
Chapter 3 ; Review of Statistical Results that y1 By ~ x2p{By 3.7 Let y and x be random vectors and W(x) be a positive definite matrix for every value of x. (a) Show that the vector function g(x) that minimizes E[(y — g(x))'W(x)(y - g(x)] is g(x) = E(y\x). (b) Show that y — E(y\x) is uncorrelated with E(y\x). [This result is a stronger version of the result of Exercise 1.6.] (c) What happens when W(x) is positive semidefinite? 3.8 Let x and y be random vectors with finite first and second order moments such that = (»*)
E(X)
\v)
W1
D(X)
\v)
= (V™
v*y)
\VyX vyyj-
Then show that (a) The best linear predictor (BLP) of y in terms of x, which minimizes (in the sense of the Lowner order) the mean squared prediction error matrix, E[(y — Lx — c)(y — Lx — c)'} with respect to L and c is unique and is given by E(y\x) =»y + VyxV-x{x
- fix),
(b) y — E(y\x) is uncorrelated with every linear function of x. (c) The mean squared prediction error matrix of the BLP is V yy
v
yxr
Xxv
xy
3.9 Modify the results of Exercise 3.7 under the constraint that g(x) is of the form Lx + c. 3.10 Conditional sufficiency. Let z have the uniform distribution over the interval [1,2]. Let yi,...,yn be independently distributed as N(6, z) for given z, 9 being a real-valued parameter. The observation consists of y = {y\ : ... : yn) and z. Show that the vector (n~ll'y : z) is minimal sufficient, even though z is ancillary (that is, the vector is not complete sufficient). In such a case, n~ll'y is said to be conditionally sufficient given z. Verify that given z, n~ll'y is indeed sufficient for 9.
3.10 Exercises
89
3.11 Let xi,...,xn be independent and identically distributed with density fe{x) — g(x — 9) for some function g which is free of the parameter 9. Show that the range maxi 0, 9 > 0, then show that the estimator y is unbiased for 9 but
3.10 Exercises
91
inadmissible with respect to the squared error loss function. Find an admissible estimator. 3.22 Let y ~ N(fil,I), with unspecified real parameter /i. Find the UMP test for the null hypothesis HQ : /i = 0 against the alternative Hi y, > 0. Find the GLRT for this problem. Which test has greater power for a given size? What happens when the alternative is two-sided, that is, 1-L\ : n ^ 0? 3.23 Let y ~ N(jil,I), with unspecified real parameter \x. Find the level (I —a) UMA confidence region for JJL when it is known that pL € [0, oo). Find the level (1 — a) UMAU confidence region for fi when it is known that fi £ (—oo, oo).
Chapter 4
Estimation in the Linear Model
Consider the homoscedastic linear model (y,X/3,a 2 /). This model is a special case of (1.3.2)—(1.3.3) where the model errors have the same variance and are uncorrelated. The unknown parameters of this model are the coefficient vector /3 and the error variance a2. In this chapter we deal with the problem of estimation of these parameters from the observables y and X. We assume that y is a vector of n elements, X is an n x k matrix and /3 is a vector of k elements. Some of these parameters may be redundant. If one has ten observations, all of them measuring the combined weight of an apple and an orange, one cannot hope to estimate the weight of the orange alone from these measurements. In general, only some functions of the model parameters and not all, can be estimated from the data. We discuss this issue in Section 4.1. Supposing that it is possible to estimate a given parameter, the next question is how to estimate it in an "optimal" manner. This leads us to the theory of best linear unbiased estimation, discussed in Section 4.3. An important tool used in the development of this theory is the set of linear zero functions, — linear functions of the response which have zero expectation. In Sections 4.2 and 4.4, we present the least squares method and the method of maximum likelihood (ML), the latter being considered under a distributional assumption for the errors. (Some other methods are discussed in Chapter 11.) Subsequent sections deal with measuring the degree of fit which the 93
94
Chapter 4 : Estimation in the Linear Model
estimated parameters provide, some variations to the linear model, and issues and problems which arise in estimation. 4.1
Linear estimation: some basic facts
Much of the classical inference problems related to the linear model (y, X/3, cr2l) concern a linear parametric function (LPF), p'/3. We often estimate it by a linear function of the response, l'y. Since y itself is modelled as a linear function of the parameter 0 plus error, it is reasonable to expect that one may be able to estimate /3 by some kind of a linear transformation in the reverse direction. This is why we try to estimate LPFs by linear estimators, that is, as linear functions of y. 4.1.1
Linear unbiased estimator and linear zero function
For accurate estimation of the LPF p'/3, it is desirable that the estimator is not systematically away from the 'true' value of the parameter. Definition 4.1.1 The statistic l'y is said to be a linear unbiased estimator (LUE) of p'(3 if E(l'y) = p'(3 for all possible values of (5. Another class of linear statistics have special significance in inference in the linear model. Definition 4.1.2 A linear function of the response, l'y is called a linear zero function (LZF) if E{l'y) — 0 for all possible values of /3. By putting p — 0 in the definition of the LUE, we see that any LZF is a linear unbiased estimator or LUE for 0. Therefore, by adding LZFs to an LUE of p'/3, we get other LUEs of p'{3. A natural question seems to be: Why bother about LZFs which are after all estimators of zero? There are two important ways in which the LZFs contribute to inference. First, they contain information about the error or noise in the model and are useful in the estimation of a 2 , which we consider in Section 4.7. Second, since the mean and variance of the LZFs do not depend on /3, they are in some sense decoupled from X/3 — the systematic part of the model (see also Remark 4.1.6). Therefore, we can use them to isolate the noise from what is useful for the estimation of the systematic part. This is precisely what we do in Section 4.3.
4.1 Linear estimation: some basic facts
95
Example 4.1.3 (A trivial example) Suppose that an orange and an apple with (unknown) weights a.\ and a2, respectively, are weighed separately with a crude scale. Each measurement is followed by a 'dummy' measurement with nothing on the scale, in order to get an idea about typical measurement errors. Let us assume that the measurements satisfy the linear model (Vi\
/I
Vz
0
\y4J
\0
/ei\
0\ 1
0/
\a2)
€3
'
\£4/
with the usual assumption of homoscedastic errors with variance a2. The observations j/2 a n d 1/4, being direct measurements of error, may be used to estimate the error variance. These are LZFs. The other two observations carry information about the two parameters. There are several unbiased estimators of a\, such as yi, y\ + 1/2 and Vi + Vi- It appears that y\ would be a natural estimator of a\ since it is free from the baggage of any LZF. We shall formalize this heuristic argument later. In reality we seldom have information about the requisite LPFs and the errors as nicely segregated as in Example 4.1.3. Our aim is to achieve this segregation for any linear model, so that the task of choosing an unbiased estimator becomes easier. Before proceeding further, let us characterize the LUEs and LZFs algebraically. Recall that Px = X(X'X)~X' is the orthogonal projection matrix for C(X). Proposition 4.1.4 tic I'y is
In the linear model (y, X/3, cr2l), the linear statis-
(a) an LUE of the LPF p'/3 if and only if X'l = p, (b) an LZF if and only if X'l = 0, that is, I is of the form (I — Px)m for some vector m. Proof. In order to prove part (a), note that I'y is an LUE of p'(3 if and only if E(l'y) = I'X/3, that is, the relation I'X/3 = p'/3 must hold as an identity for all /3. This is equivalent to the condition X'l = p.
96
Chapter 4 : Estimation in the Linear Model
The special case of part (a) for p = 0 indicates that I'y is an LZF if and only if X'l = 0. The latter condition is equivalent to requiring I to be of the form (I — Px)m for some vector m. Remark 4.1.5 Proposition 4.1.4(b) implies that every LZF is a linear function of (I—Px)y, and can be written as m'{I—P)y for some vector m. This fact will be used extensively hereafter. Proposition 4.1.4 will have to be modified somewhat for the more general model (y, X/3, a2V) (see Section 7.2.2). However, the characterization of LZFs as linear functions of (/ — Px)y continues to hold for such models. Remark 4.1.6 Consider the explicit form of the model considered here, y = X(3 + e. (1.3.2) A consequence of Remark 4.1.5 is that any LZF can be written as / ' ( / — Px)y, which is the same as I'(I — Px)e. Thus, the LZFs do not depend on /3 at all, and are functions of the model error, e. This is why LZFs are sometimes referred to as linear error functions. Example 4.1.7 (Trivial example, continued) ple 4.1.3, /I
In the case of Exam-
0 0 0\
/0 \
'** 0 I 1 I . »«- D{Liy). Note that Ly and L\y are LUEs of the same LPF, so Ly can not be the BLUE of this LPF. It is clear from Proposition 4.3.2 that if an LPF has a BLUE, then any other LUE of that LPF can be expressed as the sum of two uncorrelated parts: the BLUE and an LZF. Any ordinary LUE has larger dispersion than that of the BLUE, precisely because of the added LZF component, which carries no information about the LPF. Having understood this, we should be able to improve upon any given LUE by 'trimming the fat.' To accomplish this, we have to subtract a suitable LZF from the given LUE so that the remainder is uncorrelated with every LZF. The task is simplified by the fact that every LZF is of the form m'(I — Px)y for some vector m (see Proposition 4.1.4). Therefore, all we have to do is to make the given LUE uncorrelated with (/ — Px)yThe covariance adjustment principle of Proposition 3.1.2 comes handy for this purpose. If Ly is an LUE of an LPF, then the correspondProposition 4.3.3 ing BLUE is LPxy. Proof. Note that one can write Ly = LPxy+L(I—Px)y. Since EL(I— Px)y — 0, LPxy has the same expectation as Ly. Further, LPxy is uncorrelated with any LZF of the form m'{I — Px)y. Therefore, LPxy must be the BLUE of E{Ly). D
104
Chapter 4 : Estimation in the Linear Model
This proposition gives a constructive proof of the existence of the BLUE of any estimable LPF. Instead of modifying a preliminary estimator, one can also construct the BLUE of a given LPF directly using the Gauss-Markov Theorem (see Proposition 4.3.9). Remark 4.3.4 Proposition 4.3.2 is the linear analogue of Proposition 3.6.2. The BLUE of a single LPF is a linear analogue of the UMVUE, and the LZFs are linear estimators of zero.b Indeed, when y has the normal distribution, it can be shown that the BLUE of any estimable LPF is its UMVUE and that any LZF is ancillary (Exercise 4.20). In the general case, the uncorrelatedness of the UMVUE and estimators of zero did not provide a direct method of constructing the UMVUE, because it is difficult to characterize the set of all estimators of zero. (We needed an unbiased estimator, a complete sufficient statistic and an additional result — Proposition 3.6.3 — for the construction). In the linear case however, 'zero correlation with LZFs' is an adequate characterization for constructing the BLUE — as we have just demonstrated through Proposition 4.3.3. D We now prove that the BLUE of an estimable LPF is unique. Proposition 4.3.5
Every estimable LPF has a unique BLUE.
Proof. Let L\y and L2y be distinct BLUEs of the same vector LPF. Writing L\y as L2y + {L\ — L2)y and using Proposition 4.3.2, we have
Var(Liy) — Var(L2y) + Var({L\ — L2)y). Therefore (Li — L2)y has zero mean and zero dispersion. It follows that {L\ — L2)y must be zero with probability one and that L\y — L2y almost surely. d Although we have proved the existence and uniqueness of the BLUE of an estimable LPF, another point remains to be clarified. Suppose that A/3 is an estimable vector LPF. Since all the elements of A(3 are estimable, these have their respective BLUEs. If these BLUEs are arranged as a vector, would that vector be the BLUE of A/3? Proposition 4.3.6 Let Ly be an LUE of the estimable vector LPF A/3. Then Ly is the BLUE of A/3 if and only if every element of Ly bSee
Chapter 11 for a linear version of the fundamental notions of inference.
4.3 Best linear unbiased estimation
105
is the BLUE of the corresponding element of A/3. Proof. The 'if part is proved from the fact that the elements of Ly has zero correlation with every LZF. The 'only if part follows from the fact that Lowner order between two matrices implies algebraic order between the corresponding diagonal elements. Proposition 4.3.6 answers a question but gives rise to another one. Why do we bother about Lowner order of dispersion matrices, if the BLUE of a vector LPF is nothing but the vector of BLUEs of the elements of that LPF? The reason for our interest in the Lowner order is that it implies several important algebraic orders. Let Ly be the BLUE and Ty be another LUE of A/3. It follows from Proposition 2.6.2 that ti(D(Ly)) < tr(D(Ty)), that is, the total variance of all the components of Ly is less than that of Ty. This proposition also implies that \D(Ly)\ < \D(Ty)\. It can be shown that the volume of an ellipsoidal confidence region of A/3 (see Section 5.2.2) centered at Ty is a monotonically increasing function of \D(Ty)\. Thus, a confidence region centered at the BLUE is the smallest. It also follows from Proposition 2.6.2 that the extreme eigenvalues of D(Ly) are smaller than those of D(Ty). The reader may ponder about the implications of these inequalities. Example 4.3.7 (Trivial example, continued) For the linear model of Example 4.1.3 it was shown that every LZF is a linear function of y2 and yi (see Example 4.1.7). Since y\ and y^ are both uncorrelated with y2 and y^, these must be the respective BLUEs of a.\ and a2. Thus, the 'natural estimator' mentioned in Example 4.1.3 is in fact the BLUE. Another LUE of a.\ is y\ + y2, but it has the additional baggage of the LZF y2, which inflates its variance. The variance of y\ + y2 is 2cr2, while that of the corresponding BLUE (yi) is only a2. Example 4.3.8 (Two-way classified data, continued) For the model of Example 4.1.8, C(X) is spanned by the orthogonal vectors u\, u2 and us, where u\ is the difference between the last two columns of X and u2 and it3 are the second and third columns of X, respectively. It
106
Chapter 4 : Estimation in the Linear Model
follows that
P rx
= P
+ P
r«i
+ P
^ *u 3 ^ -^3
=
L 40
/ 311' 11' 11' " ' 311' -11'
-11' \ 11'
11' - 1 1 ' 3 - 1 1 ' 11' ' ^ -11' 11' 11' 3 11' j
Each block in the above matrix has order 10 x 10. We have already noted that the first observation is an LUE of /j, + ft + T\ . The LUE can be expressed as I'y where I is the first column of /40x40- According to Proposition 4.3.3, the BLUE of n + ft + n is l'Pxy. The BLUE simplifies to \yx + \y2 + \y~s — \y^, where the quantity |/j is the average of the observations J/io(i-i)+i> > 2/io(i-i)+io> f° r z = li 2, 3,4. Likewise, the BLUE of /J + A + r 2 is \yx - \y2 + fy 3 + \yA. The BLUE of T\ — T ||y-XGy||2. Thus, /3 is a least squares estimator. 4.4
Maximum likelihood estimation
If the errors in the linear model (y, X(3, a21) are assumed to have a multivariate normal distribution, then the likelihood of the observation vector y is (27ra2)-"/2 e x p [ - ^ ( y - X/3)'(y - X/3)]. It is clear that a maximum likelihood estimator (MLE) of /3 is a minimizer of the quadratic form (y — X/3)'(y — X/3), which is by definition an LSE. Substituting the maximized quadratic form into the likelihood and maximizing it with respect to where x^2) — 3Jm + otv, a being a small number and v being a vector such that v'v = x^'x^ and v'xn) = 0. It follows that
vuHp'm - .y(x'x)- P - ^ ^ p ' (1+_f - 1 ) P. In particular, if /3 = (fix : #2)', then by choosing p = (1 : —1)' we have
which can be very large if a is small. If a —> 0, then the variance explodes. Of course, $\ — 02 is no longer estimable when a = 0. On the other hand,
Var01+p2) = o2- J—,
4.12 Collinearity in the linear model*
135
which does not depend on a. Thus, the variance of /S1+/S2 is n °t affected by collinearity. Note that /3i + $2 remains estimable even if a = 0. D The above example shows that the variance of certain BLUEs may be very high because of collinearity, while the variance of some other BLUEs may not be affected. It also shows that non-estimability of parameters is an extreme form of collinearity. Let us try to appreciate the above points in a general set-up. The presence of collinearity implies that there is a vector v of unit norm so that the linear combination of the columns of X given by Xv, is very close to the zero vector. In such a case we have a small value of v'X'Xv, which can be interpreted as dearth of information in the direction of v (see the expression of information matrix given in page 133). We may informally refer to such a unit vector as a direction of collinearity. The unit vector v can be written as k 1=1
where Y,i-i Kviv'i IS a spectral decomposition of X'X, the eigenvalues being in the decreasing order. It follows that k
k
k
\\Xv\\2 = v'X'X Y,(v'vi)vi = v' ^2(v'vl)Xlvl = J2 hiv'vtf. i=\
i=l
i=l
The above is the smallest when v = Vk- Therefore, a very small value of Afc signifies the presence of collinearity. When A& is very small, v^ is a direction of collinearity. When X'X has several small eigenvalues, the corresponding eigenvectors are directions of collinearity. All unit vectors which are linear combinations of these eigenvectors are also directions of collinearity. Let X have full column rank, so that all LPFs are estimable. Then
Var(p% = a2p'{X'X)-lp = a2 £ ^ 2 .
(4.12.1)
If p is proportional to Vk, then the variance of its BLUE is p'p/XkThe presence of collinearity would mean that A^ is small and therefore,
136
Chapter 4 : Estimation in the Linear Model
this variance is large. As A& -> 0, the variance goes to infinity. (When Ajt = 0, we have a rank-deficient X matrix with Vk £ C(X'), and so p'/3 is not estimable at all.) A similar argument can be given if p has a substantial component (p'vi) along an eigenvector (vi) corresponding to any small eigenvalue (Aj) of X'X. If p has zero component along all the eigenvectors corresponding to small eigenvalues, then the reciprocals of the smaller eigenvalues do not contribute to the right hand side of (4.12.1), and thus Var(p /3) is not very large. In summary, all 'estimable' LPFs are not estimable with equal precision. Some LPFs can be estimated with greater precision than others. When there is collinearity, there are some LPFs which are estimable but the corresponding BLUEs have very little precision (that is, these have a very high variance). Non-estimable LPFs can be viewed as extreme cases of LPFs which can be linearly estimated with less precision. When an experiment is designed, one has to choose the matrix X in a way that ensures that the LPFs of interest are estimable with sufficient precision. The directions of collinearity can also be interpreted as directions of data inadequacy. To see this, write ||Xw|| 2 as
\\Xv\\2 = v'X'Xv = J2(xiv)\ where x[, x'2 ..., x'n are the rows of X. The ith. component of y is an observed value of x\fi (with error). If ||Xu|| 2 is small, every (x[v)2 is small. If this is the case, none of the observations carry much information about v'(3. This explains why the BLUE of this LPF has a large variance. If (x^v)2 = 0 for all i, the observations do not carry any information about v'(3. In this extreme case v'(3 is not estimable. If one has a priori knowledge of an approximate relationship among certain columns of X, and confines estimation to linear combinations which are orthogonal to these, then collinearity would not have much effect on the inference. This is analogous to the fact, as seen in Example 4.1.8, that estimable functions can be estimated even if there is one or more exact linear relationship involving the columns of X. If collinearity arises because of a known linear constraint, its impact
4.13 Exercises
137
on the precision of affected BLUEs can be reduced easily by incorporating this constraint into the model. Proposition 4.9.3 assures us that the restriction would reduce the variances of the BLUEs. If the cause of collinearity is not so obvious, then the eigenvectors corresponding to the small eigenvalues of (X'X) point to the variables which appear to be linearly related. If f3 is estimable, then the extent of collinearity can be measured by the variance inflation factors,
VIF^a^VarfyWx^f,
j = l,2,...,k,
(4.12.2)
where x^ is the jth column of X. The factor a~2 ensures that VIFj depends only on the matrix X, while the factor \\x^ ||2 ensures that this measure is not altered by a change in scale of the corresponding variable (see Exercise 4.38). All the variance inflation factors are greater than or equal to 1, and a very large value indicates that the variance of the BLUE of the corresponding parameter is inflated due to collinearity. Alternate measures of collinearity can be found in Belsley (1991) and Sengupta and Bhimasankaram (1997, see also Exercise 4.40). Since collinearity tends to inflate variances (and hence the mean squared error) of certain BLUEs, one may seek to reduce the MSE by adopting a biased estimation strategy in the case of collinear data. Some of these alternative estimators are discussed in Sections 7.9.2 and 11.3. 4.13
Exercises
4.1 The linear model ( j / n x l , X/3, o2l) is said to be saturated if the error degrees of freedom (n — p(X)) is equal to zero. Show that in a saturated model, every linear unbiased estimator is the corresponding BLUE. 4.2 Show that all the components of f3 in the model (y, X(3, a2l) are estimable if and only if X has full column rank, and that in such a case, every LPF is estimable. 4.3 If there is no linear unbiased estimator of the LPF Af3 in the model (y,X/3,cr 2 /), show that there is no nonlinear unbiased estimator of .A/3.
138
Chapter 4 : Estimation in the Linear Model 4.4 Consider the model I/t = 0 1 + 0 2 + 0 3 + - " + 0 i + e«,
!a, we have 1
r^z
— bn—r,a
my/a2p>{X'X)-p
= P [p'P >P^~ tn.r:a^p'(X'X)-pj
= 1 - a.
This gives a 100(1 — a)% lower confidence limit for p'/3, or
[^3 - tn-riay/£p'{X'X)-p,
oo)
(5.2.1)
is a 100(1 — a)% (one-sided) confidence interval for p'/3. Similar arguments lead to the other one-sided confidence interval (or upper confidence limit) (-oo, p^0 + tn-r,ay/£p'{X'X)-p\
,
(5.2.2)
and the two-sided confidence interval
[^3 - tn_r,a/2^p>(x'x)-P, fip + i n _,, Q/2 \/^p'(x'x)-p]. (5.2.3) If /3j, the j t h component of j3, is estimable, then we can obtain oneor two-sided confidence intervals for (ij as above, by choosing p' = (0 : 0: :1: : 0), with 1 in the j t h place. In such a case, p'(X'X)~p is the j t h diagonal element of (X'X)~, which does not depend on the choice of the g-inverse (as we noted in page 111). Example 5.2.1 (Two-way classified data, continued) Consider once again the model of Example 4.1.8. It was shown in Example 4.6.2 that the BLUEs ?i — ?2 and /?i — #2 e a c n have variance a 2 /10. Suppose that we choose confidence coefficient .95, corresponding to a = .05. Since tn-r,.02b/Vl0 = i37,.o25/Vl0 = (2.026)/y/T6 = .6407, a two-sided 95% confidence interval for ?i — ?2 [?i -?2-
.6407CT,
fx - ? 2 + .64075].
150
Chapter 5 : Further Inference in the Linear Model
On the other hand, t37t.05/y/W = (1.687)/\/l0 = .5335. Hence, leftand right-sided 95% confidence intervals for f\ — T% are [fi - ?2 — .53355, oo] and [—oo, n — f2 + .53355], respectively. Confidence intervals for Pi—fa can be obtained similarly.D 5.2.2
Confidence region for a vector LPF
Construction of a confidence region for the vector LPF A/3 is a meaningful task only if A/3 is estimable. Note that if y ~ N(X/3, o2l) and A/3 is estimable, then A/3 ~ N(A/3,a2A(X'X)-A'). Therefore, from Exercise 3.2, (A^ - A/3)'[A(X'X)-A']-(AP
- A/3)
CT2
2
~ *m>
where m is the rank of A. Since the BLUEs are independent of the LZFs, the above quadratic form is independent of a2. Consequently, (A/3 - A/3)'[A{X'X)-A']-{A0
- A/3)/{ma2)
{n-p{X))72l(a2{n-r)) (A3 - A/3)'[A(X'X)~A']-(A3 - A0) ^5 tm,n-r, maz where Fm>n-.r represents the P-distribution with m and n—r degrees of freedom (see Definition 3.2.5). If F m ) n _ r > Q is the (1 — a) quantile of this distribution, we have P [(A/3 - Aft'[A(X'X)-A']~(Al3
- Aft < m^F m , n _ r , Q ] = 1 - a.
The resulting confidence region for A/3 is an m-dimensional ellipsoid given by {A/3 : (A/3 - Aft'[A{X'X)-A'Y{Al3
- Aft < m^F m , n _ r , Q ,
(A/3 - Aft e C{A{X'X)~A')) .
(5.2.4)
5.2 Confidence regions
151
Example 5.2.2 (Two-way classified data, continued) Consider the vector LPF (/?i—/?2 : T\—T l-[P(£l) + ... + P{£C)) = P(£in---n£g)
i-ga,
< P(£i) = l - a .
The above inequalities can be summarized to produce the following bounds for the coverage probability of the simultaneous confidence intervals: l-qa