Optimal Design of Experiments
SIAM's Classics in Applied Mathematics series consists of books that were previously allowed to go out of print. These books are republished by SIAM as a professional service because they continue to be important resources for mathematical scientists. Editor-in-Chief Robert E. O'Malley, jr., University of Washington Editorial Board Richard A. Brualdi, University of Wisconsin-Madison Nicholas J. Higham, University of Manchester Leah Edelstein-Keshet, University of British Columbia Herbert B. Keller, California Institute of Technology Andrzej Z. Manitius, George Mason University Hilary Ockendon, University of Oxford Ingram Olkin, Stanford University Peter Olver, University of Minnesota Ferdinand Verhulst, Mathematisch Instituut, University of Utrecht Classics in Applied Mathematics C. C. Lin and L. A. Segel, Mathematics Applied to Deterministic Problems in the Natural Sciences Johan G. F. Belinfante and Bernard Kolman, A Survey of Lie Groups and Lie Algebras with Applications and Computational Methods James M. Ortega, Numerical Analysis: A Second Course Anthony V. Fiacco and Garth P. McCormick, Nonlinear Programming: Sequential Unconstrained Minimization Techniques F. H. Clarke, Optimization and Nonsmooth Analysis George F. Carrier and Carl E. Pearson, Ordinary Differential Equations Leo Breiman, Probability R. Bellman and G. M. Wing, An Introduction to Invariant Imbedding Abraham Berman and Robert J. Plemmons, Nonnegative Matrices in the Mathematical Sciences Olvi L. Mangasarian, Nonlinear Programming *Carl Friedrich Gauss, Theory of the. Combination of Observations Least Subject to Errors: Part One, Part Two, Supplement. Translated by G. W. Stewart Richard Bellman, Introduction to Matrix Analysis U. M. Ascher, R. M. M. Mattheij, and R. D. Russell, Numerical Solution of Boundary Value Problems for Ordinary Differential Equations K. E. Brenan, S. L. Campbell, and L. R. Pel2old, Numerical Solution of Initial-Value Problems in Differential-Algebraic Equations Charles L. Lawson and Richard J. Hanson, Solving Least Squares Problems J. E. Dennis, Jr. and Robert B. Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations Richard E. Barlow and Frank Proschan, Mathematical Theory of Reliability Cornelius Lanczos, Linear Differential Operators Richard Bellman, Introduction to Matrix Analysis, Second Edition Beresford N. Parlett, The Symmetric Eigenvalue Problem Richard Haberman, Mathematical Models: Mechanical Vibrations, Population Dynamics, and Traffic Flow *First time in print.
ii
Classics in Applied Mathematics (continued) Peter W. M. John, Statistical Design and Analysis of Experiments Tamer Basar and Geert Jan Olsder, Dynamic Noncooperative Game Theory, Second Edition Emanuel Parzen, Stochastic Processes Petar Kokotovic, Hassan K. Khalil, and John O'Reilly, Singular Perturbation Methods in Control: Analysis and Design Jean Dickinson Gibbons, Ingram Olkin, and Milton Sobel, Selecting and Ordering Populations: A New Statistical Methodology James A. Murdock, Perturbations: Theory and Methods Ivar Ekeland and Roger Temam, Convex Analysis and Variational Problems Ivar Stakgold, Boundary Value Problems of Mathematical Physics, Volumes I and II J. M. Ortega and W. C. Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables David Kinderlehrer and Guido Stampacchia, An Introduction to Variational Inequalities and Their Applications F Natterer, The Mathematics of Computerized Tomography Avinash C. Kak and Malcolm Slaney, Principles of Computerized Tomographic Imaging R. Wong, Asymptotic Approximations of Integrals O. Axelsson and V. A. Barker, Finite Element Solution of Boundary Value Problems: Theory and Computation David R. Brillinger, Time Series: Data Analysis and Theory Joel N. Franklin, Methods of Mathematical Economics: Linear and Nonlinear Programming, Fixed-Point Theorems Philip Hartman, Ordinary Differential Equations, Second Edition Michael D. Intriligator, Mathematical Optimization and Economic Theory Philippe G. Ciarlet, The Finite Element Method for Elliptic Problems Jane K. Cullum and Ralph A. Willoughby, Lanczos Algorithms for Large Symmetric Eigenvalue Computations, Vol. I: Theory M. Vidyasagar, Nonlinear Systems Analysis, Second Edition Robert Mattheij and Jaap Molenaar, Ordinary Differential Equations in Theory and Practice Shanti S. Gupta and S. Panchapakesan, Multiple Decision Procedures: Theory and Methodology of Selecting and Ranking Populations Eugene L. Allgower and Kurt Georg, Introduction to Numerical Continuation Methods Leah Edelstein-Keshet, Mathematical Models in Biology Heinz-Otto Kreiss and Jens Lorenz, Initial-Boundary Value Problems and the Navier-Stokes Equations J. L. Hodges, Jr. and E. L. Lehmann, Basic Concepts of Probability and Statistics, Second Edition George F Carrier, Max Krook, and Carl E. Pearson, Functions of a Complex Variable: Theory and Technique Friedrich Pukelsheim, Optimal Design of Experiments
iii
This page intentionally left blank
Optimal Design of Experiments Friedrich Pukelsheim University of Augsburg Augsburg, Germany
Society for Industrial and Applied Mathematics Philadelphia
Copyright © 2006 by the Society for Industrial and Applied Mathematics This SIAM edition is an unabridged republication of the work first published by John Wiley & Sons, Inc., New York, 1993.
10987654321 All rights reserved. Printed in the United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688. Library of Congress Cataloging-in-Publication Data: Pukelsheim, Friedrich, 1948Optimal design of experiments / Friedrich Pukelsheim.— Classic ed. p. cm. — (Classics in applied mathematics ; 50) Originally published: New York : J. Wiley, 1993. Includes bibliographical references and index. ISBN 0-89871-604-7 (pbk.) 1. Experimental design. 1. Title. II. Series. QA279.P85 2006 519.5'7--dc22
2005056407 Partial royalties from the sale of this book are placed in a fund to help students attend SIAM meetings and other SIAM-related activities. This fund is administered by SIAM, and qualified individuals are encouraged to write directly to SIAM for guidelines.
is a registered trademark.
Contents Preface to the Classics Edition, xvii Preface, xix List of Exhibits, xxi Interdependence of Chapters, xxiv Outline of the Book, xxv Errata, xxix 1. Experimental Designs in Linear Models 1.1. 1.2. 1.3. 1.4. 1.5. 1.6. 1.7. 1.8. 1.9. 1.10. 1.11. 1.12. 1.13. 1.14. 1.15. 1.16. 1.17. 1.18. 1.19. 1.20. 1.21. 1.22.
1
Deterministic Linear Models, 1 Statistical Linear Models, 2 Classical Linear Models with Moment Assumptions, 3 Classical Linear Models with Normality Assumption, 4 Two-Way Classification Models, 4 Polynomial Fit Models, 6 Euclidean Matrix Space, 7 Nonnegative Definite Matrices, 9 Geometry of the Cone of Nonnegative Definite Matrices, 10 The Loewner Ordering of Symmetric Matrices, 11 Monotonic Matrix Functions, 12 Range and Nullspace of a Matrix, 13 Transposition and Orthogonality, 14 Square Root Decompositions of a Nonnegative Definite Matrix, 15 Distributional Support of Linear Models, 15 Generalized Matrix Inversion and Projections, 16 Range Inclusion Lemma, 17 General Linear Models, 18 The Gauss-Markov Theorem, 20 The Gauss-Markov Theorem under a Range Inclusion Condition, 21 The Gauss-Markov Theorem for the Full Mean Parameter System, 22 Projectors, Residual Projectors, and Direct Sum Decomposition, 23 Vll
Vlll
CONTENTS
1.23. 1.24. 1.25. 1.26. 1.27. 1.28.
Optimal Estimators in Classical Linear Models, 24 Experimental Designs and Moment Matrices, 25 Model Matrix versus Design Matrix, 27 Geometry of the Set of All Moment Matrices, 29 Designs for Two-Way Classification Models, 30 Designs for Polynomial Fit Models, 32 Exercises, 33
2. Optimal Designs for Scalar Parameter Systems 2.1. 2.2. 2.3. 2.4. 2.5. 2.6. 2.7. 2.8. 2.9. 2.10. 2.11. 2.12. 2.13. 2.14. 2.15. 2.16. 2.17. 2.18. 2.19. 2.20. 2.21. 2.22. 2.23.
35
Parameter Systems of Interest and Nuisance Parameters, 35 Estimability of a One-Dimensional Subsystem, 36 Range Summation Lemma, 37 Feasibility Cones, 37 The Ice-Cream Cone, 38 Optimal Estimators under a Given Design, 41 The Design Problem for Scalar Parameter Subsystems, 41 Dimensionality of the Regression Range, 42 Elfving Sets, 43 Cylinders that Include the Elfving Set, 44 Mutual Boundedness Theorem for Scalar Optimality, 45 The Elfving Norm, 47 Supporting Hyperplanes to the Elfving Set, 49 The Elfving Theorem, 50 Projectors for Given Subspaces, 52 Equivalence Theorem for Scalar Optimality, 52 Bounds for the Optimal Variance, 54 Eigenvectors of Optimal Moment Matrices, 56 Optimal Coefficient Vectors for Given Moment Matrices, 56 Line Fit Model, 57 Parabola Fit Model, 58 Trigonometric Fit Models, 58 Convexity of the Optimality Criterion, 59 Exercises, 59
3. Information Matrices 3.1. Subsystems of Interest of the Mean Parameters, 61 3.2. Information Matrices for Full Rank Subsystems, 62 3.3. Feasibility Cones, 63
61
CONTENTS
3.4. 3.5. 3.6. 3.7. 3.8. 3.9. 3.10. 3.11. 3.12. 3.13. 3.14. 3.15. 3.16. 3.17. 3.18. 3.19. 3.20. 3.21. 3.22. 3.23. 3.24. 3.25.
IX
Estimability, 64 Gauss-Markov Estimators and Predictors, 65 Testability, 67 F-Test of a Linear Hypothesis, 67 ANOVA, 71 Identifiability, 72 Fisher Information, 72 Component Subsets, 73 Schur Complements, 75 Basic Properties of the Information Matrix Mapping, 76 Range Disjointness Lemma, 79 Rank of Information Matrices, 81 Discontinuity of the Information Matrix Mapping, 82 Joint Solvability of Two Matrix Equations, 85 Iterated Parameter Subsystems, 85 Iterated Information Matrices, 86 Rank Deficient Subsystems, 87 Generalized Information Matrices for Rank Deficient Subsystems, 88 Generalized Inverses of Generalized Information Matrices, 90 Equivalence of Information Ordering and Dispersion Ordering, 91 Properties of Generalized Information Matrices, 92 Contrast Information Matrices in Two-Way Classification Models, 93 Exercises, 96
4. Loewner Optimality 4.1. 4.2. 4.3. 4.4. 4.5. 4.6. 4.7. 4.8. 4.9.
Sets of Competing Moment Matrices, 98 Moment Matrices with Maximum Range and Rank, 99 Maximum Range in Two-Way Classification Models, 99 Loewner Optimality, 101 Dispersion Optimality and Simultaneous Scalar Optimality, 102 General Equivalence Theorem for Loewner Optimality, 103 Nonexistence of Loewner Optimal Designs, 104 Loewner Optimality in Two-Way Classification Models, 105 The Penumbra of the Set of Competing Moment Matrices, 107
98
x
CONTENTS
4.10. 4.11. 4.12. 4.13.
Geometry of the Penumbra, 108 Existence Theorem for Scalar Optimality, 109 Supporting Hyperplanes to the Penumbra, 110 General Equivalence Theorem for Scalar Optimality, 111 Exercises, 113
5. Real Optimality Criteria 5.1. 5.2. 5.3. 5.4. 5.5. 5.6. 5.7. 5.8. 5.9. 5.10. 5.11. 5.12. 5.13. 5.14. 5.15. 5.16. 5.17.
Positive Homogeneity, 114 Superadditivity and Concavity, 115 Strict Superadditivity and Strict Concavity, 116 Nonnegativity and Monotonicity, 117 Positivity and Strict Monotonicity, 118 Real Upper Semicontinuity, 118 Semicontinuity and Regularization, 119 Information Functions, 119 Unit Level Sets, 120 Function-Set Correspondence, 122 Functional Operations, 124 Polar Information Functions and Polar Norms, 125 Polarity Theorem, 127 Compositions with the Information Matrix Mapping, 129 The General Design Problem, 131 Feasibility of Formally Optimal Moment Matrices, 132 Scalar Optimality, Revisited, 133 Exercises, 134
6. Matrix Means 6.1. 6.2. 6.3. 6.4. 6.5. 6.6. 6.7. 6.8. 6.9. 6.10. 6.11.
114
Classical Optimality Criteria, 135 D-Criterion, 136 A-Criterion, 137 E-Criterion, 137 T-Criterion, 138 Vector Means, 139 Matrix Means, 140 Diagonality of Symmetric Matrices, 142 Vector Majorization, 144 Inequalities for Vector Majorization, 146 The Holder Inequality, 147
135
CONTENTS
6.12. 6.13. 6.14. 6.15. 6.16. 6.17.
XI
Polar Matrix Means, 149 Matrix Means as Information Functions and Norms, 151 The General Design Problem with Matrix Means, 152 Orthogonality of Two Nonnegative Definite Matrices, 153 Polarity Equation, 154 Maximization of Information versus Minimization of Variance, 155 Exercises, 156
7. The General Equivalence Theorem 7.1. 7.2. 7.3. 7.4. 7.5. 7.6. 7.7. 7.8. 7.9. 7.10. 7.11. 7.12. 7.13. 7.14. 7.15. 7.16. 7.17. 7.18. 7.19. 7.20. 7.21. 7.22. 7.23. 7.24.
158
Subgradients and Subdifferentials, 158 Normal Vectors to a Convex Set, 159 Full Rank Reduction, 160 Subgradient Theorem, 162 Subgradients of Isotonic Functions, 163 A Chain Rule Motivation, 164 Decomposition of Subgradients, 165 Decomposition of Subdifferentials, 167 Subgradients of Information Functions, 168 Review of the General Design Problem, 170 Mutual Boundedness Theorem for Information Functions, 171 Duality Theorem, 172 Existence Theorem for Optimal Moment Matrices, 174 The General Equivalence Theorem, 175 General Equivalence Theorem for the Full Parameter Vector, 176 Equivalence Theorem, 176 Equivalence Theorem for the Full Parameter Vector, 177 Merits and Demerits of Equivalence Theorems, 177 General Equivalence Theorem for Matrix Means, 178 Equivalence Theorem for Matrix Means, 180 General Equivalence Theorem for E-Optimality, 180 Equivalence Theorem for E-Optimality, 181 E-Optimality, Scalar Optimality, and Eigenvalue Simplicity, 183 E-Optimality, Scalar Optimality, and Elfving Norm, 183 Exercises, 185
xi xii
CONTENTS
8. Optimal Moment Matrices and Optimal Designs 8.1. 8.2. 8.3. 8.4. 8.5. 8.6. 8.7. 8.8. 8.9. 8.10. 8.11. 8.12. 8.13. 8.14. 8.15. 8.16. 8.17. 8.18. 8.19.
From Moment Matrices to Designs, 187 Bound for the Support Size of Feasible Designs, 188 Bound for the Support Size of Optimal Designs, 190 Matrix Convexity of Outer Products, 190 Location of the Support Points of Arbitrary Designs, 191 Optimal Designs for a Linear Fit over the Unit Square, 192 Optimal Weights on Linearly Independent Regression Vectors, 195 A-Optimal Weights on Linearly Independent Regression Vectors, 197 C-Optimal Weights on Linearly Independent Regression Vectors, 197 Nonnegative Definiteness of Hadamard Products, 199 Optimal Weights on Given Support Points, 199 Bound for Determinant Optimal Weights, 201 Multiplicity of Optimal Moment Matrices, 201 Multiplicity of Optimal Moment Matrices under Matrix Means, 202 Simultaneous Optimality under Matrix Means, 203 Matrix Mean Optimality for Component Subsets, 203 Moore-Penrose Matrix Inversion, 204 Matrix Mean Optimality for Rank Deficient Subsystems, 205 Matrix Mean Optimality in Two-Way Classification Models, 206 Exercises, 209
9. D-, A-, E-, T-Optimality 9.1. 9.2. 9.3. 9.4. 9.5. 9.6. 9.7. 9.8. 9.9. 9.10. 9.11.
187
D-, A-, E-, T-Optimality, 210 G-Criterion, 210 Bound for Global Optimality, 211 The Kiefer-Wolfowitz Theorem, 212 D-Optimal Designs for Polynomial Fit Models, 213 Arcsin Support Designs, 217 Equivalence Theorem for A-Optimality, 221 L-Criterion, 222 A-Optimal Designs for Polynomial Fit Models, 223 Chebyshev Polynomials, 226 Lagrange Polynomials with Arcsin Support Nodes, 227
210
Xlll
CONTENTS
9.12. 9.13. 9.14. 9.15. 9.16. 9.17.
Scalar Optimality in Polynomial Fit Models, I, 229 E-Optimal Designs for Polynomial Fit Models, 232 Scalar Optimality in Polynomial Fit Models, II, 237 Equivalence Theorem for T-Optimality, 240 Optimal Designs for Trigonometric Fit Models, 241 Optimal Designs under Variation of the Model, 243 Exercises, 245
10. Admissibility of Moment and Information Matrices 10.1. 10.2. 10.3. 10.4. 10.5. 10.6. 10.7. 10.8. 10.9. 10.10. 10.11. 10.12. 10.13. 10.14. 10.15.
247
Admissible Moment Matrices, 247 Support Based Admissibility, 248 Admissibility and Completeness, 248 Positive Polynomials as Quadratic Forms, 249 Loewner Comparison in Polynomial Fit Models, 251 Geometry of the Moment Set, 252 Admissible Designs in Polynomial Fit Models, 253 Strict Monotonicity, Unique Optimality, and Admissibility, 256 E-Optimality and Admissibility, 257 T-Optimality and Admissibility, 258 Matrix Mean Optimality and Admissibility, 260 Admissible Information Matrices, 262 Loewner Comparison of Special C-Matrices, 262 Admissibility of Special C-Matrices, 264 Admissibility, Minimaxity, and Bayes Designs, 265 Exercises, 266
11. Bayes Designs and Discrimination Designs 11.1. 11.2. 11.3.
Bayes Linear Models with Moment Assumptions, 268 Bayes Estimators, 270 Bayes Linear Models with Normal-Gamma Prior Distributions, 272 11.4. Normal-Gamma Posterior Distributions, 273 11.5. The Bayes Design Problem, 275 11.6. General Equivalence Theorem for Bayes Designs, 276 11.7. Designs with Protected Runs, 277 11.8. General Equivalence Theorem for Designs with Bounded Weights, 278 11.9. Second-Degree versus Third-Degree Polynomial Fit Models, I, 280
268
XIV
CONTENTS
11.10. 11.11. 11.12. 11.13. 11.14. 11.15. 11.16. 11.17. 11.18. 11.19. 11.20. 11.21. 11.22.
Mixtures of Models, 283 Mixtures of Information Functions, 285 General Equivalence Theorem for Mixtures of Models, 286 Mixtures of Models Based on Vector Means, 288 Mixtures of Criteria, 289 General Equivalence Theorem for Mixtures of Criteria, 290 Mixtures of Criteria Based on Vector Means, 290 Weightings and Scalings, 292 Second-Degree versus Third-Degree Polynomial Fit Models, II, 293 Designs with Guaranteed Efficiencies, 296 General Equivalence Theorem for Guaranteed Efficiency Designs, 297 Model Discrimination, 298 Second-Degree versus Third-Degree Polynomial Fit Models, III, 299 Exercises, 302
12. Efficient Designs for Finite Sample Sizes 12.1. 12.2. 12.3. 12.4. 12.5. 12.6. 12.7. 12.8. 12.9. 12.10. 12.11. 12.12. 12.13. 12.14. 12.15. 12.16.
Designs for Finite Sample Sizes, 304 Sample Size Monotonicity, 305 Multiplier Methods of Apportionment, 307 Efficient Rounding Procedure, 307 Efficient Design Apportionment, 308 Pairwise Efficiency Bound, 310 Optimal Efficiency Bound, 311 Uniform Efficiency Bounds, 312 Asymptotic Order O(n- l ), 314 Asymptotic Order O(n- 2 ), 315 Subgradient Efficiency Bounds, 317 Apportionment of D-Optimal Designs in Polynomial Fit Models, 320 Minimal Support and Finite Sample Size Optimality, 322 A Sufficient Condition for Completeness, 324 A Sufficient Condition for Finite Sample Size D-Optimality, 325 Finite Sample Size D-Optimal Designs in Polynomial Fit Models, 328 Exercises, 329
304
CONTENTS
XV
13. Invariant Design Problems 13.1. 13.2. 13.3. 13.4. 13.5. 13.6. 13.7. 13.8. 13.9. 13.10. 13.11. 13.12.
Design Problems with Symmetry, 331 Invariance of the Experimental Domain, 335 Induced Matrix Group on the Regression Range, 336 Congruence Transformations of Moment Matrices, 337 Congruence Transformations of Information Matrices, 338 Invariant Design Problems, 342 Invariance of Matrix Means, 343 Invariance of the D-Criterion, 344 Invariant Symmetric Matrices, 345 Subspaces of Invariant Symmetric Matrices, 346 The Balancing Operator, 348 Simultaneous Matrix Improvement, 349 Exercises, 350
14. Kiefer Optimality 14.1. 14.2. 14.3. 14.4. 14.5. 14.6. 14.7. 14.8. 14.9. 14.10.
15.5. 15.6. 15.7. 15.8.
352
Matrix Majorization, 352 The Kiefer Ordering of Symmetric Matrices, 354 Monotonic Matrix Functions, 357 Kiefer Optimality, 357 Heritability of Invariance, 358 Kiefer Optimality and Invariant Loewner Optimality, 360 Optimality under Invariant Information Functions, 361 Kiefer Optimality in Two-Way Classification Models, 362 Balanced Incomplete Block Designs, 366 Optimal Designs for a Linear Fit over the Unit Cube, 372 Exercises, 379
15. Rotatability and Response Surface Designs 15.1. 15.2. 15.3. 15.4.
331
Response Surface Methodology, 381 Response Surfaces, 382 Information Surfaces and Moment Matrices, 383 Rotatable Information Surfaces and Invariant Moment Matrices, 384 Rotatability in Multiway Polynomial Fit Models, 384 Rotatability Determining Classes of Transformations, 385 First-Degree Rotatability, 386 Rotatable First-Degree Symmetric Matrices, 387
381
XVI
CONTENTS
15.9. 15.10. 15.11. 15.12. 15.13. 15.14. 15.15. 15.16. 15.17. 15.18. 15.19. 15.20. 15.21.
Rotatable First-Degree Moment Matrices, 388 Kiefer Optimal First-Degree Moment Matrices, 389 Two-Level Factorial Designs, 390 Regular Simplex Designs, 391 Kronecker Products and Vectorization Operator, 392 Second-Degree Rotatability, 394 Rotatable Second-Degree Symmetric Matrices, 396 Rotatable Second-Degree Moment Matrices, 398 Rotatable Second-Degree Information Surfaces, 400 Central Composite Designs, 402 Second-Degree Complete Classes of Designs, 403 Measures of Rotatability, 405 Empirical Model-Building, 406 Exercises, 406
Comments and References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.
408
Experimental Designs in Linear Models, 408 Optimal Designs for Scalar Parameter Systems, 410 Information Matrices, 410 Loewner Optimality, 412 Real Optimality Criteria, 412 Matrix Means, 413 The General Equivalence Theorem, 414 Optimal Moment Matrices and Optimal Designs, 417 D-, A-, E-, T-Optimality, 418 Admissibility of Moment and Information Matrices, 421 Bayes Designs and Discrimination Designs, 422 Efficient Designs for Finite Sample Sizes, 424 Invariant Design Problems, 425 Kiefer Optimality, 426 Rotatability and Response Surface Designs, 426
Biographies
428
1. Charles Loewner 1893-1968, 428 2. Gustav Elfving 1908-1984, 430 3. Jack Kiefer 1924-1981, 430 Bibliography
432
Index
448
Preface to the Classics Edition
Research into the optimality theory of the design of statistical experiments originated around 1960. The first papers concentrated on one specific optimality criterion or another. Before long, when interrelations between these criteria were observed, the need for a unified approach emerged. Invoking tools from convex optimization theory, the optimal design problem is indeed amenable to a fairly complete solution. This is the topic of Optimal Design of Experiments, and over the years the material developed here has proved comprehensive, useful, and stable. It is a pleasure to see the book reprinted in the SIAM Classics in Applied Mathematics series. Ever since the inception of optimal design theory, the determinant of the moment matrix of a design was recognized as a very specific criterion function. In fact, determinant optimality in polynomial fit models permits an analysis other than the one presented here, based on canonical moments and classical polynomials. This alternate part of the theory is developed by H. DETTE and W.J. STUDDEN in their monograph The Theory of Canonical Moments with Applications in Statistics, Probability, and Analysis, and the references listed there complement and update the bibliography given here. Since the book's initial publication in 1993, its results have been put to good use in deriving optimal designs on the circle, optimal mixture designs, or optimal designs in other linear statistical models. However, many practical design problems of applied statistics are inherently nonlinear. Even then, local linearization may open the way to apply the present results, thus aiding in identifying good, practical designs.
FRIEDRICH PUKELSHEIM
Augsburg, Germany October 2005
xvii
This page intentionally left blank
Preface
... dans ce meilleur des [modeles] possibles ... tout est au mieux. Candide (1759), Chapitre I, VOLTAIRE
The working title of the book was a bit long, Optimality Theory of Experimental Designs in Linear Models, but focused on two pertinent points. The setting is the linear model, the simplest statistical model, where the results are strongest. The topic is design optimality, de-emphasizing the issue of design construction. A more detailed Outline of the Book follows the Contents. The design literature is full of fancy nomenclature. In order to circumvent expert jargon I mainly speak of a design being -optimal for K 'Q in H, that is, being optimal under an information function , for a parameter system of interest K'6, in a class of competing designs. The only genuinely new notions that I introduce are Loewner optimality (because it refers to the Loewner matrix ordering) and Kiefer optimality (because it pays due homage to the man who was a prime contributor to the topic). The design problems originate from statistics, but are solved using special tools from linear algebra and convex analysis, such as the information matrix mapping of Chapter 3 and the information functions of Chapter 5. I have refrained from relegating these tools into a set of appendices, at the expense of some slowing of the development in the first half of the book. Instead, the auxiliary material is developed as needed, and it is hoped that the exposition conveys some of the fascination that grows out of merging three otherwise distinct mathematical disciplines. The result is a unified optimality theory that embraces an amazingly wide variety of design problems. My aim is not encyclopedic coverage, but rather to outline typical settings such as D-, A-, and E-optimal polynomial regression designs, Bayes designs, designs for model discrimination, balanced incomplete block designs, or rotatable response surface designs. Pulling together formerly separate entities to build a greater community will always face opponents who fear an assault on their way of thinking. On the contrary, my intention is constructive, to generate a frame for those design problems that share xix
XX
PREFACE
a common goal. The goal of investigating optimal, theoretical designs is to provide a gauge for identifying efficient, practical designs. Il meglio e l'inimico del bene. Dictionnaire Philosophique (1770), Art Dramatique, VOLTAIRE
ACKNOWLEDGMENTS The writing of this book became a pleasure when I began experiencing encouragement from so many friends and colleagues, ranging from good advice of how to survive a book project, to the tedious work of weeding out wrong theorems. Above all I would like to thank my Augsburg colleague Norbert Gaffke who, with his vast knowledge of the subject, helped me several times to overcome paralyzing deadlocks. The material of the book called for a number of research projects which I could only resolve by relying on the competence and energy of my co-authors. It is a privilege to have cooperated with Norman Draper, Sabine Rieder, Jim Rosenberger, Bill Studden, and Ben Torsney, whose joint efforts helped shape Chapters 15, 12, 11, 9, 8, respectively. Over the years, the manuscript has undergone continuous mutations, as a reaction to the suggestions of those who endured the reading of the early drafts. For their constructive criticism I am grateful to Ching-Shui Cheng, Holger Dette, Berthold Heiligers, Harold Henderson, Olaf Krafft, Rudolf Mathar, Wolfgang Nather, Ingram Olkin, Andrej Pazman, Norbert Schmitz, Shayle Searle, and George Styan. The additional chores of locating typos, detecting doubly used notation, and searching for missing definitions was undertaken by Markus Abt, Wolfgang Bischoff, Kenneth Nordstrom, Ingolf Terveer, and the students of various classes I taught from the manuscript. Their labor turned a manuscript that initially was everywhere dense in error into one which I hope is finally everywhere dense in content. Adalbert Wilhelm carried out most of the calculations for the numerical examples; Inge Dotsch so cheerfully kept retyping what seemed in final form. Ingo Eichenseher and Gerhard Wilhelms contributed the public domain postscript driver dvilw to produce the exhibits. Sol Feferman, Timo Makelainen, and Dooley Kiefer kindly provided the photographs of Loewner, Elfving, and Kiefer in the Biographies. To each I owe a debt of gratitude. Finally I wish to acknowledge the support of the Volkswagen-Stiftung, Hannover, for supporting sabbatical leaves with the Departments of Statistics at Stanford University (1987) and at Penn State University (1990), and granting an Akademie-Stipendium to help finish the project. FRIEDRICH PUKELSHEIM Augsburg, Germany December 1992
List of Exhibits
1.1 1.2 1.3 1.4 1.5 1.6 1.7
The statistical linear model, 3 Convex cones in the plane R2, 11 Orthogonal decompositions induced by a linear mapping, 14 Orthogonal and oblique projections, 24 An experimental design worksheet, 28 A worksheet with run order randomized, 28 Experimental domain designs, and regression range designs, 32
2.1 2.2 2.3 2.4 2.5
The ice-cream cone, 38 Two Elfving sets, 43 Cylinders, 45 Supporting hyperplanes to the Elfving set, 50 Euclidean balls inscribed in and circumscribing the Elfving set, 55
3.1 ANOVA decomposition, 71 3.2 Regularization of the information matrix mapping, 81 3.3 Discontinuity of the information matrix mapping, 84 4.1
Penumbra, 108
5.1
Unit level sets, 121
6.1 Conjugate numbers, p + q = pq, 148 7.1 Subgradients, 159 7.2 Normal vectors to a convex set, 160 7.3 A hierarchy of equivalence theorems, 178 xxi
xxii
LIST OF EXHIBITS
8.1 Support points for a linear fit over the unit square, 194 9.1 The Legendre polynomials up to degree 10, 214 9.2 Polynomial fits over [-1; 1]: -optimal designs for 0 in T, 218 9.3 Polynomial fits over [—1;!]: -optimal designs for 6 in 219 9.4 Histogram representation of the design , 220 9.5 Fifth-degree arcsin support, 220 9.6 Polynomial fits over [-1;1]: -optimal designs for 6 in T, 224 9.7 Polynomial fits over [-1;1]: -optimal designs for 6 in 225 9.8 The Chebyshev polynomials up to degree 10, 226 9.9 Lagrange polynomials up to degree 4, 228 9.10 E-optimal moment matrices, 233 9.11 Polynomial fits over [-1;1]: -optimal designs for 8 in T, 236 9.12 Arcsin support efficiencies for individual parameters 240 10.1 10.2 10.3
Cuts of a convex set, 254 Line projections and admissibility, 259 Cylinders and admissibility, 261
11.1
Discrimination between a second- and a third-degree model, 301
12.1 12.2 12.3 12.4 12.5 12.6
Quota method under growing sample size, 306 Efficient design apportionment, 310 Asymptotic order of the E-efficiency loss, 317 Asymptotic order of the D-efficiency loss, 322 Nonoptimality of the efficient design apportionment, 323 Optimality of the efficient design apportionment, 329
13.1
Eigenvalues of moment matrices of symmetric three-point designs, 334
14.1 14.2
The Kiefer ordering, 355 Some 3x6 block designs for 12 observations, 370
LIST OF EXHIBITS
14.3 14.4
Uniform vertex designs, 373 Admissible eigenvalues, 375
15.1
Eigenvalues of moment matrices of central composite designs, 405
XX111
Interdependence of Chapters 1 Experimental Designs in Linear Models
2 Optimal Designs for Scalar Parameter Systems
3 Information Matrices
4 Loewner Optimality
5 Real Optimality Criteria
6 Matrix Means
7 The General Equivalence Theorem
8 Optimal Moment Matrices and Optimal Designs
12 Efficient Designs for Finite Sample Sizes
9 D-, A-, E-, T-Optimality
13 Invariant Design Problems
10 Admissibility of Moment and Information Matrices
14 Kiefer Optimality
11 Bayes Designs and Discrimination Designs
15 Rotatability and Response Surface Designs
XXIV
Outline of the Book
CHAPTERS 1, 2, 3, 4: LINEAR MODELS AND INFORMATION MATRICES Chapters 1 and 3 are basic. Chapter 1 centers around the Gauss-Markov Theorem, not only because it justifies the introduction of designs and their moment matrices in Section 1.24. Equally important, it permits us to define in Section 3.2 the information matrix for a parameter system of interest K'0 in a way that best supports the general theory. The definition is extended to rank deficient coefficient matrices K in Section 3.21. Because of the dual purpose the Gauss-Markov Theorem is formulated as a general result of matrix algebra. First results on optimal designs are presented in Chapter 2, for parameter subsystems that are one-dimensional, and in Chapter 4, in the case where optimality can be achieved relative to the Loewner ordering among information matrices. (This is rare, see Section 4.7.) These results also follow from the General Equivalence Theorem in Chapter 7, whence Chapters 2 and 4 are not needed for their technical details.
CHAPTERS 5,6: INFORMATION FUNCTIONS Chapters 5 and 6 are reference chapters, developing the concavity properties of prospective optimality criteria. In Section 5.8, we introduce information functions which by definition are required to be positively homogeneous, superadditive, nonnegative, nonconstant, and upper semicontinuous. Information functions submit themselves to pleasing functional operations (Section 5.11), of which polarity (Section 5.12) is crucial for the sequel. The most important class of information functions are the matrix means with parameter They are the topic of Chapter 6, starting from the classical D-, A-, E-criterion as the special cases respectively. XXV
XXVI
OUTLINE OF THE BOOK
CHAPTERS 7, 8,12: OPTIMAL APPROXIMATE DESIGNS AND EFFICIENT DISCRETE DESIGNS The General Equivalence Theorem 7.14 is the key result of optimal design theory, offering necessary and sufficient conditions for a design's moment matrix M to be -optimal for K' in M. The generic result of this type is due to Kiefer and Wolfowitz (1960), concerning D-optimality for 6 in M . The present theorem is more general in three respects, in allowing for the competing moment matrices to form a set M which is compact and convex, rather than restricting attention to the largest possible set M of all moment matrices, in admitting parameter subsystems K' rather than concentrating on the full parameter vector 6, and in permitting as optimality criterion any information function , rather than restricting attention to the classical D-criterion. Specifying these quantitites gives rise to a number of corollaries which are discussed in the second half of Chapter 7. The first half is a self-contained exposition of arguments which lead to a proof of the General Equivalence Theorem, based on subgradients and normal vectors to a convex set. Duality theory of convex analysis might be another starting point; here we obtain a duality theorem as an intermediate step, as Theorem 7.12. Yet another approach would be based on directional derivatives; however, their calculus is quite involved when it comes to handling a composition C like the one underlying the optimal design problem. Chapter 8 deals with the practical consequences which the General Equivalence Theorem implies about the support points xi, and the weights w, of an optimal design The theory permits a weight w, to be any real number between 0 and 1, prescribing the proportion of observations to be drawn under xi. In contrast, a design for sample size n replaces wi by an integer n,-, as the replication number for xi. In Chapter 12 we propose the efficient design apportionment as a systematic and easy way to pass from wi, to ni. This discretization procedure is the most efficient one, in the sense of Theorem 12.7. For growing sample size AX, the efficiency loss relative to the optimal design stays bounded of asymptotic order n - 1 ; in the case of differentiability, the order improves to n- 2 . CHAPTERS 9,10,11: INSTANCES OF DESIGN OPTIMALITY D-, A-, and E-optimal polynomial regression designs over the interval [—1; 1] are characterized and exhibited in Chapter 9. Chapter 10 discusses admissibility of the moment matrix of a polynomial regression design, and of the contrast information matrix of a block design in a two-way classification model. Prominent as these examples may be, it is up to Chapter 11 to exploit the power of the General Equivalence Theorem to its fullest. Various sets of competing moment matrices are considered, such as Ma for Bayes designs, M(a[a;b]) for designs with bounded weights, M(m) for mix-
OUTLINE OF THE BOOK
XXVll
ture model designs, {(M,... ,M): M M} for mixture criteria designs, and for designs with guaranteed efficiencies. And they are evaluated using an information function that is a composition of a set of m information functions, together with an information function on the nonnegative orthant Rm. CHAPTERS 13,14,15: OPTIMAL INVARIANT DESIGNS As with other statistical problems, invariance considerations can be of great help in reducing the dimensionality and complexity of the general design problem, at the expense of handling some additional theoretical concepts. The foundations are laid in Chapter 13, investigating various groups and their actions as they pertain to an experimental domain design r, a regression range design a moment matrix M(£), an information matrix C/c(M), or an information function (C). The idea of "increased symmetry" or "greater balancedness" is captured by the matrix majorization ordering of Section 14.1. This concept is brought together with the Loewner matrix ordering to create the Kiefer ordering of Section 14.2: An information matrix C is at least as good as another matrix D, C > D, when relative to the Loewner ordering, C is above some intermediate matrix which is majorized by D, The concept is due to Kiefer (1975) who introduced it in a block design setting and called it universal optimality. We demonstrate its usefulness with balanced incomplete block designs (Section 14.9), optimal designs for a linear fit over the unit cube (Section 14.10), and rotatable designs for response surface methodology (Chapter 15). The final Comments and References include historical remarks and mention the relevant literature. I do not claim to have traced every detail to its first contributor and I must admit that the book makes no mention of many other important design topics, such as numerical algorithms, orthogonal arrays, mixture designs, polynomial regression designs on the cube, sequential and adaptive designs, designs for nonlinear models, robust designs, etc.
This page intentionally left blank
Errata Page ±Line
Text
31 32 91 156 157 169 169 203 217 222
Section 1.26 + 12 Exh. 1.7 lower right: interchange B~B, BB~ -11 -2 X = \C\ i +11 E >0, +13 GKCDCK'G' : -7 sxk -12 _7 i [in denominator] r [in numerator] -8
241
_2
270 330 347 357 361 378 390
+4 +3 -7 + 15 + 11 +9 +13,-3
Exhibit 9.4 K NND(s) a(jk) Il m
Correction Section 1.25 1/2 and 1/6 B~ B K , B k B ~ \X\ = C j E >0, GKCDCK'G' + F : s x (s — k) d s
Exhibit 9.2 Ks NND(k)
b(jk) lI 1+m
xxix
This page intentionally left blank
OPTIMAL DESIGN OF EXPERIMENTS
This page intentionally left blank
CHAPTER 1
Experimental Designs in Linear Models
This chapter provides an introduction to experimental designs for linear models. Two linear models are presented. The first is classical, having a dispersion structure in which the dispersion matrix is proportional to the identity matrix. The second model is more general, with a dispersion structure that does not impose any rank or range assumptions. The Gauss-Markov Theorem is formulated to cover the general model. The classical model provides the setting to introduce experimental designs and their moment matrices. Matrix algebra is reviewed as needed, with particular emphasis on nonnegative definite matrices, projectors, and generalized inverses. The theory is illustrated with two-way classification models, and models for a line fit, parabola fit, and polynomial fit.
1.1. DETERMINISTIC LINEAR MODELS Many practical and theoretical problems in science treat relationships of the type
where the observed response or yield, y, is thought of as a particular value of a real-valued model function or response function, g, evaluated at the pair of arguments (t, 0). This decomposition reflects the distinctive role of the two arguments: The experimental conditions t can be freely chosen by the experimenter from a given experimental domain T, prior to running the experiment. The parameter system 6 is assumed to lie in a parameter domain ®, and is not known to the experimenter. This is paraphrased by saying that the experimenter controls t, whereas "nature" determines 6. The choice of the function g is central to the model-building process. One
1
2
CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS
of the simplest relationships is the deterministic linear model
where f(t) = (f\(t), . . . ,/*(0) ' an^ 0 = (#i> • • • i #*) ' are vectors in ^-dimensional Euclidean space Rk. All vectors are taken to be column vectors, a prime indicates transposition. Hence f(t)'B is the usual Euclidean scalar product, /(0'0 — £; 0 provides an indication of the variability inherent in the observation Y. Another way of expressing this is to say that the random error E has mean value zero and variance a2, neither of which depends on the regression vector x nor on the parameter vector 0 of the mean response. The k x 1 parameter vector 0 and the scalar parameter a2 comprise a total of k +1 unknown parameter components. Clearly, for any reasonable inference, the number n of observations must be at least equal to k + 1. We consider a set of n observations,
with possibly different regression vectors jc, in experimental run /. The joint distribution of the n responses Yt is specified by assuming that they are uncorrelated. Considerable simplicity is gained by using vector notation. Let
denote the n x l response vector Y, the n x k model matrix X, and the n x l error vector £, respectively. (Henceforth the random quantities Y and E are n x l vectors rather than scalars!) The (i,y)th entry *|; of the matrix X is the
4
CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS
same as the ; th component of the regression vector jc,, that is, the regression vector jCj appears as the / th row of the model matrix X. The model equation thus becomes
With /„ as the n x n identity matrix, the model is succinctly represented by the expectation vector and dispersion matrix of y,
and is termed the classical linear model with moment assumptions. In other words, the mean vector Ep[Y] is given by the linear relationship X6 between the regression vectors *!,...,*„ and the parameter vector 0, while the dispersion matrix D/>[F] is in its classical, that is, simplest, form of being proportional to the identity matrix.
1.4. CLASSICAL LINEAR MODELS WITH NORMALITY ASSUMPTION For purposes of hypothesis testing and interval estimation, assumptions on the first two moments do not suffice and the entire distribution of Y is required. Hence in these cases there is a need for a classical linear model with normality assumption,
in which Y is assumed to be normally distributed with mean vector XB and dispersion matrix a2In. If the model matrix X is known then the normal distribution P = N^.^ is determined by 8 and a2. We display these parameters by writing Ee.a2[- • •] in place of E/>[- • •], etc. Moreover, the letter P soon signifies a projection matrix. 1.5. TWO-WAY CLASSIFICATION MODELS The two-sample problem provides a simple introductory example. Consider two populations with mean responses a\ and a2. The observed responses from the two populations are taken to have a common variance a2 and to be uncorrelated. With replications y = !,...,«/ for populations / = 1,2 this yields a linear model
1.5. TWO-WAY CLASSIFICATION MODELS
5
Assembling the components into n x 1 vectors, with n = n\ + n^, we get
Here the n x 2 model matrix X and the parameter vector 6 are given by
with regression vectors x\ = Q and *2 = (i) repeated n\ and «2 times. The experimental design consists of the replication numbers n\ and n-i, telling the experimenter how many responses are to be observed from which population. It is instructive to identify the quantities of this example with those of the general theory. The experimental domain T is simply the two-element set {1,2} of population labels. The regression function takes values /(I) = Q and /(2) = (J) in R2, inducing the set X = {(J), (J)} as the regression range. The generalization from two to a populations leads to the one-way classification model. The model is still Y,; = a, + £t;, but the subscript ranges turn into i = l , . . . , a and ; = !,...,«,. The mean parameter vector becomes 0 = («!,..., «„)', and the experimental domain is T = {1,...,0}. The regression function / maps i into the /th Euclidean unit vector ei of Ra, with /th entry one and zeros elsewhere. Hence the regression range is X = {e l5 ... ,ea}. Further generalization is aided by a suitable terminology. Rather than speaking of different populations, / = 1,..., a, we say that the "factor" population takes "levels" / = 1..., a. More factors than one occur in multiway classification models. The two-way classification model with no interaction may serve as a prototype. Suppose level / of a first factor "A" has mean effect a/, while level j of a second factor "B" has mean effect )8y. Assuming that the two effects are additive, the model reads
with replications i — 1,... ,n/ ; , for levels i = 1,... ,a of factor A and levels j = 1,... ,b of factor B. The design problem now consists of choosing the replication numbers n,;. An extreme, but feasible, choice is n,; = 0, that is, no observation is made with factor A on level / and factor B on level /. The
6
CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS
parameter vector 0 is the k x 1 vector (ai,..., aa, p\,..., ftb)', with k = a+b. The experimental domain is the discrete rectangle T = (1,..., a} x {1,..., b}. The regression function / maps («',/) into (J), where e{ is the ith Euclidean unit vector of Ra and d, is the ; th Euclidean unit vector of R*. We return to this model in Section 1.27. So far, the experimental domain has been a finite set; next it is going to be an interval of the real line R. 1.6. POLYNOMIAL FIT MODELS Let us first look at a line fit model,
Intercept a and slope )8 form the parameter vector 6 of interest, whereas the experimental conditions f, come from an interval T C R. For the sake of concreteness, we think of t e T as a "dose level". The design problem then consists of determining how many and which dose levels f i,..., r/ are to be observed, and how often. If the experiment calls for nt replications of dose level f,-, the subscript ranges in the model are / = 1,... ,n, for i = 1,... ,^. Here the regression function has values f ( i ) = (1,0'. generating a line segment embedded in the plane R2 as regression range X. The parabola fit model has mean response depending on the dose level quadratically,
This changes the regression function to f(t) = (l,r,f 2 )', and the regression range X turns into the segment of a parabola embedded in the space R3. These are special instances of polynomial fit models of degree d > 1, the model equation becoming
The regression range X is a one-dimensional curve embedded in R*, with k = d +1. Often the description of the experiment makes it clear that the experimental condition is a single real variable /; a linear model for a line fit (parabola fit, polynomial fit of degree d) is then referred to as a first-degree model (second-degree model, d th-degree model). This generalizes to the powerful class of m-way d th-degree polynomial fit models. In these models the experimental condition / = (fi,• • • ,f m )' has m components, that is, the experimental domain T is a subset of Rm, and the model function f(t) '8 is a polynomial of degree d in the m variables f j , . . . , tm.
1.7. EUCLIDEAN MATRIX SPACE
7
For instance, a two-way third-degree model is given by
with i experimental conditions f, = (to,to)' € T C R2, and with subscript ranges / = 1,..., nt; for / = 1,..., L As a second example consider the threeway second-degree model
with i experimental conditions f, = (to, to, to)' e T C R3, and with subscript ranges / = 1,..., n,- for i = 1,..., £. Both models have ten mean parameters. The two examples illustrate saturated models because they feature every possible dth-degree power or cross product of the variables 11,...,tm. In general, a saturated m-way d th-degree model has
mean parameters. An instance of a nonsaturated two-way second-degree model is
with i experimental conditions tt = (to,to)' e T C R2, and with subscript ranges / = 1,..., n, for / = 1,..., £. The discussion of these examples is resumed in Section 1.27, after a proper definition of an experimental design.
1.7. EUCLIDEAN MATRIX SPACE In a classical linear model, interest concentrates on inference for the mean parameter vector 6. The performance of appropriate statistical procedures tends to be measured by dispersion matrices, moment matrices, or information matrices. This calls for a review of matrix algebra. All matrices used here are real. First let us recall that the trace of a square matrix is the sum of its diagonal entries. Hence a square matrix and its transpose have identical traces. Another important property is that, under the trace operator, matrices commute
8
CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS
provided they are conformable,
We often apply this rule to quadratic forms given by a symmetric matrix A, in using x'Ax = trace Axx' = trace xx'A, as is convenient in a specific context. Let R"** denote the linear space of real matrices with n rows and k columns. The Euclidean matrix scalar product
turns Rn*k into a Euclidean space of dimension nk. For k = 1, we recover the Euclidean scalar product for vectors in W. The symmetry of scalar products, trace A 'B = (A,B) = (B,A) = trace B'A, reproduces the property that a square matrix and its transpose have identical traces. Commutativity under the trace operator yields (A,B) = trace A 'B = trace BA1 — (B',A') = (A',B'), that is, transposition preserves the scalar products between the matrix spaces of reversed numbers of rows and columns, Rnxk and R*xw. In general, although not always, our matrices have at least as many rows as columns. Since we have to deal with extensive matrix products, this facilitates a quick check that factors properly conform. It is also in accordance with writing vectors of Euclidean space as columns. Notational conventions that are similarly helpful are to choose Greek letters for unknown parameters in a statistical model, and to use uppercase and lowercase letters to discriminate between a random variable and any one of its values, and between a matrix and any one of its entries. Because of their role as dispersion operators, our matrices often are symmetric. We denote by Sym(A;) the subspace of symmetric matrices, in the space Ukxk of all square, that is, not necessarily symmetric, matrices. Recall from matrix algebra that a symmetric k x k matrix A permits an eigenvalue decomposition
The real numbers A i , . . . , A j t are the eigenvalues of A counted with their respective multiplicities, and the vectors z\, • . . , z* € ^fc f°rm an orthonormal system of eigenvectors. In general, such a decomposition fails to be unique, since if the eigenvalue A; has multiplicity greater than one then many choices for the eigenvectors Zj become feasible. The second representation of an eigenvalue decomposition, A = Z'&^Z, assembles the pertinent quantities in a slightly different way. We define the operator AA by requiring that it creates a diagonal matrix with the argument vector A = ( A 1 ? . . . , \k)' on the diagonal. The orthonormal vectors z\, • - - , Zk
1.8. NONNEGATIVE DEFINITE MATRICES
9
form the k x k matrix Z' = (z\,..., Zk}-, whence Z' is an orthogonal matrix. The equality with the first representation now follows from
Matrices that have matrices or vectors as entries, such as Z', are termed block matrices. They provide a convenient technical tool in many areas. The algebra of block matrices parallels the familiar algebra of matrices, and may be verified as needed. In the space Sym(/c), the subsets of nonnegative definite matrices, NND(fc), and of positive definite matrices, PD(/c), are central to the sequel. They are defined through
Of the many ways of characterizing nonnegative definiteness or positive definiteness, frequent use is made of the following. 1.8. NONNEGATIVE DEFINITE MATRICES Lemma. Let A be a symmetric k x k matrix with smallest eigenvalue Then we have
Proof. Assume A e NND(£), and choose an eigenvector z e R* of norm 1 corresponding to A m inCA). Then we obtain 0 < z 'Az = A m jn(^)z 'z = A m jnC 0, and choose an eigenvalue decomposition
10
CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS
This yields trace trace To complete the circle, we verify for all by choosing For positive defimteness the arguments follow the same lines upon observing that we have provided The set of all nonnegative definite matrices NND(fc) has a beautiful geometrical shape, as follows. 1.9. GEOMETRY OF THE CONE OF NONNEGATIVE DEFINITE MATRICES Lemma. The set NND(A:) is a cone which is convex, pointed, and closed, and has interior PD(£) relative to the space Sym(A:). Proof. The proof is by first principles, recalling the definition of the properties involved. For A e NND(A;) and 5 > 0 evidently 8A € NND(fc), thus NND(fc) is a cone. Next for A, B e NND(fc) we clearly have A+B e NND(fc) since
Because NND(fc) is a cone, we may replace A by (1 - a)A and B by aB, where a lies in the open interval (0;1). Hence given any two matrices A and B the set NND(&) also includes the straight line (1 - a)A + aB from A to B, and this establishes convexity. If A e NND(fc) and also —A £ NND(fc), then A = 0, whence the cone NND(&) is pointed. The remaining two properties, that NND(fc) is closed and has PD(fc) for its interior, are topological in nature. Let
be the closed unit ball in Sym(k) under the Euclidean matrix scalar product. Replacing B e Sym(fc) by an eigenvalue decomposition £;Ayv;j>y' yields trace B2 — ]T; A?; thus B e B has eigenvalues Ay satisfying |A,| < 1. It follows that B e B fulfills x'Bx < \x'Bx\ < £; jAyKjc'^) 2 < x'^y^^x = x'x for all x e IR*. A set is closed when its complement is open. Therefore we pick an arbitrary k x k matrix A which is symmetric but fails to be nonnegative definite. By definition, x 'Ax < 0 for some vector x e R*. Define 5 = -x'Ax/(2x'x] > 0. For every matrix B 6 B, we then have
1.10. THE LOEWNER ORDERING OF SYMMETRIC MATRICES
11
EXHIBIT 1.2 Convex cones in the plane U2. Left: the linear subspace generated by x e R2 is a closed convex cone that is not pointed. Right: the open convex cone generated by x,y 6 R2, together with the null vector, forms a pointed cone that is neither open nor closed.
Thus the set A + SB is included in the complement of NND(Jt), and it follows that the cone NND(fc) is closed. Interior points are identified similarly. Let A e intNND(fc), that is, A + SB C NND(fc) for some 8 > 0. If x ± 0 then the choice B = -xx'/x'x e B leads to
Hence every matrix A interior to NND(/c) is positive definite, intNND(fc) C PD(fc). It remains to establish the converse inclusion. Every matrix A € PD(fc) has 0 < A min (A) = 6, say. For B e B and x e R*, we obtain x'Bx > -jc'jc, and
Thus A + 3BC NND(fc) shows that A is interior to NND(fc). There are, of course, convex cones that are not pointed but closed, or pointed but not closed, or neither pointed nor closed. Exhibit 1.2 illustrates two such instances in the plane R2. 1.10. THE LOEWNER ORDERING OF SYMMETRIC MATRICES True beauty shines in many ways, and order is one of them. We prefer to view the closed cone NND(fc) of nonnegative definite matrices through the
12
CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS
partial ordering >, defined on Sym(k) by
which has come to be known as the Loewner ordering of symmetric matrices. The notation B < A in place of A > B is self-explanatory. We also define the closely related variant > by
which is based on the open cone of positive definite matrices. The geometric properties of the set NND(fc) of being conic, convex, pointed, and closed, translate into related properties for the Loewner ordering:
The third property in this list says that the Lowener ordering is antisymmetric. In addition, it is reflexive and transitive,
Hence the Loewner ordering enjoys the three properties that constitute a partial ordering. For scalars, that is, A: = 1, the Loewner ordering reduces to the familiar total ordering of the real line. Or the other way around, the total ordering > of the real line U is extended to the partial ordering > of the matrix spaces Sym(/c), with k > 1. The crucial distinction is that, in general, two matrices may not be comparable. An example is furnished by
for which neither A > B nor B > A holds true. Order relations always call for a study of monotonic functions. 1.11. MONOTONIC MATRIX FUNCTIONS
We consider functions that have a domain of definition and a range that are equipped with partial orderings. Such functions are called isotonic when they
1.12. RANGE AND NULLSPACE OF A MATRIX
13
are order preserving, and antitonic when they are order reversing. A function is called monotonic when it is isotonic or antitonic. Two examples may serve to illustrate these concepts. A first example is supplied by a linear form A *-+ trace AB on Sym(A:), determined by a matrix B e Sym(fc). If this linear form is isotonic relative to the Loewner ordering, then A > 0 implies trace A B > 0, and Lemma 1.8 proves that the matrix B is nonnegative definite. Conversely, if B is nonnegative definite and A > C, then again Lemma 1.8 yields trace(^4 - C)B > 0, that is, trace AB > trace CB. Thus a linear form A i-> trace AB is isotonic relative to the Loewner ordering if and only if B is nonnegative definite. In particular the trace itself is isotonic, A •-> trace A, as follows with B = Ik. It is an immediate consequence that the Euclidean matrix norm \\A\\ — (trace A2)1/2 is an isotonic function from the closed cone NND(fc) into the real line. For if A > B > 0, then we have
As a second example, matrix inversion A l is claimed to be an antitonic mapping from the open cone PD(fc) into itself. For if A > B > 0 then we get
Pre- and postmultiplication by A ] gives A~l < B l, as claimed. A minimization problem relative to the Loewner ordering is taken up in the Gauss-Markov Theorem 1.19. Before turning to this topic, we review the role of matrices when they are interpreted as linear mappings. 1.12. RANGE AND NULLSPACE OF A MATRIX A rectangular matrix A e Rnxk may be identified with a linear mapping carrying x e Rk into Ax e IR". Its range or column space, and its nullspace or kernel are
The range is a subspace of the image space R n . The nullspace is a subspace of the domain of definition Rk. The rank and nullity of A are the dimensions of the range of A and of the nullspace of A, respectively. If the matrix A is symmetric, then its rank coincides with the number of nonvanishing eigenvalues, and its nullity is the number of vanishing eigenvalues. Symmetry involves transposition, and transposition indicates the presence of a scalar product (because A' is the unique matrix B that satisfies
14
CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS
EXHIBIT 13 Orthogonal decompositions induced by a linear mapping. Range and nullspace of a matrix A € R"** and of its transpose A' orthogonally decompose the domain of definition Kk and the image space R".
(Ax,y) = (x,By) for all x,y). In fact, Euclidean geometry provides the following vital connection that the nullspace of the transpose of a matrix is the orthogonal complement of its range. Let
denote the orthogonal complement of a subspace L of the linear space R". 1.13. TRANSPOSITION AND ORTHOGONALITY
Lemma. Let A be an n x k matrix. Then we have
Proof.
A few transcriptions establish the result:
Replacing A' by A yields nullspace A = (range A')^. Thus any n x k matrix A comes with two orthogonal decompositions, of the domain of definition R*, and of the image space R". See Exhibit 1.3.
1.15. DISTRIBUTIONAL SUPPORT OF LINEAR MODELS
15
1.14. SQUARE ROOT DECOMPOSITIONS OF A NONNEGATIVE DEFINITE MATRIX As a first application of Lemma 1.13 we investigate square root decompositions of nonnegative definite matrices. If V is a nonnegative definite n x n matrix, a representation of the form
is called a square root decomposition of V, and U is called a square root of V. Various such decompositions are easily obtained from an eigenvalue decomposition
For instance, a feasible choice is U = (± v /Alzi,...,±v / &«) e R nx ". If V has nonvanishing eigenvalues A I , . . . , A*, other choices are U = (±\f\iz\,..., ±v/AJtZjt) e R"x/c; here V = UU' is called a full rank decomposition for the reason that the square root U has full column rank. Every square root U of V has the same range as V, that is,
To prove this, we use Lemma 1.13, in that the ranges of V and U coincide if and only if the nullspaces of V and U' are the same. But U 'z = 0 clearly implies Vz = 0. Conversely, Vz = 0 entails 0 = z'Vz = z'UU'z = (U'z)'(U'z), and thus forces U'z = 0. The range formula, for every n x k matrix X,
is a direct consequence of a square root decomposition V = UU', since range V = range U implies range X'V — range X'U = range X'VX C range X'V. Another application of Lemma 1.13 is to clarify the role of mean vectors and dispersion matrices in linear models. 1.15. DISTRIBUTIONAL SUPPORT OF LINEAR MODELS Lemma. Let Y be an n x 1 random vector with mean vector //, and dispersion matrix V. Then we have with probability 1,
16
CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS
that is, the distribution of Y is concentrated on the affine subspace that results if the linear subspace range V is shifted by the vector /A. Proof. The assertion is true if V is positive definite. Otherwise we must show that Y — p, lies in the proper subspace range V with probability 1. In view of Lemma 1.13, this is the same as Y-JJL _L nullspace V with probability 1. Here nullspace V may be replaced by any finite set {z\,. • . , Zk} of vectors spanning it. For each j = 1,..., k we obtain
thus Y — n _L Zj with probability 1. The exceptional nullsets may depend on the subscript;', but their union produces a global nullset outside of which is orthogonal to {z1,...,zk}, as claimed. In most applications the mean vector /A is a member of the range of V. Then the affine subspace p + range V equals range V and is actually a linear subspace, so that Y falls into the range of V with probability 1. In a classical linear model as expounded in Section 1.3, the mean vector fj, is of the form X6 with unknown parameter system 6. Hence the containment IJi — Xde range V holds true for all vectors 8 provided
Such range inclusion conditions deserve careful study as they arise in many places. They are best dealt with using projectors, and projectors are natural companions of generalized inverse matrices. 1.16. GENERALIZED MATRIX INVERSION AND PROJECTIONS For a rectangular matrix A € Rnxk, any matrix G e Rkxn fulfilling AGA = A is called a generalized inverse of A. The set of all generalized inverses of A,
is an affine subspace of the matrix space R* XAI , being the solution set of an inhomogeneous linear matrix equation. If a relation is invariant to the choice of members in A~, then we often replace the matrix G by the set A~, For instance, the defining property may be written as A A'A = A. A square and nonsingular matrix A has its usual inverse A~l for its unique generalized inverse, A' = {A~1}. In this sense generalized matrix inversion is a generalization of regular matrix inversion. Our explicit convention of treating A~ as a set of matrices is a bit unusual, even though it is implicit in all of the work on generalized matrix
1.17. RANGE INCLUSION LEMMA
17
inverses. Namely, often only results that are invariant to the specific choice of a generalized inverse are of interest. For example, in the following lemma, the product X'GX is the same for every generalized inverse G of V. We indicate this by inserting the set V~ in place of the matrix G. However, the central optimality result for experimental designs is of opposite type. The General Equivalence Theorem 7.14 states that a certain property holds true, not for every, but for some generalized inverse. In fact, the theorem becomes false if this point is missed. Our notation helps to alert us to this pitfall. A matrix P e R"x" is called a projector onto a subspace /C C U" when P is idempotent, that is, P2 = P, and has /C for its range. Let us verify that the following characterizing interrelation between generalized inverses and projectors holds true:
For the direct part, note first that AG is idempotent. Moreover the inclusions
show that the range of AG and the range of A coincide. For the converse part, we use that the projector AG has the same range as the matrix A. Thus every vector Ax with x G Rk has a representation AGy with y € IR", whence AGAx = AGAGy = AGy — Ax. Since x can be chosen arbitrarily, this establishes AGA = A. The intimate relation between range inclusions and projectors, alluded to in Section 1.15, can now be made more explicit. 1.17. RANGE INCLUSION LEMMA Lemma. Let X be an n x k matrix and V be an n x s matrix. Then we have
If range X C range V and V is a nonnegative definite n x n matrix, then the product
does not depend on the choice of generalized inverse for V, is nonnegative definite, and has the same range as X' and the same rank as X. Proof. The range of X is included in the range of V if and only if A' — VW for some conformable matrix W. But then VGX = VGVW = VW = X
18
CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS
for all G e V . Conversely we may assume the slightly weaker property that VGX = X for at least one generalized inverse G of V. Clearly this is enough to make sure that the range of X is included in the range of V. Now let V be nonnegative definite, in addition to X = VW. Then the matrix X'GX = W'VGVW = W'VW is the same for all choices G e V~, and is nonnegative definite. Furthermore the ranks of X'V'X = W'VW and VW = X are equal. In particular, the ranges of X'V~X and X' have the same dimension. Since the first is included in the second, they must then coincide. We illustrate by example what can go wrong if the range inclusion condition is violated. The set of generalized inverses of
is
This is also the set of possible products X'GX with G € V if for X we choose the 2 x 2 identity matrix. Hence the product X'V'X is truely a set and not a singleton. Among the members
some are not symmetric (/3 ^ -y), some are not nonnegative definite (a < 0,/3 = y), and some do not have the same range as X' and the same rank as X (a = (3 = y = 1). Frequent use of the lemma is made with other matrices in place of X and V. The above presentation is tailored to the linear model context, which we now resume. 1.18. GENERAL LINEAR MODELS A central result in linear model theory is the Gauss-Markov Theorem 1.19. The version below is stated purely in terms of matrices, as a minimization problem relative to the Loewner ordering. However, it is best understood in the setting of a general linear model in which, by definition, the n x 1 response vector Y is assumed to have mean vector and dispersion matrix given by
1.18. GENERAL LINEAR MODELS
19
Here the n x k model matrix X and the nonnegative definite n x n matrix V are assumed known, while the mean parameter vector 9 e ® and the model variance a2 > 0 are taken to be unknown. The dispersion matrix need no longer be proportional to the identity matrix as in the classical linear model discussed in Section 1.3. Indeed, the matrix V may be rank deficient, even admitting the deterministic extreme V = 0. The theorem considers unbiased linear estimators LY for XO, that is, n x n matrices L satisfying the unbiasedness requirement
In a general linear model, it is implicitly assumed that the parameter domain © is the full space, 6 = R*. Under this assumption, LY is unbiased for X6 if and only if LX — X, that is, L is a left identity of X. There always exists a left identity, for instance, L = /„. Hence the mean vector XO always admits an unbiased linear estimator. More generally, we may wish to estimate s linear forms Cj'0,... ,cs'0 of 6, with coefficient vectors c, e R*. For a concise vector notation, we form the k x s coefficient matrix K = (ci,... ,c5). Thus interest is in the parameter subsystem K'O. A linear estimator LY for K'O is determined by an s x n matrix L. Unbiasedness holds if and only if
There are two important implications. First, K'O is called estimable when there exists an unbiased linear estimator for K'O. This happens if and only if there is some matrix L that satisfies (1). In the sequel such a specific solution is represented as L = U', with an n x s matrix U. Therefore estimability means that AT' is of the form U'X. Second, if K'O is estimable, then the set of all matrices L that satisfy (1) determines the set of all unbiased linear estimators LY for K'O. In other words, in order to study the unbiased linear estimators for K'O = U'XO,we have to run through the solutions L of the matrix equation
It is this equation (2) to which the Gauss-Markov Theorem 1.19 refers. The theorem identifies unbiased linear estimators LY for the mean vector XO which among all unbiased linear estimators LY have a smallest dispersion matrix. Thus the quantity to be minimized is a2LVL', relative to the Loewner ordering. The crucial step in the proof is the computation of the covariance matrix between the optimality candidate LY and a competitor LY,
20
CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS
1.19. THE GAUSS-MARKOV THEOREM Theorem. Let X be an n x k matrix and V be a nonnegative definite n x n matrix. Suppose U is an n x s matrix. A solution L of the equation LX = U'X attains the minimum of LVL', relative to the Loewner ordering and over all solutions L of the equation LX = U'X,
if and only if
where R is a projector given by R = In- XG for some generalized inverse G oiX. A minimizing solution L exists; a particular choice is U '(ln-VR'HK), with any generalized inverse H otRVR'. The minimum admits the representation
and does not depend on the choice of the generalized inverses involved. Proof. For a fixed generalized inverse G of X we introduce the projectors P = XG and R = In - P. Every solution L satisfies L - L(P + R) = LXG + LR= U'P + LR. I. First the converse part is proved. Assume the matrix L solves LX = U'X and fulfills LVR' = 0, and let L be any other solution. We get
and symmetry yields (L - L)VL' = 0. Multiplying out (L - L + L)V(L L+L)' = (L-L)V(L-L)'+Q+0+LVLr, we obtain the minimizing property of L,
II. Next we tackle existence. Because of RX = 0 the matrix L = U'(In — VR'HR} solves LX = U'X - U'VR'HRX = U'X. It remains to show that LVR' = 0. To this end, we note that range RV = range RVR', by the square root discussion in Section 1.14. Lemma 1.17 says that VR'HRV = VR'(RVR'yRV, as well as RVR'(RVR')-RV = RV. This gives
1.20. THE GAUSS-MARKOV THEOREM UNDER A RANGE INCLUSION CONDITION
21
Hence L fulfills the necessary conditions from the converse part, and thus attains the minimum. Furthermore the minimum permits the representation
III. Finally the existence proof and the converse part are jointly put to use in the direct part. Since the matrix L is minimizing, any other minimizing solution L satisfies (1) with equality. This forces (L - L)V(L - L}' — 0, and further entails (L - L)V = 0. Postmultiplication by R' yields LVR' = LVR' = Q. The statistic RY is an unbiased estimator for the null vector, Ee.a2[RY] = RX6 = 0. In the context of a general linear model, the theorem therefore says that an unbiased estimator LY for U'Xd has a minimum dispersion matrix if and only if LY is uncorrelated with the unbiased null estimator RY, that is, C^LY,RY] = a-2LVR' = 0. Our original problem of estimating X6 emerges with (/ = /„. The solution matrices L of the equation LX — X are the left identities of X. A minimum variance unbiased linear estimator for X6 is given by LY, with L = In — VR'HR. The minimum dispersion matrix takes the form a2(V ~- VR'(RVRTRV). A simpler formula for the minimum dispersion matrix becomes available under a range inclusion condition as in Section 1.17. 1.20. THE GAUSS-MARKOV THEOREM UNDER A RANGE INCLUSION CONDITION Theorem. Let X be an n x k matrix and V be a nonnegative definite n x n matrix such that the range of V includes the range of X. Suppose U is an n x s matrix. Then the minimum of LVL' over all solutions L of the equation LX — U'X admits the representation
and is attained by any L = U'X(X'V~X)~X'H inverse of V.
where H is any generalized
22
CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS
Proof. The matrix X'V'X ~ W does not depend on the choice of the generalized inverse for V and has the same range as X', by Lemma 1.17. A second application of Lemma 1.17 shows that a similar statement holds for XW~X'. Hence the optimality candidate L is well defined. It also satisfies LX = U'X since
Furthermore, it fulfills LVR' = 0, because of VH'X = VV~X = X and
By Theorem 1.19, the matrix Lisa minimizing solution. From X'V~VV~X— X'V~X,v/e now obtain the representation
The preceding two theorems investigate linear estimates LY that are unbiased for a parameter system U'XO. The third, and last version concentrates on estimating the parameter vector 0 itself. A linear estimator LY, with L € IR*X", is unbiased for 9 if and only if
This reduces to LX — Ik, that is, L is a left inverse of X. For a left inverse L of X to exist it is necessary and sufficient that X has full column rank k. 1.21. THE GAUSS-MARKOV THEOREM FOR THE FULL MEAN PARAMETER SYSTEM Theorem. Let X be an n x k matrix with full column rank k and V be a nonnegative definite n x n matrix. A left inverse L of X attains the minimum of LVL' over all left inverses L of*,
if and only if
1.22. PROJECTORS, RESIDUAL PROJECTORS, AND DIRECT SUM DECOMPOSITION
23
where R is a projector given by R = In- XG for some generalized inverse G otX. A minimizing left inverse L exists; a particular choice is G-GVR 'HR, with any generalized inverse H of RVR'. The minimum admits the representation
and does not depend on the choice of the generalized inverses involved. Moreover, if the range of V includes the range of X then the minimum is
and is attained by any matrix L — (X'V~X)~1X'H ized inverse of V'.
where H is any general-
Proof. Notice that every generalized inverse G of X is a left inverse of X, since premultiplication of XGX = X by (X'XY1X' gives GX = Ik. With U' = G, Theorem 1.19 and Theorem 1.20 establish the assertions. The estimators LY that result from the various versions of the GaussMarkov Theorem are closely related to projecting the response vector Y onto an appropriate subspace of R". Therefore we briefly digress and comment on projectors. 1.22. PROJECTORS, RESIDUAL PROJECTORS, AND DIRECT SUM DECOMPOSITION
Projectors were introduced in Section 1.16. If the matrix F e R"x" is a projector, P — P2, then it decomposes the space Rn into a direct sum consisting of the subspaces /C = range P, and £ = nullspace P. To see this, observe that the nullspace of P coincides with the range of the residual projector R = In - P. Therefore every vector x e Rn satisfies
But then the vector x lies in the intersection JC n C if and only if x ~ Px and x = Rx, or equivalently, x = Rx = RPx = 0. Hence the spaces 1C and C are disjoint except for the null vector. Symmetry of P adds orthogonality to the picture. Namely, we then have C = nullspace P = nullspace P' = (range P)1 = /C\ by Lemma 1.13.
24
CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS
EXHIBIT 1.4 Orthogonal and oblique projections. The projection onto the first component in IR2 along the direction (^) is orthogonal. The (dashed) projection along the direction (}) is nonorthogonal relative to the Euclidean scalar product.
Thus projectors that are symmetric correspond to orthogonal sum decompositions of the space Un, and are called orthogonal projectors. This translation into geometry often provides a helpful view. In Exhibit 1.4, we sketch a simple illustration. In telling the full story we should speak of P as being the projector "onto 1C along £", that is, onto its range along its nullspace. But brevity wins over exactness. It remains the fact that projectors in IR" correspond to direct sum decompositions IR" = /C ® £, without reference to any scalar product. In Lemma 2.15, we present a method for computing projectors if the subspaces /C and £ arise as ranges of nonnegative definite matrices A and£. We mention in passing that a symmetric n x n matrix V permits yet another eigenvalue decomposition of the form V — Y^k<e ^A» where AI, ..., A/ are the distinct eigenvalues of V and PI , . . . , Pf are the orthogonal projectors onto the corresponding eigenspaces. In this form, the eigenvalue decomposition is unique up to enumeration, in contrast to the representations given in Section 1.7.
1.23. OPTIMAL ESTIMATORS IN CLASSICAL LINEAR MODELS From now on, a minimum variance unbiased linear estimator is called an optimal estimator, for short. Returning to the classical linear model,
1.24. EXPERIMENTAL DESIGNS AND MOMENT MATRICES
25
Theorem 1.20 shows that the optimal estimator for the mean vector X6 is PY and that it has dispersion matrix a2P, where P = X(X'X)~X' is the orthogonal projector onto the range of X. Therefore the estimator P Y evidently depends on the model matrix X through its range, only. Representing this subspace as the range of another matrix X still leads to the same estimator PY and the same dispersion matrix a2P. This outlook changes dramatically as soon as the parameter vector 6 itself has a definite physical meaning and is to be investigated with its given components, as is the case in most applications. From Theorem 1.21, the optimal estimator for 6 then is (X'X)~1X'Y, and has dispersion matrix o-2(X'X)~l. Hence changing the model matrix X in general affects both the optimal estimator and its dispersion matrix.
1.24. EXPERIMENTAL DESIGNS AND MOMENT MATRICES The stage is set now to introduce the notion of experimental designs. In Section 1.3, the model matrix was built up according to X = (*i,.. .,*„)', starting from the regression vectors *,-. These are at the discretion of the experimenter who can choose them so that in a classical linear model, the optimal estimator (X'XylX'Y for the mean parameter vector 6 attains a dispersion matrix cr2(X'X)~l as small as possible, relative to the Loewner ordering. Since matrix inversion is antitonic, as seen in Section 1.11, the experimenter may just as well aim to maximize the precision matrix, that is, the inverse dispersion matrix,
The sum repeats the regression vector Xj e X according to how often it occurs in *!,...,*„. Since the order of summation does not matter, we may assume that the distinct regression vectors x\,... ,JQ, say, are enumerated in the initial section, while replications are accounted for in the final section x
t+\i • • • ixn-
We introduce for / < i the frequency counts «, as the number of times the particular regression vector A:, occurs among the full list jti, ...,*„. This motivates the definitions of experimental designs and their moment matrices which are fundamental to the sequel. DEFINITION. An experimental design for sample size n is given by a finite number of regression vectors x\,..., xg in X, and nonzero integers ni,...,nf such that
26
CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS
In other words an experimental design for sample size n, denoted by £„, specifies i < n distinct regression vectors *,, and assigns to them frequencies «, that sum to n. It tells the experimenter to observe n, responses under the experimental conditions that determine the regression vector *,. The vectors that appear in the design £, are called the support of £„, supp£, = {*i,... ,JQ}. The matrix Y^i xx' from X to Sym(/c). In Euclidean space, the convex hull of a compact set is compact, thus completing the proof. In general, the set M(H«) of moment matrices obtained from all experimental designs for sample size n fails to be convex, but it is still compact. To see compactness, note that such a design £„ is characterized by n regression vectors Jtj, ...,*„, counting multiplicities. The set of moment matrices M(En) thus consists of the continuous images M(^,) = Y^>j Q if and only if range P D range Q, for all orthogonal projectors P, Q. 1.5 Show that pa = ]CyP?/ and \Pij\ < max-{pa,Pjj} < 1, for every orthogonal projector P. 1.6 Let P = X(X'X)~1X' be the orthogonal projector originating from an n x k model matrix X of rank k, with rows */. Show that (i) /?,, = x!(X'X)-lxit (ii) EiPn/» = */», W P < XX1/\min(X'X) and max, pa < R2/'\min(X'X) = c/n, with c = R2/\min(M) > 1, where R = max, ||jc,-|| and M = (l/n)X'X, (iv) if ln e range* then P > lnlnln and mm/ Pa ^ Vn-
34
CHAPTER 1: EXPERIMENTAL DESIGNS IN LINEAR MODELS
1.7 Discuss the following three equivalent versions of the Gauss-Markov Theorem: i. LVL' > X(X'V-X)-X' for all L £ Rnxn with LX = X. ii. V >X(X'V-X}-X'. iii. trace VW > traceX(X'V-XYX'W for all W € NND(n). 1.8 Let the responses Y\,..., Yn have an exchangeable (or completely symmetric) dispersion structure V = a2In+o2p(lnlt[-In), with variance a2 > 0 and correlation p. Show that V is positive definite if and only if pe(-l/(n-l);l). 1.9 (continued) Let £, be a design for sample size n, with model matrix X, standardized moment matrix M = X'X/n = f xx'dgn/n, and standardized mean vector m — X'ln/n — /xdgn/n. Show that Iimw_00(cr2/n) x JT't/-1* = (M - mm') /(I - p), provided p e (0; 1). Discuss M -mm', as a function of £ = &/«.
CHAPTER 2
Optimal Designs for Scalar Parameter Systems
In this chapter optimal experimental designs for one-dimensional parameter systems are derived. The optimally criterion is the standardized variance of the minimum variance unbiased linear estimator. A discussion of estimability leads to the introduction of a certain cone of matrices, called the feasibility cone. The design problem is then one of minimizing the standardized variance over all moment matrices that lie in the feasibility cone. The optimal designs are characterized in a geometric way. The approach is based on the set of cylinders that include the regression range, and on the interplay of the design problem and a dual problem. The construction is illustrated with models that have two or three parameters.
2.1. PARAMETER SYSTEMS OF INTEREST AND NUISANCE PARAMETERS Our aim is to characterize and compute optimal experimental designs. Any concept of optimality calls upon the experimenter to specify the goals of the experiment; it is only relative to such goals that optimality properties of a design would make any sense. In the present chapter we presume that the experimenter's goal is point estimation in a classical linear model, as set forth in Section 1.23. There we concentrated on the full mean parameter vector. However, the full parameter system often splits into a subsystem of interest and a complementary subsystem of nuisance parameters. Nuisance parameters assist in formulating a statistical model that adequately describes the experimental reality, but the primary concern of the experiment is to learn more about the subsystem of interest. Therefore the performance of a design is evaluated relative to the subsystem of interest, only. One-dimensional subsystems of interest are treated first, in the present 35
36
CHAPTER 2: OPTIMAL DESIGNS FOR SCALAR PARAMETER SYSTEMS
chapter. They are much simpler and help motivate the general results for multidimensional subsystems. The general discussion is taken up in Chapter 3. 2.2. ESTEVfABILFTY OF A ONE-DIMENSIONAL SUBSYSTEM As before, let 8 be the full k x 1 vector of mean parameters in a classical linear model,
Suppose the system of interest is given by c'O, where the coefficient vector c e Rk is prescribed prior to experimentation. To avoid trivialities, we assume c^O. The most important special case is an individual parameter component 0y, which is obtained from c'O with c the Euclidean unit vector ej in the space IR*. Or interest may be in the grand mean 0. = Y,j B > 0, then A = B + C with C = A -B > 0. Therefore range A = (range B) + (range C) D range B. The following description shows that every feasibility cone lies somewhere between the open cone PD(fc) and the closed cone NND(fc). The description lacks an explicit characterization of which parts of the boundary NND(&) \ PD(fc) belong to the feasibility cone, and which parts do not. 2.4. FEASIBILITY CONES Theorem. The feasibility cone A(c) for c'9 is a convex subcone of NND(fc) which includes PD(fc). Proof. By definition, A(c) is a subset of NND(&). Every positive definite k x k matrix lies in A(c). The cone property and the convexity property jointly are the same as
The first property is evident, since multiplication by a nonvanishing scalar does not change the range of a matrix. That A(c) contains the sum of any two of its members follows from Lemma 2.3.
38
CHAPTER 2: OPTIMAL DESIGNS FOR SCALAR PARAMETER SYSTEMS
EXHIBIT 2.1 The ice-cream cone. The cone NND(2) in Sym(2) is isomorphic to the icecream cone in R3.
2.5. THE ICE-CREAM CONE Feasibility cones reappear in greater generality in Section 3.3. At this point, we visualize the situation for the smallest nontrivial order, k = 2. The space Sym(2) has typical members
and hence is of dimension 3. We claim that in Sym(2), the cone NND(2) looks like the ice-cream cone of Exhibit 2.1 which is given by
This claim is substantiated as follows. The mapping from Sym(2) into R3
2.5. THE ICE-CREAM CONE
39
which takes
into (a, \/2/3, y)' is linear and preserves scalar products. Linearity is evident. The scalar products coincide since for
we have
Thus the matrix
and the vector (a, \/2/3, y)' enjoy identical geometrical properties. For a more transparent coordinate representation, we apply a further orthogonal transformation into a new coordinate system,
Hence the matrix
is mapped into the vector
while the vector (jc,_y,z)' is mapped into the matrix
For instance, the identity matrix /2 corresponds to the vector
40
CHAPTER 2: OPTIMAL DESIGNS FOR SCALAR PARAMETER SYSTEMS
The matrix
is nonnegative definite if and only if ay — (32 > 0 as well as a > 0 and y > 0. In the new coordinate system, this translates into z2 > x2 + y2 as well as z > y and z > —y. These three properties are equivalent to z > (x2 + y2)1/2. Therefore the cone NND(2) is isometrically isomorphic to the ice-cream cone, and our claim is proved. The interior of the ice-cream cone is characterized by strict inequality, (x2 +_y 2 ) ! / 2 < z. This corresponds to the fact that the closed cone NND(2) has the open cone PD(2) for its interior, by Lemma 1.9. This correspondence is a consequence of the isomorphism just established, but is also easily seen directly as follows. A singular 2 x 2 matrix has rank equal to 0 or 1. The null matrix is the tip of the ice-cream cone. Otherwise we have A — dd1 for some nonvanishing vector
Such a matrix
is mapped into the vector
which satisfies x2 + y2 = z2 and hence lies on the boundary of the ice-cream cone. What does this geometry mean for the feasibility cone A(c}l In the first place, feasibility cones contain all positive definite matrices, as stated in Theorem 2.4. A nonvanishing singular matrix A = dd' fulfills the defining property of the feasibility cone, c € range dd', if and only if the vectors c and d are proportional. Thus, for dimension k = 2, we have
In geometric terms, the cone A(c) consists of the interior of the ice-cream cone and the ray emanating in the direction (lab, b2 -a2, b2+a2)' for c = (£).
2.7. THE DESIGN PROBLEM FOR SCALAR PARAMETER SUBSYSTEMS
41
2.6. OPTIMAL ESTIMATORS UNDER A GIVEN DESIGN We now return to the task set in Section 2.2 of determining estimator for c'0. If estimability holds under the model matrix parameter system satisfies c'6 = u'X6 for some vector u e R". Markov Theorem 1.20 determines the optimal estimator for c'0
the optimal X, then the The Gaussto be
Whereas the estimator involves the model matrix X, its variance depends merely on the associated moment matrix M — (l/n)X'X,
Up to the common factor tr 2 /n, the optimal estimator has variance c'M (£) c. It depends on the design £ only through the moment matrix M(£). 2.7. THE DESIGN PROBLEM FOR SCALAR PARAMETER SUBSYSTEMS We recall from Section 1.24 that H is the set of all designs, and that M(H) is the set of all moment matrices. Let c ^ 0 be a given coefficient vector in Uk. The design problem for a scalar parameter system c'6 can now be stated as follows:
In short, this means minimizing the variance subject to estimability. Notice that the primary variables are matrices rather than designs! The optimal variance of this problem is, by definition,
A moment matrix M is called optimal for c'6 in M(H) when M lies in the feasibility cone A(c) and c'M'c attains the optimal variance. A design £ is called optimal for c'6 in H when its moment matrix M(£) is optimal for c'0 in M(H). In the design problem, the quantity c'M~c does not depend on the choice of generalized inverse for M, and is positive. This has nothing to do with M being a moment matrix but hinges entirely on the feasibility cone A(c).
42
CHAPTER 2: OPTIMAL DESIGNS FOR SCALAR PARAMETER SYSTEMS
Namely, if A lies in A(c), then Lemma 1.17 entails that c'A~c is well defined and positive. Another way of verifying these properties is as follows. Every matrix A e A(c) admits a vector h eRk solving c = Ah. Therefore we obtain
saying that c 'A c is well defined. Furthermore h 'Ah = 0 if and only if Ah = 0, by our considerations on matrix square roots in Section 1.14. The assumption c ^ 0 then forces c'A'c = h'Ah > 0.
2.8. DIMENSIONALITY OF THE REGRESSION RANGE Every optimization problem poses the immediate question whether it is nonvacuous. For the design problem, this means whether the set of moment matrices Af(E) and the feasibility cone A(c) intersect. The range of matrix M(£) is the same as the subspace £(supp£) C Rk that is spanned by the support points of the design £. If £ is supported by the regression vectors x\,...,xe, say, then Lemma 2.3 and Section 1.14 entail
Therefore the intersection M(H) n A(c) is nonempty if and only if the coefficient vector c lies in the regression space C(X) C R* spanned by the regression range X C Uk. A sufficient condition is that X contains k linearly independent vectors *!,...,*£. Then the moment matrix (!/&)£, c'Nc, with equality if and only if M and N fulfill conditions (1) and (2) given below. More precisely, we have
with respective equality if and only if, for every design £ e H which has moment matrix M,
46
CHAPTER 2: OPTIMAL DESIGNS FOR SCALAR PARAMETER SYSTEMS
Proof. Inequality (i) follows from the assumption that M is the moment matrix of a design £ e H, say. By definition of A/*, we have 1 > x 'Nx for all x e X, and integration with respect to £ leads to
Moreover the upper bound is attained if and only if x'Nx = 1, for all x in the support of £, thereby establishing condition (1). Inequality (ii) relates to the assumption that M lies in the feasibility cone A(c). The fact that c e range M opens the way to applying the Gauss-Markov Theorem 1.20 to obtain M > min L€R *x*. Lc=c LA/L' = c(c'M~c)~lc'. Since TV is nonnegative definite, the linear form A t-> trace AN is isotonic, by Section 1.11. This yields
It remains to show that in this display, equality forces condition (2), the converse being obvious. We start from two square root decompositions M — KK' and N = HH', and introduce the matrix
The definition of A does not depend on the choice of generalized inverses for M. We know this already from the expression c'M'c. Because of M e A(c), we have MM~c = c. For K'M~c, in variance then follows from K = MM~K and K'Gc = K'M'MGMM-c = K'M~c, for G e M~. Next we compute the squared norm of A:
Thus equality in (ii) implies that A has norm 0 and hence vanishes. Pre- and postmultiplication of A by K and H' give 0 = MN - c(c'M~c)~lc'N which is condition (2). Essentially the same result will be met again in Theorem 7.11 where we discuss a multidimensional parameter system. The somewhat strange sequencing of vectors, scalars, and matrices in condition (2) is such as to readily carry over to the more general case.
2.12. THE ELFVING NORM
47
The theorem suggests that the design problem is accompanied by the dual problem: Maximize subject to
The two problems bound each other in the sense that every feasible value for one problem provides a bound for the other:
Equality holds as soon as we find matrices M and N such that c'M c = c'Nc. Then M is an optimal solution of the design problem, and N is an optimal solution of the dual problem, and M and N jointly satisfy conditions (1) and (2) of the theorem. But so far nothing has been said about whether such a pair of optimal matrices actually exists. If the infimum or the supremum were not to be attained, or if they were to be separated by strict inequality (a duality gap), then the theorem would be of limited usefulness. It is at this point that scalar parameter systems permit an argument much briefer than in the general case, in that the optimal matrices submit themselves to an explicit construction. 2.12. THE ELFVING NORM The design problem for a scalar subsystem c'O is completely resolved by the Elfving Theorem 2.14. As a preparation, we take another look at the Elfving set K in the regression space C(X} C Uk that is spanned by the regression range X. For a vector z £ £>(X} Q R*, the number
is the scale factor needed to blow up or shrink the set f l so that z comes to lie on its boundary. It is a standard fact from convex analysis that, on the space C(X), the function p is a norm. In our setting, we call p the Elfving norm. Moreover, the Elfving set 71 figures as its unit ball,
Boundary points of 7£ are characterized through p(z) ~ 1, and this property of p is essentially all we need. Scalar parameter systems c'6 are peculiar in that their coefficient vector c can be embedded in the regression space. This is in contrast to multidimensional parameter systems K'Q with coefficient matrices K e IR fcxs of rank
48
CHAPTER 2: OPTIMAL DESIGNS FOR SCALAR PARAMETER SYSTEMS
s > 1. The relation between the coefficient vector c and the Elfving set 7?. is the key to the solution of the problem. Rescaling 0 ^ c € C(X] by p(c) places c/p(c) on the boundary of 71. As a member of the convex set 72., the vector c/p(c) admits a representation
say, with e, e {±1} and jc/ € X for / = 1,... ,^, and 17 a design on the symmetrized regression range X u (-X}. The fact that c/p(c) lies on the boundary of K prevents rj from putting mass at opposite points, that is, at points Xi e X and x^ e —X with jc2 = —x\. Suppose this happens and without loss of generality, assume TI(X\) > 17 (— x\) > 0. Then the vector z from (1) has norm smaller than 1:
In the inequalities, we have first employed the triangle inequality on p, then used p(e,*,) < 1 for e/x, € H, and finally added 2r\(—x\) > 0. Hence we get 1 = p(c/p(c)) = p(z) < 1, a contradiction. The bottom line of the discussion is the following. Given a design 17 on the symmetrized regression range X U ( — X } such that 17 satisfies (1), we define a design £ on the regression range X through
We have just shown that the two terms 17(x) and rj(-x) cannot be positive
2.13. SUPPORTING HYPERPLANES TO THE ELFVING SET
49
at the same time. In other words the design £ satisfies
Thus representation (1) takes the form
with s a function on X which on the support of £ takes values ±1. These designs £ and their moment matrices M (£) will be shown to be optimal in the Elfving Theorem 2.14.
2.13. SUPPORTING HYPERPLANES TO THE ELFVING SET Convex combinations are "interior representations". They find their counterpart in "exterior representations" based on supporting hyperplanes. Namely, since c/p(c) is a boundary point of the Elfving set 7£, there exists a supporting hyperplane to the set 71 at the point c/p(c), that is, there exist a nonvanishing vector h € £>(X) C R* and a real number y such that
Examples are shown in Exhibit 2.4. In view of the geometric shape of the Elfving set 7?., we can simplify this condition as follows. Since 7£ is symmetric we may insert z as well as — z, turning the left inequality into \z'h\ < y. As a consequence, y is nonnegative. But y cannot vanish, otherwise 72. lies in a hyperplane of the regression space C(X} and has empty relative interior. This is contrary to the fact, mentioned in Section 2.9, that the interior of 72. relative to C(X} is nonempty. Hence y is positive. Subdividing by y > 0 and setting h — h/y / 0, the supporting hyperplane to 72. at c/p(c) is given by
with some vector h e £>(X], h^Q. The square of inequality (1) proves that the matrix defined by N = hh' satisfies
50
CHAPTER 2: OPTIMAL DESIGNS FOR SCALAR PARAMETER SYSTEMS
EXHIBIT 2.4 Supporting hyperplanes to the Elfving set. The diagram applies to the line fit model. Bottom: at c = (~^ ), the unique supporting hyperplane is the one orthogonal to h = (_Pj). Top: at d = ({), the supporting hyperplanes are those orthogonal to h = ( J ^ A ) for some A 6 [0; 1].
Hence N lies in the set Af of cylinders that include 7£, introduced in Section 2.10. The equality in (1) determines the particular value
Therefore Theorem 2.11 tells us that (p(c))2 is a lower bound for the optimal variance of the design problem. In fact, this is the optimal variance. 2.14. THE ELFVING THEOREM Theorem. Assume that the regression range X C R* is compact, and that the coefficient vector c e R* lies in the regression space C(X) and has Elfving norm p(c) > 0. Then a design £ € H is optimal for c'd in H if and only if there exists a function e on X which on the support of £ takes values ±1 such that
There exists an optimal design for c'& in H, and the optimal variance is (p(c))2. Proof. As a preamble, let us review what we have already established. The Elfving set and the ElfVing norm, as introduced in the preceding sections,
2.14. THE ELFVING THEOREM
51
are
There exists a vector h e £>(X] C Rk that determines a supporting hyperplane to 71 at c/p(c). The matrix N = hh' induces a cylinder that includes 71 or X, as discussed in Section 2.13, with c'Nc = (p(c))2. Now the proof is arranged like that of the Gauss-Markov Theorem 1.19 by verifying, in turn, sufficiency, existence, and necessity. First the converse is proved. Assume there is a representation of the form c/p(c) = Xlxesupp £ £(*) e(x)x. Let M be the moment matrix of £. We have e(x)x 'h < 1 for all x e X. In view of
every support point x of £ satisfies e(x)x'h — 1. We get x'h = \je(x) = s(x), and
This shows that M lies in the feasibility cone A(c). Moreover, this yields h'Mh = c'h/p(c) = 1 and c'M'c = (p(c))2h'MM~Mh = (p(c))2. Thus the lower bound (p(c))2 is attained. Hence the bound is optimal, as are M, £, and N. Next we tackle existence. Indeed, we have argued in Section 2.12 that c/p(c) does permit a representation of the form Jx e(x)x dg. The designs £ leading to such a representation are therefore optimal. Finally this knowledge is put to use in the direct part of the proof. If a design £ is optimal, then M(£) e A(c) and c'M(g)~c — (p(c))2. Thus conditions (1) and (2) of Theorem 2.11 hold with M = M(£) and N = hh1. Condition (1) yields
Condition (2) is postmultiplied by h/h'h. Insertion of (c'M(£) c) (p(c)) and c'h — p(c) produces
=
Upon defining e(x) = x'h, the quantities e(x) have values ±1, by condition (i), while condition (ii) takes the desired form c/p(c)
52
CHAPTER 2: OPTIMAL DESIGNS FOR SCALAR PARAMETER SYSTEMS
The theorem gives the solution in terms of designs, even though the formulation of the design problem apparently favors moment matrices. This, too, is peculiar to the case of scalar parameter systems. Next we cast the result into a form that reveals its place in the theory to be developed. To this end we need to take up once more the discussion of projections and direct sum decompositions from Section 1.22. 2.15. PROJECTORS FOR GIVEN SUBSPACES Lemma. Let A and B be nonnegative definite k x k matrices such that the ranges provide a direct sum decomposition of Rk:
Then (A + B) l is a generalized inverse of A and A(A + B)~} is the projector onto the range of A along the range of B. Proof, The matrix A + B is nonsingular, by Lemma 2.3. Upon setting G = (A + B)'\ we have Ik = AG + BG. We claim AGB = 0, which evidently follows from nullspace AG = range B. The direct inclusion holds since, if x is a vector such that AGx = 0, then x = AGx + BGx = BGx e range B. Equality of the two subspaces is a consequence of the fact that they have the same dimension, k — rank A. Namely, because of the nonsingularity of G, A:-rank A is the nullity common to AG and A. It is also the rank of B, in view of the direct sum assumption. With AGB = 0, postmultiplication of Ik = AG + BG by A produces A = AGA. This verifies G to be a generalized inverse of A for which the projector AG has nullspace equal to the range of B. The Elfving Theorem 2.14 now permits an alternative version that puts all the emphasis on moment matrices, as does the General Equivalence Theorem 7.14. 2.16. EQUIVALENCE THEOREM FOR SCALAR OPTIMALITY Theorem. A moment matrix M e M(H) is optimal for c'Q in M(H) if and only if M lies in the feasibility cone A(c] and there exists a generalized inverse G of M such that
2.16. EQUIVALENCE THEOREM FOR SCALAR OPTIMALITY
53
Proof. First the converse is proved, providing a sufficient condition for optimality. We use the inequality
for matrices A e M (H) n A(c), obtained from the Gauss-Markov Theorem 1.20 with U = Ik, X = c, V = A. This yields the second inequality in the chain
But we have c'G'c = c'Gc — c'M c, since M lies in the feasibility cone A(c). Thus we obtain c'A~c>c'M~c and M is optimal. The direct part, the necessity of the condition, needs more attention. From the proof of the Elfving Theorem 2.14, we know that there exists a vector h e Rk such that this vector and an optimal moment matrix M jointly satisfy the three conditions
This means that the equality c'M c — c'Nc holds true with N = hh'\ compare the Mutual Boundedness Theorem 2.11. The remainder of the proof, the construction of a suitable generalized inverse G of M, is entirely based on conditions (0), (1), and (2). Given any other moment matrix A e A/(H) belonging to a competing design 17 € H we have, from condition (0),
Conditions (1) and (2) combine into
Next we construct a symmetric generalized inverse G of M with the property that
From (4) we have c'h ^ 0. This means that c'(ah) = 0 forces a — 0, or formally,
54
CHAPTER 2: OPTIMAL DESIGNS FOR SCALAR PARAMETER SYSTEMS
With Lemma 1.13, a passage to the orthogonal complement gives
But the vector c is a member of the range of M, and the left hand sum can only grow bigger if we replace range c by range M. Hence, we get
This puts us in a position where we can find a nonnegative definite k x k matrix H with a range that is included in the nullspace of h' and that is complementary to the range of M:
The first inclusion is equivalent to range h C nullspace H, that is, Hh = 0. Now we choose for M, the generalized inverse G = (M+H)~l of Lemma 2.15. Postmultiplication of Ik = GM + GH by h yields h = GMh, whence condition (5) is verified. Putting together steps (3), (5), (2), and (4) we finally obtain, for all A e Af(B),
2.17. BOUNDS FOR THE OPTIMAL VARIANCE Bounds for the optimal variance (p(c))2 with varying coefficient vector c can be obtained in terms of the Euclidean norm \\c\\ = \fcrc of the coefficient vector c. Let r and R be the radii of the Euclidean balls inscribed in and circumscribing the Elfving set ft,
The norm ||z||, as a convex function, attains its maximum over all vectors z with p(z) = 1 at a particular vector z in the generating set X ( J ( - X ) . Also the norm is invariant under sign changes, whence maximization can be restricted
2.17. BOUNDS FOR THE OPTIMAL VARIANCE
55
EXHIBIT 2.5 Euclidean balls inscribed in and circumscribing the Elfving set. For the threepoint regression range X = {(*), (*), (*)}.
to the regression range X. Therefore, we obtain the alternative representation
Exhibit 2.5 illustrates these concepts. By definition we have, for all vectors c ^ 0 in the space C(X) spanned by*,
If the upper bound is attained, ||c||/p(c) = R, then c/p(c) or -c/p(c) lie in the regression range X. In this case, the one-point design assigning mass 1 to the single point c/p(c) and having moment matrix M = cc'/(p(c))2 is optimal for c'6 in M(H). Clearly, c/p(c) is an eigenvector of the optimal moment matrix M, corresponding to the eigenvalue c'c/(p(c))2 = R2. The following corollary shows that the eigenvector property pertains to every optimal moment matrix, not just to those stemming from one-point designs. Attainment of the lower bound, ||c||/p(c) = r, does not generally lead to optimal one-point designs. Yet it still embraces a very similar result on the eigenvalue properties of optimal moment matrices.
56
CHAPTER 2: OPTIMAL DESIGNS FOR SCALAR PARAMETER SYSTEMS
2.18. EIGENVECTORS OF OPTIMAL MOMENT MATRICES Corollary. Let the moment matrix M e A/(H) be optimal for c'0 in M(H). If 0 ^ c € C(X) and ||c||/p(c) = r, that is, c/p(c) determines the Euclidean ball inscribed in ft, then c is an eigenvector of M corresponding to the eigenvalue r2. If 0 ^ c e £(,*) and ||c||/p(c) = /?, that is, c/p(c) determines the Euclidean ball circumscribing 71, then c is an eigenvector of M corresponding to the eigenvalue R2. Proof. The hyperplane given by h = c/(p(c)r2) or h = c/(p(c)/?2) supports ft at c/p(c), since ||c||/p(c) is the radius of the ball inscribed in or circumscribing ft. The proof of the Elfving Theorem 2.14 shows that c/p(c) = A//z, that is, Me = r2c or Me = R2c. The eigenvalues of any moment matrix M are bounded from above by R2. This is an immediate consequence of the Cauchy inequality,
On the other hand, the in-ball radius r2 does not need to bound the eigenvalues from below. Theorem 7.24 embraces situations in which the smallest eigenvalue is r2, with rather powerful consequences. In general, however, nothing prevents M from becoming singular. For instance, suppose the regression range is the Euclidean unit ball of the plane, X = {* e R2 : ||*|| < 1} = ft. The ball circumscribing ft coincides with the ball inscribed in ft. The only optimal moment matrix for c'0 is the rank one matrix cc'/lkl)2. Here c is an eigenvector corresponding to the eigenvalue r2 — 1, but the smallest eigenvalue is zero. The next corollary answers the question, for a given moment matrix A/, which coefficient vectors c are such that M is optimal for c'0 in A/(E). 2.19. OPTIMAL COEFFICIENT VECTORS FOR GIVEN MOMENT MATRICES Corollary. Let M be a moment matrix in A/(E). The set of all nonvanishing coefficient vectors c e Uk such that M is optimal for c'0 in Af (E) is given by the set of vectors c = SMh where d > 0 and the vector h e Uk is such that it satisfies (x'h)2 < 1 = h'Mh for all x e X. Proof. For the direct inclusion, the formula c = p(c)Mh from the proof of the Elfving Theorem 2.14 represents c in the desired form. For the converse inclusion, let £ be a design with moment matrix M and let * be a support point. From h'Mh - 1, we find that e(x) = x'h equals ±1. With 8 > 0, we get c = SMh = S ]Cxesupp ( £(x)£(x)x' whence optimality follows.
2.21. PARABOLA FIT MODEL
57
Thus, in terms of cylinders that include the Elfving set, a moment matrix M is optimal in A/(H) for at least one scalar parameter system if and only if there exists a cylinder N e A/" of rank one such that trace M N — 1. Closely related conditions appear in the admissibility discussion of Corollary 10.10. 2.20. LINE FIT MODEL We illustrate these results with the line fit model of Section 1.6,
with a compact interval [a; b\ as experimental domain T. The vectors (j) with t € [a; b] constitute the regression range X, so that the Elfving set Tl becomes the parallelogram with vertices ±Q and ±Q. Given a coefficient vector c, it is easy to compute its Elfving norm p(c) and to depict c/p(c) as a convex combination of the four vertices of 72.. As an example, we consider the symmetric and normalized experimental domain T = [—!;!]. The intercept a has coefficient vector c — (J). A design T on T is optimal for the intercept if and only if the first design moment vanishes, /AI = J T f dr = 0. This follows from the Elfving Theorem 2.14, since
forces e(t) — 1 if r(t) > 0 and ^ = 0. The optimal variance for the intercept isl. The slope /3 has coefficient vector c = (^). The design T which assigns mass 1/2 to the points ±1 is optimal for the slope, because with e(l) = 1 and e(—1) = —1, we get
The optimal design is unique, since there are no alternative convex representations of c. The optimal variance for the slope is 1. 2.21. PARABOLA FIT MODEL In the model for a parabola fit, the regression space is three-dimensional and an illustration becomes slightly more tedious. The model equation is
58
CHAPTER 2: OPTIMAL DESIGNS FOR SCALAR PARAMETER SYSTEMS
with regression function /(/) = (1,f, t2)'. The Elfving set 7£ is the convex hull of the two parabolic arcs ±(l,r,f 2 ) with t e [a; b], as shown in Exhibit 2.2. If the experimental domain is [a;b] = [-1;1], then the radius of the ball circumscribing U is R = \/3 = |jc||, with c = (1, ±1,1)'. The radius of the ball inscribed in U is r ^ l/y/5 = |jc||/5, with c = (-1,0,2)'. The vector c/5 has Elfving norm 1 since it lies in the face with vertices /(-I), —/(O), and /(I),
Thus an optimal design for c'O in M(H) is obtained by allocating mass 1/5 at the two endpoints ±1 and putting the remaining mass 3/5 into the midpoint 0. Its moment matrix
has eigenvector c corresponding to the eigenvalue r2 = 1/5, as stated by Corollary 2.18. The other two eigenvalues are larger, 2/5 and 6/5. We return to this topic in Section 9.12.
2.22. TRIGONOMETRIC FIT MODELS An example with much smoothness is the model for a trigonometric fit of degree d,
with k = 2d + 1 mean parameters a, A, ri, • • • , A/, r say, are outside the scope of the theory that we develop. If a genuinely nonlinear problem has to be investigated, a linearization using the Taylor theorem may permit a valid analysis of the problem.
61
62
CHAPTER 3: INFORMATION MATRICES
3.2. INFORMATION MATRICES FOR FULL RANK SUBSYSTEMS What is an appropriate way of evaluating the performance of a design if the parameter system of interest is K'Ql In the first place, this depends on the underlying model which again we take to be the classical linear model,
Secondly it might conceivably be influenced by the type of inference the experimenter has in mind. However, it is our contention that point estimation, hypothesis testing, and general parametric model-building all guide us to the same central notion of an information matrix. Its definition assumes that the k x s coefficient matrix K has full column rank s. DEFINITION. For a design £ with moment matrix M the information matrix for K'B, with k x s coefficient matrix K of full column rank s, is defined to be CK(M] where the mapping CK from the cone NND(fc) into the space Sym(.s) is given by
Here the minimum is taken relative to the Loewner ordering, over all left inverses L of K. The notation L is mnemonic for a left inverse; these matrices are of order s x k and hence may have more columns than rows, deviating from the conventions set out in Section 1.7. We generally call CK(A) an information matrix for K '6, without regard to whether A is the moment matrix of a design or not. The matrix CK(A) exists as a minimum because it matches the GaussMarkov Theorem in the form of Theorem 1.21, with X replaced by K and V by A. Moreover, with residual projector R = 4 — KL for some left inverse L of K, Theorem 1.21 offers the representations
whenever the left inverse L of K satisfies LAR' = 0. Such left inverses L of K are said to be minimizing for A. They are obtained as solutions of the linear matrix equation
Occasionally, we make use of the existence (rather than of the form) of left inverses L of K that are minimizing for A.
3.3. FEASIBILITY CONES
63
We provide a detailed study of information matrices, their statistical meaning (Section 3.3 to Section 3.10) and their functional properties (Section 3.11 to Section 3.19). Finally (Section 3.20 to Section 3.25), we introduce generalized information matrices for those cases where the coefficient matrix K is rank deficient. 3.3. FEASIBILITY CONES Two instances of information matrices have already been dealt with, even though the emphasis at the time was somewhat different. The most important case occurs if the full parameter vector 9 is of interest, that is, if K — 7*. Since the unique (left) inverse L of K is then the identity matrix Ik, the information matrix for 6 reproduces the moment matrix M,
In other words, for a design £ the matrix M (£) has two meanings. It is the moment matrix of £ and it is the information matrix for 0. The distinction between these two views is better understood if moment matrices and information matrices differ rather than coincide. This happens with scalar subsystems c'0, the second special case. If the matrix M lies in the feasibility cone A(c), Theorem 1.21 provides the representation
Here the information for c'd is the scalar (c'M~c)~l, in contrast to the moment matrix M. The design problem of Section 2.7 calls for the minimization of the variance c'M~c over all feasible moment matrices M. Clearly this is the same as maximizing (c'M~c)~l. The task of maximizing information sounds reasonable. It is this view that carries over to greater generality. The notion of feasibility cones generalizes from the one-dimensional discussion of Section 2.2 to cover an arbitrary subsystem K'6, by forming the subset of nonnegative definite matrices A such that the range of K is included in the range of A. The definition does not involve the rank of the coefficient matrix K. DEFINITION. For a parameter subsystem K'O, the feasibility cone A(K) is defined by
A matrix A e Sym(A:) is called feasible for K'd when A G A(K), a design £ is called feasible for K'O when M(£) e A(K).
64
CHAPTER 3: INFORMATION MATRICES
Feasibility cones will also be used with other matrices in place of K. Their geometric properties are the same as in the one-dimensional case studied in Theorem 2.4. They are convex subcones of the closed cone NND(fc), and always include the open cone PD(fc). If the rank of K is as large as can be, k, then its feasibility cone A(K) coincides with PD(k) and is open. But if the rank of K is smaller than A:, then singular matrices A may, or may not, lie in A(K), depending on whether their range includes the range of K. Here the feasibility cone is neither open nor closed. In the scalar case, we assumed that the coefficient vector c does not vanish. More generally, suppose that the coefficient matrix K has full column rank s. Then the Gauss-Markov Theorem 1.21 provides the representation
It is in this form that information matrices appear in statistical inference. The abstract definition chosen in Section 3.2 exhibits its merits not until we turn to their functional properties. Feasibility cones combine various inferential aspects, namely those of estimability, testability, and identifiability. The following sections elaborate on these interrelations.
3.4. ESTIMABILFTY
First we address the problem of estimating the parameters of interest, K'O, for a model with given model matrix X. The subsystem K'B is estimable if and only if there exists at least one n x s matrix U such that
as pointed out in Section 1.18. This entails K — X'U, or equivalently, range/C C range Jf' = range A^Jf. With moment matrix M = (l/n)X'X, we obtain the estimability condition
that is, the moment matrix M must lie in the feasibility cone for K'B. This visibly generalizes the result of Section 2.2 for scalar parameter systems. It also embraces the case of the full parameter system 0 discussed in Section 1.23. There estimability forces the model matrix X and its associated moment matrix M to have full column rank k. In the present terms, this means M e A(Ik) = PD(k).
3.5. GAUSS-MARKOV ESTIMATORS AND PREDICTORS
65
3.5. GAUSS-MARKOV ESTIMATORS AND PREDICTORS Assuming estimability, the optimal estimator y for the parameter subsystem y = K 'B is given by the Gauss-Markov Theorem 1.20. Using U 'X — K', the estimator y for y is
Upon setting M — (\/ri)X'X, the dispersion matrix of the estimator y is
The dispersion matrix becomes invertible provided the coefficient matrix K has full column rank s, leading to the precision matrix
The common factor n/a2 does not affect the comparison of these matrices in the Loewner ordering. In view of the order reversing property of matrix inversion from Section 1.11, minimization of dispersion matrices is the same as maximization of information matrices. In this sense, the problem of designing the experiment in order to obtain precise estimators for K'Q calls for a maximization of the information matrices CK(M). A closely related task is that of prediction. Suppose we wish to predict additional reponses Yn+j — *' 0 + £«+y, for j — 1,... ,5. We assemble the prediction sites xn+i in a k xs matrix K = (JCM+I , . . . , xn+s), and take the random vector y from (1) as a predictor for the random vector Z = (Yn+i,..., Yn+s)'. Clearly, y is linear in the available response vector Y = (Yi,..., Yn)', and unbiased for the future response vector Z,
Any other unbiased linear predictor LY for Z has mean squared-error matrix
The first term appears identically in the estimation problem, the second term is constant. Hence y is an unbiased linear predictor with minimum mean squared-error matrix,
66
CHAPTERS: INFORMATION MATRICES
For the purpose of prediction, the design problem thus calls for minimization of K'M~K, as it does from the estimation viewpoint. The importance of the information matrix €%(M) is also justified on computational grounds. Numerically, the optimal estimate 6 for 6 is obtained as the solution of the normal equations
Multiplication by 1/n turns the left hand side into MO. On the right hand side, the k x 1 vector T = (\fn)X'Y is called the vector of totals. If the regression vectors jt 1 } ... ,*/ are distinct and if Y/7 for j = 1,... ,/i,- are the replications under jc/, then we get
where Y,. = (1/n,-) £)/ 2 + rank X, the statistic F has expectation
The larger the noncentrality parameter, the larger values for F we expect, and the clearer the test detects a significant deviation from the hypothesis HQ. V. In the fifth and final step, we study how the F-test responds to a change in the model matrix X. That is, we ask how the test for K'B = 0 compares in a grand model with a moment matrix M as before, relative to one with an alternate moment matrix A, say. We assume that the testability condition A 6 A(K) is satisfied. The way the noncentrality parameter determines the expectation of the statistic F involves the matrices
It indicates that the F-test is uniformly better under moment matrix M than under moment matrix A provided
If K is of full column rank, this is the same as requiring CK(M) > CK(A) in the Loewner ordering. Again we end up with the task of maximizing the information matrix CK(M). Some details have been skipped over. It is legitimate to compare two F-tests by their noncentrality parameters only with "everything else being equal". More precisely, model variance a2 and sample size n ought to be the
3.8. ANOVA
71
EXHIBIT 3.1 ANOVA decomposition. An observed yield y decomposes into PQy+P\y+Ry. The term P$y is fully explained by the hypothesis. The sum of squares ||.Pi y||2 is indicative for a deviation from the hypothesis, while \\Ry\\2 measures the model variance.
same. That the variance per observation, cr2/n, is constant is also called for in the estimation problem. Furthermore, the testing problem even requires equality of the ranks of two competing moment matrices M and A because they affect the denominator degrees of freedom of the F-distribution. Nevertheless the more important aspects seem to be captured by the matrices CK(M) and CK(A). Therefore we keep concentrating on maximizing information matrices. 3.8. ANOVA The rationale underlying the F-test is often subsumed under the heading analysis of variance. The key connection is the decomposition of the identity matrix into orthogonal projectors,
These projectors correspond to three nested subspaces: i. U" = range(Po + PI + R), the sample space; ii. £ = range(Po + PI), the mean space under the full model; Hi. H = range PQ, the mean space under the hypothesis HQ. Accordingly the response vector Y decomposes into the three terms P0Y, PiY, and RY, as indicated in Exhibit 3.1.
72
CHAPTERS: INFORMATION MATRICES
The term P0Y is entirely explained as a possible mean vector, both under the full model and under the hypothesis. The quadratic forms
are mean sums of squares estimating the deviation of the observed mean vector from the hypothetical model, and estimating the model variance. The sample space IR" is orthogonally decomposed with respect to the Euclidean scalar product because in the classical linear model, the dispersion matrix is proportional to the identity matrix. For a general linear model with dispersion matrix V > 0, the decomposition ought to be carried out relative to the pseudo scalar product (x,y)v = x'V~ly. 3.9. IDENTIFIABILITY Estimability and testability share a common background, identifiability. We assume that the response vector Y follows a normal distribution N^e;o.2/n. By definition, the subsystem K'B is called identifiable (by distribution) when all parameter vectors B\, #2 £ R* satisfy the implication
The premise means that the parameter vectors d\ and fy specify one and the same underlying distribution. The conclusion demands that the parameters of interest then coincide as well. We have seen in Section 3.6 that, in terms of the moment matrix M = (l/ri)X'X, we can transcribe this definition into the identifiability condition
Identifiability of the subsystem of interest is a necessary requirement for parametric model-building, with no regard to the intended statistical inference. Estimability and testability are but two particular instances of intended inference. 3.10. FISHER INFORMATION A central notion in parametric modeling, not confined to the normal distribution, is the Fisher information matrix for the model parameter 0. There are two alternative definitions. It is the dispersion matrix of the first logarithmic derivative with respect to 6 of the model density. Or, up to a change of sign, it is the expected matrix of the second logarithmic derivative. In a classical
3.11. COMPONENT SUBSETS
73
linear model with normality assumption, the Fisher information matrix for the full mean parameter system turns out to be (n/ 0 for all x, hence also for — x. Thus we get x'A\2y — 0 for all jc, and therefore Auy = 0. It follows from Lemma 1.17 that A22A22A2\ = A2\ and that A\2A22A2\ does not depend on the choice of the generalized inverse for A22. Finally nonnegative definiteness of AH -A\2A22A2i is a consequence of the identity
For the converse part of the proof, we need only invert the last identity:
There are several lessons to be learnt from the Schur complement representation of information matrices. Firstly, we can interpret it as consisting of the term AH, the information matrix for (0i,..., 0$)' in the absence of nuisance parameters, adjusted for the loss of information Ai2A22A2i, caused by entertaining the nuisance parameters (0 5 +i,..., 9k)'.
76
CHAPTERS: INFORMATION MATRICES
Secondly, Lemma 1.17 says that the penalty term A^A^A-^ has the same range as A\2- Hence it may vanish, making an inversion of A^i obsolete. It vanishes if and only if A\2 = 0, which may be referred to as parameter orthogonality of the parameters of interest and the nuisance parameters. For instance, in a linear model with normality assumption, the mean parameter vector 6 and the model variance a2 are parameter orthogonal (see Section 3.10). Thirdly, the complexity of computing CK(A) is determined by the inversion of the (k — s) x (k - s) matrix A22- In general, we must add the complexity of computing a left inverse L of the k x s coefficient matrix K. For a general subsystem of interest K'O with coefficient matrix K of full column rank s, four formulas for the information matrix CK(A) are now available:
Formula (1) clarifies the role of Cx(A) in statistical inference. It is restricted to the feasibility cone A(K) but can be extended to the closed cone NND(fc) by semicontinuity (see Section 3.13). Formula (2) utilizes an arbitrary left inverse L of K and the residual projector R = Ik - KL. It is of Schur complement type, focusing on the loss of information due to the presence of nuisance parameters. The left inverse L can be chosen in order to relieve the computational burden as much as possible. Formula (3) is based on a left inverse L of K that is minimizing for A, and shifts the computational task over to determining a solution of the linear matrix equation L(K,AR'} = (/s,0). Formula (4) has been adopted as a definition and is instrumental when we next establish the functional properties of the mapping CK. The formula provides a quasi-linear representation of CK, in the sense that the mappings A H-+ LAV are linear in A, and that CK is the minimum of a collection of such linear mappings. 3.13. BASIC PROPERTIES OF THE INFORMATION MATRIX MAPPING Theorem. Let K be a k x s coefficient matrix of full column rank s, with associated information matrix mapping
3.13. BASIC PROPERTIES OF THE INFORMATION MATRIX MAPPING
77
Then CK is positively homogeneous, matrix superadditive, and nonnegative, as well as matrix concave and matrix isotonic:
Moreover CK enjoys any one of the following three equivalent properties: a. (Upper semicontinuity) The level sets
are closed, for all C e Sym(s). b. (Sequential semicontinuity criterion) For all sequences (Am}m>\ in NND(fc) that converge to a limit A, we have
c. (Regularizatiori) For all A,B e NND(fc) we have
Proof. The first three properties are immediate from the definition of CK(A) as the minimum over the matrices LAL'. Superadditivity and homogeneity imply matrix concavity:
Superadditivity and nonnegativity imply matrix monotonicity,
Moreover CK enjoys property b. Suppose the matrices Am > 0 converge
78
CHAPTERS: INFORMATION MATRICES
to A such that CK(Am) > CK(A). With a left inverse LofK that is minimizing for A we obtain
Hence CK(Am) converges to the limit CK(A). It remains to establish the equivalence of (a), (b), and (c). For this, we need no more than that the mapping CK • NND(A:) —> Sym(s) is matrix isotonic. Let B = (B e Sym(/c) : trace B2 < 1} be the closed unit ball in Sym(/c), as encountered in Section 1.9. There we proved that every matrix B e B fulfills \x'Bx\ < x'x for all x € R*. In terms of the Loewner ordering this means -Ik < B < Ik, for all B e B. First we prove that (a) implies (b). Let (Am)m>i be a sequence in NND(fc) converging to A, and satisfying CK(Am) > CK(A) for all m > 1. For every e > 0, the sequence (Am)m>\ eventually stays in the ball A+eB. This entails Am < A+eIk and, by monotonicity of CK, also CK(Am) < CK(A+eIk) for eventually all m. Hence the sequence (CK(Am))m>l is bounded in Sym(.y), and possesses cluster points C. From CK(Am) > CK(A), we obtain C > CK(A). Let (CK(Amt))(>l be a subsequence converging to C. For all 8 > 0, the sequence (CK(Am())e>l eventually stays in C + dC, where C is the closed unit ball in Sym(.s). This implies CK(Amt) >C~8Ise Sym(s), for eventually all L In other words, Am( lies in the level set {CK > C - dls}. This set is closed, by assumption (a), and hence also contains A. Thus we get CK(A) > C — dls and, with d tending to zero, CK(A) > C. This entails C = CK(A). Since the cluster point C was arbitrary, (b) follows. Next we show that (b) implies (c). Indeed, the sequence Am = A + (l/m)B converges to A and, by monotonicity of CK, fulfills CK(Am) > CK(A). Now (b) secures convergence, thus establishing part (c). Finally we demonstrate that (c) implies (a). For a fixed matrix C e Sym(5) let (At)t>\ be a sequence in the level set {CK > C} converging to a limit A. For every m > 1, the sequence (Af)e>i eventually stays in the ball A + (l/m)B. This implies At < A + (l/m)Ik and, by monotonicity of CK, also CK(Ae) C}. The right hand side converges to CK(A) because of regularity. But C < CK(A) means A e {CK > C}, establishing closedness of the level set [CK > C}. The main advantage of defining the mapping CK as the minimum of the linear functions LAV is the smoothness that it acquires on the boundary of its domain of definition, expressed through upper semicontinuity (a). Our definition requires closedness of the upper level sets {CK > C} of the matrixvalued function CK, and conforms as it should with the usual concept of upper semicontinuity for real-valued functions, compare Lemma 5.7. Property (b)
3.14. RANGE DISJOINTNESS LEMMA
79
provides a handy sequential criterion for upper semicontinuity. Part (c) ties in with regular, that is, nonsingular, matrices since A + (l/m)B is positive definite whenever B is positive definite. A prime application of regularization consists in extending the formula CK(A) — (K'A~K)~l from the feasibility cone A(K) to all nonnegative definite matrices,
By homogeneity this is the same as
where the point is that the positive definite matrices
converge to A along the straight line from 1% to A. The formula remains correct with the identity matrix Ik replaced by any positive definite matrix B. In other words, the representation (K'A~lK)~l permits a continuous extension from the interior cone PD(fc) to the closure NND(fc), as long as convergence takes place "along a straight line" (see Exhibit 3.2). Section 3.16 illustrates by example that the information matrix mapping CK may well fail to be continuous if the convergence is not along straight lines. Next we show that the rank behavior of CK(A) completely specifies the feasibility cone A(K). The following lemma singles out an intermediate step concerning ranges. 3.14. RANGE DISJOINTNESS LEMMA Lemma. Let the k x s coefficient matrix K have full column rank 5, and let A be a nonnegative definite k x k matrix. Then the matrix
is the unique matrix with the three properties
Proof. First we show that AK enjoys each of the three properties. Nonnegative definiteness of AK is evident. Let L be a left inverse of K that is minimizing for A, and set R = Ik — KL. Then we have Cx(A) — LAL', and
80
CHAPTERS: INFORMATION MATRICES
EXHIBIT 3.2 Regularization of the information matrix mapping. On the boundary of the cone NND(fc), convergence of the information matrix mapping CK holds true provided the sequence Ai,A2,-.. tends along a straight line from inside PD(fc) towards the singular matrix A, but may fail otherwise.
the Gauss-Markov Theorem 1.19 yields
This establishes the first property. The second property, the range inclusion condition, is obvious from the definition of AK. Now we turn to the range disjointness property of A - AK and K. For vectors u e IR* and v € IRS with (A - AK]U = Kv, we have
3.15. RANK OF INFORMATION MATRICES
81
In the penultimate line, the product LAR' vanishes since the left inverse L of K is minimizing for A, by the Gauss-Markov Theorem 1.19. Hence the ranges of A - AK and K are disjoint except for the null vector. Second consider any other matrix B with the three properties. Because of B > 0 we can choose a square root decomposition B = UU'. From the second property, we get range U = range B C range K. Therefore we can write U = KW. Now A > B entails AK > B, by way of
Thus A - B — (A - AK] + (AK - B] is the sum of two nonnegative definite matrices. Invoking the third property and the range summation Lemma 2.3, we obtain
A matrix with range null must be the null matrix, forcing B = AK. The rank behavior of information matrices measures the extent of identifiability, indicating how much of the range of the coefficient matrix is captured by the range of the moment matrix.
3.15. RANK OF INFORMATION MATRICES Theorem. Let the k x s coefficient matrix K have full column rank s, and let A be a nonnegative definite k x k matrix. Then
In particular CK(A) is positive definite if and only if A lies in the feasibility cone A(K). Proof. Define AK = KCK(A)K'. We know that A = (A - AK) + AK is the sum of two nonnegative definite matrices, by the preceding lemma. From Lemma 2.3 we obtain range A n range K = (range(A - AK)r\ range K) + (range AK n range K) = range AK.
82
CHAPTERS: INFORMATION MATRICES
Since K has full column rank this yields
In particular, CK(A) has rank s if and only if range A n range K = range AT, that is, the matrix A lies in the feasibility cone A(K). The somewhat surprising conclusion is that the notion of feasibility cones is formally redundant, even though it is essential to statistical inference. The theorem actually suggests that the natural order of first checking feasibility and then calculating CK(A) be reversed. First calculate CK(A) from the Schur complement type formula
with any left inverse L of K and residual projector R = Ik - KL. Then check nonsingularity of CK(A) to see whether A is feasible. Numerically it is much simpler to find out whether the nonnegative definite matrix CK(A) is singular or not, rather than to verify a range inclusion condition. The case of the first s out of all k mean parameters is particularly transparent. As elaborated in Section 3.11, its information matrix is the ordinary Schur complement of A22 in A. The present theorem states that the rank of the Schur complement determines feasibility of A, in that the Schur complement has rank 5 if and only if the range of A includes the leading s coordinate subspace. For scalar subsystems c'9, the design problem of Section 2.7 only involves those moment matrices that lie in the feasibility cone A(c). It becomes clear now why this restriction does not affect the optimization problem. If the matrix A fails to lie in the cone A(c), then the information for c'B vanishes, having rank equal to
The formulation of the design problem in Section 2.7 omits only such moment matrices that provide zero information for c'B. Clearly this is legitimate. 3.16. DISCONTINUITY OF THE INFORMATION MATRIX MAPPING We now convince ourselves that matrix upper semicontinuity as in Theorem 3.13 is all we can generally hope for. If the coefficient matrix K is square and nonsingular, then we have CK(A) = K~lAK~l' for A e NND(fc). Here the mapping CK is actually continuous on the entire closed cone NND(fc). But continuity breaks down as soon as the rank of K falls below k.
3.16. DISCONTINUITY OF THE INFORMATION MATRIX MAPPING
83
This discontinuity is easiest recognized for scalar parameter systems c'6. We simply choose vectors dm ^ 0 orthogonal to c and converging to zero. The matrices Am — (c + dm)(c + dm)' then do not lie in the feasibility cone A(c), whence Cc(Am) — 0. On the other hand, the limit cc' has information one for c'6,
The same reasoning applies as long as the range of the coefficient matrix K is not the full space IR*. To see this, let KK' = £];A of KH is minimizing for A if and only if it factorizes into some left inverse LK^ of K that is minimizing for A and some left inverse LHtClc^A) of H that is minimizing for CK(A),
Proof. There exist left inverses LK B > 0. For nonnegative definite matrices C > D > 0, the basic relation is
which holds if D~ is a generalized inverse of D that is nonnegative definite. For instance we may choose D~ = GDG' with G e D~. 3.23. EQUIVALENCE OF INFORMATION ORDERING AND DISPERSION ORDERING Lemma. Let A and B be two matrices in the feasibility cone A(K). Then we have
Proof. For the direct part, we apply relation (1) of the preceding section to C = AK and D — BK, and insert for D~ a nonnegative definite generalized inverse fi~ of B,
Upon premultiplication by K 'A and postmultiplication by its transpose, all terms simplify using AKA~K — K, because of Lemma 3.22, and since the proof of Theorem 3.15 shows that AK and K have the same range and thus Lemma 1.17 applies. This leads to 0 < K'B~K - K'A~K. The converse part of the proof is similar, in that relation (1) of the preceding section is used with C = K'B~K and D — K'A~K. Again invoking Lemma 1.17 to obtain DD~K' = K', we get 0 < CD~C - C. Now using CC~K' = K', premultiplication by KC~ and postmultiplication by its transpose yield 0 < KD~K' - KC~K'. But we have KD~K' = AK and KC~K' = BK, by Theorem 1.20.
92
CHAPTER 3: INFORMATION MATRICES
The lemma verifies desideratum (iv) of Section 3.21 that, on the feasibility cone A(K), the transition between dispersion matrices K'A'K and generalized information matrices A% is antitonic, relative to the Loewner orderings on the spaces Sym(s) and Sym(Ar). It remains to resolve desideratum (v) of Section 3.21, that the functional properties of Theorem 3.13 continue to hold for generalized information matrices. Before we state these properties more formally, we summarize the four formulas for the generalized information matrix AK that parallel those of Section 3.12 for CK(A):
Formula (1) holds on the feasibility cone A(K), by the Gauss-Markov Theorem 1.20. Formula (2) is of Schur complement type, utilizing the residual projector R = Ik- KG, G e K~. Formula (3) involves a left identity Q of K that is minimizing for A, as provided by the Gauss-Markov Theorem 1.19. Formula (4) has been adopted as the definition, and leads to functional properties analogous to those for the mapping CK, as follows. 3.24. PROPERTIES OF GENERALIZED INFORMATION MATRICES Theorem. Let K be a nonzero k x s coefficient matrix that may be rank deficient, with associated generalized information matrix mapping
Then the mapping A *-> AK is positively homogeneous, matrix superadditive, and nonnegative, as well as matrix concave and matrix isotonic. Moreover it enjoys any one of the three equivalent upper semicontinuity properties (a), (b), (c) of Theorem 3.13. Given a matrix A € NND(&), its generalized information matrix AK is uniquely characterized by the three properties
It satisfies the range formula range AK = (range A) n (range K), as well as the iteration formula AKH = (AK)KH, with any other 5 x r coefficient matrix H.
3.25. CONTRAST INFORMATION MATRICES IN TWO-WAY CLASSIFICATION MODELS
93
Proof. The functional properties follow from the definition as a minimum of the linear functions A H-» QAQ', as in the full rank case of Theorem 3.13. The three characterizing properties are from Lemma 3.14, and the proof remains unaltered except for choosing for K generalized inverses G rather than left inverses L. The range formula is established in the proof of Theorem 3.15. It remains to verify the iteration formula. Two of the defining relations are
There exist matrices Q and R attaining the respective minima. Obviously RQ satisfies RQKH = RKH = KH, and therefore the definition of AKH yields the first inequality in the chain
The second inequality follows from monotonicity of the generalized information matrix mapping for (KH)'d, applied to AK < A. This concludes the discussion of the desiderata (i)-(v), as stipulated in Section 3.21. We will return to the topic of rank deficient parameter systems in Section 8.18.
3.25. CONTRAST INFORMATION MATRICES IN TWO-WAY CLASSIFICATION MODELS A rank deficient parameter subsystem occurs with the centered contrasts Kaa of factor A in a two-way classification model, as mentioned in Section 3.20. By Section 1.5 the full parameter system is 6 = ("), and includes the effects /3 of factor B. Hence the present subsystem of interest can be written as
94
CHAPTER 3: INFORMATION MATRICES
In Section 1.27 we have seen that an a x b block design W has moment matrix
where Ar and Af are the diagonal matrices formed from the treatment replication vector r — Wlb and blocksize vector 5 = Wla. We claim that the generalized information matrix for the centered treatment contrasts Kaa is
Except for vanishing subblocks the matrix MK coincides with the Schur complement of A5 in A/,
The matrix (2) is called the contrast information matrix, it is often referred to as the C-matrix of the simple block design W. Therefore the definition of generalized information matrices for rank deficient subsystems is in accordance with the classical notion of C-matrices. Contrast information matrices enjoy the important properties of having row sums and column sums equal to zero. This is easy to verify directly, since postmultiplication of Ar — W&~Wf by the unity vector la produces the null vector. We can deduce this result also from Theorem 3.24. Namely, the matrix MK has its range included in that of K, whence the range of Ar — WA^ W is included in the range of Ka. Thus the nullspace of the matrix Ar - W&-W includes the nullspace of Ka, forcing (Ar - Wb;W')la to be the null vector. To prove the claim (1), we choose for K the generalized inverse K' = (#a,0), with residual projector
Let la denote the a x 1 vector with all components equal to I/a. In other words, this is the row sum vector that corresponds to the uniform distribution on the a levels of factor A. From Ja = lal a and Ar7a = r, as well as Wla = 5, we obtain
3.25. CONTRAST INFORMATION MATRICES IN TWO-WAY CLASSIFICATION MODELS
Using
and
95
we further get
Hence for RMR we can choose the generalized inverse
Multiplication now yields
In the last line we have used A5AS W — W. This follows from Lemma 1.17 since the submatrix As lies in the feasibility cone A(W), by Lemma 3.12. (Alternatively we may look at the meaning of the terms involved. If the blocksize Sj is zero, then SjS~j~ = 0 and the ;' th column of W vanishes. If Sj is positive, then SjSf = s/s."1 = 1.) In summary, we obtain
and formula (1) is verified. We return to this model in Section 4.3. The Loewner ordering fails to be a total ordering. Hence a maximization or a minimization problem relative to it may, or may not, have a solution. The Gauss-Markov Theorem 1.19 is a remarkable instance where minimization relative to the Loewner ordering can be carried out. The design problem embraces too general a situation; maximization of information matrices in the Loewner ordering is generally infeasible. However, on occasion Loewner optimality can be achieved and this is explored next.
CHAPTERS: INFORMATION MATRICES
96
EXERCISES
3.1 For the compact, convex set
and the matrix
verify
Is CK(M] compact, convex? 3.2 Let the matrix X = (Xi, X2) and the vector 6 = (0/, 02')' be pardoned so that X9 = X\ Q\ + X2&2- Show that the information matrix for 62 is ^2'1*2.1, where X2.i = (/ - Xl(X{X^~X{)X2. 3.3 In the general linear model E[Y] = KB and D[Y] = A, find the joint moments of Y and the zero mean statistic Z = RY, where R — I — KL for some generalized inverse L of K. What are the properties of the covariance adjusted estimator Y -AR'(RAR')~Z of E[Y]1 [Rao (1967), p. 359]. 3.4 In the linear model Y ~ N^.^, show that a 1 — a confidence ellipsoid for y = K'B is given by
where ^^(l - a) is the 1 — a quantile of the F-distribution with numerator degrees of freedom s and denominator degrees of freedom n — k. 3.5 Let K G M*X5 be of full column rank s. For all A e NND(fc) and z 6 RJ, show that (i) K'A'K C (CK(A))~, (ii) z € range CK(A) if and only if Kz e range A [Heiligers (1991), p. 26].
EXERCISES
97
3.6 Show that K(K'K)-2K' e (KK'Y and K'(KK'YK = 75, for all A: e IR*XS with rank /C = 5. 3.7 Show that (K'A'KY1 < (K'K}-lK'AK(K'KYl for all A <E .4(#), with equality if and only if range AK C range K. 3.8 In Section 3.16, show that the path
is tangent to NND(2) at 8 = 0. 3.9 Let K € Rkxs be of arbitrary rank. For all A, B e NND(fc), show that AK > BK if and only if range AK D range BK and c'A~c < c'B~c for all c e range fi^. 3.10 For all A E NND(fc), show that (i) (AK}K = A*, (ii) if ranged = range H then A# = AH, (iii) if range ^4 C range /C then /l/^ = >4. 3.11
Show that (jcjc ')K equals jcjc' or 0 according as x € range K or not.
3.12 Show that Ac equals c(c'A~c)~lc' or 0 according as c 6 range A or not. 3.13
In the two-way classification model of Section 3.25, show that LM = (7fl, -WAj) is a left identity of K = (Ka,Q)' that is minimizing for
3.14
(continued) Show that the Moore-Penrose inverse of K'M K is A r - W&-W.
3.15
(continued} Show that Ar~ + A,:W£rWA7 C (A, - WA;W)", where D = As - WA~W is the information matrix for the centered block contrasts [John (1987), p. 39].
CHAPTER 4
Loewner Optimality
Optimality of information matrices relative to the Loewner ordering is discussed. This concept coincides with dispersion optimally, and simultaneous optimality for all scalar subsystems. An equivalence theorem is presented. A nonexistence theorem shows that the set of competing moment matrices must not be too large for Loewner optimal designs to exist. As an illustration, product designs for a two-way classification model are shown to be Loewner optimal if the treatment rectfication vector is fixed. The proof of the equivalence theorem for Loewner optimality requires a refinement of the equivalence theorem for scalar optimality. This refinement is presented in the final sections of the chapter. 4.1. SETS OF COMPETING MOMENT MATRICES The general problem we consider is to find an optimal design £ in a specified subclass H of the set of all designs, H C E. The optimality criteria to be discussed in this book depend on the design £ only through the moment matrix M (£). In terms of moment matrices, the subset H of H leads to a subset M (H) of the set of all moment matrices, A/(E) C M(H). Therefore we study a subset of moment matrices,
which we call the set of competing moment matrices. Throughout this book we make the grand assumption that there exists at least one competing moment matrix that is feasible for the parameter subsystem K'd of interest,
Otherwise there is no design under which K'B is identifiable, and our optimization problem would be statistically meaningless.
98
4.3. MAXIMUM RANGE IN TWO-WAY CLASSIFICATION MODELS
99
The subset M of competing moment matrices is often convex. Simple consequences of convexity are the following. 4.2. MOMENT MATRICES WITH MAXIMUM RANGE AND RANK Lemma. Let the set M of competing moment matrices be convex. Then there exists a matrix M e M which has a maximum range and rank, that is, for all A e M we have
Moreover the information matrix mapping CK permits regularization along straight lines in M,
Proof. We can choose a matrix M e M with rank as large as possible, rankM = max{rank A : A e M}. This matrix also has a range as large as possible. To show this, let A e M be any competing moment matrix. Then the set M contains B = \A + \M, by convexity. From Lemma 2.3, we obtain range A C range Z? and range M C range/?. The latter inclusion holds with equality, since the rank of M is assumed maximal. Thus range A C range B — range M, and the first assertion is proved. Moreover the set M contains the path (1 — a)A + aB running within M from A to B. Positive homogeneity of the information matrix mapping CK, established in Theorem 3.13, permits us to extract the factor 1 - a. This gives
This expression has limit CK(A) as a tends to zero, by matrix upper semicontinuity, Often there exist positive definite competing moment matrices, and the maximum rank in M then simply equals k.
4.3. MAXIMUM RANGE IN TWO-WAY CLASSIFICATION MODELS The maximum rank of moment matrices in a two-way classification model is now shown to be a + b -1. As in Section 1.27, we identify designs £ e H with
100
CHAPTER 4: LOEWNER OPTIMALITY
their a x b weight matrices W e T. An a x b block design W has a degenerate moment matrix,
Hence the rank is at most equal to a + b - 1. This value is actually attained for the uniform design, assigning uniform mass l/(ab) to every point (/,/) in the experimental domain T = {l,...,a} x {!,...,6}. The uniform design has weight matrix 1al b. We show that
occurs only if x = Sla and y — -&//,, for some S 6 R. But
entails x = —y.la, while
gives y = -x.lb. Premultiplication of x — -y.la by 2a yields x. = —y. = S, say. Therefore the uniform design has a moment matrix of maximum rank a + b-l. An important class of designs for the two-way classification model are the product designs. By definition, their weights are the products of row sums r, and column sums s,, w,; = r/sy. In matrix terms, this means that the weight matrix W is of the form
with row sum vector r and column sum vector s. In measure theory terms, the joint distribution W is the product of the marginal distributions r and s. An exact design for sample size n determines frequencies n /y that sum to n. With marginal frequencies «,-. = ]C/ ((I - a)A + aB)K. A passage to the limit as a tends to zero yields MK >AK. Thus M is Loewner optimal in M. Part (c) enables us to deduce an equivalence theorem for Loewner optimality that is similar in nature to the Equivalence Theorem 2.16 for scalar optimality. There we concentrated on the set M (E) of all moment matrices. We now use (and prove from Section 4.9 on) that Theorem 2.16 remains true with the set A/(H) replaced by a set M of competing matrices that is compact and convex.
4.6. GENERAL EQUIVALENCE THEOREM FOR LOEWNER OPTIMALITY
103
4.6. GENERAL EQUIVALENCE THEOREM FOR LOEWNER OPTIMALITY Theorem. Assume the set M of competing moment matrices is compact and convex. Let the competing moment matrix M e M have maximum rank. Then M is Loewner optimal for K'O in M if and only if
Proof. As a preamble, we verify that the product AGK is invariant to the choice of generalized inverse G e M~. In view of the maximum range assumption, the range of M includes the range of K, as well as the range of every other competing moment matrix A. Thus from K = MM~K and A = MM~A, we get AGK = AM'MGMM'K = AM~K for all G e M~. It follows that K'M-AM-K is well denned, as is K'M'K. The converse provides a sufficient condition for optimality. We present an argument that does not include the rank of K, utilizing the generalized information matrices AK from Section 3.21. Let G be a generalized inverse of M and introduce Q = MKG' = K(K'M-KYK'G'. From Lemma 1.17, we obtain QK — K. This yields
The second inequality uses the inequality from the theorem. The direct part, necessity of the condition, invokes Theorem 2.16 with the subset M in place of the full set M (E), postponing a rigorous proof of this result to Theorem 4.13. Given any vector c = Kz in the range of K, the matrix M is optimal for c'B in M, by part (c) of Theorem 4.5. Because of Theorem 2.16, there exists a generalized inverse G 6 A/~, possibly dependent on the vector c and hence on z, such that
As verified in the preamble, the product AGK is invariant to the choice of the generalized inverse G. Hence the left hand side becomes z'K'M~AM~Kz. Since z is arbitrary the desired matrix inequality follows. The theorem allows for arbitrary subsets of all moment matrices, M. C M(H), as long as they are compact and convex. We repeat that the necessity
104
CHAPTER 4: LOEWNER OPTIMALITY
part of the proof invokes Theorem 2.16 which covers the case M == A/(E), only. The present theorem, with M. = A/(H), is vacuous. 4.7. NONEXISTENCE OF LOEWNER OPTIMAL DESIGNS Corollary. No moment matrix is Loewner optimal for K' 6 in M(H) unless the coefficient matrix K has rank 1. Proof. Assume that £ is a design that is Loewner optimal for K'6 in H, with moment matrix M and with support points *i,... ,Xf. If i — 1, then £ is a one-point design on x\ and range K C range *!*/ forces /C to have rank 1. Otherwise t > 2. Applying Theorem 4.6 to A = *,-*/, we find
Here equality must hold since a single strict inequality leads to the contradiction
because of i > 2. Now Lemma 1.17 yields the assertion by way of comparing ranks, rank K = rank K'M'K = rank K'M'XixlM-K = 1. The destructive nature of this corollary is deceiving. Firstly, and above all, an equivalence theorem gives necessary and sufficient conditions for a design or a moment matrix to be optimal, and this is genuinely distinct from an existence statement. Indeed, the statements "If a design is optimal then it must look like this" "If it looks like this then it must be optimal" in no way assert that an optimal design exists. If existence fails to hold, then the statements are vacuous, but logically true. Secondly, the nonexistence result is based on the Equivalence Theorem 4.6. Equivalence theorems provide an indispensable tool to study the existence problem. The present corollary ought to be taken as a manifestation of the constructive contribution that an equivalence theorem adds to the theory. In Chapter 8, we deduce from the General Equivalence Theorem insights about the number of support points of an optimal design, their location, and their weights. Thirdly, the corollary stresses the role of the subset M of competing moment matrices, pointing out that the full set M(H) is too large to permit Loewner optimal moment matrices. The opposite extreme occurs if the subset consists of a single moment matrix, MQ = {A/0}. Of course, MQ is Loewner
4.8. LOEWNER OPTIMALITY IN TWO-WAY CLASSIFICATION MODELS
105
optimal for K '0 in MQ. The relevance of Loewner optimality lies somewhere between these two extremes. Subsets M of competing moment matrices tend to be of interest for the reason that they show more structure than the full set A/(E). The special structure of M often permits a direct derivation of Loewner optimality, circumventing the Equivalence Theorem 4.6.
4.8. LOEWNER OPTIMALITY IN TWO-WAY CLASSIFICATION MODELS This section continues the discussion of Section 4.3 for the two-way classification model. We are interested in the centered contrasts of factor A,
By Section 3.25, the contrast information matrix of a block design W is Ar — W&j W. If a moment matrix is feasible for the centered contrasts Kaa, then W must have positive row sum vectors. For if r, vanishes, then the / th row and column of the contrast information matrix Ar - WAjW are zero. Then its nullity is larger than 1, and its range cannot include the range of Ka. Any product design has weight matrix W = rs', and fulfills WkjW = rs'hjsr' = rr'. This is so since &~s is a vector with the /th entry equal to 1 or 0 according as s, is positive or vanishes, entailing s'k~s = ]Crjj>oJ« = 1Therefore all product designs with row sum vector r share the same contrast information matrix Ar - rr'. They are feasible for the treatment contrasts if and only if r is positive. Let r be a fixed row sum vector that is positive. We claim the following. Claim. The product designs rs' with arbitrary column sum vector 5 are the only Loewner optimal designs for the centered contrasts of factor A in the set T(r) of block designs with row sum vector equal to r, with contrast information matrix Ar - rr'. Proof. There is no loss in generality assuming the column sum vector s to be positive. This secures a maximum rank for the moment matrix M of the product design rs'. A generalized inverse G for M and the matrix GK are given by
106
CHAPTER 4: LOEWNER OPTIMALITY
Now we take a competing moment matrix A, with top left block A\\. In Theorem 4.6, the left hand side of the inequality turns into
The right hand side equals K'M K = Ka&r lKa. Hence the two sides coincide if An = A r , that is, if A lies in the subset M (T(r)). Thus Theorem 4.6 proves that the product design rs1 is Loewner optimal, Ar - rr' > Ar - W&~W for all W e T(r). But then every weight matrix W e T(r) that is optimal must satisfy Wk~ W = rr', forcing W to have rank 1 and hence to be of the form W = rs'. Thus our claim is verified. Brief contemplation opens a more direct route to this result. For an arbitrary block design W e T(r) with column sum vector s, we have Wk~s = r as well as s'&js = 1. Therefore, we obtain the inequality
Equality holds if and only if W = rs'. This verifies the assertion without any reference to Theorem 4.6. We emphasize that, in contrast to the majority of the block design literature, we here study designs for infinite sample size. For a given sample size n, designs such as W = rs' need not be realizable, in that the numbers w,-s/ may well fail to be integers. Yet, as pointed out in Section 1.24, the general theory leads to general principles which are instructive. For example, if we can choose a design for sample size n that is a proportional frequency design, then this product structure guarantees a desirable optimality property. It is noteworthy that arbitrary column sum vectors s are admitted for the product designs rs'. Specifically all observations can be taken in the first block, s = (1,0,..., 0)', whence the corresponding moment matrix has rank a rather than maximum rank a+b-1. Therefore Loewner optimality may hold true even if the maximum rank hypothesis of Theorem 4.6 fails to hold. The class of designs with moment matrices that are feasible for the centered contrasts decomposes into the cross sections T(r) of designs with positive row sum vector r. Within one cross section, the information matrix Ar — rr' is Loewner optimal. Between all cross sections, Loewner optimality is ruled out by Corollary 4.7. We return to this model in Section 8.19.
4.9. THE PENUMBRA OF THE SET OF COMPETING MOMENT MATRICES
107
4.9. THE PENUMBRA OF THE SET OF COMPETING MOMENT MATRICES Before pushing on into the more general theory, we pause and generalize the Equivalence Theorem 2.16 for scalar optimality, as is needed by the General Equivalence Theorem 4.6 for Loewner optimality. This makes for a selfcontained exposition of the present chapter. But our main motivation is to acquire a feeling of where the general development is headed. Indeed, the present Theorem 4.13 is but a special case of the General Equivalence Theorem 7.14. As in Section 2.13, we base our argument on the supporting hyperplane theorem. However, we leave the vector space Rk which includes the regression range X, and move to the matrix space Sym(/:) which includes the set of competing moment matrices M. The reason is that, unlike the full set M (H), a general convex subset M cannot be generated as the convex hull of a set of rank 1 matrices. Even though our reference space changes from Rk to Sym(/c), our argument is geometric. Whereas before, points in the reference space were column vectors, they now turn into matrices. And whereas before, we referred to the Euclidean scalar product of vectors, we now utilize the Euclidean matrix scalar product (A,B] = trace AB on Sym(A:). A novel feature which we exploit instantly is the Loewner ordering A < B that is available in the matrix space Sym(fc). In attacking the present problem along the same lines as the Elfving Theorem 2.14, we need a convex set in the space Sym(fc) which takes the place of the Elfving set 72. of Section 2.9. This set of matrices is given by
We call P the penumbra of M since it may be interpreted as a union of shadow lines for a light source located in M + fi, where B e NND(fc). The point M G M then casts the shadow half line {M - SB : 8 > 0}, generating the shadow cone M -NND(A:) as B varies over NND(fc). This is a translation of the cone -NND(&) so that its tip comes to lie in M. The union over M e M of these shadow cones is the penumbra P (see Exhibit 4.1). There is an important alternative way of expressing that a matrix A e Sym(fc) belongs to the penumbra P,
This is to say that P collects all matrices A that in the Loewner ordering lie below some moment matrix M e M. An immediate consequence is the following.
108
CHAPTER 4: LOEWNER OPTIMALITY
EXHIBIT 4.1 Penumbra. The penumbra P originates from the set M C NND(Jt) and recedes in all directions of -NND(/c). By definition of p2(c), the rank 1 matrix cc'/p2(c) is a boundary point of P, with supporting hyperplane determined by N > 0. The picture shows the equivalent geometry in the plane R2.
4.10. GEOMETRY OF THE PENUMBRA Lemma. Let the set M of competing moment matrices be compact and convex. Then the penumbra P is a closed convex set in the space Sym(fc). Proof. Given two matrices A — M - B and A = M - B with M, M e M and B,B > 0, convexity follows with a G (0; 1) from
In order to show closedness, suppose (Am)m>i is a sequence of matrices in P converging to A in Sym(A:). For appropriate matrices Mm £ M we have Am < Mm. Because of compactness of the set M, the sequence (Mm)m>\ has a cluster point M e M, say. Thus A < M and A lies in P. The proof is complete. Let c e Rk be a nonvanishing coefficient vector of a scalar subsystem c'0. Recall the grand assumption of Section 4.1 that there exists at least one competing moment matrix that is feasible, M n A(c) ^ 0. The generalization of the design problem of Section 2.7 is
4.11. EXISTENCE THEOREM FOR SCALAR OPTIMALITY
109
In other words, we wish to determine a competing moment matrix M € M that is feasible for c'6, and that leads to a variance c'M'c of the estimate for c'6 which is an optimum compared to the variance under all other feasible competing moment matrices. In the Elfving Theorem 2.14, we identified the optimal variance as the square of the Elfving norm, (p(c))2. Now the corresponding quantity turns out to be the nonnegative number
For 5 > 0, we have (8M) - NND(fc) = 8P. Hence a positive value p2(c) > 0 provides the scale factor needed to blow up or shrink the penumbra P so that the rank one matrix cc' comes to lie on its boundary. The scale factor p2(c) has the following properties. 4.11. EXISTENCE THEOREM FOR SCALAR OPTIMALITY
Theorem. Let the set M of competing moment matrices be compact and convex. There exists a competing moment matrix that is feasible for c'6, M e M n A(c), such that
Every such matrix is optimal for c'6 in M, and the optimal variance is p2(c). Proof. For 5 > 0, we have cc' 6 (SM) -NND(fc) if and only if cc' < SM for some matrix M € M.. Hence there exists a sequence of scalars dm > 0 and a sequence of moment matrices Mm e M such that cc' < 8mMm and p2(c) = limw_oo 8m. Because of compactness, the sequence (Mm)m>i has a cluster point M G Ai, say. Hence we have cc' < p2(c)M. This forces p2(c) to be positive. Otherwise cc' < 0 implies c = 0, contrary to the assumption c / 0 from Section 2.7. With p2(c) > 0, Lemma 2.3 yields c E range cc1 C range M. Hence M is feasible, M € A(c). The inequality c'M~c < p2(c) follows from c'M~(cc')M~c < 2 p (c)c'M~MM~c = p2(c)c'M~c. The converse inequality, p2(c) = inf{6 > 0 : cc' e 8P} < c'M'c, is a consequence of Theorem 1.20, since c(c'M~c}~1 c' <M means cc' < (c'M'c)M e (c'M'c)P. Finally we take any other competing moment matrix that is feasible, A e M n A(c). It satisfies cc' < (c'A~c}A e (c'A~c)P, and therefore p2(c) < c'A~c. Hence p2(c) = c'M~c is the optimal variance, and M is optimal forc'flinM Geometrically, the theorem tells us that the penumbra P includes the segment {ace1 : a < l/p2(c)} of the line {ace' : a e IR}. Moreover
110
CHAPTER 4: LOEWNER OPTIMALITY
the matrix cc'/p2(c) lies on the boundary of the penumbra P. Whereas in Section 2.13 we used a hyperplane in Rk supporting the Elfving set 7£ at its boundary point c/p(c), we now take a hyperplane in Sym(k) supporting the penumbra P at cc'/p2(c). Since the set P has been defined so as to recede in all directions of - NND(&),
the supporting hyperplane permits a matrix N normal to it that is nonnegative definite. 4.12. SUPPORTING HYPERPLANES TO THE PENUMBRA Lemma. Let the set M of competing moment matrices be compact and convex. There exists a nonnegative definite k x k matrix N such that
Proof. In the first part of the proof, we make the preliminary assumption that the set M of competing moment matrices intersects the open cone PD(fc). Our arguments parallel those of Section 2.13. Namely, the matrix cc'/p2(c) lies on the boundary of the penumbra P. Thus there exists a supporting hyperplane to the set P at the point cc'/p2(c), that is, there exist a nonvanishing matrix N e Sym(fc) and a real number y such that
The penumbra P includes - NND(£), the negative of the cone NND(&). Hence if A is a nonnegative definite matrix, then — 8A lies in P for all 8 > 0, giving
This entails trace AN > 0 for all A > 0. By Lemma 1.8, the matrix N is nonnegative definite. Owing to the preliminary assumption, there exists a positive definite moment matrix B G M. With 0 ^ N > 0, we get 0 < trace BN < y. Subdividing by y > 0 and setting N — N/y ^ 0, we obtain a matrix N with the desired properties. In the second part of the proof, we treat the general case that the set M intersects just the feasibility cone A(c), and not necessarily the open cone PD(Aj. By Lemma 4.2, we can find a moment matrix M € M with maximum range, ranged C range M for all A € M. Let r be the rank of M and choose
4.13. GENERAL EQUIVALENCE THEOREM FOR SCALAR OPTIMALITY
111
an orthonormal basis u\,..., ur 6 Rk for its range. Then the k x r matrix U — («i,..., UT) satisfies range M = range U and U 'U — Ir. Thus the matrix UU' projects onto the range of M. From Lemma 1.17, we get UU 'c = c and UU'A = A for all A eX. Now we return to the discussion of the scale factor p2(c). The preceding properties imply that for all 6 > 0 and all matrices M e M, we have
Therefore the scale factor p2(c) permits the alternative representation
The set U'MU of reduced moment matrices contains the positive definite matrix U'MU. With c = U'c, the first part of the proof supplies a nonnegative definite r x r matrix N such that trace AN < 1 = c'Nc/p2(c) for all A e U'MU. Since A is of the form U'AUjve have trace AN = trace AUNU'. Onjhe other side, we get c'Nc = c'UN U'c. Therefore the k x k matrix N = UNU' satisfies the assertion, and the proof is complete. The supporting hyperplane inequality in the lemma is phrased with reference to the set M, although the proof extends its validity to the penumbra P. But the extended validity is besides the point. The point is that P is instrumental to secure nonnegative definiteness of the matrix N. This done, we dismiss the penumbra P. Nonnegative definiteness of N becomes essential in the proof of the following general result. 4.13. GENERAL EQUIVALENCE THEOREM FOR SCALAR OPTIMALITY
Theorem. Assume the set M of competing moment matrices is compact and convex, and intersects the feasibility cone A(c). Then a competing moment matrix M e M is optimal for c'O in M if and only if M lies in the feasibility cone A(c) and there exists a generalized inverse G of M such that
Proof. The converse is established as in the proof of Theorem 2.16. For the direct part we choose a nonvanishing matrix N > 0 from Lemma 4.12 and define the vector h = Nc/p(c), with p(c) = i/p2(c). From Lemma 4.12,
112
CHAPTER 4: LOEWNER OPTIMALITY
we have p2(c) — c'Nc, and c'h — p(c). We claim that the vector h satisfies the three conditions
By definition of h, we get h'Ah — c'NANc/p2(c) = trace ANc(c'Nc)~l c'N. For nonnegative definite matrices V, Theorem 1.20 provides the general inequality X(X'V~X)~X' < V, provided range X C range V. Since N is nonnegative definite, the inequality applies with V = N and X — Nc, giving Nc(c'NN~Nc)~c'N < N. This and the supporting hyperplane inequality yield
Thus condition (0) is established. For the optimal moment matrix M, the variance c'M~c coincides with the optimal value p2(c) from Theorem 4.11. Together with p2(c) = c'Nc, we obtain
The converse inequality, h'Mh < 1, appears in condition (0). This proves condition (1). Now we choose a square root decomposition M = KK', and introduce the vector a = K'h — K'M~c(c'M~c)~lc'h (compare the proof of Theorem 2.11). The squared norm of a vanishes,
by condition (1), and because of c'M c = p2(c) = (c'h)2. This establishes condition (2). The remainder of the proof, construction of a suitable generalized inverse G of M, is duplicated from the proof of Theorem 2.16. The theorem may also be derived as a corollary to the General Equivalence Theorem 7.14. Nevertheless, the present derivation points to a further way of proceeding. Sets M. of competing moment matrices that are genuine subsets of the set A/(H) of all moment matrices force us to properly deal with matrices. They preclude shortcut arguments based on regression vectors that provide such an elegant approach to the Elfving Theorem 2.14.
EXERCISES
113
Hence our development revolves around moment matrices and information matrices, and matrix problems in matrix space. It is then by means of information functions, to be introduced in the next chapter, that the matrices of interest are mapped into the real line. EXERCISES 4.1 What is wrong with the following argument: 4.2 In the two-way classification model, use the iteration formula to calculate the generalized information matrix for Kaa from the generalized information matrix
[Krafft (1990), p. 461]. 4.3 (continued) Is there a Loewner optimal design for Kaa in the class {W e T: W'la — s} of designs with given blocksize vector 5? 4.4 (continued) Consider the design
for sample size n = a+b—1. Show that the information matrix of W /n for the centered treatment contrasts is Ka/n. Find the <j> -efficiency relative to an equireplicated product design 1as''.
CHAPTER 5
Real Optimality Criteria
On the closed cone of nonnegative definite matrices, real optimality criteria are introduced as functions with such properties as are appropriate to measure largeness of information matrices. It is argued that they are positively homogeneous, superadditive, nonnegative, nonconstant, and upper semicontinuous. Such criteria are called information functions. Information functions conform with the Loewner ordering, in that they are matrix isotonic and matrix concave. The concept of polar information functions is discussed in detail, providing the basis for the subsequent duality investigations. Finally, a result is proved listing three sufficient conditions for the general design problem so that all optimal moment matrices lie in the feasibility cone for the parameter system of interest.
5.1. POSITIVE HOMOGENEITY An optimality criterion is a function tf> from the closed cone of nonnegative definite 5 x 5 matrices into the real line,
with properties that capture the idea of whether an information matrix is large or small. A transformation from a high-dimensional matrix cone to the one-dimensional real line can retain only partial aspects and the question is, which. No criterion suits in all respects and pleases every mind. An essential aspect of an optimality criterion is the ordering that it induces among information matrices. Relative to the criterion an information matrix C is at least as good as another information matrix D when (C) > 4>(D)' With our understanding of information matrices it is essential that a reasonable criterion be isotonic relative to the Loewner ordering,
114
5.2. SUPERADDITIVITY AND CONCAVITY
115
compare Chapter 4. We use the same sign > to indicate the Loewner ordering among symmetric matrices and the usual ordering of the real line. A second property, similarly compelling, is concavity,
In other words, information cannot be increased by interpolation, otherwise the situation <j> ((I - a)C + aD) < (1 - «)(C) + a<J>(D) will occur. Rather than carrying out the experiment belonging to (1 - a)C + aD, we achieve more information through interpolation of the two experiments associated with C and D. This is absurd. A third property is positive homogeneity,
A criterion that is positively homogeneous satisfies ((n/ R, the following two statements are equivalent: a. (Superadditivity) <j> is superadditive. b. (Concavity) is concave. Proof. In both directions, we make use of positive homogeneity. Assuming (a), we get
116
CHAPTERS: REAL OPTIMALITY CRITERIA
Thus superadditivity implies concavity. Conversely, (b) entails supperadditiv-
ity,
An analysis of the proof shows that strict superadditivity is the same as strict concavity, in the following sense. The strict versions of these properties cannot hold if D is positively proportional to C, denoted by D oc C. Because then we have D = 8C with 8 > 0, and positive homogeneity yields
Furthermore, we apply the strict versions only in cases where at least one term, C or D, is positive definite. Hence we call a function strictly superadditive on PD(s) when
A function 0. Next we show that (b) implies (c). Here C > 0 forces (C) > ) > 0. We need to show that (C) is positive for all matrices C e PD(s). Since by Lemma 1.9, the open cone PD(s) is the interior of NND(s), there exists some e > 0 such that C — eD € NND(s), that is, C > eD. Monotonicity and homogeneity entail £(D) > 0. This proves (c). The properties called for by (c) obviously encompass (a). LJ Of course, we do not want to deal with the constant function = 0. All other functions that are positively homogeneous, superadditive, and nonnegative are then positive on the open cone PD(.s), by part (c). Often, although not always, we take $ to be standardized, $(/s) — 1- Every homogeneous function 0 can be standardized according to (l/(C) = trace C. This function is linear even on the full space Sym(.s), hence it is positively homogeneous and superadditive. Restricted to NND(s), it is also strictly isotonic and positive. It fails to be strictly superadditive. Standardization requires a transition to trace C/s. Being linear, this criterion is also continuous on Sym (.$•)• 5.6. REAL UPPER SEMICONTINUITY In general, our optimality criteria
> a} — {C > 0 : $(C) > a} are closed, for all a G R. This definition conforms with the one for the matrix-valued information matrix mapping CK, in part (a) of Theorem 3.13. There we met an alternative sequential criterion (b) which here takes the form liiriw--^ (Cm) = 4>(C), for all sequences (Cm)m>\ in NND(s) that converge to a limit C and that satisfy (Cm) > (C) for all m > 1. This secures a "regular" behavior at the boundary, in the same sense as in part (c) of Theorem 3.13.
5.8. INFORMATION FUNCTIONS
119
5.7. SEMICONTINUITY AND REGULARIZATION Lemma. For every isotonic function $ : NND(s) —> R, the following three statements are equivalent: a. (Upper semicontinuity) The level sets
are closed, for all a e R. b. (Sequential semicontinuity criterion) For all sequences (Cm}m>\ in NND(^) that converge to a limit C we have
c. (Regularization) For all C,D e NND(s), we have
Proof. This is a special case of the proof of Theorem 3.13, with s = 1 and CK = . 5.8. INFORMATION FUNCTIONS Criteria that enjoy all the properties discussed so far are called information functions. DEFINITION. An information function on NND(.s) is a function : NND(s) —> R that is positively homogeneous, superadditive, nonnegative, nonconstant, and upper semicontinuous. The most prominent information functions are the matrix means <j>p, for p € [-00; 1], to be discussed in detail in Chapter 6. They comprise the classical D-, A-, E-, and T-criteria as special cases. The list of defining properties of information functions can be rearranged, in view of the preceding lemmas and in view of the general discussion in Section 5.1, by requiring that an information function be isotonic, concave, and positively homogeneous, as well as enjoying the trivial property of being nonconstant and the more technical property of being upper semicontinuous. It is our experience that the properties as listed in the definition are more convenient to work with. Information functions enjoy many pleasant properties to which we will turn in the sequel. We characterize an information function by its unit level
120
CHAPTERS: REAL OPTIMALITY CRITERIA
set, thus visualizing it geometrically (Section 5.10). We establish that the set of all information functions is closed under appropriate functional operations, providing some kind of reassurance that we have picked a reasonable class of criteria (Section 5.11). We introduce polar information functions (Section 5.12) which provide the basis for the duality discussion of the optimal design problem. We study the composition of an information function with the information matrix mapping (Section 5.14), serving as the objective function for the optimal design problem (Section 5.15). 5.9. UNIT LEVEL SETS There exists a bewildering multitude of information functions. We depict this multitude more visibly by associating with each information function its unit level set
The following Theorem 5.10 singles out the characteristic properties of such sets. In general, we say that a closed convex subset C C NND(s) is bounded away from the origin when it does not contain the null matrix, and that it recedes in all directions of NND(s) when
The latter property means that C + NND(s) C C for all C e C, that is, if the cone NND(^) is translated so that its tip comes to lie in C € C then all of the translate is included in the set C. Exhibit 5.1 shows some unit level sets C = {®p > 1} in Rj. Given a unit level set C C NND(^), we reconstruct the corresponding information function as follows. The reconstruction formula for positive definite matrices is
Thus (C) is the scale factor that pulls the set C towards the null matrix or pushes it away to infinity so that C comes to lie on its boundary. However, for rank deficient matrices C, it may happen that none of the sets SC with 8 > 0 contains the matrix C whence the supremum in (1) is over the empty set and falls down to -oo. We avoid this pitfall by the general definition
For S > 0, nothing has changed as compared to (1) since then (SC) +
5.9. UNIT LEVEL SETS
121
EXHIBIT 5.1 Unit level sets. Unit contour lines of the vector means 3>p of Section 6.6 in IR+, for p = -co, —1,0,1/2,1. The corresponding unit level sets are receding in all directions of R+, as indicated by the dashed lines.
NND(s) = S(C + (l/S)NND(s)) - SC. But for 8 = 0, condition (2) turns into C 6 (OC) + NND(s) = NND(s), and holds for every matrix C > 0. Hence the supremum in (2) is always nonnegative. The proof of the correspondence between information functions and unit level sets is tedious though not difficult. We ran into similar considerations while discussing the Elfving norm p(c) in Section 2.12, and the scale factor p2(c) in Section 4.10. The passage from functions to sets and back to functions must appear at some point in the development, either implicitly or explicitly. We prefer to make it explicit, now.
122
CHAPTERS: REAL OPTIMALITY CRITERIA
5.10. FUNCTION-SET CORRESPONDENCE Theorem. The relations
define a one-to-one correspondence between the information functions on NND(s) and the nonempty closed convex subsets C of NND(s) that are bounded away from the origin and recede in all directions of NND(s). Proof. We denote by 4> the set of all information functions, and by F the collection of subsets C of NND(s) which are nonempty, closed, convex, bounded away from the origin, and recede in all directions of NND(j). The defining properties of information functions e 3> parallel the properties enjoyed by the sets C e F. An information function e 4> is
A unit level set C e F is
o finite i positively homogeneous ii superadditive iii nonnegative
0 I II III
iv v
IV V
nonconstant upper semicontinuous
a subset of NND(s) bounded away from the origin convex receding in all directions of NND(s) nonempty closed
In the first part of the proof, we assume that an information function is given and take the set C to be defined through the first equation in the theorem. We establish for C the properties 0-V, and verify the second equality in the theorem. This part demonstrates that the mapping $ i - > { < £ > l } o n < f > has its range included in F, and is injective. 0. Evidently C is a subset of NND(s). 1. Because of homogeneity 0(0) is zero, and the null matrix is not a member of C. II. Concavity of (Is) > 0, we have V. Closedness of C is an immediate consequence of the upper semicontinuity of .
5.10. FUNCTION-SET CORRESPONDENCE
123
In order to verify the second equality in the theorem we fix C € NND(s) and set a = sup{8 > 0 : C € (8C) + NND(^)}. It is not hard to show that (C) = 0 if and only if a = 0. Otherwise (C) is positive, as is ex. From
we learn that (C) < a. Conversely we know for S < a that (1/S)C e C. But we have just seen that C is closed. Letting 5 tend to a, we get (l/a)C G C, and 0((l/a)C) > 1. This yields <j>(C) > a. In summary, e£(C) = a and the first part of the proof is complete. In the second part of the proof we assume that a set C with the properties 0-V is given, and take the function 1} from 4> to F is surjective. 0. We need to show that 0. Closedness of C yields 0 = limg^oo C/8 G C, contrary to the assumption that C is bounded away from the origin. 1. Next comes positive homogeneity. For C > 0 and 8 > 0, we obtain
iii. Since 8 = 0 is in the set over which the defining supremum is formed, we get 4>(C) >0. ii. In order to establish superadditivity of $, we distinguish three cases. In case (C) = 0 = (C) > 0 = 0 such that We obtain
and (C + D) > 8. A passage to the supremum gives In case (C) > 0 and 0, we choose any y, 8 > 0 such that C 6 yC and D € 8C. Then convexity of C yields
Thus we have
and therefore
124
CHAPTERS: REAL OPTIMALITY CRITERIA
iv. Since C is nonempty there exists a matrix C € C. For any such matrix we have (C] > 1. By Lemma 5.4, 0 converging to (C) such that C e (8mC) + NND(s) = 8mC for all m > 1. The matrices Dm = C/8m are members of C. This sequence is bounded, because of \\Dm\\ — \\C\\/8m —> \\C\\/(C). Along a convergent subsequence, we obtain by closedness. This yields
whence entails «} are intersections of closed sets, and hence closed. This applies to the least upper bound of a finite family
where denotes the class of all information functions. The set over which the infimum is sought is nonempty, containing for instance the sum ^,- 0. For C > 0, it follows through regularization provided is isotonic, by Lemma 5.7. In contrast, a real-valued function 1. For C <E C, the Holder inequality then yields (C, D) > (C)°°(D) > 1. The converse inclusion holds since every matrix D > 0 satisfying (C, D) > I for all C e C fulfills °°(D) - inf c>0 {C/) > inf c > 0:<MC )>i{C,D) > 1. Thus (1) is established. Applying formula ((1) to and we get
Formulae (1) and (2) entail the direct inclusion The converse inclusion is proved by showing that the complement of C is included in the complement of C0000, based on a separating hyperplane argument. We choose a matrix E e NND(s) \ C. Then there exists a matrix D e Sym(s) such that the linear form {-,£>) strongly separates the matrix E and the set C. That is, for some y e R, we have
Since the set C recedes in all directions of NND(.s), the inequality shows in particular that for all matrices C e C and A > 0 we have {£, D) < (C+8A, D) for all 8 > 0. This forces (A,D) > 0 for all A > 0, whence Lemma 1.8 necessitates D > 0. By the same token, 0 < (E,D) < y. Upon setting D = D/v the. sfrnna sp.naration of F. and C takes the form
Now (1) gives D € C°°, whence (2) yields and the prooi is complete.
Therefore
5.14. COMPOSITIONS WITH THE INFORMATION MATRIX MAPPING
129
Thus a function $ on NND(s) that is positively homogeneous, positive on PD(s), and upper semicontinous is an information function if and only if it coincides with its second polar. This suggests a method of checking whether a given candidate function is an information function, namely, by verifying = 00000. The method is called quasi-linearization, since it amounts to representing 0,
A first use of the polar function of the composition o CK is made in Lemma 5.16 to discuss whether optimal moment matrices are necessarily feasible. First we define the design problem in its full generality. 5.15. THE GENERAL DESIGN PROBLEM Let K'0 be a parameter subsystem with a coefficient matrix K of full column rank s. We recall the grand assumption of Section 4.1 that M is a set of competing moment matrices that intersects the feasibility cone A(K). Given an information function on NND(s) the general design problem then reads
This calls for maximizing information as measured by the information function $, in the set M of competing moment matrices. The optimal value of this problem is, by definition,
A moment matrix M e M is said to be formally <j>-optimal for K'B in M when (CK(M)) attains the optimal value v ( 0, and zero outside. The Loewner ordering among information numbers reverses the ordering among the variances c'A'c. Hence Loewner optimality for c'O is equivalent to the variance optimality criterion of Section 2.7.
134
CHAPTERS: REAL OPTIMALITY CRITERIA
The concept of information functions becomes trivial, for scalar optimality. They are functions 0, by homogeneity. Thus all they achieve is to contribute a scaling by the constant 0. The composition o Cc orders any pair of moment matrices in the same way as does Cc alone. Therefore Cc is the only information function on the cone NND(fc) that is of interest. It is standardized if and only if c has norm 1. The polar function of Cc is Q°(jB) = c'Bc. This follows from Theorem 5.14, since the identity mapping $(7) = y has polar 0°°(S) = inf y> o yS/y = 8. The function B i-> c'Bc is the criterion function of the dual problem of Section 2.11. In summary, information functions play a role only if the dimensionality s of the parameter subsystem of interest is larger than one, s > 1. The most prominent information functions are matrix means, to be discussed next. EXERCISES 5.1 Show that <J> > & if and only if { > 1} D ( 1), for all information functions 5.2 In Section 5.11, what are the unit level sets of 5.3 Discuss the behavior of the sets SC as S tends to zero or infinity, with C being (i) the unit level set { > 1} of an information function $ as in Section 5.9, (ii) the penumbra M — NND(fc) of a set of competing moment matrices M as in Section 4.9, (iii) the Elfving set conv(A? u (—«¥)) of a regression range X as in Section 2.9. 5.4
Show that a linear function : Sym(.s) —* IR is an information function on NND(s) if and only if for some D e NND(s) and for all C e Sym($) one has (C) = trace CD.
for 5.5 Is an information function?
for
5.6 Show that neither of the matrix norms are isotonic on NND(s) [Marshall and olkin (1969), p. 170]. 5.7 Which properties must a function p to be an information function. Its polar is proportional to the matrix mean q where the numbers p and q are conjugate in the interval [—oo; 1]. 6.1. CLASSICAL OPTIMALITY CRITERIA The ultimate purpose of any optimality criterion is to measure "largeness" of a nonnegative definite 5 x 5 matrix C. In the preceding chapter, we studied the implications of general principles that a reasonable criterion must meet. We now list specific criteria which submit themselves to these principles, and which enjoy a great popularity in practice. The most prominent criteria are the determinant criterion, the average-variance criterion, definite) the smallest-eigenvalue criterion, the trace criterion,
positive
and
Each of these criteria reflects particular statistical aspects, to be discussed in Section 6.2 to Section 6.5. Furthermore, they form but four particular members of the one-dimensional family of matrix means p, as defined in Section 6.7. In the remainder of the chapter, we convince ourselves that the matrix mean <j>p qualifies as an information function if the parameter p lies in the interval [-00; 1].
135
136
CHAPTER 6: MATRIX MEANS
6.2. D-CRITERION The determinant criterion 0, the vector mean 4>p is defined by
For vectors A e Us+ with at least one component 0, continuous extension yields
For arbitrary vectors, A e Rs, we finally define
The definition of <J>P extends from positive vectors A > 0 to all vectors A £ RJ in just the same way for both p > 1 and for p < 1. It is not the definition, but the functional properties that lead to a striking distinction. For p > 1, the vector mean 4>p is convex on all of the space R*; for p < 1, it is concave but only if restricted to the cone IR+. We find it instructive to follow up this distinction, for the vector means as well as for the matrix means, even
140
CHAPTER 6: MATRIX MEANS
though for the purposes of optimal designs it suffices to investigate the means of order p e [-00; 1], only. If A = als is positively proportional to the unity vector ls then the means p(a75) = a for all a > 0. They are standardized in such way that PC/5) = 1. For p ^ ±00, the dependence of 0 is strictly isotonic provided the other components are fixed. With the full argument vector A held fixed, the dependence on the parameter/? € [—00; oo] is continuous. Verification is straightforward for p tending to ±00. For/? tending to 0 it follows by applying the 1'Hospital rule to log ^(A). If the vector A has at least two distinct components then the means ^(A) are strictly increasing in /?. The most prominent members of this family are the arithmetic mean 4>!, the geometric mean <J>0, and the harmonic mean _!. Our notation suggests that they correspond to the trace criterion 1, the matrix mean <j>p is a norm on the space Sym(.s); for p < 1, it is an information function on the cone NND(s). For a matrix C e Sym(5i), we let A(C) = ( A i , . . . , A,)' be the vector consisting of the eigenvalues A; of C, in no particular order but repeated according to their multiplicities. The matrix mean p is defined through the vector mean p,
Since the vector means 3>p are invariant under permutations of their argument vector A, the order in which A(C) assembles the eigenvalues of C does not matter. Hence p(C) is well defined. An alternative representation of p(C) avoids the explicit use of eigenvalues, but instead uses real powers of nonnegative definite matrices. This representation enters into the General Equivalence Theorem 7.14 for poptimality. There the conditions are stated in terms of not eigenvalues, but matrices and powers of matrices. To this end let us review the definition of the powers Cp for arbitrary real parameters n nrnvide.H C is nnsitive definite. Using an eigenvalue decomposition the definition is
6.7. MATRIX MEANS
141
For integer values p, the meaning of Cp is the usual one. In general, we obtain trace Cp trace For positive definite matrices, C 6 PD(s), the matrix mean is represented by
For singular nonnegative definite matrices, C e NND(s) with rank C < s, we have
For arbitrary symmetric matrices, C e Sym(s), we finally get
where 1C I, the modulus of C, is denned as follows. With eigenvalue decomthe positive part C+ and the negative part C_ are position given by
They are nonnegative definite matrices, and fulfill C = C+ - C_. The modulus of C is defined bv |C| = C+ + C_. It is nonneeative definite and thus substansatisfies tiating (3) It is an immediate consequence of the definition that the matrix means q>p on the space Sym(.s) are absolutely homogeneous, nonnegative (even positive if p 6 (0;ooj), standardized, and continuous. This provides all the properties that constitute a norm on the space Sym(^), or an information function on the cone NND(s), except for subadditivity or superadditivity. This is where the domain of definition of (f>p must be restricted for sub- and superadditivity to hold true, from Sym(s) for p > 1, to NND(s) for p < 1. We base our derivation on the well-known polarity relation for the vector means <J>P and the associated Holder inequality in Rs, using the quasilinearization technique of Section 5.13. The key difficulty is the transition from the Euclidean vector scalar product (A,/x) = X'fi on Rs, to the Euclidean matrix scalar product ( C , D ) = trace CD on Sym(s). Our approach
142
CHAPTER 6: MATRIX MEANS
uses a few properties of vector majorization that are rigorously derived in Section 6.9. We begin with a lemma that provides a tool to recognize diagonal matrices. 6.8. DIAGONALITY OF SYMMETRIC MATRICES Lemma. Let C be a symmetric s x s matrix with vector 8(C) = (en,..., css)' of diagonal elements and vector A(C) = ( A i , . . . , A s ) ' of eigenvalues. Then the matrix C is diagonal if and only if the vector S(C) is a permutation of the vector A(C). Proof. We write c = 8(C) and A = A(C), for short. If C is diagonal, C = Ac, then the components c;; are the eigenvalues of C. Hence the vector c is a permutation of the eigenvalue vector A. This proves the direct part. For the converse, we first show that Ac may be obtained from C through an averaging process. We denote by Sign(s) the subset of all diagonal s x s matrices Q with entries qy} € {±1} for / < s, and call it the sign-change group. This is a group of order 2s. We claim that the average over the matrices QCQ with Q e Sign(s) is the diagonal matrix Ac,
To this end let e, be the ; th Euclidean unit vector in IR5. We have e-QCQej = quqjjCij. The diagonal elements of the matrix average in (1) are then
The off-diagonal elements are, with i ^ /,
Hence (1) is established. Secondly, for the squared matrix norm, we verify the invariance property
6.8. DIAGONALITY OF SYMMETRIC MATRICES
143
On the one hand, we have ||(>C()||2 = trace QCQQCQ = trace C2 = £) y R, we have the inequality
while for strictly convex functions g the inequality is reversed. In either case equality holds if and only if the matrix C is diagonal. c. (Monotonicity of vector means) For parameter p € (-00; 1) the vector means