FOURTH EDITION
Applied Multivariate Statistical Analysis
RICHARD A. JOHNSON University of Wisconsin-Madison
DEAN W. W...
275 downloads
3202 Views
16MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
FOURTH EDITION
Applied Multivariate Statistical Analysis
RICHARD A. JOHNSON University of Wisconsin-Madison
DEAN W. WICHERN Texas A&M University
PRENTICE
HALL,
Upper
Saddle
River, New jersey
07458
Library of Congress Cataloging-in-Publication Data Johnson, Richard Arnold. Applied multivariate statistical analysis I Richard A. Johnson, Dean W. Wichern. -- 4th ed. p. em. Includes bibliographical references and indexes. ISBN 0-13-834194-X 1. Multivariate analysis. I. Wichern, Dean W. II. Title. QA278.J63 1998 519.5'35--dc21 97-42907· CIP
Acquisitions Editor: ANN HEATH Marketing Manager: MELODY MARCUS Editorial Assistant: MINDY McCLARD Editorial Director: TIM BOZIK Editor-in-Chief: JEROME GRANT Assistant Vice-President of Production and Manufacturing: DAVID W. RICCARDI Editorial/Production Supervision: RICHARD DeLORENZO Managing Editor: LINDA MIHATOV BEHRENS Executive Managing Editor: KATHLEEN SCHIAPARELLI Manufacturing Buyer: ALAN FISCHER Manufacturing Manager: TRUDY PISCIOTII Marketing Assistant: PATRICK MURPHY Director of Creative Services: PAULA MAYLAHN Art Director: JAYNE CONTE Cover Designer: BRUCE KENSELAAR © 1 998 by Prentice-Hall, Inc. Simon & Schuster I A Viacom Company Upper Saddle River, NJ 07458
All rights reserved. No part of this book may be reproduced, in any form or by any means, without permission in writing from the publisher.
Printed in the United States of America 10 9
8 7
ISBN
6
5
4
3
2
1
0-13-834194-X
Prentice-Hall International (UK) Limited, London Prentice-Hall of Australia Pty. Limited, Sydney Prentice-Hall Canada Inc., Toronto Prentice-Hall Hispanoamericana, S.A., Mexico Prentice-Hall of India Private Limited, New Delhi Prentice-Hall of Japan, Inc., Tokyo Simon & Schuster Asia Pte. Ltd., Singapore Editora Prentice-Hall do Brasil, Ltda., Rio de Janeiro
ISBN 0-13-834194-X
9 780
90000
II
To the memory of my mother and my father. R.
To Dorothy, Michael, and Andrew. D.
A. f.
W. W.
Contents
xlll
PREFACE
PART I
Getting Started 1
1
1 .1
Introduction
1 .2
Applications of Multivariate Techniques
1 .3
The Organization of Data
1 .4
Data Displays and Pictoria l Representations
1 .5
Distance
1 .6
Final Comments Exercises
3
5 19
28 36
36
References 2
1
ASPECTS OF MULTIVARIATE ANALYSIS
47
MATRIX ALGEBRA AND RANDOM VECTORS
2. 1
Introduction
2.2
Some Basics o f Matrix and Vector Algebra
49
49 49 v
vi
Contents
2.3
Positive Defin ite Matrices
61
2.4
A Square-Root Matrix 67
2.5
Random Vectors and Matrices
2.6
Mean Vectors and Cova riance Matrices
2.7
Matrix Inequalities a n d Maximization
68 69 81
Supplement 2A Vectors a n d Matrices: Basic Concepts 86 Exercises
1 07
References 3
115
3. 1
I ntroduction
3.2
The Geometry of the Sample
3.3
Random Samples and the Expected Values of the Sample Mean and Cova riance Matrix
3.4
Generalized Variance
3.5
Sample Mean, Covariance, and Correlation as M atrix Operations 1 45
3.6
Sample Values of Linear Combinations of Variables 1 48 Exercises
116 117 1 24
1 29
1 53
References 4
1 16
SAMPLE GEOMETRY AND RANDOM SAMPLING
1 56
JS7
THE MULTIVARIATE NORMAL DISTRIBUTION
4. 1
I ntroduction
4.2
The Multivariate Normal Density and Its Properties 1 58
4.3
Sampling from a Multivariate Normal Distribution and Maximum Likeli hood Estimation 1 7 7
4.4
The Sampling Distribution of X and S
4.5 4.6
1 57
Large-Sample Behavior of X and S
1 84
1 85
Assessing the Assumption of Normality
1 88
Contents
4.7
Detecti ng Outliers a n d Data Cleaning
4.8
Transformations to Near Normality Exercises
200
204
21 4
References PART II
vii
222
Inferences About Multivariate Means and Linear Models 5
224
INFERENCES ABOUT A MEAN VECTOR
5.1
I ntroduction
5.2
T h e Plausibil ity o f P-o a s a Value for a Norma l Population Mean 224
5.3
224
Hotel ling's T2 a n d Likelihood Ratio Tests
231
5 .4
Confidence Regions and Simultaneous Comparison s o f Component Means 235
5.5
Large Sample I n ferences about a Population Mean Vector 252
5.6
Multivariate Quality Control Charts
5.7
I nferences about Mean Vectors When Some Observations Are M issing 268
5.8
Difficulties Due T o Time Dependence i n M ultiva riate Observations 273
25 7
Supplement SA Simultaneous Confidence Intervals and Elli pses as Shadows of the p-Dimensional Ellipsoids Exercises 6
2 76
2 79
References
288
COMPARISONS OF SEVERAL MULTIVARIATE MEANS
6. 1
I ntroduction
6.2
Pai red Comparisons and a Repeated Measures Design 291
6.3
Comparing Mean Vectors from Two Populations
6.4
Comparing Several Multivariate Population Means (One-Way MANOVA) 31 4
290
290
302
viii
Contents
6.5
S imultaneous Confidence I ntervals for Treatment Effects 329
6.6
Two-Way Multivariate Analysis of Variance
6. 7
Profile Analysis
6.8
Repeated Measures Designs and G rowth Curves
6.9
Perspectives and a Strategy for Analyzing Multivariate Models 355 Exercises
331
343 350
358 3 75
References
7 MULTIVARIATE LINEA/l REGRESSION MODELS
377
377
7.1
I ntroduction
7.2
The Classical Linear Regression Model
7.3
Least Squares Estimation
7.4
I n ferences About the Regression M odel
7.5
Inferences from the Estimated Regression Function
7.6
Model Checking and Other Aspects of Regression 404
7.7
Multivariate Multiple Regression
7.8
The Concept of Linear Regression
7.9
Comparing the Two Formulations of the Regression Model 438
7.1 0
Multiple Regression Models with Time Dependent Errors
377
381 390
41 0 427
Supplement 7A The Distribution of the Likelihood Ratio for the Multivariate Multiple Regression M odel Exercises
PRINCIPAL COMPONENTS 8.1
446
456
Analysis of Covariance Structure 8
44 1
448
References
PART Ill
400
I ntroduction
458
458
Contents
8.2
Population Pri ncipal Components
8.3
Summarizing Sample Variation by Pri ncipal Components 471
8.4
Gra p h i ng the Principal Components
8.5
Large Sample I nferences
8.6
Mon itori ng Quality with Pri ncipal Components
ix
458
484
487 490
Supplement 8A The Geometry of the Sample Pri n cipal Component Approximation 498 Exercises
503
References 9
51 2 574
FACTOR ANALYSIS AND INFERENCE FOR STRUCTURED COVARIANCE MATRICES 9. 1
I ntroduction
51 4
9. 2
The Orthogonal Factor Model
9.3
Metho ds o f Estimation
9.4
Factor Rotation
9.5
Factor Scores
9.6
Perspectives and a Strategy for Factor Analysis
9. 7
Structura l Equation Models
51 5
52 1
540 550 557
565
Supplement 9A Some Computational Details for Maximum Likelihood Estima tion 572 Exercises
57 5
References l0
585
587
CANONICAL CORRELATION ANALYSIS 1 0. 1
I ntroduction
58 7
1 0.2
Canon ical Va riates and Canon ical Correlations
1 0.3
I nterpreti ng the Population Canon ical Variables
1 0.4
The Sample Cahonical Variates and Sample Canonica l Correlations 601
1 0.5
Additional Sample Descri ptive Measures
61 0
58 7 595
x
Contents
1 0.6
Large Sample Inferences Exercises
61 9
References PART IV
61 5
62 7
Classification and Grouping Techniques ll
1 1 .1
I ntroduction
1 1 .2
Separation and Classification for Two Populations
1 1 .3
Classification with Two Multiva riate Normal Populations
11.4
Evaluating Classification Functions
1 1 .5
Fisher's Discriminant Function-Separation of Populations 661
1 1 .6
Classification with Severa l Populations
1 1 .7
Fisher's Method for Discrimi nating among Several Populations 683
1 1 .8
Final Comments Exercises
72
629
DISCRIMINATION AND CLASSIFICATION 629 630
649
665
697
703
References
723
726
CLUSTERING, DISTANCE METHODS AND ORDINATION 1 2. 1
I ntroduction
1 2 .2
Similarity Measures
1 2.3
H ierarchical Clusteri ng Methods
1 2.4
Nonhiera rchical Clustering Methods
1 2.5
Multidimensional Scaling
760
1 2.6
Correspondence Analysis
770
1 2. 7
Biplots for Viewing Sampling U n its and Variables
1 2 .8
P rocrustes Analysis: A M ethod for Comparing Configurations Exercises References
639
726
790 798
728 738
782
754
7 79
Contents
APPENDIX
800
Table 1
Standard Normal Probabil ities
Table 2
Student's t-Distribution Percentage Points
Table 3
801
x2 Distribution Percentage Points
802
803
Table 4
F-Distribution Percentage Points (a= .1 0)
804
Table 5
F-D istribution Percentage Points (a= .05)
806
F-Distribution Percentage Points (a= .01 )
808
Table 6
xi
DATA INDEX
8ll
SUBJECT INDEX
812
Preface
INTENDED AUDIENCE
This book originally grew out of our lecture notes for an "Applied Multivariate Analysis" course offered jointly by the Statistics Department and the School of Business at the University of Wisconsin-Madison. Applied Multivariate Statisti cal Analysis, Fourth Edition, is concerned with statistical methods for describing and analyzing multivariate data. Data analysis, while interesting with one variable, becomes truly fascinating and challenging when several variables are involved. Researchers in the biological, physical, and social sciences frequently collect mea surements on several variables. Modern computer packages readily provide the numerical results to rather complex statistical analyses. We have tried to provide readers with the supporting knowledge necessary for making proper interpreta tions, selecting appropriate techniques, and understanding their strengths and weaknesses. We hope our discussions will meet the needs of experimental scien tists, in a wide variety of subject matter areas, as a readable introduction to the sta tistical analysis of multivariate observations.
LEVEL
Our aim is to present the concepts and methods of multivariate analysis at a level that is readily understandable by readers who have taken two or more statistics xiil
xiv
P reface
courses. We emphasize the applications of multivariate methods and consequently, have attempted to make the mathematics as palatable as possible. We avoid the use of calculus. On the other hand, the concepts of a matrix and of matrix manipulations are important. We do not assume the reader is familiar with matrix algebra. Rather, we introduce matrices as they appear naturally in our discussions, and we then show how they simplify the presentation of multivariate models and techniques. The introductory account of matrix algebra, in Chapter 2, highlights the more important matrix algebra results as they apply to multivariate analysis. The Chap ter 2 supplement provides a summary of matrix algebra results for those with little or no previous exposure to the subject. This supplementary material helps make the book self-contained and is used to complete proofs. The proofs may be ignored on the first reading. In this way we hope to make the book accessible to a wide audience. In our attempt to make the study of multivariate analysis appealing to a large audience of both practitioners and theoreticians, we have had to sacrifice a consis tency of level. Some sections are harder than others. In particular, we have sum marized a voluminous amount of material on regression in Chapter 7. The resulting presentation is rather succinct and difficult the first time through. We hope instruc tors will be able to compensate for the unevenness in level by judiciously choosing those sections, and subsections, appropriate for their students and by toning them down if necessary.
ORGANIZATION AND APPROACH
The methodological "tools" of multivariate analysis are contained in Chapters 5 through 12. These chapters represent the heart of the book but they cannot be assimilated without much of the material in the introductory Chapters 1 through 4. Even those readers with a good knowledge of matrix algebra or those willing to accept the mathematical results on faith should, at the very least, peruse Chapter 3, Sample Geometry, and Chapter 4, Multivariate Normal Distribution. Our approach in the methodological chapters is to keep the discussion direct and uncluttered. Typically, we start with a formulation of the population models, delineate the corresponding sample results, and liberally illustrate everything with examples. The examples are of two types: those that are simple and whose calcu lations can be easily done by hand, and those that rely on real-world data and com puter software. These will provide an opportunity to: (1) duplicate our analyses, (2) carry out the analyses dictated by exercises, or (3) analyze the data using meth ods other than the ones we have used or suggested. The division of the methodological chapters (5 through 12) into three units allows instructors some flexibility in tailoring a course to their needs. Possible sequences for a one-semester (two quarter) course are indicated schematically.
Preface
�
Inferences About Means
xv
Getting Started Chapters 1-4
�
Classification and Grouping
Chapters 5-7
Chapters 11 and 12
Analysis of Covariance Structure
I
Analysis of Covariance Structure
Chapters 8-10
Chapters 8-10
I
Each instructor will undoubtedly omit certain sections from some chapters to cover a broader collection of topics than is indicated by these two choices. For most students, we would suggest a quick pass through the first four chap ters (concentrating primarily on the material in Chapter 1, Sections 2.1, 2.2, 2.3, 2.5, 2.6, and 3.6, and the "assessing normality" material in Chapter 4) followed by a selection of methodological topics. For example, one might discuss the compari son of mean vectors, principle components, factor analysis, discriminant analysis, and clustering. The discussions could feature the many "worked out" examples included in these sections of the text. Instructors may rely on diagrams and verbal descriptions to teach the corresponding theoretical developments. If the students have uniformly strong mathematical backgrounds, much of the book can success fully be covered in one term. We have found individual data-analysis projects useful for integrating mate rial from several of the methods chapters. Here, our rather complete treatments of MANOVA, regression analysis, factor analysis, canonical correlation, discriminant analysis, and so forth are helpful, even though they may not be specifically covered in lectures. CHANGES TO THE FOURTH EDITION New Material. Users of the previous editions will notice that we have added and updated some examples and exercises, and have expanded the discus sions of viewing multivariate data, generalized variance, assessing normality and transformations to normality, simultaneous confidence intervals, repeated measure designs, and cluster analysis. We have also added a number of new sections includ ing: Detecting Outliers and Data Cleaning (Ch. 4); Multivariate Quality Control Charts, Difficulties Due to Time Dependence in Multivariate Observations (Ch. 5); Repeated Measures Designs and Growth Curves (Ch. 6); Multiple Regression
xvi
Preface
Models with Time Dependent Errors (Ch. 7); Monitoring Quality with Principal Components (Ch. 8); Correspondence Analysis (Ch. 12); Biplots (Ch. 12); and Pro crustes Analysis (Ch. 12). We have worked to improve the exposition throughout the text, and have expanded the t-table in the appendix. Data Disk. Recognizing the importance of modern statistical packages in the analysis of multivariate data, we have added numerous real-data sets. The full data sets used in the book are saved as ASCII files on the Data Disk which is pack aged with each copy of the book. This format will allow easy interface with exist ing statistical software packages and provide more convenient hands-on data analysis opportunities. Instructors Solutions Mahual. An Instructors Solutions Manual (ISBN 0-13-834202-4) containing complete solutions to most of the exercises in the book is available free upon adoption from Prentice Hall.
For information on additional for sale supplements that may be used with the book or additional titles of interest, please visit the Prentice Hall web site at www.prenhall.com. ACKNOWLEDG MENTS
We thank our many colleagues who helped improve the applied aspect of the book by contributing their own data sets for examples and exercises. A number of indi viduals helped guide this revision and we are grateful for their suggestions: Steve Coad, University of Michigan; Richard Kiltie, University of Florida; Sam Kotz, George Mason University; Shyamal Peddada, University of Virginia; K. Sivakumar, University of Illinois at Chicago; Eric Smith, Virginia Tech; and Stanley Wasser man, University of Illinois at Urbana-Champaign. We also acknowledge the feed back of the students we have taught these past 25 years in our applied multivariate analysis courses. Their comments and suggestions are largely responsible for the present iteration of this work. We would also like to give special thanks to Wai Kwong Cheang for his help with the calculations for many of the new examples. We must thank Deborah Smith for her valuable work on the Data Disk and Solutions Manual, Steve Verrill for computing assistance throughout, and Alison Pollack for implementing a Chernoff faces program. We are indebted to Cliff Gilman for his assistance with the multi-dimensional scaling examples discussed in Chapter 12. Jacquelyn Forer did most of the typing of the original draft manuscript and we appreciate her expertise and willingness to endure the cajoling of authors faced with publication deadlines. Finally we would like to thank Ann Heath, Mindy McClard, Richard DeLorenzo, Brian Baker, Linda Behrens, Alan Fischer, and the rest of the Prentice Hall staff for their help with this project. W.
R. A.
D.
Johnson Wichern
CHAPTER
1
Aspects of Multivariate Analysis 1 .1
INTRODUCTION
Scientific inquiry is an iterative learning process. Objectives pertaining to the expla nation of a social or physical phenomenon must be specified and then tested by gathering and analyzing data. In turn, an analysis of the data gathered by experi mentation or observation will usually suggest a modified explanation of the phe nomenon. Throughout this iterative learning process, variables are often added or deleted from the study. Thus, the complexities of most phenomena require an investigator to collect observations on many different variables. This book is con cerned with statistical methods designed to elicit information from these kinds of data sets. Because the data include simultaneous measurements on many variables, this body of methodology is called multivariate analysis. The need to understand the relationships between many variables makes mul tivariate analysis an inherently difficult subject. Often, the human mind is over whelmed by the sheer bulk of the data. Additionally, more mathematics is required to derive multivariate statistical techniques for making inferences than in a uni variate setting. We have chosen to provide explanations based upon algebraic con cepts and to avoid the derivations of statistical results that require the calculus of many variables. Our objective is to introduce several useful multivariate techniques in a clear manner, making heavy use of illustrative examples and a minimum of mathematics. Nonetheless, some mathematical sophistication and a desire to think quantitatively will be required. Most of our emphasis will be on the analysis of measurements obtained without actively controlling or manipulating any of the variables on which the measurements are made. Only in Chapters 6 and 7 shall we treat a few experi mental plans (designs) for generating data that prescribe the active manipulation of important variables. Although the experimental design is ordinarily the most important part of a scientific investigation, it is frequently impossible to control the generation of appropriate data in certain disciplines. (This is true, for exam1
2
Chap. 1
Aspects of M u ltiva riate Analysis
ple, in business, economics, ecology, geology, and sociology.) You should consult [7] and [8] for detailed accounts of design principles that, fortunately, also apply to multivariate situations. It will become increasingly clear that many multivariate methods are based upon an underlying probability model known as the multivariate normal distribu tion. Other methods are ad hoc in nature and are justified by logical or common sense arguments. Regardless of their origin, multivariate techniques must, invariably, be implemented on a computer. Recent advances in computer technol ogy have been accompanied by the development of rather sophisticated statistical software packages, making the implementation step easier. Multivariate analysis is a "mixed bag." It is difficult to establish a classifica tion scheme for multivariate techniques that both is widely accepted and indicates the appropriateness of the techniques. One classification distinguishes tecl�niques designed to study interdependent relationships from those designed to study dependent relationships. Another classifies techniques according to the number of populations and the number of sets of variables being studied. Chapters in this text are divided into sections according to inference about treatment means, inference about covariance structure, and techniques for sorting or grouping. This should not, however, be considered an attempt to place each method into a slot. Rather, the choice of methods and the types of analyses employed are largely determined by the objectives of the investigation. Below, we list a smaller number of practical problems designed to illustrate the connection between the choice of a statistical method and the objectives of the study. These problems, plus the examples in the text, should provide you with an appreciation for the applicability of multivariate techniques across different fields. The objectives of scientific investigations to which multivariate methods most naturally lend themselves include the following: 1.
Data reduction or structural simplification. The phenomenon being studied is
represented as simply as possible without sacrificing valuable information. It is hoped that this will make interpretation easier. 2. Sorting and grouping. Groups of "similar" objects or variables are created, based upon measured characteristics. Alternatively, rules for classifying objects into well-defined groups may be required. 3. Investigation of the dependence among variables. The nature of the relation ships among variables is of interest. Are all the variables mutually indepen dent or are one or more variables dependent on the others? If so, how? 4. Prediction. Relationships between variables must be determined for the pur pose of predicting the values of one or more variables on the basis of obser vations on the other variables. 5. Hypothesis construction and testing. Specific statistical hypotheses, formulated in terms of the parameters of multivariate populations, are tested. This may be done to validate assumptions or to reinforce prior convictions.
Sec. 1 . 2
Applications of M u ltiva ri ate Tech niques
3
We conclude this brief overview of multivariate analysis with a quotation from F. H. C. Marriott [19], page 89. The statement was made in a discussion of cluster analysis, but we feel it is appropriate for a broader range of methods. You should keep it in mind whenever you attempt or read about a data analysis. It allows one to maintain a proper perspective and not be overwhelmed by the ele gance of some of the theory:
on, docnotal presadmientt aatisoimn,plteheylogiarecal probabl interprey twrIafttoihong.en,resThere andultsdodiissnotnoagrmagiesehowwicthabout upinfcloernumer arlmedy iopinicalanigraphi metthehodsint,eandrpretmanyationwaysof datin awhi, notch tsaushey canage break down. They ar e a val u abl e ai d t o machines automatically transforming bodies of numbers into packets of scientific fact. 1 .2
APPLICATIONS OF MULTIVARIATE TECHNIQUES
The published applications of multivariate methods have increased tremendously in recent years. It is now difficult to cover the variety of real-world applications of these methods with brief discussions, as we did in earlier editions of this book. However, in order to give some indication of the usefulness of multivariate tech niques, we offer the following short descriptions of the results of studies from sev eral disciplines. These descriptions are organized according to the categories of objectives given in the previous section. Of course, many of our examples are mul tifaceted and could be placed in more than one category. Data reduction or simplification •
• •
•
Using data on several variables related to cancer patient responses to radio therapy, a simple measure of patient response to radiotherapy was con structed. (See Exercise 1.15.) Track records from many nations were used to develop an index of perfor mance for both male and female athletes. (See [10] and [21].) Multispectral image data collected by a high-altitude scanner were reduced to a form that could be viewed as images (pictures) of a shoreline in two dimensions. (See [22].) Data on several variables relating to yield and protein content were used to create an index to select parents of subsequent generations of improved bean plants. (See [14].)
Sorting and grouping •
Data on several variables related to computer use were employed to create clusters of categories of computer jobs that allow a better determination of existing (or planned) computer utilization. (See [2].)
4
Chap. 1
Aspects of M u ltivariate Analysis • •
•
Measurements of several physiological variables were used to develop a screen ing procedure that discriminates alcoholics from nonalcoholics. (See [25].) Data related to responses to visual stimuli were used to develop a rule for separating people suffering from a multiple-sclerosis-caused visual pathology from those not suffering from the disease. (See Exercise 1.14.) The U. S. Internal Revenue Service uses data collected from tax returns to sort taxpayers into two groups: those that will be audited and those that will not. (See [30].)
Investigation of the dependence among variables • •
•
•
Data on several variables were used to identify factors that were responsible for client success in hiring external consultants. (See [13].) Measurements of variables related to innovation, on the one hand, and vari ables related to the business environment and business organization, on the other hand, were used to discover why some firms are product innovators and some firms are not. (See [5].) Data on variables representing the outcomes of the 10 decathlon events in the Olympics were used to determine the physical factors responsible for suc cess in the decathlon. (See [17].) The associations between measures of risk-taking propensity and measures of socioeconomic characteristics for top-level business executives were used to assess the relation between risk-taking behavior and performance. (See [18].)
Prediction •
•
•
•
The associations between test scores and several high school performance variables and several college performance variables were used to develop pre dictors of success in college. (See [11].) Data on several variables related to the size distribution of sediments were used to develop rules for predicting different depositional environments. (See [9] and [20].) Measurements on several accounting and financial variables were used to develop a method for identifying potentially insolvent property-liability insur ers. (See [27].) Data on several variables for chickweed plants were used to develop a method for predicting the species of a new plant. (See [4].)
Hypotheses testing • Several pollution-related variables were measured to determine whether lev els for a large metropolitan area were roughly constant throughout the week, or whether there was a noticeable difference between weekdays and week ends. (See Exercise 1.6.)
Sec. 1 . 3 •
•
•
The Organ ization of Data
5
Experimental data on several variables were used to see whether the nature of the instructions makes any difference in perceived risks, as quantified by test scores. (See [26].) Data on many variables were used to investigate the differences in structure of American occupations to determine the support for one of two competing sociological theories. (See [16] and [24].) Data on several variables were used to determine whether different types of firms in newly industrialized countries exhibited different patterns of innova tion. (See [15].)
The preceding descriptions offer glimpses into the use of multivariate meth ods in widely diverse fields. 1.3 THE ORGANIZATION OF DATA
Throughout this text, we are going to be concerned with analyzing measurements made on several variables or characteristics. These measurements (commonly called data) must frequently be arranged and displayed in various ways. For exam ple, graphs and tabular arrangements are important aids in data analysis. Summary numbers, which quantitatively portray certain features of the data, are also neces sary to any description. We now introduce the preliminary concepts underlying these first steps of data organization. Arrays
Multivariate data arise whenever an investigator, seeking to understand a social or physical phenomenon, selects a number p ;;;. 1 of variables or characters to record. The values of these variables are all recorded for each distinct item, individual, or
experimental unit.
We will use the notation xi k to indicate the particular value of the kth vari able that is observed on the jth item, or trial. That is,
xi k = measurement of the kth variable on the jth item Consequently, n measurements on p variables can be displayed as follows: Variable 1 Variable 2 Variable k Variable p Item 1: xu x1 2 xl k xl p Item 2: Xz l Xzz X zk Xzp Item j:
xi l
Xi z
xi k
xiP
Item n :
Xn l
Xn z
Xnk
xn p
6
Chap. 1
Aspects of M u ltivariate Analysis
Or we can display these data as a rectangular array, called X, of n rows and p columns: x 1 1 X1 2 . . . x l k . . x 1 p . Xk . . Xp X1 X .
2
X=
22
2
. .
xi 1 xi 2 . . xi k . . . Xjp .
Xn 1 Xn 2 . xnk .
·
2
.
.
. .
. Xnp
The array X, then, contains the data consisting of all of the observations on all of the variables. Example 1 . 1
(A data array)
A selection of four receipts from a university bookstore was obtained in order to investigate the nature of book sales. Each receipt provided, among other things, the number of books sold and the total amount of each sale. Let the first variable be total dollar sales and the second variable be number of books sold. Then we can regard the corresponding numbers on the receipts as four measurements on two variables. Suppose the data, in tabular form, are: Variable 1 (dollar sales):
42
52
48
58
Variable 2 (number of books):
4
5
4
3
Using the notation just introduced, we have
x 1 1 = 42 x1 2 = 4 and the data array X is
x 2 1 = 52 x 22 = 5
X=
with four rows and two columns.
[ ] 42 52 48 58
4 5 4 3
•
Considering data in the form of arrays facilitates the exposition of the subject matter and allows numerical calculations to be performed in an orderly and effi cient manner. The efficiency is twofold, as gains are attained in both (1) describing numerical calculations as operations on arrays and (2) the implementation of the calculations on computers, which now use many languages and statistical packages to perform array operations. We consider the manipulation of arrays of numbers in Chapter 2. At this point, we are concerned only with their value as devices for displaying data.
Sec. 1 . 3
The O rg a nizatio n of Data
7
Descriptive Statistics
A large data set is bulky, and its very mass poses a serious obstacle to any attempt to visually extract pertinent information. Much of the information contained in the data can be assessed by calculating certain summary numbers, known as descrip tive statistics. For example, the arithmetic average, or sample mean, is a descriptive statistic that provides a measure of location-that is, a "central value" for a set of numbers. And the average of the squares of the distances of all of the numbers from the mean provides a measure of the spread, or variation, in the numbers. We shall rely most heavily on descriptive statistics that measure location, vari ation, and linear association. The formal definitions of these quantities follow. Let xll, x21, , X11 1 be n measurements on the first variable. Then the arith metic average of these measurements is • . •
If the n measurements represent a subset of the full set of measurements that might have been observed, then x1 is also called the sample mean for the first vari able. We adopt this terminology because the bulk of this book is devoted to pro cedures designed for analyzing samples of measurements from larger collections. The sample mean can be computed from the n measurements on each of the p variables, so that, in general, there will be p sample means: -
xk
1 II = - L xjk
n j= l
k = 1,2, . . . , p
A measure of spread is provided by the surements on the first variable as
(1-1)
sample variance, defined for n mea
where x1 is the sample mean of the xj l ' s. In general, for p variables, we have k = 1,2,
0 0 0
,p
(1-2)
Two comments are in order. First, many authors define the sample variance with a divisor of n 1 rather than n. Later we shall see that there are theoretical rea sons for doing this, and it is particularly appropriate if the number of measure ments, n, is small. The two versions of the sample variance will always be differentiated by displaying the appropriate expression. Second, although the s2 notation is traditionally used to indicate the sample variance, we shall eventually consider an array of quantities in which the sample -
8
Chap. 1
Aspects of M u ltivariate Ana lysis
variances lie along the main diagonal. In this situation, it is convenient to use dou ble subscripts on the variances in order to indicate their positions in the array. Therefore, we introduce the notation s;; to denote the same variance computed from measurements on the ith variable, and we have the notational identities k
=
1, 2, . . . , p
(1-3)
The square root of the sample variance, Vi;;, is known as the sample standard deviation. This measure of variation is in the same units as the observations. Consider n pairs of measurements on each of variables 1 and 2:
[ Xxl1 2l ] , [ XzlXzz ] , . . . , [ xx",lz ]
That is, xj 1 and xj 2 are observed on the jth experimental item (j = 1, 2, . . . , n). A measure of linear association between the measurements of variables 1 and 2 is provided by the sample covariance
or the average product of the deviations from their respective means. If large val ues for one variable are observed in conjunction with large values for the other variable, and the small values also occur together, s 1 2 will be positive. If large val ues from one variable occur with small values for the other variable, s 1 2 will be neg ative. If there is no particular association between the values for the two variables, s 1 2 will be approximately zero. The sample covariance
i
=
1, 2, . . . , p, k
=
1, 2,
0 0 0
,p
(1-4)
measures the association between the ith and kth variables. We note that the covari ance reduces to the sample variance when i = k. Moreover, s;k = sk i for all i and k. The final descriptive statistic considered here is the sample correlation coef ficient (or Pearson 's product-moment correlation coefficient; see [31). This mea sure of the linear association between two variables does not depend on the units of measurement. The sample correlation coefficient for the ith and kth variables is defined as 11
(x ; - x; ) (xj k - xk ) j=l j 2:
(1-5)
Sec. 1 . 3
The Organ ization of Data
9
1, 2, ... , p and k = 1, 2, .. . , p. Note r;k = rk; for all i and k. The sample correlation coefficient is a standardized version of the sample covariance, where the product of the square roots of the sample variances provides the standardization. Notice that r;k has the same value whether n or n - 1 is cho sen as the common divisor for s;;, skk, and s;k · The sample correlation coefficient r;k can also be viewed as a sample covari an ce . Suppose the original values xii and xik are replaced by standardized values (xi; - x;)!Vi;; and (xik - xk )/Vi;,. The standardized values are commen surable because both sets are centered at zero and expressed in stqndard deviation units. The sample correlation coefficient is just the sample covariance of the stan dardized observations. Although the signs of the sample correlation and the sample covariance are the same, the correlation is ordinarily easier to interpret because its magnitude is bounded. To summarize, the sample correlation r has the following properties:
for i
=
The value of r must be between -1 and + 1. Here r measures the strength of the linear association. If r = 0, this implies a lack of linear association between the components. Otherwise, the sign of r indicates the direction of the association: r < 0 implies a tendency for one value in the pair to be larger than its average when the other is smaller than its average; and r > 0 implies a tendency for one value of the pair to be large when the other value is large and also for both values to be small together. 3. The value of r;k remains unchanged if the measurements of the ith variable are changed to Yii = axii + b, j = 1,2, . . . , n, and the values of the kth vari �ple are changed to Yik = cxik + d,j = 1,2, . . . , n, provided that the con stants a and c have the same sign. 1. 2.
The quantities s;k and r;k do not, in general, convey all there is to know about the association between two variables. Nonlinear associations can exist that are not revealed by these descriptive statistics. Covariance and correlation provide mea sures of linear association, or association along a line. Their values are less infor mative for other kinds of association. On the other hand, these quantities can be very sensitive to "wild" observations ("outliers") and may indicate association when, in fact, little exists. In spjte of these shortcomings, covariance and correla tion coefficients are routiqely calculated and analyzed. They provide cogent numer ical summaries of association when the data do not exhibit obvious nonlinear patterns of association and when wild observations are not present. Suspect observations must be accounted for by correcting obvious recording mistakes and by taking actions consistent with the identified causes. The values of s;k and r; k should be quoted both with and without these observations. The sum of squares of the deviations from the mean and the sum of cross product deviations are often of interest themselves. These quantities are
wkk
=
II
.L (xj k - xk ) 2
j=l
k
=
1,2, . . . ,p
(1-6)
10
Chap. 1
Aspects of M u ltivariate Analysis
and
W; k
=
ll
jL: (xji - x;) (xjk - xk ) =l
i
=
1, 2, . . . 'p, k
=
1, 2, . . . 'p
(1-7)
The descriptive statistics computed from n measurements on p variables can also be organized into arrays. ARRAYS OF BASIC DESCRIPTIVE STATISTICS
Sample
means
Sample vafiahces
and covariances
Sample correlations
The sample mean array is denoted by x, the sample variance and covariance array by the capital letter S11, and the sample correlation array by R. The subscript n on the array S11 is a mnemonic device used to remind you that n is employed as a divisor for the elements s; k · The size of all of the arrays is determined by the number of variables, p. The arrays S11 and R consist of p rows and p columns. The array x is a single column with p rows. The first subscript on an entry in arrays S" and R indicates the row; the second subscript indicates the column. Since s; k = ski and r; k = rk ; for all i and k, the entries in symmetric positions about the main northwest-southeast diag onals in arrays S11 and R are the same, and the arrays are said to be symmetric. Example 1 .2 (The arrays x, S11, and R for bivariate data)
Consider the data introduced in Example 1.1. Each receipt yields a pair of measurements, total dollar sales, and number of books sold. Find the arrays x, S11, and R.
Sec. 1 . 3
11
The Organ ization of Data
Since there are four receipts, we have a total of four measurements (observations) on each variable. The sample means are
4
:X1
=
� � xj !
:X 2
=
� � xj 2
j= l 4
j= l
=
l ( 42 + 52 + 48 + 58)
=
l{ 4 + 5 + 4 + 3)
=
=
50
4
The sample variances and covariances are
sl l
4
- 4I"' £.J =
(xj ! - x-i ) 2
j =l 2 � ((42 - 50) + (52 - 50) 2 + (48 - 50) 2 + (58 - 50) 2 )
S22 - I
4
=
34
=
- 1.5
(xj 2 - X-2 ) 2 j =l
"' - 4 £.J
=
s1 2
= =
� ((4 - 4) 2 + (5 - 4) 2 + (4 - 4) 2 + (3 - 4) 2 )
=
.5
4
(x - :X ) (x - :X ) j= j i 1 j 2 2 � ((42 - 50) (4 - 4) + (52 - 50) (5 - 4) � �l
+ (48 - 50) (4 - 4) + (58 - 50) (3 - 4)) and sn =
The sample correlation is
r 12
=
[
�Ys;;
S1 2
34 - 1.5 1.5 .5 =
- 1.5
J
\134 Y.5
=
- .36
12
Chap. 1
Aspects of M u ltivariate Analysis
so _
R G raphical Techniques
[ - .361
- .36 1
]
•
Plots are important, but frequently neglected, aids in data analysis. Although it is impossible to simultaneously plot all the measurements made on several variables and study the configurations, plots of individual variables and plots of pairs of vari ables can still be very informative. Sophisticated computer programs and display equipment allow one the luxury of visually examining data in one, two, or three dimensions with relative ease. On the other hand, many valuable insights can be obtained from the data by constructing plots with paper and pencil. Simple, yet ele gant and effective, methods for displaying data are available in [28]. It is good sta tistical practice to plot pairs of variables and visually inspect the pattern of association. Consider, then, the following seven pairs of measurements on two variables: Variable 1 (x 1 ) : Variable 2 (x2 ) :
3 5
4
2
5.5
4
6
7
8
2
5
10
5
7.5
These data are plotted as seven points in two dimensions ( each axis repre senting a variable ) in Figure 1.1. The coordinates of the points are determined by the paired measurements: (3, 5), (4, 5.5), . . . , (5, 7 .5 ) . The resulting two-dimen sional plot is known as a scatter diagram or scatter plot. Xz
xz
•
10
10
• •
8
8
:.a • 0 •• 0
6
6
E �
OJl "'
•
•
• •
4
4
2
2
0
• •
•
4
2
! •
2
•
•
! 4
6
•
! 6
Dot diagram
8
! 8
I""
10
XI
A scatter plot and marginal dot diagrams.
Figure 1 . 1
Sec. 1 . 3
The O rgan ization of Data
13
Also shown in Figure 1.1 are separate plots of the observed values of variable 1 and the observed values of variable 2, respectively. These plots are called (mar ginal) dot diagrams. They can be obtained from the original observations or by pro jecting the points in the scatter diagram onto each coordinate axis. The information contained in the single-variable dot diagrams can be used to calculate the sample means.�\ and :X2 and the sample variances s 1 1 and s22 . (See Exercise 1.1.) The scatter diagram indicates the orientation of the points, and their coordinates can be used to calculate the sample covariance s 12 . In the scatter dia gram of Figure 1.1, large values of x1 occur with large values of x 2 and small val ues of x 1 with small values of x2• Hence, s1 2 will be positive. Dot diagrams and scatter plots contain different kinds of information. The information in the marginal dot diagrams is not sufficient for constructing the scat ter plot. As an illustration, suppose the data preceding Figure 1.1 had been paired differently, so that the measurements on the variables x 1 and x 2 were as follows: Variable 1 Variable 2
(x 1 ): (x2 ):
5
4
6
2
2
8
3
5
5.5
4
7
10
5
7.5
(We have simply rearranged the values of variable 1.) The scatter and dot dia grams for the "new" data are shown in Figure 1.2. Comparing Figures 1.1 and 1.2, we find that the marginal dot diagrams are the same, but that the scatter diagrams are decidedly different. In Figure 1.2, large values of x 1 are paired with small val ues of x2 and small values of x1 with large values of x 2• Consequently, the descrip tive statistics for the individual variables :X1 , :X2 , s 1 1 , and s22 remain unchanged, but the sample covariance s 1 2 , which measures the association between pairs of vari ables, will now be negative. The different orientations of the data in Figures 1.1 and 1.2 are not dis cernible from the marginal dot diagrams alone. At the same time, the fact that the
• • •
•
• •
•
Xz
Xz
10 8
•
•
6
• •
4
•
•
•
2 0
2
•
! 2
4
•
! 4
6
•
! 6
8
10
8
10
!
I
x,
�
x,
Figure 1 .2 Scatter plot and dot diagrams for rearranged data .
14
Chap. 1
Aspects of M u ltivariate Analysis
marginal dot diagrams are the same in the two cases is not immediately apparent from the scatter plots. The two types of graphical procedures complement one another; they are not competitors. The next two examples further illustrate the information that can be con veyed by a graphic display. Example 1 .3
(The effect of unusual observations on sample correlations)
Some financial data representing jobs and productivity for the 16 largest pub lishing firms appeared in an article in Forbes magazine on April 30, 1990. The data for the pair of variables x1 = employees U obs ) and x 2 = profits per employee (productivity) are graphed in Figure 1 .3. We have labeled two "unusual" observations. Dun & Bradstreet is the largest firm in terms of number of employees, but is "typical" in terms of profits per employee. Time Warner has a "typical" number of employees, but comparatively small ( neg ative ) profits per employee .
• •
•
•
,
• • •
• •
•
•
•
Dun & Bradstreet •
Time Warner
{
Employees (thousands)
Figure 1 . 3 Profits per employee and number of employees for 1 6 publishing firms.
The sample correlation coefficient computed from the values of x1 and x2 is
r1 2
=
- .39 - .56 - .39 - .50
for all 16 firms for all firms but Dun & Bradstreet for all firms but Time Warner for all firms but Dun & Bradstreet and Time Warner
It is clear that atypical observations can have a considerable effect on the • sample correlation coefficient.
Example 1 .4 (A scatter plot for baseball data)
In a July 17, 1978, article on money in sports, Sports Illustrated magazine pro vided data on x1 = player payroll for National League East baseball teams.
Sec. 1 . 3
The O rganization of Data
15
Xz
•
•• •
•
•
0 Player payroll in millions of dollars Figure 1 .4
Salaries and won-lost percentage from Table 1 . 1 .
We have added data on x2 given in Table 1.1.
=
won-lost percentage for 1977. The results are
1 97 7 SALARY AND F I NAL RECORD FOR TH E NATIONAL LEAGU E EAST
TABLE 1 . 1
Team Philadelphia Phillies Pittsburgh Pirates St. Louis Cardinals Chicago Cubs Montreal Expos New York Mets
x1
=
player payroll 3,497,900 2,485,475 1,782,875 1,725,450 1,645,575 1,469,800
won-lost percentage
x2 =
.623 .593 .512 .500 .463 .395
The scatter plot in Figure 1.4 supports the claim that a championship team can be bought. Of course, this cause-effect relationship cannot be sub stantiated, because the experiment did not include a random assignment of payrolls. Thus, statistics cannot answer the question: Could the Mets have won with $4 million to spend on player salaries? • To construct the scatter plot in, for example, Figure 1 .4, we have regarded the six paired observations in Table 1.1 as the coordinates of six points in two dimensional space. The figure allows us to examine visually the grouping of teams with respect to the variables total payroll and won-lost percentage.
16
Chap. 1
Aspects of M u ltivariate Analysis
Example 1 .5
(Multiple scatter plots for paper strength measurements)
Paper is manufactured in continuous sheets several feet wide. Because of the orientation of fibers within the paper, it has a different strength when mea sured in the direction produced by the machine than when measured across, or at right angles to, the machine direction. Table 1.2 shows the measured values of x1 =
x2 =
x3 =
density (grams/cubic centimeter) strength (pounds) in the machine direction strength (pounds) in the cross direction
A novel graphic presentation of these data appears in Figure 1.5, page 18. The scatter plots are arranged as the off-diagonal elements of a covari ance array and box plots as the diagonal elements. The latter are on a differ ent scale with this software, so we use only the overall shape to provide information on symmetry and possible outliers for each individual character istic. The scatter plots can be inspected for patterns and unusual observations. In Figure 1.5, there is one unusual observation: the density of specimen 25. Some of the scatter plots have patterns suggesting that there are two separate clumps of observations. These scatter plot arrays are further pursued in our discussion of new • software graphics in the next section. In the general multiresponse situation, p variables are simultaneously recorded on n items. Scatter plots should be made for pairs of important variables and, if the task is not too great to warrant the effort, for all pairs. Limited as we are to a three-dimensional world, we cannot always picture an entire set of data. However, two further geometric representations of the data pro vide an important conceptual framework for viewing multivariable statistical meth ods. In cases where it is possible to capture the essence of the data in three dimensions, these representations can actually be graphed. n points in p dimensions (p-dimensional scatter plot) . Consider the nat ural extension of the scatter plot to p dimensions, where the p measurements
on the jth item represent the coordinates of a point in p-dimensional space. The coordinate axes are taken to correspond to the variables, so that the jth point is xi 1 units along the first axis, xi 2 units along the second, . . , xiP units along the p th axis. The resulting plot with n points not only will exhibit the overall pattern of vari ability, but also will show similarities (and differences) among the n items. Group ings of items will manifest themselves in this representation. .
TABLE 1 . 2
PAPER-QUALITY M EASU REM ENTS
Strength Specimen
Density
Machine direction
Cross direction
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
.801 .824 .841 .816 .840 .842 .820 .802 .828 .819 .826 .802 .810 .802 .832 .796 .759 .770 .759 .772 .806 .803 .845 .822
121.41 127.70 129.20 131 .80 135.10 131.50 126.70 115.10 130.80 124.60 118.31 114.20 120.30 115.70 117.51 109.81 109.10 115.10 118.31 112.60 116.20 1 18.00 131.00 125.70
70.42 72.47 78.20 74.89 71.21 78.39 69.02 73.10 79.28 76.48 70.25 72.88 68.23 68.12 71.62 53.10 50.85 51.68 50.60 53.51 56.53 70.70 74.35 68.29
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
.816 .836 .815 .822 .822 .843 .824 .788 .782 .795 .805 .836 .788 .772 .776 .758
125.80 125.50 127.80 130.50 127.90 123.90 124.10 120.80 107.40 120.70 121.91 122.31 1 10.60 103.51 110.71 113.80
70.64 76.33 76.75 80.33 75.68 78.54 71.91 68.22 54.42 70.41 73.68 74.93 53.52 48.93 53.67 52.42
25
.971
126.10
72:10
Source: Data courtesy of SONOCO Products, Inc. 17
Strength (MD)
Density Max ... 00
-� �
~
Med Min
�
1 � Cll
.
0.97
.
0.8 1
.
.
.
.
.
.
.
..
.
.
.
·· �·= .
.
.
. .
.
Med
.
.
.
.
. . .
. .
.
.
.
.
.
.
..
.
.
.
.
.
.
: : '·
121.4
.
. .
.
. .
.
.
.
--
Min
103.5
.
.
Max
. . . . · � .·...· ·
. 4 .... . . .
: : ·· · .
.·
.
.
.
. .
.
135. 1
.
.
.
,
.
T
.
. ;-
.
.
0.76 Max
. . . . . .. . . . .. . . . . . .. • . .
.
�.
.
.
.
.
� Cll
. .
.
.
l
· ·' . .
. .. . ·:. .· .: .. . . .. . t . . . ..
Strength (CD)
.
. .
.
. .
.
.
.
.
.. .
-
.
.
. .
.
.
. . .
.
. .
. .
.
. .
.
Med
..
.
.
.
80.33
70.70
.
. .
.
---
Figure 1 .5
.
T
---
-
Min
_l_
Scatter plots and boxplots of paper-quality data from Table 1 .2 .
48.93
Sec. 1 .4
Data Displays and Pictorial Representations
19
[xli]
p points i n n dimensions. The n observations o f the p variables can also be regarded as p points in n-dimensional space. Each column of X determines one of the points. The ith column,
x:, ;
X2 ;
consisting of all n measurements on the ith variable, determines the ith point. In Chapter 3, we show how the closeness of points in n dimensions can be related to measures of association between the corresponding variables. 1 .4 DATA DISPLAYS AND PICTORIAL REPRESENTATIONS
The rapid development of powerful personal computers and workstations has led to a proliferation of sophisticated statistical software for data analysis and graph ics. It is often possible, for example, to sit at one ' s desk and examine the nature of multidimensional data with clever computer-generated pictures. These pictures are valuable aids in understanding data and often prevent many false starts and sub sequent inferential problems. As we shall see in Chapters 8 and 12, there are several techniques that seek to represent p-dimensional observations in few dimensions such that the original distances ( or similarities between pairs of observations are (nearly preserved. In general, if multidimensional observations can be represented in two dimensions, then outliers, relationships, and distinguishable groupings can often be discerned by eye. We shall discuss and illustrate several methods for displaying multivariate data in two dimensions. One good source for more discussion of graphical methods is [12].
)
)
Linking Multiple Two-Dimensional Scatter Plots
One of the more exciting new graphical procedures involves electronically con necting many two-dimensional scatter plots. Example 1 .6
x1
(Linked scatter plots and brushing)
x3
To illustrate linked two dimensional scatter plots, we refer to the paper-qual ity data in Table 1 .2. These data represent measurements on the variables = density, x 2 = strength in the machine direction, and = strength in the cross direction. Figure 1.6 shows two-dimensional scatter plots for pairs of these variables organized as a 3 3 array. For example, the picture in the upper left-hand comer of the figure is a scatter plot of the pairs of observa tions That is, the values are plotted along the horizontal axis, and the values are plotted along the vertical axis. The lower right-hand comer
x3(x1 , x3 ).
x1
X
20
Chap. 1
. . ., . . . �· . . . .., . ... .,
. .... . . .
.
. .
.
Aspects of M u l tivariate Ana lysis
.
. ."' ' . .,. .'. . . ··'. · . '
. , ._
.
( x3 )
Cross
48 . 9
.
(x2)
. .
. 97 1
. 758
,
Machine
1 04
(xl )
80 . 3
1 35
.
Density
. .... .. ' . .. . , .
. "" . . . . . . .
.. . r .
.
"'
.
I
.;
.
.. . . . I.
. . ... '.. . , · . ··' . i• . . . . . .. . . . . . ...... . '
,
. ... . .
. . . . .: ' .. I I ... .
I
. � ·..'- ·... -. .
.. ·:« ..
Figure 1 .6 Scatter plots for the paper quality data of Table 1 .2.
of the figure contains a scatter plot of the observations ( x3 , x 1 ) . That is, the axes are reversed. Corresponding interpretations hold for the other scatter plots in the figure. Notice that the variables and their three-digit ranges are indicated in the boxes along the SW-NE diagonal. The operation of mark ing (selecting) the obvious outlier in the ( x 1 , x3 ) scatter plot of Figure 1 .6 cre ates Figure 1.7(a), where the outlier is labeled as specimen 25 and the same data point is highlighted in all the scatter plots. Specimen 25 also appears to be an outlier in the ( x1 , x2 ) scatter plot but not in the ( x2 , x3 ) scatter plot. The operation of deleting this specimen leads to the modified scatter plots of Figure 1 .7(b). From Figure 1.7, we notice that some points in, for example, the ( x2 , x3 ) scatter plot seem to be disconnected from the others. Selecting these points, using the (dashed) rectangle (see page 22), highlights the selected points in all of the other scatter plots and leads to the display in Figure 1 .8(a) . Further checking revealed that specimens 16-21, specimen 34, and specimens 3�1 were actually specimens from an older roll of paper that was included in order to have enough plies in the cardboard being manufactured. Deleting the out lier and the cases corresponding to the older paper and adjusting the ranges of the remaining observations leads to the scatter plots in Figure 1.8(b ). The operation of highlighting points corresponding to a selected range of one of the variables is called brushing. Brushing could begin with a rec tangle, as in Figure 1.8(a), but then the brush could be moved to provide a
:· '
. . .. . , .. -�· . . ., : :-' .
.
.
.
.
.
. . ..
# ... .,. .. ... . .. �. ...
. . r . .. . . .. .
#
25
.
. , -_
I
. . .. · . . . . .
Sec. 1 . 4
.
. .. , · : · 25 . ,
..
.
.
••
.
.
I I ....
.
.I
.
. 97 1
. . . . '.. . •.v : . • . . ' . . .� .··. : . . . '
.
.•
I
.
25
25
Cx t )
, · . .. � : · 25• •
.
1 35
1 04
. 758
( x3 )
Cross
48.9
Machine
Density
21
80.3
.. '
( xz )
25
Data Displays and Pictorial Representations
. · . ..., .. .
... . • .: ..
, ..
�� ...
(a )
. . . . , . . -�· . . ... ..._, .
:-' . .. .
#
. .. ,
80.3
. Cross (x3 )
48.9 1 35
( xz )
Machine
. .
.. . .
.I
1 04
.
.
.
.
. .:
.
I I ....
.. .
,
.•
...
' .
.
I
.97 1
( xi )
Density
.758
.. '
. . "' . .. . . .
.
. .# ... . .,. .. . ... . . . ··'· . ... . r . .. . . . . .
.
... . , -.. :. I
.
. ..,. . .. . ,.. ' · .. . . . ... .··.' : . . . '
(b)
. . .�· .·.·
. ·..
� -.s. .., .. .
·:« ,
Figure 1 .7 Modified scatter plots for the paper-quality data with outlier (25) (a) selected and (b) deleted.
22
Chap. 1
Aspects of M u l tivariate Analysis
' •I .. . "' . . . .
.. - · . . .,� ... . .. .
. .. ·
.
...
.. ' .
.
... '\ • . . I
.
80.3
I
Cross
.•
- - - - : -· •
·.. ..
I
•
48.9
.. · :. ..'
1 35
. ·· '· . ..
. . ...
I
..
•• •t • •
, ... .
Machine
r
I I
..
...
.
.
1 04
.•
I
.97 1
Density
.., . . ·.
\
.
. 758
·
.
. ... I' . . ·.,
•:·
•
.
� �.. ... .
... .....
.
·
.. ..
.... ,
(a)
:
. . .
.
..
. . . .. . . . .
.
..
..
·.
80.3
..
..
..
..
68. 1
..
1 35
..
Machine
1 14 .845 Density
Cross
. .. .
.
. .. .
.
.
.
. .
(b)
.
.
.
. .
.
.
.788
. .
.
.
Figure 1 .8
Modified scatter plots with (a) g roup of points selected and (b) points, including specimen 25, deleted and the scatter plots rescaled.
Sec. 1 .4
23
Data Displays a n d Pictorial Representations
sequence of highlighted points. The process can be stopped at any time to provide a snapshot of the current situation. • Scatter plots like those in Example 1 .6 are extremely useful aids in data analysis. Another important new graphical technique uses software that allows the data analyst to view high-dimensional data as slices of various three-dimensional perspectives. This can be done dynamically and continuously until informative views are obtained. A comprehensive discussion of dynamic graphical methods is available in [1] . A strategy for on-line multivariate exploratory graphical analysis, motivated by the need for a routine procedure for searching for structure in mul tivariate data, is given in [31]. Example 1 . 7
(Plots in three dimensions)
Four different measurements of lumber stiffness are given in Table 4.3, page 198. In Example 4.13, specimen (board) 16 and possibly specimen (board) 9 are identified as unusual observations. Figures 1 .9(a), (b), and (c) contain per spectives of the stiffness data in the x 1 , x 2 , x3 space. These views were obtained by continually rotating and turning the three-dimensional coordi nate axes. Spinning the coordinate axes allows one to get a better under standing of the three-dimensional aspects of the data. Figure 1 .9(d) gives one picture of the stiffness data in x2 , x3 , x4 space. Notice that Figures 1 .9(a) and (d) visually confirm specimens 9 and 16 as outliers. Specimen 9 is very large .
16
Xz
.
.
.
.
.
.
.
. . . .
.
.
•9
(a)
r
· : . : .. . · � : •·
16 ••
•
.
.
(b)
Outliers clear.
.
x2
•
•
.
X1
(c) Figure 1 .9
.
.
XI
Outliers masked.
. . . . .· . .
. .
.
.
.
.
x3 9'
Specimen 9 large.
• 9 Xz
(d)
Good view of
x2, x3 , x4 space.
Three-dimensional perspectives for the lumber stiffness data.
24
Chap. 1
Aspects of M u ltivariate Analysis
in all three coordinates. A counterclockwiselike rotation of the axes in Fig ure 1.9(a) produces Figure 1 .9(b ), and the two unusual observations are masked in this view. A further spinning of the x 2 , x3 axes gives Figure 1.9(c); one of the outliers (16) is now hidden. Additional insights can sometimes be gleaned from visual inspection of the slowly spinning data. It is this dynamic aspect that statisticians are just • beginning to understand and exploit.
Plots like those in Figure 1.9 allow one to identify readily observations that do not conform to the rest of the data and that may heavily influence inferences based on standarq data-generating 111o dels. We now turn to two popular pictorial representations of multivariate data in two dimensions: stars and Chernoff faces. Stars
Suppose each data unit consists of nonnegative observations on p ;;;.: 2 variables. In two dimensions, we can constr�ct circles of a fixed (reference) radius with p equally spaced rays emanating from the center of the circle. The lengths of the rays repre sent the values of the variables. The eqds of the rays can be connected with straight lines to form a star. Each star represents a multivariate observation, and the stars can be grouped according to their (subjective) similarities. It is often helpful, when constructing the stars, to standardize the observa tions. In this case some of the observations will be negative. The observations can then be reexpressed so that the center of the circle represents the smallest stan dardized observation within the entire data set. Example 1 .8
(Utility data as stars)
Stars representing the first 5 of the 22 public utility firms in Table 12.5, page 747, are shown in Figure 1.10. There are eight variables; consequently, the stars are distorted octagons. The observations on all variables were standardized. Among the first five utilities, the smallest standardized observation for any variable was - 1 .6. Treating this value as zero, the variables are plotted on identical scales along eight equiangular rays originating from the center of the circle. The variables are ordered in a clockwise direction, beginning in the 12 o ' clock position. At first glance, none of these utilities appears to be similar to any other. However, because of the way the stars are construct�d, each variable gets equal weight in the visual impression. If we concentrate on the variables 6 (sales in KWH use per year) and 8 (total fuel costs in cents per KWH), then Boston Edison and Consolidated Edison are similar (small variable 6, large variable 8), and Arizona Public Service, Central Louisiana Electric, and Com monwealth Edison are similar (moderate variable 6, moderate variable 8). •
Sec. 1 .4
Data Displays a n d Pictorial Representations
25
Boston Edison Co. (2)
Arizona Public Service ( 1 )
5
5 Central Louisiana Electric Co. (3)
Commonwealth Edison Co. (4)
5
5
I
Consolidated Edison Co. (NY) (5)
5 Figure 1 . 1 0
Stars for the first five public utilities.
Chernoff Faces
People react to faces. Chernoff [6] suggested representing p-dimensional observa tions as a two-dimensional face whose characteristics (face shape, mouth curvature, nose length, eye size, pupil position, and so forth) are determined by the measure tnents on the p variables.
26
Chap. 1
Aspects of M u ltivariate Analysis
As originally designed, Chernoff faces can handle up to 18 variables. The assignment of variables to facial features is done by the experimenter, and differ ent choices produce different results. Some iteration is usually necessary before sat isfactory representations are achieved. Chernoff faces appear to be most useful for verifying (1) an initial grouping suggested by subject-matter knowledge and intuition or (2) final groupings pro duced by clustering algorithms. Example 1 .9
(Utility data as Chernoff faces)
From the data in Table 12.5, the 22 public utility companies were represented as Chernoff faces. We have the following correspondences: Variable Xl : Fixed-charge coverage Xz : Rate of return on capital X3 : Cost per KW capacity in place X4 : Annual load factor
("\
i· iI
Xs :
Peak KWH demand growth from 1974
X6 : Sales (KWH use per year) X7 : Percent nuclear Xs : Total fuel costs (cents per KWH)
, . � ..
Facial characteristic � �
�
�
� �
�
�
Half-height of face Face width Position of center of mouth Slant of eyes height . . Eccentnctty of eyes width Half-length of eye Curvature of mouth Length of nose
(
)
\
The Chernoff faces· are shown in Figure-1. 1 1 . .We have.. subjectively grouped "similar" faces into seven clusters. If a smaller number of clusters is desired, we might combine clusters 5, 6, and 7 and, perhaps, clusters 2 and 3 to obtain four or five clusters. For our assignment of variables to facial features, the firms group largely according to geographical location. • Constructing Chernoff faces is a task that must be done with the aid of a com puter. The data are ordinarily standardized within the computer program as part of the process for determining the locations, sizes, and orientations of the facial characteristics. With some training, we can use Chernoff faces to communicate sim ilarities or dissimilarities, as the next example indicates. Example 1 . 1 0
(Using Chernoff faces to show changes over time)
Figure 1 .12 illustrates an additional use of Chernoff faces. (See [23] .) In the figure, the faces are used to track the financial well-being of a company over time. As indicated, each facial feature represents a single financial indicator, and the longitudinal changes in these indicators are thus evident � a �anre. •
\
Cluster I
0 O 0 0 4
10
13
20
Sec. 1 .4
Cluster 3
Cluster 2
0 CD 0 0 CD 0 3
9
14
Cluster ?
6
5
7
22
21
15
Cluster 4
Cluster 6
8
2
0 CD 0 CD 0 \D II
18
Figure 1 . 1 1
Cluster 5
W \D CD 0 W CD
I
19
27
Data Displays a n d Pictorial Representations
12
Chernoff faces for 22 public utilities. 16
17
1 975 1 976 1 977 1 978 1 979 ------� Time Figure 1 . 1 2
Chernoff faces for over time.
28
Chap. 1
Aspects of M u ltivariate Analysis
Chernoff faces have also been used to display differences in multivariate observations in two dimensions. For example, the two-dimensional coordinate axes might represent latitude and longitude (geographical location), and the faces might represent multivariate measurements on several U.S. cities. Additional examples of this kind are discussed in [29]. There are several ingenious ways to picture multivariate data in two dimen sions. We have described some of them. Further advances are possible and will almost certainly take advantage of improved computer graphics. 1 .5 DISTANCE
Although they may at first appear formidable, most multivariate techniques are based upon the simple concept of distance. Straight-line, or Euclidean, distance should be familiar. If we consider the point P = (x 1 , x 2 ) in the plane, the straight line distance, d (O, P), from P to the origin 0 = (0, 0) is, according to the Pythagorean theorem,
V\2 + x 22
d ( O ' P) =
(1-9)
1
The situation is illustrated in Figure 1.13. In general, if the point P has p coordi nates so that P = (x 1 , x 2 , , xp ), the straight-line distance from P to the origin 0 = (0, 0, . . . , 0) is • • •
(1-10) Yx12 + x 22 + · · · + xp2 (See Chapter 2.) All points (x 1 , x2 , , xp ) that lie a constant squared distance, such as c 2 , from the origin satisfy the equation d (O ' P) =
• . •
(1-11) Because this is the equation of a hypersphere (a circle if p = 2), points equidistant from the origin lie on a hypersphere. The straight-line distance between two arbitrary points P and Q with coordi nates P = (x 1 , x 2 , , xp ) and Q = (y1 , y2 , , Yp ) is given by • • •
• • •
d (P, Q ) =
Y(x 1 - y1 ) 2 + (x 2 - y2 ) 2 +
n
···
+ (xP - Yp ) 2
(1-12)
Straight-line, or Euclidean, distance is unsatisfactory for most statistical pur poses. This is because each coordinate contributes equally to the calculation of d ( O. P ) =
jx l · 0
t X
p
1- x r-----+ 1
z
t
-
Figure 1 . 1 3
Distance given by the Pythagorean theorem.
Sec. 1 . 5
D ista nce
29
Euclidean distance. When the coordinates represent measurements that are sub ject to random fluctuations of differing magnitudes, it is often desirable to weight coordinates subject to a great deal of variability less heavily than those that are not highly variable. This suggests a different measure of distance. Our purpose now is to develop a "statistical" distance that accounts for dif ferences in variation and, in due course, the presence of correlation. Because our choice will depend upon the sample variances and covariances, at this point we use the term statistical distance to distinguish it from ordinary Euclidean distance. It is statistical distance that is fundamental to multivariate analysis. To begin, we take as fixed the set of observations graphed as the p-dimen sional scatter plot. From these, we shall construct a measure of distance from , xp ) . In our arguments, the coordinates the origin to a point P = ( x 1 , x2 , ( x 1 , x 2 , , xp ) of P can vary to produce different locations for the point. The data that determine distance will, however, remain fixed. To illustrate, suppose we have n pairs of measurements on two variables. Call the variables x 1 and x 2 , and assume that the x 1 measurements vary independently of the x 2 measurements. 1 In addition, assume that the variability in the x 1 mea surements is larger than the variability in the x 2 measurements. A scatter plot of the data would look something like the one pictured in Figure 1 .14. Glancing at Figure 1.14, we see that values which are a given deviation from the origin in the x 1 direction are not as "surprising" or "unusual" as are values equidistant from the origin in the x2 direction. This is because the inherent vari ability in the x 1 direction is greater than the variability in the x 2 direction. Conse quently, large x 1 coordinates (in absolute value) are not as unexp"e cted as large x 2 coordinates. It seems reasonable, then, to weight an x 2 coordinate more heavily than an x 1 coordinate of the same value when computing the "distance" to the ori gin. One way to proceed is to divide each coordinate by the sample standard devi ation. Therefore, upon division by the standard deviations, we have the "stan dardized" coordinates x : = x 1 / � and x ; = x 2 j 'lfi;;_ . The standardized • • •
• • •
•
•
•
•
•
•
•
• •
• •
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
• •
•
Figure 1 . 1 4 A scatter plot with greater variability in the x1 direction than in the x2 direction.
1 At this point, "independently" means that the x 2 measurements cannot be predicted with any
accuracy from the
x1
measurements, and vice versa.
30
Chap. 1
Aspects of M u ltivariate Ana lysis
coordinates are now on an equal footing with one another. After taking the dif ferences in variability into account, we determine distance using the standard Euclidean formula. Thus, a statistical distance of the point P = (x 1 , x 2 ) from the origin 0 = (0, 0) can be computed from its standardized coordinates x: = x t f� and x� = x2 / YS;;. as
d (O, P)
=
v'(xn 2 + (x; ) 2
(1-13) Comparing (1-13) with (1-9), we see that the difference between the two expres sions is due to the weights k1 = 1 /s1 1 and k2 = 1 /s22 attached to xi and x� in (1-13). Note that if the sample variances are the same, k 1 = k2 , and xi and x� will receive the same weight. In cases where the weights are the same, it is convenient to ignore the common divisor and use the usual Euclidean distance formula. In other words, if the variability in the x 1 direction is the same as the variability in the x2 direction, and the x1 values vary independently of the x2 values, Euclidean dis tance is appropriate. Using (1-13), we see that all points which have coordinates (x 1 , x 2 ) and are a constant squared distance c2 from the origin must satisfy �
2 2 � + su S22
=
c2
(1-14)
Equation (1-14) is the equation of an ellipse centered at the origin whose major and minor axes coincide with the coordinate axes. That is, the statistical distance in (1-13) has an ellipse as the locus of all points a constant distance from the ori gin. This general case is shown in Figure 1.15. Example 1 . 1 1
(Calculating a statistical distance)
A set of paired measurements (x1 , x2 ) on two variables yields x1 = x2 = 0, = 4, and s 22 = 1. Suppose the x 1 measurements are unrelated to the x 2
s1 1
xz
p
�
----
�04-
----
cfi;:
----+-�==� x l
--------
The ellipse of constant statistical distance + x�/s22 = cl.
d2 (0, P) = xf /s1 1
Figure 1 . 1 5
----
--
Sec. 1 .5
Distance
31
measurements; that is, measurements within a pair vary independently of one another. Since the sample variances are unequal, we measure the square of the distance of an arbitrary point P = (x 1 , x 2 ) to the origin 0 = (0, 0) by
All points equation
(x1 , x2 ) that are a constant distance 1 from the origin satisfy the
The coordinates of some points a unit distance from the origin are presented in the following table: Coordinates: (\.
2 x Distance: --t-
(x1 , x2 )
+
xT2
12 1 2 0 + ( - 1) 2 4 1 2 2 +02 4 1 ( v'3/2 ) 2 1 02 4
(0, 1)
- + -
(0, - 1) (2 , 0) ( 1, v'3/2 )
=
=
=
=
=
1
1 1 1 1
A plot of the equation xf /4 + xV1 = 1 is an ellipse centered at (0, 0) whose major axis lies along the x1 coordinate axis and whose minor axis lies along the x2 coordinate axis. The half-lengths of these major and minor axes are \14 = 2 and Yl = 1, respectively. The ellipse of unit distance is plot ted in Figure 1.16. All points on the ellipse are regarded as being the same • statistical distance from the origin-in this case, a distance of 1.
Xz ---f--....L._---f-.L.._---+--�
-2
-1
2
Xl
!:L2
+
Figure 1 . 1 6
-1
4
2 x2
1
=
1.
Ellipse
of unit distance,
32
Chap. 1
Aspects of M u l tivariate Analysis
The expression in (1-13) can be generalized to accommodate the calculation of statistical distance from an arbitrary point P = (x i , x 2 ) to any fixed point Q = ( YI , Yz ). If we assume that the coordinate variables vary independently of one another, the distance from P to Q is given by =
d (P, Q )
/ (x i - Y1 ) 2 V si i
+
(x z - Yz ) 2 Szz
(1-15)
The extension of this statistical distance to more than two dimensions is straightforward. Let the points P and Q have p coordinates such that P = (x i , x2 , , xp ) and Q = ( YI , y2 , , Yp ). Suppose Q is a fixed point [it may be the origin 0 = (0, 0, . . . , 0)] and the coordinate variables vary independently of one another. Let si i , s22 , . . . , sPP be sample variances constructed from n measurements on x 1 , x2 , , xP , respectively. Then the statistical distance from P to Q is • • •
• • •
�(xi s-1 1YI )2
• • •
d (P, Q )
=
+
(x z
- Yz ) 2 Szz
+ ... +
(xp
- YtY
(1-16) sPP All points P that are a constant squared distance from Q lie on a hyperellip soid centered at Q whose major and minor axes are parallel to the coordinate axes. We note the following: The distance of P to the origin 0 is obtained by setting Y J = y 2 = · · · = Yp = 0 in (1-16). 2. Ifs l l = s22 = · · · = sPP the Euclidean distance formula in (1-12) is appropriate. ' 1.
The distance in (1-16) still does not include most of the important cases we shall encounter, because of the assumption of independent coordinates. The scat ter plot in Figure 1.17 depicts a two-dimensional situation in which the x i mea surements do not vary independently of the x 2 measurements. In fact, the Xz
I •
•
:/: �e
I. • - -. · '· . ---------"-l-7-------'-......._ x l . . .- .I • • • • I • I • •I • I
I
Figure 1 . 1 7
A scatter plot for positively correlated measurements and a rotated coordinate system .
Sec. 1 .5
33
Dista nce
coordinates of the pairs (x 1 , x2 ) exhibit a tendency to be large or small together, and the sample correlation coefficient is positive. Moreover, the variability in the x2 direction is larger than the variability in the x1 direction. What is a meaningful measure of distance when the variability in the x 1 direc tion is different from the variability in the x2 direction and the variables x1 and x2 are correlated? Actually, we can use what we have already introduced, provided that we look at things in the right way. From Figure 1 .17, we see that if we rotate the original coordinate system through the angle 0 while keeping the scatter fixed and label the rotated axes .X 1 and .X2 , the scatter in terms of the new axes looks very much like that in Figure 1.14. (You may wish to turn the book to place the x1 and .X2 axes in their customary positions.) This suggests that we calculate the sample variances using the .X 1 and .X2 coordinates and measure distance as in Equation (1-13). That is, with reference to the .X1 and .X2 axes, we define the distance from the point P = (.X1 , .X2 ) to the origin 0 = (0, 0) as =
d(O, P)
��isl l + ��
(1-17)
Sz z
where s1 1 and s22 denote the sample variances computed with the .X1 and .X2 measurements. The relation between the original coordinates (x 1 , x 2 ) and the rotated coor dinates (x1 , .X2 ) is provided by
X1 x2
= X1
COS
= - x1
+
( 0)
X z sin ( 0)
(1-18)
+ x 2 cos ( 0)
sin ( 0)
Given the relations in (1-18), we can formally substitute for x 1 and .X2 in (1-17) and express the distance in terms of the original coordinates. After some straightforward algebraic manipulations, the distance from P = (x 1 , x2 ) to the origin 0 = (0, 0) can be written in terms of the original coor dinates x1 and x 2 of P as (1-19) where the a's are numbers such that the distance is nonnegative for all possible val ues of x 1 and x 2 • Here a 1 1 , a 1 2 , and a22 are determined by the angle 0, and s 1 1 , s 1 2 , and s22 are calculated from the original data. 2 The particular forms for a 1 1 , aw and 2 Specifically,
2 cos ( 0) + 2 sin ( O) cos ( O)s 1 2 2 _ - -,-- --- - sin-( 0)-- .. az-, cos- (O)s 1 1 + 2 sin ( O) cos ( O) s 1 2 and a1 1 cos2 ( 0)s 1 1 _
_
_ _ __ _
a2 , = cos2 ( 0)su
+
__
cos ( 0) sin ( 0) 2 sin ( O) cos ( O)s 1 2
+
sin2 ( 0) s22
- . - --
+
+
-
--
sin2 ( 0) s 2 2
+ +
2 ( 0_!_) · -c- :-=s= i n:_'-' cos ( 0) s22 - 2 sin ( O) cos (.,---,O) s 1 2 2
__
----=-
__
cos2 ( 0) cos2 ( 0) s22 - 2 sin ( O) cos ( O) s 1 2
+
__
2 sin ( 0) s 1 1 2 sin ( 0)sn
- - ----- --- -----·----
sin ( e) cos ( 0) 2 2 sin ( 0) s22 cos (0)s 2 - 2 sin ( O) cos ( O) s 1 2 2
+
+
2 sin ( 0)su
34
Chap. 1
Aspects of M u ltiva riate Analysis
a22 are not important at this point. What is important is the appearance of the cross-product term 2a 12 x1 x 2 necessitated by the nonzero correlation r1 2 . Equation ( 1-1 9) can be compared with (1-13). The expression in (1-13) can be regarded as a special case of ( 1-1 9) with a l l = 1/s 1 1 , a22 = 1/szz , and a 12 = 0. In general, the statistical distance of the point P = ( x 1 , x 2 ) from the fixed point Q = (y 1 , y2 ) for situations in which the variables are correlated has the gen eral form
and can always be computed once coordinates of all points P = (x 1 , from Q satisfy
a 1 1 , a 12 , and a22 are known. In addition, the2 x 2 ) that are a constant squared distance c
a l l (xl - Y1 ) 2 + 2a 12 (xl - Y1 Hxz - Yz ) + azz (Xz - Yz ) 2
=
C 2 (1-21)
By definition, this is the equation of an ellipse centered at Q. The graph of such an equation is displayed in Figure 1.18. The major (long) and minor (short) axes are indicated. They are parallel to the x1 and x2 axes. For the choice of a l l , a 1 2 , and a22 in footnote 2, the .X1 and x2 axes are at an angle () with respect to the x 1 and x 2 axes. The generalization of the distance formulas of (1-19) and (1-20) to p dimen sions is straightforward. Let P = (x 1 , x2 , , xp ) be a point whose coordinates rep resent variables that are correlated and subject to inherent variability. Let 0 = (0, 0, . . . , 0) denote the origin, and let Q = (y 1 , y2 , , Yp ) be a specified fixed point. Then the distances from P to 0 and from P to Q have the general forms • • •
• . .
d (O, P ) (1-22) and
Xz
/
/
/
'
Figure 1 . 1 8 Ellipse of points a constant distance from the point Q.
Sec. 1 5 .
35
Distance
d (P, Q )
[a 1 1 (x 1 - y 1 ) 2 + a22 (x 2 - Y2 ) 2 + . . . + app (xp - Yp ) 2 + 2al 2 (xl - Y 1 Hxz - Yz ) + 2a t 3 (xl - Yt Hx3 - Y3 ) + . . . + 2ap - t,p (xp - t - Yp - t ) (xP - Yp )]
{1-23)
where the a ' s are numbers such that the distances are always nonnegative ? We note that the distances in (1-22) and (1-23) are completely determined by the coefficients (weights) a ik• i = 1, 2, . . . , p, k = 1, 2, . . . , p. These coefficients can be set out in the rectangular array (1-24) where the a;k ' s with i � k are displayed twice, since they are multiplied by 2 in the distance formulas. Consequently, the entries in this array specify the distance func tions. The a;k ' s cannot be arbitrary numbers; they must be such that the computed distance is nonnegative for every pair of points. (See Exercise 1 .10.) Contours of constant distances computed from (1-22) and (1-23) are hyper ellipsoids. A hyperellipsoid resembles a football when p = 3; it is impossible to visualize in more than three dimensions. The need to consider statistical rather than Euclidean distance is illustrated heuristically in Figure 1 .19. Figure 1.19 depicts a cluster of points whose center of gravity (sample mean) is indicated by the point Q. Consider the Euclidean dis tances from the point Q to the point P and the origin 0. The Euclidean distance from Q to P is larger than the Euclidean distance from Q to 0. However, P appears to be more like the points in the cluster than does the origin. If we take into account the variability of the points in the cluster and measure distance by the
• • • • • • • • • • • • • • •• • • •• • • • • • • • • • • • • Q :• • • .•••. .•� f'i' .• .• •
p� r;:..
. . . . . . :.· ·. • • . • . • •. •. •
----e......=- .----e,__::.. -+-------'Jio- x i
--
•
•
0
Figure 1 . 1 9 A cluster of points relative to a point P and the origi n .
3 The algebraic expressions for the sq uares o f the distances in (1-22) and ( 1 -23) are known as uadratic forms and, in particular, positive definite q uadratic forms. It is possible to display these qua q dratic forms in a simpler manner using matrix algebra; we shall do so in Section 2.3 of Chapter 2.
36
Chap. 1
Aspects of M u l tivariate Ana lysis
statistical distance in (1-20), then Q will be closer to P than to 0. This result seems reasonable given the nature of the scatter. Other measures of distance can be advanced. (See Exercise 1.12.) At times, it is useful to consider distances that are not related to circles or ellipses. Any dis tance measure d(P, Q) between two points P and Q is valid provided that it satis fies the following properties, where R is any other intermediate point: d (P, Q ) = d (Q, P) d (P, Q ) > O if P
-=f
Q (1-25 )
d (P, Q ) = O if P = Q d (P, Q )
:s;
d (P, R )
+ d (R, Q )
(triangle inequality)
1 .6 FINAL COMMENTS
We have attempted to motivate the study of multivariate analysis and to provide you with some rudimentary, but important, methods for organizing, summarizing, and displaying data. In addition, a general concept of distance has been introduced that will be used repeatedly in later chapters. EXERCISES
1.1.
Consider the seven pairs of measurements
(x i , x2 ) plotted in Figure 1.1:
xi 3 4 2 6 8 2 5 x2 5 5.5 4 7 10 5 7.5 Calculate the sample means xi and .X2 , the sample variances s 1 1 and s22 , and the sample covariance s12 . 1.2.
A morning newspaper lists the following used-car prices for a foreign com pact with age x 1 measured in years and selling price x 2 measured in thou sands of dollars: 8 9 10 1 1 7 5 7 7 5 3 xi x 2 2.30 1.90 1.00 .70 .30 1.00 1.05 .45 .70 .30 (a)
Construct a scatter plot of the data and marginal dot diagrams.
(b) Infer the sign of the sample covariance s i 2 from the scatter plot.
xi and .X2 and the sample variances s1 1 and s22• Compute the sample covariance s i 2 and the sample correlation coef ficient r12 . Interpret these quantities.
(c) Compute the sample means
Chap. 1
1.3.
Exercises
37
(d) Display the sample mean array x, the sample variance-covariance array S11, and the sample correlation array R using (1-8).
The following are five measurements on the variables x1 , x 2 , and x3:
Xz
2 8 4
9 12 3
6 6 0
5 4 2
8 10 1
Find the arrays x, S11, and R. 1.4. The 10 largest U.S. industrial corporations yield the following data:
x3 = assets x1 = sales x2 = profits (millions of dollars) (millions of dollars) (millions of dollars)
Company General Motors Ford Exxon IBM General Electric Mobil Philip Morris Chrysler Du Pont Texaco
Source: "Fortune 500," (a)
126,974 96,933 86,656 63,438 55,264 50,976 39,069 36,156 35,209 32,416 Fortune,
4,224 3,835 3,510 3,758 3,939 1,809 2,946 359 2,480 2,413
121
173,297 160,893 83,219 77,734 128,344 39,080 38,528 51,038 34,715 25,636
(April 23, 1990), 346-367.
Plot the scatter diagram and marginal dot diagrams for variables x1 and
x2• Comment on the appearance of the diagrams. (b) Compute :X1 , x2 , s 1 1 , s22 , and r1 2 . Interpret r1 2• Sw
Use the data in Exercise 1.4. (a) Plot the scatter diagrams and dot diagrams for ( x 2 , x3) and ( x1 , x3). Com ment on the patterns. (b) Compute the x, S11, and R arrays for ( x1 , x 2 , x3 ) . 1.6. The data in Table 1.3 are 42 measurements on air-pollution variables recorded at 12:00 noon in the Los Angeles area on different days. (See also the air-pollution data on the data disk.) (a) Plot the marginal dot diagrams for all the variables. (b) Construct the x , S11 , and R arrays, and interpret the entries in R. 1.7. You are given the following n = 3 observations on p = 2 variables: 1.5.
Variable 1: Variable 2:
x1 2 = 1 x22 = 2
x3 2 = 4
38
Chap. 1
Aspects of M u ltivariate Analysis
TABLE 1 .3 AI R-POLLUTION DATA
Wind (x 1 )
Solar radiation (x2 )
8 7 7 10 6 8 9 5 7 8 6 6 7 10 10 9 8 8 9 9 10 9 8 5 6 8 6 8 6 10 8 7 5 6 10 8 5 5 7 7 6 8
98 107 103 88 91 90 84 72 82 64 71 91 72 70 72 77 76 71 67 69 62 88 80 30 83 84 78 79 62 37 71 52 48 75 35 85 86 86 79 79 68 40
CO (x3 ) NO (x4 ) N02 (x5) 03 (x6 ) 7 4 4 5 4 5 7 6 5 5 5 4 7 4 4 4 4 5 4 3 5 4 4 3 5 3 4 2 4 3 4 4 6 4 4 4 3 7 7 5 6 4
Source: Data courtesy of Profes or G. C. Tiao.
2 3 3 2 2 2 4 4 1 2 4 2 4 2 1 1 1 3 2 3 3 2 2 3 1 2 2 1 3 1 1 1 5 1 1 1 1 2 4 2 2 3
12 9 5 8 8 12 12 21 11 13 10 12 18 11 8 9 7 16 13 9 14 7 13 5 10 7 11 7 9 7 10 12 8 10 6 9 6 13 9 8 11 6
8 5 6 15 10 12 15 14 11 9 3 7 10 7 10 10 7 4 2 5 4 6 11 2 23 6 11 10 8 2 7 8
4
24 9 10 12 18 25 6 14 5
HC (x7 ) 2 3 3 4 3 4 5 4 3 4 3 3
3
3 3 3 3 4 3 3 4 3 4 3 4 3 3 3 3 3 3 4
3
3 2 2 2 2 3 2 3 2
Chap. 1
Exercises
39
Plot the pairs of observations in the two-dimensional "variable space." That is, construct a two-dimensional scatter plot of the data. (b) Plot the data as two points in the three-dimensional "item space." 1.8. Evaluate the distance of the point P = ( -1, -1) to the point Q = (1, 0) using the Euclidean distance formula in (1-12) with p = 2 and using the statistical distance in (1-20) with a 1 1 = 1/3, a22 = 4/27, and a 1 2 = 1/9. Sketch the locus of points that are a constant squared statistical distance 1 from the point Q. 1.9. Consider the following eight pairs of measurements on two variables x1 and x 2 : (a)
�
1 2 5 6 8 x1 - 6 - 3 - 2 1 -1 2 1 5 3 x2 - 2 - 3 (a)
Plot the data as a scatter diagram, and compute s 1 1 , s22 , and s 1 2 .
(b) Using (1-18), calculate the corresponding measurements on variables x1
and .X2 , assuming that the original coordinate axes are rotated through an angle of 0 = 26° [given cos (26°) = .899 and sin (26°) = .438]. (c) Using the x 1 and .X2 measurements from (b), compute the sample vari ances s l l and s22 . (d) Consider the new pair of measurements (x 1 , x 2 ) = (4, -2). Transform these to measurements on .X1 and .X2 using (1-18), and calculate the dis tance d(O, P) of the new point P = (.X1 , .X2 ) from the origin 0 = (0, 0) using (1-17). Note: You will need s1 1 and s22 from (c). (e) Calculate the distance from P = (4, -2) to the origin 0 = (0, 0) using (1-19) and the expressions for a l l , a22 , and a 12 in footnote 2. Note: You will need s l l , s22 , and s 1 2 from (a). Compare the distance calculated here with the distance calculated using the .X 1 and .X2 values in (d). (Within rounding error, the numbers should be the same.) 1.10. Are the following distance functions valid for distance from the origin? Explain. (a) + 4x � + x 1 x2 = (distance) 2 - 2x� = (distance ) 2 (b) 1.11. Verify that distance defined by (1-20) with a l l = 4, a22 = 1 , and a 1 2 = - 1 sat isfies the first three conditions in (1-25). (The triangle inequality is more dif ficult to verify.) 1.12. Define the distance from the point P = (x 1 , x 2 ) to the origin 0 = (0, 0) as
xi xi
(a)
d(O, P) = max ( l x1 1 , l xz l ) Compute the distance from P = (-3, 4) to the origin.
(b) Plot the locus of points whose squared distance from the origin is 1.
(c) Generalize the foregoing distance expression to points in p dimensions. 1.13.
A large city has major roads laid out in a grid pattern, as indicated in the fol lowing diagram. Streets 1 through 5 run north-south (NS), and streets A
40
Chap. 1
Aspects of M u ltiva riate Ana lysis
through E run east-west (EW). Suppose there ate retail stores located at intersections (A, 2), (E, 3), and (C, 5). Assume the distance along a street between two intersections in either the NS or EW direction is 1 unit. Define the distance between any two intersections (points) on the grid to be the "city block" distance. (For example, the distance between intersections (D, 1) and (C, 2), which we might call d((D, 1), (C, 2)), is given by d((D, 1), (C, 2)) = d ((D , 1), (D, 2)) + d ( (D, 2), (C, 2)) = 1 + 1 = 2. Also, d((D, 1), (C, 2)) = d ((D, 1), (C, 1)) + d (( C, 1), (C, 2)) = 1 + 1 = 2.)
B r---+---+---,_--� C
t---t-----1f--+- -�FP2n
D r---+---+---,_--� E
Locate a supply facility (warehouse) at an intersection such that the sum of the distances from the warehouse to the three retail stores is minimized. The following exercises contain fairly extensive data sets. A computer may be nec
essary for the required calculations.
Table 1.4 contains some of the raw data discussed in Section 1 .2. (See also the multiple-sclerosis data on the disk.) Two different visual stimuli (S1 and S2) produced responses in both the left eye (L) and the right eye (R) of sub jects in the study groups. The values recorded in the table include x 1 (sub ject ' s age); x 2 (total response of both eyes to stimulus S1, that is, S1L + S1R ); x3 (difference between responses of eyes to stimulus S1, j S1L - S1R j); and so forth. (a) Plot the two-dimensional scatter diagram for the variables x 2 and x4 for the multiple-sclerosis group. Comment on the appearance of the diagram. (b) Compute the i , Sn , and R arrays for the non-multiple-sclerosis and mul tiple-sclerosis groups separately. 1.15. Some of the 98 measurements described in Section 1.2 are listed in Table 1.5. (See also the radiotherapy data on the data disk.) The data consist of aver age ratings over the course of treatment for patients undergoing radiother apy. Variables measured include x 1 (number of symptoms, such as sore throat or nausea); x2 (amount of activity, on a 1-5 scale); x3 (amount of sleep, on a 1-5 scale); x4 (amount of food consumed, on a 1-3 scale); x5 (appetite, on a 1-5 scale); and x6 (skin reaction, on a 0-3 scale).
1.14.
Chap. 1
41
Exercises
TABLE 1 .4 M U LTI P LE-SCLEROSIS DATA
Non-Multiple-Sclerosis Group Data Subject number
x3
x4
152.0 138.0 144.0 143.6 148.8
1.6 .4 .0 3.2 .0
198.4 180.8 186.4 194.8 217.6
.0 1.6 .8 .0 .0
154.4 171.2 157.2 175.2 155.0
2.4 1.6 .4 5.6 1.4
205.2 210.4 204.8 235.6 204.4
6.0 .8 .0 .4 .0
x3
x4
Xs
xl
Xz
1 2 3 4 5
18 19 20 20 20
65 66 67 68 69
67 69 73 74 79
(Age) (S1L
+ S1R)
J S1L - S1R J (S2L
+ S2R )
Xs
I S2L - S2R I
Multiple-Sclerosis Group Data Subject number
xl
Xz
1 2 3 4 5
23 25 25 28 29
148.0 195.2 158.0 134.4 190.2
.8 3.2 8.0 .0 14.2
205.4 262.8 209.8 198.4 243.8
.6 .4 12.2 3.2 10.6
25 26 27 28 29
57 58 58 58 59
165.6 238.4 164.0 169.8 199.8
16.8 8.0 .8 .0 4.6
229.2 304.4 216.8 219.2 250.2
15.6 6.0 .8 1.6 1.0
Source: Data courtesy of Dr. G. G. Celesia. Construct the two-dimensional scatter plot for variables x 2 and x3 and the marginal dot diagrams (or histograms). Do there appear to be any errors in the x3 data? (b) Compute the i, Sn , and R arrays. Interpret the pairwise correlations. 1.16. At the start of a study to determine whether exercise or dietary supplements would slow bone loss in older women, an investigator measured the mineral content of bones by photon absorptiometry. Measurements were recorded (a)
42
Chap. 1
Aspects of M u l tiva riate Analysis
TABLE 1 .5 x1
Symptoms
RADIOTH ERAPY DATA
Xz
x3
x4
Activity
Sleep
Eat
.889 2.813 1.454 .294 2.727
1.389 1.437 1.091 .941 2.545
1.555 .999 2.364 1.059 2.819
4.100 .125 6.231 3.000 .889
1.900 1.062 2.769 1.455 1.000
2.800 1.437 1.462 2.090 1.000
Xs
x6
2.222 2.312 2.455 2.000 2.727
Appetite 1.945 2.312 2.909 1.000 4.091
Skin reaction 1.000 2.000 3.000 1 .000 .000
2.000 1.875 2.385 2.273 2.000
2.600 1.563 4.000 3.272 1.000
2.000 .000 2.000 2.000 2.000
Tealesy,. RowsR. N.contValauiesninofg valuandes of lesandthanle1.s0 Source: lectionteproces arthane due1.0tDatmayo erarbeocourtrsomiinettshyeed.ofdatMra cols. Annet x2
x3
x2
x3
for three bones on the dominant and nondominant sides and are shown in Table 1 .6. (See also the mineral-content data on the data disk.) Compute the x, S11, and R arrays. Interpret the pairwise correlations. 1.17. Some of the data described in Section 1 .2 are listed in Table 1 .7. (See also the national-track-records data on the data disk.) The national track records for women in 55 countries can be examined for the relationships among the running events. Compute the x, S11 , and R arrays. Notice the magnitudes of the correlation coefficients as you go from the shorter (100-meter) to the longer (marathon) running distances. Interpret these pairwise correlations. 1.18. Convert the national track records for women in Table 1.7 to speeds mea sured in meters per second. For example, the record speed for the 100-m dash for Argentinian women is 100 m/1 1.61 sec = 8.613 m/sec. Notice that the records for the 800-m, 1500-m, 3000-m and marathon runs are measured in minutes. The marathon is 26.2 miles, or 42,195 meters, long. Compute the x , S11, and R arrays. Notice the magnitudes of the correlation coefficients as you go from the shorter (100 m) to the longer (marathon) running distances. Interpret these pairwise correlations. Compare your results with the results you obtained in Exercise 1 .17. 1.19. Create the scatter plot and box plot displays of Figure 1 .5 for (a) the mineral content data in Table 1.6 and (b) the national-track-records data in Table 1.7. 1.20. Refer to the bankruptcy data in Table 11.4, page 712, and on the data disk. Using appropriate computer software: (a) View the entire data set in x 1 , x 2 , x3 space. Rotate the coordinate axes in various directions. Check for unusual observation s.
Chap. 1
TABLE 1 .6
M I N ERAL CONTENT I N BON ES
Subject number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Dominant radius 1.103 .842 .925 .857 .795 .787 .933 .799 .945 .921 .792 .815 .755 .880 .900 .764 .733 .932 .856 .890 .688 .940 .493 .835 .915
Exercises
43
Radius
Dominant humerus
Humerus
Dominant ulna
Ulna
1.052 .859 .873 .744 .809 .779 .880 .851 .876 .906 .825 .751 .724 .866 .838 .757 .748 .898 .786 .950 .532 .850 .616 .752 .936
2.139 1.873 1.887 1.739 1 .734 1.509 1.695 1.740 1.811 1.954 1.624 2.204 1.508 1.786 1.902 1.743 1.863 2.028 1.390 2.187 1.650 2.334 1.037 1.509 1.971
2.238 1.741 1.809 1 .547 1 .715 1 .474 1.656 1.777 1 .759 2.009 1 .657 1 .846 1.458 1 .811 1.606 1 .794 1 .869 2.032 1.324 2.087 1 .378 2.225 1 .268 1.422 1 .869
.873 .590 .767 .706 .549 .782 .737 .618 .853 .823 .686 .678 .662 .810 .723 .586 .672 .836 .578 .758 .533 .757 .546 .618 .869
.872 .744 .713 .674 .654 .571 .803 .682 .777 .765 .668 .546 .595 .819 .677 .541 .752 .805 .610 .718 .482 .731 .615 .664 .868
Source: Data courtesy of Everet Smith. (b) Highlight the set of points corresponding to the bankrupt firms. Examine
various three-dimensional perspectives. Are there some orientations of three-dimensional space for which the bankrupt firms can be distin guished from the nonbankrupt firms? Are there observations in each of the two groups that are likely to have a significant impact on any rule developed to classify firms based on the sample means, variances, and covariances calculated from these data? ( See Exercise 11.24.) 1.21. Refer to the milk transportation-cost data in Table 6.8, page 366, and on the data disk. Using appropriate computer software: (a) View the entire data set in three dimensions. Rotate the coordinate axes in various directions. Check for unusual observations.
44
Chap. 1
TABLE 1 . 7
Aspects of M u ltiva riate Analysis
NATIONAL TRACK RECORDS FOR WOM EN
Country Argentina Australia Austria Belgium Bermuda Brazil Burma Canada Chile China Colombia Cook Islands Costa Rica Czechoslovakia Denmark Dominican Republic Finland France German Democratic Republic Federal Republic of Germany Great Britain and Northern Ireland Greece Guatemala Hungary India Indonesia Ireland Israel Italy Japan Kenya Korea Democratic People ' s Republic of Korea Luxembourg Malaysia Mauritius Mexico Netherlands
100 m (s)
200 m (s )
400 m (s)
800 m ( min )
1500 m (min )
11.61 11.20 11 .43 11.41 11 .46 11.31 12.14 11 .00 12.00 11.95 11 .60 12.90 11.96 11 .09 11.42 11.79 11.13 11.15
22.94 22.35 23.09 23.04 23.05 23.17 24.47 22.25 24.52 . 24.41 24.00 27.10 24.60 21.97 23.52 24.05 22.39 22.59
54.50 51.08 50.62 52.00 53.30 52.80 55.00 50.06 54.90 54.97 53.26 60.40 58.25 47.99 53.60 56.05 50.14 51.73
2.15 1.98 1 .99 2.00 2.16 2.10 2.18 2.00 2.05 2.08 2.11 2.30 2.21 1.89 2.03 2.24 2.03 2.00
4.43 4.13 4.22 4.14 4.58 4.49 4.45 4.06 4.23 4.33 4.35 4.84 4.68 4.14 4.18 4.74 4.10 4.14
9.79 9.08 9.34 8.88 9.81 9.77 9.51 8.81 9.37 9.31 9.46 11.10 10.43 8.92 8.71 9.89 8.92 8.98
178.52 152.37 159.37 157.85 169.98 168.75 191.02 149.45 171.38 168.48 165.42 233.22 171.80 158.85 151.75 203.88 154.23 155.27
10.81
21.71
48.16
1.93
3.96
8.75
157.68
11.01
22.39
49.75
1.95
4.03
8.59
148.53
11.00 11.79 11.84 1 1.45 11.95 11 .85 11.43 11.45 1 1.29 11.73 11.73 11 .96
22.13 24.08 24.54 23.06 24.28 24.24 23.51 23.57 23.00 24.00 23.88 24.49
50.46 54.93 56.09 51.50 53.60 55.34 53.24 54.90 52.01 53.73 52.70 55.70
1.98 2.07 2.28 2.01 2.10 2.22 2.05 2.10 1.96 2.09 2.00 2.15
4.03 4.35 4.86 4.14 4.32 4.61 4.11 4.25 3.98 4.35 4.15 4.42
8.62 9.87 10.54 8.98 9.98 10.02 8.89 9.37 8.63 9.20 9.20 9.62
149.72 182.20 215.08 156.37 188.03 201.28 149.38 160.48 151.82 150.50 181.05 164.65
12.25 12.03 12.23 11.76 11.89 11.25
25.78 24.96 24.21 25.08 23.62 22.81
51 .20 56.10 55.09 58.10 53.76 52.38
1.97 2.07 2.19 2.27 2.04 1 .99
4.25 4.38 4.69 4.79 4.25 4.06
9.35 9.64 10.46 10.90 9.59 9.01
179.17 174.68 182.17 261.13 158.53 152.48
3000 m Marathon ( min ) ( min )
Chap. 1
TABLE 1 . 7 (continued)
45
NATIONAL TRACK RECO RDS FOR WOM EN
Country
lOO m (s)
200 m (s)
400 m (s)
800 m (min)
1500 m (min)
New Zealand Norway Papua New Guinea Philippines Poland Portugal Rumania Singapore Spain Sweden Switzerland Taiwan Thailand Turkey U.S.A. U.S.S.R. Western Samoa
11.55 11.58 12.25 11.76 11.13 11.81 11 .44 12.30 1 1.80 11.16 11.45 11.22 11.75 11 .98 10.79 1 1.06 12.74
23.13 23.31 25.07 23.54 22.21 24.22 23.46 25.00 23.98 22.82 23.31 22.62 24.46 24.44 21.83 22.19 25.85
51.60 53.12 56.96 54.60 49.29 54.30 51 .20 55.08 53.59 51 .79 53.11 52.50 55.80 56.45 50.62 49.19 58.73
2.02 2.03 2.24 2.19 1.95 2.09 1.92 2.12 2.05 2.02 2.02 2.10 2.20 2.15 1.96 1.89 2.33
4.18 4.01 4.84 4.60 3.99 4.16 3.96 4.52 4.14 4.12 4.07 4.38 4.72 4.37 3.95 3.87 5.81
Source:
Exercises
IAAFIA TFS Track and Field Statistics Handbook for the
1 984
3000 m Marathon (min) (min) 8.76 8.53 10.69 10.16 8.97 8.84 8.53 9.94 9.02 8.84 8.77 9.63 10.28 9.38 8.50 8.45 13.04
145.48 145.48 233.00 200.37 160.82 151 .20 165.45 182.77 162.60 154.48 153.42 177.87 168.45 201.08 142.72 151 .22 306.00
Los Angeles Olympics.
(b) Highlight the set of points corresponding to gasoline trucks. Do any of
the gasoline-truck points appear to be multivariate outliers? (See Exer cise 6.17. ) Are there some orientations of x 1 , x 2 , x3 space for which the set of points representing gasoline trucks can be readily distinguished from the set of points representing diesel trucks? 1.22. Refer to the oxygen-consumption data in Table 6.10, page 369, and on the data disk. Using appropriate computer software: (a) View the entire data set in three dimensions employing various combi nations of three variables to represent the coordinate axes. Begin with the x 1 , x 2 , x3 space. (b) Check this data set for outliers. 1.23. Using the data in Table 11.9, page 724, and on the data disk, represent the cereals in each of the following ways. (a) Stars. (b) Chernoff faces. (Experiment with the assignment of variables to facial characteristics.) 1.24. Using the utility data in Table 12.5, page 747, and on the data disk, represent the public utility companies as Chernoff faces with assignments of variables to facial characteristics different from those considered in Example 1.9. Compare your faces with the faces in Figure 1.11. Are different groupings indicated?
46
Chap. 1
Aspects of M u l tivariate Analysis
Using the data in Table 12.5 and on the data disk, represent the 22 public util ity companies as stars. Visually group the companies into four or five clusters. 1.26. The data in Table 1.8 (see the bull data on the data disk) are the measured characteristics of 76 young (less than two years old) bulls sold at auction. Also included in the table are the selling prices (SalePr) of these bulls. The column headings (variables) are defined as follows: 1.25.
Breed
=
FtFrBody
{
1 Angus 5 Hereford 8 Simental
= Yearling height at
YrHgt
= Fat free body (pounds)
shoulder (inches)
PrctFFB
Frame
= Scale from 1(small)
BkFat
SaleHt
= Sale height at
SaleWt
to 8(large)
shoulder (inches)
= Percent fat-free body
= Back fat (inches)
= Sale weight (pounds)
Compute i , Sn , and R arrays. Interpret the pairwise correlations. Do some of these variables appear to distinguish one breed from another? (b) View the data in three dimensions using the variables Breed, Frame, and BkFat. Rotate the coordinate axes in various directions. Check for out liers. Are the breeds well separated in this coordinate system? (c) Repeat part b using Breed, FtFrBody, and SaleHt. Which three-dimen sional display appears to result in the best separation of the three breeds of bulls? (a)
TABLE 1 .8
DATA ON BU LLS
Breed
SalePr
YrHgt
FtFrBody
PrctFFB
Frame
BkFat
SaleHt
SaleWt
1 1 1 1 1
2200 2250 1625 4600 2150
51.0 51.9 49.9 53.1 51.2
1128 1108 1011 993 996
70.9 72.1 71.6 68.9 68.6
7 7 6 8 7
.25 .25 .15 .35 .25 . .10 .15 .10 .10 .15
54.8 55.3 53.1 56.4 55.0
1720 1575 1410 1595 1488
55.2 54.6 53.9 54.9 55.1
1454 1475 1375 1564 1458
'
8 8 8 8 8
1450 1200 1425 1250 1500
51.4 49.8 50.0 50.1 51.7
997 991 928 990 992
Source: Data courtesy of Mark Ellersieck.
73.4 70.8 70.8 71.0 70.6
7 6 6 6 7
Chap. 1
47
References
REFERENCES
1. siBecker, R. A., W. S. Clevelno.and,4 (1and987)A., 355-395. R. Wilks. "Dynamic Graphics for Data Analy s . " 2. BenjResoaurmicesn, Y.Ut,ilandizatiM.on.I"gbaria. "Clustering Cat, eno.gor2ie(s1f991),or Bet295-307. ter Prediction of Computer 3. John BhattWiacharley,yya,1977.G. K., and R. A. Johnson. New York: 4. Blis , C.volI.. 2."StNewatistYorics ikn: McGr Biology,aw-"Hil , 1967. 5. Large Capon,U.N.S., J.Manuf Farley,actD.ureLehman, and J. Hulbert. "Profino.les of2 (Product Innovators among r s . " 1 992) , 157-169. 6. Chernoff, H. "Using Faces to Represent Points68,in no.K-D342imens(1973)ional, 361-368. Space Graphically." 7.8. Cochran, Cochran, W.W.G.G., and G. M. Cox. (3d ed.). New York: John(2d ed.Wi)l.ey,New1977.York: John 9. WiDavino.ley,s,2J.1957.(1C.970)"In,f105-112. ormation Contained in Sediment Size Analysis." 10. Dawkins,no.B. 2"Mul(1989)tivar, 110--iate 1Anal15. ysis of National Track Records." 11. tDunham, R. B., and D. J. Kravetz. "Canonicalno.Cor4re(1la975)tion, Anal ysis in a Predictive Sys e m. " 35-42. 12.13. Ever iet, ,G.B. G. "A Multidimensional Model of Client Succes New York: NorEngagi th-Holngland.Exte1978.rnal Gabl s when , no.y8si(s1i996)n Pla1175-1198. 14. HalbasConseinduar,lontantJ.datsC.."a "PrcollienctciepdalbyComponent Anal ntofBreedi ng.i"n,Unpubl ished report Dr. F. A. Bl i s , Uni v er s i t y Wi s c ons 1979. 15. crKiimm,inL.ant, andAnalY.ysKiis.m" . "Innovation in a Newly no.Indus3 t(r1i985)alizin312-322. g Country: A Multiple Dis 16. Mobi Klatzlky,ity."S. R., and R. W. Hodge. "A Canonical Correlati66,on no.Anal333ysis(1of971),Occupat ional 16-22. 17. Linden.no. 3M.(Oct"Fact. 1977)or ,Anal562-568.ytic Study of Olympic Decathlon Data." 18. MacCrimmon, K., 6,andno.D.4 Wehr u,ng.422-435. "Characteristics of Risk Taking Executives." ( 1 990) 19. Press Marrio1974.t , F. H. C. London: Academic 20. viMatoglhaer,cialP.SediM.m"Stentusdy." of Factors Influencing Varino.at3io(n1972)in Si, z219-234. e Characteristics in Flu 21. sNaiideratk, D.ionsN.,inandtheR.PrKhatinciptalree.Component "RevisitingAnalOlyympisis.c"Track Records: Some Practical no.Con2 (1996), 140--144. Statistical Science,
2,
Applied Statistics,
40
Statistical Concepts and Methods.
Statistical Methods for Research in the Natural Sci
ences,
Management Science,
38,
Journal of the American Statistical Association, Sampling Techniques
Experimental Designs
Mathematical Geology,
2,
The A merican Statisti
cian,
43,
Journal of Experimental Education,
43,
Graphical Techniques for Multivariate Data. Management Science,
42
Management Science,
31,
Journal of the American Statistical Association,
Research Quarterly,
48,
Man
agement Science,
3
The Interpretation of Multiple Observations.
Mathematical Geology,
4,
The American Statistician,
50,
48
Chap. 1
of 22. Nas(1995),on, 411-430. G. "Three-dimensional Projection Pursuit." no. 4 23. Account Smith, M.ing, andStateR.mentTafs."fler. "Improving the Communication Funct io(n1984)of Publ, 139-146. ished no. 54 24. Ph.Spenner, K.ertaI.tio"Frn, oUnim vGenerat ioWin tsoconsGenerat ion: The Transmis ion of Occupation." D . di s er s i t y of i n , 1977. 25. Nonal Tabakofcoholf, ics.et" al. "Dif erences in Platelet Enzyme Actno.ivit3y(bet1988)ween, 134-139. Alcoholics and 26. Titerey,mm,CA:N. H.Brooks/Cole, 1975. Mon 27. TrDisietrsceshmann, J.InS.s,uandrers.G." E. Pinches. "A Multivariate Modelno.for3Pr(1e973)dicti,n327-338. g Financial y s e d PL 28.29. WaiTukey,ner,J.H.W.and D This en. "Graphical DatReadia Analng, yMA:sis." Addison-Wesley, 1977. ( 1 981) , 191-241. 30. War24, 1993)tzman,, C1,R. C15."Don't Wave a Red Flag at the IRS." (February 31. WeiAnalhyss,isC.):,Routand iH.ne Schmi Searchidlni.g "OMEGA for Structur(eO."n Line Multivariate Explno. o2r(a1to990)ry Graphi , 175-226.cal Aspects
M u ltivariate Analysis
Applied Statistics,
Accounting and Business Research,
B.,
New England Journal of Medicine,
44,
14,
318,
Multivariate Analysis with Applications in Education and Psychology.
Journal of Risk and Insurance,
40,
Exploratory Data Analysis.
Annual Review of Psychology,
32,
The Wall Street Journal,
Statistical Science, 5,
CHAPTER
2
Matrix Algebra and Random Vectors 2. 1 INTRODUCTION
We saw in Chapter 1 that multivariate data can be conveniently displayed as an array of numbers. In general, a rectangular array of numbers with, for instance, n rows and p columns is called a matrix of dimension n p. The study of multivari ate methods is greatly facilitated by the use of matrix algebra. The matrix algebra results presented in this chapter will enable us to concisely state statistical models. Moreover, the formal relations expressed in matrix terms are easily programmed on computers to allow the routine calculation of important statistical quantities. We begin by introducing some very basic concepts that are essential to both our geometrical interpretations and algebraic explanations of subsequent statistical techniques. If you have not been previously exposed to the rudiments of matrix algebra, you may prefer to follow the brief refresher in the next section by the more detailed review provided in Supplement 2A.
X
2.2 SOME BASICS OF MATRIX AND VECTOR ALGEBRA Vectors
J�l : l
An array x of n real numbers x 1 , x2 , x
or
• • •
, x11 is called a x
vector, and it is written as
'
x"
where the prime denotes the operation of transposing a column to a row. 49
50
Chap. 2
M atrix Algebra and Ra ndom Vectors
A vector x can be represented geometrically as a directed line in n dimen sions with component x1 along the first axis, x2 along the second axis, . . . , and X11 along the n th axis. This is illustrated in Figure 2.1 for n = 3.
2
,"
-----------------�
1- /
I I
/
/
I I I I I
:
, " II I I I I I
:
31 0·�------------+ l �r-� � I
· · - - - - - - - - - - - - - - - - - - �"
,"
[]
Figure 2.1
The vector x ' = [1 , 3, 2].
A vector can be expanded or contracted by multiplying it by a constant c. In particular, we define the vector e x as ex =
cx1 CXz. c�"
That is, ex is the vector obtained by multiplying each element of x by ure 2.2 ( a ) .] Two vectors may be added. Addition of x and y is defined as 2 2
/
) 1 and contracted if 0 < I c I < 1. [Recall Fig ure 2.2(a).] Choosing c = L; 1 , we obtain the unit vector L ; 1 x , which has length 1 and lies in the direction of x. A second geometrical concept is angle. Consider two vectors in a plane and the angle 0 between them, as in Figure 2.4. From the figure, 0 can be represented as the difference between the angles 01 and 02 formed by the two vectors and the first coordinate axis. Since, by definition, cos ( 01 )
=
sin ( 01 )
x1
l1__
cos ( 02 )
LX
x2 LX
Ly
Y2 Ly
sin ( 02 )
and cos ( 0 )
=
cos ( 02 - 01 )
=
cos ( 02 ) cos ( 01 )
the angle 0 between the two vectors x' cos ( 02 - 01 )
=
+ sin ( 02 ) sin ( 01 )
[ x 1 , x2 ] and y '
(ll__L ) ( �Lx ) + ( LY2 ) ( �Lx )
=
[y 1 , y2 ] is specified by
x 1 y 1 + X2Yz Lx Ly
(2-3) y y We find it convenient to introduce the inner product of two vectors. For n = 2 dimensions, the inner product of x and y is cos ( O )
=
=
=
With this definition and Equation (2-3) LX
=
Wx
Since cos (90°) = cos (270°) pendicular when x ' y = 0.
=
x' y
cos ( 0 ) 0 and cos ( 0)
=
0 only if x ' y
=
0, x and y are per
2
X
The angle (J between = [x1 , x2 ] and y ' = [y1 , Y2 l ·
Figure 2.4
x'
Sec. 2.2
53
Some Basics of M atrix a n d Vector Algebra
For an arbitrary number of dimensions
x and y as
n, we define the inner product of (2-4)
The inner product is denoted by either x' y or y' x. Using the inner product, we have the natural extension of length and angle to vectors of n components: Lx = length of x = �
(2-5)
=
x' y (2-6) Lx Ly �v� x' x �v� y' y Since, again, cos ( 8 ) = 0 only if x' y = 0, we say that x and y are perpendicular when x' y = 0. x' y
cos ( 8 ) =
Example 2. 1
(Calculating lengths of vectors and the angle between them)
Given the vectors x' = [1, 3, 2] and y' = [ -2, 1, -1], find 3x and x + y. Next, determine the length of x, the length of y, and the angle between x and y. Also, check that the length of 3x is three times the length of x. First,
Next, x' x = 1 2 + 32 + 22 = 14, y' y = ( -2) 2 1 ( - 2) + 3 (1) + 2 ( - 1) = - 1. Therefore, L X = � = Vi4 = 3.742
+ 12 +
( 1 )2 -
=
6, and
x' y
=
Ly = VY'Y = V6 = 2.449
and cos ( O ) so () = 96.3°. Finally L3x = Y3 2 + 92 showing L3x = 3 Lx .
ft Lx L y
+ 62
=
-1 3.742
X 2.449
- .109
\li26 and 3 Lx = 3 Vi4 = \li26
•
54
Chap. 2
M atrix Algebra and Random Vectors
A pair of vectors x and y of the same dimension is said to be linearly depen
dent if there exist constants c1 and c2 , both not zero, such that
A set of vectors x 1 , x 2 , . . . , xk is said to be linearly dependent if there exist constants
c 1 , c2 , . . . , ck, not all zero, such that
(2-7) Linear dependence implies that at least one vector in the set can be written as a linear combination of the other vectors. Vectors of the same dimension that are not linearly dependent are said to be linearly independent. Example 2.2 (Identifying linearly independent vectors)
Consider the set of vectors
Setting
implies that
c 1 + c2 + c 3 = 0 2c1 - 2c3 = 0 c 1 - c2 + c3 = 0 with the unique solution c1 = c2 = c3 = 0. As we cannot find three constants c1 , c2 , and c3 , not all zero, such that c1 x 1 + c2 x 2 + c3x3 = 0, the vectors x 1 , • x 2 , and x 3 are linearly independent. The projection (or shadow) of a vector x on a vector y is 1 . . of x on y = (x' y) y = (x' y) y ProJectiOn -, L y Ly yy where the vector L; 1 y has unit length. The length of the projection is
(2-8)
-
Length of protection =
I �Y I y
= Lx
\ :'I \ = Lx l cos ( O ) I X
y
where 0 is the angle between x and y. (See Figure 2.5.)
(2-9)
Sec . 2 . 2
Some Basics of M atrix a n d Vector Algebra
( )
x' y y y' y
55
, y
''"""""'-- LX cos (8) -----;)�1
Figure 2.5
The projection of x on y.
Matrices
A matrix is any rectangular array of real numbers. We denote an arbitrary array of n rows and p columns by
A (n Xp)
= l :�: :�� ::: :�; J . . an I an 2
.
an p
Many of the vector concepts just introduced have direct generalizations to matrices. The transpose operation A' of a matrix changes the columns into rows, so that the first column of A becomes the first row of A', the second column becomes the second row, and so forth. Example 2.3
{The transpose of a matrix)
If
A (2 X 3 ) then
l=
A' (3 X 2 )
[� = [- � �] 2 4
•
A matrix may also be multiplied by a constant c. The product cA is the matrix that results from multiplying each element of A by c. Thus
cA (n Xp)
ca 1 1 c �2 1 ·
can l
Two matrices A and B of the same dimensions can be added. The sum A ( i, j)th entry aij + b ;j ·
+ B has
56
Chap. 2
M atrix Algebra and Ra ndom Vectors
Example 2.4 (The sum of two matrices and multiplication of a matrix by a constant)
If
A
( 2 X 3)
=
then
4A
( 2 X 3}
A + B
(2 X 3}
( 2 X 3}
[� - � � ]
= =
[
[ 04
12
B
and
4
(2 X 3}
]
=
[ � - � -� ]
and -4 4 3 2 0 + 1 1 + 2 -1 + 5
-
•
It is also possible to define the qmltiplication of two matrices if the dimen sions of the matrices conform in the following manner: When A is (n k) and B is (k p ), so that the number of elements in a row of A is the same as the num ber of elements in a column of B, we can form the matrix product AB. An element of the new matrix AB is formed by taking the inner product of each row of A with each column of B. The matrix product AB is
X
X
A
B
(n x k) (k xp )
=
the (n p) matrix whose entry in the ith row and jth column is the inner product of the ith row of A and the jth column of B
X
or ( i, j) entry of AB
= a; 1 b 1 j
+ a; 2 b 2 j + · · · +
[
a; k b k j
=
k
2:
f=l
a; e b n
]
(2-10)
When k = 4, we have four products ro add for each entry in the matrix AB. Thus, ql i
A B = (n X 4} (4 X p )
(a i '�
l
ai 2 al 3 al 4 a; z 1
a; 3
a; 4 )
an i a n 2 an 3 a n 4
� Row i
[
b1 1
b2 l b3 i b
4
. 0 0 0
0 0 0
1
blj . . b
b 2j
b b
,
0 0 0
lp
b2 p
b3 3 p 4, . . • a4p 0 0 0
Column j
· · · (a, b , ; + a, b,; +
a _ , b ,;
+ a,. b,; ) · · ·
]
Sec . 2.2
Example 2.5
Some Basics of M atrix a n d Vector Algebra
57
(Matrix multiplication)
If
then
A B
(2 X 3 ) (3 X l )
=
[ 31
[ 31 (( -- 2)2) ++ 5( -(7)1 ) (7) ++ 24 (9)(9) ]
-1 5
(2 X 1 ) and
C A (2 X 2 ) (2 X 3)
=
=
=
[ 21 - OJ1 [ 31 - 51 42 ] [ 21 (3)(3) -+ 01 (1(1 )) 21 (( -- 11 )) +- 01 (5)(5) [ 26 --26 - 24 ]
2 (2) + 0 (4) 1 (2) - 1 (4)
] •
(2 X 3 )
When a matrix B consists of single column, it is customary to use the lower case b vector notation. Example 2.6
(Some typical products and their dimensions)
Let d
Then Ab, be', b' c, and d' Ab are typical products.
=
[ �]
58
Chap. 2
M atrix Algebra and Random Vectors
[ _!] [
The product Ab is a vector with dimension equal to the number of rows of A.
b'c The product b' c is a
1
[7 - 3 6)
�
�
[ - 13)
]
X 1 vector or a single number, here - 13. [5 8 - 4]
=
35 56 - 28 - 15 - 24 12 30 48 - 24
The products be' is a matrix whose row dimension equals the dimension of b and whose column dimension equals that of c. This product is unlike b' c, which is a single number.
d'Ab
=
[ 21
[2 9]
The products d' Ab is a
1
-2
3 4 -1
]
[ -n
�
[26) •
X 1 vector or a single number, here 26.
Square matrices will be of special importance in our development of statis tical methods. A square matrix is said to be symmetric if A = A' or aii = ai i for all i and j. Example 2.7 (A symmetric matrix)
The matrix
is symmetric; the matrix
•
is not symmetric.
When two square matrices A and B are of the same dimension, both prod ucts AB and BA are defined, although they need not be equal. (See Supplement 2A.) If we let I denote the square matrix with ones on the diagonal and zeros else where, it follows from the definition of matrix multiplication that the ( i, j )th entry of AI is a; 1 0 + + ai,j - l 0 + a;i 1 + ai,j + l 0 + + ai k 0 = aii • so AI = A. Similarly, lA = A, so
X
·
· ·
X
X
X
· · ·
X
Sec. 2.2
I
Some Basics of Matrix a n d Vector Algebra
A = A
(k X k) (k X k)
1
I = A
(k X k) (k X k)
( k X k)
for any
(1
A
(k X k)
59
(2-11)
The matrix I acts like in ordinary multiplication · a = a · 1 = a), so it is called the identity matrix. The fundamental scalar relation about the existence of an inverse number a - 1 such that a - 1 a = aa - 1 = if a i= 0 has the following matrix algebra extension: If there exists a matrix B such that
1
B
A = A B
(k X k) (k X k)
(k X k) (k X k)
I
(k X k)
then B is called the inverse of A and is denoted by A - t . The technical condition that an inverse exists is that the k columns 1 a 1 , a2 , . . . , ak of A are linearly independent. That is, the existence of A - is equiv alent to
( See Result
(2-12)
2A.9 in Supplement 2A.)
Example 2.8 (The existence of a matrix inverse)
For
A=
[! �]
you may verify that
(.4) 4 (-2)2 + (.4)1 [ -..82 -..46 ] [ 43 21 ] = [ (-.(.82)3)3 ++ (-. 6 ) 4 (.8)2 + (-.6 )1 J = [� �]
so
is A - 1 . We note that
[ -.2 .4 ] .8 -. 6
implies that c 1 = c2 = 0, so the columns of A are linearly independent. This • confirms the condition stated in
(2-12).
60
Chap. 2
M atrix Algebra a n d Ra ndom Vectors
A method for computing an inverse, when one exists, is given in Supplement 2A. The routine, but lengthy, calculations are usually relegated to a computer, especially when the dimension is greater than three. Even so, you must be fore warned that if the column sum in (2-12) is nearly 0 for some constants c 1 , , ck , then the computer may produce incorrect inverses due to extreme errors in round ing. It is always good to check the products AA - t and A -1 A for equality with I when A - 1 is produced by a computer package. (See Exercise 2.10.) Diagonal matrices have inverses that are easy to compute. For example, • . •
1
0
0
0
azz 0
0
0
0
0
au 0 au 0 0 0 0 azz 0 0 0 0 a33 0 0 0 0 a44 0
0
0
0
0 0 0 0
0 has inverse
ass
1
1
0
0
0
0
0
0
0
0
1 0
0 1
if all the a;; =F 0. Another special class of square matrices with which we shall become familiar are the orthogonal matrices, characterized by (2-13) QQ' = Q'Q = I or Q' = Q - 1 The name derives from the property that if Q has ith row q ;, then QQ' = I implies that q ; qi = 1 and q ; q j = 0 for i =F j, so the rows have unit length and are mutu ally perpendicular (orthogonal). According to the condition Q'Q = I, the columns
have the same property. We conclude our brief introduction to the elements of matrix algebra by introducing a concept fundamental to multivariate statistical analysis. A square matrix A is said to have an eigenvalue .A, with corresponding eigenvector x =F 0, if Ax = .A x
(2 - 14)
Ordinarily, we normalize x so that it has length unity; that is, 1 = x' x . It is conve nient to denote normalized eigenvectors by e, and we do so in what follows. Spar ing you the details of the derivation (see [1]), we state the following basic result:
Sec. 2 . 3
Example 2.9
Positive Defi n ite M atrices
61
(Verifying eigenvalues and eigenvectors)
Let
Then, since
..\ 1
=
6 is an eigenvalue, and
is its corresponding normalized eigenvector. You may wish to show that a sec ond eigenvalue-eigenvector pair is A 2 = - 4, e; = [1 / V2, 1 / V2]. •
A method for calculating the A ' s and e ' s is described in Supplement 2A. It is instructive to do a few sample calculations to understand the technique. We usu ally rely on a computer when the dimension of the square matrix is greater than two or three. 2.3 POSITIVE DEFINITE MATRICES
The study of the variation and interrelationships in multivariate data is often based upon distances and the assumption that the data are multivariate normally distrib uted. Squared distances (see Chapter 1) and the multivariate normal density can be expressed in terms of matrix products called quadratic forms (see Chapter 4). Con sequently, it should not be surprising that quadratic forms play a central role in multivariate analysis. In this section, we consider quadratic forms that are always nonnegative and the associated positive definite matrices.
62
Chap. 2
Matrix Algebra and Random Vectors
Results involving quadratic forms and symmetric matrices are, in many cases, a direct consequence of an expansion for symmetric matrices known as the spec tral decomposition. The spectral decomposition of a k k symmetric matrix A is 1 given by
X
where A1 , A 2 , , Ak are the eigenvalues of A and e1 , e 2 , . . . , e k are the associated normalized eigenvectors. (See also Result 2A.14 in Supplement 2A.) Thus, e; e; = 1 for i = 1, 2 , . . . , k, and e; ej = O for i =F j. . . •
Example 2. 1 0
[ -�
]
{The spectral decomposition of a matrix)
Consider the symmetric matrix
13
A =
-4 13
2 -2
- 2 10
The eigenvalues obtained from the characteristic equation l A - A i l = 0 are A1 = 9, A 2 = 9, and A 3 = 18 (Definition 2A.30). The cor responding eigenvectors e1 , e 2 , and e 3 are the (normalized) solutions of the equations Ae; = A;e; for i = 1, 2, 3. Thus, Ae1 = Ae1 gives
[
13 - 4 2 - 4 13 - 2 2 - 2 10
][ ] [ ] e1 1 e1 1 e21 = 9 e21 e3 1 e3 1
or
- 4e1 1 + 13 e2 1 - 2e3 1 = 9e21 2e1 1 - 2e2 + 10e3 1 = 9e3 1 1
Moving the terms on the right of the equals sign to the left yields three homogeneous equations in three unknowns, but two of the equations are redundant. Selecting one of the equations and arbitrarily setting e 1 1 = 1 and e2 1 = 1 , we find that e3 1 = 0. Consequently, the normalized eigenvector is e� = [1/ Y1 2 + 1 2 + 0 2 , 1/ Y1 2 + 1 2 + 0 2 , O/ Y1 2 + 1 2 + 0 2 ] 1 A proof of Equation (2-16) is beyond the scope of this book. The interested reader will find a proof in [5], Chapter 8.
Sec. 2 . 3
63
Positive Defi nite M atrices
[1/v2, 1/v2, 0] , since the sum of the squares of its elements is unity. You may verify that e� = [1/vls, - 1/vls, - 4/vls] is also an eigenvector for 9 = .-\ 2 , and e; = [2/3, - 2/3, 1/3] is the normalized eigenvector corre sponding to the eigenvalue .-\ 3 = 18. Moreover, e; ej = 0 for i * j. The spectral decomposition of A is then A = ..\ 1 e 1 e ; + ..\2 e 2 e ; + ..\3 e 3 e;
[
or
-�]
13 - 4 - 4 13 2 - 2 10
1
� 9
v2
[ Jz
1
v2 0
1
v2
o]
1 v1s
+9
-1
v1s
-4
1 2 1 9 2
-
=
-
v1s
[ Jts
1 2 0 1 2 0 +9 -
-
0 0 0
-1
v1s
v1s ]
1 18 1 18 4 18
1 18 1 18 4 18
- 4 + 18
-
-
-
2 3 2 3 1 3
[�
-
-
4 18 4 18
-
16 18
4
4
-
9
+
18
9
4 9
2
-
9 as you may readily verify.
l]
- 32
-
2
-
9
4
-
2
2
9
9
-
-
9
9
1
•
The spectral decomposition is an important analytical tool. With it, we are very easily able to demonstrate certain statistical results. The first of these is a matrix explanation of distance, which we now develop.
64
Chap . 2
M atrix Algebra and Random Vectors
X k symmetric matrix A is such that
When a k
0 :;.;;:; X1 Ax (2-17) for all X1 = [ x1, x 2 , , xd, A is said to be nonnegative definite. If equality holds in (2-17) only for the vector X1 = [ 0, 0, . . . , OJ , then A is said to be positive definite. In other words, A is positive definite if 0 < X1 Ax (2-18) for all vectors x * 0. Because X1 Ax has only squared terms xf and product terms . • .
X;Xk ,
it is called a quadratic form.
Example 2. 1 1
(A positive definite quadratic form)
Show that the following quadratic form is positive definite: 3
xf + 2x� - 2 V2 x1x2
[ - V23
- Y2 [ xt ] = x 1 Ax J x2 By Definition 2A.30, the eigenvalues of A are the solutions of the equa tion l A H I 0, or (3 A ) (2 - A ) - 2 0. The solutions are A 1 = 4
To illustrate the general approach, we first write the quadratic form in matrix notation as
[xl
-
x2 ]
=
2
=
-
and A 2 = 1 . Using the spectral decomposition in (2-16), we can write A = A i e l e ; + A 2 e 2 e 2I
(2 X 2)
(2 X I) (I X 2) (2 X I) (I X 2 ) = e 1 e ; + e 2 el2 (2 X I) ( 1 X 2 ) (2 X I ) (I X 2 )
4
where e 1 and e 2 are the normalized and orthogonal eigenvectors associated with the eigenvalues A 1 = 4 and A 2 = 1, respectively. Because and 1 are scalars, premultiplication artd postmultiplication of A by X1 and x, respec tively, where X1 = is any nonzero vector, give
4
[x1 ,x2]
X1
(I X 2 )
A
x = 4x1 e 1
(2 X 2) (2 X I )
=
e;
x
(I X 2) (2 X I) (I X 2 ) (2 X I)
4yf + y� ;;,: 0
+
X1
e
e;
x
(I X 2 ) (2 X2I) (I X 2) (2 X I)
with
y1 = X1 e 1 = e ; x and Yz = X1 e2 = e;x We now show that y1 and Yz are not both zero and, consequently, that X1 Ax 4 y f + Yi > 0, or A is positive definite. =
Sec. 2 . 3
65
Positive Defin ite M atrices
From the definitions of y1 and y2 , we have
or
y
(2 X l )
=
E
X
(2 X2 ) (2 X l )
Now E is an orthogonal matrix and hence has inverse E ' . Thus, x But x is a nonzero vector, and 0 * x = E ' y implies that y * 0.
=
E ' y. •
Using the spectral decomposition, we can easily show that a k k symmet ric matrix A is a positive definite matrix if and only if every eigenvalue of A is pos itive. (See Exercise 2.17.) A is a nonnegative definite matrix if and only if all of its eigenvalues are greater than or equal to zero. Assume for the moment that the p elements x 1 , x 2 , . . . , xP of a vector x are realizations of p random variables X1 , X2 , , XP . As we pointed out in Chapter 1, we can regard these elements as the coordinates of a point in p-dimensional space, and the "distance" of the point [x 1 , x 2 , . . . , xp ] to the origin can, and in this case shbuld, be interpreted in terms of standard deviation units. In this way, we can account for the inherent uncertainty ( variability) in the observations. Points with the same associated "uncertainty" are regarded as being at the same distance from the origin. If we use the distance formula introduced in Chapter 1 [see Equation (1-22)], the distance from the origin satisfies the general formula
X
• • •
+ 2 (a 1 2 x1 x2 + a 1 3x 1 x3 + . . . + ap - l, p xp _ 1 xp ) provided that (distance ? > 0 for all [x 1 , x 2 , . . . , xp ] * [0, 0, . . . , 0.] Setting a;j = aj i ' i * j, i = 1, 2, . . . , p, j = 1, 2, . . . , p , we have alp X a az. p Xz. 0 < ( distance ) 2 = [x 1 , x2 , . . . , xp ] f 1 .. .. aP aPP xPj
rai l
r
1
or
0 < ( distance ) 2
=
x' Ax
for x * 0
ll
(2-19)
From (2-19), we see that the p p symmetric matrix A is positive definite. In sum, distance is determined from a positive definite quadratic form x' Ax. Con versely, a positive definite quadratic form can be interpreted as a squared distance.
X
66
Chap. 2
M atrix Algebra and Random Vectors
Comment. Let the square of the distance from the point x' = [ x1 , x2 , . . . , xp ] to the origin be given by x ' Ax, where A is a p p symmetric positive definite matrix. Then the square of the distance from x to an arbitrary fixed point p ' [ JL 1 , JL 2 , . . . , JLp ] is given by the general expression (x - p ) ' A (x - p ) .
X
=
Expressing distance as the square root of a positive definite quadratic form allows us to give a geometrical interpretation based on the eigenvalues and eigen vectors of the matrix A . For example, suppose p = 2. Then the points x' = [ x 1 , x 2 ] of constant distance c from the origin satisfy By the spectral decomposition, as in Example 2.11 ,
A = A1 e 1 e� + A2 e 2 e� s o x ' Ax = A 1 (x ' e 1 ) 2 + A 2 (x ' e 2 ) 2 Now, c 2 = A 1 y� + A 2 yi is an ellipse in Y t = x ' e 1 and y2 = x' e 2 because A 1 , A2 > 0 when A is positive definite. (See Exercise 2.17.) We easily verify that l 1 1 X = cA] 12 e l satisfies x' Ax = At (cA l - /Ze� e l ) 2 = c 2 . Similarly, X = cA2 1 2 e z gives the appropriate distance in the e 2 direction. Thus, the points at distance c lie on an ellipse whose axes are given by the eigenvectors of A with lengths propor
tional to the reciprocals of the square roots of the eigenvalues. The constant of proportionality is c. The situation is illustrated in Figure 2.6. Ifp > 2, the points x' = [ xt , x 2 , . . . , xP ] a constant distance c = � from the origin lie on hyperellipsoids c 2 = A 1 (x ' e 1 ) 2 + + AP (x ' ep ) 2 , whose axes are given by the eigenvectors of A. The half-length in the direction e; is equal to c/ "'VA; , i = 1, 2 , . . . , p, where At , A2 , . . . , AP are the eigenvalues of A. · · ·
Figure 2 . 6
Points a constant distance c from the origin (p = 2, 1
,;;
A1
- p. ( l ) ) ( X - p. ) 1 , we get
1 E ( X(l) - ,.,. ( ) ) ( X(2) - ,.,. (2) ) I =
al , q + l al , q + 2 · · · al p a2, q + l a2, q + 2 · · · a2 p •. : : . . . .: aq, q + l aq, q + 2 · · · aq p
=
{2 -39 )
""' 1 2 �
which gives all the covariances, a;i, i 1 , 2, . . . , q, j q + 1 , q + 2, . . . , p, between a component of x< 1 > and a component of x . Note that the matrix I1 2 is not necessarily symmetric or even square. Making use of the partitioning in Equation {2-38), we can easily demon strate that =
=
[
( X - p. ) ( X - p. ) '
=
and consequently,
1 1 ( X< l p. ( ) ) ( q X l) ( X(2 ) - ,.,. (2 ) ) ((p - q) X 1 ) _
( 1 ) ) ' ( X( 1 ) - ,.,. ( 1 ) ) ( X(2) - ,.,. (2 ) ) I p. ( 1 X (p - q )) (q X 1 ) ( l x q) 1 1 ( X( ) - ,.,. ( ) ) I ( X 0 = A;+ 1 , A;+ 2 , . . . , A;, (for m > k). Then vi = Ai- 1 A' ui. Alternatively, the vi are the eigenvectors of A' A with the same nonzero eigenvalues A f . The matrix expansion for the singular-value decomposition written in terms of the full dimensional matrices U, V, A is A
(m X k)
V'
A
U
(m X m) (m X k) (k X k)
where U has m orthogonal eigenvectors of AA' as its columns, V has k orthogo nal eigenvectors of A' A as its columns, and A is specified in Result 2A.15. For example, let
A= then
AA' =
[ - 31
1 1
3 1
[
]
3 1 1 -1 3 1
]
[� -� ]
[ 1 11 111 ]
1 You may verify that the eigenvalues y = A 2 o f AA' satisfy the equation 2 y - 22 y + 120 = ( y - 12) ( y - 10), and consequently, the eigenvalues are y1 = Af = 12 and y2 = Ai = 10. The corresponding eigenvectors are u� =
1
=
[ � � ] and u; = [ � � l respectively. Also = A' A [ - � � ; J I� I�
[ n
[: n
so J A' A - y i J = - y 3 - 22 y 2 - 120 y = - y ( y - 12) ( y - 10), and the eigen values are y1 = Af = 12, y2 = Ai = 10, and y3 = A� = 0. The nonzero eigenval ues are the same as those of AA' . A computer calculation gives the eigenvectors v
=
[
1
2
1
] , v2 = [ Vs2 I
]
- 1 0 and v3 =
Vs
,
V6 V6 Eigenvectors v1 and v2 can be verified by checking: {
V6
A' Av ,
=
A' A v, =
1
1 [ Y3Q
[ � � n � [�] � [ � ] [ I� I� n Js [ -n IO Js [ -n I
1
= 12 =
2 Y3Q
= Af v , = Al v,
-5 Y3Q
]
.
Chap. 2 Exercises
Taking A 1 A is
=
VU and
A = [ - 31
A2 =
1 1 3 1
]
1 07
VlO , we find that the singular-value decomposition of
2 V6
1 V6
J+
ViO
[Jz] [� -1
v2
v5
-=1. v5
o]
The equality may be checked by carrying out the operations on the right-hand side. EXERCISES
Let x ' = [5, 1, 3] and y' = [-1, 3, 1]. (a) Graph the two vectors. (b) Find (i) the length of x, (ii) the angle between x and y, and (iii) the pro jection of y on x. (c) Since .X = 3 and y = 1, graph [5 - 3, 1 - 3, 3 - 3] = [2, -2, 0] and [ - 1 - 1, 3 - 1, 1 - 1] = [-2, 2, 0]. 2.2. Given the matrices
2.1.
A=
[ - 41 23 ] '
perform the indicated multiplications.
(a) SA (b) BA (c) A'B' (d) C'B (e) Is AB defined?
2.3.
A = [ � !].
[ 51
]
Verify the following properties of the transpose when
B=
4 2 , and C 0 3
(a) (A')' = A (b) (C')- 1 = (c - t ) ' (c) (AB)' = B' A' (d) For general A and B , (AB)' = B' A'. (m x k)
(k x e)
=
1 08
Chap. 2 Matrix Algebra and Random Vectors 2.4.
2.5.
When A -I and B -I exist, prove each of the following. (a) (A')- 1 = (A - 1 ) ' (b) (AB ) - 1 = B- 1 A - I Hint: Part a can be proved by noting that AA- 1 = I, I = I', and (AA- 1 ) ' = (A - 1 ) ' A'. Part b follows from ( B - 1 A - 1 ) AB = B- 1 (A - 1 A) B = B- 1 B = I .
[ 5 12 ]
Check that
is an orthogonal matrix. 2.6. Let
Q= �1 3 �13 A=
[ - 29 - 26 ]
(a) Is A symmetric? Show that A is positive definite. Let A be as given in Exercise 2.6. (a) Determine the eigenvalues and eigenvectors of A. (b) Write the spectral decomposition of A. (c) Find A -I . (d) Find the eigenvalues and eigenvectors of A -I . (b)
2.7.
2.8.
Given the matrix
A=
[ 21 - 22 ]
find the eigenvalues t\ 1 and t\ 2 and the associated normalized eigenvectors e 1 and e 2 . Determine the spectral decomposition (2-16) of A. 2.9. Let A be as in Exercise 2.8. (a) Find A -I . (b) Compute the eigenvalues and eigenvectors of A - 1 . (c) Write the spectral decomposition of A-I, and compare it with that of A from Exercise 2.8. 2.10. Consider the matrices
A=
[ 44.001
4.001 4.002
]
and B =
4.001 4.001 4.002001
[4
J
These matrices are identical except for a small difference in the (2, 2) posi tion. Moreover, the columns of A (and B ) are nearly linearly dependent. Show that A- 1 = ( - 3 ) B- 1 . Consequently, small changes-perhaps caused by rounding-can give substantially different inverses.
Chap.
2 Exercises
1 09
Show that the determinant of the p p diagonal matrix A = {aii } with aii = 0, i * j, is given by the product of the diagonal elements; thus, \ A \ = a 1 1 a22 aPP . Hint: By Definition 2A.24, \ A \ = au Au + 0 + + 0. Repeat for the sub matrix A 1 1 obtained by deleting the first row and first column of A. 2.U. Show that the determinant of a square symmetric p p matrix A can be expressed as the product of its eigenvalues A 1 , A 2 , , AP ; that is, 2.11.
X
· · ·
· · ·
\A\
=
X
ITf= l A i .
• . •
Hint: From (2-16) and (2-20), A
= PAP' with P'P = I. From Result \ PAP' \ = \ P \ \ AP' \ = \ P \ \ A \ \ P' \ = \ A \ \ 1 \ , since \ I \ = \ P'P \ = \ P' \ \ P \ . Apply Exercise 2.1 1 . Show that \ Q \ = + 1 o r - 1 i f Q i s a p p orthogonal matrix. Hint: \ QQ' \ = \ I \ . Also, from Result 2A.l l , \ QQ' \ = \ Q \ \ Q' \ = \ Q \ 2 • Thus, \ Q \ 2 = \ I \ . Now use Exercise 2.11. Show that Q ' A Q and A have the same eigenvalues if Q is
2A.l l(e),
2.13.
2.14.
2.15. 2.16.
2.17.
2.18.
\A\
=
X
(p X p ) (p Xp) (p X p)
(p Xp)
orthogonal. Hint: Let A be an eigenvalue of A. Then 0 = \ A - AI \ . By Exercise 2.13 and Result 2A.ll(e), we can write 0 = \ Q' \ \ A - AI \ \ Q \ = \ Q'AQ - AI \ , since Q'Q = I. A quadratic form x' Ax is said to be positive definite if the matrix A is posi tive definite. Is the quadratic form 3x f + 3x � - 2x 1x 2 positive definite? Consider an arbitrary n p matrix A. Then A' A is a symmetric p p matrix. Show that A' A is necessarily nonnegative definite. Hint: Set y = Ax so that y ' y = x ' A' Ax. Prove that every eigenvalue of a k k positive definite matrix A is positive. Hint: Consider the definition of an eigenvalue, where Ae = Ae. Multiply on the left by e' so that e' Ae = Ae' e. Consider the sets of points (x 1 , x 2 ) whose "distances" from the origin are given by
X
X
X
c2
=
4x
i + 3x� - 2 V2 x1 x2
for c 2 = 1 and for c 2 = 4. Determine the major and minor axes of the ellipses of constant distances and their associated lengths. Sketch the ellipses of con stant distances and comment on their positions. What will happen as c 2 increases? 2.19.
Let A 1 12 = 2: \IA; e ; e; m
(m X m)
i= l
=
PA1 12 P', where PP'
=
P'P
=
I. (The A /s and
the e ; s are the eigenvalues and associated normalized eigenvectors of the matrix A.) Show Properties (1)-(4) of the square-root matrix in (2-22). 2.20. Determine the square-root matrix A 1 12 , using the matrix A in Exercise 2.3. Also, determine A - l/2 , and show that A 1 /2 A - I /2 = A -l /2 A 1 12 = I. '
110
Chap. 2 Matrix Algebra and Random Vectors 2.21.
(See Result 2A.15) Using the matrix
(a) Calculate A' A and obtain its eigenvalues and eigenvectors.
(b) Calculate AA' and obtain its eigenvalues and eigenvectors. Check that the nonzero eigenvalues are the same as those in part a.
(c) Obtain the singular-value decomposition of A. 2.22.
(See Result 2A. l 5) Using the matrix A =
[4
8
3 6
] -9 8
(a) Calculate AA' and obtain its eigenvalues and eigenvectors.
(b) Calculate A' A and obtain its eigenvalues and eigenvectors. Check that the nonzero eigenvalues are the same as those in part a.
(c) Obtain the singular-value decomposition of A.
Verify the relationships V1 12 pV1 /2 = I and p = (V1 12 ) - 1 I (V1 12 ) -\ where I is the p p population covariance matrix [Equation (2-32)] , p is the p p population correlation matrix [Equation (2-34)] , and V 1 12 is the population standard deviation matrix [Equation (2-35)]. 2.24. Let X have covariance matrix 2.23.
X
X
Find
(a) I - 1 •
2.25.
(b) The eigenvalues and eigenvectors of I . (c) The eigenvalues and eigenvectors of I - 1 .
Let X have covariance matrix
(a) Determine p and V1 12 • 2.26.
(b) Multiply your matrices to check the relation V 112 pV1 /2 = Use I as given in Exercise 2.25. (a) Find p1 3 •
(b) Find the correlation between X1 and �X2 +
� X3 •
I.
Chap. 2 Exercises
111
Derive expressions for the mean and variances of the following linear com binations in terms of the means and covariances of the random variables X1 , X2 , and X3 • (a) X1 - 2 X2 (b) - X1 + 3 X2 (c) X1 + X2 + X3 (e) X1 + 2 X2 - X3 (t) 3 X1 - 4 X2 if X1 and X2 are independent random variables. 2.28. Show that
2.27.
Cov (e1 1 X1 + e1 2 X2 + · · · + e1 P XP , e2 1 X1 + e22 X2 + · · · + e2 P XP ) = c � Ixc2 where c � [ e1 1 , e1 2 , . . . , e 1 p ] and c; = [ e2 1 , e22 , . . . , e2 P ] . This verifies the off-diagonal elements C ixC' in (2-45) or diagonal elements if c 1 c 2 . Hint: By (2-43), Z1 - E (Z1 ) e1 1 (X1 - 11-1 ) + · · · + e1 P (XP - /Lp ) and Z2 - E (Z2 ) e2 1 (X1 - �J- 1 ) + · · · + e2 P (XP - IJ-p ) . So Cov ( Z 1 , Z2 ) = E [ (Z1 - E (Z1 ) ) (Z2 - E (Z2 ) ) ] E [ (e11 (X1 - 11- 1 ) + · · · + e1p (Xp - 11-p ) ) (ez t (Xl - 11-t ) + e22 (X2 - �J- 2 ) + · · · + e2p (Xp - IJ-p ) ) ] . The product (el l (Xl - 11-t ) + e1 2 (Xz - 11- z ) + · · · + e l p (XP - ILp ) ) (ez t (XI - 11-t ) + ezz (Xz - /L z ) + · · · + ez p (Xp - ILp ) )
=
=
= ( �1 e� e (Xe p
p
11- e )
=
=
=
) c �1 ezm (Xm - 11-m ) )
= � � e l f ezm (Xt - ILe H Xm - 11- m )
f= l m = l has expected value p p � � et e ez m uem [ el l• · · · • e l p ] I [ ez t• · · · • e2 p ] ' . f=l m=l Verify the last step by the definition of matrix multiplication. The same steps hold for all elements. 2.29. Consider the arbitrary random vector [X1 , X2 , X3 , X4 , X5 ] ' with mean vector p, [�J- 1 , �J- 2 , �J- 3 , �J- 4 , 11- s ] ' . Partition into
=
=
where
X= X
X = [-i��-J
112
Chap. 2 Matrix Algebra and Random Vectors
Let I be the covariance matrix of X with general element cr;k · Partition I into the covariance matrices of X (l l and x 1, some information about the sample is lost in the process. A geometrical interpretation of I S I will help us appreciate its strengths and weaknesses as a descriptive summary. Consider the area generated within the plane by two deviation vectors d1 = y1 - x1l and d2 = y2 - x2 1. Let Ld 1 be the length of d1 and Ld2 the length of d2 . By elementary geometry, we have the diagram •
dl
- - - - - - - - - �t!J- - - - - - - - - - - - - .
cos2 ( e) + sin2 ( e) 1, we and the area of the trapezoid is jLd sin ( e) I Ld2 .canSinceexpress th' area as I
IS
2 Definition 2A.24 defines "determinant" and indicates one method for calculating the value of a determinant.
Sec.
From (3-5) and (3-7), Ld 1 Ld 2 =
and
�� (xi1 - .X1 )2 F .X2 ) 2
3 .4
Generalized Variance
=
V(n - 1) s11
=
V(n - 1) s22
1 31
cos(O) = r1 2
Therefore, Area = (n - 1) � YS; V1 - r?2 = (n - 1) Vs11s22 (1 - r?2 ) (3-13) Also, � Vs22 r1 z ] S22 I (3-14) If we compare (3-14) with (3-13), we see that l S I = (area) 2/(n - 1) 2 Assuming now that l S I = (n - 1) - (p - l l (volume? holds for the volume gen erated in n space by the p - 1 deviation vectors d 1 , d2 , ... , dp - J , we can establish the following general result for p deviation vectors by induction (see [1 ], p. 260): Generalized sample variance = I S I = (n - 1) -p (volume) (3-15) Equation (3-15) says that the generalized sample variance, for a fixed set of data,3 is proportional to the square of the volume generated by the p deviation vectors d 1 = y1 - .X11, d2 = y2 - .X2 1, ... , dP = Yp - .XP l. Figures 3.6(a) and (b) on page 132 show trapezoidal regions, generated by p = 3 residual vectors, corresponding to "large" and "small" generalized variances. For a fixed sample size, it is clear from the geometry that volume, or I S I , will increase when the length of di = Yi - :Xi l ( on/5;;) is increased. In addition, volume will increase if the residual vectors of fixed length are moved until they are at right angles to one another, as in Figure 3.6(a). On the other hand, the volume, or I S I , will be small if just one of the sii is small or one of the deviation vectors lies 2
any
3 If generalized variance is defined in terms of the sample covariance matrix S, = [(n - 1)/n] S, then, using Result 2A. 1 1 , I S, I = l [(n - 1 )/n)IP S I = l [(n - 1)/n)Ip I l S I = [(n - 1)/nJP I S I . Conse quently, using (3-15), we can also write the following: Generalized sample variance = I S, I = n-p (volume) 2 •
1 32
Chap.
3
Sample Geometry and Random Sampling , , 11 \
�
,' I \
3
3
\ , ,, \ I I \ I I \ \ \ "' \ \ I ( '"
Figure 3.6 (a) "Large" generalized sample variance for p = 3 . (b) "Small" gen eralized sample variance for p = 3.
nearly in the (hyper) plane formed by the others, or both. In the second case, the trapezoid has very little height above the plane. This is the situation in Figure 3.6(b ), where d3 lies nearly in the plane formed by d 1 and d2 . Generalized variance also has interpretations in the p-space scatter plot re presentation of the data. The most intuitive' interpretation concerns the spread of the scatter about the sample mean point x = [x , x2 , . . . , xP ] . Consider the mea sure of distance given1 in the comment below (2-19),1 with x playing the role of the fixed point p and s- playing the role of A. With these choices, the coordinates x ' = [ x1 , x 2 , . . . , xp] of the points a constant distance c from x satisfy (3-16) 1 (When p = 1, (x - x) ' S- (x - x) = (x1 - x1 ) 2/s1 1 is the squared distance from xl to xl in standard deviation units. ) Equation (3-16) defines a hyperellipsoid (an ellipse if p = 2) centered at x. It can be shown using integral calculus that the volume of this hyperellipsoid is related to I S 1 . In particular, Volume of {x: (x - x) ' S- 1 (x - x) ,;; c2 } = kP I S I 112cP (3-17) or (Volume of ellipsoid)2 = (constant) (generalized sample variance) where the constant kP is rather formidable.4 A large volume corresponds to a large generalized variance. 4 For those who are curious, kP uated at z.
=
27TP12jp f (p/2), where f (z) denotes the gamma function eval
Sec.
Generalized Variance
3.4
1 33
Although the generalized variance has some intuitively pleasing geometrical interpretations, it suffers from a basic weakness as a descriptive summary of the sample covariance matrix S, as the following example shows. Example 3.8
(Interpreting the generalized variance)
Figure 3.7 gives three scatter plots with very different patterns of correlation. All three data sets have i' = [1, 2], and the covariance matrices are 4 s [� � J r = .8 s = [ � � J r = 0 S = � ] , r = -.8 7
Xz
7 •
• •
•
•
•
•
•
•
•
•
•
XI
7
• •
•
•
• ' . • • • • • I • . .. . "' . •
Xz
[-
• •
-
• •
•
5
•
• • • • • • •
• • •
•
• •
(a)
(b)
• • •
•
7 • • • •
• • •• • •• • •
• • • • ,. •
•
•
• • • •• • •
• •
• •
(c) Figure 3.7
Scatter plots with three different orientations.
• •
•
7
X
I
1 34
Chap.
3
Sample Geometry and Random Sampling
Each covariance matrix S contains the information on the variability of the component variables and also the information required to calculate the correlation coefficient. In this sense, S captures the orientation and size of pattern of scatter. The eigenvalues and eigenvectors extracted from S further describe the pattern in the scatter plot. For the eigenvalues satisfy 0 == (A(A -- 5)9) 2(A- 421) and we determine the eigenvalue-eigenvector pairs A 1 = 9, e � = [ 1 /Yl , 1 /Yl ] and A2 = 1, e� = [1/Yl, -1/Yl]. The mean-centered ellipse, with center x' = [1, 2] for all three cases, is (x i) 'S- 1 (x - x) � c2 To describe this ellipse, as in Section 2.3, with A =I s- I , we notice that if (A, e) is an eigenvalue-eigenvector pair for S, then (A - , e) is an eigenvalue-eigen vector -pair for s-1- 1. That is,- if Se -=1 Ae, then multiplying on the left by s - 1 1 gives S Se = A S e, or S te = A e. Therefore, using the eigenvalues from S, we know that the ellipse extends e VA; in the direction of e; from x. In p = 2 dimensions, the choice c2 = 5. 99 will produce an ellipse that con tains approximately 95 percent of the observations. The vectors 3 \15.99 e 1 and \15.99 e are drawn in Figure 3. 8 (a) on page 135. Notice how the directions are the natural2 axes for the ellipse, and observe that the lengths of these scaled eigenvectors are comparable to the size of the pattern in each direction. Next, for the eigenvalues satisfy 0 = (A - 3) 2 and we arbitrarily choose the eigenvectors so that A 1 = 3, e� = [1, 0] and A2 = 3, e� = [0, 1]. The vectors V3 \15.99 e 1 and V3 \15.99 e2 are drawn in Figure 3. 8 (b). Finally, for [ 5 4 ] the eigenvalues satisfy 0 = (A - 5)2 - ( )2 s = = (A - 9) (A 1) 5 ' and we determine the eigenvalue-eigenvector pairs A 1 = 9, e � = [1 /Yl, -1/Yl] and A = 1, e� = [1 /Yl , 1/Yl ]. The scaled eigen vectors 3 v5.99 e 1 and v5.992 e2 are drawn in Figure 3.8 (c). In two dimensions, we can often sketch the axes of the mean-centered ellipse by eye. However, the eigenvector approach also works for high dimen sions where the data cannot be examined visually. -
-
-4
-
-
-
4
Sec.
7
x2
7
•
•
•
I
•
"'
•
•
•
• •
•
•
•
XI
•
•
•
•
•
•
• • • • •
•
•
•
•
•
•
1 35
x2
•
•
7 •
Generalized Variance
3.4
•
•
•
•
•
•
•
• • •
•
•
• • •
•
• •
7
•
X
I
(b)
(a) • •
•
•
7
x2
• •
•
•
• •
• • • •• •
•
•
•
•
• •
7 • • • •
XI
•
(c) Figure 3.8
Axes of the mean-centered 95-percent ellipses for the scatter plots in Figure 3 . 7.
Note: Here the generalized variance I S I gives the same value, I S I = 9, for all three patterns. But generalized variance does not contain any infor mation on the orientation of the patterns. Generalized variance is easier to interpret when the two or more samples (patterns) being compared have nearly the same orientations. Notice that our three patterns of scatter appear to cover approximately the same area. The ellipses that summarize the variability (x
-
i) 'S- 1 ( x - i) � c 2
do have exactly the same area [see (3-17)], since all have l S I 9. =
•
1 36
Chap.
3
Sample Geometry and Random Sampling
As Example 3. 8 demonstrates, different correlation structures are not detected by IS 1. The situation for p > 2 can be even more obscure. Consequently, it is often desirable to provide more than the single number IS I as aA summary of S. From Exercise 2.12, IS I can be expressed as the product A 1 A2 · · · P of the eigenvalues of S. Moreover, the mean-centered ellipsoid based on s- 1 [see (3-16)] has axes whose lengths are proportional to the square roots of the A;'s (see Section 2.3). These eigenvalues then provide information on the variabil ity in all directions in the p-space representation of the data. It is useful, therefore, to report their individual values, as well as their product. We shall pursue this topic later when we discuss principal components. Situations in which the Generalized Sample Variance Is Zero
The generalized sample variance will be zero in certain situations. A generalized variance of zero is indicative of extreme degeneracy, in the sense that at least one column of the matrix of deviations,
[ ] [ - - -] -
X 1I - X , X z, - X-,
X1 1 - X--1 x 1 2 - x-2 · · · x 1P - xP X z l - XI X zz - X z . . . X zp - xP
x;,
xn l
� x'
X
� XI
Xn z
1
x'
� Xz · : : X"P � XP
(3-18) can be expressed as a linear combination of the other columns. As we have shown geometrically, this is a case where one of the deviation vectors-for instance, d; = [xl i - X; , . . . , X, ; - x;]-lies in the (hyper) plane generated by d l , . . . , d;-] , (n X p )
-
(n X I) (I X p)
d i+l , . . . , dp .
Result 3.2. The generalized variance is zero when, and only when, at least one deviation vector lies in the (hyper) plane formed by all linear combinations of the others-that is, when the columns of the matrix of deviations in (3-18) are lin early dependent. Proof. If the columns of the deviation matrix (X - IX') are linearly depen dent, there is a linear combination of the columns such that
= (X -
li' ) a for some a 0 But then, as you may verify, (n - 1)S = (X - lx' ) ' (X - IX' ) and ( n - 1)Sa = ( X - IX' ) ' (X - IX ' ) a = 0 =I=
Sec.
3 .4
Generalized Variance
1 37
so the same a corresponds to a linear dependency, a1 col1 (S) + · · · + aP colP (S) = Sa = 0, in the columns of S. So, by Result 2A.9, I S I = 0. In the other direction, if I S I = 0, then there is some linear combination Sa of the columns of S such that Sa = 0. That is, 0 = (n 1 ) Sa = (X - li' ) ' (X - li' ) a. P�emultiplying by a' yields 0 = a' (X - li' ) ' (X - li' ) a = L fx - tx')a and, for the length to equal zero, we must have (X - li' ) a = 0. Thus, the • columns of (X - li' ) are linearly dependent.
-
Example 3.9
(A case where the generalized variance is zero)
Show that l S I = 0 for
X =
(3 X 3 )
[- -
-� � [ ] � = �]
and determine the degeneracy. Here = [3, 1, 5], so 1 - 3 2 1 � X - li' = 4 3 1 - 1 = 1 -1 -1 4 - 3 0 - 1 4 - 5 The deviation (column) vectors are d; = [ -2, 1 , 1], d� = [1 , 0, - 1], and d� = [0, 1, - 1]. Since d = d 1 2d2 , there is column degeneracy. (Note that there is row degeneracy3 afso. ) +This means that one of the deviation vectors-for example, d 3 , lies in the plane generated by the other two residual vectors. Consequently, the th volume is zero. This case is illustrated in Figure 3.9 and may beree-dimensional verified algebraically by showing that I S I = We have i'
0.
3 6 5
3 4
Figure 3.9 A case where the three dimensional volume is zero ( I S I = 0).
1 38
Chap.
3
Sample Geometry and Random Sampling
s (3 X 3 )
and from Definition 2A.24,
-
-[ �
-
•
3 { 1 i) + GH - � o) + o = � - � = o When large data sets are sent and received electronically, investigators are sometimes unpleasantly surprised to find a case of zero generalized variance, so that S does not have an inverse. We have encountered several such cases, with their associated difficulties, before the situation was unmasked. A singular covariance matrix occurs when, for instance, the data are test scores and the investigator has included variables that are sums of the others. For example, an algebra score and a geometry score could be combined to give a total math score, or class midterm and final exam scores summed to give total points. Once, the total weight of a num ber of chemicals was included along with that of each component. This common practice of creating new variables that are sums of the original variables and then including them in the data set has caused enough lost time that we emphasize the consequences. =
Example 3. 1 0
(Creating new variables that lead to a zero generalized variance)
Consider the data matrix
10 16 10 12 X= 13 3 11 14 where the third column is the sum of first two columns. These data could be the number of successful phone solicitations per day by a part-time and a full time employee, respectively, so the third column is the total number of suc cessful solicitations per day. Show that the generalized variance I S I = 0, and determine the nature of the dependency in the data. 1
9 4 12 2 5 8
Sec.
Generalized Variance
3.4
1 39
We find that the mean corrected data matrix, with entries xjk - xk , is X - lx'
[
-2 -1 -3 3 2 1 1 0 -1 0 2 -2 1 1 0
The resulting covariance matrix is s =
2.5 0 2.5 0 2.5 2.5 2.5 2.5 5.0
]
We verify that, in this case, the generalized variance I s I = 2.5 2 X 5 + 0 + 0 - 2.53 - 2.5 3 - 0 = 0
In general, if the three columns of the data matrix X satisfy a linear constraint a 1 xj l + a2xj 2 + a3xj 3 = a constant for all j, then a 1 x 1 + a2 x 2 + a3 x3 = so that c,
c,
a l (xj l - x l ) + a2 (xj 2 - x2 ) + a3 (xj 3 - x3 ) = 0
for all j. That is,
(X - lx' ) a =
0
and the columns of the mean corrected data matrix are linearly dependent. Thus, the inclusion of the third variable, which is linearly related to the first two, has led to the case of a zero generalized variance. Whenever the columns of the mean corrected data matrix are lin early dependent, ( n - 1 ) Sa = (X - li' ) ' (X - li ' ) a = (X - li ' ) O = 0
and Sa = 0 establishes the linear dependency of the columns of S. Hence, l s i = o. Since Sa = 0 = Oa, we see that a is a scaled eigenvector of S associated with an eigenvalue of zero. This gives rise to an important diagnostic: If we are unaware of any extra variables that are linear combinations of the others, we can find them by calculating the eigenvectors of S and identifying the one associated with a zero eigenvalue. That is, if we were unaware of the depen dency in this example, a computer calculation would find an eigenvalue pro portional to a' = [1, 1, - 1], since
1 40
Chap.
3
[
]
Sample Geometry and Random Sampling
Sa =
2.5 0 2.5 0 2.5 2.5 2.5 2.5 5.0
The coefficients reveal that
for allj In addition, the sum of the first two variables minus the third is a constant c for all n units. Here the third variable is actually the sum of the first two vari ables, so the columns of the original data matrix satisfy a linear constraint with c = 0. Because we have the special case c = 0, the constraint establishes the fact that the columns of the data matrix are linearly dependent. Let us summarize the important equivalent conditions for a generalized vari ance to be zero that we discussed in the preceding example. Whenever a nonzero vector a satisfies one of the following three conditions, it satisfies all of them: ( 1 ) Sa = 0 (2) a' (xj - x) = 0 for allj (3) a'xj = c for allj (c = a'x) l (xj l - xd + l (xj 2 - .X2 ) + ( - l ) (xj 3 - .X3 ) = 0
•
a
ear combicorrenctateiodn a scaloerdof with Theof thleinmean eieiggisenval envect ue data, using is zero. 0.
S
a,
Thethe orlinigearinalcombi nusatiinogn of dat a , is a constant. a,
We showed that if condition (3) is satisfied-that is, if the values for one variable can be expressed in terms of the others-then the generalized variance is zero because S has a zero eigenvalue. In the other direction, if condition (1) holds, then the eigenvector a gives coefficients for the linear dependency of the mean cor rected data. In any statistical analysis, I S I = 0 means that the measurements on some variables should be removed from the study as far as the mathematical computa tions are concerned. The corresponding reduced data matrix will then lead to a covariance matrix of full rank and a nonzero generalized variance. The question of which measurements to remove in degenerate cases is not easy to answer. When there is a choice, one should retain measurements on a (presumed) causal variable instead of those on a secondary characteristic. We shall return to this subject in our discussion of principal components. At this point, we settle for delineating some simple conditions for S to be of full rank or of reduced rank. Result 3.3. If n .;;; p, that is, (sample size) .;;; (number of variables), then IS I = 0 for all samples. Proof. We must show that the rank of S is less than or equal top and then apply Result 2A.9.
Sec.
3.4
Generalized Variance
1 41
For any fixed sample, the n row vectors in (3-18) sum to the zero vector. The existence of this linear combination means that the rank of X - 1X1 is less than or equal to n - 1, which, in turn, is less than or equal to p - 1 because n :s:; p. Since (n - 1) s
(p Xp )
= (X - 1X1) 1 (X - 1X1) (n x p)
(p X n )
the kth column of S, colk (S), can be written as a linear combination of the rows of (X - l.X1) 1 • In particular, (n - 1 ) colk (S) = (X - 1X1 ) 1 colk (X - 1X 1 )
= (xl k - xd row1 ( X - 1X1 ) 1 + . . . + (xn k - xk ) rown (X - 1X1) 1 Since the row vectors of (X - 1X1 ) 1 sum to the zero vector, we can write, for example, row1 {X - 1X1 ) 1 as the negative of the sum of the remaining row vectors. After substituting for row 1 (X - 1X1 ) 1 in the peceding equation, we can express
colk (S) as a linear combination of the at most n - 1 linearly independent row vec tors row2 (X - 1X1) 1, , rown (X - 1X1 ) 1• The rank of S is therefore less than or equal to n - 1, which-as noted at the beginning of the proof-is less than or equal • to p - 1, and S is singular. This implies, from Result 2A.9, that S I = 0. • • •
I
Resu lt 3.4. Let the row vectors x 1 , x2 , , xn , where x; is the jth row of the data matrix X, be realizations of the independent random vectors X 1 , X2 , , Xn . Then • • •
• • •
c) c
1. If the linear combination a 1 Xj has positive variance for each constant vector a 'I: 0, then, provided that p < n, S has full rank with probability 1 and
c
l s i > o. 2. If, with probability 1, &1 Xj is a constant (for example,
Proof.
ity 1 , &1Xj n =
=
c
=
0.
(Part 2). If &1 Xj = a 1 Xj 1 + a 2 Xj 2 + . . . + ap Xjp = with probabil for all j, and the sample mean of this linear combination is
L ( a1 xj 1 +
j= l
for all j, then I S I
a 2xj 2 + . . . + ap xjp ) /n
-
_
[
= a 1 .X1 + a2.X2
a 1 i � � 81 X : &1 i � - a1 i
+
] [c c] c-c =
�
:
. . . + aP .XP = a1i. Then
=
0
indicating linear dependence; the conclusion follows from Result 3.2. The proof of Part (1) is difficult and can be found in [2].
•
1 42
Chap.
3
Sample Geometry and Random Sampling
Generalized Variance Determined by I R and Its Geometrical Interpretation
I
The generalized sample variance is unduly affected by the variability of measure ments on a single variable. For example, suppose some s; ; is either large or quite small. Then, geometrically, the corresponding deviation vector d; = (y; - :X;l) will be very long or very short and will therefore clearly be an important factor in determining volume. Consequently, it is sometimes useful to scale all the deviation vectors so that they have the same length. Scaling the residual vectors is equivalent to replacing each original observa tion xi k by its standardized value (xi k - xk )/ Vi;;; . The sample covariance matrix of the standardized variables is then R, the sample correlation matrix of the origi nal variables. ( See Exercise 3.13.) We define
Since the resulting vectors all have length �, the generalized sample variance of the standardized vari ables will be large when these vectors are nearly perpendicular and will be small when two or more of these vectors are in almost the same direction. Employing the argument leading to (3-7), we readily find that the cosine of the angle (Jik between - :X;l)/\/S;; and ( yk - :Xkl)/� is the sample correlation coefficient r;k · Therefore, we can make the statement that I R I is large when all the r;k are nearly zero and it is small when one or more of the r;k are nearly + 1 or - 1 . I n sum, we have the following result: Let
( Y;
(Y; - :X;l) Vi;;
Xu - X; Vi;; X2 ; - X; Vi;;
i = 1, 2, . . . , p
Xn ; - X; Vi;;
be the deviation vectors of the standardized variables. These deviation vectors lie in the direction of d; , but have a squared length of n - 1. The volume generated
Sec. 3.4 Generalized Variance
1 43
3
Figure 3.1 0 The volume generated by equal-length deviation vectors of the standardized variables.
in p-space by the deviation vectors can be related to the generalized sample vari ance. The same steps that lead to (3-15) produce
( Generalized sa�ple va�iance ) = I R I = (n of the standardized vanables
_
1 ) -P (volume) z
(3-20)
The volume generated by deviation vectors of the standardized variables is illustrated in Figure 3.10 for the deviation vectors graphed in Figure 3.6. A com parison of Figures 3.10 and 3.6 reveals that the influence of the d 2 vector (large variability in x 2 ) on the squared volume I S I is much greater than its influence on the squared volume I R 1 . The quantities I S I and I R I are connected by the relationship (3-21) so (3 -22) [The proof of (3-21) is left to the reader as Exercise 3.12.] Interpreting (3-22) in terms of volumes, we see from (3-15) and (3 -20) that the squared volume (n - 1)P I S I is proportional to the squared volume (n - 1 ) P I R 1 . The constant of proportionality is the product of the variances, which, in turn, is proportional to the product of the squares of the lengths (n - 1)s;; of the d; . Equation (3-21) shows, algebraically, how a change in the measurement scale of X1 , for example, will alter the relationship between the generalized variances. Since I R I is based on standardized measurements, it is unaffected by the change in scale. However, the relative value of I S I will be changed whenever the multi plicative factor s 1 1 changes.
144
Chap.
3
Sample Geometry and Random Sampling
Example 3 . 1 1
(Ill ustrating the relation between I S I and I R I )
3.
Let us illustrate the relationship in and I R I when p = Suppose
(3-21 ) for the generalized variances I S I
(3 sX3) Then s 1 1 = 4 , s22 = 9, and s33 = 1 . Moreover,
1 � �� ( - 1 )2 + 3 �� �� (- 1 ) 3 + 1 � � �� (-1)4
Using Definition 2A.24 , we obtain
lsl = 4
- 3 ( 3 - 2 ) + 1(6 - 9) = 14 I R I = 1 1 � ! ( - 1 )2 + ! I ! I ( - 1 )3 + ( - 1 )4 I t ! It �I = ( 1 - �) - CD G - D + 0, P [ - e < Y - J-t < e ] approaches unity as n � oo . •
Proof. See [9].
As a direct consequence of the law of large numbers, which says that each Xi converges in probability to J-t i , i = 1 , 2, . . . , p,
X converges in probability to fL
(4-26)
i = Sn ) converges in probability to I
(4-27)
Also, each sample covariance sik converges in probability to O"ik ' i, k = 1 , 2, . . . , p, and S ( or
Statement (4-27) follows from writing n (n
(�i - Xi ) (� k - Xk ) - 1 ) sik = 2: j= i II
= 2: (Xj i - J-t i + ILi - Xi) (Xj k - ILk + f.-tk - Xk ) j=i n
= 2: (�i - J-t J (Xj k - f.-tk) + n (Xi - J-t J (Xk - f.-t k ) j= i Letting Jj = (�i - J-t i) (�k - ILk ) , with E ( Y) = O"ik ' we see that the first term in sik converges to O"ik and the second term converges to zero, by applying the law
of large numbers. The practical interpretation of statements (4-26) and ( 4-27) is that, with high probability, X will be close to /L and S will be close to I whenever the sample size is large. The statement concerning X is made even more precise by a multivariate version of the central limit theorem.
Sec. 4.5 Large-Sample Behavior of X and
S
1 87
Result 4. 1 3 (The central limit theorem). Let X 1 , X 2 , . . . , Xn be indepen dent observations from any population with mean f,1, and finite covariance I. Then
Vn ( X - f,l,) has an approximate NP (O, I) distribution
for large sample sizes. Here n should also be large relative to p. Proof. See [1].
•
The approximation provided by the central limit theorem applies to discrete, as well as continuous, multivariate populations. Mathematically, the limit is exact, and the approach to norma_!!ty is often fairly rapid. Moreover, from the results in Section 4.4, we know that X is exactly normally distributed when the underlying population is normal. Thus, we would expect the central limit theorem approxima tion to be quite good for moderate n when the parent population is nearly normal. As we have seen, when n is large, S is close to I with high probability. Con sequently, replacing I by S in the approximating normal distribution for X will have a negligible effect on subsequent probability calculations. Result 4.7 can be used to show that n ( X - f,l,) ' I - 1 ( X - f,l,) has a xi distribu-
( � )
tion when X is distributed as NP f,l,, I or, equivalently, when
Vn (X - f,1,) has
an NP (O, I)jistribution. '!}1e xi distribu�n is approximately the sampling distrib ution of n ( X - f,l,) ' I - 1 ( X - f,l,) when X is approximately normally distributed. Replacing I - 1 by s-t does not seriously affect this approximation for n large and much greater than p. We summarize the major conclusions of this section as follows:
In the next three sections, we consider ways of verifying the assumption of normality and methods for transforming nonnormal observations into observations that are approximately normal.
1 88
Chap. 4 The Multivariate Normal Distribution
4.6 ASSESSING THE ASSUMPTION OF NORMALITY I
As we have pointed out, most of the statistical techniques discussed in subsequent chapters assume that each vector observation Xj comes from a multivariate normal distribution. On the other hand, in situations where the sample size is large and the techniques depend solely on the behavior of X, or distances involving X of the form n ( X - p)'S- 1 ( X - p), the assumption of normality for the individual obser vations is less crucial. But to some degree, the quality of inferences made by these methods depends on how closely the true parent population resembles the multi variate normal form. It is imperative, then, that procedures exist for detecting cases where the data exhibit moderate to extreme departures from what is expected under multivariate normality. We want to answer this question: Do the observations Xj appear to violate the assumption that they came from a normal population? Based on the properties of normal distributions, we know that all linear combinations of normal variables are normal and the contours of the multivariate normal density are ellipsoids. Therefore, we address these questions: 1. Do the marginal distributions of the elements of X appear to be normal? What about a few linear combinations of the components X;? 2. Do the scatter plots of pairs of observations on different characteristics give the elliptical appearance expected from normal populations? 3. Are there any "wild" observations that should be checked for accuracy?
It will become clear that our investigations of normality will concentrate on the behavior of the observations in one or two dimensions (for example, marginal distributions and scatter plots). As might be expected, it has proved difficult to con struct a "good" overall test of joint normality in more than two dimensions because of the large number of things that can go wrong. To some extent, we must pay a price for concentrating on univariate and bivariate examinations of normality: We can never be sure that we have not missed some feature that is revealed only in higher dimensions. (It is possible, for example, to construct a nonnormal bivariate distribution with normal marginals. [See Exercise 4.8.]) Yet many types of non normality are often reflected in the marginal distributions and scatter plots. More over, for most practical work, one-dimensional and two-dimensional investigations are ordinarily sufficient. Fortunately, pathological data sets that are normal in lower dimensional representations, but nonnormal in higher dimensions, are not frequently encountered in practice. Evaluating the Normality of the Univariate Marginal Distributions
Dot diagrams for smaller n and histograms for n > 25 or so help reveal situations where one tail of a univariate distribution is much longer than the other. If the his togram for a variable X; appears reasonably symmetric, we can check further by
Sec.
4.6
Assessing the Assumption of Normality
1 89
counting the number of observations in certain intervals. A univariate normal dis tribution assigns probability .683 to the interval ( p, ; - � , IL ; + � ) and prob ability .954 to the interval ( P, ; - 2 � , IL ; + 2 � ) . Consequently, with a large sample size n, we expect the observed proportion P; 1 of the observations lying in the interval ( .X; - YS;; , x; + Vi;; ) to be about .683. Similarly, the observed pro portion P ; z of the observations in ( .X; - 2 v'i;; , .X; + 2 � ) should be about .954. Using the normal approximation to the sampling distribution of P; (see [9]), we observe that either (.683) ( .317) n
-'----� --' -..!...
lfii l - .683 1 > 3
=
1 .396
Vn
or (4-29) would indicate departures from an assumed normal distribution for the ith char acteristic. When the observed proportions are too small, parent distributions with thicker tails than the normal are suggested. Plots are always useful devices in any data analysis. Special plots called Q-Q plots can be used to assess the assumption of normality. These plots can be made for the marginal distributions of the sample observations on each variable. They are, in effect, plots of the sample quantile versus the quantile one would expect to observe if the observations actually were normally distributed. When the points lie very nearly along a straight line, the normality assumption remains tenable. Nor mality is suspect if the points deviate from a straight line. Moreover, the pattern of the deviations can provide clues about the nature of the nonnormality. Once the reasons for the nonnormality are identified, corrective action is often possible. (See Section 4.8.) To simplify notation, let x 1 , x 2 , , x, represent n observations on any single characteristic X; . Let x( l l :s;; x 1.39
- .003293 .128831
J
- 62,309 [ 126,974 4224 - 2927
J
X 10 - 5
and this point falls outside the 50% contour. The remaining nine points have generalized distances from i of 1.20, .59, .83, 1.88, 1.01, 1 .02, 5.33, .81, and .97, respectively. Since seven of these distances are less than 1 .39, a proportion, .70, of the data falls within the 50% contour. If the observations were nor mally distributed, we would expect about half, or 5, of them to be within this contour. This large a difference in proportions would ordinarily provide evi dence for rejecting the notion of bivariate normality; however, our sample size of 10 is too small to reach this conclusion. (See also Example 4.13.) • Computing the fraction of the points within a contour and subjectively com paring it with the theoretical probability is a useful, but rather rough, procedure. A somewhat more formal method for judging the joint normality of a data set is based on the squared generalized distances - ( xj - -x ) 'S - 1 ( xj - -x ) , 1· - 1 , 2 , . . . , n dj2 (4-32)
1 96
Chap. 4 The M ultivariate Normal Distribution
where x l > x 2 , , X 11 are the sample observations. The procedure we are about to describe is not limited to the bivariate case; it can be used for all p � 2. When the parent population is multivariate normal and both n and n p are greater than 25 or 30, each of the squared distances df , di , . . . , d� should behave like a chi-square random variable. [ See Result 4.7 and Equations (4-26) and (4-27).] Although these distances are not independent or exactly chi-square distributed, it is helpful to plot them as if they were. The resulting plot is called a chi-square plot or gamma plot, because the chi-square distribution is a special case of the more general gamma distribution. (See [6].) To construct the chi-square plot:
-
. . •
1. Order the squared distances 2.
-
d{l ) :;;:;; d{z)
:;;:;;
· · · :;;:;; d{n) .
m
( 4-32) from smallest to largest as
Graph the pairs (qc, p ((j - D in), d{n ), where qc, p ( (j D i n) is the 100 (j ! ) In quantile of the chi-square distribution with p degrees of freedom. -
-
Quantiles are specified in terms of proportions, whereas percentiles are spec ified in terms of percentages. The quantiles qc, p ((j !)In) are related to the upper percentiles of a chi squared distribution. In particular, qc, p ((j - !)In) = x� ( (n - j + ! )In). The plot should resemble a straight line through the origin having slope 1. A systematic curved pattern suggests lack of normality. One or two points far above the line indicate large distances, or outlying observations, that merit further attention. Example 4. 1 3
(Constructing a chi-square plot)
Let us construct a chi-square plot of the generalized distances given in Exam ple 4.12. The ordered distances and the corresponding chi-square percentiles for p = 2 and n = 10 are listed in the following table:
-
e !)
j
d{n
qc, Z ----w-
1 2 3 4 5 6 7 8 9 10
.59 .81 .83 .97 1.01 1.02 1.20 1.88 4.34 5.33
.10 .33 .58 .86 1.20 1 .60 2.10 2.77 3.79 5.99
Sec.
Assessing the Assumption of Normality
4.6
1 97
•
•
•
•
•
•
•
•
•
•
___.___--'----'--------'-----+_ L..o.,___....___
0
2
Figure 4.7
3
4
5
6
qc,2(( j- !)1 10 )
A chi-square plot of the ordered distances in Example 4 . 1 3 .
A graph of the pairs (q c, z ( (j !}/10), d �n ) is shown in Figure 4.7. The points in Figure 4.7 do not lie along the line with slope 1. The smallest distances appear to be too large and the middle distances appear to be too small, relative to the distances expected from bivariate normal populations for samples of size 10. These data do not appear to be bivariate normal; however, the sample size is small, and it is difficult to reach a definitive conclusion. If further analysis of the data were required, it might be reasonable to transform them to observations more nearly bivariate normal. Appropriate transformations are discussed in Section 4.8 . -
•
In addition to inspecting univariate plots and scatter plots, we should check multivariate normality by constructing a chi-squared or d 2 plot. Figure 4.8 on page 198 contains d 2 plots based on two computer-generated samples of 30 four-variate normal random vectors. As expected, the plots have a straight-line pattern, but the top two or three ordered squared distances are quite variable. The next example contains a real data set comparable to the simulated data set that produced the plots in Figure 4.8.
Chap. 4 The M ultivariate Normal Distribution
1 98
10
•
dJ)
8
.:
6
•
4
• •
• ••
•
•
2
0
2
4
6
8
Figure 4.8
10
qc,4 ( (j -
0
�/3 1
0
2
4
6
8
10
qc,4 ( (j -
�/30)
Chi-square plots for two simulated four-variate normal data sets with n = 30.
Example 4. 1 4 (Evaluating multivariate normality for a four-variable data set)
The data in Table 4.3 were obtained by taking four different measures of stiffness, x 1 , x 2 , x3 , and x4 , of each of n = 30 boards. The first measurement involves sending a shock wave down the board, the second measurement is determined while vibrating the board, and the last two measurements are obtained from static tests. The squared distances df = (xj x ) 'S- 1 (x j x ) are also presented in the table. -
-
TABLE 4.3
FOU R M EASU REM ENTS OF STI F F N ESS
Observation no.
xl
Xz
x3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1889 2403 2119 1645 1976 1712 1943 2104 2983 1745 1710 2046 1840 1867 1859
1651 2048 1700 1627 1916 1712 1685 1820 2794 1600 1591 1907 1841 1685 1649
1561 2087 1815 1110 1614 1439 1271 1717 2412 1384 1518 1627 1595 1493 1389
x4
d2
1778 .60 2197 5.48 2222 7.62 1533 5.21 1883 1.40 1546 2.22 1671 4.99 1874 1.49 2581 12.26 .77 1508 1667 1.93 .46 1898 1741 2.70 .13 1678 1714 1.08
Source: Data courtesy of Wil iam Galligan.
Observation no.
xl
Xz
x3
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
1954 1325 1419 1828 1725 2276 1899 1633 2061 1856 1727 2168 1655 2326 1490
2149 1170 1371 1634 1594 2189 1614 1513 1867 1493 1412 1896 1675 2301 1382
1180 1002 1252 1602 1313 1547 1422 1290 1646 1356 1238 1701 1414 2065 1214
x4
d2
1281 16.85 1176 3.50 1308 3.99 1755 1 .36 1646 1.46 2111 9.90 1477 5.06 .80 1516 2037 2.54 1533 4.58 1469 3.40 1834 2.38 1597 3.00 2234 6.28 1284 2.58
Sec.
'Cl
4.6
Assessing the Assumption of Normality
dfj )
@
""'
@
N •
2 00 'Cl ""' N 0
1 99
0
•
••
• ••
•• • ••••
2
•• •
•
••
•
•
• •
•
4 Figure 4.9
•
•
•
8
6
1
0
I
12
qc. 4 ({j- �
) / 30)
A chi-square plot for the data i n Exam ple 4 . 1 4.
The marginal distributions appear quite normal (see Exercise 4.33), with the possible exception of specimen (board) 9. To further evaluate multivariate normality, we constructed the chi square plot shown in Figure 4.9. The two specimens with the largest squared distances are clearly removed from the straight-line pattern. Together, with the next largest point or two, they make the plot appear curved at the upper • end. We will return to a discussion of this plot in Example 4.15.
...
We have discussed some rather simple techniques for checking the normality assumption. Specifically, we advocate calculating the d} , j = 1 , 2, , n [see Equa tion ( 4-32)] and comparing the results with x 2 quantiles. For example, p-variate normality is indicated if:
( 1 - l) (2 - l),
1. Roughly half of the d} are less than or equal to q c, p (.50). 2. A plot of the ordered squared distances d{1 ) :s; d tz) :s; · · · :s; d{n) versus n qc, p � , qc. p � . . . , qc, p �1 respectively, is nearly a straight
(
),
line having slope 1 and which passes through the origin. (See [6] for a more complete exposition of methods for assessing normality.)
200
Chap. 4 The M ultivariate Normal Distribution
We close this section by noting that all measures of goodness of fit suffer the same serious drawback. When the sample size is small, only the most aberrant behavior will be identified as lack of fit. On the other hand, very large samples invariably produce statistically significant lack of fit. Yet the departure from the specified distribution may be very small and technically unimportant to the infer ential conclusions.
4.�
I
DETECTING OUTLIERS AND CLEANING DATA
Most data sets contain one or a few unusual observations that do not seem to belong to the pattern of variability produced by the other observations. With data on a single characteristic, unusual observations are those that are either very large or very small relative to the others. The situation can be more complicated with multivariate data. Before we address the issue of identifying these outliers, we must emphasize that not all outliers are wrong numbers. They may, justifiably, be part of the group and may lead to a better understanding of the phenomena being studied. Outliers are best detected visually whenever this is possible. When the num ber of observations n is large, dot plots are not feasible. When the number of characteristics p is large, the large number of scatter plots p (p 1)/2 may pre vent viewing them all. Even so, we suggest first visually inspecting the data when ever possible. What should we look for? For a single random variable, the problem is one dimensional, and we look for observations that are far from the others. For instance, the dot diagram -
••
• • ••
•
• • •• • • • • •• • • •
• eee •
•
•
• •
4-------+---� x
reveals a single large observation. In the bivariate case, the situation is more complicated. Figure 4.10 on page 201 shows a situation with two unusual observations. The data point circled in the upper right corner of the figure is removed from the pattern, and its second coordinate is large relative to the rest of the x2 mea surements, as shown by the vertical dot diagram. The second outlier, also circled, is far from the elliptical pattern of the rest of the points, but, separately, each of its components has a typical value. This outlier cannot be detected by inspecting the marginal dot diagrams. In higher dimensions, there can be outliers that cannot be detected from the univariate plots or even the bivariate scatter plots. Here a large value of (xj x ) 'S - 1 (xj x ) will suggest an unusual observation, even though it cannot be seen visually. -
-
Sec.
Detecting Outliers and Cleaning Data
4.7
• • • •
••• •••
•
@
•• • •
••
•
•
•
• • •
•
• •• •
.. .
Figure 4. 1 0
..
•
• • • •••• :: : .
•
• • • •
.
•• •
.
:
•
• •
•
201
•
@
• •@
c:> •
.
Two outliers; one univariate and one bivariate.
Steps for Detecting Outliers
1. Make a dot plot for each variable.
Make a scatter plot for each pair of variables. Calculate the standardized values zj k = (xj k - xk ) ��� for j = 1, 2, . . . , n and each column k = 1, 2, , p. Examine these standardized values for large or small values. 4. Calculate the generalized squared distances (x j - x ) 'S- 1 (x j - x ) Examine these distances for unusually large values. In a chi-square plot, these would be the points farthest from the origin.
2. 3.
. . .
.
In step 3, "large" must be interpreted relative to the sample size and number of variables. There are n p standardized values. When n = 100 and p = 5, there
X
202
Chap.
4
The Multivariate Normal Distribution
are 500 values. You expect 1 or 2 of these to exceed 3 or be less than - 3, even if the data came from a multivariate distribution that is exactly normal. As a guide line, 3.5 might be considered large for moderate sample sizes. In step 4, "large" is measured by an appropriate percentile of the chi-square distribution with p degrees of freedom. If the sample size is n = 100, we would expect 5 observations to have values of dJ that exceed the upper fifth percentile of the chi-square distribution. A more extreme percentile must serve to determine observations that do not fit the pattern of the remaining data. The data we presented in Table 4.3 concerning lumber have already been cleaned up somewhat. Similar data sets from the same study also contained data on x5 = tensile strength. Nine observation vectors, out of the total of 1 12, are given as rows in the following table, along with their standardized values.
xl
Xz
x3
x4
Xs
zl
Zz
z3
z4
Zs
1631 1770 1376 1705 1643 1567 1528 1803 1587
1528 1677 1190 1577 1535 1510 1591 1826 1554
1452 1707 723 1332 1510 1301 1714 1748 1352
1559 1738 1285 1703 1494 1405 1685 2746 1554
1602 1785 2791 1664 1582 1553 1698 1764 1551
.06 .64 - 1 .01 .37 .11 - .21 - .38 .78 - .13
- .15 .43 - 1.47 .04 - .12 - .22 .10 1.01 - .05
.05 1.07 -2.87 - .43 .28 - .56 1 .10 1 .23 - .35
.28 .94 - .73 .81 .04 - .28 .75
- .12 .60
Cffi) .26
@]) .13 - .20 - .31 .26 .52 - .32
The standardized values are based on the sample mean and variance, calcu lated from all 112 observations. There are two extreme standardized values. Both are too large with standardized values over 4.5. During their investigation, the researchers recorded measurements by hand in a logbook and then performed cal culations that produced the values given in the table. When they checked their records regarding the values pinpointed by this analysis, errors were discovered. The value x5 = 2791 was corrected to 1241, and x4 = 2746 was corrected to 1670. Incorrect readings on an individual variable are quickly detected by locating a large leading digit for the standardized value. The next example returns to the data on lumber discussed in Example 4.14. Example 4. 1 5
(Detecting outliers in the data on lumber)
Table 4.4 on page 203 contains the data in Table 4.3, along with the stan dardized observations. These data consist of four different measures of stiff ness x 1 , x2 , x3 , and x4 , on each of n = 30 boards. Recall that the first
Sec.
4.8
Transformations To Near Normality
203
measurement involves sending a shock wave down the board, the second measurement is determined while vibrating the board, and the last two measurements are obtained from static tests. The standardized measurements are
xj k - xk k = 1, 2, 3, 4; j = 1, 2, . . . ' 30 Zj k = � '
and the squares of the distances are df = (xj
- x ) ' S - 1 (xj - x ) .
TABLE 4.4 FOU R M EASU REM ENTS OF STI F F N ESS WITH STAN DARDIZED VALU ES
xl
Xz
1889 2403 2119 1645 1976 1712 1943 2104 2983 1745 1710 2046 1840 1867 1859 1954 1325 1419 1828 1725 2276 1899 1633 2061 1856 1727 2168 1655 2326 1490
1651 2048 1700 1627 1916 1712 1685 1820 2794 1600 1591 1907 1841 1685 1649 2149 1170 1371 1634 1594 2189 1614 1513 1867 1493 1412 1896 1675 2301 1382
x3
x4
Observation no.
zl
Zz
z3
1561 2087 1815 1110 1614 1439 1271 1717 2412 1384 1518 1627 1595 1493 1389 1180 1002 1252 1602 1313 1547 1422 1290 1646 1356 1238 1701 1414 2065 1214
1778 2197 2222 1533 1883 1546 1671 1874 2581 1508 1667 1898 1741 1678 1714 1281 1176 1308 1755 1646 2111 1477 1516 2037 1533 1469 1834 1597 2234 1284
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
- .1 1.5 .7 - .8 .2 - .6 .1 .6 3.3 - .5 - .6 .4 - .2 - .1 - .1 .1 - 1.8 - 1 .5 - .2 - .6 1.1 - .0 - .8 .5 - .2 - .6 .8 - .8 1.3 - 1.3
- .3 .9 - .2 - .4 .5 - .1 - .2 .2 3.3 - .5 - .5 .5 .3 - .2 - .3 1 .3 - 1 .8 - 1.2 - .4 - .5 1.4 - .4 - .7 .4 - .8 - 1.1 .5 - .2 1.7 - 1.2
.2 1.9 1.0 - 1.3 .3 - .2 - .8 .7 3.0 - .4 .0 .4 .3 - .1 - .4 - 1.1 - 1.7 - .8 .3 - .6 .1 - .3 - .7 .5 - .5 - .9 .6 - .3 1.8 - 1.0
z4
d2
.60 .2 5.48 1.5 7.62 1.5 - .6 5.21 1.40 .5 - .6 2.22 - .2 4.99 1.49 .5 2.7 - .7 .77 - .2 1.93 .46 .5 2.70 .0 - .1 .13 - .0 1 .08 - 1.4 0. (See [8].) Given the observations x 1 , x2 , , X11 , the Box-Cox solution for the choice of an appropriate power A is the solution that maximizes the expression • • •
f (A) = -
':1_
2
ln
[ln ji=l x? l - x
"2
=
-
1
Az "
XJ�P'' Ap
1
"
where A 1 , A 2 , , AP are the values that individually maximize (4-38). The procedure just described is equivalent to making each marginal distribu tion approximately normal. Although normal marginals are not sufficient to ensure that the joint distribution is normal, in practical applications they may be good A P obtained from the pre enough. If not, we could start with the values A 1 , A 2 , ceding transformations and iterate toward the set of values .A' = [A 1 , A 2 , , A P ], which collectively maximizes • • •
• • •
,
. • •
Sec.
=
4.8
Transformations To Near Normality
21 1
- % In I S (A) I + (A 1 - 1) � lnxj 1 + (A2 - 1) � Inxj z 1 "
n
j=
j= 1
+ ... +
n
(Ap - 1) � Inxjp j= i
(4-40)
where S (A) is the sample covariance matrix computed from
XjAlI - 1
At
xJ
"0
'6 .5
XI = 3558
· ;;
2500
LCL = 1 737 1 500 5
0
15
10 Ob servation Numb er
Figure 5.5
The X chart for
x1 =
legal appearances overtime hours.
variables can make it impossible to assess the overall error rate that is implied by a large number of univariate charts. The2two most common multivariate charts are (i) the ellipse format chart and ( ii) the T chart. Two cases that arise in practice need to be treated differently: 1. Monitoring the stability of a given sample of multivariate observations 2. Setting a control region for future observations Initially, we consider the use of multivariate control procedures for a sample of multivariate observations x 1 , x2 , . . . , x ll " Later, we discuss these procedures when the observations are subgroup means. Charts for Monitoring a Sample of Individual Multivariate Observations for Stability
We assume that X 1 , X 2 , . . . , X11 are independently distributed as Np ( J.t, :t). By Result 4. 8 , l. ( l. ) J n.!. 1 l. l. rt J J+l x.
-
x
=
l
-
n
x
-
x
- ··· -
n
x
_
n
x.
- ··· -
n
x II
260
Chap. 5 Inferences about a Mean Vector
has Cov(Xj - X) = ( 1 - -;;1 ) 2 I + (n - 1 ) n - 2 I = (n -n 1) 2 I and is distributed as N (O, I). However, Xj - X is not independent of the sample covariance matrix S, soP we use the approximate chi-square distribution to set con trol limits. Ellipse Format Chart. The ellipse format chart for a bivariate control region is the more intuitive of the charts, but its approach is limited to two vari ables. The two characteristics on the jth unit are plotted as a pair (xj 1 , xj 2 ) . The 95% ellipse consists of all x that satisfy (5-32) ( x - i ) ' S- 1 ( x - i ) xi (. 05) and
-
�
Example 5.9
(An ellipse format chart for overtime hours)
Let us refer to Example 5.8 and create a quality ellipse for the pair of over time characteristics (legal appearances, extraordinary event) hours. A com puter calculation gives [ 3558 ] [ 367,884.7 -72,093.8 ] X = 1478 and s = -72,093.8 1,399,053.1 We illustrate the ellipse format chart using the 99% ellipse, which con sists of all x that satisfy ( x - i ) ' S- 1 ( x - i ) xi(. 0 1) Here p = 2, so xi(.O l) = 9.21, and the ellipse becomes �
(
(x r - .X 1 ) 2 (xl - .XJ ) (x2 - x2 ) + (x2 - .Xr ) 2 s l l s22 2sr 2 S1 1S22 - S12 2 S22 sl l s 1 1 S22
)
(367844.7 1399053.1) 2 367844.7 1399053.1 - ( -72093. 8 ) 2 9.21 1 - 3558) 2 2( 72093 · 8 ) (x 1 - 3558) (x2 - 1478) + (x 1 - 1478) ) ( (x367844. 367844.7 1399053.1 1399053.1 7 This ellipse format chart is graphed, along with the pairs of data, in Fig ure 5.6. X
X
_
X
_
X
�
Sec.
8 "€
,
�0 ia "
ro l:: >< �
261
0 0 0 "0
3000 2000 x1 = 1 478
1 000 0 - 1 000 - 2000
LCL = - 207 1
- 3000 0
10
5
15
Observation Number
Figure 5.7
The X chart for
x2 =
extraordinary event hours.
For the jth point, we calculate the T2 statistic
(5-33) We then plot the T2 values on a time axis. The lower control limit is zero, and we use the upper control limit UCL = x;(.o5) or, sometimes, x; (.Ol ) . There is no centerline in the T2 chart. Notice that the T2 statistic is the same as the quantity d/ used to test normality in Section 4.6. Example 5. 1 0 (A T2 chart for overtime hours)
Using the police department data in Example 5. 8 , we construct a T2 plot based on the two variables X = legal appearances hours and X2 = extraor dinary event hours. T2 charts 1with more than two variables are considered in Exercise 5.23. We take a = .01 to be consistent with the ellipse format chart in Example 5.2 9 . The T chart in Figure 5. 8 on page 262 reveals that the pair (legal appearances, extraordinary event) hours for period 11 is out of control. Fur ther investigation, as in Example 5.9, confirms that this is due to the large value of extraordinary event overtime during that period. •
Sec.
5.6
Multivariate Quality Control Charts
263
12
10
• - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
•
8
�
6
4 •
•
2
•
•
0
4
2
0
•
10
8
6
12
14
16
Period Figure 5.8
a =
.01.
2
The T chart for legal appearances hours and extraordinary event hours,
When the multivariante T2 chart signals that the jth unit is out of control, it should be determined which variables are responsible. A modified region based on Bonferroni intervals is frequently chosen for this purpose. The kth variable is out of control if xik does not lie in the interval ( :Xk - tn - 1 (.005/p) YS;,; , Xk + tn - 1 (.005/p) YS;,; ) where p is the total number of measured variables. Control Regions for Future Individual Observations
The goal now is to use data x 1 , x 2 , , x n , collected when a process is stable, to set a control region for a future observation x or future observations. The region in which a future observation is expected to lie is called a forecast, or prediction, region. the process is stable, we take the observations to be independently dis tributed as NP (p,, I). Because these regions are of more general importance than just for monitoring quality, we give the basic distribution theory as Result 5.6. Result 5.6. Let X 1 , X 2 , ... , X n be independently distributed as Np ( p,, I), and let X be a future observation from the same distribution. Then T2 = n n+ 1 (X - X)'S - 1 (X - X) is distributed as (nn -- l)p p F� n - p • • •
If
--
�
�
264
Chap. 5 Inferences about a Mean Vector
and a 100(1 - a)% p-dimensional prediction ellipsoid is given by all x satisfying ( n 2 - 1) ( x - x ) ' S- 1 ( x - x ) :;;;;; ( n n - pf FP n - p (a) Proof. We first note that X - X has mean 0. Since X is a future observa tion, X and X are independent, so - = I + -1 I = ( n + 1) I Cov(X - -X) = Cov(X) + Cov(X) n n and, by Result 4.8, Yn/ (n + 1) (X - X ) is distributed as NP (O, I). Now, � (X - X ) ' s- 1 .) n (X - X ) \j � n+1 which combines a multivariate normal, N (O, I), random vector and an indepen dent Wishart, Wp. n- 1 (I), random matrix inP the form ( multivariate normal ) 1 ( Wishart random matrix ) - 1 ( multivariate normal ) random vector random vector d.f. has the scaled F distribution claimed according to (5-8) and the discussion on page 226. • The constant for the ellipsoid follows from (5-6). Note that the prediction region in Result 5.6 for a future observed value x is an ellipsoid. It is centered at the initial sample mean x, and its axes are determined by the eigenvectors of S . Sirice - 1)p P [ (X - X) S - 1 (X - X) (nz n (n p) �;. n - p (a) ] -- 1 - a before any new observations are taken, the probability that X will fall in the pre diction ellipse is 1 - a. Keep in mind that the current observations must be stable before they can be used to determine control regions for future observations. Based on Result 5.6, we obtain the two charts for future observations. ·
-
I
-
0 c II) > i:Ll
� c
�0 "' tl >< i:Ll
• •
0 0
•
::
0 0 ll'l
• •
•
•
-+ •
•
• •
0 0 0
"i'
1 500
2500
3500
4500
Appearances Overtime
5500
Figure 5.9 The 95% control ellipse for future legal appearances and extraordinary event overtime.
266
Chap. 5 Inferences about a Mean Vector
UCL = (n(n - 1p))p Fp, n - p (.05) Points above the upper control limit represent potential special cause variation and suggest that the process in question should be examined to determine whether immediate corrective action is warranted. _
Control Charts Based on Subsample Means
It is assumed that each random vector of observations from the process is inde pendently distributed as NP (O, I). We proceed differently when the sampling pro cedure specifies that m > 1 units be selected, at the same time, from the process. From the first sample, we determine its sample mean X1 and covariance matrix S 1 . When the population is normal, these two random quantities are independent. For a general subsample mean Xi , Xi - X has a normal distribution with mean 0 and n - 1 Cov ( Xi - X ) = ( 1 - -1 ) 2 Cov ( -Xi ) + -2- Cov ( X-1 ) = (n - 1) I -
where
=
n
n
nm
X = -n1 i�=n l -X.J As will be described in Section 6.4, the sample covariances from the n sub samples can be combined to give a single estimate (called Spooled in Chapter 6) of the common covariance I. This pooled estimate is 1 S = (S 1 + S 2 + · + S n ) n Here (nm - n) S is independent of each X and, therefore, of their mean X. Further, (nm - n) S is distributed as a Wisharti random matrix with nm - n degrees of freedom. Notice that we are estimating I internally from the data col lected in any given period. These estimators are combined to give a single esti mator with a large number of degrees of freedom. Consequently, =
-
. .
(5-3 5)
is distributed as (nm - n)p (nm - n - p
+ 1) Fp, nm - n - p + l
Sec.
5.6
Multivariate Quality Control Charts
267
Ellipse Format Chart. In an analogous fashion to our discussion on individual multivariate observations, the ellipse format chart for pairs of subsample means is (5-36) ( 05) ( -X X ) S ( X X ) (nm (-nm1) -(mn -- 11)) 2 F2 although the right-hand side is usually approximated as x� (.05)/m. Subsamples corresponding to points outside of the control ellipse should be carefully checked for changes in the behavior of the quality characteristics being measured. The interested reader is referred to [9] for additional discussion. T2 Chart. To construct a T2 chart with subsample data and p characteristics, we plot the quantity -
= '
-1 -
=
-
:so;
·
nm - n - 1 •
for j = 1, 2, . . . , n, where the UCL = ((nnm--1)n (m- p- 1)p1) F2 (.05) The UCL is often2 approximated as x� (.05) when n is large. Values of � that exceed the UCL correspond to potentially out-of-control or special cause variation, which should be checked. (See [9]. ) +
·
nm - n - p + 1
Control Regions for Future Subsample Observations
Once data are collected from the stable operation of a process, they can be used to setIfcontrol future observed subsampl�means. X is a limits futureforsubsample mean, then X - X has a multivariate normal dis tribution with mean 0 and - = Cov(X - X) = Cov(X) -n1 Cov(X-1 ) Consequently, -nm (X- - X)'S - 1 (X - = X) n 1 is distributed as +
=
-
+
(nm - n)p (nm - n - p + 1) Fp, nm - n - p + 1 Control Ellipse for Future Subsample Means. The prediction ellipse for a future subsample mean for p = 2 characteristics is defined by the set of all x such that
268
Chap. 5 Inferences about a Mean Vector
+ 1) (m - 1 ) 2 F 1 (.05) (5-37) (i - x )'S - 1 (i - x ) .:::; (nm(nm - n - 1) where, again, the right-hand side is usually approximated as xi (.05)/m. Chart for Future Subsample Means. As before, we bring n/ ( n + 1 ) into the control limit and plot the quantity 2, nm _ n
_
T2
for future sample means in chronological order. The upper control limit is then UCL = ((nmn +-l n) (m- p- +1 ) p1 ) FP (.05) The UCL is often approximated as xi (.05) when n is large. Points outside of the prediction ellipse or above the UCL suggest that the current values of the quality characteristics are different in some way from those of the previous stable process. This may be good or bad, but almost certainly war rants a careful search for the reasons for the change. ·
nm - n - p + t
5.7 I NFERENCES ABOUT MEAN VECTORS WHEN SOME OBSERVATIONS ARE MISSING
Often, some components of a vector observation are unavailable. This may occur because of a breakdown in the recording equipment or because of the unwilling ness of a respondent to answer a particular item on a survey questionnaire. The best way to handle incomplete observations, or missing values, depends, to a large extent, on the experimental context. If the pattern of missing values is closely tied to the value of the response, such as people with extremely high incomes who refuse to respond in a survey on salaries, subsequent inferences may be seriously biased. To date, no statistical techniques have been developed for these cases. However, we are able to treat situations where data are missing at random-that is, cases in which the chance mechanism responsible for the missing values is not influenced by the value of the variables. A general approach for computing maximum likelihood estimates from incomplete data is given by Dempster, Laird, and Rubin [5]. Their technique, called the EM algorithm, consists of an iterative calculation involving two steps. We call them the prediction and estimation steps: Prediction step. Given some estimate ii of the unknown parameters, predict the contribution of any missing observation to the (complete-data) sufficient statistics. 2. Estimation step. Use the predicted sufficient statistics to compute a revised estimate of the parameters. L
Sec. 5.7 Inferences About Mean Vectors When Some Observations are Missing
269
The calculation cycles from one step to the other, until the revised estimates do not differ appreciably from the estimate obtained in the previous iteration. When the observations X 1 , X 2 , , X n are a random sample from ap-variate normal population, the prediction-estimation algorithm is based on the complete data sufficient statistics [see (4-21)] n T1 = � Xj = nX j= l and n T2 = � xj x; (n - 1)S + n X X ' j= l In this case, the algorithm proceeds as follows: We assume that the population mean and variance-It and I, respectively-are unknown and must be estimated. Prediction step. For each vector xj with missing values, let x]I> denote the missing denote those components which are available. Thus, x � = [x 1 >' x'] Given estimates ji and I from the estimation step, use the mean of the conditional1 normal distribution of x(ll , given x(2) , to estimate the missing values. That is, x .( l l = E (X< 1 > j x · I ) = + I 1 2 I-221 (x - 11 (2)) (5-38) estimates the contribution of xp> to T1 • Next, the predicted contribution of xp> to T2 is �?> x} 1 > : = E (X} 1 >x? > ' i xJ2 > ; ji, I) = I1 1 - I12I2i i2 1 + xp > xpr (5-39) and • . .
1
1
'
1
-
•
J
1
1
II.
,, ( ! )
' ,_ ,
.-
� = E (X< 1 >X(2)' j x(2> · ,, I ) 1
1
1
J
I
.-
1
, ,_,
=
xJ( l > x(2) ' 1
The contributions in (5-38) and (5-39) are summed over all x �ith m�sing components. The results are combined with the sample data to yieldj T1 and T2. Estimation step. Compute the revised maximum likelihood estimates ( see Result 4.11): 1 - T-2 - ILIL (5-40) n We illustrate the computational aspects of the prediction-estimation algo rithm in Example 5.12. � •
1 If all the components xi are missing, set ii.i
-
=
-
ji and ii.iii.j
- -,
=
I
+
p.;;: .
270
Chap. 5 Inferences about a Mean Vector
Example 5. 1 2 (Illustrating the EM algorithm)
Estimate the normal population mean and covariance I using the incom plete data set p
Here n = 4, p = 3, and parts of observation vectors x 1 and x4 are missing. We obtain the initial sample averages 7 +-5 = 6, I-L- 2 _- 0 + 2 � - 1 , I-L3- = 3 + 6 + 2 + 5 = 4 IL- l = 2 3 4 from the available observations. Substituting these averages for any missing values, so that .X = 6, for example, we can obtain initial covariance esti mates. We shall construct these estimates using the divisor n because the algorithm eventually produces the maximum likelihood estimate I. Thus, (6 - 6) 2 + (7 - 6) 2 + (5 - 6) 2 + (6 - 6) 2 1 (]"1 1 = 4 2 1 0"22 = 2' (6 - 6) (0 - 1) + (7 - 6) (2 - 1) + (5 6) (1 - 1) + (6 - 6) (1 - 1) 4 1 4 3 0:1 3 = 1 0"2 3 = 4' The prediction step consists of using the initial estimates ji and i to predict the contributions of the missing values to the sufficient statistics T1 and T2 . [See (5-38) and (5-39).] The first component of x 1 is missing, so we partition ji and I as 11
-
�
and predict
3 x11 - J.L1 + I1 2 I22 [ XX1132 - ILIL- 32 J - 6 + b:, 1 ] [ 24' 42 ] - ' [ 03 41 ] 5 .73 + (5 .73) 2 32.99
Sec. 5.7 Inferences About Mean Vectors When Some Observations are Missing
271
-
_
- - 1 -
1
_
_
�
_
�
-
-
_
_
=
For the two missing components of x4 , we partition ji and I as and predict
[ �:: ] E( [�:: ] l x43 5; ) [ �: ] [�] + [ � J (�) - 1 (5 - 4) [ �:� ] =
=
ji , I
=
=
for the contribution to T1 . Also, from (5 -39),
[ t n [�J 10. 94, and we reject H0 : Cp. = 0 (no treatment effects). To see which of the contrasts are responsible for the rejection of H0 we construct 95% simultaneous confidence intervals for these contrasts. From, (6-18), the contrast JL 2 ) = halothane influence JL4 ) - (JL 1 c { IL = (JL 3 is estimated by the interval 18(3) F3' 1 6 ( .05) �c{ Sc1 = 209.31 10 94 �9432.32 16 19 19 = 209. 3 1 ± 73. 7 0 where c{ is the first row of C. Similarly, the remaining contrasts are estimated by _
1
+
+
+
+ � r;-;;-;;; v 1U.�I4
302
Chap.
6
Comparisons of Several Multivariate Means
C02 pressure influence =
- 60.05 ± ViD.94 � 51��.84 - 60.05 ± 54.70
(JL1 + JL3 ) - ( JL 2 + JL4 ) :
H-C02 pressure "interaction" = ( JL 1 + JL4 ) - ( JL 2 + JL3 ) :
7557.44 = - 12.79 ± 65 .97 - 12.79 ± .. � �19v 1U.�4
The first confidence interval implies that there is a halothane effect. The presence of halothane produces longer times between heartbeats. This occurs at both levels of C02 pressure, since the H - C02 pressure interaction con trast, (JL1 + JL4 ) - (J.Lz - JLJ ) , is not significantly different from zero. (See the third confidence interval.) The second confidence interval indicates that there is an effect due to C0 2 pressure: The lower C02 pressure produces longer times between heartbeats. Some caution must be exercised in our interpretation of the results because the trials with halothane must follow those without. The apparent H-effect may be due to a time trend. (Ideally, the time order of all treatments • should be determined at random.)
(6-16) (6-16).
The test in is appropriate when the covariance matrix, Cov (X) = I, cannot be assumed to have any special structure. If it is reasonable to assume that I has a particular structure, tests designed with this structure in mind have higher power than the one in (For I with the equal correlation structure see a discussion of the "randomized block" design in or
(8-14),
[10] [16] . )
6.3 COMPARING MEAN VECTORS FROM TWO POPULATIONS
A T 2 -statistic for testing the equality of vector means from two multivariate pop ulations can be developed by analogy with the univariate procedure. (See for a discussion of the univariate case.) This T 2 -statistic is appropriate for comparing responses from one set of experimental settings (population 1) with indepen dent responses from another set of experimental settings (population The com parison can be made without explicitly controlling for unit-to-unit variability, as in the paired-comparison case. If possible, the experimental units should be randomly assigned to the sets of experimental conditions. Randomization will, to some extent, mitigate the effect of unit-to-unit variability in a subsequent comparison of treatments. Although some precision is lost relative to paired comparisons, the inferences in the two-popula tion case are, ordinarily, applicable to a more general collection of experimental units simply because unit homogeneity is not required.
[7]
2).
Sec.
6.3
Comparing Mean Vectors from Two Populations
303
1
Consider a random sample of size n 1 from population and a sample of size n2 from population 2. The observations on p variables can be arranged as: Summary statistics
Sample (Population 1) Xu • X 1 2 • . . · • x l n , (Population 2) X 2 1 • X zz , · · · • X zn,
In this notation, the first subscript-1 or 2-denotes the population. We want to make inferences about (mean vector of population 1) - (mean vector of population 2) = p,1 - p, 2 • For instance, we shall want to answer the ques tion, Is p, 1 = p, 2 (or, equivalently, is p, 1 l h = 0)? Also, if p, 1 - p, 2 "# 0, which component means are different? With a few tentative assumptions, we are able to provide answers to these questions. -
Assumptions Concerning the Structure of the Data
1. The sample X u , X 1 2 , . . . , X 1 11 , , is a random sample of size n 1 from a p-vari ate population with mean vector p, 1 and covariance matrix I1 • 2. The sample X 2 1 , X 22 , . . . , X 2 11, , is a :random sample of size n2 from a p-vari ate population with mean vector p, 2 and covariance matrix I 2 • 3. Also, X u , X 1 2 , . . . , X 1 11 , are independent of X 2 1 , X 22 , . . . , X 2 11 , .
(6-19)
We shall see later that, for large samples, this structure is sufficient for mak ing inferences about the p X 1 vector p, 1 - p, 2 . However, when the sample sizes n 1 and n2 are small, more assumptions are needed. Further Assu m ptions when
n1
and
n2
Are Small
1. Both populations are multivariate normal. 2. Also, I1 = I 2 (same covariance matrix).
(6-20)
The second assumption, that I 1 = I2 , is much stronger than its univ�riate coun terpart. Here we are assuming that several pairs of variances and covariances are nearly equal.
304
Chap.
Comparisons of Several Multivariate Means
6
nl
When I 1 = I2 = I, j�= l (x 1 i - x1 ) (x 1 i - x1 )' is an estimate of (n 1 - 1)I and of (n2 - 1)I. Consequently, we can pool � (x - x 2 ) (x 2i - x 2 )' is an estimate . j = l 2i the information in both samples in order to estimate the common covariance I. We set � (x 1 - i1 ) ( x 1i - x1 )' + � (x 2i - x2 ) ( x 2i - x2 )' j= l i j= 1 s pooled = n 1 + n2 - 2 (6-21) Since j�= l (x 1 i - x1 ) (x 1 i - id' has n1 - 1 d.f. and j�= l (x2i - x 2 ) (x2i - x 2 )' has n2 - 1 d.f., the divisor (n1 - 1) + (n2 - 1) in (6-21) is obtained by combining the two component degrees of freedom. [See (4-24).] Additional support for the pool ing procedure comes from consideration of the multivariate normal likelihood. (See Exercise 6.11.) To test the hypothesis that p - p = 80 a specified vector, we consider the squared statistical distance from x11 - x 22 to 80 •, Now, E(X 1 - X 2 ) = E(X 1 ) - E( X 2 ) = p 1 - p2 Since the independence assumption in (6-19) implies that X 1 and X 2 are indepen dent and thus Cov(X 1 , X 2 ) = 0 (see Result 4. 5 ), by (3-9), it follows that Cov(X1- - -X 2 ) = Cov(X- 1 ) + Cov(X- 2 ) = -nt1 I + -n1z I = ( -n1t + -n1z ) I (6-22) Because S pooted estimates I, we see that (�1 + �J s pooled is an estimator of Cov(X 1 - X 2 ). The likelihood ratio test of Ho : IL t - IL z = Bo is based on the square of the statistical distance, T2, and is given by (see [1]), Reject H0 if 1 yz = ( x t - X z - Bo ) ' [ (� + �J spoote d r ( x t - X z - Bo ) > c z (6-23) n2
�
�
�------------------�------------------
nl
n2
t
Sec.
6.3
Comparing Mean Vectors from Two Populations
305
where the critical distance c 2 is determined from the distribution of the two-sam ple T 2-statistic. Result 6.2. If X 1 1 , X 1 2 , . . . , X 1 11 1 is a random sample of size n 1 from NP (p.. 1 , :I) and X 2 1 , X 22 , . . . , X 2 112 is an independent random sample of size n2 from Np ( f.L2 , :I) , then
is distributed as
(
P [ (x l - X2 - ( P.. t - P2 )) ' [ ( �, Consequently,
+
2)p Fp, 111 + 112 - p - 1 - p - 1)
( n l + nz n , + n2
-
1 l �J spoo cctr (X l - X 2 - (p.. 1 - p..2 ))
�
c2
]=1-a (6-24)
where
Proof. X 1 - X2
We first note that
= -n1J X 1 1 + -n1l X 1 2 + . . . + -n1, X I n
I
1
1 1 - - X 2 1 - - X 22 - . . . - - X 2n2 n n n
2
z
z
is distributed as
c 1 = c2 = . . · = C111 = 1 /n 1 and C11 1 + 1 = C111 + 2 = = C111 +n2 = (4-23), / (n1 - 1 )S1 is distributed as W111 _ 1 (I) and (n2 - 1)S 2 as W112 _ 1 (I) By assumption, the X 1 / s and the X 2 / s are independent, so ( n 1 - 1)S1 and ( n2 - 1 ) S 2 are also independent. From (4-24), (n1 - 1)S1 + (n 2 - 1 ) S 2 is then dis tributed as W111 + 112 _ 2 ( I ) . Therefore, 4.8,
with by Result to According - 1 n2 •
..·
306
Chap. 6
(
Comparisons of Severa l M u ltivariate Means
)(
) (
I = multivariate normal ' Wishart random matrix - multivariate normal random vector
d.f.
random vector
)
which is the T2 -distribution specified in for the relation to F.]
•
(5-8), with n replaced by n1 + n2 1. [See (5-5) We are primarily interested in confidence regions for p 1 - p 2 . From (6-24), we conclude that all p1 - p within squared statistical distance c 2 of x 1 - x con -
2 2 stitute the confidence region. This region is an ellipsoid centered at the observed difference x 1 - x 2 and whose axes are determined by the eigenvalues and eigen vectors of s pooled (or s;;oled ) . Example 6.3
(Constructing a confidence region for the difference of two mean vectors)
Fifty bars of soap are manufactured in each of two ways. Two characteristics, X1 = lather and X2 = mildness, are measured. The summary statistics for bars produced by methods and are
1 2 [8.3 ] 4.1 ' [10.2 ] Xz 3. 9 '
-
95%
=
Obtain a confidence region for p 1 - p 2 . We first note that S1 and S 2 are approximately equal, so that it is rea sonable to pool them. Hence, from
Also,
(6-21), 49 49 [ 2 1 Spooled = 98 S I + 98 S z = 1 5 J
[ -1..29 ] so the confidence ellipse is centered at [ -1. 9 , .2] ' . The eigenvalues and eigen vectors of s pooled are obtained from the equation 0 = ! Spooled A I J = 1 2 -1 A 5 1 A I = A 2 - 7A + 9 X I - Xz =
-
_
Sec.
6.3
Comparing Mean Vectors from Two Populations
and Consequently, = ± v' so = the corresponding eigenvectors, e 1 and e 2 , determined from
A (7
A1 5. 303
49 - 36)/2.
are e1
=
i=
[ ..299057 ]
and e z =
307
A2 = 1. 697, and
1, 2
[ .957 ] - .290
6.2, (__!_n n2__!_) = ( 501 501 ) { 98)(97)(2) F2' 97 ( • 05) -- •25 1 since F2, 9 7 (. 05) = 3.1. The confidence ellipse extends
By Result
+
cz
+
1.15
.65
units along the eigenvector e ; , or units in the e 1 direction and units in the direction. The 95% confidence ellipse is shown in Figure Clearly, = 0 is not in the ellipse, and we conclude that the two methods of 1 manufacturing soap produce different results. It appears as if the two processes produce bars of soap with about the same mildness but those • from the second process have more lather ).
e2 p. - p.2
6.1.
(X1
(X2),
2.0
- 1 .0
Figure 6 . 1
P.l - P.2 .
95% confidence ellipse for
308
Chap. 6
Com parisons of Several M u ltivariate Means
Simultaneous Confidence Intervals
It is possible to derive simultaneous confidence intervals for the components of the vector p 1 - p 2 . These confidence intervals are developed from a consideration of all possible linear combinations of the differences in the mean vectors. It is assumed that the parent multivariate populations are normal with a common covariance I . Result 6.3.
/
Let c2 = [(n1 + n2 - 2 ) p (n1 + n2 - p
With probability 1 - a,
(
-
1 ) ] Fp , n1 + nz - p - 1 ( a ) .
)
/ + l. spooled a a' ( X 1 - X2) ± c a' l. \j n t nz will cover a' (!L 1 - p2) for all a. In particular f-t 1 i - f-t z i will be covered by (Xl i - Xz J ± C
1 ( _!_ + l.nz ) sii, pooled
\j n l
for i = 1, 2, . . . , p
Proof. Consider univariate linear combinations of the observations
given by a'X 1 j = a 1 X1j1 + a 2 X1j2 + . . . + apXtjp and a'X 2j = a 1 X2 j 1 + a 2 X2j2 + � · + apXZ jp · T� se linear combinations have�ample means and covariances a'X 1 , a'S 1 a and a'X 2 , a'S 2 a, respectively, where X 1 , S1 , and X 2, S2 are the mean and covariance statistics for the two original samples. ( See Result 3.5.) When both parent populations have the same covariance, sf' a = a'S 1 a and s� a = a'S 2 a are both estimators of a' I a, the common population variance of the li�ear combina tions a'X 1 and a'X 2 • Pooling these estimators, we obtain
2 sa, pooled
(n 1 - 1) sL + (n2 - 1 ) s�. a (n 1 + n2 - 2)
(6-25) To test H0 : a' (p 1 - p 2) = a' 80 , on the basis o f the a'X 1j and a'X 2j , we can form the square of the univariate two-sample £-statistic (6-26)
Sec.
- -
6.3
Comparing Mean Vectors from Two Populations
t; :,.;;; ( X I - X 2 - ( JL I
a
= Tz
(2-50),
- JLz ) ) '
= ( X 1 - X 2 - (p1 - JL2 ) ) and
[( n1 + n1 ) Spooled ]- I ( -X I - -X z - (JLI - JLz ) )
According to the maximization lemma with d B (1/n 1 + 1/n2 ) S pooled in
=
309
1
z
= P [ T2 :,.;;; c2] = P [t; :,.;;; c2, for all a] = [ I a ' ( X I - X z ) - a ' (p i - JLz ) I :,.;;; c �a ' (�I + � J s pooled a for all a] where c 2 i s selected according to Result 6. 2 . Remark. For testing H0 : p 1 - p 2 = 0, the linear combination a' ( i 1 - i 2 ),
for all
(1 - a )
¥-
0. Thus,
p
•
with coefficient vector a ex: s;�oted ( i 1 - i 2 ), quantifies the largest population dif ference. That is, if T 2 rejects H0 , then a' ( i 1 - i 2 ) is likely to have a nonzero mean. Frequently, we try to interpret the components of this linear combination for both subject matter and statistical importance. Example 6.4 (Calculating simultaneous confidence i ntervals for the differences in mean components)
= 45
= 55
Samples of sizes n 1 and n2 were taken of Wisconsin homeowners with and without air-conditioning, respectively. (Data courtesy of Statistical Laboratory, University of Wisconsin.) Two measurements of electrical usage (in kilowatt hours) were considered. The first is a measure of total on-peak consumption (X1 ) during July, and the second is a measure of total off-peak consumption (X2 ) during July. The resulting summary statistics are
4 = [ 204. 556.6 J ' -Xz = [ 130.0 ' 355.0 J XI
13825.3 23823.4 = [ 23823. 4 73107. 4 J ' [ 8632.0 19616.7 ] s 2 = 19616.7 55964. 5 ' sl
n1
= 45
n2
= 55
(The off-peak consumption is higher than the on-peak consumption because there are more off-peak hours in a month.) Let us find 95% simultaneous confidence intervals for the differences in the mean components. Although there appears to be somewhat of a discrepancy in the sample variances, for illustrative purposes we proceed to a calculation of the pooled sample covariance matrix. Here
310
Chap.
6
Comparisons of Several Multivariate Means
S pooled =
- 1 S + n2 - 1 S + n 2 - 2 1 nl + n 2 - 2 2 -
nl n,
_
and
10963.7 [ 21505.5
21505.5 63661.3
]
= (2.02) (3.1) = 6.26 With p { - p� = [ JL 1 1 - JL 2 1 , JL 1 2 - JL 22 ] , the 95% simultaneous confidence intervals for the population differences are 1L 1 1 - JL2 1 : (204.4 - 130.0) ± V6.26 or
�( 415 + 5�) 10963.7
�( 415 + 515 ) 63661.3
21.7 ,;; IL1 1 - JL 2 1 ,;; 127.1 JL 1 2 - JL22 : (556.6 - 355.0) ± V6.26 or
74.7 ,;; JL 1 2 - JL 22 ,;; 328.5 We conclude that there is a difference in electrical consumption between those with air-conditioning and those without. This difference is evident in both on-peak and off-peak consumption. The 95% confidence ellipse for p 1 - p 2 is determined from the eigen value-eigenvector pairs .-\ 1 = 71323.5, e { = [.336, .942] and .-\ 2 = 3301 .5, e� = [.942, - .336]. Since
and
we obtain the 95% confidence ellipse for p1 - p 2 sketched in Figure 6.2 on page 311. Because the confidence ellipse for the difference in means does not cover 0 ' = [0, 0], the T2 -statistic will reject H0 : p 1 - p 2 = 0 at the 5% level.
Sec.
6.3
Comparing Mean Vectors from Two Populations
p;
-
Jt�
Figure 6.2
=
31 1
95% confidence ellipse for (p. l l - IL2 t 1 IL 1 2 - P-22 ) .
The coefficient vector for the linear combination most responsible for rejec • tion is proportional to 2 . (See Exercise 6.7.)
S�o1oled (x1 - x ) The Bonferroni 100(1 - a)% simultaneous confidence intervals for the p pop
f.Lz; : ( :Xli - Xz; ) ± t,1+,2 - z ( 2: ) �(�1 + � J sii, pooled where t 1 _ (a/2p) is the upper 100(a/2p)th percentile of a !-distribution with n 1 + n2 ,-1 + 22 d.f.2 I1 I2 When I 1 I2 , we are unable to find a "distance" measure like T 2 , whose distrib ution does not depend on the unknowns I 1 and I2 . Bartlett ' s test [3] is used to test the equality of I1 and I2 in terms of generalized variances. Unfortunately, the con
ulation mean differences are Jl-1 ; -
The Two-Sample Situation when
=F
=F
clusions can be seriously misleading when the populations are nonnormal. Non normality and unequal covariances cannot be separated with Bartlett ' s test. A method of testing the equality of two covariance matrices that is less sensitive to the assumption of multivariate normality has been proposed by Tiku and Balakr ishnan [17]. However, more practical experience is needed with this test before we can recommend it unconditionally. We suggest, without much factual support, that any discrepancy of the order = 4 a2 , ;;, or vice versa, is probably serious. This is true in the univariate case. ;; a1 , The size of the discrepancies that are critical in the multivariate situation probably depends, to a large extent, on the number of variables A transformation may improve things when the marginal variances are quite different. However, for n 1 and n2 large, we can avoid the complexities due to unequal covariance matrices.
p.
312
Chap.
6
Comparisons of Several Multivariate Means
Resu lt 6.4. Let the sample sizes be such that n1 - p and n 2 - p are large. Then, an approximate 100 (1 - a)% confidence ellipsoid for p1 - p 2 is given by all IL 1 - /L z satisfying
where x; (a) is the upper (100a) th percentile of a chi-square distribution with p d.f. Also, 100 (1 - a)% simultaneous confidence intervals for all linear combina tions a' (p1 - p 2 ) are provided by
(
)
a' (p1 - p2 ) belongs to a' (i1 - i 2 ) ± Vx; (a) /a' _!_ s 1 + _!_ S 2 a nz "V n l Proof
From
(6-22) and (3-9),
E ( X 1 - X 2 ) = P- 1 - 1L2
and
By the central limit theorem, X 1 - X 2 is nearly NP [p1 - p 2 , n 1 1 I1 + n2 1 I 2 ] . If I1 and I 2 were known, the square of the statistical distance from X 1 - X 2 to IL 1 - p 2 would be This squared distance has an approximate x;-distribution, by Result 4.7. When n1 and n2 are large, with high probability, s l will be close to I I and s 2 will be close to I 2 . Consequently, the approximation holds with S1 and S 2 in place of I1 and I 2 , respectively. The results concerning the simultaneous confidence intervals follow from • Result 5A. l . Remark.
__!_ S 1 n1
+
If n 1 = n2 = n, then (n - 1)/(n + n - 2) = 1/2, so __!_ S 2 = l_ (S 1 + S 2 ) = (n - 1) S 1 + (n - l ) S 2 l + l n n n+nn n2
2
(
)
With equal sample sizes, the large sample procedure is essentially the same as the procedure based on the pooled covariance matrix. (See Result In one dimen-
6. 2 . )
Sec.
6.3
Comparing Mean Vectors from Two Populations
sion, it is well known that the effect of unequal variances is least when n 1 greatest when n 1 is much less than n2 or vice versa.
31 3 =
n2
and
(Large sample prdcedures for inferences about the difference in means)
Example 6.5
13825.3 [ 23823.4
8632.0 ] + 551 [ 19616.7
We shall analyze the electrical-consumption data discussed in Example 6.4 using the large-sample approach. We first calculate 1 n1
81
+ n12 S z
=
-
1 45
[ 464.17 886.08
23823.4 73107.4
886.08 2642.15
[
]
]
19616.7 55964.5
J
The 95% simultaneous confidence intervals for the linear combinations
a' ( p l - P z )
[1, 0] /L ' ' f.Lz ' 1 f.L 2 f.Lzz
=
[0 1] [ f.Lf.L 11 21 -- f.Lzf.Lzz1 J
and
,
are (see Result 6.4)
74.4 ± \15.99 V464.17
f.L1 1 - f.L2 1 :
=
f.L1 1 - f.Lz 1
=
f.L1 2 - /Lzz
or (21.7, 127.1)
201.6 ± \15.99 V2642.15 or
(75.8, 327.4)
Notice that these intervals differ negligibly from the intervals in Example 6.4, where the pooling procedure was employed. The T 2-statistic for testing H0 : P 1 - P z = 0 is -' 1 1 Tz = x , - X- z l ' - s I - S z [ x- l - -X z ] n n2
[-
[
]
- 130.0 ] [ 464.17 886.08 ] [ 204.4 - 130.0 ] [ 204.4 556.6 - 355.0 886.08 2642.15 556.6 - 355.0 = [74.4 201.6] (10 - 4 ) [ 59.874 - 20.080 [ 74.4 15 · 66 10.519 J 201.6 J - 20.080 1
I
+
-
I
=
For a = .05, the critical value is xi (.05) = 5.99 and, since T 2 = 15.66 > . 2 = Xz (.05) 5.99, we reJect H0 . The most critical linear combination leading to the rejection of H0 has coefficient vector
314
Chap.
6
Comparisons of Several Multivariate Means A
3
ex
( n1
s
+ s2 1 n2 1 1
)-�(-x 1 - -x 2 ) = (l0 - 4) [ .041 [ .063 J
59 , 874 - 20.080 - 20.080 10.519
] [ 74.4 ] 201.6
The difference in off-peak electrical consumption between those with air-conditioning and those without contributes more than the corresponding difference in an-peak consumption to the rejection of H0 : p1 - p 2 = 0. • A statistic similar to T2 that is less sensitive to outlying observations for small and moderately sized samples has been developed by Tiku and Singh [18] . How ever, if the sample size is moderate to large, Hotelling ' s T 2 is remarkably unaf fected by slight departures from normality and/or the presence of a few outliers.
6.4
COMPARING SEVERAL MULTIVARIATE POPULATION MEANS (ONE-WAY MANOVA)
Often, more than two populations need to be compared. Random samples, col lected from each of g populations, are arranged as Population 1: X 1 1 , X 1 2 , . . . , X 1 n , Population 2: X 2 1 , X 22 , . . . , X 2 n,
(6-27)
Population g: X8 1 , X 8 2 , . . . , X 8 "" MANOVA is used first to investigate whether the population mean vectors are the same and, if not, which mean components differ significantly. Assum ptions about the Structure of the Data for One-way MANOVA
1. X c 1 , X c 2 , , X c" is a random sample of size n e from a population with mean /L c , e = 1, 2: . . . , g. The random samples from different populations are . • .
independent. 2. All populations have a common covariance matrix I. 3 . Each population is multivariate normal.
Condition 3 can be relaxed by appealing to the central limit theorem (Result 4.13) when the sample sizes ne are large. A review of the univariate analysis of variance (ANOV A) will facilitate our discussion of the multivariate assumptions and solution methods.
Sec.
6.4
Comparing Several Multivariate Population Means (One-Way Manova)
315
A Summary of Univariate ANOVA
Xe Xe ... , Xe N(JLe , f.Le f.Lz · ·· f.Lg , f.L, f.L e f.L (JLe - f.L) f.L e JL Te Tc f.Lc - f.L · Tc reparameterization Tc f.L + f.Lc ( C th population ) ( overall ) ( C th population ) (6-28)
n, is a random In the univariate situation, the assumptions are that 1 , 2 , if) population, e = 1, 2, . . . , g, and that the random samples sample from an are independent. Although the null hypothesis of equality of means could be for = = mulated as JL 1 = it is customary to regard as the sum of an over all mean component, such as and a component due to the specific population. For instance, we can write = + or = + where = Populations usually correspond to different sets of experimental conditions, and therefore, it is convenient to investigate the deviations associated with the C th population (treatment). The
(treatment) effect
mean
mean
leads to a restatement of the hypothesis of equality of means. The null hypothe sis becomes
The response gestive form
Xci ' distributed as N(JL + Tc , a2) , can be expressed in the sug Xc i =
f.L
+
(overall mean)
ec i g
Tc
ee i
( treatment ) ( random ) +
effect
(6-29)
error
N(O,
a2 ) random variables. To define uniquely the where the are independent model parameters and their least squares estimates, it is customary to impose the constraint ,L
C=l
ncrc = 0.
Motivated by the decomposition in (6-29), the analysis of variance is based upon an analogous decomposition of the observations,
(observation)
(
-
X
overall sample mean
) ( +
(xc - x)
estimated treatment effect
)
+
(xci - xc) (residual)
(6-30)
f.L, 1- = x) is an estimate of Tc , and (xei - xc ) is ecj · e (xe -
where x is an estimate of an estimate of the error
316
Cha p . 6
Comparisons o f Several M u ltiva riate Means
Example 6.6
(The sum of squares decomposition for u nivariate ANOVA)
Consider the following independent samples. Population 1: 9, 6, 9 Population 2: 0, 2 Population 3: 3, 1, 2
Since, for example, .X3 = (3 + 1 + 2)/3 = 2 and .X = (9 + 6 + 9 + 0 + 2 + 3 + 1 + 2)/8 = 4, we find that
= 4 + (2 - 4) + (3 - 2)
= 4 + ( - 2) + 1
(� � ) (: : ) ( -� -� ) (-� -� )
Repeating this operation for each observation, we obtain the arrays 9
4
+ + 3 1 2 4 4 4 -2 -2 -2 observation = mean + treatment effect +
(xcj )
( .X)
1
4
( .Xc - .X )
1 -1 0 residual
(xcj - .Xc )
The question of equality of means is answered by assessing whether the contribution of the treatment array is large relative to the residuals. (Our esti g mates rc = .Xc - .X of Tc always satisfy 2: nc rc = 0. Under H0 , each rc is an C=l
estimate of zero.) If the treatment contribution is large, H0 should be rejected. The size of an array is quantified by stringing the rows of the array out into a vector and calculating its squared length. This quantity is called the sum of squares (SS). For the observations, we construct the vector y' = [9, 6, 9, 0, 2, 3, 1, 2]. Its squared length is ssobs = 92 + 62 + 92 + 02 + 22 + 3 2 + 1 2 + 22 = 216 Similarly ss mean sstr
= 42 + 42 + 42 + 42 + 42 + 42 + 42 + 42 = 8 (42 ) = 128
= 42 + 42 + 42 + ( - 3) 2 + ( - 3) 2 + ( - 2) 2 + ( - 2) 2 + ( - 2? = 3 (42 ) + 2 ( - 3) 2 + 3 ( - 2) 2 = 78
Sec. 6.4
31 7
Comparing Several Multivariate Popu lation Means (One-Way Manova)
and the residual sum of squares is The sums of squares satisfy the same decomposition, (6-30), as the observa tions. Consequently, or 216 = 128 + 78 + 10. The breakup into sums of squares apportions vari ability in the combined samples into mean, treatment, and residual (error) components. An analysis of variance proceeds by comparing the relative sizes of SS1r and SSres . If H0 is true, variances computed from SS1r and SSres should be approximately equal. The sum of squares decomposition illustrated numerically in Example 6.6 is so basic that the algebraic equivalent will now be developed. Subtracting :X from both sides of (6-30) and squaring gives •
We can sum both sides over j, note that 2: (xei - :Xe ) = 0, and obtain n,
j= l
�
2: (xei - :X ) z = ne ( :Xe - :X ) z
j=l
Next, summing both sides over e we get
�
+ 2: (xej - :Xe ) z j= l
(6-31)
(
) ( ) + ( SSres ) SS tr = total (corrected) SS between (samples) SS within (samples) SS or SScor
g
�
2: 2: xii = (n 1 + n2 f=l j= l ( SSobs ) =
+
··· +
n1)x 2
g
g
�
+ 2: ne ( :Xe - :X ) 2 + 2: 2: (xei - :Xe )2 + + (6-32) f=l
f=l j=l
318
Chap.
6
Comparisons of Several Multivariate Means
(6-32),
In the course of establishing we have verified that the arrays repre senting the mean, treatment effects, and residuals are That is, these arrays, considered as vectors, are perpendicular whatever the observation vector y ' = [x l l , . . . , Xln l • x2 1 • . . . , Xzn, • . . . , Xg n) · Consequently, we could obtain ssres by subtraction, without having to calculate the individual residuals, because ss res = SS obs - SS mean - SS tr " However, this is false economy because plots of the residu als provide checks on the assumptions of the model. The vector representations of the arrays involved in the decomposition also have geometric interpretations that provide the degrees of freedom. For an arbitrary set of observations, let [x1 1 , . . . , x 1 111 , x2 1 , . . . , x2 11, , . . . , xg n , ] = y ' . The observation vector y can lie anywhere in = + n2 + · · · + g dimensions; the mean vector .X1 = [x, . . . , x] ' must lie along the equiangular line of 1, and the treat ment effect vector
orthogonal.
(6-30)
n n1
(.XI
-
x)
1 1 0 0 0 0
}"·
+ ( xz
-
x)
0 0 1 + nz 1 0 0
}
= (:X\
-
n
+ ( xg - x )
0 0 0 0 1 n, 1
}
x ) u 1 + ( x2 - x ) u 2 + · · · + (xg - x ) ug
g
lies in the hyperplane of linear combinations of the vectors u 1 , u 2 , , ug. Since 1 = u 1 + u 2 + · · · + ug, the mean vector also lies in this hyperplane, and it is perpendicular to the treatment vector. (See Exercise Thus, the mean vector has the freedom to lie anywhere along the one-dimensional equiangular line, and the treatment vector has the freedom to lie anywhere in the other g - 1 dimen sions. The residual vector, e = y - ( .X1 ) - [ (.X1 - .X ) u 1 + · · · + (xg - .X ) ug] is perpendicular to both the mean vector and the treatment effect vector and has the = freedom to lie anywhere in the subspace of dimension that is perpendicular to their hyperplane. To summarize, we attribute d.f. to SS mean , d.f. to SS1" and n - g = ( 1 + n2 + · · · + ng ) - g d.f. to SS res · The total number of degrees of freedom is = n 1 + n 2 + · · · + ng. Alternatively, by appealing to the univariate distribution theory, we find that these are the degrees of freedom for the chi-square distributions associated with the corresponding sums of squares. The calculations of the sums of squares and the associated degrees of free dom are conveniently summarized by an ANOVA table.
always
. • .
6. 1 0. )
n n
1
n - (g - 1) - 1 n - g g-1
Sec.
6.4
Comparing Several Multivariate Population Means (One-Way Manova)
321
can be written as
The sum over j of the middle two expressions is the zero matrix, because n,
j�=l (xCi - X c) = 0. Hence, summing the cross product over C and j yields g g g + ( nc ) ) )' cj c ) ' j � � � � C= l j = I (x x xf x C=l (X. x (xe x C=l j�= l (x cj - xc) (x cj - xd
(
11(
total (corrected) sum of squares and cross products
The
) (
treatment (�etween) sum of squares and cross products
) (
�
)
residual (Within) sum of squares and cross products (6-36)
within sum of squares and cross products matrix can be expressed as W=
g
Ill
C=l� j�=l (x ej - xc Hx cj - ic)' (6-37)
where S c is the sample covariance matrix for the C th sample. This matrix is a gen eralization of the S pooled matrix encountered in the two-sample case. It plays a dominant role in testing for the presence of treatment effects. Analogous to the univariate result, the hypothesis of no treatment effects,
(n 1 + n2 - 2)
H0 : T1 = Tz = · · · = Tg = 0 is tested by considering the relative sizes of the treatment and residual sums of squares and cross products. Equivalently, we may consider the relative sizes of the
322
Chap.
Comparisons of Several Multivariate Means
6
residual and total (corrected) sum of squares and cross products. Formally, we summarize the calculations leading to the test statistic in a MANOVA table. MANOVA TABLE FOR COM PARI NG POPULATIO N M EAN VECTORS
Source of variation
Matrix of sum of squares and cross products (SSP)
= CL=g l ne ( ie - x ) ( ie - x ) ' g W = L L (x ej - ie H x ej - ie ) ' C= l j= l B
Treatment Residual (Error) Total (corrected for the mean)
ll(
g B + W = L L (x ej - i ) (x ej - x ) ' C = l j= l n,
Degrees of freedom ( d.f.)
g-1 g L ne - g C=l g L ne - 1 C=l
This table is exactly the same form, component by component, as the ANOVA table, except that squares of scalars are replaced by their vector counterparts. For example, ( .Xe - .X ) 2 becomes ( ie - x ) (i e The degrees of freedom corre spond to the univariate geometry and also to some multivariate distribution theory involving Wishart densities. (See [1].) One test of H0 : T1 T2 · · · Tg 0 involves generalized variances. We reject H0 if the ratio of generalized variances
x )'.
= = = =
A*
lwl
IB + WI
A* =
�� � (xej - ieHx ej - ie ) ' I l �l j� (xej - x ) (x ej - x ) ' l
(6-38)
I
is too small. The quantity I W I / B + W I , proposed originally by Wilks (see [20]), corresponds to the equivalent form (6-33) of the F-test of H0 : no treatment effects in the univariate case. Wilks' lambda has the virtue of being convenient and related to the likelihood ratio criterion. 2 The exact distribution of can be as
2 Wilks' lambda can also be expressed as a function of the eigenvalues of A , A , 1 2 A* =
fi ;�
(1 A)
A*
• • •
, A s of w- 1 B
1 1 + A; where s = min (p, g - 1 ) , the rank of B. Other statistics for checking the equality o f several multivari ate means, such as Pillai's statistic, the Lawley-Hotelling statistic, and Roy's largest root statistic, can also be written as particular functions of the eigenvalues of w- 1 B. For large samples, all of these sta tistics are, essentially, equivalent. (See the additional discussion on page 357.)
Sec.
6.4
Comparing Several Multivariate Population Means (One-Way Manova) D I STRI BUTION OF WI LKS' LAM BDA,
TAB LE 6.3
No. of variables
No. of groups
p=1 p=2
g ;;. 2
p ;;. 1
g
WI/I B
+ wJ
Sampling distribution for multivariate normal data
( Lne - g) C - *A * ) Fg - U:n, - g A g- 1 ( Lne - g - 1 ) e - \lA* ) F2(g - 1). 2(�n, - g - l) g- 1 y'A* ( Lne - p - 1 ) C - *A * ) Fp. ���. - p - t p A ( Lne - p - 2 ) e - \lA* ) Fzp, Z (�n, - p - 2) VA* p �
g ;;. 2
�
=2 g= 3
p ;;. 1
A* = J
323
�
�
derived for the special cases listed in Table 6.3. For other cases and large sample sizes, a modification of due to Bartlett (see [4]) can be used to test H0 • Bartlett (see [4]) has shown that if H0 is true and is large,
A*
Lne = n
- ( n - 1 - (p +2 g) ) ln A * = - ( n - 1 - (p +2 g) ) ln ( I B J w+ Jw J )
(6-39)
has approximately a chi-square distribution with p (g - 1) d.f. Consequently, for Lne = n large, we reject H0 at significance level a if - ( n - 1 - (p +2 g) ) ln ( B J w+ lW I ) (6-40)
J
where x; (g - ) (a) is the upper (100a) th percentile of a chi-square distribution with p(g - 1) d.f. l
Example 6.8 (A Manova table and Wil ks' lambda for testing the equality of three mean vectors)
Suppose an additional variable is observed along with the variable introduced in Example 6.6. The sample sizes are 3, 2, and 3. Arranging the observation pairs X ej in rows, we obtain
n1 = n2 =
n3 =
324
Chap.
6
Comparisons of Several Multivariate Means
[� ] [� ] [ � ] [�] [�] [�] [�] [�]
= [ ! ], and i = [:]
with i 1
x2 = [� ],
X3
= [ � ],
We have already expressed the observations on the first variable a s the sum of an overall mean, treatment effect, and residual in our discussion of uni variate ANOVA. We found that
(observation)
(mean)
( treatment ) effect
(residual)
and
= 128 + 78 + 10 Total SS (corrected) = SS obs - SS mean = 216 - 128 216
88
( =� =� � � -� =� ( ( + + ) ( ) ) ) =
Repeating this operation for the observations on the second variable, we have
! �
7
5
8 9 7
5 5 5
(observation)
(mean)
-1
( treatment ) effect 3
3
3
3
0
1 -1
(residual)
and
= 200 + 48 + 24 Total SS (corrected) = SS obs - SSmean = 272 - 200 272
=
72
These two single-component analyses must be augmented with the sum of entry-by-entry cross products in order to complete the entries in the
Sec.
6.4
Comparing Several Multivariate Population Means (One-Way Manova)
325
MANOVA table. Proceeding row by row in the arrays for the two variables, we obtain the cross product contributions: Mean: 4 (5) + 4 (5)
+
···
+ 4 (5) = 8 (4) (5) = 160
Treatment: 3 (4) ( - 1 ) + 2 ( - 3) ( - 3) + 3 ( - 2) (3) = - 12 Residual: 1 ( - 1 ) + ( - 2) ( - 2) + 1 (3) + ( - 1 ) (2) + Total: 9 (3) + 6 (2) + 9 (7) + 0 (4) +
···
·· ·
+ 0 (- 1) = 1
+ 2 (7) = 149
Total ( corrected ) cross product = total cross product - mean cross product = 149 - 160 = - 1 1 Thus, the MANOVA table takes the following form:
Source of variation
Matrix of sum of squares and cross products
Degrees of freedom
78 - 12 48 - 12
3 - 1 = 2
[
Treatment
[ 101 88 [ - 11
Residual Total ( corrected)
] ] 2� - 11 ] 72
] = [ - 7812
Equation (6-36) is verified by noting that
[
88 - 11 - 11 72
Using (6-38), we get A* =
IWI IB + WI
I
88 - 11 - 11 72
1
- 12 48
3 + 2 + 3 - 3 = 5 7
] + [ 101 241 ]
10 (24) - ( 1 ) 2 239 = = 0385 88 (72) - ( - 1 1 ) 2 6,215 •
326
Chap.
6
Comparisons of Several Multivariate Means
=
=
Since p 2 and g 3, Table 6.3 indicates that an exact test (assuming normality and equal group covariance matrices) of H0 : T1 T2 T3 0 (no treatment effects) versus H1 : at least one -rc # 0 is available. To carry out the test, we compare the test statistic
= = =
( 1 - VA* ) (2-nc - g - 1) = ( 1 - Y.0385) (8 - 3 - 1 ) = 8. 1 9 (g - 1) 3 - 1 VA* Y.0385 with a percentage point of an F-distribution having v1 = 2 (g - 1) = 4 and v2 = 2 ( 2-n c - g - 1) = 8 d.f. Since 8. 1 9 > F4,8 (.01) = 7.01, we reject H0 at the a = .01 level and conclude that treatment differences exist. •
When the number of variables, p, is large, the MANOVA table is usually not constructed. Still, it is good practice to have the computer print the matrices B and W so that especially large entries can be located. Also, the residual vectors
should be examined for normality and the presence of outliers using the techniques discussed in Section 4.6 and 4.7 of Chapter 4. Example 6.9 (A multivariate analysis of Wisconsin n u rsing home data)
The Wisconsin Department of Health and Social Services reimburses nursing homes in the state for the services provided. The department develops a set of formulas for rates for each facility, based on factors such as level of care, mean wage rate, and average wage rate in the state. Nursing homes can be classified on the basis of ownership (private party, nonprofit organization, and government) and certification (skilled nurs ing facility, intermediate care facility, or a combination of the two). One purpose of a recent study was to investigate the effects of owner ship or certification (or both) on costs. Four costs, computed on a per-patient day basis and measured in hours per patient day, were selected for analysis: X1 cost of nursing labor, X2 cost of dietary labor, X3 cost of plant operation and maintenance labor, and X4 cost of housekeeping and laun dry labor. A total of n 516 observations on each of the p 4 cost variables were initially separated according to ownership. Summary statistics for each of the g 3 groups are given below.
=
=
=
=
=
=
=
Sec.
6.4
Comparing Several Multivariate Population Means (One-Way Manova)
Number of observations
Group e = 1 (private) e = 2 (nonprofit) e = 3 (government)
327
[ ]' [ ]' [ ] Sample mean vectors
n1 = 271 n2 = 138
XI =
n3 = 107
2.066 .480 .082 .360
X2 =
2 167 .596 .124 .418
x3 =
2.273 .521 .125 .383
3
2: nc = 516
f=l
[ [
Sample covariance matrices
29 1
[
J J J Source: Data courtesy of State of Wisconsin Department of Health and Social Services. s,
s3 =
- .001 .011 .002 .000 .001 .010 .003 .000
O
.017 .030 .003 - .000 .004 .018 .006 .001
O
261
s2 =
561 .011 .025 .001 .004 .005 .037 .007 .002
O
Since the Sc ' s seem to be reasonably compatible,3 they were pooled [see (6-37)] to obtain VV
[ ! !��
= (n1
1
1)S1
-
8
:
]
+ ( n2 - 1 ) S 2 + ( n3 - 1 ) S 3
8.200 1.695 .633 1.484 9.581 2.428 .394 6.538
3 However, a normal-theory test of H0 : I 1 cance level because of the large sample sizes.
=
I2
=
I3 would reject H0 at any reasonable signifi
328
Chap.
6
Comparisons of Several Multivariate Means
[ l
Also,
2.136 .519 .102 .380
and
� B =
1
(-
- x) ( - x) -
e-'-'= ne Xe - Xe
'
-
[
3.475 1.111 1 .225 . .821 .453 .235 .584 .610 .230 .304
l
To test H0 : T1 = T2 = T3 (no ownership effects or, equivalently, no difference in average costs among the three types of owners-private, nonprofit, and government), we can use the result in Table 6.3 for g = 3. Computer-based calculations give
A* = and
( 2- nc a
p -
p
!WI IB + WI
.7714
2 ) ( 1 - � ) = ( 516 - 4 - 2 ) ( 1 - Y.7714 ) = 17_67 VA*
4
Y.77i4
Let = .01 , so that F2 (4) , Z ( S i o / · 0 1 ) ='= x� (.01 ) /8 = 2.51. Since 17.67 > F8 1 0 2 0 (.01 ) == 2.51,' we reject H0 at the 1% level and conclude that average costs differ, depending on type of ownership. It is informative to compare the results based on this "exact" test with those obtained using the large-sample procedure summarized in (6-39) and = = 516 is large, and H0 can be (6-40). For the present example, tested at the = .01 level by comparing
,
a
- (n - 1 -
2- ne n
(p
+ g ) /2) ln
( I :� ) = - 511.5 ln (.7714) = 132.76 IB I
with x; (g - 1 ) (.01) = xn 01) = 20.09. Since 132.76 > x� (.01 ) = 20.09, we reject H0 at the 1% level. This result is consistent with the result based on the • foregoing F-statistic.
Sec.
6.5
6.5
329
Simultaneous Confidence Intervals for Treatment Effects
SIMULTANEOUS CONFIDENCE INTERVALS FOR TREATMENT EFFECTS
When the hypothesis of equal treatment effects is rejected, those effects that led to the rejection of the hypothesis are of interest. For pairwise comparisons, the Bon ferroni approach (see Section 5.4) can be used to construct simultaneous confi dence intervals for the components of the differences Tk - Tc (or IL k - JLe ) . These intervals are shorter than those obtained for all contrasts, and they require critical values only for the univariate t-statistic. Let Tk ; be the ith component of Tk . Since Tk is estimated by 7-k = ik - i (6-41 } and 1-k ; - Tc ; = xk; - X e; is the difference between two independent sample means. The two-sample t-based confidence interval is valid with an appropriately modified a. Notice that
( nk
. 1 Var (7-k I- - 1-c I- ) = Var (Xk I- - Xc I- ) = --
-
+
1
)
- uI. I.
n(
where u;; is the ith diagonal element of I. As suggested by (6-37), Var (Xk i - Xe; ) is estimated by dividing the corresponding element of W by its degrees of freedom. That is, -
(
1 Var (Xk l. - XC l- ) = nk -
1
+ -
)
w 11 ..
-
nf n +
g
+ where W; ; is the ith diagonal element of W and n = n1 ng . It remains to apportion the error rate over the numerous confidence statements. Relation (5-28} still applies. There are p variables and g (g - 1)/2 pairwise differ ences, so each two-sample t-interval will employ the critical value t11 _ g (a/2m), where m = pg (g - 1 } /2 (6-42 }
···
is the number of simultaneous confidence statements. Result 6.5. Let n =
(1 -
a} ,
g
_L nk . For the model in (6-34), with confidence at least
k= l
Tk i - Te ; belongs to xk i - Xe;
) + ± tll - g ( pg (ga- 1 ) ) \j/__!!J _jj_ _ (_!_ _!_ n - g n k ne
for all components i = 1, . . . , p and all differences e < ith diagonal element of W.
k = 1, . . . , g. Here W; ; is the
330
Chap.
6
Comparisons of Several Multivariate Means
We shall illustrate the construction of simultaneous interval estimates for the pairwise differences in treatment means using the nursing-home data introduced in Example 6.9. Example 6. 1 0
(Simultaneous intervals for treatment differences-N u rsing Homes)
We saw in Example 6.9 that average costs for nursing homes differ, depend ing on the type of ownership. We can use Result 6.5 to estimate the magni tudes of the differences. A comparison of the variable X3 , costs of plant operation and maintenance labor, between privately owned nursing homes and government-owned nursing homes can be made by estimating r1 3 - r33 • Using (6-35) and the information in Example 6.9, we have 7-1 = ( i 1 - i ) = w
=
[
Consequently,
- .020
T3 = ( -x 3 - -x ) = �
182.962 4.408 8.200 1.695 .633 1 .484 9.581 2.428 .394
.137 .002 .023 .003
7-1 3 - 7-33 = - .020 - .023 = - .043 138 107 = 516, so that
+ + ' ( __!__ + __!__ )
and n = 271
[=:: ].
[]
�
=
1
( + _ ) 1.484 = .00614 107 516 - 3
1 ' 'J 271
'J n 1 n3 n - g = 4 and g = 3, for 95% simultaneous confidence statements we
Since p require that t5 1 3 (.05/4 (3) 2) = 2.87. (See Appendix, Table 1.) The 95% simultaneous confidence statement is
/(__!__ + __!__ )
� r1 3 - r33 belongs to 7-1 3 - 7-33 ± t5 1 3 (.00208) n3 n - g 'J n 1 = - .043 ± 2.87 (.00614)
- .043 ± .018, or ( - .061, - .025)
We conclude that the average maintenance and labor cost for government owned nursing homes is higher by .025 to .061 hour per patient day than for pri vately owned nursing homes. With the same 95% confidence, we can say that r1 3 - r2 3 belongs to the interval ( - .058, - .026)
Sec.
6.6
Two-Way Multivariate Analysis of Variance
331
and
T2 3 - T33 belongs to the interval ( - .021, .019) Thus, a difference in this cost exists between private and nonprofit nursing homes, but no difference is · observed between nonprofit and government • nursing homes. 6.6 TWO-WAY MULTIVARIATE ANALYSIS OF VARIANCE
Following our approach to the one-way MANOVA, we shall briefly review the analysis for a univariate two-way fixed-effects model and then simply generalize to the multivariate case by analogy. Univariate Two-Way Fixed-Effects Model with I nteraction
We assume that measurements are recorded at various levels of two factors. In some cases, these experimental conditions represent levels of a single treatment arranged within several blocks. The particular experimental design employed will not concern us in this book. (See [9] and [10] for discussions of experimental design.) We shall, however, assume that observations at different combinations of experimental conditions are independent of one another. Let the two sets of experimental conditions be the levels of, for instance, fac tor 1 and factor 2, respectively.4 Suppose there are g levels of factor 1 and b levels of factor 2, and that n independent observations can be observed at each of the gb combinations of levels. Denoting the rth observation at level e of factor 1 and level k of factor 2 by Xekr • we specify the univariate two-way model as
Xekr = IL
+ Te + f3k + Yek + eekr
= 1, 2, . k = 1, 2, r = 1, 2, . . e
.
.
. . .
g where :L Te f=l
.
(6-43)
,g
'b ,n
:L f3k = :L Ye k = :L Yek = 0 and the eekr are independent b
k=l
g
b
f=l
k=t
N(O, u2 ) random variables. Here p., represents an overall level, Te represents the fixed effect of factor 1, f3k represents the fixed effect of factor 2, and Yt k is the inter
action between factor 1 and factor 2. The expected response at the f th level of fac tor 1 and the kth level of factor 2 is thus 4 The use of the term factor to indicate an experimental condition is convenient. The factors dis cussed here should not be confused with the unobservable factors considered in Chapter 9 in the con text of factor analysis.
332
Chap.
6
Comparisons of Several Multivariate Means
(
) (
+
mean overall = response level
+
+
) + ( effect of ) ( effect of ) + ( fac�or 1-fa_ctor 2 ) factor 1
e = 1, 2, . . , 8,
+
k
.
factor 2
= 1, 2, .
'Ye k
mteraction
. .
,b
(6-44)
The presence of interaction, 'Yt k • implies that the factor effects are not addi tive and complicates the interpretation of the results. Figures 6.3(a) and (b) show expected responses as a function of the factor levels with and without interaction, respectively. The absense of interaction means 'Ye k 0 for all e and k. In a manner analogous to (6-44), each observation can be decomposed as
=
Level I of factor I Level 3 of factor I Level 2 of factor I
3
2
4
Level of factor 2 (a)
Level 3 of factor I Level I
of factor I
Level 2 of factor I
3
2
4
Level of factor 2 (b) Figure 6.3
i nteraction .
Curves for expected responses (a) with interaction and (b) without
Sec.
Two-Way Multivariate Analysis of Variance
6.6
333
where x is the overall average, xe . is the average for the l th level of factor 1 , x . k is the average for the kth level of factor 2, and Xe k is the average for the l th level of factor 1 an d the kth level of factor 2. Squaring and summing the devia tions (xe kr - x ) gives g
b
n
���
f=l k= l r=l
= f�= l bn (xe . - x ) 2 g
(xe kr - x ) 2
+
or SScor
g
b
��
f=l k=l
= SSfac I + SSfac
+
b
k= l
n (xe k - xe . - x. k
2 +
SSin t
- x )2
� gn (x. k
+
+
x )2
(6-46)
SS res
The corresponding degrees of freedom associated with the sums of squares in the breakup in (6-46) are gbn - 1 = (g - 1 )
+
( b - 1)
+
(g - 1 ) ( b - 1 )
+
gb ( n - 1 )
( 6-47)
The ANOVA table takes the following form: ANOVA TABLE FOR COM PARI NG EFFECTS O F TWO FACTORS AN D TH E I R I NTERACTION
Source of variation Factor 1 Factor 2 Interaction Residual (Error) Total (corrected)
Degrees of freedom ( d.f.)
Sum of squares ( SS )
= f�= l bn (xe . - x ) 2 2 ssfac 2 = � gn ( x. k - x ) k= ! SS int = � � n (xe k - xe . - x. k f=l k=l SS res = � � � (xe kr - xe k ) 2 f=l k=l r=l ssfac l
g
g- 1
b
SS cor
g
b
g
b
n
= f�= l k�= l r�= l (xe kr - x ) 2 g
b
n
b - 1 +
x )2
(g - 1 )(b - 1 ) gb (n - 1 )
gbn - 1
334
Chap.
6
Comparisons of Several Multivariate Means
- 1),
The F-ratios of the mean squares, SSrac 1 / (g - 1 ) , SSrac 2 / (b and ss int / (g - 1 ) (b - 1 ) to the mean square, ss res f (gb ( n - 1 ) ) can be used to test for the effects of factor 1 , factor 2, and factor 1-factor 2 interaction, respectively. ( See
[7] for a discussion of univariate two-way analysis of variance. )
Multivariate Two-Way Fixed-Effects Model with Interaction
Proceeding by analogy, we specify the two-way fixed-effects model for a response consisting of p components [see
(6-43)] X ek r = JL + Te + {Jk + 'Yek + e f k r e = 1, 2, . . . , g
vector
(6-48)
k = 1 , 2, . . . ' b
r = 1 , 2, . . . , n
p
b g = � 'Yek = � 'Yek = 0. The vectors are all of order 1, Te {Jk €=! k =! f = l k =l and the eu, are independent Np (O, I) random vectors. Thus, the responses consist of measurements replicated n times at each of the possible combinations of lev
g where �
b =�
X
p
els of factors 1 and 2. Following we can decompose the observation vectors
(6-45), X ek r as X ek r = i + ( ie. - i ) + ( i. k - i ) + ( iek - ie . - i. k + i ) + (x ek r - iek ) (6-49) where i is the overall average of the observation vectors, ie. is the average of the observation vectors at the fth level of factor 1, i. k is the average of the observa tion vectors at the kth level of factor 2, and iek is the average of the observation vectors at the fth level of factor 1 and the kth level of factor 2. Straightforward generalizations of (6-46) and (6-47) give the breakups of the sum of squares and cross products and degrees of freedom: n
g
f=l k =l r=l (x ek r - i ) (x ek r - i ) ' = f=l� bn ( ie. - i ) ( ie. - i ) ' + k�=l gn( i. k - i ) ( i. k - i ) ' + f�= l k�=l n( iek - ie. - i. k + i ) ( iek - ie. - i. k + i ) ' (6-50) + f=l�g k�=lb r=l� (xf.k r - iek ) (xekr - ie d ' gb n - 1 = (g - 1 ) + (b - 1 ) + (g - 1 ) (b - 1) + gb ( n - 1) (6-51) g b � � �
b
g
b
11
Sec.
6.6
Two-Way Multivariate Analysis of Variance
335
Again, the generalization from the univariate to the multivariate analysis consists simply of replacing a scalar such as with the corrresponding matrix
(xe . - xf
(xe . - x) (x e . - x)'.
The MANOVA table is the following:
MANOVA TAB LE FOR COM PARI N G FACTORS AND TH EIR I NTERACTIO N
Factor
1
Factor 2 Interaction Residual (Error)
Degrees of freedom ( d.f. )
Matrix of sum of squares and cross products ( SSP)
Source of variation
=
g
� bn( ie . - x) (xe . - x )' f=l b SSPrac z = � gn(x . k - x) (x . k - x)' k =l g b SSPi nt = � � n( iek - ie . - x . k + x) (x u - ie . - x . k + x)' f=l k =l g b = SSPres f=l� k�=l r=l� (xu, - Xek Hxek r - Xek )' SSPrac t
n
Total ( corrected)
SSPcor
=
g b
f=l� k�=! r=!� (xek r - x) (x ek r - x )' n
g-1 b-1 (g - 1) (b - 1) gb(n - 1) gbn - 1
A test ( the likelihood ratio test) 5 of
Ho : 'Yu =
1'1 2
=
... = 'Yg b = 0
( no interaction effects )
(6-52)
versus
H1 : At least one
'Yek * 0
is conducted by rejecting H0 for small values of the ratio (6-53) 5 The likelihood test procedures require that p "" gb (n inite (with probability 1).
-
1 ), so that SSP,., will be positive def
336
Chap.
6
Comparisons of Several Multivariate Means
For large samples, Wilks ' lambda, A * , can be referred to a chi-square percentile. Using Bartlett ' s multiplier (see [6]) to improve the chi-square approximation, we reject H0 : y1 1 = y1 2 = = 'Ygb = 0 at the a level if
[
- gb (n - 1 ) -
p
···
+ 1 - (g - 1 ) ( b - 1 ) ] In A * > xfg - t ) (b - t ) p (a) 2
(6-54)
where A * is given by (6-53) and x{g- ! ) (b - t ) p (a) is the upper (100a) th percentile of a chi-square distribution with (g - 1) (b - 1)p d.f. Ordinarily, the test for interaction is carried out before the tests for main fac tor effects. If interaction effects exist, the factor effects do not have a clear inter pretation. From a practical standpoint, it is not advisable to proceed with the additional multivariate tests. Instead, p univariate two-way analyses of variance (one for each variable) are often conducted to see whether the interaction appears in some responses but not others. Those responses without interaction may be interpreted in terms of additive factor 1 and 2 effects, provided that the latter effects exist. In any event, interaction plots similar to Figure 6.3, but with treat ment sample means replacing expected values, best clarify the relative magnitudes of the main and interaction effects. In the multivariate model, we test for factor 1 and factor 2 main effects as fol lows. First, consider the hypotheses H0 : T1 = T2 = · · · = Tg = 0 and H1 : at least one Te � 0. These hypotheses specify no factor 1 effects and some factor 1 effects, respectively. Let
A* =
I SSPres I I SSPfac 1 + SSP res I
(6-55)
so that small values of A * are consistent with H1 • Using Bartlett ' s correction, the likelihood ratio test is:
+ 1 - (g - 1 ) ] In A * > X(g2 - l) p (a)
Reject H0 : T1 = T2 = · · · = Tg = 0 (no factor 1 effects) at level a if
[
- gb (n - 1) -
p
2
(6-56)
where A * is given by (6-55) and x {g - i ) p (a) is the upper (lOOa) th percentile of a chi-square distribution with (g - 1)p d.f. In a similar manner, factor 2 effects are tested by considering H0 : /1 1 /12 = · · · = {Jb = 0 and H1 : at least one {Jk � 0. Small values of (6-57) are consistent with H1 • Once again, for large samples and using Bartlett ' s correc tion: Reject H0 : {11 = /12 = · · · = {Jb = 0 (no factor 2 effects) at level a if
- [ gb(n
Sec. -
p + 1 - (b - 1) ] lo A * > X (b - !) p (a)
6.6
1) -
Two-Way Multivariate Analysis of Variance 2
2
337
(6-58)
where is given by (6-57) and xfb - !) p (a) is the upper (100a) th percentile of a chi-square distribution with degrees of freedom. Simultaneous confidence intervals for contrasts in the model parameters can provide insights into the nature of the factor effects. Results comparable to Result 6.5 are available for the two-way model. When interaction effects are negligible, we may concentrate on contrasts in the factor and factor 2 main effects. The Bon ferroni approach applies to the components of the differences 'Te 7'111 of the fac tor 1 effects and the components of fJk - fJq of the factor 2 effects, respectively. The 100 (1 - a) % simultaneous confidence intervals for Te ; T111 ; are
A*
(b - 1)p
1
Te ; -
T,, ;
belongs to ( Xc;. - Xm ; . )
-
{E;;2 ± tv ( pg(ga- 1 ) ) v--:: b,;,
b (n - 1), Eu is the ith diagonal element of E = SSP and - xm i· isgthe ith component of Xe. - XIII • ' Similarly, the 100 (1 - a) % simultaneous confidence intervals for fJk i - /3q ; are ( pb (ba 1) ) v{E;;2 {3 k i - f3q ; belongs to (x .k; - x. q ; ) ± t v --:: g;; (6-60)
where v =
X e ;.
(6-59) res '
_
where v and Eu are as just defined and x.k; - x.q; is the ith component of x.k - x . q ·
Comment. We have considered the multivariate two-way model with repli cations. That is, the model allows for replications of the responses at each com bination of factor levels. This enables us to examine the "interaction" of the factors. If only one observation vector is available at each combination of factor levels, the two-way model does not allow for the possibility of a general interaction term 'Yc k · The corresponding MANOVA table includes only factor 1 , factor 2, and residual sources of variation as components of the total variation. (See Exercise
n
6.13.)
Example 6. 1 1
(A two-way multivariate analysis of variance of plastic fil m data)
The optimum conditions for extruding plastic film have been examined using a technique called Evolutionary Operation. (See [8] .) In the course of the study that was done, three responses-X1 = tear resistance, X2 = gloss, and X3 = opacity-were measured at two levels of the factors, and The measurements were repeated = 5 times at each combination of the factor levels. The data are displayed in Table 6.4.
amount of an additive.
rate of extrusion n
338
Chap.
6
Comparisons of Several Multivariate Means TABLE 6.4 PLASTIC F I LM DATA
x1 =
tear resistance,
x2 =
gloss, and
x3 =
opacity
Factor 2: Amount of additive Low (1.0%)
High (1.5%)
Xr [6.5 [6.2 Low ( - 10%) [5.8 [6.5 [6.5
x2 9.5 9.9 9.6 9.6 9.2
x3 4.4] 6.4] 3.0] 4.1] 0.8]
Xr [6.9 [7.2 [6.9 [6.1 [6.3
Xz 9.1 10.0 9.9 9.5 9.4
x3 5.7] 2.0] 3.9] 1.9] 5.7]
Xr [6.7 [6.6 [7.2 [7.1 [6.8
x2 9.1 9.3 8.3 8.4 8.5
x3 2.8] 4.1] 3.8] 1 .6] 3.4]
Xr [7. 1 [7.0 [7.2 [7.5 [7.6
Xz 9.2 8.8 9.7 10.1 9.2
x3 8.4] 5.2] 6.9] 2.7 ] 1 .9]
Factor 1: Change in rate of extrusion
-
High (10%)
-
-
The matrices of the appropriate sum of squares and cross products were calculated (see the SAS statistical software output in Panel 6.1), leading to the following MANOVA table:
Source of variation Factor 1: change of extruisni.Ornate Factor 2: amount addr"tl.veof Interaction Residual Total (corrected)
d.f. SSP [ 1.7405 -1.1.35005045 -..78395555 ] 1 . 4 205 .7605 ..66825125 1.1.79325305 ] 1 [[ .0005 .0165 4.�90055 ] .5445 3.1.49605685 1 [ 1.764(] 2..06200280 -3.-.05520700 ] 16 64.9240 [ 4.2655 -.5.70855855 -1.23959095 ] 19 74.2055
Sec. PAN EL 6.1
6.6
Two-Way Multivariate Analysis of Variance
339
SAS ANALYSIS F O R EXAM PLE 6. 1 1 U S I N G P ROC G L M .
title 'MAN OVA'; data fi l m ; i nfi le 'T6-4.dat'; i n put x 1 x2 x3 facto r1 factor2; proc g l m data = fi l m ; class factor1 facto r2; model x1 x2 x3 facto r1 factor2 facto r1 * facto r2 /ss3; m a n ova h = facto r1 facto r2 factor1 * factor2 /pri nte; means factor1 facto r2;
PROGRAM COMMANDS
=
Genera l Linear Models Proced u re Class Level I nfo rmatio n
J
Dependent ,V ariable:
Leve ls Cl ass V a l u es FACTOR 1 2 0 1 FACTO R2 2 0 1 N u m ber of o bservations i n d ata set = 20
X1 J
S o u rce Model E rror Corrected Tota l
S o u rce FACTOR1
I
I
FACTOR2 FACTOR1*f,.XCTOR2 DependentVa riable:
S o u rce Model E rror Corrected Total
OUTPUT
DF 3 16 19
S u m of Squares 2.501 50000 1 .76400000 4.26550000
Mean S q u a re 0.83383333 0 . 1 1 025000
R-Sq u a re 0.586449
4.893724
c.v.
Root M S E 0.332039
DF
Type Ill SS
Mean S q u a re
F Va l u e
1 .74050000
1 .7 4050000 0.76050000 0.00050000
1 5.79 6.90 0.00
0.00 1 1 0 .0 1 83 0.947 1
DF 3 16 19
S u m of S q u a res 2 . 45750000 2.62800000 5.08550000
Mean S q u a re 0.8 1 9 1 6667 0 . 1 6425000
F Va l u e 4.99
Pr > F 0.0 1 25
R-Sq u a re 0.483237
4.350807
c.v.
Root M S E 0 . 405278
.· 1 x2 j
o:!>oo5oooo 0.76050000
F Va l u e 7.56
Pr > F 0.0023
X1 Mean 6.78500000 Pr
>
F
X2 Mean 9 . 3 1 500000
340
Chap.
6
Comparisons of Several Multivariate Means
PANEL 6. 1
(continued)
S o u rce
OF
Type I l l SS
F Va l u e
1 .30050000 0.6 1 250000 0.54450000
7.92 3.73 3.32
0.0125 0.07 1 4 0.0874
Mean S q u a re 3.09383333 4.05775000
F Va l u e 0.76
Pr > F 0 . 53 1 5
S u m of OF S q u a res 3 9.281 50000 1 6 64.92400000 1 9 7 4.20550000
S o u rce Model E rror Co rrected Tota l
R-Sq uare 0 . 1 25078
Root MSE 2 . 0 1 4386
c.v.
51 . 1 9151
Pr
Pr
X1 X2 X3
0.42050000 4.90050000 3.96050000
e
=
i:rr()r, '$�&cP Matti��!! X2 0.02 2.628 -0.552
X1 1 .764 0.02 -3.07
0. 1 0 1 .2 1 0.98
>
X3 - 3.07 - 0 . 552 64.924
the =
Type Ill SS&CP Matrix fo r FACTO R 1 S 1 M 0.5 =
Pil l a i's Trace Hote l l i ng-Lawley Trace Roy's G reatest Root
0.61 8 1 4 1 62 1 .6 1 877 1 88 1 .6 1 877 1 88
=
7 . 5543 7.5543 7.5543
N
E =
=
6
Error SS&CP Matrix
3 3 3
14 14 14
F
0.75 1 7 0.2881 0.3379
Manova Test Criteria and Exact F Stati stics for
H
F
X 3 Mean 3.93500000
S o u rce
I
>
Mean S q u a re
0.0030 0.0030 0.0030
PANE L 6. 1
Sec.
Two-Way Multivariate Analysis of Variance
6.6
(contin ued)
I Hypothf!!��� ofno��erall FAE:TOR2
M a n ova Test Criteria a n d Exact F Statistics for the
Effect
341
I
E = E rror SS&CP M atrix H = Type I l l SS&CP Matrix fo r FACTOR2 N = 6 M = 0.5 S = 1
:Wilks' Lamqda
;Statistic
0.476965 1 0 0.91 1 9 1 832 0.91 1 9 1 832
Pi l lai's Trace H ote l l i ng-Lawley Trace R oy's G reatest Root
Hypothesis
.::: · :�}.2556
N u m OF
4.2556 4.2556 4.2556
3
3 3 3
14 14 14
0.0247 0.0247 0 .0247
M a n ova Test Criteria a n d Exact F Statistics fo r
the
of
no
Overall FACTOR 1 *FACTOR2 Effect
H = Type I l l SS&CP M atrix fo r FACTOR 1 * FACTOR2 N = 6 M = 0.5 S = 1
·
'
E = E rror SS&CP M atrix
Lariloda 0.22289424 0.2868261 4 0. 286826 1 4
Pi l l ai's Trace H otel l i n g-Lawley Trace Roy's G reatest Root Level of FACTOR 1 0
N 10 10
14 14 14
0.30 1 8 0.30 1 8 0.30 1 8
- - - - - - - - - X2 - - - - - - - - -
Mean 6. 49000000 7.08000000
Mean 9 . 57 000000 9 .06000000
SD 0.420 1 85 1 4 0.32249031
Mean 6.59000000 6.98000000
Level of FACTO R2 0
SD 0 .29832868 0 . 5758086 1
- - - - - - - - - X3 - - - - - - - - N 10 10
Mean 3.79000000 4.08000000
--------- X1 --------N 10 10
3 3 3
--------- X1 ---------
Level of FACTOR 1 0
Level of FACTO R2 0
1 .3385 1 .3385 1 .3385
SD 0. 40674863 0.47328638
SD 1 .8537949 1 2 . 1 82 1 49 8 1 - - - - - - - - - X2 - - - - - - - - Mean 9 . 1 4000000 9. 49000000
- - - - - - - - - X3 - - - - - - - - N 10 10
Mean 3.44000000 4.43000000
SD 1 .55077042 2.30 1 23 1 55
SD 0.560 1 587 1 0.42804465
342
Chap.
6
Comparisons of Several Multivariate Means
=
To test for interaction, we compute
275.7098 7771 I SSPres I 354.7906 • I SSPint + SSP I For (g - 1) (b - 1) = 1, 1 - A * ) (gb (n - 1) - p + 1)/2 F- ( ( I (g - ) ( b - 1) - p I + 1) /2 A* has an exact F-distribution with v1 I (g - 1 ) ( b - 1) - p I + 1 and v2 gb (n - 1) - p + 1 d.f. ( See [1].) For our example, 1 - .7771 ) (2 (2) (4) - 3 + 1)/2 1. 34 F ( .7771 ( 1 1 (1) - 3 1 + 1)/2 v, ( 1 1 (1) - 3 1 + 1) = 3 v2 = (2 (2) (4) - 3 + 1) = 14 and F3 1 (.05) = 3.34. Since F 1.34 < F3 , 14 (.05) 3.34, we do not reject the hypothesis H0 : y 1 1 y 12 y21 y22 0 (no interaction effects). A*
=
res
1
=
=
=
=
=
,
4
=
=
=
=
=
=
Note that the approximate chi-square statistic for this test is
- [2 (2) (4) - (3 + 1 - 1 (1))/2] ln (.7771) 3.66, from (6-54). Since xj (.05) 7.81, we would reach the same conclusion as provided by the =
=
exact F-test. To test for factor 1 and factor 2 effects (see page 336), we calculate
I SSPres I A* 1 I SSPfac 1 + SSP res I _
=
275.7098 722.0212
= .38 1 9
275.7098 527.1347
=
and
A *z For both g
-1
=
I SSPres I I SSPrac2 + SSPres I 1 and b - 1 = 1, =
_(
=
5230
Fl -
1 - A � ) (gb (n - 1) - p + 1)/2 ( I (g - 1) - p I + 1) /2 A�
Fz -
( 1 - Ai ) (gb (n - 1) - p + 1 ) /2
and _
Ai
( l (b - 1) - p l + 1)/2
Sec.
6.7
Profile Analysis
343
- 1) - p I + 1, = = + 1, v2I (g= gb(n - 1) - p v2+ 1, l (b ( 1 - .3819 ) (16 - 3 + 1)/2 = 7.55 F1 = .3819 ( 1 1 - 3 1 + 1)/2 ( 1 - .5230 ) (16 - 3 + 1)/2 4.26 Fz = .5230 ( 1 1 - 3 1 + 1) /2
+
have F-distributions with degrees of freedom 1) - p 1 and v1 = 1) - P I respectively. (See [ 1] . ) In our case,
gb(n -
v1
=
and
+1
3 Vz (16 - 3 + 1) = 14 From before, F3 14 ( .05) 3.34. We have F1 = 7.55 > F3 , 1 4 ( .05) = 3.34, , and therefore, we reject H0 : -r1 = -r2 = 0 (no factor 1 effects) at the 5% level. Similarly, F2 = 4.26 > F3 ,1 4 (.05) = 3.34, and we reject H0 : {J1 = {J2 = 0 (no factor 2 effects) at the 5% level. We conclude that both the change in rate of v1
= 11 - 31
=
=
=
extrusion
amount of additive
and the affect the responses, and they do so in an additive manner. The of the effects of factors 1 and 2 on the responses is explored in Exercise 6.15. In that exercise, simultaneous confidence intervals for con trasts in the components of Te and {Jk are considered. •
nature
6.1 PROFILE ANALYSIS
Profile analysis pertains to situations in which a battery ofp treatments (tests, ques tions, and so forth) are administered to two or more groups of subjects. All responses must be expressed in similar units. Further, it is assumed that the responses for the different groups are independent of one another. Ordinarily, we might pose the question, Are the population mean vectors the same? In profile analysis, the question of equality of mean vectors is divided into several specific possibilities. Consider the population means /L � = [ JL 1 1 , 11-1 2 , 11- 1 3 , JL 1 4 ] representing the average responses to four treatments for the first group. A plot of these means, connected by straight lines, is shown in Figure 6.4 on page 344. This broken-line graph is the for population 1. Profiles can be constructed for each popula tion (group). We shall concentrate on two groups. Let p ; = [ JL 1 1 , JL 1 2 , . . , fL i p ] and /L� = [ JL2 1 , JL22 , . . . , IL z p ] be the mean responses to p treatments for populations 1 and 2, respectively. The hypothesis H0 : p1 = p 2 implies that the treatments have the same (average) effect on the two populations. In terms of the population profiles, we can formulate the question of equality in a stepwise fashion.
profile
.
344
Chap.
6
Comparisons of Several Multivariate Means
Mean response
�
- - - - - - - - - ---- - - - - - - - - - - - - - - -
2
3
Figure 6.4 p = 4.
4
= =
The population profile
=
1. Are the profiles parallel? Equivalently: Is H0 1 : p.,1 ; - IL J i - l /L z; - p., 2 ; _ 1 , i 2, 3, . . . , p, acceptable? 2. Assuming that the profiles are parallel, are the profiles coincident? 6 Equivalently: Is H0 2 : p., 1 ; /L z; , i 1, 2, . . . , p, acceptable? 3. Assuming that the profiles are coincident, are the profiles level? That is, are all the means equal to the same constant? Equivalently: Is H0 3 : p.,1 1 p., 1 2 · · · fLi p /L2 1 /L zz · · · /L z p acceptable?
=
= = =
=
=
= =
The null hypothesis in stage 1 can be written where
[=
C is the contrast matrix c
((p - ! ) Xp)
-1
� �
1 0 0 -1 1 0 0 0 0
J !l
(6-61)
For independent samples of sizes n 1 and n 2 from the two populations, the null hypothesis can be tested by constructing the transformed observations j
=
j
= 1, 2, . . . , n2
and
1, 2 ,
. . . , n1
6 The question, 'Assuming that the profiles are parallel, are the profiles linear?' is considered in Exercise 6.12. The null hypothesis of parallel linear profiles can be written, H0 : (J.t l i + J.t2 ; ) ( J.t i i - l + J.t2 , _ 1 ) = (J.t 1 , _ 1 + J.t2, _ 1 ) - ( J.t i i - 2 + J.t2 , _ 2 ) , i = 3, . . . , p. Although this hypothesis may be of interest in a particular situation, in practice the question of whether two parallel profiles are the same (coincident ) , whatever their nature, is usually of greater interest.
Sec.
6.7
Profile Analysis
345
These have sample mean vectors C x 1 and Cx2 , respectively, and pooled covari ance matrix Since the two sets of transformed observations have Np - I ( Cp 1 , C!.C' ) and NP _ 1 ( Cp2 , C!.C' ) distributions, respectively, an application of Result 6.2 provides a test for parallel profiles.
CSpooiect C '.
Reject T2
H0 1 : {:p1
=
TEST FOR PARALLEL PROFI LES FOR TWO NORMAL POPULATIONS
Cp 2 (parallel profiles ) at level
= (it - Xz)'C' [ (�t �)cspoolectC' rt +
a
if
C ( xt - x 2 )
>
cz
(6-62)
where
When the profiles are parallel, the first is either above the second ( JL i i > JL 2 i , for all i), or vice versa. Under this condition, the profiles will be coincident only if the total heights f.Lt t JL1 2 JL 2p = 1 ' p2 are f.Lt p = 1 ' p 1 and JL 2 1 JL 22 equal. Therefore, the null hypothesis at stage 2 can be written in the equivalent form
+ + ··· +
Ho 2 : 1 ' Itt =
+ + ··· +
1 ' P2
We can then test H0 2 with the usual two-sample t-statistic based on the univariate observations 1 ' x1j, j = 1, 2, . . . , n 1 , and 1 ' x 2j , j = 1, 2, . . . , n 2 . TEST FORCOINCIDENT PROFILES, G IVEN THAT PROFI LES ARE PARALLEL
For two norinal populations: Reject H02 : at level a if
=
(
1/
�( _!_nt _!_nz ) s
-pooledl )
( -x i - -X z )
+
1
1
.
2
>
1'p 1
= 1'p2
(profiles coincident)
( ) = Fl,n1 + n2-2 ( a)
t n,2 + nz-2 � 2
(6-63)
346
Chap.
Comparisons of Several Multivariate Means
6
For coincident profiles, x 1 1 , x 1 2 , , x1 n I and x 2 1 , x 22 , . . . , x 2 n2 are all observations from the same normal population. The next step is to see whether all variables have the same mean, so that the common profile is level. When H0 1 and H0 2 are tenable, the common mean vector p. is estimated, using all 2 observations, by • . .
•
n1 + n
If the common profile is level, f.l- 1 3 can be written as
= = ··· = f.l- z
Jl-p ,
and the null hypothesis at stage
where C is given by (6-61). Consequently, we have the following test. TEST FOR LEVEL PROFILES, GIVEN. THAT PROFILES ARE COINCIDENT
Fl)r two noi"mal pop:glations: ' '' i':
, ' �' ;o,: ,
:(\
Reject H03 : . (;p. ' c 2 where c 2 = (n ln +I n2n - -2 )p(p -1 2) Fp - 2, n , + n, - p + l ( a ) + 2 + Let n 1 = 30, n2 = 30, x; = [6.4, 6.8, 7. 3 , 7.0], i � = [4.3, 4.9, 5. 3 , 5.1], and .61 .26 .07 .16 s pooled = ..2067 .17.64 .17.81 .14.03 .16 .14 . 03 . 3 1 Test for linear profiles, assuming that the profiles are parallel. Use a = .05. 6.13. (Two-way MANOVA without replications.) Consider the observations on two responses, x1 and x2 , displayed in the form of the following two-way table (note that there is a single observation vector at each combination of factor levels): 1
l!
!]
l
Level 1
Level l Factor 1 Level 2 Level 3
]
Factor 2 Level Level 2 3
Level 4
[ � ] [ : ] [ 1 � ] [� ] -�] [:] [ -:] [ [�] = [ -�] [ ;J [ - [ =:J �J
362
Chap.
6
Comparisons of Several Multivariate Means
With no replications, the two-way MANOVA model is g
� Te f=1
b
= k� fJk =1
=0
where the ee k are independent Np (O, I) random vectors. (a) Decompose the observations for each of the two variables as similar to the arrays in Example 6.8. For each response, this decomposi tion will result in several 3 4 matrices. Here :X is the overall average, :Xe. is the average for the t'th level of factor 1, and x.k is the average for the kth level of factor 2. (b) Regard the rows of the matrices in Part a as strung out in a single "long" vector, and compute the sums of squares X
sstot
= ssmean + ssfac + ssfac 2 + ssres 1
and sums of cross products SCPtot = SCPmean + SCPrac + SCPrac 2 + SCPres Consequently, obtain the matrices SSP SSP SSP 2 , and SSP with degrees of freedom gb - 1, g - c1,or >b - 1,rae and (grae- 1) (b - 1), respectively. (c) Summarize the calculations in Part b in a MANOVA table. Hint: This MANOVA table is consistent with the two-way MANOVA table for comparing factors and their interactions where n = 1. Note that, with n = 1, SSP in the general two-way MANOVA table is a zero matrix with zero degrees of freedom. The matrix of interaction sum of squares and cross products now becomes the residual sum of squares and cross products matrix. (d) Given the summary in Part c, test for factor 1 and factor 2 main effects at the a = .05 level. Hint: Use the results in (6-56) and (6-58) with gb(n - 1) replaced by (g - 1) (b - 1). Note: The tests require that p ::;:; (g - 1) (b - 1) so that SSPres will be positive definite (with probability 1). 1
1 ,
res
res
Chap. 6.14.
A replicate
6
Exercises
363
of the experiment in Exercise 6.13 yields the following data:
Factor 2 Level Level Level Level 4 3 2 1 Level l [ 1: ] [ � ] [� ] [ �� ] Factor 1 Level 2 [ � ] [ 1 � ] [ 1 � ] [ � ] Level 3 [ -� J [ - � ] [ - 1 � ] [ - � ] (a)
Use these data to decompose each of the two measurements in the obser vation vector as
where :X is the overall average, :Xe . is the average for the t'th level of fac tor 1, and x. k is the average for the kth level of factor 2. Form the cor responding arrays for each of the two responses. (b) Combine the preceding data with the data in Exercise 6.13, and carry out the necessary calculations to complete the general two-way MANOVA table. (c) Given the results in Part b, test for interactions, and if the interactions do not exist, test for factor 1 and factor 2 main effects. Use the likelihood ratio test with a = .05. (d) If main effects, but no interactions, exist, examine the nature of the main effects by constructing Bonferroni simultaneous 95% confidence intervals for differences of the components of the factor effect parameters. 6.15. Refer to Example 6.11. (a) Carry out approximate chi-square (likelihood ratio) tests for the factor 1 and factor 2 effects. Set a = .05. Compare these results with the results for the exact F-tests given in the example. Explain any differences. (b) Using (6-59), construct simultaneous 95% confidence intervals for differ ences in the factor 1 effect parameters for pairs of the three responses. Interpret these intervals. Repeat these calculations for factor 2 effect parameters.
364
Chap.
6
Comparisons of Several Multivariate Means
The following exercises may require the use of a computer. 6.16. Four measures of the response stiffness on each of 30 boards are listed in
Table 4.3 (see Example 4.14). The measures, on a given board, are repeated in the sense that they were made one after another. Assuming that the mea sures of stiffness arise from 4 treatments, test for the equality of treatments in a repeated measures design context. Set a =.05. Construct a 95% (simulta neous) confidence interval for a contrast in the mean levels representing a comparison of the dynamic measurements with the static measurements. 6.17. Jolicoeur and Mosimann [11] studied the relationship of size and shape for painted turtles. Table 6.7 contains their measurements on the carapaces of 24 female and 24 male turtles. CARAPACE M EASU REM ENTS ( I N M I LLI M ETERS) FOR PAI NTED TU RTLES
TABLE 6.7
Female Length Width Height (xz ) (x3) (xl ) 38 81 98 38 84 103 86 103 42 86 42 105 88 44 109 50 92 123 46 95 123 51 99 133 102 51 133 51 102 133 48 100 134 49 102 136 51 98 138 51 99 138 53 105 141 57 108 147 55 107 149 56 107 153 63 115 155 60 117 155 62 115 158 63 118 159 61 124 162 67 132 177
Length (xi ) 93 94 96 101 102 103 104 106 107 112 113 114 116 117 117 119 120 120 121 125 127 128 131 135
Male Width (xz ) 74 78 80 84 85 81 83 83 82 89 88 86 90 90 91 93 89 93 95 93 96 95 95 106
Height (x3) 37 35 35 39 38 37 39 39 38 40 40 40 43 41 41 41 40 44 42 45 45 45 46 47
Chap.
6
Exercises
365
Test for equality of the two population mean vectors using a = .05. If the hypothesis in Part a is rejected, find the linear combination of mean components most responsible for rejecting H0• (c) Find simultaneous confidence intervals for the component mean differ ences. Compare with the Bonferroni intervals. Hint. You may wish to consider logarithmic transformations of the observations. 6.18. In the first phase of a study of the cost of transporting milk from farms to dairy plants, a survey was taken of firms engaged in milk transportation. Cost data on X1 = fuel, X2 = repair, and X3 = capital, all measured on a per-mile basis, are presented in Table 6.8 on page 366 for n1 = 36 gasoline and n2 = 23 diesel trucks. (a) Test for differences in the mean cost vectors. Set a = . 0 1. (b) If the hypothesis of equal cost vectors is rejected in Part a, find the linear combination of mean components most responsible for the rejection. (c) Construct 99% simultaneous confidence intervals for the pairs of mean components. Which costs, if any, appear to be quite different? (d) Comment on the validity of the assumptions used in your analysis. Note in particular that observations 9 and 21 for gasoline trucks have been identified as multivariate outliers. (See Exercise 5.20 and [2]. ) Repeat Part a with these observations deleted. Comment on the results. 6.19. The tail lengths in millimeters (x 1 ) and wing lengths in millimeters (x2 ) for 45 male hook-billed kites are given in Table 6.9 on page 367. Similar measure ments for female hook-billed kites were given in Table 5.11. (a) Plot the male hook-billed kite data as a scatter diagram, and (visually) check for outliers. (Note, in particular, observation 31 with x1 = 284.) (b) Test for equality of mean vectors for the populations of male and female hook-billed kites. Set a = .05. If H0 : IL l - IL z = 0 is rejected, find the lin ear combination most responsible for the rejection of H0 (You may want to eliminate any outliers found in Part a for the male •hook-billed kite data before conducting this test. Alternatively, you may want to interpret x 1 = 284 for observation 31 as a misprint and conduct the test with x1 = 184 for this observation. Does it make any difference in this case how observation 31 for the male hook-billed kite data is treated?) (c) Determine the 95% confidence region for IL l - IL z and 95% simultaneous confidence intervals for the components of IL1 - IL z . (d) Are male or female birds generally larger? 6.20. Using Moody ' s bond ratings, samples of 20 Aa (middle-high quality) corpo rate bonds and 20 Baa (top-medium quality) corporate bonds were selected. For each of the corresponding companies, the ratios X1 = current ratio (a measure of short-term liquidity) X2 = long-term interest rate (a measure of interest coverage) X3 = debt-to-equity ratio (a measure of financial risk or leverage) X4 = rate of return on equity (a measure of profitability) (a)
(b)
366
Chap.
6
Comparisons of Several Multivariate Means
MILK TRANSPORTATI O N-COST DATA Gasoline trucks Diesel trucks
TABLE 6.8
xl
16.44 7.19 9.92 4.24 11.20 14.25 13.50 13. 32 29.11 12. 68 7. 5 1 9.90 10.25 11.11 12. 1 7 10.24 10. 1 8 8.88 12.34 8. 5 1 26.16 12.95 16. 93 14.70 10.32 8.98 9.70 12.72 9.49 8.22 13.70 8.21 15.86 9.18 12.49 17.32
Xz
12.43 2.70 1.35 5.78 5.05 5.78 10.98 14.27 15.09 7.61 5.80 3.63 5.07 6. 1 5 14.26 2.59 6.05 2.70 7.73 14.02 17.44 8.24 13.37 10.78 5.16 4.49 11.59 8.63 2. 1 6 7.95 11.22 9.85 11.42 9. 1 8 4.67 6.86
x3
11.23 3. 92 9.75 7.78 10.67 9.88 10.60 9.45 3.28 10.23 8. 1 3 9. 1 3 10. 1 7 7. 6 1 14.39 6.09 12. 1 4 12.23 11.68 12.01 16.89 7. 1 8 17.59 14.58 17.00 4.26 6.83 5.59 6.23 6.72 4.91 8. 1 7 13.06 9.49 11.94 4.44
xl
8.50 7. 42 10.28 10.16 12.79 9. 60 6.47 11.35 9. 1 5 9.70 9.77 11. 61 9.09 8. 5 3 8.29 15.90 11.94 9.54 10.43 10.87 7.13 11.88 12.03
Source: Data courtesy of M. Keaton.
Xz
12.26 5.13 3.32 14.72 4.17 12.72 8.89 9.95 2.94 5.06 17.86 11.75 13.25 10. 1 4 6.22 12. 90 5. 69 16.77 17. 65 21.52 13.22 12. 1 8 9.22
x3
9.1 1 17.15 11.23 5. 99 29.28 11.00 19.00 14.53 13.68 20.84 35. 1 8 17.00 20.66 17.45 16.38 19.09 14.77 22.66 10. 66 28.47 19.44 21.20 23.09
Chap. TABLE 6.9 x
(Tailt length) 180 186 206 184 177 177 176 200 191 193 212 181 195 187 190
6
Exercises
MALE HOOK-BI LLED KITE DATA x
X
(Tailt length) 185 195 183 202 177 177 170 186 177 178 192 204 191 178 177
z (Wing �ength) 278 277 308 290 273 284 267 281 287 271 302 254 297 281 284
X
z (Wing length) 282 285 276 308 254 268 260 274 272 266 281 276 290 265 275
x
(Tailt length) 284 176 185 191 177 197 199 190 180 189 194 186 191 187 186
Source: Data courtesy of S. Temple.
] ]
X
z (Wing length) 277 281 287 295 267 310 299 273 278 280 290 287 286 288 275
were recorded. The summary statistics are as follows: Aa bond companies: n 1 = 20, x 1 = [2. 2 87, 12.600, .347, 14. 830]', and .459 .254 - .026 - .244 . 8 1 = 254 27.465 - .5 89 - . 267 -.026 -.589 .030 .102 -.244 - .267 .102 6. 854 Baa bond companies: n 1 = 20, x 2 = [2. 404, 7. 1 55, . 5 24, 12. 840]', . 944 - .089 .002 -. 7 19 8 2 = - . 089 16.432 - .400 19. 044 .002 -.400 .024 -. 094 - .7 19 19. 044 - .094 61.854 and .701 . 083 -.012 ! .083 21. 949 - .494 9.388 s pooled = - .012 - .494 . 027 .004 -.481 9.388 .004 34.354
r r
r
-
�
]
367
368
Chap.
6
Comparisons of Several Multivariate Means
Does pooling appear reasonable here? Comment on the pooling proce dure in this case. (b) Are the financial characteristics of firms with Aa bonds different from those with Baa bonds? Using the pooled covariance matrix, test for the equality of mean vectors. Set a = .05. (c) Calculate the linear combinations of mean components most responsible for rejecting H0: p1 - p2 = 0 in Part b. (d) Bond rating companies are interested in a company's ability to satisfy its outstanding debt obligations as they mature. Does it appear as if one or more of the foregoing financial ratios might be useful in helping to clas sify a bond as "high" or "medium" quality? Explain. 6.21. Researchers interested in assessing pulmonary function in nonpathological populations asked subjects to run on a treadmill until exhaustion. Samples of air were collected at definite intervals and the gas contents analyzed. The results on 4 measures of oxygen consumption for 25 males and 25 females are given in Table 6. 1 0 on page 369. The variables were X1 = resting volume 02 (L/min) X2 = resting volume 0 2 (mL/kg/min) X3 = maximum volume 02 (L/min) X4 = maximum volume 02 (mL/kg/min) (a) Look for gender differences by testing for equality of group means. Use a = .05. If you reject H0: p1 - p2 = 0, find the linear combination most responsible. (b) Construct the 95% simultaneous confidence intervals for each - JJ- 2 ; , i = 1, 2, 3, 4. Compare with the corresponding Bonferroni intervals. (c) The data in Table 6. 1 0 were collected from graduate-student volunteers, and thus they do not represent a random sample. Comment on the pos sible implications of this information. 6.22. Construct a one-way MANOV A using the width measurements from the iris data in Table 11.5 . Construct 95% simultaneous confidence intervals for dif ferences in mean components for the two responses for each pair= of popula tions. Comment on the validity of the assumption that � 1 = �2 �3 • 6.23. Construct a one-way MANOVA of the crude-oi� data listed in Table 11. 7 . Construct 95% simultaneous confidence intervals to determine which mean components differ among the populations. (You may want to consider trans formations of the data to make them more closely conform to the usual MANOVA assumptions.) 6.24. A project was designed to investigate how consumers in Green Bay, Wis consin, would react to an electrical time-of-use pricing scheme. The cost of electricity during peak periods for some customers was set at eight times the (a)
f.L u
·
TABLE 6. 1 0 OXYG EN-CON S U M PTION DATA
Females Males X Resting 02 Resting2 02 Maximum 02 Maximum 02 Resting! 02 Resting2 02 Maximum 02 Maximum 02 (L/min) (mL/kg/min) (L/min) (mL/kg/min) (L/min) (mL/kg/min) (L/min) (mL/kg/min) 33. 85 0.34 1. 93 3.7 1 2. 87 30.87 0.29 5.04 35.82 0.39 2.5 1 5.08 43. 85 3. 38 0.28 3.95 36.40 0.48 2. 3 1 5.13 44. 5 1 0. 3 1 4.88 37. 87 0.31 1. 90 3. 95 3. 60 46.00 5. 97 0.30 38.30 0.36 2.32 4.57 5. 5 1 3.11 47.02 0.28 39.19 0.33 2.49 4.07 48.50 3.95 1.74 0.11 39. 2 1 0.43 2.12 4.77 4.39 48.75 0.25 4.66 39. 94 0.48 1.98 6.69 3. 5 0 48.86 0.26 5.28 42. 4 1 0.21 2.25 3.71 2. 82 48.92 7.32 0.39 28.97 0.32 1.7 1 4.35 48.38 3. 5 9 6.22 0. 3 7 37.80 0.54 2.76 7.89 3.47 4.20 50. 5 6 0. 3 1 31.10 0.32 2.10 5. 37 3.07 51.15 5.10 0.35 38.30 0.40 2.50 4. 95 4.43 4.46 55. 34 0.29 51.80 0.31 3. 06 4.97 5.60 3.5 6 56.67 0.33 37. 60 0.44 2. 40 6.68 2. 80 3.86 58.49 0.18 36.78 2.58 0.32 4.01 4.80 49.99 3.31 0.28 46.16 3.05 0.50 6.43 6.69 42.25 3.29 0.44 38.95 1.85 0.36 4.55 5.99 51.70 3.10 0.22 40.60 2.43 0.48 6.30 4.80 5.73 63.30 0.34 43.69 2.58 5.12 0.40 6.00 3.06 46.23 0.30 30.40 1. 97 0.42 6.04 4.77 55.08 3.85 0. 3 1 39.46 2.03 0.55 6.45 5.00 5.16 58. 80 0.27 39.34 2.32 0.50 11.05 5.55 0.66 57.46 5. 23 34.86 0.34 2. 48 4.27 5.23 4. 00 0.37 50.35 35.07 2.25 0.40 4.58 5.37 2.82 0.35 32.48 xl
x
x3
4.13
0
w
�
Source: Data courtesy of S. Rokicki.
x4
x
x3
x4
370
Chap.
6
Comparisons of Several Multivariate Means
cost of electricity during off-peak hours. Hourly consumption (in kilowatt hours) was measured on a hot summer day in July and compared, for both the test group and the control group, with baseline consumption measured on a similar day before the experimental rates began. The responses, log(current consumption) - log(baseline consumption) for the hours ending 9 11 (a peak hour), 1 and 3 (a peak hour) produced the following summary statistics: A.M.,
Test group: n1 = Control group: n2 =
and
s pooled
=
A.M.
P.M.,
P.M.
]
28, x 1 = [.153, -.231, - .322, -.339]' 58, x 2 = [.151, .180, .256, .257]' .804 .355 .228 .232 .355 .722 .233 .199 .228 .233 .592 .239 .232 .199 .239 .479
l
WiSource:sconsiDatn. a courtesy of Statistical Laboratory, University of Perform a profile analysis. Does time-of-use pricing seem to make a differ ence in electrical consumption? What is the nature of this difference, if any? Comment. (Use a significance level of a = .05 for any statistical tests. ) 6.25. As part qf the study of love and marriage in Example 6.1�, a sample of hus bands and wives were asked to respond to these questions: 1. What is the level of passionate love you feel for your partner? 2. What is the level of passionate love that your partner feels for you? What is the level of companionate love that you feel for your partner? 4. What is the lev�l of companionate love that your partner feels for you? The responses were recorded on the following 5-point scale. 3.
None at all
Very little
Some
A great deal
Tremendous amount
2
3
4
5
in Table 6.11, where X1 = Thirty husbands and 30 wives gave the responses response to a 5-point-scale response to Question 1, X2 = a 5-po�nt-scale Question 2, x3 = a 5-point-scale response to Question 3, and x4 = 5-point scale response to Question 4. (a) Plot the mean vectors for husbands and wives as sample profiles. a
Chap. TABLE 6.1 1
SPOUSE DATA
2 5 4 4 3 3 3 4 4 4 4 5 4 4 4 3 4 5 5 4 4 4 3 5 5 3 4 3 4 4
Xz
3 5 5 3 3 3 4 4 5 4 4 5 4 3 4 3 5 5 5 4 4 4 4 3 5 3 4 3 4 4
x3
5 4 5 4 5 4 4 5 5 3 5 4 4 5 5 4 4 5 4 4 4 4 5 5 3 4 4 5 3 5
x4
5 4 5 4 5 5 4 5 5 3 5 4 4 5 5 5 4 5 4 4 4 4 5 5 3 4 4 5 3 5
Exercises
371
Wife rating husband
Husband rating wife xl
6
xl
4 4 4 4 4 3 4 3 4 3 4 5 4 4 4 3 5 4 3 5 5 4 2 3 4 4 4 3 4 4
Xz
4 5 4 5 4 3 3 4 4 4 5 5 4 4 4 4 5 5 4 3 3 5 5 4 3 4 4 4 4 4
x3
5 5 5 5 5 4 5 5 5 4 5 5 5 4 5 4 5 4 4 4 4 4 5 5 5 4 5 4 5 5
x4
5 5 5 5 5 4 4 5 4 4 5 5 5 4 5 4 5 4 4 4 4 4 5 5 5 4 5 4 4 5
Source: Data courtesy of E. Hatfield. (b)
Is the husband rating wife profile parallel to the wife rating husband pro file? Test for parallel profiles with a = .05. If the profiles appear to be parallel, test for coincident profiles at the same level of significance. Finally, if the profiles are coincident, test for level profiles with a = .05. What conclusion(s) can be drawn from this analysis?
372
Chap.
6
Comparisons of Several Multivariate Means
Two species of biting flies (genus Leptoconops) are so similar morphologically, that for many years they were thought to be the same. Biological differences such as sex ratios of emerging flies and biting habits were found to exist. Do the taxonomic data listed in part in Table 6. 1 2 on page 373 and on the com puter disk indicate any difference in the two species L. carteri and L. torrens? Test for the equality of the two population mean vectors using a = .05. If the hypotheses of equal mean vectors is rejected, determine the mean components (or linear combinations of mean components) most responsible for rejecting H0. Justify your use of normal-theory methods for these data. 6.27. Using the data on bone mineral content in Table 1. 6 , investigate equality between the dominant and nondominant bones. (a) Test using a = . 0 5. (b) Construct 95% simultaneous confidence intervals for the mean differences. (c) Construct the Bonferroni 95% simultaneous intervals, and compare these with the intervals in Part b. 6.28. Table 6.13 on page 374 contains the bone mineral contents, for the first 24 subjects in Table 1.6 , 1 year after their participation in an experimental pro gram. Compare the data from both tables to determine whether there has been bone loss. (a) Test using a = .05. (b) Construct 95% simultaneous confidence intervals for the mean differences. (c) Construct the Bonferroni 95% simultaneous intervals, and compare these with the intervals in Part b. 6.29. Peanuts are an important crop in parts of the southern United States. In an effort to develop improved plants, crop scientists routinely compare varieties with respect to several variables. The data for one two-factor experiment are given in Table 6. 1 4 on page 375. Three varieties (5, 6, and 8) were crossed with two geographical locations (1, 2), and, in this case, the three variables representing yield and the two most important grade-grain characteristics were measured. The three variables are: X1 = Yield (plot weight) X2 = Sound mature kernels (weight in grams-maximum of 250 grams) X3 = Seed size (weight, in grams, of 100 seeds) There were two replications of the experiment. (a) Perform a two-factor MANOVA using the data in Table 6.14. Test for a location effect, a variety effect, and a location-variety interaction. Use a = .05. (b) Analyze the residuals from Part a. Do the usual MANOVA assumptions appear to be satisfied? Discuss. (c) Using the results in Part a, can we conclude that the location and/or vari ety effects are additive? If not, does the interaction effect show up for some variables, but not for others? Check by running three separate uni variate two-factor ANOVA's. (d) Larger numbers correspond to better yield and grade-grain characteris tics. Using location 2, can we conclude that one variety is better than the 6.26.
TABLE 6. 1 2 B ITI NG-FLY DATA
( length Wing ) ( Wing ) width xl
L. torrens
L. carteri
8587 9492 919096 9291 87 106 105 103 100 109 10495 10490 104 8694 10382 103 101 103 10099 100 1109999 10395 101 10399 10599
Xz
4138 4443 444243 4341 38 4746 4441 4445 444040 46 404819 4143 4345 414443 454442 464743 435047 47
of ) ( Un�h of ) ( Thirdpalp ) (Thirdpalp ) (palpourth) ( Ungili antennal antennal x3
length 323136 3235 363636 3635 3834 353436 3635 3437 37 3738 354239 4044 424340 413538 363838 4037 4039
x4
Xs
width length 141315 222725 1714 28 26 161712 262624 1411 2324 2631 141515 1413 242723 232930 141515 2230 1412 1114 2531 3325 121514 1514 252932 151618 313431 1417 3336 141516 323131 151414 322337 1614 3334
x6
x7
segment 12 1389 1099 999 9 1010 101110 109 99 10 96 109 9119 101011 9109 108 1111 1112 7
segment 13 1389 1099 999 10 1011 101010 101010 1010 7109 998 101011 10 10109 108 1111 1011 7
Source: Data courtesy of Wil iam Atchley. 373
374
Chap.
6
Comparisons of Several Multivariate Means TABLE 6. 1 3
M I N ERAL CONTENT I N BON ES (AFTER 1 YEAR)
Subject Dominant Dominant Dominant number radius Radius humerus Humerus ulna Ulna 1 1.027 1. 051 2.268 2.246 .869 . 964 2 . 602 .689 . 857 .817 1.7 18 1.7 10 .765 .738 3 . 875 .880 1.953 1.756 .761 .698 1.443 4 .698 1.668 .873 . 5 51 . 6 19 1.661 5 . 8 11 .813 1.643 .753 . 5 15 1.378 6 .734 1.3 96 .640 .708 .787 1.686 7 . 947 .865 1. 851 .687 . 7 15 1. 8 15 .886 .806 1.742 8 .844 .656 1.776 .991 .923 1. 931 9 .869 .789 2. 1 06 . 925 1. 933 . 977 10 .654 .726 1. 651 .826 1. 609 .825 11 . 692 .526 1.980 .765 2.352 .851 12 . 670 .580 1.420 .770 13 .730 1.470 .823 .773 1.809 .875 1.846 .912 14 .746 .729 1.579 15 .826 1.842 . 905 .656 .506 1.860 .727 1. 747 .756 16 .693 .740 1.941 .764 1.923 .765 17 .883 .785 1. 997 .914 2. 1 90 .932 18 .577 .627 1.228 19 .782 1.242 .843 . 802 .769 1.999 . 906 2. 1 64 .879 20 . 540 .498 1. 3 30 1. 5 73 .537 .673 21 . 804 .779 2. 1 59 .900 2. 1 30 . 949 22 .570 .634 1.265 .637 1.041 .463 23 .585 .640 1.411 .743 1.442 .776 24
Source: Data courtesy of Everett Smith. other two for each characteristic? Discuss your answer, using 95% Bonferonni simultaneous intervals for pairs of varieties. 6.30. Refer to Example 6. 1 3. (a) Plot the profiles, the components of i 1 versus time and those of i 2 versus time, on the same graph. Comment on the comparison. (b) Test that linear growth is adequate. Take a = . 0 1. 6.31. Refer to Example 6. 1 3 but treat all 31 subjects as a single group. The maximum likelihood estimate of the ( + 1) 1 f3 is p = (B' S - 1 B) - 1 B' S- 1 i whereTheS iestimated s the samplecovariances covarianceof matrix. the maximum likelihood estimators are q
X
Chap. 6
References
375
TABLE 6. 1 4 P EAN UT DATA
Factor 1 Factor 2 Location Variety 5 1 5 1 5 2 5 2 6 1 6 1 6 2 6 2 8 1 8 1 8 2 8 2
x
x
2 XI 3 SeedSize SdMatKer Yield 51.4 195.3 153.1 53.7 194.3 167.7 55.5 189.7 139. 5 44.4 180.4 121.1 49. 8 203.0 156. 8 45.8 195.9 166.0 60.4 202.7 166. 1 54.1 197.6 161.8 57.8 193.5 164.5 58.6 187.0 165.1 65. 0 201.5 166.8 67.2 200.0 173. 8
Source: Data courtesy of Yolanda Lopez. Cov ( {J ) (n - 1 -(n p- +1)q)(n(n- -2)p + q)n ( B ' s- t u) - t Fit a quadratic growth curve to this single group and comment on the fit. ---
"
REFERENCES
(
)
1. John AndersWioln,ey,T.1984.W. 2d ed. . New York: 2. MulBacon-tiplSehone,OutliJ.ers, andin UniW. K.variFung.ate and"A MulNewtivGrariaaphite cDatal Meta."hod for Detecting Singleno.and2 3. Bart(1987)let, 153-162. , M.S. "Properties of(1Suf937)fic,i268-282. ency and Statistical Tests." 4. Bartlet , M.S. "Further Aspects of the Theor y ,of33-40.Multiple Regression." ( 1 938) 5. Bartlet , M. (S.1947)"Mul, 176-197. tivariate Analysis." 6. Bartlet , M. S. "A Note on the Multiplying Fact(1954)ors, for296-298. Various Approximations." 7. John BhattWiacharley,yya,1977.G. K., and R. A. Johnson. New York: 8. Box, G. E. P., and N.NewR. Draper. York: John Wiley, 1969. An Introduction to Multivariate Statistical Analysis
Applied Statistics,
36,
Proceedings of the Royal
Society of London (A) ,
160
Proceedings of
the Cambridge Philosophical Society, ment (B),
34
Journal of the Royal Statistical Society Supple
9
Journal of the Royal Statistical Society (B),
16
r
Statistical Concepts and Methods.
Evolutionary Operation: A Statistical Method for
Process Improvement.
376
Chap.
Comparisons
Several Multivariate Means
of 9. John Box, G.WilE.ey,P.1978., W. G. Hunter, and J. S. Hunter. New York: Newn inYorthekPai: Macmi lTuran,tl1971.e: A 11.10. JolJohn, iPrinccioeurP.palW.,Component P.M., and J. E.AnalMosyismisann.." "Size and(Shape Var i a t i o n t e d 12.13. KsMorhirrissaogarn, D., A.M.F. , and W. B. Smith, 1960)(2d ,ed.339-354. New). NewYorYork:k: MarMcGrawcel Dekker.Hil , 1995.1976. 14. brPearidge,son,EnglE. S.and:, andCambrH. idgeHarUnitley,veredssity. Pres , 1972. vol. I . Cam 15. usPoteftuhlofesf,pR.eciaF.l yandforS.groN.wtRoy,h curv"Ae prgeneroblems",alized multivariate anal(1964)ysis, 313-326. of variance model H.L., and N. Balakrishnan. "TesNewting tYorhe Equal k: JohnityWioflVarey, 1959. 17.16. TicesSchefku,thfM,ee,Robus iance-Covariance Matno. r12i t Way. " 18. Tiat(1e985)ku,DiM.,s3033-3051. L.ibut, andions.M." Singh. "Robust Statistics for Testing Mean Vectors of Multno.ivari9 t r 19. Tit(e1rey,982)mm,Cal,N.H.985-1001. Mon i f o r n i a : Br o oks / C ol e , 1975. 20. Wi(1932)lks,, S.471S.-494."Certain Generalizations in the Analysis of Variance."
6
Statistics for Experimenters.
Statistical Design and Analysis of Experiments. Growth,
24
Growth Curves, Multivariate Statistical Methods 0.
Biometrika Tables for Statisticians,
Biometrika,
51
The Analysis of Variance.
Communications in Statistics-Theory and Methods,
14,
Communications in Statistics-Theory and Methods,
11,
Multivariate Analysis with Applications in Education and Psychology.
Biometrika,
24
CHAPTER
7
Multivariate Linear Regression Models 7 . 1 INTRODUCTION
Regression analysis is the statistical methodology for predicting values of one or more response (dependent) variables from a collection of predictor (independent) variable values. It can also be used for assessing the effects of the predictor vari iables on the responses. Unfortunately, the name regression, culled from the title of the first paper on the subject by F. Galton [13], in no way reflects either the importance or breadth of application of this methodology. In this chapter, we first discuss the multiple regression model for the predic tion of a single response. This model is then generalized to handle the prediction of several dependent variables. Our treatment must be somewhat terse, as a vast literature exists on the subject. (If you are interested in pursuing regression analy sis, see the following books, in ascending order of difficulty: Bowerman and O ' Con nell [5], Neter, Kutner, Nachtsheim, and Wasserman [16], Draper and Smith [11], Seber [19], and Goldberger [14].) Our abbreviated treatment highlights the regres sion assumptions and their consequences, alternative formulations of the regres sion model, and the general applicability of regression techniques to seemingly different situations. 7.2 THE CLASSICAL LINEAR REGRESSION MODEL
Let z 1 , z2 , , be r predictor variables thought to be related to a response vari able Y. For example, with r = 4, we might have Y = current market value of home • • •
z,
377
378
Chap.
7
Multivariate Linear Regression Models
and
square feet of living area location (indicator for zone of city) appraised value last year z 4 = quality of construction (price per square foot) The classical linear regression model states' that Y is composed of a mean, which depends in a continuous manner on the z; s, and a random error e, which accounts for measurement error and the effects of other variables not explicitly considered in the model. The values of the predictor variables recorded from the experiment or set by the investigator are treated as ftxed. The error (and hence the response) is viewed as a random variable whose behavior is characterized by a set of distrib utional assumptions. Specifically, the linear regression model with a single response takes the form = f3o + /3 1 Z1 + · · · + f3rZr + e Y [Response] = [mean (depending on z 1 , z2 , . . . , Zr ) ] + [error] The term linear refers to the fact that the mean is a linear function of the unknown parameters {30 , {31 , . . . , f3r · The predictor variables may or may not enter the model as first-order terms. With n independent observations on Y and the associated values of Z ; , the complete model becomes = z2 = z3 = z1
Y1 Y2
= f3o + /3 Z1 + f32Z1 2 + . . . + f3rZ1 r +
= f3o + /311 Z2 11 + f32Z22 + · · · + f3rZ2r + 0012
where the error terms are assumed to have the following properties: 1. E(e) = 0 ; 2. Var(e) = u 2 (constant) ; and 3. Cov(ej, ek) = 0 , j * k. In matrix notation, (7-1) becomes z 11 2 . . .
(7-1) (7-2)
Sec.
The Classical Linear Regression Model
7.2
379
or
= z f3 + and the specifications in (7-2) become: 1. E(e) = 0; and 2. Cov(e) = E(ee') = u 2 1 . Note that a one in the first column of the design matrix is the multiplier of the constant term {30 . It is customary to introduce the artificial variable Zj o = 1, so that f3o + {3 , Zj ! + . . + {3, Zjr = f3o Zj o + /31 Zj l + . . . + {3 ,Zjr Each column of consists of the n values of the corresponding predictor variable, while the jth row of contains the values for all predictor variables on the jth trial. y
E
(n X (r + l)) ( (r + l ) X l )
(n X l )
(n X I )
Z
Z
.
Z
CLASSICAL LI N EAR REG RESSION MODEL y
(n x l)
E ( e)
= =
z f3 (n X (r + l)) ((r + l) X l) 0
(n X l)
,
and Cov (e)
+
E
(n x 1)
=
'
u21 ,
(n X n)
(7-3)
where f3 and u2 are unknown parameters and the design matrix Z has jth row. [Zja • Zj l • · · · , Zjr] .
Although the error-term assumptions in (7-2) are very modest, we shall later need to add the assumption of joint normality for making confidence statements and testing hypotheses. We now provide some examples of the linear regression model. Example 7. 1
(Fitting a straight-line regression model)
Determine the linear regression model for fitting a straight line Mean response = E(Y) = /30 + {31 z 1 to the data 0 1 2 3 4 1 4 3 8 9
380
Chap.
7
Multivariate Linear Regression Models
Before the responses Y = [ , , , Y5 ] ' are observed, the errors e = [e1 , e2 , , e5 ] ' are random,Y1andY2 we can write Y = ZfJ + e where • • •
• • .
The data for this model are contained in the observed response vector y and the design matrix Z, where 1 1 0 4 1 1 y= 3 , Z= 1 2 8 1 3 9 1 4 Note that we can handle a quadratic expression for the mean response by introducing the term f32 z2 , with z2 = z f . The linear regression model for the jth trial in this latter case is or Yj
•
= f3o + f3t zj l + f3z zll + ej
Example 7.2 (The design matrix for one-way ANOVA as a regression model)
Determine the design matrix if the linear regression model is applied to the one-way ANOVA situation in Example 6.6. We create so-called dummy variables to handle the three population means: = + T1 , z = + T2 , and /-L3 = + T3 We set 1 the observation is if the observation is from population 1 z2 = from population 2 0 otherwise otherwise if the observation is from population 3 otherwise 1-L t
1-L
1-L
1-L
{
1-L
if
•
Sec.
7.3
Least Squares Estimation
381
= 1, 2, ... ' 8 where we arrange the observations from the three populations in sequence. Thus, we obtain the observed response vector and design matrix 1 1 0 0 9 1 1 0 0 6 1 1 0 0 9 0 z 1 0 1 0 y 1 0 1 0 2 1 0 0 1 3 1 0 0 1 1 1 0 0 1 2 The construction of dummy variables, as in Example 7.2, allows the whole of analysis of variance to be treated within the multiple linear regression framework. j
( 8 X 4)
(8 X I )
•
7.3 LEAST SQUARES ESTIMATION
One of the objectives of regression analysis is to develop an equation that will allow the investigator to predict the response for given values of the predictor variables. Thus, it is necessary to "fit" the model in (7-3) to the observed yj corresponding to the known values 1, zj l , . . . , Zjr · That is, we must determine the values for the regression coefficients f3 and the error variance 0'2 consistent with the available data. Let b be trial values for {3. Consider the difference yj - b0 - b 1 zj l - . . · - b,zj, between the observed response yj and the value b0 + b 1 Zj 1 + · · · + b,zj, that would be expected if b were the "true" parameter vector. Typically, the differences yj - b0 - b 1 Zj - · · · - b, zj , will not be zero, because the response fluctuates (in a manner characterized by the error term assumptions) about its expected value. The method of least squares selects b so as to minimize the sum of the squares of the differences: (7-4) S (b) = 2: (yj - b0 - b 1 zj 1 - . . . - b,zj ,) 2 j= 1 = (y - Zb) (y - Zb) The coefficients b chosen by the least squares criterion are called least squares esti mates of the regression parameters {3. They will henceforth be denoted by p to emphasize their role as estimates of {3. 1
It
I
382
Chap. 7 Multivariate Linear Regression Models
The coefficients PA are consisten1 with Jhe data in theAsense that they produce estimated (fitted) mean responses, {30 + {3 z + + zjr • the sum of whose squares of the differences from the observed y1j isj 1 as small asf3rpossible. The deviations j = 1, 2, . . . , n (7-5) are called residuals. The vector of residuals e = y - zp contains the information about the remaining unknown parameter rr2• (See Result 7.2. ) n. 1 The least squares estimate of Result 7. 1 . Let Z have full rank r + 1 p in (7-3) is given by ···
A
:,;;;
p = ( Z' Z ) - 1 Z' y
Let = z = Hy denote the fitted values of y, where H = Z ( Z' Z) - 1 Z' is called "hat"y matrix.p Then the residuals e=y-
y
= [I - Z ( Z' Z ) -1 Z'] y = ( I - H ) y
satisfy Z' e = 0 and Y' e = 0. Also, the fl
A A
residual sum ofsquares = 2: ( yj - f3 o - {3 1 Zj l =I j
···
A
- f3rZjr ) 2 = e' e
= y' [I - Z ( Z' Z ) - 1 Z'] y = y' y - y' ZP Proof. p = ( Z' Z ) - 1 Z' y e = y - y = y - zp = 1 [I - Z ( Z' Z ) Z'] y. [I - Z ( Z' Z ) - 1 Z']
Let
as asserted. Then satisfies: 1. [I - Z ( Z' Z ) - 1 Z']' = [I - Z ( Z' Z ) - 1 Z'] (symmetric) ; 2. [I - Z ( Z' Z ) -1 Z'] [I - Z ( Z' Z ) - 1 Z'] = I - 2Z ( Z' Z ) - 1 Z' + Z ( Z' Z ) - 1 Z' Z ( Z' Z ) - 1 Z' = [I - Z ( Z' Z )- 1 Z'] (idempotent) ; 3.
The matrix
(7-6)
Z' [I - Z ( Z' Z ) -1 Z' ] = Z' - Z' = 0.
Consequently, Z'e = Z' (y - y) = Z' [I - Z ( Z' Z ) - 1 Z'] y = 0, so Y' e = P ' Z' e = 0. Additionally, e' e = y' [I - Z ( Z' Z ) - 1 Z'] [I - Z ( Z' Z ) -1 Z'] y = y' [I - Z ( Z' Z ) - 1 Z'] y = y' y - y' Zp . To verify the expression for p , we write cise
1
If z is not full rank, (Z' z) - 1 is replaced by (Z' Z ) - , a generalized inverse of Z' Z. (See Exer
7.6.)
Sec.
7.3
Least Squares Estimation
y - Z b = y - z p + z p - Zb = y - z p + z ( p - b )
so
S ( b ) = ( y - Zb ) ' (y - Zb)
383
+ ( P - b) ' Z' Z ( P - b )
(y - ZP ) ' (y - Z P )
+ 2 (y - ZP ) ' Z ( P - b) = (y - z p) ' ( y - z p) + ( P - b) ' Z' Z ( P - b ) since (y - zp) ' Z = e' Z = 0' . The first teqn in S (b) does not depend on b and the second is th� squared length of Z ( f1 - b) . Because Z has full rank, Z ( f1 - b ) * 0 if f1 * b , so the minimum sum of squares is unique and occurs for b = p = ( Z' Z ) - 1 Z'y. Note that ( Z' Z ) - 1 exists since Z'Z has rank + 1 n. (If Z'Z is not of full rank, Z' Za = 0 for some a * 0, but then a' Z' Za = 0 or Za = 0, which contradicts Z having full rank + 1.) Result 7.1 shows how the least squares estimates p and the residuals e can be obtained from the design matrix Z and responses y by simple matrix operations. A
r
r
Exatnple 7.3
(Calculating the least squares estimates, the residuals, and the residual sum of sq uares)
:s;;
•
Calculate the least square estimates /3, the residuals e, and the residual sum of squares for a straight-line model Yi = f3o + {31 zi l + si fit to the data 0 1 2 3 4 :� I 1 4 3 8 9 We have A
Z'
y
1 4 3 8 9
Z'Z
( Z'Z ) - 1
Z'y
384
Chap.
7
M ultivariate Linear Regression Models
Consequently, and the fitted equation is y = 1 + 2z
The vector of fitted (predicted) values is Y
so
=
A
zp
1 0 1 1 = 1 2 1 3 1 4 1
e=y-y=
The residual sum of squares is
4 3 8 9
[�] 1
3 5 7 9
1
3 5 7 9 0 1 -2 1 0
0 1 e' e = [o 1 - 2 1 o] - 2 = 02 + 1 2 + ( - 2) 2 + 1 2 + 02 = 6 1 0
•
Sum-of-Squares Decomposition
According to Result y' y = 2: Yl satisfies j= 1 II
7.1,
Y' e
= 0,
so the total response sum of squares
y' y = (y + y - y) ' (y + y - y) = (y + e) ' (y + e) = Y'Y + e' e (7-7) Since the first column of Z is 1, the condition Z' e = 0 includes the requirement 0 = 1' e = 2: ej = 2: yj - 2: yj , or y = y. Subtracting ny 2 = n (.Y ) 2 from both ! =1 j= 1 sides of the jdecomposition inj =(7-7), we obtain the basic decomposition of the sum n
n
of squares about the mean:
n
Sec.
y ' y - ny 2
or
(
=
7.3
Least Squares Estimation
385
n 0 are the eigenvalues of Z'Z and ep e 2 , . . . , e r+ I are the corresponding eigenvectors. If Z is of full rank, · · · ;,:
1 .! _!_ - e + e �+ (Z' Z) - 1 = A e 1 e � + _A _ e 2 e; + · · · + � A r+ I r I I z 1 Z. q; = Aj 1 12 Ze;, 1 ' 2 2 I 2 = 2 I k. I i / 1 / = k i= i 0 / e / A = e� ' A :A k k k q i q k = ' - Il' k- e i Z' Zek
Consider
ll j
of Then which is a linear combinationif of the columns That or if I
I
Sec.
Least Squares Estimation
7.3
387
is, the + 1 vectors q ; are mutually perpendicular and have unit length. Their linear combinations span the space of all linear combinations of the columns of Z. Moreover, r+l r+l ; ; Z (Z' Z) - 1 Z' = A;- 1 Ze;e z' = r
;
L i= l
Lqq i= l
According to Result 2A.2 and Definition 2A.12, the projection of y) on a (r+l r+l linear combination of { q 1 , q 2 , . . . , q + l is � ( q ; y ) q = � q ; q ; Y = Z (Z' Z) - 1 Z' y = z{J . Thus, multiplication by Z (Z' Z) - 1 Z' projects a vector onto the space spanned by the columns of Z.2 1 Similarly, [I - Z (Z' Z) Z'] is the matrix for the projection of y on the plane perpendicular to the plane spanned by the columns of Z. ,
;
t
Sampling Properties of Classical Least Squares Estimators
The least squares estimator p and the residuals e have the sampling properties detailed in the next result Result 7.2. Under the general linear regression model in (7-3), the least squares estimator {J = (Z' Z) - 1 Z' Y has E ( P ) = {3 and Cov( p ) = a.2 (Z' Z) - 1 The residuals e have the properties E ( e) = 0 and Cov( e) = 0"2 [I - Z (Z' Z)- 1 Z'] 0" 2 [I - H] Also, E (i ' e) (n 1)0'2, so defining e ' i - Y' [I - Z (Z' z) - 1 Z'] Y Y' [I - H] Y s 2 = ---n ( + 1) n 1 n 1 we have - r -
-
r
r -
r -
Moreover, {3 and i are uncorrelated. "
A1
2 If Z is not of full rank, we can use the generalized inverse (Z' z)-
;;;.
Az
;;;.
· · ·
Z (Z' Z ) - z' =
;;;.
A,, + 1
r1 +
1
L
i= l
>
0
=
A,, + 2
=
· · ·
=
A,+ 1 ,
as
described
in
=
+I
L A;- 1 e ; e ; ,
r1
i= 1
Exercise
7.6.
where Then
q;q; has rank r1 + 1 and generates the unique projection of y on the space spanned
by the linearly independent columns of Z. This is true for any choice of the generalized inverse. (See [19].)
388
Chap.
7
M u ltiva riate Linear Reg ression Models
Proof. Before the response Y = Z{J + e is observed, it is a random vector. Now, p = (Z'Z)- 1 Z'Y = (Z'Z) - 1 Z' (Z{J + e) = fJ + (Z'Z) - 1 Z'e e = [I - Z(Z'Z) - 1 Z']Y (7-10) = [I - Z(Z'Z) - 1 Z'] [Z{J + e] = [I - Z(Z'Z) - 1 Z'] e since [I - Z(Z'Z) - 1 Z']Z = Z - Z = 0. From (2-24) and (2-45), E( p ) = fJ + (Z'Z) - 1 Z' E(e) = fJ Cov( p ) = (Z'Z) - 1 Z'Cov(e)Z(Z'Z) - 1 = u2 (Z'Z) - 1 Z'Z(Z'Z) - 1 = u2 (Z'Z) - 1 E(e) = [I - Z(Z'Z) - 1 Z']E(e) = 0 Cov(e) = [I - Z(Z'Z) - 1 Z']Cov(e)[I - Z(Z'Z)- 1 Z']' = u2 [I - Z (Z' Z) - I Z'] where the last equality follows from (7-6). Also, Cov( p, e) = E[( p - {J)e'] = (Z'Z) - 1 Z' E(ee')[I - Z(Z'Z) - 1 Z'] = u2 (Z'Z) - 1 Z'[I - Z(Z'Z) - 1 Z'] = 0 because Z' [I - Z(Z'Z) - 1 Z'] = 0. From (7-10), (7-6), and Result 4. 9 , e ' e = e'[I - Z(Z'Z) - 1 Z'][I - Z(Z'Z) - 1 Z'] e = e' [I - Z(Z'Z) - 1 Z'] e = tr[e' (I - Z(Z'Z)- 1 Z') e] = tr([I - Z(Z'Z)- 1 Z'] ee') Now, for an arbitrary n n random matrix W, E(tr(W)) = E( W11 + W22 + · · · + Wnn ) = E( W11 ) + E( W22 ) + · · · + E( Wnn ) = tr[E(W)] Thus, using Result 2A.12, we obtain X
Sec. 7 . 3
Least Squares Esti mation
389
E(e' e) = tr ( [I - Z (Z' Z)- 1 Z' ] E (ee' ) ) u2 tr [I - Z (Z' Z) - 1 Z' ] u2 tr (I) - u2 tr [Z (Z' Z) - l Z' ] = u2 n - u2 tr[(Z' Z) - 1 Z' Z ] I nu 2 -
= =
u2 tr [ (r+ l) X (r+ l ) J
=
= u2 (n - r - 1) and the result for s2 = e ' e/(n r - 1) follows.
•
-
The least squares estimator [3 possesses a minimum variance property that was first established by Gauss. The following result concerns "best" estimators of linear parametric functions of the form c' f3 = c0{30 + c1 {31 + + c, {3, for any c. Result 7.3 (Gauss' 3 least squares theorem). Let Y = Z/3 + e, where E (e) = 0, Cov (e) = u2 I, and Z has full rank + 1. For any c, the estimator c' f3 = cof3o + c1 {3 1 + . . . + c,{3, of c' f3 has the smallest possible variance among all linear estimator of the form ···
r
A
A
A
A
that are unbiased for c' {3. Proof. For any fixed c, let a'Y be any unbiased estimator of c' {3. Then c' {3, whatever the value of {3. Also, by assumption, E(a'Y) E (a' Y) = E (a'Z/3 + a' e) = a'Z/3. Equating the two expected value expressions yields a'Z/3 = c' f3 or ( c' - a'Z) f3 = 0 for all {3, including the choice f3 = (c' - a'Z) ' . This implies that c' = a' Z for any unbiased estimator. Now, c' [3 = c' (Z' Z)-1 Z' Y = a*' Y with a* = Z (Z' z)- 1 c. Moreover, from Result 7.2 E ( [3 ) = {3, so c' [3 = a*' Y is an unbiased estimator of c' {3. Thus, for any a satisfying the unbiased requirement c' = a'Z, Var (a' Y) = Var (a' Z/3 + a' e) = Var (a' e) = a'Iu 2 a = =
u2 (a - a* + a*) ' (a - a* + a*) u2 [(a - a*) ' (a - a*) + a * ' a *]
3 Much later, Markov proved a less general result, which misled many writers into attaching his name to this theorem.
390
Chap.
7
M ultivariate Linear Regression Models
since (a - a*)'a* = (a - a*)'Z(Z'Z) - 1 c = O from the condition (a - a*)'Z = a'Z - a*'Z = c' - c' = O'. Because a* isfixed and (a - a*)' (a - a*) ispositi�e unless a = a*, Var(a'Y) is minimized by the choice a*'Y = c' (Z' z) - 1 Z' Y = c' {J. •
This powerful result states that substitution of [3 for {J leads to the best esti mator of c' {J for any c of interest. In statistical terminology, the estimator c' {J is called the best (minimum-variance) linear unbiased estimator (BLUE) of c' {J. 7.4 I N FERENCES ABOUT THE REGRESSION MODEL
We describe inferential procedures based on the classical linear regression model in (7-3) with the additional (tentative) assumption that the errors e have a normal distribution. Methods for checking the general adequacy of the model are consid ered in Section 7.6. I nferences concerning the Regression Parameters
Before we can assess the importance of particular variables in the regression function (7-11)
we must determine the sampling distributions of {J and the residual sum of squares, To do we shall assume that the errors e have a normal distribution. Result 7.4. Let Y = Z{J + e, where Z has full rank r + 1 and e is distrib uted as Nn (0, u2 1) . Then,.. the maximum likelihood estimator of {J is the same as the least squares estimator {J. Moreover, is distributed as and is distributed independently of the residuals = Y - zp . Further, is distributed as where a2 is the maximum likelihood estimator of u2• Proof. Given the data and the normal assumption for the errors, the likeli hood function for {J, u2 is n = jII= ! yl2;1 e - ej 2u e I e.
"
SO,
e
(]'
'/
'
=
Sec.
7.4
Inferences About the Regression Model
391
For a fixed value a2, the likelihood is maximized by minimizing (y - Z/3) ' (y - ZfJ). But this minimization yields the least squares estimate j1 = (Z' Z) - 1 Z'y, which does not depend upon a2 • Therefore, under the normal assumption, the maximum likelihood al:)d least squares approaches provide the same estimator fJ.A Next, maximizing L ( fJ, a2 ) over a2 [see (4-18)] gives A (y - ZP ) ' (y - ZP ) (7-12) L ( fJ, a�z ) - (27r /21( £T z ) " /2 - n /2 wh ere a�z n t From (7-10), we can express [1 and i as linear combinations of the normal vari ables e. Specifically, _
_
e
[-�] � [/��z\�.��ii�· �;] � [-�] k=(�(i.� ��;z; ] • � a Because is fixed, Result implies the joint normality of and Their mean vec tors and covariance matrices were obtained in Result Again, using we get eov ( [-� ]) � ov � o-+( z :)��f -1 - :: -zc�; z)-:;z.-j + Ae
+
Z
4.3
[1
7.2.
i.
(7-6),
A c ( e) A '
Since Cov ( [1, i ) = 0 for the normal random vectors p and i, these vectors are independent. (See Result 4.5.) Next, let (A, e) be any-eigenvalue-eigenvector pair for I - Z (Z' Z) - 1 Z'. 1 Then, by (7-6), [I - Z (Z' Z) 1 Z'f = [I - Z (Z' Z) - Z'] so
Ae = [I - Z (Z' Z) - 1 Z']e = [I - Z (Z' Z) - 1 Z'fe = A [I - Z (Z' Z) - 1 Z']e = A 2 e That is, A = 0 or 1. Now, tr [I - Z (Z' Z) - 1 Z'] = -n1 - r - 1 (see the proof of Result 7.2), and from Result 4.9, tr [I - Z (Z' Z) Z'] = A + A + · · · + A11, where A 1 ;;;. A z ;;;. · · · ;;;. A11 are the eigenvalues of [I - Z (Z'1 Z) - z1 Z'.] Conse quently, exactly n - r - 1 values of Ai equal one, and the rest are zero. It then
follows from the spectral decomposition that
(7-13)
where e , e , . . . , e r are the normalized eigenvectors associated with the eigen values A11 =2 A z = n - -=I An - r - I = 1. Let ···
V=
----------·
e n, - r - 1
e
392
Chap.
7
Multivariate Linear Regression Models
Then V is normal with mean vector 0 and
That is, the V; are independent N(O, u 2 ) and by (7-10),
n u 2 = e' e = e ' [l - Z ( Z' Z ) -1 Z'] e = V12 + V22 + . . .
lS. d'lStn'b ute d lT 2 Xn2 - r - 1 '
+ Vn2 r - 1 -
•
A confidence ellipsoid for P is easily constructed. It is expressed in terms of the estimated covariance matrix s 2 ( Z' Z )- 1 , where s 2 = e ' e/(n - r - 1 ) .
=
Resu lt 7.5. Let Y zp + e, where Z has full rank r + 1 and N11 (0, u21 ) . Then a 100 (1 - a)% confidence region for p is given by
e is
( fJ - P ) ' Z' Z ( /J - P ) :;;;; (r + 1 ) s 2 Fr + 1 , n - r - 1 (a)
where F,+ 1 , 11 _ ,_ 1 (a) is the upper (100a)th percentile of an F-distribution with
r + 1 and n - r - 1 d.f. Also, simultaneous 100 (1 - a)% confidence intervals for the /3; are given by � i ± VVar ( �; ) V (r + 1 ) Fr + 1, n - r- 1 (a) , i = 0, 1, . . . , r
where Ya; (�;) is the diagonal element of s 2 ( Z' Z ) -1 corresponding to � ; ·
Consip er the symmetric square-root matrix ( Z' Z ) 1 12 • [See (2-22) .] ( Z' Z ) 1 12 ( /J - /J ) and note that E (V) = 0,
Proof.
Set V
=
Cov (V)
= ( Z' Z ) 1 12Cov ( p ) ( Z' Z ) 1 12 = u2 ( Z' Z ) 1 12 ( Z' Z ) -1 ( Z' Z ) 1 12 = u2 1
and V is normally distributed, since it consists of linear combinations of the � ; s. Therefore, V' V = (p - /J ) ' ( Z' Z ) 1 12 ( Z' Z ) 1 12 (p - p) = (p - /J ) ' ( Z' Z ) (p - p) is distributed as u 2 x;+ t · By Result 7.4 (n - r - 1 ) s 2 = e' e is distributed as u 2 x�_ , _ 1 , independently of p and, hence, independently of V. Consequently, [ X;+ d (r + 1) ]/[x� - r - tf ( n - r - 1 ) ] = [ V ' V/ (r + 1) ] /s 2 has an F,+ 1 , n - r - 1 dis tribution, and the confidence ellipsoid for p follows. Projecting this ellipsoid for ( P - /J ) using Result 5A.1 with A -I = Z' Z/s 2 , c 2 = ( r + 1 ) F,+ 1 , 11 _ ,_ 1 (a), and u = [0, . . . , 0, 1, 0, . . . , O ] ' yields l /3; - �; I :;;;; V(r + 1 ) F,+ l n - r - 1 ( a) V \la;(�; ) , II where \la;(�;) is the diagonal element of s 2 ( Z' Z ) - 1 corresponding to � ; · The confidence ellipsoid is centered at the maximum likelihood estimate p, and its orientation and size are determined by the eigenvalues and eigenvectors of Z'Z. If an eigenvalue is nearly zero, the confidence ellipsoid will be very long in the direction of the corresponding eigenvector. '
A
Sec.
7.4
Inferences About the Regression Model
393
Practitioners often ignore the "simultaneous" confidence property of the interval estimates in Result 7.5. Instead, they replace ( r + 1 ) F, + I , n - r - 1 (a) with the one-at-a-time t value t11 _ , _ 1 (a/2) and use the intervals
(7-14) when searching for important predictor variables. Example 7.4 (Fitting a regression model to real-estate data)
The assessment data in Table 7.1 were gathered from 20 homes in a Milwau kee, Wisconsin, neighborhood. Fit the regression model Yi = f3o + /3 1 Zi l + /3zZjz + ei where z 1 = total dwelling size (in hundreds of square feet), z 2 = assessed value (in thousands of dollars), and Y = selling price (in thousands of dollars), to these data using the method of least squares. A computer calculation yields TABLE 7. 1
REAL-ESTATE DATA
zl Total dwelling size (100 ft 2)
15.31 15.20 16.25 14.33 14.57 17.33 14.48 14.91 15.25 13.89 15.18 14.44 14.87 18.63 15.20 25.76 19.05 15.37 18.06 16.35
Zz Assessed value
($1000) 57.3 63.8 65.4 57.0 63.8 63.2 60.2 57.7 56.4 55.6 62.6 63.4 60.2 67.2 57.1 89.6 68.6 60.1 66.3 65.8
y Selling price
($1000) 74.8 74.0 72.9 70.0 74.9 76.0 72.0 73.5 74.5 73.5 71.5 71.0 78.9 86.5 68.0 102.0 84.0 69.0 88.0 76.0
394
Chap.
7
[
Multivariate Linear Regression Models
5.1523 (Z' Z)-1 = .2544 .0512 - .1463 - .0172 .0067 and
p
= (Z' Z)- 1 Z'y =
[ ]
]
30.967 2.634 .045
Thus, the fitted equation is
y = 30.967 + 2.634z 1 + .045 z 2
(7 .88)
(.785)
(.285)
with s = 3.473. The numbers in parentheses are the estimated standard devi ations of the least squares coefficients. Also, R 2 = .834, indicating that the data exhibit a strong regression relationship. ( See Panel 7.1 on page 395, which contains the regression analysis of these data using the SAS statistical software package. ) If the residuals e pass the diagnostic checks described in Section 7.6, the fitted equation could be used to predict the selling price of another house in the neighborhood from its size and assessed value. We note that a 95% confidence interval for {3 2 [ see (7-14)] is given by
�2
±
t1 7 (. o25) 'V Var'< � 2 ) = . o45
±
2.1l o (.285)
or
(- .556, .647) Since the confidence interval includes {3 2 = 0, the variable z 2 might be
dropped from the regression model and the analysis repeated with the single predictor variable z 1 • Given dwelling size, assessed value seems to add little • to the prediction of selling price. Likelihood Ratio Tests for the Regression Parameters
Part of regression analysis is concerned with assessing the effects of particular pre dictor variables on the response variable. One null hypothesis of interest states that certain of the z; ' s do not influence the response Y. These predictors will be labeled Zq + l • Zq + 2 , . . . , z,. The statement that Zq + l • Zq + 2 , . . . , z, do not influence Y trans lates into the statistical hypothesis
f3q + 2 = . . . = {3, = 0 or H0 : fJ (2) = where /J(2) = [ f3q + l • /3q + 2 • . . . , /3, ] ' . H0 : f3q + J =
0
(7-15)
Sec. 7.4 Inferences About the Regression Model PAN EL 7. 1
395
SAS ANALYSI S FOR EXAM PLE 7.4 U S I N G P ROC REG .
title ' Regression Ana lysis'; d ata estate; i nfi l e 'D-1 .dat'; i n put z 1 z2 y; proc reg d ata estate; model y z 1 z2;
PROGRAM COMMANDS
=
=
Model: M O D E L 1 Dependent Variable: Y
OUTPUT
Ana lysis of Variance S o u rce Model E rro r C Tota l
Sum of Sq u a res 1 032 .87506 204.99494 1 237 .87000
DF 2 17 19
·.
Deep Mean c.v.
Mean Square 5 1 6. 43753 1 2.05853
F va l ue 42.828
j R-sq uare ,
�.47254 1
76.55000 4. 53630
Adj R-sq
Prob > F 0.0001
0.8 1 49
Pa ra m eter Est i mates Vari a b l e I NTERCEP z1 z2
DF
.-P -ara-:m -e..., te ..r" .,·
Estimate . �Q�9��fi66' 2;634400
·o:o4S184
Setting
z=
[
1
' 's1:a"ndard
. en'fir
T fo r HO: Para m eter 0 3.929 3.353 0. 1 58 =
·
'7�88220844 '
o:2861 a21.1 .
o17Ss59a72
Z1 ! Z n X ( q + l ) i n X ( r -z q )
]'
•.
fJ
=
l ] ((q + 1) X 1 ) fJ fJ (2 ) ((r - q)(l)x 1 )
-----------
+ e = Z 1 fJ(ll + Z 2 fJ(2) + e [-�pwJ ( 2) j
we can express the general linear model as
Y = ZfJ + e = [Z 1 j Z 2 ]
Prob > I T I 0.00 1 1 0.0038 0.8760
Under the null hypothesis H0 : fJ(2 ) = 0, Y = Z 1 fJ(l ) of H0 is based on the
+ e. The likelihood ratio test
396
Chap.
7
Multivariate Linear Regression Models
Extra sum of squares
=
SS res (Zd
-
SS res (Z)
(7-16)
Result 7.6. Let Z have full rank r + 1 and e be distributed as Nn (0, a.21 ) . The likelihood ratio test of H0: /3(2) = 0 is equival�nt to a test,_ of H0 based on the extra sum of squares in (7-16) and s 2 = (y - Z/3 ) ' (y - Z/3 )/(n r - 1 ) . In particular, the likelihood ratio test rejects H0 if -
where Fr - q, n - r - l (a) is the upper (100a)th percentile of an F-distribution with r q and n - r - 1 d.f. -
Proof.
Given the data and the normal assumption, the likelihood associated with the parameters f3 and u2 is
with the maximum occurring at jJ = (Z' Z)- 1 Z' y and u 2 = (y - ZP ) ' (y - ZP )/n. Under the restriction of the null hypothesis, = Z1/3 ( 1 ) + e and
Y
max L ( /3( 1 ) • u2 )
fJ(J )• u'
where the maximum occurs at /J(l)
Rejecting H0 : /3 (2)
=
1
e ( 27T)n f2 u� ln
- n/2
= (Z� Z1 ) - 1 Z1 y. Moreover,
= 0 for small values of the likelihood ratio
Sec.
7.4
Inferences About the Regression Model
397
is equivalent to rejecting H0 for large values of ( 8r - 8 2 ) I 82 or its scaled version,
n (8r - 82 )/(r - q ) n8 2/(n - r - 1 ) The F-ratio above has an F-distribution with r Result 7.1 1 with m = 1.)
-
q and n - r - 1 d.f. (See [18] or •
Comment. The likelihood ratio test is implemented as follows. To test whether all coefficients in a subset are zero, fit the model with and without the terms corresponding to these coefficients. The improvement in the residual sum of squares (the extra sum of squares) is compared to the residual sum of squares for the full model via the F-ratio. The same procedure applies even in analysis of vari ance situations where Z is not of full rank. 4 More generally, it is possible to formulate null hypotheses concerning r - q linear combinations of fJ of the form H0 : CfJ = A0• Let the ( r - q ) X ( r + 1 ) matrix C have full rank, let A0 0, and consider =
H0 : CfJ =
0
( This null hypothesis reduces to the previous choice when C = [o ! (r - q} XI (r - q) ].)
Under the full model, cp is distributed as N, _ q (CfJ, u 2 C (Z' Z)- 1 C' ) . We reject H0 : CfJ = 0 at level a if 0 does not lie in the 100 (1 - a)% confidence ellipsoid for CfJ. Equivalently, we reject H0 : CfJ = 0 if o
(7-17)
where s 2 = (y - ZP ) ' (y - ZP )/(n - r - 1) and Fr - q , n - r - l (a) is the upper (100a)th percentile of an F-distribution with r - q and n - r - 1 d.f. The test in (7-17) is the likelihood ratio test, and the numerator in the F-ratio is the extra residual sum of squares incurred by fitting the model, subject to the restriction that CfJ = 0. (See [21 ]). 4 1n situations where Z is not of full rank, rank(Z) replaces r + 1 and rank(Z 1 ) replaces q in Result 7.6.
+ 1
398
Chap. 7 Multivariate Linear Regression Models
The next example illustrates how unbalanced experimental designs are easily handled by the general theory described just described. Example 7.5
(Testing the importance of additional predictors using the extra sum-of-squares approach)
Male and female patrons rated the service in three establishments (locations) of a large restaurant chain. The service ratings were converted into an index. Table contains the data for n = customers. Each data point in the table is categorized according to location or 3) and gender (male = and female = This categorization has the format of a two-way table with unequal numbers of observations per cell. For instance, the combination of location and male has responses, while the combination of location and female has responses. Introducing three dummy variables to account for location and two dummy variables to account for gender, we can develop a regression model linking the service index Y to location, gender, and their "interaction" using the design matrix
7.2
18
1).
1
(1, 2,
0
5
2
2
TABLE 7.2 RESTAU RANT-SERVIC E DATA
Location
Gender
Service (Y)
1 1 1 1 1 1 1 2 2 2 2 2 2 2
0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 1 1
15. 2 21.2 27.3 21.2 21.2 36.4 92.4 27.3 15.2 9.1 18.2 50.0 44.0 63.6 15.2 30.3 36.4 40. 9
3 3
3 3
Sec. 7.4 Inferences About the Regression Model
constant
Z=
1 1 1 1 1 1 1 1 1 1 1 1 1
1
1 1 1 1
location
,.-A----,
1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1
1 1 1 1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
interaction
gender ,----"---..
1 1 1 1 1 0 0 1 1
1 1 1 0 0 1 1 0 0
0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 1 1
1 1 1 1 1 0 0 0
p
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0
The coefficient vector can be set out as
0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
1
399
5 responses
} 2 responses
1
5 responses
} 2 responses } 2 responses } 2 responses
fJ = [ ,Bo , .B t , .Bz , ,83 , T1 ' Tz , 'Yu ' 'Yt z • 'Y2 1 ' 'Yzz , 'Y3 1 , 'Y3 2 ] ' where the ,8/ s ( i > 0) represent the effects of the locations on the determi-
nation of service, the T/ s represent the effects of gender on the service index, and the 'Y;k ' s repr�sent the location-gender interaction effects. The design matrix Z is not of full rank. (For instance, column 1 equals the sum of columns 2-4 or columns 5-6.) In fact, rank (Z) = 6. For the complete model, results from a computer program give
SSres (Z)
=
= 2977.4
and n - rank (Z) = 18 - 6 12. The model without the interaction terms has the design matrix Z 1 consisting of the first six columns of Z. We find that with n = y3 2
ssres (Z I ) = 3419.1
- rank (Z 1 ) = 18 - 4 = 14. To test H0: y1 1 = y1 2 = y2 1 = y2 2 = y3 1 = 0 (no location-gender interaction), we compute
400
Chap. 7 Multivariate Linear Regression Models
F
( Z ) )/( 6 - 4) = ( SSres ( Z I ) - SSres sz (3419.1 - 2977.4)/2 = . 89 2977.4/12
( SSres ( Z ) - ssres ( Z ) ) /2 SSres ( Z )/12 I
The F-ratio may be compared with an appropriate percentage point of an F-distribution with 2 and 12 d.f. This F-ratio is not significant for any rea sonable significance level a. Consequently, we conclude that the service index does not depend upon any location-gender interaction, and these terms can be dropped from the model. Using the extra sum-of-squares approach, we may verify that there is no difference between locations (no location effect), but that gender is signifi cant; that is, males and females do not give the same ratings to service. In analysis-of-variance situations where the cell counts are unequal, the variation in the response attributable to different predictor variables and their interactions cannot usually be separated into independent amounts. To eval uate the relative influences of the predictors on the response in this case, it is necessary to fit the model with and without the teqns in question and com • pute the appropriate F-test statistics. 7.5 I NFERENCES FROM THE ESTIMATED REGRESSION FUNCTION
Once an investigator is satisfied with the fitted regression model, it can be used to solve two prediction problems. �et,..zo [1, z0 1 , , z0,] ' be selected values for the predictor variables. Then z 0 and fJ can be psed ( 1 ) to estimate the regression function {30 + {3 1 z01 + · · · + {3,z0, at z 0 and ' ( 2) t� estimate the value of the response Y at z 0•
=
• • .
Estimating the Regression Function at z 0
Let Y0 denote the value of the response when the predictor variables have values [1, z0 1 , . . . , z0,] '. According to the model in (7-3 ) , the expected value of Y0 is
z0
=
E ( Yo l zo )
= f3o + {31 Zo1 +
Its least squares estimate is z � p .
···
+ f3,Zor
= z�{J
(7-18)
Result 7.7. For the lip.ear regression model in ( 7-3 ), z � P is the unbiased lin
ear estimator of E(Y0 I z 0 ) with minimum variance, Var (z� P ) = z � ( Z' Z ) - 1 z 0 u 2 • If the errors e are normally distributed, then a 100 ( 1 - a) % confidence interval for E ( Y0 I z 0 ) z � {J is provided by
=
Sec.
Inferences from the Estimated Regression Function
7.5
401
where t, _ 1 ( a /2 ) is the upper 100 ( a/2)th percentile of a t-distribution with n - r - 1 d.f. r-
Proof. For a fixed z 0, z� fJ is just a linear combination of the {3; ' s, so Result 7.3 applies. Also, Var (z�p ) z� Cov ( P )z 0 z� ( Z' Z ) - 1 z0 a 2 since (P 1 a 2 ( Z' Z ) by Result 7.2. Under the further assumption that e is normally distrib uted, Result 7.4 asserts that P is N,+ 1 ( fJ, a 2 ( Z' Z ) - 1 ) independently of s 2 / a 2 , which is distributed as x� _ 1 / (n - r - 1 ) . Consequently, the linear combination z�P is N (z�fJ, a2 z� ( Z' Z )- 1 z 0) and (z� [J - z�fJ) / Ya 2 z� ( Z' z )- 1 z 0 _ (z� [J - z� fJ) - Y2 s ( z� ( Z' Z ) - 1 z 0 ) yr;z;;;z
=
Cov ) =
=
r-
•
is t, _ ,_ 1 • The confidence interval follows.
Forecasting a New Observation at z0
=
Prediction of a new observation, such as Y0, at z 0 [1, z0 1 , . . . , z0,] ' is more uncer tain than estimating the expected value of Y0• According to the regression model of (7-3), or (new response Y0)
= (expected value of Y0 at z0 ) + (new error)
where s0 is distributed as N ( 0, a2 ) and is �ndependent of e and, hence, of p and s 2 • The errors e influence the estimators fJ and s 2 through the responses but s0 does not.
Y,
Result 7.8.
has the
Given the linear regression model of (7-3 ) , a new observation Y0
unbiased predictor
z�p
= ffio + ffi t Zot + · · · + ffi,zo,
The variance of the forecast error Y0 - z� {J is
Var (Y0 - z�P )
= a2 ( 1 + z� ( Z' Z ) - 1 z0 )
When the errors e have a normal distribution, a 100 ( 1 Y0 is given by
- a)% prediction interval for
402
Chap.
7
Multivariate Linear Regression Models
where tn - r - l (a/2) is the upper 100(a/2)th percentile of !-distribution with n - - 1 degrees of freedom. Proof. We forecast Y0 by z� p , which estimates E(Y0 Iz 0 ). By Result 7. 7 , z� p has E(z� P ) = z� /3 and Var(z� P ) = z�(Z'Z) - 1 z0 u2. The forecast error is then Y0 - z� p = z� /3 + e0 - z� p = e0 + z�( /3 - p ). Thus, E(Y0 - z� P ) E ( e0 ) + E (z� ( f3 - [3 ) ) = 0 so the predictor is unbiased.2 Since e0 and1 [3 2are independent, Var(Y1 0 - z� p ) = Var(e0 ) + Var(z� p ) = u + z�(Z'Z) - z0 u = 2u (1 + z�(Z'Z) - z0 ). If it is further assumed that e has a normal distribution, then [3 is normally distributed, and so is the linear combination Y0 - z� p . Con sequently, (Y0 - z� p )lVu2 (1 + z�(Z' z) - 1 z 0 ) is distributed as N (O, 1). Dividing this ratio by v;z;;;.z, which is distributed as v'x?z_,_ 1j(n - - 1), we obtain fl
r
r
•
which is tn - r - i · The prediction interval follows immediately. The prediction interval for Y0 is wider than the confidence interval for esti mating the value of the regression function E ( Y0 I z 0 ) = z�f3. The2 additional uncer tainty in forecasting Y , which is represented by the extra term s in the expression 0 s2 (1 + z�(Z'Z) - 1 z0 ), comes from the presence of the unknown error term e0 • Example 7.6 (I nterval estimates for a mean response and a future response)
Companies considering the purchase of a computer must first assess their future needs in order to determine the proper equipment. A computer scien tist collected data from seven similar company sites so that a forecast equa tion of computer-hardware requirements for inventory management could be developed. The data are given in Table 7.3 for z1 = customer orders (in thousands) z2 = add-delete item count (in thousands) Y = CPU (central processing unit) time (in hours) for the . mean CPU a time,% Construct a 95% confidence interval = [1, 130, 7.5] ' Also, find 95 (Y0 I z0 ) = {30 + {3for1 z0 1a +new{32 z0facility Eprediction 2 at 'zs0 CPU requireme�t corresponding to interval the same z 0 .
Sec.
7.5
TAB LE 7.3
(
z
l Orders ) 123.5 146.1 133.9 128.5 151.5 136.2 92.0
Inferences from the Estimated Regression Function
403
COM PUTER DATA
(
Zz Add-delete items) 2.108 9.213 1 .905 .815 1.061 8.603 1 .125
(
y
CPU time)
Source: Data taken from H. P. Artis, away, NJ: Bell Laboratories, 1979).
141.5 168.9 154.8 146.5 172.8 160.1 108.5
Forecasting Com puter Requirements: A Forecaster's Dilemma
(Piscat
A computer program provides the estimated regression function
[
y = 8.42 + 1.08z + .42z2
(Z' Z) - 1 =
1
8.17969 - .06411 .00052 .08831 - .00107
and s = 1 .204. Consequently,
z ;J1 = 8.42 + 1.08 (130) + .42 (7.5 ) = 151.97
and s Yzb (Z' Z)- 1 z0 = 1 .204 (.58928) = .71 . We have t4 (.025) = 2.776, so the 95% confidence interval for the mean CPU time at z0 is or (150.00, 153.94).� Since s Y�1 -zb,.(-. Z-'Z_)_ = (1.204) (1 .16071) = 1.40, a 95% predic tion interval for the+ CPU time_-:-1z-at0 a new facility with conditions z0 is or (148.08, 155.86).
•
404
Chap.
7
M ultivariate Linear Regression Models
7.6 MODEL CHECKING AND OTHER ASPECTS OF REGRESSION Does the Model Fit?
Assuming that the model is "correct," we have used the estimated regression func tion to make inferences. Of course, it is imperative to examine the adequacy of the model before the estimated function becomes a permanent part of the decision making apparatus. All the sample information on lack of fit is contained in the residuals e l = Y I - ffi o - ffi l z 1 1 e z = Yz - ffi o - ffi 1 z 2 1
···
···
- /3,Zir - /3,Z z r A
A
or i
= [I - Z (Z' Z) - 1 Z'] y = [I - H] y
(7-19)
If the model is valid, each residual ej is an estimate of the error ej , which is2 assumed to be a normal random variable with mean zero and variance u • Although the residuals i have expected value 0, their covariance matrix 1 2 u [I - Z (Z' Z) Z'] = u2 [I - H] is not diagonal. Residuals have unequal vari ances and nonzero correlations. Fortunately, the correlations are often small and the variances are nearly equal. Because the residuals i have covariance matrix u2 [I - H], the variances of the ej can vary greatly if the diagonal elements of H, the leverages hjj • are sub stantially different. Consequently, many statisticians prefer graphical diagnostics based2 on studentized residuals. Using the residual mean square s 2 as an estimate of u , we have (7-20) j = 1, 2, . . . , n and the studentized residuals are (7-21) Ys 2 (1 - hjj ) ' j = 1 , 2, . . . , n We expect the studentized residuals to look, approximately, like independent draw ings from an N(O, 1) distribution. Some software packages2 go one step further and studentize e using the delete-one estimated variance s (j), which is the residual mean squarej when the jth observation is dropped from the analysis. Residuals should be plotted in various ways to detect possible anomalies. For general diagnostic purposes, the following are useful graphs:
Sec. 1.
7.6
Model Checking and Other Aspects of Regression
405
Plot tfte residuals ej against the predicted values Yj = �0 + � 1 Zj 1 + · · · + {3, zjr · Departures from the assumptions of the model are typically indi cated by two types of phenomena: (a) A dependence of the residuals on the predicted value. This is illustrated in
Figure 7.2(a). The numerical calculations are incorrect, or a {30 term has been omitted from the model. (b) The variance is not constant. The pattern of residuals may be funnel shaped, as in Figure 7.2(b) so that there is large variability for large y and small variability for small y. If this is the case, the variance of the error is not constant, and transformations or a weighted least squares approach (or both) are required. (See Exercise 7. 3 . ) In Figure 7. 2 (d), the residuals form a horizontal band. This is ideal and indicates equal variances and no dependence on y. 2. Plot the residuals ej against a predictor variable, such as z 1 , or products ofpre dictor variables, such as zi or z z • A systematic pattern in these plots sug gests the need for more terms 1in2 the model. This situation is illustrated in Figure 7.2(c). 3. Q-Q plots and histograms. Do the errors appear to be normally distributed? To answer this question, the residuals e or e/ can be examined using the tech niques discussed in Section 4.6. The Q-Qj plots, histograms, and dot diagrams help to detect the presence of unusual observations or severe departures from ,
(b )
(c)
Figure 7.2
Residual plots.
406
Chap.
7
Multivariate Linear Regression Models
normality that may require special attention in the analysis. If n is large, minor departures from normality will not greatly affect inferences about {3. 4. Plot the residuals versus time. The assumption of independence is crucial, but hard to check. If the data are naturally chronological, a plot of the residuals versus time may reveal a systematic pattern. (A plot of the positions of the residuals in space may also reveal associations among the errors.) For instance, residuals that increase over time indicate a strong positive depen dence. A statistical test of independence can be constructed from the first autocorrelation, II
2: ej ej - l
j=2
___
(7-22)
of residuals from adjacent periods. A popular test based on the statistic � (ej - ej _ 1 )2�� e} 2(1 - r1 ) is called the Durbin-Watson test. (See [12] for a description of this test and tables of critical values.) ==
Example 7.7
(Residual plots)
Three residual plots for the computer data discussed in Example 7.6 are shown in Figure 7.3 on page; 407. The sample size n = 7 is really too small to allow definitive judgments however, it appears as if the regression assump tions are tenable. If several observations of the response are available for the same values of the predictor variables, then a formal test for lack of fit can be carried out. (See [11] for a discussion of the pure-error lack-of-fit test. ) •
Leverage and Influence
Although a residual analysis is useful in assessing the fit of a model, departures from the regression model are often hidden by the fitting process. For example, there may be "outliers" in either the response or explanatory variables that can have a considerable effect on the analysis, yet are not easily detected from an examination of residual plots. In fact, these outliers may determine the fit. The leverage hjj is associated with the jth data point and measures, in the space of the explanatory variables, how far the jth observation is from the other n - 1 observations. For simple linear regression with one explanatory variable z,
Sec.
Model Checking and Other Aspects of Regression
7.6
407
(b )
(a)
(c) Figure 7.3
Residual plots for the computer data of Example 7.6.
h". .
= -n1 +
)2 The average leverage is ( + 1)/n. (See Exercise 7. 8 . ) For a data point with high leverage, h approaches 1 and the prediction at z is almost solely determined by yi , the rest ofii the data having little to say about thei matter. This follows because (change in yi ) = hii (change in yi ) , provided that other y values remain fixed. Observations that significantly affect inferences drawn from the data are said to be influential. Methods for assessing influence are typically based on the change in the vector of parameter estimates, {J, when observations are deleted. Plots based upon leverage and influence statistics and their use in diagnostic check ing of regression models are described in [2], [4], and [9]. These references are rec ommended for anyone involved in an analysis of regression models. If, after the diagnostic checks, no serious violations of the assumptions are detected, we can make inferences about fJ and the future Y values with some assurance that we will not be misled. n
� (zi -
j= l
r
A
z
408
Chap.
7
Multivariate Linear Regression Models
Additional Problems in Linear Regression
We shall briefly discuss several important aspects of regression that deserve and receive extensive treatments in texts devoted to regression analysis. (See [11], [19], and [8]). Selecting predictor variables from a large set. In practice, it is often dif ficult to formulate an appropriate regression fupction immediately. Which predic tor variables should be included? What form should the regression function take? When the list of possible predictor variables is very large, not all of the variables can be included in the regression function. Techniques and computer programs designed to select the "qest" subset of predictors are now readily avail able. The good ones try all subsets: z 1 alone, z2 alone, . . . , z 1 and z , The best choice is2 decided by examining some criterion quantity like R 2• [See2 . (. 7. -• 9).] How ever, R always increases with the inclusion of additional predictor variables. Although this2 problem can be circumvented b,y using the adj�sted R 2, R2 1 - (1 - R ) (n - 1)/(n r 1), a better statistic for selecting variables seems to be Mallow's CP statistic (see [10]), (residual sum of squares for subset model with p parameters, including an intercept) (n - 2p ) (residual variance for full model) A plot of the pairs (p, C ), one for each subset of predictors, will indicate models that forecast the observedP responses well. Good models typically have (p, CP ) coordinates near the 45° line. In Figure 7.4 on page 409, we have circled the point corresponding to the "best" subset predictor variables. If the list of predictor variables is very long, cost considerations limit the num ber of models that can be examined. Another approach, called stepwise regression (see [11 ]), attempts to select important predictors without considering all the pos sibilities. The procedure can be described by listing the basic steps (algorithm) involved in the computations: Step 1. All possible simple linear regressions are considered. The predictor vari able that explains the largest significant proportion of the variation in Y (the variable that has the largest correlation with the response) is the first variable to enter the regression function. . Step 2. The next variable to enter is the one (out of those not yet included) that makes the largest significant contribution to the regression sum of squares. The significance of the contribution is determined by an F-test. (See Result 7.6.) The value ofthe F-statistic that must be exceeded before the contri bution of a variable is deemed significant is often called the F to enter. Step 3. Once an additional variable has been included in the equation, the indi vidual contributions to the regression sum of squares of the other varicp
(=
--
)-
of
=
Sec.
• col
• (3)
(2) e
7.6
Model Checking and Other Aspects of Regression
409
• (2 , 3)
• ( 1 , 3)
Numbers in parentheses correspond to predicator variables
Figure 7.4 CP plot for computer data from Exam ple 7.6 with three predictor variables (z1 = orders, z2 = add-delete count, z3 = n u m ber of items; see the example and original source).
abies already in the equation are checked for significance using F-tests. If the F-statistic is less than the one (called the F to remove) corresponding to a prescribed significance level, the variable is deleted from the regres sion function. Step 4. Steps 2 and 3 are repeated until all possible additions are nonsignificant and all possible deletions are significant. At this point the selection stops. Because of the step-by-step procedure, there is no guarantee that this approach will select, for example, the best three variables for prediction. A second drawback is that the (automatic) selection methods are not capable of indicating when transformations of variables are useful. Colinearity. If Z is not of full rank, some linear combination, such as Za, must equal 0. In this situation, the columns are said to be colinear. This implies that Z'Z does not have an inverse. For most regression analyses, it is unlikely that Za 0 exactly. Yet, if linear combinations of the columns of Z exist that are nearly 0, the calculation of (Z' Z) - 1 is numerically unstable. Typically, the diagonal entries of (Z' Z) - 1 will be large. This yields large estimated variances for the ffi /s and it is then difficult to detect the "significant" regression coefficients /3; · The problems =
410
Chap.
7
Multivariate Linear Regression Models
caused by colinearity can be overcome somewhat by (1) deleting one of a pair of predictor variables that are strongly correlated or (2) relating the response Y to the principal components of the predictor variables-that is, the rows zj of Z are treated as a sample, and the first few principal components are calculated as is sub sequently described in Section 8.3. The response Y is then regressed on these new predictor variables. Bias caused by a misspecified model. Suppose some important predictor variables are omitted from the proposed regression model. That is, suppose the true model has Z = [Z1 i Z2 ] with rank r + 1 and (7-23)
= z1 {J(l l + Zz fJ(z) + e where E ( e) = 0 and Var(e) = u2 1. However, the investigator unknowingly fits a model using only the first predictors by minimizing the error sum of squares (Y - Z1 {J - Zb <m> l ( Y(m ) - Zb(m ) ) ( Y(m ) - Zb (m ) )
]
(7-30)
The selection b(i ) = f1u> minimizes the ith diagonal sum of squares ( Y(i ) - Zb (i ) ) ' ( Y(i ) - Zb(i ) ) . Conse q_uently, tr[(Y - ZB) ' (Y - ZB)] is minimized by the choice B = {3. Also, the generalized variance I (Y - ZB) ' - ZB) I is minimized by the least squares estimates {J. (See Exercise 7 . 11 (Yfor an additional generalized sum of squares property. ) Using the least squares estimates {3, we can form the matrices of Predicted values: Y = z{J = Z ( Z' Z) - 1 Z' Y i = Y - Y = [I - Z ( Z' Z ) - 1 Z ' ] Y (7-31) Residuals: The amongregression, the residuals, predicted values,multiple and columns of Z,orthogonality which hold inconditions classical linear hold in multivariate regres sion. They follow from Z' [I - Z ( Z' Z) - 1 Z'] = Z' - Z' = 0. Specifically, A
Z' i = Z' [I - Z ( Z' Z ) - 1 Z' ] Y = 0
(7-32)
Sec.
7. 7
Multivariate Multiple Regression
41 3
so the residuals ecil are perpendicular to the columns of Z. Also, Y ' E:
=
P ' Z' [I - Z ( Z' Z ) - 1 Z' ] Y
=
(7-33)
0
confirming that the predicted values Ycil are perpendicular to all residual vectors e(k) · Because Y Y + E:, =
or
Y ' Y = (Y + E:) ' (Y + E:)
=
Y ' Y + E:' E: + 0 + 0'
( total sum of squares ) ( predicted sum of squares ) + and cross products and cross products Y' Y
+
(residualof squares(error)andsum) E:' E:
cross products
(7-34)
The residual sum of squares and cross products can also be written as E: ' E:
=
Y ' Y - Y' Y
=
Y ' Y - {J ' Z' Z P
(7-35)
To illustrate the calculations of {3, Y, and E:, we fit a straight-line regression model (see Panel 7.2 on page 415 for SAS output),
Example 7.8
(Fitting a multivariate straight-line regression model)
j = 1, 2, . . . ' 5 to two responses Y1 and Y2 using the data in Example 7.3. These data, aug mented by observations on an additional response, are: Yt
0 1 -1
1 4 -1
2 3 2
3
8
4 9 2
3 Y2 The design matrix Z remains unchanged from the single-response problem. We find that Z' =
[1
1 1 1 1 0 1 2 3 4
J
( Z' Z ) - t =
[ - .26 - .2.1 ] .
41 4
Chap.
7
Multivariate Linear Regression Models
and
-1 1 1 1 1 1 ] 1 Z Y = [ 0 1 2 3 4 2 3 I
2
so
From Example 7.3, Hence, The fitted values are generated from j\ = 1 + 2z1 and y2 = -1 + z2 . Collectively, Y = zp =
and
1 0 1 1 1 2 [ � -� ] 1 3 1 4
1 -1 3 0 5 1 7 9
2
3
OJ
-1 [� 1 1 -1 Note that 1 -1 h/ 1 - 2 1 � J 53 01 y = [ � -1 1 1 [� �] 7 2
e=Y-Y=
1
2 1
E
-
9
3
Sec. 7 . 7
M u ltiva riate M u ltiple Reg ression
41 5
Since Y ' Y = [ -11 Y' Y =
[165 45
1 -1 4 3 8 43 -12 [ 171 43] -1 2 3 �] 8 3 43 19 9 2 45] and = [ 6 -2] -2 4 15 E' E
the sum of squares and cross products decomposition Y ' Y = Y' Y + is easily verified. e' e
•
PAN EL 7.2 SAS ANALYSI S FOR EXAM PLE 7.8 U S I N G PROC. G L M . title ' M u ltiva riate Regression Ana lysis'; data m ra; i nfi le 'E7-8.d at'; i n put y 1 y2 z 1 ; proc g l m data m ra; m od el y 1 y2 z 1 /ss3; m a n ova h z 1 /pri nte;
PROGRAM COMMANDS
=
=
=
I·
D.ep.endentVa ria ?i e:
S o u rce Model E rror Co rrected Total
S o u rce Z1
INTERCEPT ) ,,
OUTPUT
DF 1 3 4
Sum of Sq u a res 40.00000000 6.00000000 46.00000000
Mean S q u a re 40.00000000 2.00000000
R-Sq u a re 0.869565
28.28427
c.v.
Root M S E 1 .4 1 42 1 4
Type I l l SS 40.00000000
Mean Sq u a re 40.00000000
DF
Parameter Z1
YtJ
Genera l Linear Models Proced u re
' - :·;�,;
Estimate
1.000000000
2.ooooooooo'
T for HO: Para m eter 0 0.91 4.47 =
F Va l u e 20.00
Pr > F 0.0208
Y 1 Mean 5.00000000 F Va l u e 20.00 Pr > I T I 0.4286 0.0208
Pr > F 0.0208 Std E rror of Esti m ate 1 .095445 1 2 0.4472 1 360
(continued)
416
Chap.
7
Multivariate Linear Regression Models PAN EL 7.2 (continued)
I [)ependent
Variable:
Y21
OF 1 3 4
S u m of S q u a res 1 0.00000000 4.00000000 1 4.00000000
Mean Sq u a re 1 0.00000000 1 .33333333
R-Sq u a re 0.71 4286
1 1 5.4701
c.v.
Root MSE 1 . 1 5470 1
OF 1
Type Ill SS 1 0.00000000
Mean S q u a re 1 0.00000000
S o u rce Model E rror Corrected Tota l
S o u rce Z1
T for HO: Para m eter = 0 -1 . 1 2 2.74
j
EstimatJ -1 .000000000'
Para meter
I NTE RCEPT
1.000000000
Z1
I�
=
Error SS
$
V1
CP Matrix
I
Pr > F 0.07 1 4
F Va l u e 7.50
V2 M e a n 1 .00000000 Pr > F 0.07 1 4
F Va l u e 7 .50 Pr > ITI 0.3450 0.07 1 4
Std E rro r of Est i m ate 0.894427 1 9 0.3651 4837
Y2
Y1 Y2 Manova Test Criteria a n d Exact F Statistics fo r the Hypothesis of no Overa l l Z1 Effect E = E rror SS&CP Matrix H = Type I l l SS&CP Matrix fo r Z 1 N = O M = O S = 1 Statistic W i l ks' Lambda P i l l ai's Trace H ote l l i ng-Lawley Trace Roy's G reatest Root
Va l u e 0.06250000 0.93750000 1 5.00000000 1 5.00000000
F 1 5.0000 1 5.0000 1 5.0000 1 5.0000
N u m OF 2 2 2 2
Pr > F 0 .0625 0.0625 0.0625 0.0625
Den O F 2 2 2 2
For the least squares estimator {J [p ! P(z) ! · ! P(m) J determined under the multivariate multiple regression model(l)(7-26) with full rank (Z) = r + 1 < n, ..
Result 7.9.
and
i, k = 1, 2, . . . , m
Sec.
7. 7
Multivariate Multiple Regression
41 7
The residuals = [ e - {J(i ) ) ( p ( k) - {J( k) ) ' z 0 - z� E( ( P(i ) - P u > ) e ok ) - E (e o ; ( P(k) - P (k) ) ' ) z o = a; k (1 + (7-39) /J (i ) = e(i ) + fJ(i ) E ( ( P(i ) - /J(i ) ) e0 k ) = 0 e0 . E ( e0; ( P( k) - /J(k) ) ') .
so
indicating that of The forecast errors have covariances
is an
z�(Z'Z) - 1 z0) Note that since is indepen (Z'Z) - 1 Z' dent of A similar result holds for Maximum likelihood estimators and their distributions can be obtained when the errors have a normal distribution. Result 7. 1 0. Let the multivariate multiple regression model in (7-26) hold with full rank (Z) = r + 1, n (r + 1) + m, and let the errors have a normal distribution. Then E
�
E
Sec.
7.7
Multivariate Multiple Regression
419
A
is the maximum likelihood estimator of fJ and- 1 fJ has a normal distribution with E ( /3 ) = fJ and Cov ( p (i ) ' p ( k) ) = u;k (Z' Z) . Also, jj is independent of the maximum likelihood estimator of the positive definite l: given by and
is distributed as wp. n - (l:) Proof. According to the regression model, the likelihood is determined from the data Y = [Y1 , Y2 , . . . , Yn ] ' whose rows are independent, with Yj distributed as N111 ( {J ' zj, l:). We first note that Y - Z {J = [Y 1 - {J ' z 1 , Y2 - {J ' z2 , . . . , Y11 - {J ' zn ] ' so n (Y - z{J) ' (Y - z{J) = 2: (Yj - {J ' zj ) (Yj - {J ' zj ) ' j= l and n 11 2: (Yj - fJ ' zJ 'l:-1 (Yj - {J ' zj ) = 2: tr [(Yj - fJ ' zJ 'l: - 1 (Yj - {J ' zj)] j= 1 j= l = j2:= 1 tr [l:- 1 (Yj - fJ ' zJ (Yj - {J ' zj) '] = tr [l:- 1 (Y - z{J) ' (Y - z {J ) ' ] (7-40) Another preliminary calculati�m will enable us to express the likelihood in a sim ple form. Since = Y - Z{J satisfies Z' = 0 [see (7-32)], ni
r- 1
II
e
(Y - z{J) ' (Y - z{J)
e
= [Y - z jj + z ( jj - {J)] ' [Y - z jj + z ( jj - {J)] = (Y - z jj ) ' (Y - z jj ) + ( p - {J) ' Z' Z( jj - {J) = + ( p - {J) ' Z' Z( jj - {J) Using (7-40) and (7-41), we obtain the likelihood L ( {J, l:) = n (2 :) m f2 l l:l 1 /2 e' e
e- i (yj - P' zj)'I- I (yj - P' zj)
(7-41)
420
Chap.
Multivariate Linear Regression Models
7
The1 2 matrix z ( {J - fJ) I- 1 ( {J - fJ) ' Z ' is the form A' A, with A = I - 1 ( {J - fJ)' Z' , and, from Exercise 2 . 16 , it is nonnegative definite. Therefore, its eigenvalues are nonnegative also. Since, by Result 4.9, 1 tr[z ( {J - fJ) I - ( {J - fJ) ' Z'] is the sum of its eigenvalues, this trace will equal its minimum value, zero, if fJ = /3. This choice is unique because Z is of full rank and P(i) - P(i) * 0, implies that Z ( p(i) - p(i) ) * 0, in which case tr[z ( {J - fJ) I- 1 ( {J - fJ) ' Z'] ;;;.: c' I - 1 c > 0, where c' is any nonzero row of z ( {J - fJ). Applying Result 4.10 with B = b = n /2 , and p = m, we find 1 that {J and i = n- are the maximum likelihood estimators of fJ and I, respectively, and i' i,
i' i
L ( {J, I )
A
P (i)
/2 -c e -nm---(n) mn/2 - nm/2 e n mn 2 2 (2 1T) / 1 £ ' £ 1 / (2 1T ) mn /2 1 i l n/2
1
_
---
(7-42)
--
It remains to establish the distributional results. From (7-36), we know that and e(i) are linear combinations of the elements of e. Specifically, p (i) = (Z' Z) - 1 Z' e be the matrix of interaction parameters for the two responses. Although the sample size n = 18 is =not large, we shall illustrate the calcula tions involved in the test of H0: {3(2) 0 given in Result 7.11. Setting a = .05, we test H0 by referring ln� l 1) - [ n - r1 - 1 - !2 (m - r1 + q 1 + 1) ] 1n ( 1 ni + n(I 1 - I) = - [ 18 - 5 - 1 - � (2 - 5 + 3 + 1) ] ln ( .7605) = 3.28 to a chi-square =percentage point with m ( r1 - q 1 ) = 2 (2) = 4 d.f. Since 3.28 < x� (.05) 9.49, we do not reject H0 at the 5% level. The interaction terms are not needed. More generally, we could consider a null hypothesis of the form H0: C/3 = r0, where C is - q) and is of full rank - q). For the choices C = [0 i I(r ] and (rr0+ =1)0, this null hypothesis(r becomes H0: C/3 = /3(2) = 0, the case considered earlier. It can be shown that the extra sum of squares and cross products generated by the hypothesis H0 is n( I 1 - i) = (cp - r0) '( C(Z ' Z) - 1 C ' ) - 1 (Cp - r0) Under the null hypothesis, the statistic n (I 1 - I) is distributed as W (I) inde pendently of i. This distribution theory can be employed to develop a test of H0: C/3 = r0 similar to the test discussed in Result 7.11. (See, for example, [21].) Sec.
7.7
Multivariate Multiple Regression
423
=
A
A
A
A
•
X
(r - q) X (r - q)
,_
q
Other Multivariate Test Statistics
Tests other than the likelihood ratio test have been proposed for testing H0: /3 c2) = 0 in the multivariate multiple regression model. Popular computer-package programs routinely calculate four multivariate test statistics. To connect with their output, we introduce some alternative notation. Let E be the p p error, or residual, sum of squares and cross products matrix E = ni that results from fitting the full model. The p p hypothesis, or extra, sum of squares and cross products matrix X
X
424
Chap.
7
M ultivariate Linear Regression Models
The statistics can be defined in terms of E and H directly, or in terms of the nonzero eigenvalues TJr 1Jz � 1Js 2f HE - � where s = min(p, r - q). Equivalently, they are the roots of I (I 1 - I ) - TJI I = 0. The definitions are: l E I ---' Wilks' lambda = ITs 1 +1 TJ; = --,--I E -'H I ---;+ -TJ_;- = tr[H(H + E) - 1 ] Pillai 's trace = � _ 1 + TJ; Hotelling-Lawley trace = �1 TJ; = tr [H E 1 ] Roy 's greatest root = 1 +TJ r TJI Roy's test selects the coefficient vector so that the univariate F-statistic based on has its maximum possible value. When several of the eigenvalues ; are moderately large, Roy's test will perform poorly relative to the other three.TJSim ulation studies suggest that its power will be best when there is only one large eigenvalue. Charts and tables of critical values are available for Roy' s test. (See [17] and [15].) Wilks ' lambda, Roy ' s greatest root, and the Hotelling-Lawley trace test are nearly equivalent for large sample sizes. If there is a large discrepancy in the reported P-values for the four tests, the eigenvalues and vectors may lead to an interpretation. In this text, we report Wilks' lambda, which is the likelihood ratio test. �
�
•··
--
i= I s
i=l s
-
i=
a a ' Yi
a
Predictions from Multivariate Multiple Regressions
Suppose the model Y = Z fJ + with normal errors has been fit and checked for any inadequacies. If the model is adequate, it can be employed for predictive purposes. One problem is to predict the mean responses corresponding to fixed values z0 of the predictor variables. Inferences about the mean responses can be made using the distribution theory in Result 7.10. From this result, we determine that P ' z0 is distributed as Nm ( fJ ' z0, z� ( Z' Z ) - 1 z0I) and ni is independently distributed as wn -r- r ( I ) The unknown value of the regression function at 0 is So, from the discussion of the T2-statistic in Section 5.2. we can write z fJ ' z0• E,
E,
Sec. 7.7 Multivariate Multiple Regression yz
and the
( P ' z0 - {J' z ) ' ( Vz� ( Z' Z) - 1 :0
=
A )-I ( P ' z - {J ' z ) n I 1 r nVz�tZ ' Z) -1 :0
425
(7-45)
( _ : _ I ) - I ( {J ' z0 - {J ' z0 ) 1 n
100 (1 - a)% confidence ellipsoid for {J' z0 is provided by the inequality P ' Zo ) '
( fJ ' Zo -
, , ,;:;; z0 ( Z Z ) - 1 z0
[ ( m (n - r - 1) ) Fm 11 _ r 111 ( a)] (7-46) n-r-m ·
_
where F111, 11 _ , _ 111 (a) is the upper (lOOa)th percentile of an F-distribution with m and n - r - m d.f. The 100 (1 - a)% simultaneous confidence intervals for E ( Y; ) z�fJu> are
�( mn(n--r r--m1) ) Fm n - r-- m ( a) �Zo, ( Z , Z ) _ 1 Zo ( n - nr - 1 a;; ) ' =
A
·
i
=
1, 2, . . . , m
(7-47)
where p (i ) is the ith column of {J and U;; is the ith diagonal element of I . The second prediction problem is concerned with forecasting new responses Y0 {J' z0 + e0 at z0• Here Eo is independent of e. Now, Y0 - P ' z0 ( {J - P ) ' z0 + E0 is distributed as N111 (0, (1 + z� ( Z ' Z) - 1 z0 ) I ) =
=
independently of n i , so the
100 (1 - a)% prediction ellipsoid for Y0 becomes
---.,::::.; ( 1 + z0' ( Z , Z )
�(
)
_1
z0 )
[( mn(n--r r--m1) ) F111 n - r -m ( a) ] ·
(
(7-48)
)
- a)% simultaneous prediction intervals for the individual responses ¥0; are m (n - r - 1) n Fm, n - r - m a ) (1 + z0, (Z , Z) _ 1 z0 ) n - r - 0';; , n-r-m 1 i 1, 2, . . . , m (7-49) where P (i ) • U;;, and Fm , n - r - m ( a ) are the same quantities appearing in (7-47). Com paring (7-47) and (7-49), we see that the prediction intervals for the actual values
The 100(1
( �
A
=
of the response variables are wider than the corresponding intervals for the expected values. The extra width reflects the presence of the random error e o ; ·
426
Chap.
7
Multivariate Linear Regression Models
Example 7. 1 0 (Constructing a confidence ellipse and a prediction ellipse for bivariate responses)
Y2,
A second response variable was measured for the computer-requirement prob lem discussed in Example 7.6. Measurements on the response disk input/out put capacity, corresponding to the z 1 and z 2 values in that example were
Y2 = [301.8, 396.1, 328.2, 307.4, 362.4, 369.5, 229.1] Obtain the 95% confidence ellipse for /1 1 z0 and the 95% prediction ellipse for Y0 = [ Y0 1 , Y0 2 ] 1 for a site with the configuration z0 = [1, 130, 7.5] 1 • Computer calculations provide the fitted equation y2 = 14.14 + 2.25z 1 + 5.67z2 with s = 1.812. Thus, [3(2) [14.14, 2.25, 5.67] 1 • From Example 7.6, I
=
[3( 1 ) = [8.42, 1.08, 42] 1, z � P( l ) = 151.97, and z� ( Z1 Z ) - 1 z0 = .34725
We find that
�
zP and
(2 )
=
14.14 + 2.25 (130) + 5.67 (7.5) = 349.17
151.97 ] [ 349.17
Since
n = 7, r = 2, and m
o/3(
from
[z
I
- 151.97,
I
[-�;!!!_� .]
/3 1Zo = zo /3(2) is, 5.30 ] - \ [ z �/3( 1 ) - 151.97 ] 13.13 z�/3(2) - 349.17 2 4) (.34725) [ ( � ) F2. 3 (.05) ]
2, a 95% confidence ellipse for
zo/3(2) - 349.17] (4) [ 5.80 5.30
(7-46), the set 1)
=
�
Sec.
7.8
The Concept of Linear Regression
427
with F2 3 (.05) = 9.55. This ellipse is centered at (151.97, 349.17). Its orienta tion and the lengths of the major and minor axes can be determined from the eigenvalues and eigenvectors of n i . Comparing (7-46) and (7-48), we see that the only change required for the calculation of the 95% prediction ellipse is to replace = .34725 with + = 1.34725. Thus, the 95% prediction ellipse for is also centered at (151.97, 349.17), but is larger than the con Y0 = fidence ellipse. Both ellipses are sketched in Figure 7.5. It is the prediction ellipse that is relevant to the determination of com• puter requirements for a particular site with the given
z�(Z'Z) - 1 z0
1 z�(Z'Z)-1z0 [Y0 1 , Y0 2]'
z0 •
Response 2
d
380
�
360
340
Prediction ellipse
onfidence ellipse
Figure 7.5
95% confidence and prediction ellipses for the computer data with two responses.
0
7.8 THE CONCEPT OF LINEAR REGRESSION
The classical linear regression model is concerned with the association between a single dependent variable and a collection of predictor variables z 1 , z 2 , , z,. The regression model that we have considered treats as a random variable whose mean depends upon fixed values of the z; ' s . This mean is assumed to be a linear function of the regression coefficients .•. , The linear regression model also arises in a different setting. Suppose all the variables Z1 , Z2 , , are random and have a joint distribution, not necessar ily normal, with mean vector p and covariance matrix :I . Partition-
Y
Y,
• . .
•••
Y
{30 , {31 , {3, .
Z,
(r+ l) X l
ing p and I in an obvious fashion, we write
[ ]
(r + l) X (r+ l)
I I
and
I=
I
y O" yy : Uz X i ( 1 X r)
(1 1 ) rX l):-rx��: (rX r) (;�I
428
Chap.
7
Multivariate Linear Regression Models
with (7-50)
Uz y = [uyz , Uyz , . . . , u yz ] 1 I
2
can be taken to have full rank.6 Consider the problem of predicting Y using the linear predictor = b0 + b 1 Z1 + + b,Z, = b0 + b1 Z (7-51) For a given predictor of the form of (7-51), the error in the prediction of Y is prediction error = Y - b0 - b 1 Z1 - - b,Z, = Y - b0 - b1 Z (7-52) Because this error is random, it is customary to select b0 and b to minimize the I zz
r
···
···
mean square error = E ( Y - b0 - b1 Z) 2
(7-53) Now the mean square error depends on the joint distribution of Y and Z only through the parameters p and I. It is possible to express the "optimal" linear pre
dictor in terms of these latter quantities. Result 7 1 2 The linear predictor {30 + fJ 1 Z with coefficients f3o = J.Ly - /31 JL z has minimum mean square among all linear predictors of the response Y. Its mean square error is E ( Y - {30 - /31 Z) 2 = E (Y - J.Ly - u�y izi (Z - Pz ) ) 2 = u yy - u�y i zi uz y Also, {30 + {J1 Z = J.Ly + u�y izi (Z - pz ) is the linear predictor having maxi mum correlation with Y; that is, Carr (Y, {30 + /31 Z) = max Carr ( Y, b0 + b1 Z) .
.
�
bo, b
Proof.
we get
E(Y -
Writing
I
�-j = {J1 UI zz fJ = Uz y ,.. z z Uz y yy O"yy b0 + b1 Z = b 0 + b1 Z - ( J.Ly - b1 p z ) + ( J.Ly - b1 Pz ),
b0 - b1 Z)2 = E [ Y - J.Ly - (bi Z - b1 Pz ) + ( J.Ly - b0 - b1 pz )JZ
= E(Y - J.Ly ) 2 + E (b1 (Z - Pz ) ) 2 + ( J.Ly - b0 - b1 p z ) 2 - 2E [ b1 (Z - p z ) (Y - J.Ly)] = Uyy + b1 I zz b + ( J.Ly - b0 - b1 Pzf - 2b1 Uz y
6 If I zz is not of full rank, one variable-for example, Zk---can be written as a linear combina tion of the other Z;'s and thus is redundant in forming the linear regression function Z' {J. That is, Z may be replaced by any subset of components whose nonsingular covariance matrix has the same rank as I z z ·
Sec.
7.8
The Concept of Linear Regression
429
Adding and subtracting u�yizi uz y , we obtain
=
The mean square error is minimized by taking b = I i i uz y {J, making the last term zero, and then choosing be b0 = f.L y - (I ii uz y ) 1 JL z = {30 to make the third term zero. The minimum mean square error is thus uyy - u�yizi uz y· Next, we note that Cov (b 0 + b1 Z, Y ) = Cov (b1 Z, Y ) h1 Uz y so
=
[ Corr (b0 + b 1 Z, Y ) ] 2 _- Uyy[b1(b1uzIyJzzZ b) ,
for all b0 , b
Employing the extended Cauchy-Schwartz inequality of (2-49) with B = I zz , we obtain
or
with equality for b = I i i uz y = {J. The alternative expression for the maximum correlation follows from the equation u�yizi uz y = u�y {J = u�y iziizz fJ = •
fJ 1 Izz fJ·
The correlation between Y and its best linear predictor is called the popula
tion multiple correlation coefficient
I
Uz y �...., z- 1zUz y Uyy
PY( Z) = +
(7-54)
The square of the population multiple correlation coefficient, p�(Z) , is called the pop ulation coefficient of determination. Note that, unlike other correlation coefficients, the multiple correlation coefficient is a positive square root, so ::::;; PY( Z) ::::;; 1 . The population coefficient of determination has an important interpretation. From Result 7.12, the mean square error in using {30 + {J1 Z to forecast Y is
0
(
I
)
2 y �...., z- z1 Uz y Uz -1 (Z) ) (7-55) Uyy - Uz1 y izz U U (1 U PY yy y yy yy Uz Uyy _
0,
_
If P�(Z) = there is no predictive power in Z . At the other extreme, p�(Z) = 1 implies that Y can be predicted with no error.
430
Chap.
7
Multivariate Linear Regression Models
Example 7. 1 1
(Determining the best linear predictor, its mean square error, and the multiple correlation coefficient)
[-�!'�-L��x] [ !Q_l_�- -=-�-]
Given the mean vector and covariance matrix of Y, Z1 , Z2 , and I = Uz y i Izz = 1 i 7 3 i -1 i 3 2 determine (a) the best linear predictor + {31 Z1 + {32 Z2 , (b) its mean square error, and (c) the multiple correlation{30 coefficient. Also, verify that the mean square error equals Uyy{ 1 - p�(z) ). First, -1 fJ = I i i uzv = [ � � [ -� J [ _ : : �:� ] [ - � J [ -� J J f3o = #Ly - fJ 1 ILz = 5 - [1, -2) [ � J = 3 so the best linear predictor is {30 + {J 1 Z = 3 + Z1 - 2Z2 • The mean square error is Uy y - Uz y �"'-zz- 1 Uz y = 10 - [ 1, - 1 ] [ - ..46 -1..64 J [ _ 11 J = 10 - 3 = 7 and the multiple correlation coefficient is -1 y #o = .548 Uz y �"'-zzUz P Y(Z) = Uyy Note that uyy{ 1 - Phz> ) = 10(1 - fo ) = 7 is the mean square error. • It is possible to show (see Exercise 7. 5 ) that 1 - PY2 (Z) -- p 1y y (7-56) where p Y Y is the upper left-hand corner of the inverse of the correlation matrix determined from I. The restriction to linear predictors is closely connected to the assumption of normality. Specifically, if we take __
I
I
I
I
y
Z21 to be distributed as z
z,
N, +
1 (p,, I)
Sec.
7.8
The Concept of Linear Regression
then the conditional distribution of Y with
431
z 1 , z2 , ... , z, fixed (see Result 4.6) is
N( J.Ly + u�y izi (z - J.tz ), uyy - u�y izi uz y )
The mean of this conditional distribution is the linear predictor in Result 7.12. That is, (7-57) E(Yiz 1 , z2 , , z, ) = J.Ly + u�yiz i (z - p,z ) = f3o + fJ'z and w e conclude that E(Y iz 1 , z2 , , z , ) is the best linear predictor of Y when the • . .
• • •
population is N, + 1 ( p, , I) . The conditional expectation of Y in (7-57) is called the
linear regression function .
E(Yiz 1 , z2 , , z, ) [18])
When the population is not normal, the regression function need not be of the form {30 + z . Nevertheless, it can be shown (see that (Y 1 , whatever its form, predicts Y with the smallest mean square er ror. Fortunately, this wider optimality among all estimators is possessed by the linear predictor when the population is normal.
fJ'
E iz , z2 , z, ),
• • •
• • •
Result 7. 1 3.
[-�S.Y-z !'y_f-�-: S�zzr..]
Suppose the joint distribution of Y and and S =
Z is N,+ ( p,, I) . Let 1
be the sample mean vector and sample covariance matrix, respectively, for a ran dom sample of size n from this population. Then the maximum likelihood estima tors of the coefficients in the linear predictor are
P = S z� Sz y,
fi o
= Y - s� y S z� Z = Y - P'Z
Consequently, the maximum likelihood estimator of the linear regression function is
fi o + P ' z =
Y
+ s� y Sz � (z - Z )
and the maximum likelihood estimator of the mean square error I
n-1 A O" yy.z - n (Syy - S z/ y zz Sz y --
Proof.
s-1 )
E [ Y - {30 - fJ'Z] 2 is
We use Result 4.11 and the invariance property of maximum likeli hood estimators. [See (4-20).] Since, from Result 7.12,
f3o = J.Ly - (I i i uz y ) ' p, z , and mean square error = uyy. z = uyy - u�y izi uz y
432
Chap.
7
Multivariate Linear Regression Models
[-r-J
the conclusions follow upon substitution of the maximum likelihood estimators p� -
z
for and I =
:.] [-��-Uz �y-�-��: I zz
•
It is customary to change the divisor from n to n - ( r + 1 ) in the estimator of the mean square error, cr yy. z = E ( Y - {30 - {J ' Z) 2 , in order to obtain the unbiased estimator
( n - 1 ) (S yy - S z y S zz Sz y ) ,
n-r-1
-1
n -
r -
1
(7-58)
Example 7. 1 2 (Maximum likelihood estimate of the regression functionsingle response)
For the computer data of Example 7.6, the n = 7 observations on Y (CPU time), Z 1 (orders), and Z2 (add-delete items) give the sample mean vector and sample covariance matrix: fo s
� =
[-;j � [-l��-��] [ j [_1§.?:�I-�_l_1_� :?��- -�-�2?_�·] S yy ! s � v = 418.763 : 377.200 28.034 S z y ! 8 zz 35.983 i 28.034 13.657
-- - - - + - - - - - I
I
Assuming that Y, Z1 , and Z2 are jointly normal, obtain the estimated regres sion function and the estimated mean square error. Result 7.13 gives the maximum likelihood estimates A
] [ 418.763 ] [ 1.079 ] .420 35.983 [ 130.24 150.44 - 142.019 {J ' z = 150.44 - [1.079, .420] 3.547 J
f3 - 8 -zz1 S z A
-
{30 = y
-
A
Y
-
[ - .003128 .006422
- .006422 .086404
= 8.421
and the estimated regression function
=
Sec.
7.8
The Concept of Linear Regression
433
�0 + [J 1 z = 8.42 - 1.08z 1 + .42z 2 The maximum likelihood estimate of the mean square error arising from the prediction of Y with this regression function is
(n-1 n
) 76 467 .913 - [418.763, 35.983 ] [ .003128 - .006422 ( )( ( Syy
- S 1z y s -zz1 S z y )
- .006422 .086404
= .894
][
418.763 35.983
]) •
Prediction of Several Variables
The extension of the previous results to the prediction of several responses Y1 , Y2 , , Ym is almost immediate. We present this extension for normal populations. Suppose
[flariah'c>�>e§tilii!�tee:;�: ::.ixi l:��2�1l394d2'� ·> , 1
Autocorre l ation C h eck of Resi d u a l s To Lag 6 12 18 24
Chi S q u a re 6.04 1 0.27 1 5.92 23.44
Autocorrelations DF 4 10 16 22
0.079 0. 1 44 0.0 1 3 0.0 1 8
0.0 1 2 -0.067 0. 1 06 0.004
0.022 - 0. 1 1 1 -0. 1 37 0.250
0. 1 92 - 0.056 - 0. 1 70 -0.080
- 0. 1 27 - 0.056 - 0 . 079 - 0.069
0. 1 61 - 0. 1 08 O.D 1 8 - 0.051
(continued)
Sec. PAN EL 7.3
7.1 0
M ultiple Regression Models with Time Dependent Errors
445
(continued)
Autocorrelation Plot of Resi d u a l s Lag 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Cova ria nce 228.894 1 8. 1 94945 2 .763255 5.038727 44.059835 - 29. 1 1 8892 36.90429 1 33.008858 - 1 5.4240 1 5 - 25.379057 - 1 2.890888 - 1 2 .777280 - 2 4.825623 2 .970 1 97 24. 1 50 1 68 - 3 1 . 4073 1 4
Corre l ation 1 .00000 0.07949 0.01 207 0.0220 1 0 . 1 9249 - 0 . 1 2722 0. 1 6 1 23 0 . 1 442 1 -0.06738 - 0. 1 1 088 -0.05632 -0.05582 - 0 . 1 0846 0.01 298 0 . 1 0551 - 0. 1 37 2 1
-1 9 8 7 6 5 4 3 2
0 1 2 3 4 5 6 7 8 9 1 I *********************! I **
I **** . *** I I *** I *** *I ** I *I *I ** I I ** *** I " . " m a rks two sta n d a rd errors
When modeling relationships using time ordered data, regression models with noise stuctures that allow for the time dependence are often useful. Modern soft ware packages, like SAS, allow the analyst to easily fit these expanded models.
SUPPLEMEN T lA
The Dis tribution of the Likelihood Ratio for the Multivariate Multiple Regression Model
The development in this supplement establishes 1Result 7.11. We know that1 ni = Y' (I - Z(Z'Z) - Z')Y and under H0, ni11 Y ' [I - Z1 (z;z,) - Z{]Y1 with Y = Zd J r + 1, were constructed, Z' ge = 0, so that Pgc = ge . Conse quently, these ge s are eigenvectors of P corresponding to the n - r - 1 unit eigenvalues. By the spectral decomposition (2-16), P = f =2:r+ 2 geg� and ;_;;,
;_;;, . . . ;_;;,
c
�
11
f=r+2
f = r+ 2
the e ' ge = where, because [ 1, .. , Veml' are independently distributed as I). Conse quently, by ni is distributed as (I). In the same manner, { gc e > qq ++ 11 t ge n so P1 = f =2:q + 2 geg� . We can write the extra sum of squares and cross products as 1 r+ 1 n(I 1 - I ) = e '(P1 - P) e = t =r+2:q + 2 ( e ' ge ) ( e ' gc ) ' = C =2:q + 2 Vev; where the Ve are independently distributed �s Nm (O, I);_ By (1-22), n(I 1 - I ) is distributed as Wp, r - q (I) independently of ni , since n (I 1 - I) involves a differ ent set of independent Ve ' s. The large sample distribution for - [n - r - 1 - � (m - r + q + 1 ) ] ln(!IA I/IIA 1 ! ) follows from Result 5.2, with v - v0 = m (m + 1 )/2 + m (r + 1 ) Cov ( Ve;, Vj k ) = E (g; e r1 + 1. Therefore, (Z'Z) (Aj e;)e;z' e;e;z'. Summing over i gives =
pYY
;;;;.
;;;;. · • · ;;;;.
u Y Y CT y y ·
'
1 •
=
r1
=
=
:s;;
i
=
since e;z' = 0 for i > r1 + 1.
IZ' = Z'
450
Chap.
7
M ultivariate Linear Regression Models 7.7.
Suppose the classical regression model is, with rank (Z) = r + 1, written as Y = Z1 /J(2) + fJ(l) + Z 2 (n X l) (n X (q + 1)) ((q + 1 ) X 1) (n X ( r - q)) (( r - q) X 1 ) (n X l) where rank(Z1 ) = q + 1 and rank(Z2) = r - q. If the parameters {J(2) are identified beforehand as being of primary interest, show that a 100(1 - a) % confidence region for {J(2) is given by 6
( P - {J(2) ) ' ( Z � Z 2 - Z � Z1 ( z ; z 1 ) - 1 Z ; z 2 ] ( p - fJ ) .:;;; s 2 (r - q ) Fr - q. n - r - 1 (a)
By Exercise 4.10, with 1 's and 2's interchanged, [ e 2n c 221 2 ] C 22 = [ Z 2' Z 2 - Z 2' Z 1 (Z 1' Z 1 ) - 1 Z 1' z 2] - 1 -1 = where (Z' ) z ' C 1 C Multiply by the square-root matrix ( C 22 ) - 1 12 , and conclude that 22 2 2 ( C ) - 1 1 ( p (2) - {J(2) ) / u is N(O, 1) , so that ( P - fJ ) ' ( C 22 ) - 1 ( p - fJ ) is u 2 x;- q . 7.8. Recall that the hat matrix is defined by H = Z (Z' Z ) - 1 Z' with diagonal ele ments hjj . (a) Show that H is an idempotent matrix. [See Result 7.1 and ( 7 -6 ) ] . n (b) Show that 0 < hjj < 1, j = 1, 2, . . . , n, and that � hjj = r + 1, where r j= l is the number of independent variables in the regression model. (In fact, (1/n) hjj < 1.) (c) Verify, for the simple linear regression model with one independent vari able z, that the leverage, hjj• is given by Hint:
.:;;;
(zj - zY 1 h". . = + n --'-n � (zj - z ) 2 j= l -
7.9.
-
---
Consider the following data on one predictor variable' z1 and two responses Y1 and Y2 : -2 - 1 0 1 2 2 1 4 5 3 Y1 3 2 1 3 1 Y2 Determine the least squares estimates of the parameters in the straight-line regression model Yj l = f3o1 + {3 1 1 Zj t + ej 1 lf2 = f3o2 + f31 2Zj l + ej 2 • j = 1, 2, 3, 4, 5
Chap.
7
Exercises
451 E
Also, calculate the matrices of fitted values Y and residuals with Y = [y1 l y2 ] . Verify the sum of squares and cross products decomposition �
E'E
Y' Y = Y ' Y +
Using the results from Exercise 7.9, calculate each of the following. (a) A 95% confidence interval for the mean response E ( Y01) = /30 1 + /31 1 z0 1 corresponding to z01 = 0.5. (b) A 95% prediction interval for the response Y01 corresponding to z01 = 0.5. (c) A 95% prediction region for the responses Y0 1 and Y0 2 corresponding to z0 1 = 0.5. 7.11. (Generalized least squares for multivariate multiple regression.) Let A be a positive definite matrix, so that dJ (B) = (yj - B' zj ) ' A (yj - B' z) is a squared statistical distance f.rom the jth- 1observation yj to its regression B' zj . Show that the choice B = fJ = (Z' Z) Z' Y minimizes the sum of squared statistical distances, j=Ll df (B), for any choice of positive definite A. Choices for A include I - t and I. Hint: Repeat the steps in (7-40) and (7-41) with I - 1 replaced by A. 7.U. Given the mean vector and covariance matrix of Y, Z1 , and Z2 , 7.10.
n
determine each of the following. (a) The best linear predictor {30 + {31 Z1 + {32 Z2 of Y. (b) The mean square error of the best linear predictor. (c) The population multiple correlation coefficient. (d) The partial correlation coefficient p z z 7.13. The test scores for college students described in Example 5.5 have
[ zz1 ] [ ] [ y
z =
2
:z3
=
527.74 54.69 , 25.13
s =
I
.
2
.
5691.34 600.51 126.05 217.25 23.37 23.11
]
Assume joint normality. (a) Obtain the maximum likelihood estimates of the parameters for predict ing zl from z2 and z3 (b) Evaluate the estimated multiple correlation coefficient Rz z (c) Determine the estimated partial correlation coefficient R z 7.14. Twenty-five portfolio managers were evaluated in terms of their performance. Suppose Y represents the rate of return achieved over a period of time, Z1 is the manager's attitude toward risk measured on a five-point scale 0
1 (Z2'
3) ·
1 ' z2 · z3 .
452
Chap. 7 Multivariate Linear Regression Models
from "very conservative" to "very risky," and Z is years of experience in the investment business. The observed correlation 2coefficients between pairs of variables are Zz zl .82 1 .0 - .35 R = - .35 1.0 - .60 . 82 - .60 1.0 (a) Interpret the sample correlation coefficients ryz1 = - .35 and ryz2 = - . 8 2. (b) Calculate the partial correlation coefficient ryz1 • z2 and interpret this quantity with respect to the interpretation provided for ryz1 in Part a.
[
y
]
The following exercises may require the use of a computer.
Use the real-estate data in Table 7.1 and the linear regression model in Example 7.4. (a) Verify the results in Example 7.4. (b) Analyze the residuals to check the adequacy of the model. (See Section 7.6. ) (c) Generate a 95% prediction interval for the selling price (Y0 ) corresponding to total dwelling size z 1 = 17 and assessed value z2 = 46. (d) Carry out a likelihood ratio test of H0 : {3 2 = 0 with a significance level of a = .05. Should the original model be modified? Discuss. 7.16. Calculate a CP plot corresponding to the possible linear regressions involving the real-estate data in Table 7.1. 7.17. Consider the Fortune 500 data in Exercise 1 .4. (a) Fit a linear regression model to these data using profits as the dependent variable and sales and assets as the independent variables. (b) Analyze the residuals to check the adequacy of the model. Compute the leverages associated with the data points. Does one (or more) of these companies stand out as an outlier in the set of independent variable data points? (c) Generate a 95% prediction interval for profits corresponding to sales of 40,000 (millions of dollars) and assets of 70,000 (millions of dollars). (d) Carry out a likelihood ratio test of H0 : {32 = 0 with a significance level of a = .05. Should the original model be modified? Discuss. 7.18. Calculate a CP plot corresponding to the possible regressions involving the Fortune 500 data in Exercise 1.4. 7.19. Satellite applications motivated the development of a silver-zinc battery. Table 7.4 contains failure data collected to characterize the performance of the battery during its life cycle. Use these data. (a) Find the estimated linear regression of ln(Y) on an appropriate ("best") subset of predictor variables. (b) Plot the residuals from the fitted model chosen in Part a to check the nor mal assumption. 7.15.
Chap.
7
Exercises
BATIERY-FAI LU RE DATA
TABLE 7.4
zl
z3
453 y
z4
EndZs of Depth of charge Charge Discharge discharge to (% of rated Temperature voltage Cycles rate rate (volts) failure CCC) (amps) (amps) ampere-hours) 101. 2.00 40. 60.0 .375 3.13 141. 1.99 30. 76.8 1.000 3.13 96. 2.00 20. 60.0 1.000 3.13 125. 1. 98 20. 60.0 1.000 3.13 43. 2. 0 1 10. 43.2 1.625 3.13 16. 2. 0 0 20. 60.0 1.625 3.13 188. 2. 0 2 20. 60.0 1.625 3.13 10. 2. 0 1 10. 76.8 .375 5.00 3. 1.99 10. 43.2 1.000 5.00 386. 2. 0 1 30. 43.2 1.000 5.00 45. 2.00 100.0 20. 1.000 5.00 2. 1. 99 10. 76.8 1.625 5.00 2. 0 1 76. 10. 76.8 .375 1.25 1.000 1.25 78. 43.2 1. 99 10. 1.000 1.25 76.8 30. 2.00 160. 1.000 1.25 3. 2.00 60.0 0. 216. 43.2 1.625 1.25 30. 1. 99 73 . 2.00 20. 1.625 1.25 60.0 314. .375 3.13 30. 76.8 1.99 .375 3.13 20. 60.0 170. 2.00 Zz
Source: Selected from S. Sidik, H. Leibecki, and J. Bozek,NASA Technical Memorandum 81556 Cleveland: Lewis Research Center, 1980).
Failure of Silver-Zinc Cells
with Competing Failure Modes-Preliminary Data Analysis,
(
Using the battery-failure data in Table 7.4, regress ln(Y) on the first princi pal component of the predictor variables z1 , z2 , , z (See Section 8.3.) Compare the result with the fitted model obtained in Exercise 7.19(a). 7.21. Consider the air-pollution data in Table 1. 3 . Let Y1 = N02 and Y2 = 03 be the two responses (pollutants) corresponding to the predictor variables Z1 = wind and z2 = solar radiation. (a) Perform a regression analysis using only the first response Y (i) Suggest and fit appropriate linear regression models. (ii) Analyze the residuals. (iii) Construct a 95% prediction interval for N02 corresponding to z1 = 10 and z 2 = 80.
7.20.
• • •
5
•
1
•
454
Chap.
7
Multivariate Linear Regression Models
Perform a multivariate multiple regression analysis using both responses (i) Suggest and fit appropriate linear regression models. (ii) Analyze the residuals. (iii) Construct a 95% prediction ellipse for both N02 and 03 for z 1 = 10 and z2 = 80. Compare this ellipse with the prediction interval in Part a (iii). Comment. 7.22. Using the data on bone mineral content in Table 1 .6: (a) Perform a regression analysis by fitting the response for the dominant radius bone to the measurements on the last four bones. (i) Suggest and fit appropriate linear regression models. (ii) Analyze the residuals. (b) Perform a multivariate multiple regression analysis by fitting the responses from both radius bones. 7.23. Using the data on the characteristics of bulls sold at auction in Table 1.8: (a) Perform a regression analysis using the response Y1 = SalePr and the pre dictor variables Breed, YrHgt, FtFrBody, PrctFFB, Frame, BkFat, SaleHt, and SaleWt. (i) Determine the "best" regression equation by retaining only those predictor variables that are individually significant. (ii) Using the best fitting model, construct a 95% prediction interval for selling price for a set of predictor variable values that are not in the original data set. (iii) Examine the residuals from the best fitting model. (b) Repeat the analysis in Part a, using the natural logarithm of the sales price as the response. That is, set Y1 = Ln(SalePr). Which analysis do you prefer? Why? 7.24. Using the data on the characteristics of bulls sold at auction in Table 1.8: (a) Perform a regression analysis, using only the response Y1 SaleHt and the predictor variables Z1 = YrHgt and Z2 = FtFrBody. (i) Fit an appropriate model and analyze the residuals. (ii) Construct a 95% prediction interval for SaleHt corresponding to z 1 = 50.5 and z 2 = 970. (b) Perform a multivariate regression analysis with the responses Y1 = SaleHt and Y2 = SaleWt and the predictors Z1 = YrHgt and Z2 = FtFrBody. (i) Fit an appropriate multivariate model and analyze the residuals. (ii) Construct a 95% prediction ellipse for both SaleHt and SaleWt for z 1 = 50.5 and z 2 = 970. Compare this ellipse with the prediction interval in Part a (ii). Comment. 7.25 Amitriptyline is prescribed by some physicians as an antidepressant. How ever, there are also conjectured side effects that seem to be related to the use of the drug: irregular heartbeat, abnormal blood pressures, and irregular waves on the electrocardiogram, among other things. Data gathered on 17 (b)
Y1 and Y2•
=
Chap.
7
Exercises
455
TABLE 7.5 AM ITRI PlYLI N E DATA
Yz AMI 3149 653 810 448 844 1450 493 941 547 392 1283 458 722 384 501 405 1520 Source: See [20].
YJ TOT 3389 1101 1131 596 896 1767 807 1111 645 628 1360 652 860 500 781 1070 1754
z GENl 1 1 0 1 1 1 1 0 1 1 1 1 1 0 0 0 1
Zz AMT 7500 1975 3600 675 750 2500 350 1500 375 1050 3000 450 1750 2000 4500 1500 3000
z3 Z4 Zs QRS PR DIAP 220 0 140 200 0 100 205 60 111 160 60 120 83 185 70 180 60 80 154 80 98 93 200 70 137 60 105 74 167 60 180 60 80 160 64 60 135 90 79 160 60 80 180 0 100 170 90 120 180 0 129
patients who were admitted to the hospital after an amitriptyline overdose are given in Table 7. 5 . The two response variables are Y1 = Total TCAD plasma level (TOT) Y2 = Amount of amitriptyline present in TCAD plasma level (AMI) The five predictor variables are Z1 = Gender: 1 if female, 0 if male (GEN) Z2 = Amount of antidepressants taken at time of overdose (AMT) Z3 = PR wave measurement (PR) Z4 = Diastolic blood pressure (DIAP) Z5 = QRS wave measurement (QRS) (a) Perform a regression analysis using only the first response Y1 • (i) Suggest and fit appropriate linear regression models. (ii) Analyze the residuals. (iii) Construct a 95% prediction interval for Total TCAD for z1 = 1, z2 = 1200, z3 = 140, z4 = 70, and z5 = 85.
456
Chap.
7
M ultivariate Linear Regression Models
(b) (c)
Repeat Part a using the second response Y2 • Perform a multivariate multiple regression analysis using both responses Y1 and Y2 • (i) Suggest and fit appropriate linear regression models. (ii) Analyze the residuals. (iii) Construct a 95% prediction ellipse for both Total TCAD and Amount of amitriptyline for z 1 = 1, z2 = 1200, z3 = 140, z4 = 70, and = 85. Compare this ellipse with the prediction intervals in Partsz5a and b. Comment.
REFERENCES
1. John AnderWisoln,ey,T.1984.W. (2d ed.). New York: 2. verAtksiintysoPrn,eA.s , 1985. C. Oxford, England: Oxford Uni 3. Bartions.tle"t , M. "A Note on Multiplying Factors for Var(1954)ious, 296-298. Chi-Squared Approxima 4. WiBellsely,ey,1980.D. A., E. Kuh, and R. E. Welsh. New York: John 5. Bowerman, and R.KentT. ,O'1990.Connell. (6. Box,2d ed.G.). E.BosP.B.to"n:L.A,PWSGener al Distribution Theory for a Clas of Likelihood Criteria." ( 1 949) , 317-346. 7. Box, G. (E.3dP.ed., G.). EnglM. Jenkiewoodns,Clandif sG., NJ:C. PrReientnsicee-l. Hall, 1994. 8. Chat terjee, S., and B. Price. New York: John Wiley, 1977. 9. Cook, R.l, D.1982., and S. Weisberg. London: Chapman and Hal 10. Dani el, C., and F. S. Wood. (2d ed.). New York: John Wiley, 1980. 11. DrWialeper,y, 1981.N. R., and H. Smith. (2d ed.). New York: John 12. Dursion,bII.in,"J., and G. S. Wat(1so951)n. "Tes, 159-178. ting for Serial Correlation in Least Squares Regres 13. Galton, F. "Regres ion(1Towar d Mediocrity in Heredity Stature." 885) , 246-263. 115.4. Heck, GoldberD.gL.er,"Char A. S. ts of Some Upper PercNewentageYorPoik:nJohnts of tWihelDiey,st1964.ribution of the Largest Characteristic Root." (1960), 625-642. An Introduction to Multivariate Statistical Analysis
Plots, Transformations and Regression.
S. Journal of the Royal Statistical Society (B) ,
16
Regression Diagnostics.
Linear Statistical Models: An Applied Approach Bio
metrika,
36
Time Series Analysis: Forecasting and
Control
Regression Analysis by Example.
Residuals and Influence in Regression.
Fitting Equations to Data
Applied Regression Analysis
Biometrika,
38
Journal of the Anthro
pological Institute,
15
Econometric Theory.
Annals of Mathematical Statistics,
31
Chap.
7
References
457
16. Neter, J.,3dM.ed.Kut. Chiner,cago:C. NachtRicharshdeimD. andIrwinW., 1996.Wassermann. 17. Piatel aAnali, K.yC.sis.S." "Upper Percenta1967ge Ppi, 189-193. nts of the Largest Root of a Matrix in Multivari 18. WiRao,ley,C.1973.R. 2d ed. . New York: John G.er, A.M.F.V. "Cardiovascular Changes andNewPlaYorsmak:DrugJohnLevel Wiley,s af1977.ter Amitriptyline 20.19. Seber, Rudor f 1882 , 67-71. Mon 21. TitOverdos merey,m,CA:N.H.e."Brooks/Cole, 1975. Models
(
Applied Linear Regression
)
Biometrika, 54 ( ) Linear Statistical Inference and Its Applications.
(
)
Linear Regression Analysis.
Journal of Toxicology-Clinical Toxicology,
19
(
)
Multivariate Analysis with Applications in Education and Psychology.
CHAPTER
8
Prlndpai Componenb 8 . 1 I NTRODUCTION
A principal component analysis is concerned with explaining the variance-covari ance structure of a set of variables through a few linear combinations of these vari ables. Its general objectives are (1) data reduction and (2) interpretation. Although p components are required to reproduce the total system variabil ity, often much of this variability can be accounted for by a small number k of the principal components. If so, there is (almost) as much information in the k com ponents as there is in the original p variables. The k principal components can then replace the initial p variables, and the original data set, consisting of n measure ments on p variables, is reduced to a data set consisting of n measurements on k principal components. An analysis of principal components often reveals relationships that were not previously suspected and thereby allows interpretations that would not ordinarily result. A good example of this is provided by the stock market data discussed in Example 8.5. Analyses of principal components are more of a means to an end rather than an end in themselves, because they frequently serve as intermediate steps in much larger investigations. For example, principal components may be inputs to a mul tiple regression (see Chapter 7) or cluster analysis (see Chapter 12). Moreover, (scaled) principal components are one "factoring" of the covariance matrix for the factor analysis model considered in Chapter 9. 8.2 POPULATION PRINCIPAL COMPONENTS
Algebraically, principal components are particular linear combinations of the p random variables X1 , X2 , XP . Geometrically, these linear combinations repre sent the selection of a new coordinate system obtained by rotating the original sys. . • ,
458
Sec.
8.2
Population Principal Components
459
tern with X1 , X2 , . . . , XP as the coordinate axes. The new axes represent the direc tions with maximum variability and provide a simpler and more parsimonious description of the covariance structure. As we shall see, principal components depend solely on the covariance matrix I (or the correlation matrix p) of X1 , X2 , . . . , XP . Their development does not require a multivariate normal assumption. On the other hand, principal compo nents derived for multivariate normal populations have useful interpretations in terms of the constant-density ellipsoids. Further, inferences can be made from the sample components when the population is multivariate normal. (See Section 8. 5 . ) Let the random vector X' = [X1 , X , , X ] have the covariance matrix I with eigenvalues A1 ;;:;. A2 ;;:;. • AP 0.2 p Consider the linear combinations Y1 = a � X = a1 1 X1 + a1 2X2 + · · · + a1 P XP Y2 = a ; x = a 2 1X1 + a 22X2 + · · · + a 2P XP (8-1) • • •
• •
;;:;.
;;:;.
Then, using (2-45), we obtain Var(Y;) = a;Ia; i = 1 , 2, . . . , p (8-2) Cov(Y;, Yk) = a;Iak i, k = 1 ' 2, . . . ' p (8-3) The principal components are those uncorrelated linear combinations Y1 , Y2 , ••• , YP whose variances in (8-2) are as large as possible. The first principal component is the linear combination with maximum vari ance. That is, it maximizes Var(Y1 ) = a{Ia1 • It is clear that Var(Y1 ) = a{Ia1 can be increased by multiplying any a1 by some constant. To eliminate this inde terminacy, it is convenient to restrict attention to coefficient vectors of unit length. We therefore define First principal component = linear combination a{ X that maximizes Var (a{ X) subject to a{ a1 = 1 Second principal component = linear combination a{ X that maximizes Var (a� X) subject to a� a2 = 1 and Cov(a{X, a�X) = 0 At the ith step, ith principal component = linear combination a;x that maximizes Var(a;x) subject to a; a; = 1 and Cov(a;x, a�X) = 0 for k < i
460
Chap.
8
Principal Components
Result 8. 1 . Let I be the covariance matrix associated with the random vec tor X' = [X1 , X2 , , XP] . Let I have the eigenvalue-eigenvector pairs (A1 , e1 ) , A2 (A2 , e2 ) , , (Ap , ep ) where A 1 0 . Then the ith principal com AP ponent is given by Y; = e; x = ei l X1 + e;2X2 + · · · + e;P XP , i = 1, 2, . . , p (8-4) With these choices, Var(Y;) = e; Ie; = A; i = 1, 2, , p (8-5) Cov(Y;, Yk) = e; Iek = 0 i * k If some A; are equal, the choices of the corresponding coefficient vectors e;, and hence Y;, are not unique. Proof. We know from (2-51), with B = I, that a'Ia = Al (attained when a = e1 ) max a'a O * a But e{ e 1 = 1 since the eigenvectors are normalized. Thus, • . .
;a.
• • •
;a.
• • •
;a.
;a.
.
. . .
--
Similarly, using (2-52), we get = 1, 2, ... , p - 1 Forthechoicea = ek + 1 , with e� + 1 e; = O, for i = 1,2, ... , k andk = 1, 2, . , p - 1, e� + 1 Iek + t f e� + t ek + l = e� + 1 Iek + t = Var(Yk +t ) But e� + 1 (Iek + l ) = Ak + l e� + l ek + l = Ak + 1 so Var(Yk + d = Ak + t · It remains to show that e; perpendicular to ek (that is, ef ek = 0, i i= k) gives Cov(Y;, Yk) = 0. Now, the eigenvectors of I are orthogonal if all the eigenvalues A 1 , A 2 , , AP are distinct. If the eigenvalues are not all distinct, the eigenvectors corresponding to common eigenvalues may be chosen to be orthogonal. Therefore, for any two eigen vectors e; and ek , ef ek = 0, i i= k. Since Iek = Akek , premultiplication by e/ gives Cov(Y;, Yk) = ef iek = ef Akek = Ake/ ek = 0 for any i i= k, and the proof is complete. From Result 8.1, the principal components are uncorrelated and have vari ances equal to the eigenvalues of I. k
. .
. • .
•
Sec.
8.2
Population Principal Components
461
Result 8.2. Let X' = [X , X2 , , X ] have covariance matrix I, with eigen value-eigenvector pairs (A1 , e11 ) , (A2 , e2 ) ,p , ( Ap , ep ) where A 1 ;;;:.: A2 ;;;:.: AP ;;;:.: 0. Let Y1 = e{ X, Y2 = e� X, . . . , YP = e� X be the principal components. Then p p + uPP = :L Var(X;) = A 1 + A2 + + A P = :L Var(Y;) u 1 1 + u22 + i= l i= l Proof. From Definition 2A. 2 8, u + u22 + + u = tr(I). From (2-20) with A = I, we can write I = PAP'1 1 where A is the PPdiagonal matrix of eigenvalues and P = [e1 , e2 , , ep] so that PP' = P'P = I. Using Result 2A.12(c), we have tr(I) = tr(PAP') = tr(AP'P) = tr(A) = A 1 + A2 + + AP Thus, p p :L Var(X;) = tr(I) = tr(A) = :L Var(Y;) i= l i= l Result 8.2 says that Total population variance = u1 1 + u22 + + uPP (8-6) = A 1 + A2 + + AP and consequently, the proportion of total variance due to (explained by) the kth principal component is Proportion of total population variance = Ak due to kth principal A 1 + A2 + + AP k = l, 2, . . , p (8-7) component If most (for instance, 80 to 90%) of the total population variance, for large p, can be attributed to the first one, two, or three components, then these components can "replace" the original p variables without much losse of information. Each component of the coefficient vector e/ = [ i l , . . . , ei k • . . . , e;p ] also mer its inspection. The magnitude of eik measures the importance of the kth variable to the ith principal component, irrespective of the other variables. In particular, e; k is proportional to the correlation coefficient between Y; and Xk . Result 8.3. If Y = e{ X, Y2 = e� X, . . . , YP = e� X are the principal com ponents obtained from 1the covariance matrix I, then e . k \IA, i, k = 1, 2, . . . , p (8-8) P V CTkk ···
• • •
• • •
;;;:.:
···
···
···
• • •
···
•
(
···
)
Y;. X•
-
···
l l . �
···
.
462
Chap.
8
Principal Components
are the correlation coefficients between the components Y; and the variables Xk . Here (A 1 , e 1 ) , (A2 , e2 ) , , (Ap , ep ) are the eigenvalue-eigenvector pairs for I. Proof. Set a� = [0, ... , 0, 1, 0, ... , 0] so that Xk = a�X and Cov(Xk , Y;) = Cov( a� X, e; x) = a� Ie ; , according to (2-45). Since Ie ; = A;e;, Cov(Xk , Y;) a� A ; e ; = A;e;k · Then Var(Y;) = A; [see (8-5)] and Var(Xd = ukk yield Cov(Y;, Xk ) i, k = 1, 2, .. . ' p Pv,, x. _v'var(Y;) v'var(Xk ) . • •
•
Although the correlations of the variables with the principal components often help to interpret the components, they measure only the univariate contri bution of an individual X to a component Y. That is, they do not indicate the importance of an X to a component Yin the presence of the other X's. For this rea son, some statisticians (see, for example, Rencher [17]) recommend that only the coefficients ei k • and not the correlations, be used to interpret the components. Although the coefficients and the correlations can lead to different rankings as measures of the importance of the variables toc a given component, it is our expe rience that these rankings are often not appre iably different. In practice, variables with relatively large coefficients (in absolute value) tend to have relatively large correlations, so the two measures of importance, the first multivariate and the sec ond univariate, frequently give similar results. We recommend that both the coef ficients and the correlations be examined to help interpret the principal components. hypothetical example illustrates the contents of Results 8. 1 , 8.2, andThe8.following 3.
Exampl e
8. 1
(Calculating the population principal components)
Suppose the random variables X1 , X2 and X3 have the covariance matrix
It may be verified that the eigenvalue-eigenvector pairs are A1 =
5.83, A2 = 2. 0 0, A3 = 0. 1 7,
e{ =
[. 383, -.924, 0] e� = [0, 0, 1] e; = [. 924, . 3 83, 0]
Therefore, the principal components become
Sec.
8.2
Population Principal Components
463
Y1 = e{ X = .383X1 - . 924X2 The variable X is one of the principal components, because it is uncorrelated with the other 3two variables. Equation (8-5) can be demonstrated from first principles. For example, Var(Y1 ) = Var(.383X1 - . 924X2 ) = (. 383) 2 Var(X1 ) + (- . 924) 2 Var(X2 ) + 2(.383) (-. 924)Cov(X1 , X2 ) = .147(1) + . 854(5) - . 7 08( - 2) = 5. 83 = A 1 Cov ( Y1 , Y2 ) = Cov ( .383X1 - .924X2 , X3 ) = .383 Cov (X1 , X3 ) - . 924 Cov (X2 , X3 ) = . 3 83(0) - .924(0) = 0 It is also readily apparent that a11 + a22 + a33 = 1 + 5 + 2 = A 1 + A2 + A3 = 5.83 + 2.00 + .17 validating Equation (8-6) for this example. The proportion of total variance accounted for by the first principal component is A 1 /(A 1 + A 2 + A 3 ) = 5. 83/8 = .73. Further, the first two components account for a proportion (5. 83 + 2)/8 = . 98 ofthe population variance. In this case, the components Y1 and Y2 could replace the original three variables with little loss of information. Next, using (8-8), we obtain .383� = . 925 eu = Py,, x, = � 11 e 1 � = -. 924 V5.83 = 998 Py, , x, = 2vo:;; Vs · Notice here that the variable X , with coefficient -. 924, receives the greatest weight in the component Y1 .2 It also has the largest corr�lation (in absolute value) with Y1 . The correlation of X1 , with Y1 , .925, is almost as large as that for X2 , indicating that the variables are about equally important to the first principal component. The relative sizes of the coefficients of X1 ya::,
_
464
Chap.
8
Principal Components
and X2 suggest, however, that X2 contributes more to the determination of Y1 than does X1 • Since, in this case, both coefficients are reasonably large and they have opposite signs, we would argue that both variables aid in the inter pretation of Y1 • Finally, � V2 Py2· x, = p Y2. x2 = 0 and py2· x, = -�=\12 = 1 (as it should) The remaining correlations can be neglected, since the third component is • unimportant. It is informative to consider principal components derived from multivariate normal random variables. Suppose X is distributed as Np (p, I). We know from (4-7) that the density of X is constant on the p centered ellipsoids 1 2 ( x - p ) 1 I - ( x - p) = c which have axes ± c V"i:; ei, i = 1, 2, . . . , p, where the ( Ai , e i ) are the eigen value-eigenvector pairs of I. A point lying on the ith axis of the ellipsoid will have coordinates proportional to e; = [ei l , ei2 , . . . , eip ] in the coordinate system that has origin p and axes that are parallel to the original axes x1 , x2 , . . . , xP . It will be con venient to set p = 0 in the argument that follows. 1 1 From our discussion in Section 2.3 with A = I - , we can write 2 2 2 2 c = X1 I - 1 x = : ( e{ x ) + : ( e� x ) + + : ( e; x ) 2 p I where e{ x, e� x, . . . , e; x are recognized as the principal components of x. Setting y1 = e1 x , y2 = e2 x , . . . , yP = eP x , we h ave ···
1
I
I
and this equation defines an ellipsoid (since A , A 2 , , AP are positive) in a coor dinate system with axes y1 , y2 , , Yp lying in the1 directions of e 1 , e 2 , , eP , respec tively. If A1 is the largest eigenvalue, then the major axis lies in the direction e 1 • The remaining minor axes lie in the directions defined by e2 , , eP . To summarize, the principal components y1 = e{ x, y2 = e� x, . . . , Y = e; x lie in the directions of the axes of a constant density ellipsoid. Therefore, anyp point on the i th ellipsoid axis has x coordinates proportional to e; = [ei ei2 , , eip ] and, necessarily, principal component coordinates of the form [0, . . . , 0, Yi • 0, . . . , 0] . When p 0, it is the mean-centered principal component y1 = e; (x - p) that has mean 0 and lies in the direction ei. • • •
• . .
• • •
• • •
1 ,
• • •
¥-
1
This can b e done without loss o f generality because the norma! random vector X can always be translated to the normal random vector W == X - p. and E (W) = 0 . However Cov(X) = Cov(W).
Sec.
8. 2
Population Principal Components
465
Figure 8.1 The constant density ellipse x ' l: - 1 x = c2 and the principal components y1 , y2 for a bivariate normal random vector X having mean 0.
Jt = O p = .75
A constant density ellipse and the principal components for a bivariate nor mal random vector with p 0 and .75 are shown in Figure 8.1. We see that the principal components are obtained by rotating the original coordinate axes through an angle ()until they coincide with the axes of the constant density ellipse. This result holds for p > 2 dimensions as well. =
p =
Principal Components Obtained from Standard ized Variables
Principal components may also be obtained for the standardized variables 22
In matrix notation,
=
� (Xz - J.Lz ) vo:;
(8-9)
(8-10) where the diagonal standard deviation matrix V 112 is defined in (2-35). Clearly, E(Z) = 0 and
466
Chap.
8
Principal Components
by (2-37). The principal components of Z may be obtained from the eigenvectors of the correlation matrix p of X. All our previous results apply, with some simplifica tions, since the variance of each Z; is unity. We shall continue to use the notation Y; to refer to the ith principal component and (A;, e;) for the eigenvalue-eigenvector pair from either p or I. However, the (A;, e;) derived from I are, in general, not the same as the ones derived from p . Result 8.4. The ith principal component of the standardized variables Z' [Z1 , Z2 , , Zp] with Cov(Z) = p, is given by i = 1, 2, . . . , p Moreover, p p (8-11) 2: Var(Y;) 2: Var(Z;) = p i= l i=l and Py. = e . k vT i, k = 1, 2, ' p In this case, (A 1 , e 1 ), (A 2 , e2 ), 0., (AP ' ep ) are the eigenvalue-eigenvector pairs for AP p, with A A2 Proof. Result 8.4 follows from Results 8.1, 8.2, and 8.3, with Z 1 , Z2 , , ZP in place of X1 , X2 , ... , XP and in place of I. We see from (8-11 ) that the total (standardized variables) population vari ance is simply p, the sum of the diagonal elements of the matrix p . Using (8-7) with Z in place of X, we find that the proportion of total variance explained by the kth principal component of Z is Proportion of (standardized) population variance due k = 1, 2, , p (8-12) to kth principal component where the Ak 's are the eigenvalues of p . • • •
, zk
l
I
0 0 0
• • .
1
;;;.
;;;. · · · ;;;.
;;;.
p
(
Example 8.2
• . .
)
0 0 0
(Principal components obtained from covariance and correlation matrices are different)
Consider the covariance matrix I = [ � 10� ]
•
Sec.
8.2
and the derived correlation matrix p=
Population Principal Components
[ .41 .41 ]
467
The eigenvalue-eigenvector pairs from I are A 1 = 100.16, [.040, .999] . 84, 2 [. 999, - .040] Similarly, the eigenvalue-eigenvector pairs from p are [.707, .707] A I = 1 + p = 1. 4 , [.707, - .707] A2 = 1 - p = .6, The respective principal components become Y1 = .040X1 + . 999X2 I: Y2 = . 999X1 - .040X2 and Y1 = .70721 + .70722 = .707 (Xl -1 P- 1 ) + .707 ( X2 10- P-2 ) p: = . 7 07(X1 - p.,1 ) + . 0707(X2 - p., 2 ) Y2 = .70721 - .70722 = .707 (Xl -1 P- 1 ) - .707 ( X2 10- P-2 ) = . 7 07(X1 - p.,1 ) - . 0707(X2 - p., 2 ) Because of its large variance, X2 completely dominates the first principal component determined from I. Moreover, this first principal component explains a proportion ' el e
'
of the total population variance. When the variables X and X2 are standardized, however, the resulting variables contribute equally1 to the principal components determined from p . Using Result 8.4, we obtain PY, , z, = e1 1 � = . 7 07 ViA = .837
468
Chap.
8
Principal Components
and PY,, = e2 1 � =
.707 vlA = .837 In this case, the first principal component explains a proportion A I = 1.4 = .7 2 p of the total (standardized) population variance. Most strikingly, we see that the relative importance of the variables to, for instance, the first principal component is greatly affected by the standard ization. When the first principal component obtained from p is expressed in terms of X1 and X2 , the relative magnitudes of the weights .707 and . 0707 are in direct opposition to those of the weights . 040 and .999 attached to these variables in the principal component obtained from I. The preceding example demonstrates that the principal components derived from I are different from those derived from p . Furthermore, one set of principal components is not a simple function of the other. This suggests that the standard ization is not inconsequential. Variables should probably be standardized if they are measured on scales with widely differing ranges or if the units of measurement are not commensurate. For example, if X1 represents annual sales in the $10,000 to $350,000 range and X2 is the ratio (net annual income)/(total assets) that falls in the .01 to .60 range, then the total variation will be due almost exclusively to dollar sales. In this case, we would expect a single (important) principal component with a heavy weighting of X1 • Alternatively, if both variables are standardized, their subsequent magnitudes will be of the same order, and X2 (or Z2 ) will play a larger role in the construction of the components. This behavior was observed in Example 8.2. z,
•
Principal Components for Covariance Matrices with Special Structures
[
There are certain patterned covariance and correlation matrices whose principal components can be expressed in simple forms. Suppose I is the diagonal matrix 0 (Tl l 0 0. (8-13) I = 0. (}".22 . . . 0 0 (Tpp Setting e; [0, ... , 0, 1, 0, ... , 0], with 1 in the ith position, we observe that .
J
Sec.
lT n 0
O"z z
0
8.2
0
Population Principal Components
469
0
0
0
1
10";;
0
0
0
0
or
I e ; = £Tii e i
and we conclude that ( e; ) is the ith eigenvalue-eigenvector pair. Since the lin ear combination e; X = X;, the set of principal components is just the original set of uncorrelated random variables. For a covariance matrix with the pattern of (8-13), nothing is gained by extracting the principal components. From another point of view, if X is distrib uted as Np (/L, I), the contours of constant density are ellipsoids whose axes already lie in the directions of maximum variation. Consequently, there is no need to rotate the coordinate system. Standardization does not substantially alter the situation for the I in (8-13). In that case, p = I, the p p identity matrix. Clearly, pe; = 1e; , so the eigen value 1 has multiplicity p and e; = [0, . . . , 0, 1, 0, . . . , 0] , i = 1, 2, . . , p, are conve nient choices for the eigenvectors. Consequently, the principal components determined from p are also the original variables Z1 , . . . , ZP . Moreover, in this case of equal eigenvalues, the multivariate normal ellipsoids of constant density are spheroids. Another patterned covariance matrix, which often describes the correspon dence among certain biological variables such as the sizes of living things, has the general form O"; ; ,
X
.
(8-14)
The resulting correlation matrix (8-15)
is also the covariance matrix of the standardized variables. The matrix in (8-15) implies that the variables X1 , X2 , , XP are equally correlated. It is not difficult to show (see Exercise 8.5) that the p eigenvalues of the corre lation matrix (8-15) can be divided into two groups. When p is positive, the largest is • • •
470
Chap.
8
Principal Components
AI
= 1 + (p - 1)p
(8-16)
= [ vp1 ' vp1 ' . . . , vp1 ] The remaining p - 1 eigenvalues are with associated eigenvector
1
(8-17)
e,
and one choice for their eigenvectors is 1 -1 ' vlX2 ' 0, .. . , 0 ] [ vlX2 e3'
1 , v2X3 1 , v2X3 -2 ] [ v2X3 , 0, ... , 0
[
1 1 - (i - 1) , o, . . . , o ... , , -v-;;= (i - 1)i == , v (i - 1)i v (i - 1)i
]
= [ v (p 1- 1 )p , . . . , v(p 1- 1 )p , v-(p(p -- 11))p ] The first principal component Y1 = e { X = 1 2: Xi Vp is proportional to the sum of the p original variables. It might be regarded as an "index" with equal weights. This principal component explains a proportion -Ap1 = 1 + (pp - 1)p = p + 1 p- p (8-18) of the total population variation. We see that A 1 /p p for p close to 1 or p large. For example, if p = .80 and p = 5, the first component explains 84% of the total variance. When p is near 1, the last p - 1 components collectively contribute very little to the total variance and can often be neglected. If the standardized variables Z , Z , . . . , Z have a multivariate normal distri bution with a covariance matrix given1 by2 (8-15),Pthen the ellipsoids of constant den sity are "cigar shaped," with the major axis proportional to the first principal component Y1 = (1/vp ) [1, 1, . .. , 1 ] X. This principal component is the projection ep'
p
,
1
i=l
--
='=
Sec.
8.3
Summarizing Sample Variation by Principal Components
471
of X on the equiangular line 1' = [1, 1, ... , 1]. The minor axes (and remaining prin cipal components) occur in spherically symmetric directions perpendicular to the major axis (and first principal component). 8.3 SUMMARIZING SAMPLE VARIATION BY PRINCIPAL COMPONENTS
We now have the framework necesssary to study the problem of summarizing the variation in n measurements on p variables with a few judiciously chosen linear combinations. Suppose the data x , x , , x represent n independent drawings from some p-dimensional population1 with2 meann vector p and covariance matrix I. These data yield the sample mean vector i, the sample covariance matrix S, and the sample correlation matrix R. Our objective in this section will be to construct uncorrelated linear combi nations of the measured characteristics that account for much of the variation in the sample. The uncorrelated combinations with the largest variances will be called the sample principal components. Recall that the n values of any linear combination j = 1, 2, .. , n have sample mean a� i and sample variance a� Sa 1 . Also, the pairs of values ( a� xi, a�x), for two linear combinations, have sample covariance a� Sa 2 [see (3-36)]. The sample principal components are defined as those linear combinations which have maximum sample variance. As with the population quantities, we restrict the coefficient vectors a ; to satisfy a; a; = 1. Specifically, First sample linear combination a� xi that maximizes principal component = the sample variance of a� xi subject to a� a1 = 1 Second sample linear combination a�xi that maximizes the sample principal component = variance of a�xi subject to a� a 2 = 1 and zero sample covariance for the pairs ( a� xi, a�x) At the ith step, we have linear combination a; xi that maximizes the sample ith sample principal component = variance of a; xi subject to a; a; = 1 and zero sample covariance for all pairs ( a; xi , a�xi ) , k < i The first principal component maximizes a� Sa 1 or, equivalently, (8-19) • • •
.
472
Chap. 8
Pri ncipal Components
By (2-51), the maximum is the largest eigenvalue A 1 attained for the choice a 1 = eigenvector e1 of S. Successive choices of a; maximize (8-19) subject to 0 = a; se" = a; A "e" , or a; perpendicular to e" . Thus, as in the proofs of Results 8.1-8.3, we obtain the following results concerning sample principal components: A
If S
=
{s;k} is
the
where A. 1 � A 2 � .. , XP . Also, X2 , -;.., .
•
•
.. . .
p
p
s �mple
(A 1 , e1 ) , (A2, e2 ),
nenqs given by
vector pairs
A
X
AP
.. iei ) A i so tr[U'SU] = A I + A z + . . . + A ,.. Also, 2: ( xi - x )' (xi - x ) =l tr [ j±= l (xi - x )(xi - x)' ] = (n - 1) tr(S) = (n - l ) (A I + A z + . . . + A p ) . Let • U = U in (8A-3), and the error bound follows. =
2: tr[UU' (xi
i=
l
I
u
A
A
A
A
A
A
A
A
n
A
j
The p-Dimensional Geometrical Interpretation
The geometrical interpretations involve the determination of best approximating planes to the p-dimensional scatter plot. The plane through the origin, determined by u 1 , u2 , ... , u, consists of all points x with x = b1 u1 + b2 u2 + .. + b,. u = Ub, for some b This plane, translated to pass through a, becomes a + Ub for some b. We want to select the r-dimensional plane a + Ub that minimizes the sum of n squared distances 2: dJ between the observations xi and the plane. If xi is approxi i= l mated by a + Ubi with i2:= l bi = 0,5 then .
,.
n
5 If ± bi = nb j�l
*
O, use a
+ Ubi = (a + Ub) + U(bi - b) = a* + Ubt.
Supplement 8A The Geometry of the Sample Principal Component Approximation
501
ll
2: (xj - a - Ubj ) ' (xj - a - Ubj ) j= ! ll
= j2: (xj - i - Ubj + i - a) ' (xj - i - Ubj + i - a) =l ll
= j2: (xj - i - Ub) ' (xj - i - Ubj ) + n ( i - a) ' ( i - a) =! ll
2: (xj - i - E , E ; (xj - i ) ) ' (xj - i - :E,:E; (xj - i ) ) j= l [Ub 1 , . . . , Ub,] = � r. a = i, el ' e2 , . . . ' e,. ek e;(xj - i ) = Yjk• �
by Result 8A.l, since A' has rank(A) The lower bound is reached by taking so the plane passes through the sample mean. This plane is determined by The coefficients of are the kth sample principal component evaluated at the jth observation. The approximating plane interpretation of sample principal components is illustrated in Figure 8.10. An alternative interpretation can be given. The investigator places a plane through i and moves it about to obtain the largest spread among the shadows of the observations. From (SA-2), the projection of the deviation xj - i on the plane Ub is vj = UU' (xj - i ) . Now, v = 0 and the sum of the squared lengths of the
projection deviations
n n 2: v/ vj = 2: (xj - i ) ' UU' (xj - i ) j= l j=l U = E. v = 0,
is maximized by
A
(n
- 1)
tr[U'SU]
Also, since
3
Figure 8. 1 0 The r = 2-dimensional plane that approximates the scatter plot n
by minimizing
.2; d/ .
i= l
502
Chap.
8
Principal Components
(n - 1)S.v = j�=n i (vj - v ) ( vj - v ) ' = j�=n i vjvj and this plane also maximizes the total variance
The n-Dimensional Geometrical Interpretation
Let us now consider, by columns, the approximation of the mean-centered data matrix by A. For r = 1, the ith column [xl i - X; , x2 ; - X; , . . . , xn i - x;]' is approx imated by a multiple c;b' of a fixed vector b = [ b1 , b2 , , bn ] '. The square of the length of the error of approximation is • • •
n
Lf = � (xji j= i
X; -
c; bj ) 2
Considering (nAX p) to be of rank one, we conclude from Result 8A.1 that
minimizes the sum of squared lengths i�=p l Lf. That is, the best direction is deter mined by the vector of values of the first principal component. This is illustrated in Figure 8.11(a) on page 503. Note that the longer deviation vectors (the larger s;; 's) have the most influence on the minimization of i�= l Lf . If the variables are first standardized, the resulting vector [ (xl i - X; )/Vi;; , (x2 ; - x;)/Yi;; , . . . , (xn i - x;)/Vi;; ] has length n - 1 for all variables, and each vector exerts equal influence on the choice of direction. [See Figure 8.1l(b).] In either case, the vector b isp moved around in n-space to minimize the sum of the squares of the distances i�= 1 Lf where L; is the squared distance between [xli X; , x2 ; - X; , . . . , xn i - x;]' and its projection on the line determined by b. The -second principal component minimizes the same quantity among all vectors perpendicular to the first choice. p
Chap. 8
Exercises
503
3
(a ) Principal component of S
(b ) Principal component of R
Figure 8 . 1 1 The first sample principal component y 1 , m i n i m izes the sum of the squares of the distances, Lf, from the deviation vectors, d; = [ xl i - X; , x2i - Xu . . . , Xn i - x;] ' to a line .
EXERCISES
8.1.
Determine the population principal components Y1 and Y2 for the covari ance matrix
Also, calculate the proportion of the total population variance explained by the first principal component. 8.2. Convert the covariance matrix in Exercise 8.1 to a correlation matrix p. (a) Determine the principal components Y1 and Y2 from p and compute the proportion of total population variance explained by Y1 . (b) Compare the components calculated in Part a with those obtained in Exercise 8.1. Are they the same? Should they be? (c) Compute the correlations and 8.3. Let py
I•
z , I
py
I•
z , 2
py
2•
z . I
Determine the principal components Y1 , Y2 , and Y3 • What can you say about the eigenvectors (and principal components) associated with eigenvalues that are not distinct?
504
Chap.
8
Principal Components 8.4.
Find the principal components and the proportion of the total population variance explained by each when the covariance matrix is 1 1 -- < p < V2
V2
8.5. (a)
Find the eigenvalues of the correlation matrix
Are your results consistent with (8-16) and (8-17)? Verify the eigenvalue-eigenvector pairs for the p p matrix p given (8-15). 8.6. Data on x 1 = sales and x2 = profits for the 10 largest U. S . industrial corpo rations were listed in Exercise 1.4 of Chapter 1. From Example 4.12 62,309 ] S = [ 10,005.20 255.76 ] 1 05 X = [ 255.76 14.30 2,927 ' (a) Determine the sample principal components and their variances for these data. (You may need the quadratic formula to solve for the eigen values of S.) (b) Find the proportion of the total sample variance explained by 5\ . (c) Sketch the constant density ellipse ( x - i)'S - 1 ( x - i) = 1. 4 , and indi cate the principal components 5\ and y2 on your graph. (d) Compute the correlation coefficients , k = 1, 2. What interpretation, if any, can you give to the first principal component? 8.7. Convert the covariance matrix S in Exercise 8. 6 to a sample correlation matrix R. ( a) Find the sample principal components y 1 , y2 and their variances. (b) Compute the proportion of the total sample variance explained by y1 • (c) Compute the correlation coefficients k = 1, 2. Interpret y 1 • (d) Compare the cornpon�nts obtained in Part a with those obtained in Exer cise 8.6(a). Given the original data displayed in Exercise 1.4, do you feel that it is better tq determine prinCipal components from the sample covariance matrix or sample correlation matrix? Explain. 8.8. Use the results in Example 8. 5 . a) Compute the correlations for i = 1, 2 and k = 1, 2, ... , 5. Do these correlations reinforce the interpretations given to the first two compo nents? Explain. (b)
_
X
ry- I• xk
ry , , z k '
(
in
X
rY;. zk
Chap. 8
(b)
Test the hypothesis
Exercises
505
1 p p p p 1 p p p
p p p p
Ho : P = Po =
versus
p 1 p p p p 1 p p p p 1
H1 : P * Po
at the 5% level of significance. List any assumptions required in carrying out this test. 8.9. (A test that all variables are independent.) (a) Consider the normal theory likelihood ratio test of H0 : l: is the diag onal matrix
� ��' :, l0 0
]
� , cr;; > 0
,
crPP
Show that the test is: Reject H0 if
S I " 12 A = -pI -= I R I " /2 < n II s /2 i=1 II
A
c
For a large sample size, -2 ln is approximately x;(p _ 1 )12 . Bartlett [3] suggests that the test statistic - 2[1 - (2p + 11)/6n] ln be used in place of -2 ln This results in an improved chi-square approximation. The large sample a critical point is x;(p - 1 )12 (a). Note that testing l: = l:0 is the same as testing p = I. (b) Show that the likelihood ratio test of H0 : l: = a- 2 1 rejects H0 if A.
A
-l l
I S I " 12 (tr( - S )/p)"P12 - ( �D� A ; r _
A
A;
" 12
=
[ geometric mean A . "P/2 < c arithmetic mean A ; J I A
For a large sample size, Bartlett [3] suggests that - [ 1 - ( 2p 2 + p + 2)/6pn] ln approximate p + 2) ( p - 1)/2 " Thus, the large sample a critical pointis is x{p + 2) ( p - l )/2ly(a).xrThis test is 2
A
506
Chap.
8
Principal Components
called a sphericity test, because the constant density contours are spheres when I a-2 1. Hint: (a) max L( p,, I) is given by (5-10), and max L( p, , I0) is the product of the univariate likelihoods, �!; (2 1T) - n1Z a-i t12 exp [ - j� (xj i - .uY/2 a-;; J . Hence, [L; (1/n) j2:=n I xji and (1 /n) j2:=n I (xji - xY. The divisor n cancels in so S may be used. (b) Verify 6- 2 [:i (xj 1 - x 1 ) 2 + + :i (xjp - xp ) 2 ]/ np under H0 • j= l j= l Again, the divisors n cancel in the statistic, so S may be used. Use Result 5.2 to calculate the chi-square degrees of freedom. =
p. I
U;;
=
A,
=
···
=
The following exercises require the use of a computer.
The weekly rates of return for five stocks listed on the New York Stock Exchange are given in Table 8.4. (See the stock-price data on the disk. ) (a) Construct the sample covariance matrix S, and find the sample principal components in (8-20). (Note that the sample mean vector x is displayed in Example 8.5.) (b) Determine the proportion of the total sample variance explained by the first three principal components. Interpret these components. (c) Construct Bonferroni simultaneous 90% confidence intervals for the variances A 1 , A2 , and A3 of the first three population components Y1, Y2 , and Y3 • (d) Given the results in Parts a-c, do you feel that the stock rates-of-return data can be summarized in fewer than five dimensions? Explain. 8.11. Consider the census-tract data listed in Table 8.5 on page 508. Suppose the observations on X median value home were recorded in thousands, rather than ten thousands,5 of dollars; that is, multiply all the numbers listed in the sixth column of the table by 10. (a) Construct the sample covariance matrix S for the census-tract data when X5 median value home is recorded in thousands of dollars. (Note that this covariance matrix can be obtained from the covariance matrix given in Example 8.3 by multiplying the off-diagonal elements in the fifth col umn and row by 10 and the diagonal element s55 by 100. Why?) (b) Obtain the eigenvalue-eigenvector pairs and the first two sample princi pal components for the covariance matrix in Part a. (c) Compute the proportion of total variance explained by the first two prin cipal components obtained in Part b. Calculate the correlation coeffi cients, ry, , x, • and interpret these components if possible. Compare your 8.10.
=
=
Chap. 8
Exercises
507
TABLE 8.4 STOC K-PRICE DATA (WEEKLY RATE OF RETU RN)
Union Allied Week Chemical Du Pont Carbide
Exxon
Texaco
1 2 3 4 5 6 7 8 9 10
.000000 .027027 .122807 .057031 .063670 .003521 -.045614 .058823 .000000 .006944
.000000 -.044855 .060773 .029948 -.003793 .050761 -.033007 .041719 -.019417 - .025990
.000000 -.003030 .088146 .066808 -.039788 .082873 .002551 .081425 .002353 .007042
.039473 - .014466 .086238 .013513 - .018644 .074265 - .009646 - .014610 .001647 - .041118
- .000000 .043478 .078124 .019512 - .024154 .049504 - .028301 .014563 -.028708 - .024630
91 92 93 94 95 96 97 98 99 100
-.044068 .039007 -.039457 .039568 -.031142 .000000 .021429 .045454 .050167 .019108
.020704 .038540 -.029297 .024145 -.007941 -.020080 .049180 .046375 .036380 -.033303
-.006224 .024988 -.065844 -.006608 .011080 -.006579 .006622 .074561 .004082 .008362
- .018518 - .028301 -.015837 .028423 .007537 .029925 - .002421 .014563 -.011961 .033898
.004694 .032710 -.045758 - .009661 .014634 -.004807 .028985 .018779 .009216 .004566
results with the results in Example 8.3. What can you say about the effects of this change in scale on the principal components? S.U. Consider the air-pollution data listed in Table 1.3. Your job is to summarize these data in fewer than p = 7 dimensions if possible. Conduct a principal component analysis of the data using both the covariance matrix S and the correlation matrix R. What have you learned? Does it make any difference which matrix is chosen for analysis? Can the data be summarized in three or fewer dimensions? Can you interpret the principal components? In the radiotherapy data listed in Table 1 .5 (see also the radiotherapy data on the disk), the n = 98 observations on p = 6 variables represent patients' reac tions to radiotherapy. (a) Obtain the covariance and correlation matrices S and R for these data. (b) Pick one of the matrices S or R (justify your choice), and determine the eigenvalues and eigenvectors. Prepare a table showing, in decreasing order of size, the percent that each eigenvalue contributes to the total sample variance. 8.13.
508
Chap.
8
TABLE 8.5
Principal Components CENSUS-TRACT DATA
Median Total Health services Median value Total population school employment employment home Tract (thousands) years (thousands) (hundreds) ($10,000s) 1 5.935 2. 9 1 14.2 2.265 2.27 2 1. 523 13.1 .597 2. 62 .75 3 2.599 1.237 1. 72 12.7 1.11 4 4.009 15.2 3.02 1.649 .81 5 4.687 2.22 14.7 2.312 2.50 6 8.044 3. 641 2.36 4.5 1 15. 6 7 2.766 1.244 13.3 1. 97 1. 03 8 6.538 17.0 1. 85 2.618 2.39 9 6.451 2. 01 3.147 12.9 5.52 10 3. 3 14 1. 82 12.2 2.18 1.606 11 3. 777 2.83 1.80 13. 0 2.119 12 4.25 1.530 .84 .798 13.8 13 2.64 2.768 1.75 13. 6 1.336 14 3.17 6.585 14.9 1. 9 1 2.763
tNotionse:mayObsenotrvatconsionstiftruotme aadjraandomcent censsampluset.racts are likely to be correlated. That is, these 14 observaGiven the results in Part b, decide on the number of important sample principal components. Is it possible to summarize the radiotherapy data with a single reaction-index component? Explain. (d) Prepare a table of the correlation coefficients between each principal component you decide to retain and the original variables. If possible, interpret the components. 8.14. Perform a principal comp0nent analysis using the sample covariance matrix of the sweat data given in E�mmple 5.2 . Construct a Q-Q plot for each of the important principal components. Are there any suspect observations? Explain. 8.15. The four sample standard deviations for the postbirth weights discussed in Example 8.6 are � 32. 9909, � 33.5918, vs;-; 36.5534, and � 37.3517 Use these and the correlations given in Example 8.6 to construct the sample covariance matrix S. Perform a principal component analysis using S. 8.16. Over a period of five years in the 1990s, yearly samples of fishermen on 28 lakes in Wisconsin were asked to report the time they spent fishing and how many of each type of game fish they caught. Their responses were then con verted to a catch rate per hour for (c)
=
=
=
=
= Black crappie = Smallmouth bass == Bluegill = Northern pike Largemouth bass = Walleye The estimated correlation matrix (courtesy of Jodi Barnet) Chap. 8
x1 x4
S=
x2
x3
x5
x6
1 .4919 1 .4919 .2635 .3127 .4653 .3506 - .2277 - .1917 .0652 .2045
.2636 .4653 .3127 .3506 1 .4108 1 .4108 .0647 - .2249 .2493 .2293
Exercises
509
- .2277 .0652 - .1917 .2045 .0647 .2493 - .2249 .2293 1 - .2144 1 - .2144
is based on a sample of about 120. (There were a few missing values.) Fish caught by the same fisherman live alongside of each other, so the data should provide some evidence on how the fish group. The first four fish belong to the centrarchids, the most plentiful family. The walleye is the most popular fish to eat. (a) Comment on the pattern of correlation within the centrarchid family x 1 through x4• Does the walleye appear to group with the other fish? (b) Perform a principal component analysis using only x 1 through x4 • Inter pret your results. (c) Perform a principal component analysis using all six variables. Interpret your results. 8.17. Using the data on bone mineral content in Table 1.6, perform a principal component analysis of 8.18. The data on national track records for women are listed in Table 1.7. (a) Obtain the sample correlation matrix R for these data, and determine its eigenvalues and eigenvectors. (b) Determine the first two principal components for the standardized vari ables. Prepare a table showing the correlations of the standardized vari ables, with the components and the cumulative percentage of the total (standardized) sample variance explained by the two components. (c) Interpret the two principal components obtained in Part b. (Note that the first component is essentially a normalized unit vector and might measure the athletic excellence of a given nation. The second component might measure the relative strength of a nation at the various running distances.) (d) Rank the nations based on their score on the first principal component. Does this ranking correspond with your inituitive notion of athletic excel lence for the various countries? 8.19. Refer to Exercise 8.18. Convert the national track records for women in Table 1.7 to speeds measured in meters per second. Notice that the records for 800 m, 1500 m, 3000 m, and the marathon are given in minutes. The
S.
510
Chap.
8
Principal Components
marathon is 26.2 miles, or 42,195 meters, long. Perform a principal components analysis using the covariance matrix S of the speed data. Compare the results with the results in Exercise 8. 1 8. Do your interpretations of the components differ? If the nations are ranked on the basis of their score on the first principal component, does the subsequent ranking differ from that in Exercise 8.18? Which analysis do you prefer? Why? 8.20. The data on national track records for men are listed in Table 8. 6 . (See also the data on national track records for men on the disk. ) Repeat the principal component analysis outlined in Exercise 8.18 for' the men. Are the results consistent with those obtained from the women s data? TABLE 8.6
NATIONAL TRACK RECORDS FOR MEN
Country Argentina Australia Austria Belgium Bermuda Brazil Burma Canada Chile China Colombia Cook Islands Costa Rica Czechoslovakia Denmark Dominican Republic Finland France German Democratic Republic Federal Republic of Germany Great Britain and Northern Ireland Greece Guatemala Hungary
lOO m (s) 10.39 10. 3 1 10.44 10.34 10.28 10.22 10.64 10. 1 7 10.34 10. 5 1 10.43 12. 1 8 10.94 10.35 10.56 10. 1 4 10.43 10.11 10. 1 2 10. 1 6 10.11 10.22 10.98 10.26
200 m (s) 20.81 20.06 20.81 20.68 20.58 20.43 21.52 20.22 20.80 21.04 21.05 23.20 21. 90 20.65 20.52 20.65 20.69 20.38 20.33 20.37 20.21 20.71 21.82 20.62
400 m (s) 46.84 44.84 46.82 45.04 45.91 45.21 48.30 45.68 46.20 47.30 46.10 52.94 48.66 45.64 45.89 46.80 45.49 45.28 44.87 44.50 44.93 46.56 48.40 46.02
800 m (min) 1.8 1 1.74 1.79 1.73 1.80 1.73 1.80 1.76 1.79 1.81 1.82 2.02 1. 87 1.76 1.78 1. 82 1.74 1.73 1.73 1.73 1.70 1.78 1.89 1.77
1500 m (min) 3.70 3. 57 3.60 3. 60 3.75 3. 66 3. 85 3. 63 3. 7 1 3.73 3. 74 4.24 3. 84 3.58 3. 6 1 3. 82 3. 6 1 3.57 3.56 3.53 3.51 3. 64 3. 80 3. 62
5000 m (min) 14. 04 13.28 13.26 13.22 14.68 13.62 14.45 13. 5 5 13. 6 1 13. 90 13.49 16. 70 14.03 13.42 13. 5 0 14.91 13.27 13.34 13.17 13.21 13. 0 1 14.59 14.16 13.49
10,000 m (min) 29.36 27.66 27.72 27.45 30.55 28.62 30.28 28.09 29.30 29.13 27.88 35.38 28.81 28.19 28.11 31.45 27.52 27.97 27.42 27. 6 1 27. 5 1 28.45 30.11 28.44
Marathon (mins) 137.72 128.30 135.90 129.95 146.62 133. 1 3 139. 95 130.15 134.03 133.53 131.35 164.70 136.58 134.32 130.78 154. 1 2 130.87 132.30 129. 92 132.23 129. 1 3 134.60 139.33 132.58 (continued )
Chap. 8
Exercises
51 1
TABLE 8.6 (continued)
Country India Indonesia Ireland Israel Italy Japan Kenya Korea Democratic People ' s Republic of Korea Luxembourg Malaysia Mauritius Mexico Netherlands New Zealand Norway Papua New Guinea Philippines Poland Portugal Rumania Singapore Spain Sweden Switzerland Taipei Thailand Turkey USA USSR Western Samoa
lOO m (s) 10. 60 10.59 10. 6 1 10.71 10.01 10.34 10.46 10.34 10. 9 1 10.35 10.40 11.19 10.42 10.52 10. 5 1 10.55 10.96 10.78 10. 1 6 10.53 10.41 10.38 10.42 10.25 10.37 10. 5 9 10.39 10. 7 1 9.93 10.07 10.82
200 m (s) 21.42 21.49 20.96 21. 00 19.72 20.81 20.66 20.89 21. 94 20.77 20.92 22.45 21.30 20.95 20.88 21.16 21.78 21.64 20.24 21.17 20.98 21.28 20.77 20.61 20.46 21.29 21.09 21.43 19.75 20.00 21.86
400 m (s) 45.73 47.80 46.30 47.80 45.26 45.86 44.92 46.90 47.30 47.40 46.30 47.70 46. 1 0 45. 1 0 46.10 46.71 47.90 46.24 45.36 46.70 45.87 47.40 45.98 45.63 45.78 46.80 47.91 47.60 43.86 44.60 49.00
800 m (min) 1.76 1. 84 1.79 1.77 1.73 1.79 1.73 1.79 1. 85 1.82 1. 82 1.88 1. 80 1.74 1.74 1.76 1.90 1.81 1.76 1.79 1.76 1.88 1.76 1.77 1.78 1.79 1.83 1.79 1.73 1.75 2.02
1500 m (min) 3.73 3. 92 3.56 3.72 3. 60 3. 64 3.55 3.77 3.77 3.67 3. 80 3. 83 3. 65 3.62 3.54 3.62 4.01 3. 83 3.60 3. 62 3.64 3. 89 3.55 3.61 3.55 3.77 3. 84 3. 67 3.53 3.59 4.24
5000 m (min) 13.77 14.73 13.32 13.66 13.23 13.41 13.10 13. 96 14. 1 3 13.64 14.64 15.06 13.46 13. 3 6 13.2 1 13.34 14.72 14.74 13.29 13.13 13.25 15.11 13.31 13.29 13.22 14.07 15.23 13.56 13.20 13.20 16.28
10,000 m (min) 28.81 30.79 27.81 28.93 27.52 27.72 27.38 29.23 29.67 29.08 31.01 31.77 27.95 27. 6 1 27.70 27.69 31.36 30.64 27.89 27.38 27.67 31.32 27.73 27.94 27. 9 1 30.07 32.65 28.58 27.43 27.53 34.71
Marathon (mins) 131.98 148.83 132.35 137.55 131.08 128.63 129.75 136.25 130.87 141.27 154. 1 0 152.23 129.20 129.02 128.98 131.48 148.22 145.27 131.58 128.65 132.50 157.77 131.57 130.63 131.20 139.27 149. 90 131.50 128.22 130.55 161.83
SOURCE: IAAF/ATFS Track and Field Statistics Handbook for the 1984 Los Angeles Olympics. 8.21.
Refer to Exercise 8.20. Convert the national track records for men in Table 8.6 to speeds measured in meters per second. Notice that the records for 800 m, 1500 m, 500 m, 10,000 m and the marathon are given in minutes. The marathon is 26.2 miles, or 42,195 meters, long. Perform a principal compo-
512
Chap.
8
Principal Components
nent analysis using the covariance matrix S of the speed data. Compare the results with the results in Exercise 8.20. Which analysis do you prefer? Why? 8.22. Consider the data on bulls in Table 1.8. Utilizing the seven variables YrHgt, FtFrBody, PrctFFB, Frame, BkFat, SaleHt, and SaleWt, perform a principal component analysis using the covariance matrix S and the correlation matrix R. Your analysis should include the following: (a) Determine the appropriate number of components to effectively summa rize the sample variability. Construct a scree plot to aid your determination. (b) Interpret the sample principal components. (c) Do you think it is possible to develop a "body size" or "body configura tion" index from the data on the seven variables above? Explain. (d) Using the values for the first two principal components, plot the data in a two-dimensional space with .Yt along the vertical axis and y2 along the horizontal axis. Can you distinguish groups representing the three breeds of cattle? Are there any outliers? (e) Construct a Q-Q plot using the first principal component. Interpret the plot. 8.23. Refer to Example 8.10 and the data in Table 5.8, page 258. Add the variable x6 regular overtime hours whose values are (read across) =
6187 7679
7336 8259
6988 6964 10954 9353
8425 6291
6778 4969
5922 4825
7307 6019
and redo Example 8.10. 8.24. Refer to the police overtime hours data in Example 8.10. Construct an alter nate control chart, based on the sum of squares d�j • to monitor the unex plained variation in the original observations summarized by the additional principal components. REFERENCES
1. John AndersWioln,ey,T.1984.W. (2d ed.). New York: 2. Anderson, T. W. "Asympt(o1t963),ic Theor122-148. y for Principal Components Analysis." 3. tBartions.le"t , M. S. "A Note on Multiplying Factors for Vari(1954)ous, 296-298. Chi-Squared Approxima 4. Dawkins(,1B.989)"Mul, 110-115. tivariate Analysis of National Track Records." 5. Girschick, M. A. "On the Sampling(1Theor y203-224. of Roots of Determinantal Equations." 939) , 6. nentHotesl."ing, H. "Analysis of a Complex of Stat(is1t933)ical ,Vari417-441,ables498-520. into Principal Compo An Introduction to Multivariate Statistical Analysis
Annals of
Mathematical Statistics,
34
Journal of the Royal Statistical Society (B) ,
16
The American Statisti
cian,
43
Annals of Mathematical Statistics,
10
Journal of Educational Psychology,
24
Chap. 8
References
51 3
7. Hot(1935)el i,n139-142. g, H. "The Most Predictable Criterion." 8. Hot(1936)el ,in27-35. g, H. "Simplified Calculation of Principal Components. " 10.9. JolHoti(ce1oeur963)l ing,,, P.H.497-499. "The"RelaMultionstivbetariwateene GenerTwoaSetlizats ofionVarof itahtees.Al" lometry Equatio(1n.936)" , 321-377. 11. JolPriinccioeurpal ,Component P., and J. E.AnalMosyismisann.." "Size and(1Shape Variation in the Painted Turtle: A 960) , 339-354. 12. Ki(1966)ng, B., 139-190. "Market and Industry Factors in Stock Price Behavior." 13. iKourt i,"T., and J. McGregor, "Multivariate SPC(1996)Met, 409-428. hods for Process and Product Mon t o r i n g, 14. Lawley, D. N. "On Testing(1a963)Set, 1of49-151. Correlation Coefficients for Equality." 15. Maxwel l, A. E. London: Chapman and Hal l , 1977. 16. WiRao,ley,C.1973.R. (2d ed.). New York: John 17. Rencher, terpretatiosn."of Canonical Discriminant Funct(1992)ions,, 217-225. Canonical Vari ates and PrA.inciC.pal"InComponent Journal of Educational Psychology,
Psychometrika,
Biometrika,
26
1
28
Biometrics,
19
Growth,
24
Journal of Business,
Journal of Quality Technology,
39
28
Annals of
Mathematical Statistics,
34
Multivariate Analysis in Behavioural Research.
Linear Statistical Inference and Its Applications
The American Statistician,
46
CHAPTER
9
Factor Analysis and Inference for Structured Covariance Matrices 9. 1 INTRODUCTION
Factor analysis has provoked rather turbulent controversy throughout its history. Its modern beginnings lie in the early 20th-century attempts of Karl Pearson, Charles Spearman, and others to define and measure intelligence. Because of this early association with constructs such as intelligence, factor analysis was nurtured and developed primarily by scientists interested in psychometrics. Arguments over the psychological interpretations of several early studies and the lack of powerful computing facilities impeded its initial development as a statistical method. The advent of high-speed computers has generated a renewed interest in the theoreti cal and computational aspects of factor analysis. Most of the original techniques have been abandoned and early controversies resolved in the wake of recent devel opments. It is still true, however, that each application of the technique must be examined on its own merits to determine its success. The essential purpose of factor analysis is to describe, if possible, the covari ance relationships among many variables in terms of a few underlying, but unob servable, random quantities called factors. Basically, the factor model is motivated by the following argument: Suppose variables can be grouped by their correlations. That is, suppose all variables within a particular group are highly correlated among themselves, but have relatively small correlations with variables in a different group. Then it is conceivable that each group of variables represents a single under lying construct, or factor, that is responsible for the observed correlations. For example, correlations from the group of test scores in classics, French, English, mathematics, and music collected by Spearman suggested an underlying "intelli gence" factor. A second group of variables, representing physical-fitness scores, if available, might correspond to another factor. It is this type of structure that fac tor analysis seeks to confirm. 514
Sec.
9.2
The Orthogonal Factor Model
51 5
Factor analysis can be considered an extension of principal component analy sis. Both can be viewed as attempts to approximate the covariance matrix I. How ever, the approximation based on the factor analysis model is more elaborate. The primary question in factor analysis is whether the data are consistent with a pre scribed structure. 9.2 THE ORTHOGONAL FACTOR MODEL
The observable random vector X, with p components, has mean p, and covariance matrix I. The factor model postulates that X is linearly dependent upon a few unobservable random variables F1 , F2 , . . . , F111 , called common factors, and p addi tional1 sources of variation c: 1 , c:2 , . . . , c:P , called errors or, sometimes, specific fac tors. In particular, the factor analysis model is xl - fL 1 x2 fL 2
fl l FI + e 1 2 F2 + f21 F1 + e22F2 +
+ eJ mFm + c: l + e2 m Fm + c: 2
xp
fP 1 F1 + fp 2 F2 +
+ ep m Fm + c:P
J.Lp
(9-1)
or, in matrix notation, X - p, (p X J )
L
F
(p X m ) (m X I )
+
E
(p X 1 )
(9-2)
The coefficient f;i is called the loading of the ith variable on the jth factor, so the matrix L is the matrix of factor loadings. Note that the ith specific factor is associated only with the ith response X;. The p deviations X1 - J.L 1 , X2 - J.Lz , . . . , XP - J.Lp are expressed in terms of p + m random vari ables , F2 , . . . , Fn, c: 1 , c:2 , . . . , c:P which are unobservable. This distinguishes the factor F1model of (9-2) from the multivariate regression model in (7-26), in which the independent variables [whose position is occupied by F in (9-2)] can be observed. With so many unobservable quantities, a direct verification of the factor model from observations on XI , x2 , . . . , xp is hopeless. However, with some addi tional assumptions about the random vectors F and e, the model in (9-2) implies certain covariance relationships, which can be checked. e;
1 As Maxwell [22] points out, in many investigations the e , tend to be combinations of measure ment error and factors that are uniquely associated with the individual variables.
516
Chap.
9
Factor Analysis an d Inference for Structured Covariance Matrices
We assume that E(F) (m0X Cov(F) E[FF'] =
E(e)
0
(p X I )
'
I
=
'
I)
Cov(e) E[ee'] =
(m X m)
[!' 1 J 0
=
'II
( p X p)
1/Jz
(9-3)
0
and that F and e are independent, so Cov(e, F) E(eF') (p X0m) These assumptions and the relation in (9-2) constitute the orthogonal factor model.2 =
·
ORTHOGGNAL 'FAC:rrali N1boE.t·· WITH
m
COMMON FACTORS
The orthogonal factor model implies a covariance structure for X. From the model in (9-4), 2 Allowing the factors F to be correlated so that Cov (F) is not diagonal gives the oblique factor model. The oblique model presents some additional estimation difficulties and will not be discussed in this book. (See (20].)
Sec.
(X - �-t) (X - �-t ) '
9.2
The Orthogonal Factor Model
=
(LF + e) (LF + e) '
=
(LF + e) ( (LF) ' + e' )
=
LF (LF) ' + e (LF) ' + LFe' + ee'
51 7
so that I
=
Cov (X)
=
LE (FF' ) L' + E ( eF' ) L' + LE (Fe' ) + E ( ee' )
=
LL' + 'II
according to (9-3). Also, by the model in Cov (X, F) = E (X - �-t) F'
=
E (X - �-t) (X - �-t) '
(9-4), (X - IL ) F' =
= (LF + e) F' LE (FF' ) + E ( eF' ) = L.
=
LFF' + eF', so
COVARIANCE STRUCTURE FOR T H E ORTHOGONAL FACTOR MODEL
l. Cov (X)
=
LL' + 'II
or
2.
Cov (X, F)
or
=
L
(9-5)
The model X - /L = LF + e is linear in the common factors. If the p responses X are, in fact, related to underlying factors, but the relationship is non linear, such as in X1 - f.J- 1 = f 11 F1 F3 + t: 1 , X2 - f.J-2 = f2 1 F2F3 + t: 2 , and so forth, then the covariance structure LL' + 'II given by (9-5) may not be adequate. The very important assumption of linearity is inherent in the formulation of the traditional factor model. That portion of the variance of the ith variable contributed by the m common factors is called the ith communality. That portion of Var (X; ) = u;; due to the specific factor is often called the uniqueness, or specific variance. Denoting the ith communality by hf , we see from (9-5) that
518
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
Var (X; ) or
=
+ specific variance
communality
and
= I = LL' '\ft
(9-6)
1, 2, . . . , p
i
The ith communality is the sum of squares of the loadings of the ith variable on the m common factors. Example 9. 1
[ ] I= ]=[ ][
(Verifying the relation
Consider the covariance matrix
[
19 30 2 12
The equality
or
+
19 30 2 12 30 57 5 23 2 5 38 47 12 23 47 68
=
4 7 -1 1
1 2 6 8
for two factors)
30 2 12 57 5 23 5 38 47 23 47 68
;
7 -1 2 6
I = LL' '\ft I +
may be verified by matrix algebra. Therefore, by an m 2 orthogonal factor model. Since
L= '\ft=
has the structure produced
Sec.
The Orthogonal Factor Model
9.2
(9-6), e?, + er2
519
the communality of X1 is, from
2 2 = 4 + 1 hf = and the variance of X1 can be decomposed as a" =
(C r, + C fz ) + 1/11
=
=
17
h f + 1/1 1
or
19 variance
2
+ =
communality
17 + 2
+ specific variance
A similar breakdown occurs for the other variables.
•
The factor model assumes that the p + p (p - 1) /2 = p (p + 1) /2 vari ances and covariances for X can be reproduced from the pm factor loadings f;i and the p specific variances 1/J;. When m = p, any covariance matrix � can be repro duced exactly as LL' [see (9-11)], so W can be the zero matrix. However, it is when m is small relative to p that factor analysis is most useful. In this case, the factor model provides a "simple" explanation of the covariation in X with fewer para meters than the p (p + 1) /2 parameters in �- For example, if X contains p = 12 variables, and the factor model in (9-4) with m = 2 is appropriate, then the p (p + 1)/2 = 12(13)/2 = 78 elements of � are described in terms of the mp + p = 12(2) + 12 = 36 parameters eii and 1/J; of the factor model. Unfortunately for the factor analyst, most covariance matrices cannot be fac tored as LL' + 'IT, where the number of factors m is much less than p. The fol lowing example demonstrates one of the problems that can arise when attempting to determine the parameters f;i and 1/J; from the variances and covariances of the observable variables. Example 9.2
(Nonexistence of a proper solution)
[ ]
Let p = 3 and m = 1, and suppose the random variables X1 , X2 , and X3 have the positive definite covariance matrix
1 .9 .7 .9 1 .4 .7 .4 1
� Using the factor model in
(9-4), we obtain xl
-
JL1
X2 f.Lz X3 - f.L3 -
= = =
C 1 1 F1 + e 1 e2 1 F1 + 62 f3 1 F1 + e3
520
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
The covariance structure in (9-5) implies that I
or
= LL' + 'II
.90 = 1=
e 11 e2 , ei 1 + 1/Jz
The pair of equations .70 = .40 =
implies that
.70 = e 1 1 e31 .40 = e2 1 e31 1 = ej, + 1/13
e 1 ,e31 ezl e31
Substituting this result for e2 1 in the equation .90 = £1 1 £2 1 yields ef 1 = 1.575, or e,1 = ± 1.255. Since Var(F1 ) = 1 (by assumption) and Var(X ) = 1 , e1 = Cov(X , F1 ) = Corr(X1 , F1 ). Now, a correlation coef ficient 1cannot be 1greater than1 unity (in absolute value), so, from this point of view, I el l I = 1.255 is too large. Also, the equation 1 = e? 1 + 1/11 , Or 1/11 = 1 - e ? 1 gives 1/1, = 1 - 1.575 = - .575 which is unsatisfactory, since it gives a negative value for Var(e1 ) = I/J1 • Thus, for this example with m = 1, it is possible to get a unique numer ical solution to the equations I = LL' + However, the solution is not consistent with the statistical interpretation of the coefficients, so it is not a • proper solution. When > 1, there is always some inherent ambiguity associated with the factor model.= To see this, let T be any m m orthogonal matrix, so that TT' = T'T I. Then the expression in (9-2) can be written '11 .
m
X
X
-
p,
= LF + e = LTT'F + e = L * F * + e
(9-7)
Sec.
9. 3
Methods of Estimation
521
where L*
= LT and F * = T'F
Since E (F * )
= T' E (F) = 0
and Cov (F * )
= T' Cov (F) T = T'T =
I ( m x m)
it is impossible, on the basis of observations on X, to distinguish the loadings L from the loadings L * . That is, the factors F and F * = T'F have the same statisti cal properties, and even though the loadings L * are, in general, different from the loadings L, they both generate the same covariance matrix �. That is, (9-8) � = LL' + 'II = LTT'L' + 'II = (L * ) (L * ) ' + 'II This ambiguity provides the rationale for "factor rotation," since orthogonal matri ces correspond to rotations (and reflections) of the coordinate system for X. Factor Ioa:dings L the loadings '
• ·
'
'
are
both give�the �same elements ofLL' =
'
determined
only
L * = LT
'
"
up to and
' an orthog€ma1 matrix
T.
L
Thus, (9-9)
repre��ntation. The ��mmunalities, �\t�n by the diag�al (L *)(L *)', are also tmaffected by the choice
of T.
:
The analysis of the factor model proceeds by imposing conditions that allow one to uniquely estimate L and 'fl. The loading matrix is then rotated (multiplied by an orthogonal matrix), where the rotation is determined by some "ease-of-inter pretation" criterion. Once the loadings and specific variances are obtained, factors are identified, anq estimated values for the factors themselves (called factor scores ) are frequently constructed. 9.3 METHODS OF ESTIMATION
Given observations x 1 , x2 , , x n on p generally correlated variables, factor analy sis seeks to answer the question, Does the factor model of (9-4 ), with a small num ber of factors, adequately represent the data? In essence, we tackle this statistical model-building problem by trying to verify the covariance relationship in (9-5). • • •
522
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
The sample covariance matrix S is an estimator of the unknown population covariance matrix �- If the off-diagonal elements of S are small or those of the sample correlation matrix R essentially zero, the variables are not related, and a factor analysis will not prove useful. In these circumstances, the specific factors play the dominant role, whereas the major aim of factor analysis is to determine a few important common factors. If � appears to deviate significantly from a diagonal matrix, then a factor model can be entertained, and the initial problem is one of estimating the factor loadings l;i and specific variances 1/J; · We shall consider two of the most popular methods of parameter estimation, the principal component (and the related princi pal factor) method and the maximum likelihood method. The solution from either method can be rotated in order to simplify the interpretation of factors, as described in Section 9.4. It is always prudent to try more than one method of solu tion; if the factor model is appropriate for the problem at hand, the solutions should be consistent with one another. Current estimation and rotation methods require iterative calculations that must be done on a computer. Several computer programs are now available for this purpose. The Principal Component (and Principal Factor) Method
The spectral decomposition of (2-20) provides us with one factoring of the covari ance matrix �- Let � have eigenvalue-eigenvector pairs (A.; , e; ) with A 1 ;;;. A 2 ;;;. · · · ;;;. AP ;;;. 0. Then � = A 1 e 1 e; � �
+ A 2 e 2 e� + � �
[ v A 1 e 1 l v A2 e2 l
···
···
+ APePe;
� �
l
v
AP e ] p
� e� -\.!A;-;;-· -----------·
(9-10)
This fits the prescribed covariance structure for the factor analysis model having as many factors as variables (m = p) and specific variances 1/J; = 0 for all i. The load ing matrix has jth column given by � ei . That is, we can write L
L'
+ 0 = LL'
(9-11)
Apart from the scale factor � ' the factor loadings on the jth factor are the coef ficients for the jth principal component of the population. Although the factor analysis representation of � in (9-11) is exact, it is not particularly useful: It employs as many common factors as there are variables and does not allow for any variation in the specific factors e in (9-4). We prefer mod(p X p)
(p X p ) ( p X p )
(p X p )
Sec.
9. 3
Methods of Estimation
523
els that explain the covariance structure in terms of just a few common factors. One approach, when the last p - m eigenvalues are small, is to neglect the contribu tion of A m + l e m + l e�, + l + · · · + AP eP e; to I in (9-10). Neglecting this contribution, we obtain the approximation
L
L'
(p X m) (m Xp )
(9-12)
The approximate representation in (9-12) assumes that the specific factors e in (9-4) are of minor importance and can also be ignored in the factoring of I. If spe cific factors are included in the model, their variances may be taken to be the diag onal elements of I - LL', where LL' is as defined in (9-12). Allowing for specific factors, we find that the approximation becomes
I = LL' +
'}I
(9-13)
where "'i
=
(J"ii
m
- � e� for i = 1, 2, . . . ' p.
[xn ] [:XI ] [xil -:XI ]
j=l
To apply this approach to a data set x 1 , x 2 , , x n , it is customary first to center the observations by subtracting the sample mean i. The centered observations • • .
.,
-
.
�
�: r :: � ::
(xxi1 -:Xt )
j
= 1 , 2, . . . , n
(9-14)
have the same sample covariance matrix S as the original observations. In cases where the units of the variables are not commensurate, it is usually desirable to work with the standardized variables
zi
v's l l
=
( j z - Xz )
VSzz
j
= 1, 2, . . . , n
524
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
whose sample covariance matrix is the sample correlation matrix R of the obser vations x 1 , x 2 , . . . , x n . Standardization avoids the problems of having one variable with large variance unduly influencing the determination of factor loadings. The representation in (9-13), when applied to the sample covariance matrix S or the sample correlation matrix R, is known as the principal component solution. The name follows from the fact that the factor loadings are the scaled coefficients of the first few sample principal components. (See Chapter 8.) ,n:\. PRI NClPAL C@MPONENT SQLUTION OF
THfFACTOR MODEL
1il;b prifi�p!n ooitlponell.fiactor:iiklysisbi the s1fropie c��ariaric�· mattbt.s s�ecified in
tenlls
{�£;� eph;j�!here 1ll: ;;.
A 2 ,�
factors. Then the matrix of estimated • • •
:· ;!';(.,
:"�'>> '•
��
' � ' ;,:, · · ··.> -
,;}�:0
�
Ll/ ' . �Q '
.
The
.;. .62] = .43 implies that H0 would not be • rejected at any reasonable level. L11rg � sample variances and covariances for the maximum likelihood esti mates eij> 1/1 ; have been derived when these estimates have been determined from the sample covariance matrix S. (See [20].) The expressions are, in general, quite complicated. 9.4 FACTOR ROTATION
As we indicated in Section 9.2, all factor loadings obtained from the initial load ings by an orthogonal transformation have the same ability to reproduce the covariance (or correlation) matrix. [See (9-8).] From matrix algebra, we know that an orthogonal transformation corresponds to a rigid rotation (or reflection) of the coordinate axes. For this reason, an orthogonal transformation of the factor load ings, as well as the implied orthogonal transformation of the factors, is called fac
tor rotati9n.
If L is the p X m matrix of estimated factor loadings obtained by any method ( principal component, maximum likelihood, and so forth) then
i
*
= LT '
where TT'
=
T'T
=I
(9-42)
is a p X m matrix of "rotated" loadings. Moreover, the estimated covariance (or correlation) matrix remains unchanged, since (9-43)
Equation (9-43)A indicates that the residual matrix, s n - LL ' - A-tit = A A s n - L * L * I - '11 , remaip. s unchanged. Moreover, the specific variances 1/J;, and hence the communalities ,M ' are unaltered. Thus, from a mathematical viewpoint, it is immaterial whether L or L * is obtained. Since the original loadings may not be readily interpretable, it is usual practice to rotate them until a "simpler structure" is achieved. The rationale is very much akin to sharpening the focus of a microscope in order to see the detail more clearly. Ideally, we should like to see a pattern of loadings such that each variable loads highly on a single factor and has small to moderate loadings on the remain ing factors. However, it is not always possible to get this simple structure, although the rotated loadings for the decathlon data discussed in Example 9.11 provide a nearly ideal pattern. We shall concentrate on graphical and analytical methods for determining an orthogonal rotation to a simple structure. When m = 2, or the common factors are
Sec.
9.4
Factor Rotation
541
considered two at a time, the transformation to a simple structure can frequently be determined graphically. The uncorrelated common factors are regarded as unit v�ctors along perpendicular coordinate axes. A plot of the pairs of factor loadings ( £ i 1 , £i 2 ) yields p points, each point corresponding to a variable. The coordinate axes can then be visually rotated through an angle-call it 4r -1 L be a diagonal matrix. This condition is con venient for computational purposes, but may not lead to factors that can easily be interpreted. PAN EL 9. 1
SAS ANALYSI S FOR EXAM PLE
9.9 U S I N G P ROC FACTOR.
t i t l e 'Facto r Ana lysis'; data cons u m e r(type=co rr) ; _type_ = 'CO R R'; i n put _name_ $ taste m o n ey flavor snack energy; ca rds; 1 .00 taste .02 1 .00 m o ney flavor .96 . 1 3 1 .00 snack . 42 .71 .50 1 .00 energy .01 .85 .1 1 .79 1 .00
PROGRAM COMMANDS
proc factor res data=co n s u m e r m ethod=prin nfact=2 rotate=vari m ax preplot plot; va r taste m o ney flavor snack energy;
(continued)
546
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices PAN El 9. 1 fhiti a l
(continued)
F�ctorMethod: �ri�cipaiComponents
O UTPUT
Prior Com m u n a l ity Esti m ates: O N E Eigenva l u es o f t h e Correlation Matrix: Total
Eigenva l u e Difference
Proportion
.cumulative
5 Average
=
=
1
1 2.853090 1 .046758
2 1 .806332 1 .60 1 842
3 0.204490 0 . 1 0208 1
4 0. 1 02409 0.068732
5 0.033677
0.5706
0.36 1 3
0.0409 0.9728
0.0205 0.9933
0.0067 1 .0000
0.570.6
0.93 1 9
2 facto rs w i l l be reta ined by the N FACTOR criterion. Factor Pattern
TASTE MONEY FLAVOR S NACK E N E RGY
FACTORi
0.55986
0.64534
. >Q;'7'?7�Q.
0.79821'
• ·· .
' FACTOR.2 0.8 1 610
:·�0.524,@!) 0.74795 0 1 0 492
0.939 1 1
·:
-
FiiJ 1, the condition L' l}I- 1 L = A effectively imposes m (m - 1 ) /2 con straints on the elements of L and '\}1, and the likelihood equations are solved, sub ject to these contraints, in an iterative fashion. One procedure is the following:
574
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
[16]
1. Compute initial estimates of the specific variances suggests setting
1/11 , 1/12 , , 1/Jp · Joreskog • • •
(9A-4) �·' = ( 1 - l_2 . m ) (�) s" where is the ith diagonal element of s - 1 • Given � ' compute the first m distinct eigenvalues, A 1 > A 2 > ··· > A m > 1, and corresponding eigenvectors, el' ez, ... ' em, of the "uniqueness-rescaled" covariance matrix s* = � -1/z s n � -1 ;2 (9A-5) Let E [ el i ez i .. . i em ] be the m matrix of normalized eigenvectors and A A A A = diag [ 1 . 2 , . . . , m ] be the� m m diagonal matrix of eigenvalues. From (9A-1), A = I + A and E = -1/2 LA -112. Thus, we obtain the estimates (9A-6) Substitute i obtained in (9A-6) into the likelihood function (9A-3), and min imize the result with respect to � 1 , � 2 , ... , �p · A numerical search routine must be used. The values � 1 , � 2 , ... , �P obtained from this minimization are employed at Step (2) to create a new i. Steps (2) and (3) are repeated until convergence-that is, until the differences between successive values of e and � are negligible. It often happens that the objective function in (9A-3) has a rel ative minimum corresponding to negative values for some �;· This solution is clearly inadmissible and is said to be jmproper, or a Heywood case. For most pack aged computer programs, negative 1/J ; , if they occur on a particular iteration, are changed to small positive numbers before proceeding with the next step. p = LzL: + '��z When I has the factor analysis structure I = LL ' + '11 , p can be factored as p = v -1/2 Iy - 1 /2 = ( v -1/2 L) ( v - 1 /2 L) ' + v - 1 /2\}ly-1/2 = L Z L� + '��z · The loading matrix for the standardized variables is Lz = v -t /2 L, and the corresponding specific variance matrix is '��z = v -1/2 '11v - l l2 , where v -1 /2 is the diagonal matrix with ith diagonal element u;; 1 12 . If R is substituted for S n in the objective function of (9A-3), the investigator minimizes p
s
ii
2.
p
X
X
3.
ij
i
Comment.
Maximum Likelihood Estimators of
(9A-7)
Chap . 9
575
Exercises
Introducing the diagonal matrix V 1 12, whose ith diagonal element is the square root of the ith diagonal element of 8 11 , we can write the objective function in (9A-7) as ln
( I V 1 12 I lL Z L� + :z i i V 1 12 I ) I V 1 12 I I R I I V112 I
+ tr [ (LzL � + '11, ) - 1 V - 1 12 V1 12 RV 1 12 V - 112]
= ln
( I = [ t\,"+ 1 i · i ep J and A. (2) is the diagonal matrix with elements m + 1 , . . . , AP . "Us� (sum of squared entries of A) = tr AA' and tr[P c2J A (2) A (2) lP(2) ] = tr [A c2> A ]. 9.6. Verify the following matrix identities. (a) ( I + L' � - 1 L) - 1 L' � - 1 L = I - (I + L' � - 1 L) - 1 Hint: Premultiply bothI sides by (I + L' �- 1 L ) . � - 1 L (I + L' � - 1 L) - I L' � - I (b) (LL' + � ) - 1 = � Hint: Postmultiply both sides by (LL' + � ) and use (a). - -
-
A \
�
A
.
_
Af
· · ·
A \
A
Af
p A
A
1'\
Chap. 9
Exercises
577
L' (LL' + '\]f)- 1 = (I + L''\]f- 1 L) - 1 L''\]f - 1 Hint: Postmultiply the result in 1 (b) by L, use (a), and take the trans pose, noting that (LL' + '\]f) - , '\)f - 1 , and ( I + L''\]f - 1 L) - 1 are sym metric matrices. 9.7. (The factor model parameterization need not be unique.) Let the factor model with p = 2 and m = 1 prevail. Show that 0"11 -- eA "2 + 1/11 , O"zz = f i 1 + 1/Jz and, for given 0"11 , 0"22 , and 0"1 2 , there is an infinity of choices for L and 9.8. (Unique but improper solution: Heywood case.) Consider an m = 1 factor model for the population with covariance matrix 1 .4 . 9 I = .4 1 . 7 .9 .7 1 Show that there is a unique choice of L and with I = LL' + but that I/J3 < 0, so the choice is not admissible. 9.9. In a study of liquor preference in France, Stoetzel [25] collected preference rankings of p = 9 liquor types from n = 1442 individuals. A factor analysis of the 9 9 sample correlation matrix of rank orderings gave the following estimated loadings: (c)
'\)1,
[ ]
'\)1
'\)1,
X
Variable (X1 ) Liquors Kirsch Mirabelle Rum Marc Whiskey Calvados Cognac Armagnac *
Estimated factor loadings F,
.64 .50 .46 .17 -.2299 -. -.49 -.52 -.60
Fz
. 02 -.06 -.24 .74 .66 -..2008 -.03 -. 1 7
F3
.16 -.10 -.19 .97* -.39 .09 -.04 .42 .14
valforThiuobte sofafii.ng6iur4,ngeastihsaetresoesotuihimltgatofh.eandItfactexceeds approrolxiomadithatneiogsmaxin usedmetmhumodby Stoetzel.
578
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
Given these results, Stoetzel concluded the following: The major principle of liquor preference in France is the distinction between sweet and strong liquors. The second motivating element is price, which can be understood by remembering that liquor is both an expensive commodity and an item of con spicuous consumption. Except in the case of the two most popular and least expensive items (rum and marc), this second factor plays a much smaller role in producing preference judgments. The third factor concerns the sociologi cal and primarily the regional, variability of the judgments. (See [25], p. 11.)' (a) Given what you know about the various liquors involved, does Stoetzel s interpretation seem reasonable? (b) Plot the loading pairs for the first two factors. Conduct a graphical orthogonal rotation of the factor axes. Generate approximate rotated loadings. Interpret the rotated loadings for the first two factors. Does your interpretation agree with Stoetzel's interpretation of these factors from the unrotated loadings? Explain. 9.10. The correlation matrix for chicken-bone measurements (see Example 9. 1 4) is 1.000 .505 1.000 .569 .422 1.000 .602 .467 .926 1.000 .621 .482 .877 .874 1.000 .603 .450 .878 .894 . 937 1.000 The following estimated factor loadings were extracted by the maximum like lihood procedure: Varimax Estimated rotated estimated factor loadings factor loadings Variable p1* F, Fz F* 1. Skull length . 602 .200 .484 .411 2. Skull breadth .467 .154 .375 .319 3. Femur length . 926 .143 . 603 . 7 17 4. Tibia length 1.000 .000 . 5 19 .855 5. Humerus length .874 .476 .861 .499 6. Ulna length .894 . 327 .744 .594 Using the unrotated estimated factor loadings, obtain the maximum likeli hood estimates of the following. (a) The specific variances. 2
Chap. 9
Exercises
579
The communalities. The proportion of variance explained by each factor. The residual matrix R LzL� � z · 9.11. Refer to Exercise 9. 1 0. Compute the value of the varimax criterion using both unrotated and rotated estimated factor loadings. Comment on the results. 9.U. The covariance matrix for the logarithms of turtle measurements (see Exam ple 8.4) is 11.072 s = w - 3 8. 0 19 6. 4 17 8. 1 60 6.005 6.773 The following maximum likelihood estimates of the factor loadings for an m = 1 model were obtained: (b)
(c) (d)
-
-
]
[
Estimated factor loadings
Variable 1. ln(length) 2. ln(width) 3. ln(height)
FI
.1022 .0752 .0765
Using the estimated factor loadings, obtain the maximum likelihood esti mates of each of the following. (a) Specific variances. (b) Communalities. (c) Proportion of variance exp} ajned by the factor. (d) The residual matrix S n L L ' 'II . Hint: Convert S to 811 • 9.13. Refer to Exercise 9. 1 2. Compute the test statistic in (9-39). Indicate why a test of H0 : I = LL' + 'II (with m = 1) versus H1 : I unrestricted cannot be carried out for this example. [See (9-40).] 9.14. The maximum likelihood factor loading estimates are given in (9A-6) by -
-
Verify, for this choice, that where
a A
=
A
-
I
is a diagonal matrix.
580
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
Hirschey and Wichern [15] investigate the consistency, determinants, and uses of accounting and market-value measures of profitability. As part of their study, a factor analysis of accounting profit measures and market esti mates of economic profits was conducted. The correlation matrix of account ing historical, accounting replacement, and market-value measures of profitability for a sample of firms operating in 1977 is as follows: Variable HRA HRE HRS RRA RRE RRS Q REV Historical return on assets, HRA 1.000 Historical return on equity, HRE .738 1. 000 Historical return on sales, HRS .731 .520 1.000 Replacement return on assets, RRA . 828 . 688 . 652 1.000 Replacement return on equity, RRE .681 .831 . 5 13 . 887 1. 000 Replacement return on sales, RRS .712 .543 . 826 . 867 .692 1.000 Market Q ratio, Q .625 .322 .579 . 639 .419 . 608 1.000 Market relative excess value, REV . 604 .303 . 6 17 .563 .352 . 6 10 . 937 1.000 The following rotated principal component estimates of factor loadings for an m = 3 factor model were obtained: Estimated factor loadings Variable Fz F3 Ft .433 . 6 12 .499 Historical return on assets Historical return on equity .125 . 892 .234 Historical return on sales .296 .238 . 887 .406 .708 .483 Replacement return on assets Replacement return on equity .198 .895 .283 .331 .414 .789 Replacement return on sales . 928 .160 .294 Market Q ratio . 9 10 .079 .355 Market relative excess value Cumulative proportion .287 . 628 .908 of total variance explained (a) Using the estimated factor loadings, determine the specific variances and communalities. (b) Determine the residual matrix, R izi� ir z· Given this information and the cumulative proportion of total variance explained in the preced ing table, does an m = 3 factor model appear appropriate for these data? 9.15.
-
-
Chap. 9
Exercises
581
Assuming that estimated loadings less than .4 are small, interpret the three factors. Does it appear, for example, that market-value measures provide evidence of profitability distinct from that provided by account ing measures? Can you separate accounting historical measures of prof itability from accounting replacement measures? 9.16. Verify that factor scores constructed according to (9-50) have sample mean vector 0 and zero sample covariances. 9.17. Consider the LISREL model in Example 9.16. Interchange 1 and A1 in the parameter vector AY ' and interchange A 2 and 1 in the parameter vector Ax Using the S matrix provided in the example, solve for the model parameters. Explain why the scales of the structural variables and �must be fixed. (c)
TJ
The following exercises require the use of a computer.
Refer to Exercise 5.16 concerning the numbers of fish caught. (a) Using only the measurements x 1 - x4, obtain the principal component solution for factor models with m = 1 and m = 2. (b) Using only the measurements x 1 - x4, obtain the maximum likelihood solution for factor models with m = 1 and m = 2. (c) Rotate your solutions in Parts (a) and (b). Compare the solutions and comment on them. Interpret each factor. (d) Perform a factor analysis using the measurements x1 - x6 • Determine a reasonable number of factors m, and compare the principal component and maximum likelihood solutions after rotation. Interpret the factors. 9.19. A firm is attempting to evaluate the quality of its sales staff and is trying to find an examination or series of tests that may reveal the potential for good performance in sales. The firm has selected a random sample of 50 sales people and has evaluated each on 3 measures of performance: growth of sales, profitability of sales, and new-account sales. These measures have been converted to a scale, on which 100 indicates "average" performance. Each of the 50 individuals took each of 4 tests, which purported to measure creativity, mechanical reasoning, abstract reasoning, and mathematical abil ity, respectively. The n = 50 observations on p = 7 variables are listed in Table 9.12. (a) Assume an orthogonal factor model for the standardized variables Zi = ( Xi - JL i )/� , i = 1, 2, . . . , 7. Obtain either the principal compo nent solution or the maximum likelihood solution for m = 2 and m = 3 common factors. (b) Given your solution in (a), obtain the rotated loadings for m = 2 and m = 3. Compare the two sets of rotated loadings. Interpret the m = 2 and m = 3 factor solutions. c) List the estimated communalities, specific variances, and ££ ' + 4r for the m = 2 and m = 3 solutions. Compare the results. Which choice of m do you prefer at this point? Why? 9.18.
(
582
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
TABLE 9. 1 2 SALESPEOPLE DATA
Sales Salesperson growth (x1 ) 1 93.0 2 88.8 3 95.0 4 101.3 5 102.0 6 95.8 7 95.5 8 110.8 9 102.8 10 106.8 11 103. 3 12 99.5 13 103.5 14 99.5 15 100.0 16 81.5 17 101. 3 18 103.3 19 95.3 20 99.5 88. 5 21 22 99.3 87.5 23 24 105.3 25 107.0 93.3 26 106.8 27 106.8 28 92.3 29 106.3 30 106.0 31 88.3 32 96.0 33 94.3 34 106.5 35
Index of: Sales profitability (x2 ) 96.0 91.8 100.3 103.8 107. 8 97.5 99.5 122.0 108.3 120.5 109.8 111. 8 112.5 105.5 107.0 93.5 105.3 110.8 104.3 105.3 95. 3 115. 0 92.5 114.0 121.0 102.0 118.0 120.0 90.8 121. 0 119.5 92.8 103.3 94.5 121.5
Score on: Mechanical Abstract Newaccount Creativity reasoning reasoning sales (x3 ) test (x4 ) test (x5 ) test (x6 ) 12 09 97.8 09 10 10 96.8 07 12 09 08 99.0 12 14 106.8 13 12 15 103.0 10 14 11 10 99.3 12 09 09 99.0 20 15 115.3 18 13 17 10 103.8 11 18 102.0 14 12 17 104.0 12 18 08 10 100.3 17 11 16 107.0 10 11 08 102.3 08 10 13 102.8 05 09 07 95.0 11 12 11 102.8 14 11 11 103.5 13 14 05 103.0 11 17 17 106.3 07 12 10 95.8 11 11 05 104.3 07 09 09 95.8 12 15 12 105.3 12 19 16 109.0 07 15 10 97.8 12 16 14 107.3 11 16 10 104.8 13 10 08 99.8 11 17 09 104.5 10 15 18 110.5 08 11 13 96.8 11 15 07 100.5 11 12 10 99.0 10 17 18 110.5
Mathernatics test (x7 ) 20 15 26 29 32 21 25 51 31 39 32 31 34 34 34 16 32 35 30 27 15 42 16 37 39 23 39 49 17 44 43 10 27 19 42 (continued)
Chap. 9
TABLE 9. 1 2
Salesperson 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Exercises
583
(continued)
Sales growth (x1 ) 106.5 92.0 102.0 108.3 106.8 102.5 92.5 102.8 83.3 94.8 103.5 89.5 84.3 104.3 106.0
Index of: Sales profitability (x2 ) 115. 5 99.5 99.8 122.3 119.0 109. 3 102.5 113. 8 87.3 101. 8 112.0 96.0 89.8 109.5 118.5
Score on: Mechanical Abstract Newaccount Creativity reasoning reasoning sales (x3 ) test (x4 ) test (x5 ) test (x6 ) 14 13 08 107.0 08 16 18 103.5 14 12 13 103.3 12 19 15 108.5 12 20 14 106.8 13 17 09 103. 8 06 15 13 99.3 10 20 17 106.8 09 05 01 96.3 11 16 99.8 07 12 13 18 110.8 11 15 07 97.3 08 08 94.3 08 12 12 106.5 14 11 16 105.0 12
Mathernatics test (x7 ) 47 18 28 41 37 32 23 32 15 24 37 14 09 36 39
Conduct a test of H0 : LL' + '\fl versus H : I LL' + '\fl for both m 2 and m 3 at the .01 level. With1 these results and those in Parts b and c, which choice of m appears to be the best? (e) Suppose a new salesperson, selected at random, obtains the test scores x' = [x1 , x2 , , x7 ] [ 110, 98, 105, 15, 18, 12, 35]. Calculate the sales person ' s factor score using the weighted least squares method and the regression method. Note: The components of x must be standardized using the sample means and variances calculated from the original data. 9.20. Using the air-pollution variables X1 , X2 , X5 , and X6 given in Table 1. 3 , gen erate the sample covariance matrix. (a) Obtain the principal component solution to a factor model with m 1 and m 2. (b) Find the maximum likelihood estimates of L and '\fl for m 1 and m 2. (c) Compare the factorization obtained by the principal component and max imum likelihood methods. 9.21. Perform a varimax rotation of both m = 2 solutions in Exercise 9. 2 0. Inter pret the results. Are the principal component and maximum likelihood solu tions consistent with each other? (d)
=
=
• • •
=
a =
=I=
=
=
=
=
584
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
Refer to Exercise 9.20. (a) Calculate the factor scores from the = 2 maximum likelihood esti mates by (i) weighted least squares in (9-50) and (ii) the regression approach of (9-58). (b) Find the factor scores from the principal component solution, using (9-51). (c) Compare the three sets of factor scores. 9.23. Repeat Exercise 9. 2 0, starting from the sample correlation matrix. Interpret the factors for the = 1 and = 2 solutions. Does it make a difference if R, rather than S, is factored? Explain. 9.24. Perform a factor analysis of the census-tract data in Table 8.2. Start with R and obtain both the maximum likelihood and principal component solutions. Comment on your choice of Your analysis should include factor rotation and the computation of factor scores. 9.25. Perform a factor analysis of the "stiffness" measurements given in Table 4.3 and discussed in Example 4.14. Compute factor scores, and check for outliers in the data. Use the sample covariance matrix S. 9.26. Consider the mice-weight data in Example 8.6. Start with the sample covari ance matrix. (See Exercise 8. 1 5 for � .) (a) Obtain the principal component solution to the factor model with = 1 and = 2. (b) Find the maximum likelihood estimates of the loadings and specific vari ances for = 1 and = 2. (c) Perform a varimax rotation of the solutions in Parts a and b. 9.27. Repeat Exercise 9. 26 by factoring R instead of the sample covariance matrix S. Also, for the mouse with standardized weights [.8, - .2, - .6, 1 .5], obtain the factor scores using the maximum likelihood estimates of the loadings and Equation (9-58) . 9.28. Perform a factor analysis of the national track records for women given in Table 1 .7. Use the sample covariance matrix S and interpret the factors. Compute factor scores, and check for outliers in the data. Repeat the analy sis with the sample correlation matrix R. Does it make a difference if R, rather than S, is factored? Explain. 9.29. Refer to Exercise 9.28. Convert the national track records for women to speeds measured in meters per second. (See Exercise 8 . 19. ) Perform a factor analysis of the speed data. Use the sample covariance matrix S and interpret the factors. Compute factor scores, and check for outliers in the data. Repeat the analysis with the sample correlation matrix R. Does it make a difference if R, rather than S, is factored? Explain. Compare your results with the results in Exercises 9.28. Which analysis do you prefer? Why? 9.30. Perform a factor analysis of the national track records for men given in Table 8.6. Repeat the steps given in Exercise 9.28. Is the appropriate factor model for the men's data different from the one for the women 's data? If not, are m
9.22.
m
m
m.
m
m
m
m
Chap. 9
References
585
the interpretations of the factors roughly the same? If the models are differ ent, explain the differences. 9.31. Refer to Exercise 9.30. Convert the national track records for men to speeds measured in meters per second. (See Exercise 8.2 1.) Perform a factor analy sis of the speed data. Use the sample covariance matrix S and interpret the factors. Compute factor scores, and check for outliers in the data. Repeat the analysis with the sample correlation matrix R. Does it make a difference if R, rather than S, is factored? Explain. Compare your results with the results in Exercises 9.30. Which analysis do you prefer? Why? 9.32. Perform a factor analysis of the data on bulls given in Table 1. 8 . Use the seven variables YrHgt, FtFrBody, PrctFFB, Frame, BkFat, SaleHt, and SaleWt. Factor the sample covariance matrix S and interpret the factors. Compute fac tor scores, and check for outliers. Repeat the analysis with the sample corre lation matrix R. Compare the results obtained from S with the results from R. Does it make a difference if R, rather than S, is factored? Explain. REFERENCES An Introduction to Multivariate Statistical Methods (
)
1. Anders oln,ey,T.1984.W. 2d ed. . New York: John Wi 3.2. BartBarthleolt o, mew, M. (S.1937)D."TheJ., 97-104. Statistical Conception of Mental Factors." London: Grif in, 1987. 4. Barttions.le"t , M. S. "A Note on Multiplying Factors for Var(1954)ious, 296-298. Chi-Squared Approxima 5. Bentler, P. M. "Multivaria(t1e980),Anal419-456. ysis with Latent Variables: Causal Models." 6. Bielby, W.(1977)T., and, 137-161. R. M. Hauser "Structural Equation Models." 7.8. DiBolxlon,en, W.K. A.J., ed. NewBerkYork: John: UniWilveersy, 1989. el e y, CA. ity of Cal i f o r n i a Pr e s , 1979. 9. Press, Duncan,1975. New York: Academic 10. Dunn, L. C. "The Effect of Inbre(edi1928)ng ,on1-112.the Bones of the Fowl." 11. Gol(d1972)berger,, 979-1001. A. S. "Structural Equation Methods in the Social Sciences." 12.13. Hayduk, Harmon, L.H.A.H. Chicago: The UniverBalsitytiofmore:ChicTheago Johns Pres , Hop 1967. ki n s Uni v er s i t y Pres s , 1987. 14. Heise, D. R. New York: John Wiley, 1975. Latent Variable Models and Factor Analysis.
British Journal of Psy
chology,
28
Journal of the Royal Statistical Society (B) ,
16
Annual
Review of Psychology,
31
Annual Review of Soci
ology,
3
Structural Equations with Latent Variables.
BMDP Biomedical Computer Programs.
0.
Q. Introduction to Structural Equation Models.
Storrs Agricultural
Experimental Station Bulletin,
52
Econometrica,
40
Modern Factor Analysis.
Structural Equation Modeling with LISREL.
Causal Analysis.
586
Chap.
9
Factor Analysis and Inference for Structured Covariance Matrices
15. iHitarbisclihey,ty: ConsM., iandstency,D. DetW. eWirmcihern.nants"Account in" g and Market-Value Measures of Prof and Us e s. 984), o375-383. 16. Joreskog,no.K. 4G.(1"Fact r Analysis by LeaseditteSquares andleMaxiin, A.mumRalsLitokn,eliandhood.H." S.InWilf. d by K. Ens New kYork:og, K.JohnG., andWileD.y, 1975.Sorbom. 17. Jores Cambr idge,D.MA:Sorbom.Abt Books, 1979. 18. Jores k og, K. , and Chicago: Scientific Sof t w ar e I n t e rnat i o nal , 1996. 19. Kaiser, H. F. "The(1958)Vari, 18m7-200. ax Criterion for Analytic Rotation in Factor Analysis." 20. YorLawlke: y,AmerD. N.ic,anandElsA.eviE.er Maxwel ln. g Co., 1971. 2d ed. . New Publ i s h i 21. Linden,no. 3M.(1977)"A Fact, 562-568. or Analytic Study of Olympic Decathlon Data." 22. Maxwel l, A. E. London: Chapman and Hal l , 1977. 23. MiEnvil err,onment D. "St.a"ble in the Saddle" CEO Tenurno. e1 and(1991),the 34--Mat5c2.h between Organization and 24.25. MorStoetrizselon,, J.D."AF.Factor Analysis of Liquor Prefere2dnce.ed." . New York: McGraw-Hil , 1976. 26. Wr(1960) ight, S., edi7-11."Theted byInterpKempt retationhorofnMule andtivarotihaertesSys. Amestems., l"AIn: Iowa State University Pres , 1954, 11-33. Journal of Business and Economic Sta
tistics,
2,
Sta
tistical Methods for Digital Computers,
Advances in Factor Analysis and Structural Equation
Models.
LISREL
8:
User's Reference Guide,
Psy
chometrika,
23
Factor Analysis as a Statistical Method
(
)
Research Quarterly,
48,
Multivariate Analysis in Behavioral Research.
Management Science,
37,
Multivariate Statistical Methods
(
)
Journal of Advertising Research,
1
Statistics and Mathematics in
Biology,
0.
CHAPTER
10
Canonical Correlation Analysis 1 0. 1 INTRODUCTION
Canonical correlation analysis seeks to identify and quantify the associati ns between two sets of variables. H. Hotelling ([5], [6]), who initially developed Othe technique, provided the example of relating arithmetic speed and arithmetic power to reading speed and reading power. (See Exercise 10. 9 . ) Other examples include relating governmental policy variables with economic goal variables and relating college "performance" variables with precollege "achievement" variables. Canonical correlation analysis focuses on the correlation between a linear combination of the variables in one set and a linear combination of the variables in another set. The idea is first to determine the pair of linear combinations having the largest correlation. Next, we determine the pair of linear combinations hav ing the largest correlation among all pairs uncorrelated with the initially selected pair, and so on. The pairs of linear combinations are called the canonical variables, and their correlations are called canonical correlations. The canonical correlations measure the strength of association between the two sets of variables. The maximization aspect of the technique represents an attempt to concentrate a high-dimensional relationship between two sets of vari ables into a few pairs of canonical variables. 1 0.2 CANONICAL VARIATES AND CANONICAL CORRELATIONS
We shall be interested in measures of association between two groups of variables. The first group, of p variables, is represented by the (p 1) random vector x ( l> . The second group, of q variables, is represented by the (q 1) random vector x . We assume, in the theoretical development, that x < 1 > represents the smaller set, so that p q. X
X
:o::::;
587
588
Chap.
10
Canonical Correlation Analysis
For the random vectors X(l l and X (2) , let E (X (l l ) (l l; Cov cx< I l ) I ll E (x, x ) b = a' I 1 2 b We shall seek coefficient vectors a and b such that X (2)
(10-7)
is as large as possible. We define the following: The first pair of canonical variables, or first canonical variate pair, is the pair of linear combinations U1 , V1 having unit variances, which maximize the cor relation (10-7); The second pair of canonical variables, or second canonical variate pair, is the pair of linear combinations U2 , V2 having unit variances, which maximize the correlation (10-7) among all choices that are uncorrelated with the first pair of canonical variables. At the kth step: The kth pair of canonical variables, or kth canonical variate pair, is the pair of linear combinations Uk, Vk having unit variances, which maximize the cor relation (10-7) among all choices uncorrelated with the previous k 1 canon ical variable pairs. The correlation between the kth pair of canonical variables is called the kth canon ical correlation. The following result gives the necessary details for obtaining the canonical variables and their correlations. q and let the random vectors x and x have Result 1 0. 1 . Suppose p (q X l ) (p X l ) Cov ( X ( l l ) = I1 1 , Cov ( X ) = I22 and Cov ( X ( l l , X ( 2 ) ) = I1 2 where I has q X q) X ) X q) -
:o::;
(p p
(
(p
590
Chap.
10
Canonical Correlation Analysis
full rank. For coefficient vectors (p aX I ) and (q bX I) , form the linear combinations U = a' X ( I ) and V = b'X. Then max Corr ( U, V) = pj a, b attained by the linear combinations (first canonical variate pair) U1 = e{ I !l 12 X (l) and V1 = f{ I2":f/2 X '-.r--'
'-v--'
a{
b{
The kth pair of canonical variates, k = 2, 3, ... , p, � - l /2 x vk = rk· .... uk - ek' �_.., I-Il /2 x ( l ) 22 maximizes -
among those linear combinations uncorrelated with the preceding 1, 2, ... , k - 1 canonical variables. � .... � 1;2 - � .... H ere p*1 2 ;;;;. p*2 2 ;;;;. ;;;;. P*p 2 are t h e etgenva ues of � .... 1-11;2� .... 1 2 �.... 22 2 1 1-1 , and e 1 , e2 , . . . , eP are the associated (p 1) eigenvectors. (The quantities p1 2 , p� 2 , , p; 2 are also the p largest eigenvalues of the matrix I:Z] i2 I 2 1 I!l i 1 2 I2F2 with corresponding (q 1) eigenvectors f 1 , f2 , . . . , fP . Each f proport10na to � .... 22- 1;2�....2 1 �•1-11;2 e ; . ) The canonical variates have the properties Var(Uk) = Var(Vd = 1 Cov(Uk, Uc) = Corr(Uk, Uc) = 0 k * f Cov(Vk, Ve) = Corr(Vk, Ve) = 0 k i= f Cov(Uk, Ve) = Corr(Uk, Vc) = 0 k i= f for k, e = 1, 2, . . . , p. Proof. We assume that I 1 1 and I 22 are nonsingular. 1 Introduce the symmet ric square-root matrices 2Ig2 and I��2 with I 1 1 = I g2 Ig2 and I!l = I !li2 I !F2 . 2 [See (2-22). ] Set c = Ig a and d = I�� b, so a = I!ll2 c and b = I:Z]i2 d. Then ·
• • •
X
. . •
· ; IS
·
1
Corr ( a' X ( l) ' b' X (2) )
I
X
=
a· � ..., 1 2 b Ya' I 1 1 a Yb' I22 b
� - 1 /2� � - 1 /2 = c · ..., 1 1 ..., 1 2...,2 2 d � \l'd'd
(10-8)
1 I f I1 1 o r I 22 is singular, one or more variables may be deleted from the appropriate set, and the linear combinations a'X(I) and b ' X(2 ) can be expressed in terms of the reduced set. If p > rank (I1 ) = p 1 , then the nonzero canonical correlations are p� , . . . , p;, . 2
Sec.
1 0.2
Canonical Variates and Canonical Correlations
591
By the Cauchy-Schwarz inequality (2-48), Since
(2-51)
c 1 Ii l 12 I 1 2 Iz"F2 d :,;;; ( c 1 I !l12 I 1 2 I Z21 I 2 1 I 1? f2 c ) 1 12 ( d 1 d ) 1 12 pXp Iil 12 I 1 2 I2] I 2 1 I!l /2
is a
yields
(10-9)
symmetric matrix, the maximization result
� - � � � -11 /2 C :,;;; i C C C 1 �"'-1-11 /2�"'- 1 2"'-22 ll "'-2 1 "'-1 Iil 12 I 1 2 I2d I2 1 I !l f2 . A1 . 2 2 1 e Ii/ I2F I2 1 1. ( b 1 X< 2 > ) = � I
\
(10-10)
where A 1 is the largest eigenvalue of Equality occurs in (10-10) for c =e 1 , a normalized eigenvalue associated with Equality also holds in (10-9) if d is proportional to Thus, max (10-11 ) Corr a1X ( l> , a, b with2 equality occurring for a = I i{ /2 c = Ii/ /2 e 1 and with b proportional to 2 2 I2F I2F I 2 1 Iil 1 e 1 , where the sign is selected to give positive correlation. We take b = :_tz-j12 f 1 . This last correspondence follows by multiplying both sides of � - 1 /2� � - 1 � � - 1 /2 ) e = e ( "'-1 1 "'- 1 2"'-22 "'-2 1 "'- t t l ll l 1 \
' y1e ldmg
� - 1 /2� � - 1 /2 ' . "'-22 "'-2 1 "'-1 1 � - 1 /2� � - � � � - 1 12 ( � - 1 12� � - 1 /2 e ) = , ( � - 1 12 � � - 1 /2 e ) (10- 12) llt "'-22 "'-21 "'-1 1 1 "'-22 "'-2 1 "'-1 1 "'-1 2 "'-22 "'-22 "'-2 1 "'-1 1 1 � �-1 2 �"'-t-t1 /2� � - 1 "" 1 ( llt , e 1 ) . . "" 1 2 "" 22 2 1 "" 1 1 / 2 2 / of i2F I 2 1 I il e 1 -is (A 1 , f 1 ) -with f 1 f1 I2F2 I2 1 Iil i1 2 I2F2 . v1 = r; :.tz-j f2 X (2) = e{ I !l /2 X ( I ) p 'f = � . ( ) = e{ Ii/ 12 I 1 1 I!l 12 e 1 = e{ e 1 = 1, ( ) = 1. 1 a 1 X< 1 > = C 1 Iil /2 X< 1 >
b
Y
Th us, 1'f 1s an e1genva ue-e1genvector pau. for , then an eigenvalue-eigen the normalized form The sign for is chosen to give a positive vector pair for correlation. We have demonstrated that UJ and are the first pair of canonical variables and that their correlation is Also, Var U1 and similarly, Var V1 Continuing, we note that U and an arbitrary linear combination are uncorrelated if 1 2� � 1 2 0 = Cov ( U1 • c � � "'- -I It /2 X< 1 > ) - e 1� � "'-1-1 ; "'-1 1 "'-1-1 1 c - e 11 c ' At the kth stage, we require that c e 1 , e2 , . , ek - t · The maximization result (2-52) then yields 1 2� � - � �"'- t �"'- -t tJ /2 c :o::; /l kc c C1 � for c e 1 , . . . , e k - 1 z "'-1-1 / "'- t z "'- zz and by (10-8), '
.
..L
..
\
I
..L
592
Chap.
10
Canonical Correlation Analysis
with equality2 for1 e k or a = 2 I;;}12 2 e k and b I2F2 fk , as before. Thus, Uk ei i 1l f X< > and Vk = fi i2F X< >, are the kth canonical pair, and they have correlation � p� . Although we did not explicitly require the Vk to be uncorrelated, ifk =l= f � p Also, Cov(Uk, Ve ) ei i1lf2 I 1 2 I2F2 fe = 0, ifk =l= f � p since r; is a multiple of ei i1l f2 I1 2 I2F2 by (10-12). If{2)the original variables are standardized with z ( J ) = rzp> , Z�1 > , . . . ' Z�1 > ] ' 2 2 2 and z [Z� ) ' Z� ) ' ' Z� ) ] from first principles, the canonical variates are of the form Uk -- ak' zO> -- ek' p 11- 1 12 z< 1 > vk (10-13) - bk' Z(2) -- f'k p22- 1 12 z 2> Here, Cov(zO > ) . = p11 , Cov(Z-)1 /2 p22 ,- 1Cov(zO>, and f k are the e1genvectors of Pu P1 2 Pk22 P2 1P 11- 1 /2 z ) i 1, 2, ... , p. Therefore, the canonical coefficients for the standardized variables, z p> cxp>) - JLf!> ) ;v;;; ' are simply related to the canonical coefficients attached to the original variables xp > . Specifically, if a; is the coefficient vector for the kth canonical variate Uk, then a; vg2 is the coefficient vector for the kth canonical variate constructed from the standardized variables =
=
=
Sec.
1 0.2
Canonical Variates and Canonical Correlations
593
zO l . Here Vj{2 is the diagonal matrix with ith diagonal element va:;; . Similarly, 2 b� V�� is the coefficient vector for the canonical variate constructed from the set of standardized variablesY Z (2) . In2lthis case vW is the diagonal matrix with ith diag onal element va:;; = Var(Xf ) . The canonical correlations are unchanged by the standardization. However, the choice of the coefficient vectors ak , bk will not be umque 1" f P*k 2 = P*k 2+ i · The relationship between the canonical coefficients of the standardized vari ables and the canonical coefficients of the original variables follows from the spe cial structure of the matrix (see also (10-16)) I!l12 I 12 I2d I21I !l 12 (or p ],1 12 P 12 Pzi P21 PI1112 ) and, in this book, is unique to canonical correlation analysis. For example, in prin cipal component analysis, if a� is the coefficient vector for the kth principal com ponent obtained from I, then a� ( X - J.t) = a� V 1 12 z, but we cannot infer that a� V 1 /2 is the coefficient vector for the kth principal component derived from p . 0
Example 1 0. 1
(Calculating canonical variates and canonical correlations for standardized variables)
ll.O
l
2l = [Zf2l , Z�2l ]' Suppose z( ll = [Zf' l , Z�l) ]' are standardized1 lvariables and z< 2 are also standardized variables. Let = [Z< , z< l ]' and .4 i .5 .6 ] : _ , _ �_.(! _ = 5� 1 � � � � Cov (Z) = ��!_' . .3 : 1.0 .2 L P21 : P22 J .6 .4 i .2 1.0 Then 681 - .2229 p 11- 1 /2 = [ -.1.20229 1. 0681 J - .2083 P2i [ -.1.20417 083 1.0417 J and .2178 ] P1- 1/2 P12P22- 1 P2 1 P 11- 1 /2 - [ ._42371 178 _1096 The e1genvalues, p1* 2 , p*2 2 , of p 11- 1 /2 p1 2 p22- 1 p2 1p 11- 1 /2 are obtamed from - A .2178 1 -- (.4371 - A) (.1096 - A) - (2. 1 78)2 = 1 . 4 371 .2178 _ 1096 A = A 2 - .5467 A + . 0005 Z
-- 2 ---- -·- - --: -----· -
O
0
0
_
594
Chap.
10
Canonical Correlation Analysis
yielding Pt 2 = .5458 and Pi 2 = .0009. The eigenvector e 1 follows from the vector equation [ .4371 .2178 e r = (.5458) e r .2178 . 1 096 J Thus, e{ [. 8947, .4466] and .8561 ] a r = Pu- 1 /2 e r = [ .2776
959 .2292 ] [ .8561] = [ .4026] b1 P22- 1 Pzr 31 = [ ..53209 .3542 .2776 .5443 We must scale b1 so that Var(V1 ) = Var(b{Z ) = b{p22 b1 = 1 The vector [.4026, .5443]' gives oc
°26 ] = .5460 [.4026, .5443] [1 .·02 1.·20 ] [..54443 Using Y.5460 = .7389, we take 1 [ .4026 ] = [ .5448 ] bl = .7389 .5443 .7366
The first pair of canonical variates is U1 = a; z( l > = . 86zp> + .28Z �1 > VI = b{ Z(Z) = .54Zf2> + .74Zfl and their canonical correlation is Pt = v'Pf2 = v':54s8 = .74 possible between linear combinations of vari This is the largest(l)correlation ables from the z and z (Z) sets. The second canonical correlation, Pi = v:oOo9 = .03, is very small, and consequently, the second pair of canonical variates, although uncorre lated with members of the first pair, conveys very little information about the association between sets. (The calculation of the second pair of canonical variates is considered in Exercise 10. 5 . )
Sec.
1 0.3
Interpreting the Population Canonical Variables
595
We note that U and V , apart from a scale change, are not much dif ferent from the pair 1 1 z p > J = 3zp> + Z�1 > if1 = a' z< 1 > = [3 , 1] [ z!t> zz1z>> v1 = b'z = [1 , 1] [ f J = z F > + Z�2 > For these variates, Var( U1 ) = a' p11 a = 12.4 Var(V1 ) = b' p22 b = 2.4 Cov(U1 , V1 ) = a' p1 2 b = 4.0 and 4.0 = 73 Vi2.4 VzA . The correlation between the rather simple and, perhaps, easily interpretable linear combinations U1 , V1 is almost the maximum value Pt = .74. • The procedure for obtaining the canonical variates presented in Result 10. 1 has certain advantages. The symmetric matrices, whose eigenvectors determine the canonical coefficients, are readily handled2 by computer routines. Moreover, writing 2 = the coefficient vectors as a k = I !ll ek and bk I2}1 fk facilitates analytic descriptions and their geometric interpretations. To ease the computational burden, many people prefer to get the canonical correlations from the eigenvalue equation (10-15) The coefficient vectors a and b follow directly from the eigenvector equations I !t1 It zi2i izt a = p * 2 a (10-16) The matrices I !l i1 2 I2i i2 1 and I2ii2 1 I !li1 2 are, in general, not symmetric. (See Exercise 10.4 for more details.) 1 0. 3 INTERPRETING THE POPULATION CANONICAL VARIABLES
Canonical variables are, in general, artificial. That is, they have no physical mean ing. If the original variables x and X (2) are) used, thez> canonical coefficients a and b have units proportional to those of the x ) = Cov(Ui , u;F2 XL1 > ). Intro ducing the (p p) diagonal matrix V]/ with kth diagonal element u;F2 , we have, in matrix terms, 1 /2 x - A� "'-' 11 1 1 (q x q ) (p Xp) (10-19) � 1 2 v22- 1 /2.' P v, x a special case of a canonical correlation when X (l) has the 1single element xp (p 1). Recall that p 1 (x l'' l - maxb Corr (X i Z (2)]
=
and the sample canonical variates become u A
(p X I)
=
A z(l).
'
z
A
v
(q X I)
=
B z skull length Head (X ) : { xp Xf1 l skull breadth length { xf2l> femur Leg ( x )· X fZ length tibia have the sample correlation matrix =
(I)
=
=
0
R�
t-:;;-tt�J �
=
l-l:.�O��---���2:d i 6� ---'��l .505 i .569
.602
.602
.467 i .926 1 .0
canonical correlation analysis of the head and leg sets of variables using R produces the two canonical correlations and corresponding pairs of variables A
Sec.
1 0.4
The Sample Canonical Variates and Sample Canonical Correlations
V1 = v-1 =
.631
605
.781z�l2l + .345z�l2l .o6od > + . 944d >
and .856d2ll + 1.106z�21 l .057 Vl!_22 == - 2.648d > + 2.475z� > Here zP l , i = 1, 2 and zFl, i = 1, 2 are the standardized data values for sets 1 and 2, respectively. The preceding results were taken from the SAS statis tical software output shown in Panel 10.1. In addition, the correlations of the original variables with the canonical variables are highlighted in that Panel. A
-
•
Example 1 0.5
(Canonical correlation analysis of job satisfaction)
As part of a larger study of the effects of organizational structure on "job sat isfaction," Dunham [4] investigated the extent to which measures of job sat isfaction are related to job characteristics. Using a survey instrument, Dunham obtained measurements of p = 5 job characteristics and q = 7 job satisfaction variables for = 784 executives from the corporate branch of a large retail merchandising corporation. Are measures of job satisfaction associated with job characteristics? The answer may have implications for job design. The2l original job characteristic variables, X (ll and job satisfaction vari ables, x< were respectively defined as n
PAN EL 1 0. 1
SAS ANALYSI S FOR EXAM PLE 10 . 4 U S I N G P ROC CANCORR.
title 'Ca n o nical Co rrelatio n Ana lysis'; d ata sku l l (type corr); _type_ = 'CORR'; i n p ut_na m e_ $ x 1 x2 x3 x4; ca rds; x 1 1 .0 x2 .505 1 .0 x3 .422 1 .0 . 569 x4 .602 . 467 .926 1 .0 =
proc cancorr d ata sku l l vprefix var x 1 x2; with x3 x4; =
=
head wprefix
PROGRAM COMMANDS
=
leg;
(continued)
606
Chap.
10
Canonical Correlation Analysis PAN El l O. l
2
(continued)
I
I
Ca nonical Correlati o n Ana lysis Approx Adj usted Canon ical Canon ica l Sta n d a rd Corre l ation E rror Correlation
I I
0 ..
6310�5.1
· 0.056794
I
0.6282 9 1
S q u a red c a n o n ical
Co rrelation 0 .398268 0 .003226
0.036286 0.060 1 08
tfipiEmtf�r �he 'VAR' V� riabl es i
Raw Ca!l9!lica i 6338
··· o:o6o��l)8775 0.943948961
2.47493889 1 3
Canon ica l Struct u re
Correlations Between the 'VAWVa riables and Their Ca nonical Variables X1
Correlations
X2
HEAD1 0.9548 0.7388
H EAD2 - 0.2974 0.6739
(see ( 1 0-3 4 ) )
Between the 'WITH' Va riables and Thei r Canonica l Variables X3 X4
and
LEG 1 0.9343 0.9997
LEG2 - 0.3564 0.0227
(see ( 1 0-3 4 ) )
Oorf�l�tlons !3e�een · · f h� 'VA R ' va ri a bl�s · ·
the Canonical Variables of the 'WITH' Variables X1 X2
�nd �ne p��pqi9al
LE G 1 0.6025 0. 4663
LEG2 - 0 .0 1 69 0.0383
(see ( 1 0-3 4 ) )
· ��rJI!.I)Ie�. o:f.IH�· .�\IA.Fl' Y� ti� bles
Correlations Between the 'WITH' Va riables
X3 X4
HEAD1 0.5897 0.6309
H EAD2 - 0.0202 0.00 1 3
(see ( 1 0-3 4 ) )
Sec.
1 0.4
The Sample Canonical Variates and Sample Canonical Correlations
607
feedback task significance 1 ) task variety x< = task identity autonomy supervisor satisfaction x (2) career-future satisfaction x's . The sample correlation between the two indices U1 and V1 is iJi = .55. There appears to be some overlap between job characteristics and job satisfaction. We explore this issue further in Example 10.7. • 1•
1,
610
Chap.
10
Canonical Correlation Analysis
Scatter plots of the first ( U1 , l\ ) pair may reveal atypical observations xi requiring further study. If the canq_ nic�l corr�lations M, p;, ... are also moderately large, scatter plots of the pairs ( U2 , V2) , ( U3 , V3) , may also be helpful in this respect. Many analysts suggest plotting "significant" canonical variates against their component variables as an aid in subject-matter interpretation. These plots rein force the correlation coefficients in (10-34). If the sample size is large, it is often desirable to split the sample in half. The first half of the sample can be used to construct and evaluate the sample canoni cal variates and canonical correlations. The results can then be "validated" with the remaining observations. The change (if any) in the nature of the canonical analysis will provide an indication of the sampling variability and the stability of the conclusions. • • •
1 0. 5 ADDITIONAL SAMPLE DESCRIPTIVE MEASURES
If the canonical variates are "good" summaries of their respective sets of variables, then the associations between variables can be described in terms of the canonical variates and their correlations. It is useful to have summary measures of the extent to which the canonical variates account for the variation in their respective sets. It is also useful, on occasion, to calculate the proportion of variance in one set of vari ables explained by the canonical variates of the other set. Matrices of Errors of Approximations
A and B defined in (1Q-32),)et a ang b of the first r canonical variates 0 , 02, , 0, �ith1 their component variables xp , X�1�, X11 > . �imilarly, the first 1r columns of B contain the sample covariances of V 1 , V2 , , V, with their component variables. If only the first r canonical pairs are used, so that for instance, • • •
• •
.,._,
• • •
x- ( 1 )
_- [ A
: a ( I ) : aA ( 2) ::
and
·.
. :: aA (r) J
[ S� ] :
A
u,
(10-38 )
then S 1 2 is approximated by sample Cov (x ( l> , x ) . Continuing, we see that the matrices of errors of approximation are S l l - (a (l) a ( l ) l
+
a (2) a (2) '
+ ... +
a (r) a (r) ' )
= a (r + l ) a (r + l) '
+ ... +
a (P) a (P) '
s 22 - ( b ( l )b (l )l
+
b (2 )b (2 ) 1
+ ... +
b(r)b(r)l )
=
b (r+ l )b (r + 1 ) 1
+ ... +
b(q) b (q)l
S 1 2 - c rr a ( l> b < 1 > '
+
iJ� a ( 2)b '
+ ... +
r ; a (r> b a�1 ) , + a�2la�2) , + . . . + a <j'> a , ) = P
(10-41a)
614
Chap.
10
Canonical Correlation Analysis
Total (standardized) sample variance in second set = tr (R 22 ) = tr ( bzo>t;z< 1 >' + b(z2)b(z2l ' + + b(zq ) b(zq) ' ) · · ·
=
q
(10-41b)
Since the correlations in the first r < p columns of A� 1 and :8; 1 involve only the ,... ,... ,... ,... ,... sample canonical variates U1 , U2 , , U, and V1 , V2 , , V,, respectively, we define the contributions of the first r canonical variates to the total (standardized) sample variances as "
• . .
. . •
l tr ( 3 (z1 ) 3 (l) z
""r � ""P r � z(l) + 3 (z2) 3 (z2) ' + . . . + 3 (z') 3 (z') ' ) = � i= l k=
tr ( b�l) b�1 ) '
+ b�2l b�2) ' + . . . + b �') b�'l' ) = ± f r�' . zf>
and
l
u. ,,
i= l k= !
(
k
)
The proportions of total (standardized) sample variances "explained by" the first r canonical variates then become
R z(l > i u, . u, . . . . , u, = 2
proportion of total standardized sample varian,_ce i� first se; explained by U 1 , U2 , . . . , U, 1 tr ( 3 (z ) 3 oz >, + . . . + a� (r) . a� (.r) ' ) r p "" "" £../ � r o,. z ll) i= 1 k= l p k
and
RzI'l- l 2
V1 , V2 , A
A
. . . , V, A
=
(
proportion of total standardized sample varianc� in �econd �et explained by V1 , V2 , . . . , V, tr ( b(zl ) b(zJ ) , + . . . + b(z') b(z') ' )
)
(10-42)
q
Descriptive measures (10-42) provide some indication of how well the canon ical variates represent their respective sets. They provide single-number descrip tions of the matrices of errors. In particular,
Sec.
1 -
p
tr [R 1 1 - aA (l ) aA (l) ' - aA (2 ) aA ( 2) ' z
z
z
z
• • •
1 0. 6
Large-Sample Inferences
- aA (r) aA (r) ' ] z
z
1 - R2t' ' l . z
61 5
u, , u, . . . . , u, •
•
according to (10-41) and (10-42). Example 1 0.7
(Calculating proportions of sample variance explained by canonical variates)
Consider the job characteristic-job satisfaction data discussed in Example 10.5. Using the table of sample correlation coefficients presented in that example, we find that 1 5 1 2 2 2 ro , , z\' ' = 5 [ (.83) + (.74) + . . . + (.85) ] R;o,ID, = 5
�
.58
1 7 1 2 2 2 = ' R; x;q (a)
pq
(10-44)
where x; q (a) is the upper (100a)th percentile of a chi-square distribution with d.f. If the null hypothesis H0 : I 12 = 0 (p{ = Pi = · · · = p: = 0) is rejected, it is natural to examine the "significance" of the individual canonical correlations. Since the canonical correlations are ordered from the largest to the smallest, we can begin by assuming that the first canonical correlation is nonzero and the remaining - 1 canonical correlations are zero. If this hypothesis is rejected, we assume that the first two canonical correlations are nonzero, but the remaining - 2 canonical correlations are zero, and so forth. Let the implied sequence of hypotheses be
p
p
Hg : p{ H� : p/
-=/= -=/=
0, Pi
-=/=
..
0, . , p:
0 for some i
�
-=/=
0, p:
k+1
+1
=
···
= p1� = 0
(10-45 )
Sec.
1 0.6
Large-Sample Inferences
61 7
Bartlett [2] has argued that the kth hypothesis in (10-45) can be tested by the like lihood ratio criterion. Specifically,
- ( n - 1 - .!2 (p + q + 1) ) In i=k+ IT I (1 - p;*2 ) > Xfr k)(q - k) ( a )
Reject H�"l at significance level a if
(pXfp - k) (q - k) ( a)
(10-46)
where is the upper (100a)th percentile of a chi-square distribution with - k) (q - k) d.f. We point out that the test statistic in (10-46) involves p II (1 - P;* 2 ) , the "residual" after the first k sample canonical correlations have
i=k+
I
been removed from the total criterion A
H