An Introduction to Applied Multivariate Analysis

Raykov/Introduction to Applied Multivariate Analysis RT20712_C000 Final Proof page i 2.2.2008 2:54pm Compositor Name: B...

Author: Tenko Raykov | George A. Marcoulides

129 downloads 2379 Views 2MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

Raykov/Introduction to Applied Multivariate Analysis RT20712_C000 Final Proof page i 2.2.2008 2:54pm Compositor Name: BMani

An Introduction to Applied Multivariate Analysis

Tenko Raykov George A. Marcoulides

New York London

Raykov/Introduction to Applied Multivariate Analysis RT20712_C000 Final Proof page ii 2.2.2008 2:54pm Compositor Name: BMani

Routledge Taylor & Francis Group 270 Madison Avenue New York, NY 10016

Routledge Taylor & Francis Group 2 Park Square Milton Park, Abingdon Oxon OX14 4RN

© 2008 by Taylor & Francis Group, LLC Routledge is an imprint of Taylor & Francis Group, an Informa business Printed in the United States of America on acid-free paper 10 9 8 7 6 5 4 3 2 1 International Standard Book Number-13: 978-0-8058-6375-8 (Hardcover) Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data Introduction to applied multivariate analysis / by Tenko Raykov & George A. Marcoulides. p. cm. Includes bibliographical references and index. ISBN-13: 978-0-8058-6375-8 (hardcover) ISBN-10: 0-8058-6375-3 (hardcover) 1. Multivariate analysis. I. Raykov, Tenko. II. Marcoulides, George A. QA278.I597 2008 519.5’35--dc22 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the Psychology Press Web site at http://www.psypress.com

2007039834

Raykov/Introduction to Applied Multivariate Analysis RT20712_C000 Final Proof page iii 2.2.2008 2:54pm Compositor Name: BMani

Contents Preface ............................................................................................................... ix

Chapter 1 Introduction to Multivariate Statistics 1.1 Definition of Multivariate Statistics ............................................. 1 1.2 Relationship of Multivariate Statistics to Univariate Statistics .................................................................. 5 1.3 Choice of Variables and Multivariate Method, and the Concept of Optimal Linear Combination ....................... 7 1.4 Data for Multivariate Analyses .................................................... 8 1.5 Three Fundamental Matrices in Multivariate Statistics ............ 11 1.5.1 Covariance Matrix...................................................................... 12 1.5.2 Correlation Matrix...................................................................... 13 1.5.3 Sums-of-Squares and Cross-Products Matrix ........................ 15 1.6 Illustration Using Statistical Software........................................ 17 Chapter 2 Elements of Matrix Theory 2.1 Matrix Definition ......................................................................... 31 2.2 Matrix Operations, Determinant, and Trace .............................. 33 2.3 Using SPSS and SAS for Matrix Operations .............................. 46 2.4 General Form of Matrix Multiplications With Vector, and Representation of the Covariance, Correlation, and Sum-of-Squares and Cross-Product Matrices ..................... 50 2.4.1 Linear Modeling and Matrix Multiplication .......................... 50 2.4.2 Three Fundamental Matrices of Multivariate Statistics in Compact Form ....................................................................... 51 2.5 Raw Data Points in Higher Dimensions, and Distance Between Them.............................................................................. 54 Chapter 3 Data Screening and Preliminary Analyses 3.1 Initial Data Exploration............................................................... 61 3.2 Outliers and the Search for Them............................................... 69 3.2.1 Univariate Outliers..................................................................... 69 3.2.2 Multivariate Outliers ................................................................. 71 3.2.3 Handling Outliers: A Revisit .................................................... 78 3.3 Checking of Variable Distribution Assumptions....................... 80 3.4 Variable Transformations............................................................ 83

iii

Raykov/Introduction to Applied Multivariate Analysis RT20712_C000 Final Proof page iv

2.2.2008 2:54pm Compositor Name: BMani

iv Chapter 4 Multivariate Analysis of Group Differences 4.1 A Start-Up Example .................................................................... 99 4.2 A Definition of the Multivariate Normal Distribution ............ 101 4.3 Testing Hypotheses About a Multivariate Mean..................... 102 4.3.1 The Case of Known Covariance Matrix................................ 103 4.3.2 The Case of Unknown Covariance Matrix........................... 107 4.4 Testing Hypotheses About Multivariate Means of Two Groups ............................................................................... 110 4.4.1 Two Related or Matched Samples (Change Over Time) ................................................................ 110 4.4.2 Two Unrelated (Independent) Samples ................................ 113 4.5 Testing Hypotheses About Multivariate Means in One-Way and Higher Order Designs (Multivariate Analysis of Variance, MANOVA)............................................. 116 4.5.1 Statistical Significance Versus Practical Importance ........... 129 4.5.2 Higher Order MANOVA Designs ......................................... 130 4.5.3 Other Test Criteria ................................................................... 132 4.6 MANOVA Follow-Up Analyses ............................................... 143 4.7 Limitations and Assumptions of MANOVA............................ 145 Chapter 5 Repeated Measure Analysis of Variance 5.1 Between-Subject and Within-Subject Factors and Designs................................................................................ 148 5.2 Univariate Approach to Repeated Measure Analysis ............. 150 5.3 Multivariate Approach to Repeated Measure Analysis .......... 168 5.4 Comparison of Univariate and Multivariate Approaches to Repeated Measure Analysis.................................................. 179 Chapter 6 Analysis of Covariance 6.1 Logic of Analysis of Covariance ............................................... 182 6.2 Multivariate Analysis of Covariance........................................ 192 6.3 Step-Down Analysis (Roy–Bargmann Analysis) ..................... 198 6.4 Assumptions of Analysis of Covariance .................................. 203 Chapter 7 Principal Component Analysis 7.1 Introduction ............................................................................... 211 7.2 Beginnings of Principal Component Analysis ......................... 213 7.3 How Does Principal Component Analysis Proceed?............... 220 7.4 Illustrations of Principal Component Analysis ....................... 224 7.4.1 Analysis of the Covariance Matrix S (S) of the Original Variables .................................................................... 224 7.4.2 Analysis of the Correlation Matrix P (R) of the Original Variables .................................................................... 224 7.5 Using Principal Component Analysis in Empirical Research......... 234

Raykov/Introduction to Applied Multivariate Analysis RT20712_C000 Final Proof page v 2.2.2008 2:54pm Compositor Name: BMani

v 7.5.1 7.5.2 7.5.3 7.5.4 7.5.5 7.5.6 7.5.7 7.5.8 7.5.9

Multicollinearity Detection ..................................................... 234 PCA With Nearly Uncorrelated Variables Is Meaningless............................................................................... 235 Can PCA Be Used as a Method for Observed Variable Elimination? .............................................................................. 236 Which Matrix Should Be Analyzed? ..................................... 236 PCA as a Helpful Aid in Assessing Multinormality .......... 237 PCA as ‘‘Orthogonal’’ Regression ......................................... 237 PCA Is Conducted via Factor Analysis Routines in Some Software .......................................................................... 237 PCA as a Rotation of Original Coordinate Axes................. 238 PCA as a Data Exploratory Technique ................................. 238

Chapter 8 Exploratory Factor Analysis 8.1 Introduction ............................................................................... 241 8.2 Model of Factor Analysis .......................................................... 242 8.3 How Does Factor Analysis Proceed?........................................ 248 8.3.1 Factor Extraction ...................................................................... 248 8.3.1.1 Principal Component Method................................. 248 8.3.1.2 Maximum Likelihood Factor Analysis................... 256 8.3.2 Factor Rotation ......................................................................... 262 8.3.2.1 Orthogonal Rotation ................................................. 266 8.3.2.2 Oblique Rotation ....................................................... 267 8.4 Heywood Cases ......................................................................... 273 8.5 Factor Score Estimation............................................................. 273 8.5.1 Weighted Least Squares Method (Generalized Least Squares Method) .................................... 274 8.5.2 Regression Method .................................................................. 274 8.6 Comparison of Factor Analysis and Principal Component Analysis ................................................................. 276 Chapter 9 Confirmatory Factor Analysis 9.1 Introduction ............................................................................... 279 9.2 A Start-Up Example .................................................................. 279 9.3 Confirmatory Factor Analysis Model ....................................... 281 9.4 Fitting Confirmatory Factor Analysis Models ......................... 284 9.5 A Brief Introduction to Mplus, and Fitting the Example Model ......................................................................................... 287 9.6 Testing Parameter Restrictions in Confirmatory Factor Analysis Models ........................................................................ 298 9.7 Specification Search and Model Fit Improvement................... 300 9.8 Fitting Confirmatory Factor Analysis Models to the Mean and Covariance Structure ............................................... 307 9.9 Examining Group Differences on Latent Variables ................. 314

Raykov/Introduction to Applied Multivariate Analysis RT20712_C000 Final Proof page vi


vi Chapter 10 Discriminant Function Analysis 10.1 Introduction ............................................................................. 331 10.2 What Is Discriminant Function Analysis?.............................. 332 10.3 Relationship of Discriminant Function Analysis to Other Multivariate Statistical Methods............................................. 334 10.4 Discriminant Function Analysis With Two Groups .............. 336 10.5 Relationship Between Discriminant Function and Regression Analysis With Two Groups ................................. 351 10.6 Discriminant Function Analysis With More Than Two Groups ............................................................................. 353 10.7 Tests in Discriminant Function Analysis ............................... 355 10.8 Limitations of Discriminant Function Analysis ..................... 364 Chapter 11 Canonical Correlation Analysis 11.1 Introduction ............................................................................. 367 11.2 How Does Canonical Correlation Analysis Proceed? ........... 370 11.3 Tests and Interpretation of Canonical Variates ..................... 372 11.4 Canonical Correlation Approach to Discriminant Analysis .................................................................................... 384 11.5 Generality of Canonical Correlation Analysis ....................... 389 Chapter 12 12.1 12.2 12.3

12.4

12.5 12.6

An Introduction to the Analysis of Missing Data Goals of Missing Data Analysis .............................................. 391 Patterns of Missing Data ......................................................... 392 Mechanisms of Missing Data .................................................. 394 12.3.1 Missing Completely at Random ........................................ 396 12.3.2 Missing at Random .............................................................. 398 12.3.3 Ignorable Missingness and Nonignorable Missingness Mechanisms .................................................... 400 Traditional Ways of Dealing With Missing Data ...................401 12.4.1 Listwise Deletion .................................................................. 402 12.4.2 Pairwise Deletion ................................................................. 402 12.4.3 Dummy Variable Adjustment ............................................ 403 12.4.4 Simple Imputation Methods............................................... 403 12.4.5 Weighting Methods ............................................................. 405 Full Information Maximum Likelihood and Multiple Imputation ......................................................... 406 Examining Group Differences and Similarities in the Presence of Missing Data...............................................407 12.6.1 Examining Group Mean Differences With Incomplete Data ................................................................... 410 12.6.2 Testing for Group Differences in the Covariance and Correlation Matrices With Missing Data .................. 427

Raykov/Introduction to Applied Multivariate Analysis RT20712_C000 Final Proof page vii 2.2.2008 2:54pm Compositor Name: BMani

vii Chapter 13 Multivariate Analysis of Change Processes 13.1 Introduction ............................................................................. 433 13.2 Modeling Change Over Time With Time-Invariant and Time-Varying Covariates ................................................. 434 13.2.1 Intercept-and-Slope Model ................................................. 435 13.2.2 Inclusion of Time-Varying and Time-Invariant Covariates.............................................................................. 436 13.2.3 An Example Application..................................................... 437 13.2.4 Testing Parameter Restrictions........................................... 442 13.3 Modeling General Forms of Change Over Time.....................448 13.3.1 Level-and-Shape Model....................................................... 448 13.3.2 Empirical Illustration ........................................................... 450 13.3.3 Testing Special Patterns of Growth or Decline................ 455 13.3.4 Possible Causes of Inadmissible Solutions ....................... 459 13.4 Modeling Change Over Time With Incomplete Data ............ 461 Appendix:

Variable Naming and Order for Data Files ............ 467

References.......................................................................................... 469 Author Index ..................................................................................... 473 Subject Index ..................................................................................... 477

Raykov/Introduction to Applied Multivariate Analysis RT20712_C000 Final Proof page viii


Raykov/Introduction to Applied Multivariate Analysis RT20712_C000 Final Proof page ix


Preface Having taught applied multivariate statistics for a number of years, we have been impressed by the broad spectrum of topics that one may be expected to typically cover in a graduate course for students from departments outside of mathematics and statistics. Multivariate statistics has developed over the past few decades into a very extensive field that is hard to master in a single course, even for students aiming at methodological specialization in commonly considered applied fields, such as those within the behavioral, social, and educational disciplines. To meet this challenge, we tried to identify a core set of topics in multivariate statistics, which would be both of fundamental relevance for its understanding and at the same time would allow the student to move on to more advanced pursuits. This book is a result of this effort. Our goal is to provide a coherent introduction to applied multivariate analysis, which would lay down the basics of the subject that we consider of particular importance in many empirical settings in the social and behavioral sciences. Our approach is based in part on emphasizing, where appropriate, analogies between univariate statistics and multivariate statistics. Although aiming, in principle, at a relatively nontechnical introduction to the subject, we were not able to avoid the use of mathematical formulas, but we employ these primarily in their definitional meaning rather than as elements of proofs or related derivations. The targeted audience who will find this book most beneficial consists primarily of graduate students, advanced undergraduate students, and researchers in the behavioral, social, as well as educational disciplines, who have limited or no familiarity with multivariate statistics. As prerequisites for this book, an introductory statistics course with exposure to regression analysis is recommended, as is some familiarity with two of the most widely circulated statistical analysis software: SPSS and SAS. Without the use of computers, we find that an introduction to applied multivariate statistics is not possible in our technological era, and so we employ extensively these popular packages, SPSS and SAS. In addition, for the purposes of some chapters, we utilize the latent variable modeling program Mplus, which is increasingly used across the social and behavioral sciences. On the book specific website, www.psypress.com=appliedmultivariate-analysis, we supply essentially all data used in the text. (See Appendix for name of data file and of its variables, as well as their order as

ix

Raykov/Introduction to Applied Multivariate Analysis RT20712_C000 Final Proof page x 2.2.2008 2:54pm Compositor Name: BMani

x columns within it.) To aid with clarity, the software code (for SAS and Mplus) or sequence of analytic=menu option selection (for SPSS) is also presented and discussed at appropriate places in the book. We hope that readers will find this text offering them a useful introduction to and a basic treatment of applied multivariate statistics, as well as preparing them for more advanced studies of this exciting and comprehensive subject. A feature that seems to set apart the book from others in this field is our use of latent variable modeling in later chapters to address some multivariate analysis questions of special interest in the behavioral and social disciplines. These include the study of group mean differences on unobserved (latent) variables, testing of latent structure, and some introductory aspects of missing data analysis and longitudinal modeling. Many colleagues have at least indirectly helped us in our work on this project. Tenko Raykov acknowledges the skillful introduction to multivariate statistics years ago by K. Fischer and R. Griffiths, as well as many valuable discussions on the subject with S. Penev and Y. Zuo. George A. Marcoulides is most grateful to H. Loether, B. O. Muthén, and D. Nasatir under whose stimulating tutelage many years ago he was first introduced to multivariate analysis. We are also grateful to C. Ames and R. Prawat from Michigan State University for their instrumental support in more than one way, which allowed us to embark on the project of writing this book. Thanks are also due to L. K. Muthén, B. O. Muthén, T. Asparouhov, and T. Nguyen for valuable instruction and discussions on applications of latent variable modeling. We are similarly grateful to P. B. Baltes, F. Dittmann-Kohli, and R. Kliegl for generously granting us access to data from their project ‘‘Aging and Plasticity in Fluid Intelligence,’’ parts of which we adopt for our method illustration purposes in several chapters of the book. Many of our students provided us with very useful feedback on the lecture notes we first developed for our courses in applied multivariate statistics, from which this book emerged. We are also very grateful to Douglas Steinley, University of Missouri-Columbia; Spiridon Penev, University of New South Wales; and Tim Konold, University of Virginia for their critical comments on an earlier draft of the manuscript, as well as to D. Riegert and R. Larsen from Lawrence Erlbaum Associates, and R. Tressider of Taylor & Francis, for their essential assistance during advanced stages of our work on this project. Last but not least, we are more than indebted to our families for their continued support in lots of ways. Tenko Raykov thanks Albena and Anna, and George A. Marcoulides thanks Laura and Katerina. Tenko Raykov East Lansing, Michigan George A. Marcoulides Riverside, California

Raykov/Introduction to Applied Multivariate Analysis RT20712_C001 Final Proof page 1 30.1.2008 4:37pm Compositor Name: BMani

1 Introduction to Multivariate Statistics One of the simplest conceptual definitions of multivariate statistics (MVS) is as a set of methods that deal with the simultaneous analysis of multiple outcome or response variables, frequently also referred to as dependent variables (DVs). This definition of MVS suggests an important relationship to univariate statistics (UVS) that may be considered a group of methods dealing with the analysis of a single DV. In fact, MVS not only exhibits similarities with UVS but can also be considered an extension of it, or conversely UVS can be viewed as a special case of MVS. At the same time, MVS and UVS have a number of distinctions, and this book deals with many of them whenever appropriate. In this introductory chapter, our main objective is to discuss, from a principled standpoint, some of the similarities and differences between MVS and UVS. More specifically, we (a) define MVS; then (b) discuss some relationships between MVS and UVS; and finally (c) illustrate the use of the popular statistical software SPSS and SAS for a number of initial multiple variable analyses, including obtaining covariance, correlation, and sum-of-squares and cross-product matrices. As will be observed repeatedly throughout the book, these are three matrices of variable interrelationship indices, which play a fundamental role in many MVS methods.

1.1 Definition of Multivariate Statistics Behavioral, social, and educational phenomena are often multifaceted, multifactorially determined, and exceedingly complex. Any systematic attempt to understand them, therefore, will typically require the examination of multiple dimensions that are usually intertwined in complicated ways. For these reasons, researchers need to evaluate a number of interrelated variables capturing specific aspects of phenomena under consideration. As a result, scholars in these sciences commonly obtain and have to deal with data sets that contain measurements on many interdependent dimensions, which are collected on subjects sampled from the studied

1


2

Introduction to Applied Multivariate Analysis

populations. Consequently, in empirical behavioral and social studies, one is very often faced with data sets consisting of multiple interrelated variables that have been observed on a number of persons, and possibly on samples from different populations. MVS is a scientific field, which for many purposes may be viewed a branch of mathematics and has been developed to meet these complex challenges. Specifically, MVS represents a multitude of statistical methods to help one analyze potentially numerous interrelated measures considered together rather than separately from one another (i.e., one at a time). Researchers typically resort to using MVS methods when they need to analyze more than one dependent (response, or outcome) variable, possibly along with one or more independent (predictor, or explanatory) variables, which are in general all correlated with each other. Although the concepts of independent variables (IVs) and DVs are generally well covered in most introductory statistics and research methods treatments, for the aims of this chapter, we deem it useful to briefly discuss them here. IVs are typically different conditions to which subjects might be exposed, or reflect specific characteristics that studied persons bring into a research situation. For example, socioeconomic status (SES), educational level, age, gender, teaching method, training program or treatment are oftentimes considered IVs in various empirical settings. Conversely, DVs are those that are of main interest in an investigation, and whose examination is of focal interest to the researcher. For example, intelligence, aggression, college grade point average (GPA) or Graduate Record Exam (GRE) score, performance on a reading or writing test, math ability score or computer aptitude score can be DVs in a study aimed at explaining variability in any of these measures in terms of some selected IVs. More specifically, the IVs and DVs are defined according to the research question being asked. For this reason, it is possible that a variable that is an IV for one research query may become a DV for another one, or vice versa. Even within a single study, it is not unlikely that a DV for one question of interest changes status to an IV, or conversely, when pursuing another concern at a different point during the study. To give an example involving IVs and DVs: suppose an educational scientist were interested in comparing the effectiveness of two teaching methods, a standard method and a new method of teaching number division. To this end, two groups of students are randomly assigned to the new and to the standard method. Assume that a test of number division ability was administered to all students who participated in the study, and that the researcher was interested in explaining the individual differences observed then. In this case, the score on the division test would be a DV. If the scientist had measured initial arithmetic ability as well as collected data on student SES or even hours watching television per week then all these three variables may be potential IVs. The particular posited


Introduction to Multivariate Statistics

3

question appears to be relatively simple and in fact may be addressed straightforwardly using UVS as it is phrased in terms of a single DV, namely score obtained on the number division test. However, if the study was carried out in such a way that measurements of division ability were collected for each student on each of three consecutive weeks after the two teaching methods were administered, and in addition data on hours watched television were gathered in each of these weeks, then this question becomes considerably more complicated. This is because there are now three measures of interest—division ability in each of the 3 weeks of measurement—and it may be appropriate to consider them all as DVs. Furthermore, because these measures are taken on the same subjects in the study, they are typically interrelated. In addition, when addressing the original question about comparative effectiveness of the new teaching method relative to the old one, it would make sense at least in some analyses to consider all three so-obtained division ability scores simultaneously. Under these circumstances, UVS cannot provide the sought answer to the research query. This is when MVS is typically used, especially where the goal is to address complicated research questions that cannot be answered directly with UVS. A main reason for this MVS preference in such situations is that UVS represents a set of statistical methods to deal with just a single DV, while there is effectively no limit on the number of IVs that might be considered. As an example, consider the question of whether observed individual differences in average university freshman grades could be explained with such on their SAT score. In this case, the DV is freshman GPA, while the SAT score would play the role of an IV, possibly in addition to say gender, SES and type of high school attended (e.g., public vs. private), in case a pursued research question requires consideration of these as further IVs. There are many UVS and closely related methods that can be used to address an array of queries varying in their similarity to this one. For example, for prediction goals, one could consider using regression analysis (simple or multiple regression, depending on the number of IVs selected), including the familiar t test. When examination of mean differences across groups is of interest, a traditionally utilized method would be analysis of variance (ANOVA) as well as analysis of covariance (ANCOVA)—either approach being a special case of regression analysis. Depending on the nature of available data, one may consider a chi-square analysis of say two-way frequency tables (e.g., for testing association of two categorical variables). When certain assumptions are markedly violated, in particular the assumption of normality, and depending on other study aspects, one may also consider nonparametric statistical methods. Most of the latter methods share the common feature that they are typically considered for application when for certain research questions one identifies single DVs.


4


By way of contrast, MVS may be viewed as an extension of UVS in the case where one is interested in studying multiple DVs that are interrelated, as they commonly would be in empirical research in the behavioral, social, and educational disciplines. For this reason, MVS typically deals in applications with fairly large data sets on potentially many subjects and in particular on possibly numerous interrelated variables of main interest. Due to this complexity, the underlying theme behind many MVS methods is also simplification, for instance, the reduction of the complexity of available data to several meaningful indices, quantities (parameters), or dimensions. To give merely a sense of the range of questions addressed with MVS, let us consider a few simple examples—we of course return to these issues in greater detail later in this book. Suppose a researcher is interested in determining which characteristics or variables differentiate between achievers and nonachievers in an educational setting. As will be discussed in Chapter 10, these kinds of questions can be answered using a technique called discriminant function analysis (or discriminant analysis for short). Interestingly, discriminant analysis can also be used to predict group membership—in the currently considered example, achievers versus nonachievers—based on the knowledge obtained about differentiating characteristics or variables. Another research question may be whether there is a single underlying (i.e., unobservable, or so-called latent) dimension along which students differ and which is responsible for their observed interrelationships on some battery of intelligence tests. Such a question can be attended to using a method called factor analysis (FA), discussed in Chapters 8 and 9. As another example, one may be concerned with finding out whether a set of interrelated tests can be decomposed into groups of measures and accordingly new derived measures obtained so that they account for most of the observed variance of the tests. For these aims, a method called principal component analysis (PCA) is appropriate, to which we turn in Chapter 7. Further, if one were interested in whether there are mean differences between several groups of students exposed to different training programs, say with regard to their scores on a set of mathematical tasks—possibly after accounting for initial differences on algebra, geometry, and trigonometry tests—then multivariate analysis of variance (MANOVA) and multivariate analysis of covariance (MANCOVA) would be applicable. These methods are the subject of Chapters 4 and 6. When of concern are group differences on means of unobserved variables, such as ability, intelligence, neuroticism, or aptitude, a specific form of what is referred to as latent variable modeling could be used (Chapter 9). Last but not least, when studied variables have been repeatedly measured, application of special approaches of ANOVA or latent variable modeling can be considered, as covered in Chapters 5 and 13. All these examples are just a few of the kinds of questions that can be addressed using MVS, and the remaining chapters in this book are



5

devoted to their discussion. The common theme unifying the research questions underlying these examples is the necessity to deal with potentially multiple correlated variables in such a way that their interrelationships are taken into account rather than ignored.

1.2 Relationship of Multivariate Statistics to Univariate Statistics The preceding discussion provides leads to elaborate further on the relationship between MVS and UVS. First, as indicated previously, MVS may be considered an extension of UVS to the case of multiple, and commonly interrelated, DVs. Conversely, UVS can be viewed as a special case of MVS, which is obtained when the number of analyzed DVs is reduced to just one. This relationship is additionally highlighted by the observation that for some UVS methods, there is an MVS analog or multivariate generalization. For example, traditional ANOVA is extended to MANOVA in situations involving more than one outcome variable. Similarly, conventional ANCOVA is generalized to MANCOVA whenever more than a single DV is examined, regardless of number of covariates involved. Further, multiple regression generalizes to multivariate multiple regression (general linear model) in the case with more than one DVs. This type of regression analysis may also be viewed as path analysis or structural equation modeling with observed variables only, for which we refer to a number of alternative treatments in the literature (see Raykov & Marcoulides, 2006, for an introduction to the subject). Also, the idea underlying the widely used correlation coefficient, for example, in the context of a bivariate correlation analysis or simple linear regression, is extended to that of canonical correlation. In particular, using canonical correlation analysis (CCA) one may examine the relationships between sets of what may be viewed, for the sake of this example, as multiple IVs and multiple DVs. With this perspective, CCA could in fact be considered encompassing all MVS methods mentioned so far in this section, with the latter being obtained as specifically defined special cases of CCA. With multiple DVs, a major distinctive characteristic of MVS relative to UVS is that the former lets one perform a single, simultaneous analysis pertaining to the core of a research question. This approach is in contrast to a series of univariate or even bivariate analyses, like regressions with a single DV, correlation estimation for all pairs of analyzed variables, or ANOVA=ANCOVA for each DV considered in turn (i.e., one at a time). Even though we often follow such a simultaneous multivariate test with further and more focused analyses, the benefit of using MVS is that no matter how many outcome variables are analyzed the overall Type I error


6


rate is kept at a prespecified level, usually .05 or more generally the one at which the multivariate test is carried out. As a conceivable alternative, one might contemplate conducting multiple univariate analyses, one per DV. However, that approach will be associated with a higher (family-wise) Type I error rate due to the multiple testing involved. These are essentially the same reasons for which in a group mean comparison setup, carrying out a series of t tests for all pairs of groups would be associated with a higher than nominal error rate relative to an ANOVA, and hence make the latter a preferable analytic procedure. At the same time, it is worth noting that with MVS we aim at the ‘‘big picture,’’ namely analysis of more than one DV when considered together. This is why with MVS we rarely get ‘‘as close’’ to data as we can with UVS, because we typically do not pursue as focused an analysis in MVS as we do in UVS where a single DV is of concern. We emphasize however that the center of interest, and thus of analysis, depends on the specific research question asked. For example, at any given stage of an empirical study dealing with say a teaching method comparison, we may be interested in comparing two or more methods with regard only to a single DV. In such a case, the use of UVS will be quite appropriate. When alternatively the comparison is to be carried out with respect to several DVs simultaneously, an application of MVS is clearly indicated and preferable. In conclusion of this section, and by way of summarizing much of the preceding discussion, multivariate analyses are conducted instead of univariate analyses for the following reasons: 1. With more than one DVs (say p in number), the use of p separate univariate tests inflates the Type I error rate, whereas a pertinent multivariate test preserves the significance level (p > 1). 2. Univariate tests, no matter how many in number, ignore the interrelationships possible among the DVs, unlike multivariate analyses, and hence potentially waste important information contained in the available sample of data. 3. In many cases, the multivariate test is more powerful than a corresponding univariate test, because the former utilizes the information mentioned in the previous point 2. In such cases, we tend to trust MVS more when its results are at variance with those of UVS (as we also do when of course our concern is primarily with a simultaneous analysis of more than one DV). 4. Many multivariate tests involving means have as a by-product the construction of a linear combination of variables, which provides further information (in case of a significant outcome) about how the variables unite to reject the hypothesis; we deal with these issues in detail later in the book (Chapter 10).



7

1.3 Choice of Variables and Multivariate Method, and the Concept of Optimal Linear Combination Our discussion so far has assumed that we have already selected the variables to be used in a given multivariate analysis. The natural question that arises now is how one actually makes this variable choice. In general, main requirements for the selection of variables to be used in an MVS analysis—like in any univariate analysis—are those of high psychometric qualities (specifically, high validity and reliability) of the measures used as DVs and IVs, and that they pertain to the research questions being pursued, that is, are measures of aspects of a studied phenomenon that are relevant to these questions. Accordingly, throughout the rest of the book, we assume that the choice of considered variables has already been made in this way. MVS encompasses an impressive array of analytic and modeling methods, each of which can be used to tackle certain research queries. Consequently, the next logical concern for any researcher is which one(s) of these methods to utilize. In addition to selecting from among those methods that allow answering the questions asked, the choice will also typically depend on the type of measurement of the involved DVs. Oftentimes, with continuous (or approximately so) DVs, such as reaction time, income, GPA, and intelligence test scores, a frequent choice may be made from among MANOVA or MANCOVA, FA, PCA, discriminant function analysis, CCA, or multivariate multiple regression. Alternatively, with discrete DVs—for example, answers to questions from a questionnaire that have limited number of response options, or items of another type of multiple-component measuring instrument—a choice may often be made from among logistic regression, contingency table analysis, loglinear models, latent variable modeling with categorical outcomes (e.g., latent class analysis), or item–response theory models. Obviously, even within a comprehensive textbook, only a limited number of these methods can be adequately addressed. Having to make such a choice, we elected the material in the book to center around what we found to be—at the level aimed in this text—most widely used multivariate methods for analysis of continuous DVs. For discussions of methods for analyzing discrete DVs, the reader is referred to a number of alternative texts (Agresti, 2002; Lord, 1980; Muthén, 2002; Skrondal & Rabe-Hesketh, 2004). When choosing a statistical technique, in particular a multivariate method, whether the data to be analyzed are experimental (i.e., resulting after random assignment of subjects and manipulation of IVs, in addition to typically exercising control upon so-called extraneous variance) or observational (i.e., obtained from responses to questionnaires or surveys), is irrelevant. Statistical analysis will work in either case equally well. However, it is the resultant interpretation which will typically differ. In particular, potential causality attributions, if attempted, can be crucially


8


affected by whether the data stem from an experimental or an observational (correlational) study. A researcher is in the strongest position to possibly make causal statements in the case of an experimental study. This fundamental matter is discussed at length in the literature, and we refer the reader to alternative sources dealing with experimental design, causality and related issues (Shadish, Cook, & Campbell, 2002). Throughout this book, we will not make or imply causal statements of any form or type. Once the particular choice of variables and statistical method(s) of analysis is made, a number of MVS techniques will optimally combine the DVs and yield a special linear combination of them with certain features of interest. More specifically, these methods find that linear combination, Y*, of the response measures Y1, Y2, . . . , Yp (p > 1), which is defined as Y* ¼ w1 Y1 þ w2 Y2 þ þ wp Yp

(1:1)

and has special optimality properties (Y* is occasionally referred to as supervariable). In particular, this constructed variable Y* may be best at differentiating between groups of subjects that are built with respect to some IV(s). As discussed in Chapter 10, this will be the case when using the technique called discriminant function analysis. Alternatively, the variable Y* defined in Equation 1.1 may possess the highest possible variance from all linear combinations of the measures Y1, Y2, . . . , Yp that one could come up with. As discussed in Chapter 7, this will be the case when using PCA. As another option, Y* may be constructed so as to possess the highest correlation with another appropriately obtained supervariable, that is, a linear combination of another set of variables. This will be the case in CCA. Such optimal linear combinations are typical for MVS, and in some sense parallel the search conducted in univariate regression analysis for that linear combination of a given set of predictors, which has the highest possible correlation with a prespecified DV. Different MVS methods use specific information about the relationship between the DVs and possibly IVs in evaluating the weights w1, w2, . . . , wp in Equation 1.1, so that the resulting linear combination Y* has the corresponding of the properties mentioned above.

1.4 Data for Multivariate Analyses In empirical behavioral, social, and educational research, MVS methods are applied to data provided by examined subjects sampled from studied populations. For the analytic procedures considered in this book, these data are typically organized in what is commonly called data matrix. Because the notion of a data matrix is of special relevance both in UVS and MVS, we attend to it in some detail next.



9

TABLE 1.1 Data From Four Subjects in a General Mental Ability Study Student 1 2 3 4

Test1

Test2

Test3

Gen

SES

MathAbTest

45 51 40 49

55 54 51 45

47 57 46 48

1 0 1 0

3 1 2 3

33 23 43 42

Note: Gen ¼ gender; SES ¼ socioeconomic status; MathAbTest ¼ mathematics ability test score.

A data matrix is a rectangular array of collected (recorded) scores from studied subjects. The entries in this matrix are arranged in such a way that each row represents a given person’s data, and each column represents a variable that may be either dependent or independent in a particular analysis, in accordance with a research question asked. For example, consider a study of general mental ability, and let us assume that data are collected from a sample of students on the following six variables: (a) three tests of intelligence, denoted below Test1, Test2, and Test3; (b) a mathematics ability test; (c) information about their gender; and (d) SES. For the particular illustrative purposes here, the data on only four subjects are provided in Table 1.1. As seen from Table 1.1, for the continuous variables—in this case, the three tests of intelligence and the mathematics ability test—the actual performance scores are recorded in the data matrix. However, for discrete variables (here, gender and SES), codes for group membership are typically entered. For more details on coding schemes, especially in studies with multiple groups—a topic commonly discussed at length in most regression analysis treatments—we refer to Pedhazur (1997) or any other introductory to intermediate statistics text. Throughout the rest of this book, we treat the elements of the data matrix as discussed so far in this section. We note, however, that in some repeated assessment studies (or mixed modeling contexts) data per measurement occasion may be recorded on a single line. Such data arrangements will not be considered in the book, yet we mention that they can be readily reordered in the form of data matrices outlined above that will be of relevance in the remainder. Additionally, this text will not be concerned with data that, instead of being in the form of subject performance or related scores, are determined as distances between stimuli or studied persons (e.g., data collected in psychophysics experiments or some areas of social psychology). For discussions on how to handle such settings, we refer to corresponding sources (Johnson & Wichern, 2002). The above data-related remarks also lead us to a major assumption that will be made throughout the book, that of independence of studied


10


subjects. This assumption may be violated in cases where there is a possibility for examined persons to interact with one another during the process of data collection, or to receive the same type of schooling, treatment instruction, or opportunities. The data resulting in such contexts typically exhibit a hierarchical nature. For example, data stemming from patients who are treated by the same doctor or in the same health care facility, or data provided by students taught by the same teachers, in certain classrooms or in schools (school districts), tend to possess this characteristic. This data property is at times also referred to as nestedness or clustering, and then scores from different subjects cannot be safely considered independent. While the methods discussed in this book may lead to meaningful results when the subject independence assumption is violated to a minor extent, beyond that they cannot be generally trusted. Instead, methods that have been specifically developed to deal with nested data, also referred to as hierarchical, multilevel, or clustered data, should be utilized. Such methods are available within the so-called mixed modeling methodology, and some of them are also known as hierarchical linear (or nonlinear) modeling. For a discussion of these methods, which are not covered in this text, the reader is referred to Heck and Thomas (2000), Hox (2002), or Raudenbush and Bryk (2002) and references therein. Later chapters of this book will, however, handle a special case of hierarchical data, which stem from repeated measure designs. In the latter, measurements from any given subject can be considered nested within that subject, that is, exhibit a two-level hierarchy. Beyond this relatively simple case of nested structure, however, alternative multilevel analysis approaches are recommended with hierarchical data. While elaborating on the nature of the data utilized in applications of MVS methods considered in this book, it is also worthwhile to make the following notes. First, we stress that the rows of the data matrix (Table 1.1) represent a random sample from a studied population, or random samples from more than one population in the more general case of a multipopulation investigation. That is, the rows of the data matrix are independent of one another. On the other hand, the columns of the data matrix do not represent a random sample, and are in general not independent of one another but instead typically interrelated. MVS is in fact particularly concerned with this information about the interrelationship between variables (data matrix columns) of main interest in connection to a research question(s) being pursued. Second, in many studies in the behavioral, social, and educational disciplines, some subjects do not provide data on all collected variables. Hence, a researcher has to deal with what is commonly referred to as missing data. Chapter 12 addresses some aspects of this in general difficult to deal with issue, but at this point we emphasize the importance of using appropriate and uniform representation of missing values throughout the entire data matrix. In particular, it is strongly recommended to use the



11

TABLE 1.2 Missing Data Declaration in an Empirical Study (cf. Table 1.1) Student 1 2 3 4 5 6

Test1

Test2

Test3

Gen

SES

MathAbTest

45 51 40 49 99 52

55 54 51 45 44 99

47 57 46 48 99 44

1 0 1 0 1 99

3 1 2 3 99 2

33 23 43 42 44 99

Note: Gen ¼ gender; SES ¼ socioeconomic status; MathAbTest ¼ mathematics ability test score.

same symbol(s) for denoting missing data, a symbol(s) that is not a legitimate value possible to take by any subject on a variable in the study. In addition, as a next step one should also insure that this value(s) is declared to the software used as being employed to denote missing value(s); failure to do so can cause severely misleading results. For example, if the next two subjects in the previously considered general mental ability study (cf. Table 1.1) had some missing data, the latter could be designated by the uniform symbol (99) as illustrated in Table 1.2. Dealing with missing data is in general a rather difficult and in part ‘‘technical’’ matter, and we refer the reader to Allison (2001) and Little and Rubin (2002) for highly informative and instructive treatments of this issue (see also Raykov, 2005, for a nontechnical introduction in a context of repeated measure analysis). In this book, apart from the discussion in Chapter 12, we assume that used data sets have no missing values (unless otherwise indicated).

1.5 Three Fundamental Matrices in Multivariate Statistics A number of MVS methods may be conceptually thought of as being based on at least one of three matrices of variable interrelationship indices. For a set of variables to be analyzed, denoted Y1, Y2, . . . , Yp (p > 1), these matrices are 1. The covariance matrix, designated S for a given sample 2. The correlation matrix, symbolized R in the sample 3. The sum-of-squares and cross-products (SSCP) matrix, denoted Q in the sample For the purpose of setting the stage for subsequent developments, we discuss each of these matrices in Sections 1.5.1 through 1.5.3.


12


1.5.1 Covariance Matrix A covariance matrix is a symmetric matrix that contains the variable variances on its main diagonal and the variable covariances as remaining elements. The covariance coefficient represents a nonstandardized index of the linear relationship between two variables (in case it is standardized, this index is the correlation coefficient, see Section 1.5.2). For example, consider an intelligence study of n ¼ 500 sixth-grade students, which used five intelligence tests. Assume that the following was the resulting empirical covariance matrix for these measures (for symmetry reasons, only main diagonal elements and those below are displayed, a practice followed throughout the remainder of the book): 3

2

75:73 6 23:55 6 6 S ¼ 6 33:11 6 4 29:56 21:99

66:77 37:22 33:41 31:25

99:54 37:41 74:44 22:58 33:66

7 7 7 7: 7 5

(1:2)

85:32

We note that sample variances are always positive (unless of course on a given variable all subjects take the same value, a highly uninteresting case). In empirical research, covariances tend to be smaller, on average, in magnitude (absolute value) than variances. In fact, this is a main property of legitimate covariance matrices for observed variables of interest in behavioral and social research. In Chapter 2, we return to this type of matrices and discuss in a more formal way an important concept referred to as positive definiteness. For now, we simply note that unless there is a perfect linear relationship among a given set of observed variables (that take on real values), their covariance matrix will exhibit this feature informally mentioned here. For a given data set on p variables (p > 1), there is a pertinent population covariance matrix (on the assumption of existence of their variances, which is practically not restrictive). This matrix, typically denoted S, consists of the population variances on its main diagonal and population covariances off it for all pairs of variables. In a random sample of n (n > 1) subjects drawn from a studied population, each element of this covariance matrix can be estimated as sij ¼

n i )(Ykj Y j) X (Yki Y n1 k¼1

(1:3)

(i, j ¼ 1, . . . , p), where the bar stands for arithmetic mean for the variable underneath, while Yki and Ykj denote the score of the kth subject on the ith



13

and jth variable, respectively. (In this book, n generally denotes sample size.) Obviously, in the special case that i ¼ j, from Equation 1.3 one estimates the variable variances in that sample as s2i ¼

n i )2 X (Yki Y , n1 k¼1

(1:4)

where (i ¼ 1, . . . , p). Equations 1.3 and 1.4 are utilized by statistical software to estimate an empirical covariance matrix, once the data matrix with observed raw data has been provided. If data were available on all members of a given finite population (the latter being the typical case in social and behavioral research), then using Equations 1.3 and 1.4 yet with the divisor (denominator) n would allow one to determine all elements of the population covariance matrix S. We note that, as can be seen from Equations 1.3 and 1.4, variances and covariances depend on the specific units of measurement of the variables involved. In particular, the magnitude of either of these two indices is unrestricted. Hence, a change from one unit of measurement to another (such as from inches to centimeter measurements), which in empirical research is usually done via a linear transformation, can substantially increase or decrease the magnitude of variance and covariance coefficients.

1.5.2 Correlation Matrix The last mentioned feature of scale dependence creates difficulties when trying to interpret variances and covariances. To deal with this problem for two random variables, one can use the correlation coefficient, which is obtained by dividing their covariance with the square-rooted product of their variances (i.e., the product of their standard deviations; see below for the case when this division would not be possible). For a given population, from the population covariance matrix S one can determine in this way the population correlation matrix, which is commonly denoted R (the capital Greek letter ‘‘rho’’). Similar to the covariance coefficient, in a given sample the correlation coefficient rij between the variables Yi and Yj is evaluated as n P

i )(Ykj Y j) (Yki Y sij sij k¼1 rij ¼ qffiffiffiffiffiffiffiffi ¼ ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi si sj n n P s2i s2j i )2 P (Ykj Y j )2 (Yki Y k¼1

k¼1

(1:5)


14


(i, j ¼ 1, . . . , p). A very useful property of the correlation coefficient is that the magnitude of the numerator in Equation 1.5 is never larger than that of the denominator, that is, sij si sj ,

(1:6)

1 rij 1

(1:7)

and hence

is always true (i, j ¼ 1, . . . , p). As a matter of fact, in Inequalities 1.6 and 1.7 the equality sign is only then obtained when there is a perfect linear relationship between the two variables involved, that is, if and only there exist two numbers aij and bij such that Yi ¼ aij þ bij Yj (1 < i, j < p). As with the covariance matrix, if data were available on an entire (finite) population, using Equation 1.5 one could determine the population correlation matrix R. Also from Equation 1.5 it is noted that the correlation coefficient will not exist, that is, will not be defined, if at least one of the variables involved is constant across the sample or population considered (this is obviously a highly uninteresting and potentially unrealistic case in empirical research). Then obviously the commonly used notion of relationship is void as well. As an example, consider a study examining the relationship between GPA, SAT scores, annual family income, and abstract reasoning test scores obtained from n ¼ 350, 10th-grade students. Let us also assume that in this sample the following correlation matrix was obtained (positioning of variables in the data matrix is as just listed, from left to right and top to bottom): 2

1 6 :69 R¼6 4 :48 :75

3 1 :52 1 :66 :32

7 7: 5

(1:8)

1

As seen by examining the values in the right-hand side of Equation 1.8, most correlations are of medium size, with that between income and abstract reasoning test score being the weakest (.32). We note that while there are no strict rules to be followed when interpreting entries in a correlation matrix, it is generally easier to interpret their squared values (e.g., as in simple linear regression, where the squared correlation equals the R2 index furnishing the proportion of explained or shared variance between the two variables in question). We also observe that while in general only the sign of a covariance coefficient can be interpreted, both the sign and magnitude of the correlation coefficient are meaningful. Specifically, the closer the correlation



15

coefficient is to 1 or 1, the stronger (more discernible) the linear relationship is between the two variables involved. In contrast, the closer this coefficient is to 0, being either a positive or negative number, the weaker the linear pattern of relationship between the variables. Like a positive covariance, a positive correlation suggests that persons with above mean (below mean) performance on one of the variables tend to be among those with above mean (below mean) performance on the other measure. Alternatively, a negative covariance (negative correlation) is indicative of the fact that subjects with above (below) mean performance on one of the variables tend to be among those with below (above) mean performance on the other measure. When the absolute value of the correlation coefficient is, for example, in the .90s, one may add that this tendency is strong, while otherwise it is moderate to weak (the closer to 0 this correlation is). 1.5.3 Sums-of-Squares and Cross-Products Matrix An important feature of both the covariance and correlation matrices is that when determining or estimating their elements an averaging process takes place (see the summation sign and division by (n 1) in Equations 1.3 through 1.5, and correspondingly for their population counterparts). This does not happen, however, when one is concerned with estimating the entries of another matrix of main relevance in MVS and especially in ANOVA contexts, the so-called sums-of-squares and cross-products (SSCP) matrix. For a sample from a studied population, this symmetric matrix contains along its diagonal the sums of squares, and off this diagonal the sums of cross products for all possible pairs of variables involved. As is well known from discussions of ANOVA in introductory statistics textbooks, the sum of squares for a given variable is qii ¼

n X

i )2 (Yki Y

(1:9)

k¼1

(i ¼ 1, . . . , p), which is sometimes also referred to as corrected or deviation sum of squares, due to the fact that the mean is subtracted from the observed score on the variable under consideration. Similarly, the sum of cross products is qij ¼

n X

i )(Ykj Y j ), (Yki Y

(1:10)

k¼1

where 1 < i, j < p. Obviously, Equation 1.9 is obtained from Equation 1.10 in the special case when i ¼ j, that is, the sum of squares for a given variable equals the sum of cross products of this variable with itself (i, j ¼ 1, . . . , p). As can be readily seen by a comparison of Equations 1.9


16


and 1.10 with the Equations 1.3 and 1.4, respectively, the elements of the SSCP matrix Q ¼ [qij] result by multiplying with (n 1) the corresponding elements of the empirical covariance matrix S. (In the rest of this chapter, we enclose in brackets the general element of a matrix; see Chapter 2 for further notation explication.) Hence, Q has as its elements measures of linear relationship that are not averaged over subjects, unlike the elements of the matrices S and R. As a result, there is no readily conceptualized population analogue of the SSCP matrix Q. This may in fact be one of the reasons why this matrix has been referred to explicitly less often in the literature (in particular applied) than the covariance or correlation matrix. Another type of SSCP matrix that can also be considered and used to obtain the matrix Q is one that reflects the SSCP of the actual raw scores uncorrected for the mean. In this raw score SSCP matrix, U ¼ [uij], the sum of squares for a given variable, say Yi is defined as uii ¼

n X k¼1

Y2ki ,

(1:11)

and the sums of its cross products with the remaining variables are uij ¼

n X

Yki Ykj ,

(1:12)

k¼1

where 1 i, j p. We note that Equation 1.11 is obtained from Equation 1.12 in the special case when i ¼ j (i, j ¼ 1, . . . , p). As an example, consider a study concerning the development of aggression in middle-school students, with p ¼ 3 consecutive assessments on an aggression measure. Suppose the SSCP matrix for these three measures, Q, is as follows: 2

1112:56 Q ¼ 4 992:76 890:33

3 5: 2055:33 1001:36 2955:36

(1:13)

We observe that similarly to the covariance matrix, the diagonal entries of Q are positive (unless a variable is a constant), and tend to be larger in magnitude than the off-diagonal ones. This feature can be readily explained with the fact mentioned earlier that the elements of Q equal (n 1) times those of the covariance matrix S, and the similar feature of the covariance matrix. We also note that the elements of Q may grow unrestrictedly, or alternatively decrease unrestrictedly (if negative), when sample size increases; this is due to their definition as nonaveraged sums across subjects. Similarly to the elements of any sample covariance matrix S, due to their nonstandardized feature, the elements of Q cannot in



17

general be interpreted in magnitude, but only their sign can—in the same way as the sign of the elements of S. Further, the entries in Q also depend on the metric underlying the studied variables. We stress that the matrices S, R, and Q will be very important in most MVS methods of interest in this book because they contain information on the linear interrelationships among studied variables. It is these interrelationships, which are essential for the multivariate methods considered later in the text. Specifically, MVS methods capitalize on this information and re-express it in their results, in addition to other features of the data. To illustrate, consider the following cases. A correlation matrix showing uniformly high variable correlations (e.g., for a battery of tests) may reveal a structure of relationships that is consistent with the assumption of a single dimension (e.g., abstract thinking ability) underlying a set of analyzed variables. Further, a correlation matrix showing two groups of similar (within-group) correlations with respect to size, may be consistent with two interrelated dimensions (e.g., reading ability and writing ability in an educational study). As it turns out, and preempting some of the developments to follow in subsequent chapters, we use the correlation and covariance matrix in FA and PCA; the SSCP matrix in MANOVA, MANCOVA, and discriminant function analysis; and the covariance matrix in confirmatory FA, in studies of group mean differences on unobserved variables, and in such with repeated measures. (In addition to the covariance matrix, also variable means will be of relevance in the latter two cases, as elaborated in Chapters 9 and 13.) Hence, with some simplification, we may say that these three matrices of variable interrelationships—S, R, and Q—will often play the role of data in this text; that is, they will be the main starting points for applying MVS methods (with the raw data also remaining relevant in MVS in its own right). We also emphasize that, in this sense, for a given empirical data set, the covariance, correlation, and SSCP matrices are only the beginning and the means rather than the end of MVS applications.

1.6 Illustration Using Statistical Software In this section, we illustrate the previous discussions using two of the most widely circulated statistical analysis software, SPSS and SAS. To achieve this goal, we use data from a sample of n ¼ 32 freshmen in a study of the relationship between several variables that are defined below. Before we commence, we note that such relatively small sample size examples will occasionally be used in this book merely for didactic purposes, and stress that in empirical research it is strongly recommended to use large samples whenever possible. The desirability of large sample sizes is a topic that has received a considerable amount of attention in the literature because it


18


is well recognized that the larger the sample, the more stable the parameter estimates will be, although there are no easy to apply general rules for sample size determination. (This is because the appropriate size of a sample depends in general on many factors, including psychometric properties of the variables selected, the strength of relationships among them, number of observed variables, amount of missing data, and the distributional characteristics of the analyzed variables; Marcoulides & Saunders, 2006; Muthén & Muthén, 2002). In this example data set, the following variables are considered: (a) GPA at the beginning of fall semester (called GPA in the data file ch1ex1.dat available from www.psypress.com= applied-multivariate-analysis), (b) an initial math ability test (called INIT_AB in that file), (c) an IQ test score (called IQ in the file), and (d) the number of hours the person watched television last week (called HOURS_TV in the file). For ease of discussion, we also assume that there are no anomalous values in the data on any of the observed variables. In Chapter 3, we revisit the issue of anomalous values and provide some guidance concerning how such values can be examined and assessed. For the purposes of this illustration, we go through several initial data analysis steps. We begin by studying the frequencies with which scores occur for each of these variables across the studied sample. To accomplish this, in SPSS we choose the following menu options (in the order given next) and then click on the OK button: Analyze ! Descriptive Statistics ! Frequencies (Upon opening the data file, or reading in the data to be analyzed, the ‘‘Analyze’’ menu choice is available in the toolbar, with the ‘‘Descriptive Statistics’’ becoming available when ‘‘Analyze’’ is chosen, and similarly for ‘‘Frequencies.’’ Once the latter choice is clicked, the user must move over the variables of interest into the variable selection window.) To obtain variable frequencies with SAS, the following set of commands can be used: DATA CHAPTER1; INFILE ‘ch1ex1.dat’; INPUT GPA INIT_AB IQ HOURS_TV; PROC FREQ; RUN; In general, SAS program files normally contain commands that describe the data to be analyzed (the so-called DATA statements), and the type of procedures to be performed (the so-called PROC statements). SAS PROC statements are just like computer programs that perform various manipulations, and then print the results of the analyses (for complete details, see the latest SAS User’s Guide and related manuals). The INFILE command statement merely indicates the file name from which the data are to be



19

read (see also specific software arrangements regarding accessing raw data), and the following INPUT statement invokes retrieval of the data from the named file, in the order of specified free-formatted variables. Another way to inform SAS about the data to use is to include the entire data set (abbreviated below to the first five observations to conserve space) in the program file as follows: DATA CHAPTER1; INPUT GPA INIT_AB IQ HOURS_TV; CARDS; 2.66 20 101 9 2.89 22 103 8 3.28 24 99 9 2.92 12 100 8 4 21 121 7 ; PROC FREQ; RUN; Once either of these two command sequences is submitted to SAS, two types of files reporting results are created. One is called the SAS log file and the other is referred to as the SAS output file. The SAS log file contains all commands, messages, and information related to the execution of the program, whereas the SAS output file contains the actual statistical results. The outputs created by running the software, SPSS and SAS, as indicated above are as follows. For ease of presentation, we separate them by program and insert clarifying comments after each output section accordingly. SPSS output notes Frequencies Notes Output Created Comments Input

Missing Value Handling

Data Filter Weight Split File N of Rows in Working Data File Definition of Missing

D:\Teaching\Multivariate.Statistics\ Data\Lecture1.sav <none> <none> <none> 32 User-defined missing values are treated as missing.


20


Cases Used Syntax

Resources

Elapsed Time Total Values Allowed

Statistics are based on all cases with valid data. FREQUENCIES VARIABLES ¼ gpa init_ab iq hours_tv=ORDER ¼ ANALYSIS. 0:00:00.09 149796

SAS output log file

NOTE: Copyright (c) 2002–2003 by SAS Institute Inc., Cary, NC, USA. NOTE: SAS (r) 9.1 (TS1M3) Licensed to CSU FULLERTON-SYSTEMWIDE-T=R, Site 0039713013. NOTE: This session is executing on the WIN_PRO platform. NOTE: SAS 9.1.3 Service Pack 1 NOTE: SAS initialization used: real time 16.04 seconds cpu time 1.54 seconds 1 2 3

DATA CHAPTER1; INPUT GPA INIT_AB IQ HOURS_TV; CARDS;

NOTE: The data set WORK.CHAPTER1 has 32 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.31 seconds cpu time 0.06 seconds 36 ; 37 PROC FREQ; 38 RUN; NOTE: There were 32 observations read from the data set WORK.CHAPTER1. NOTE: PROCEDURE FREQ used (Total process time): real time 0.41 seconds cpu time 0.04 seconds



21

To save space, in the remainder of the book we dispense with these beginning output parts for each considered analytic session in both software, which echo back the input submitted to the program and=or contain information about internal arrangements that the latter invokes in order to meet the analytic requests. We also dispense with the output section titles, and supply instead appropriate headings and subheadings. SPSS frequencies output Statistics GPA

INIT_AB

PSI

IQ

HOURS_TV

32 0

32 0

32 0

32 0

32 0

N Valid Missing

This section confirms that we are dealing with a complete data set, that is, one having no missing values, as indicated in Section 1.4. Frequency Table

GPA

Valid

2.06 2.39 2.63 2.66 2.67 2.74 2.75 2.76 2.83 2.86 2.87 2.89 2.92 3.03 3.10 3.12 3.16 3.26 3.28 3.32 3.39 3.51 3.53

Frequency

Percent

Valid Percent

Cumulative Percent

1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1

3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 6.3 3.1 3.1 6.3 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1

3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 6.3 3.1 3.1 6.3 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1 3.1

3.1 6.3 9.4 12.5 15.6 18.8 21.9 25.0 31.3 34.4 37.5 43.8 46.9 50.0 53.1 56.3 59.4 62.5 65.6 68.8 71.9 75.0 78.1


22


Frequency

Percent

Valid Percent

Cumulative Percent

1 1 1 1 1 2 32

3.1 3.1 3.1 3.1 3.1 6.3 100.0

3.1 3.1 3.1 3.1 3.1 6.3 100.0

81.3 84.4 87.5 90.6 93.8 100.0

3.54 3.57 3.62 3.65 3.92 4.00 Total

INIT_AB

Valid

12.0 14.0 17.0 19.0 20.0 21.0 22.0 23.0 24.0 25.0 26.0 27.0 28.0 29.0 Total

Frequency

Percent

Valid Percent

Cumulative Percent

1 1 3 3 2 4 2 4 3 4 2 1 1 1 32

3.1 3.1 9.4 9.4 6.3 12.5 6.3 12.5 9.4 12.5 6.3 3.1 3.1 3.1 100.0

3.1 3.1 9.4 9.4 6.3 12.5 6.3 12.5 9.4 12.5 6.3 3.1 3.1 3.1 100.0

3.1 6.3 15.6 25.0 31.3 43.8 50.0 62.5 71.9 84.4 90.6 93.8 96.9 100.0

IQ

Valid

97.00 98.00 99.00 100.00 101.00 102.00 103.00 104.00 107.00 110.00 111.00 112.00 113.00 114.00 119.00 121.00 Total

Frequency

Percent

Valid Percent

Cumulative Percent

2 4 4 1 7 1 2 1 1 2 1 1 2 1 1 1 32

6.3 12.5 12.5 3.1 21.9 3.1 6.3 3.1 3.1 6.3 3.1 3.1 6.3 3.1 3.1 3.1 100.0

6.3 12.5 12.5 3.1 21.9 3.1 6.3 3.1 3.1 6.3 3.1 3.1 6.3 3.1 3.1 3.1 100.0

6.3 18.8 31.3 34.4 56.3 59.4 65.6 68.8 71.9 78.1 81.3 84.4 90.6 93.8 96.9 100.0



23

HOURS_TV

Valid

6.000 6.500 7.000 7.500 8.000 8.500 9.000 9.500 Total

Frequency

Percent

Valid Percent

Cumulative Percent

4 2 6 1 10 2 6 1 32

12.5 6.3 18.8 3.1 31.3 6.3 18.8 3.1 100.0

12.5 6.3 18.8 3.1 31.3 6.3 18.8 3.1 100.0

12.5 18.8 37.5 40.6 71.9 78.1 96.9 100.0

SAS frequencies output

GPA

Frequency

Percent

Cumulative Frequency

Cumulative Percent

2.06 2.39 2.63 2.66 2.67 2.74 2.75 2.76 2.83 2.86 2.87 2.89 2.92 3.03 3.1 3.12 3.16 3.26 3.28 3.32 3.39 3.51 3.53 3.54 3.57 3.62 3.65 3.92 4

1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2

3.13 3.13 3.13 3.13 3.13 3.13 3.13 3.13 6.25 3.13 3.13 6.25 3.13 3.13 3.13 3.13 3.13 3.13 3.13 3.13 3.13 3.13 3.13 3.13 3.13 3.13 3.13 3.13 6.25

1 2 3 4 5 6 7 8 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 32

3.13 6.25 9.38 12.50 15.63 18.75 21.88 25.00 31.25 34.38 37.50 43.75 46.88 50.00 53.13 56.25 59.38 62.50 65.63 68.75 71.88 75.00 78.13 81.25 84.38 87.50 90.63 93.75 100.00


24


INIT_AB

Frequency

Percent


Cumulative Percent

1 1 3 3 2 4 2 4 3 4 2 1 1 1

3.13 3.13 9.38 9.38 6.25 12.50 6.25 12.50 9.38 12.50 6.25 3.13 3.13 3.13

1 2 5 8 10 14 16 20 23 27 29 30 31 32

3.13 6.25 15.63 25.00 31.25 43.75 50.00 62.50 71.88 84.38 90.63 93.75 96.88 100.00

12 14 17 19 20 21 22 23 24 25 26 27 28 29

IQ

Frequency

Percent


Cumulative Percent

2 4 4 1 7 1 2 1 1 2 1 1 2 1 1 1

6.25 12.50 12.50 3.13 21.88 3.13 6.25 3.13 3.13 6.25 3.13 3.13 6.25 3.13 3.13 3.13

2 6 10 11 18 19 21 22 23 25 26 27 29 30 31 32

6.25 18.75 31.25 34.38 56.25 59.38 65.63 68.75 71.88 78.13 81.25 84.38 90.63 93.75 96.88 100.00

97 98 99 100 101 102 103 104 107 110 111 112 113 114 119 121

HOURS_TV 6 6.5 7 7.5 8 8.5 9 9.5

Frequency Percent 4 2 6 1 10 2 6 1

12.50 6.25 18.75 3.13 31.25 6.25 18.75 3.13

Cumulative Cumulative Frequency Percent 4 6 12 13 23 25 31 100

12.50 18.75 37.50 40.63 71.88 78.13 96.88 100.00



25

These SPSS and SAS frequency tables provide important information regarding the distribution (specifically, the frequencies) of values that each of the variables is taking in the studied sample. In this sense, these output sections also tell us what the data actually are on each variable. In particular, we see that all variables may be considered as (approximately) continuous. The next step is to examine what the descriptive statistics are for every measure used. From these statistics, we learn much more about the studied variables, namely their means, range of values taken, and standard deviations. To this end, in SPSS we use the following sequence of menu option selections: Analyze ! Descriptive statistics ! Descriptives In SAS, the procedure PROC MEANS would need to be used and stated instead of PROC FREQ (or in addition to the latter) in the second-to-last line of either SAS command file presented above. Each of the corresponding software command sets provides identical output (up to roundoff error) shown below. SPSS output Descriptive Statistics

GPA INIT_AB IQ HOURS_TV Valid N (listwise)

N

Minimum

Maximum

32 32 32 32 32

2.06 12.0 97.00 6.000

4.00 29.0 121.00 9.500

Mean 3.1172 21.938 104.0938 7.71875

Std. Deviation .46671 3.9015 6.71234 1.031265

SAS output Variable

N

Mean

Std Dev

Minimum

Maximum

GPA INIT_AB IQ HOURS_TV

32 32 32 32

3.1171875 21.9375000 104.0937500 7.7187500

0.4667128 3.9015092 6.7123352 1.0312653

2.0600000 12.0000000 97.0000000 6.0000000

4.0000000 29.0000000 121.0000000 9.5000000

To obtain standardized measures of variable interrelationships, we produce their correlation matrix. In SAS, this is accomplished by including the procedure PROC CORR at the end of the program file used previously


26


(or instead of PROC FREQ or PROC MEANS), whereas in SPSS this is accomplished by requesting the following series of menu options: Analyze ! Correlations ! Bivariate These SPSS and SAS commands furnish the following output: SPSS output Correlations GPA

INIT_AB .387* .029 32

IQ

HOURS_TV

.659** .000 32

.453** .009 32

1 . 32

.206 .258 32

.125 .496 32

.659** .000 32

.206 .258 32

1 . 32

.453** .009 32

.125 .496 32

GPA

Pearson Correlation Sig. (2-tailed) N

1 . 32

INIT_AB


.387* .029 32

IQ


HOURS_TV


.784** .000 32

.784** .000 32 1 . 32

*. Correlation is significant at the 0.05 level (2-tailed). **. Correlation is significant at the 0.01 level (2-tailed).

SAS output Pearson Correlation Coefficients, N ¼ 32 Prob > jrj under H0: Rho ¼ 0 GPA

INIT_AB

IQ

HOURS_TV

GPA

1.00000

0.38699 0.0287

0.65910

IQ

0.65910 1), and x ¼ [x1, x2, . . . , xp]0 is a p 3 1 vector, then the quadratic form x0 A x is the following scalar: 2

a11

6 a21 6 x0 A x ¼ [x1 , x2 , . . . , xp ]6 4 ap1 ¼

a11 x21

þ

a22 x22

a12 a22 ap2

. . . a1p

3

. . . a2p 7 7 0 7[x1 , x2 , . . . , xp ] 5 . . . app

þ . . . þ app x2p

þ 2a12 x1 x2 þ 2a13 x1 x3 þ . . . þ 2ap1,p xp1 xp :

(2:23)

In words, a quadratic form (with a symmetric matrix, as throughout the rest of this text) is the scalar that results as the sum of all squares of successive vector elements with the corresponding diagonal elements of the matrix involved, plus the product of different vector elements multiplied by pertinent elements of that matrix (with subindexes being those of the vector elements involved). Matrix inversion. When using multivariate statistical methods, we will often need to be concerned with a (remote) analog of number division. This is the procedure of matrix inversion. Only square matrices may have inverses (for the concerns of this book), although not all square matrices will have inverses. In particular, a matrix with the property that there is a linear relationship between its columns (or rows)—e.g., one of the columns being a linear combination of some or all of the remaining columns (rows)—does not have an inverse. Such a matrix is called singular, as opposed to matrices that have inverses and which are called nonsingular or invertible. Inversion is denoted by the symbol (.)1, whereby A1 denotes the inverse of the matrix A (when A1 exists), and is typically best carried out by computers. Interestingly, similarly to transposition, matrix inversion works like a ‘‘toggle’’: the inverse of an inverse of a given matrix is the original matrix itself, that is, (A1)1 ¼ A, for any invertible matrix A. In addition, as shown in more advanced treatments, the inverse of a matrix is unique when it exists.


42


How does one work out a matrix inverse? To begin with, there is a recursive rule that allows one to start with an obvious inverse of a 1 3 1 matrix (i.e., a scalar), and define those of higher-order matrices. For a given scalar a, obviously a1 ¼ 1=a is its inverse (assuming a 6¼ 0); that is, inversion of a 1 3 1 matrix is the ordinary division of 1 by this scalar. To determine the inverse of a 2 3 2 matrix, we must first introduce the notion of a ‘‘determinant’’—which we do in the next subsection—and then move on to the topic of matrix inversion for higher-order matrices. An important type of matrix that is needed for this discussion is that of an identity matrix. The identity matrix, for a given size, is a square matrix which has 1’s along its main diagonal and 0’s off it. For example, the identity matrix with 4 rows and 4 columns, i.e., of size 4 3 4, is 2

1 60 I4 ¼ 6 40 0

0 1 0 0

0 0 1 0

3 0 07 7: 05 1

(2:24)

We note that for each integer number, say m (m > 0), there is only one identity matrix, namely Im. The identity matrix plays a similar role to that of the unity (the number 1) among the real numbers. That is, any matrix multiplied with the identity matrix (with which the former is matrix conform) will remain unchanged, regardless whether it has been pre- or postmultiplied with that identity matrix. As an example, if 2

4 A ¼ 4 66 45

5 5 32

3 6 55 5, 35

then AI3 ¼ I3 A ¼ A:

(2:25)

(Note that this matrix A cannot be pre- or postmultiplied with an identity matrix other than that of size 3 3 3.) Returning now to the issue of matrix inverse, we note its following characteristic feature: if A is an arbitrary (square) matrix that is nonsingular, then AA1 ¼ I ¼ A1A, where I stands for the identity matrix of the size of A. An interesting property of matrix inverses, like of transposition, is that the inverse of a matrix product is the reverse product of the inverses of the matrices involved (assuming all inverses exist). For example, the following would be the case with the product of two multiplication conform matrices: (AB)1 ¼ B1A1; in case of more than two matrices


Elements of Matrix Theory

43

involved in the product, say A, B, . . . , Z, where they are multiplication conform in this order, this feature looks as follows: (A B C . . . Y Z)1 ¼ Z1 Y1 . . . C1 B1 A1

(2:26)

(assuming all involved matrix inverses exist). Determinant of a matrix. Matrices in empirical research typically contain many numerical elements, and as such are hard to remember or take note of. Further, a table of numbers—which a matrix is—cannot be readily manipulated. It would therefore be desirable to have available a single number that characterizes a given matrix. One such characterization is the so-called determinant. Only square matrices have determinants. The determinant of a matrix is sometimes regarded as a measure of generalized variance for a given set of random variables, i.e., a random vector. Considering for example the determinant of a sample covariance matrix, it indicates the variability (spread) of the individual raw data it stems from—the larger the determinant of that matrix, the more salient the individual differences on the studied variables as reflected in the raw data, and vice versa. Unfortunately, determinants do not uniquely characterize a matrix; in fact, different matrices can have the same number as their determinant. The notion of a determinant can also be defined in a recursive manner. To start, let us consider the simple case of a 1 3 1 matrix, i.e., a scalar, say a. In this case, the determinant is that number itself, a. That is, using vertical bars to symbolize a determinant, jaj ¼ a, for any scalar a. In case of a 2 3 2 matrix, the following rule applies when finding the determinant: subtract from the product of its main diagonal elements, the product of the remaining two off-diagonal elements. That is, for the 2 3 2 matrix

a A¼ c

b , d

its determinant is jAj ¼ ad bc:

(2:27)

For higher-order matrices, their determinants are worked out following a rule that reduces their computation to that of determinants of lower order. Without getting more specific, fortunately this recursive algorithm for determinant computation is programmed into most widely available statistical software, including SPSS and SAS, and we will leave their calculation to the computer throughout the rest of this text. (See discussion later in this chapter concerning pertinent software instructions and resulting


44


output.) For illustrative purposes, the following are two simple numerical examples: j3j ¼ 3 and 2 6

5 ¼ 2 7 5 6 ¼ 16: 7

Now that we have introduced the notion of a determinant, let us return to the issue of matrix inversion. We already know how to render the inverse of a 1 3 1 matrix. In order to find the inverse of a 2 3 2 matrix, one more concept concerning matrices needs to be introduced. This is the concept of the so-called adjoint of a quadratic matrix, A ¼ [aij], denoted adj(A). To furnish the latter, in a first step one obtains the matrix consisting of the determinants, for each element of A, pertaining to the matrix resulting after deleting the row and column of that element in A. In a second step, one multiplies each of these determinants by (1)q, where q is the sum of the numbers corresponding to the row and column of the corresponding element of A (i.e., q ¼ i þ j for its general element, aij). To exemplify, consider the following 2 3 2 matrix a b B¼ , c d for which the adjoint is adj(B) ¼

d b

c : a

That is, the adjoint of B is the matrix that results by switching position of elements on the main diagonal, as well as on the diagonal crossing it, and adding the negative sign to the off-diagonal elements of the newly formed matrix. In order to find the inverse of a nonsingular matrix A, i.e., A1, the following rule needs to be applied (as we will indicate below, the determinant of an invertible matrix is distinct from 0, so the following division is possible): A1 ¼ [adj(A)]0 =jAj:

(2:28)

That is, for the last example, B1 ¼

d c

b a

jBj ¼

d b c a

(ad bc):



45

We stress that the rule stated in Equation 2.28 is valid for any size of a square invertible matrix A. When its size is higher than 2 3 2, the computation of the elements of the adjoint matrix adj(A) is obviously more tedious, though following the above described steps, and is best left to the computer. That is, finding the inverse of higher-order matrices proceeds via use of determinants of matrices of lower-order and of the same order. (See discussion later in this chapter for detailed software instructions and resulting output.) Before finishing this section, let us note the following interesting properties concerning matrix determinant. For any two multiplication conform square matrices A and B, jABj ¼ jAj jBj,

(2:29)

that is, the determinant of the product of the two matrices is the product of their determinants. Further, if c is a constant, then the determinant of the product of c with a matrix A is given by jc Aj ¼ cp jAj, where the size of A is p 3 p (p 1). Last but not least, if the matrix A is singular, then jAj ¼ 0, while its determinant is nonzero if it is invertible. Whether a particular matrix is singular or nonsingular is very important in MVS because as mentioned previously only nonsingular matrices have inverses—when a matrix of interest is singular, the inverse fails to exist. For example, in a correlation matrix for a set of observed variables, singularity will occur whenever there is a linear relationship between the variables (either all of them or a subset of them). In fact, it can be shown in general that for a square matrix A, jAj ¼ 0 (i.e., the matrix is singular) if and only if there is a linear relationship between its columns (or rows; e.g., Johnson & Wichern, 2002). Trace of a matrix. Another candidate for a single number that could be used to characterize a given matrix is its trace. Again, like determinant, we define trace only for square matrices—the trace of a square matrix is the sum of its diagonal elements. That is, if A ¼ [aij ], then tr(A) ¼ a11 þ a22 þ . . . þ app ,

(2:30)

where tr(.) denotes trace and A is of order p 3 p (p 1). For example, if 2 3 3 5 7 6 6 8 7 9 4 7 7 A¼6 4 0 23 34 35 5, 34 23 22 1 then its trace is tr(A) ¼ 3 þ 7 þ 34 þ 1 ¼ 45.


46


We spent considerable amount of time on the notions of determinant and trace in this section because they are used in MVS as generalizations of the concept of variance for a single variable. In particular, for a given covariance matrix, the trace reflects the overall variability in a studied data set since it equals the sum of the variances of all involved variables. Further, the determinant of a covariance matrix may be seen as representing the generalized variance of a random vector with this covariance matrix. Specifically, large values of the generalized variance tend to go together with a broad data scatter around their mean (mean vector) and conversely, as well as with large amounts of individual differences on studied variables. Similarly, for a correlation matrix, R, small values of jRj signal high degree of intercorrelation among the variables, whereas large values of jRj are indicative of a limited extent of intercorrelation.

2.3 Using SPSS and SAS for Matrix Operations As mentioned on several occasions in this chapter, matrix operations are best left to the computer. Indeed, with just a few simple instructions, one can readily utilize for instance either of the widely circulated software packages SPSS and SAS for these purposes. The illustration given next shows how one can employ SPSS for computing matrix sum, difference, product, inversion, and determinant. (This is followed by examples of how one can use SAS for the same aims.) We insert comments preceded by a star to enhance comprehensibility of the following input files. Note the definition of the vectors and matrices involved, which happens with the COMPUTE statement (abbreviated to COMP); elements within a row are delineated by a comma, whereas successive rows in a matrix are so by a semicolon. TITLE ‘USING SPSS TO CARRY OUT SOME MATRIX OPERATIONS’. * BELOW WE UTILIZE ‘ * ’ TO INITIATE A COMMENT RE. PRECEEDING COMMAND IN LINE. * DO NOT CONFUSE IT WITH THE SIGN ‘ * ’ USED FURTHER BELOW FOR MULTIPLICATION! * FIRST WE NEED TO TELL SPSS WE WANT IT TO CARRY OUT MATRIX OPERATIONS FOR US. MATRIX. * THIS IS HOW WE START THE SPSS MODULE FOR MATRIX OPERATIONS. COMP X ¼ {1,3,6,8}.

* NEED TO ENCLOSE MATRIX IN CURLY BRACKETS.

* USE COMMAS TO SEPARATE ELEMENTS. COMP Y ¼ {6,8,7,5}. COMP Z1 ¼ X*T(Y).

* USE ‘T’ FOR TRANSPOSE AND ‘*’ FOR MULTIPLICATION.

PRINT Z1.

* PRINT EACH RESULT SEPARATELY.

COMP Z2 ¼ T(X)*Y. PRINT Z2.



47

COMP A ¼ X þ Y. COMP B ¼ Y X. COMP C ¼ 3*X. PRINT A. PRINT B. PRINT C. COMP DET.Z1 ¼ DET(Z1).

* USE ‘DET(.)’ FOR EVALUATING A DETERMINANT.

COMP DET.Z2 ¼ DET(Z2). PRINT DET.Z1. PRINT DET.Z2. COMP TR.Z1 ¼ TRACE(Z1).

* USE TRACE(.) TO COMPUTE TRACE.

COMP TR.Z2 ¼ TRACE(Z2). PRINT TR.Z1. PRINT TR.Z2. END MATRIX.

* THIS IS HOW TO QUIT THE SPSS MATRIX OPERATIONS MODULE.

We hint here to the fact that the resulting matrices Z1 and Z2 will turn out to be of different size, even though they are the product of the same constituent matrices (vectors). The reason is, as mentioned before, that Z1 and Z2 result when matrix multiplication is performed in different orders. To accomplish the same matrix operations with SAS, the following program file utilizing the Interactive Matrix Language procedure (called PROC IML) must be submitted to that software. proc iml; =* THIS IS HOW WE START MATRIX OPERATIONS WITH SAS*= X ¼ {1 3 6 8}; =* NEED TO ENCLOSE MATRIX IN CURLY BRACKETS*= Y ¼ {6 8 7 5}; =* ELEMENTS ARE SEPARATED BY SPACES*= Z1 ¼ X*T(Y); =* USE ‘T’ FOR TRANSPOSE AND ‘*’ FOR MULTIPLICATION*= print Z1; =* PRINT EACH RESULT SEPARATELY*= Z2 ¼ T(X)*Y; print Z2; A ¼ X þ Y; B ¼ Y X; C ¼ 3*X; print A; print B; print C; DETZ1 ¼ det(Z1); =* USE ‘det(.)’ FOR EVALUATING DETERMINANT*= DETZ2 ¼ det(Z2); print DETZ1; print DETZ2; TRZ1 ¼ TRACE(Z1); =* USE ‘TRACE(.)’ TO COMPUTE TRACE*=


48


TRZ2 ¼ TRACE(Z2); print TRZ1; print TRZ2; FINISH; =* END OF MATRIX MANIPULATIONS*=

The outputs created by submitting the above SPSS and SAS command files are given next. For clarity, we present them in a different font from that of the main text, and provide comments at their end. SPSS output USING SPSS TO CARRY OUT SOME MATRIX OPERATIONS Run MATRIX procedure: Z1 112 Z2 6 18 36 48

8 24 48 64

7 21 42 56

7

11

13

5

5

1

3

9

18

5 15 30 40

A 13

B 3

C 24

DET.Z1 112.0000000 DET.Z2 0 TR.Z1 112 TR.Z2 112 —— END MATRIX ——



49

SAS output The SAS System Z1 112 Z2 6 18 36 48

8 24 48 64

7 21 42 56

5 15 30 40

7

A 11

13

13

5

B 5

1

3

3

C 9

18

24

DETZ1 112 DETZ2 0 TRZ1 112 TRZ2 112

As can be deduced from the preceding discussion in this chapter, multiplication of the same vectors in each of the two possible orders renders different results—in the first instance a scalar (number) as in the case of Z1, and in the other a matrix as in case of Z2. Indeed, to obtain Z1 we (post)multiply a row vector with a column vector, whereas to obtain Z2 we (post)multiply a column vector with a row vector. Hence, following the earlier discussion on matrix multiplication, the resulting two matrices are correspondingly of size 1 3 1 and 4 3 4: the product Z1 ¼ x0 y is a single number, often called inner product, whereas Z2 ¼ xy0 is a matrix, sometimes called outer product. We also note that the above matrix Z2 also possesses a determinant equal to 0. The reason is that it is a singular matrix, since any of its rows is a constant multiple of any other row. (In general, a singular matrix usually exhibits a more subtle, but still linear, relationship between some or all of its rows or columns.)


50


2.4 General Form of Matrix Multiplications With Vector, and Representation of the Covariance, Correlation, and Sum-of-Squares and Cross-Product Matrices Much of the preceding discussion about matrices was mainly confined to a number of specific operations. Nevertheless, it would be of great benefit to the reader to get a sense of how these matrix manipulations can actually be considered in a more general form. In what follows, we move on to illustrate the use of symbols to present matrices (and vectors) and operations with them in their most general form of use for our purposes in this book. 2.4.1 Linear Modeling and Matrix Multiplication Suppose we are given the vector y ¼ [y1 , y2 , . . . , yn ]0 consisting of n elements (that can be any real numbers), where n > 1, and the vector 2 3 b0 6 b1 7 6 7 0 7 b¼6 6 b2 7 ¼ [b0 , b1 , . . . , bp ] , 4...5 bp which is a vector of p þ 1 elements (that can be arbitrary numbers as well), with 0 < p < n. Let us also assume that 2 3 1 x11 . . . x1p 61 x . . . x2p 7 21 6 7 X¼6 7 4 5 1 xn1 . . . xnp is a matrix with data from n subjects on p variables in its last p columns. We can readily observe that the equation y ¼ Xb

(2:31)

in actual fact states that each consecutive element of y equals a linear combination of the elements of that row of X, which corresponds to the



51

location of the element in question within the vector y. For instance, the sth element of y (1 s n) equals from Equation 2.31 ys ¼ b0 þ b1 xs1 þ b2 xs2 þ . . . þ bp xsp ,

(2:32)

that is, represents a linear combination of the elements of the sth row of the matrix X. We stress that since Equation 2.32 holds for each s (s ¼ 1, . . . , n), this linear combination utilizes the same weights for each element of y. (These weights are the successive elements of the vector b.) Now let us think of y as a set of individual (sample) scores on a dependent variable of interest, and of X as a matrix consisting of the subjects’ scores on a set of p independent variables, with the added first column consisting of 1’s only. Equation 2.31, and in particular Equation 2.32, is then in fact the equation of how one obtains predicted scores for the dependent variable in a multiple regression analysis session, if the b’s were the estimated partial regression weights. Further, if we now add in Equation 2.31 an n 3 1 vector of error scores, denoted e, for the considered case of a single dependent variable we get the general equation of the multiple linear regression model: y ¼ Xb þ e:

(2:33)

Hence, already when dealing with univariate regression analysis, one has in fact been implicitly carrying out matrix multiplication (of a matrix by vector) any time when obtaining predicted values for a response variable. 2.4.2 Three Fundamental Matrices of Multivariate Statistics in Compact Form Recall from introductory statistics how we estimate the variance of a single (unidimensional) random variable X with observed sample values x1, x2, . . . , xn (n > 1): if we denote that estimator s2X, then s2 X ¼

n 1 1 X [(x1 x)2 þ . . . þ (xn x)2 ] ¼ (xi x)2 : n1 n 1 i¼1

(2:34)

From Equation 2.34, the sum of squares for this random variable, X, is also based on its realizations x1, x2, . . . , xn and is given by SSX ¼ (n 1)s2 X ¼

n X i¼1

where x is their sample average.

(xi x)2 ,

(2:35)


52


As one may also recall from univariate statistics (UVS), sums of squares play a very important part in a number of its methods (in particular, analysis of variance). It is instructive to mention here that the role played by sums of squares in UVS is played by the SSCP matrix Q in MVS. Further, as we mentioned in Chapter 1, the elements of the SSCP matrix Q are n 1 times the corresponding elements of the covariance matrix S. Hence, using the earlier discussed rule of matrix multiplication with a scalar, it follows that the SSCP matrix can be written as Q ¼ (n 1)S:

(2:36)

Comparing now Equations 2.35 and 2.36, we see that the latter is a multivariate analog of the former, with Equation 2.35 resulting from Equation 2.36 in the special case of a single variable. This relationship leads us to a more generally valid analogy that facilitates greatly understanding the conceptual ideas behind a number of multivariate statistical methods. To describe it, notice that we could carry out the following two operations in the right-hand side of Equation 2.35 in order to obtain Equation 2.36: (i) exchange single variables with vectors, and (ii) exchange the square with the product of the underlying expression (the one being squared) with its transpose. In this way, following steps (i) and (ii), from Equation 2.35 one directly obtains the formula for the SSCP matrix from that of sum of squares for a given random variable: Q¼

n X i¼1

(xi x)(xi x)0 :

(2:37)

where xi is the vector of scores of the ith subject on a set of p studied x is the vector with elements being the means of these variables, and variables (i ¼ 1, . . . , n). Note that in the right-hand side of Equation 2.37 we have a sum of n products of a p 3 1 column vector with a 1 3 p row vector, i.e., a sum of n matrices each of size p 3 p. With this in mind, it follows that S ¼ (n 1)1 Q ¼

n 1 X (x x)(xi x)0 , n 1 i¼1 i

(2:38)

which is the ‘‘reverse’’ relationship to that in Equation 2.36, and one that we emphasized earlier in this chapter as well as in Chapter 1. We also note



53

in passing that Equation 2.36 follows as a special case from Equation 2.37, when the dimensionality of the observed variable vector x in the latter is 1; in that case, from Equation 2.38 follows also the formula Equation 2.34 for estimation of variance for a given random variable (based on its random realizations in a sample). Next recall the definitional relationship between the correlation coefficient of two random variables X1 and X2 (denoted Corr(X1,X2)), their covariance coefficient Cov(X1,X2), and their standard deviations sX1 and sX2 (assuming the latter are not zero; see Chapter 1): Corr(X1 ,X2 ) ¼

Cov(X1 ,X2 ) : sX1 sX2

(2:39)

In the case of more than two variables, Equation 2.39 would relate just one element of the correlation matrix R with the corresponding element of the covariance matrix S and the reciprocal of the product of the involved variables’ standard deviations. Now, for a given random vector x, that is a set of random variables X1, X2, . . . , Xp, one can define the following diagonal matrix: 2 3 sX1 0 0 0 6 0 sX 0 0 7 2 6 7 D¼6 7, 4 5 0 0 0 sXp which has as its only nonzero elements the standard deviations of the corresponding elements of the vector x along its main diagonal (p > 1). The inverse of the matrix D (i.e., D1) would simply be a diagonal matrix with the reciprocals of the standard deviations along its main diagonal (as can be found out by direct multiplication of D and D1, which renders the unit matrix; recall earlier discussion in this chapter on uniqueness of matrix inverse). Thus, the inverse of D is 3 2 1=sX1 0 0 0 6 0 0 7 1=sX2 0 7 6 D1 ¼ 6 7: 4 5 0 0 0 1=sXp Based on these considerations, and using the earlier discussed rules of matrix multiplication, one can readily find out that the correlation matrix R can now be written as " # n X 1 0 R ¼ D1 SD1 ¼ D1 (x x)(xi x) D1 : (2:40) n 1 i¼1 i


54


Hence, in Equations 2.37, 2.38, and 2.40 we have expressed in compact form three fundamental matrices in MVS. Thereby, we have instrumentally used the ‘‘uni-to-multivariate analogy’’ indicated previously on several occasions (Rencher, 1995).

2.5 Raw Data Points in Higher Dimensions, and Distance Between Them Data points. If we wanted to take a look at the data of a particular person from a given data matrix, e.g., that in Table 1.1 (see Chapter 1), we can just take his=her row and represent it separately, i.e., as a row vector. For example, if we were interested in examining the data for the third person in Table 1.1., they would be represented as the following 1 3 7 matrix (row vector; i.e., a horizontal ‘‘slice’’ of the data matrix): x0 ¼ [3, 40, 51, 46, 1, 2, 43], where we use the prime symbol in compliance with the widely adopted convention to imply a column vector from a simple reference to a vector. Similarly, if we wanted to look at all subjects’ data on only 1 variable, say Test 3 in the same data table, we would obtain the following 4 3 1 matrix (column vector; a vertical ‘‘slice’’ of the matrix): 2 3 47 6 57 7 7 y¼6 4 46 5: 48 Note that we can also represent this vector, perhaps more conveniently, by stating its transpose: y ¼ [47, 57, 46, 48]0 : We could also think of both vectors x and y as representing two data points in a multidimensional space that we next turn to. Multivariable Distance (Mahalanobis Distance) Many notions underlying MVS cannot be easily or directly visualized because they are typically related to a q-dimensional space, where q > 3. Regrettably, we are only three-dimensional creatures, and thus so is also our immediate imagination. It will therefore be quite helpful to utilize whenever possible extensions of our usual notions of three-dimensional space, especially if these can assist us in understanding much of MVS, at least at a conceptual level.



55

Two widely used spaces in MVS. To accomplish such extensions, it will be beneficial to think of each subject’s data (on all studied variables) as a point in a p-dimensional space, where each row of the data matrix represents a corresponding point. At times it will be similarly useful to think of all subjects’ data on each separate variable as a point in an n-dimensional space, where each column of the data matrix is represented by a corresponding point. Notice the difference between these two spaces. Specifically, the meaning and coordinates of the actual data point indicated are different in both cases. This difference stems from what the coordinate axes in these two spaces are supposed to mean, as well as their number. In the first case (p-dimensional space), one can think of the studied variables being the axes and individuals being points in that space, with these points corresponding to their data. This p-dimensional space will be quite useful for most multivariate techniques discussed in this text and is perhaps ‘‘more natural,’’ but the second mentioned space above is also quite useful. In the latter, n-dimensional space, one may want to think of individuals being the axes while variables are positioned within it according to the columns in an observed data matrix. Preemptying some of the discussion in a later chapter, this space will be particularly helpful when considering the technique of factor analysis, and especially when interested in factor rotation (Chapter 8). In addition to reference to a multivariate space, many MVS procedures can be understood by using the notion of a distance, so we move now to this concept. Distance in a multivariable space. For a point that is say in a q-dimensional space, x ¼ [x1, x2, . . . , xq]0 (q > 1), we can define its distance to the origin as its length (denoted by kxk) as follows: kxk¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi x21 þ x22 þ . . . þ x2q

(2:41)

(We will refer below to x1, x2, . . . , xq at times also as components of x.) Now, if there are two points in that q-dimensional space, say x and y, we can define their distance D(x, y) as the length of their difference, x y (i.e., as the distance of x y to the origin), where x y is obviously the vector having as components the corresponding differences in the components of x and y: D(x, y) ¼ kx yk ¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (x1 y1 )2 þ (x2 y2 )2 þ . . . þ (xq yq )2 :

(2:42)

For example, if x ¼ (2, 8)0 and y ¼ (4, 2)0 , then their distance would be as follows (note that q ¼ 2 here, i.e., these points lie in a two-dimensional space):


56

Introduction to Applied Multivariate Analysis qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffi D(x, y) ¼ kx yk ¼ (2 4)2 þ (8 2)2 ¼ 4 þ 36 ¼ 6:235:

Using the earlier discussed rules of matrix multiplication, it is readily observed that the expression (2.42) can also be written as kx yk2 ¼ (x y)0 Iq (x y):

(2:43)

Equation 2.43 defines what is commonly known as ‘‘Euclidean distance’’ between the vectors x and y. This is the conventional concept used to represent distance between points in a multivariable space. In this definition, each variable (component, or coordinate) participates with the same weight, viz. 1. Note that Equation 2.41 is a special case of Equation 2.42, which results from the latter when y ¼ 0 (the last being the zero vector consisting of only 0’s as its elements). Most of the time in empirical behavioral and social research, however, some observed variables may have larger variances than others, and in addition some variables are more likely to be stronger related with one another than with other variables. These facts are not taken into account in the Euclidean distance, but are so in what is called multivariable (statistical) distance. In other words, the Euclidean distance depends on the units in which variables (components) are measured, and thus can be influenced by whichever variable takes numerically larger values. This effect of differences in units of measurement is counteracted in the multivariate distance by particular weighting that can be given to different components (studied variables or measures). This is accomplished by employing a prespecified weight matrix W that is an appropriate square matrix, with which the multivariable distance is more compactly defined as DW (x, y) ¼ (x y)0 W(x y):

(2:44)

The product in the right-hand side of Equation 2.44 is also denoted by kx yk2W . The weight matrix W can be in general an appropriately sized symmetric matrix with the property that z0 W z > 0 for any nonzero vector z (of dimension making it multiplication conform with W). Such a matrix is called positive definite, typically denoted by W > 0. Any covariance, correlation, or SSCP matrix of a set of random variables (with real-valued realizations) has this property, if the variables are not linearly related. An interesting feature of a positive definite matrix is that only positive numbers can lie along its main diagonal; that is, if W ¼ [wii] is positive definite, then wii > 0 (i ¼ 1, 2, . . . , q, where q is the size of the matrix; q 1). The distance DW(x,y) is also called generalized distance between the points x and y with regard to the weight matrix W. In MVS, the W matrices



57

typically take into account individual differences on the observed variables as well as their interrelationships. In particular, the most important distance measure used in this book will be the so-called Mahalanobis distance (MD; sometimes also called ‘‘statistical distance’’). This is the distance between an observed data point x (i.e., the vector of an individual’s scores on p observed variables, p > 1), to the mean for a sample, x, weighted by the inverse of the sample covariance matrix S of the p variables, and is denoted by Mah(x): Mah(x) ¼ (x x)0 S1 (x x):

(2:45)

Note that this is a distance in the p-space, i.e., the variable space, where the role of axes is played by the measured variables. As can be easily seen from Equation 2.45, the MD is a direct generalization or extension of the intuitive notion of univariate distance to the multivariate case. A highly useful form of univariate distance is the wellknown z-score: z ¼ (x x)=s,

(2:46)

where s denotes the standard deviation of the variable x. Specifically, if we were to write out Equation 2.46 as follows: z2 ¼ (x x)0 (s2 )1 (x x)

(2:47)

and then compare it with Equation 2.45, the analogy is readily apparent. As we have indicated before, this type of ‘‘uni-to-multivariate’’ analogy will turn out to be highly useful as we discuss more complex multivariate methods. The MD evaluates the distance of each individual data point, in the p-dimensional space, to the centroid of the sample data. The centroid is the point with coordinates being the means of the observed variables, i.e., the vector with elements being the means of the observed variables. Data points x with larger Mah(x) are further out from the sample mean than points with smaller MD. In fact, points with very large Mah are potentially abnormal observations. (We discuss this issue in greater detail in the next chapter.) So how can one evaluate in practice the MD for a given point x with respect to the mean of a data set? To accomplish this aim, one needs to find out the means on all variables and their covariance matrix for the particular data set (sample). Then, using either SPSS or SAS, the distance defined in Equation 2.45 of that point, x, to the centroid of the data set can be evaluated. To illustrate, consider a study involving three tests (i.e., p ¼ 3) of writing, reading, and general mental ability, which were administered to a sample of


58


n ¼ 200 elementary school children. Suppose one were interested in finding out the distance between the point x ¼ [77, 56, 88]0 of scores obtained by one of the students, to the centroid of the sample data, x ¼ [46, 55, 65]0 , with the sample covariance matrix of these three variables being 2

3

994:33

6 S ¼ 4 653:3 554:12

873:24 629:88

7 5:

(2:48)

769:67

To work out the MD for this point, Mah(x), either the following SPSS command file or subsequent SAS command file can be used. (To save space, only comments are inserted pertaining to operations and entities not used previously in the chapter.) SPSS command file TITLE ‘HOW TO USE SPSS TO WORK OUT MAHALANOBIS DISTANCE’. MATRIX. COMP X ¼ {77, 56, 88}. * NOTE THAT WE DEFINE X AS A ROW VECTOR HERE. COMP X.BAR ¼ {46, 55, 65}. COMP S ¼ {994.33, 653.30, 554.12; 653.30, 873.24, 629.88; 554.12, 629.88, 769.67}. PRINT S. COMP S.INV ¼ INV(S). PRINT S.INV. COMP MAH.DIST ¼ (XX.BAR)*S.INV*T(XX.BAR). * SEE NOTE AT THE BEGINNING OF THIS INPUT FILE. PRINT MAH.DIST. END MATRIX. SAS command file proc iml; X ¼ {77 56 88}; XBAR ¼ {46 55 65}; S ¼ {994.33 653.30 554.2, 653.30 873.24 629.88, 554.12 629.88 769.67}; =* NOTE THE USE OF COMMA TO SEPARATE ROWS*= print S; INVS ¼ INV(S); Print INVS; DIFF ¼ X XBAR; =* THIS IS THE DEVIATION SCORE*=



59

TRANDIFF ¼ T(DIFF); =* THIS IS THE TRANSPOSE OF THE DEVIATION SCORE*= MAHDIST ¼ DIFF*INVS*TRANDIFF; print MAHDIST; QUIT;

The following output results are furnished by these SPSS and SAS program files: SPSS output HOW TO USE SPSS TO WORK OUT MAHALANOBIS DISTANCE Run MATRIX procedure: S 994.3300000 653.3000000 554.1200000

653.3000000 873.2400000 629.8800000

554.1200000 629.8800000 769.6700000

S.INV 10 ** 3 X 2.067023074 1.154499684 1.154499684 3.439988733 .543327095 1.984030478

.543327095 1.984030478 3.314108030

MAH.DIST 2.805383491 —— END MATRIX —— SAS output The SAS System 994.33 653.3 554.12

S 653.3 873.24 629.88

554.2 629.88 769.67

INVS 0.0020671 0.001155 0.000543

0.001154 0.0034398 0.001984 MAHDIST 2.8051519

0.000544 0.001984 0.0033143


60


It is important to note that the inverse of the empirical covariance matrix provided in Equation 2.48—which is denoted S.INV in the SPSS command file and INVS in the SAS file—is the matrix that is instrumentally needed for calculating the MD value of 2.805. Further, in SPSS the matrix S.INV is provided in scientific notation, whereas it is not so in SAS. We finalize this chapter by emphasizing once again that the concept of MD renders important information about the distance between points in a p-dimensional variable space, for any given sample of data from a multivariable study. In addition, as we will see in more detail in the next chapter, it allows us to evaluate whether some observations may be very different from the majority in a given sample. Last but not least, it helps us conceptually understand a number of multivariate statistical procedures, especially those related to group differences, which are dealt with in later chapters.


3 Data Screening and Preliminary Analyses Results obtained through application of univariate or multivariate statistical methods will in general depend critically on the quality of the data and on the numerical magnitude of the elements of the data matrix as well as variable relationships. For this reason, after data are collected in an empirical study and before they are analyzed using a particular method(s) to respond to a research question(s) of concern, one needs to conduct what is typically referred to as data screening. These preliminary activities aim (a) to ensure that the data to be analyzed represent correctly the data originally obtained, (b) to search for any potentially very influential observations, and (c) to assess whether assumptions underlying the method(s) to be applied subsequently are plausible. This chapter addresses these issues.

3.1 Initial Data Exploration To obtain veridical results from an empirical investigation, the data collected in it must have been accurately entered into the data file submitted to the computer for analysis. Mistakes committed during the process of data entry can be very costly and can result in incorrect parameter estimates, standard errors, and test statistics, potentially yielding misleading substantive conclusions. Hence, one needs to spend as much time as necessary to screen the data for entry errors, before proceeding with the application of any uni- or multivariate method aimed at responding to the posited research question(s). Although this process of data screening may be quite time consuming, it is an indispensable prerequisite of a trustworthy data analytic session, and the time invested in data screening will always prove to be worthwhile. Once a data set is obtained in a study, it is essential to begin with proofreading the available data file. With a small data set, it may be best to check each original record (i.e., each subject’s data) for correct entry. With larger data sets, however, this may not be a viable option, and so one may instead arrange to have at least two independent data entry sessions followed by a comparison of the resulting files. Where discrepancies are

61


62


found, examination of the raw (original) data records must then be carried out in order to correctly represent the data into a computer file to be analyzed subsequently using particular statistical methods. Obviously, the use of independent data entry sessions can prove to be expensive and time consuming. In addition, although such checks may resolve noted discrepancies when entering the data into a file, they will not detect possible common errors across all entry sessions or incorrect records in the original data. Therefore, for any data set once entered into a computer file and proofread, it is recommended that a researcher carefully examine frequencies and descriptive statistics for each variable across all studied persons. (In situations involving multiple-population studies, this should also be carried out within each group or sample.) Thereby, one should check, in particular, the range of each variable, and specifically whether the recorded maximum and minimum values on it make sense. Further, when examining each variable’s frequencies, one should also check if all values listed in the frequency table are legitimate. In this way, errors at the data-recording stage can be spotted and immediately corrected. To illustrate these very important preliminary activities, let us consider a study in which data were collected from a sample of 40 university freshmen on a measure of their success in an educational program (referred to below as ‘‘exam score’’ and recorded in a percentage correct metric), and its relationship to an aptitude measure, age in years, an intelligence test score, as well as a measure of attention span. (The data for this study can be found in the file named ch3ex1.dat available from www.psypress.com=applied-multivariate-analysis.) To initially screen the data set, we begin by examining the frequencies and descriptive statistics of all variables. To accomplish this initial data screening in SPSS, we use the following menu options (in the order given next) to obtain the variable frequencies: Analyze ! Descriptive statistics ! Frequencies, and, correspondingly, to furnish their descriptive statistics: Analyze ! Descriptive statistics ! Descriptives. In order to generate the variable frequencies and descriptive statistics in SAS, the following command file can be used. In SAS, there are often a number of different ways to accomplish the same aim. The commands provided below were selected to maintain similarity with the structure of the output rendered by the above SPSS analysis session. In particular, the order of the options in the SAS PROC MEANS statement is structured to create similar output (with the exception of fw¼6, which requests the field width of the displayed statistics be set at 6—alternatively, the command ‘‘maxdec¼6’’ could be used to specify the maximum number of decimal places to output).


Data Screening and Preliminary Analyses

63

DATA CHAPTER3; INFILE ‘ch3ex1.dat’; INPUT id Exam_Score Aptitude_Measure Age_in_Years Intelligence_Score Attention_Span; PROC MEANS n range min max mean std fw¼6; var Exam_Score Aptitude_Measure Age_in_Years Intelligence_Score Attention_Span; RUN; PROC FREQ; TABLES Exam_Score Aptitude_Measure Age_in_Years Intelligence_Score Attention_Span; RUN; The resulting outputs produced by SPSS and SAS are as follows: SPSS descriptive statistics output Descriptive Statistics

Exam Score Aptitude Measure Age in Years Intelligence Score Attention Span Valid N (listwise)

N

Range

Minimum

Maximum

Mean

Std. Deviation

40 40 40 40 40 40

102 24 9 8 7

50 20 15 96 16

152 44 24 104 23

57.60 23.12 18.22 99.00 20.02

16.123 3.589 1.441 2.418 1.349

SAS descriptive statistics output

The SAS System The MEANS Procedure Variable Exam_Score Aptitude _Measure Age_in_Years Intelligence _Score Attention _Span

N

Range

Min

Max

Mean

Std Dev

40 40

102.0 24.00

50.00 20.00

152.0 44.00

57.60 23.13

16.12 3.589

40 40

9.000 8.000

15.00 96.00

24.00 104.0

18.23 99.00

1.441 2.418

40

7.000

16.00

23.00

20.03

1.349


64


By examining the descriptive statistics in either of the above tables, we readily observe the high range on the dependent variable Exam Score. This apparent anomaly is also detected by looking at the frequency distribution of each measure, in particular of the same variable. The pertinent output sections are as follows: SPSS frequencies output Frequencies Exam Score Valid

50 51 52 53 54 55 56 57 62 63 64 65 69 152 Total

Frequency

Percent

Valid Percent

5 3 8 5 3 3 1 3 1 3 1 2 1 1 40

12.5 7.5 20.0 12.5 7.5 7.5 2.5 7.5 2.5 7.5 2.5 5.0 2.5 2.5 100.0

12.5 7.5 20.0 12.5 7.5 7.5 2.5 7.5 2.5 7.5 2.5 5.0 2.5 2.5 100.0

Cumulative Percent 12.5 20.0 40.0 52.5 60.0 67.5 70.0 77.5 80.0 87.5 90.0 95.0 97.5 100.0

Note how the score 152 ‘‘sticks out’’ from the rest of the values observed on the Exam Score variable—there is no one else having a score even close to 152; the latter finding is also not unexpected because as mentioned this variable was recorded in the metric of percentage correct responses. We continue our examination of the remaining measures in the study and return later to the issue of discussing and dealing with found anomalous, or at least apparently so, values.

Aptitude Measure Valid

20 21 22 23 24 25 44 Total

Frequency

Percent

Valid Percent

2 6 8 14 8 1 1 40

5.0 15.0 20.0 35.0 20.0 2.5 2.5 100.0

5.0 15.0 20.0 35.0 20.0 2.5 2.5 100.0

Cumulative Percent 5.0 20.0 40.0 75.0 95.0 97.5 100.0



65

Here we also note a subject whose aptitude score tends to stand out from the rest: the one with a score of 44. Age in Years Valid

15 16 17 18 19 20 24 Total

Frequency

Percent

Valid Percent

Cumulative Percent

1 1 9 15 9 4 1 40

2.5 2.5 22.5 37.5 22.5 10.0 2.5 100.0

2.5 2.5 22.5 37.5 22.5 10.0 2.5 100.0

2.5 5.0 27.5 65.0 87.5 97.5 100.0

On the age variable, we observe that a subject seems to be very different from the remaining persons with regard to age, having a low value of 15. Given that this is a study of university freshmen, although not a common phenomenon to encounter someone that young, such an age per se does not seem really unusual for attending college. Intelligence Score Valid

96 97 98 99 100 101 102 103 104 Total

Frequency

Percent

Valid Percent

Cumulative Percent

9 4 5 5 6 5 2 2 2 40

22.5 10.0 12.5 12.5 15.0 12.5 5.0 5.0 5.0 100.0

22.5 10.0 12.5 12.5 15.0 12.5 5.0 5.0 5.0 100.0

22.5 32.5 45.0 57.5 72.5 85.0 90.0 95.0 100.0

The range of scores on this measure also seems to be well within what could be considered consistent with expectations in a study involving university freshmen. Attention Span Valid

16 18 19 20 21 22 23 Total

Frequency

Percent

Valid Percent

1 6 2 16 12 2 1 40

2.5 15.0 5.0 40.0 30.0 5.0 2.5 100.0

2.5 15.0 5.0 40.0 30.0 5.0 2.5 100.0

Cumulative Percent 2.5 17.5 22.5 62.5 92.5 97.5 100.0


66


Finally, with regard to the variable attention span, there is no subject that appears to have an excessively high or low score compared to the rest of the available sample. SAS frequencies output Because the similarly structured output created by SAS would obviously lead to interpretations akin to those offered above, we dispense with inserting comments in the next presented sections.

The SAS System The FREQ Procedure Frequency

Percent


Cumulative Percent

50 51 52 53 54 55 56 57 62 63 64 65 69 152

5 3 8 5 3 3 1 3 1 3 1 2 1 1

12.50 7.50 20.00 12.50 7.50 7.50 2.50 7.50 2.50 7.50 2.50 5.00 2.50 2.50

5 8 16 21 24 27 28 31 32 35 36 38 39 40

12.50 20.00 40.00 52.50 60.00 67.50 70.00 77.50 80.00 87.50 90.00 95.00 97.50 100.00

Aptitude_Measure

Frequency

Percent


Cumulative Percent

2 6 8 14 8 1 1

5.00 15.00 20.00 35.00 20.00 2.50 2.50

2 8 16 30 38 39 40

5.00 20.00 40.00 75.00 95.00 97.50 100.00

Exam_Score

20 21 22 23 24 25 44



Age_in _Years

Frequency

Percent


Cumulative Percent

1 1 9 15 9 4 1

2.50 2.50 22.50 37.50 22.50 10.00 2.50

1 2 11 26 35 39 40

2.50 5.00 27.50 65.00 87.50 97.50 100.00

15 16 17 18 19 20 24

Intelligence _Score 96 97 98 99 100 101 102 103 104

Attention_Span 16 18 19 20 21 22 23

67

Frequency

Percent


Cumulative Percent

9 4 5 5 6 5 2 2 2

22.50 10.00 12.50 12.50 15.00 12.50 5.00 5.00 5.00

9 13 18 23 29 34 36 38 40

22.50 32.50 45.00 57.50 72.50 85.00 90.00 95.00 100.00

Frequency

Percent


Cumulative Percent

1 6 2 16 12 2 1

2.50 15.00 5.00 40.00 30.00 5.00 2.50

1 7 9 25 37 39 40

2.50 17.50 22.50 62.50 92.50 97.50 100.00

Although examining the descriptive statistics and frequency distributions across all variables is highly informative, in the sense that one learns what the data actually are (especially when looking at their frequency tables), it is worthwhile noting that these statistics and distributions are only available for each variable when considered separately from the others. That is, like the descriptive statistics, frequency distributions provide only univariate information with regard to the relationships among the values that subjects give rise to on a given measure. Hence, when an (apparently) anomalous value is found for a particular variable, neither descriptive statistics nor frequency tables can provide further information about the person(s) with that anomalous score, in particular regarding their scores on some or all of


68


the remaining measures. As a first step toward obtaining such information, it is helpful to extract the data on all variables for any subject exhibiting a seemingly extreme value on one or more of them. For example, to find out who the person was with the exam score of 152, its extraction from the file is accomplished in SPSS by using the following menu options=sequence (the variable Exam Score is named ‘‘exam_score’’ in the data file): Data ! Select cases ! If condition ‘‘exam_score¼152’’ (check ‘‘delete unselected cases’’). To accomplish the printing of apparently aberrant data records, the following command line would be added to the above SAS program: IF Exam_Score¼152 THEN LIST; Consequently, each time a score of 152 is detected (in the present example, just once) SAS prints the current input data line in the SAS log file. When this activity is carried out and one takes a look at that person’s scores on all variables, it is readily seen that apart from the screening results mentioned, his=her values on the remaining measures are unremarkable (i.e., lie within the variable-specific range for meaningful scores; in actual fact, reference to the original data record would reveal that this subject had an exam score of 52 and his value of 152 in the data file simply resulted from a typographical error). After the data on all variables are examined for each subject with anomalous value on at least one of them, the next question that needs to be addressed refers to the reason(s) for this data abnormality. As we have just seen, the latter may result from an incorrect data entry, in which case the value is simply corrected according to the original data record. Alternatively, the extreme score may have been due to a failure to declare to the software a missing value code, so that a data point is read by the computer program as a legitimate value while it is not. (Oftentimes, this may be the result of a too hasty move on to the data analysis phase, even a preliminary one, by a researcher skipping this declaration step.) Another possibility could be that the person(s) with an out-of-range value may actually not be a member of the population intended to be studied, but happened to be included in the investigation for some unrelated reasons. In this case, his=her entire data record would have to be deleted from the data set and following analyses. Furthermore, and no less importantly, an apparently anomalous value may in fact be a legitimate value for a sample from a population where the distribution of the variable in question is highly skewed. Because of the potential impact such situations can have on data analysis results, these circumstances are addressed in greater detail in a later section of the chapter. We move next to a more formal discussion of extreme scores, which helps additionally in the process of handling abnormal data values.



69

3.2 Outliers and the Search for Them As indicated in Section 3.1, the relevance of an examination for extreme observations, or so-called outliers, follows from the fact that these may exert very strong influence upon the results of ensuing analyses. An outlier is a case with (a) such an extreme value on a given variable, or (b) such an abnormal combination of values on several variables, which may render it having a substantial impact on the outcomes of a data analysis and modeling session. In case (a), the observation is called univariate outlier, while in case (b) it is referred to as multivariate outlier. Whenever even a single outlier (whether univariate or multivariate) is present in a data set, results generated with and without that observation(s) may be very different, leading to possibly incompatible substantive conclusions. For this reason, it is critically important to also consider some formal means that can be used to routinely search for outliers in a given data set. 3.2.1 Univariate Outliers Univariate outliers are usually easier to spot than multivariate outliers. Typically, univariate outliers are to be sought among those observations with the following properties: (a) the magnitude of their z-scores is greater than 3 or smaller than 3; and (b) their z-scores are to some extent ‘‘disconnected’’ from the z-scores of the remaining observations. One of the easiest ways to search for univariate outliers is to use descriptive methods and=or graphical methods. The essence of using the descriptive methods is to check for individual observations with the properties (a) and (b) just mentioned. In contrast, graphical methods involve the use of various plots, including boxplots, steam-and-leaf plots, and normal probability (detrended) plots for studied variables. Before we discuss this topic further, let us mention in passing that often with large samples (at least in the hundreds), there may occasionally be a few apparent extreme observations that need not necessarily be outliers. The reason is that large samples have a relatively high chance of including extreme cases in a studied population that are legitimate members of it and thus need not be removed from the ensuing analyses. To illustrate, consider the earlier study of university freshmen on the relationship between success in an educational program, aptitude, age, intelligence, and attention span (see data file ch3ex1.dat available from www.psypress.com=applied-multivariate-analysis). To search for univariate outliers, we first obtain the z-scores for all variables. This is readily achieved with SPSS using the following menu options=sequence: Analyze ! Descriptive statistics ! Descriptives (check ‘‘save standardized values’’). With SAS, the following PROC STANDARD command lines could be used:


70


DATA CHAPTER3; INFILE ‘ch3ex1.dat’; INPUT id Exam_Score Aptitude_Measure Age_in_Years Intelligence_Score Attention_Span; zscore¼Exam_Score; PROC STANDARD mean¼0 std¼1 out¼newscore; var zscore; RUN; PROC print data¼newscore; var Exam_Score zscore; title ‘Standardized Exam Scores’; RUN; In these SAS statements, PROC STANDARD standardizes the specified variable from the data set (for our illustrative purposes, in this example only the variable exam_score was selected), using a mean of 0 and standard deviation of 1, and then creates a new SAS data set (defined here as the outfile ‘‘newscore’’) that contains the resulting standardized values. The PROC PRINT statement subsequently prints the original values alongside the standardized values for each individual on the named variables. As a result of these software activities, SPSS and SAS generate an extended data file containing both the original variables plus a ‘‘copy’’ of each one of them, which consists of all subjects’ z-scores; to save space, we only provide next the output generated by the above SAS statements (in which the variable ‘‘Exam Score’’ was selected for standardization). Standardized Test Scores Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Exam_Score 51 53 50 63 65 53 52 50 57 54 65 50 52 63 52 52 51 52 55

zscore 0.40936 0.28531 0.47139 0.33493 0.45898 0.28531 0.34734 0.47139 0.03721 0.22329 0.45898 0.47139 0.34734 0.33493 0.34734 0.34734 0.40936 0.34734 0.16126



20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

55 53 54 152 50 63 57 52 62 52 55 54 56 52 53 64 57 50 51 53 69

71

0.16126 0.28531 0.22329 5.85513 0.47139 0.33493 0.03721 0.34734 0.27291 0.34734 0.16126 0.22329 0.09924 0.34734 0.28531 0.39696 0.03721 0.47139 0.40936 0.28531 0.70708

Looking through the column labeled ‘‘zscore’’ in the last output table (and in general each of the columns generated for the remaining variables under consideration), we try to spot the z-scores that are larger than 3 or smaller than 3 and at the same time ‘‘stick out’’ of the remaining values in that column. (With a larger data set, it is also helpful to request the descriptive statistics for each variable along with their corresponding z-scores, and then look for any extreme values.) In this illustrative example, subject #23 clearly has a very large z-score relative to the rest of the observations on exam score (viz. larger than 5, although as discussed above this was clearly a data entry error). If we similarly examined the z-scores on the other variables (not tabled above), we would observe no apparent univariate outliers with respect to the variables Intelligence and Attention Span; however, we would find out that subject #40 had a large z-score on the Aptitude measure (z-score ¼ 5.82), like subject #8 on age (z-score ¼ 4.01). Once possible univariate outliers are located in a data set, the next step is to search for the presence of multivariate outliers. We stress that it may be premature to make a decision for deleting a univariate outlier before examination for multivariate outliers is conducted.

3.2.2 Multivariate Outliers Searching for multivariate outliers is considerably more difficult to carry out than examination for univariate outliers. As mentioned in the


72


preceding section, a multivariate outlier is an observation with values on several variables that are not necessarily abnormal when each variable is considered separately, but are unusual in their combination. For example, in a study concerning income of college students, someone who reports an income of $100,000 per year is not an unusual observation per se. Similarly, someone who reports that they are 16 years of age would not be considered an unusual observation. However, a case with these two measures in combination is likely to be highly unusual, that is, a possible multivariate outlier (Tabachnick & Fidell, 2007). This example shows the necessity of utilizing such formal means when searching for multivariate outliers, which capitalize in an appropriate way on the individual variable values for each subject and at the same time also take into consideration their interrelationships. A very useful statistic in this regard is the Mahalanobis distance (MD) that we have discussed in Chapter 2. As indicated there, in an empirical setting, the MD represents the distance of a subject’s data to the centroid (mean) of all cases in an available sample, that is, to the point in the multivariate space, which has as coordinates the means of all observed variables. That the MD is so instrumental in searching for multivariate outliers should actually not be unexpected, considering the earlier mentioned fact that it is the multivariate analog of univariate distance, as reflected in the z-score (see pertinent discussion in Chapter 2). As mentioned earlier, the MD is also frequently referred to as statistical distance since it takes into account the variances and covariances for all pairs of studied variables. In particular, from two variables with different variances, the one with larger variability will contribute less to the MD; further, two highly correlated variables will contribute less to the MD than two nearly uncorrelated ones. The reason is that the inverse of the empirical covariance matrix participates in the MD, and in effect assigns in this way weights of ‘‘importance’’ to the contribution of each variable to the MD. In addition to being closely related to the concept of univariate distance, it can be shown that with multinormal data on a given set of variables and a large sample, the Mahalanobis distance follows approximately a chisquare distribution with degrees of freedom being the number of these variables (with this approximation becoming much better with larger samples) ( Johnson & Wichern, 2002). This characteristic of the MD helps considerably in the search for multivariate outliers. Indeed, given this distributional property, one may consider an observation as a possible multivariate outlier if its MD is larger than the critical point (generally specified at a conservative recommended significance level of a ¼ .001) of the chi-square distribution with degrees of freedom being the number of variables participating in the MD. We note that the MDs for different observations are not unrelated to one another, as can be seen from their formal definition in Chapter 2. This suggests the need for some caution



73

when using the MD in searching for multivariate outliers, especially with samples that cannot be considered large. We already discussed in Chapter 2 a straightforward way of computing the MD for any particular observation from a data set. Using it for examination of multivariate outliers, however, can be a very tedious and time-consuming activity especially with large data sets. Instead, one can use alternative approaches that are readily applied with statistical software. Specifically, in the case of SPSS, one can simply regress a variable of no interest (e.g., subject ID, or case number) upon all variables participating in the MD; requesting thereby the MD for each subject yields as a byproduct this distance for all observations (Tabachnick & Fidell, 2007). We stress that the results of this multiple regression analysis are of no interest and value per se, apart from providing, of course, each individual’s MD. As an example, consider the earlier study of university freshmen on their success in an educational program in relation to their aptitude, age, intelligence, and attention span. (See data file ch3ex1.dat available from www.psypress.com=applied-multivariate-analysis.) To obtain the MD for each subject, we use in SPSS the following menu options=sequence: Analyze ! Regression ! Linear ! (ID as DV; all others as IVs) ! Save ‘‘Mahalanobis Distance’’ At the end of this analysis, a new variable is added by the software to the original data file, named MAH_1, which contains the MD values for each subject. (We note in passing that a number of SPSS macros have also been proposed in the literature for the same purposes, which are readily available.) (De Carlo, 1997). In order to accomplish the same goal with SAS, several options exist. One of them is provided by the following PROC IML program: title ‘Mahalanobis Distance Values’; DATA CHAPTER3; INFILE ‘ch3ex1.dat’; INPUT id $ y1 y2 y3 y4 y5; %let id¼id; =* THE %let IS A MACRO STATEMENT*= %let var¼y1 y2 y3 y4 y5; =* DEFINES A VARIABLE *= PROC iml; start dsquare; use _last_; read all var {&var} into y [colname¼vars rowname¼&id]; n¼nrow(y); p¼ncol(y); r1¼&id; mean¼y[ :,];


74


d¼y j(n,1)*mean; s¼d’* d=(n 1); dsq¼vecdiag(d* inv(s) * d’); r¼rank(dsq); =* ranks the values of dsq *= val¼dsq; dsq[r, ]¼val; val¼r1; &id [r]¼val; result¼dsq; cl¼{‘dsq’}; create dsquare from result [colname¼cl rowname¼&id]; append from result [rowname¼&id]; finish; print dsquare; run dsquare; quit; PROC print data¼dsquare; var id dsq; run; The following output results would be obtained by submitting this command file to SAS (since the resulting output from SPSS would lead to the same individual MDs, we only provide next those generated by SAS); the column headings ‘‘ID’’ and ‘‘dsq’’ below correspond to subject ID number and MD, respectively. (Note that the observations are rank ordered according to their MD rather than their identification number.) Mahalanobis Distance Values Obs

ID

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

6 34 33 3 36 25 16 38 22 32 27 21 7 14 1 2 30 28 18

dsq 0.0992 0.1810 0.4039 0.4764 0.6769 0.7401 0.7651 0.8257 0.8821 1.0610 1.0714 1.1987 1.5199 1.5487 1.6823 2.0967 2.2345 2.5811 2.7049



20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

13 10 31 5 29 9 19 35 12 26 15 4 17 39 24 20 37 11 8 40 23

75

2.8883 2.9170 2.9884 3.0018 3.0367 3.1060 3.1308 3.1815 3.6398 3.6548 3.8936 4.1176 4.4722 4.5406 4.7062 5.1592 13.0175 13.8536 17.1867 34.0070 35.7510

Mahalanobis distance measures can also be obtained in SAS by using the procedure PROC PRINCOMP along with the STD option. (These are based on computing the uncorrected sum of squared principal component scores within each output observation; see pertinent discussion in Chapters 1 and 7.) Accordingly, the following SAS program would generate the same MD values as displayed above (but ordered by subject ID instead): PROC PRINCOMP std out¼scores noprint; var Exam_Score Aptitude_Measure Age_in_Years Intelligence_Score Attention_Span; RUN; DATA mahdist; set scores; md¼(uss(of prin1-prin5)); RUN; PROC PRINT; var md; RUN; Yet another option available in SAS is to use the multiple regression procedure PROC REG and, similarly to the approach utilized with SPSS


76


above, regress a variable of no interest (e.g., subject ID) upon all variables participating in the MD. The information of relevance to this discussion is obtained using the INFLUENCE statistics option as illustrated in the next program code. PROC REG; model id¼Exam_Score Aptitude_Measure Age_in_Years Intelligence_Score Attention_Span=INFLUENCE; RUN; This INFLUENCE option approach within PROC REG does not directly provide the values of the MD but a closely related individual statistic called leverage—commonly denoted by hi and labeled in the SAS output as HAT DIAG H (for further details, see Belsley, Kuh, & Welsch, 1980). However, the leverage statistic can easily be used to determine MD values for each observation in a considered data set. In particular, it has been shown that MD and leverage are related (in the case under consideration) as follows: MD ¼ (n 1)(hi 1=n),

(3:1)

where n denotes sample size and hi is the leverage associated with the ith subject (i¼1, . . . , n) (Belsley et al., 1980). Note from Equation 3.1 that MD and leverage are directly proportional to one another—as MD grows (decreases) so does leverage. The output resulting from submitting these PROC REG command lines to SAS is given below:

The SAS System The REG Procedure Model: MODEL1 Dependent Variable: id Output Statistics

Obs

Residual

RStudent

Hat Diag H

Cov Ratio

DFFITS

1 2 3 4 5 6 7

14.9860 14.9323 18.3368 9.9411 13.7132 14.5586 6.2042

1.5872 1.5908 1.9442 1.0687 1.4721 1.5039 0.6358

0.0681 0.0788 0.0372 0.1306 0.1020 0.0275 0.0640

0.8256 0.8334 0.6481 1.1218 0.9094 0.8264 1.1879

0.4292 0.4652 0.3822 0.4142 0.4961 0.2531 0.1662


Data Screening and Preliminary Analyses 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

2.2869 7.0221 7.2634 3.0439 4.5687 4.0729 9.3569 1.8641 4.7932 0.5673 0.9985 11.8243 2.4913 6.9400 4.7030 0.2974 8.1462 1.8029 10.6511 12.1511 7.7030 0.6869 6.6844 2.0881 9.5648 12.4692 14.2581 4.2887 15.9407 4.0544 19.1304 6.8041 0.4596

0.3088 0.7373 0.7610 0.3819 0.4812 0.4240 0.9669 0.1965 0.4850 0.0603 0.1034 1.2612 0.2677 0.7092 0.4765 0.1214 0.8786 0.1818 1.1399 1.2594 0.8040 0.0715 0.6926 0.2173 0.9822 1.2819 1.4725 0.4485 1.6719 0.5009 2.0495 0.7294 0.1411

0.4657 0.1046 0.0998 0.3802 0.1183 0.0991 0.0647 0.1248 0.0446 0.1397 0.0944 0.1053 0.1573 0.0557 0.0476 0.9417 0.1457 0.0440 0.1187 0.0525 0.0912 0.1029 0.0823 0.1016 0.0522 0.0354 0.0296 0.1066 0.0424 0.3588 0.0462 0.1414 0.8970

77 2.2003 1.2112 1.1971 1.8796 1.3010 1.2851 1.0816 1.3572 1.1998 1.3894 1.3182 1.0079 1.4011 1.1569 1.2053 20.4599 1.2187 1.2437 1.0766 0.9525 1.1715 1.3321 1.1953 1.3201 1.0617 0.9264 0.8415 1.2909 0.7669 1.7826 0.6111 1.2657 11.5683

0.2882 0.2521 0.2534 0.2991 0.1763 0.1406 0.2543 0.0742 0.1048 0.0243 0.0334 0.4326 0.1157 0.1723 0.1066 0.4880 0.3628 0.0390 0.4184 0.2964 0.2547 0.0242 0.2074 0.0731 0.2305 0.2454 0.2574 0.1549 0.3516 0.3746 0.4509 0.2961 0.4165

As can be readily seen, using Equation 3.1 with, say, the obtained leverage value of 0.0681 for subject #1 in the original data file, his=her MD is computed as MD ¼ (40 1)(0:0681 1=40) ¼ 1:681,

(3:2)

which corresponds to his or her MD value in the previously presented output. By inspection of the last displayed output section, it is readily found that subjects #23 and #40 have notably large MD values—above 30—that may fulfill the above-indicated criterion of being possible multivariate outliers. Indeed, since we have analyzed simultaneously p ¼ 5 variables, we are dealing with 5 degrees of freedom for this evaluation, and at


78


a significance level of a ¼ .001, the corresponding chi-square cutoff is 20.515 that is exceeded by the MD of these two cases. Alternatively, requesting extraction from the data file of all subjects’ records for whom their MD value is larger than 20.515 (see preceding section) would yield only these two subjects with values beyond this cutoff that can be, thus, potentially considered as multivariate outliers. With respect to examining leverage values, we note in passing that they range from 0 to 1 with (p þ 1)=n being their average (in this empirical example, 0.15). Rules of thumb concerning high values of leverage have also been suggested in the literature, whereby in general observations with leverage greater than a certain cutoff may be considered multivariate outliers (Fung, 1993; Huber, 1981). These cutoffs are based on the aboveindicated MD cutoff at a specified significance level a (denoted MDa). Specifically, the leverage cutoffs are hcutoff ¼ (MDa )=(n 1) þ 1=n,

(3:3)

which yields 20.515=39 þ 1=40 ¼ .551 for the currently considered example. With the use of Equation 3.3, if one were to utilize the output generated by PROC REG, there is no need to convert to MD the then reported leverage values to determine the observations that may be considered multivariate outliers. In this way, it can be readily seen that only subjects #23 and #40 could be suggested as multivariate outliers. Using diagnostic measures to identify an observation as a possible multivariate outlier depends on a potentially rather complicated correlational structure among a set of studied variables. It is therefore quite possible that some observations may have a masking effect upon others. That is, one or more subjects may appear to be possible multivariate outliers, yet if one were to delete them, other observations might emerge then as such. In other words, the former group of observations, while being in the data file, could mask the latter ones that, thus, could not be sensed at an initial inspection as possible outliers. For this reason, if one eventually decides to delete outliers masked by previously removed ones, ensuing analysis findings must be treated with great caution since there is a potential that the latter may have resulted from capitalization on chance fluctuations in the available sample. 3.2.3 Handling Outliers: A Revisit Multivariate outliers may be often found among those that are univariate outliers, but there may also be cases that do not have extreme values on separately considered variables (one at a time). Either way, once an



79

observation is deemed to be a possible outlier, a decision needs to be made with respect to handling it. To this end, first one should try to use all available information, or information that it is possible to obtain, to determine what reason(s) may have led to the observation appearing as an outlier. Coding or typographical errors, instrument malfunction or incorrect instructions during its administration, or being a member of another population that is not of interest are often sufficient grounds to correspondingly correct or consider removing the particular observation(s) from further analyses. Second, when there is no such relatively easily found reason, it is important to assess to what degree the observation(s) in question may be reflecting legitimate variability in the studied population. If the latter is the case, instead of subject removal variable transformations may be worth considering, a topic that is discussed later in this chapter. There is a growing literature on robust statistics that deals with methods aimed at down-weighting the contribution of potential outliers to the results of statistical analyses (Wilcox, 2003). Unfortunately, at present there are still no widely available and easily applicable multivariate robust statistical methods. For this reason, we only mention here this direction of current methodological developments that is likely to contribute in the future readily used procedures for differential weighting of observations in multivariate analyses. These procedures will also be worth considering in empirical settings with potential outliers. When one or more possible outliers are identified, it should be borne in mind that any one of these may unduly influence the ensuing statistical analysis results, but need not do so. In particular, an outlier may or may not be an influential observation in this sense. The degree to which it is influential is reflected in what are referred to as influence statistics and related quantities (such as the leverage value discussed earlier) (Pedhazur, 1997). These statistics have been developed within a regression analysis framework and made easily available in most statistical software. In fact, it is possible that keeping one or more outliers in the subsequent analyses will not change their results appreciably, and especially their substantive interpretations. In such a case, the decision regarding whether to keep them in the analysis or not does not have a real impact upon the final conclusions. Alternatively, if the results and their interpretation depend on whether the outliers are retained in the analyses, while a clear-cut decision for removal versus no removal cannot be reached, it is important to provide the results and interpretations in both cases. For the case where the outlier is removed, it is also necessary that one explicitly mentions, that is, specifically reports, the characteristics of the deleted outlier(s), and then restricts the final substantive conclusions to a population that does not contain members with the outliers’ values on the studied variables. For example, if one has good reasons to exclude the subject with ID ¼ 8 from


80


the above study of university freshmen, who was 15 years old, one should also explicitly state in the substantive result interpretations of the following statistical analyses that they do not necessarily generalize to subjects in their mid-teens.

3.3 Checking of Variable Distribution Assumptions The multivariate statistical methods we consider in this text are based on the assumption of multivariate normality for the dependent variables. Although this assumption is not used for parameter estimation purposes, it is needed when statistical tests and inference are performed. Multivariate normality (MVN) holds when and only when any linear combination of the individual variables involved is univariate normal (Roussas, 1997). Hence, testing for multivariate normality per se is not practically possible, since it involves infinitely many tests. However, there are several implications of MVN that can be empirically tested. These represent necessary conditions, rather than sufficient conditions, for multivariate normality. That is, these are implied by MVN, but none of these conditions by itself or in combination with any other(s) condition(s) entails multivariate normality. In particular, if a set of p variables is multivariate normally distributed, then each of them is univariate normal (p > 1). In addition, any pair or subset of k variables from that set is bivariate or k-dimensional normal, respectively (2 < k < p). Further, at any given value for a single variable (or values for a subset of k variables), the remaining variables are jointly multivariate normal, and their variability does not depend on that value (or values, 2 < k < p); moreover, the relationship of any of these variables, and a subset of the remaining ones that are not fixed, is linear. To examine univariate normality, two distributional indices can be judged: skewness and kurtosis. These are closely related to the third and fourth moments of the underlying variable distribution, respectively. The skewness characterizes the symmetry of the distribution. A univariate normally distributed variable has a skewness index that is equal to zero. Deviations from this value on the positive or negative side indicate asymmetry. The kurtosis characterizes the shape of the distribution in terms of whether it is peaked or flat relative to a corresponding normal distribution (with the same mean and variance). A univariate normally distributed variable has a kurtosis that is (effectively) equal to zero, whereby positive values are indicative of a leptokurtic distribution and negative values of a platykurtic distribution. Two statistical tests for evaluating univariate normality are also usually considered, the Kolmogorov–Smirnov Test



81

and the Shapiro–Wilk Test. If the sample size cannot be considered large, the Shapiro–Wilk Test may be preferred, whereas if the sample size is large the Kolmogorov–Smirnov Test is highly trustworthy. In general terms, both tests consider the following null hypothesis H0: ‘‘The sampled data have been drawn from a normally distributed population.’’ Rejection of this hypothesis at some prespecified significance level is suggestive of the data not coming from a population where the variable in question is normally distributed. To examine multivariate normality, two analogous measures of skewness and kurtosis—called Mardia’s skewness and kurtosis—have been developed (Mardia, 1970). In cases where the data are multivariate normal, the skewness coefficient is zero and the kurtosis is equal to p(p þ 2); for example, in case of bivariate normality, Mardia’s skewness is 0 and kurtosis is 8. Consequently, similar to evaluating their univariate counterparts, if the distribution is, say, leptokurtic, Mardia’s measure of kurtosis will be comparatively large, whereas if it is platykurtic, the coefficient will be small. Mardia (1970) also showed that these two measures of multivariate normality can be statistically evaluated. Although most statistical analysis programs readily provide output of univariate skewness and kurtosis (see examples and discussion in Section 3.4), multivariate measures are not as yet commonly evaluated by software. For example, in order to obtain Mardia’s coefficients with SAS, one could use the macro called %MULTNORM. Similarly, with SPSS, the macro developed by De Carlo (1997) could be utilized. Alternatively, structural equation modeling software may be employed for this purpose (Bentler, 2004; Jöreskog & Sörbom, 1996). In addition to examining normality by means of the above-mentioned statistical tests, it can also be assessed by using some informal methods. In case of univariate normality, the so-called normal probability plot (often also referred to as Q–Q plot) or the detrended normal probability plot can be considered. The normal probability plot is a graphical representation in which each observation is plotted against a corresponding theoretical normal distribution value such that the points fall along a diagonal straight line in case of normality. Departures from the straight line indicate violations of the normality assumption. The detrended probability plot is similar, with deviations from that diagonal line effectively plotted horizontally. If the data are normally distributed, the observations will be basically evenly distributed above and below the horizontal line in the latter plot (see illustrations considered in Section 3.4). Another method that can be used to examine multivariate normality is to create a graph that plots the MD for each observation against its ordered chi-square percentile value (see earlier in the chapter). If the data are multivariate normal, the plotted values should be close to a straight line, whereas points that fall far from the line may be multivariate


82


outliers (Marcoulides & Hershberger, 1997). For example, the following PROC IML program could be used to generate such a plot: TITLE ‘Chi-Square Plot’; DATA CHAPTER3; INFILE ‘ch3ex1.dat’; INPUT id $ y1 y2 y3 y4 y5; %let id¼id; %let var¼y1 y2 y3 y4 y5; PROC iml; start dsquare; use_last_; read all var {&var} into y [colname¼vars rowname¼&id]; n¼nrow(y); p¼ncol(y); r1¼&id; mean¼y[ :,]; d¼y j(n,1)*mean; s¼d’* d = (n 1); dsq¼vecdiag(d* inv(s) * d’); r¼rank(dsq); val¼dsq; dsq[r, ]¼val; val¼r1; &id [r]¼val; z¼((1:n)’ .5)=n; chisq¼2 * gaminv(z, p=2); result¼dsqjjchisq; cl¼{‘dsq’ ‘chisq’}; create dsquare from result [colname¼cl rowname¼&id]; append from result [rowname¼&id]; finish; print dsquare; =* THIS COMMAND IS ONLY NEEDED IF YOU WISH TO PRINT THE MD *= RUN dsquare; quit; PROC print data¼dsquare; var id dsq chisq; RUN; PROC gplot data¼dsquare; plot chisq*dsq; RUN;

This command file is quite similar to that presented earlier in Section 3.2.2, with the only difference being that now, in addition to the MD values, ordered chi-square percentile values are computed. Submitting this PROC IML program to SAS for the last considered data set generates the



83

Chi sq

Chi-square plot 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0

1

2

3

4

5

6

7

8 9 dsq

10 11 12 13 14 15 16 17

FIGURE 3.1 Chi-square plot for assessing multivariate normality.

above multivariate probability plot (if first removing the data lines for subjects #23 and #40 suggested previously as multivariate outliers). An examination of Figure 3.1 reveals that the plotted values are reasonably close to a diagonal straight line, indicating that the data do not deviate considerably from normality (keeping in mind, of course, the relatively small sample size used for this illustration). The discussion in this section suggests that examination of MVN is a difficult yet important topic that has been widely discussed in the literature, and there are a number of excellent and accessible treatments of it (Mardia, 1970; Johnson & Wichern, 2002). In conclusion, we mention that most MVS methods that we deal with in this text can tolerate minor nonnormality (i.e., their results can be viewed also then as trustworthy). However, in empirical applications it is important to consider all the issues discussed in this section, so that a researcher becomes aware of the degree to which the normality assumption may be violated in an analyzed data set.

3.4 Variable Transformations When data are found to be decidedly nonnormal, in particular on a given variable, it may be possible to transform that variable to be closer to


84


normally distributed whereupon the set of variables under consideration would likely better comply with the multivariate normality assumption. (There is no guarantee for multinormality as a result of the transformation, however, as indicated in Section 3.3.) In this section, we discuss a class of transformations that can be used to deal with the lack of symmetry of individual variables, an important aspect of deviation from the normal distribution that as well known is symmetric. As it often happens, dealing with this aspect of normality deviation may also improve variable kurtosis and make it closer to that of the normal distribution. Before we begin, however, let us emphasize that asymmetry or skewness as well as excessive kurtosis—and consequently nonnormality in general—may be primarily the result of outliers being present in a given data set. Hence, before considering any particular transformation, it is recommended that one first examines the data for potential outliers. In the remainder of this section, we assume that the latter issue has been already handled. We start with relatively weak transformations that are usually applicable with mild asymmetry (skewness) and gradually move on to stronger transformations that may be used on distributions with considerably longer and heavier tails. If the observed skewness is not very pronounced pffiffiffiffi and positive, chances are that the square root transformation, Y0 ¼ Y, where Y is the original variable, will lead to a transformed measure Y0 with a distribution that is considerably closer to the normal (assuming that all Y scores are positive). With SPSS, to obtain the square-rooted variable Y0 , we use Transform ! Compute, and then enter in the small left- and right-opened windows correspondingly SQRT_Y¼SQRT(Y), where Y is the original variable. In the syntax mode of SPSS, this is equivalent to the command COMPUTE SQRT_Y¼SQRT(Y). (which as mentioned SQRT_Y¼SQRT(Y).)

earlier

may

be

abbreviated

to

COMP

With SAS, this can be accomplished by inserting the following general format data-modifying statement immediately after the INPUT statement (but before any PROC statement is invoked): New-Variable-Name¼Formula-Specifying-Manipulation-of-anExisting-Variable For example, the following SAS statement could be used in this way for the square root transformation: SQRT_Y¼SQRT(Y), which is obviously quite similar to the above syntax with SPSS.



85

If for some subjects Y < 0, since a square root cannot be taken then, we first add the absolute value of the smallest of them to all scores, and then proceed with the following SPSS syntax mode command that is to be executed in the same manner as above: COMP SQRT_Y¼SQRT(Y þ jMIN(Y)j). where jMIN(Y)j denotes the absolute value of the smallest negative Y score, which may have been obtained beforehand, for example, with the descriptives procedure (see discussion earlier in the chapter). With SAS, the same operation could be accomplished using the command: SQRT_Y¼SQRT(Y þ ABS(min(Y)), where ABS(min(Y)) is the absolute value of the smallest negative Y score (which can either be obtained directly or furnished beforehand, as mentioned above). For variables with more pronounced positive skewness, the stronger logarithmic transformation may be more appropriate. The notion of ‘‘stronger’’ transformation is used in this section to refer to a transformation with a more pronounced effect upon a variable under consideration. In the presently considered setting, such a transformation would reduce more notably variable skewness; see below. The logarithmic transformation can be carried out with SPSS using the command: COMP LN_Y¼LN(Y). or with SAS employing the command: LN_Y¼log(Y); assuming all Y scores are positive since otherwise the logarithm is not defined. If for some cases Y ¼ 0 (and for none Y < 0 holds), we add 1 first to Y and then take the logarithm, which can be accomplished in SPSS and SAS using respectively the following commands: COMP LN_Y¼LN(Y þ 1). LN_Y¼log(Y þ 1); If for some subjects Y < 0, we first add to all scores 1 þ jMIN(Y)j, and then take the logarithm (as indicated above). A stronger yet transformation is the inverse, which is more effective on distributions with larger skewness, for which the logarithm does not render them close to normality. This transformation is obtained as follows using either of the following SPSS or SAS commands, respectively: COMP INV_Y¼1=Y. INV_Y¼1=Y; in cases where there are no zero scores. Alternatively, if for some cases Y ¼ 0, we add first 1 to Y before taking inverse:


86

Introduction to Applied Multivariate Analysis COMPUTE INV_Y¼1=(Y þ 1).

or INV_Y¼1=(Y þ 1); (If there are zero and negative scores in the data, we add first to all scores 1 plus the absolute value of their minimum, and then proceed as in the last two equations.) An even stronger transformation is the inverse squared, which under the assumption of no zero scores in the data can be obtained using the commands: COMPUTE INVSQ_Y¼1=Y2. or INV_Y¼1=(Y**2); If there are some cases with negative scores, or zero scores, first add the constant 1 plus the absolute value of their minimum to all subjects’ data, and then proceed with this transformation. When a variable is negatively skewed (i.e., its left tail is longer than its right one), then one needs to first ‘‘reflect’’ the distribution before conducting any further transformations. Such a reflection of the distribution can be accomplished by subtracting each original score from 1 plus their maximum, as illustrated in the following SPSS statement: COMPUTE Y_NEW¼MAX(Y) þ 1 – Y. where MAX(Y) is the highest score in the sample, which may have been obtained beforehand (e.g., with the descriptives procedure). With SAS, this operation is accomplished using the command: SQRT_Y¼max(Y) þ 1 Y; where max(Y) returns the largest value of Y (obtained directly, or using instead that value furnished beforehand via examination of variable descriptive statistics). Once reflected in this way, the variable in question is positively skewed and all above discussion concerning transformations is then applicable. In an empirical study, it is possible that a weaker transformation does not render a distribution close to normality, for example, when the transformed distribution still has a significant and substantial skewness (see below for a pertinent testing procedure). Therefore, one needs to examine the transformed variable for normality before proceeding with it in any analyses that assume normality. In this sense, if one transformation is not strong enough, it is recommendable that a stronger transformation be chosen. However, if one applies a stronger than necessary transformation, the sign of the skewness may end up being changed (e.g., from positive to negative). Hence, one might better start with the weakest transformation



87

that appears to be worthwhile trying (e.g., square root). Further, and no less important, as indicated above, it is always worthwhile examining whether excessive asymmetry (and kurtosis) may be due to outliers. If the transformed variable exhibits substantial skewness, it is recommendable that one examines it, in addition to the pretransformed variable, also for outliers (see Section 3.3). Before moving on to an example, let us stress that caution is advised when interpreting the results of statistical analyses that use transformed variables. This is because the units and possibly origin of measurement have been changed by the transformation, and thus those of the transformed variable(s) are no longer identical to the variables underlying the original measure(s). However, all above transformations (and the ones mentioned at the conclusion of this section) are monotone, that is, they preserve the rank ordering of the studied subjects. Hence, when units of measurement are arbitrary or irrelevant, a transformation may not lead to a considerable loss of substantive interpretability of the final analytic results. It is also worth mentioning at this point that the discussed transformed variables result from other than linear transformations, and hence their correlational structure is in general different from that of the original variables. This consequence may be particularly relevant in settings where one considers subsequent analysis of the structure underlying the studied variables (such as factor analysis; see Chapter 8). In those cases, the alteration of the relationships among these variables may contribute to a decision perhaps not to transform the variables but instead to use subsequently specific correction methods that are available within the general framework of latent variable modeling, for which we refer to alternative sources (Muthén, 2002; Muthén & Muthén, 2006; for a nontechnical introduction, see Raykov & Marcoulides, 2006). To exemplify the preceding discussion in this section, consider data obtained from a study in which n ¼ 150 students were administered a test of inductive reasoning ability (denoted IR1 in the data file named ch3ex2.dat available from www.psypress.com=applied-multivariateanalysis). To examine the distribution of their scores on this intelligence measure, with SPSS we use the following menu options=sequence: Analyze ! Descriptive statistics ! Explore, whereas with SAS the following command file could be used: DATA Chapter3EX2; INFILE ‘ch3ex2.dat’; INPUT ir1 group gender sqrt_ir1 ln_ir1; PROC UNIVARIATE plot normal;


88


=* Note that instead of the ‘‘plot’’ statement, additional commands like ‘‘QQPLOT’’, ‘‘PROBPLOT’’ or ‘‘HISTOGRAM’’ can be provided in a line below to create separate plots *= var ir1; RUN; The resulting outputs produced by SPSS and SAS are as follows (provided in segments to simplify the discussion). SPSS descriptive statistics output Descriptives IR1

Mean 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis

Lower Bound Upper Bound

Statistic

Std. Error

30.5145 28.1272 32.9019 29.9512 28.5800 218.954 14.79710 1.43 78.60 77.17 18.5700 .643 .158

1.20818

.198 .394

Extreme Values IR1

Case Number

Value

Highest

1 2 3 4 5

100 60 16 107 20

78.60 71.45 64.31 61.45 60.02a

Lowest

1 2 3 4 5

22 129 126 76 66

1.43 7.15 7.15 7.15 7.15b

a. Only a partial list of cases with the value 60.02 are shown in the table of upper extremes. b. Only a partial list of cases with the value 7.15 are shown in the table of lower extremes.



89

SAS descriptive statistics output

The SAS System The UNIVARIATE Procedure Variable: ir1 Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation

150 30.5145333 14.7971049 0.64299511 172294.704 48.4919913

Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean

150 4577.18 218.954312 0.15756849 32624.1925 1.20817855

Basic Statistical Measures Location Mean Median Mode

30.51453 28.58000 25.72000

Variability Std Deviation Variance Range Interquartile Range

14.79710 218.95431 77.17000 18.57000

Extreme Observations — — — — —Lowest— — — — — — — — — Highest— — — — Value Obs Value Obs 1.43 7.15 7.15 7.15 7.15

22 129 126 76 66

60.02 61.45 64.31 71.45 78.60

78 107 16 60 100

As can be readily seen by examining the skewness and kurtosis in either of the above sections with descriptive statistics, skewness of the variable under consideration is positive and quite large (as well as significant, since the ratio of its estimate to standard error is larger than 2; recall that at a¼.05, the cutoff is 1.96 for this ratio that follows a normal distribution). Such a finding is not the case for its kurtosis, however. With respect to the listed extreme values, at this point, we withhold judgment about any of these 10 cases since their being apparently extreme may actually be due to lack of normality. We turn next to this issue.


90


SPSS tests of normality Tests of Normality Kolmogoroy-Smirnova Statistic IR1

.094

Shapiro-Wilk

df

Sig.

Statistic

df

Sig.

150

.003

.968

150

.002

a. Lilliefors Significance Correction

SAS tests of normality

Test

Tests for Normality — — — —Statistic— — — — — — — — — —p Value — — — — — —

Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling

W D W-Sq A-Sq

0.96824 0.093705 0.224096 1.348968

Pr < W Pr > D Pr > W-Sq Pr > A-Sq

0.0015 1) as normally distributed with a mean vector m and covariance matrix S (where S > 0), denoted X Np (m, S), if its probability density function is 1 f (x) ¼ (2p)p=2 jSj1=2 exp (x m)0 S1 (x m) : (4:2) 2 We observe that one formally obtains the right-hand side of Equation 4.2 from the right-hand side of Equation 4.1 by exchanging in the latter unidimensional quantities in the exponent with p-dimensional ones, and the variance with the determinant of the covariance matrix (accounting in addition for the fact that now p dimensions are considered simultaneously rather than a single one). We note in this definition the essential requirement of positive definiteness of the covariance matrix S (see Chapter 2), because in the alternative case, the right-hand side of Equation 4.2 will not be defined. (In that case, however, which will not be of relevance in the remainder of the book, one could still use the definition of MVN from Chapter 2; Anderson, 1984).


102


The following two properties, and especially the second one, will be of particular importance later in this chapter. We note that the second property is formally obtained by analogy from the first one. Property 1: If a sample of size n is drawn from the univariate normal distribution N(m, s2), then the mean X of the sample will be distributed as N(m, s2=n), that is, X N(m, s2=n). This property provides the rationale as to why the sample mean is more stable than just a single observation from a studied population—the reason is the smaller variance of the mean. Property 2: If a sample of size n is drawn from the multivariate normal distribution Np(m, S), then the mean vector X of the sample will be distributed as Np(m, (1=n)S), that is, X Np(m, (1=n)S). We observe that Property 1 is obviously obtained as a special case of Property 2, namely when p ¼ 1. We conclude this section by noting that we make the multinormality assumption for the rest of the chapter (see Chapter 3 for its examination).

4.3 Testing Hypotheses About a Multivariate Mean Suppose for a moment that we were interested in examining whether the means of university freshmen on two distinct intelligence test scores were each equal to 100. Since these are typically interrelated measures of mental ability, it would be wasteful of empirical information not to consider them simultaneously but to test instead each one of them separately for equality of its mean to 100 (e.g., using the single-group t test). For this reason, in lieu of the latter univariate approach, what we would like to do is test whether the mean of the two-dimensional vector of this pair of random variables is the point in the two-dimensional space, which has its both coordinates equal to 100. In more general terms, we would like to test the hypothesis that the two-dimensional mean m of the random vector consisting of these two interrelated intelligence measures equals m0, that is, the null hypothesis H0: m ¼ m0, where m0 is a vector having as elements prespecified numbers. (In the example under consideration, m0 ¼ [100, 100]0 .) This goal can be accomplished using Property 2 from the preceding section. A straightforward approach to conducting this test is achieved by using the duality principle between confidence interval (confidence region) and hypothesis testing (Hays, 1994), as well as the concept of Mahalanobis distance (MD). According to this principle, a 95%-confidence region for the multivariate mean m is the area consisting of all points m0 in the respective multivariate space, for which the null hypothesis H0: m ¼ m0 is not rejected at the a ¼ .05 significance level. (With another significance level, the confidence level is correspondingly obtained as the complement to 1 of the former.)


Multivariate Analysis of Group Differences

103

To see this principle at work, let us make use of the univariateto-multivariate analogy. Recall what is involved in testing the univariate version of the above null hypothesis H0. Accordingly, testing the hypothesis m ¼ m0 for a prespecified real number m0, is effectively the same as finding out whether m0 is a plausible value for the mean m, given the data. Hence, if m0 lies within a plausible range of values for the mean, we do not reject this hypothesis; otherwise we do. This is the same as saying that we do not reject the hypothesis m ¼ m0 if and only if m0 belongs to a plausible range of values for m, that is, falls in the confidence interval of the mean. Thus, we can carry out hypothesis testing by evaluating a confidence interval and checking whether m0 is covered by that interval. If it is, we do not have enough evidence to warrant rejection of the null hypothesis m ¼ m0, but if it is not, then we can reject that hypothesis. Note that the essence of this principle is a logical one and does not really depend on the dimensionality of m and m0 (whether they are scalars or at least twodimensional vectors). Therefore, we can use it in the multivariate case as well. We next note that this testing approach capitalizes on viewing a confidence interval as a set of values for a given parameter, in this case the mean, whose distance from its empirical estimate (here, the sample mean) is sufficiently small. This view is also independent of the dimensionality of the parameter in question. If it is multidimensional, as is the case mostly in this book, the role of distance can obviously be played by the MD that we got familiar with in Chapter 2. That is, returning to our above concern with testing a multivariate null hypothesis, H0: m ¼ m0, we can say that we reject it if the MD of the sample mean vector (point) to the vector (point) m0 in the p-dimensional space is large enough, while we do not reject this hypothesis if the sample mean is sufficiently close to the hypothetical mean vector m0. We formalize next these developments further and thereby take into account two possible situations.

4.3.1 The Case of Known Covariance Matrix We consider again an example, but this time using numerical values. Let us assume that reading and writing scores in a population of third graders follow a two-dimensional normal distribution with a population covariance matrix equal to X

¼

25 12

12 : 27

(4:3)

Suppose also that a sample is taken from this population, which consists of observations on these variables from n ¼ 20 students, with an observed mean vector of [57, 44]0 . Assume we would like to test whether the


104


population mean is m0 ¼ [55, 46]0 . That is, to test is the null hypothesis H0: m ¼ [55, 46]0 . As discussed above, we can use the duality principle between hypothesis testing and confidence interval to accomplish this goal. Specifically, for a given significance level a (with 0 < a < 1), we can check if m0 falls in the confidence region for the mean vector at confidence level 1 a. To this end, we make use of the following result: if a population is studied where a given vector of measures X follows a normal distribution, X Np (m, S), with S > 0, the 95%-confidence region for the mean vector m is the region enclosed by the ellipsoid with contour defined by the equation (Y denoting the running point along the contour) 0 S1 (Y X) ¼ (Y X)

x2p,a n

,

(4:4)

where x2p,a is the pertinent cutoff for the chi-square distribution with p degrees of freedom and n is sample size (as usual in this book; Tatsuoka, 1988). By way of illustration, in the three-dimensional case (i.e., when p ¼ 3) the ellipsoid defined by Equation 4.4 resembles a football, whereas in the two-dimensional case (p ¼ 2) it is an ellipse. Just to give an example, a plot of such an ellipse with mean vector (centroid) say at m ¼ [55, 46]0 for the 95%-confidence level, is shown in Figure 4.1 (with the short vertical and horizontal lines assumed to be erected at the scores 55 and 46, respectively, and dispensing for simplicity with representing explicitly the metric of the two coordinate axes and their directions of increase that are implied). We note that the discussed procedure leading to the confidence region in Equation 4.4 is applicable for any p 1, because nowhere is dimensionality mentioned in it (other than in the degrees of freedom, which are

FIGURE 4.1 Two-dimensional ellipsoid with centroid at m ¼ [55, 46]0 , representing a 95%-confidence region.



105

however irrelevant in this regard). In addition, we stress that the lefthand side of Equation 4.4 is precisely the multivariate distance (MD) of whereby this the running vector Y to the centroid of the data set, X, distance is taken with regard to the covariance matrix S (see Chapter 2). Hence, from Equation 4.4 it follows that all points in the p-dimensional which is less than x2 =n, constispace with an MD from the centroid X, p,a tute the 95%-confidence region for the mean m. Therefore, using the duality principle mentioned above, the null hypothesis H0: m ¼ m0 may be rejected if the MD of the hypothetical is large enough, namely larger than value m0 to the data centroid X x2p,a =n, that is, if and only if 0 S1 (m X) > (m 0 X) 0

x2p,a n

:

(4:5)

We readily observe the analogy between Equation 4.5 and the well-known test statistic of the univariate z test: z¼

m ) (X pffiffiffi0 , (s= n)

(4:6)

for which we can write its squared value, distributed now as a chi-square variable with 1 degree of freedom, as m )=(s2 =n)1 (X m ) x2 , z2 ¼ (X 0 0 1

(4:7)

m )=(s2 )1 (X m ) x2 =n: z2 ¼ ( X 0 0 1

(4:8)

or simply

(Recall that the square of a random variable following a standard normal distribution, is a chi-square distributed variable with 1 degree of freedom; Hays, 1994.) A comparison of Equations 4.5 and 4.8 reveals that the latter is a special case of the former (leaving aside the reference to significance level in Equation 4.5 that is irrelevant for this comparison). Returning to our previous example of the reading and writing ability study, since we are interested in two scores, p ¼ 2 in it. Hence, given that the sample size was n ¼ 20, we easily find that the right-hand side of Equation 4.5 yields here (taking as usual significance level of a ¼ .05): x 2p,a n

¼

x22,:05 5:9948 ¼ :30 ¼ 20 20

(4:9)


106


(This critical value can be obtained from chi-square tables, which are provided in most introductory statistics textbooks, for a ¼ .05 and 2 degrees of freedom; Hays, 1994.) To carry out the corresponding hypothesis testing in an empirical setting, we can use either the following SPSS syntax file or the subsequent SAS PROC IML program file, with explanatory comments correspondingly following each. (For details on using such SPSS and SAS command files, the reader can refer to Chapter 2.) SPSS command file TITLE ‘USING SPSS FOR TWO-GROUP HYPOTHESIS TESTING’. MATRIX. COMP X.BAR ¼ {57, 44}. COMP MU.0 ¼ {55, 46}. COMP SIGMA ¼ {25, 12; 12, 27}. COMP XBR.DIST ¼ (MU.0-X.BAR)*INV(SIGMA)*T(MU.0-X.BAR). PRINT XBR.DIST. END MATRIX. MU.0 In this command sequence, X.BAR stands for the sample mean (X), for the hypothetical mean vector (m0) with coordinates 55 and 46 (actually, its transpose, for convenience reasons and due to software choice); in addition, SIGMA symbolizes the known population covariance matrix S in Equation 4.3, and XBR.DIST is the MD of the sample mean to the hypothetical mean vector. This command file yields the computed value of XBR.DIST ¼ .57. Since it is larger than the above found cutoff value of .30, we conclude that the sample mean is farther away from the hypothetical mean vector than tolerable under the null hypothesis. This finding suggests rejection of H0. SAS command file PROC IML; XBAR ¼ {57 44}; MUO ¼ {55 46}; SIGMA ¼ {25 12, 12 27}; =* RECALL THE USE OF A COMMA TO SEPARATE ROWS*= DIFF ¼ MUO - XBAR; TRANDIFF ¼ T(DIFF); INVS ¼ INV(SIGMA); XBRDIST ¼ DIFF*INVS*TRANDIFF; PRINT XBRDIST; QUIT;



107

This SAS program (using comparable quantity names to the preceding SPSS input file) generates the same result as that obtained with SPSS. It is worthwhile noting here that if one were to carry out instead two separate t tests on each of the ability scores from this example, none of the respective univariate null hypotheses would be rejected, that is, both H0,1: mreading ¼ 55 and H0,2: mwriting ¼ 46 would be retained. The reason can be the lack of power that results from wasting the information about the important interrelationship between these two scores, which one commits in general if univariate rather than multivariate tests are carried out on correlated DVs. In particular, in the present example one can find out from Equation 4.3 that there is a correlation of .46 between the two ability scores (e.g., using Equation 2.39 in Chapter 2). This notable correlation is not accounted for by the univariate testing approach just mentioned. In other words, the multivariate hypothesis was rejected here because the multivariate test accumulated information about violation of H0 across two interrelated dimensions, while the univariate approach treated these ability scores as unrelated (which they obviously are not). This is a typical example of a difference in results between univariate and multivariate analyses that could be carried out on the same data. The difference usually arises as a result of (a) inflated Type I error when univariate tests are conducted separately on each DV, one at a time; and (b) higher power of the multivariate statistical test, due to more efficient use of available sample information. For this reason, when univariate and multivariate tests disagree in the manner shown in this example, one would tend to trust the multivariate results more.

4.3.2 The Case of Unknown Covariance Matrix The preceding subsection made a fairly strong assumption that we knew the population covariance matrix for the variables of interest. This assumption is rarely fulfilled in empirical social and behavioral research. In this subsection, we consider the more realistic situation when we do not know that covariance matrix to begin with, yet estimate it in a given sample from a studied population by the empirical covariance matrix S, and want to test the same null hypothesis H0: m ¼ m0. In this case, one can still employ the same above reasoning but using the MD with regard to the matrix S (because its population counterpart S is unknown). Then it can be shown that the 95%-confidence region for the mean consists of all those observed values of Y that are close (near) in terms of their MD, namely all those Y for which enough to X 0 S1 (Y X) < p(n 1) Fp,np;a , (Y X) n(n p)

(4:10)


108


where Fp,np;a is the pertinent cutoff of the F distribution with p and n p degrees of freedom (Johnson & Wichern, 2002). As explained above, we stress that this procedure is applicable for any dimensionality p 1, that is, with any number of DVs. From Inequality 4.10 it is seen that the same approach is applied here as in the preceding subsection, with the only differences that (a) the matrix S (i.e., the sample-based estimate of the population covariance matrix) is used in lieu of S; and (b) the relevant cutoff is from a different distribution. The latter results from the fact that here we estimate S by S rather than use S itself as we do not know it. (Recall that in the univariate setup we use a t distribution for tests on means when the population variance is unknown, rather than the normal distribution as we would when that variance were known.) Hence, by analogy to the last considered testing procedure, and given Inequality 4.10, we reject H0: m ¼ m0 if and only if 0 S1 (m X) > (m0 X) 0

p(n 1) Fp,np;a , n(n p)

(4:11)

that is, if the observed mean vector is far enough from the hypothetical mean vector, m0, in terms of the former’s MD. When the population covariance matrix S is unknown, the left-hand side of Inequality 4.11, multiplied by sample size, is called Hotelling’s T2. That 0 S1 (m X), and in fact represents is, Hotelling’s T2 is equal to n(m 0 X) 0 the multivariate analog of the univariate t statistic for testing hypotheses about the mean. Specifically, T2 is a multivariate generalization of the square of the univariate t ratio for testing H0: m ¼ m0, that is t¼

m X pffiffiffi0 : s= n

(4:12)

Indeed, squaring both sides of Equation 4.12 leads to m )(s2 )1 (X m ), t2 ¼ n(X 0 0

(4:13)

which is formally identical to T2 in case of p ¼ 1. To complete this univariate-to-multivariate analogy, recall also that the t-distribution’s and particular F-distribution’s cutoffs are closely related (at a given sample size n): t2n1,a ¼ F1,n1;a ; the last relationship should also clarify the use of the F distribution in Equation 4.11. To illustrate this discussion, let us reconsider the last empirical example, but now asking the same question in the more realistic situation when the population matrix were unknown. Assume that instead of knowing this



109

population matrix, the latter is only estimated by the following sample covariance matrix:

24:22 S¼ 10:98

10:98 : 27:87

(4:14)

To test the null hypothesis H0: m ¼ [55, 46]0 using either SPSS or SAS, we employ either of the following two command files. (Note that these are obtained from the last presented, respective command files via a minor modification to accommodate the empirical covariance matrix in Equation 4.14 and the right-hand side of Inequality 4.11.) The utilized below critical value of F ¼ 3.55 is obtained from F-distribution tables, which can be found in appendices to most introductory statistics textbooks, based on a ¼ .05, p ¼ 2, and n p ¼ 18 degrees of freedom (Hays, 1994). SPSS command file TITLE ‘USING SPSS FOR HOTELLING’S TEST’. MATRIX. COMP X.BAR ¼ {57, 44}. COMP MU.0 ¼ {55, 46}. COMP S ¼ {24.22, 10.98; 10.98, 27.87}. COMP XBR.DIST ¼ (MU.0-X.BAR)*INV(S)*T(MU.0-X.BAR). PRINT XBR.DIST. COMP CUTOFF ¼ 2*19=(20*18)*3.55. PRINT CUTOFF. END MATRIX. SAS command file PROC IML; XBAR ¼ {57 44}; MUO ¼ {55 46}; S ¼ {24.22 10.98, 10.98 27.87}; DIFF ¼ MUO - XBAR; TRANDIFF ¼ T(DIFF); INVS ¼ INV(S); XBRDIST ¼ DIFF*INVS*TRANDIFF; PRINT XBRDIST; CUTOFF ¼ 2*19=(20*18)*3.55; PRINT CUTOFF; QUIT;


110


Submitting either of these SPSS or SAS programs yields the values of XBR.DIST ¼ .53 and CUTOFF ¼ .37 (recalling of course from Equation 4.11 that CUTOFF ¼ p(n 1)=[n(n p)]F ¼ 2(19)=[20(18)]3.55 ¼ 0.37). Because the value of XBR.DIST is larger than the value of the relevant cutoff, we reject H0 and conclude that there is evidence in the analyzed data to warrant rejection of the null hypothesis stating that the reading and writing score means were equal to 55 and 46, respectively.

4.4 Testing Hypotheses About Multivariate Means of Two Groups Many times in empirical research, we do not have precise enough information to come up with meaningful hypothetical values for the means of analyzed variables. Furthermore, situations may often occur in which there is interest in comparing means across two groups. This section deals with methods that can be used in such circumstances. 4.4.1 Two Related or Matched Samples (Change Over Time) Suppose Y ¼ (Y1, Y2, . . . , Yp)0 is a set of p multinormal measures that have been administered to n subjects on two occasions, with resulting scores (y11, y12, . . . , y1p) for the first occasion and (y21, y22, . . . , y2p) for the second occasion (with p > 1). We are interested in testing whether there are mean differences across time, that is, whether there is mean change over time. In other words, we are concerned with testing the null hypothesis H0: m1 ¼ m2, where m1 and m2 are the population mean vectors at first and second assessment occasions, respectively. The following approach is also directly applicable when these measurements result from two related, dependent, or so-called matched samples. In order to proceed, we can reduce the problem to an already handled case—that of testing the hypothesis m1 m2 ¼ 0. Indeed, we note that the mean difference, m1 m2, is equivalent to the mean of the difference score, that is, m1 m2 ¼ mD , where D ¼ Y1 Y2 is the vector of differences on all Y components across the two assessments. Thus the hypothesis of interest, H0: m1 ¼ m2, being the same as m1 m2 ¼ 0, is also equivalent to mD ¼ 0. The latter hypothesis, however, is a special case of the one we have already dealt with in Section 4.3, and is obtained from that hypothesis when m0 ¼ 0. Therefore, using Inequality 4.11 with m0 ¼ 0, we reject the null hypothesis H0 under consideration if and only if 0 S1 (0 D) > (0 D)

p(n 1) Fp,np;a , n(n p)

(4:15)



111

or simply, if and only if 0 S1 D > p(n 1) Fp,np;a : D n(n p)

(4:16)

In other words, we reject the hypothesis of equal means when and only is far enough from 0 (in terms of the former’s MD from the origin; when D throughout this section, n stands for number of pairs of subject scores in the analyzed data set; see Section 4.3.2). The test statistic in the right-hand side of Equation 4.16 is also readily seen to be the multivariate analog of the univariate t statistic for testing differences in two related means. Indeed, that univariate statistic is t ¼ s =dpffiffin, where sd is the standard deviation of the difference score and d d is its mean in the sample. We emphasize that here, as well as in the rest of this section, n stands for the number of pairs of recorded observations (e.g., studied subjects in a pretest=posttest design, or pairs in a matchedpairs design) rather than for the total number of all available measurements that is obviously 2n. To illustrate this discussion, consider the following research study setting. Two intelligence tests, referred to as test 1 and test 2, are administered to a sample of 160 high school students at the beginning and at the end of their 11th-grade year. A researcher is interested in finding out whether there is any change in intelligence, as evaluated by these two measures, across the academic year. To answer this question, let us first denote test 1 by Y1 and test 2 by Y2, and for their two administrations let us add as a second subscript 1 and 2, respectively. Next, with SPSS or SAS we can correspondingly calculate the difference scores for each of the two tests, denoted D1 and D2, using, for example, the following syntax commands (where Y11 corresponds to test 1 on occasion 1, and the other symbols are defined correspondingly): COMP D1 ¼ Y11 Y12. COMP D2 ¼ Y21 Y22. or D1 ¼ Y11 Y12; D2 ¼ Y21 Y22; 1 and D 2 , and the covariance Then we compute, as usual, the means D matrix S of the two difference scores from the available sample. With these estimates, to evaluate the left-hand side of Inequality 4.16 we proceed as follows with either SPSS or SAS. (Clarifying comments are inserted immediately after command lines where needed; for generality of the following two programs, we refer to means, variances, and covariances by symbols=names.)


112


SPSS command file TITLE ‘USING SPSS FOR HOTELLING’S RELATED SAMPLES TEST’. MATRIX. COMP D.BAR ¼ {D1.BAR, D2.BAR}. * ENTER HERE THE 2 DIFFERENCE SCORE MEANS FROM THE SAMPLE. COMP MU.0 ¼ {0, 0}. COMP S ¼ {S11, S12; S21, S22}. * ENTER HERE COVARIANCE MATRIX OBTAINED FROM SAMPLE. COMP DBR.DIST ¼ (MU.0-D.BAR)*INV(S)*T(MU.0-D.BAR). PRINT DBR.DIST. COMP CUTOFF ¼ 2*159=(160*158)*F.CUTOFF. * F.CUTOFF IS THE CUTOFF OF F WITH 2 AND 158 DF’S, AT ALPHA ¼ .05. * WHICH IS FOUND FROM APPENDICES IN INTRO STATS BOOKS. PRINT CUTOFF. END MATRIX.

SAS command file PROC IML; DBAR ¼ {D1BAR D2BAR}; =* ENTER HERE THE TWO DIFFERENCE SCORE MEANS *=; MUO ¼ {0 0 0}; S ¼ {S11 S12, S21 S22}; =* ENTER HERE COVARIANCE MATRIX OBTAINED FROM SAMPLE*=; DIFF ¼ MUO - DBAR; TRANDIFF ¼ T(DIFF); INVS ¼ INV(S); DBRDIST ¼ DIFF*INVS*TRANDIFF; PRINT DBRDIST; CUTOFF ¼ 2*(n1)=(n*(n2))*F; PRINT CUTOFF; QUIT;

As indicated before, if the computed value of DBR.DIST exceeds CUTOFF in a given data set, we reject the null hypothesis of no mean difference (in the case of no change over time); otherwise we retain it. We stress that both above SPSS and SAS command files can be readily modified in case a different number of measures are taken at both assessments (or observed in related samples); they produce numerical results as soon as one inserts into the appropriate places empirical statistics (viz., corresponding sample means, variances, and covariances, as well as cutoff value for the respective F distribution). As an example, assume that in the last considered empirical study the 1 and D 2: D 1 ¼ 0:7 and following results were obtained for the means D



113

2 ¼ 2:6; and that the relevant covariance matrix was S ¼ 22:1 2:8 . D 2:8 90:4 Using either of the above SPSS or SAS command files, we readily obtain DBR.DIST ¼ 0.092, which is larger than the CUTOFF value of 0.056 furnished thereby (with 2 and 158 degrees of freedom, Fa ¼ .05 ¼ 4.46). Hence, we can reject H0 and conclude that there is a significant change in intelligence, as evaluated by these two measures, across the academic year. 4.4.2 Two Unrelated (Independent) Samples We begin this subsection with a motivating example in which two methods of teaching algebra topics are used in a study with an experimental and a control group. Suppose three achievement tests are administered to n1 ¼ 45 subjects designated as the experimental group and n2 ¼ 110 subjects designated as the control group. Assume that a researcher is interested in finding out whether there is a differential effect upon average performance of the two teaching methods. To proceed here, we must make an important assumption, namely that the covariance matrices of the observed variables are the same in the populations from which these samples were drawn (we denote that common covariance matrix by S). That is, designating by X and Y the vectors of three achievement measures in the experimental and control groups, respectively, we assume that X N3 (m1 , S) and Y N3 (m2 , S). Note that this assumption is the multivariate analog of that of equal variances in a corresponding univariate analysis, which underlies the conventional t test for mean differences across two unrelated groups. Since the common population covariance matrix is unknown, it is desirable to estimate it using the sample data in order to proceed. To this end, we first note that as can be shown under the above distributional assumptions, the difference between the two group means is normally distributed, and specifically Np m m , 1 þ 1 S , Y X 1 2 n n2 1

(4:17)

that is, follows the multinormal distribution with mean m1 m2 and covariance matrix equal to (1=n1 þ 1=n2)S (Johnson & Wichern, 2002). Consequently, based on the previous discussion, we can use the following estimator of the common covariance matrix (see Chapter 2): S* ¼

1 n1

þ n12 (SSCPX þ SSCPY ) df

,

(4:18)


114


where SSCPX and SSCPY are the SSCP matrices for the X and Y measures in each of the groups, respectively, and df ¼ n1 þ n2 2 are the underlying degrees of freedom. As an aside at this moment, let us note the identity of the degrees of freedom here and in the corresponding univariate setup when estimating a common population variance in the twogroup study (we remark on this issue later again). In a univariate setup, Equation 4.18 would correspond to the well-known relationship s2p ¼ (1=n1 þ 1=n2 )(Sx21 þ Sy22 )=(n1 þ n2 2), where Sx21 and Sy22 are the sum of squared mean deviations in each of the two groups. We refer to the covariance matrix S* in Equation 4.18 as the pooled covariance matrix (similarly to pooled variance in the univariate case). Returning now to our initial concern in this subsection, which is to test the null hypothesis H0: m1 ¼ m2, or equivalently the null hypothesis m1 m2 ¼ 0, we use the same reasoning as earlier in this chapter. Accordingly, if 0 is close enough to m1 m2, in terms of the latter’s MD (to the origin), we do not reject H0; otherwise we reject H0. How do we find the MD between m1 m2 and the origin 0 in the present situation? According to its definition (see Chapter 2, and notice the analogy to the case of two related samples), this MD is 0 S*1 (X Y) Y), T22 ¼ (X

(4:19)

which is similarly called Hotelling’s T2 for the two-sample problem. We further note that the right-hand side of Equation 4.19 is a multivariate analog of the t-test statistic in the univariate two-sample setup, in which our interest lies in testing the hypothesis of equality of two independent group means (Hays, 1994). That is, Y) (X t ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi , 1 1 sp þ n1 n2

(4:20)

from which it also follows by squaring 1 s2 1 þ 1 Y) Y), t2 ¼ (X (X p n1 n2

(4:21)

where sp is the pooled estimate of the common variance in both samples and can be obtained as a special case from Equation 4.18 when there is a single studied variable. We note that the expression in the right-hand side of Equation 4.21 is identical to the right-hand side of Equation 4.19 in that special case.



115

Considering again the empirical example regarding the two methods of teaching algebra topics, let us assume that we have first computed the necessary mean vectors and SSCP matrices using the procedures described in Chapters 1 and 2, based on the common covariance matrix assumption. Now we can use either of the following SPSS or SAS command files to test for group differences. (The raw data are found in the file named ch4ex1.dat available from www.psypress.com=applied-multivariateanalysis, where TEST.1 through TEST.3 denote the three achievement test scores, in percentage correct metric, and GROUP stands for the experimental vs. control group dichotomy.) SPSS command file TITLE ‘HOTELLING’S INDEPENDENT SAMPLES TEST’. MATRIX. COMP XBAR ¼ {27.469, 46.244, 34.868}. COMP YBAR ¼ {31.698, 50.434, 37.856}. COMP DIFF ¼ XBAR – YBAR. COMP SSCPX ¼ {9838.100, 7702.388, 10132.607; 7702.388, 11559.293, 8820.509; 10132.607, 8820.509, 13295.321}. COMP SSCPY ¼ {23463.794, 15979.138, 23614.533; 15979.138, 29152.887, 22586.075; 23614.533, 22586.075, 34080.625}. * THE LAST ARE THE SSCP MATRICES OBTAINED FROM BOTH GROUPS. COMP SUMS ¼ SSCPX þ SSCPY. COMP S ¼ ((1=45 þ 1=110)*(SUMS))=153. COMP DBR.DIST ¼ (DIFF)*INV(S)*T(DIFF). PRINT DBR.DIST. COMP CUTOFF ¼ (45þ110-2)*2=(45þ110-3)*2.99. * 2.99 IS THE F VALUE WITH 2 AND 158 DF’S, AT ALPHA ¼ .05, FROM BOOKS. PRINT CUTOFF. END MATRIX.

SAS command file PROC IML; XBAR ¼ {27.469 46.244 34.868}; YBAR ¼ {31.698 50.434 37.856}; SSCPX ¼ {9838.100 7702.388 10132.607, 7702.388 11559.293 8820.509, 10132.607 8820.509 13295.321}; SSCPY ¼ {23463.794 15979.138 23614.533, 15979.138 29152.887 22586.075, 23614.533 22586.075 34080.625};


116


DIFF ¼ XBAR - YBAR; TRANDIFF ¼ T(DIFF); SUMS ¼ SSCPX þ SSCPY; S ¼ ( (1=45 þ 1=110)*(SUMS))=153; INVS ¼ INV(S); DBRDIST ¼ DIFF*INVS*TRANDIFF; PRINT DBRDIST; CUTOFF ¼ (45þ110-2)*2=(45þ110-3)*2.99; PRINT CUTOFF; QUIT;

Submitting either of these two program files to the software, with the observed data, yields XBAR.DIST ¼ 4.154 (i.e., equal to the MD of the difference between the two sample means from the origin) and CUTOFF ¼ 6.02. Because XBAR.DIST is not larger than CUTOFF, we do not reject H0, and can conclude that there is no evidence in the analyzed samples that the two methods of teaching algebra topics differ in their effectiveness.

4.5 Testing Hypotheses About Multivariate Means in One-Way and Higher Order Designs (Multivariate Analysis of Variance, MANOVA) Many empirical studies in the social and behavioral sciences are concerned with designs that have more than two groups. The statistical approach followed so far in this chapter can be generalized to such cases, and the resulting extension is referred to as multivariate analysis of variance (MANOVA). A main concern of MANOVA is the examination of mean differences across several groups when more than one DVs are considered simultaneously. That is, a MANOVA is essentially an analysis of variance (ANOVA) with p > 1 response (dependent) variables; conversely, ANOVA is a special case of MANOVA with p ¼ 1 outcome variable. In analogy to ANOVA, a major question in MANOVA is whether there is evidence in an analyzed data set for an ‘‘effect’’ (i.e., a main effect or an interaction), when all p DVs are considered together (p 1). As could be expected, there is a helpful analogy between the statistical procedure behind MANOVA and the one on which its univariate counterpart, ANOVA, is based. We use this analogy in our discussion next. We first recall from UVS that in a one-way ANOVA with say g groups (g 2), a hypothesis of main interest is that of equality of their means on the single DV, that is, H0 : m1 ¼ m2 ¼ . . . ¼ mg :

(4:22)



117

The testing approach to address this question then is based on the wellknown sum of squares partitioning, SST ¼ SSB þ SSW ,

(4:23)

where SST is the total sum of squares SSB is the sum of squares between (for) the means of the groups SSW is the sum of squares within the groups The statistic for testing the null hypothesis in Equation 4.22 is then the following F ratio: F¼

Mean SSB [SSB =(g 1)] Fg1 , ng ¼ Mean SSW [SSW =(n g)]

(4:24)

(Hays, 1994). Turning to the case of a one-way MANOVA design with p DVs (p 2), a null hypothesis of main interest is H0 : m1 ¼ m2 ¼ . . . ¼ mg ,

(4:25)

whereby we stress that here equality of vectors is involved. That is, the hypothesis in Equation 4.25 states that the g population centroids, in the p-dimensional space of concern now, are identical. Note that this hypothesis is the multivariate analog of the above ANOVA null hypothesis H0: m1 ¼ m2 ¼ . . . ¼ mg, and may actually be obtained from the latter by formally exchanging scalars (the ms in Equation 4.22) with vectors. We next recall our earlier remark (see Chapter 1) that in the multivariate case the SSCP matrix is the analog of the univariate concept of sum of squares. In fact, testing the multivariate null hypothesis in Equation 4.25 can be seen by analogy as being based on the partitioning of the SSCP matrix of observations on the p response variables, which proceeds in essentially the same manner as the one in Equation 4.23. Specifically, for the multivariate case this breakdown is SSCPT ¼ SSCPB þ SSCPW ,

(4:26)

where the subindexes T, B, and W stand for total, between, and within (SSCP), and SSCPW ¼ SSCP1 þ SSCP2 þ þ SSCPg ,

(4:27)


118


which represents the sum of the group-specific SSCP matrices across all g groups considered. We stress that each SSCP appearing in the right-hand side of Equation 4.27 not only combines all dependent variability indices (sum of squares, along its main diagonal) but also takes the response variables’ interrelationships into account (reflected in its off-diagonal elements, the cross products). We also note that Equation 4.27 helps clarify the reason why we want to assume homogeneity of group-specific covariance matrices. (We present a detailed discussion of this matter later.) Now, in order to test the null hypothesis H0 in Equation 4.25, we can make use of the so-called likelihood ratio theory (LRT), which has some very attractive properties. This theory is closely related to the method of maximum likelihood (ML), which follows a major principle in statistics and is one of the most widely used estimation approaches in its current applications. Accordingly, as estimates of unknown parameters—here, a mean vector and covariance matrix per group—one takes those values for them, which maximize the ‘‘probability’’ of observing the data at hand. (Strictly speaking, maximized is the likelihood of the data as a function of the model parameters; Roussas, 1997.) These estimates are commonly called ML estimates. The LRT is used to test hypotheses in the form of various parameter restrictions, whenever the ML estimation method is employed. The LRT utilizes instrumentally the ratio of (a) the probability (actually, likelihood in the continuous variable case) of observing the data at hand if a hypothesis under consideration were to be true, to (b) the probability (likelihood) of observing the data without assuming validity of the hypothesis, whereby both (a) and (b) are evaluated at the values for the unknown parameters that make each of these probabilities (likelihoods) maximal. This ratio is called the likelihood ratio (LR), and it is the basic quantity involved in hypothesis testing within this framework. If the LR is close to 1, its numerator and denominator are fairly similar, and thus the two probabilities (likelihoods) involved are nearly the same. For this reason, such a finding is interpreted as suggesting that there is lack of evidence against the tested null hypothesis. If alternatively the LR is close to 0, then its numerator is much smaller than its denominator. Hence, the probability (likelihood) of the data if the null hypothesis were true is quite different from the probability (likelihood) of the data without assuming validity of the null hypothesis; this difference represents evidence against the null hypothesis (Wilks, 1932). We note that this application of the LRT considers, for the first time in this book, evidence against H0 as being provided not by a large test statistic value but by a small value of a test statistic (i.e., a value that is close to 0, as opposed to close to 1). We emphasize that the LR cannot be larger than 1 since its numerator cannot exceed its denominator, due to the fact that in the former maximization occurs across a subspace of the multivariate space that is of relevance for the denominator; similarly, the LR cannot be negative as it is a ratio of two probabilities.



119

In the current case of one-way MANOVA, it can be shown that the LR test statistic is a monotone function of what is called the Wilks’ lambda (L, capital Greek letter lambda): L¼

jSSCPW j jSSCPW j ¼ , jSSCPT j jSSCPB þ SSCPW j

(4:28)

which represents a ratio of the determinants of the within-group and total sample SSCP matrices (Johnson & Wichern, 2002). More specifically, if we denote SW ¼

SSCPW (SSCP1 þ þ SSCPg ) ¼ , (n g) (n g)

it can also be shown that Wilks’ L criterion can be written as L¼

jSW j (n g)p , jST j (n 1)p

(4:29)

where ST is the total covariance matrix (of the entire sample, disregarding group membership). The L criterion in Equations 4.28 and 4.29 can be viewed as a multivariate generalization of the univariate ANOVA F ratio. Since the L criterion is defined in terms of determinants of appropriate matrices, it takes into account involved variable variances as well as interrelationships (reflected in the off-diagonal elements of these matrices). Also, it can be shown that in the univariate case L is inversely proportional to the ANOVA F ratio, denoted next F: L(p¼1) ¼

1 : 1 þ [(k 1)=(n k)]F

(4:30)

Because the right-hand side of Equation 4.30 is a monotone function of the F ratio, it demonstrates that testing the null hypothesis equation (Equation 4.25) effectively reduces in the univariate case to the familiar ANOVA F test (Tatsuoka, 1988). In the general case of p 1 DVs, as indicated earlier the smaller L the more evidence there is in the data against H0; that is, L is an inverse measure of disparity between groups. In other words, the logic of Wilks’ L in relation to the null hypothesis of interest is the ‘‘reverse’’ to that of the F ratio. Also, from Equation 4.28 follows that Wilks’ L can be presented as 1 1 ¼ jSSCP1 W (SSCPB þ SSCPW )j ¼ jSSCPW SSCPB þ Ij, L

(4:31)


120


where I is the correspondingly sized identity matrix. Another interesting fact is that L and Hotelling’s T2 are related in the two-group case, for any number of DVs (p 1), as follows: 1 ¼ L

T2 1 n2

(4:32)

(Johnson & Wichern, 2002). Equation 4.32 indicates that the bigger the group disparity as determined by Hotelling’s T2, the smaller the value of L. Consequently, an alternative way of testing mean differences for two groups is to use, in lieu of Hotelling’s T2, Wilks’ L. Although there are also a number of other test criteria that can be used to examine group differences, for the moment we restrict our attention to Wilks’ L criterion; we discuss those test criteria in Section 4.5.3. The sampling distribution of Wilks’ L has been worked out a long time ago, under the assumption of homogeneity of the covariance matrix across groups: S1 ¼ S2 ¼ . . . ¼ Sg, where the Ss denote the covariance matrices of the DVs in the studied populations. This assumption is testable using the so-called Box’s M test, or Bartlett’s homogeneity test (Morrison, 1976). As explained later in this section, SPSS provides Box’s M test and SAS yields Bartlett’s test (see also Chapter 10). These tests represent multivariate generalizations of the test of variance homogeneity in an ANOVA setup (and in particular of the t test for mean differences in a two-group setting). Considerable robustness, however, is in effect against violations of this assumption when large samples of equal size are used (this robustness applies also to the earlier considered case of g ¼ 2 groups, as do all developments in this section unless stated otherwise explicitly). Box’s M test of homogeneity compares the determinant—that is, the generalized variance (see discussion in Chapter 1)—of the pooled covariance matrix estimate with those of the group covariance matrices. Its test statistic is computed as an approximate F-statistic proportional to the difference U ¼ (n p)lnjS*j

X

(nk 1)lnjSk j,

(4:33)

where n is total sample size (disregarding group membership), p is number of variables, nk the size of kth sample (k ¼ 1, . . . , g), and the summation in the right-hand side of Equation 4.33 runs across all groups. The proportionality constant, not shown in Equation 4.33, simply renders the value of U to follow a known distribution, and is of no intrinsic interest here. Consequently, an evaluation of Box’s M test relative to an F distribution with appropriate degrees of freedom at a prespecified significance level can be used to statistically test the assumption of covariance matrix homogeneity.



121

Bartlett’s test of homogeneity operates in much the same way as Box’s M test except that the test statistic of the former is computed as an approximately x2-distributed statistic. For this reason, an evaluation of Bartlett’s statistic relative to a value from the x2 distribution with p(p þ 1)(g 1)=2 degrees of freedom is used to test the assumption of covariance matrix homogeneity. (As indicated before, the test statistic is provided in a pertinent SAS output section.) We note in passing that both Box’s M and Bartlett’s tests are notoriously sensitive to nonnormality, a fact that we remark on again later in this chapter. As an alternative, one can use structural equation modeling techniques, and specifically robust methods within that framework, to test this homogeneity assumption (Raykov, 2001). As another possibility, with either Box’s M or Bartlett’s test one may consider utilizing a more conservative significance level of a ¼ .01 say, that is, proclaim the assumption violated if the associated p-value is less than .01. To demonstrate the discussion in this section, let us consider the following example study. In it, three measures of motivation are obtained on three socioeconomic status (SES) groups—lower, medium, and high— with n1 ¼ 47, n2 ¼ 50, and n3 ¼ 48 children, respectively. (The three measures of motivation are denoted MOTIV.1, MOTIV.2, and MOTIV.3 in the file ch4ex2.dat available from www.psypress.com=applied-multivariateanalysis, whereas SES runs from 1 through 3 and designates their respective group membership there). Let us assume that a researcher is concerned with the question of whether there are any SES group differences in motivation. Such a question can easily be handled via the MANOVA approach in which p ¼ 3 responses are to be compared across g ¼ 3 groups. This MANOVA can be readily carried out with SPSS using the following menu options=sequence: Analyze ! General Linear Model ! Multivariate (choose DVs, SES as ‘‘fixed factor’’; Options: homogeneity test and means for SES). With SAS, two procedures can be used to conduct a MANOVA: PROC GLM or PROC ANOVA. The main difference between them, for the purposes of this chapter, has to do with the manner in which the two procedures handle the number of observations in each group. PROC ANOVA was designed to deal with situations where there were an equal number of observations in each group, in which case the data are commonly referred to as being balanced. For the example under consideration, the number of observations in the study groups is not the same; under these circumstances, the data are referred to as unbalanced. PROC GLM was developed to handle such unbalanced data situations. When the data are balanced, both procedures produce identical results. Due to the fact that PROC ANOVA does not check to determine whether the data are balanced or not, if used with unbalanced data it can produce


122


incorrect results. Therefore, only with balanced data could one use PROC ANOVA; otherwise one proceeds with PROC GLM. Because the data in the example under consideration are not balanced, we use PROC GLM to conduct the analysis with the following program setup: DATA motivation; INFILE ‘ch4ex2.dat’; INPUT MOTIV1 MOTIV2 MOTIV3 SES; PROC GLM; CLASS SES; MODEL MOTIV1 MOTIV2 MOTIV3 ¼ SES; MEANS SES=HOVTEST ¼ LEVENE; MANOVA H ¼ SES; RUN; PROC DISCRIM pool ¼ test; CLASS SES; RUN; A number of new statements are used in this command file and require some clarification. In particular, we need to comment on the statements CLASS, MODEL, MEANS, MANOVA, and PROC DISCRIM. The CLASS statement is used to define the independent variables (IVs) (factors) in the study—in this case SES. The MODEL statement specifies the DVs (the three motivation measures) and IVs—to the left of the equality sign, one states the dependent measures, and to the right of it the independent ones. The MEANS statement produces the mean values for the IVs specified on the CLASS statement, while HOVTEST ¼ LEVENE requests the Levene’s univariate test of homogeneity of variance for each DV. (One may look at this test as a special case of Box’s M test for the univariate case.) The MANOVA statement requests the multivariate tests of significance on the IV and also provides univariate ANOVA results for each DV. Finally, in order to request Bartlett’s test of homogeneity, the ‘‘pool ¼ test’’ option within PROC DISCRIM can be used, which procedure conducts in general discriminant function analysis. We note that our only interest at this time in employing PROC DISCRIM is to obtain Bartlett’s test of covariance matrix homogeneity, and we consider the use of this procedure in much more detail in Chapter 10. The resulting outputs produced by the above SPSS and SAS command sequences, follow next. For ease of presentation, the outputs are reorganized into sections and clarifying comments are inserted at their end.



123

SPSS output Box’s Test of Equality of Covariance Matricesa Box’s M F df1 df2 Sig.

13.939 1.126 12 97138.282 .333

Tests the null hypothesis that the observed covariance matrices of the dependent variables are equal across groups. a. Design: InterceptþSES

Levene’s Test of Equality of Error Variancesa F MOTIV.1 MOTIV.2 MOTIV.3

.672 .527 1.048

df1

df2

Sig.

2 2 2

142 142 142

.512 .591 .353

Tests the null hypothesis that the error variance of the dependent variable is equal across groups. a. Design: InterceptþSES

SAS output The GLM Procedure Levene’s Test for Homogeneity of MOTIV1 Variance ANOVA of Squared Deviations from Group Means

Source

DF

SES Error

2 142

Sum of Squares

Mean Square

155078 13724135

77538.9 96648.8

F Value

Pr > F

0.80

0.4503

Levene’s Test for Homogeneity of MOTIV2 Variance ANOVA of Squared Deviations from Group Means

Source

DF

Sum of Squares

Mean Square

SES Error

2 142

26282.7 9810464

13141.3 69087.8

F Value 0.19

Pr > F 0.8270


124


Levene’s Test for Homogeneity of MOTIV3 Variance ANOVA of Squared Deviations from Group Means Source SES Error

DF 2 142

Sum of Squares

Mean Square

209125 15554125

104563 109536

F Value 0.95

Pr > F 0.3874

The DISCRIM Procedure Test of Homogeneity of Within Covariance Matrices Notation: K ¼ P ¼ N ¼ N(i) ¼

Number of Groups Number of Variables Total Number of Observations - Number of Groups Number of Observations in the i’th Group - 1 _

V

N(i)=2

jj jWithin SS Matrix(i)j ¼ ----------------------------------N=2 jPooled SS Matrixj

_ _ 2 j 1 1 j 2P þ 3P - 1 RHO ¼ 1.0 - j SUM ----- - --- j ----------j_ N(i) N _j 6(Pþ1)(K-1) DF

¼ .5(K-1)P(Pþ1)

Under the null hypothesis:

_ _ j PN/2 j j N V j -2 RHO In j ----------------- j j _ PN(i)/2 j j_ || N(i) _j

is distributed approximately as Chi-Square(DF). Chi-Square 13.512944

DF 12

Pr > ChiSq 0.3329

Reference: Morrison, D.F. (1976) Multivariate Statistical Methods p252.



125

The displayed results indicate that according to Box’s M (and Bartlett’s) test, there is not sufficient evidence in the data to warrant rejection of the assumption of equal covariance matrices for the motivation measures considered across the three groups. In particular, given that SPSS yields a value for Box’s M of 13.939 with associated p-value of 0.333, and SAS furnished a Bartlett’s test statistic of 13.513 with associated p ¼ 0.333, we conclude that the dependent measure covariance matrices do not differ significantly across the groups. We note in passing that Levene’s univariate tests of homogeneity of variance also suggests that each motivation measure has equal variances across groups, but these results are not of interest because the counterpart multivariate test is not significant in this example. When covariance matrix homogeneity is rejected, however, Levene’s test can help identify which measure may be contributing singly to such a finding. SPSS output Multivariate Testsc Effect

Value

F

Hypothesis df

Error df

Sig.

a

Intercept

Pillai’s Trace Wilks’ Lambda Hotelling’s Trace Roy’s Largest Root

.902 .098 9.221 9.221

430.301 430.301a 430.301a 430.301a

3.000 3.000 3.000 3.000

140.000 140.000 140.000 140.000

.000 .000 .000 .000

SES


.015 .985 .015 .015

.351 .350a .349 .686b

6.000 6.000 6.000 3.000

282.000 280.000 278.000 141.000

.909 .910 .910 .562

a. Exact statistic b. The statistic is an upper bound on F that yields a lower bound on the significance level. c. Design: InterceptþSES

In this SPSS output section with multivariate results, we are interested only in the SES part of the table since the one titled ‘‘intercept’’ pertains to the means of the three measures, and whether they are significant or not is really of no particular interest. (Typically, measures used in the social and behavioral sciences yield positive scores and thus it cannot be unexpected that their means are significant—in fact, the latter could be viewed as a trivial finding.) For the moment, in that SES part, we look only at the row for Wilks’ L; we discuss in more detail the other test criteria in Section 4.5.3. An examination of Wilks’ L and its statistical significance, based on the F distribution with 6 and 280 degrees of freedom, shows no evidence of SES group differences in the means on the three motivation measures when considered together. We note again


126


that the degrees of freedom for Wilks’ test are identical to those in the univariate case. The reason is that the uni and multivariate designs are the same for this study—it is a three-group investigation regardless of number of outcome variables—with the only difference that more dependent measures have been used here, a fact that does not affect degrees of freedom. SAS output The GLM Procedure Multivariate Analysis of Variance MANOVA Test Criteria and F Approximations for the Hypothesis of No Overall SES Effect H ¼ Type III SSCP Matrix for SES E ¼ Error SSCP Matrix S ¼ 2 M ¼ 0 N ¼ 69 Statistic

Pr > F

Value F Value Num DF Den DF

Wilks’ Lambda Pillai’s Trace Hotelling-Lawley Trace Roy’s Greatest Root

0.98516192 0.01484454 0.01505500 0.01460592

0.35 0.35 0.35 0.69

6 6 6 3

280 282 184.9 141

0.9095 0.9088 0.9092 0.5617

NOTE: F Statistic for Roy’s Greatest Root is an upper bound. NOTE: F Statistic for Wilks’ Lambda is exact.

As can be readily seen by examining this SAS output part, identical results are found with respect to Wilks’ L. The next set of output sections provided by each program corresponds to the means of the SES groups on the three measures of motivation and their univariate ANOVA tests. Given that the research question of interest was conceptualized as a multivariate one in the first instance, and that as already found there was no evidence of SES group differences, the univariate results contained in the subsequent tables are of no particular interest for the moment. We provide them here only for the sake of completeness, so that the reader can get a more comprehensive picture of the output generated by the SPSS and SAS command files under consideration.



127

SPSS output Tests of Between-Subjects Effects Source

Type III Sum of Squares

Dependent Variable

df

a

Mean Square

F

Sig.

2 2 2

2.320 43.781 33.639

.011 .164 .110

.989 .849 .896

134031.059 346645.340 196571.511

1 1 1

134031.059 346645.340 196571.511

614.101 1298.135 642.188

.000 .000 .000

MOTIV.1 MOTIV.2 MOTIV.3

4.639 87.561 67.277

2 2 2

2.320 43.781 33.639

.011 .164 .110

.989 .849 .896

Error


30992.332 37918.729 43465.713

142 142 142

218.256 267.033 306.097

Total


165115.351 384743.280 240415.613

145 145 145

Corrected Total


30996.971 38006.290 43532.991

144 144 144

Corrected Model


Intercept


SES

4.639 87.561b 67.277c

a. R Squared ¼ .000 (Adjusted R Squared ¼ .014) b. R Squared ¼ .002 (Adjusted R Squared ¼ .012) c. R Squared ¼ .002 (Adjusted R Squared ¼ .013)

Estimated Marginal Means SES 95% Confidence Interval Dependent Variable

SES

Mean

Std. Error

Lower Bound

Upper Bound

MOTIV.1

1.00 2.00 3.00

30.252 30.323 30.664

2.155 2.089 2.132

25.992 26.193 26.449

34.512 34.454 34.879

MOTIV.2

1.00 2.00 3.00

48.684 48.096 49.951

2.384 2.311 2.359

43.972 43.528 45.289

53.396 52.665 54.614

MOTIV.3

1.00 2.00 3.00

36.272 37.783 36.440

2.552 2.474 2.525

31.227 32.892 31.448

41.317 42.674 41.431


128


SAS output The GLM Procedure Dependent Variable: MOTIV1 Source Model Error Corrected Total

Source SES Source SES

DF

Sum of Squares

Mean Square

2 142 144

4.63938 30992.33186 30996.97124

2.31969 218.25586

R-Square

Coeff Var

Root MSE

0.000150

48.57612

14.77348

30.41306

DF

Type I SS

Mean Square

F Value

Pr > F

2

4.63937832

2.31968916

0.01

0.9894

DF

Type III SS

Mean Square

F Value

Pr > F

2

4.63937832

2.31968916

0.01

0.9894

Mean Square

F Value

Pr > F

43.78064 267.03330

0.16

0.8489

F Value

Pr > F

0.01

0.9894

MOTIV1 Mean

The GLM Procedure Dependent Variable: MOTIV2 Source Model Error Corrected Total

Source SES Source SES

DF

Sum of Squares

2 87.56128 142 37918.72865 144 38006.28993 R-Square

Coeff Var

0.002304

33.41694

DF

Type I SS

Mean Square

F Value

2 87.56127654

43.78063827

0.16

Type III SS

Mean Square

F Value

2 87.56127654

43.78063827

DF

Root MSE MOTIV2 Mean 16.34115 48.90081

0.16

Pr > F 0.8489 Pr > F 0.8489



129

The GLM Procedure Dependent Variable: MOTIV3 Source

Sum of Squares

DF

Model Error Corrected Total

2 67.27715 142 43465.71338 144 43532.99052

Mean Square

F Value

Pr > F

33.63857 306.09657

0.11

0.8960

R-Square

Coeff Var

0.001545

47.47987

17.49562

DF

Type I SS

Mean Square

F Value

2 67.27714571

33.63857286

0.11

Type III SS

Mean Square

F Value

2 67.27714571

33.63857286

0.11

Source SES Source

DF

SES

Root MSE MOTIV3 Mean 36.84849 Pr > F 0.8960 Pr > F 0.8960

The GLM Procedure Level of SES ------MOTIV1-----N 1 2 3

Mean

Std Dev

------MOTIV2-----Mean

Std Dev

------MOTIV3-----Mean

Std Dev

47 30.2522340 15.7199013 48.6838511 16.0554160 36.2722766 18.9356935 50 30.3233800 13.1297030 48.0962000 16.9047920 37.7827600 16.2551049 48 30.6639583 15.4217050 49.9513958 16.0174023 36.4395000 17.2742199

4.5.1 Statistical Significance Versus Practical Importance As is the case when examining results obtained in a univariate analysis, finding differences among sample mean vectors in a MANOVA to be statistically significant does not necessarily imply that they are important in a practical sense. With large enough samples, the null hypothesis of no mean differences will be rejected anyway, even if only violated to a substantively irrelevant degree. As is well known, one measure that can be used to address the practical relevance of group differences in an ANOVA design is the correlation ratio (also referred to as ‘‘eta squared’’—sometimes also considered an effect size index): h2 ¼ 1

SSW : SST

(4:35)


130


In the multivariate case, this equation can be generalized to the so-called multivariate correlation ratio: h2mult ¼ 1 L:

(4:36)

The ratio in Equation 4.36 is interpretable as the proportion of generalized variance of the set of DVs that is attributable to group membership. In other words, the right-hand side of Equation 4.36 describes the strength of association between IVs and DVs in a sample at hand. It seems to be still one of the most widely used measures of practical significance from a set of related measures, for which we refer the reader to Rencher (1998) and references therein. 4.5.2 Higher Order MANOVA Designs So far in this section we have considered MANOVA designs with only one factor. (In the example used in the preceding subsection, this was SES group membership.) Oftentimes in social and behavioral research, however, it is necessary to include additional factors, with several levels each. These circumstances lead to two-way, three-way, or higher order multivariate designs. In such designs, once again an analogy to the univariate two-way and higher order ANOVA settings will turn out to be quite helpful. In particular, recall the two-way ANOVA design partitioning of sum of squares on the response measure: SST ¼ SSA þ SSB þ SSAB þ SSW ,

(4:37)

where the factors are denoted using the letters A and B, SSW is the withingroup sum of squares, and SSAB is the sum of squares due to their interaction, that is, represents DV variation in design cells over and above what could be attributed to the factors A and B. In other words, the interaction component is viewed as a unique effect that cannot be explained from knowledge of the effects of the other two factors. In a two-way MANOVA design, using the earlier indicated uni-tomultivariate analogy, the same partitioning principle is upheld but with respect to the SSCP matrices: SSCPT ¼ SSCPA þ SSCPB þ SSCPAB þ SSCPW ,

(4:38)

because as mentioned earlier the SSCP matrix is the multivariate generalization of the univariate sum of squares notion (see Chapter 1), where SSCPW is the earlier defined sum of within-group SSCP matrices across all cells of the design. (Like in a univariate two-way ANOVA design, the number of cells equals the number of combinations of all factor levels.) Because in this multivariate case the design is not changed but only the



131

number of DVs is, as indicated before the degrees of freedom remain equal to those in the univariate case, for each hypothesis test discussed next. The null hypotheses that can be tested in a two-way MANOVA are as follows, noting the complete design-related analogy with univariate ANOVA tests (observe also that in part due to the interaction effect being viewed as a unique effect that cannot be predicted from knowledge of the effects of the other two factors, interaction is commonly tested first): H0,1: Factors A and B do not interact, H0,2: Factor A does not have an effect (no main effect of A), and H0,3: Factor B does not have an effect (no main effect of B). This analogy can be carried out further to the test statistics used. Specifically, employing Wilks’ L as a test criterion, it can be written here in its more general form as LH ¼

jSSCPE j , jSSCPE þ SSCPH j

(4:39)

or equivalently its reciprocal as 1 1 ¼ jSSCP1 E (SSCPE þ SSCPH )j ¼ jSSCPE SSCPH þ Ij, LH

(4:40)

where H denotes each of the above effect hypotheses taken in turn, E stands for error, and I is the appropriately sized identity matrix. Thereby, the effect hypothesis SSCP is chosen as follows (Johnson & Wichern, 2002): 1. For H0,1, SSCPH ¼ SSCPAB (i.e., formally substitute ‘‘H’’ for ‘‘AB’’); 2. For H0,2: SSCPH ¼ SSCPA (substitute ‘‘H’’ for ‘‘A’’); and 3. For H0,3: SSCPH ¼ SSCPB (substitute ‘‘H’’ for ‘‘B’’). For all three hypotheses, SSCPE ¼ SSCPW (i.e., formally, substitute ‘‘E’’ ¼ ‘‘W’’). The SSCP matrices in Equations 4.39 or 4.40 are readily calculated using software like SPSS or SAS once the design is specified, as are the pertinent test statistics. For higher order MANOVA designs or more complicated ones, the basic principle in multivariate hypothesis testing follows from Equation 4.39: 1. Compute the SSCP matrix corresponding to each sum of squares in the respective ANOVA design (sum of square partitioning); 2. Select the appropriate error SSCP matrix (with fixed effects, as in this chapter, it is the SSCPW from Equation 4.38); and 3. Calculate LH from Equation 4.39.


132


All these steps are carried out by the used software once the needed detail about the underlying design and data are provided. In cases where the effects in the design are not fixed, but random or mixed (i.e., some are fixed and others are random), the choice of the appropriate error SSCP matrix will depend on the particular type of model examined. Since SPSS and SAS use fixed effects as default design when computing the necessary test statistics, whenever the effects being considered are not fixed the programs must be provided with this information in order to correctly compute the test statistic (e.g., see Marcoulides & Hershberger, 1997, for a table of error SSCP matrices in a two-way MANOVA then). 4.5.3 Other Test Criteria In case of g ¼ 2 groups, and regardless of number of DVs, a single test statistic is obtained in MANOVA, as indicated before, which we can treat for this reason just as Wilks’ L. As soon as we have more than two groups, however, there are three additional test criteria that reflect the complexity of handling multivariate questions and are discussed in this subsection. Before we proceed, we note that it is not infrequent in empirical research for all four test statistics to suggest the same substantive conclusion. The three other test statistics that are commonly computed are (a) Hotelling’s Trace, (b) Pillai’s Trace, and (c) Roy’s Largest Root. Because each of these tests can be reasonably approximated using an F distribution (Rao, 1952; Schatzoff, 1964), it is also common practice to assess their significance based upon that F distribution rather than looking at the exact value of each of these test statistics. Only when p ¼ 1 (i.e., in the univariate case) will all four test statistics provide identical F ratios as a special case of them. In order to define these three test statistics, we need to introduce the notion of an eigenvalue, also sometimes referred to as a characteristic root or latent root. We spend a considerable amount of time on this concept again in Chapter 7, including some detailed numerical and graphical illustrations as well as use of software, but for the purposes of this section we briefly touch upon it here. To this end, suppose that x is a p 3 1 data vector of nonzero length, that is, x 6¼ 0. As mentioned in Chapter 1, each data vector (row in the data matrix) can be represented in the multivariate space by a p-dimensional point, and hence also by the vector—or one-way arrow—that extends from the origin and ends into that point. Consider now the vector y defined as y ¼ A x, where A is a p 3 p matrix (for example a SSCP matrix for a set of variables). This definition implies that we can obviously look at y as the result of a transformation working on x. In general, y can be in any position and have any direction in the p-dimensional space, as a result of this transformation. However, the particular case when y is collinear with x is of special interest. In this case, both x and y share a common direction, but need not have the same length.



133

Under such a circumstance, there exists a number that we label l, which ensures that y ¼ lx. This number l is called eigenvalue of A while x is called eigenvector of A that pertains to l. The matrices that are of interest and pursued in this book will have as many eigenvalues as their size, denoted say l1, l2, . . . , lp from largest to smallest, that is, l1 l2 . . . lp, whereby the case of some of them being equal is in general not excluded. For example, if A is the following correlation matrix with p ¼ 2 vari 1 0:4 ables, A ¼ , it can be readily found with software that it will 0:4 1 have two eigenvalues, which will be equal to l1 ¼ 1.4 and l2 ¼ 0.6. (Specific details for using software to accomplish this aim are postponed to 1 0 Chapter 7.) Similarly can be found that if the matrix were A ¼ , 0 1 which implies that the two variables would be unrelated, the two eigenvalues would be l1 ¼ 1 and l2 ¼ 1. Also, if the two variables were perfectly 1 1 correlated, so that A ¼ , then l1 ¼ 2 and l2 ¼ 0. 1 1 We discuss further eigenvalues and eigenvectors in Chapter 7 when we will be concerned also with how they are computed. For the moment, let us emphasize that the eigenvalues carry information characterizing the matrix A. In particular, it can be shown that the product of all eigenvalues for a symmetric matrix equals its determinant. Indeed, recalling from 1 0:4 Chapter 2 that the determinant of A ¼ equals jAj ¼ [(1)(1) 0:4 1 (0.4)(0.4)] ¼ 0.84, we readily see that it is the product of its eigenvalues l1 ¼ 1.4 and l2 ¼ 0.6. For this reason, if a symmetric matrix has one or more eigenvalues equal to zero, it will be singular and thus cannot be inverted (i.e., its inverse does not exist; see Chapter 2). Such a problematic case will occur when two variables are perfectly correlated and A is their covariance or correlation matrix. For instance, in the above case where the two eigenvalues of a matrix were equal to l1 ¼ 2 and l2 ¼ 0, it can be implied that the variables contain redundant information (which also follows from the observation that they are perfectly correlated). Last but not least, as can be directly shown from the definition of eigenvalue and positive definiteness of a matrix, when all eigenvalues of a matrix are positive, then that matrix is positive definite. (The inverse statement is also true—a positive definite matrix has only positive eigenvalues.) With this discussion of eigenvalues, we can now move on to the additional multivariate test statistics available generally in MANOVA, which are defined as follows: Hotelling’s trace criterion. This statistic, denoted t, is the trace of a particular matrix product, specifically t ¼ trace (SSCP1 E SSCPH ):

(4:41)


134


It can be shown that this definition is equivalent to stating t ¼ sum (eigenvalues of SSCP1 E SSCPH ) ¼ l1 þ l2 þ þ lp , where the ls now denote the eigenvalues of the matrix product SSCP1 E SSCPH (recalling Equation 4.39 where this matrix product originated). In this definition, a particular feature of the trace operator is used, that is, for a symmetric positive definite matrix it equals the sum of its eigenvalues. Roy’s largest root criterion. This statistic, denoted u, is defined as the following nonlinear function of the first eigenvalue of a related matrix: u¼

l1 , (1 þ l1 )

(4:42)

where l1 denotes the largest eigenvalue of the matrix product (SSCPE þ SSCPH )1 SSCPH : Pillai–Bartlett trace criterion. This test statistic, denoted V, utilizes all eigenvalues of the last matrix product, (SSCPE þ SSCPH )1 SSCPH : V¼

lp l1 l2 þ þ þ , (1 þ l1 ) (1 þ l2 ) (1 þ lp )

(4:43)

where the ls are the same as for Roy’s criterion. We observe that all three criteria are increasing functions of the eigenvalues involved in them. That is, large eigenvalues indicate strong (and possibly significant) effect that is being tested with the statistics. It can also be shown that Wilks’ L is indirectly related to these three test criteria, since L¼

1 , [(1 þ l1 )(1 þ l2 ) (1 þ lp )]

(4:44)

where the ls are the eigenvalues participating in Equation 4.43. Notice that, contrary to the other three test statistics, L is a decreasing function of the eigenvalues involved in it. That is, larger eigenvalues lead to a smaller test statistic value L (i.e., possibly significant), while entailing higher alternative test statistics. There is to date insufficient research, and lack of strong consensus in the literature, regarding the issue of which statistic is best used when. Usually, with pronounced (or alternatively only weak) effects being tested, all four would be expected to lead to the same substantive conclusions. Typically, if groups differ mainly along just one of the outcome variables, or along a single direction in the multivariate space (i.e., the group means are



135

collinear), Roy’s criterion u will likely be the most powerful test statistic. With relatively small samples, Pillai–Bartlett’s V statistic may be most robust to violation of the covariance matrix homogeneity assumption. Some simulation studies have found that this statistic V can also be more powerful in more general situations, in particular when the group means are not collinear (Olson, 1976; Schatzoff, 1966). In all cases, Wilks’ L compares reasonably well with the other test statistics, however, and this is part of the reason why L is so popular, in addition to resulting from the framework of the LR theory (ML estimation). To illustrate a MANOVA using the above test criteria, consider the following example study in which n ¼ 240 juniors were examined with respect to their educational aspiration. As part of the study, 155 of them were randomly assigned to an experimental group with an instructional program aimed at enhancing their motivation for further education, and the remaining 85 students were assigned to a control group. At the end of the program administration procedure, p ¼ 3 educational aspiration measures were given to both groups, and in addition data on their SES were collected. The research question is whether there are any effects of the program or SES on students’ aspiration. (The data are available in the file ch4ex3.dat available from www.psypress.com=applied-multivariateanalysis; in that file, the motivation measures are denoted ASPIRE.1 through ASPIRE.3, with self-explanatory names for the remaining two variables; for SES, 1 denotes low, 2 middle, and 3 high SES; and for GROUP, 0 stands for control and 1 for experimental subjects.) To conduct a MANOVA of this two-way design using SPSS, the following menu options=sequence would be used: Analyze ! General linear model ! Multivariate (ASPIRE.1 through ASPIRE.3 as DVs, SES and GROUP as fixed factors; Options: homogeneity test, GROUP * SES means) With SAS, the following command file could be employed with PROC GLM (detailed descriptions of each command were given in an earlier section of this chapter): DATA ASPIRATION; INFILE ‘ch4ex3.dat’; INPUT MOTIV1 MOTIV2 MOTIV3 GROUP SES; PROC GLM; CLASS GROUP SES; MEANS GROUP SES; MANOVA h ¼ GROUP SES GROUP*SES; RUN;


136


The outputs produced by SPSS and SAS are displayed next. For ease of presentation, the outputs are organized into sections, with clarifying comments inserted at appropriate places. SPSS output Box’s Test of Equality of Covariance Matricesa Box’s M F df1 df2 Sig.

37.354 1.200 30 68197.905 .208

Tests the null hypothesis that the observed covariance matrices of the dependent variables are equal across groups. a. Design: InterceptþGROUPþSESþGROUP* SES

The lack of significance of Box’s M test (and of Bartlett’s test with SAS, thus not presented below) suggests that the covariance matrix homogeneity assumption for the aspiration measures could be considered plausible in this two-way study.

Multivariate Testsc Effect

Value

F

Hypothesis df Error df a

Sig.

Intercept

Pillai’s Trace .954 1603.345 Wilks’ Lambda .046 1603.345a Hotelling’s Trace 20.733 1603.345a Roy’s Largest Root 20.733 1603.345a

3.000 3.000 3.000 3.000

232.000 232.000 232.000 232.000

.000 .000 .000 .000

GROUP


.084 .916 .091 .091

7.052a 7.052a 7.052a 7.052a

3.000 3.000 3.000 3.000

232.000 232.000 232.000 232.000

.000 .000 .000 .000

SES


.050 .951 .052 .037

1.990 1.986a 1.983 2.901b

6.000 6.000 6.000 3.000

466.000 464.000 462.000 233.000

.066 .066 .067 .036

GROUP * SES


.089 .912 .096 .084

3.612 3.646a 3.680 6.559b

6.000 6.000 6.000 3.000

466.000 464.000 462.000 233.000

.002 .002 .001 .000

a. Exact statistic b. The statistic is an upper bound on F that yields a lower bound on the significance level. c. Design: InterceptþGROUPþSESþGROUP * SES



137

SAS output The GLM Procedure Multivariate Analysis of Variance MANOVA Test Criteria and Exact F Statistics for the Hypothesis of No Overall GROUP Effect H ¼ Type III SSCP Matrix for GROUP E ¼ Error SSCP Matrix S ¼ 1 M ¼ 0.5 N ¼ 115 Statistic

Pr > F

Value F Value Num DF Den DF


0.91642568 0.08357432 0.09119596 0.09119596

7.05 7.05 7.05 7.05

3 3 3 3

232 232 232 232

0.0001 0.0001 0.0001 0.0001

NOTE: F Statistic for Roy’s Greatest Root is an upper bound. NOTE: F Statistic for Wilks’ Lambda is exact. MANOVA Test Criteria and F Approximations for the Hypothesis of No Overall SES Effect H ¼ Type III SSCP Matrix for SES E ¼ Error SSCP Matrix S ¼ 2 M ¼ 0 N ¼ 115 Statistic

Value F Value Num DF


0.95054327 0.04995901 0.05150154 0.03735603

1.99 1.99 1.99 2.90

Den DF

Pr > F

6 464 0.0662 6 466 0.0657 6 307.56 0.0672 3 233 0.0357

NOTE: F Statistic for Roy’s Greatest Root is an upper bound. NOTE: F Statistic for Wilks’ Lambda is exact. MANOVA Test Criteria and F for the Hypothesis of No Overall GROUP*SES Effect H ¼ Type III SSCP Matrix for GROUP*SES E ¼ Error SSCP Matrix S ¼ 2 M ¼ 0 N ¼ 115 Statistic

Value F Value Num DF


0.91197238 0.08888483 0.09558449 0.08445494

3.65 3.61 3.69 6.56

Den DF

Pr > F

6 464 0.0015 6 466 0.0016 6 307.56 0.0015 3 233 0.0003

NOTE: F Statistic for Roy’s Greatest Root is an upper bound. NOTE: F Statistic for Wilks’ Lambda is exact.


138


As in most applications of ANOVA, when examining the output provided by either SPSS or SAS, we usually look first at the interaction term. All four test statistics indicate that it is significant, so we can interpret this finding as evidence that warrants rejection of the null hypothesis of no interaction between GROUP and SES. Hence, the effect of the instructional program on students’ aspiration is related to their SES. Since our main research concern here is with whether (and, possibly, when) the program has an effect, we move on to carrying out so-called ‘‘simple effect’’ tests, as we would in a univariate setup in the face of a finding of factorial interaction. To this end, we select the cases from each of the three SES groups in turn, and test within them whether the program has an effect on the aspiration measures. We note that this latter analysis is equivalent to conducting a one-way MANOVA within each of these three groups (in fact, testing then for multivariate mean differences across two independent samples per SES group), and thus can be readily carried out with the corresponding methods discussed earlier in this chapter. In particular, for the low SES group (with value of 1 on this variable), the so-obtained results are as follows (presented only using SPSS, to save space, since the SAS output would lead to identical conclusions). These results indicate the presence of a significant program effect for low SES students, as far as the multivariate mean differences are concerned. In order to look at them more closely, we decide to examine the mean group differences for each aspiration measure separately, as provided by the univariate analysis results next. SPSS output Multivariate Testsb Effect

Value

F

Hypothesis df

Error df

Sig.

Intercept


.953 .047 20.430 20.430

517.569a 517.569a 517.569a 517.569a

3.000 3.000 3.000 3.000

76.000 76.000 76.000 76.000

.000 .000 .000 .000

GROUP


.189 .811 .233 .233

5.898a 5.898a 5.898a 5.898a

3.000 3.000 3.000 3.000

76.000 76.000 76.000 76.000

.001 .001 .001 .001

a. Exact statistic b. Design: InterceptþGROUP



139

Tests of Between-Subjects Effects Source

Dependent Variable

Corrected Model

ASPIRE.1 ASPIRE.2 ASPIRE.3

Intercept


GROUP

Type III Sum of Squares a

3888.114 4028.128b 4681.383c

df

Mean Square

F

Sig.

1 1 1

3888.114 4028.128 4681.383

11.559 17.940 12.010

.001 .000 .001

167353.700 334775.484 196890.228

1 1 1

167353.700 334775.484 196890.228

497.548 1490.945 505.098

.000 .000 .000


3888.114 4028.128 4681.383

1 1 1

3888.114 4028.128 4681.383

11.559 17.940 12.010

.001 .000 .001

Error


26235.845 17514.046 30404.863

78 78 78

336.357 224.539 389.806

Total


215241.332 385735.384 253025.216

80 80 80

Corrected Total


30123.959 21542.174 35086.246

79 79 79


Accordingly, each of the three aspiration measures shows statistically significant experimental versus control group differences. This finding, plus an inspection of the group means next, suggest that for low SES students the instructional program had an effect on each aspiration measure. For more specific details regarding these effects, we take a close look at the group means (typically referred to as marginal means). Alternatively, we can also examine a graph of these means to aid in visualizing the group differences that are present at each level of the SES factor. GROUP 95% Confidence Interval Dependent Variable

Lower Bound

Upper Bound

3.242 2.647

33.111 48.526

46.020 59.066

58.781 73.265

2.649 2.163

53.507 68.959

64.055 77.571

42.825 58.440

3.490 2.850

35.877 52.767

49.774 64.114

GROUP

Mean

Std. Error

ASPIRE.1

.00 1.00

39.565 53.796

ASPIRE.2

.00 1.00

ASPIRE.3

.00 1.00


140


From this table is seen that each of the measures is on average higher in the experimental group, suggesting that the program may have led to some enhancement of aspiration for low SES students. Next, in order to examine the results for students with middle SES, we repeat the same analyses after selecting their group. The following results are obtained then:

Multivariate Testsb Effect

Value

F

Hypothesis df

Error df

Sig.

a

Intercept


.966 .034 28.236 28.236

715.311 715.311a 715.311a 715.311a

3.000 3.000 3.000 3.000

76.000 76.000 76.000 76.000

.000 .000 .000 .000

GROUP


.128 .872 .147 .147

3.715a 3.715a 3.715a 3.715a

3.000 3.000 3.000 3.000

76.000 76.000 76.000 76.000

.015 .015 .015 .015

a. Exact statistic b. Design: Intercept þ GROUP

Considered together, the aspiration measures differ on average across the experimental and control groups also in the middle SES group, as evinced by the significant p-value of .015 for the group effect in the last table. To examine these differences more closely, we decide again to look at the univariate ANOVA results. None of the three aspiration measures considered separately from the other two appears to have been affected notably by the program for middle SES students. It would seem that the multivariate effect here has capitalized on the relatively limited mean differences on each of the aspiration measures as well as on their interdependency. This phenomenon is consistent with the fact that the practical importance measure of the multivariate effect is quite negligible (as h2 ¼ 1 L ¼ 1 .872 ¼ .128). This is also noticed by looking at the cell means below and seeing small differences across groups that are with opposite directions, keeping in mind that they are actually not significant per measure. (Note also the substantial overlap of the confidence intervals for the means of the same measure across groups.)



141

Tests of Between-Subjects Effects Dependent Variable

Source


df

Mean Square

F

Sig.

a

Corrected Model


88.228 134.494b 743.237c

1 1 1

88.228 134.494 743.237

.312 .818 2.437

.578 .369 .123

Intercept


148239.017c 332086.264c 185905.713c

1 1 1

148239.017 332086.264 185905.713

524.093 2018.959 609.600

.000 .000 .000

GROUP


88.228c 134.494c 743.237c

1 1 1

88.228 134.494 743.237

.312 .818 2.437

.578 .369 .123

Error


22062.200c 12829.743c 23787.163c

78 78 78

282.849 164.484 304.964

Total


185278.464c 389142.549c 223936.342c

80 80 80

Corrected Total


22150.427c 12964.237c 24530.400c

79 79 79


GROUP Dependent Variable

95% Confidence Interval GROUP

Mean

Std. Error

Lower Bound

Upper Bound

ASPIRE.1

.00 1.00

46.628 44.407

3.237 2.310

40.184 39.808

53.071 49.006

ASPIRE.2

.00 1.00

66.756 69.498

2.468 1.762

61.842 65.991

71.670 73.005

ASPIRE.3

.00 1.00

54.196 47.750

3.361 2.399

47.505 42.975

60.887 52.526

Finally, when examining the results for the high SES students, we observe the following findings:


142

Introduction to Applied Multivariate Analysis Multivariate Testsb

Effect

Value

F

Hypothesis df

Error df

Sig.

Intercept


.945 .055 17.231 17.231

436.512a 436.512a 436.512a 436.512a

3.000 3.000 3.000 3.000

76.000 76.000 76.000 76.000

.000 .000 .000 .000

GROUP


.138 .862 .159 .159

4.040a 4.040a 4.040a 4.040a

3.000 3.000 3.000 3.000

76.000 76.000 76.000 76.000

.010 .010 .010 .010

a. Exact statistic b. Design: Intercept þ GROUP

As can be seen also here, there is indication of program effect when the aspiration measures are analyzed simultaneously and their interrelationship is taken into account. In difference to students of middle SES, however, there are notable program effects also when each measure is considered on its own, as we see next.

Tests of Between-Subjects Effects Source

Dependent Variable


df

Mean Square

a

F

Sig.

Corrected Model


1810.066 1849.893b 4048.195c

1 1 1

1810.066 1849.893 4048.195

5.510 7.020 9.917

.021 .010 .002

Intercept


134832.701c 306088.252c 145442.177c

1 1 1

134832.701 306088.252 145442.177

410.442 1161.537 356.303

.000 .000 .000

GROUP


1810.066c 1849.893c 4048.195c

1 1 1

1810.066 1849.893 4048.195

5.510 7.020 9.917

.021 .010 .002

Error


25623.504c 20554.561c 31839.425c

78 78 78

328.506 263.520 408.198

Total


193803.985c 390463.464c 221555.322c

80 80 80

Corrected Total


27433.570c 22404.453c 35887.620c

79 79 79




143

To explore these group differences more closely, we take a look at the corresponding means. GROUP 95% Confidence Interval Dependent Variable

GROUP

Mean

Std. Error

Lower Bound

Upper Bound

ASPIRE.1

.00 1.00

38.748 48.904

3.555 2.466

31.671 43.993

45.824 53.814

ASPIRE.2

.00 1.00

60.899 71.165

3.184 2.209

54.561 66.768

67.237 75.563

ASPIRE.3

.00 1.00

37.923 53.111

3.962 2.749

30.035 47.638

45.812 58.585

As is evident from the last table, the experimental group has a higher mean than the control group on each of the aspiration measures, indicating also for high SES students enhanced levels of aspiration after program administration. We conclude our analyses for this empirical example by stating that there is evidence suggesting program effect for students with low and students with high SES, for whom they are found on each aspiration measure as well as overall. For middle SES students, although there is some indication of overall instructed versus uninstructed group mean differences, they are relatively small and inconsistent in direction when particular motivation measures are considered. In this SES group, there is no evidence for a uniformly enhanced aspiration levels across measures that would be due to the program administered.

4.6 MANOVA Follow-Up Analyses When no significant differences are observed in a MANOVA across g 2 studied groups, one usually does not follow it up with univariate analyses. In particular, if the initial research question is multivariate in nature, no subsequent univariate analyses would be necessary or appropriate. However, it is possible—and in empirical research perhaps not infrequently the case—that there is not sufficient information available to begin with, which would allow unambiguous formulation of the substantive research question as a multivariate one. Under such circumstances, it may be helpful to examine which DV shows significant group differences and which not. This latter question may also be of interest when a MANOVA group difference is found to be significant.


144


The most straightforward approach to the last query would be to examine the univariate test results. Typically, if the latter were not of concern prior to a decision to carry out a MANOVA, they should better be conducted at a lower significance level. To be precise, if one is dealing with p DVs (p 1), then examining each one of them at a significance level of a=p would be recommendable. That is, any of these variables would be proclaimed as significantly differing across groups only if the pertinent p-value is less than a=p. In other words, if a ¼ .05 was the initial choice for a significance level, the tests in question would be carried out each at a significance level a0 ¼ .05=p. For the fixed effects of concern so far in the book, these tests are readily conducted using the output obtained with SPSS, by inspecting the pertinent entries in the output section entitled ‘‘Tests of Between-Subject Effects’’ and correspondingly checking if the p-value for an effect of interest with regard to a given DV is smaller than a0 . In the corresponding SAS output, one would simply look at the univariate ANOVA results provided separately for each DV and check their associated p-value accordingly. The general significance level adjustment procedure underlying this approach is often referred to as Bonferroni correction. It turns out to be somewhat conservative, but if one finds univariate differences with it one can definitely have confidence in that statistical result. The correction is frequently recommended, especially with a relatively small number of DVs, p (up to say half a dozen), instead of counterparts of what has become known as simultaneous multiple testing procedures (Hays, 1994). The latter turn out to yield too wide confidence intervals for mean differences on individual DVs, and for this reason they are not generally recommendable unless p is quite large (such as say p > 10; Johnson & Wichern, 2002). These procedures capitalize on the interrelationship between the DVs and aim at securing a prespecified significance level to all possible comparisons between means. This is the reason why their resulting confidence intervals can be excessively wide. These procedures may be viewed as multivariate analogs of the so-called Scheffe’s multiplecomparison procedure in the context of univariate ANOVA, which is also known to be rather conservative but is well suited for ‘‘data-snooping’’ or a posteriori mean comparisons. As an alternative to Bonferroni’s correction, one may wish to carry out conventional unidimensional F tests (i.e., t tests in cases of g ¼ 2 groups), which are the univariate ANOVA tests for group differences on each DV considered separately, at the conventional a level (say .05). By opting for such a strategy only in the case of a significant MANOVA group difference, this approach would be nearly optimal in terms of controlling the overall Type I error and not being overly conservative (Rencher, 1998). Regardless which of these follow-up analysis approaches is chosen, however, one should keep in mind that their results are not independent of one another across response variables, since the latter are typically



145

interrelated. Another important alternative that can be used to follow up significant MANOVA tests is discriminant function analysis, which is the subject of Chapter 10.

4.7 Limitations and Assumptions of MANOVA MANOVA is a fairly popular procedure among social, behavioral, and educational scientists. It is therefore important when applying it to be aware of its assumptions and limitations. First of all, one cannot base any causality conclusions only on the results of statistical tests carried out within the MANOVA framework. That is, a finding of a significant group mean difference does not justify a statement that an IV in a study under consideration has produced differences observed in the DVs. Second, in order for the underlying parameter estimation procedure to be computationally feasible, each cell of the design should have more cases than DVs. If this is not the case, some levels of pertinent factors need to be combined to insure fulfillment of this condition (see also Chapter 7 for an alternative approach). Third, excessively highly correlated DVs present a problem akin to multicollinearity in regression analysis. Thus, when such redundant variables are initially planned as outcome variables in a study, one should either drop one or more of them or use possibly a linear combination of an appropriate subset of them instead, which has clear substantive meaning. Furthermore, as mentioned earlier, throughout this chapter we have assumed that the MVN assumption holds. Although the methods discussed have some degree of robustness against violations of this assumption, it is recommended that consideration be given to use of appropriate variable transformations prior to analysis, or of alternative methods (such as path analysis or structural equation modeling; Raykov & Marcoulides, 2006) where corrections for some nonnormality can be carried out on respective fit statistics and parameter standard errors. It is also noted that depending on outcome variable relationships, a multivariate test may turn out to be nonsignificant while there are considerable or even significant group differences on individual response measures, a circumstance sometimes referred to as ‘‘washing out’’ of univariate differences. Hence, it is of special importance that a researcher conceptualizes carefully, and before the analysis is commenced, whether they need to use a multivariate test or one or more univariate ones. Another important MANOVA assumption that we mentioned repeatedly in this chapter is that of covariance matrix homogeneity. While it can be tested using Box’s M test, or Bartlett’s test, they are notorious for being quite sensitive to nonnormality. In such cases, our note in the preceding paragraph applies, that is, consideration should better be given to appro-


146


priate variable transformations before analysis is begun. Alternatively, covariance matrix homogeneity can be tested using structural equation modeling (Raykov, 2001). The latter approach, when used with corresponding fit statistics’ and standard error corrections, provides another means for these purposes in cases of up to mild nonnormality (and absence of piling of cases at an end of any measurement scale used, e.g., with absence of ceiling and floor effects).


5 Repeated Measure Analysis of Variance In this chapter, we will be concerned with the analysis of data obtained from studies with more than two related groups. We first introduce the notion of within-subject design (WSD) that underlies investigations where repeated assessments are carried out on a given sample(s) of subjects, and parallel it with that of between-subject design (BSD). Subsequently, we discuss the necessity to have special methods for analyzing repeated observations collected in such studies. Then we deal with the univariate approach to repeated measures analysis (RMA), which is developed within the general framework of analysis of variance (ANOVA). Following that discussion, we will be concerned with the multivariate approach to RMA, which is presented within the context of multivariate analysis of variance (MANOVA). The settings where the methods discussed in this chapter will be applicable are those where several dependent variables result from multiple assessments of a single outcome measure across several occasions, regardless of whether there are any independent variables; when there are, typically a grouping factor will be present (e.g., control vs. experimental group). (Studies with more than one repeatedly administered measure are considered in Chapter 13.) Statistical approaches of relevance for these settings are also sometimes called profile analysis methods, and empirical investigations with such designs are frequently referred to as longitudinal or panel studies in the social and behavioral sciences. Repeated measure designs (RMDs) have some definitive advantages relative to cross-sectional designs (CSDs) that have been widely used for a large part of the past century. In particular, CSDs where say a number of different age groups are measured at one point in time on an outcome or a set of outcomes have as a main limitation the fact that they are not wellsuited for studying developmental processes that are of special relevance in these disciplines. The reason is that they confound the effects of time and age (cohort). As a viable alternative, RMDs have become popular since the late 1960s and early 1970s. To give an example of such designs, suppose one were interested in studying intellectual development of high school students in grades 10 through 12. A longitudinal design utilizing repeated assessments will then call for measuring say once a year a given sample of students in grade 10, then in grade 11, and finally in grade 12.

147


148


Alternatively, a cross-sectional design (CSD) would proceed by assessing at only one time point a sample from each of the 10th, 11th, and 12th grades. As it turns out, there are numerous threats to the validity of a CSD that limit its utility for studying processes of temporal development. Central amongst these threats is the potential bias that results from the above mentioned confounding. A thorough discussion of the validity threats is beyond the scope of this text, and we refer the reader to some excellent available sources (Campbell & Stanley, 1963; Nesselroade & Baltes, 1979).

5.1 Between-Subject and Within-Subject Factors and Designs In a repeated measure study, it is particularly important to distinguish between two types of factors: (a) between-subject factors (BSFs; if any), and (b) within-subject factors (WSFs). As indicated in Chapter 4, a BSF is a factor whose levels are represented by independent subjects that thus provide unrelated measurements (observations). For example, the distinction between an experimental and control group represents a BSF since each person can be (usually) either in an experimental or a control group; for this reason, each subject gives rise only to one score on this factor (say ‘‘0’’ if belonging to the control group and ‘‘1’’ if from the experimental group). Similarly, gender (male vs. female) is another BSF since persons come to a study in question with only a single value on gender. Further examples of BSFs would include factors like ethnicity, political party affiliation, or religious affiliation (at least at a given point in time), which may have more than a pair of levels. Hence, in a design with a BSF, the measurements collected from studied subjects are independent from one another across that factor’s levels. By way of contrast, a within-subject factor (WSF) has levels that represent repeated measurements conducted on the same persons or in combination with others related to them (e.g., in studies with matched samples). In a typical longitudinal design, any given assessment occasion is a level of a WSF (frequently referred to as ‘‘time factor’’). In a design with a WSF, measurements on this factor are not independent of one another across its levels. For example, consider a cognitive development study of boys and girls in grades 7 through 9, in which mental ability is assessed by a test once in each of these three grades on a given sample of students. Since these are repeated measurements of the same subjects, the assessment (or time) factor is a WSF with three levels, while the gender factor is a BSF with two levels. For convenience, designs having only BSFs are commonly called between-subject designs (BSDs), while those with only WSFs are referred to as within-subject designs (WSDs). In the social and behavioral sciences,


Repeated Measure Analysis of Variance

149

designs that include both BSF(s) and WSF(s) are rather frequently used and are generally called mixed designs. In fact, it may be even fair to say that mixed designs of this kind are at present perhaps typical for repeated measure studies in these disciplines. So far in this text, we have mostly been concerned with designs that include BSFs. For example, one may recall that the designs considered in Chapter 4 were predominantly based on BSFs. Specifically, when testing for mean differences across independent groups, and the entire discussion of MANOVA in that chapter, utilized BSDs. However, we also used a WSD there without explicitly referring to it in this way. Indeed, when testing for mean differences with related samples (or two repeated assessments), a WSF with two levels was involved. Given that we have already devoted considerable attention to BSDs and procedures for analysis of data resulting from such studies, it may be intriguing to know why it would be important to have special methods for handling WSD. To motivate our interest in such methods, let us consider the following example that utilizes a WSD. Assume we are interested in motivation, which is considered by many researchers a critically important variable for academic learning and achievement across childhood and through adolescence. To highlight better the idea behind the necessity to use special methods in RMDs, we use only a small portion of the data resulting from that study, which is presented in Table 5.1 (cf. Howell, 2002). Two observations can be readily made when looking at the data in Table 5.1. First, we note that although the average score (across students) is not impressively changing over time, there is in fact some systematic growth within each subject of about the same magnitude as the overall mean change. Second, the variability across the three assessment means is considerably smaller than subject score variability within each measurement occasion. Recalling that what contributes toward a significant main effect in a univariate ANOVA is the relationship of variability in the group means to that within groups (Hays, 1994), these two observations suggest that if one were to analyze data from a WSD—like the presently considered one—using methods appropriate for between-subject designs, one is TABLE 5.1 Scores From Four Students in a Motivation Study Across Three Measurement Occasions (Denoted by Time_1 Through Time_3) Student 1 2 3 4 Average total score:

Time_1

Time_2

Time_3

3 16 4 22 11.25

4 17 5 24 12.50

5 19 7 25 14


150


likely to face a nonsignificant time effect finding even if all subjects unambiguously demonstrate temporal change in the same direction (e.g., growth or decline). The reason for such an incorrect conclusion would be the fact that unlike BSFs, as mentioned, measurements across levels of WSFs are not independent. This is because repeatedly assessed subjects give rise to correlated measurements across the levels of any WSF (which relationship is often referred to as serial correlation). This is clearly contrary to the case with a BSF. Thus, an application of a BSD method to repeated measure data like that in Table 5.1 effectively disregards this correlation—that is, wastes empirical information stemming from the interrelationships among repeated assessments for the same subjects—and therefore is likely to lead to an incorrect conclusion. Hence, different methods from those applicable with BSDs are needed for studies having WSFs. These different methods should utilize that empirical information, i.e., take into account the resulting serial correlation—in a sense ‘‘partial it out’’—before testing hypotheses of interest. Univariate and multivariate approaches to repeated measure analysis provide such methods for dealing with studies containing within-subject factors, and we turn to them next.

5.2 Univariate Approach to Repeated Measure Analysis This approach, also sometimes referred to as ANOVA method of RMA, can be viewed as following the classical ANOVA procedure of variance partitioning. For example, suppose that a sample of subjects from a studied population is repeatedly observed on m occasions (m 2). The method is then based on the following decomposition (cf. Howell, 2002): SST ¼ SSBS þ SSWS ¼ SSBS þ SSBO þ SSE ,

(5:1)

where the subscript notation used is as follows: T for total, BS for betweensubjects, WS for within-subjects, BO for between-occasions, and E for error. According to Equation 5.1, if two observations were randomly picked from the set of all observations for all subjects at all measurement points, then their dependent variable values could differ from one another as a result of these values belonging to two different persons and=or stemming from two distinct occasions, or being due to unexplained reasons. The model underlying the decomposition in Equation 5.1 can be written as Xij ¼ m þ pi þ t j þ eij ,

(5:2)



151

where the score Xij of the ith subject at jth assessment is decomposed into an overall mean (m), subject effect (pi), assessment occasion effect (tj), and error (eij) (i ¼ 1, . . . , n, with n denoting as usual sample size; j ¼ 1, . . . , m). As seen from Equation 5.2, no interaction between subject and occasion is assumed in the model. This implies that any possible interaction between subject and occasion is relegated to the error term. In other words, in this model we assume that there is no occasion effect that is differential by subject. Such a model is said to be additive. In case this assumption is invalid, an interaction term between subject and occasion would need to be included in the model that will then be referred to as nonadditive (for an excellent discussion of nonadditive models, see Kirk, 1994). Although a number of tests have been developed for examining whether a model is additive or nonadditive, for the goals of our discussion in this chapter we will assume an additive model. The test for change over time, which is of particular interest here, is based on the following F ratio that compares the between-occasion mean sum of squares to that associated with error: F¼

Mean(SSBO) SSBO=(m 1) ¼ Fm1,(n1)(m1) : Mean(SSE) SSE=[(n 1)(m 1)]

(5:3)

The validity of this test, that is, the distribution of its F ratio stated in Equation 5.3, is based on a special assumption called sphericity (Maxwell & Delaney, 2004). This condition is fulfilled when the covariance matrix of the repeated assessments, denoted by , has the same quantity along its main diagonal (say s2), and off its diagonal an identical quantity (say u) appears. That is, sphericity holds for that matrix when 3

2

s2 6 u 6 6 C¼6 u 6 4 : u

s2 u : u

s2 : u

7 7 7 7, 7 : : 5 . . . s2

(5:4)

for two generally distinct numbers s2 and u (such that is positive definite; see Chapter 2). The sphericity condition implies that any two repeated assessments correlate to the same extent, as can be readily seen by obtaining the correlation of two occasions from the right-hand side of Equation 5.4. For example, if the following covariance matrix involving three repeated assessments were to hold in a studied population, the assumption of sphericity would be fulfilled:


152

Introduction to Applied Multivariate Analysis 2

25 C ¼ 4 10 25 10 10

3 5: 25

This assumption is, however, not very likely to be valid in most longitudinal behavioral and social studies, since repeated measurements in these disciplines tend to correlate more highly when taken closely in time than when they are further apart. When the sphericity assumption is tenable, however, the univariate approach provides a powerful method for RMA, also with relatively small samples. We will return to this issue in more detail in a later section in the chapter. We note in passing that the sphericity condition could be viewed, loosely speaking, as an ‘‘extension’’ to the RMD context of the independentgroup ANOVA assumption of variance homogeneity, which states that the variances of all groups on a single dependent variable are the same. (Recall that the latter assumption underlies conventional applications of the t test as well, in cases with g ¼ 2 groups.) Specifically, the variance homogeneity assumption might be seen as a special case of the sphericity assumption (cf. Howell, 2002) because when dealing with independent groups the corresponding formal covariances are all 0, and thus the value u (see Equation 5.4) equals 0 in a respective counterpart of the above matrix (while all its main diagonal entries are the same, when variance homogeneity holds across groups). That is, as an aside, as soon as variance homogeneity is fulfilled in a multiple-population setting with a single outcome variable, in the above sense also the sphericity assumption is fulfilled. The much more important concern with sphericity is, however, in the repeated measure context of interest in this chapter, when variance homogeneity alone is obviously not sufficient for sphericity because the latter also requires covariance coefficient homogeneity (see Equation 5.4). Since the univariate approach to RMA hinges on the sphericity assumption being tenable, it is important to note that it represents a testable condition. That is, in an empirical repeated measure setting, one can use special statistical tests to ascertain whether sphericity is plausible. A widely available test for this purpose is the so-called Mauchly’s sphericity test, whose result is automatically provided in most statistical software output (see discussion of SAS and SPSS RMA procedures below). An alternative test is also readily obtained within the latent variable modeling framework (Raykov, 2001). When the sphericity assumption is violated, however, the F test in Equation 5.3 is liberal. That is, it tends to reject more often than it should the null hypothesis when the latter is true; in other words, its Type I error rate is higher than the nominal level (usually set at .05). For this reason, when the F test in Equation 5.3 is not significant, one might have considerable confidence in its result and a suggestion based on it not to reject the pertinent null hypothesis.



153

When the sphericity condition is not plausible for a given data set, one possibility is to employ a different analytic approach, viz. multivariate repeated measure analysis of variance that is the subject of the next section. (As another option, one could use with large samples unidimensional versions of the models discussed in Chapter 13; see that chapter for further discussion on their utility.) The multivariate approach to RMA is particularly attractive with large samples. Alternatively, and especially with small samples (when the multivariate approach tends to lack power, as indicated in the next section), one could still use the F test in Equation 5.3 but after correcting its degrees of freedom when working out the pertinent cut-off value for significance testing. This results in the so-called «-correction procedure for the univariate RMA F test. Specifically, the new degrees of freedom to be used with that F test are df1 ¼ e(m1) and df2 ¼ e(m1) (n1), rather than those stated immediately after Equation 5.3 above. We note that the value of the « factor is 1 when sphericity is not violated; in all other cases, « will be smaller than 1 (see further discussion below), thus ensuring that smaller degrees of freedom values are utilized. This correction effectively increases the cut-off value of the pertinent F- distribution of reference for significance testing, thus counteracting the earlier mentioned ‘‘liberalism’’ feature of the F test in Equation 5.3 when sphericity is violated. There are three procedures for accomplishing this degrees of freedom correction: (a) Box’s lower bound, (b) Greenhouse–Geisser’s adjustment, and (c) Huynh–Feldt’s correction. Each of them produces a different value for e with which the degrees of freedom need to be modified by multiplication as indicated above. Hence, each of these three procedures leads to different degrees of freedom for the reference F- distribution, against which the test statistic in the right-hand side of Equation 5.3 is to be judged for significance. The effect of correction procedures (a) through (c) is that they make that F test in Equation 5.3 more conservative, in order to compensate for its being liberal when sphericity is violated. However, each of the three corrections actually ‘‘overshoots’’ the target—the resulting, corrected F test becomes then conservative. That is, it tends to reject the null hypothesis, when true, less often than it should; therefore, the associated Type I error rate is then lower than the nominal level (commonly taken as .05). The three corrections actually lead to tests showing to a different extent this ‘‘conservatism’’ feature. Box’s procedure is the most conservative, Greenhouse–Geisser’s is less so, and the least conservative of the three is Huynh–Feldt’s adjustment (cf. Timm, 2002). Hence, if Huynh–Feldt’s procedure rejects the null hypothesis—that is, the F test in Equation 5.3 is significant when its value is referred to the F- distribution with degrees of freedom being e(m1) and e(m1)(k1), where e results from Huynh–Feldt’s adjustment—then we can have trust in its result; otherwise interpretation is ambiguous (within this univariate approach to RMA). To demonstrate this discussion, consider the following example. One hundred and fifty juniors are measured with an intelligence test at a


154


pretest and twice again after a 2 week training aimed at improving their test-relevant cognitive skills. The research question is whether there is evidence for change over time. Here, as typical in single-group longitudinal studies, of main interest is the null hypothesis H0: m1 ¼ m2 ¼ m3, where mt stands for the observed mean at tth occasion (t ¼ 1, 2, 3). This hypothesis of no change over time is also sometimes called the flatness hypothesis. (The data for this example are found in the file ch5ex1.dat available from www.psypress.com=applied-multivariate-analysis, where the variables at the three repeated assessments are named Test.1, Test.2, and Test.3.) To answer this question, with SPSS we use the following menu option=sequence: General Linear Model ! Repeated Measures (WSF name ¼ ‘‘time,’’ 3 levels, click ‘‘Add’’; define levels as the successive measures; Options: mean for the time factor). To accomplish this analysis with SAS, the following command statements within the PROC GLM procedure can be used: DATA RMA; INFILE ‘ch4ex1.dat’; INPUT t1-t3; PROC GLM; MODEL t1 - t3 ¼ =nouni; REPEATED time POLYNOMIAL =summary mean printe; RUN; As one may recall from the previous chapter, the MODEL statement names the variables containing the data on the three assessments in the study under consideration. There are, however, a number of new statements in this program setup that require clarification. In particular, we need to comment on the use of the statements ‘‘nouni,’’ ‘‘REPEATED,’’ ‘‘POLYNOMIAL,’’ and ‘‘summary mean printe.’’ The ‘‘nouni’’ statement suppresses the display of output relevant to certain univariate statistics that are in general not of interest with this type of repeated measures design. (In fact, not including the ‘‘nouni’’ subcommand would make SAS compute and print F tests for each of the three assessments, which results are not usually of much interest in within-subjects designs.) When the variables in the MODEL statement represent repeated measures on each subject, the REPEATED statement enables one to test hypotheses about the WSF (in this case the time factor). For example, in the present study the three levels of the time factor used are ‘‘t1,’’ ‘‘t2,’’ and ‘‘t3.’’ One can also choose to provide a number that reflects exactly the number of assessments considered. If one used this option, the pertinent statement above could also be specified as ‘‘REPEATED 3.’’ As a result, both univariate and



155

multivariate tests are generated as well as hypothesis tests for various patterns of change over time (also called contrasts; see next section). To specifically request examination of linear, quadratic, and higher-order patterns of change, if applicable, the POLYNOMIAL statement is utilized (see SAS User’s Guide for a complete listing of the various contrasts that can be obtained). Finally, the options ‘‘summary mean printe’’ produce ANOVA tables for each pattern of change under consideration, the means for each of the within-subject variables, and Mauchly’s test of sphericity. The outputs produced by the above SPSS instructions and SAS command file follow next. For ease of presentation, they are re-organized into sections, and clarifying comments are inserted after each of them. SPSS output General Linear Model Within-Subjects Factors Measure: MEASURE_1 Dependent Variable

TIME 1 2 3

TEST.1 TEST.2 TEST.3

In this table, the levels are explicitly named after the measures comprising the WSF, whose effect we are interested in testing. Next, these three repeated measures are examined with respect to the sphericity assumption. Mauchly’s Test of Sphericityb Measure: MEASURE_1 Within Subjects Effect TIME

Epsilona Mauchly’s W

Approx. Chi-Square

df

Sig.

GreenhouseGeisser

HuynhFeldt

Lowerbound

.851

23.888

2

.000

.870

.880

.500

Tests the null hypothesis that the error covariance matrix of the orthonormalized transformed dependent variables is proportional to an identity matrix. a. May be used to adjust the degrees of freedom for the averaged tests of significance. Corrected tests are displayed in the Tests of Within-Subjects Effects table. b. Design: Intercept Within Subjects Design: TIME

From the column titled ‘‘Sig.’’ in this table, Mauchly’s test of sphericity is found to be significant. This means that the data contain evidence warranting rejection of the null hypothesis of sphericity for the repeated


156


measure covariance matrix (i.e., the covariance matrix resulting when considering the repeated measures as three interrelated variables). Consequently, in order to deal with the lack of sphericity, epsilon corrections may be used in determining the more appropriate F test. (As mentioned earlier, an alternative approach is discussed in the next section, and another one in Chapter 13, both best applied with large samples.) As indicated before, Huynh–Feldt’s adjustment is least affecting the degrees of freedom for the F test in Equation 5.3, because its pertinent e factor is closest to 1, while Box’s modification provides a lower bound (labeled ‘‘Lower-bound’’ in the output) of this multiplier. Since degrees of freedom are inversely related to the extent an F test is conservative (when it is so), the last presented output table also demonstrates our earlier comment that Box’s correction leads to the most conservative test statistic whereas Huynh–Feldt’s entails the least conservative F test version that is therefore preferable. SAS output The GLM Procedure Repeated Measures Analysis of Variance Repeated Measures Level Information Dependent Variable Level of time

t1 1

t2 2

t3 3

Similar to the preceding SPSS output, this table explicitly names each of the measures representing the WSF levels. Next, the sphericity assumption is evaluated. The GLM Procedure Repeated Measures Analysis of Variance Sphericity Tests Variables

DF

Mauchly’s Criterion

Chi-Square

Pr > ChiSq

Transformed Variates Orthogonal Components

2 2

0.8509465 0.8509465

23.888092 23.888092

F G-G H-F

1 of repeated assessments), the multivariate approach to RMA treats the new variables D1, D2, . . . , Dm1 just like MANOVA treats multiple dependent variables. To demonstrate this discussion, consider the following example. One hundred and sixty one male and female high school sophomore students are assessed four consecutive times using an induction reasoning test at the beginning of each quarter in a school year. (The data for this example are found in the file ch5ex3.dat available from www.psypress.com=appliedmultivariate-analysis, where the variables containing the four successive assessments are named ir.1, ir.2, ir.3, and ir.4.) The research question is whether there are gender differences in the pattern of change over time. To respond to this query, with SPSS we follow the same steps as in the univariate approach when a BSF was involved but focus on those parts of the output produced thereby, which deal with the multivariate approach, viz. the multivariate tests. For completeness of this discussion, we include this menu options sequence: General Linear Model ! Repeated Measures (WSF name ¼ ‘‘time,’’ 3 levels, click ‘‘Add’’; define levels as the successive measures; Between-subject factor: Gender; Options: mean for all factors and interactions, homogeneity test). To accomplish the same analysis with SAS, the following command statements within PROC GLM can be used. We note that with exception of the added statement ‘‘printm,’’ this command sequence is quite similar



171

to that used in the previous section for the analysis of measures with a BSF. Including the new statement ‘‘printm’’ ensures that the used linearly independent contrasts are displayed in the output. DATA RMA; INFILE ‘ch4ex2.dat’; INPUT ir1-ir4 gender; PROC GLM; CLASS gender; MODEL ir1 - ir4 ¼ gender=nouni; REPEATED time POLYNOMIAL=summary printe printm; run; The results produced are as follows (provided again in segments, to simplify the discussion). SPSS output General Linear Model Within-Subjects Factors Measure: MEASURE_1 TIME

Dependent Variable

1 2 3 4

IR.1 IR.2 IR.3 IR.4

Between-Subjects Factors N GENDER

.00 1.00

50 111

Box’s Test of Equality of Covariance Matricesa Box’s M F df1 df2 Sig.

4.173 .403 10 44112.470 .946

Tests the null hypothesis that the observed covariance matrices of the dependent variables are equal across groups. a. Design: InterceptþGENDER Within Subjects Design: TIME


172


SAS output The GLM Procedure Class Level Information Class

Levels

Values

2

01

gender

Repeated Measures Level Information Dependent Variable Level of time

ir1 1

ir2 2

ir3 3

Number of Observations

ir4 4

161

The preceding output tables identify the variables associated with the four levels of the within subject factor, referred to as ‘‘time,’’ and indicate the gender (BSF) split. In addition, we note the results of Box’s M test (displayed only in the SPSS output, to save space), which show no evidence for violation of the covariance matrix homogeneity assumption. SPSS output Multivariate Testsb Effect

Value

F

Hypothesis df

Error df

Sig.

a

TIME


.764 .236 3.235 3.235

169.284 169.284a 169.284a 169.284a

3.000 3.000 3.000 3.000

157.000 157.000 157.000 157.000

.000 .000 .000 .000

TIME * GENDER


.023 .977 .024 .024

1.233a 1.233a 1.233a 1.233a

3.000 3.000 3.000 3.000

157.000 157.000 157.000 157.000

.300 .300 .300 .300

a. Exact statistic b. Design: InterceptþGENDER Within Subjects Design: TIME

Mauchly’s Test of Sphericityb Measure: MEASURE_1 Within Subjects Effect TIME

Epsilona Mauchly’s W

Approx. Chi-Square

df

Sig.

GreenhouseGeisser

HuynhFeldt

Lowerbound

.693

57.829

5

.000

.798

.816

.333

Tests the null hypothesis that the error covariance matrix of the orthonormalized transformed dependent variables is proportional to an identity matrix. a. May be used to adjust the degrees of freedom for the averaged tests of significance. Corrected tests are displayed in the Tests of Within-Subjects Effects table. b. Design: InterceptþGENDER Within Subjects Design: TIME



173

SAS output The GLM Procedure Repeated Measures Analysis of Variance Sphericity Tests Variables

DF

Mauchly’s Criterion

Chi-Square

Pr > ChiSq

Transformed Variates Orthogonal Components

5 5

0.6930509 0.6930509

57.829138 57.829138

An Introduction to Applied Multivariate Analysis

An Introduction to Applied Multivariate Analysis

An Introduction to Applied Multivariate Analysis with R (Use R)