Advanced Quantitative Data Analysis
Understanding Social Research Series Editor: Alan Bryman
Published titles Unobtrusive Methods in Social Research Raymond M. Lee Ethnography John D. Brewer Surveying the Social World Alan Aldridge and Ken Levine Biographical Research Brian Roberts Qualitative Data Analysis: Explorations with NVivo Graham R. Gibbs Postmodernism and Social Research Mats Alvesson Advanced Quantitive Data Analysis Duncan Cramer
Advanced Quantitative Data Analysis
DUNCAN CRAMER
Open University Press Maidenhead · Philadelphia
Open University Press McGraw-Hill Education McGraw-Hill House Shoppenhangers Road Maidenhead Berkshire England SL6 2QL email:
[email protected] world wide web: www.openup.co.uk and 325 Chestnut Street Philadelphia, PA 19106, USA First published 2003 Copyright © Duncan Cramer 2003 All rights reserved. Except for the quotation of short passages for the purpose of criticism and review, no part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior written permission of the publisher or a licence from the Copyright Licensing Agency Limited. Details of such licences (for reprographic reproduction) may be obtained from the Copyright Licensing Agency Ltd of 90 Tottenham Court Road, London, W1T 4LP. A catalogue record of this book is available from the British Library ISBN 0335 20059 1 (pb)
0 335 20062 1 (hb)
Library of Congress Cataloging-in-Publication Data Cramer, Duncan, 1948– Advanced quantitative data analysis / Duncan Cramer. p. cm. — (Understanding social research) Includes bibliographical references and index. ISBN 0-335-20062-1 — ISBN 0-335-20059-1 (pbk.) 1. Social sciences — Statistical methods. I. Title. II. Series. HA29 .C7746 2003 001.4′22—dc21 2002042582 Typeset by RefineCatch Limited, Bungay, Suffolk Printed in Great Britain by Bell and Bain Ltd, Glasgow
Contents
Series editor’s foreword Preface 1 Introduction PART 1 Grouping quantitative variables together 2 Exploratory factor analysis 3 Confirmatory factor analysis 4 Cluster analysis PART 2 Explaining the variance of a quantitative variable 5 Stepwise multiple regression 6 Hierarchical multiple regression
vii ix 1
11 13 28 46
57 59 74
vi
Advanced quantitative data analysis
PART 3 Sequencing the relationships between three or more quantitative variables
87
7 Path analysis assuming no measurement error 8 Path analysis accounting for measurement error
89 103
PART 4 Explaining the probability of a dichotomous variable
119
9 Binary logistic regression
121
PART 5 Testing differences between group means
143
10 An introduction to analysis of variance and covariance 11 Unrelated one-way analysis of covariance 12 Unrelated two-way analysis of variance
145 161 179
PART 6 Discriminating between groups
201
13 Discriminant analysis
203
PART 7 Analysing frequency tables with three or more qualitative variables
221
14 Log-linear analysis
223
Glossary References Index
240 245 247
Series editor’s foreword
This Understanding Social Research series is designed to help students to understand how social research is carried out and to appreciate a variety of issues in social research methodology. It is designed to address the needs of students taking degree programmes in areas such as sociology, social policy, psychology, communication studies, cultural studies, human geography, political science, criminology and organization studies and who are required to take modules in social research methods. It is also designed to meet the needs of students who need to carry out a research project as part of their degree requirements. Postgraduate research students and novice researchers will find the books equally helpful. The series is concerned to help readers to ‘understand’ social research methods and issues. This means developing an appreciation of the pleasures and frustrations of social research, an understanding of how to implement certain techniques, and an awareness of key areas of debate. The relative emphasis on these different features varies from book to book, but in each one the aim is to see the method or issue from the position of a practising researcher and not simply to present a manual of ‘how to’ steps. In the process, the series contains coverage of the major methods of social research and addresses a variety of issues and debates. Each book in the series is written by a practising researcher who has experience of the technique or debates that he or she is addressing. Authors are encouraged to draw on their own experiences and inside knowledge.
viii
Advanced quantitative data analysis
While there are many books that deal with basic quantitative data analysis, there are relatively few that aim to deal with somewhat more advanced forms and aspects of such analysis in an accessible way. This is precisely what Duncan Cramer has succeeded in doing in this book. He builds upon his experience of delivering modules in, and of the writing books covering, basic quantitative data analysis. His approach is to make the novice student and researcher familiar with the techniques he covers by taking them through worked examples. In addition, he ties the instruction in the techniques with taking readers through the steps that are required to implement the same techniques using computer software. Duncan Cramer emphasizes the use of SPSS, the most widely-used suite of statistical analysis programs, whenever feasible. In addition, he takes the reader through much of the output that the programs generate, so that the reader is given help with what to look for in the output and how to interpret the findings. Duncan Cramer takes the reader through such increasingly used techniques as multiple regression, log-linear analysis, logistic regression, and analysis of variance. Knowledge of such techniques is crucial if the analyst is to move beyond simple and basic manipulation of the data. Moreover, when we analyse we need to know how to conduct the analyses that we learn about and in the modern setting this amounts to knowing how to use computer software to put into practice the techniques that we are introduced to. This is why Duncan Cramer’s book includes help with instruction in computer software. At the same time, we need to know how to interpret what we find and in this respect the book will be invaluable for appreciating how to make sense of computer output. The book is very timely, when students are being encouraged to learn transferable skills and when higher education institutions are being encouraged to develop such skill for their students. Knowing how to perform more advanced forms of quantitative data analysis and how to use computer software in connection with such analysis is an important component of the inculcation of such skills. As such, this book will prove invaluable for students, researchers, and higher education teachers. Alan Bryman
Preface
The aim of this book is to provide an introduction to some of the major advanced statistical techniques most commonly used in the social and behavioural sciences for analysing quantitative data and to do this in as concise and as non-technical a way as possible. The development of relatively easy to use computer programs and software for analysing quantitative data has led not only to the widespread use of these techniques, but also to the increased expectation to apply these techniques to the analysis of one’s own quantitative data where appropriate. To understand the results of these analyses, as well as to be able to critically evaluate them, it is necessary to be familiar with the techniques themselves, as the authors of published research papers are generally expected to take this understanding for granted when describing their own work. Despite the relatively long time these techniques have been used, there are few books providing a relatively non-technical account of them. It is hoped that this text will help to fill this need. With the exception of the introductory chapter, the other 13 chapters of the book focus on the particular use of a specific technique. Each of these chapters begins by generally describing what the technique is used for, before illustrating the use of the technique with a small set of data consisting of simple numbers. Altogether, there are eight sets of data comprising 9–15 lines of information on 2–9 variables. Consequently, these data sets should be relatively easy to work with. Those aspects of the statistics
x
Advanced quantitative data analysis
considered to be the most important for understanding a technique are covered first. Although this book is concerned with more advanced statistics, statistical terms are generally described when they are first introduced, which should help those who are relatively unfamiliar with statistics. Words rather than symbols are used to refer to these terms so as to avoid having to remember what the symbols signify. Where possible the statistic is calculated to show how the figures are derived and these calculations have been separated from the text by placing them in tables. A concise way of reporting the results of each worked example is also presented where feasible. The exact style of reporting statistics varies a little from one publishing house to another. I have applied one style to all the examples. It should be easy to change this style to one which is preferred or required. Recommendations for further reading have been provided at the end of all but the first chapter. These have been restricted to the less technical presentations available, although these are generally at a higher level than those in this book. Although a better understanding of a statistical technique is often gained by working through some of the calculations for that technique, it is not expected or recommended that the reader analyse their own data through such hand calculations. These computations are done much more efficiently by statistical software. There are several different commercial products readily available. It was not possible in a book of this size to illustrate the use of more than one product for each technique. Because of its apparent widespread popularity, SPSS was chosen to carry out the computations for the worked examples wherever possible. The latest version available at the time the manuscript was complete was used, which was Release 11. This version is similar to the three earlier versions and so this part of the book should also be suitable for people using these earlier versions. SPSS does not offer structural equation modelling. LISREL was chosen to carry out these computations, as it appears to be one of the most widely used programs for doing this. The latest version of this program was used, which is LISREL 8.51. The worked example can be carried out with a free student or limited version of this program, which can be downloaded from the following website: http://www.ssicentral.com/other/entry.htm. Instructions on how to download this program are available at this site. Terms and values that are part of SPSS and LISREL or their output have been printed in bold to indicate this. The space devoted to describing the use of these two programs has been kept as brief as possible so that much of the book may be of value to those with access to, or who are more familiar with, other software. The data for the worked examples were made up to illustrate important statistical points and do not pretend to represent results that may be typical of the research literature on that topic. The examples were also designed to be as simple as possible. The reader may like to generate more relevant examples from those used if the present ones are not sufficiently interesting
Preface xi
or relevant. Making up further examples and analysing them is also useful for testing one’s general understanding of these techniques. This understanding will also be increased by reading examples of published reports of such analyses in one’s own fields of interest. Analysing quantitative data is a skill that benefits from informed practice. It is hoped that this book will help you develop this skill. I would like to thank, in alphabetical order, Alan Bryman, Tim Liao and Amanda Sacker for their comments on the first draft of the manuscript. Duncan Cramer Loughborough University
1
Introduction
A major aim of the social and behavioural sciences is to develop explanations of various aspects of human behaviour. For example, we may be interested in explaining why some people are more aggressive than others. One way of determining the adequacy or validity of explanations is to collect data pertinent to them and to see to what extent the data are consistent with the explanation. Data agreeing with the explanation support the explanation to the extent that they are not also consistent with other explanations. Data inconsistent with the explanation do not support the explanation to the extent that the variables relevant to them have been appropriately operationalized. Operationalization involves either the manipulation of variables or their measurement.
Qualitative and quantitative variables
As the name implies, a variable is a quality or characteristic that varies. If the quality did not vary, it would be called a constant. The sociobehavioural sciences are interested in explaining why particular characteristics vary. The most basic type of variable is a dichotomous one, in which the quality is either present or absent. For example, being female is a dichotomous variable in which the units or cases are classified as being either female or not female. Similarly, being divorced is a dichotomous
2
Advanced quantitative data analysis
variable in which cases can be categorized as being either divorced or not divorced. The two categories making up a dichotomous variable can be represented or coded by any two numbers such as 1 and 2 or 23 and 71. For example, being female may be coded as 1 and not being female as 2. There are two main kinds of variables. One kind is variously called a qualitative, categorical, nominal or frequency variable. An example of a qualitative variable is marital status, which might consist of the following five categories: (1) never married; (2) married; (3) separated; (4) divorced; and (5) widowed. These five categories can be represented or coded with any set of five numbers such as 1, 2, 3, 4 and 5 or 32, 12, 15, 25 and 31. The numbers are simply used to refer to the different categories. The number or frequency of cases falling within each of the categories can only be counted, hence this variable is sometimes called a frequency variable. For example, of 100 people, 30 may never have been married, 40 may be married, 8 may be separated, 12 may be divorced and 10 may be widowed. The frequency of cases in a category can be expressed as a proportion or percentage of the total frequency of cases. For example, the proportion of married people is .40 (40/100 = .40), which, expressed as a percentage, is 40 (40/100 × 100 = 40). Data consisting of qualitative variables is quantitative in the sense that the frequency, proportion or percentage of cases can be quantified. The categories of a qualitative variable can be treated as dichotomous variables. For example, the divorced may form one category of a dichotomous variable and the remaining four groups of the never married, the married, the separated and the widowed the other category. Other examples of qualitative variables include type of food eaten, country of origin, nature of illness and kind of treatment received. The other kind of variable is called a quantitative variable in which numbers are used to order or to represent increasing levels of that variable. The simplest example of a quantitative variable is a dichotomous variable such as sex or gender, where one category is seen as representing more of that quality than the other. For example, if females are coded as 1 and males as 2, then this variable may be seen as reflecting maleness in which the higher score indicates maleness. The next simplest example is a variable consisting of three categories such as social class, which may comprise the three categories of upper, middle and lower. Upper class may be coded as 1, middle as 2 and lower as 3, in which case lower values represent higher social status. These numbers may be treated as a ratio measure or scale. Someone who is coded as 1 is ranked twice as high as someone who is coded as 2 giving a ratio of 1 to 2. Other quantitative variables typically comprise more than three categories such as age, income or the total score on a questionnaire scale. For example, ten questions may be used to assess how aggressive one is. Each question may be answered in terms of ‘Yes’ or ‘No’. A code of 1 may be given to answers that indicate aggressiveness while a code of 0 may be given to answers that show a lack of aggressiveness. The codes for these
Introduction
3
ten questions can be added together to give an overall score, which will vary from a minimum of zero to a maximum of 10. Higher scores will signify greater aggressiveness. These numbers may also be considered a ratio measure. A person with a score of 10 will have twice the score of a person with a score of 5, which gives a ratio of 10 to 5, which, simplified, is 2 to 1. If it is considered useful, adjacent categories can always be combined to form a smaller number of categories not less than two. For example, the 11 categories of total aggressiveness scores just mentioned may be recategorized into the three new categories of scores of 0 to 1, 2 to 5 and 6 to 10. The new categories do not have to comprise the same number of scores as here, where the first category consists of two potential scores (0 and 1), the second one of four scores (2, 3, 4 and 5) and the third one of five scores (6, 7, 8, 9 and 10). These three new categories will now have the new numerical codes of, say, 1 for the first group, 2 for the second group and 3 for the third group. Quantitative variables should only be re-categorized in this way if there is a good justification for doing so, because the meaning of these regrouped scores is less clear than that of the original ones and the variation in the original scores is reduced. In the socio-behavioural sciences, we are usually interested in whether one variable is related to one or more other variables. The stronger the relationship between variables, the more they have in common. Bivariate analysis examines the relationship between two variables, whereas multivariate analysis examines the relationships between three or more variables simultaneously. Only one of the major statistical techniques described in this book examines two variables at a time – one-way analysis of variance, which is described in Chapter 10. All the other techniques examine three or more variables at a time. A one-way analysis of variance looks at the relationship between a qualitative variable such as marital status and a quantitative one such as the scores on a measure of depression. A one-way analysis of covariance examines this relationship while controlling for a second quantitative variable that is related to the first. In other words, it involves three variables, one qualitative and the other two quantitative. Consequently, it is a multivariate analysis. Any aspect of human behaviour is likely to be affected or influenced by several different factors. Therefore, a better understanding of human behaviour is likely to be achieved if we are able to examine more than one factor at a time.
Statistical inference
Socio-behavioural scientists are often interested in determining whether the relationships that they find in their data can be inferred as applying in general to the population from which their sample of data was drawn. In this context, a population refers to the values of the variables rather than to
4
Advanced quantitative data analysis
people or other organisms. A sample is a subset of those values. To establish whether a relationship in a sample can be generalized to the population, we assume that the variables are not related in the population and work out what the probability is of finding a relationship of such a size in the sample if this assumption were true. If the probability of finding such a relationship is .05 or less (i.e. a chance of 1 in 20 or less), we assume that this relationship exists in the population from which the sample was drawn. In other words, the relationship is unlikely to be due to chance. Such a relationship is called a statistically significant one. However, it is possible that a relationship occurring with a probability of .05 or less is a chance one, in which case we have inferred that a relationship exists when it does not exist. Making this kind of mistake is known as a Type I error. Although statistical significance is generally accepted as being .05 or less, this is an arbitrary cut-off point. If the probability of finding a relationship is more than .05, we assume that no relationship exists in the population from which the sample was drawn. Such a relationship is called a statistically non-significant one. However, it is possible that a relationship that has a probability of more than .05 is not a chance one but a real one. This kind of mistake is known as a Type II error. The probability of finding a relationship to be statistically significant is greater the larger the sample. In other words, the probability of making a Type I error (saying a relationship exists when it does not) is greater with larger samples. The probability of finding a relationship to be statistically non-significant is greater the smaller the sample. In other words, the probability of making a Type II error (saying a relationship does not exist when it does) is greater with smaller samples. Larger relationships are also more likely to be statistically significant. Consequently, when we interpret statistical significance, we need to take into account both the size of the relationship and the size of the sample. All but two of the statistical methods covered in this book entail procedures of statistical significance. The two exceptions are cluster analysis and the factor analytic technique of principal components.
Dependent and independent variables
For many of the statistical methods described, a distinction is made between a dependent variable and an independent one. In some techniques such as multiple regression, the dependent variable may be called the criterion or the criterion variable and the independent variable the predictor or predictor variable. The dependent or criterion variable is the variable we are interested in explaining in terms of the independent or predictor variables. The dependent variable is so called because we are assuming that it is influenced by, or dependent on, the independent variables. The independent
Introduction
5
variables are presumed to be unaffected by, or independent of, the dependent variable. In path analysis, discussed in Chapters 7 and 8, we may propose an order or sequence of variables in which one variable is assumed to influence another variable, which, in turn, influences a third variable and so on. The variable that starts the sequence is sometimes called an exogenous variable because it is external to the explanation entailed by the path analysis. Variables that follow the exogenous variables are called endogenous variables because they are to be explained within the path analysis. While the exogenous variable can be viewed as an independent variable, the endogenous variables are acting as both independent and dependent variables. They are thought to be influenced by the preceding variable and to influence the subsequent variable. It is also possible that two or more variables influence one another, in which case the relationship is called a reciprocal one. This topic is not covered in this book. It should be made clear that statistical analysis can only determine whether variables are related. It cannot determine whether one variable affects another. For example, we may find a relationship between unemployment and crime in that those who are not in paid employment are more likely to engage in criminal activity or have a criminal conviction. This relationship, however, does not mean that unemployment leads to criminal activity. It is equally plausible that those who engage in criminal activity are less interested in seeking paid employment. It is also possible that both variables influence each other. Determining the causal nature or direction of two variables in the socio-behavioural sciences is usually a complex issue which involves providing support for a well-argued position. The role that statistical analysis plays in this argument is to offer an indication of the size of any observed relationship and the likelihood of such a relationship occurring by chance.
Choosing an appropriate statistical method
A brief outline of the statistical methods covered in this book will be presented here to help readers who are uncertain as to which technique is most appropriate for the analysis of their data and who are primarily interested in learning about that technique alone. In the book, the techniques have been ordered in terms of trying to make sense of a potentially large number of quantitative variables and of outlining statistical ideas that are useful in understanding subsequent methods. For example, multiple regression is introduced before analysis of variance because regression is often used with many variables and to carry out analysis of variance. In providing guidance to the reader as to which technique is likely to be most suitable for their data, it is useful to think of them in terms of the kind of variables to which they are applied.
6
Advanced quantitative data analysis
If the analysis is confined to looking at three or more qualitative variables together, there is only one technique in this book which does that – log-linear analysis, described in Chapter 14. For example, we may be interested in the relationship between psychiatric classification (e.g. anxious, depressed and both anxious and depressed), religious affiliation (e.g. none, Protestant and Catholic) and childhood parental status (e.g. lived with both parents, mother only and father only). Log-linear analysis is used to answer two related kinds of questions. The first question is whether the frequency of cases varies significantly from that expected by chance in terms of the interaction between three or more qualitative variables. In other words, do we need to consider more than two variables to explain the distribution of cases according to those variables? The second question is which of the variables and/or their interactions are necessary to account for the distribution of cases. This question differs from the first in that the influence of single variables and their two-way interactions with other variables is considered as well as higher-order interactions. If one of the qualitative variables is to be considered as the dependent variable (e.g. psychiatric classification) and the other qualitative variables as the independent variables, logistic regression is more appropriate, as it only considers the relationship between the dependent variable and the independent variables and their interactions. In other words, it excludes the relationships between the independent variables on their own (e.g. the relationship between religious affiliation and childhood parental status). The only kind of logistic regression covered in this book is binary logistic regression where the dependent variable consists of two categories; this is described in Chapter 9. Logistic or logit multiple regression is used to determine which qualitative and quantitative variables and which of their interactions are most strongly associated with the probability of a particular category of the dependent variable occurring, taking into account their association with the other predictor variables in the analysis. Qualitative variables need to be transformed into dummy variables. This procedure is described in Chapters 10–12 for analysis of variance. There are three ways of entering the predictors into a logistic regression. In the standard or direct method, all predictors are entered at the same time, although some of these predictors may play little part in maximizing the probability of a category occurring. In the hierarchical or sequential method, predictors are entered in a predetermined order to find out what contribution they make. For example, demographic variables such as age, gender and social class may be entered first to control for the effect of these variables. In the statistical or stepwise method, predictors are selected in terms of the variables that make the most contribution to maximizing the probability of a category occurring. If two predictors are related to each other and have very similar maximizing power, the predictor with the
Introduction
7
greater maximizing power will be chosen even if the difference in the maximizing power of the two predictors is trivial. Discriminant function analysis, described in Chapter 13, may be used to determine which quantitative variables best predict which category a case is likely to fall in provided that the data meet the following requirements. The number of cases in the categories of the dependent variable should not be very unequal. The independent variables should be normally distributed and the variances of the groups should be similar or homogeneous. The independent variables form a new composite variable called a discriminant function. The maximum number of discriminant functions that are considered is either the number of predictors or one less the number of groups, whichever of these is the smaller number. As with logistic regression, there are three ways of entering the predictors into a discriminant function. In the standard or direct method, all predictors are entered at the same time, although some of these predictors may not discriminate between the groups. In the hierarchical or sequential method, predictors are entered in a predetermined order to find out what contribution they make. In the statistical or stepwise method, predictors are selected in terms of the variables that make the most contribution to the discrimination. If two predictors are interrelated and have very similar discriminating power, the predictor with the greater discriminating power will be considered for entry even if the difference in the discriminating power of the two predictors is minimal. The other statistical techniques described in this book are for when either the dependent variable is a quantitative one or all the variables are quantitative. Multiple regression is used to determine which quantitative and qualitative variables and which of their interactions are most strongly related to a quantitative criterion variable. Qualitative variables should be treated as dummy variables as described in Chapters 10–12 for analysis of variance. As with logistic regression and discriminant analysis, there are three ways of entering the predictors into a multiple regression. In the standard or direct method, all predictors are entered at the same time, although some of these predictors may not be associated with the criterion. In the statistical or stepwise method, predictors are selected in terms of the variables that explain the maximum amount of variance in the criterion. If two predictors are related to each other and have a very similar association with the criterion, the predictor with the stronger association will be selected even if the difference in the size of this association for the two predictors is minute. This method is described in Chapter 5. In the hierarchical or sequential method, predictors are entered in a predetermined order to find out what contribution they make. This method is described in Chapter 6. Hierarchical multiple regression is also used in the most basic form of path analysis and in analysis of variance and covariance. Path analysis is used to determine the strength of association in a hypothesized sequence or series of quantitative endogenous variables and
8
Advanced quantitative data analysis
the extent to which the pathways selected provide a satisfactory description or fit of the associations between all the variables. Qualitative variables may be included as exogenous variables when transformed into dummy variables, as described in Chapters 10–12. Taking the simplest example of three variables, path analysis may be used to assess the extent to which one of the variables is a direct function of the other two and an indirect function of one of them. The simplest form of path analysis is covered in Chapter 7. A more sophisticated form of path analysis, which takes account of the reliability of the variables, is described in Chapter 8. Analysis of variance is used to determine whether one or more qualitative independent variables and their interactions are significantly associated with a quantitative criterion or dependent variable. If the qualitative variable consists of only two groups, a significant association indicates that the means of those two groups differ significantly. If the qualitative variable comprises more than two groups, a significant association implies that the means of two or more groups differ significantly. If there are strong grounds for predicting which means differ, the significance of these differences can be tested with a one-tailed t test. If no differences had been predicted or if there were no strong reasons for predicting differences, the significance of the differences need to be analysed with a post-hoc test, such as the Scheffé test. An analysis of variance with one qualitative variable is described in Chapter 10 and an analysis of variance with two qualitative variables is described in Chapter 12. Quantitative independent variables which are related to the quantitative dependent variable can be controlled through analysis of covariance. An analysis of covariance with one qualitative independent variable and one quantitative independent variable which is correlated with the quantitative dependent variable is covered in Chapter 11. Finally, the extent to which related quantitative variables can be grouped together to form a smaller number of factors or clusters encompassing them can be determined with the statistical techniques described in Chapters 2–4. For example, we may be interested in finding out whether items designed to measure anxiety and depression, respectively, can be grouped together to produce two factors or clusters representing these two types of items. A form of exploratory factor analysis called principal components is described in Chapter 2. The analysis is exploratory in the sense that the way in which the variables may be grouped together is not predetermined, as is the case with the confirmatory factor analysis described in Chapter 3. Confirmatory factor analysis provides a statistical measure of the extent to which a predetermined or hypothesized structure offers a satisfactory account of the associations of the variables. An alternative method to exploratory factor analysis for grouping variables together is cluster analysis. A form of cluster analysis called hierarchical agglomerative clustering is described in Chapter 4.
Introduction
9
It should be pointed out that some of the techniques described in this book appear to be less common in the socio-behavioural literature than others. The least popular techniques appear to be cluster analysis, log-linear analysis and discriminant analysis. Consequently, the reader is less likely to come across these techniques in their reading of the quantitative research literature. If this is the case, it is worthwhile trying to find examples of these techniques by specifically searching for them in electronic bibliographic databases relevant to your own areas of interest to further familiarize yourself with the ways in which these techniques are applied.
Part 1 Grouping quantitative variables together
2
Exploratory factor analysis
Factor analysis is a set of techniques for determining the extent to which variables that are related can be grouped together so that they can be treated as one combined variable or factor rather than as a series of separate variables. Perhaps the most common use of factor analysis in the social and behavioural sciences is to determine whether the responses to a set of items used to measure a particular concept can be grouped together to form an overall index of that concept. For example, we may be interested in assessing how anxious people see themselves as being. We could simply ask some people how anxious they generally are. However, there are three main problems with trying to measure a social concept with a single question or item. First, the potential sensitiveness of the index will be more restricted. For example, if we restrict the possible answers to this question to ‘Yes’ or ‘No’, we can only put people into these two categories. The more questions on anxiety we can generate, the more potential categories we would have. With two questions, we would have four categories; with three questions, six categories; and so on. The second problem with a single-item measure is that it does not enable us to determine how reliable that index is. To take an extreme example, it could be that the people we ask do not know what being anxious means and answer ‘Yes’ or ‘No’ on the basis of a whim rather than their understanding of the term. To determine the reliability of this question, we could ask it
14
Advanced quantitative data analysis
two or more times on the same occasion. If the question was a reliable index, we would expect people to give the same answer each time. Because asking the same question several times within a short space of time may indicate that we are disorganized or don’t trust the person, it is preferable to ask questions which differ but which are directed at the same content. For example, we could ask them if they are generally tense. The third problem with a single-item measure is that it does not enable us to sample different aspects of that concept. For example, being anxious, tense, nervous or easily frightened may describe slightly different characteristics of the concept we have of anxiety. If these different features do reflect the concept of anxiety, we would expect people who describe themselves as being anxious to also describe themselves as being tense, nervous and easily frightened. Similarly, we would expect people who describe themselves as not being anxious to also describe themselves as not being tense, nervous and easily frightened. In other words, we would expect the answers to these four questions to be related to each other and to form a single factor. If this were the case, we could combine the answers of these four questions together to create a single overall score rather than to leave them as four separate scores giving similar information. However, interpreting the results of a factor analysis solely based on the items thought to make up a single index such as anxiety is problematic. If, for example, we find that the answers to the four anxiety items group together to form a single factor, we will not know without further information whether this factor specifically reflects anxiety or represents a more general factor such as a tendency to complain. Alternatively, if these four items grouped together into two separate factors, say one of being anxious and tense and the other of being nervous and easily frightened, we wouldn’t know whether a general factor of anxiety had been broken down into two more specific sub-factors. Consequently, when carrying out a factor analysis, it is useful to include the responses to items that are believed not to be part of the index we are testing. For example, if we believed depression to be separate from anxiety, we could include the responses to items on depression to help interpret the results of our analysis. If the anxiety and depression items grouped together on a single factor, it would appear that people who were anxious were also depressed and that these two characteristics cannot be distinguished at least through selfreport. If the anxiety items grouped together on one factor and the depression items on another factor, we would be more confident that our anxiety items were not simply a measure of people’s tendency to be generally unhappy. We will illustrate the interpretation and some of the computations of a factor analysis by establishing whether self-reported anxiety and depression can be distinguished as separate factors. To keep the example simple, we will restrict ourselves to three short questions on anxiety (A1–A3) and
Exploratory factor analysis
15
depression (D1–D3), respectively, although most factor analyses are based on more than six variables: A1 A2 A3 D1 D2 D3
I am anxious I get tense I am calm I am depressed I feel useless I am happy
Each of these questions is answered on a 5-point scale, where 1 represents ‘not at all’, 2 ‘sometimes’, 3 ‘often’, 4 ‘most of the time’ and 5 ‘all of the time’. The recommended minimum size of the sample to be used for a factor analysis varies according to different sources, but there is agreement that it should be bigger than the number of variables. Gorsuch (1983), for example, suggested that there should be no fewer than 100 cases per analysis and a minimum of five cases per variable. A case is the unit of analysis, which, in this example, is a person. However, a case can be any unit of analysis, such as a school, commercial organization or town. To simplify the entering of the data for the analysis, we will use the fictitious answers of only nine individuals, which are shown in Table 2.1. Case 1 says they are sometimes anxious, not at all tense and often calm.
Correlation matrix
The first step in a factor analysis is to create a correlation matrix in which every variable is correlated with every other variable. The correlation matrix for the data in Table 2.1 is displayed in Table 2.2. This matrix is known
Table 2.1
Responses of nine people on six variables
Cases
A1 Anxious
A2 Tense
A3 Calm
D1 Depressed
D2 Useless
D3 Happy
1 2 3 4 5 6 7 8 9
2 1 3 4 5 4 4 3 3
1 2 3 4 5 5 3 3 5
3 3 4 3 2 2 2 4 3
1 4 2 3 3 4 5 4 3
2 3 1 2 4 3 4 4 4
5 3 4 3 4 1 1 3 1
16
Advanced quantitative data analysis
Table 2.2
Triangular correlation matrix for the six variables
Variables
A1 Anxious
A2 Tense
A3 Calm
D1 Depressed
D2 Useless
D3 Happy
A1 Anxious A2 Tense A3 Calm D1 Depressed D2 Useless D3 Happy
1.00 .74 −.50 .22 .28 −.25
1.00 −.40 .30 .39 −.54
1.00 −.37 −.43 .41
1.00 .65 −.74
1.00 −.53
1.00
as a triangular matrix because of its shape, in which the correlation for each pair of variables is only shown once. A correlation represents the nature and size of a linear relationship between two variables. Correlations can vary from −1.00 to +1.00. A correlation of −1.00 indicates a perfect inverse relationship between two variables in which the highest score on one variable (say, 5 on being anxious) is associated with the lowest score on the other variable (say, 1 on being calm), the next highest score on the first variable (say, 4 on being anxious) is associated with the next lowest score on the second variable (say, 2 on being calm) and so on. A correlation of +1.00 indicates a perfect relationship between two variables in which the highest score on one variable (say, 5 on being anxious) is associated with the highest score on the other variable (say, 5 on being tense), the next highest score on the first variable (say, 4 on being anxious) is associated with the next highest score on the second variable (say, 4 on being tense) and so on. A correlation of 0.00 represents a lack of a linear relationship between two variables. It is very unusual to find a perfect correlation of +1.00 or −1.00. The perfect correlation of 1.00s in the diagonal of the matrix in Table 2.2 simply represents the correlation of each variable with itself, which, by its very nature, will always be 1.00 and which is of no interest or importance. The bigger the correlation, regardless of whether it is positive or negative, the stronger the linear association is between two variables. The biggest correlation in Table 2.2 is .74 between being anxious and being tense and −.74 between being depressed and being happy. The next biggest correlation is .65 between being depressed and being useless. The smallest correlation is .22 between being anxious and being depressed. The amount of variation or variance that is shared between two variables can be simply worked out from the correlation by squaring it. So the amount of variance that is shared or that is common to being anxious and being tense is .742 or about .55, while the amount of variance that is shared between being anxious and being depressed is .222 or about .05. In other
Exploratory factor analysis
17
words, the amount of variance shared between being anxious and being tense is 11 times the amount of variance shared between being anxious and being depressed. The maximum amount of variance that can be shared with a variable is 1.00 (±1.002 = 1.00), while the minimum amount is .00 (.002 = .00). If we look at the correlations in Table 2.2, we can see that there is some tendency for the anxiety items to correlate more highly with each other than with the depression items and the depression items to correlate more highly with each other than with the anxiety items, suggesting that they might form two separate groups of anxiety and depression items. For example, being anxious correlates .74 with being tense but only .22 with being depressed. However, the picture is not entirely clear because being tense correlates more highly with the depression item of being happy (−.54) than with the anxiety item of being calm (−.40). It is usually not possible to tell simply from looking at a correlation matrix into how many groups or factors the variables might fall, particularly the more variables there are. Consequently, we need to use a more formal method, such as exploratory factor analysis, to determine how many groups there are.
Principal components analysis
There are many different kinds of factor analysis but perhaps the simplest and most widely used is principal components analysis. Components is another term for factors and the components in principal components analysis are often referred to as factors. We will use the two terms interchangeably here. In principal components analysis, the amount of variance that is to be explained or accounted for is the number of variables, since the variance or communality of each variable is set to 1.00. So, with six variables there is a total variance of 6.00 to be explained. The number of components to be formed or extracted is always the same as the number of variables in the analysis. So with six variables, six factors will be extracted. The first factor will always explain the largest proportion of the overall variance, the second factor will explain the next largest proportion of variance that is not explained by the first factor and so on, with the last factor explaining the smallest proportion of the overall variance. Each variable is correlated with or loads on each factor. Because the first factor explains the largest proportion of the overall variance, the correlations or loadings of the variables will, on average, be highest for the first factor, next highest for the second factor and so on. To calculate the proportion of the total variance explained by each factor, we simply square the loadings of the variables on that factor, add the squared loadings to give the eigenvalue or latent root of that factor, and divide the eigenvalue by the number of variables.
18
Advanced quantitative data analysis
The six principal components for the data for the six variables in Table 2.1 are shown in Table 2.3. As one can see, the six variables are most highly correlated with the first component apart from being anxious, which is slightly more highly correlated with the second component. To show the proportion of variance explained by these six principal components, the loadings have been squared, summed to form eigenvalues and divided by the number of variables as presented in Table 2.4. So the loading of .66 between being anxious and the first principal component when squared is about .43 (taking into account errors due to rounding to different decimal places). Adding the squared loadings of the variables on the first principal component gives a sum or eigenvalue of 3.26, which, expressed as a proportion of 6, is .54 (3.26/6 = .543). In other words, the first principal component explains about 54 per cent of the total variance of the six variables, the second component a further 20 per cent and so on.
Table 2.3
A1 Anxious A2 Tense A3 Calm D1 Depressed D2 Useless D3 Happy
Initial principal components 1
2
3
4
5
6
.66 .76 −.69 .76 .75 −.80
.67 .46 −.20 −.53 −.34 .34
.03 .37 .64 .05 −.15 −.27
.11 .02 .25 −.04 .52 .35
.29 −.20 .10 .35 −.17 .14
−.14 .18 −.05 .14 −.06 .17
Table 2.4 Proportion of total variance explained by initial principal components
A1 Anxious A2 Tense A3 Calm D1 Depressed D2 Useless D3 Happy Eigenvalues Proportion
1
2
3
4
5
6
Communalities
.43 .58 .48 .57 .56 .64
.45 .21 .04 .28 .11 .12
.00 .14 .41 .00 .02 .07
.01 .00 .06 .00 .27 .12
.08 .04 .01 .12 .03 .02
.02 .03 .00 .02 .00 .03
1.00 1.00 1.00 1.00 1.00 1.00
3.26 .54
1.21 .20
0.64 .11
0.47 .08
0.31 .05
0.11 .02
6.00
Exploratory factor analysis
19
Number of principal components to be retained
As there are as many components as variables, we need some criterion to decide how many of the smaller factors we should ignore, as these explain the least amount of the total variance. One of the main criteria used is the Kaiser or Kaiser-Guttman criterion, which is that factors that have eigenvalues of one or less should be ignored. As the maximum amount of variance that can be explained by one variable is one, these factors effectively account for no more than the equivalent of the variance of one variable. From Table 2.4, we can see that only the first two components have eigenvalues of greater than one, while the last four have eigenvalues of less than one. So if we adopted this commonly used criterion, we would only pay further attention to the first two factors and we would ignore the four smaller remaining factors. Cattell (1966) has suggested that the Kaiser criterion may retain too many factors when there are many variables and too few factors when there are few variables. He proposed an alternative criterion, the scree test, which essentially looks for a marked break between the initial big factors that explain the largest proportion of the variance and the later smaller factors that explain very similar and small proportions of the variance. To determine where this break occurs, the eigenvalue of each factor is represented by the vertical axis of a graph, while the factors are arranged in order of decreasing size of eigenvalue along the horizontal axis. The eigenvalues for the six components in Table 2.4 have been plotted in this way in Figure 2.1. ‘Scree’ is a geological term for the debris that lies at the foot of a steep slope and that hides the real base of the slope itself. The number of factors to be retained is indicated by the number of the factor that appears to represent the line of the steep slope itself where the scree starts. The factors forming the slope are seen as being the important factors, while those comprising the scree are thought to be unimportant ones. The scree factors are usually identified by being able to draw a straight line through or very close to the points representing their eigenvalues on the graph. This is not always easy to do, as in Figure 2.1 where it is not entirely clear whether the scree starts at the second or third factor. In this case, it may be useful to compare the variables that correlate on both the first two and the first three factors (after these factors have been rotated) to determine whether two or three factors is the more meaningful solution. The reason for rotating factors will be discussed in the next section. If more than one scree can be identified using straight lines, the uppermost scree determines the number of factors to be retained.
20
Advanced quantitative data analysis
Figure 2.1
Cattell’s scree test for the six principal components.
Rotated factors
The initial principal components which explain most of the variance in the variables are rotated to make their meaning clearer. There are various ways in which factors may be rotated. We shall discuss two of them. The most common form of rotation is called varimax, in which the factors are unrelated or orthogonal to one another in that the scores on one factor are not correlated with the scores of the other factors. Varimax tries to maximize the variance explained by factors by increasing the correlation of variables that correlate highly with them and decreasing the correlation of variables that correlate lowly with them. The varimax rotation of the two principal components in Table 2.3 is presented in Table 2.5. Comparing the correlations in Table 2.5 with those in Table 2.3, we can see that some of the correlations have been increased while others have been decreased, making the structure of the rotated factors clearer. The first varimax rotated factor seems to represent depression, in that all three depression items load ±.79 or more on it, whereas all three anxiety items load ±.39 or less on it. The second varimax rotated factor appears to reflect anxiety in that all three anxiety items load ±.60 or more on it, whereas all three anxiety items load ±.27 or less on it. With many variables in an analysis it is useful to order the items in terms of decreasing size for each
Exploratory factor analysis
21
Table 2.5 First two varimax rotated principal components
A1 Anxious A2 Tense A3 Calm D1 Depressed D2 Useless D3 Happy
1
2
.05 .27 −.39 .92 .79 −.83
.94 .85 −.60 .10 .24 −.27
factor to see more clearly which variables load most highly on which factors. The proportion of variance explained by the two varimax rotated factors is the sum or eigenvalue of the squared loadings for each factor divided by the number of variables. These figures are presented in Table 2.6 for the two varimax rotated principal components in Table 2.5. The proportion of the total variance explained is .40 for the first factor and .35 for the second factor. These proportions are different from those for the initial unrotated principal components because of the change in the loadings of the variables on these factors. Another method of rotation is called direct oblimin, in which the factors are allowed to be correlated or oblique to one another. There are two ways of presenting the results of an oblique rotation. The first is a pattern matrix, which is the one usually presented. This shows the unique contribution that each variable makes to each factor but not the contribution that is shared between factors if the factors are correlated. The structure matrix indicates the overall contribution that each variable makes to a factor. If the
Table 2.6 Proportion of total variance explained by the first two varimax rotated principal components
A1 Anxious A2 Tense A3 Calm D1 Depressed D2 Useless D3 Happy Eigenvalues Proportion
1
2
.00 .07 .15 .84 .62 .69
.88 .72 .36 .01 .06 .07
2.37 .40
2.10 .35
22
Advanced quantitative data analysis
Table 2.7 Pattern and structure matrix of the first two direct oblimin rotated principal components Pattern matrix
A1 Anxious A2 Tense A3 Calm D1 Depressed D2 Useless D3 Happy Proportion
Structure matrix
1
2
1
2
−.16 .09 −.28 .97 .79 −.83
.99 .85 −.55 −.11 .07 −.09
.26 .45 −.51 .92 .82 −.87
.93 .88 −.67 .29 .40 −.44
.39
.34
.47
.42
factors are uncorrelated, these two matrices should be similar and it would be simpler and more appropriate to carry out a varimax rotation. If the factors are correlated, it is not meaningful to present the amount of the total variance that each factor accounts for, as the pattern matrix will provide an underestimate and the structure matrix an overestimate. The pattern and structure matrix for the two direct oblimin rotated principal components in Table 2.4 are shown in Table 2.7. The proportion accounted for by these rotated factors is also displayed to show that they differ, although they are usually not presented. The two oblique factors are correlated about 0.42. Because the two factors are correlated, the loadings in the pattern and structure matrix differ. The loadings of the pattern matrix are somewhat easier to interpret than those in the structure matrix, although the results are similar. The first direct oblimin rotated factor seems to reflect depression because, in the pattern matrix, all three depression items load ±.79 or above on it whereas all three anxiety items load ±.28 or lower on it. The second direct oblimin rotated factor appears to represent anxiety, in that all three anxiety items in the pattern matrix load ±.55 or more on it, whereas all three depression items load ±.11 or less on it. In this respect, the results of the direct oblimin rotation are essentially the same as those of the varimax rotation.
Reporting the results
The way in which a principal components analysis is reported will depend to some extent on the reason for carrying it out. One of the more succinct ways of describing the results for the example used in this chapter is as follows: ‘A principal components analysis was conducted on the correl-
Exploratory factor analysis
23
ations of the six items. Two components were extracted with eigenvalues of more than one. The factors were rotated with both varimax and direct oblimin, giving essentially similar results. The first factor seemed to reflect depression in that all three depression items loaded most highly on it. The second factor appeared to represent anxiety in that all three anxiety items correlated most highly on it. The two varimax factors accounted for about 40 and 35 per cent, respectively, of the total variance. The two direct oblimin factors correlated .42.’
SPSS Windows procedure
The following steps should be followed to produce the essential information for a principal components analysis of the data in Table 2.1. Enter the data in Table 2.1 into the Data Editor as shown in Box 2.1, label them and save them as a file. Select Analyze on the horizontal menu bar near the top of the window which produces a drop-down menu. Select Data Reduction and then Factor. . ., which opens the Factor Analysis dialog box in Box 2.2. Select the variables Anxious to Happy and then the first 䉴 button to put them in the box under Variables:. Select Descriptives. . ., which opens the Factor Analysis: Descriptives subdialog box in Box 2.3. Select Coefficients (under Correlation Matrix) to produce a correlation matrix like that shown in Table 2.2. This matrix is the first table in the output, is called Correlation Matrix and is square rather than triangular. The correlations are rounded to three rather than two decimal places. Select Continue to return to the Factor Analysis dialog box in Box 2.2.
Box 2.1
Data in Data Editor
24
Advanced quantitative data analysis
Box 2.2
Factor Analysis dialog box
Box 2.3
Factor Analysis: Descriptives subdialog box
Select Extraction. . . to open the Factor Analysis: Extraction subdialog box in Box 2.4. Select Scree plot (under Display) to produce the scree plot in Figure 2.1. With other data sets, the analysis may need more than 25 iterations to converge. If so, increase the Maximum iterations for Convergence
Exploratory factor analysis
Box 2.4
25
Factor Analysis: Extraction subdialog box
to say 50 or 100 at first and more if necessary. If you want to retain fewer or more factors than those with the Kaiser criterion of eigenvalues greater than one, specify the number in Number of factors (under Extract). The percentage of the variance explained by the principal components that have been retained is shown in the table labelled Total Variance Explained in the third and fifth column headed % of Variance (under Initial Eigenvalues and Extraction Sums of Squared Loadings, respectively). The loadings of the retained principal components shown in the second and third columns of Table 2.3 are displayed in the output in the table labelled Component Matrixa rounded to three decimal places. Select Continue to return to the Factor Analysis dialog box. Select Rotation. . . to open the Factor Analysis: Rotation subdialog box in Box 2.5. Only one rotation procedure can be carried out during an analysis. Select Varimax to produce the data in Table 2.5, which will be presented in the SPSS output under the heading Rotated Component Matrixa. Some of the values will be rounded to three decimal places, whereas others will be expressed in scientific notation. For example, the correlation between Anxious and the first rotated factor is displayed as 5.237E-02 in scientific notation. The -02 means that the decimal place has to be moved two places to the left to give 0.05. The proportion of the total variance explained by orthogonally rotated factors is displayed in the table labelled Total Variance Explained in the ninth column headed . . .% of Variance (under Rotation Sums of Squared Loadings). In a subsequent analysis, select Direct Oblimin to produce the pattern and structure matrix in Table 2.7. The two tables in the output are called Pattern Matrixa and Structure Matrix, respectively.
26
Advanced quantitative data analysis
Box 2.5
Factor Analysis: Rotation subdialog box
The correlation between the two oblique factors is shown in the output in the table labelled Component Correlation Matrix and is given to three decimal places. Select Continue to return to the Factor Analysis dialog box. If you want to list the loadings according to their decreasing size, select Options. . . to open the Factor Analysis: Options subdialog box in Box 2.6. Select Sorted by size (under Coefficient Display Format). Select Continue to return to the Factor Analysis dialog box. Select OK to run the analysis.
Box 2.6
Factor Analysis: Options subdialog box
Exploratory factor analysis
27
Recommended further reading Child, D. (1990) The Essentials of Factor Analysis, 2nd edn. London: Cassell. This is a relatively short, non-technical introduction to factor analysis. Kline, P. (1994) An Easy Guide to Factor Analysis. London: Routledge. This is a longer, non-technical introduction to factor analysis with a useful illustration of how to calculate principal components. Pedhazur, E.J. and Schmelkin, L.P. (1991) Measurement, Design and Analysis: An Integrated Approach. Hillsdale, NJ: Lawrence Erlbaum Associates. Chapter 22 is a concise but more advanced treatment of factor analysis with comments principally on the procedure used and output produced by SPSS 3.0. SPSS Inc. (2002) SPSS Base 11.0 User’s Guide Package. Upper Saddle River, NJ: Prentice-Hall. This provides a detailed commentary on the output produced by SPSS 11.0 as well as a useful introduction to factor analysis. Tabachnick, B.G. and Fidell, L.S. (1996) Using Multivariate Statistics, 3rd edn. New York: Harper Collins. Chapter 13 is a systematic and valuable exposition of factor analysis that compares four different programs, including SPSS 6.0, and shows how a factor analysis should be written up.
3
Confirmatory factor analysis
Whereas exploratory factor analysis is used to determine what is the most likely factor structure for the relationships between a set of variables, confirmatory factor analysis is used to test the probability that a particular or hypothesized factor structure is supported or confirmed by the data. If the data do not support or fit the postulated factor structure, the data will differ significantly from the assumed factor model. If the data support the model, the data will not differ significantly from the model. More than one model of the data can be tested to establish which model may provide the most appropriate or best explanation of the data. For example, we may wish to determine whether the correlations between six items thought to measure either depression or anxiety are best represented by one general factor or by two factors that can either be related or unrelated. In the one-factor model all six items would load on a single factor, whereas in the two-factor model the depression items would load on only one of the factors while the anxiety factors would only load on the other factor. The two factors can be allowed to be related or unrelated to one another. Table 3.1 illustrates which items load (represented by 1) and do not load (represented by 0) on the one- and two-factor models.
Confirmatory factor analysis
Table 3.1
29
One- and two-factor models for six items One factor
A1 Anxious A2 Tense A3 Calm D1 Depressed D2 Useless D3 Happy
Two factors
1
1
2
1 1 1 1 1 1
1 1 1 0 0 0
0 0 0 1 1 1
Path diagrams
Another way of graphically displaying the factor model being tested is with a path diagram. We will illustrate the use of confirmatory factor analysis with the example used in the previous chapter on anxiety and depression so that you can compare the results of the two methods. The path diagrams for the one-factor model, the unrelated two-factor model and the related two-factor model are presented in Figures 3.1, 3.2 and 3.3, respectively. These path diagrams have been created with a computer program called
Figure 3.1
Path diagram for a one-factor model
30
Advanced quantitative data analysis
Figure 3.2
Path diagram for an unrelated two-factor model
Figure 3.3
Path diagram for a related two-factor model
Confirmatory factor analysis
31
LISREL, which is an abbreviation for LInear Structural RELationships. When expressed as general models, they are usually presented without the numerical values shown that are to be calculated or estimated. In addition, the line of information at the bottom of each model is not included. This information gives two measures of the extent to which the model fits the data. Lines with arrows are known as pathways. The items are shown as rectangles, while the factors are portrayed as ellipses. A relationship between an item and a factor is indicated by an arrow from the ellipse to the rectangle. So, for example, in Figure 3.2, there is an arrow between the Anxious item and the Anxiety factor, showing that this item loads or is related to this factor. There are no arrows between the Anxious item and the Depression factor because this item is assumed not to load on this factor. LISREL only prints the first eight characters of a label, so the last two letters of the label for this factor are missing. The arrow points from the factor to the item rather than from the item to the factor to indicate that the factor is expressed in terms of the item. Factors are sometimes called latent variables, as they cannot be measured directly, while items may be referred to as indicators or manifest variables. The values next to these arrows may be loosely thought of as the correlation or loading between the item and the factor in that they generally vary between −1 and +1. So, in Figure 3.2, the loading between the Anxious item and the Anxiety factor is 0.96. The arrows on the left of each item and pointing to that item indicate that each item is not a perfect measure of the factor it is assumed to reflect. The value next to each arrow is the proportion of variance that is assumed to be error. So, in Figure 3.2, the proportion of the variance in the scores of the Anxiety item that is thought to represent error is 0.08. The line and value on the right of each factor is often not shown in path diagrams and indicates that the variance of these factors has been standardized as 1.00. The curved line with an arrowhead at both ends, such as that between the Anxiety and Depression factors in Figure 3.3, signifies that these two variables are thought to be related but that the causal direction of this relationship cannot be specified. The value next to this line is the correlation between the two variables, which in this case is .48.
Goodness-of-fit indices
Chi-square The loadings of the items are generally higher for the two-factor models than the one-factor one, suggesting that the two-factor models may provide a better fit to the data. A large number of measures of the extent to which a model fits the data have been suggested and there is little agreement as to
32
Advanced quantitative data analysis
which ones are the most appropriate. There is only space to outline a few of these. One of the more popular ones is the first one shown in the line of information at the bottom of each path diagram, which is called chi-square for short. This test is similar to that which compares the difference between the observed frequencies in a contingency table and the frequency expected by chance. This test is discussed further in Chapter 14 on log-linear analysis. The bigger the difference between the observed and expected frequencies, the bigger chi-square will be and the more likely that the differences will not be due to chance. Chi-square will also be bigger the more categories there are in the contingency table, so an adjustment has to be made to chi-square to take the number of categories into account. This is done by noting the degrees of freedom. The chi-square test in confirmatory factor analysis compares the difference between the correlation matrix of the original data with the correlation matrix that is produced by the model. The bigger the difference between the original and the model matrix, the bigger chi-square will be and the more likely that the difference will not be due to chance. If the difference cannot be considered due to chance, then the model does not provide an adequate fit to the data. Once again, chi-square will be affected by the number of values that have to be estimated in the model. The more values there are, the bigger chi-square will be. Consequently, the value of chi-square has to take into account the number of such values. This is done through the degrees of freedom, which is the number of observed variances and covariances minus the number of values or parameters to be estimated. The number of observed variances and covariances is given by the general formula n(n + 1)/2, where n is the number of observed variables. As we have six observed variables in all three models, the number of observed variances and covariances is 6(6 + 1)/2 or 21. This can be checked by counting the number of values in Table 2.2, including the 1.00s in the diagonal. The number of values or parameters to be estimated is the number of pathways, which is 12 for the model in Figures 3.1 and 3.2 and 13 for the model in Figure 3.3. Thus the degrees of freedom is 9 (21 − 12 = 9) for the first two models and 8 (21 − 13 = 8) for the third model. The probability of chi-square being significant is less than the .05 twotailed level for the first two models, which suggests that neither of these adequately fits the data. The chi-square value for the third model is the lowest, implying that this model may fit the data best. Chi-square difference test for nested models A model which has more values or parameters to be estimated will generally provide a better fit to the data than one which has fewer values. In the extreme case, a model which has as many parameters as the original data will in effect be the original data and so will reproduce those data exactly. In
Confirmatory factor analysis
33
this case, chi-square will be zero and the model will provide a perfect fit to the data. One aim in the socio-behavioural sciences is to find the model that has the fewest parameters, as this provides the simplest or most parsimonious explanation of the data. Of our three models, the one with the most parameters is the related two-factor model. This model also seems to provide the best fit to the data in that it has the smallest chi-square value. Whether the fit of two models differ significantly from each other can also be determined with a chi-square test if one or more of the parameters of the model that is being compared can be dropped. For example, the fit of the related two-factor model can be compared with the fit of the unrelated two-factor model because the correlation between the two factors in the related two-factor model has been removed in the unrelated two factor model. The unrelated two-factor model is called a nested model because it can be derived from the related two-factor model by removing the correlation between the two-factors. If the fit of the larger model is significantly smaller than that of the smaller model, the larger model gives a more appropriate fit to the data. If the fit of the two models does not differ significantly, then the model with the fewer parameters provides a simpler or more parsimonious model of the data. The fit of the one-factor model can also be compared with the fit of the related two-factor model because the two factors are assumed to be the same in the sense that they are perfectly related. The one-factor model cannot be compared with the unrelated two-factor model because both these models have the same number of parameters. To test whether the fit of two models differs, the chi-square of the model with the larger number of parameters is subtracted from the chi-square of the nested model to produce the chi-square difference. Because the size of chi-square is affected by the degrees of freedom of a model, the chi-square difference has to take account of the difference in the degrees of freedom for the two models. The degrees of freedom of the larger model are subtracted from the degrees of freedom of the nested model to give the degrees of freedom difference. The statistical significance of the chi-square difference can be looked up in a table of the critical values of chi-square against the difference in the degrees of freedom. Table 3.2 presents these calculations for comparing the related two-factor model with the unrelated two-factor model and with the one-factor model,
Table 3.2 Comparisons 1 vs 3 2 vs 3
Chi-square difference test for two comparisons χ2 difference
df difference
96.73 − 46.15 = 50.58 58.78 − 46.15 = 12.63
9−8=1 9−8=1
34
Advanced quantitative data analysis
respectively. The difference in the degrees of freedom (df ) for both comparisons is 1 (9 − 8 = 1). The difference in chi-square (χ2) for the first comparison is 50.58 (96.73 − 46.15 = 50.58) and for the second is 12.63 (58.78 − 46.15 = 12.63). With one degree of freedom, chi-square has to be 3.84 or larger to be statistically significant at the .05 two-tailed level. As both chi-square differences are larger than 3.84, the related two-factor model provides a significantly better fit than either the unrelated two-factor model or the one-factor model. Root mean square error of approximation The second measure of fit that is shown in the last line of information in each path diagram in Figures 3.1 to 3.3 is the root mean square error of approximation (RMSEA). A problem with the chi-square test for confirmatory factor analysis is that it is more likely to be statistically significant the larger the sample is. In general, bigger samples are preferable to smaller ones as they are more likely to provide more accurate estimates of the parameters. In our example, the problem that larger samples are more likely to produce statistically significant results does not really apply, as a sample of 100 is not considered to be large. But in other cases chi-square may not be significant in a sample of this size but may be significant in larger samples. Because of this problem, other measures of fit have been developed that depend less on sample size. One of these is the root mean square error of approximation. Values below 0.100 are considered to be indicative of good fit (Loehlin 1998). The value of this statistic is above 0.100 for all three models, implying that none of them offers a good fit to the data. In this case, both chi-square and the root mean square error of approximation agree in showing that none of the three models provides a satisfactory fit.
Reporting the results
It is only possible to give a brief indication here of the kinds of information that may be included in reporting the results of the analysis for our example. A very concise report might be as follows: ‘The statistical fit of the one-factor, the unrelated two-factor and the related two-factor models was compared using the maximum likelihood estimation method of confirmatory factor analysis on LISREL 8.51. The results for two of the goodness-offit measures are presented in Table 3.3 for the three models. Both measures indicate that neither of the first two models provides a satisfactory fit, as chi-square is statistically significant and the root mean square error of approximation is larger than 0.100. The chi-square difference tests show that the related two-factor model offers a significantly better fit than either
Confirmatory factor analysis
Table 3.3
35
Two goodness-of-fit measures for the three models
Model One-factor Unrelated two-factor Related two-factor
χ2
df
p
RMSEA
96.73 58.78 46.15
9 9 8
.001 .001 .052
.314 .236 .167
the unrelated two-factor model (χ2 = 50.58, df = 1, p < .001) or the one-factor model (χ2 = 12.63, df = 1, p < .001). The standardized maximum likelihood estimates for the parameters of this model are shown in Figure 3.3.’
LISREL procedure
There are several different programs for running confirmatory factor analysis. We will restrict ourselves to demonstrating one of the more popular programs called LISREL. To run one or more of the three analyses illustrated in this chapter, we need to access LISREL. After we have done this, we then need to create a syntax file containing rows or lines of instructions. To do this, we select File from the menu bar near the top of the LISREL Window, which produces a drop-down menu. From this menu we select New, which opens the New dialog box. We select Syntax only, which opens a Syntax window into which we type the instructions. When we have finished typing the instructions, we select File and from the drop-down menu Run LISREL. If we have not correctly typed in the instructions, the program will produce output which should indicate where the initial problems lie. If the instructions have been correctly typed in, the path diagram will be displayed first. To look at the other output which accompanies this diagram, select Window from the menu bar near the top of the Window and select the file ending in .OUT. One-factor model LISREL procedure The instructions below in bold need to be typed into the syntax file and run to produce the results for the one factor model. CFA: 1 factor DAta NInputvar=6 NObserv=100 LAbels Anxious Tense Calm Depressed Useless Happy KMatrix
36
Advanced quantitative data analysis
1.00 .74 1.00 −.50 −.40 1.00 .22 .30 −.37 1.00 .28 .39 −.43 .65 1.00 −.25 −.54 .41 −.74 −.53 1.00 MOdel NXvar=6 NKvar=1 PHi=FIxed TDelta=DIagonal FRee LX(1,1) LX(2,1) LX(3,1) LX(4,1) LX(5,1) LX(6,1) STartval 1 PHi(1,1) LKvar Distress PDiagram OUtput
We will briefly describe what these instructions do. For further details it is necessary to use the Help option on the menu bar or the manual for the LISREL program. It is useful to give the instructions or program a brief title as has been done. The first two characters of this title should not consist of the letter d followed by the letter a, which is the two-letter name or keyword for describing aspects of the data. Instruction keywords can be in upper- or lower-case or any combination of upper- and lower-case. We have followed the usual convention of presenting them in upper-case. The instructions can consist of a series of two-letter names. However, because LISREL ignores all the letters that immediately follow the first two letters, we have used lower-case letters to provide more information about what the name means and does. The line beginning with DA states that the number of input variables that are to be read is 6 and that the number of observations or cases in the data is 100. Rather than read in the raw data, we can read in the data as a correlation matrix. If we do this, we need to tell LISREL the number of cases this correlation matrix is based on. Although the number of cases on which this matrix was based was 9, we have increased it to a 100 to increase the probability that the data will differ significantly from the model being tested. Being able to enter the data as a correlation matrix is very useful in analysing data that may not be readily accessible in their raw form, such as those in published studies. The line starting with LA shows that we are going to label the six input variables. Although labelling is optional, it is useful in reminding us what the variables are. The labels are entered on the next line. Only the first eight characters of each label are printed in the output, so Depressed will be shortened to Depresse. The line starting with KM indicates that the data are to be read in as a correlation matrix. The next six lines represent the lower triangular correlation matrix shown in Table 2.2.
Confirmatory factor analysis
37
The line beginning with MO gives details of the model to be tested. The Number of X variables or items is 6. The Number of K variables or factors is 1. The next instruction is difficult to describe briefly. Together with a later instruction it enables the variance of the K variable or factor to be standardized as 1. The variance–covariance matrix of the K variables or factors is called PHi, which is FIxed so that these can be specified. This is done with the line beginning with ST, which fixes the starting value of the factor as 1. Although there is only one element in the lower triangular phi matrix, this element needs to be identified, which is done by the numbers in parentheses. The first number refers to the row number the element is in, while the second number refers to the column number the element is in. The final instruction on the line starting with MO states that the lower triangular correlation matrix of the errors of the six items, called Theta Delta, only consists of a diagonal. This model assumes that the errors for the six items are not correlated with one another. The line beginning with FR specifies which of the six items, now called the Lambda X variables, are to be freed so that their factor loadings can be estimated. In this model, it is assumed that all six items load on one factor, which is represented by six rows on one column. The position of these items is indicated by the two values in parentheses. The first value refers to the row number, while the second value refers to the column number. The line opening with LK is optional and enables us to Label the K variable or factor, which we do on the next line. We have called this factor Distress. The line starting with PD provides us with the path diagram shown in Figure 3.1. The final line starting with OU gives us the output. One-factor model LISREL output We will only describe some of the output that is displayed. LISREL starts off with reproducing the syntax file we have run more or less followed by the correlation matrix (called a covariance matrix) with the variables labelled. This output is not shown. The next output indicates which parameters of the model are to be estimated by numbering them and is presented in Table 3.4. For this model, 12 parameters will be estimated. LAMBDA-X will give the loadings or lambdas of the six items or x variables on the k or ksi factor. THETA-DELTA will provide the error or unique variance for the six items. The next output essentially presents the maximum likelihood estimates
38
Advanced quantitative data analysis
Table 3.4 LISREL output of parameter specifications for the one-factor model
of these parameters and is shown in Table 3.5. So the loading of anxiousness on the one factor is 0.43 as shown under LAMBDA-X. In other words, about 0.432 or .18 of the variance in anxiousness is accounted for by this factor, leaving .82 as error or unexplained variance. The value in parentheses immediately under the loading is its standard error, which is 0.10, while the value below that is the t-value of finding that loading by chance. The t-value is 4.16. For a sample of 100, the t-value has to be about +1.98 or bigger to be significant at the two-tailed .05 level. As 4.16 is bigger than 1.98, this loading is statistically significant at or below the two-tailed .05 level. The variance of this factor has been set as 1.00, as shown under PHI. The error variance of the six items is given under THETA-DELTA. The error variance for anxiousness is 0.82. When an item loads on only one factor as in this example, the Squared Multiple Correlation shown in the final part of Table 3.5 is the loading of the item squared. For anxiousness, this is 0.432 or 0.18. The last major part of the output is the goodness-of-fit statistics for the model tested, which is shown in Table 3.6. The chi-square value shown in Figure 3.1 is the second one listed and is called the Normal Theory Weighted Least Squares Chi-Square. Its value is 96.73. The degrees of freedom are given two lines above it and are 9. The Root Mean Square Error Of Approximation is displayed on the ninth line of information and is 0.31. Further information about the other indices of fit can be found in Loehlin (1998).
Confirmatory factor analysis
39
Table 3.5 LISREL output of the maximum likelihood estimates of the parameters for the one-factor model
40
Advanced quantitative data analysis
Table 3.6 LISREL output of the goodness-of-fit statistics for the one-factor model
Confirmatory factor analysis
41
Unrelated two-factor model LISREL procedure The syntax program for producing the results for the unrelated two-factor model is as follows: CFA: 2 unrelated factors DAta NInputvar=6 NObserv=100 LAbels Anxious Tense Calm Depressed Useless Happy KMatrix 1.00 .74 1.00 −.50 −.40 1.00 .22 .30 −.37 1.00 .28 .39 −.43 .65 1.00 −.25 −.54 .41 −.74 −.53 1.00 MOdel NXvar=6 NKvar=2 PHi=FIxed TDelta=DIagonal FRee LX(1,1) LX(2,1) LX(3,1) LX(4,2) LX(5,2) LX(6,2) STartval 1 PHi(1,1) PHi(2,2) LKvar Anxiety Depression PDiagram OUtput
Only the main differences between this program and the previous one will be commented on. The model consists of two K variables or factors, so on the MOdel instruction NKvar is set to 2. The first three LX variables (anxious, tense and calm) load on the first factor, where the second three LX variables (depressed, useless and happy) load on the second factor. The location of these variables is indicated by the two values in parentheses, the first number referring to the variable or row and the second number to the factor or column. So the second value of 2 signifies that there is a second factor. The variances of the two factors, PHi(1,1) and PHi(2,2), are set at 1. Because PHi is Fixed, the covariance between the two factors, which is represented by PHi(2,1), is fixed at zero. The two K variables or factors are labelled Anxiety and Depression, respectively. Unrelated two-factor model LISREL output The path diagram for this model is presented in Figure 3.2. Only selected aspects of the other output will be presented and discussed. Table 3.7 shows the parameters to be estimated for this model. There are 12 of these, the same number as in the previous model.
42
Advanced quantitative data analysis
Table 3.7 LISREL output of the parameter specifications for the unrelated two-factor model
Table 3.8 displays the maximum likelihood estimates of the loadings of the first three items on the first factor and the second three items on the second factor. Related two-factor model LISREL procedure The instructions for producing the results for the related two-factor model are as follows: CFA: 2 related factors DAta NInputvar=6 NObserv=100 LAbels Anxious Tense Calm Depressed Useless Happy KMatrix 1.00 .74 1.00 −.50 −.40 1.00 .22 .30 −.37 1.00 .28 .39 −.43 .65 1.00 −.25 −.54 .41 −.74 −.53 1.00 MOdel NXvar=6 NKvar=2 PHi=FIxed TDelta=DIagonal FRee LX(1,1) LX(2,1) LX(3,1) LX(4,2) LX(5,2) LX(6,2) STartval 1 PHi(1,1) PHi(2,2) FRee PHi(2,1) LKvar Anxiety Depression PDiagram OUtput
Confirmatory factor analysis
43
Table 3.8 LISREL output for the maximum likelihood estimates of some of the parameters for the unrelated two-factor model
Only the main difference between this and the previous program will be commented on. To allow the two K variables or factors to covary, we have to FRee PHi(2,1). Related two-factor model LISREL output The path diagram for this model is shown in Figure 3.3. The number of parameters to be estimated is now 13, as we have to estimate the covariance between the two factors as shown under PHI in Table 3.9.
44
Advanced quantitative data analysis
Table 3.9 LISREL output of the parameter specifications for the related two-factor model
The maximum likelihood estimate for the covariance between the two factors is 0.48, as shown in Table 3.10. Recommended further reading Jöreskog, K.G. and Sörbom, D. (1989) LISREL 7: A Guide to the Program and Applications, 2nd edn. Chicago, IL: SPSS Inc. Chapter 3 in particular. This earlier version of the manual seems to be more comprehensive than subsequent ones and provides useful examples of confirmatory factor analysis together with the instructions for carrying them out. Loehlin, J.C. (1998) Latent Variable Models: An Introduction to Factor, Path, and Structural Analysis, 3rd edn. Mahwah, NJ: Lawrence Erlbaum Associates. Chapter 2 in particular. This is a clear, non-technical exposition of confirmatory factor analysis, including the numerous goodness-of-fit indices that have been proposed. Pedhazur, E.J. and Schmelkin, L.P. (1991) Measurement, Design and Analysis: An Integrated Approach. Hillsdale, NJ: Lawrence Erlbaum Associates. Chapter 23 is a relatively concise introduction to confirmatory factor analysis, which includes instructions for carrying them out mainly with LISREL 7.16 and comments on the output produced.
Confirmatory factor analysis
45
Table 3.10 LISREL output for the maximum likelihood estimates of some of the parameters for the related two-factor model
Stevens, J. (1996) Applied Multivariate Statistics for the Social Sciences, 3rd edn. Mahwah, NJ: Lawrence Erlbaum Associates. The latter half of chapter 11 describes the procedure for carrying out a confirmatory factor analysis with LISREL and EQS. Tabachnick, B.G. and Fidell, L.S. (1996) Using Multivariate Statistics, 3rd edn. New York: HarperCollins. Chapter 14 is a general introduction to structural equation modelling but includes useful information on confirmatory factor analysis.
4
Cluster analysis
Another method for seeing whether variables group together is cluster analysis, although it is much less widely used than factor analysis, at least in the socio-behavioural sciences. It is sometimes advocated more as a method for determining how cases rather than variables group together, but both factor analysis and cluster analysis can be used to ascertain the way that either cases or variables group together. Like factor analysis, there are several different methods for carrying out cluster analysis and there is little agreement as yet as to which are the most appropriate methods. To enable the results of a cluster analysis to be compared with those of a factor analysis, we will illustrate the steps involved in a cluster analysis with the same data that were used to explain factor analysis, namely the scores of nine individuals on the three depression and three anxiety items presented in Table 2.1.
Proximity measures and matrix
The first stage in a cluster analysis is to decide how the similarity or proximity between variables is to be measured. One such measure is a correlation. The more similar the scores are on two variables, the more highly positive those two variables will be correlated. The more dissimilar the scores are, the more highly negative the correlation will be. Perhaps
Cluster analysis
47
the most frequently used measure of proximity is the squared Euclidean (i.e. straight-line) distance between two variables. This is simply the sum of the squared differences between the scores on two variables for the sample of cases. Differences are squared partly so that this measure is not affected by the sign or direction of the difference. In other words, a distance of −2 is the same as one of +2. Table 4.1 shows the calculation of the squared Euclidean distance between the scores of the anxiety and tense items, which is 8. These squared Euclidean distances are usually tabulated in the form of a triangular matrix as shown in Table 4.2. We can see that the pairs of variables that are most similar are anxious and tense, and depressed and useless, which both have a squared Euclidean distance of 8. The pair of variables that are most dissimilar are tense and happy, which has a squared Euclidean distance of 56.
Table 4.1 Squared Euclidean distance between the scores of the anxiety and tense items Cases
Anxious
Tense
Difference
Difference 2
1 2 3 4 5 6 7 8 9
2 1 3 4 5 4 4 3 3
1 2 3 4 5 5 3 3 5
1 −1 0 0 0 −1 1 0 −2
1 1 0 0 0 1 1 0 4 8
Table 4.2 Proximity matrix of squared Euclidean distances for the anxiety and depression items
Anxious Tense Calm Depressed Useless Happy
Anxious
Tense
Calm
Depressed
Useless
8 25 18 16 38
31 20 18 56
23 21 15
8 52
42
Happy
48
Advanced quantitative data analysis
Hierarchical agglomerative clustering
One of the more widely used methods for forming clusters is hierarchical agglomerative clustering. In hierarchical agglomerative clustering, there are initially as many clusters as there are variables. Forming clusters occurs in a series or hierarchy of stages. At the first stage, the two variables that have the shortest distance between them are grouped together to form one cluster. For example, depressed may be grouped with useless. At the second stage, either a third variable is added or agglomerated to the first cluster containing the two variables or two other variables are grouped together to form a new cluster. For example, anxious may be grouped with depressed and useless or with tense. At the third stage, two variables may be grouped together, a third variable may be added to an existing group of variables or two groups may be combined. So, at each stage only one new cluster is formed. At the final stage, all the variables are grouped into a single cluster.
Average linkage between groups method for combining clusters
There are various methods for determining how the clusters are formed at each stage. One of the most commonly used methods is the average linkage between groups. At the first stage, the two variables that have the shortest distance between them are grouped together to form a cluster. In our example, the shortest distance is 8 and there are two pairs of variables that have this distance. So, the first cluster will consist of either anxious and tense or depressed and useless. Let us assume that the first cluster is depressed–useless. At the second stage, the shortest average distance between clusters is used as the criterion for forming the next cluster. We have only one group at this stage. The average distance between this group and, say, the calm variable would be the average of the distances of the following pairs of variables: depressed–calm and useless–calm. This average is 22 [(23 + 21)/2 = 22]. Thus, the average distance between groups or clusters is the average of the distances of each variable in one cluster paired with every other variable in the other cluster. As we only have one variable in one cluster (calm) and two variables in the other cluster (depressed and useless), the average is only based on two pairs of variables (calm–depressed and calm–useless). In a similar way, we would work out the average distance between this two-variable group (depressed–useless) and each of the other three variables (anxious, tense and happy). We could present the distance indices between clusters after the first stage as a triangular proximity matrix, as shown in Table 4.3. As the shortest distance between two clusters is still 8, the next cluster to be formed is between anxious and tense. We proceed in this way until all the variables are in one cluster. The results of the cluster analysis of the depression and anxiety items are
Cluster analysis
49
Table 4.3 Proximity matrix of distances between clusters after the first stage of clustering Clusters Anxious Tense Calm Depressed–useless Happy
Table 4.4
Anxious
Tense
Calm
8 25 17 38
31 19 56
22 15
Depressed–useless Happy
47
Hierarchical agglomerative cluster analysis
Stage
Clusters
Cluster distance
Number of clusters
0 1 2 3 4 5
(A1) (T2) (C3) (D4) (U5) (H6) (A1) (T2) (C3) (D4–U5) (H6) (A1–T2) (C3) (D4–U5) (H6) (A1–T2) (D4–U5) (C3–H6) (A1–T2–D4–U5) (C3–H6) (A1–T2–D4–U5–C3–H6)
8 8 15 18 36
6 5 4 3 2 1
summarized in Table 4.4. To save space, each variable is identified by two characters. The first is the initial letter of the label, whereas the second is the order of the variable. So anxious is identified as A1, tense as T2 and so on. The first column enumerates the stage of the analysis. Initially, there are as many clusters as variables. At each stage one new cluster is formed. The second column shows which variables are clustered together at each stage. Parentheses represent a cluster. Variables in bold indicate which cluster was formed at that stage. The first cluster to be formed is depressed and useless, the second anxiety and tense and so on. The third column shows the distance between the two clusters that have been paired together at that stage. Where a cluster consists of more than one variable, this is the average distance between groups. The final column shows the number of clusters at each stage. At the last stage there is one cluster containing all six variables.
Agglomeration schedule
The results of a cluster analysis may be tabulated in terms of an agglomeration schedule, such as that shown in Table 4.5, which was produced by
50
Advanced quantitative data analysis
Table 4.5
SPSS output of an Agglomeration Schedule Agglomeration Schedule Stage Cluster First Appears
Cluster Combined Stage
Cluster 1
Cluster 2
Coefficients
Cluster 1
Cluster 2
Next Stage
4 1 3 1 1
5 2 6 4 3
8.000 8.000 15.000 18.000 36.000
0 0 0 2 4
0 0 0 1 3
4 4 5 5 0
1 2 3 4 5
SPSS. The first column numbers the stages of the analysis. The second and third columns show the two clusters paired at each stage. The clusters are identified by the original order in the data of the first variable in that cluster. So the first cluster to be formed consists of depressed and useless. Depressed is 4th in the list of variables and useless is 5th. The fourth cluster to be created is made up of a cluster containing anxious and tense and a cluster comprising depressed and useless. Anxious is 1st in the list of variables and depressed is 4th. The fourth column displays the distance or average distance between clusters. So the distance between the cluster of depressed and the cluster of useless which are paired at the 1st stage is 8.00. The fifth and sixth columns show the stage at which the first and second clusters were created, respectively. The first three clusters that are formed are made up of single variables that existed before clustering and this initial state is represented as 0. The cluster that is created at the 4th stage consists of one cluster that was formed at the 2nd stage (anxious–tense) and another cluster (depressed– useless) that was produced at the 1st stage. The final column presents the next stage at which a cluster is paired with another cluster. So the next stage at which the cluster of depressed–useless is paired with another cluster is at the 4th stage.
Dendrogram
One way of presenting the results of a cluster analysis graphically is in terms of a dendrogram like that shown in Figure 4.1 which was produced by SPSS. Dendron is the Greek word for tree and the diagram is somewhat like the branches of a tree in the way in which they divide into two. The Labels
Cluster analysis
Figure 4.1
51
SPSS output of a dendrogram.
on the left refer to variables and not CASEs and the Numbers to the order of the variables. The first variables to be paired are depressed and useless, followed by anxious and tense and then calm and happy. The dashed vertical line numbered from 0 to 25 shows the distance or average distance between the clusters regraded from 1, the smallest distance, to 25, the largest distance. So, 1 represents the distance of 8 and 25 the distance of 36. The distance of 15 is 7 on this regraded scale [as ((15 − 8)/(36 − 8) × 24) + 1 = 7.00], while that of 18 is about 9.57 [(18–8)/(36 − 8) × 24 + 1 = 9.57]. Icicle plot
Another way of graphically displaying the results of a cluster analysis is in terms of a vertical icicle plot like that presented in Figure 4.2 which was produced by SPSS. The X’s may be thought of as a row of icicles hanging from a horizontal surface. The first column shows the Number of clusters, which starts with the final solution of one cluster and ends with the first cluster in which depressed is paired with useless. The variables are represented by X’s in all of the rows in the column that they head. A cluster is
52
Advanced quantitative data analysis
Figure 4.2
SPSS output of a verticle icicle plot.
shown by placing an X in the column or columns which separate each of the variables. So, in the final row of the plot, there is an X in the column that separates the depressed column from the useless column. As we move up the plot, a new cluster is formed at each row.
Choosing the number of clusters
In this example, we appear to have relatively little choice as to which is the appropriate number of clusters to select as providing a useful summary of the relationships between our variables. The choice seems to be between a two-cluster or a three-cluster solution. An argument could be made for either solution. There is no statistical criterion for making this choice. The three-cluster solution consists of clusters representing depression, anxiety and positive feelings respectively, whereas the two-cluster solution comprises clusters reflecting negative and positive feelings. One criterion may be to select the number of clusters at the point where there appears to be a large break in the average distance between clusters. As shown in Figure 4.2, this break appears to occur after the fourth cluster in that the third and fourth clusters are relatively close together and farthest apart from the final cluster.
Reporting the results
The most appropriate way of reporting the results of a cluster analysis will depend on the particular rationale for the analysis. A very concise report
Cluster analysis
53
may be worded as follows: ‘A proximity matrix of the squared Euclidean distances based on the responses to the six items was subjected to a hierarchical agglomerative cluster analysis using the average linkage between groups method to combine clusters. A dendrogram of the analysis is presented in Figure 4.2. In the two-cluster solution, the first cluster contains items of negative affect and the second cluster items of positive affect.’
SPSS Windows procedure
The following procedure should be used to carry out the cluster analysis described in this chapter. Enter the data in Table 2.1 into the Data Editor as shown in Box 2.1. If these data have been saved as a file, retrieve the file by selecting File, Open, Data. . ., the file’s name from the Open File dialog box and Open. Select Analyze on the horizontal menu bar near the top of the window, which produces a drop-down menu. Select Classify and then Hierarchical Cluster. . ., which opens the Hierarchical Cluster Analysis dialog box in Box 4.1. Select the variables Anxious to Happy and then the first 䉴 button to put them in the box under Variable(s):.
Box 4.1
Hierarchical Cluster Analysis dialog box
54
Advanced quantitative data analysis
Select Variables under Cluster to carry out a cluster analysis of the variables rather than the cases. Select Statistics. . ., which opens the Hierarchical Cluster Analysis: Statistics sub-dialog box in Box 4.2. Select Proximity matrix to produce a proximity matrix similar to the one shown in Table 4.2, except that this will be a square matrix in which distances are given to three decimal places. The Agglomeration schedule has already been selected. This produces the schedule shown in Table 4.5. Select Continue to return to the Hierarchical Cluster Analysis dialog box in Box 4.1. Select Plots . . ., which opens the Hierarchical Cluster Analysis: Plots sub-dialog box in Box 4.3.
Box 4.2
Hierarchical Cluster Analysis: Statistics sub-dialog box
Box 4.3
Hierarchical Cluster Analysis: Plots sub-dialog box
Cluster analysis
Box 4.4
55
Hierarchical Cluster Analysis: Method sub-dialog box
Select Dendrogram to produce the dendrogram displayed in Figure 4.1. Icicle has already been selected. This presents the icicle plot shown in Figure 4.2. Select Continue to return to the Hierarchical Cluster Analysis dialog box. Select Method. . ., which opens the Hierarchical Cluster Analysis: Method sub-dialog box in Box 4.4. This box shows the procedures carried out if we make no changes. These are known as the default procedures, which are the ones we need to use. The default Cluster Method: is the Between-groups linkage, which is the average linkage between groups method. The default Measure to be used in the analysis is the Squared Euclidean distance. Select Continue to return to the Hierarchical Cluster Analysis dialog box. Select OK to run the analysis.
Recommended further reading Diekhoff, G. (1992) Statistics for the Social and Behavioral Sciences. Dubuque, IA: Wm. C. Brown. Chapter 17 offers a very brief but clear account of the type of cluster analysis described in this chapter.
56
Advanced quantitative data analysis
Hair, J.F., Jr., Anderson, R.E., Tatham, R.L. and Black, W.C. (1998) Multivariate Data Analysis, 5th edn. Upper Saddle River, NJ: Prentice-Hall. Chapter 9 offers a clear and simple introduction to cluster analysis. SPSS Inc. (2002) SPSS Base 11.0 User’s Guide Package. Upper Saddle River, NJ: Prentice-Hall. Provides a detailed commentary on the output produced by SPSS 11.0 and a useful introduction to cluster analysis.
Part 2 Explaining the variance of a quantitative variable
5
Stepwise multiple regression
Multiple regression is a statistical technique for determining what proportion of the variance of a continuous, preferably normally distributed, variable is associated with, or explained by, two or more other variables, taking into account the associations between those other variables. Suppose, for example, we wish to ascertain what variables are most closely associated with academic achievement in children. It is likely that a number of different factors are related to academic achievement, including the child’s intelligence, the child’s interest in school work, the parents’ interest in the child’s academic achievement, the teachers’ interest in the child’s academic achievement and so on. It is expected that these factors are related to one another so that more intelligent children may be more interested in school work and may have parents and teachers who are more interested in how well they do at school. Multiple regression can determine what proportion of the variance in children’s academic achievement is explained by these factors, taking into account that they may be interrelated and whether the proportion of variance explained by each of these factors is significantly greater than that expected by chance. There are two main ways in which multiple regression is used. One way is to determine which variables explain the greatest and significant proportions of the variance in the variable of interest and what these proportions are. This is most usually done with a method called stepwise multiple regression, which is the topic of this chapter. The other way is to determine whether
60
Advanced quantitative data analysis
Table 5.1
Scores of nine cases on five variables
Cases
Child’s achievement
Child’s ability
Child’s interest
Parents’ interest
Teachers’ interest
1 2 3 4 5 6 7 8 9
1 2 2 3 3 4 4 5 5
1 3 2 3 4 2 3 4 3
1 3 3 2 4 3 5 2 4
2 1 3 2 3 2 3 3 2
1 2 4 4 2 3 4 2 3
a particular variable or set of variables explains a significant proportion of the variance in the variable of interest after certain variables have been taken into account and what this proportion is. This is carried out with a method called hierarchical multiple regression, which is covered in the next chapter. Predictors can be either quantitative variables such as the variables above or qualitative ones such as country of birth, ethnicity and religious affiliation. To determine the influence of qualitative variables such as these, we have to convert them into dummy variables. This procedure is described in Chapters 10–12 in the context of analysis of variance. We will illustrate stepwise multiple regression with the quantitative variables we have already introduced. Table 5.1 shows a small set of data we have made up for this example. For each of the variables, the scores range from 1 to 5 with higher scores indicating more of that quality. So, for example, higher scores on academic achievement reflect greater achievement. We would not carry out a multiple regression on such a small sample. The size of sample to be used will depend on various factors, such as the size of the correlations expected between the variable of interest and the other variables. Correlations are more likely to be significant the larger the sample. We will artificially increase the size of the sample by reproducing this data set 30 times so that the sample is increased from 9 to 270. Increasing the size of the sample in this way will not affect the size or direction of the correlations between the variables because the pattern of the data has not changed, but it will make the correlations significant. Correlation matrix and the first predictor
A useful step in trying to understand what stepwise multiple regression does is to produce a correlation matrix of the variables in the analysis as shown
Stepwise multiple regression 61
Table 5.2
Triangular correlation matrix for the five variables
Child’s achievement Child’s ability Child’s interest Parents’ interest Teachers’ interest
Child’s achievement
Child’s ability
Child’s interest
Parents’ interest
Teachers’ interest
1.00 .59 .44 .30 .28
1.00 .42 .30 .07
1.00 .29 .47
1.00 .27
1.00
in Table 5.2. The variable that we want to explain is called the dependent or criterion variable. The variables that we use to explain or predict the criterion variable are known as independent or predictor variables. The predictor variable that has the highest correlation with the criterion variable is always entered first into the regression analysis if the correlation is statistically significant. The predictor variable that has the highest correlation with the child’s academic achievement is the child’s intellectual ability. This correlation is .59 and is statistically significant at below the .05 two-tailed level. Consequently, the child’s intellectual ability will be entered first into the multiple regression. To obtain the proportion of the variance in academic achievement that the child’s ability explains, we simply square this correlation, which is about .34 (.592 = .34). In other words, about 34 per cent of the variance in the children’s academic achievement is explained by their intellectual ability. As the sign of the correlation between academic achievement and the child’s ability is positive, this means that more intelligent children do better at school. If the sign of the correlation had been negative, this would mean that less intelligent children do better at school. If two or more predictor variables have very similar correlations with the criterion variable, then the predictor variable that has the highest correlation will always be entered first even if the difference is very small and either variable would explain a similar proportion of the variance. For example, if the correlation between the child’s academic achievement and the child’s interest in school was, say, .60 instead of .44, then child’s interest would be entered first, even though both variables have very similar correlations with academic achievement and, as such, explain very similar proportions of its variance. In this case, the child’s interest would explain .36 (.602 = .36) of the variance in academic achievement. In other words, stepwise multiple regression operates on purely statistical criteria.
62
Advanced quantitative data analysis
Subsequent predictors
The next variable to be considered for entry into the regression analysis is the one that has the highest partial correlation with the criterion variable controlling for the variable that has already been entered. If this partial correlation is significant, this second variable is entered into the regression equation. If the first variable still explains a significant proportion of the variance in the criterion variable when the second variable is controlled, the first variable is kept in the regression analysis. If the first variable does not explain a significant proportion of the variance in the criterion variable, then it is dropped from the regression analysis. This process continues in this stepwise fashion until no further significant increase in the proportion of the variance in the criterion variable is explained by predictor variables. It is not possible to tell from the correlation matrix in Table 5.2 which predictor variable has the highest partial correlation with academic achievement when ability is controlled. For example, the child’s interest has the next highest correlation (.44) with academic achievement but child’s interest is also correlated (.42) with child’s ability. Perhaps much of the correlation between child’s interest and child’s academic achievement has already been explained by child’s ability and so child’s interest may not explain a significant further proportion of the variance in academic achievement. Partial correlation
For the next stage, therefore, it is necessary to work out the partial correlations between academic achievement and each of the other predictor variables controlling for child’s ability. This can be done using the following formula: r12.3 =
r12 − (r13 × r23) √(1 − r132) × (1 − r232)
where r refers to a correlation, subscript 1 to the variable of academic achievement, subscript 2 to one of the predictor variables other than child’s ability and subscript 3 to the variable of child’s ability. If we substitute the appropriate correlations from Table 5.2 into this formula for working out the partial correlation between academic achievement and child’s interest controlling for child’s ability, we find that it is about .26: .44 − .25 .19 .44 − (.59. × .42) = = = 2 2 √(1 − .59 ) × (1 − .42 ) √(1 − .35) × (1 − .18) √.65 × .82 .19 .19 = = .26 √.53 .73
Stepwise multiple regression 63
The partial correlation between academic achievement and parents’ interest is .15 and that between academic achievement and teachers’ interest is .30. As the highest partial correlation is for teachers’ interest and this partial correlation is significant, teachers’ interest is the second predictor variable to be entered into the regression equation. Because this partial correlation is positive, this means that the more interest in the child’s work the teachers show, the better the child does at school. If this partial correlation had been negative, it would mean that the less interest teachers showed in the child’s work, the better the child would do. Since the proportion of the variance in academic achievement that child’s ability explains is still significant when teachers’ interest is taken into account, child’s ability remains in the regression analysis together with teachers’ interest. The partial correlation between academic achievement and child’s ability controlling for teachers’ interest is about .60 and is significant. In the third step, the predictor variable that has the highest partial correlation with academic achievement when child’s ability and teachers’ interest are controlled is considered for entry into the regression analysis. The formula for calculating this second-order partial correlation is more complicated than that for the first-order partial correlation shown above and so it will not be presented. It can be found elsewhere (e.g. Cramer 1998: 159). The second-order partial correlations for child’s interest and parents’ interest are .13 and .08, respectively. As the partial correlation for child’s interest is higher than that for parents’ interest and as it is significant, child’s interest is the third predictor to enter into the regression analysis. Once again, this partial correlation is positive, indicating that children who are more interested in school do better at school. If this partial correlation had been negative, it would mean that children who were less interested in school did better at school. Both child’s ability and teachers’ interest remain as predictors in the regression analysis because they both still explain a significant proportion of the variance in academic achievement when child’s interest is included. The second-order partial correlation between academic achievement and teachers’ interest controlling for child’s ability and child’s interest is .21 and is significant. The second-order partial correlation between academic achievement and child’s ability controlling for child’s interest and teachers’ interest is .53 and is significant. The final predictor of parents’ interest is not entered into the regression analysis because the third-order partial correlation of .07 between academic achievement and parents’ interest controlling for child’s ability, teachers’ interest and child’s interest is not significant. Therefore, the three predictor variables that explain a significant proportion of the variance in academic achievement are child’s ability, teachers’ interest and child’s interest. It is important to remember that these results are based on fictitious data, which have been used solely to show what stepwise multiple regression entails. The data do not come from a study which has actually looked at this issue.
64
Advanced quantitative data analysis
Proportion of variance explained
The proportion of the variance in the criterion that is explained by the first predictor is simply the correlation between these two variables squared. As the correlation between academic achievement and child’s ability is .59, the proportion of the variance in academic achievement that is explained by child’s ability is about .34 (.592 = .34). The proportion of the additional variance in academic achievement that is explained by each of the other predictors can be calculated from squaring the part correlation.
Part correlation
The formula for the first-order part correlation is r12.3 =
r12 − (r13 × r23) √(1 − r232)
The difference between the first-order part correlation and first-order partial correlation lies in the divisor or denominator of the formula. The part correlation is expressed in terms of the non-shared variance of the first and second predictor. The partial correlation is expressed in terms of this as well as the non-shared variance of the criterion and the second predictor. To determine the first-order part correlation between academic achievement and teachers’ interest controlling for child’s ability, we substitute the appropriate correlations in Table 5.2 into this formula, which works out as about .24: .24 .24 .28 − (.59 × .07) .28 − .04 = = = = .24 2 √(1 − .07 ) √(1 − .00) √1.00 1.00 To work out the proportion of the variance in academic achievement that teachers’ interest explains over and above that explained by child’s ability, we square this part correlation, which gives about .06 (.242 = .06). The formula for a second-order part correlation is more complicated and will not be presented. However, if we work out the second-order part correlation between academic achievement and child’s interest controlling for child’s ability and teachers’ interest, we find that it is about .10, which squared is about .01 (.102 = .01). In other words, about .01 of the variance in academic achievement is explained by child’s interest over and above that explained by child’s ability and teachers’ interest.
Stepwise multiple regression 65 Statistical significance of the variance explained
The statistical significance of the proportion of the variance in the criterion explained by a predictor is determined by the F ratio, which has the following formula: F=
R2 change/number of predictors added (1 − R2)/(N − number of predictors included − 1)
The F ratio has two degrees of freedom, one for the top half or numerator in the formula and one for the bottom half or denominator. In stepwise multiple regression, the degree of freedom for the numerator is always 1. The degrees of freedom for the denominator are the number of cases (N) minus the number of predictors minus 1. Note that this expression forms the second part of the denominator of the formula for the F ratio. The number of predictors includes those previously entered as well as the one entered at this stage. So, there is one predictor in the first step, two in the second step and so on. The statistical significance of the F ratio can be looked up in a table which shows this but is given in the output of a statistical package such as SPSS. The bigger the F ratio, the more likely it is to be statistically significant. With only one predictor in the regression analysis, R2 change and R2 are the same and are simply the square of the correlation between the criterion and the predictor. So, the F ratio for the proportion of the variance in academic achievement explained by child’s ability is about 145.83, as the squared correlation between these two variables is about .35 (.592 = .35): .35 .35 .35/1 = = = 145.83 (1 − .35)/(270 − 1 − 1) .65/268 .0024 The two degrees of freedom for this F ratio are 1 and 268, respectively. This F ratio is significant at less than the .001 level. With more than one predictor in the regression analysis, R2 change is the square of the part correlation for the predictor at that step and R2 is the sum of the R2 change for all the previous steps including the present one. The squared part correlation between academic achievement and teachers’ interest controlling for child’s ability is about .06 (.242 = .06). Consequently, its F ratio is about 27.27: .06 .06 .06/1 = = = 27.27 [1 − (.35 + .06)]/(270 − 2 − 1) .59/267 .0022 The two degrees of freedom for this F ratio are 1 and 267, respectively. This F ratio is significant at less than the .001 level. The squared part correlation between academic achievement and child’s
66
Advanced quantitative data analysis
Table 5.3
Main results of a stepwise multiple regression
Steps
Predictors
R2
R2 change
F
df1
df2
p
1 2 3
Child’s ability Teachers’ interest Child’s interest
.35 .41 .42
.35 .06 .01
145.83 27.27 4.55
1 1 1
268 267 266
.001 .001 .05
interest controlling for child’s ability and teachers’ interest is about .01 (.102 = .01). Consequently, its F ratio is about 4.55: .01 .01 .01/1 = = = 4.55 [1 − (.35 + .06 + .01)]/(270 − 3 − 1) .58/266 .0022 The two degrees of freedom for this F ratio are 1 and 266, respectively. This F ratio is significant at less than the .05 level. The results of this stepwise multiple regression are summarized in Table 5.3. Other statistics are an important part of multiple regression but are not essential for an understanding of stepwise multiple regression. Some of these statistics will be introduced in Chapter 6.
Reporting the results
There are various ways of writing up the results of the stepwise multiple regression illustrated in this chapter. A very succinct report of the results may be worded as follows: ‘In the stepwise multiple regression, the child’s intellectual ability was entered first and explained about 35 per cent of the variance in the child’s academic achievement (F1,268 = 146.45, p < .001). Teachers’ interest was entered second and explained a further 6 per cent (F1,267 = 27.27, p < .001). The child’s interest was entered third and explained another 1 per cent (F1,266 = 4.92, p < .05). Parents’ interest did not explain a significant increment in the proportion of variance explained. Greater academic achievement was associated with greater child’s ability, teachers’ interest and child’s interest.’
SPSS Windows procedure
Follow the procedure below to carry out the stepwise multiple regression described in this chapter. Enter the data in Table 5.1 into the Data Editor as shown in Box 5.1. A sixth column of data called frequency has been created for the sole purpose
Stepwise multiple regression 67
Box 5.1
The scores of the five variables for the nine cases in the Data Editor
of stipulating that the data for each row or case is to be reproduced 30 times, thereby increasing the size of the sample from 9 to 270. The five variables have been given the more extended labels of Child’s achievement, Child’s ability, Child’s interest, Parents’ interest and Teachers’ interest, respectively, to make it clear to what they refer. Save these data as a file for further use. To weight or to increase the cases in this way, select Data from the horizontal menu bar near the top of the window and Weight Cases. . . from the drop-down menu to open the Weight Cases dialog box shown in Box 5.2. Select Weight cases by (by clicking on it). Select freq from the list of variables, the 䉴 button to place freq in the box under Frequency Variable: and then OK to close the dialog box. In the bottom right corner of the Data
Box 5.2
Weight Cases dialog box
68
Advanced quantitative data analysis
Editor window are the words Weight On to remind you that the data are weighted. Select Analyze on the horizontal menu bar near the top of the window, Regression from the drop-down menu and then Linear. . ., which opens the Linear Regression dialog box in Box 5.3. Select Child’s achievement and then the first 䉴 button to put this variable in the box under Dependent: (variable). Select Child’s ability to Teachers’ interest and then the second 䉴 button to put these variables in the box under Independent(s):. Select Enter beside Method:, which produces a drop-down menu. Select Stepwise for a stepwise multiple regression. Select Statistics. . ., which opens the Linear Regression: Statistics subdialog box in Box 5.4. Select R squared change to produce the statistics in the columns under Change Statistics in Table 5.4. Select Descriptives to display the means, standard deviations and number of cases of the five variables followed by a full matrix of their correlations, the statistical significance of these correlations and the number of cases on which these are based.
Box 5.3
Linear Regression dialog box
Stepwise multiple regression 69
Box 5.4
Linear Regression: Statistics sub-dialog box
Select Part and partial correlations to display these statistics as shown in the last two columns of Table 5.5. Select Continue to close this sub-dialog box and to return to the dialog box. Select OK to run this analysis.
SPSS output
SPSS produces a great deal of statistics compared with those in our description of stepwise multiple regression. Only selected aspects of this output will be reproduced and commented on. Table 5.4 reproduces the Model Summary of the analysis, which contains the information that we used in our report of the results. Each step in the analysis is called a model. In this case, there are three steps or models. The predictors that are entered on each of these three steps are shown immediately below the table. So, Child’s ability is entered in the first step, Child’s ability and Teachers’ interest in the second step and so on. The proportion of variance explained at each step is shown to three decimal places in the sixth column under R Square Change. So, .353 of the variance in academic achievement is explained by child’s ability in the first model. An additional .060 of the variance in academic achievement is explained by teachers’ interest and a further 0.011 by child’s interest.
.594a .643b .651c
1 2 3
.353 .413 .424
R Square .351 .409 .417
Adjusted R Square 1.061 1.013 1.006
Std. Error of the Estimate
b
.353 .060 .011
R Square Change
Model Summary
Predictors: (Constant), Child’s ability. Predictors: (Constant), Child’s ability, Teachers’ interest. c Predictors (Constant), Child’s ability, Teachers’ interest, Child’s interest.
a
R
SPSS output of the Model Summary
Model
Table 5.4
146.451 27.118 4.917
F Change
df1 1 1 1
Change Statistics
268 267 266
df2
.000 .000 .027
Sig. F Change
a
(Constant) Child’s ability Teachers’ interest
(Constant) Child’s ability Teachers’ interest Child’s interest
2
3
.853 .853
1.146E-02 .757 .239 .148
4.946E-02 .830 .312
Dependent Variable: Child’s achievement.
(Constant) Child’s ability
B
.249 .075 .068 .067
.250 .067 .060
.206 .070
Beta
.528 .187 .130
.578 .245
.594
Standardized Coefficients
Coefficientsa
Std. Error
Unstandardized Coefficients
SPSS output of the Coefficients table
1
Model
Table 5.5
.046 10.147 3.510 2.218
.198 12.310 5.208
4.137 12.102
t
.963 .000 .001 .027
.843 .000 .000
.000 .000
Sig.
.594 .283 .439
.594 .283
.594
Zero-order
.528 .210 .135
.602 .304
.594
Partial
Correlations
.472 .163 .103
.577 .244
.594
Part
72
Advanced quantitative data analysis
The F ratio for each step in the analysis is displayed in the seventh column under F change. The F ratio for the first step is 146.451. This figure is slightly higher than the 145.83 we calculated because we rounded our figures in the calculation to two decimal places while SPSS uses more decimal places than this. The degrees of freedom for the numerator and denominator of this F ratio are presented in the eighth and ninth column under df1 and df2, respectively. They are 1 and 268, respectively. The statistical significance or probability of the F ratio is produced in the tenth column under Sig. F. Change. The probability is given to three decimal places and so is less than 0.0005 as it can never be zero. The squared multiple correlation or R Square for any step is simply the sum of R Square Change up to and including that step. So, for the second step R Square is .413 (.353 + .060 = .413). The multiple correlation or R is the square root of R Square Change. So, for the second step the square root of .413 is .643. The squared multiple correlation will overestimate that in the population the more cases and predictors there are. A less biased estimate is provided by the following formula: Adjusted R2 = R2 −
(1 − R2) × number of predictors N − number of predictors − 1
To illustrate the use of this formula, we will calculate the adjusted squared multiple correlation for the second model, which is about .409: .413 −
.587 × 2 (1 − .413) × 2 = .413 − = .413 − .0044 = .4086 270 − 2 − 1 267
The partial and part correlations are presented to three decimal places in the last two columns of the Coefficients table shown in Table 5.5. The partial correlation between academic achievement and teachers’ interest controlling for child’s ability for the second step is .304. The part correlation between academic achievement and teachers’ interest controlling for child’s ability for the second step is .244.
Recommended further reading Cohen, J. and Cohen, P. (1983) Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, 2nd edn. Hillsdale, NJ: Lawrence Erlbaum Associates. Chapter 5 provides a clear account of how to create dummy variables to examine the influence of qualitative variables. Cramer, D. (1998) Fundamental Statistics for Social Research: Step-by-Step Calculations and Computer Techniques Using SPSS for Windows. London: Routledge. Chapter 7 shows how many of the statistics in the SPSS output of multiple regression are calculated.
Stepwise multiple regression 73 Pedhazur, E.J. (1982) Multiple Regression in Behavioral Research: Explanation and Prediction, 2nd edn. New York: Holt, Rinehart & Winston. Although fairly technical, chapters 5 and 6 provide a useful guide to the rationale underlying multiple regression. Pedhazur, E.J. and Schmelkin, L.P. (1991) Measurement, Design and Analysis: An Integrated Approach. Hillsdale, NJ: Lawrence Erlbaum Associates. Chapter 18 provides a valuable overview of multiple regression with useful comments on the output produced by SPSS 3.0. SPSS Inc. (2002) SPSS Base 11.0 User’s Guide Package. Upper Saddle River, NJ: Prentice-Hall. Provides a detailed commentary on the output produced by SPSS 11.0 and a useful introduction to multiple regression. Tabachnick, B.G. and Fidell, L.S. (1996) Using Multivariate Statistics, 3rd edn. New York: HarperCollins. Chapter 5 offers a systematic and general account of multiple regression, comparing the procedures of four different programs (including SPSS 6.0) and showing how to write up the results of two worked examples.
6
Hierarchical multiple regression
Hierarchical multiple regression is used to determine what proportion of the variance in a particular variable is explained by other variables when these variables are entered into the regression analysis in a certain order and whether these proportions are significantly greater than would be expected by chance. For example, we may be interested in finding out what proportion of the variance in academic achievement is initially explained by the teachers’ and parents’ interest in how well the child does at school, followed by the child’s own interest and then followed by the child’s intellectual ability. In other words, we may wish to know what additional proportion of the variance in academic achievement is explained by the child’s interest and child’s ability when other more external factors such as the influence of parents and teachers are controlled. We may believe that the proportion of the variance in academic achievement that is explained by the child’s ability is much less when these other factors are taken into account, which may suggest that the relationship between academic achievement and intellectual ability is to some extent due to the possibility that they are assessing the same kind of content. Of course, the exact question that we ask will depend on the particular ideas that we are interested in testing in this area. This example was chosen to show the difference between a hierarchical and a stepwise multiple regression and not to demonstrate the soundness of the reasoning for ordering the variables in that way. In hierarchical multiple regression, we decide
Hierarchical multiple regression
75
which variables are to be entered at each stage of the regression analysis, whereas in stepwise multiple regression, this decision is based purely on statistical criteria. In hierarchical multiple regression, more than one variable can be entered into the analysis at each stage, whereas in stepwise multiple regression, only one variable is entered at each stage. In all other respects, the procedures for calculating the statistics in a stepwise and a hierarchical multiple regression are the same.
Moderating effects
Hierarchical multiple regression has also been advocated as a more appropriate method for determining whether a quantitative variable has a moderating effect on the relationship between two other quantitative variables (Baron and Kenny 1986). For example, we might think that the relationship between depression and social support may be stronger when people are experiencing stress than when they are not. When experiencing stress, people who are able to talk about it to others may feel less depressed than those who are not able to talk about it. For people not experiencing stress, this relationship may be weaker or non-existent. If this is the case, we would say that stress is thought to moderate the relationship between depression and social support. The stress people experience is likely to vary in differential amounts from not at all to a great deal. One way of ascertaining whether stress has a moderating effect on this relationship is to divide the sample into a high and a low stress group, compute the correlation between depression and social support in these two groups and see whether these two correlations differ significantly from each other. If the correlations do not differ significantly, then there is no evidence for a moderating effect. One disadvantage with this approach is that dividing the sample into two groups will create two smaller samples, which will reduce the chances of the correlations differing, particularly if the initial sample is quite small. An alternative approach is to use hierarchical multiple regression in which the moderating effect of stress is represented by multiplying the social support scores with the stress scores to produce what is known as the interaction between stress and social support. Multiplying the scores in this way enables the joint effect of these two variables to be analysed. However, this interaction also reflects the influence of the two separate variables that go to form it. We have to remove the influence of these two variables to determine what proportion of the variance in depression is due solely to the interaction of the two variables. We do this by first entering stress and social support into the regression analysis followed by the interaction term. If the interaction term explains a significant increment in the variance of depression, then a moderating effect is present. We can determine what this
76
Advanced quantitative data analysis
moderating effect is by splitting the sample into two groups representing low and high stress and examining the relationship between depression and social support. We will illustrate hierarchical multiple regression with the same data as used for stepwise multiple regression, which are presented in Table 5.1. The two variables of parents’ and teachers’ interest will be entered in the first step, child’s interest will be entered in the second step and child’s ability will be entered in the third step. The way we would calculate the proportion of variance contributed at each step would be exactly the same as the way we did it for the stepwise multiple regression if only one variable was entered at each step. So, the proportion of variance accounted for by the predictor entered first is simply the square of the correlation between that predictor and the criterion. The proportion of the variance explained by subsequent predictors is the square of the part correlation between that predictor and the criterion controlling for any previous predictors.
Two or more predictors in a step
When there are two or more predictors in a step of a multiple regression, it is not possible to calculate the proportion of variance in the criterion that is accounted for by these predictors by simply summing the squared part correlations between each predictor and the criterion. Doing this would only take account of the unique variance that the predictors contributed to the criterion and would not include the variance that the predictors shared with each other and the criterion. For example, if we wanted to determine the proportion of variance in academic achievement that was accounted for by both parents’ and teachers’ interest, we could not simply add together the squared part correlations for parents’ and teachers’ interest. The squared part correlation between academic achievement and parents’ interest excludes the contribution that teachers’ interest shares with both academic achievement and parents’ interest. This same contribution is also ignored by the squared part correlation between academic achievement and teachers’ interest. Consequently, both these squared part correlations omit this contribution and simply reflect the unique variance that each of them shares with academic achievement.
Squared multiple correlation
The proportion of variance in the criterion that is accounted for by the predictors at any stage of a multiple regression analysis is given by the squared multiple correlation. Where two or more predictors are entered at a stage, this proportion can be calculated by summing the product of
Hierarchical multiple regression
77
the standardized partial regression coefficient for that predictor and its correlation with the criterion across all the predictors: squared multiple correlation = [standardized partial regression × correlation coefficient] summed across predictors We will work out the squared multiple correlation for parents’ and teachers’ interest before describing what the standardized partial regression coefficient is. The correlation of academic achievement with parents’ and with teachers’ interest is .30 and .28, respectively. The standardized partial regression coefficient for parents’ and teachers’ interest, which we shall calculate subsequently, is about .24 and .22, respectively. Consequently, the squared multiple correlation for parents’ and teachers’ interest is about .13: (.30 × .24) + (.28 × .22) = .07 + .06 = .13 In other words, about .13 of the variance in academic achievement is explained by both parents’ and teachers’ interest. Standardized partial regression coefficient
The standardized partial regression coefficient or beta (β) can be thought of as the weight that is attached to two or more predictors in a multiple regression analysis that takes into account its relationship with the other variables in the analysis. Because these coefficients have been standardized, they generally vary between ± 1.00. The higher the value, the stronger the relationship is between that predictor and that criterion. A negative coefficient simply means that higher scores on one variable are associated with lower scores on the other variable. In our example, the standardized partial regression coefficients have a similar weight of about .23 and are positive. These weights are used to determine what proportion of the relationship between the criterion and a predictor is explained by that predictor. As both the correlations and the standardized partial regression coefficients for parents’ and teachers’ interest are similar in size, the proportion of the variance in academic achievement that each explains is similar in magnitude. The formula for a standardized partial regression coefficient that controls for one other predictor is as follows: β2 =
r12 − (r13 × r23) 1 − r232
This formula is the same as that for the first-order part correlation except that there is no square root in the denominator. We can work out the standardized partial regression coefficient for parents’ interest if subscript 1 refers to academic achievement, subscript 2 to parents’ interest and subscript 3 to teachers’ interest. By substituting the appropriate correlations
78
Advanced quantitative data analysis
from Table 5.2 we find that the standardized partial regression coefficient is about .24: .30 − (.28 × .27) .30 − .08 .22 = = .24 = 1 − .272 1 − .07 .93 Similarly, we can calculate the standardized partial regression coefficient for teachers’ interest if subscript 1 refers to academic achievement, subscript 2 to teachers’ interest and subscript 3 to parents’ interest. By substituting the appropriate correlations from Table 5.2 we find that the standardised partial regression coefficient is about .22: .28 − (.30 × .27) .28 − .08 .20 = = = .22 1 − .272 1 − .07 .93 Further predictors
We could use the same procedure to work out the proportion of variance accounted for by the predictors entered in the second and third steps of the multiple regression. Child’s interest is entered in the second step. The squared multiple correlation or the proportion of variance in academic achievement accounted for by the three predictors of parents’, teachers’ and child’s interest is the sum of the products of the standardized partial regression coefficient and the correlation for these three predictors. The standardized partial regression coefficients for parents’, teachers’ and child’s interest are about .17, .07 and .36, respectively. The formulae for second- and third-order standardized partial regression coefficients will not be presented as they become progressively more complicated. Note that the standardized partial regression coefficients for parents’ and teachers’ interest are not the same as they were in the first stage because there is now another variable to be taken into account. The correlations between academic achievement and parents’, teachers’ and child’s interest are .30, .28 and .44, respectively. Consequently, the squared multiple correlation for these three predictors is about .23: (.17 × .30) + (.07 × .28) + (.36 × .44) = .05 + .02 + .16 = .23 To work out the proportion of variance in academic achievement that child’s interest explains over and above that explained by parents’ and teachers’ interest, we subtract from this squared multiple correlation the squared multiple correlation for parents’ and teachers’ interest, which gives us about .10 (.23 − .13 = .10). Child’s ability is the fourth and last variable to be entered into the hierarchical multiple regression. The standardized partial regression coefficients for parents’ interest, teachers’ interest, child’s interest and child’s ability are
Hierarchical multiple regression
79
about .06, .18, .13 and .53, respectively. The correlations between academic achievement and parents’ interest, teachers’ interest, child’s interest and child’s ability are .30, .28, .44 and .59, respectively. Consequently, the squared multiple correlation for these four predictors is about .44: (.06 × .30) + (.18 × .28) + (.13 × .44) + (.53 × .59) = .02 + .05 + .06 + .31 = .44 The proportion of the variance in academic achievement that child’s ability explains over and above that of parents’, teachers’ and child’s interest is arrived at by subtracting from this squared multiple correlation the squared multiple correlation for parents’, teachers’ and child’s interest, which is about .21 (.44 − .23 = .21). Statistical significance of variance explained
The statistical significance of the change in the proportion of the variance in the criterion that is accounted for by the predictor(s) entered on a step is calculated with an F ratio in exactly the same way as for stepwise multiple regression. The formula for this F ratio is: F=
R2 change/number of predictors added (1 − R2)/(N − number of predictors included − 1)
The F ratio has two degrees of freedom, one for the top half or numerator in the formula and one for the bottom half or denominator. The degrees of freedom for the numerator are the number of predictors that have been added at that stage. In our example, this was two for the first stage and one each for the second and third stages. The degrees of freedom for the denominator are the number of cases minus the number of predictors up to and including that stage minus 1. For the first step in the regression analysis, R2 change and R2 are the same and are about .13. The F ratio for the proportion of the variance in academic achievement explained by parents’ and teachers’ interest is about 19.70: .065 .065 .13/2 = = = 19.70 (1 − .13)/(270 − 2 − 1) .87/267 .0033 The two degrees of freedom for this F ratio are 2 and 267, respectively. This F ratio is significant at less than the .001 level. For the second step, R2 change is .10 and R2 is .23. The F ratio for the proportion of the variance in academic achievement explained by child’s interest is about 34.48: .10 .10 .10/1 = = = 34.48 (1 − .23)/(270 − 3 − 1) .77/266 .0029
80
Advanced quantitative data analysis
Table 6.1 Steps 1 2 3
Main results of a hierarchical multiple regression
Variables
R2
R2 change
F
df1
df2
p
Parents’ interest Teachers’ interest Child’s interest Child’s ability
.13
.13
19.70
2
267
.001
.23 .44
.10 .21
34.48 100.00
1 1
266 265
.001 .001
The two degrees of freedom for this F ratio are 1 and 266, respectively. This F ratio is significant at less than the .001 level. For the third step, R2 change is .21 and R2 is .44. The F ratio for the proportion of the variance in academic achievement explained by child’s ability is about 100.00: .21 .21 .21/1 = = = 100.00 (1 − .44)/(270 − 4 − 1) .56/265 .0021 The two degrees of freedom for this F ratio are 1 and 265, respectively. This F ratio is significant at less than the .001 level. The results of this hierarchical multiple regression are summarized in Table 6.1.
Statistical significance of the partial regression coefficient
When there are two or more predictors in a step as in our example, the F ratio only determines whether the increment in the proportion of variance that is explained by all the predictors in that step is statistically significant. In such circumstances, it is useful to know whether the partial regression coefficients are statistically significant. The statistical significance of a partial regression coefficient is given by a t-test in which the unstandardized partial regression coefficient (B) is divided by its standard error: t=
unstandardized partial regression coefficient standard error
The bigger the standard error, the less confident we can be that the value of the unstandardized partial regression coefficient represents its true value in the population. In other words, the more likely it is that the value of this coefficient will vary from its true value. Bigger t values are more likely to be statistically significant. The degrees of freedom for this t-test are the number of cases minus 2.
Hierarchical multiple regression
81
The formulae for calculating the unstandardized partial regression coefficient and its standard error can be found elsewhere (e.g. Pedhazur 1982; Cramer 1998). The unstandardized partial regression coefficients for parents’ and teachers’ interest are about .47 and .28, respectively, while their standard errors are about .12 and .08, respectively. Consequently, the t values are about 3.92 (.47/.12 = 3.92) and 3.50 (.28/.08 = 3.50), respectively, which with 268 degrees of freedom are statistically significant at less than the .001 level.
Reporting the results
One succinct way of describing the results of this hierarchical multiple regression analysis is as follows: ‘In the hierarchical multiple regression, parents’ and teachers’ interest were entered together in the first step and explained about 13 per cent of the variance in children’s academic achievement (F2,267 = 19.70, p < .001), each explaining a similar proportion of the variance. The partial regression coefficients were statistically significant for both parents’ interest (B = 0.47, t268 = 3.92, p < .001) and teachers’ interest (B = 0.28, t268 = 3.50, p < .001). Child’s interest was entered second and explained a further 10 per cent (F1,266 = 34.48, p < .001) of the variance. Child’s ability was entered third and explained another 21 per cent (F1,265 = 100.00, p < .001). Greater academic achievement was associated with greater parents’ interest, teachers’ interest, child’s interest and child’s ability.’
SPSS Windows procedure
To carry out the hierarchical multiple regression described in this chapter, follow the procedure outlined below. If the data in Table 5.1 have already been saved as a file, retrieve the file in the Data Editor by selecting File, Open, Data. . . the file’s name from the Open File dialog box and Open. Otherwise, enter the data as shown in Box 5.1 and weight the cases by following the Weight Cases. . . procedure described in the previous chapter. Select Analyze on the horizontal menu bar near the top of the window, Regression from the drop-down menu and then Linear. . ., which opens the Linear Regression dialog box in Box 5.3. Select Child’s achievement and then the first 䉴 button to put this variable in the box under Dependent: (variable). Select Parents’ interest and Teachers’ interest and then the second 䉴 button to put these variables in the box under Independent(s): Select Next immediately above Independent(s): and besides Block 1 of 1
82
Advanced quantitative data analysis
to enter the second block or set of variables, which in our case consists of one variable. Select Child’s interest and then the second 䉴 button to put these variables in the box under Independent(s):. Select Next immediately above Independent(s): and besides Block 2 of 2 to enter the third and last block. Select Child’s ability and then the second 䉴 button to put these variables in the box under Independent(s): Select Statistics. . ., which opens the Linear Regression: Statistics sub-dialog box in Box 5.4. Select R squared change to produce the statistics in the columns under Change Statistics in Table 6.1. Select Descriptives to display the means, standard deviations and number of cases of the five variables followed by a full matrix of their correlations, the statistical significance of these correlations and the number of cases on which these are based. Select Part and partial correlations to display these statistics as shown in the last two columns of Table 6.2. Select Continue to close this sub-dialog box and to return to the dialog box. Select OK to run this analysis.
SPSS output
Only two of the tables in the SPSS output will be reproduced, the Model Summary table shown in Table 6.2 and the Coefficients table presented in Table 6.3. Any discrepancies between the values in these tables and the calculations above are due to rounding error with the figures in the output being more accurate. The Coefficients table shows the Standardized Coefficients (or Betas) and the Unstandardized Coefficients (or Bs), their standard errors (Std. Error), t values and their statistical significance (Sig.). The proportion of variance accounted for by each step can be checked by adding the products of the standardized coefficients and the correlations. For example, this proportion is .132 for the first step, which is the same as the value in the Model Summary table: (.237 × .296) + (.219 × .283) = .070 + .062 = .132 The t values can be checked by dividing the unstandardized coefficients by their standard error. So, the t value for the unstandardized coefficient for parents’ interest in the first step is about 3.99 (.467/.117 = 3.99), which is similar to that of 4.000.
.363a .477b .653c
1 2 3
.132 .228 .426
R Square .125 .219 .418
Adjusted R Square 1.232 1.164 1.005
Std. Error of the Estimate
Model Summary
b
.132 .096 .199
R Square Change
Predictors: (Constant), Teachers’ interest, Parents’ interest. Predictors: (Constant), Teachers’ interest, Parents’ interest, Child’s interest. c Predictors: (Constant), Teachers’ interest, Parents’ interest, Child’s interest, Child’s ability.
a
R
SPSS output of Model Summary table
Model
Table 6.2
20.274 33.000 91.791
F Change
2 1 1
df1
Change Statistics
267 266 265
df2
.000 .000 .000
Sig. F Change
a
Dependent Variable: Child’s achievement.
.280 .100 .069 .067 .077
(Constant) Parents’ interest Teachers’ interest Child’s interest Child’s ability
3
−.133 .112 .223 .143 .736
(Constant) Parents’ interest Teachers’ interest Child’s interest
2
.305 .117 .076 .297 .112 .079 .071
1.357 .467 .279
(Constant) Parents’ interest Teachers’ interest
1
Std. Error
Beta
.057 .175 .125 .513
.174 .069 .357
.237 .219
Standardized Coefficients
Coefficientsa
.958 .344 8.807E-02 .406
B
Unstandardized Coefficients
SPSS output of Coefficients table
Model
Table 6.3
−.476 1.121 3.216 2.135 9.581
3.229 3.057 1.118 5.745
4.449 4.000 3.693
t
.635 .263 .001 .034 .000
.001 .002 .264 .000
.000 .000 .000
Sig.
.296 .283 .439 .594
.296 .283 .439
.296 .283
Zero-order
.069 .194 .130 .507
.184 .068 .332
.238 .220
Partial
Correlations
.052 .150 .099 .446
.165 .060 .310
.228 .211
Part
Hierarchical multiple regression
85
Recommended further reading Cramer, D. (1998) Fundamental Statistics for Social Research: Step-by-Step Calculations and Computer Techniques Using SPSS for Windows. London: Routledge. Chapter 7 shows how many of the statistics in the SPSS output are calculated. Pedhazur, E.J. (1982) Multiple Regression in Behavioral Research: Explanation and Prediction, 2nd edn. New York: Holt, Rinehart & Winston. Although fairly technical, chapters 5 and 6 provide a useful guide to the rationale underlying multiple regression. Pedhazur, E.J. and Schmelkin, L.P. (1991) Measurement, Design and Analysis: An Integrated Approach. Hillsdale, NJ: Lawrence Erlbaum Associates. Chapter 18 provides a valuable overview of multiple regression with useful comments on the output produced by SPSS 3.0. SPSS Inc. (2002) SPSS Base 11.0 User’s Guide Package. Upper Saddle River, NJ: Prentice-Hall. Provides a detailed commentary on the output produced by SPSS 11.0 and a useful introduction to multiple regression. Tabachnick, B.G. and Fidell, L.S. (1996). Using Multivariate Statistics, 3rd edn. New York: HarperCollins. Chapter 5 offers a systematic and general account of multiple regression, comparing the procedures of four different programs (including SPSS 6.0) and showing how to write up the results of two worked examples.
Part 3 Sequencing the relationships between three or more quantitative variables
7
Path analysis assuming no measurement error
Path analysis involves setting up a model showing the way in which three or more variables are thought to be related to one another. It is often useful to depict this model using a path diagram where the variables may be ordered from left to right in terms of the direction of their influence and where a relationship between two variables is shown by drawing a line between them. Let us take the simplest case of three variables and let us say that the three variables are the child’s academic achievement, the child’s interest in doing well academically and the parents’ interest in their child doing well, with all three variables being measured simultaneously. There are various models we could set up using these three variables. One decision we have to make is which variable or variables we are interested in explaining. With three variables, we could be interested in explaining any one or two of them. So, for example, we may think that both the child’s and the parents’ interest in the child’s academic achievement are influenced by the child’s academic achievement and that the child’s interest and the parents’ interest are related to each other. We could depict this model with the path diagram in Figure 7.1. Child’s academic achievement is placed on the left of the diagram because it is thought to influence child’s and parents’ interest, which are placed to its right. The direction of influence is also shown by a straight right-pointing arrow drawn from child’s academic achievement to both child’s interest and parents’ interest. The fact that child’s and parents’ interest are related to one another is
90
Advanced quantitative data analysis
Figure 7.1 Path diagram showing academic achievement influencing child’s and parents’ interest with the latter two covarying.
shown by a curved line with an arrow at either end. The double-headed curved arrow and the two single-headed straight arrows are generally called parameters. The strength of these parameters is indicated by a coefficient which can be standardized so that it varies between 0 and ± 1.00. The number of parameters in a model is limited by the number of variables that can be used to define those parameters. With three variables, we can only have three parameters. Such a model is called just-identified because we have just a sufficient number of variables to be able to identify or estimate the three parameters. If we had four or more parameters, the model would be under-identified because we would not have sufficient variables to identify or estimate all the parameters. An under-identified model is presented in Figure 7.2, which shows a reciprocal or bi-directional relationship between child’s and parents’ interest with child’s interest influencing parents’ interest and parents’ interest influencing child’s interest. In this case, the influence of child’s interest on parents’ interest cannot be estimated independently of the influence of the parents’ interest on child’s interest because both of these parameters can only be estimated in terms of the same variable of child’s academic achievement. Models for which the number of parameters is less than the number of variables are over-identified because not all the possible parameters are used. For example, the model in Figure 7.1 would be over-identified if
Figure 7.2 Path diagram showing academic achievement influencing child’s and parents’ interest with the latter two influencing each other.
Path analysis assuming no measurement error
91
child’s interest and parents’ interest were not free to covary. Although the value of these models may not be apparent when there are only three variables, they are of increasing interest the more variables there are because we may wish to determine which model accounts for the data in terms of the fewest parameters. Path analysis is sometimes referred to as causal analysis (e.g. James et al. 1982) or causal modelling (e.g. Asher 1983). These terms may be misleading if they are interpreted as implying that path analysis can be used to determine whether changes in one variable cause changes in another variable. Path analysis can only be used to determine whether two or more variables are associated. It cannot be used to determine whether this association is a causal one. The appropriate procedure for determining causality is a true experimental design in which the causal or independent variable is manipulated, while the effect or dependent variable is measured and all other variables are held constant. When variables have been simply measured at the same point in time, as in this example on academic achievement, it is not possible to determine the causal ordering of the variables. In terms of the relationship between, say, the child’s academic achievement and the child’s interest, it is possible that: (1) the child’s interest determines the child’s achievement; (2) the child’s achievement determines the child’s interest; (3) they both influence each other; or (4) the association is spurious and is due to the influence of some other variable such as the child’s socio-economic background. We will assume that the variable that we are interested in explaining is not child’s or parents’ interest but child’s academic achievement. We will illustrate path analysis with the three models shown in Figure 7.3. The first model we will call the ‘Correlated Direct’ model because child’s interest is assumed to covary with parents’ interest and both child’s and parents’ interest are assumed to influence child’s academic achievement. The second model we will call the ‘Direct and Indirect’ model because child’s academic achievement is assumed to be directly influenced by the parents’ interest and indirectly influenced by the parents’ interest through their effect on the child’s interest. The third model we will call the ‘Indirect’ model because the parents’ interest is assumed to only have an indirect influence on the child’s academic achievement through their effect on the child’s interest. There are two points that are worth noting about these models. The first is that they do not exhaust the models we can test. For example, we could test a model in which child’s and parents’ interest are not allowed to covary or in which child’s interest influences parents’ interest. The second point is that we cannot determine whether the first (Correlated Direct) or the second (Direct and Indirect) model provides the more appropriate account of the data because both models are just-identified and so both will give a complete explanation of the data. However, we will be able to determine
Figure 7.3 Three path models: (a) Correlated Direct (b) Direct and Indirect (c) Indirect.
Path analysis assuming no measurement error
93
whether the third model (Indirect) provides as adequate a model of the data as the second model (Direct and Indirect) because the third model is a subset of the second one. We will illustrate the calculations involved in path analysis with the fictitious data in Table 7.1. The three variables of child’s academic achievement (AA), child’s interest (CI) and parents’ interest (PI) is each made up of three measures that can vary between 1 and 9. Higher scores indicate higher levels of these variables. The mean score for each of these three variables is shown in Table 7.2. Academic achievement is correlated .53 with the child’s interest and .45 with the parents’ interest, while child’s interest is correlated .38 with parents’ interest.
Multiple regression
The simplest way of carrying out a path analysis is to use multiple regression. As the standardized regression coefficient between a predictor and a criterion is the same as the correlation coefficient between the two variables, we only need to use multiple regression to calculate the standardized partial regression coefficient when we have to take account of one or more other predictor variables. The path coefficients for the three models in Figure 7.3 are displayed in Figure 7.4. Only for the first two models are there pathways where we need to take account of another variable and that is the pathway
Table 7.1 Cases 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Scores on nine measures assessing three variables
AA1
AA2
AA3
CI1
CI2
CI3
PI1
PI2
PI3
3 2 3 4 5 3 4 6 4 3 7 9 7 5
5 3 4 5 5 6 3 7 7 2 5 6 6 4
3 4 5 5 4 5 4 6 4 3 4 6 5 4
4 3 4 6 4 4 5 6 4 3 3 5 8 8
5 2 6 5 3 3 5 5 6 4 3 6 8 5
2 2 3 7 5 3 4 5 5 3 2 5 6 6
3 2 5 4 2 5 4 7 7 5 3 6 3 6
2 4 2 3 4 3 3 8 8 3 2 4 4 4
4 2 4 3 4 4 6 9 5 4 3 5 3 5
Abbreviations: AA = academic achievement, CI = child’s interest, PI = parents’ interest.
94
Advanced quantitative data analysis
Table 7.2
Mean scores for the three variables
Cases
AA
CI
PI
1 2 3 4 5 6 7 8 9 10 11 12 13 14
3.67 3.00 4.00 4.67 4.67 4.67 3.67 6.33 5.00 2.67 5.33 7.00 6.00 4.33
3.67 2.33 4.33 6.00 4.00 3.33 4.67 5.33 5.00 3.33 2.67 5.33 7.33 6.33
3.00 2.67 3.67 3.33 3.33 4.00 4.33 8.00 6.67 4.00 2.67 5.00 3.33 5.00
Abbreviations: AA = academic achievement, CI = child’s interest, PI = parents’ interest.
between parents’ interest and academic achievement and between child’s interest and academic achievement. In both cases, the appropriate path coefficients are the standardized partial regression coefficients for a multiple regression in which the criterion is academic achievement and the predictors are child’s interest and parents’ interest. The other path coefficients in these models, such as that between parents’ interest and child’s interest, do not involve taking other variables into account and so are the correlation coefficients. From the path coefficients shown in Figure 7.4 we can see that the values are the same for the first two models and so there is no way of distinguishing between the two in terms of the statistics involved. Whether the third model provides a simpler and as acceptable a model of the data as the second one depends on whether the standardized partial regression coefficient between parents’ interest and academic achievement in the second model is statistically significant. If this coefficient is significant, the second model is the more appropriate one. If the coefficient is not significant, the third model is the simpler one. The statistical significance of the path coefficients will depend on the size of the sample. With only 14 cases, none of the path coefficients is statistically significant at the two-tailed .05 level. Using such a small sample would not be acceptable in research of this kind, which would need to consist of at least 30 cases. With 140 cases, all of the path coefficients would be statistically significant at this level. A significant path coefficient between parents’ interest and child’s interest
Figure 7.4 Three path models with path coefficients and proportion of variance unexplained: (a) Correlated Direct, (b) Direct and Indirect, (c) Indirect [n = 140; *p < .001 (two-tailed)].
96
Advanced quantitative data analysis
and between child’s interest and academic achievement implies that child’s interest mediates the association between parents’ interest and academic achievement. The size of the indirect association between parents’ interest and academic achievement is calculated by multiplying the path coefficient between parents’ interest and child’s interest by the path coefficient between child’s interest and academic achievement. This gives an indirect effect of .16 (.38 × .42 = .16) for the second model in Figure 7.3 and .20 (.38 × .53 = .20) for the third model. The proportion of variance that is not explained by the variables in a path model is also often displayed in a path diagram. This proportion is shown in Figure 7.4 by the value which is placed just below the upwardpointing arrows. In the first model, there is only one variable which is explained by other variables in the path model and that is academic achievement. Consequently, there is only one upward-pointing arrow. The proportion of variance in academic achievement that is not explained by parents’ interest and child’s interest is 0.65. This value is obtained by subtracting from 1 the unadjusted squared multiple correlation (1 − .346 = .654) for the regression of academic achievement on parents’ interest and child’s interest. In the second and third models, there is also an upward-pointing arrow for child’s interest which is explained by parents’ interest. The proportion of unexplained variance in academic achievement in the second model is calculated in the same way as that for the first model and so is .65. The proportion of unexplained variance in child’s interest is the same in both the second and third model and is derived by squaring the correlation between parents’ and child’s interest (.382 = .14) and subtracting this value from 1, which gives .86 (1 − .14 = .86). The proportion of unexplained variance in academic achievement in the third model is worked out similarly by squaring the correlation between child’s interest and academic achievement (.532 = .28) and subtracting it from 1, which is .72 (1 − .28 = .72). One problem with carrying out path analysis with multiple regression is that no index of the fit of a model is provided in terms of the extent to which the estimated parameters in the model can reproduce the original correlations in the data. Consequently, it is not possible to compare the fit of a model which is a subset of another model such as models two and three in Figure 7.4. A second problem is that multiple regression does not take account of the fact that there is usually error involved in measuring variables, which has the effect of lowering the size of the association between them. Comparing the size of path coefficients becomes problematic when measurement error varies for the variables in a model. Structural equation modelling provides both a measure of the fit of a model and takes account of measurement error. How this is done will be outlined in the next chapter. However, structural equation modelling can also be used to work out the path coefficients for the models in Figure 7.4 as shown below.
Path analysis assuming no measurement error
97
To familiarize yourself with the program and output for carrying out structural equation modelling, it may be helpful initially to carry out the simplest path analysis which does not involve trying to take account of measurement error. Furthermore, it should make you more aware of the difference between path analyses which either do or do not take account of measurement error. One of the measures of the goodness-of-fit of a model is the normal theory weighted least squares chi-square. Because the first two models in Figure 7.4 are just-identified, they provide a perfect fit to the data with a chisquare of 0.00 and no degrees of freedom. The chi-square of a model that is over-identified will be affected by the size of the sample. With 14 cases, the chi-square for the third model is 1.31 with 1 degree of freedom. This value is not statistically significant, with p being greater than .05. A path analysis would not be carried out on such a small sample. With a more reasonable sized sample of 140 cases, the degree of freedom is still 1 but chi-square is now 13.96, which is statistically significant with p less than .001. As the third model is a subset or nested version of the second model, we can compare the fit of the third model with the second one by subtracting chi-square and the degrees of freedom for the second model from these respective two statistics for the third model. Because the second model provides a perfect fit to the data, the difference in chi-square and the degrees of freedom between these two models will be the same as the chi-square (e.g. 13.96 − 0.00 = 13.96), the degrees of freedom (1 − 0 = 1) and the statistical significance for the third model. With a sample of 140, the difference in chi-square would be statistically significant, indicating that the third model provides a significantly poorer fit to the data than the second model. The degrees of freedom for a model are determined by subtracting the number of parameters to be estimated from the number of observed variances and covariances. The number of observed variances and covariances is given by the general formula: n(n + 1)/2, where n is the number of observed variables. The number of observed variances and covariances for all three models in Figure 7.4 is 6 [3(3 + 1)/2 = 6]. The number of parameters to be estimated is the number of postulated relationships and an error term for each of the variables in the models in Figure 7.4. This number is 6 for the first two models and 5 for the third model.
Reporting the results
It would be more appropriate to analyse the example in this chapter with structural equation modelling, which takes account of measurement error as shown in the next chapter. If this was not possible, a succinct way of reporting one interpretation of the findings for this rather simple example is
98
Advanced quantitative data analysis
as follows: ‘Academic achievement was found to be significantly positively related both directly to parents’ interest (β = .29, d.f. = 138, two-tailed p < .001) and indirectly via child’s interest (β = .20, d.f. = 138, two-tailed p < .05). Child’s interest was significantly positively related to both academic achievement (β = .42, d.f. = 138, p < .001) and parents’ interest (r = .38, d.f. = 138, p < .001).’
SPSS Windows procedure for multiple regression
To produce the standardized partial regression coefficients for the models in Figure 7.4, first enter the data in Table 7.2 into the Data Editor and weight the frequency of each case by 10 as described in Chapter 5. Then, select Analyze on the horizontal menu bar near the top of the window, Regression from the drop-down menu and then Linear. . ., which opens the Linear Regression dialog box in Box 5.3. Select AA and then the first 䉴 button to put this variable in the box under Dependent: (variable). Select CI and PI and then the second 䉴 button to put these variables in the box under Independent(s): Select Statistics. . ., which opens the Linear Regression: Statistics subdialog box in Box 5.4. Select Descriptives to produce in the output the correlations between these three variables and their statistical significance. Select Continue to close this sub-dialog box and to return to the dialog box. Select OK to run this analysis.
SPSS output
Of the output, only the table containing the partial regression coefficients and their statistical significance will be presented as shown in Table 7.3. The standardized partial regression coefficient of academic achievement (AA) with child’s interest (CI) and parents’ interest (PI) is .417 and .286, respectively.
LISREL procedure for analysis without error
We will use the structural equation modelling program called LISREL to illustrate how to produce the results for the second and third model. Once we have accessed LISREL we select File from the menu bar near the top of
Path analysis assuming no measurement error
Table 7.3
99
SPSS output of partial regression coefficients Coefficientsa Unstandardized Coefficients
Model 1
a
B (Constant) PIMEAN CIMEAN
2.044 .231 .358
Std. Error .316 .060 .064
Standardized Coefficients Beta
.286 .417
t
Sig.
6.468 3.834 5.584
.000 .000 .000
Dependent Variable: AAMEAN.
the LISREL Window, which produces a drop-down menu. From this menu we select New, which opens the New dialog box. We select Syntax only, which opens a Syntax window into which we type the instructions. When we have finished typing the instructions, we select File and from the dropdown menu Run LISREL. If we have not correctly typed in the instructions, the program will produce output which should indicate where the initial problems are encountered. If the instructions have been correctly typed in, the path diagram will be displayed first. To look at the other output that accompanies this diagram, select Window from the menu bar near the top of the Window and select the file ending in .OUT. The instructions below in bold need to be run to produce the appropriate output for the third model: PA: Indirect Model without Error DAta NInputvar=3 NObserv=140 LAbels ChiInt AcaAch ParInt KMatrix 1.00 .53 1.00 .38 .45 1.00 MOdel NYvar=2 NXvar=1 GAmma=FRee BEta=SD PSi=DIagonal FIxed GAmma(2,1) PDiagram OUtput EFfects
We will briefly describe what these instructions do. The line beginning with PA (short for Path Analysis) provides a title which will appear on the top of each ‘page’ of the output. The instructions can be shortened by restricting keywords such as DAta and LAbels to their first two
100
Advanced quantitative data analysis
letters, which have been capitalized to distinguish them from the rest of the word. The line starting with DA provides information about the data to be analysed. It states that the number of variables inputted into the analysis is 3 and the number of observations or cases is 140. The line starting with LA states what labels we are giving the three variables that are listed on the next line. Only the first eight characters of the labels are printed. Variables without any arrows leading to them (Parents’ Interest) are listed last. They are often called exogenous variables. Variables with arrows pointing to them (Child’s Interest and Academic Achievement) are listed first. They are usually called endogenous variables. The line beginning with KM indicates that the data are to be read as a correlation matrix. The variables in the triangular correlation matrix are listed in the same order as their labels. For example, the first line of the matrix shows the correlation of Child’s Interest with itself, which is 1.00 of course. The line starting with MO defines the model to be tested, which consists of 2 endogenous variables and 1 exogenous variable. The parameters of the model are contained in a number of matrices. The pathway between an exogenous and an endogenous variable is defined by the GAmma matrix, whereas that between two endogenous variables is defined by the BEta matrix. The path coefficients are free to vary in the GAmma matrix. As there is no pathway in this model between Parents’ Interest and Academic Achievement, this path coefficient has to be FIxed so as not to be free to vary. This is done with the line starting with FI. The path coefficient is defined by its position in the GAmma matrix, which is the 2nd row of the 1st and, in this case, only column. To analyse the second model we simply delete this line, thereby freeing this parameter. The BEta matrix is set so that the single path coefficient between Child’s Interest and Academic Achievement is free to vary. The PSi matrix is set to be diagonal and provides the proportion of variance left unexplained in the two endogenous variables (Child’s Interest and Academic Achievement). The line starting with PD provides a path diagram for this model. The line starting with OU specifies the output to be produced. The total and indirect effects are produced by adding EF.
LISREL output for analysis without errors
Because of the need to save space, only a few aspects of the output will be commented on or presented. The path diagram displayed is similar to that shown for the third model in Figure 7.4, except that it excludes the asterisks
Path analysis assuming no measurement error
101
Table 7.4 LISREL output of path coefficients for the third model without error
indicating statistical significance and includes details of the normal theory weighted least squares chi-square and the root mean square error of approximation (RMSEA). The path coefficients for the model are also shown in the matrices in Table 7.4. These path coefficients have been standardized because a correlation rather than a variance–covariance matrix was read in. The path coefficient between Child’s Interest and Academic Achievement is displayed in the BETA matrix and is 0.53. The value in parentheses immediately under it is its standard error, which is 0.07. The value below that is the t value of finding that coefficient significant by chance. The t value is 7.34. For a sample of 140 and at the two-tailed level, the critical value of t is about ± 1.98 at the .05 level, ± 2.62 at the .01 level and ± 3.37 at the .001 level. As 7.34 is greater than 3.37, this coefficient is statistically significant at below the two-tailed .001 level. The path coefficient between Parents’ Interest and Academic Achievement is shown in the GAMMA matrix and is 0.38. With a t value of 4.83, this coefficient is also significant at below the two-tailed .001 level. The proportion of variance that is left unexplained in the two endogenous variables of Child’s Interest and Academic Achievement is also shown in the PSI matrix in Table 7.5 and is 0.86 and 0.72, respectively. The indirect effect of Parents’ Interest on Academic Achievement is shown in the matrix in Table 7.6 and is 0.20. With a t value of 4.03, this effect is significant at below the two-tailed .001 level.
102
Advanced quantitative data analysis
Table 7.5 LISREL output of the proportion of unexplained variance for the third model without error
Table 7.6 LISREL output of the indirect effect for the third model without error
Recommended further reading Bryman, A. and Cramer, D. (2001) Quantitative Data Analysis with SPSS Release 10 for Windows. Hove: Routledge. A few pages of chapter 10 outline path analysis using multiple regression and a four-variable model. Pedhazur, E.J. (1982) Multiple Regression in Behavioral Research: Explanation and Prediction, 2nd edn. New York: Holt, Rinehart & Winston. Although fairly technical, chapters 15 and 16 provide a useful guide to the rationale underlying path analysis using multiple regression and LISREL. Pedhazur, E.J. and Schmelkin, L.P. (1991) Measurement, Design and Analysis: An Integrated Approach. Hillsdale, NJ: Lawrence Erlbaum Associates. Chapter 24 provides a valuable overview of various kinds of path analysis using LISREL.
8
Path analysis accounting for measurement error
This chapter looks at two of the main ways in which measurement error is handled in path analysis when variables are measured with two or more indices or indicators, as in the example in the last chapter. The same example will be used in this chapter to enable the results of the different methods to be compared.
Measurement reliability
The simpler method of taking account of measurement error is to obtain a measure of the reliability with which the variables are assessed and to use this in the model. One such measure is Cronbach’s (1951) alpha or index of internal consistency. Alpha reliability may be thought of as the average of all possible split-half reliability coefficients for the indicators of a variable, taking into account the number of indicators in those split halves. The splithalf reliability coefficient is simply the correlation between the aggregated scores of half the indicators with the aggregated scores of the remaining half of the indicators. With only three indicators for each of the three variables in our example, one half will consist of one item, while the other half will comprise the other two items. Thus, there will be three possible split-half reliability coefficients (item 1 vs items 2 and 3; item 2 vs items 1 and 3; and item 3 vs items 1 and 2). Alpha reliability coefficients range from
104
Advanced quantitative data analysis
zero to one. Values of .80 and above are considered to indicate a reliable measure. The alpha reliabilities of our three variables of parents’ interest, child’s interest and academic achievement are .83, .85 and .73, respectively. Taking these reliabilities into account, the standardized path coefficients for our three path models are shown in Figure 8.1. The indicator or measure of a variable may be referred to as a manifest variable and is represented by a rectangle or square, while the variable itself may be called a latent variable and is displayed as a circle or ellipse. The arrow moves from the latent to the manifest variable to show that the indicator is a reflection of the underlying variable. For example, greater parental interest will be reflected in the indicator used to measure it. The value at the end of the arrow leading to the manifest variable or indicator is the proportion of error or unique variance in the indicator. It is calculated by subtracting the reliability coefficient from the value of one. So, the proportion of error variance in parents’ interest is .17 (1 − .83 = .17). Because error variance is taken into account, the path coefficients are greater in Figure 8.1 than in Figure 7.4. So the path coefficient between parents’ interest and academic achievement is .34 in Figure 8.1 compared with .29 in Figure 7.4. The size of the indirect effect between parents’ interest and academic achievement is calculated in exactly the same way as it was when path analysis was carried out with multiple regression. The path coefficient between parents’ and child’s interest is multiplied by that between child’s interest and academic achievement, which gives an indirect effect of about .23 (.45 × .52 = .23) for the second model and .34 (.49 × .70 = .34) for the third model in Figure 8.1. With only 14 cases, none of the path coefficients in the three models in Figure 8.1 is statistically significant. We would not conduct a path analysis on such a small sample. With a more reasonable sized sample of 140 cases, all the path coefficients would be significant. The first two models in Figure 8.1 are just-identified and so provide a perfect or saturated fit to the data with a minimum fit function chi-square of 0.00 and no degrees of freedom. The chi-square of a model which is overidentified will be affected by the size of the sample. With 14 cases, the chisquare for the third model is 1.00 with 1 degree of freedom. This value is not statistically significant with p being greater than .05. With 140 cases, the degree of freedom is still 1 but chi-square is now 10.65, which is statistically significant with p less than .0011. As the third model is a subset or nested version of the second model, we can compare the fit of the third model with the second one by subtracting chi-square and the degrees of freedom of the second model from these respective two statistics of the third model. Because the second model provides a perfect fit to the data, the difference in chi-square and the degrees of freedom between these two models will be the same as the chi-square (e.g. 10.65 − 0.00 = 10.65), the degrees of freedom
Path analysis accounting for measurement error
105
Figure 8.1 Three path models with measurement error: (a) Correlated Direct (b) Direct and Indirect (c) Indirect [n = 140; *p < .05, **p < .001 (two-tailed)].
(1 − 0 = 1) and the statistical significance for the third model. With a sample of 140, the difference in chi-square is statistically significant, indicating that the third model provides a significantly poorer fit to the data than the second model.
106
Advanced quantitative data analysis
The degrees of freedom for a model are determined by subtracting the number of parameters to be estimated from the number of observed variances and covariances. The number of observed variances and covariances is given by the general formula: n(n + 1)/2, where n is the number of observed variables. The number of observed variances and covariances for all three models in Figure 8.1 is 6 [3(3 + 1)/2 = 6]. The number of parameters to be estimated is 6 for the first two models and 5 for the third model. The parameters to be estimated are shown in the output provided by the LISREL program for conducting the structural equation modelling.
Measurement model
The more complicated method for handling measurement error in structural equation modelling when variables are measured with two or more indicators is to represent the variable as a factor, as is done in confirmatory factor analysis. Indicators of a particular variable are specified as loading on the factor reflecting that variable. For the model not to be underidentified, it is necessary to fix one of the parameters of each factor. This is usually done by fixing the path coefficient between one of the indicators of the variable and its factor as 1, which sets the metric for the latent variable at this value. The standardized path coefficients for the path models are shown in Figure 8.2 where the path coefficient between each factor and the first indicator for that factor have been fixed as 1 (aa1, ci1 and pi1). The value for the fit of the model and the path coefficients between the latent variables are the same regardless of which of the three indicators is chosen. The size of the indirect path coefficient between parents’ interest and academic achievement is obtained by multiplying the path coefficient between parents’ and child’s interest by that between child’s interest and academic achievement. It is about .18 (.41 × .43 = .18) for the second model and .28 (.44 × .63 = .28) for the third model. With a sample of 14 cases, none of the path coefficients between the three latent variables is significant, whereas with a sample of 140 cases, all of them are significant. Because the number of parameters to be estimated in all three models is smaller than the number of observed variances and covariances, all three models are over-identified. As before, the fit of the first two models is the same and cannot be compared. For over-identified models, chi-square is affected by the size of the sample, which has been made to be the more reasonable size of 140. Because the third model is a subset of the second model, the difference in fit between the two models can be compared. This is done by subtracting the chi-square of the second model from that of the third model (107.81 − 90.27 = 17.54) and noting the significance of this difference in chi-square against the appropriate difference in the degrees of freedom for the two models, which is 1 (25 − 24 = 1). With 1 degree of
Figure 8.2 Three path models with measurement models: (a) Correlated Direct (χ2 = 90.27, d.f. = 24, p < .001) (b) Direct and Indirect (χ2 = 90.27, d.f. 24, p < .001) (c) Indirect (χ2 = 107.81, d.f. = 25, p < .001) [n = 140*; p < .001 (two-tailed)].
108
Advanced quantitative data analysis
freedom, chi-square has to be 3.84 or more to be significant at the two-tailed .05 level, which it is. Consequently, the third model provides a significantly worse fit to the data than the second model. However, the fact that chi-square for the first and second model is significant implies that these two models do not provide an adequate fit to the data using chi-square as a measure of goodness-of-fit. A number of different measures of goodness-of-fit have been developed and are provided by structural equation modelling programs such as LISREL. Discussion of these measures can be found elsewhere (e.g. Loehlin 1998). These measures also indicate that the fit of these models is less than desirable. In these circumstances, it may be worth considering freeing one or more further parameters to improve the fit of the model.
Reporting the results
We do not have space here to illustrate the reporting of the results of structural equation modelling, which often includes a fairly lengthy explanation of what was done. Path diagrams for the models to be tested are usually presented. Where the interest is solely in the results of these models, the path coefficients and the proportion of error variance left unexplained in the measures can also be displayed in these diagrams, as shown in Figures 8.1 and 8.2. Where the best-fitting models differ from those initially put forward, the results for these models may be presented in further path diagrams. Several goodness-of-fit indices are usually provided for the models tested.
LISREL procedure for analysis with measurement reliability
The instructions below in bold have to be run to produce the appropriate results for the third model in Figure 8.1: PA: Indirect model with measurement reliability DAta NInputvar=3 NObserv=140 LAbels CI AA PI KMatrix 1.00 .53 1.00 .38 .45 1.00 MOdel NYvar=2 NXvar=1 NKvar=1 NEvar=2 c LY=DIagonal LX=DIagonal c TEpsilon=FIxed TDelta=FIxed c
Path analysis accounting for measurement error
109
GAmma=FRee BEta=SD PSi=DIagonal FIxed GAmma(2,1) LEta CI AA LKsi PI MAtrix LY * .92 .85 MAtrix LX * .91 MAtrix TEpsilon .15 .27 MAtrix TDelta * .17 PDiagram OUtput Effect
We will briefly comment on these instructions. Keywords such as DAta and LAbels can be shortened to the first two letters, which have been capitalized to indicate this. The first line provides a title on each ‘page’ of the output. The line starting with DA states that the data consist of 3 variables and 140 observations or cases. The line beginning with LA indicates that labels will be provided for the three variables on the subsequent line. The labels are listed in the same order as the variables in the correlation matrix, which is placed below the line starting with KM. Endogenous variables (such as Child’s Interest and Academic Achievement) are listed first. Exogenous variables (such as Parents’ Interest) are listed last. Lines in LISREL can contain up to 127 characters including spaces. The line starting with MO has been broken up into four lines. Although the unbroken line would only contain 119 characters and spaces, it has been broken up to reduce its extent across the page and to make it easier to refer to. To indicate that the line continues, c is inserted at the end of the line after a particular instruction has been completed. The model consists of 2 endogenous manifest (NY) and latent (NE) variables and 1 exogenous manifest (NX) and latent (NK) variable. The parameters of the model are contained in a number of matrices. The matrix for each of the latent variables (LY and LX) has been made to consist only of parameters in the Diagonal of the matrix so that there is only one path coefficient between the manifest and the latent variable. These path
110
Advanced quantitative data analysis
coefficients have been fixed to the square roots of the alpha reliabilities of the manifest variables in the lines starting with MAtrix LY and LX. For example, the path coefficient between the manifest and latent variable of the first endogenous variable (CI) has been fixed as .92 (√.85 = .92) and that of the second endogenous variable (AA) as .85 (√.73 = .85). The matrices for the proportion of error variance in the indicators for the exogenous (TDelta) and the endogenous (TEpsilon) latent variables are diagonal by default and have been fixed so that the proportion of error variance can be specified in the lines starting with MAtrix TEpsilon and TDelta. These values are derived by subtracting the alpha reliability of the variable from 1. For instance, the proportion of error variance in the indicator of the first endogenous variable (CI) has been fixed as .15 (1 − .85 = .15) and that of the second endogenous variable (AA) as .27(1 − .73 = .27). The pathway between an exogenous and an endogenous variable is defined by the GAmma matrix, whereas that between two endogenous variables is defined by the BEta matrix. The path coefficients are free to vary in the GAmma matrix. As there is no pathway in this model between Parents’ Interest and Academic Achievement, this path coefficient has to be FIxed so as not to be free to vary. This is done in the line starting with FI. The path coefficient is defined by its position in the GAmma matrix which is the 2nd row of the 1st column. To analyse the second model we simply delete this line. The BEta matrix is set so that the single path coefficient between Child’s Interest and Academic Achievement is free to vary. The PSi matrix is set to be diagonal and provides an estimate of the error for the two endogenous variables (Child’s Interest and Academic Achievement). The line starting with LE labels the two eta or latent endogenous variables of Child’s Interest (CI) and Academic Achievement (AA). The two labels are put on the next line. The line starting with LK labels the ksi or latent exogenous variable of Parents’ Interest (PI). The label is placed on the subsequent line. The line starting with PD provides a path diagram for this model. The line starting with OU specifies the output to be produced. The total and indirect effects are produced by adding EF. LISREL output for analysis with measurement reliability
Because of the need to conserve space, only a few aspects of the output will be commented on or presented. The path diagram produced is similar to that shown for the third model in Figure 8.1, except that it excludes the asterisks indicating statistical significance and includes details of the normal theory weighted least squares chi-square and the root mean square error of approximation (RMSEA). The path coefficients are also shown in
Path analysis accounting for measurement error
111
the matrices in Table 8.1. For example, the path coefficient between the manifest and latent variable of Child’s Interest (CI) is shown in the Lambda-Y matrix and is 0.92. The path coefficient between the latent endogenous variables of Child’s Interest (CI) and Academic Achievement (AA) is presented in the BETA matrix and is 0.70. The value in parentheses immediately below it is its standard error, which is 0.09. The value below that is the t value of finding that coefficient significant by chance. The t value is 7.65. For a sample of 140 and at the two-tailed level, the critical value of t is about ± 1.98 at the .05 level, ± 2.62 at the .01 level and ± 3.37 at the .001 level. As 7.65 is larger than 3.37, this coefficient is statistically significant at below the two-tailed .001 level.
Table 8.1 LISREL output of the path coefficients for the third model with measurement reliability
112
Advanced quantitative data analysis
The proportion of the error variance in the indicators is also presented in the matrices in Table 8.2. For instance, the proportion for the indicator of Child’s Interest (CI) is displayed in the THETA-EPS matrix and is 0.15. The indirect effect of the latent variable of Parents’ Interest (PI) on that of Academic Achievement (AA) is shown in the matrix in Table 8.3 and is 0.34. With a t value of 4.47, this effect is significant at below the two-tailed .001 level.
Table 8.2 LISREL output of the error variance estimates of the third model with measurement reliability
Table 8.3 LISREL output of the indirect effect for the third model with measurement reliability
Path analysis accounting for measurement error
113
LISREL procedure for analysis with measurement model
The instructions in bold below have to be run to give the results for the third model shown in Figure 8.2: PA: Indirect model with measurement model DAta NInputvar=9 NObserv=140 LAbels ci1 ci2 ci3 aa1 aa2 aa3 pi1 pi2 pi3 KMatrix 1.00 .64 1.00 .79 .54 1.00 .42 .40 .41 1.00 .29 .37 .38 .49 1.00 .41 .35 .45 .54 .59 1.00 .25 .39 .33 .19 .39 .41 1.00 .23 .21 .42 .19 .58 .34 .59 1.00 .27 .24 .26 .25 .35 .38 .69 .61 1.00 MOdel NYvar=6 NXvar=3 NKvar=1 NEvar=2 c GAmma=FRee BEta=SD PSi=DIagonal FIxed GAmma(2,1) FRee LX(2,1) LX(3,1) c LY(2,1) LY(3,1) LY(5,2) LY(6,2) STartval 1 LX(1,1) c LY(1,1) LY(4,2) LEta CI AA LKsi PI PDiagram OUtput SS EF
Only differences between this and the previous set of instructions will be commented on. The number of input variables is 9 and these are labelled ci1 to pi3. The correlation matrix contains the correlations for these variables. There are 6 endogenous (NY) and 3 exogenous (NX) manifest variables. The matrices for the latent variables (LX and LY) are fixed and rectangular or square by default. We need to free the path coefficients or loadings of the second and third indicators for each latent variable, which we do with the line starting with FR. We need to fix the path coefficient of the first indicator for each latent variable as 1, which we do with the line beginning with ST. The first number in the parentheses refers to the row and the second number to the column. As there is only one exogenous latent variable (LX),
114
Advanced quantitative data analysis
there is one column with three rows. There are two endogenous latent variables, so there are two columns with six rows. The first three rows of the first column are the indicators for Child’s Interest, whereas the last three rows of the second column are the indicators for Academic Achievement. The four instructions starting with MA in the previous set are omitted, as these values will be calculated for each of the indicators for the latent variables. To produce the standardized coefficients, SS is added to the line starting with OU. As previously, omit the line starting with FIxed GAmma to run the analysis for the second model.
LISREL output for analysis with measurement model
The path diagram provides the unstandardized path coefficients for the model. As these may vary according to the scale of the measures used, it is more useful to look at the standardized coefficients when we are interested in the relative size of the coefficients. These can be changed by selecting View from the menu bar near the top of the LISREL window, Estimations from the first drop-down menu and Standardized Solution from the second drop-down menu. Alternatively, they can be found in the output under Standardized Solution, as shown in Table 8.4. For example, the standardized path coefficient between Parents’ Interest (PI) and Academic Achievement (AA) is presented in the GAMMA matrix and is 0.44. The statistical significance of the path coefficients is provided by the t values. These can be obtained by selecting t-Values from the second dropdown menu, which displays them in the path diagram. Alternatively, they can be obtained under LISREL estimates in the output on the third line of the appropriate matrix. For example, the t value for the path coefficient between Parents’ Interest (PI) and Academic Achievement (AA) is presented in the GAMMA matrix and is 4.58. This is statistically significant at less than the two-tailed .001 level. The standardized indirect effect of the latent variable of Parents’ Interest (PI) on that of Academic Achievement (AA) is shown in the matrix in Table 8.6 and is 0.28. The t value of this path coefficient is shown in the third line of the matrix in Table 8.7 and is 3.73. This effect is significant at below the two-tailed .001 level.
Path analysis accounting for measurement error
Table 8.4 LISREL output of the standardized path coefficients for the third model with the measurement model
115
Table 8.5 LISREL output of the unstandardized path coefficients of the third model with the measurement model
Path analysis accounting for measurement error
117
Table 8.6 LISREL output of the standardized indirect effect for the third model with the measurement model
Table 8.7 LISREL output of the unstandardized indirect effect for the third model with the measurement model
Recommended further reading Byrne, B.M. (1998) Structural Equation Modeling with LISREL, PRELIS, and SIMPLIS. Mahwah, NJ: Lawrence Erlbaum Associates. This book is a relatively accessible introduction to the various ways in which structural equation modelling can be applied using version 8 of LISREL. Hair, J.F., Jr., Anderson, R.E., Tatham, R.L. and Black, W.C. (1998) Multivariate Data Analysis, 5th edn. Upper Saddle River, NJ: Prentice-Hall. Chapter 11 provides a non-technical introduction to structural equation modelling with useful references for further details. Pedhazur, E.J. (1982) Multiple Regression in Behavioral Research: Explanation and Prediction, 2nd edn. New York: Holt, Rinehart & Winston. Although fairly technical, chapters 15 and 16 provide a useful guide to the rationale underlying path analysis using multiple regression and LISREL. Pedhazur, E.J. and Schmelkin, L.P. (1991) Measurement, Design and Analysis: An Integrated Approach. Hillsdale, NJ: Lawrence Erlbaum Associates. Chapter 24 provides a valuable overview of various kinds of path analysis using LISREL.
Part 4 Explaining the probability of a dichotomous variable
9
Binary logistic regression
Logistic or logit multiple regression is used to determine which variables are most strongly associated with the probability of a particular category in another variable occurring. This category can be part of a dichotomous or binary variable having only two categories such as passing or failing a test, being diagnosed as having or not having a particular illness, being found guilty or not guilty and so on. Alternatively, the category can be one category of a polychotomous, polytomous or multinomial variable having three or more categories such as being diagnosed as suffering from anxiety, depression, both or neither. Logistic multiple regression is called binary logistic multiple regression when the criterion or dependent variable is dichotomous and multinomial logistic multiple regression when the criterion is polychotomous. This chapter will only be concerned with binary logistic multiple regression, which we shall call logistic regression for short. Multinomial logistic multiple regression is covered elsewhere (e.g. Hosmer and Lemeshow 1989). In trying to understand logistic regression, it may be useful to compare it with multiple regression. Both techniques are similar in many respects and can be used in similar ways. However, the statistics involved in logistic regression are more complicated. To understand these it may help to compare them with the corresponding statistics in multiple regression. Logistic regression is seen as being a more appropriate technique than multiple regression for analysing data where the dependent or criterion variable is
122
Advanced quantitative data analysis
qualitative rather than quantitative. The reasons for this will be explained later. Predictor variables can be entered in a predetermined order as in hierarchical multiple regression or according to statistical criteria as in stepwise multiple regression. We can think of hierarchical logistic regression as determining whether one or more predictor variables significantly maximizes the probability of a category being present or absent. Similarly, we can view statistical or stepwise logistic regression as finding out which predictors significantly maximize the probability of a category being present or not. We will illustrate logistic regression with the data in Table 5.1, except that we will convert the continuous variable of child’s academic achievement into the dichotomous variable of being classified as a pass or a fail. Children with a score of 2 or less are considered to have failed and will be coded 0. Those with 3 or more are deemed to have passed and are coded as 1 as shown in Table 9.1. Predictor variables can be either quantitative or qualitative but we will only demonstrate quantitative variables. We will use stepwise multiple regression and a comparable procedure for logistic regression that will be described later.
Predicted score
The results of this stepwise multiple regression differ from those in Chapter 5 because the values of child’s academic achievement have been changed, which has altered the correlations between achievement and the predictor variables. The predictor that explains the greatest significant proportion of variance in academic achievement is child’s intellectual ability, followed by
Table 9.1 variable
Child’s academic achievement coded as a dichotomous
Cases
Child’s achievement
Child’s ability
Child’s interest
Parents’ interest
Teachers’ interest
1 2 3 4 5 6 7 8 9
0 0 0 1 1 1 1 1 1
1 3 2 3 4 2 3 4 3
1 3 3 2 4 3 5 2 4
2 1 3 2 3 2 3 3 2
1 2 4 4 2 3 4 2 3
Binary logistic regression
Table 9.2
Values for determining the squared multiple correlation
Cases
Actual score
1 2 3 4 5 6 7 8 9 Sum Mean
0 0 0 1 1 1 1 1 1
123
Predicted score
(Predicted − mean)2
(Actual − mean)2
−0.05 0.53 0.65 0.84 0.99 0.45 0.93 0.99 0.73
(−0.05 − 0.67)2 = .52 (0.53 − 0.67)2 = .02 (0.65 − 0.67)2 = .00 (0.84 − 0.67)2 = .03 (0.99 − 0.67)2 = .10 (0.45 − 0.67)2 = .05 (0.93 − 0.67)2 = .07 (0.99 − 0.67)2 = .10 (0.73 − 0.67)2 = .00
(0 − 0.67)2 = .45 (0 − 0.67)2 = .45 (0 − 0.67)2 = .45 (1 − 0.67)2 = .11 (1 − 0.67)2 = .11 (1 − 0.67)2 = .11 (1 − 0.67)2 = .11 (1 − 0.67)2 = .11 (1 − 0.67)2 = .11
.89
2.00
6 0.67
teachers’ interest and then parents’ interest. We can work out what a child’s academic achievement score will be based on three predictors if we use the following regression equation predicted achievement score = child’s ability score + teachers’ interest score + parents’ interest score + adjustment constant where the predictor scores are multiplied by their respective unstandardized partial regression coefficients. These coefficients are about .28, .11 and .09, respectively. The adjustment constant that we have to add to this equation is −.62. Using these values we can calculate what the predicted achievement score is for each of the nine cases in Table 9.1. For example, the predicted score is −0.05 for the first case where the child’s ability score is 1, their teachers’ interest score is 1 and their parents’ score is 2: (.28 × 1) + (.11 × 1) + (.09 × 2) + (−.62) = .28 + .11 + .18 − .62 = −0.05 The actual score was 0, so the predicted score of −0.05 is very close to the actual score. The predicted score for each of the nine cases is shown in Table 9.2. Squared multiple correlation
From the predicted and actual scores, we can work out what proportion of the variance in academic achievement is explained by these three predictors. This proportion is given by the squared multiple correlation, the formula for which is:
124
Advanced quantitative data analysis
squared multiple correlation =
sum of (predicted score – mean score)2 sum of (actual score − mean score)2
In other words, the proportion of variance accounted for is basically the variance of the predicted scores divided by the variance of the actual scores. These variances have been worked out in Table 9.2. The actual variances are 30 times as big because our sample is 270 and not 9, but this does not matter as 30 appears in both the numerator and denominator and so cancels out. The squared multiple correlation is about .445 (.89/2.00 = .445). Using this method we could work out the squared multiple correlation for any set of predictors and we could determine the statistical significance of adding predictors by using the F ratio described in Chapter 7.
Predicted probability
In logistic regression we work out the predicted probability of a category occurring rather than the predicted value of a variable. The probability of a category occurring is the number of times that category is present divided by the total number of times it could be present: probability of a category =
frequency of a category frequency of all categories
Probability varies from a minimum of 0 to a maximum of 1. The probability of a child passing without taking account of the predictors is simply the number of children who pass divided by the total number of children, which in this case is about .67 (6/9 = .667). The probability of a category occurring can be re-expressed in terms of odds according to the following formula: probability of a category =
odds of a category 1 + odds of a category
The odds of a category occurring is the ratio of the number of times the category occurs to the number of times it does not occur. So the odds of children passing are 6 to 3, which can be simplified as 2 to 1. If we express these odds as a fraction (2/1 = 2.0) and insert them into the above formula, we can see that the probability of passing is .67 [2.0/(1 + 2.0) = .667]. The reason for defining probability in terms of odds is that the formula that logistic regression uses to determine the probability of an event occurring is expressed in this way. Multiple regression assumes that the relationship between the criterion and the predictors can be best represented by a straight line. Figure 9.1 shows this regression line for the criterion of passing and the single
Binary logistic regression
125
predictor of child’s ability. A single predictor was chosen to make the explanation simpler. The unstandardized regression coefficient for child’s ability on its own is about .31. This means that for every change of one unit in child’s ability, the corresponding change in failing or passing is .31 of a unit. This relationship is the same for both low and high levels of child’s ability. Logistic regression assumes that the relationship between the criterion and the predictors is an S-shaped or sigmoidal one, as shown in Figure 9.1. This means that this relationship is strongest around the midpoint of 0.5 between failing and passing and weakest at either end of the curve. A change of a particular size will have a much bigger effect at the midpoint of the curve than at either end. The points representing the values of the criterion and the predictor will lie closer to this S-shaped curve than a straight line when the criterion is dichotomous.
Logged odds
This S-shaped relationship is expressed in terms of what is variously called the log of the odds, the logged odds or the logit. We can demonstrate this relationship by working out the log of the odds for 15 levels of probability ranging from .0001 to .9999, as shown in Table 9.3, and plotting these probabilities against the log of their odds as portrayed in Figure 9.2.
Figure 9.1
Multiple and logistic regression lines.
126
Advanced quantitative data analysis
Figure 9.2
Probability plotted against the log of the odds.
The odds are calculated by dividing the probability of a category occurring by the probability of it not occurring. The probability of a category not occurring is the probability of the category occurring subtracted from 1. So, if a category has a .7 probability of occurring, it has a .3 probability of not occurring (1 − .7 = .3). The odds of that category occurring are about 2.33 (.7/.3 = 2.33). We take the natural or Naperian logarithm of the odds to obtain the logit. For example, the natural logarithm of 2.33 is about 0.85. To convert the logits into odds we raise or exponentiate the base of e, the natural logarithm, which is about 2.718, to the power of the logit. So the odds are about 2.33 for a logit of 0.85 (2.7180.85 = 2.33). The odds of the children in our example passing are about 2.00 (.667/.333 = 2.00) and the logged odds are about 0.69 (natural log of 2.00 = 0.693). The logged odds of 0.69 gives an odds of 2.00 (2.7180.693 = 2.00). Whereas in multiple regression the regression coefficients and an adjustment constant are used to calculate the predicted value of a case, in logistic regression they are used to calculate the logged odds. The logged odds are then converted into odds and the odds are used to calculate the predicted probability of a case, as shown by the following formula: probability of a category =
odds 2.718logged odds = logged odds 1 + 2.718 1 + odds
We will illustrate the calculation of the probability of children passing for
Binary logistic regression
Table 9.3
127
Probabilities, odds and logits
Probabilities
1 − probability
Odds
Logits
.0001 .0010 .0100 .1000 .2000 .3000 .4000 .5000 .6000 .7000 .8000 .9000 .9900 .9990 .9999
.9999 .9990 .9900 .9000 .8000 .7000 .6000 .5000 .4000 .3000 .2000 .1000 .0100 .0010 .0001
.0001 .0010 .0101 .1111 .2500 .4286 .6667 1.0000 1.5000 2.3333 4.0000 9.0000 99.0000 999.0000 9999.0000
−9.2102 −6.9068 −4.5951 −2.1972 −1.3863 −.8473 −.4055 .0000 .4055 .8473 1.3863 2.1972 4.5951 6.9068 9.2102
the pattern of scores shown by the first case in our sample. The unstandardized logistic regression coefficients for child’s ability, parents’ interest and teachers’ interest are about 2.24, 0.60 and 0.77, respectively, while the constant is about −8.86 for the sample. The logged odds of children passing having values of 1 for child’s ability, 2 for parents’ interest and 1 for teachers’ interest is about −4.65: (2.24 × 1) + (.60 × 2) + (.77 × 1) + (−8.86) = 2.24 + 1.20 + .77 − 8.86 = −4.65 A negative logit means that the odds are against children with these predictor values passing. A positive logit indicates that the odds are in favour of the children passing and a zero logit that the odds are even. A logit of −4.65 gives an odds of about 0.01 (2.718−4.65 = 0.01), which gives a predicted probability of about .01 [.01/(1 + .01) = .01]. The predicted probabilities for the nine cases are presented in the third column of Table 9.4. A regression coefficient of 2.24 for child’s ability means that for every change of one unit in child’s ability, the corresponding change is about 2.24 in the logged odds. This is a multiple of about 9.39 (2.7182.24 = 9.39) in the odds of failing or passing. We can illustrate this by changing child’s ability from 1 to 2, which gives us a logged odds of about −2.41: (2.24 × 2) + (.60 × 2) + (.77 × 1) + (−8.86) = 4.48 + 1.20 + .77 − 8.86 = −2.41 A logit of −2.41 gives an odds of about 0.09 (2.718−2.41 = 0.09), which is about 9.39 times as big as 0.01 (0.01 × 9.39 = 0.09). The odds ratio is the
0 0 0 1 1 1 1 1 1
1 2 3 4 5 6 7 8 9
Sum
Outcome (O) .01 .50 .62 .89 .97 .30 .94 .97 .80
Predicted probability (P) O × log P 0.00 0.00 0.00 −0.12 −0.03 −1.20 −0.06 −0.03 −0.22
Log P −4.61 −0.69 −0.48 −0.12 −0.03 −1.20 −0.06 −0.03 −0.22 1 1 1 0 0 0 0 0 0
1−O .99 .50 .38 .11 .03 .70 .06 .03 .20
1−P −0.01 −0.69 −0.97 −2.21 −3.51 −0.36 −2.81 −3.51 −1.61
Log (1 − P)
−.01 −.69 −.97 .00 .00 .00 .00 .00 .00
−3.33
−0.01 −0.69 −0.97 −0.12 −0.03 −1.20 −0.06 −0.03 −0.22
(1 − O) × log (1 − P) Log likelihood
Predicted probabilities and calculation of log likelihood for the three-predictor model
Cases
Table 9.4
Binary logistic regression
129
number by which we multiply the odds of a category occurring for a change of one unit in a predictor variable while controlling for any other predictors.
Log likelihood
The statistic used to determine whether the predictors included in a model provide a good fit to the data is the log likelihood, which is usually multiplied by −2 so that it takes the approximate form of a chi-square distribution. A perfect fit is indicated by 0, while larger values signify progressively poorer fits. Because the −2 log likelihood will be bigger the larger the sample, the size of the sample is controlled by subtracting the −2 log likelihood value of a model containing the predictors from the −2 log likelihood value of a model containing only the adjustment constant. The predicted probability of a category occurring for all cases in the constant-only model is the overall probability of that category occurring in the sample. This is .67 for our example, as shown in Table 9.5. The −2 log likelihood value for our example in which the three predictors of child’s ability, parents’ interest and teachers’ interest are included to predict whether the child passes is about 199.80 (−2 × −3.33 × 30 = 199.80). The −2 log likelihood of the model with just the constant is about 343.80 (−2 × −5.73 × 30 = 343.80). The difference between the two −2 log likelihood values gives a chi-square value of about 144.00 (343.80 − 199.80 = 144.00). The degrees of freedom for a model are the number of predictors in the model. The degrees of freedom are 0 for the constant-only model and 3 for the three-predictor model. The difference in the degrees of freedom for the two models is 3 (3 − 0 = 3). Chi-square has to be 7.82 or larger to be significant at the .05 level. As the chi-square of 144.00 is larger than 7.82, chi-square is statistically significant. Therefore, we would conclude that the model containing these three predictors provides a satisfactory fit to the data. The log likelihood is the sum of the probabilities associated with the predicted and actual outcome for each case, which can be calculated using the following formula: [outcome × log of predicted probability] + [(1 − outcome) × log of 1 − predicted probability)] The steps in this calculation for the nine cases are shown in Table 9.4 for the three-predictor model and in Table 9.5 for the constant-only model. As the nine cases are each repeated 30 times, the sum of the log likelihood needs to be multiplied by 30. To obtain the −2 log likelihood we need to multiply this figure by −2. This gives us a value of about 199.80 (−3.33 × 30 × −2 = 199.80) for the three-predictor model. The corresponding value for the constant-only model is about 343.80 (−5.73 × 30 × −2 = 343.80). A model with predictors can also be compared with models without one
0 0 0 1 1 1 1 1 1
1 2 3 4 5 6 7 8 9
Sum
Outcome (O) .67 .67 .67 .67 .67 .67 .67 .67 .67
Predicted probability (P) O × log P 0.00 0.00 0.00 −0.40 −0.40 −0.40 −0.40 −0.40 −0.40
Log P −0.40 −0.40 −0.40 −0.40 −0.40 −0.40 −0.40 −0.40 −0.40 1 1 1 0 0 0 0 0 0
1−O .33 .33 .33 .33 .33 .33 .33 .33 .33
1−P −1.11 −1.11 −1.11 −1.11 −1.11 −1.11 −1.11 −1.11 −1.11
Log (1 − P)
−1.11 −1.11 −1.11 0.00 0.00 0.00 0.00 0.00 0.00
(1 − O) × log (1 − P)
Predicted probabilities and calculation of log likelihood for the constant-only model
Cases
Table 9.5
−5.73
−1.11 −1.11 −1.11 −0.40 −0.40 −0.40 −0.40 −0.40 −0.40
Log likelihood
Binary logistic regression
131
or more of those predictors using this chi-square test. For example, the model with the three predictors of child’s ability, parents’ interest and teachers’ interest can be compared with a model containing all four predictors and with one containing the two predictors of child’s ability and parents’ interest. We have already seen that the −2 log likelihood of the three-predictor model is about 199.80. The comparable values for the fourand two-predictor models are 198.20 and 207.45, respectively. The difference between the three- and four-predictor model gives a chi-square of about 1.60 (199.80 − 198.20 = 1.60) with 1 degree of freedom (4 − 3 = 1). With 1 degree of freedom, chi-square has to be 3.84 or larger to be statistically significant at the .05 level. As a chi-square of 1.60 is smaller than 3.84, the four-predictor model – which has the additional predictor of child’s interest – does not provide a statistically significant increment in the fit to the data. The difference between the three- and the two-predictor model produces a chi-square value of about 7.65 (207.45 − 199.80 = 7.65) with 1 degree of freedom (3 − 2 = 1). As a chi-square of 7.65 is larger than 3.84, the three-predictor model with the additional predictor of teachers’ interest provides a statistically significant increment in the fit to the data compared with the two-predictor model, which omits teachers’ interest. The statistics for these comparisons are summarized in Table 9.6.
Stepwise selection
Various statistical criteria can be used to enter or remove predictors from a logistic regression analysis. In the forward selection method, which was used here, the predictor with the most statistically significant Rao’s (1973) efficient score statistic is entered first. This score statistic is a measure of association in logistic regression. The larger it is, the stronger the association. If none of the score statistics is statistically significant, the procedure stops indicating that none of the predictors provides a good fit to the data. In our example, the most significant score statistic was for child’s ability, which was entered first. This has a value of about 97.28, which is significant at less than the .001 level. If the −2 log likelihood difference between the model with this predictor
Table 9.6
Comparing the fit of logistic regression models
Models compared
Log likelihood difference
χ2
df difference
p
3 vs 4 predictors 2 vs 3 predictors
199.80 − 198.20 = 207.45 − 199.80 =
1.60 7.65
4−3=1 3−2=1
ns .05
132
Advanced quantitative data analysis
and the one without it is statistically significant at less than the .10 level, the predictor is retained. If the difference is not statistically significant at this level, the procedure ends, indicating that none of the predictors provides a good fit to the data. The −2 log likelihood difference between this model and the constant-only model is about 110.47 (343.80 − 233.33 = 110.47), which with 1 degree of freedom is significant at this level. So child’s ability is the first predictor to be entered into the model. The relevant statistics for this and subsequent steps are presented in Table 9.7. The predictor with the next most statistically significant score statistic is entered next. This is parents’ interest, which has a score statistic of about 26.27. The −2 log likelihood difference between the model with and without this predictor is about 25.88 (233.33 − 207.45 = 25.88), which with 1 degree of freedom is statistically significant at the .10 level. Consequently, parents’ interest is the second predictor to be entered into the model. The −2 log likelihood difference test is also applied to the first predictor of child’s ability. The difference between the model with this predictor and the one without it is about 102.05 (309.50 − 207.45 = 102.05), which with 1 degree of freedom is statistically significant at the .10 level. Thus child’s ability remains in the model. The predictor with the next most statistically significant score statistic is teachers’ interest, which has a value of about 6.33. The −2 log likelihood difference between the model with and without this predictor is about 7.65 (207.45 − 199.80 = 7.65), which with 1 degree of freedom is statistically significant at the .10 level. Consequently, teachers’ interest is the third predictor to be entered into the model. Child’s ability and parents’ interest remain in the model because the −2 log likelihood differences of 95.47 (295.27 − 199.80 = 95.47) and 5.08 (204.88 − 199.80 = 5.08), respectively, are significant at the .10 level. Because the score statistic for the fourth variable of child’s interest is not statistically significant, the analysis stops here. Thus, the three predictors of child’s ability, parents’ interest and teachers’ interest provide a statistically significant fit to the data.
Distribution of predicted probabilities
One way of viewing the improvement in the fit of the data provided by a model is in terms of the difference in predicted probabilities of the category occurring and not occurring. With a perfect model the predicted probability of the category occurring (e.g. passing) is 1 and that of the category not occurring (e.g. failing) is 0. The difference in the predicted probability of the category occurring and not occurring is a maximum of 1. With the constant-only model, the predicted probability of the category occurring and not occurring is the same and in our example is .67 for failing and
Predictors
Child’s ability Child’s interest Parents’ interest Teachers’ interest
Child’s ability Parents’ interest Teachers’ interest Child’s interest
Child’s ability Parents’ interest Teachers’ interest Child’s interest
Child’s interest
1
2
3
4
3.39
6.33 2.41
26.27 23.86 8.84
97.28 45.00 33.75 25.12
Score statistic
Entry criteria
ns
.05 ns
.001 .001 .01
.001 .001 .001 .001
p
χ2 110.47
102.05 25.88
95.47 5.08 7.65
Log likelihood difference 343.80 − 233.33 =
309.50 − 207.45 = 233.33 − 207.45 =
295.27 − 199.80 = 204.88 − 199.80 = 207.45 − 199.80 =
3−2=1 3−2=1 3−2=1
2−1=1 2−1=1
1−0=1
df difference
Removal criteria
Summary of statistics for forward selection of predictors in logistic regression
Steps
Table 9.7
.001 .05 .01
.001 .001
.001
p
134
Advanced quantitative data analysis
passing. The difference in the predicted probability of the category occurring and not occurring is a minimum of 0. Better fitting models will increase the difference in the predicted probability of the category occurring and not occurring. We can see this in our example. The predicted probabilities of the three models together with the constant-only and perfect model are presented in Table 9.8. The mean of the predicted probabilities for the cases who failed and who passed are also shown together with the difference in the two means. This difference increases from .37 for the one-predictor model of child’s ability to .43 for the three-predictor model of child’s ability, parents’ interest and teachers’ interest. In other words, the three-predictor model best discriminates between those who passed and those who failed. Reporting the results
There are various ways of writing up the results of the analysis illustrated in this chapter, which will vary according to its precise aims. A very succinct report may be phrased as follows: ‘A forward stepwise binary logistic regression was performed. Predictors were entered based on the most significant score statistic with a p of .05 or less and were removed if the p of the −2 log likelihood test was greater than .10. Child’s intellectual ability
Table 9.8 Predicted probabilities of those passing and failing for five logistic regression models Models Cases
Outcome
Constant
1
2
3
Perfect
1 2 3
0 0 0
.67 .67 .67
.08 .81 .38
.04 .53 .61
.01 .50 .62
.00 .00 .00
.67
.42
.39
.38
.00
.67 .67 .67 .67 .67 .67
.81 .97 .38 .81 .97 .81
.80 .99 .30 .94 .99 .80
.90 .97 .30 .94 .97 .80
1.00 1.00 1.00 1.00 1.00 1.00
Mean
.67
.79
.80
.81
1.00
Difference
.00
.37
.41
.43
1.00
Mean 4 5 6 7 8 9
1 1 1 1 1 1
Binary logistic regression
135
was entered first (χ12 = 110.47, p < .001), parents’ interest second (χ12 = 25.88, p < .001) and teachers’ interest third (χ12 = 7.65, p < .05). Child’s interest did not provide a significant increment in the fit of the model. The probability of passing was associated with greater child’s ability, parents’ interest and teachers’ interest.’
SPSS Windows procedure
To perform the logistic regression described in this chapter, follow the procedure outlined below. If the data in Table 5.1 have already been saved as a file, retrieve the file in the Data Editor by selecting File, Open, Data. . ., the file’s name from the Open File dialog box and Open. Otherwise, enter the data as shown in Box 5.1. Change the child’s achievement scores so that a score of 2 or less is coded 0 and a score of 3 or more is coded as 1, as shown in Table 9.1. Weight the cases by following the Weight Cases. . . procedure described in Chapter 5. Select Analyze on the horizontal menu bar near the top of the window, Regression from the drop-down menu and then Binary Logistic. . ., which opens the Logistic Regression dialog box in Box 9.1. Select Child’s achievement and then the first 䉴 button to put this variable in the box under Dependent: (variable).
Box 9.1
Logistic Regression dialog box
136
Advanced quantitative data analysis
Box 9.2
Logistic Regression: Save New Variables sub-dialog box
Select Child’s ability to Teachers’ interest and then the second 䉴 button to put these variables in the box under Covariates:. Select Enter beside Method:, which produces a drop-down menu. Select Foreward: LR for a forward stepwise logistic regression in which the −2 log likelihood (ratio) test is used to consider entered predictors for removal. If you want the predicted probabilities for this model (as shown in Tables 9.4 and 9.8), select Save. . ., which opens the Logistic Regression: Save New Variables sub-dialog box in Box 9.2. Select Probabilities under Predicted Values and then Continue to close this sub-dialog box and return to the main dialog box. When the analysis is performed, these predicted probabilities will be placed in the column next to freq in the Data Editor. Select OK to carry out the analysis.
SPSS output
The output for all the tables will be shown apart from the first four. These tables will be presented in the same sequence as in the output. Table 9.9 shows the Score statistics of the four predictors and their statistical Significance to be considered for entry on the first step, which is numbered 0. The highest score statistic is 97.279 for Child’s ability, which has a probability of .000 (i.e. less than .001). Table 9.10 presents the chi-square test for the three steps in which a predictor provided a significant improvement in the fit of the model. For example, in Step 3 the Step chi-square is 5.847 for adding teachers’ interest to the previous model of child’s ability and parents’ interest. This differs slightly from the figure we gave of 7.65 due to rounding error. It is the difference in the −2 log likelihood of the two models. The Block and Model
Binary logistic regression
Table 9.9 step
137
SPSS output of score statistics of the predictors for the first Variables not in the Equation Score
Step 0
Variables
CHABIL CHINTER PARINTER TEAINTER
97.279 45.000 33.750 25.116 120.399
Overall Statistics
Table 9.10
df
Sig. 1 1 1 1 4
.000 .000 .000 .000 .000
SPSS output of chi-square test for the three steps Omnibus Tests of Model Coefficients Chi-square
df
Sig.
Step 1
Step Block Model
110.386 110.386 110.386
1 1 1
.000 .000 .000
Step 2
Step Block Model
25.884 136.270 136.270
1 2 2
.000 .000 .000
Step 3
Step Block Model
5.847 142.117 142.117
1 3 3
.016 .000 .000
chi-square of 142.117 is the difference in the −2 log likelihood between this three-predictor model and the constant-only model, which we worked out as 144.00 (343.80 − 199.80 = 144.00). Table 9.11 shows the −2 log likelihood of the model in each of the three steps. For instance, the −2 Log likelihood for the three-predictor model in Step 3 is 201.600, which we calculated as 199.80. Table 9.12 presents the classification table for the three steps. This table categorizes the predicted probabilities shown in Table 9.8 into a predicted value of 0 (fail) below a cut-off probability of .500 and of 1 (pass) on .500 and above. It then presents the number of observed cases falling into these two predicted categories. In Step 3, for example, 30 failed and were predicted to fail (case 1, which multiplied 30 times is 30), 30 passed but were
138
Advanced quantitative data analysis
Table 9.11 SPSS output of −2 log likelihood for the model in each of the three steps Model Summary
Step
−2 Log likelihood
1 2 2
Table 9.12
Cox & Snell R Square
233.331 207.447 201.600
Nagelkerke R Square .336 .396 .409
.466 .550 .568
SPSS output of the classification table for the three steps Classification Tablea Predicted Child’s achievement Observed
Step 1
Child’s achievement
1
0 0 1
60 30
30 150
66.7 83.3 77.8
0 1
30 30
60 150
33.3 83.3 66.7
0 1
30 30
60 150
33.3 83.3 66.7
Overall Percentage Step 2
Child’s achievement Overall Percentage
Step 3
Child’s achievement Overall Percentage
a
Percentage Correct
The cut value is .500.
predicted to fail (case 6), 60 failed but were predicted to pass (cases 2 and 3) and 150 passed and were predicted to pass (cases 4, 5, 7, 8 and 9). Thus 66.7 per cent correct predictions were made [(30 + 150) × 100/270 = 66.67]. Note that in terms of this cut-off point, the percentage of correct predictions decreases from 77.8 for the first model to 66.7 for the subsequent model. However, as we have already seen in Table 9.8, there is a slight but significant increase in the predicted probability of passing from the one- to the three-predictor model. Thus, these classification tables may be misleading in terms of the adequacy of the model.
Binary logistic regression
139
Table 9.13 SPSS output of the predictors on each step and their unstandardized regression coefficients Variables in the Equation B
S.E.
Wald
df
Sig.
Exp(B)
Step 1a
CHABIL Constant
1.906 −4.295
.237 .625
64.822 47.162
1 1
.000 .000
6.723 .014
Step 2b
CHABIL PARINTER Constant
2.265 1.290 −7.974
.323 .270 1.183
49.051 22.900 45.450
1 1 1
.000 .000 .000
9.634 3.634 .000
Step 3c
CHABIL PARINTER TEAINTER Constant
2.240 .600 .770 −8.859
.326 .345 .332 1.357
47.144 3.027 5.392 42.588
1 1 1 1
.000 .082 .020 .000
9.397 1.822 2.161 .000
a
Variable(s) entered on step 1: CHABIL. Variable(s) entered on step 2: PARINTER. c Variable(s) entered on step 3: TEAINTER. b
Table 9.13 shows the predictors selected for each step and their unstandardized regression coefficients. Child’s ability was entered in the first step, parents’ interests in the second step and teachers’ interests in the third and final step. For Step 3, the regression coefficients or Bs of child’s ability, parents’ interest and teachers’ interest are 2.240, .600 and .770, respectively. The Constant is −8.859. A regression coefficient of 2.240 for child’s ability means that one unit of child’s ability (1 on the 5-point scale) adds 2.240 to the logged odds of passing and increases the odds of passing by the Exponent of (B) or a multiple of 9.397. Table 9.14 displays the Change in −2 Log Likelihood when predictors are removed from the model at each step and the statistical significance of this change. In Step 3, for example, removing teachers’ interest from the model results in a change in the −2 log likelihood of 5.847 (previously calculated as 7.65), which with 1 degree of freedom is significant at .016 or less. This table also presents the Log Likelihood of the model without each of the predictors in the model. So, for instance, the log likelihood for the model in Step 3 without teachers’ interest is −103.724, which multiplied by −2 gives a log likelihood of 207.448. Finally, Table 9.15 shows the score statistic and its significance for predictors not in the model at each step. In Step 3, for example, the three predictors of child’s ability, parents’ interest and teachers’ interest have been entered leaving child’s interest out of the model. The Score statistic for
140
Advanced quantitative data analysis
Table 9.14 SPSS output of change in −2 log likelihood with predictors removed at each step Model if Term Removed
Model Log Likelihood
Variable
Change in −2 Log Likelihood
df
Sig. of the Change
Step 1
CHABIL
−171.859
110.386
1
.000
Step 2
CHABIL PARINTER
−154.751 −116.666
102.054 25.884
1 1
.000 .000
Step 3
CHABIL PARINTER TEAINTER
−147.636 −102.442 −103.724
93.673 3.283 5.847
1 1 1
.000 .070 .016
Table 9.15 SPSS output of the score statistic and its significance for predictors not in the model at each step Variables not in the Equation Score Step 1
Variables
Variables
8.842 26.274 23.859 31.939
1 1 1 3
.003 .000 .000 .000
CHINTER TEAINTER
2.411 6.329 11.082
1 1 2
.120 .012 .004
CHINTER
3.385 3.385
1 1
.066 .066
Overall Statistics Step 3
Variables Overall Statistics
Sig.
CHINTER PARINTER TEAINTER
Overall Statistics Step 2
df
child’s interest is 3.385, which with a probability of .066 is not statistically significant and so is not entered into the fourth step.
Recommended further reading Menard, S. (1995) Applied Logistic Regression Analysis. Thousand Oaks, CA: Sage. This is a concise and non-technical account of logistic regression with examples of SPSS output, although the data set is not provided.
Binary logistic regression
141
Pample, F.C. (2000) Logistic Regression: A Primer. Thousand Oaks, CA: Sage. This is another short and non-technical account of logistic regression that includes examples of SPSS output, although the data set is not given. SPSS Inc. (2002) SPSS 11.0 Regression Models. Upper Saddle River, NJ: PrenticeHall. Provides a detailed commentary on the output produced by SPSS 11.0 and a useful introduction to logistic regression. Tabachnick, B.G. and Fidell, L.S. (1996) Using Multivariate Statistics, 3rd edn. New York: HarperCollins. Chapter 12 offers a systematic and general account of logistic regression, comparing the procedures of four different programs (including SPSS 6.0) and showing how to write up the results of two worked examples.
Part 5 Testing differences between group means
10
An introduction to analysis of variance and covariance
Analysis of variance (usually abbreviated as ANOVA) and analysis of covariance (ANCOVA) are parametric statistical techniques for determining whether the variance in a quantitative variable differs significantly from that expected by chance for a qualitative variable or its interaction with one or more other qualitative variables. If the variance differs in this way and the qualitative variable only consists of two groups or categories, this indicates that the means of these two groups differ. Where the qualitative variable consists of more than two groups, a significant difference indicates that the means of two or more of these groups is likely to differ. Which of the means differ needs to be determined using further statistical tests. The qualitative variable is often referred to as an independent variable or factor. An analysis of variance may have one or more factors. An analysis of variance with one factor is called a one-way analysis of variance. One with two factors is called a two-way analysis of variance, one with three factors a three-way analysis of variance and so on. The groups or categories within a factor are frequently called levels. Where the analysis of variance has two or more factors, the number of levels for each factor may be given instead. So, for example, a two-way analysis of variance which has two groups in each factor may be referred to as a 2 × 2 analysis of variance. A three-way analysis of variance which has two groups in one factor and three groups in the other two factors may be referred to as a 2 × 3 × 3 analysis of variance and so forth.
146
Advanced quantitative data analysis
The quantitative variable is usually called the dependent variable, of which there may be more than one. Analysis of variance generally refers to one dependent variable being examined at a time. Multivariate analysis of variance (MANOVA) and covariance (MANCOVA) refer to two or more dependent variables being examined at the same time. A one-way MANOVA is the same as a discriminant analysis, which is covered in Chapter 13. Analysis of covariance is used to control for the influence of one or more quantitative variables that have been found to be correlated with the dependent variable. The scores on the dependent variable for the groups within a factor may come from the same or different cases. Where the scores come from different cases, the factor may be called an unrelated factor. The sex or gender of a case is an example of an unrelated factor. Where the scores come from the same, matched or related cases, the factor may be called a related factor. The same variable measured on two different occasions on the same cases is an example of a related factor. An analysis of variance may also be referred to in terms of whether it consists of unrelated factors only (an unrelated analysis of variance), related factors only (a related analysis of variance) or a combination of both related and unrelated factors (a mixed analysis of variance). The aim of analysis of variance is to determine which factors and which interactions of those factors account for a significant proportion of the overall variance in a variable. The simplest analysis of variance is a one-way analysis of variance consisting of two unrelated groups, which is the equivalent of an unrelated t-test, which assumes the variances of the two groups are equal. As we would use an unrelated t-test to analyse the data for such a design, we will illustrate the computation and use of analysis of variance in this chapter with the next simplest analysis of variance, which is a one-way unrelated analysis of variance with three groups. We will demonstrate a one-way analysis of covariance in the next chapter and a 2 × 2 unrelated analysis of variance in the chapter after that.
One-way unrelated analysis of variance
Suppose we have three groups of 2, 4 and 3 individuals as shown in Table 10.1. The three groups of individuals could be depressed patients who had been assigned to one of three conditions consisting of no treatment, a drug treatment and a psychotherapy treatment. The use of analysis of variance does not depend on whether patients had or had not been randomly assigned to the three treatments. For this example, we will say that the three groups represent the three marital statuses of being divorced, married or never married, respectively. The dependent variable is a 9-point measure of depression in which higher scores indicate higher levels of depression. The mean depression score is highest for the divorced (6.00) followed by the
An introduction to analysis of variance and covariance
147
Table 10.1 Depression scores for three groups of cases Cases
Group
Depression
1 2
1 1
7 5
3 4 5 6
2 2 2 2
2 3 4 3
7 8 9
3 3 3
5 3 4
never married (4.00) and the married (3.00), as shown in Table 10.2. We want to know whether the difference between the means of these groups (the between-groups difference) is significantly greater than would be expected by chance. We also usually wish to know which, if any, of the three means differ from one another. The mean score of the group can be thought of as representing the ‘true’ score for that group, while the variation in scores around the mean within that group can be thought of as chance, error or unexplained variation. In analysis of variance, we compare the between-groups variance estimate with the within-groups variance estimate. If the between-groups variance estimate is considerably bigger than the within-groups variance estimate, the differences between the means are unlikely to be due to chance or error. Dividing the between-groups variance estimate by the within-groups variance estimate is known as the F-test in tribute to Sir Ronald Fisher who developed analysis of variance: F=
between-groups variance estimate within-groups variance estimate
The bigger the between-groups variance estimate is in relation to the within-groups variance estimate, the bigger F will be and so the more likely that the means of the groups will differ significantly from each other by chance. To calculate the between-groups variance estimate, we assume that the true score for an individual in a group is the mean score for that group. From each of those true scores we subtract the grand or overall mean for the whole sample, which is 4.00. We then square these differences or deviations
148
Advanced quantitative data analysis
Table 10.2
Calculation of the F-test for a one-way analysis of variance
Cases
Group
1 2
1 1
2 2 2 2
Group mean
Within-groups squared deviations
(6 − 4)2 = 4 (6 − 4)2 = 4
(6 − 7)2 = 1 (6 − 5)2 = 1
2 3 4 3
(3 − 4)2 = 1 (3 − 4)2 = 1 (3 − 4)2 = 1 (3 − 4)2 = 1
(3 − 2)2 = 1 (3 − 3)2 = 0 (3 − 4)2 = 1 (3 − 3)2 = 0
(4 − 4)2 = 0 (4 − 4)2 = 0 (4 − 4)2 = 0
(4 − 5)2 = 1 (4 − 3)2 = 1 (4 − 4)2 = 0
12 3−1=2 12/2 = 6.00
6 9−3=6 6/6 = 1.00 6.00/1.00 = 6.00
12/4 = 3.00
Group mean 7 8 9
7 5
Between-groups squared deviations
12/2 = 6.00
Group mean 3 4 5 6
Depression
3 3 3
5 3 4 12/3 = 4.00
Grand mean 36/9 = 4.00 Sum of squares (SS) Degrees of freedom (df ) Mean square (MS) F
and add them together to form what is known as the sum of squares (which is short for the sum of squared deviations). We divide this sum of squares by the between-groups degrees of freedom, which is the number of groups minus 1 to form the mean square (which is short for the mean squared deviation and which is the variance estimate). As shown in Table 10.2, the between-groups sum of squares, degrees of freedom and mean square are 12.00, 2 and 6.00, respectively. To calculate the within-groups variance estimate, we subtract the group mean from the individual score of the case in that group. We then square these deviations and add them together to form the sum of squares. We divide this sum of squares by the within-groups degrees of freedom to form the mean square or variance estimate. The within-groups degrees of freedom is the number of cases minus the number of groups. This is the same as subtracting 1 from the number of cases in a group and summing this difference across the groups. As presented in Table 10.2, the within-groups sum of squares, degrees of freedom and mean square are 6.00, 6 and 1.00, respectively. The F-test is the between-groups variance estimate divided by the withingroups variance estimate, which is 6.00 (6.00/1.00 = 6.00). We can look up
An introduction to analysis of variance and covariance
Table 10.3
149
A one-way analysis of variance table
Source of variation Between-groups Within-groups Total
Sum of squares (SS)
Degrees of freedom (df)
12.00 6.00 18.00
2 6 8
Mean square (MS) 6.00 1.00
F
p
6.00
.05
the statistical significance of this F value in a table of its critical values, although this will be provided by SPSS. With 2 degrees of freedom in the numerator (for the between-groups variance estimate) and 6 degrees of freedom in the denominator (for the within-groups variance estimate), F has to be 5.15 or larger to be statistically significant at the two-tailed .05 level. As 6.00 is bigger than this critical value, we can conclude that the mean depression scores differ significantly across the three marital status groups. The results of an analysis of variance are often summarized in the kind of table shown in Table 10.3. Note that the total sum of squares and degrees of freedom is the sum of these values for the between- and within-groups sources of variation. The total sum of squares can be calculated independently by subtracting each individual score from the grand mean, squaring the deviation and adding them together. The total degrees of freedom are the number of cases minus 1.
Homogeneity of variance
The analysis of variance is based on the assumption that the populations from which the samples are drawn have a normal distribution and equal or homogeneous variances. Statistical tests exist for determining whether the skewness (or asymmetry) and the kurtosis (or flatness) of a distribution of scores differs significantly from zero, and they are described elsewhere (Cramer 1998). If the variances of one or more of the groups is considerably larger than that of the others, the larger variance will increase the size of the within-groups variance, which will reduce the chance of the F ratio being significant. The means of the groups with the smaller variances may differ from each other but this difference is hidden by the inclusion of the groups with the larger variances in the within-groups variance. One test for assessing whether variances are homogeneous is Levene’s test, which is simply a one-way analysis of variance on the absolute deviation of each score from the mean for that group. The absolute deviation
150
Advanced quantitative data analysis
scores for our example are either 1 or 0. For instance, the absolute deviation score for the first case is the absolute difference between the group mean of 6 and the individual score of 7, which is 1. The F-test for these absolute deviations is calculated in Table 10.4 and is about 0.61. As 0.61 is less than the critical value of 5.15, there are no significant differences in the means of the absolute deviations of the depression scores and so the variances are homogeneous. Had the variances differed significantly, we could have tried to make them more equal by transforming the original depression scores by, for example, taking the square root or natural logarithm of them. Comparing means
A significant F value for a factor only consisting of two groups indicates that the means of those two groups differ. Where a factor comprises three or more groups, we need to determine which of the means of the groups differ by comparing two groups at a time. With three groups, we can make three comparisons (Group 1 vs Group 2; Group 1 vs Group 3; and Group 2 vs Group 3). Where we have good grounds before collecting the data for expecting which means would differ, we could test for these differences using an a priori or planned test such as the unrelated t-test using the onetailed level. The t value, degrees of freedom and p value for these three comparisons are shown in Table 10.5. Only the means for the first and second groups differ. The mean depression score for the divorced group (6.00) is significantly higher than that for the married group (3.00). Where we have no good grounds for predicting which means might differ, we could use various post-hoc, a posteriori or unplanned tests to determine which, if any, of the means differed. These tests take into account the fact that the more comparisons we make, the more likely we are to find that some of these comparisons will differ by chance. For example, at the .05 probability level, we would expect one of 20 comparisons to be significant by chance at that level. This figure would rise to five if we made 100 comparisons. One of the most general and versatile post-hoc tests is the Scheffé test, which can be used for groups of differing size. This test is conservative in the sense that it is less likely to find differences to be significant. The Scheffé test is an F-test in which the squared difference between the means of the two groups being compared is divided by the within-groups mean square for all the groups, but which is weighted to reflect the number of the cases in the two groups being compared (n1 and n2): F=
(Group 1 mean − Group 2 mean)2 within-groups mean square × (n1 + n2)/(n1 × n2)
The significance of this F-test is evaluated against the appropriate critical value of F, which is weighted or multiplied by the between-groups degrees
3 3 3
Grand mean Sum of squares (SS) Degrees of freedom (df) Mean square (MS) F
Group mean
7 8 9
Group mean
3 4 5 6
2 2 2 2
1 1
1 2
Group mean
Group
12/3 = 4.00
5 3 4
12/4 = 3.00
2 3 4 3
12/2 = 6.00
7 5
Depression
6/9 = 0.67
2/3 = 0.67
4−5=1 4−3=1 4−4=0
2/4 = 0.50
3−2=1 3−3=0 3−4=1 3−3=0
3−1=2 0.34/2 = 0.17
0.34
(0.67 − 0.67)2 = 0.00 (0.67 − 0.67)2 = 0.00 (0.67 − 0.67)2 = 0.00
(0.50 − 0.67)2 = 0.03 (0.50 − 0.67)2 = 0.03 (0.50 − 0.67)2 = 0.03 (0.50 − 0.67)2 = 0.03
0.17/0.28 = 0.61
(1.00 − 0.67)2 = 0.11 (1.00 − 0.67)2 = 0.11
6−7=1 6−5=1 2/2 = 1.00
Between-groups squared deviations
Absolute deviations
Calculation of Levene’s test for a one-way analysis of variance
Cases
Table 10.4
9−3=6 1.67/6 = 0.28
1.67
(0.67 − 1.00)2 = 0.11 (0.67 − 1.00)2 = 0.11 (0.67 − 0.00)2 = 0.45
(0.50 − 1.00)2 = 0.25 (0.50 − 0.00)2 = 0.25 (0.50 − 1.00)2 = 0.25 (0.50 − 0.00)2 = 0.25
(1.00 − 1.00)2 = 0.00 (1.00 − 1.00)2 = 0.00
Within-groups squared deviations
152
Advanced quantitative data analysis
Table 10.5
Unrelated t- and Scheffé tests comparing three means
Comparisons
t
df
p
Scheffé
df
p
Group 1 vs 2 Group 1 vs 3 Group 2 vs 3
3.46 1.90 1.46
4 3 5
.026 ns ns
12.00 4.80 1.72
2, 6 2, 6 2, 6
.037 ns ns
of freedom (i.e. the number of groups minus 1). The .05 critical value of F with 2 degrees of freedom in the numerator and 6 degrees of freedom in the denominator is about 5.15. Consequently, the appropriate .05 critical value of F for this Scheffé test is 10.30 [5.15 × (3 − 1) = 10.30]. Scheffé F for comparing the mean scores of the divorced and the married groups is 12.00: 3.002 9.00 (6.00 − 3.00)2 = = = 12.00 1.00 × (2 + 4)/(2 × 4) 1.00 × 0.75 0.75 As 12.00 is larger than 10.30, the means of these two groups differ significantly. The Scheffé F, the degrees of freedom and the p value for these three comparisons are shown in Table 10.5. Only the means of these two groups differ. Although this difference is significant in terms of both the unrelated t-test and the Scheffé test, the exact probability is higher for the Scheffé test (.037) than the two-tailed value of the unrelated t-test (.026), as would be expected for a more conservative test. General linear model
Analysis of variance can also be computed with multiple regression. Both these techniques can be derived from the general linear model. In multiple regression, the squared multiple correlation (R2) is the proportion of variance in the dependent or criterion variable that is explained by the independent or predictor variables. Variance is the sum of squares divided by the degrees of freedom, which is the number of cases minus 1. As the degrees of freedom are the same for the three sources of variation in the analysis of variance, we can ignore them. The squared multiple correlation can, therefore, be obtained by dividing the between-groups sum of squares by the total sum of squares: R2 =
between-groups sum of squares total sum of squares
In analysis of variance, this statistic is called eta squared (η2) or the correlation ratio. For our example, the squared multiple correlation or correlation ratio is about .67 (12.00/18.00 = .667).
An introduction to analysis of variance and covariance
153
As mentioned in Chapter 6, one way of calculating the squared multiple correlation in multiple regression is to sum the product of the standardized partial regression coefficient for that predictor and its correlation with the criterion across all the predictors: R2 = [standardized partial regression × correlation coefficient] summed across predictors Although marital status is a single factor or variable, it is a categorical variable. Consequently, we need to ensure that each group can be identified separately by a predictor. The number of predictors that is required to do this is always one less than the number of categories. So, to identify the three marital status groups we need two predictors. The simplest way of identifying a group is using dummy coding where that group is coded as 1 and the other groups are coded as 0. For example, the divorced group could be coded as 1 and the married and never married groups coded as 0, as shown in Table 10.6. This new variable is known as a dummy variable. In the second dummy variable that is needed to identify the three groups, the married group could be coded as 1 and the divorced and never married groups as 0. Using these two dummy variables, we can see that the divorced group is identified by 1 on the first dummy variable and 0 on the second, the married group by 0 on the first and 1 on the second, and the never married group by 0 on both. If we regress the depression score on these two dummy variables, the standardized partial regression score for the first dummy variable is .59 and that for the second is −.35. The correlation of the depression score with the first dummy variable is .76 and that with the second is −.63.
Table 10.6
Dummy coding of dummy variables for three groups Dummy variables
Cases
Group
Depression
Divorced
Married
1 2
1 1
7 5
1 1
0 0
3 4 5 6
2 2 2 2
2 3 4 3
0 0 0 0
1 1 1 1
7 8 9
3 3 3
5 3 4
0 0 0
0 0 0
154
Advanced quantitative data analysis
Thus, the squared multiple correlation is .67: (.59 × .76) + (−.35 × −.63) = .45 + .22 = .67 The F-test for the squared multiple correlation, as we saw in Chapter 6, is: F=
R2 (change)/number of predictors (1 − R2)/(N − number of predictors − 1)
If we use a sufficient number of decimal places, the F-test for our example is 6.00, which is the same as that for the analysis of variance: .3334 .3334 .6667/2 = = = 5.996 (1 − .6667)/(9 − 2 − 1) .3333/6 .0556 With 2 degrees of freedom in the numerator and 6 degrees of freedom in the denominator, F has to be 5.15 or larger to be significant at the .05 level, which it is. There are three reasons why it is useful to know the relationship between multiple regression and analysis of variance. First, categorical variables such as marital status need to be coded as dummy variables and entered as part of the same step in a multiple regression. When entered together in a single step, the proportion of variance accounted for by the dummy variables is the same as the eta squared for an analysis of variance. Second, analysis of variance is often carried out in statistical software using multiple regression. In an analysis of variance with more than one factor and where the number of cases in the factors are unequal or disproportionate, the results may differ depending on the method used for entering the variables. An understanding of these differences may help you choose the method most appropriate for your analysis. Third, the calculations involved in an analysis of covariance are easier to explain in terms of multiple regression. Reporting the results
One succinct way of reporting the results for this example of an analysis of variance is as follows: ‘A one-way unrelated analysis of variance found marital status to have a significant effect on depression (F2,6 = 6.00, p < .05).’ If we had good grounds for predicting that depression would be higher in the divorced than in the married group, we would add the following kind of statement: ‘As predicted, depression was significantly higher (unrelated t4 = 3.45, one-tailed p < .05) among the divorced (M = 6.00, SD = 1.41) than the married (M = 3.00, SD = 0.82). None of the other means differed significantly.’ If we did not have sound reasons for any predictions, we could add the following: ‘The Scheffé test showed that mean depression only differed significantly (Scheffé F2,6 = 12.00, p < .05) for the divorced and the married
An introduction to analysis of variance and covariance
155
group. Mean depression was higher among the divorced (M = 6.00, SD = 1.41) than the married (M = 3.00, SD = 0.82).’
SPSS Windows procedure
Use the following procedure to conduct the analysis of variance described in this chapter. Enter the data into the Data Editor as shown in Box 10.1. The first column, labelled group, contains the code for the three groups. The second column, labelled depress, holds the depression scores. Label the three groups as Divorced, Married and Never married, respectively. Select Analyze on the horizontal menu bar near the top of the window, General Linear Model from the drop-down menu and then Univariate. . ., which opens the Univariate dialog box in Box 10.2. Select depress and then the first 䉴 button to put this variable in the box under Dependent: (variable). Select group and then the second 䉴 button to put this variable in the box under Fixed Factor(s):. To carry out post-hoc tests, select Post Hoc. . ., which opens the Univariate: Post Hoc Multiple Comparisons for Observed Means sub-dialog box in Box 10.3. Select group under the Factor(s): and then the 䉴 button to put this variable in the box under Post Hoc Tests for:. To carry out a Scheffé test, select Scheffe and then Continue to return to the Univariate dialog box.
Box 10.1
Data in the Data Editor for a one-way analysis of variance
156
Advanced quantitative data analysis
Box 10.2
Univariate dialog box
Select Options. . . to open the Univariate: Options sub-dialog box in Box 10.4. Select Descriptive statistics to display the means and standard deviations of the three groups. Select Homogeneity tests to carry out Levene’s test for homogeneity of variances. Select Continue to close this sub-dialog box and to return to the main dialog box. Select OK to run this analysis. To run the regression analysis, enter the codes for the two dummy variables as shown in Table 10.6 into the third and fourth columns of the Data Editor file. Regress depress onto these two variables.
SPSS output
The table of the means and standard deviations for the three groups is presented first, which is not shown here. After this is the table for Levene’s
An introduction to analysis of variance and covariance
157
Box 10.3 Univariate: Post Hoc Multiple Comparisons for Observed Means subdialog box
test, as displayed in Table 10.7. The variances do not differ significantly as the F value of .600 has a significance level (.579) of greater than .05. The analysis of variance table is shown next as presented in Table 10.8. Only three of the six sources listed are relevant to us and these are labelled GROUP, Error and Corrected Total, which correspond to those presented in Table 10.3. The F value of 6.000 has a significance level of .037, which is significant as it is less than .05. The results of the Scheffé tests are shown last. The first of the two tables is presented in Table 10.9 as it is the more relevant to us. The comparisons in this table are repeated twice. For example, the first line compares the means of the Divorced and the Married groups, while the third line compares the means of the Married and the Divorced groups. Only the significance of the comparisons is presented, which is .037 for this comparison. The F values and the degrees of freedom can be worked out as described earlier. The table with the standardized partial regression coefficients for the multiple regression with the two dummy variables is shown in Table 10.10. The first coefficient is .588 and the second is −.351.
158
Advanced quantitative data analysis
Box 10.4
Univariate: Options sub-dialog box
Table 10.7 variance
SPSS output of Levene’s test for the one-way analysis of Levene’s Test of Equality of Error Variancesa
Dependent Variable: DEPRESS F
df1 .600
df 2 2
Sig. 6
.579
Tests the null hypothesis that the error variance of the dependent variable is equal across groups. a Design: Intercept + GROUP.
An introduction to analysis of variance and covariance
Table 10.8
159
SPSS output of the one-way analysis of variance table Tests of Between-Subjects Effects
Dependent Variable: DEPRESS
Source Corrected Model Intercept GROUP Error Total Corrected Total a
Type III Sum of Squares
df
12.000a 156.000 12.000 6.000 162.000 18.000
2 1 2 6 9 8
Mean Square 6.000 156.000 6.000 1.000
F
Sig.
6.000 156.000 6.000
.037 .000 .037
R Squared = .667 (Adjusted R Squared = .556).
Table 10.9 variance
SPSS output of the Scheffé tests for the one-way analysis of Multiple Comparisons
Dependent Variable: DEPRESS Scheffé 95% Confidence Interval Mean Difference (I − J)
(I) GROUP
(J) GROUP
Std. Error
Lower Bound
Upper Bound
1
2 3
3.00* 2.00
.866 .913
.037 .171
.22 −.93
5.78 4.93
2
1 3
−3.00* −1.00
.866 .764
.037 .471
−5.78 −3.45
−.22 1.45
3
1 2
−2.00 1.00
.913 .764
.171 .471
−4.93 −1.45
.93 3.45
Based on observed means. * The mean difference is significant at the .05 level.
Sig.
160
Advanced quantitative data analysis
Table 10.10 SPSS output of the standardized partial regression coefficients for the two dummy variables Coefficientsa Unstandardized Coefficients Model 1
a
B (Constant) D1 D2
Std. Error
4.000 2.000 −1.000
.577 .913 .764
Standardized Coefficients Beta
.588 −.351
t 6.928 2.191 −1.309
Sig. .000 .071 .238
Dependent Variable: DEPRESS.
Recommended further reading Cohen, J. (1968) Multiple regression as a general data-analytic system. Psychological Bulletin, 70: 426–43. This paper demonstrates how a one-way analysis of variance can be computed with multiple regression. Cramer, D. (1998) Fundamental Statistics for Social Research: Step-by-Step Calculations and Computer Techniques Using SPSS for Windows. London: Routledge. Using a different example, chapter 5 covers much of the material presented here. Diekhoff, G. (1992) Statistics for the Social and Behavioral Sciences. Dubuque, IA: Wm. C. Brown. Chapter 8 offers a concise and very clear account of a one-way analysis of variance. Pedhazur, E.J. (1982) Multiple Regression in Behavioral Research: Explanation and Prediction, 2nd edn. New York: Holt, Rinehart & Winston. Although fairly technical, chapter 9 explains the Scheffé test and shows how a one-way analysis of variance is related to and can be calculated with multiple regression. SPSS Inc. (2002) SPSS Base 11.0 User’s Guide Package. Upper Saddle River, NJ: Prentice-Hall. Provides a detailed commentary on the output produced by SPSS 11.0 as well as a useful introduction to analysis of variance.
11
Unrelated one-way analysis of covariance
In an analysis of covariance (ANCOVA), the variance in the dependent variable that is shared with one or more quantitative variables is removed before determining whether the remaining variance can be attributed to one or more factors and their interactions. Shared variance is known as covariance. Quantitative variables which share variance with the dependent variable and which are removed in this way are called covariates. In other words, covariates are correlated with the dependent variable. When the means of a covariate are the same for the different levels or groups in a factor, the means of the dependent variable for those different levels do not have to be adjusted for that covariate. In these circumstances, the variance that is attributable to a factor will not differ from that in an analysis of variance, but the error term will be smaller than in an analysis of variance as some of the error variance will now be attributed to the covariate. Consequently, the variance attributed to a factor is more likely to be statistically significant in an analysis of covariance than in an analysis of variance, indicating that the means of the groups are more likely to differ. Where the means of the covariate are not the same for the different levels of a factor, the means of the dependent variable have to be adjusted for the covariate. The greater the difference in the means of the covariate, the greater the adjustment is to the means of the dependent variable. As a consequence of this adjustment, adjusted means may differ less than unadjusted means. Furthermore, the order of the adjusted means in terms
162
Advanced quantitative data analysis
of their size may differ from that of the unadjusted means. An extreme example of this is where the group with the highest unadjusted mean has the lowest adjusted mean. The means of the covariate are most likely to be the same in a true experiment where cases or participants are randomly assigned to the different levels in a factor. Random assignment is used to ensure that the cases are unlikely to differ in any way other than according to the level or treatment to which they have been assigned. For example, patients taking part in an experiment to evaluate the effectiveness of different treatments for depression are likely to differ in terms of the severity of their depression, with some patients being more depressed than others. Patients who are more depressed before treatment may be less responsive to treatment and so may show less improvement than less depressed patients. Alternatively, they may show more improvement because they have more room for improvement. Consequently, the results of such a study will be more difficult to interpret if patients are not similar in terms of the severity of their depression before treatment begins. If patients are similar in terms of their depression before treatment and if depression before treatment is positively correlated with depression immediately after treatment has ended, analysis of covariance can be used to provide a more sensitive test of the effectiveness of the treatments by controlling for variations in depression before treatment within the groups of patients assigned to the different treatments. However, random assignment does not always ensure that cases are similar in all respects, especially if the number of cases used is relatively small. For example, it is possible that the mean depression score before treatment may be higher in some groups than in others. There are various ways of handling this, none of which is entirely satisfactory. Many social scientists believe that the preferred method for pre-treatment differences is analysis of covariance (e.g. Stevens 1996). In this analysis, the pre-treatment means are assumed to be the same and the post-treatment means are adjusted to take account of the original pre-treatment differences, which makes the interpretation of the results more difficult. The means of covariates for cases that have not been randomly assigned to the different levels in a factor are more likely to differ across groups. For example, the divorced, the married and the never married are likely to differ in various ways, one of which may be age. The divorced may in general be older than the married who, in turn, may be older than the never married. If age is found to be correlated with the dependent variable of depression, age may be controlled by carrying out an analysis of covariance. Analysis of covariance is frequently used in this way, although it should be remembered that this method assumes that the means of the covariate are the same across the groups. We will illustrate a one-way analysis of covariance using the same data as in Table 10.1 so the results of the analysis of covariance can be compared
Unrelated one-way analysis of covariance
163
with those of an analysis of variance. As before, the single factor consists of the three levels or groups of the divorced, the married and the never married. We will have one covariate in this analysis, which we will call physical ill-health. Like the dependent variable of depression, this variable has nine points in which higher scores indicate greater ill-health. Although it is highly unlikely in practice that the means of the covariate will be exactly the same for the levels of a factor, we have generated one set of data where this is the case and another set where the means differ in order to compare these two situations. These two sets of data are shown in Table 11.1 together with the data for the dependent variable and the means for the three groups and the whole sample for all three variables. The means for ill-health for the first example are all 5, whereas for the second example they are 8, 5 and 3, respectively. To carry out an analysis of covariance, the covariate has to be correlated with the dependent variable, which it is in both these examples. The correlation is .39 where the covariate means are equal and .31 where they are unequal. The results of an analysis of covariance can be summarized in table form in the same way as those for an analysis of variance as shown in Table 10.3. It is often useful, however, to present the variation or sum of squares in the dependent variable that is accounted for by the covariate as displayed in Table 11.2. The main summary statistics for the analysis of
Table 11.1
Raw and mean scores for analysis of covariance Ill-health
Cases
Groups
1 2
1 1
2 2 2 2
2 3 4 3 12/4 = 3
Group mean 7 8 9
7 5 12/2 = 6
Group mean 3 4 5 6
Depression
3 3 3
3 4 5
Equal means 6 4 10/2 = 5 4 6 5 5 20/4 = 5 5 4 6
Unequal means 7 9 16/2 = 8 5 6 4 5 20/4 = 5 4 2 3
Group mean
12/3 = 4
15/3 = 5
9/3 = 3
Grand mean
36/9 = 4
45/9 = 5
45/9 = 5
* p < .05.
Covariate Groups Error Total
Table 11.2
12.00 6.00 18.00
SS
2 6 8
df
6.00 1.00
MS
Analysis of variance
6.00*
F 2.67 12.00 3.33 18.00
SS 1 2 5 8
df 2.67 6.00 0.67
MS
Equal means
4.00 9.00*
F
2.67 12.89 3.33 18.00
SS
Analysis of covariance
Comparison of analysis of variance and analysis of covariance
1 2 5 8
df
2.67 6.44 0.67
MS
Unequal means
4.00 9.67*
F
Unrelated one-way analysis of covariance
165
variance and the two analyses of covariance for the data in Table 11.1 are presented for comparison in Table 11.2. We can see that the F ratio for the analysis of covariance for equal means (9.00) is larger than that for the analysis of variance (6.00) because a large proportion of the error sum of squares in the dependent variable (2.67/6.00 = .45) is shared with the covariate. Removing this shared sum of squares in the analysis of covariance makes the error sum of squares smaller (6.00 − 2.67 = 3.33), thereby decreasing the mean square (3.33/5 = 0.67) and increasing the F ratio. Note that one degree of freedom has been lost from the error term in the analysis of covariance. This means that the F ratio has to be slightly bigger for the analysis of covariance than the analysis of variance to be statistically significant. With 2 degrees of freedom in the numerator and 5 degrees of freedom in the denominator, the F ratio has to be 5.79 or more to be statistically significant at the .05 level rather than 5.15 or more as with the analysis of variance. However, the increase in the size of the F ratio is substantially greater than the increase in the critical value of F, so that the probability of this result being due to chance is smaller for the analysis of covariance (p = .022) than the analysis of variance (p = .037). Because the means of the covariate are exactly the same for the three groups, the means of the dependent variable will remain unchanged with the divorced having the highest depression (6.00) followed by the never married (4.00) and the married (3.00). The F ratio for the analysis of covariance for unequal means (9.67) is slightly larger than that for equal means (9.00) and so is a little more significant (p = .019). Because the means of the covariate are not exactly the same for the three groups, the means of the dependent variable have to be adjusted. The adjusted mean is highest for the divorced (8.00) followed by the married (3.00) and the never married (2.67). When the means are adjusted for differences in the covariate, the never married (2.67) are slightly less depressed than the married (3.00), whereas when the means are not adjusted the never married (4.00) are somewhat more depressed than the married (3.00). In other words, the order of the size of the adjusted means differs somewhat from that for the unadjusted means. One way of calculating these adjusted means will be described later.
Comparing means
As with analysis of variance, where a factor consists of more than two groups, further tests have to be carried out to determine which group means differ. One test for doing this is Fisher’s Protected LSD (Least Significant Difference) test (Huitema, 1980) which has the following formula:
166
Advanced quantitative data analysis adjusted group 1 mean − adjusted group 2 mean
t=
冪 error ×
adjusted mean square
(covariate group 1 mean − covariate group 2 mean)2 covariate error sum of squares
冦冢n + n 冣 + 冤 1
1
1
2
冥冧
We will illustrate the use of this formula comparing the divorced and the married where the means are unequal. The covariate error sum of squares is 6.00 for both unequal and equal means. Substituting the appropriate values into the formula we find that t is 4.07: 8.00 − 3.00
冪
0.67 ×
1 1 (8.00 − 5.00) 2 + + 2 4 6.00
冦冢 冣 冤
冥冧
5.00 5.00 5.00 = = = 4.07 √0.67 × (0.75 + 1.50) √1.51 1.23 The degrees of freedom for this t-test are the degrees of freedom for the adjusted mean square error, which is the number of cases minus the number of groups minus 1. In this case, it is 5 (9 − 3 − 1 = 5). Where we have no strong grounds for predicting which means would differ, we would use the two-tailed .05 level, whereas if we have good reasons for predicting which means would be bigger we would use the one-tailed .05 level. With 5 degrees of freedom, the .05 critical value of t is 2.58 for the two-tailed test and 2.02 for the one-tailed test. As the t value of 4.07 is larger than both these critical values, we would conclude that the divorced are significantly more depressed than the married. The results of the two-tailed test for all three comparisons for the covariate with equal and unequal means is shown in Table 11.3. Unlike the analyses carried out in the previous chapter, which showed that the divorced were not significantly more depressed than the never married, these results indicate that the divorced are also significantly more depressed than the never married.
Table 11.3
Comparing pairs of means with Fisher’s Protected LSD test Equal means
Unequal means
Comparisons
t
df
p
t
df
p
Group 1 vs 2 Group 1 vs 3 Group 2 vs 3
4.23 2.68 1.60
5 5 5
.01 .05 ns
4.07 2.91 0.36
5 5 5
.01 .05 ns
Unrelated one-way analysis of covariance
167
Homogeneity of variance and regression
Like analysis of variance, analysis of covariance is based on the assumption that the variances within groups are similar and do not differ significantly. If they do differ, transforming the scores may reduce any differences. One test for assessing whether variances are homogeneous is Levene’s test, which is simply a one-way analysis of variance on the absolute deviation of each score from the score predicted by the group factor and the covariate. The degrees of freedom for the between-groups sum of squares is the number of groups minus 1, while those for the within-groups sum of squares is the number of cases minus the number of groups. For this example, the degrees of freedom for the between-groups sum of squares are 2 (3 − 1 = 2) and for the within-groups sum of squares 6 (9 − 3 = 6). The .05 critical value of F for these degrees of freedom is 5.15. As the F value for this test (0.516) is less than 5.15, this assumption is met. The method for calculating these absolute deviations will be described later. Another assumption that needs to be met for analysis of covariance is that the linear relationship between the dependent variable and the covariate within a group does not differ significantly between the groups. If it does differ for one or more groups, the adjustment made for those groups will not be appropriate. This linear relationship is expressed as a regression coefficient and the assumption is known as homogeneity of regression. It is tested by an F test, in which the adjusted within-groups sum of squares is divided into the between-regressions sum of squares and the remaining sum of squares. The F test is the between-regressions mean square divided by the remainder mean square. The between-regressions mean square is the between-regressions sum of squares divided by its degrees of freedom, which is the number of groups minus 1. The remaining mean square is the remaining sum of squares divided by its degrees of freedom, which is the number of cases minus twice the number of groups. For this example, the degrees of freedom for the between-regressions sum of squares are 2 (3 − 1 = 2) and for the remaining sum of squares 3 [9 − (2 × 3) = 3]. The .05 critical value of F for these degrees of freedom is 9.55. As the F value for the covariate with equal (0.167) and unequal means (0.167) is less than 9.55, this assumption is met. One way of calculating this F ratio will be described shortly.
Analysis of covariance and multiple regression
Perhaps the simplest way of deriving the major statistics for an analysis of covariance is through multiple regression. As we saw in the last chapter, the factor of marital status can be represented by the two dummy variables in Table 10.6. The covariate remains as it is. To determine the F-test for the homogeneity of regression, we need to create two further dummy variables
168
Advanced quantitative data analysis
which represent an interaction between the covariate and the factor. If the regression coefficient differs significantly between the groups, a significant proportion of the variance will be accounted for by these two dummy variables. These two dummy variables are created by multiplying each of the first two dummy variables by the covariate, as shown in Table 11.4. Those not wishing to follow the detailed calculations involved for this analysis should proceed to the next section on adjusted means. As mentioned in the last chapter, the F-test can be derived from the following equation: F=
R2 (change)/number of predictors (1 − R2)/(N − number of predictors − 1)
The squared multiple correlation (R2) represents the proportion of the total variance or sum of squares accounted for in the dependent variable. If we know the total sum of squares, we can work out the sum of squares for the different terms in the analysis of covariance table. In the last chapter, we calculated the total sum of squares for depression to be 18.00. If we know the squared multiple correlation for the covariate and for the covariate and the first two dummy variables, we can work out the F ratios, the sum of squares and the mean square in Table 11.2. If we know the squared multiple correlation for the covariate and the four dummy variables, we can calculate the F ratio for the homogeneity of regression. The squared multiple correlations for the unequal means example for these three sets of predictors are .099, .815 and .833, respectively. The proportion of variance accounted for by the groups is the change or
Table 11.4 Dummy variables for analysis of covariance with covariate for unequal means Dummy variables Cases
Group
Depression
Covariate Divorced Married DivXCov MarXCov
1 2
1 1
7 5
7 9
1 1
0 0
7 9
0 0
3 4 5 6
2 2 2 2
2 3 4 3
5 6 4 5
0 0 0 0
1 1 1 1
0 0 0 0
5 6 4 5
7 8 9
3 3 3
3 4 5
4 2 3
0 0 0
0 0 0
0 0 0
0 0 0
Unrelated one-way analysis of covariance
169
difference between the squared multiple correlation for the covariate and the two dummy variables (.815) and that for the covariate (.099), which is .716 (.815 − .099 = .716). The proportion of variance that is left over or is considered error is the squared multiple correlation for the covariate and the two dummy variables (.815) subtracted from 1, which is .185 (1 − .815 = .185). In other words, the error term for marital status on its own (1 − .716 = .284) is reduced by taking account of the covariate (1 − .815 = .185). The number of predictors involved in the change in the squared multiple correlation is the two dummy variables representing the three groups. The number of predictors involved in the error are the covariate and the two dummy variables. Consequently, the degrees of freedom are 2 for the numerator and 5 (9 − 3 − 1 = 5) for the denominator. Substituting these figures into the formula for this F ratio gives an F value of 9.68, which is similar to the figure of 9.67 in Table 11.2: .716/2 .358 = = 9.68 .185/5 .037 To calculate the sum of squares for the groups, we simply multiply the change in the squared multiple correlation (.716) by the total sum of squares for the dependent variable (18.00), which gives 12.888 (.716 × 18.00 = 12.888), which is similar to that of 12.89 in Table 11.2. To work out the error sum of squares, we multiply the squared multiple correlation representing error (.185) by the total sum of squares for the dependent variable (18.00), which gives 3.33 (.185 × 18.00 = 3.33), which is the same as that in Table 11.2. The F ratio for the homogeneity of regression involves the proportion of the variance or sum of squares accounted for by the interaction between the covariate and the two dummy variables, which is represented by the last two dummy variables. This proportion is the change or difference between the squared multiple correlation for the covariate and the four dummy variables (.833) and that for the covariate and the first two dummy variables (.815), which is .018 (.833 − .815 = .018). The proportion of variance left over is the squared multiple correlation for the covariate and the four dummy variables (.833) subtracted from 1, which gives .167 (1 − .833 = .167). The number of predictors involved in the change in the squared multiple correlation is the two dummy variables representing the interaction between the covariate and the three groups. The number of predictors involved in the remaining variance is the covariate and the four dummy variables. Consequently, the degrees of freedom are 2 for the numerator and 3 for the denominator (9 − 5 − 1 = 3). Substituting these values in the formula for the F ratio gives an F value of 0.161, which is similar to that of 0.167 previously given: .018/2 .009 = = 0.161 .167/3 .056
170
Advanced quantitative data analysis
The F ratio for the covariate is often omitted from analysis of covariance tables but has been included in Table 11.2 to show the way in which the error term has been reduced. This F ratio determines whether the proportion of variance that is attributable to the covariate after the group factor has been taken into account is significant. This proportion is the change or difference between the squared multiple correlation for the covariate and the first two dummy variables (.815) and that for the first two dummy variables (.667), which is .148 (.815 − .667 = .148). The proportion of variance left over is the squared multiple correlation for the covariate and the first two dummy variables (.815) subtracted from 1, which gives .185 (1 − .815 = .185). The degrees of freedom is 1 (the covariate) for the numerator and 5 for the denominator (9 − 3 − 1 = 5). Substituting these values in the formula for the F ratio gives an F value of 4.00, which is the same as that in Table 11.2: .148/1 .148 = = 4.00 .185/5 .037 Note that when the means of the covariate are exactly the same, the sum of squares for the group factor (12.00) is the same as that for the analysis of variance (12.00) because the means of the dependent variable are unchanged. When the means of the covariate are not exactly the same, the sum of squares for the groups factor can be larger or smaller than that for the analysis of variance. In this case, it is slightly larger (12.89). The total sum of squares is that of the dependent variable. When the covariate means are unequal, these separate sums of squares when added together will not be the same as the total.
Adjusted means
One way of working out the adjusted mean for each group is to use the following formula: adjusted group mean = unadjusted group mean − [covariate unstandardized regression coefficient × (covariate group mean − covariate grand mean)] The unstandardized regression coefficient for the covariate is −0.67. The group and the grand means for the dependent variable and the covariate are presented in Table 11.1. Substituting the appropriate values in this formula for the divorced gives an adjusted mean of about 8.01, which, given rounding error, is very similar to that previously presented: 6.00 − [−0.67 × (8.00 − 5.00)] = 6.00 − (−2.01) = 8.01
Unrelated one-way analysis of covariance
171
An alternative and less simple way of computing the adjusted group mean is with the following general formula adjusted group mean = (dummy variable 1 code × its unstandardized regression coefficient) + . . . (covariate unstandardized regression coefficient × covariate grand mean) + constant The unstandardized regression coefficients for the two dummy variables representing the factor are 5.33 and 0.33, respectively. The constant is 6.00. Inserting the relevant values for the divorced gives an adjusted mean of about 7.98, which, with rounding error, is very similar to that of 8.00: (5.33 × 1) + (0.33 × 0) + (−0.67 × 5.00) + 6.00 = 5.33 + 0 + (−3.35) + 6.00 = 7.98 It should be clearer from this latter formula that the calculation of the adjusted group mean only requires knowing what the covariate grand mean is and not the covariate group mean. In other words, the calculation of the adjusted means assumes that the covariate means are the same.
Absolute deviations for Levene’s test
To obtain the absolute deviations for Levene’s test, we need to work out the predicted score for each individual, based on the group to which they belong and their score on the covariate, and to subtract this predicted score from their actual score. The predicted depression score is the sum of the constant of the regression equation and the product of the unstandardized regression coefficient and the value for each of the predictors in the regression equation. In this example, the regression equation consists of the two dummy variables and the covariate. The unstandardized regression coefficients for these three variables are 5.333, 0.333 and −0.667 respectively, while the constant is 6.000. Inserting the relevant values into this equation, we find that the predicted score for the first case is about 6.66: 6.000 + (5.333 × 1) + (0.333 × 0) + (−0.667 × 7) = 6.000 + 5.333 + 0 − 4.669 = 6.664 As this individual’s depression score is 7, the absolute deviation for this person is about 0.34. The depression score, the predicted score and the absolute deviation for each of the nine individuals are shown in Table 11.5. Levene’s test is a one-way analysis of variance on these absolute deviations, which gives an F ratio of about .516.
172
Advanced quantitative data analysis
Table 11.5 Actual, predicted and absolute deviation scores for depression Cases
Group
Actual
Predicted
Absolute deviation
1 2
1 1
7 5
6.66 5.33
7 − 6.66 = 0.34 5 − 5.33 = 0.33
3 4 5 6
2 2 2 2
2 3 4 3
3.00 2.33 3.67 3.00
2 − 3.00 = 1.00 3 − 2.33 = 0.67 4 − 3.67 = 0.33 3 − 3.00 = 0.00
7 8 9
3 3 3
3 4 5
3.33 4.67 4.00
3 − 3.33 = 0.33 4 − 4.67 = 0.67 5 − 4.00 = 1.00
Reporting the results
We will illustrate one concise way of writing up the results of this analysis for the covariate with unequal means, as this is more likely to be the case. Such a report may take the following form: ‘To control for the effects of physical ill-health, which was correlated (r = .31, p = ns) with depression, a one-way analysis of covariance was carried out. Marital status had a significant effect on depression (F2,5 = 9.67, p < .05).’ If we did not have sound reasons for predicting any differences, we could add the following: ‘Fisher’s Protected LSD test showed that mean depression was significantly higher in the depressed (adjusted M = 8.00) than in either the married (adjusted M = 3.00, t5 = 4.07, two-tailed p < .01) or the never married (adjusted M = 2.67, t5 = 2.91, two-tailed p < .05).’
SPSS Windows procedure
To carry out the analysis of covariance on the covariate with unequal means, use the following procedure. Enter the data into the Data Editor as shown in Box 11.1. As the data in the first two columns are the same as those used in the previous chapter, we could retrieve this data file and add the values in the third column, which we could label illhealt. Select Analyze on the horizontal menu bar near the top of the window, General Linear Model from the drop-down menu and then Univariate. . ., which opens the Univariate dialog box in Box 10.2.
Unrelated one-way analysis of covariance
Box 11.1
173
Data in the Data Editor for a one-way analysis of covariance
Select depress and then the first 䉴 button to put this variable in the box under Dependent: (variable). Select group and then the second 䉴 button to put this variable in the box under Fixed Factor(s):. Select illhealt and then the fourth 䉴 button to put this variable in the box under Covariate(s):. Select Options. . . to open the Univariate: Options sub-dialog box in Box 10.4. Select Descriptive statistics to display the means and standard deviations of the three groups. Select Homogeneity tests to carry out Levene’s test for homogeneity of variances. Select group in the box under Factor(s) and Factor Interactions: and then the 䉴 button to put this variable in the box under Display Means for: to produce the adjusted means. Select Compare main effects and LSD (none) in the box under Confidence interval adjustment to give the significance levels for Fisher’s Protected LSD test. Select Continue to close this sub-dialog box and to return to the main dialog box. Select OK to run this analysis. To test for homogeneity of regression, enter the variables into the Univariate dialog box as above and then select Model. . . to open the Univariate: Model sub-dialog box in Box 11.2. Select Custom. Select group(F) and then the 䉴 button under Build Term(s) to put this variable in the box under Model:.
174
Advanced quantitative data analysis
Select illhealt(C) and then the 䉴 button under Build Term(s) to put this variable in the box under Model:.
Box 11.2
Univariate: Model sub-dialog box
Select group(F) and illhealt(C), interaction under the 䉴 button and then the 䉴 button to put the interaction between these two variables in the box under Model:. Select Continue to close this sub-dialog box and to return to the main dialog box. Select OK to run this analysis. To run the regression analysis, enter the codes for the four dummy variables as shown in Table 11.4 into the fourth to seventh columns of the Data Editor file. Regress depress onto the covariate first, the first two dummy variables second and the last two dummy variables third. SPSS output
Not all of the SPSS output will be presented and discussed. The results for Levene’s test are displayed in Table 11.6. As the F value of .524 has a significance value of .617, the variances do not differ significantly. The analysis of covariance table is shown in Table 11.7. The values for the Corrected Model, the Intercept and the Total are often not included in such tables, as is the case in Table 11.2. The adjusted means are presented in Table 11.8.
Unrelated one-way analysis of covariance
175
Table 11.6 SPSS output of Levene’s test Levene’s Test of Equality of Error Variancesa Dependent Variable: DEPRESS F
df1
df2
.524
2
Sig. 6
.617
Tests the null hypothesis that the error variance of the dependent variable is equal across groups. a Design: Intercept + ILLHEALT + GROUP.
Table 11.7
SPSS output of the analysis of covariance table Tests of Between-Subjects Effects
Dependent Variable: DEPRESS
Source
Type III Sum of Squares
df
Mean Square
14.667a 12.803 2.667 12.889 3.333 162.000 18.000
3 1 1 2 5 9 8
4.889 12.803 2.667 6.444 .667
Corrected Model Intercept ILLHEALT GROUP Error Total Corrected Total a
F
Sig.
7.333 19.204 4.000 9.667
.028 .007 .102 .019
R Squared = .815 (Adjusted R Squared = .704).
Table 11.8
SPSS output of the adjusted means Estimates
Dependent Variable: DEPRESS 95% Confidence Interval GROUP 1 2 3 a
Mean 8.000a 3.000a 2.667a
Std. Error 1.155 .408 .816
Lower Bound 5.032 1.951 .568
Evaluated at covariates appeared in the model: ILLHEALT = 5.00.
Upper Bound 10.968 4.049 4.766
176
Advanced quantitative data analysis
Table 11.9
SPSS output for Fisher’s Protected LSD test Pairwise Comparisons
Dependent Variable: DEPRESS 95% Confidence Interval for Differencea (I) GROUP
(J) GROUP
1
2 3 1 3 1 2
2 3
Mean Difference (I − J)
Std. Error
Sig.a
5.000* 5.333* −5.000* .333 −5.333* −.333
1.225 1.826 1.225 .913 1.826 .913
.010 .033 .010 .730 .033 .730
Lower Bound
Upper Bound
1.852 .640 −8.148 −2.013 −10.027 −2.680
8.148 10.027 −1.852 2.680 −.640 2.013
Based on estimated marginal means. * The mean difference is significant at the .05 level. a Adjustment for multiple comparisons: Least Significant Difference (equivalent to no adjustments).
The significance levels of Fisher’s Protected LSD test are shown in Table 11.9. For example, the significance level for the comparison of the divorced and the married is .010. The homogeneity of regression test for the analysis of covariance is given by the values for the GROUP * ILLHEALT term in the analysis of covariance table in Table 11.10. The F value of .167 has a significance value of .854. As
Table 11.10
SPSS output for the homogeneity of regression test Tests of Between-Subjects Effects
Dependent Variable: DEPRESS
Source Corrected Model Intercept GROUP ILLHEALT GROUP * ILLHEALT Error Total Corrected Total a
Type III Sum of Squares
df
15.000a 12.479 2.007 2.667 .333 3.000 162.000 18.000
5 1 2 1 2 3 9 8
R Squared = .833 (Adjusted R Squared = .556).
Mean Square 3.000 12.479 1.003 2.667 .167 1.000
F
Sig.
3.000 12.479 1.003 2.667 .167
.197 .039 .464 .201 .854
Unrelated one-way analysis of covariance
177
this value is greater than .05, the F value is not significant, which means that the regression coefficient does not differ significantly between the groups. The squared multiple correlations (R Square) for the covariate (.099), the covariate and the first two dummy variables (.815) and the covariate and all four dummy variables (.833) are shown in Table 11.11. The constant (6.000) and the unstandardized regression coefficients for the covariate (−.667), the first (5.333) and the second (.333) dummy variables are displayed in Table 11.12.
Table 11.11
SPSS output of the squared multiple correlations Model Summary
Model
R
R Square
.314a .903b .913c
1 2 3
.099 .815 .833
Adjusted R Square
Std. Error of the Estimate
−.030 .704 .556
1.522 .816 1.000
a
Predictors: (Constant), ILLHEALT. Predictors: (Constant), ILLHEALT, D2, D1. c Predictors: (Constant), ILLHEALT, D2, D1, D4, D3. b
Table 11.12 SPSS output of the unstandardized regression coefficients for the covariate and the first two dummy variables. Coefficientsa Unstandardized Coefficients Model 1
a
(Constant) ILLHEALT D1 D2
B 6.000 −.667 5.333 .333
Dependent Variable: DEPRESS.
Std. Error 1.106 .333 1.826 .913
Standardized Coefficients Beta −.943 1.568 .117
t
Sig.
5.427 −2.000 2.921 .365
.003 .102 .033 .730
178
Advanced quantitative data analysis
Recommended further reading Huitema, B.E. (1980) The Analysis of Covariance and Alternatives. New York: Wiley. Chapter 4 is a short and relatively non-technical account of carrying out a one-way analysis of covariance with multiple regression. Pedhazur, E.J. (1982) Multiple Regression in Behavioral Research: Explanation and Prediction, 2nd edn. New York: Holt, Rinehart & Winston. Although fairly technical, chapter 13 shows how a one-way analysis of covariance with three groups can be calculated with multiple regression. Pedhazur, E.J. and Schmelkin, L.P. (1991) Measurement, Design and Analysis: An Integrated Approach. Hillsdale, NJ: Lawrence Erlbaum Associates. The latter part of chapter 21 provides a less technical account of analysis of covariance, using two groups with equal and unequal means on the covariate and shows how it can be calculated with multiple regression. SPSS Inc. (2002) SPSS Base 11.0 User’s Guide Package. Upper Saddle River, NJ: Prentice-Hall. Provides a detailed commentary on the output produced by SPSS 11.0 as well as a useful introduction to factor analysis. Tabachnick, B.G. and Fidell, L.S. (1996) Using Multivariate Statistics, 3rd edn. New York: HarperCollins. Chapter 8 offers a systematic and general account of analysis of covariance, comparing the procedures of four different programs (including SPSS 6.0) and showing how to write up the results of two worked examples.
12
Unrelated two-way analysis of variance
Conducting an analysis of variance with more than one unrelated factor has two major strengths. The first is that it enables the interaction between factors to be examined. An interaction occurs when the mean score of at least one group on one factor varies according to two or more groups on one or more other factors. For example, the mean depression score of women in relation to men may differ significantly according to whether they are married or not. The mean depression score may be significantly higher in never married men than never married women but may not differ significantly between married men and married women. In this case we would have a two-way interaction between the factor of sex and the factor of marital status. If there is no significant interaction, any significant effect for the factors that make up that interaction can be interpreted without taking account of those other factors. For example, if there is no significant interaction between marital status and sex but there is a significant effect for marital status, then we can ignore any effects of sex when talking about differences in depression due to marital status. If, however, there is a significant interaction, a significant effect for a factor making up that interaction needs to be explained in terms of those other factors. For instance, if there is a significant interaction between sex and marital status and there is also a significant effect for marital status, then we should discuss differences in depression in relation to both marital status and sex and not to marital status alone.
180
Advanced quantitative data analysis
The second advantage of an analysis of variance with more than one unrelated factor is that it is a more sensitive test of a factor in that it is more likely to be statistically significant if the error variance is reduced because some of that variance is now accounted for by the other factors and their interactions. For example, the factor of marital status may become statistically significant if the variance attributable to sex and/or its interaction with marital status reduces the error term sufficiently. An analysis of variance with more than one unrelated factor is sometimes called a factorial analysis of variance. The effects due to factors may be referred to as main effects as opposed to interaction effects. We will illustrate the calculation and interpretation of a factorial analysis of variance with the simplest two-way analysis of variance in which both factors only consist of two levels or groups. The first factor of marital status comprises the two groups of the never married and the married. The second factor of sex is made up of the two groups of women and men. The dependent variable is the same 9-point depression scale used in the examples in the previous two chapters. The individual or raw scores of depression for the four groups are shown in Table 12.1 together with their mean scores for the four groups, the two factors and the whole sample. The number of cases in the four groups varies. There are 2 never married women, 6 never married men, 4 married women and 3 married men. If we look at the mean depression score for marital status on its own, it is higher for the never married (6.00) than for the married (3.14). If we do the same for sex, it is higher for men (5.33) than for women (3.67). Finally, if we consider marital status and sex together, it is higher for married women (4.00) than for never married women (3.00) but lower for married men (2.00) than for never married men (7.00). In other words, there appears to be an interaction between marital status and sex in that the mean depression score for marital status appears to depend on the sex of the person. It may be easier to grasp the relationship between two or more factors when it is expressed as a graph as shown in Figure 12.1. The vertical axis or ordinate of the graph represents the dependent variable, which is the mean depression score. Higher points on this vertical axis indicate higher means. The horizontal axis or abscissa depicts the factor of marital status with the left marker representing the never married and the right one the married. The other factor of sex is indicated by two different styles of line. Women are shown by a continuous line with a diamond shape at either end. Men are shown by a dashed line with a square at either end. It does not matter which factor is represented on the horizontal line and which factor by the styles of line. No interaction is indicated when the two lines representing the other factor are more or less parallel. For example, if the line representing men was more or less parallel to the line representing women, there would be no interaction between marital status and sex. An interaction occurs when the two lines are non-parallel, as they are here. Whether this interaction is
Unrelated two-way analysis of variance
181
Table 12.1 Raw and mean depression scores for an unrelated two-way analysis of variance Women Never married
2 4
Sum n Mean
6 2 3.00
Married
Sum n Mean
3 5 4 4 16 4 4.00
Column total Sum n Mean
22 6 3.67
Figure 12.1
Men
Row total
8 6 8 6 7 7 42 6 7.00
48 8 6.00
1 3 2 6 3 2.00
22 7 3.14
48 9 5.33
70 15 4.67
Graph showing an interaction between marital status and sex.
182
Advanced quantitative data analysis
statistically significant depends on whether the F ratio for the interaction in the analysis of variance is statistically significant. A two-way analysis of variance has two main effects and an interaction effect, as shown in the left column of Table 12.2. The F ratio for each effect is the mean square for that effect divided by the mean square for the variation that is not accounted for by any of the three effects. This latter mean square is variously known as the within-groups, error or residual mean square. For example, the F ratio for the interaction is the mean square for the interaction (28.80) divided by the mean square for the error (0.91), which gives 31.65 (28.80/0.91 = 31.65). The degrees of freedom for the interaction effect is one less than the number of groups in the first factor multiplied by one less than the number of groups in the second factor. As there are only two groups in each factor, the degrees of freedom for the interaction is 1 [(2 − 1) × (2 − 1) = 1]. The degrees of freedom for the error term is the number of cases minus the total number of groups. The total number of groups can be calculated by multiplying the number of groups in the first factor by the number of groups in the second factor. As there are two groups in each factor, the total number of groups or cells is 4 (2 × 2 = 4). As there are 15 cases, the degrees of freedom for the error term is 11 (15 − 4 = 11). With 1 degree of freedom for the numerator and 11 degrees of freedom for the denominator, F has to be 4.84 or bigger to be statistically significant at the .05 level, which it is. Consequently, the interaction between marital status and sex is statistically significant.
Unequal and disproportionate cell frequencies
When the number of cases in each of the cells of a factorial analysis of variance is not the same and is not proportionate, as here, the three effects may not be unrelated to or independent of each other. There are three main ways for calculating the effects in these circumstances (Overall and Spiegel 1969). The results for the interaction (and the error) will be the same for all three methods but those for the main effects may differ, as shown in Table 12.2. For example, although the F ratio for marital status is statistically significant for all three methods, it is 14.07 for the first method presented, 24.79 for the second method and 33.49 for the third method. When the number of cases is the same in each of the cells of a factorial analysis of variance, the three effects are unrelated and all three methods give the same results for the main effects. All three methods use multiple regression to calculate the sum of squares for the effects, but the way in which this is done differs. The first method presented is called Type III in SPSS and is the default method. It has also been called Method 1, the regression, the unweighted means or the unique approach. Each effect is adjusted for all other effects (including any
1 1 1 11 14
Marital status (M) Sex (S) M×S Error Sum of effects
* p < .05.
df 12.80 3.20 28.80 10.00 54.80
SS 12.80 3.20 28.80 0.91
MS
Type III
14.07* 3.52 31.65*
F 22.53 2.06 28.80 10.00 63.36
SS 22.53 2.06 28.80 0.91
MS
Type II
24.79* 2.26 31.65*
F
30.48 2.06 28.80 10.00 71.34
SS
Main results for three types of factorial analysis of variance with unequal numbers
Source of variation
Table 12.2
30.48 2.06 28.80 0.91
MS
Type I
33.49* 2.26 31.65*
F
184
Advanced quantitative data analysis
covariates). The variance explained by each effect is unique to that effect and is not shared with or related to any other effect. For example, the variance for marital status is that which remains after the effect of sex and the interaction has been taken into account. This can be expressed as the difference between the variance of sex and the interaction and the variance of all three effects, as shown in Table 12.3. The variance for sex is that which is left over after the effect of marital status and the interaction has been taken into account. In other words, it is the difference between the variance of marital status and the interaction and the variance of all three effects. Tabachnick and Fidell (1996) recommended this approach for experimental designs where each cell is expected to be equally important. It may be argued, however, that this approach is also suitable for non-experimental designs where the unique effect for each variable is of interest. The second method shown in Table 12.2 is called Type II in SPSS. It has also been referred to as Method 2, the classical experimental or the least squares approach. Main effects are adjusted for other main effects (and covariates), while interactions are adjusted for all other effects apart from higher-order interactions. For example, the variance for marital status is that which is not already explained by sex but which includes any variance that marital status shares with the interaction. It is the difference between the variance for marital status and sex and the variance for sex, as shown in Table 12.3. Tabachnick and Fidell (1996) recommended this approach for non-experimental designs where greater weight is attached to the main effects. The third and final method presented is called Type I in SPSS. It has also been called Method 3, the hierarchical or the sequential approach. The effects are ordered in a particular sequence according to the investigator. For example, the investigator may be primarily interested in the effect of marital status, followed by the effect of sex and then the interaction between marital status and sex. In this case, the variance explained by marital status may include variance which is shared with sex and the interaction between marital status and sex. The variance that is explained by the next effect of sex will exclude that which has already been explained by marital status. In Table 12.3, we can see that the way that the effect for sex is calculated is exactly the same in the Type I and II methods, which is why the results for these two methods in Table 12.2 are the same. Where the number of cases in each cell is not equal but is proportionate, the Type II method will give the same results as the Type I method because the main effects will not be related to one another. A design with proportionate frequencies is presented in Table 12.4. The ratio of the row frequencies (2:4 and 3:6) is 1:2. The ratio of the column frequencies (2:3 and 4:6) is 2:3. Cell frequencies are generally proportionate when the frequency of a cell is equal to the product of the row and column frequency for that cell divided by the total frequency. This is the case for all four cells in
Type II M, S − S M, S − M M, S, M × S − M, S
Type III M, S, M × S − S, M × S M, S, M × S − M, M × S M, S, M × S − M, S
Regression equations for the three methods of analysis of variance
Marital status (M) Sex (S) M×S
Effects
Table 12.3
M M, S − M M, S, M × S − M, S
Type I (M, S, M × S order)
186
Advanced quantitative data analysis
Table 12.4
A 2 × 2 design with proportional cell frequencies
Never married Married Column total
Women
Men
Row total
2 4 6
3 6 9
5 10 15
Table 12.4. For example, the frequency of the cell in the first row and first column is 2, which is the same as 5 × 6/15 (30/15 = 2). Dummy variables with effect coding
To demonstrate the calculation of a factorial analysis of variance and that effects may be correlated when cell frequencies are unequal and disproportionate, we need to represent the main effects and interaction with dummy variables with effect coding as shown in Table 12.5. In effect coding, one group is coded as 1, another group as −1 and the other groups as 0. To
Table 12.5
Effect coding for a 2 × 2 analysis of variance Dummy variables Marriage
Sex
Marriage × Sex
Women Women
1 1
1 1
1 1
Never married Never married Never married Never married Never married Never married
Men Men Men Men Men Men
1 1 1 1 1 1
−1 −1 −1 −1 −1 −1
−1 −1 −1 −1 −1 −1
Married Married Married Married
Women Women Women Women
−1 −1 −1 −1
1 1 1 1
−1 −1 −1 −1
Married Married Married
Men Men Men
−1 −1 −1
−1 −1 −1
1 1 1
Marital status
Sex
Never married Never married
Unrelated two-way analysis of variance
187
represent a factor we need one less dummy variable than the number of groups in that factor. As each factor has only two groups, we only need one dummy variable to represent each factor. One group is coded 1 and the other as −1. It does not matter which group is coded 1 and −1. We have coded the first group in each factor as 1 and the other as −1. As we have only two groups in each factor, we do not use the code of 0. The dummy variable for representing the interaction is produced by multiplying the dummy variables representing the two factors. If we correlate the three dummy variables in Table 12.5, we find that the first dummy variable correlates −.33 and −.19 with the second and third dummy variables, respectively, while the second dummy variable correlates .00 with the third dummy variable. In other words, marital status is correlated with both sex and the interaction, while sex is not correlated with the interaction. If we create three new dummy variables to represent the proportionate cell frequency design in Table 12.4 and correlate them, we find that the main effect of marital status is correlated .00 with the main effect of sex. In other words, the two main effects are not correlated. The third dummy variable correlates −.19 and −.33 with the first and second dummy variable, respectively. Finally, if we create three new dummy variables to represent a 2 × 2 design with 4 cases in each cell, we find that none of the effects are correlated with each other.
Multiple regression
We will illustrate the calculation of the F ratios and sum of squares for the Type III method where the values of the factors are fixed and not random, as is usually the case and is the case here. A random factor is one where the values of that factor have been selected at random from some population of values. For example, if we were interested in the effects of level of background white noise on performance and we were restricted to three loudness levels of no more than 90 decibels, then we may select those three levels at random from the 90 whole values available. As mentioned in the previous two chapters, the F-test can be derived from the following equation: F=
R2 (change)/number of predictors (1 − R2)/(N − number of predictors − 1)
The squared multiple correlation (R2) represents the proportion of the total variance or sum of squares accounted for in the dependent variable. If we know the total sum of squares, we can work out the sum of squares for the different terms in the analysis of variance table shown in Table 12.2. The total sum of squares is the sum of the squared deviation of each score from
188
Advanced quantitative data analysis
the total or grand mean. The grand mean for our example is 4.67, as shown in Table 12.1. The total sum of squares is 71.33. We can work out the F ratio for marital status if we know the squared multiple correlation for sex and the interaction and for all three effects. The squared multiple correlation for sex and the interaction is .680 and for all three effects is .860. Consequently, the change in the squared multiple correlation for the one predictor of marital status is .180 (.860 − .680 = .180). Note that this change provides an estimate of the size of an effect, which for marital status is .18. Substituting the appropriate values in the formula for the F ratio, we find an F ratio of about 14.17, which, given rounding error, is similar to the figure of 14.07 in Table 12.2: .180 .180 .180/1 = = = 14.17 (1 − .860)/(15 − 3 − 1) .140/11 .0127 The sum of squares for marital status is the squared multiple correlation for marital status (.180) multiplied by the total sum of squares (71.33), which is about 12.84 (.180 × 71.33 = 12.84), a value near that of 12.80 in Table 12.2. To calculate the F ratio for sex, we need to know the squared multiple correlation for marital status and the interaction, which is .815. Consequently, the change in the squared multiple correlation for the one predictor of sex is .045 (.860 − .815 = .045). Inserting the appropriate values in the formula for the F ratio gives an F ratio of about 3.54, which is similar to the figure of 3.52 in Table 12.2: .045 .045 .045/1 = = = 3.54 (1 − .860)/(15 − 3 − 1) .140/11 .0127 The sum of squares for sex is its squared multiple correlation (.045) multiplied by the total sum of squares (71.33), which is about 3.21 (.045 × 71.33 = 3.21), a figure close to that of 3.20 in Table 12.2. To calculate the F ratio for the interaction, we have to know the squared multiple correlation for marital status and sex, which is .456. The change in the squared multiple correlation for the one predictor of the interaction is .404 (.860 − .456 = .404). Putting the relevant values in the formula for the F ratio gives an F ratio of about 31.81, which, given rounding error, is similar to the value 31.65 in Table 12.2: .404 .404 .404/1 = = = 31.81 (1 − .860)/(15 − 3 − 1) .140/11 .0127 The sum of squares for the interaction is its squared multiple correlation (.404) multiplied by the total sum of squares (71.33), which is about 28.82 (.404 × 71.33 = 28.82), a figure very close to that of 28.80 in Table 12.2.
Unrelated two-way analysis of variance
189
Finally, the error sum of squares is the proportion of variance that is not explained by the three effects (1 − .860 = .140) multiplied by the total sum of squares (71.33), which is about 9.99 (.140 × 71.33 = 9.99), a figure very near that of 10.00 in Table 12.2.
Reduction of error
To demonstrate that a two-way analysis of variance may provide a more sensitive test of a factor than a one-way analysis of variance, the results of a one-way analysis of variance for marital status are presented in Table 12.6 together with those for the two-way analysis of variance. The error sum of squares is considerably smaller for the two-way analysis of variance (10.00) than the one-way analysis of variance (40.86) because some of the error is due to sex (3.20) and to the interaction (28.80). Consequently, the F ratio for marital status is larger for the two-way analysis of variance (14.07) than the one-way analysis of variance (9.71). For the two-way analysis of variance with 1 and 11 degrees of freedom in the numerator and denominator, respectively, F has to be 4.84 or larger to be statistically significant at the .05 level. For the one-way analysis of variance with 1 and 13 degrees of freedom in the numerator and denominator, respectively, F has to be 4.67 or larger to be statistically significant at the .05 level. Although the critical value of F for the one-way analysis of variance is lower than that for the two-way analysis of variance, the difference is relatively small (4.84 − 4.67 = 0.17) and is much smaller than the difference between the F value for marital status (14.07 − 9.71 = 4.36). Consequently, the effect of marital status is more significant for the two-way analysis of variance (p = .003) than for the one-way analysis of variance (p = .008).
Table 12.6
One- versus two-way analysis of variance for marital status Two-way ANOVA
One-way ANOVA Source of variation Marital status (Sex) (Interaction) Error Total * p < .05.
SS
df
MS
F
SS
df
MS
F
30.48
1
30.48
9.71*
13 14
3.14
1 1 1 11 14
12.80 3.20 28.80 0.91
14.07* 3.52 31.65*
40.86 71.33
12.80 3.20 28.80 10.00 71.33
190
Advanced quantitative data analysis
Homogeneity of variance
Like the one-way analysis of variance, the factorial analysis of variance assumes that the populations from which the samples are drawn have a normal distribution and equal or homogeneous variances. Statistical tests for determining whether the skewness (or asymmetry) and the kurtosis (or flatness) of a distribution of scores differs significantly from zero are described elsewhere (Cramer 1998). One test for assessing whether variances are homogeneous is Levene’s test, which is simply a one-way analysis of variance on the absolute deviation of each score from the mean for that group. The calculations for this test are shown in Table 12.7. The F value for Levene’s test is 0.407. With 3 and 11 degrees of freedom in the numerator and denominator, respectively, F has to be 5.59 or larger to be statistically significant, which it is not. Had this test been significant, transforming the scores by, for example, taking their square root should be tried to make the variances more equal.
Comparing groups
For our example, we have a significant effect for marital status and for the interaction between marital status and gender. As the factor of marital status consists of only two groups, the significant effect means that the depression score for the never married (6.00) is significantly higher than that for the married (3.14). The significance level for the F ratio is two-tailed. If we have no strong grounds for predicting a difference between the two groups, we would use the two-tailed level for assessing the significance of this effect. If we have good reasons for predicting that the never married would be more depressed than the married, we could use the one-tailed level by halving the two-tailed probability value. If we had three or more groups and we had a significant effect, we would have had to compare the means of two groups at a time to see which of the group means differed from each other. If we had good grounds for making specific predictions, we would use unrelated t-tests to test pairs of means. If we did not have strong grounds for making predictions, we would use a post-hoc test, such as the Scheffé test, to determine which means differed. The interaction effect is made up of four groups. A significant interaction effect does not tell us which of the means for these groups differ from each other. With four groups, we can make six pairwise comparisons, which are shown in the first column of Table 12.8. If we had good grounds for making specific predictions about which groups differed, we would use the unrelated t-test with a one-tailed probability. The results for this test are presented in Table 12.8. The only means that do not differ significantly from
Unrelated two-way analysis of variance
191
each other are those of the never married women (3.00) versus those of the married women (4.00) and the married men (2.00). If we did not have good reasons for making specific predictions about which groups differed, we would use a post-hoc test, such as Scheffé. The Scheffé test is an F test in which the squared difference between the means of the two groups being compared is divided by the error mean square for all the groups, but which is weighted to reflect the number of the cases in the two groups being compared (n1 and n2): F=
(group 1 mean − group 2 mean)2 error mean square × (n1 + n2)/(n1 × n2)
The significance of this F-test is evaluated against the appropriate critical value of F, which is weighted or multiplied by the degrees of freedom for the interaction. These degrees of freedom are the number of groups minus 1, which in this case is 3 (4 − 1 = 3). The .05 critical value of F with 3 and 11 degrees of freedom in the numerator and denominator, respectively, is about 3.59. Consequently, the appropriate .05 critical value of F for this Scheffé test is 10.77 (3.59 × 3 = 10.77). The Scheffé F for comparing the mean scores of the never married women and men is 26.23: −4.002 16.00 (3.00 − 7.00)2 = = = 26.23 0.91 × (2 + 6)/(2 × 6) 0.91 × 0.67 0.61 As 26.23 is larger than 10.77, the means of these two groups differ significantly. The Scheffé F, the degrees of freedom and the p value for these six comparisons are shown in Table 12.8. In addition to the two pairwise comparisons which were not significant for the unrelated t-test, the mean for the married women (4.00) did not differ from that for the married men (2.00).
Reporting the results
One concise way of writing up the results of this analysis is as follows: ‘A 2 × 2 unrelated ANOVA was carried out using the regression approach. Significant effects were found for marital status (F1,11 = 14.07, p < .05) and the interaction between marital status and sex (F1,11 = 31.65, p < .001). Depression was significantly higher in the never married (M = 6.00, SD = 2.07) than in the married (M = 3.15, SD = 1.35).’ If we did not have strong reasons for predicting an interaction effect, we could add the following: ‘The Scheffé test showed that never married men (M = 7.00, SD = 0.89) were significantly more depressed than married men (M = 2.00, SD = 1.00), never married women (M = 3.00, SD = 1.41) and married women (M = 4.00, SD = 0.82).’
9 10 11 12
Group mean
3 4 5 6 7 8
2, 1 2, 1 2, 1 2, 1
1, 2 1, 2 1, 2 1, 2 1, 2 1, 2
1, 1 1, 1
1 2
Group mean
Group
3 5 4 4
42/6 = 7.00
8 6 8 6 7 7
6/2 = 3.00
2 4
Depression
4−3=1 4−5=1 4−4=0 4−4=0
4/6 = 0.67
7−8=1 7−6=1 7−8=1 7−6=1 7−7=0 7−7=0
2/2 = 1.00
3−2=1 3−4=1
Absolute deviations
(0.50 − 0.67)2 = 0.03 (0.50 − 0.67)2 = 0.03 (0.50 − 0.67)2 = 0.03 (0.50 − 0.67)2 = 0.03
(0.67 − 0.67)2 = 0.00 (0.67 − 0.67)2 = 0.00 (0.67 − 0.67)2 = 0.00 (0.67 − 0.67)2 = 0.00 (0.67 − 0.67)2 = 0.00 (0.67 − 0.67)2 = 0.00
(1.00 − 0.67)2 = 0.11 (1.00 − 0.67)2 = 0.11
Between-groups squared deviations
Calculation of Levene’s test for a two-way analysis of variance
Cases
Table 12.7
(0.50 − 1.00)2 = 0.25 (0.50 − 1.00)2 = 0.25 (0.50 − 0.00)2 = 0.25 (0.50 − 0.00)2 = 0.25
(0.67 − 1.00)2 = 0.11 (0.67 − 1.00)2 = 0.11 (0.67 − 1.00)2 = 0.11 (0.67 − 1.00)2 = 0.11 (0.67 − 0.00)2 = 0.45 (0.67 − 0.00)2 = 0.45
(1.00 − 1.00)2 = 0.00 (1.00 − 1.00)2 = 0.00
Within-groups squared deviations
2, 2 2, 2 2, 2
Grand mean Sum of squares Degrees of freedom Mean square F
Group mean
13 14 15
Group mean
6/3 = 2.00
1 3 2
16/4 = 4.00
10/15 = 0.67
2/3 = 0.67
2−1=1 2−3=1 2−2=0
2/4 = 0.50
4−1=3 0.34/3 = 0.11
0.34
(0.67 − 0.67)2 = 0.00 (0.67 − 0.67)2 = 0.00 (0.67 − 0.67)2 = 0.00
3.01 15 − 4 = 11 3.01/11 = 0.27 0.11/0.27 = 0.407
(0.67 − 1.00)2 = 0.11 (0.67 − 1.00)2 = 0.11 (0.67 − 0.00)2 = 0.45
194
Advanced quantitative data analysis
Table 12.8
Unrelated t- and Scheffé tests comparing four means
Comparisons Never married women vs never married men Never married women vs married women Never married women vs married men Never married men vs married women Never married men vs married men Married women vs married men
t
df
p
Scheffé
df
p
4.90
6
.01
26.23
3,11 .01
1.16 0.95 5.37 7.64 2.93
4 3 8 7 5
ns ns .001 .0001 .05
1.47 1.32 23.68 54.35 7.55
3,11 ns 3,11 ns 3,11 .01 3,11 .001 3,11 ns
SPSS windows procedure
To carry out this analysis of variance, use the following procedure. Enter the data into the Data Editor as shown in Box 12.1. Marital status, labelled marital, is coded 1 for never married and 2 for married. Sex, labelled as such, is coded 1 for women and 2 for men. Depression has been called depress.
Box 12.1
Data in the Data Editor for a two-way analysis of variance
Unrelated two-way analysis of variance
195
Select Analyze on the horizontal menu bar near the top of the window, General Linear Model from the drop-down menu and then Univariate. . ., which opens the Univariate dialog box in Box 10.2. Select depress and then the first 䉴 button to put this variable in the box under Dependent: (variable). Select marital and sex and then the second 䉴 button to put these two variables in the box under Fixed Factor(s):. Type III is the default method as shown in Box 11.2. To select another method, select Model. . . to open the Univariate: Model sub-dialog box. Select the downward arrow next to Type III to display and select the appropriate option. To produce a graph similar to that in Figure 12.1, select Plots. . . to open the Univariate: Profile Plots sub-dialog box in Box 12.2. Select marital under Factor(s): and the first 䉴 button to put this variable under the Horizontal Axis: box. Select sex and the second 䉴 button to put this variable under the Separate Lines: box. Select Add to put marital*sex under the Plots: box. Select Continue to close this sub-dialog box and to return to the main dialog box. Select Options. . . to open the Univariate: Options sub-dialog box in Box 10.4.
Box 12.2
Univariate: Profile Plots sub-dialog box
196
Advanced quantitative data analysis
Select Descriptive statistics to display the means and standard deviations for the three effects. Select Homogeneity tests to carry out Levene’s test for homogeneity of variances. Select Continue to return to the main dialog box. Select OK to run this analysis. To run the Scheffé test on the four groups of the interaction terms, create a fourth variable (called, say, groups) in the Data Editor in which the four groups are coded 1 (never married women) to 4 (married men). Run a Univariate. . . analysis of variance in which depress is the Dependent: variable and groups is the Fixed Factor(s):. Select Post Hoc. . ., which opens the Univariate: Post Hoc Multiple Comparisons for Observed Means sub-dialog box in Box 10.3. Select groups under the Factor(s): and then the 䉴 button to put this variable in the box under Post Hoc Tests for:. Select Scheffe and then Continue to return to the Univariate dialog box. Select OK to run this analysis. To run the regression analysis, enter the codes for the three dummy variables as shown in Table 12.5 into the fourth to sixth columns of the Data Editor file. To produce the change in the squared multiple correlation for marital status, regress depress on the last two dummy variables and then the first one.
SPSS output
Not all of the SPSS output will be presented or discussed. The results for Levene’s test are displayed in Table 12.9. As the F value of .407 has a significance value of .751, the variances do not differ significantly.
Table 12.9 SPSS output of Levene’s test for the two-way analysis of variance Levene’s Test of Equality of Error Variancesa Dependent Variable: DEPRESS F
df1 .407
df 2 3
Sig. 11
.751
Tests the null hypothesis that the error variance of the dependent variable is equal across groups. a Design: Intercept+MARITAL+SEX+MARITAL * SEX.
Unrelated two-way analysis of variance
Table 12.10
197
SPSS output of the two-way analysis of variance table Tests of Between-Subjects Effects
Dependent Variable: DEPRESS Type III Sum of Squares
Source Corrected Model Intercept MARITAL SEX MARITAL * SEX Error Total Corrected Total a
61.333a 204.800 12.800 3.200 28.800 10.000 398.000 71.333
df 3 1 1 1 1 11 15 14
Mean Square 20.444 204.800 12.800 3.200 28.800 .909
F
Sig.
22.489 225.280 14.080 3.520 31.680
.000 .000 .003 .087 .000
R Squared = .860 (Adjusted R Squared = .822).
Table 12.11 of variance
SPSS output of the Scheffé tests for the two-way analysis Multiple Comparisons
Dependent Variable: DEPRESS Scheffé 95% Confidence Interval (I) (J) GROUPS GROUPS 1
2
3
4
2 3 4 1 3 4 1 2 4 1 2 3
Mean Difference (I − J) −4.00* −1.00 1.00 4.00* 3.00* 5.00* 1.00 −3.00* 2.00 −1.00 −5.00* −2.00
Std. Error .778 .826 .870 .778 .615 .674 .826 .615 .728 .870 .674 .728
Based on observed means. * The mean difference is significant at the .05 level.
Sig. .003 .697 .729 .003 .004 .000 .697 .004 .112 .729 .000 .112
Lower Bound −6.55 −3.71 −1.86 1.45 .98 2.79 −1.71 −5.02 −.39 −3.86 −7.21 −4.39
Upper Bound −1.45 1.71 3.86 6.55 5.02 7.21 3.71 −.98 4.39 1.86 −2.79 .39
b
a
.825a .927b
1 2
.680 .860
R Square .627 .822
Adjusted R Square
Predictors: (Constant), E1 × E2, EFFECT2. Predictors: (Constant), E1 × E2, EFFECT2, EFFECT1.
R 1.378 .953
Std. Error of the Estimate .680 .179
R Square Change
Model Summary
12.772 14.080
F Change
df1 2 1
12 11
df 2
Change Statistics
SPSS output of the change in the squared multiple correlation for marital status
Model
Table 12.12
.001 .003
Sig. F Change
Unrelated two-way analysis of variance
199
The analysis of variance table is shown next, as presented in Table 12.10. Five of the eight sources listed are of interest to us and these are those labelled MARITAL, SEX, MARITAL * SEX, Error and Corrected Total, which correspond to those presented in Table 12.6. For example, the F value of 14.080 has a significance value of .003, which is significant as it is less than .05. The results of the first of two tables for the Scheffé analysis are displayed in Table 12.11, as it is the more relevant to us. The comparisons in this table are repeated twice. For example, the first line compares the means of the never married women (1) and the never married men (2), which is presented again but the other way round in the fourth line. Only the significance of the comparisons is displayed, which is .003 for this comparison. The results for the change in the squared multiple correlation for marital status is shown in Table 12.12. This change is shown in the second line of the column labelled R Square Change and is .179.
Recommended further reading Cramer, D. (1998) Fundamental Statistics for Social Research: Step-by-Step Calculations and Computer Techniques Using SPSS for Windows. London: Routledge. Using a different example, chapter 9 covers much of the material presented here. Overall, J.E. and Spiegel, D.K. (1969) Concerning least squares analysis of experimental data, Psychological Bulletin, 72: 311–22. This paper presents a short and clear account of the three main methods used for carrying out a factorial analysis of variance with unequal and disproportionate cell frequencies. Pedhazur, E.J. (1982) Multiple Regression in Behavioral Research: Explanation and Prediction, 2nd edn. New York: Holt, Rinehart & Winston. Although fairly technical, chapter 10 shows how an unrelated 3 × 3 analysis of variance can be calculated with multiple regression. Pedhazur, E.J. and Schmelkin, L.P. (1991) Measurement, Design and Analysis: An Integrated Approach. Hillsdale, NJ: Lawrence Erlbaum Associates. Chapter 20 provides a less technical account of how an unrelated 2 × 3 analysis of variance can be calculated with multiple regression SPSS Inc. (2002) SPSS Base 11.0 User’s Guide Package. Upper Saddle River, NJ: Prentice-Hall. Provides a detailed commentary on the output produced by SPSS 11.0 as well as a useful introduction to analysis of variance.
Part 6 Discriminating between groups
13
Discriminant analysis
Discriminant, or discriminant function, analysis is a parametric technique used to determine which weightings of quantitative variables or predictors best discriminate between two or more groups of cases and do so better than chance. The weightings of variables form a new composite variable, which is known as a discriminant function and which is a linear combination of the weightings and scores on these variables. The maximum number of such functions is either the number of predictors or the number of groups minus one, whichever of these two values is the smaller. For example, there will only be one discriminant function if there are either two groups or one predictor. There will be two discriminant functions if there are either three groups or two predictors. Where there is more than one discriminant function, the discriminant functions will be unrelated or orthogonal to each other. Each function will consist of all predictors, although their weight will not be the same on all the discriminant functions. The accuracy of the discriminant functions in classifying cases into their groups can be determined. There are three ways of entering predictors into a discriminant analysis as there are in multiple regression. In the standard or direct method, all predictors are entered at the same time, although some of these predictors may play little part in discriminating between the groups. In the hierarchical or sequential method, predictors are entered in a predetermined order to find out what contribution they make. For example, demographic variables
204
Advanced quantitative data analysis
such as age, gender and social class may be entered first to control for the effect of these variables. In the statistical or stepwise method, predictors are selected in terms of variables that make the most contribution to the discrimination. If two predictors are related to each other and have very similar discriminating power, the predictor with the greater discriminating power will be chosen even if the difference in discriminating power of the two predictors is trivial. We will illustrate the use of discriminant analysis in discriminating three groups in terms of four predictors. The three groups are people who have been diagnosed as suffering from anxiety, depression or neither anxiety nor depression (i.e. normal). Each group consists of five people, although the numbers in each group need not be equal for a discriminant analysis. The four predictors comprise 5-point rating scales of the four symptoms of anxiety, restlessness, depression and hopelessness. Higher scores indicate greater severity of these symptoms. The ratings on these predictors for the three groups of cases are presented in Table 13.1. The normal, anxious and depressed groups have been coded as 1, 2 and 3, respectively. Tabachnick and Fidell (1996) recommended that the size of the smallest group should be bigger than the number of predictors, which is the case in this sample, which has been kept deliberately small. Any outliers or extreme scores in the data, of which there are none here, should be transformed or omitted. The
Table 13.1 cases
Individual ratings on four predictors for three groups of
Cases
Groups
Anxious
Restless
Depressed
Hopeless
1 2 3 4 5
1 1 1 1 1
2 1 3 4 1
3 3 2 2 2
1 3 2 3 1
2 3 1 2 2
6 7 8 9 10
2 2 2 2 2
4 5 4 3 4
3 4 4 2 5
3 2 2 3 1
2 4 3 2 2
11 12 13 14 15
3 3 3 3 3
4 2 3 2 2
3 2 3 4 1
5 3 5 4 4
4 3 4 5 3
Discriminant analysis
205
aim is to determine which symptoms are needed to distinguish the three groups from each other.
Group means and standard deviations
As a first step, it is useful to look at the means and standard deviations of the ratings of the four predictors for the three groups to determine whether they appear to differ between the three groups and, if so, in what way. We could also carry out a one-way analysis of variance on the four predictors to establish which of them on their own best discriminate the three groups in terms of their level of significance. The predictors having the highest level of significance would be the best discriminators. The means, standard deviations and significance levels of the four predictors for the three groups are presented in Table 13.2. From Table 13.2, we can see that the most significant predictor that discriminates between the three groups is feeling depressed (p = .004). The depressed group feels more depressed (4.20) than the anxious (2.20) and the normal (2.00) group. Consequently, it is possible that the first discriminant function will differentiate the depressed group from the other two groups and that the most heavily weighted predictor on this discriminant function will be the symptom of feeling depressed. The next most significant predictor is feeling hopeless (p = .013). The depressed group feels more hopeless (3.80) than the anxious (2.60) and the normal (2.00) group. Thus, feeling hopeless may be the second most heavily weighted predictor on this discriminant function. The third most significant predictor is feeling anxious (p = .035). The anxious group feels more anxious (4.00) than the depressed (2.60) and the normal (2.00) group. Therefore, it is possible that
Table 13.2 Means, standard deviations (SD) and significance levels of the four predictors for the three groups Predictors
Normals
Anxious
Depressed
p
Anxious
Mean SD
2.00 1.30
4.00 0.71
2.60 0.89
.035
Restless
Mean SD
2.40 0.55
3.60 1.14
2.60 1.14
.161
Depressed
Mean SD
2.00 1.00
2.20 0.84
4.20 0.84
.004
Hopeless
Mean SD
2.00 0.71
2.60 0.89
3.80 0.84
.013
206
Advanced quantitative data analysis
feeling anxious will be the most heavily weighted predictor on a second discriminant function which distinguishes the anxious group from the other two groups. Note that if we have one discriminant function that distinguishes the depressed group from the other two groups and another that distinguishes the anxious group from the other two groups, there is no need for a third discriminant function to distinguish the normal group from the other two groups because we already have the information from the first two discriminant functions to do this. If, for example, a high score on the first discriminant function represents the depressed group and a high score on the second discriminant function represents the anxious group, then the normal group will be represented by a low score on both discriminant functions. This is why the maximum number of discriminant functions based on the number of groups is always one less than the number of groups.
Discriminant functions
The first discriminant function will provide the maximum or best separation between the groups. The second discriminant function will provide the next best separation between the groups which is unrelated or orthogonal to the first discriminant function and so on. If we had more than three groups, we would have more than two discriminant functions. A discriminant function is like a regression equation in which each predictor is weighted and there is a constant. So, the discriminant function equation for both the discriminant functions in our example will take the following general form: Discriminant function = anxious + restless + depressed + hopeless + constant The value of the constant and the unstandardized weights will differ for the two functions. These are shown for the two functions in Table 13.3. We can see that hopelessness is the predictor that is most heavily weighted on the first discriminant function (.834), followed by feeling depressed (.723). Feeling anxious is the predictor that is most heavily weighted on the second discriminant function (.753). These values are calculated with matrix algebra and so will not be demonstrated. To work out the score for each case on these two discriminant functions, we take their ratings on each of these predictors as shown in Table 13.1, multiply them by the appropriate weights and add the products together with the constant. So, the first case in Table 13.4 has a score of about −1.63 on the first discriminant function: (2 × −0.421) + (3 × −0.397) + (1 × 0.723) + (2 × 0.834) + (−1.986) = (−0.842) + (−1.191) + (0.723) + (1.668) + (−1.986) = −1.628
Discriminant analysis
207
Table 13.3 SPSS output of weights or coefficients of the predictors and constant for the two discriminant functions Canonical Discriminant Function Coefficients Function 1
2
−.421 −.397 .723 .834 −1.986
Anxious Restless Depressed Hopeless (Constant)
.753 .304 .113 .358 −4.401
Unstandardized coefficients.
Table 13.4 Scores on the two discriminant functions for the 15 cases Cases
Groups
1
2
1 2 3 4 5
1 1 1 1 1
−1.63 1.07 −1.76 −0.63 −0.81
−1.15 −1.32 −0.95 0.28 −2.21
6 7 8 9 10
2 2 2 2 2
−1.02 −0.90 −1.31 −0.21 −3.27
0.58 2.24 1.13 −0.48 0.96
11 12 13 14 15
3 3 3 3 3
2.09 1.05 2.51 2.64 2.17
1.52 −0.87 0.77 0.57 −1.06
and a score of about −1.15 on the second discriminant function: (2 × 0.753) + (3 × 0.304) + (1 × 0.113) + (2 × 0.358) + (−4.401) = (1.506) + (0.912) + (0.113) + (0.716) + (−4.401) = −1.154 The scores on the two discriminant functions for the 15 cases are presented in Table 13.4.
208
Advanced quantitative data analysis
The means of the two discriminant functions for the three groups are displayed in Table 13.5. For the first discriminant function, the mean for the depressed group (2.09) is higher than that for the normal (−.792) and the anxious (−1.341) group, suggesting that this discriminant function differentiates the depressed group from the other two groups. For the second discriminant function, the mean of the anxious group (.887) is higher than that for the depressed (.184) and the normal group (−1.071), indicating that this discriminant function distinguishes the anxious group from the other two groups. We can see that the first discriminant function offers the best separation between the groups if we carry out a one-way analysis of variance on the scores of the two discriminant functions, the results of which are shown in Table 13.6 together with two other measures of discrimination. The between-groups sum of squares of about 33.70 for the first discriminant function is larger than that of about 9.84 for the second discriminant function. The discriminating power of the discriminant functions can also be expressed in terms of their eigenvalue and canonical correlation. The eigenvalue is the ratio of the between-groups to the within-groups sum of squares: eigenvalue =
between-groups sum of squares within-groups sum of squares
The eigenvalue of 2.81 (33.70/12.00 = 2.81) for the first discriminant function is larger than that of 0.82 (9.84/12.00 = 0.82) for the second discriminant function. The eigenvalue can be expressed as a percentage of the total variance of the discriminant function scores by dividing the eigenvalue of a
Table 13.5 SPSS output of the means of the two discriminant functions for the three groups Functions at Group Centroids Function GROUP Normal Anxious Depressed
1 −.752 −1.341 2.092
2 −1.071 .887 .184
Unstandardized canonical discriminant functions evaluated at group means.
Discriminant analysis
209
Table 13.6 One-way analysis of variance results for the two discriminant functions with their eigenvalues and canonical correlations 1 Between-groups SS Within-groups SS Total SS
2 33.70 12.00 45.70
33.70/12.00 = 2.81 √(33.70/45.70) = .86
Eigenvalue Canonical correlation
9.84 12.00 21.84 9.84/12.00 = 0.82 √(9.84/21.84) = .67
discriminant function by the total eigenvalue of all the discriminant functions and multiplying the result by 100. So, the first discriminant function explains 77.41 per cent [100 × 2.81/(2.81 + 0.82) = 77.41] of the total variance, while the second discriminant function explains the remaining 22.59 per cent [100 × 0.82/(2.81 + 0.82) = 22.59] of the total variance. The canonical correlation is the square root of the ratio of the betweengroups to the total sum of squares: canonical correlation =
冪
between-groups sum of squares total sum of squares
The canonical correlation of .86 [√(33.70/45.70) = .86] for the first discriminant function is larger than that of .67 [√(9.84/21.84) = .67] for the second discriminant function. Statistical significance of discriminant functions
Generally, we are only interested in discriminant functions that discriminate between the groups at a level greater than chance. The procedure for doing this is to first determine if all the discriminant functions taken together are statistically significant. This test is based on Wilks’ lambda. For single predictors and discriminant functions, Wilks’ lambda is the ratio of the within-groups to the total sum of squares. For all the discriminant functions analysed together, it is the product of Wilks’ lambda for the separate discriminant functions. So, Wilks’ lambda for the two discriminant functions combined is about .144: 12.00/45.70 × 12.00/21.84 = .263 × .549 = .1444 Wilks’ lambda varies from 0 to 1. A lambda of 1 indicates that the means of all the groups have the same value and so do not differ. Lambdas close to zero signify that the means of the groups differ.
210
Advanced quantitative data analysis
Table 13.7 functions
SPSS output for the statistical significance of discriminant Wilks’ Lambda
Test of Function(s) 1 through 2 2
Wilks’ Lambda .144 .549
Chi-square 20.330 6.289
df
Sig. 8 3
.009 .098
Wilks’ lambda can be transformed as a chi-square (Stevens 1996) whose significance level can then be determined. The chi-square value for a Wilks’ lambda of .144 is about 20.33 as displayed in Table 13.7, which shows the relevant SPSS output. The degrees of freedom for this chi-square is the number of predictors multiplied by the number of groups minus one. In this case, we have 8 degrees of freedom [4 × (3 − 1) = 8]. With 8 degrees of freedom, chi-square has to be 15.51 or larger to be significant at the .05 twotailed level, which it is. If this chi-square was not significant, we would not need to proceed any further, as this would tell us that none of the discriminant functions differentiates the groups significantly. If chi-square is significant, as it is here, Wilks’ lambda for the first discriminant function is removed and a test is carried out to determine if the remaining discriminant functions are significant. In this example, there are only two discriminant functions, so this test is simply determining whether Wilks’ lambda for the second discriminant function is significant. Wilks’ lambda for the second discriminant function is .549 (12.00/21.84 = .549), which has a chi-square value of about 6.29. The degrees of freedom for this chi-square is the number of predictors minus one multiplied by the number of groups minus two. In other words, it is 3 [(4 − 1) × (3 − 2) = 3]. With 3 degrees of freedom, chi-square has to be 7.82 or bigger to be significant at the .05 two-tailed level, which it is not. This means that the first but not the second discriminant function is significant. The general formula for the degrees of freedom for n remaining discriminant functions is: (number of predictors − n) × [number of groups − (n + 1)] Applying this formula to this example gives us 3 {(4 − 1) × [3 − (1 + 1)} degrees of freedom.
Interpreting discriminant functions
When interpreting discriminant functions, it may not be appropriate to use the unstandardized canonical discriminant function coefficients as shown in
Discriminant analysis
211
Table 13.3, particularly if the standard deviations of the predictors vary substantially. The discriminant functions are generally interpreted in terms of the absolute size of either the standardized coefficients shown in Table 13.8 or the pooled within-groups correlations between the predictors and the standardized discriminant functions displayed in Table 13.9. The results for the two methods differ somewhat. For the first discriminant function, the largest standardized coefficient is for feeling hopeless (.681) followed by
Table 13.8 SPSS output of the standardized canonical discriminant function coefficients Standardized Canonical Discriminant Function Coefficients Function 1 Anxious Restless Depressed Hopeless
2
−.421 −.391 .646 .681
.753 .299 .101 .292
Table 13.9 SPSS output of the pooled within-groups correlations between the predictors and the standardized canonical discriminant functions Structure Matrix Function
Depressed Hopeless Anxious Restless
1
2
.719* .538* −.234 −.180
.330 .537 .849* .569*
Pooled within-groups correlations between discriminating variables and standardized canonical discriminant functions. Variables ordered by absolute size of correlation within function. * Largest absolute correlation between each variable and any discriminant function.
212
Advanced quantitative data analysis
feeling depressed (.646), while the largest correlation is for feeling depressed (.719) followed by feeling hopeless (.538). For the second discriminant function, feeling anxious has the largest standardized coefficient (.753) and also has the strongest correlation (.849). To compute the standardized coefficients, we multiply the unstandardized coefficient by the diagonal element for that predictor in the pooled within-groups covariance matrix. For example, for feeling anxious, the value of this diagonal element is 1.000 and the unstandardized coefficient is −.421 (see Table 13.3), so the standardized coefficient remains as −.421. The computation of the pooled within-groups correlations is not easily derived and so will not be described.
Classification
The value of a discriminant analysis can be determined by the percentage of cases it correctly identifies as belonging to their group when compared with the percentage that would be expected by chance alone. In our example, where there is the same number of people in each of the three groups, the probability of correctly identifying group membership on the basis of chance alone is .33, or 33 per cent. The ways in which the results of discriminant analysis can be used to classify cases into groups is too complicated to discuss here. The group membership predicted by SPSS is shown in Table 13.10 together with the actual membership of the 15 cases. From this table, we can see that three cases (2, 4 and 9) have been misclassified, giving an overall correct identification of 80 per cent (12/15 × 100 = 80). The information in Table 13.10 can be converted into the classification table shown in Table 13.11, which displays the number and the percentage of cases that have been correctly and incorrectly identified for each group. The number and percentage of correctly identified cases for each group are presented in the diagonal cells of the table. For example, 3 or 60 per cent (3/ 5 × 100 = 60) of the normal group are correctly identified as normal. The number and percentage of the incorrectly identified cases for each group are shown in the off-diagonal cells of the table. For example, 1 or 20 per cent (1/5 × 100) of the normal group are incorrectly identified as being in the anxious group. From this table, we can see that the percentage of correct identification is highest for the depression group and lowest for the normal group. The number of cases in the groups is often not equal in discriminant analysis. The procedure for assigning cases to groups may need to take this into account and the percentage of correct classifications expected by chance will have to be adjusted. Suppose that we have 2, 5 and 9 cases in groups 1, 2 and 3, respectively, as shown in Table 13.12. The probability of being in these groups will be .13 (2/15 = .13), .33 (5/15 = .33) and .53
Discriminant analysis
213
Table 13.10 Actual and predicted group membership for 15 cases Group membership Cases
Actual
Predicted
1 2 3 4 5
1 1 1 1 1
1 3 1 2 1
6 7 8 9 10
2 2 2 2 2
2 2 2 1 2
11 12 13 14 15
3 3 3 3 3
3 3 3 3 3
Table 13.11
SPSS output of the classification table Classification Resultsa Predicted Group Membership GROUP
Original
Count Normal Anxious Depressed %
a
Normal Anxious Depressed
Normal
Anxious
Depressed
Total
3 1 0
1 4 0
1 0 5
5 5 5
60.0 20.0 .0
20.0 80.0 .0
20.0 .0 100.0
100.0 100.0 100.0
80.0% of original grouped cases correctly classified.
(8/15 = .53), respectively. The number of cases expected to be in these groups will be 0.26 (.13 × 2 = 0.26), 1.65 (.33 × 5 = .33) and 4.24 (.53 × 8 = 4.24), respectively. The percentage of cases expected to be in these groups will be about 41 [100 × (0.26 + 1.65 + 4.24)/15 = 41.00].
214
Advanced quantitative data analysis
Table 13.12 Calculating proportion of correctly identified cases expected by chance Group 1 2 3 Sum
Frequency
Probability
Chance-correct frequency
2 5 8
.13 .33 .53
0.26 1.65 4.24
15
.99
6.15 6.15/15 = .41
When the numbers of cases in groups are small and unequal, cases may be misclassified if the within-groups covariance is also not equal or homogeneous. Equality or homogeneity of within-groups covariance can be assessed with Box’s M-test. If this test is not significant, the covariances are equal. If the test is significant, the covariances are unequal or heterogeneous. It may be possible to make them less unequal by converting the values of the predictors such as taking their square root or logarithm. Box’s M-test for our example is displayed in Table 13.13, where it can be seen that the significance level of .578 is greater than .05 and so is not significant. Reporting the results
As usual, the form in which the results of an analysis are reported will depend on its main purpose. One succinct way of writing up the results of the analysis described in this chapter is as follows: ‘A direct discriminant analysis was carried out using the four predictors of feeling anxious, restless, depressed and hopeless in determining whether people were diag-
Table 13.13 SPSS output of Box’s M-test of equality of covariance matrix Test Results Box’s M F
Approx. df1 df2 Sig.
37.535 .907 20 516.896 .578
Tests null hypothesis of equal population covariance matrices.
Discriminant analysis
215
nosed as normal, anxious or depressed. Two discriminant functions were calculated, explaining about 77 per cent and 23 per cent of the variance, respectively. Wilks’ lambda was significant for the combined functions (χ28 = 20.33, p