Categorical Variables in
Developmental Research
This Page Intentionally Left Blank
Categorical Variables in Develo...
54 downloads
1058 Views
13MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Categorical Variables in
Developmental Research
This Page Intentionally Left Blank
Categorical Variables in Developmental Research Methods of Analysis
Edited by
Alexander von Eye Michigan State University East Lansing, Michigan
Clifford C. Cloggm Pennsylvania State University University Park, Pennsylvania *Deceased
ACADEMIC PRESS
San Diego New York Boston
London
Sydney Tokyo Toronto
This book is printed on acid-flee paper. ( ~ Copyright 9 1996 by ACADEMIC PRESS, INC. All Rights Reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher. A c a d e m i c Press, Inc. A Division of Harcourt Brace & Company 525 B Street, Suite 1900, San Diego, California 92101-4495
United Kingdom Edition publ&hed by Academic Press Limited 24-28 Oval Road, London NW1 7DX Library of Congress Cataloging-in-Publication Data Categorical variables in developmental research : methods of analysis / edited by Alexander von Eye, Clifford C. Clogg p. cm. Includes bibliographical references and index. ISBN 0-12-724965-6 1. Psychometrics. 2. Psychology--Statistical methods. 3. Categories (Mathematics) I. Eye, Alexander von. II. Clogg, Clifford C. BF39.C294 1995 155'.072--dc20 95-21950 PRINTED IN THE UNITED STATES OF AMERICA 95 96 97 98 99 00 QW 9 8 7 6 5
4
3
2
1
Contents
Contributors xi Preface xiii Acknowledgments xvii In Memoriam xix
PART
1 Measurement and Repeated
Observations of Categorical Data 1. MeasurementCriteria for Choosing among Models with GradedResponses David Andrich
1. Introduction 3 2. Measurement Criteria for a Model for Graded Responses 4
vi
Contents
3. Models for Graded Responses 4. Examples 28 5. Summary and Discussion 32 References 34 11
Growth Modeling with Binary Responses
Bengt O. Muth~n
1. Introduction 37 2. Conventional Modeling and Estimation with Binary Longitudinal Data 39 3. More General Binary Growth Modeling 42 4. Analyses 46 5. Conclusions 52 References 52 11
Probit Models for the Analysis of Limited Dependent Panel Data Gerhard Arminger
1. 2. 3. 4.
Introduction 55 Model Specification 56 Estimation Method 61 Analysis of Production Output from German Business Test Data 68 5. Conclusion 72 References 73
PART 2 Catastrophe Theory 4. CatastropheAnalysis of Discontinuous Development
Han L. J. van der Maas and Peter C. M. Molenaar
1. 2. 3. 4. 5. 6.
Introduction 77 Catastrophe Theory 79 Issues in Conservation 80 The Cusp Model 84 Empirical Studies 89 Discussion 101 References 104
Contents B
vii
Catastrophe Theory of Stage Transitions in Metrical and Discrete Stochastic Systems
Peter C. M. Molenaar and Pascal Hartelman
1. Introduction 107 2. Elementary Catastrophe Theory 111 3. Catastrophe Theory for Metrical Stochastic Systems 115 4. Catastrophe Theory for Discrete Stochastic Systems 125 5. General Discussion and Conclusion 128 References 129
PART 3 Latent Class and Log-Linear Models
6. Some Practical Issues Related to the Estimation of Latent Class and Latent Transition Parameters
Linda M. Collins, Penny L. Fidler, and Stuart E. Wugalter
1. Introduction 133 2. Methods 137 3. Discussion 144 References 146
7. ContingencyTables and BetweenSubject Variability Thomas D. Wickens
1. 2. 3. 4. 5.
Introduction 147 Association Variability 148 The Simulation Procedure 150 Tests Based on Multinomial Variability 152 Tests Based on Between-Subject Variability 156 6. Procedures with Two Types of Variability 161 7. Discussion 163 References 167
viii
Contents
8. AssessingReliability of Categorical Measurements Using Latent Class Models Clifford C. Clogg and Wendy D. Manning
1. Introduction 169 2. The Latent Class Model: A Nonparametric Method of Assessing Reliability 171 3. Reliability of Dichotomous Measurements in a Prototypical Case 174 4. Assessment of Reliability by Group or by Time 178 5. Conclusion 181 References 182 W
Partitioning Chi-Square: Something Old, Something New, Something Borrowed, but Nothing BLUE (Just ML) David Rindskopf
1. 2. 3. 4. 5.
Introduction 183 Partitioning Independence Models 184 Analyzing Change and Stability 190 How to Partition Chi-Square 195 Discussion 199 References 201
10. Nonstandard Log-Linear Models for Measuring Change in Categorical Variables
Alexander von Eye and Christiane Spiel 1. 2. 3. 4.
Introduction 203 Bowker's Test 204 Log-Linear Models for Axial Symmetry 205 Axial Symmetry in Terms of a Nonstandard Log-Linear Model 206 5. Group Comparisons 209 6. Quasi-Symmetry 210 7. Discussion 213 References 214
Contents
ix
11. Application of the Multigraph
Representation of Hierarchical Log-Linear Models H. J. Khamis
Introduction 215 2. Notation and Review 216 3. The Generator Multigraph 217 4. Maximum Likelihood Estimation and Fundamental Conditional Independencies 5. Examples 222 6. Summary 228 References 229 ,
219
PART 4 Applications 12. Correlation and Categorization under a Matching Hypothesis
Michael J. Rovine and Alexander von Eye
1. 2. 3. 4. 5. 6. 7. 8. .
10. 11.
Introduction 233 An Interesting Plot 234 The Binomial Effect Size Display 236 An Organizing Principle for Interval-Level Variables 237 Definition of the Matching Hypothesis 237 A Data Example 238 Correlation as a Count of Matches 241 Correlation as a Count of How Many Fall within a Set Range 243 Data Simulation 244 Building Uncertainties from Rounding Error into the Interpretation of a Correlation 246 Discussion 247 References 248
13. Residualized Categorical Phenotypes and Behavioral Genetic Modeling Scott L. Hershberger
1. The Problem 249 2. Weighted Least-Squares Estimation
250
X
Contents
3. Proportional Effects Genotype-Environment Correlation Model 253 4. Method 256 5. Results 258 6. Conclusions 271 References 273 Index
275
Contributors
Numbers in parentheses indicate the pages on which the authors' contributions begin.
David Andrich (3) School of Education, Murdoch University, Murdoch,
Western Australia 6150, Australia Gerhard Arminger (55) Department of Economy, Bergische Universit~it, Wuppertal, Germany Clifford C. Clogg (169) Departments of Sociology and Statistics and Population Research Institute, Pennsylvania State University, University Park, Pennsylvania 16802 Linda M. Collins (133) The Methodology Center and Department of Human Development and Family Studies, Pennsylvania State University, University Park, Pennsylvania 16802
x/
Xll oo
Contributors
Penny L. Fidler (133) J. P. Guilford Laboratory of Quantitative Psychology, University of Southern California, Los Angeles, California 90089 Pascal Hartelman (107) Faculty of Psychology, University of Amsterdam, The Netherlands Scott L. Hershberger (249) Department of Psychology, University of Kansas, Lawrence, Kansas 66045 H. J. Khamis (215) Statistical Consulting Center and Department of Community Health, School of Medicine, Wright State University, Dayton, Ohio 45435 Wendy D. Manning (169) Department of Sociology, Bowling Green State University, Bowling Green, Ohio 43403 Peter C. M. Molenaar (77, 107) Faculty of Psychology, University of Amsterdam, The Netherlands Bengt O. Muth~n (37) Graduate School of Education, University of California, Los Angeles, Los Angeles, California 90095 David Rindskopf (183) Educational Psychology, City University of New York Graduate School, Chestnut Ridge, New York 10977 Michael J. Rovine (233) Human Development and Family Studies, Pennsylvania State University, University Park, Pennsylvania 16802 Christiane Spiel (203) Department of Psychology, University of Vienna, Vienna, Austria Han L. J. van der Maas (77) Department of Developmental Psychology, University of Amsterdam, The Netherlands Alexander yon Eye (203, 233) Department of Psychology, Michigan State University, East Lansing, Michigan 48824 Thomas D. Wickens (147) Department of Psychology, University of California, Los Angeles, Los Angeles, California 90095 Stuart E. Wugalter (133) J. P. Guilford Laboratory of Quantitative Psychology, University of Southern California, Los Angeles, California 90089
Preface
Categorical variables come in many forms. Examples include classes that are naturally categorical such as gender or species; groups that have been formed by definition, such as classes of belief systems or classifications into users versus abusers of alcohol; groups that result from analysis of data, such as latent classes; classes that reflect grading of, for instance, performance or age; or categories that were formed for the purpose of discriminating as, for instance, nosological units. Developmental researchers have been reluctant to include categorical variables in their studies. The main reason for this reluctance is that it often is thought difficult to include categorical variables in plans for data analysis. Methods and computer programs for continuous variables seem to be more readily accessible. This volume presents methods for analysis of categorical data in developmental research. Thus, it fills the void perceived by many, by providing infor-
xiii
XiV
Preface
mation in a very understandable fashion, about powerful methods for analysis of categorical data. These methods go beyond the more elementary tabulation and • methods still in widespread use. This volume covers a broad range of methods, concepts, and approaches. It is subdivided into the following four sections: 1. Measurement and Repeated Observations of Categorical Data 2. Catastrophe Theory 3. Latent Class and Log-Linear Models 4. Applications The first section, Measurement and Repeated Observations of Categorical Data, contains three chapters. The first of these, by David Andrich, discusses measurement criteria for choosing among models with graded responses. Specifically, this chapter is concerned with the important issues of criteria for measurement and choice of a model that satisfies these criteria. Proper selection of a measurement model ensures that measurement characteristics can be fully exploited for data analysis. The second chapter in the first section is by Bengt Muthrn. Titled Growth Modeling with Binary Responses, it describes methods for analysis of individual differences in growth or decline. It presents modeling with random intercepts and slopes when responses are binary. The importance of this paper lies in its demonstration of how to allow for restrictions on thresholds across time and differences across time in the variances of the underlying variables. The third chapter in this section, by Gerhard Arminger, is on Probit Models for the Analysis of Limited-Dependent Panel Data. It presents methods for analysis of equidistant panel data. Such data often involve metric and nonmetric variables that may be censored metric, dichotomously ordered, and unordered categorical. The chapter presents an extension of Heckman's (1981) models. The new methods allow researchers to analyze censored metric and ordered categorical variables and any blend of such variables using general threshold models. The second section of this volume contains two chapters on Catastrophe Theory. The first of these chapters, authored by Han van der Maas and Peter Molenaar, is Catastrophe Analysis of Discontinuous Development. This chapter presents results from application of catastrophe theory to the study of infants' acquisition of the skill of conservation. This skill, well-known since Piaget's investigations, is a very important step in a child's cognitive development. However, proper methodological analysis has been elusive. This paper presents methods and results at both the individual and the aggregate level. The second chapter in this section, by Peter Molenaar and Pascal Hartelman, addresses issues of Catastrophe Theory of Stage Transitions in Metrical and Discrete Stochastic Systems. This paper is conceptual in nature. It first gives an outline of elementary catastrophe theory and catastrophe theory for stochastic metrical systems. The chapter presents a principled approach to dealing with
Preface
XV
problems from Cobb's approach and introduces an approach for dealing with catastrophe theory of discrete stochastic systems. The importance of this chapter lies in the new course it charts for analysis of both continuous and categorical developmental change processes using catastrophe theory. The third section covers Latent Class and Log-Linear Models. Because of the widespread use of these models, they were allotted more space in this volume. The section contains six chapters. The first chapter, by Linda Collins, Penny Fidler, and Stuart Wugalter, addresses Some Practical Issues Related to Estimation of Latent Class and Latent Transition Parameters. Both of the issues addressed are of major practical relevance. The first concerns estimability of parameters for latent class and latent transition models when sample sizes are small. The second issue concerns the calculation of standard errors. The paper presents computer simulations that yielded very encouraging results. The second chapter in this section, by Thomas Wickens, covers issues of Contingency Tables and Between-Subject Variability. Specifically this chapter addresses problems that arise when between-subject variability prevents researchers from aggregating over subjects. Like the chapter before, this chapter presents simulation studies that show that various statistical tests differ in Type-I error rates and power when there is between-subject variability. The importance of this contribution lies in the presentation of recommendations for analysis of categorical data with between-subject variability. The third chapter in this section, by Clifford Clogg and Wendy Manning, is Assessing Reliability of Categorical Measurements Using Latent Class Models. This paper addresses the important topic of reliability assessment in categorical variables. The authors propose using the framework of the latent class model for assessing reliability of categorical variables. The model can be applied without assuming sufficiency of pairwise correlations and without assuming a special form for the underlying latent variable. The importance of this paper lies in that it provides a nonparametric approach to reliability assessment. The fourth chapter, by David Rindskopf, is Partitioning Chi-Square: Something Old, Something New, Something Borrowed, but Nothing BLUE (just ML). Partitioning chi-square is performed to identify reasons for departure from independence models. Rindskopf introduces methods for partitioning the likelihood chi-square. These methods avoid problems with the Lancaster methods of partitioning the Pearson chi-square. They are exact without needing complex adjustment formulas. Change-related hypotheses can be addressed using Rindskopf' s methods. The fifth chapter, by Alexander von Eye and Christiane Spiel, is Extending the Bowker Test for Symmetry Using Nonstandard Log-Linear Models for Measuring Change in Categorical Variables. The chapter presents three ways to formulate tests of axial symmetry in square cross-classifications: the Bowker test, standard log-linear models, and nonstandard log-linear models. The advan-
xvi
Preface
tage of the latter is that one can devise designs that allow one to simultaneously test symmetry and other developmental hypotheses. The sixth chapter in this section, by Harry Khamis, introduces readers into the Application of the Multigraph Representation of Hierarchical Log-Linear Models. The chapter focuses on hierarchical log-linear models. It introduces the generator multigraph and shows, using such graph-theoretic concepts as maximum spanning trees and edge cutsets, that the generator multigraph provides a useful tool for representing and interpreting hierarchical log-linear models. The fourth section of this volume contains application-oriented chapters. The first of these, written by Michael Rovine and Alexander von Eye, reinterprets Correlation and Categorization under a Matching Hypothesis. The authors show the relationship between the magnitude of a correlation coefficient and the number of times cases fall into certain segments of the range of values. The relationship of the correlation coefficient with the binomial effect size display is shown. The second chapter of this section, contributed by Scott Hershberger, discusses methods for Residualized Categorical Phenotypes and Behavioral Genetic Modeling. The paper presents methods for behavior genetic modeling of dichotomous variables that describe dichotomous phenotypes.
Alexander von Eye Clifford C. Clogg
Acknowledgments
We are indebted to many people who supported us during the production phase of this book. First there is the Dean of the Pennsylvania State College of Health and Human Development, Gerald McClearn. We thank him for his generous support. We thank Tina M. Meyers for help and support, particularly in phases of moves and transition. We thank a number of outside reviewers for investing time and effort to improve the quality of the chapters. Among them are Holger We[3els of the NIH, Ralph Levine of MSU, and Phil Wood of the University of Missouri who all read and commented on several chapters. We thank the authors of this volume for submitting excellent chapters and for responding to our requests for changes in a very professional fashion. We thank Academic Press, and especially Nikki Fine, for their interest in this project and for smooth and flexible collaboration during all phases of production. Most of all we thank our families, the ones we come from and the ones we live in. They make it possible for us to be what we are.
xg//
This Page Intentionally Left Blank
In Memoriam
Clifford C. Clogg died on May 7, 1995. In Cliff we all lose a caring friend, a dedicated family man, a generous colleague, and a sociologist and statistician of the highest caliber. Cliff was a person with convictions, heart, and an incredibly sharp mind; he was a person to admire and to like. We needed him then and we still do now. The manuscripts for this book were submitted to the publisher two weeks before this tragedy. This includes the preface, which I left unchanged. Alexander von Eye
xix
This Page Intentionally Left Blank
..P A.R.T~ Measurement and
Repeated
Observations of Categorical Data
This Page Intentionally Left Blank
Measurement Criteria for Choosing among Models with Graded Responses 9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
David Andrich
Murdoch University Western Australia
1. INTRODUCTION It is generally accepted that measurement has been central to the advancement of empirical physical science. The prototype of measurement is the use o f an instrument to map the amount of a property on a real line divided into equal intervals by thresholds sufficiently fine that their own width can be ignored, and in its elementary form; this is understood readily by young school children. However, the function of measurement in science goes much deeper: it is central to the simultaneous definition of variables and the formulation of quantitative scientific theories and laws (Kuhn, 1961). Furthermore, when proper measurement has taken place, these laws take on a simple multiplicative structure (Ramsay, 1975). Although central to formulating physical laws, it is understood that measurement inevitably contains error. In expressions of deterministic theories, these errors are considered sufficiently small that they are ignored. In practice, the mean of independent repeated measurements, the variance of which is inversely Categorical Variables in Developmental Research: Methods of Analysis Copyright
9
1996 by Academic
Press,
Inc. All rights of reproduction
in any form
reserved.
4
David Andrich
proportional to the number of measurements, can be taken to increase precision to a point where errors can indeed be ignored. It is also understood that instruments have operating ranges; in principle, however, the measurement of an entity is not a function of the operating range of any one instrument, but of the size of the entity. Graded responses of one kind or another are used in social and other sciences when no measuring instrument is available, and these kinds of graded responses mirror the prototype of measurement in important ways. First, the property is envisaged to be continuous, such as an ability to perform in some domain, or an intensity of attitude, or the degree of a disease; second, the continuum is partitioned into ordered adjacent (contiguous) intervals, usually termed categories, that correspond to the units of length on the continuum. In elementary treatments of graded responses, the prototype of measurement is followed closely in that the successive categories are simply assigned successive integers, and these are then treated as measurements. In advanced treatments, a model with a random component is formalized for the response and classification processes, the sizes of the intervals are not presumed equal, and the number of categories is finite. This chapter is concerned with criteria, and the choice of a model that satisfies these criteria, so that the full force of measurement can be exploited with graded responses of this kind. The criteria are not applied to ordered variables where instruments for measurement already exist, such as age, height, income expressed in a given currency, and the like, which in a well-defined sense already meet the criteria. Although one new mathematical result is presented, this chapter is also not about statistical matters such as estimation, fit, and the like, which are already well established in the literature. Instead, it is about looking at a relatively familiar statistical situation from a relatively nonstandard perspective of measurement in science. Whether a variable is defined through levels of graded responses that characterize more or less of a property, or whether it is defined through the special case of measurement in which the accumulation of successive amounts of the property can be characterized in equal units, central to its definition is an understanding of what constitutes more or less of the property and what brings about changes in the property. It is in expressing this relationship between the definition and changes in terms of a model for measurement that generalizes to graded responses, and in articulating a perspective of empirical enquiry that backs up this expression, that this chapter contributes to the theme of the analysis of categorical variables in developmental research.
2. MEASUREMENTCRITERIA FOR A MODEL FOR GRADED RESPONSES In this section, three features of the relationship between measurement and theory are developed as criteria for models for measurement: first, the dominant direc-
1. Measurement Criteria for Choosing among Models with Graded Responses
5
tion of the relationship between theory and measurement; second, the structure of the models that might be expected to apply when measurements have been used; and third, the invariance of the measurement under different partitions of the continuum. These are termed measurement criteria. It is stressed that in this argument, the criteria established are independent and a priori to any data to which they might apply. In effect, the model chosen is a formal rendition of the criteria; it expresses in mathematical terms the requirements to which the graded responses must conform if they are to be like measurements, and therefore the model itself must exhibit these criteria. Thus the model is not a description of any set of data, although it is expected that data sets composed of graded responses can be made to conform to the model, and that even some existing ones may do so. Moreover, the criteria are not assumptions about any set of data that might be analyzed by the model. If data collected in the form of graded responses do not accord with the model, then they do not meet the criteria embedded in the model, but this will not be evidence against the criteria or the model. Thus it is argued that graded responses, just like measurements, should subscribe to certain properties that can be expressed in mathematical terms, and also that the data should conform to the chosen model and not the other way around; that is, the model should not be chosen to summarize the data. This position may seem nonstandard, and because it is an aspect of a different perspective, it has been declared at the outset. It is not, however, novel, having been presented in one form or another by Thurstone (1928), Guttman (1950), and Rasch (1960/1980), and reinforced by Duncan (1984) and Wright (1984).
2.1. Theory Precedes Measurement Thomas Kuhn is well known for his theory of scientific revolutions (Kuhn, 1970). In this chapter, I will invoke a part of his case, apparently much less known than the revolutionary theory itself, concerning the function of measurement in science (Kuhn, 1961) in which he stands the relationship between measurement and theory as traditionally perceived on its head: In text books, the numbers that result from measurements usually appear as the archetypes of the "irreducible and stubborn facts" to which the scientist must, by struggle, make his theories conform. But scientific practice, as seen through the journal literature, the scientist often seems rather to be struggling with the facts, trying to force them to conformity with a theory he does not doubt. Quantitative facts cease to seem simply "the given." They must be fought for and with, and in this fight the theory with which they are to be compared proves the most potent weapon. Often scientists cannot get numbers that compare well with theory until they know what numbers they should be making nature yield. (Kuhn, 1961, p. 171)
6
David Andrich
Kuhn (1961) elaborates the . . . "paper's most persistent thesis: The road from scientific law to scientific measurement can rarely be traveled in the reverse direction" (p. 219, emphasis in original). If this road can be seldom traveled in the physical sciences, then it is unlikely to be traveled in the social ' sciences. Yet, I suggest that social scientists attempt to travel this route most of the time by modeling available data, that is, by trying to find models that will account for the data as they appear. In relentlessly searching for statistical models that will account for the data as given, and finding them, the social scientist will eschew one of the main functions of measurement, the identification of anomalies: To the extent that measurement and quantitative technique play an especially significant role in scientific discovery, they do so precisely because, by displaying serious anomaly, they tell scientists when and where to look for a new qualitative phenomenon. To the nature of that phenomenon, they usually provide no clues. (Kuhn, 1961, p. 180) And this is because When measurement departs from theory, it is likely to yield mere numbers, and their very neutrality make them particularly sterile as a source of remedial suggestions. But numbers register the departure from theory with an authority and finesse that no qualitative technique can duplicate, and that departure often is enough to start a search. (Kuhn, 1961, p. 180) Although relevant in general, these remarks are specifically relevant to the role of measurement and therefore to the role that graded responses can have. Measurement formalizes quantitatively and efficiently a theoretical concept that can be summarized as a variable in terms of degree, similar in kind but greater or lesser in intensity, in terms of more or less, greater or smaller, stronger, or weaker, better or worse, and so on, that is to be studied empirically. If it is granted that the variable is an expression of a theory, that is, that it is an expression of what constitutes more or less, greater or smaller, and so on, according to the theory, then when studied empirically, it becomes important to invoke another principle of scientific enquiry, that of falsifiability (Popper, 1961). Although Popper and Kuhn have disagreed on significant aspects of the philosophy of science, the idea that any attempt at measurement arises from theory, and that measurement may provide, sooner or later, evidence against the theory, is common in both philosophies. The implication of this principle to the case of graded responses is that even the operational ordering of the categories, which defines the meaning of more or less of the variable, should be treated as a hypothesis. Clearly, when a researcher decides on the operational definition of the ordered categories, there is a strong conviction about their ordering, and any chosen model should reflect this. However, if the categories are to be treated as a hypothesis, then the model used to characterize the response process must itself
1. Measurement Criteria for Choosing among Models with Graded Responses
7
also have the ordering as a hypothesis. Thus, the model should not provide an ordering irrespective of the data but should permit evidence to the contrary to arise: If the data refute the ordering, then an anomaly that must be explained is revealed. The explanation might involve no more than identifying a coding error, although most likely it will clarify the theory: however, whatever its source, the anomaly must be explained.
2.2. Fundamental Measurement and Laws in Physical Science Because of the strong analogy to be made between measurement in the physical sciences and the construction and operation of graded responses, any model chosen should be consistent with how measurements operate in the physical sciences. As already noted, graded responses differ from traditional measurements in that (a) they are finite in number, (b) the categories (units) are not equal in size, and (c) assignment of an entity to a category contains an uncertainty that must be formalized. Therefore, the model should specialize to the case in which the number of categories from an origin is, in principle, not finite, in which the sizes of the categories are equal (a constant unit), and in which measurement precision is increased to the point where the random component can be ignored. In general terms, the model should specialize to fundamental measurement. Fundamental measurement is an abstraction from the idea of concatenation of entities in which the successive addition of measurements corresponds to the concatenation (Wright, 1985, 1988). Measurements of mass exemplify fundamental measurement, in that if two entities are amalgamated, then the amalgamated entity behaves with respect to acceleration as a single body with a mass the sum of the masses of the original entities. It seems that because successive concatenations of a unit can be expressed as a multiplication of the unit, measurement leads to laws that have a multiplicative structure among the variables: "Throughout the gigantic range of physical knowledge, numerical laws assume a remarkably simple form provided fundamental measurement has taken place" (Ramsay, 1975, p. 262) and "Virtually all the laws of physics can be expressed numerically as multiplications or divisions of measurements" (p. 258). Thus it might be expected that any model that is to form a criterion for measurement would have this multiplicative structure, although this alone would not be sufficient to meet the criterion: it would still have to be demonstrated that when specialized to the case of physical measurement, the model leads to increased precision as the units are made smaller, and that the model reflects the process of concatenation.
2.3. Invariance of Location across an Arbitrary Partitioning of the Continuum Because an entity can be measured with different units, it is expected that the location of any entity on the continuum should be invariant when measured in
8
David Andrich
different units, although with smaller units greater precision is expected. This difference in units corresponds to different partitions of the continuum. Therefore, this invariance of location across different partitions of the continuum should be met by graded responses (of which measurement is a special case).
2.4. Summary of Measurement Criteria In summary, the measurement criteria for a model for graded responses are 1. it should contain the ordering of the categories as a falsifiable hypothesis, 2. it should specialize to fundamental measurement, and 3. the location of an entity should be invariant under different partitions of the continuum. It should not come as a surprise that the preceding criteria are in some sense related, but they have been listed separately to focus on the salient features of measurement and its function. A model that satisfies these criteria provides a special opportunity for learning about variables in the form of graded responses. In the next section, these criteria are taken in reverse order, and the model that satisfies the first criterion by definition is presented first and then is shown to satisfy the other two criteria as well.
2.5. Statistical Criteria Partly because contrasts are helpful in any exposition, and partly because there is historically an alternate model for graded responses, this alternate model is outlined here. This model, too, has been presented as satisfying certain principles of scientific enquiry (McCullagh, 1980, 1985) for graded responses. These criteria, in contrast to what have been referred to as measurement criteria, will be referred to as statistical criteria: (i) If, as is usually the case, the response scale contains useful and scientifically important information such as order or factorial structure, the statistical model should take this information into account. (ii) If the response categories are on an arbitrary but ordinal scale, it is nearly always appropriate to consider models that are invariant under the grouping of adjacent categories (McCullagh, 1980). (iii) If sufficient care is taken in the design and execution of the experiment or survey, standard probability distributions such as the Poisson or multinomial may be applicable, but usually the data are over d i s p e r s e d . . . (McCullagh, 1985, p. 39). These statistical criteria seem compatible with the measurement criteria. However, it will be seen that the models chosen to meet each set of criteria are in
1. Measurement Criteria for Choosing among Models with Graded Responses
9
fact incompatible and serve different situations. It is in part to help understand the basis of the choices and the change in perspective entailed in choosing the former model ahead of the latter for graded responses that the measurement criteria have been set in a wider context of the relationship between data, theory, and models. The distinction made here between measurement and statistical criteria is compatible with the same kind of distinction made in Duncan and Stenbeck (1988).
3. MODELS FOR GRADED RESPONSES The model that satisfies the criteria of invariance with graded responses arises from the Rasch (1961) class of models, and its properties have been elaborated by Andersen (1977) and Andrich (1978, 1985). The historically prior model is based on the work of Thurstone (Edwards & Thurstone, 1952) and has been further elaborated by Samejima (1969), Bock (1975), and McCullagh (1980, 1985).
3.1. Rasch Cumulative Threshold Model Rasch (1961) derived a class of models for measurement explicitly from the requirements of invariant comparisons. Because a measurement is the comparison of a number of units to a standard unit, and therefore the comparison of two measurements is indirectly a comparison of these comparisons with a standard, it is perhaps not surprising that the relationship can be taken in reverse, so that if conditions are imposed on comparisons, they may lead to measurement. Rasch's parsimonious criteria are the following. A. "The comparison between two stimuli should be independent of which particular individuals were instrumental for the comparison; and it should also be independent of which other stimuli within the considered class were or might also have been compared." B. "Symmetrically, a comparison between two individuals should be independent of which particular stimuli within the class considered were instrumental for comparison; and it should also be independent of which other individuals were also compared, on the same or on some other occasion (Rasch, 1961, p. 322)." The probabilistic models that satisfy these conditions have sufficient statistics for the parameters of the individuals and for the stimuli, which will be referred to in general now as entities and instruments, respectively. The existence of sufficient statistics means that the outcome space can be partitioned so that within the subspaces the distribution of responses depends only on either the
10
David Andrich
parameters of the instrument or the entity, but not on both, thus providing the required invariance of comparison, either among instruments or among entities. The model for the probability Pr{xpi } of a response with respect to entity p and instrument i that satisfies these conditions of invariance can be expressed in different forms, one efficient form being Pr{xpi } = "rki + xpi[3p , ~1 exp 'Ypi k= 1
where ~pi = ~
exp - ~
x -- 0
-oO O,
~p > O,
(4)
COki
k=O
where now the parameters must be greater than 0. Thus in the model, and in the multiplicative metric, a value of 0 provides the natural origin, as "roi -~ 0, tOoi ~ 1. Suppose that the first threshold in instrument i has the value o.)i as before, and then that each of the successive thresholds is an additional distance coi from the previous threshold. Thus the first threshold can be conceived of as one unit from the origin, with 03i being the unit of measurement, just as in the prototype of measurement. Then, the thresholds take on the value Xpi
COoi =
1
,
O.)xi = X(Di,
X = 1
,.
9
and
m,
.,
~I COki --- Xpilt'~xpi . w
!
9
k=O
Next, suppose that the number of categories (units) is not finite, but that in principle, it may be any count of the number of units" this being determined by the amount of the property and not by the finite range of a particular instruxpi
ment. Replacing m by ~ and I-I tOki by Xpi!to~," in Equation (4) gives k=0
Pr{xpi } -
However, with ~pi
~r x ~ O
I
(GI~,)~, '' ~ ,
"Ypi
X!
x - 0.....
(~#(.Di) xpi
~.
(5)
= exp(~p/eOi), Equation (5) is the 9
9
Poisson distribution, giving, in the conventional mode of expression,
Pr{xpi} = e -;~'/~~ (G/~
x!
(6)
Thus, repeated measurements with error specialize to the Poisson distribution as a function of the location of the entity and the size of the unit of the instrument. This result in itself, novel and perhaps surprising, has many implications. Lest they detract from the point of the chapter, only two will be considered here, the precision of measurement as a function of the size of the unit and the addition of measurements in relation to concatenation of entities. These are central to the specialization of graded responses to the physical measure-
1. Measurement Criteria for Choos&g among Models with Graded Responses
15
ment and can be studied by using only the well-known result that the mean and variance of a Poisson distribution are given by its parameter: E[Xpi] -- V[Xpi] ~- ~p/(D i.
Suppose that, as in physical measurement, a unit toI = 1 of instrument I is chosen as a standard. In this unit, ~p/tOI = ~p, giving E[Xpl ] = ~p, and just as in physical measurement, any observed count of units exceeding the origin is taken immediately as an unbiased estimate of the location of the entity. o)I Now, suppose that the units are reduced to o~n = m in some instrument n, n
and that the count in these units is denoted by the random variable Xpn. Then in these units
E[Xp n]
_
o~1/n
__
?/~p.
Thus reducing the units to (1/n)th of an original unit gives n times the value in the new units. Let the values of the variable Xp, be denoted by ~tF, n(1) - when expressed in standard units of instrument I: then, if xp,, E {0,1,2,3,4 . . . . } in units .(I){ 01234 } of to,,, ~p,, E ........ in units of toI. (For example, if the original
nnnnn
units are in centimeters and the new ones are in millimeters, then measurements in millimeters when expressed in centimeters will be 1/lOth their millimeter values.) Therefore, expressed in terms of the original unit ~oI,
n
n
n
as required to show the invariance of location of entity p when measured with different instruments when the measurement is carried out in different units (but expressed in the same unit). In some sense this result is in fact rather trivial, but it is important because it highlights the need to distinguish between the effect of changing the unit in relation to the original unit in making the actual measurements and the effect of reexpressing the measurement in the new unit in terms of the original unit. In the former, the changed unit actually changes the parameter of the distribution, and therefore the probabilities, whereas in the latter, the parameter and the probabilities do not change, only the values of the measurements change into the original units. The same logic then can be used to establish the variance, and this is not trivial. Thus v[.~'21
=
v
~
n
:
-n~
v[x,j
=
n2
=-,
n
16
DavidAndrich
indicating that when the unit is (1/n)th of the original unit, the variance is also 1/nth of the original variance when expressed in the original metric and therefore increases the precision in the location of ~p. Although beyond the scope of this chapter, it can be readily shown that if the measurements are Poisson distributed according to Equation (6), then the distribution of the m e a n of n independent measurements in a particular unit is identical to the distribution of the measurements in a unit (1/n)th the size of that unit. Thus taking n measurements and finding the mean to estimate the location has an identical effect as taking one measurement in a unit (1/n)th the size of the original unit. This result itself is rather impressive, but here the only point emphasized is that it is consistent with what is expected in measurements, namely, that a measurement in smaller units should be more precise than a measurement in larger units. Figure 2 shows the distribution of measurements of Equation (5) for an entity with ~p = 1 in standard units, and when the units are halved successively up to units of to//8. The figure clearly shows that the distribution also tends to the normal, and this arises because as the value of its parameter increases, the Poisson distribution tends to the normal. This is all as expected in physical measurement: When expressed in the original metric, the expected value remains constant, while the precision becomes greater as the unit is reduced and the distribution of measurements is expected to become normal. If the unit is made
0.6
"
Pr{x} o.5 0 0.4
Unit =1
9 Unit = 1/2 9 Unit = 1/4 9 Unit = 1/8
( 0.3
0.2
0.1
0.0 0.0
0.5
1.0
= 1
1.5
2.0
2.5
3.0
3.5
Measurement x
FIGURE2. Distributions of measurements of entity of size 1 and units starting with 1.0 then halved for each new set of measurements.
1. Measurement Criteria for Choosing among Models with Graded Responses
17
sufficiently small relative to the size of the entity measured, then it might be small enough to be ignored in a deterministic theory. In addition to behaving as measurements with respect to their location, dispersion, and distribution in physical science, the model also incorporates the characteristic of concatenation. As is well known, the sum of two Poisson distributions is Poisson again, with the new parameter equal to the sum of parameters of the individual distributions. Suppose entities p and q with location values ~p and ~q, respectively, are concatenated and then measured with instrument i with unit coi. Then, from Equation (6) and some straightforward algebra, Pr{xp+q = Xp + Xq} = e
with E[xp+q]
_ ~,,+~q 0,, [(~p + ~q)/O~i]~,'+',/Xp+q!
+ eq =
~
, 1.oi
(7)
indicating that the value of the new concatenated entity is the sum of the values of the original entities and that an estimate of this value is the sum of the measurements of the original entities.
3.1.3. Falsification of the ordering of the categories Rasch (1961) originally specialized a m u l t i d i m e n s i o n a l version of a response model that satisfied criteria A and B and expressed it in the form equivalent to Pr{xpi} =
1
eXp{Kxi
@
+xi~p}'
(8)
"~pi
where the coefficient Kxi and scoring function +xi characterized each category x. Andersen (1977) showed that the model had sufficient statistics only if +(x+ 1)i - +xi = +xi - +(~-l)i and that categories x and x = 1 could be combined only if +(~+1)i = +~i. Andrich (1978) constructed the K'S and + ' s in terms of the prototype of measurement. The steps in the construction, reviewed here with an emphasis on falsification of threshold order, are 1. assume in the first instance independent dichotomous decisions at each threshold x, x = 1, m; 2. characterize the dichotomous decision by the model Pr{y~p i = 1 } = e x p O[xi(~ p -"rxi)/T]xpi , Pr{yxp i = 0 } = 1]Ylxpi, where Yxpi is a Bernoulli random variable in which Yxpi = 1 denotes threshold x is exceeded, Yxpi = 0 denotes threshold x is not exceeded, %i is the discrimination at threshold x, Tlxpi -= 1 + exp O[xi(~ p -- Txi); and 3. restrict and renormalize the original outcome space 1~ = { ( 0 , 0 , 0 . . . 0), (1,0,0 . . . . 0) . . . . (1,1,1 . . . . 1) } of all possible 2 m outcomes to the subspace 1~' of those outcomes which conform to the Guttman pattern consistent with
18
David Andrich
the threshold order according to (0,0,0 . . . . O) corresponding to xpi = 0 , ( 1 , 0 , 0 . . . O) corresponding to xpi = 1, (1,1,0 . . . . O) corresponding to xpi = 2 , and (1,1,1 . . . . 1) corresponding to xpi = m. This restriction makes the final outcome at each threshold dependent on the location of all thresholds (Andrich, 1992). For example, consider the case of a continuum partitioned into two thresholds. Then t} = {(0,0), Pr{ (0,0) } = Pr{(1,O)} = Pr{ (0,1) } = Pr{(1,1)} =
(1,0), (0,1), (1,1) } and ( 1)( 1 )/"qlpi'q2pi, [exp oLli([31, - Tli)](1)/"qlpi'q2pi , (1)[exp c~2i([3p - T2i)]/TllpiTl2pi , [exp OLli([3 p -- "rli) exp oL2i([3 p -- 'T2i ) ]/'lq l l, iTI21, i.
However, the outcome must be in one of only three categories, and this is given by the subspace 1 1 ' = {(0,0), (1,0), (1,1)}, in which case a response in the first category x~,i = 0 implies failure at both thresholds, xpi = 1 implies success at the first threshold and failure at the second, and xpi = 2 implies success at both thresholds. The element (0,1) is excluded because it implies exceeding the second more difficult threshold and failing the first, easier threshold, which violates the ordering of the thresholds. If the element (1,0) had been excluded and the element (0,1) included, then the intended ordering of the thresholds and direction of the continuum would have been reversed. The choice of elements which are legitimate, and therefore others which are inadmissible, is emphasized because it is a key step in defining the direction of the continuum and the interpretation of the application of the model. The probability of the outcomes conditional on the subspace t}', and after some simplification, is given by Pr{(O,O) ID'} = Pr{x~,i = 0 } =
(1)(1)
'~pi 1 Pr{(1,O) lt)'} = Pr{xpi = 1} = -- {exp(-otli'rli "~pi 1
+ otii[3p) }
Pr{(1,O) lO'} = Pr{x~; = 2} = - - { e x p ( - o t j i - r l i - ot2i'rzi + (otli + o~zi)[3p)}. "~i,i
Defining the successive elements of the Guttman pattern {(0,0), (1,0), (1,1)} by the integer random variable X, x E {0,1,2}, and generalizing, gives
1. Measurement Criteria for Choosing among Models with Graded Responses
Pr{xpi-
19
1 0} = ~ exp{Kx/ + d~xi[3p}, "Ypi
which is Equation (8) in which +xi and K~i are identified as +oi---0 +xi
--
and
OLli nt- OL2i -[-
9 9 9 -[- OLxi
K0/------0 Kxi
x-
--
--OLliTil
1. . . .
--
OL2i'r2i
9 9 9 --c~ .viT r~i ,
m,
for -oo < T x i ~ oo, X - - 1 . . . . m; "rxi > "r(~_ ~)i, x - 2 . . . . m. Thus the scoring functions +~i are the sums of the successive discriminations at the thresholds, and if %; > 0, x - 1. . . . m, as would normally be required, the successive scoring functions increase, qb(x+1); > 4'~/x - 1. . . . m. If the discriminations are constrained to be equal, O L l i - - O L 2 i 0~3i = . . . OLxi = . . . c~,,,i = oti > 0, then the scoring functions and the category coefficients become +o = 0 +xi
and
- - XOLi
Ko = 0 Kxi
--
--OL(TIi
-[- T 2 i
-q- " " " -'[- T x i ) ,
X --
1. . . .
m.
It can now be seen that the successive scoring functions have the property ( b ~ + l - (b~ = ( b ~ - (bx-l = a specified by Andersen (1977). Thus the mathematical requirement of sufficient statistics, which arises from the requirement of invariant comparisons, leads to the specialization of e q u a l d i s c r i m i n a t i o n s at the t h r e s h o l d s , exactly as in the prototype of measurement. Furthermore, if at threshold x, % i - 0, then (b(~+~)~- (bxi, which explains why categories can be combined only if threshold x does not discriminate, that is, if it is in any case artificial. At the same time, if a threshold does not discriminate, then the categories should not be combined. This analysis of the construction of the CTM provides an explanation of the known result that the collapsing of categories with such a model has nontrivial implications for interpreting relationships among variables (Clogg and Shihadeh, 1994). The parameter oL can be absorbed into the other parameters without loss of generality, giving the model of Equation (1), in which the scoring functions now are simply the successive counts x E {0,1,2 . . . . m} of the number of thresholds exceeded, as in the prototype of measurement. The category coefficients then become simply the opposite of the sums of the thresholds up to the category of the coefficient x, giving in summary +oi-0 ~)xi -- Xi
and
K0/--0 Kxi
-'- - - ( T I i
-[- T 2 i
-[-
9 9 9 -nt- T x i ) ,
X --
1. . . .
m,
and the resultant model is the CTM. It is important to note that the scoring of successive categories with successive integers d o e s n o t r e q u i r e a n y a s s u m p t i o n t h a t the d i s t a n c e s b e t w e e n t h r e s h o l d s are e q u a l - - t h e s e differences are estimated through the estimates of the thresholds.
20
David Andrich
The requirement of equal discriminations at the thresholds, and the strict ordering of the thresholds, is therefore central to the construction of the model. However, it turns out that there is nothing in the structure of the parameters or in the way the summary statistics appear in any of the estimation equations (irrespective of method of estimation) that constrains the threshold estimates to be ordered; and in the estimates, the parameters may appear in an order different from that intended. If they appear in the incorrect order, it is taken that the thresholds are not operating as intended, and that the hypothesis of threshold order must be rejected. This reversal of order in the estimates will happen readily if the discriminations at the thresholds are not equal in the data and the data are analyzed using the CTM which requires equal discriminations. A common reaction to this evidence that the model permits the estimates to be disordered is that the disordered estimates should simply be accepted--either there is something wrong with the model or the parameters should be interpreted as showing something other than that the data fail to satisfy the ordering requirement. This perspective is supported by the feature that the usual statistical tests of fit based on the chi-square distribution can be constructed and that the data can be shown to fit the model according to such a test of fit. However, such reasoning does not take into account that none of these tests of fit, which operate with different powers, are necessary and sufficient to conclude that the data fit the model. One hypothesis is that the thresholds are ordered correctly. Thus the fit can be satisfied with a global test of fit or with respect to some specific hypothesis about the data, even though some other specific hypothesis may have to be rejected. In addition, such tests of fit involve the data in the estimates of parameters, and degrees of freedom are lost to the test of fit: the test of fit checks whether, given these estimates and the model, the detail of these very same data can be recovered. In the CTM, in which the threshold estimates can show any order, the test of fit involves those estimates and so it may not reveal any misfit, and, in particular, it cannot in itself reveal anything special about the reversed estimates of the thresholds. It is a test of fit of the internal consistency of the data given the parameter estimates, and features of the parameter estimates such as order have to be tested separately. Thus, evidence from this kind of statistical test of fit is incomplete and does not obviate the need to study the estimates in relation to other evidence and other criteria related to the construction of the model. This argument is so central to the application of the model that it is now elaborated from an alternative perspective. This perspective involves the log-odds of successive categories: log (Pr{(x)i}/Pr{(x
-
1)i}) - [([3p
If "r(x+ 1)i > "r(x)i, then from Equation (9),
--
~i)
--
"rxi].
(9)
1. Measurement Criteria for Choos&g among Models with Graded Responses
21
log (Pr{(x)i}/Pr{(x - 1)i}) - log (Pr{(x + 1)i}/Pr{(x)i}) =
[(~p
--
T(x+
-
a;)
1)i - -
-
-r~;] -
[([3p -
a/)
-
(Pr{xpi} )2 1 )pi})(Pr {(x
+
%+~;]
Txi ~ O,
it follows that
(Pr {(x -
1
)pi } )
>1,
which ensures a unimodal distribution of Xpi. Specifically, no matter how close two successive thresholds are, if "~x)i < "r~x+~); and if "rex); < [3p < "r~x+~)i, then the probability of the outcome x is greater than the probability of an outcome in any other category. Figure 3 illustrates this feature where thresholds 'r3i and "r4i are close together. When successive thresholds are in an identical location, or reversed, unimodality no longer holds. The unimodal shape when the thresholds are ordered is consistent with interpreting the values Xpi to parallel measurements, because the distribution of Ypi is then simply a distribution of random error conditional on the location of the entity and the thresholds, and upon replication, regression effects would produce a unimodal distribution. The Poisson distribution which has been shown to be the special case when thresholds are equidistant in the multiplicative metric, exemplifies this feature. Other random error distributions, such as the binomial and negative binomial, as well as the normal
FIGURE 3.
Probability of response in seven categories with thresholds 3 and 4 close together.
22
David Andrich
and other continuous distributions all are unimodal. It is stressed, however, that unimodality is a property of the model, and that there is no guarantee that in any set of data, this unimodal property would hold--that is an empirical issue-and if it does not hold, then reversed threshold estimates will be obtained. When one tests for the normal, binomial, or Poisson distributions in usual circumstances, one can readily find that the data do not fit the model because they are bimodal. To return to the idea that the threshold order is a hypothesis about the data, there must be a myriad of ways in empirical work in which data may be collected in graded responses where the discriminations at the thresholds will not be equal, or where some thresholds do not discriminate at all, or where some may even discriminate negatively so that a nonunimodal distribution, with reversed estimates of thresholds, is obtained. If the estimates show threshold disorder, then it is taken that the intended ordering has been refuted by the data and that an anomaly has been disclosed, an anomaly that needs substantive investigation. The CTM permits the data to reveal whether or not they are ordered. 3.1.4. S u m m a r y of C T M and measurement criteria The CTM therefore satisfies the criteria specified, and also therefore characterizes the responses in ordered response categories in a way that reflects measurement. The only difference is that the categories in which the continuum is partitioned is not in equal units and the number of categories is finite. This means that an estimate of the location is not available explicitly but must be obtained by solving an equation iteratively; otherwise, it has the same features. How each of the criteria, and how the model's characterization of these criteria, can be exploited in data is reflected in the examples provided in the next section, after the competing model is presented.
3.2. Thurstone Cumulative Probability Model The model historically prior to the CTM for ordered response categories is also based on the partitioning of a continuum into categories. The derivation of the model is based on the plausible assumption of a single continuous response process so that the entity is classified into a category depending on the realization of this process. If "rli, "rzi, "r3i . . . . . "rxi . . . . 'rmi are again m-ordered thresholds dividing the continuum into m + 1 categories, then Figure 4 shows the construction and the continuous process that Thurstone originally assumed to be normal. However, because it is much more tractable, and because with a scaling constant it is virtually indistinguishable from the normal (Bock, 1975), the process is now often assumed to be the double exponential density function f ( y ) = (exp(y))/(1 + exp(y)) 2. In addition, various parameterizations have been
1. Measurement Criteria for Choosing among Models with Graded Responses
FIGURE 4.
23
The double exponential density function with four thresholds.
used but the differences among these is not relevant to the chapter; therefore only the one that is the simplest and closest to the parameterization of the CTM, the one with only the location and thresholds parameters, is examined. Thus if Ypi is a random continuous process on the continuum about [3p, and if successive categories are denoted by successive integers Xpi, then an outcome "rxi -r and y = 0 otherwise. Here, the progress over time of the latent response variable y* is described as Yik = cti + [3itk + ~l~Vik + ~ik,
(1)
where i denotes an individual, tk denotes a time-related variable with tk = k (e.g., k = 0, 1, 2 . . . . . K - 1), oLi is a random intercept at t = 0, [3i is a random slope, ~/k are fixed slopes, vik is a time-varying covariate, and ~ik is a residual, --- In(0, ~ ) . Furthermore, { O~i -'- ~Lor + "IToLW i "+" ~ ori ~ i -- ~ f3 + 3T [3W i -+" ~ [3i'
(e)
where IX~, tx~, rr~,, and 7r~ are parameters, w i is a time-invariant covariate, and ~ , ~ are residuals assumed to have a bivariate normal distribution,
With tk = k and a linear function of time, for example, k = 0, 1, 2 . . . . . K - 1, the variables a and [3 can be interpreted as the initial status level and the rate of growth/decline, respectively.
40
Bengt O. Muth~n
The residuals of ~ik a r e commonly assumed to be uncorrelated across time 9 In line with Gibbons and Bock (1987), however, a first-order autoregressive structure over time for these residuals is presented. Letting x = (w, v)', the model implies multivariate normality for y* conditional on x with
Yi.o !X c7
I
~l~176
YillX 9
(
~
~]lvil
= T tx~ + qTe~Wi jr_ ~ ~ -Jr-"rr~3W i ]
"~2Vi2
k Y i K - 1 IX
(4)
~K-lViK-I
and 1 p p2
V(y* Ix) = Ttlr~/f~T ' + t~;
9 pK-1
p
13 2
1 p
p 1
o pK-2
.
.
.
... ...
. pK--3
. . .
pK-- 1 pK-2 pK--3
(5) 1
where with linear growth or decline
r ___
1 1 1
0 1 2
.
(6)
.
1
K
1
and ~ , / ~ is the 2 x 2 covariance matrix in Eq. (3). For given tk, w, and v, the model expresses the probability of a certain observed response Yi~: as a function of the random coefficients oL and [3,
P(Yi~
=
I I OLi, ~i' X) = P(Yik > a'[ oti, =
[~i' X)
f: tp(slz, qJ~)ds
(7) where q~ is a (univariate) normal density, ~; denotes the standard deviation of ~, and
Z -- OLi At- ~itk -t- ~lkVik.
(8)
To identify the model, the standardization a" = 0, ~ -- 1 can be used as in conventional probit regression (see, e.g., Gibbons & Bock, 1987).
2. Growth Modeling with Binary Responses
41
The probability of a certain response may be expressed as P(Yo . . . . .
YK- 1 IX) --
.,-~f+~j -~ P(Yo, Y l . . . . .
YK-1 [OL, ~3, X) q~(oL, ~31x)doL d~3,
(9)
where P(Yo, Yl . . . .
fc - ( Yi o)
YK-iI e~, [3, X) = 9
9 9fc
- ( YiK- 1)
q~(Yo, Yl*. . . . .
YK-1 * I OL, ~,
x)dYio..,
*
dYiK_
1,
(10)
where c - ( Y i k ) denotes the integration domain for Yik given that the kth variable takes on the value Yik. Here, the integration domain is either ( - ~ , "r) or ('r, +~). In the special case of uncorrelated residuals, that is, p = 0 in Equation (5), the y* variables are independent when conditioning on a, [3, and x so that P(Yo, Yl . . . . . YK-iI a, [3, x) simplifies considerably, K--1
P(Yo,
Yl . . . . .
YK-1 Io~, 13, x) - k=oI-[ -(y.~,q~(YikloL,
fc
*
[3, x)dYik. *
(11)
In this case, only univariate normal distribution functions are involved so that the essential computations of Equation (9) involve the two-dimensional integral over ~ and [3. Perhaps because of the computational simplifications, the special case of p = 0 appears to be the standard model used in growth analysis with binary response 9 This model was used in Gibbons and Bock (1987; see also Gibbons & Hedeker, 1993). The analogous model with logit link was studied in Stiratelli et al. (1984) and in Zeger and Karim (1991). Gibbons and Bock (1987) considered maximum likelihood estimation using Fisher scoring and EM procedures developed for binary factor analysis in Bock and Lieberman (1970) and Bock and Aitkin (1981). Stiratelli et al. (1984) considered restricted maximum likelihood using the EM algorithm. Gibbons and Bock (1987) used a computational simplification obtained by orthogonalizing the bivariate normal variables oL and [3 using a Cholesky factor so that the bivariate normal density is written as a product of two univariate normal densities. They used numerical integration by Gauss-Hermite quadrature, with the weights being the product of the onedimensional weights. For the case of p 4: 0, Gibbons and Bock (1987) used the Clark algorithm to approximate the probabilities of the multivariate normal distribution for y* in Equation (10). Even when p = 0 the computations are heavy when there is a large number of distinct x values in the sample. Zeger and Karim (1991) employed a Bayesian approach using the Gibbs sampler algorithm. For recent overviews, see Fitzmaurice, Laird, and Rotnitzky (1993); Longford (1993); Diggle, Liang, and Zeger (1994); and Rutter and Elashoff (1994).
42
Bengt O. Muth~,n
3. MORE GENERAL BINARY GROWTH MODELING 3.1. Critique of Conventional Approaches In this section, weaknesses in conventional growth modeling with binary data are presented, along with a more general model and its estimation. The maximum likelihood approach to binary growth modeling leads to heavy computations when p 4: 0. This seems to have caused a tendency to restrict modeling to an assumption of p = 0. Experience with continuous response variables, that is, when y =- y*, indicates that p = 0 is not always a realistic assumption. The assumption of a single p parameter that is different from zero, as in the Gibbons-Bock first-order autoregressive model, also may not be realistic in some cases. Instead, it appears necessary to include a separate parameter for at least the correlations among residuals that are adjacent in time. Furthermore, the conventional model specification of a- = 0, qJ~ = 1 has no effect when, as in standard probit regression, there is only a single equation that is being estimated. It is important to note, however, that this is not the case in longitudinal analysis. The longitudinal analysis can be characterized as a multivariate probit regression in which the multivariate response consists of the same response variable at different time points. This has the following consequences. First, the standardization of "r to zero at all time points needs clarification. In the binary case, this does not lead to incorrect results but does not show the generalization to the case of ordered categorical response or to the case of multiple indicators. The threshold a" is a parameter describing a measurement characteristic of the variable y, namely, the level (proportion) of y with zero values on x. Because the same y variable is measured at all time points, equality of this measurement characteristic over time is the natural model specification. In the binary case, however, the equality of the level of y across time points is accomplished by tx~, in Equation (2) affecting y equally over time as a result of the unit coefficient of oLi in Equation (1), which is not explicitly shown. Setting a- = 0 is therefore correct, although an equivalent specification would take "r as a parameter held equal over time points while fixing tx~, at zero. In the ordered categorical case, however, there are several "r parameters involved for a y variable and equality over time of such "r's is called for. In this case, tx~, cannot be separately estimated but may be fixed at zero. The multiple indicator case will be discussed in the next section. Second, ~ is the standard deviation of the residual variation of the latent response variable y*, and fixing it at unity implicitly assumes that the residual variation has the same value over time. This is not realistic because over time different sources of variation not accounted for by the time-varying variable vi~, are likely to be introduced. Again, experiences with continuous response variables indicate that the residual variance often changes over time.
2. Growth Modeling with Binary Responses
43
In presentations using the logit version of the model, the parameters of a" and ~ are usually not mentioned (see, e.g., Diggle et al., 1994). This is probably because the threshold formulation, often used in the probit case, is seldom used in the logit case. This has inadvertently lead to an unnecessarily restrictive logit formulation in growth modeling.
3.2. The Approach of Muth6n An important methodological consideration is whether computational difficulties should lead to a simplified model, such as using p = 0 or t~ = 1, or whether it is better to maintain a general model and instead use a simpler estimator. Here, I describe the latter approach, building on the model of Equations (1) and (2) to consider a more general model and a limited-information estimator. First, the ~ik variables of Equation (1) are allowed to be correlated among themselves and are allowed to have different variances over time. Second, multiple indicators Yikj, J = 1, 2 . . . . . p are allowed at each time point, (12)
Yil,j = kjxlik + eikj,
where kj is a measurement (slope) parameter for indicator j, eikj is a measurement error residual for variable j at time k, and Yikj 1 if Yikj > "rj. The multiple indicator case is illustrated by Example 2 in which four measurements of a single construct "neurotic illness" ('q) were considered. Given that the a"s and the k's are measurement parameters, a natural model specification would impose equality over time for each of these parameters. Using normality assumptions for all three types of residuals, ~, ~, and e, again leads to a multivariate probit regression model. This generalized binary growth model is a special case of the structural equation model of Muth6n (1983, 1984). The longitudinal modeling issues just discussed were also brought up in Muth6n (1983), where a random intercept model like Equations (1), (2), and (12) was fitted to the Example 2 data. The problems with standardization issues related to ~- and t~ have also been emphasized by Arminger (see, e.g., ch. 3, this volume) and Muth6n and Christofferson (1981) in the context of structural equation modeling. In the approach of Muth6n (1983, 1984), conditional mean and covariance matrix expressions corresponding to Equations (4) and (5) are considered. This is sufficient given the conditional normality assumptions. Muth6n (1983, 1984) introduced a diagonal scaling matrix A containing the inverse of the conditional standard deviations of the latent response variable at each time point, =
A = d i a g [ V ( y * Ix)] -1/2.
(13)
Muth6n (1983, 1984) describes three model parts. Using the single-indicator growth model example of Equations (4) and (5),
44
Bengt O. MuthOn 0.1 = ATIx 0" 2 -- A[T-rr ~/] 0"3 = 1 ) = A ( T ~ , ~ / ~ T ' + ~ g g ) A ,
(14) (15) (16)
where ~ is the K • K covariance matrix of ~ (cf. Eq. 5). The three parts correspond to the intercepts, slopes, and residual correlation matrix of a multivariate probit regression. Note that 1) is a correlation matrix. The general model of Muth6n (1983, 1984), including multiple indicators as well as multiple factors at each time point, can be expressed as follows. Consider a set of measurement relations for a p-dimensional vector y*, y* = Axl + e,
(17)
and a set of structural relations for an m-dimensional vector of latent variable constructs -q, n =a+
13n + F x +
~,
(18)
where A, a, B, F are parameter arrays, e is a residual (measurement error) vector with mean zero and covariance matrix 19, and ~ is a residual vector with mean zero and covariance matrix ~ . The scaling matrix ZX is also included in this general framework, as is the ability to analyze independent samples from multiple populations simultaneously. Muthrn (1983, 1984, 1987) used a least-squares estimator where with 0. (if'l, 0.;, 0.;)t, F = (s - 0.)' W - I ( s
--
0.),
(19)
where the s elements are arranged in line with 0. and are maximum likelihood estimates of the intercepts, slopes, and residual correlations. Here, s I and s 2 are estimates from probit regressions of each y variable on all the x variables, whereas each s 3 element is a residual correlation from a bivariate probit regression of a pair of y variables regressed on all x variables. A generalized leastsquares estimator is obtained when the weight matrix W is a consistent estimate of the asymptotic covariance matrix of s. In this case, a chi-square test of model fit is obtained as n. F, where n is the sample size and F refers to the minimum value of the function in Equation (19). (For additional technical details on the asymptotic theory behind this approach, see Muthrn & Satorra, 1995.) Muthrn (1984, 1987) presented a general computer program LISCOMP which carries out these calculations.
3.3. Model Identification Given that the response variables are categorical, the general binary growth model needs to be studied carefully in terms of parameter identification. Under
2. Growth Modeling with Binary Responses
45
the normality assumptions of the model, the number of distinct elements of cr represents the total number of parameters that can be identified, therefore the number of growth model parameters can be no larger than this. The growth model parameters are identified if and only if they are identified in terms of the elements of er. It is instructive to consider first the conditional y* variance in some detail for the case of binary growth modeling. Let [A]~ 2 denote the conditional variance of Ykk given x. For simplicity, the focus is on the case with linear growth. With four time points, the conditional variances of y* can be expressed in model parameter terms as [A]o2 = + ~ + tb~o~o [z~]H2 = +~,~ + 2tb~ + tb~ + tb~l~l [A]222 = tb~,~, + 4tb~,~ + 4tb~ + ~2~2 [A]332 = ~,,~ + 6~,~ + 9qs~ + ~3~3"
(20) (21) (22) (23)
Note that the A elements are different because of across-time differences in contributions from ~,~/~ as well as ~ . In Equation (5), using the Gibbons-Beck standardization of +~ - 1, there are no free ~ parameters to be estimated. Contrary to this conventional approach, there are four different ~ parameters in Equations (18) through (21). Because the y* variables are not directly observed, not all of these parameters are identifiable. Instead of assuming ~ = 1 for all time points, as in Gibbons-Beck, the first diagonal element of A can be fixed to unity, corresponding to the first time point. For the remaining time points, the A elements in Equations (19) through (21) are the unrestricted parameters instead of the residual y* variances of ~. The residual variances are not taken as free parameters to be estimated, but can be obtained from the other parameters using Equations (18) through (21). Allowing the A parameters to be different across time allows the residual variances to be different across time. With four time points, this adds three parameters to the model relative to the conventional model. The (co)variance-related parameters of the model are in this case the three free/X elements (not the the's) and the three elements of ~,~/~. With covariates, added parameters are lX~, tx~, -rr~, -rr~, and ~/o. . . . . ~/K-1. It can be shown that with four time points, covariances between pairs of residuals at adjacent time points can also be identified in this model (see Muthdn & Liu, 1994). As opposed to the case of a single response variable, multiple indicator models allow for separate identification of the residual variances and the measurement errors of each indicator. In this case, there is the additional advantage that the residual variances for the latent variable constructs xl are identified at all time points. Multiple-indicator models would assume equality of the measurement parameters (the -r' s and the )t' s) for the same response variable across time. In this case, the p~,~intercept is fixed at zero. The A matrix scaling is general-
46
Bengt O. Muth~n
ized as follows. The scaling factors of A are fixed at the first time point for all of the indicators to eliminate the indeterminacy of scale for each different y* variable (corresponding to each indicator) at this time point. The scaling factors of A are free for all indicators at later time points so that the measurement error variances are not restricted to be equal across time points for the same response variable.
3.4. Implementation in Latent Variable Modeling Software In the case of continuous response variables, Meredith and Tisak (1984, 1990) have shown that the random coefficient model of the previous section can be formulated as a latent variable model. For applications in psychology, see McArdle and Epstein (1987); for applications in education, see Muth6n (1993) and Willett and Sayer (1993); and for applications in mental health, see Muth6n (1983, 1991). For a pedagogical introduction to the continuous case, see Muth6n, Khoo, and Nelson Goff (1994) and Willett and Sayer (1993). Muth6n (1983, 1993) pointed out that this idea could be carried over to the binary and ordered categorical case. The basic idea is easy to describe. In Equation 1, c~i is unobserved and varies randomly across individuals. Hence, it is a latent variable. Furthermore, in the product t e r m ~itk, ~i is a latent variable multiplied by a term tk which is constant over individuals and can therefore be treated as a parameter. The tks may be fixed as in Equation (6), but with three or more time points they may be estimated for the third and later time points to represent nonlinear growth. More than one growth factor may also be used.
4. ANALYSES Simulated and real data will now be used to illustrate analyses using the general growth model with binary data.
4.1. A Monte Carlo Study A limited Monte Carlo study was carried out to demonstrate the sampling behavior of the generalized least-squares estimator in the binary case. The model chosen for the study has a single binary response variable observed at four time points. There is one time-invariant covariate and one time-varying covariate (one for each time point). The simulated data can be thought of as being in line with the Example 1 situation in which the probability of a problem behavior declines over time. Linear decline is specified with T as in Equation (6). Both the random intercept (et) and the random slope ([3) show individual variation as represented both by their common dependence on the time-invariant covariate (w) and their residual variation (represented by ~ / ~ ) . The e~ variable regression has a
2. Growth Modeling with Binary Responses
47
positive intercept (Ix~) and a positive slope (Try,) for w, whereas the [3 variable regression has a negative intercept (tx~) and a negative slope ('rr~) for w. In this way, the time-invariant covariate can be seen as a risk factor, which with increasing value increases o~, that is, it increases the initial probability of the problem and decreases [3, making the rate of decline larger (this latter means that the higher the risk factor value, the more likely the improvement is in the problem behavior over time). The regression coefficient for the response variable on the time-varying covariates (the vk's) is positive and the same at all time points. The residual variances ( ~ ) are changing over time and there is a nonzero residual covariance between adjacent pairs of residuals that is assumed to be equal. The time-varying covariates are correlated 0.5 and are each correlated 0.25 with the time-invariant covariate. All covariates have means of 0 and variances of 1. The population values of the model parameters are given in Table 1.
TABLE 1 Monte Carlo Study: 500 Replications Using LISCOMP GLS for Binary Response Variables Parameter
True value
n = 1000
n = 250
~ ~
0.50 -0.50
0.50 (0.05, 0.05) -0.51 (0.04, 0.04)
0.50 (0.10, 0.09) -0.50 (0.10, 0.09)
"rr~ "rr~ ~/1 ~/2 ~/3 ~/4
0.50 -0.50 0.70 0.70 0.70 0.70
0.50 -0.51 0.70 0.72 0.71 0.71
0.50 -0.50 0.70 0.76 0.71 0.71
qJe~ot ~[3ot ~[313
0.50 -0.10 0.10
0.49 (0.13, 0.13) -0.09 (0.06, 0.05) 0.10 (0.04, 0.03)
0.48 (0.29, 0.26) -0.08 (0.13, 0.11) 0.10 (0.09, 0.08)
~k+ 1, k
0.20
0.22 (0.09, 0.08)
0.26 (0.31, 0.22)
All
1.00 1.00 1.00
0.99 (0.13, 0.13) 1.00 (0.11, 0.11) 1.00 (0.11, 0.11)
1.01 (0.29, 0.25) 1.02 (0.25, 0.22) 1.03 (0.25, 0.23)
14.75 5.53 5.2 1.0
15.51 5.61 5.4 2.0
A22 A33 X 2 Average (df= 15) SD 5% Reject proportion 1% Reject proportion
(0.05, (0.04, (0.06, (0.11, (0.09, (0.08,
0.05) 0.04) 0.05) 0.11) 0.09) 0.08)
(0.10, 0.10) (0.09, (0.11, (0.40, (0.19, (0.18,
0.09) 0.11) 0.29) 0.18) 0.17)
Note. In parentheses are empirical standard deviations, standard errors, df degrees of freedom; SD, standard deviation.
48
Bengt O. Muth#n
Two sample sizes were used, a larger sample size of n = 1000 and a smaller sample of n = 250. Multivariate normal data were generated for y* and x, and the y* variables were dichotomized at zero. The generalized least-squares estimator was used. The parameter values chosen (see Table 1) imply that the proportion of y = 1 at the four time points is .64, .50, .34, and .25. The parameters estimated were IX~, IX[3, "rr,~, 7r[3, ~/o, ~tl, '5/2, ~/3, q/oto~, q/[3o~,11/[313,q/~l,+z~k(a single parameter), [A] ll, [A]22, and [A]33. The threshold parameter -r was fixed at zero and the scaling factor [A]lz was fixed at one. As discussed in section 3.3, the four residual variances of qJ;~ are not free parameters to be estimated but they are still allowed to differ freely across time (their population values are .5, .6, .5, and .2) because the A parameters are free. The degrees of freedom for the chi-square model test is 15. A total of 500 replications were used for both sample sizes in Table 1. Table 1 gives the parameter estimates, the empirical standard deviation of the estimates across the 500 replications, the mean of the estimated standard errors for the 500 replications, and a summary of the chi-square test of model fit for the 500 replications. As seen in Table 1, the estimates for the n = 1000 case show almost no parameter bias, the empirical variation is very close to the mean of the standard errors, and the chi-square test behaves correctly. As expected, the empirical standard deviation is cut in half when reducing the sample size to a quarter, from 1000 to 250. Exceptions to this, however, are the regression slope for the second time point "Y2and the residual covariance ~+,~k" The cause for these anomalies needs further research. In these cases, the standard errors are also strongly underestimated. In the remaining cases, the standard errors agree rather well with the empirical variation, with perhaps a minor tendency to underestimate the standard errors for the (co)variance-related parameters of ~ and A. At n = 250, the variation in the regression intercept and slope parameter (IX and 7r) estimates is low enough for the hypotheses of zero values to be rejected at the 5% level. For the (co)variance-related parameters of +, however, this is not the case and the A parameters also have relatively large variation. The chi-square test behavior at n = 250 is quite good.
4.2. Analysis of Example 2 Data The model used for the preceding simulation study will now be applied to the Example 2 data of neurotic illness as described previously (for more details, see Henderson et al., 1981). Each of the four response variables will be modeled separately. They can also be analyzed together as multiple indicators of neurotic illness, but this will not be done here. Previous longitudinal analyses of these data were done in Muth6n (1983, 1991). Summaries of the data are given in Table 2. As is shown in Table 2, there is a certain drop from the first to the remaining occasions in the proportion of
2. Growth Modeling with Binary Responses
49
TABLE2 Descriptive Statistics for Example 2 Data Response variables Percentage yes: Anxiety Depression Irritability Nervousness
Time 1
Time 2
Time 3
Time 4
26.4 25.5 40.3 24.2
16.5 14.7 31.2 19.0
15.6 17.7 28.1 16.0
16.0 13.9 29.0 15.6
Covariates
N L1 L2 L3 L4
Means
Variances
9.31 3.86 3.17 2.58 2.42
20.66 6.54 5.89 4.90 5.27 Correlations
N L1 L2 L3 L4
1.00 0.22 0.16 0.18 0.21
1.00 0.54 0.50 0.53 .
.
.
1.00 0.49 0.49
1.00 0.51
1.00
.
people answering yes to the neuroticism items. There is also a corresponding drop in the mean of the life event score. Because the latter is used as a timevarying covariate, this means that the data could be fit by a model that does not include a factor for a decline in the response variable, but which instead uses only a random intercept factor model. Given previous analysis results, gender is dropped as a time-invariant covariate. Only the N score, the long-term susceptibility to neurosis, will be used to predict the variation in the random intercept factor. Two types of models will be fit to the data. First, the general binary growth model will be fit, allowing for across-time variation in the latent response variable residual variance and nonzero covariances between pairs of residuals (the residual covariances are restricted to being equal). Second, the conventional binary growth model, in which these features are not allowed for, will be fit as a comparison. In both cases, the same quantities as in Table 1 will be studied, along with two types of summary statistics. One summary statistic is R 2, that is, the proportion of variation is the e~ factor accounted for by N. A second statistic is the proportion that the e~ factor variation makes up of the total variation in the latent response variable y*, calculated at all time points. The oL factor
50
Bengt O. Muth6n
represents individual variation in a neurotic illness trait, variation that is present at all time points. In addition to this variation, the y* variation is also influenced by time-specific variation caused by time-varying, measured covariates (Ls) and time-specific unmeasured residuals (~s). In this way, the proportion is a time-varying measure of how predominant the trait variation is in the responses. Table 3 shows the results for the general binary growth model. The model fits each of the four response variables very well. As expected, the N score has a significantly positive influence (4r~) on the random intercept factor and the L scores have significantly positive influences (C/s) on the probability of yes answers for the response variables. For none of these four response variables is the residual covariance significantly different from zero. Note, however, from the simulation study at n = 250 that the variation in this estimate is quite large and that a large sample size is required to reject zero covariance. As shown in the simulation study, the point estimate of the covariance may be of reasonable magnitude. The model with
TABLE3 Analysis of Example 2 Data Using the General Growth Model ,,
,
Anxiety
Depression
Irritability
Nervousness
ixoL
-1.46 (0.21)
-2.56 (0.34)
-1.21 (0.20)
-2.43 (0.27)
"rre~ yl y2 ~3 ~4
0.06 (0.01) 0.08 (0.03) 0.05 (0.02) 0.06 (0.03) 0.03 (0.02)
0.13 (0.02) 0.14 (0.04) 0.02 (0.03) 0.12 (0.03) 0.04 (0.05)
0.06 (0.01) 0.09 (0.03) 0.08 (0.02) 0.07 (0.02) 0.06 (0.02)
0.14 (0.02) 0.08 (0.03) 0.03 (0.02) 0.05 (0.03) 0.01 (0.03)
~o~o~
0.31 (0.08)
0.39 (0.10)
0.27 (0.07)
0.69 (0.09)
-0.01 (0.05)
-0.12 (0.10)
-0.04 (0.04)
-0.11 (0.06)
1.37 (0.19) 1.48 (0.21) 1.20 (0.18)
0.88 (0.15) 1.08 (0.19) 0.88 (0.14)
1.41 (0.22) 1.43 (0.24) 1.32 (0.22)
1.01 (0.11) 1.14 (0.11) 1.00 (0.11)
~k, k+l All A22 A33 X2(19) p-value R2ot P1 P2 P3 P4
16.49 .624 0.19 0.34 0.31 0.31 0.30
21.73 .298 0.47 0.50 0.41 0.42 0.41
26.02 .130 0.22 0.30 0.28 0.29 0.28
23.30 .225 0.37 0.75 0.52 0.54 0.52
2. Growth Modeling with Binary Responses
51
nonzero residual covariance is therefore maintained. In this particular application, the estimate is small. The scaling factors of A are not significantly different from unity at the 5% level for all but one of the cases. Because in this model there is no [3 factor, A is a function of the oL factor residual variance + , ~ and the residual variance thcc (cf. Eqs. 18-21). Unit values for the A scaling factors would therefore indicate that the residual variances are constant over time in this application. Note, however, from the Table 1 simulation results that the sampling variation in the A estimates is quite large at n = 250 which makes it difficult to reject equality of residual variances over time. The Table 1 results also indicate that the point estimates for A are good. Table 4 shows the results for the conventional binary growth model. This model cannot be rejected at the 5% level in these applications. The parameter estimates are, in most cases, similar to those for the generalized model of Table 3. Differences do, however, show up in the values for the trait variance proportions, labeled P~ through P4 in Tables 3 and 4. Relative to the more general model, the conventional model overestimates these proportions for three out of the four response variables. For example, the conventional model indicates that there is a considerable dominance of trait variation in the response variable Nervousness, with a proportion of .81 for the last three time points (see Table 4).
TABLE 4
Analysis of Example 2 Data Using the Conventional Growth Model Anxiety
Depression
Irritability
Nervousness
~OL
-1.79 (0.17)
-2.40 (0.16)
-1.56 (0.16)
-2.56 (0.23)
9roL ~1 ~/2 "y3 ~/4
0.07 (0.01) 0.12 (0.02) 0.06 (0.03) 0.05 (0.04) 0.04 (0.03)
0.12 (0.01) 0.13 (0.02) 0.03 (0.03) 0.10 (0.03) 0.05 (0.03)
0.08 (0.01) 0.12 (0.02) 0.10 (0.02) 0.09 (0.03) 0.07 (0.03)
0.15 (0.02) 0.09 (0.03) 0.03 (0.02) 0.03 (0.03) 0.01 (0.03)
~c~c~
0.50 (0.51)
0.31 (0.06)
0.42 (0.06)
0.72 (0.05)
X2(23) p-value RZoL P1 P2 P3 P4
26.12 .295 0.17 0.50 0.54 0.54 0.54
26.29 .287 0.49 0.43 0.47 0.45 0.46
33.97 .066 0.24 0.45 0.46 0.47 0.48
29.02 .180 0.39 0.78 0.81 0.81 0.81
52
Bengt O. Muth~n
The more general model of Table 3 points to a much lower range of values for the last three time points, .52 to .54.
5. CONCLUSIONS This chapter has discussed a general framework for longitudinal analysis with binary response variables. As compared with conventional random effects modeling with binary response, this general approach allows for residuals that are correlated over time and variances that vary over time. It also allows for multiple indicators of latent variable constructs, in which case it is possible to identify separately residual variation and measurement error variation. The more general model can be estimated by a limited-information generalized least-squares estimator. The general approach fits into an existing latent variable modeling framework for which software has been developed. A Monte Carlo study showed that the limited-information generalized leastsquares estimator performed well with sample sizes at least as low as n - 250. At this sample size, the sampling variability is not unduly large for the regression parameters of the model, but it is rather high for the (co)variance-related parameters of the model. Analyses of a real data set indicated that the differences in key estimates obtained by the conventional model are not always markedly different from those obtained by the more general model, but can lead to quite different conclusions about certain aspects of the phenomenon that is being modeled. The general approach should be of value for developmental studies in which variables are often binary and in which the variables are often very skewed and essentially binary. The general model allows for a flexible analysis which has so far been used very little with binary responses. A multiple-cohort analysis of this type is carried out in Muthdn and Muthdn (1995), which describes the development of heavy drinking over age for young adults.
ACKNOWLEDGMENTS This research was supported by Grant AA 08651-01 from NIAAA for the project "Psychometric Advances for Alcohol and Depression Studies" and Grant 40859 from the National Institute of Mental Health.
REFERENCES Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443-459.
2. Growth Modeling with Binary Responses
53
Bock, R. D., & Lieberman (1970). Fitting a response model for n dichotomously scored items. Psychometrika, 35, 179-197. Diggle, P. J., Liang, K. Y., & Zeger, S. (1994). Analysis of longitudinal data. Oxford: Oxford University Press. Fitzmaurice, G. M., Laird, N. M., & Rotnitzky, A. G. (1993). Regression models for discrete longitudinal data. Statistical Science, 8, 284-309. Gibbons, R. D., & Bock, R. D. (1987). Trend in correlated proportions. Psychometrika, 52, 113-124. Gibbons, R. D., & Hedeker, D. R. (1993). Application of random-effects probit regression models (Technical report). Chicago: University of Illinois at Chicago, UIC Biometric Laboratory. Henderson, A. S., Byrne, D. G., & Duncan-Jones, E (1981). Neurosis and the social environment. Sydney: Academic Press. J6reskog, K. G., & S6rbom, D. (1977). Statistical models and methods for analysis of longitudinal data. In D. J. Aigner & A. S. Goldberger (Eds.), Latent variables in socio-economic models (pp. 285-325). Amsterdam: North-Holland. Laird, N. M., & Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics, 38, 963-974. Longford, N. T. (1993). Random coefficient models. Oxford: Oxford University Press. McArdle, J. J., & Epstein, D. (1987). Latent growth curves within developmental structural equation models. Child Development, 58, 110-133. Meredith, W., & Tisak, J. (1984). "Tuckerizing" curves. Paper presented at the annual meetings of the Psychometric Society, Santa Barbara, CA. Meredith, W., & Tisak, J. (1990). Latent curve analysis. Psychometrika, 55, 107-122. Muth6n, B. (1983). Latent variable structural equation modeling with categorical data. Journal of Econometrics, 22, 43-65. Muth6n, B. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 49, 115-132. Muth6n, B. (1987). LISCOMP. Analysis of linear structural equations with a comprehensive measurement model. Theoretical integration and user's guide. Mooresville, IN: Scientific Software. Muth6n, B. (1991). Analysis of longitudinal data using latent variable models with varying parameters. In L. Collins & J. Horn (Eds.), Best methods for the analysis of change. Recent advances, unanswered questions, future directions (pp. 1-17). Washington, DC: American Psychological Association. Muth6n, B. (1993). Latent variable modeling of growth with missing data and multilevel data. In C. R. Rao & C. M. Cuadras (Eds.), Multivariate analysis: Future directions 2 (pp. 199-210). Amsterdam: North-Holland. Muth6n, B., & Christofferson, A. (1981). Simultaneous factor analysis of dichotomous variables in several groups. Psychometrika, 46, 407-419. Muth6n, B., Khoo, S. T., & Nelson Goff, G. (1994). Longitudinal studies of achievement growth using latent variable modeling (Technical report). Los Angeles: UCLA, Graduate School of Education. Muth6n, B., & Liu, G. (1994). Identification issues related to binary growth modeling (Technical report). Manuscript in preparation. Muth6n, B., & Muth6n, L. (1995). Longitudinal modeling of non-normal data with latent variable techniques: Applications to developmental curves of heavy drinking among young adults (Technical report). Manuscript in preparation. Muth6n, B., & Satorra, A. (1995). Technical aspects of Muth6n's LISCOMP approach to estimation of latent variable relations with a comprehensive measurement model. Psychometrika. Rutter, C. M., & Elashoff, R. M. (1994). Analysis of longitudinal data: Random coefficient regression modeling. Statistics in Medicine, 13, 1211-1231.
54
Bengt O. Muth~n
Stiratelli, R., Laird, N., & Ware, J. H. (1984). Random-effects models for serial observations with binary response. Biometrics, 40, 961-971. Wheaton, B., Muth6n, B., Alwin, D., & Summers, G. (1977). Assessing reliability and stability in panel models. In D. R. Heise (Ed.), Sociological methodology 1977 (pp. 84-136). San Francisco: Jossey-Bass. Willett, J. B., & Sayer, A. G. (1993). Using covariance structure analysis to detect correlates and predictors of individual change over time. Psychological Bulletin. Zeger, S. L., & Karim, M. R. (1991). Generalized linear models with random effects; a Gibbs sampling approach. Journal of the American Statistical Association, 86, 79-86.
Probit Models for the Analysis of Limited Dependent Panel Data GerhardArminger
Bergische Universit~t Wuppertal, Germany
1. INTRODUCTION Panel data are observations from a random sample of n elements from a population collected over a fixed number T of time points. Usually, n is fairly large (e.g., approximately 5000 households in the German Socio-Economic Panel [GSOEP; Wagner, Schupp, & Rendtel, 1991] or approximately 4000 firms in the 1993 panel of the German bureau of labor) and T is fairly small (T = 10 panel waves in the GSOEP). Usually, the time points t = 1. . . . . T are equidistant. In psychology, panel data are often referred to as repeated measurements; in epidemiology they are referred to as cohort data. In analyzing panel data using regression models, one often finds that the dependent variables of interest are nonmetric, that is, dichotomous, censored metric, and ordered or unordered categorical. Typical examples are found in the following areas: 9 The analysis of individual behavior in the labor market. Flaig, Licht, and Steiner (1993) model and test hypotheses about whether a person is unemployed Categorical Variables in Developmental Research: Methods of Analysis Copyright 9 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
55
56
GerhardArminger
or not at time t depending on variables such as unemployment at t - l, length of former unemployment, education, professional experience, and other personal characteristics using data from the GSOEE Here, the dependent variable is dichotomous with the categories "employed" versus "unemployed." 9 The analysis of the behavior of individual firms. Arminger and Ronning (1991) analyze the simultaneous dependence of changes in output, prices, and stock using data from the German business test conducted quarterly by IFO Institute in Munich. Here, the dependent variables of change in output (less, equal, and more than at time t - 1) and change in price are ordered trichotomous and the variable stock is metric as it is measured in production months. The special problem in specifying a model for these data is that trichotomous variables appear on the left a n d on the right side of regression equations. 9 The analysis of marketing data. Marketing data come often in the form of preferences measured on a Likert scale with five or seven ordered categorical outcomes such as "1 = I like product A very much" to "5 = I don't like product A at all." Again, the variables are ordered categorical and often a great number of variables must be reduced to a smaller set of variables by using factor analytic models for ordered categorical data. Ordered categorical variables may again appear on the left and on the right side of regression equations. In the following section the construction principles of Heckman (1981a, 1981b) on the specification and estimation of dichotomous outcomes in panel data are extended to include ordered categorical and censored dependent variables as well as simultaneous equation systems of nonmetric dependent variables. The parameters of models in which strict exogeneity of error terms holds can be estimated by assuming multivariate normality of the error terms in threshold models and by using conditional polychoric and polyserial covariance coefficients in the framework of mean and covariance structure models for nonmetric dependent variables. These models and estimation techniques have been introduced by Muth6n (1984) and extended by Ktisters (1987) and Schepers and Arminger (1992). The estimation of special models in which strict exogeneity of error terms does not hold is also briefly discussed. Special attention is given to the problem of initial states. As an example, the trichotomous output variable of a four-wave panel of 656 firms from the German business test conducted by the IFO Institute is analyzed.
2. MODEL SPECIFICATION 2.1. Heckman's Model for Dichotomous Variables Heckman (1981a, chap. 3.3) considers the following model for an unobserved variable Yit, i = 1 . . . . . n, t = 1 ..... T , where i denotes the individual and t denotes a sequence of equispaced time points:
3. Probit Model and Limited Dependent Panel Data
Yit -- ~it + 6-it"
57
(1)
The term txi, is the expected value of Y~t, which itself may be written as a function of the exogenous variables x;,, lagged observed variables Yi,t-j, and lagged unobserved variables Yi,,-j: vc
I'Lit - Xit~ + E
vc
3"t--j, tYi,t--j + E
j=l
j--1
j
K
~kJ,t--J ]--I Yi,t--I + E l--1
k=l
~
9
kYi,t-k"
(2)
The error terms 6-it are collected in a vector and are assumed to be normally distributed: 6-i "-" (6-il
6-iT)" 6-i "~ t (0, Y,).
.....
(3)
A threshold model is used to model the relation between the unobserved variable Yit and the observed variable Yit: if Yzt > 0 if Yit 1, which is called the true state dependence. If 3't-l,, = 3'J 4 : 0 and 3"t--j,t 0 for all j > 1 and t - 1. . . . . T, we have a simple Markov model. Note that the inclusion of former states requires the knowledge of initial states Y~o, Y i , - 1 . . . . . depending on the specification of the parameters 3't-i,t. If the initial states are known and are nonstochastic, they can be included in the vectors xit as additional explanatory variables. If the initial states are themselves outcomes of the process that generates Yit, the distribution of the initial states must be taken into account, as discussed by Heckman (1981 b) and in Section 3 of this chapter. Note that the effects of the former states Y;,t-~ may change for each time point. This is captured by 3't-j,t. In most applications, 3't-j,, is set to 3',_j and almost all of the parameters are set to 0. =
50
GerhardArminger
The third component of [JLi, models the dependence of Yit on the duration of the state Yi,t = 1. Again, the effects of duration may be different for each time length. These different effects are parameterized in ha,,_j. If hj,,_j = h, this is the simple case of a linear duration effect. The fourth component of ].Lit models the dependence of Yit on former values Yi,t-j o f the unobserved endogenous variables. Structures that incorporate this component are called models with habit formation or habit persistence. The idea behind such models is that Yit is not dependent on the actual former state of the observed variable, but rather on the former disposition or habit identified with Yi,t-j instead of Yi,t-j. If the initial dispositions Yi,o, Yi,-1 . . . . . Yi,-(K-I) are known and nonstochastic, they may be included in the list of explanatory variables. Otherwise, assumptions about the distribution of the initial variables Yi,o, Yi,-1 . . . . . Y i , - ( K - 1 ) m u s t be made. Now turn to the specification of the error term Eit in Equation (3). The error term is often decomposed in the form (5)
~'it-- oti at- ~-it'
where oti denotes the error term which varies across individuals but not over time and which may be considered to be an unobserved heterogeneity just as in the metric case (cf. Hsiao, 1986). The values of oti may be considered as fixed effects for each i or may be considered as random effects such that oti "" I "(0, o-2). In the first case, oti is an individual specific parameter. If Yi, is modeled by Yit = xitf3 + oti Jr- E.it and is actually observed, as in the metric case, then oti may be eliminated by taking the first differences Y i , - Y i , t - ~ - ( x i , - x/,t-1)[3 + ei.t - %,-1. This technique does not work for nonmetric models such as the probit model. In the case of a dichotomous logit model, the otis may be eliminated by conditioning on a sufficient statistic, as shown in Hsiao (1986) and Hamerle and Ronning (1995). If oti is a random variable, then it is assumed to be uncorrelated with xit and ~-it. If ~-it has constant variance %2 and is serially uncorrelated, then Ei has the typical covariance structure 2+ orot
2 oroL
2 ore
2 + 2 O'o~ ore
211' + or2I,
v( *.] "2 orc~
"2 orc~
9 9 9
2 _1_ 2 O"OL ore
where 1 is a T X 1 vector of ones and I is the T • T identity matrix 9 More generally, a serial or a factor analytic structure may be assumed for V(e i). (Details are found in Heckman, 1981 a, or Arminger, 1992.) Discussion of the estimation of the parameters of this model under various assumption is deferred to section 3.
3. Probit Model and Limited Dependent Panel Data
59
2.2. Extension to General Threshold Models We now extend Heckman's (198 la) dichotomous models in a systematic way to censored, metrically classified and ordered categorical dependent variables and simultaneous equation systems that allow as dependent variables any mixture of metric and/or limited dependent variables. Only random effect models are considered. The T x 1 vector Yi of utilities Yit is formulated as a multivariate linear regression model
Y i -- ~
-+- 1-[x i "~ e-i,
where x i is a R • 1 vector of observed explanatory variables, ~/is a T • 1 vector of regression constants, and II is a T X R matrix of regression coefficients. The T X 1 vector of error terms follows a normal distribution L I ~ Note that there is a slight change in notation compared with section 2.2. The R X 1 vector x; may be interpreted as the vectorized form of X i in section 2.1 and may additionally include dummy variables denoting the lagged values Y~t-~, Y~t-2 . . . . and duration of observed states. The Heckman model of the form P~i,- x;,[3, t = 1. . . . . T, which in matrix form is written as p~; = XJ3, is then written as ~J'~i =
I-Ixi
with
x i,
!' H
0
...
0 )
[3' 0
... ...
0 [3'
Xi2
,
)Ci ~
\XiT
If the regression parameters are not serially constant, [3' in the first row is ret placed by [3 l, in the second row by [3'2, and so forth. Together with the specification of E through a model for unobserved heterogeneity and for serial correlation, the preceding specification of Yi-~l + Hxi + Ei yields a conditional mean and covariance structure in the latent variable vector Yi, with Yi " ~ t~(~/+ IIxi, E). The model is now extended by allowing not only the dichotomous threshold model of Equation (4), but any one of the following threshold models that maps Yit onto the observed variable y;, (cf. Schepers, Arminger, & Kiisters, 1991). For convenience, the case index i = 1. . . . . n is omitted. 9 Variable Yt is metric (identity relation). Examples are variables such as monthly income and psychological test scores. y, = y,.
(6)
60
GerhardArminger
9 Variable Yt is ordered categorical with unknown thresholds n-t,~ < -rt,2 < . . . < "rt,K, and categories y, = 1. . . . . K t + 1 (ordinal probit relation; McKelvey & Zavoina, 1975). Examples are dichotomous variables such as e m p l o y m e n t (employed vs. unemployed) and five-point Likert scales with categories ranging from "I like very much" (1) to "I don't like at all" (5). yt - k r
yt
E[Tt,k_I, Tt,k)
with ['r,,o, n,, l) - (-oc, ,r,,l) and "r,,K,+l -- +oo.
(7)
Note that for reasons of identification, the threshold "rt,, is set to 0 and the variance of the reduced form error term o52 is set to 1 The parameters in [3, are only identified up to scale. If one considers simultaneous equation models or analyzes two or more panel waves simultaneously, only hypotheses of proportionality of regression coefficients across equations can be tested in general. Hypotheses of equality of regression coefficients across equations can only be tested under additional, and sometimes nontestable, assumptions (Sobel & Arminger, 1992). 9 Classified metric variables may be treated analogously to the orginal probit case with the difference that the class limits are now used as known thresholds (Stewart, 1983). No identification restrictions are necessary. An important example is grouped income with known class boundaries. 9 Variable Yt is one-sided censored with a threshold value q't, 1 known a priori (tobit relation; Tobin, 1958). The classical example is household expenditure for durable goods, such as cars, when only a subset of the households in a sample acquires a car during a given time interval. Yt Y' = ['rt, ,
if Yt > "rt,1 ifyt ( d t - ~ . , - d,_,), then the shock is positive, that is, the demand is higher than the past expectation; otherwise, the shock is 0 or negative. The effect of s t may be estimated by setting the parameter of the demand (dt - dt-~) to [3,3 and the parameter of the past business expectation (dr.,-~ - d r - ~ ) to -[3t3. Note that the variables Ay,, (a,_ 1 - a,_ 1), (d,.,_ l - dr), and ( d , - dr_ 1) are only observed at an ordinal scale where the following observation rule is supposed to hold:
Ot =
i
if if if
Ayt_ "r~21).
The observation rules for AB t, GL t, and D t are analogous. For identification, the first threshold is set to 0. Because AB t, GL t, and Dt are only observed at an ordinal scale, it is assumed that ( a t - l - at*-l), (dt,t-~ - dr), and ( d r - d r - ~ ) are endogenous variables with means conditioned by the exogenous variables In K,
\
In ~
rt-I/
,
In
:,-I
, u t,
71
3. Probit Model and Limited Dependent Panel Data
and v, and a multivariate normal error vector that is uncorrelated with % The model for all endogenous variables in the first wave is therefore given by a0 - a0 d21 - dl d I - do
IX1 [&2 -
d l , o - do Yl - Yo
q-
~4
~5 Yll
-q-
~1,3
Y12
0 0
0 0
0 0
0 0
0
0
0
0
0 [~51
0 ~52
0 [~53 -InK
Y14
Y,5~
/ln
"~31
"~32
'~34
'~/42
~44
Y35 ] Y45 ]
~
'~/41 Y51
Y52
Y54
Y55 /
Ik
\
0 0
a0 - a0 d2, l - d l
0
d I - do
0 0
dl, o - d o Yl - Yo
ra
In
f~
+
.
(41)
u1 P1
The models for wave 2 and wave 3 are constructed in similar ways. Note that the variable GL occurs two times in the first wave as GL o and GL~. In the second wave, GL~ must be taken from the first wave. Hence the whole model consists of 13 equations. The focus here, however, is only on Equations 5, 9, and 13, that is, y~ - Y o , Y 2 - Y ~ , and Y3 -- Y2. The first model does not take into account restrictions by proportionality of coefficients for each wave and unobserved heterogeneity. The parameter estimates obtained from MECOSA are shown in Table 2. The pseudo R2s of McKelvey and Zavoina (1975) show that only a small portion of the variance of the output is explained. The output increases in the second wave in comparison to the first and third waves. Judged by the z-values, the variables stock order (AB) and surprise effect are more important than business expectation (GL). The firms react primarily to shocks of the recent past. If the shock has been positive, more output is produced. The variables stock of raw materials and stock of unsold finished products are of lesser importance than the dependence on the state of the period before. Here, however, only the decrease of output in the past period matters. The covariances between the errors are rather small, indicating that the assumption of uncorrelated errors over time given the former states is correct. In Table 3, the results of the restricted parameter estimation under the hypothesis of proportionality of the regression coefficients except for the constants and the effect of the number of employees are shown. The hypothesis of proportionality is not rejected at the 0.05 test level. The error variance of the second wave is greater in absolute terms than the error variances of the first and second waves as judged from the inverse of the proportionality coefficient ~.
72
GerhardArminger
Unrestricted Parameter Estimates for IFO Output Model
TABLE 2
Explanatory variables
"r~ a"2 IX (a,_~ - a~;~) d,+ ~., - d, s, In K In
Wave 1
2.175 0.604 0.439 0.053 0.413 --0.049
r,
0 (28.989) (1.613) (5.685) (1.046) (10.993) (-- 1.433)
Wave 2
2.175 1.147 0.305 0.089 0.312 --0.023
0 (28.989) (3.281) (4.161) (2.319) (9.844) (--0.707)
0.097 (0.679)
--0.077 (0.481)
Wave 3
2.175 0.599 0.293 0.125 0.385 0.015
0 (28.989) (2.793) (6.236) (2.822) (13.532) (0.468)
0.086(0.632)
Ft-- 1
f, ft-~ u, v,
0.238 (2.973)
-0.031 (-0.369)
-0.058 (-0.796)
-0.543 (-3.040) 0.129 (1.039)
-0.834 (-5.916) -0.204 (-1.697)
-0.227 (-2.386) 0.429 (3.217)
R 2MZ
0.089
0.184
0.116
0.514 O. 127 0.045
0.715 -0.010
0.616
In
Covariances Wave l Wave 2 Wave 3 Note: z-values are in parentheses.
5. CONCLUSION The models discussed in this chapter serve as an introduction to the wide range of models that have been and may be formulated for the analysis of nonmetric longitudinal data with few time points observed in many independent sample elements. Other models include extensions of legit models and of loglinear models for count data (cf. Hamerle & Ronning, 1995). The estimation methods given here have been proven to be useful. However, new approaches for estimation are emerging. Beck and Gibbons (1994) exploit new techniques for highdimensional numerical integration. Muth6n and Arminger (1994) report first results for using the Gibbs sampler to describe the a posteriori distribution of parameters in MECOSA-type models.
ACKNOWLEDGMENTS I am grateful to the IFO Institute Munich for providing the business test data and to Professor G. Ronning of the University of Ttibingen and to R. Jung of the University of Konstanz for preparing
3. Probit Model and Limited Dependent Panel Data
TABLE3
Restricted Parameter Estimates for IFO Output Model
Explanatory variables
-rI "r2 tx (a,_~ - a~"__~) d,+~.,- d, s, In K In
73
Wave 1
2.133 0.850 0.393 0.097 0.413 --0.073
0 (30.454) (2.514) (9.130) (3.3169) (13.255) (--2.292)
Wave 2
2.133 1.212 0.393 0.097 0.413 --0.036
0 (30.454) (3.051) (9.130) (3.3169) (13.255) (--0.881)
Wave 3
2.133 0.731 0.393 0.097 0.413 0.006
0 (30.454) (3.072) (9.130) (3.3169) (13.255) (0.172)
r, rt-l
0.125 (1.505)
0.125 (1.505)
0.125 (1.505)
In f' fr-J u, v,
0.036 (0.774)
0.036 (0.774)
0.036 (0.774)
--0.501 (--7.466) 0.014 (0.224)
--0.501 (--7.466) 0.014 (0.224)
--0.501 (--7.466) 0.014 (0.224)
0.753
0.908
df
15
X2 Statistic
5.781
Note. z-values are in parentheses.
the data. Helpful comments on an earlier version of the chapter have been provided by an unkown reviewer. Comments should be sent to Gerhard Arminger, Bergische Universit~it Wuppertal, Department of Economics (FB 6), D-42097 Wuppertal, Germany.
REFERENCES Andersen, E. B. (1973). Conditional inference and models for measuring. Copenhagen: Mentalhygiejnisk Forsknings Institut. Arminger, G. (1992). Analyzing panel dam with non-metric dependent variables: Probit models, generalized estimating equations, missing data and absorbing states, (Discuss. Pap. No. 59). Berlin: Deutsches Institut for Wirtschaftsforschung. Arminger, G., & Ronning, G. (1991). Ein Strukturmodell ftir Preis-, Produktions- und Lagerhaltungsentscheidungen von Firmen, IFO-STUDIEN. Zeischrift fiir Empirische Wirtschaftsforschung, 37, 229-254. Bhargava, A. and Sargan, G. D. (1983). Estimating Dynamic Random Effect Models from Panel Data Covering Short Time Periods, Econometrica, 51, 1635-1359. Bock, R. D., & Gibbons, R. D. (1994). High-dimensional multivariate probit analysis. Unpublished manuscript. University of Chicago, Department of Psychology. Flaig, G., Licht, G., & Steiner, V. (1993). Testing for state dependence effects in a dynamic model of male unemployment behavior (Discuss. Pap. No. 93-07). Mannheim: Zentrum ftir Europ~iische Wirtschaftsforschung Gmbh. Hamerle, A., & Ronning, G. (1995). Analysis of discrete panel data. In G. Arminger, C. C. Clogg,
74
GerhardArminger
& M. E. Sobel (Eds.), Handbook of statistical modeling for the behavioral sciences. (pp. 401-451). New York: Plenum. Heckman, J. J. (1981a). Statistical models for discrete panel data. In C. F. Manski & D. McFadden (Eds.), Structural analysis of discrete data with econometric applications (pp. 114-178). Heckman, J. J. (198 l b). The incidental parameters problem and the problem of initial conditions in estimating a discrete time-discrete stochastic process. In C. E Manski & D. McFadden (Eds.), Structural analysis of discrete data with econometric applications (pp. 179-195). Hsiao, C. (1986). Analysis of panel data. Cambridge, MA: Cambridge University Press. Keane, M. E, & Runkle, D. E. (1992). On the estimation of panel-data models with serial correlation when instruments are not strictly exogenous. Journal of Business & Economic Statistics, 10(1), 1-29. Kiefer, J., & Wolfowitz, J. (1956). Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Annals of Mathematical Statistics, 27, 887-906. KiJsters, U. (1987). Hierarchische Mittelwert- und Kovarianzstrukturmodelle mit nichtmetrischen endogenen Variablen. Heidelberg: Physica Verlag. McKelvey, R. D., & Zavoina, W. (1975). A statistical model for the analysis of ordinal level dependent variables. Journal of Mathematical Sociology, 4, 103-120. Muth6n, B. O. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 49, 115-132. Muth6n, B. O., & Arminger, G. (1994). Bayesian latent variable regression for binary and continuous response variables using the Gibbs sampler. Unpublished manuscript, UCLA, Graduate School of Education. Neyman, J., & Scott, E. (1948). Consistent estimates based on partially consistent observations. Econometrica, 16, 1-32. Rosett, R. N., & Nelson, F. D. (1975). Estimation of the two-limit probit regression model. Econometrica, 43, 141-146. Schepers, A., & Arminger, G. (1992). MECOSA: A program for the analysis of general mean- and covariance structures with non-metric variables, user guide. Fraunenfeld, Switzerland: SLI-AG. Schepers, A., Arminger, G., & Kiisters, U. (1991). The analysis of non-metric endogenous variables in latent variable models: The MECOSA approach. In E Gruber (Ed.), Econometric decision models: New methods of modeling and applications (pp. 459-472). Heidelberg: Springer-Verlag. Sobel, M., & Arminger, G. (1992). Modeling household fertility decisions: A nonlinear simultaneous probit model. Journal of the American Statistical Association, 87, 38-47. Stewart, M. B. (1983). On least squares estimation when the dependent variable is grouped. Review of Economic Studies, 50, 737-753. Tobin, J. (1958). Estimation of relationships for limited dependent variables. Econometrica, 26, 24-36. Wagner, G., Schupp, J., & Rendtel, U. (1991). The Socio-Economic Panel (SOEP)for GermanyMethods of production and management of longitudinal data (Discuss. Pap. No. 3 l a). Berlin: Deutsches Institut ftir Wirtschaftsforschung.
..P.A.R.T~ Catastrophe Theory
This Page Intentionally Left Blank
Catastrophe Analysis of Discontinuous Development 9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
Hart L. J. van der Maas and Peter C. M. Mo/enaar University of Amsterdam The Netherlands
1. INTRODUCTION The theme of this book is categorical data analysis. Advanced statistical tools are being presented for the study of discrete things behaving discretely. Many relevant social science questions can be captured by these tools, but there is also tension between categorical and noncategorical data analysis which centers on ordinal, interval, and ratio scaled data. Zeeman (1993) presents an original view on this conflict (see Table 1). He distinguishes four types of applied mathematics according to whether things are discrete or continuous, and whether their behavior is discrete or continuous. Of special interest is Pandora's box. In contrast to the other three types of applied mathematics, discrete behavior of continuous things often gives rise to controversies. The mathematical modeling of music, the harmonics of vibrating strings, led to a major debate in the eighteenth century. The foundations of quantum theory are still controversial. A final example is catastrophe theory, which is concerned with the modeling of discontinuities. Researchers in developmental psychology are well aware of the controversial nature of discontinuities in, for example, cognitive development. Categorical Variables in Developmental Research." Methods of Analysis Copyright 9 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
77
78
H. L. J. van der Maas and P. C. M. Molenaar
TABLE 1 Four Types of Applied Mathematics Things Behavior
Discrete
Continuous
Discrete
Dice Symmetry DISCRETE BOX Finite probability Finite groups
Continuous
Planets Populations TIME BOX Ordinary differential equations
Music, harmony Light Discontinuities PANDORA'S BOX Fourier series Quantum theory Catastrophe theory Waves Elasticity CONTINUOUS BOX Partial differential equations
Note.
From Zeeman (1993).
Note that categorical behavior is not explicitly included in Zeeman's boxes. Categorical data, nominal measurements, should be labeled as discrete, whereas measurements on an ordinal, interval, or ratio level are covered by continuous behavior. Probably, the critical distinction here is in quantitative and qualitative differences in responding. If this distinction is crucial, then the definition of discreteness in catastrophe theory includes categorical data. We will not discuss further the classification of Zeeman. We hope that it illustrates the relationship between categorical data analysis and this chapter, the goal of which is to present our results, obtained by the application of catastrophe theory, to the study of conservation acquisition. Conservation acquisition is a kind of benchmark problem in the study of discontinuous cognitive developmerit. Discontinuous development is strongly associated with the stage theory of Piaget (1960, 1971; Piaget & Inhelder, 1969). The complexity of debate on this theory is enormous (the Pandora box), and in our presentation we will necessarily neglect a large body of literature on alternative models, other relevant data, and severe criticism on Piaget's work. It is common knowledge that Piaget's theory, especially concerning stages and discontinuities, has lost its dominant position in developmental psychology. Yet, we will defend that our results generally confirm Piaget's ideas on the transition of nonconservation to conservation. We start with a short introduction to catastrophe theory, issues in conservation research, and alternative models of conservation acquisition. We proceed by explaining our so-called "cusp" model of conservation acquisition and the possibilities to test this model. The last part of this chapter discusses the experi-
4. Catastrophe Theory
7'0
ments that we conducted to collect evidence for the model. We first explain our test of conservation, a computer variant of the traditional test. Finally, data from a cross-sectional and from a longitudinal experiment are discussed in relation to three phenomena of catastrophe models: sudden jumps, anomalous variance, and hysteresis.
2. CATASTROPHETHEORY There are several good introductions to catastrophe theory. In order of increasing difficulty, we can recommend Zeeman (1976), Saunders (1980), Poston and Stewart (1978), Gilmore (1981), and Castrigiano and Hayes (1993). Catastrophe theory is concerned with the classification of equilibrium behavior of systems in the neighborhood of singularities of different degrees. Singularities are points where, besides the first derivative, higher order derivatives of the potential function are zero. The mathematical basis of catastrophe theory consists of a proof that the dynamics of systems in such singular points can be locally modeled by seven elementary catastrophes (for systems with up to four independent variables). The elementary behavior of systems in singular points depends only on the number of independent variables. In the case of two independent variables, systems characterized by singular behavior can be transformed (by a well-defined set of transformations) to the so-called cusp form. Our preliminary model of conservation acquisition is formulated as a cusp model. In contrast to the well-known quadratic minima, the equilibrium behavior of singular systems is characterized by strange behavior such as sudden jumps and splitting of equilibria. The quadratic minimum is assumed in the application of linear and nonlinear regression models. In each point of the regression line a normal distribution of observed scores is expected. In contrast, singular minima lead to bimodal or multimodal distributions. If the system moves between modes, this change is called a qualitative change (of a quantitative variable). The comparison with regression analysis is helpful. As in regression models, catastrophe theory distinguishes between dependent (behavioral) and independent variables (control variables). These variables are related by a deterministic formula which can be adjusted for statistical analysis. The cusp catastrophe is denoted by
V(X; a, b) = 1/4 X 4 - 1/2 a X
2 -
b X,
(1)
which has as equilibria (first derivative to zero): X 3 -
a X-
b = O.
(2)
80
H. L. J. van der Maas and P. C. M. Molenaar
If we compare the latter function to regression models in the general form of
X=f(a,b),
(3)
we can see an important difference. The cusp function, Equation (2), is an implicit function, a cubic that cannot be formulated in the general regression form. This difference has far-reaching consequences. In the cusp, sudden jumps occur under small continuous variations of the independent variables a and b. In contrast, in the regression models, either linear or nonlinear small continuous variation of independent variables may lead to an acceleration in X but not to genuine discontinuities. Of course f, in the regression model, can be an explicit discontinuous threshold function, but then a purely descriptive position is taken. In catastrophe functions, the equilibrium surfaces are continuous (smooth), hence discontinuities are not built in. In catastrophe theory, qualitative changes in quantitative behavior are modeled in a way that is clearly distinguished from not only (nonlinear) regression models but also from Markov chain models (Brainerd, 1979) and special formulations of Rasch models (Wilson, 1989).
3. ISSUES IN CONSERVATION Conservation is the invariance of certain properties in spite of transformation of form. An example of a conservation of liquid task is shown in Figure 1. The liquid of one of two identical glasses (B) is poured into a third, nonidentical class (C). The child's task is to judge whether the quantity in C is now more than, equal to, or less than in A. A conserver understands that the quantities remain equal, a nonconserver systematically chooses the glass with the highest level as having more. A typical property of conservation tests is that the lowest test score is systematically below chance score.
FIGURE 1. A standard equality item of conservation of liquid. In the initial situation A and B have to be compared; in the final situation, after the transformation, A and C are to be compared (as indicated by the arrows below the glasses).
4. Catastrophe Theory
81
Piaget and co-workers used conservation repeatedly as an example to explain the general equilibration theory. Piaget's idea of epigenetic development implies that children actively construct new knowledge. In conservation development, a number of events can be differentiated: (a) (b) (c) (d)
focusing on the height of the liquid columns only (nonconservation) focusing on the width of the liquid columns only the conflict between the height and the width cue the conflict between the height cue and the conserving transformation (e) constructing new operations, f and g (f) focusing on the conserving transformation (g) understanding of compensation (multiplication of height and width) In Piaget's model, the sequence of events is probably a, b, c, e, f, and g. In the Piagetian view, f and g are connected. Whether d should be included, or that d is connected to c, is unclear. Many authors give alternative sequences and deny the active construction of the new operations. Bruner (1967), for instance, argues that f and g already exist in the nonconservation period but are hampered by the perceptual misleading factor. Peill (1975) gives an excellent overview of the various proposals about the sequence of events a to g. Thousands of studies have been conducted to uncover the secrets of conservation development. Many models have been introduced which have been tested in training settings and in sequence and concurrence designs by test criteria varying from generalization to habituation. Mostly, consensus has not been reached. In our opinion, conservation (or nonconservation) cannot be reduced to a perceptual, linguistic, or social artifact of the conservation test procedure, although perceptual, linguistic, and social factors do play a role, especially in the transitional phase. In a later section, we will explain how we developed a computer test of conservation to overcome the major limitations of clinical conservation testing. An important approach to the study of conservation focuses on conservation strategies (also rules and operations). At present, the best methodology of assessing conservation and several related abilities has been introduced by Siegler (1981, 1986). He uses several item types to distinguish between four rules. These rules roughly coincide with Piaget's phases in conservation development. Rule 1: Users focus on the dominant dimension (height). Rule 2: Users follow rule 1 except when values of the dominant dimensions are equal, then they base their judgment on the subordinate dimension (width). Rule 3: Users do as rule 2 users except when both the values of the dominant and the subordinate dimensions differ, then they guess. Rule 4: Users multiply the dimensional values (in the case of the balance beam) and/or understand that the transformation does not affect the quantity of liquid. Multiplying dimensions (or compensation) is
82
H. L. J. van der Maas and P. C. M. Molenaar
correct on conservation tasks too, but more difficult and rarely used (Siegler, 1981). Siegler's methodology has been criticized mainly for rule 3. Guessing (muddling through) is a rather general strategy of another order. Undoubtedly, it can occur, and should be controlled for. Several alternative strategies have been proposed, for instance, addition of dimensions (Anderson & Cuneo, 1978), maximization (Kerkman & Wright, 1988), and the buggy rule (van Maanen, Been, & Sijtsma, 1989). The validation of these alternatives will take a lot of test items but is not impossible. Another important concept in the study of conservation is that of the cognitive conflict. Piaget himself associated the concept of cognitive conflict to conservation. Most authors agree that conservation performance on Piagetian conservation tasks should be explained as (or can be described as) a conflict between perceptual and cognitive factors. Concepts such as salience, field dependency, and perceptual misleadingness have been used to define the perceptual factor in conservation (Bruner, 1967; Odom, 1972; Pascual-Leone, 1989). Siegler applies it in the dominant/subordinate distinction. We will use this factor as an independent variable in our model. The cognitive factor is less clearly understood. It can be defined as cognitive capacity, short-term memory, or as a more specific cognitive process such as learning about quantities. Also, the very general concept of maturation can be used. In cusp models, one has to choose two independent variables. In the case of conservation, this very important choice is difficult, which is one of the reasons to communicate our cusp model as preliminary. Flavell and Wohlwill (1969) took Piaget's view on the acquisition of conservation as a starting point. Piaget distinguished between four phases: nonconservation, pretransitional, transitional, and conservation. Flavell and Wohlwill (1969) formulated these Piagetian phases in the form of a simple equation: P( + ) = Pa * Pb l-k,
(4)
where P ( + ) is the probability that a child given person parameters Pa and k and item parameter Pb succeeds on a particular task. Pa is the probability that the relevant operation is functional in a given child. Pb is a difficulty coefficient between 0 and 1. The parameter k is the weight to be attached to the Pb factor in a given child. In the first phase, Pa and k are supposed to be 0, and consequently P ( + ) is 0. In the second phase, Pa changes from 0 to 1, whereas k remains 0. For a task of intermediate difficulty, (Pb = .5) P ( + ) = .25. According to Piaget, and Flavell and Wohlwill add this to their model, the child should manifest oscillations and intermediary forms of reasoning. Notice that this does not follow from the equation. The third phase is a period of stabilization and consolidation, Pa is 1 and k increases. In the fourth phase, Pa and k are 1 and full conservation has been reached.
4. Catastrophe Theory
83
Of course, this model also has its disadvantages. It predicts a kind of growth curve that is typical for many models and it predicts little besides what it assumes. In addition, the empirical verification suffers from the lack of operationalization of the parameters. As Flavell and Wohlwill admit, task analysis will not easily lead to an estimation of Pb, Pa is a nonmeasurable competence factor, and k does not have a clear psychological meaning and therefore also cannot be assessed. Yet, the reader will notice that this 25-year-old model has a close relationship to modern latent trait models. The simplest one, the Rasch model, can be denoted as P ( + ) = 1/(1 +
e-(Pa-Pb)).
(5)
Pa is the latent trait and Pb is the item difficulty. Several modifications are known. Parameters for discrimination, guessing, and so on, can be added. Pa and Pb have in both models a similar meaning. In latent trait models, one cannot find a direct equivalent for k, although some latent trait models contain person-specific weights for item parameters. If we compare Equations (4) and (5), we can conclude that Equation (4) is a kind of early latent trait model. The theory of latent trait models has developed to a dominant approach in current psychometrics. Much is known of the statistical properties of these models. The most appealing advantage is that item parameters can be estimated independent of the sample of subjects and person parameters independent of the sample of items. Besides assumptions as local independence, some constraints on the item-specific curves are required. For instance, in the Rasch model, item curves have equal discriminatory power. More complex latent trait models do not demand this, but other constraints remain. The important constraint for the discussion here is that in all latent trait models, for fixed values of Pa and Pb, one and only one value of P ( + ) is predicted. The item characteristic curve suggested by catastrophe theory does not accommodate this last constraint. Hence, we can conclude that the behavior in the Flavell and Wohlwill model, similar to more advanced latent trait models (e.g., Saltus, proposed by Wilson, 1989), does not vary discontinuously. The behavior may show a continuous acceleration in the increase of P(+), but is not saltatory in a strict sense. A second important conclusion is that the model of Flavell and Wohlwill, contrary to their suggestion, does not predict the oscillations and intermediary forms of reasoning in the transitional phases. We will see that the cusp model explicitly predicts these phenomena. This latter point is also of importance with regard to Markov models of Brainerd (1979) and the transition model of Pascual-Leone (1970). In Molenaar (1986) it is explained why such models do not test discontinuities. Of course, many more models on conservation are proposed, often in a nonmathematical fashion; but as far as they concern discontinuity explicitly, it seems that they will be in
84
H. L. J. van der Maas and P. C. M. Molenaar
a c c o r d a n c e with the c a t a s t r o p h e interpretation or they will not, that is, b e l o n g to the class o f m o d e l s d i s c u s s e d in this chapter.
4. THE CUSP MODEL In this section, w e explain the cusp m o d e l in general and its specification as a m o d e l o f c o n s e r v a t i o n . T h e type o f cusp m o d e l w e apply, typical by a rotation o f axes, is also f o u n d in Z e e m a n (1976). It is especially useful if the d y n a m i c s are u n d e r s t o o d in terms o f a conflict. In our application o f the cusp m o d e l , w e use p e r c e p t u a l m i s l e a d i n g n e s s and cognitive capacity as i n d e p e n dent variables and c o n s e r v a t i o n as a d e p e n d e n t variable. T h e m o d e l is r e p r e s e n t e d in F i g u r e 2.
FIGURE 2. The cusp model of conservation acquisition holds two independent and one behavioral variable: perceptual misleadingness, cognitive capacity, and conservation, respectively. The cusp surface is shown above the plane defined by the independent variables. The cusp surface has a folded form at the front that vanishes at the back of the surface. The folding implies that for certain values of the independent variables, three values of the behavioral variable, that is, three modes, are predicted. The mode in the middle is unstable and compelling and therefore called inaccessible. The remaining two modes lead to bimodality. The area in the control plane, defined by the independent variables, for which more behavioral modes exist is the bifurcation set. If the system is in its upper mode and moves from right to the left, a sudden jump will take place to the lower mode at the moment that the upper mode disappears. If the system then moves to the right, another jump to the upper mode will occur when leaving the bifurcation set at the right. This change in position of jumps is hysteresis. The placement of groups R, NC, TR, and C is explained in the text. Van der Maas and Molenaar (1992). Copyright 1992 by the American Psychological Association. Adapted by the permission of the publisher.
4. Catastrophe Theory
85
A three-dimensional surface consisting of the solutions of Equation (2) represents the behavior for different values of the independent variables. The independent variables define the control plane beneath the surface. The bifurcation set consists of independent values for which three models of behavior exists. Two of the three models are stable, the one in the middle is unstable. Outside the bifurcation set, only one mode exists. Changes in independent variables lead to continuous changes in the behavior variable, except for places on the edges of the bifurcation set; there, where the upper or lower mode of behavior disappears, a sudden jump takes place to the other stable mode. In terms of conservation, we interpret this to mean that an increase in cognitive capacity, or a decrease in perceptual misleadingness, leads to an increase of conservation, in most cases continuously and sometimes saltatory. The reverse process is also possible, but is usually not the dominant direction in development. However, artificial manipulation of cognitive capacity (by shared attention tasks) or misleadingness (by stimulus manipulation) should lead to saltatory regressions. This cusp model suggests the discrimination of four groups of children. At the neutral point, the back of the surface, the probability of a correct response equals chance level. Both independent values have low values. At the left front, the scores are below chance level. Perceptual misleadingness is high, capacity is low. In the middle, both independent values are high. Two behavior modes are possible, above and below chance level. Finally, at the right front, the high scores occur. Here, perceptual misleadingness is low and capacity is high. Associated with these states are four groups which we call residual group (R), nonconservers (NC), transitional group (TR), and conservers (C), respectively. The residual group consists of children who guess or follow irrelevant strategies because of a lack of understanding of test instructions or a lack of interest. This group is normally not incorporated into models of conservation development. In this model, it is expected that this group is characterized by low values for both independent variables. The nonconserver and conserver children are included in each model of conservation development. Most models also include a transitional group, defined, however, according to all kinds of criteria. In the cusp model, the transition period is defined as the period when the subjects stay in the bifurcation set. In this period, the possibility of the sudden jump is present. The sudden jump itself is part of this transition phase. The transition includes more than just the sudden jump. There are at least seven other behavioral phenomena typical for the transition period (see section 4.2.).
4.1. Routes Through Control Plane The classification of four groups suggests an expected sequence, but this does not directly follow from the cusp model. We need an additional assumption on
86
H. L. J. van der Maas and P. C. M. Molenaar
the direction of change of independent variables as a function of age (or as a function of experimental manipulation). To get at a sequence of residual, nonconservation, transitional, and conservation, it should be assumed that at first, both factors are low, then, perceptual misleadingness increases, third, cognitive capacity increases too, and fourth, perceptual misleadingness decreases back to a low level. We discuss this assumption because from the literature we know that very young children indeed score at chance level, 4- to 6-year-olds score below chance level, and later on, they score above chance level (McShane & Morrison, 1983). The phases of the model of Flavell and Wohlwill (1969) coincide with the nonconservation, transitional, and conservation sequence. The phase in which very young children score at chance level is not included in their model nor in the model of Piaget. This is partly because of issues of measurement. Guessing is not helpful on the classical Piagetian test of conservation in which valid verbal arguments are required. Other test criteria (judgment-only criterion or measures of looking time and surprise) allow for false positive classifications, whereas Piaget's criterion suffers from false negatives. In the cusp model, we assume the use of the judgmentonly criterion of conservation. What should be clear now is that some assumption on the change of the independent variables is necessary to specify a sequence of behavior. Of course, this is also required in the model of Flavell and Wohlwill (1969).
4.2. Test of the Cusp Model: The Catastrophe Flags The transitional period, associated with the bifurcation set, is characterized by eight catastrophe flags derived from catastrophe theory by Gilmore (1981). Three of these flags can also occur outside the bifurcation set and may predict the transition. Together, they can be used to test the model. These catastrophe flags are sudden jump, bimodality, inaccessibility, hysteresis, divergence, anomalous variance, divergence of linear response, and critical slowing down. Some are well known in developmental research, others only intuitively or not at all. The sudden jump is the most obvious criterion. However, in the case of developmental research, quite problematic. Empirical verification of this criterion requires a dense time-series design. In spite of the many studies on conservation development, a statistically reliable demonstration of the sudden jump in conservation ability is lacking. In the following, we present data that demonstrate the sudden jump. Bimodality is also known in conservation research. In van der Maas and Molenaar (1992), we present a reanalysis of the data of Bentler (1970) which clearly demonstrates bimodality. Although this seems to be the most simple criterion, applicable to cross-sectional group data of the behavior variable only, some is-
4. Catastrophe Theory
87
sues arise. In the prediction of bimodal score distribution, we combine two flags, bimodality and inaccessibility. Inaccessibility is implied by the unstable mode in between two stable modes. In a mixture distribution model (Everitt & Hand, 1981), this inaccessible mode misses. In these models, a mixture of two normal (or binomial, etc.) distributions is fitted on empirical distributions: F ( X ) = p N(Ixl, 0.1) + (1 - p) N ( I x 2 , 0"2) ,
(6)
where N is the normal distribution, p defines the proportions in the modes, and tx and 0. are the characteristics of the modes. This mixture model takes five parameters, whereas a mixture of binomials takes three parameters. The number of components can be varied and tested in a hierarchical procedure. An alternative formulation is found in the work of Cobb and Zacks (1985). They apply the cusp equation to define distributions of a stochastic cusp model: F ( X ) = ~. e -(z4-az2-bz)
Z-
sX-
l,
(7)
where a and b (independent values) define the form of the distribution, s and 1 linearly scale X, and h is the integration constant. The four parameters a, b, s, and 1 are estimated. It is possible to fit unimodal and bimodal distributions by constraints on the parameters a and b and compare the fits. The method of Cobb has some limitations (see Molenaar, this volume), and it is difficult to fit to data (e.g., in computing h). Yet, it takes into account the impact of the inaccessible mode and the computational problems are solvable. A comparison of possible forms of Equations (6) and (7) leads to the conclusion that their relationship is rather complex. In Equation (6), distributions can be fitted to data defined by IX~ = IX2 and 0.1 > > 0.2. Such distributions are not allowed in Equation (7), the modes must be separated by the inaccessible mode. Hysteresis is easily demonstrated in simpler physical catastrophic processes, but is probably very difficult in psychological experimentation. The degree of the hysteresis effect, the distance between the jumps up and down, depends on the disturbance or noise in the system (leading to the so-called Maxwell condition). Later on we discuss our first attempt to detect hysteresis in conservation. Divergence has a close relation to what is usually called a bifurcation. In terms of the chosen independent variables it means that if, in the case of a residual child, perceptual misleadingness as well as cognitive capacity are increased, the paths split between the upper and lower sheets, that is, between the high and low scores. Two children with almost similar start values (both independent variables very low) and following the same path through the control plane can show strongly diverging behavior. Again, this is not easily found in an experiment. The manipulation of only one independent variable is already difficult. Another choice of independent variables (e.g., a motivation factor or
88
H. L. J. van der Maas and P. C. M. Molenaar
optimal conditions factor in another rotation of independent variables) may lead to an empirical test. Anomalous variance is a very important flag. Gilmore (1981) proves that the variance of the behavioral variable increases strongly in the neighborhood of the bifurcation set and that drops occur in the correlation structure of various behavioral measures in this period. In developmental literature, many kinds of anomalous behaviors, from oscillations to rare intermediate strategies, are known. In our view, this prediction, used here as a criterion, is what Flavell and Wohlwill call "manifestations of oscillations and intermediary forms of reasoning." Divergence of linear response and critical slowing down, the last flags, concern the reaction to perturbations. They imply the occurrence of large oscillations and a delayed recovery of equilibrium behavior, respectively. These flags are not studied in our experiments. Reaction time measures are possibly the solution for testing these two flags. The flags differ importantly in strength. Bimodality and the sudden jump, certainly when they are tested by regression analysis and mixture distributions, are not unique for the cusp model. It is very difficult to differentiate acceleration and cusp models by these flags. Inaccessibility changes this statement somewhat. It is unique, but as indicated before, usually incorporated in the bimodality test. Anomalous variance is unique, as other mathematical models do not derive this prediction from the model itself. In this respect, the many instances of oscillations and intermediary forms of reasoning count as evidence for our model. Hysteresis and divergence are certainly unique for catastrophe models. Divergence of linear response and critical slowing down seem to be unique, too. Such a rough description of the strength and importance of each flag is only valid in contrast with predictions of alternative models. In the case of conservation research, bimodality and the sudden jump are thought of as necessary but not sufficient criteria, the others as both necessary and sufficient. The general strategy should be to find as many flags as possible. The demonstration of the concurrence of all flags forms the most convincing argument for the discontinuity hypothesis. Yet, the flags constitute an indirect method for testing catastrophe models. However, the advantage is that, except for divergence and hysteresis, the flags only pertain to the behavioral variable. We suggested a perceptual and a cognitive factor in our cusp model of conservation acquisition, but the test of the majority of flags does not depend on this choice. These flags do answer the question whether a discontinuity, as restrictively defined by catastrophe theory, occurs or not. The definitions of hysteresis and divergence include variation along the independent variables. In the case of the cusp conservation model, we can use variation of perceptual cues of conservation items or divided attention tasks to
4. Catastrophe Theory
89
force the occurrence of hysteresis and divergence. On the other hand, it is a pity that these appealing flags cannot be demonstrated without knowledge of the independent variables. On the other hand, by these flags, hypotheses about the independent variables can be tested.
4.3. Statistical Fit of Cusp Models A direct test of the cusp model of conservation acquisition requires a statistical fit of the cusp model to measurements of the dependent as well as the independent variables. The statistical fit of catastrophe models to empirical data is difficult and a developing area of research. In the comparison of Equations (2) and (3), the cusp models and (nonlinear) regression models, we did not discuss the statistical fit. It is common practice to fit regression models with one dependent and two independent variables to data and to test whether the fit is sufficient. Fitting data to the cusp model is much more problematic. There have been two attempts to fit Equation (2) in terms of Equation (3) (Guastello, 1988; Oliva, Desarbo, Day, & Jedidi, 1987). Alexander, Herbert, DeShon, and Hanges (1992) heavily criticize the method of Guastello. The main problem of Guastello's method puts forth the odd characteristic that the fit increases when the measurement error increases. Gemcat by Oliva et al. (1987) does not have this problem but needs an additional penalty function for the inaccessible mode. A different approach is taken by Cobb and Zacks (Cobb & Zacks, 1985). They developed stochastic catastrophe functions which can be fitted to data. In Equation (7), the basic function is shown. If one fits a cusp model, a and b are, for example, linear functions of observed independent variables. Examples of applications of Cobb' s method can be found in Ta' eed, Ta' eed, and Wright (1988) and in Stewart and Peregoy (1983). We are currently modifying and testing Cobb's method. Simulation studies should reveal the characteristics of parameter estimates and test statistics, and the requirements on the data. At present, we have not fit data of measurements of conservation, perceptual misleadingness, and cognitive capacity directly to the cusp model of conservation acquisition.
5. EMPIRICAL STUDIES 5.1. Introduction In the experiments, we have concentrated on the catastrophe flags. Evidence for bimodality, inaccessibility, and anomalous variance can be found in the literature (van der Maas & Molenaar, 1992). This evidence comes from
gO
H. L. J. van der Maas and P. C. M. Molenaar
predominantly cross-sectional experiments. Analyses of time series from longitudinal experiments have been rarely applied. The focus on cross-sectional research hampers the assessment of the other flags, most importantly, the sudden jump itself. We do not know of any study that demonstrates the sudden jump (or sharp continuous acceleration) convincingly. This is, of course, a difficult task, as dense time series are required. The presented evidence for anomalous variance is also achieved by crosssectional studies. In such studies, conservation scores on some test are collected for a sample of children in the appropriate age range. Children with consistent low and high scores are classified as nonconservers and conservers, respectively. The remaining children are classified as transitional. Then, some measure of anomalous variance is applied (rare strategies, inconsistent responses to a second set of items, inconsistencies in verbal/nonverbal behavior). By an analysis of variance (or better, by a nonparametric equivalent) it is decided whether the transitional group shows anomalous variance more than the other groups. In comparison with the strong demands, in the form of the catastrophe flags, that we put on the detection of a transition, this procedure of detecting transitional subjects is rather imprecise. However, it is a very common procedure in conservation research, and is also applied to decide whether training exclusively benefits transitional subjects. In light of the importance of anomalous variance for our verification, we want a more severe test. For these reasons, a longitudinal experiment is required. Special requirements are dense time series and precise operationalizations for the flags. This appears to be a difficult task when we rely on the clinical test procedure of conservation. Many retests in a short period creates a heavy load on our resources, as well as on the time of the children and the continuation of normal school activities. Moreover, this clinical test procedure has been heavily criticized. We chose to construct a new test of conservation. We describe this test and its statistical properties elsewhere (van der Maas, 1995), therefore only a short summary will be given here.
5.2. Instrument: A Computer Test of Conservation 5.2.1. Items In the clinical test procedure, the child is interviewed by an experimenter who shows the pouring of liquids to the child and asks a verbal explanation for the judgment. Only a few tasks are used and the verbal justification is crucial in scoring the response and detecting the strategy that is applied by the child. The rule assessment methodology of Siegler (1981), discussed previously, has a different approach. Siegler applies many more items that are designed to detect strategies on the basis of the judgments only. Although verbal justifications have an additional value, the level of conservation performance is determined by simple responses to the items only.
4. CatastropheTheory
91
The computer test is based on four out of six item types of Siegler's methodology. We call them the guess equality, guess inequality, standard equality, and standard inequality item types. The guess equality and guess inequality item types compare with Siegler's dominant and equal items. In the guess item types, the dimensions of the glasses are equal. In the equality variant amounts, heights and widths of the liquid are equal before and after the transformation. In the inequality variant, one of the glasses has less liquid, is equal in width, but differs in height. In these item types, the perceptual height cue points to the correct response and should therefore be correctly solved by all children. Consequently, these items can be used to detect children who apply guessing or irrelevant strategies. In view of the criticism of the judgment-only criterion concerning the possibility of guessing, items like this should be included. The standard item types, equality and inequality, compare with the conflict equal and subordinate items of Siegler. The standard equality item is shown in Figure 1. The dimensions of the glasses and liquids differ, whereas the amounts are equal. In the initial situation, the dimensions are clearly equal. Understanding the conserving transformation is sufficient to understand that the amounts are equal in the final situation, although multiplication of dimensions in the final situations suffice as well. The standard inequality item starts with an initial situation in which widths are equal but heights and amounts differ. In the final situation, the liquid of one of the glasses is poured into a glass of such width that the heights of the liquid columns are then exactly equal (see Fig. 3). 5.2.2. Strategies For all of these item types, a large variation in dimensions is allowed. For example, in the item in Figure 3, the differences in width can be made more or less salient. We expect that these variations do not alter the classification of children. Siegler makes the same assumption. A close examination of the results of Ferretti and Butterfield (1986) show that only very large differences in dimensional values have a positive effect on classification. Siegler uses six item types (each four items) to distinguish between four rules. We use instead the two standard item types (Figs. 1 and 3) and classify according to the following strategy classification schema (Table 2).
FIGURE 3. A standard inequality item of conservation of liquid.
92
H. L. J. van der Maas and P. C. M. Molenaar
The conservation items have three answer alternatives: left more, equal, and right more. These three alternatives can be interpreted as the responses highest more, equal, and widest more in the case of equality items, and in the responses equal, smallest more, and widest more in the case of inequality items. The combination of responses on equality and inequality items is interpreted in terms of a strategy. Children who consistently chose highest more on equality items and equal on inequality items follow the nonconserver height rule (NC.h), or, in Siegler's terms, rule 1. Conservers (C) or rule 4 users chose the correct response on both equality and inequality items. Some children will fail the equality items but succeed on the inequality items (widest more). If they prefer the highest more response on the equality items, they are called NC.p, or rule 2 users in Siegler's terms. If they chose the widest more response, they probably focus on width instead of height. The dominant and subordinate dimensions are exchanged (i.e., from a to b in the list of events in conservation development). We call this the nonconserver width rule (NC.w). The nonconserver equal rule ( N C . = ) refers to the possibility that children prefer the equal response to all items, including the inequality items. We initially interpret this rule as an overgeneralization strategy. Children, discovering that the amounts are equal on the equality items, may generalize this to inequality items. If this interpretation is correct, N C . - m a y be a typical transitional strategy. The cell denoted by i l does not have a clear interpretation. Why should a child focus on width in the case of equality items, and on the height in the case of inequality items. This combination makes no sense to us. The same is true for the second row of Table 2, determined by the smallest more response on inequality items. This response should not occur at all, so this row is expected to be empty. The results of the experiments will show how much children apply the strategies related to the nine cells. If a response pattern has an equal distance to two cells, then it is classified as a tie. The number of ties should be small. Notice that ties can only occur when more than one equality and more than one inequality item is applied. The possibility of ties implies a possibility of a formal test of this strategy classification procedure.
TABLE2 Strategy Classification Schema Standard equality items Highest more
Equal
Widest more
Standard
Equal
NC.h/rule 1
NC. =
i1
Inequality
Smallest more
i2
i3
i4
Items
Widest more
NC.p/rule 2
C/rule 4
NC.w
4. Catastrophe Theory
93
The NC.=, NC.w, and the i l rules do not occur in Siegler's classification of rules, although they could be directly assessed in Siegler's methodology. Rule 3 of Siegler, muddling through or guessing on conflict items, does not occur in this classification schema. Proposed alternatives for rule 3, like addition or maximization, are also not included. Additional item types are required to detect these strategies. 5.2.3. Test Items are shown on a computer screen. Three marked keys on the keyboard are used for responding. They represent this side more (i.e., left), equal, that side more (right). First, the initial situation appears, two glasses filled with liquid are shown. The subject has to judge this initial situation first. Then, an arrow denotes the pouring of liquid into a third glass which appears on the screen at the same time as the arrow; the liquid disappears from the second glass and appears in the third. Small arrows below the glasses point to the glasses that should be compared (A and C). In the practice phase of the computer test, the pouring is, except by the arrow, also indicated by the sound of pouring liquid. This schematic presentation of conservation items is easily understood by the subjects. The application of the strategy classification schema is more reliable when more items are used. We chose to apply one guess equality, one guess inequality, three standard equality, and three standard inequality items in the longitudinal experiment, and two guess equality, two guess inequality, four standard equality, and four standard inequality items in the cross-sectional studies. Guessers are detected by responses to guess items and the initial situations of the standard items, which are all non-misleading situations. Two types of data are obtained, number correct and strategies. Apart from the preceding test items, four other items are applied. We call these items compensation construction items. On the computer screen, two glasses are shown, one filled and one empty. The child is asked to fill the empty glass until the amounts of the liquids are equal. The subject can both increase and decrease the amount until he or she is satisfied with the result. Notice that a conserving transformation does not take place, hence the correct amount can only be achieved by compensation. We will only refer briefly to this additional part of the computer test. It is, however, interesting from a methodological point of view as the scoring of these items is not discrete but continuous, that is, on a ratio scale.
5.3. Experiment 1: Reliability and Validity The first experiment is a cross-sectional study in which 94 subjects ranging in age from 6 to 11 years old participated. We administered both the clinical conservation test (Goldschmid & Bentler, 1968) and an extended computer test.
94
H. L. J. van der Maas and P. C. M. Molenaar
The reliability of the conservation items of the computer test turned out to be .88. Four of 94 subjects did not pass the guess criteria; they failed the clinical conservation test as well. This clinical test consisted of a standard equality and a standard inequality item, scored according to three criteria: Piagetian, Goldschmid and Bentler, and judgment-only. The classifications obtained by these three criteria correlate above .96, hence to ease presentation, we only apply the Piagetian criterion. The correlation between the numbers correct of the clinical test and the computer test is .75. In terms of classifications, 79 of 94 subjects are classified concordantly as nonconservers and conservers. A more concise summary is given in Table 3. Here we can see that the classifications in NC.h and C do not differ importantly. For the NC.p, N C . = , and NC.b strategies, no definite statements can be made because of the small number of subjects. It seems that the ties appear among conservers and nonconservers on the clinical test. The percentages correct on both tests show that the tests do not differ in difficulty (62% vs. 59%, paired t-test: t = .98, df = 93, p = .33). We can state that the computer test seems to be reliable and valid (if the clinical test is taken as criterion).
5.4. Experiment 2: Longitudinal Investigation of the Sudden Jump and Anomalous Variance In the second study, 101 subjects from four classes of one school participated. At the beginning of the experiment, the ages varied between 6.2 and 10.6 years old. The four classes are parallel classes containing children of age groups 6, 7, and 8 years.
TABLE3 Classifications on the Computer Test Versus the Clinical Test Consisting of an Equality and an Inequality Item Clinical test scores
Strategy on c o m p u t e r test
=
:/:
NC.h
NC.p
NC.=
0
0
1
0
19
3
0
1
0
0
0
1
2
2
0
1
1
2
1
2
C
NC.b
Ties
1
0
4
3
30
2
0
0
1
4
3
0
0
0
7
40
1
7
0
53
Guess
Total
Note. Four score patterns of correct (1) and incorrect (0) can occur on the clinical test. These are compared with the strategies found on the computer test (strategies i l to i4 did not occur at all). The cells contain the number of subjects.
4. Catastrophe Theory
g5
We placed a computer in each class and individually trained the children to use it. During 7 months, 11 sessions took place. Except for the first session, children made the test by themselves as part of their normal individual education. This method of closely following subjects is similar to what Siegler and Jenkins (1989) call the microgenetic method, except for the verbal statements on responses which they use as additional information. We can call our method a computerized microgenetic method. Its main advantage is that it takes less effort, because, ideally, the investigator has only to backup computer diskettes. 5.4.1. Sudden jump To find evidence for the sudden jump, we classified subjects in four groups, nonconservers, transitional, conservers, and residual on the basis of the strategies applied. The transitional subjects are subjects who applied both conserver and nonconserver strategies during the experiment. Twenty-four of the 101 subjects show a sharp increase in the use of conserver strategies. We corrected the time series for latency of transition points. The resuiting individual plots are shown in Figure 4. This figure demonstrates a very sharp increase in the conservation score. To judge how sharp, we applied a multiple regression analysis in which the conservation score serves as the dependent variable and the session (linear) and a binary template of a jump as independent variables. This latter jump indicator consists of zeros for sessions 1 to 10 and ones for sessions 11 to 19. Together, the independent variables explain 88% of the variance (F(2,183) = 666.6, p = .0001). The t-values associated with the beta coefficient of each independent variable is t = 1.565 (p - .12) and t = 20.25 (p = .0001), for the linear and jump indicators, respectively. Actually, the jump indicator explains more variance than a sixth-order polynomial of the session variable. Yet, this statistical result does not prove that this sharp increase is catastrophic. The data can also be explained by a continuous acceleration model, as the density of sessions over time may be insufficient. What this plot does prove is that this large increase in conservation level takes place within 3 weeks, between two sessions, for 24 of 31 potentially transitional subjects. For the remaining 70 subjects, classified as nonconservers, conservers, and residual subjects on the basis of strategy use, no significant increases in scores are found. The mean scores on sessions 1 and 11 for these subjects do not differ significantly, F(1,114)= .008, p = .92. The scores of these subjects stay at a constant level. The transitional group shows important individual differences. A few subjects show regressions to the nonconserver responses, some apply rare strategies during some sessions, but the majority show a sharp increase between sessions. One subject jumped within one test session, showing consistent nonconserver scores on all preceding sessions and conserver scores on all subsequent test sessions.
81_S
g6
H. L. J. van der Maas and P. C. M. Molenaar
68
6 4
'
I
0
9
5
I
9
I
10
15
2
,~--
,
I
0
'
I
5
'
.
9
I
0
5
i
5
.
i
10
15
2O
2 I
0
2O
'
I
5
9
10
'
I
15
20
..
'
15
10
0
5
0
5
9
0
I
9
5
9
I
10
I
9
15
20
0
9
9
10
0
i
0
9
15
20
6 4
6 4
2
2
0
,
9
I
0
5
I
9
i
10
0
9
15
20
0
I
5
0
'
0
I
'
I
5
0
9
10
I
15
0
,
20
9
5
0
,
I
9
10
1
15
9
20
I
i
9
I
10
,
5
15
J I
10
,
9
20
0
I
9
5
I
9
10
9
I
15
20
0
15
20
0
I
5
FIGURE 4.
9
I
10
9
I
15
9 "'
20
9
20
0
l
'
I
10
i
9
5
"
15
i
9
10
20
/
9
15
20
0
!
0
,
5
i
'
10
I
9
15
20
81 / 1 '1 ~ -
-
I
'
I
0
5
10
0
5
10
'
I
20
15
6 4 2 i
5
9
0
I
10
15
20
5
i
8
4\
2 /
10
" 15
0
9
20
__A 9
'
I
9
15
/
6
2
9
I
15
81 J 9
0
9
6 4 0
9
S
2
I
2
9
'
6 4
6 4 0
I
10
20
2 i
2
2
"
6 4
6 4
6 4
I
15
J
,
I
2
2
"
6 4
6 4
6 4
I
I
10
2
0
20
9
8
2 I
l
5
0
9
?
6 4
9
0
2
0
15
0
6 4
81 81 81 / 8i 81 81 81 _/81 s 0
,
4
I
10
i
8
-
4681j0
,
0
20
. . . . . .
2
0
,
81 81
j
6 4
2
4681 /---
0
.....
4
2 0
j
20
.... I
!
5
10
'
I
15
9
20
Individual transition plots aligned to time point l 1. On the vertical axis raw scores on
the c o n s e r v a t i o n test are shown. T w o guess items and six standard items were used.
4. Catastrophe Theory
97
5.4.2. Anomalous variance A special problem in the analysis of anomalous variance is that the conservation test consists of dichotomous items. The variance depends on the test score. We looked for three solutions for this problem using inconsistency, alternations, and transitional strategies. The inconsistency measure is achieved by an addition to the computer test, a repetition of four standard items at the end of the test. These additional items are not included in the test score. The inconsistency measure is the number of responses that differ from the responses on the first presentation. We do not present the results here. In summary, inconsistencies do occur in the responses of transitional subjects but occur more in the responses of residual subjects. The other two measures concern the application of strategies. The analysis of alternations did not indicate a transitional characteristic. The analysis of strategies, however, did. To explain this, we will now describe the results in terms of strategies. The other measures are discussed in van der Maas, Walma van der Molen, and Molenaar (1995). The responses to six standard items are analyzed by the strategy classification schema. The results are shown in Table 4. This table shows some important things. The number of ties is low, hence in the large majority of patterns, the classification schema applies well. The NC.h and the C strategy are dominant (80%). The other cells are almost empty or concern uncommon strategies. NC.p, rule 2 in Siegler's classification, N C . - , il, and NC.w make up 14% of the response patterns. Focusing on the subordinate dimension, NC.w is very rare. A statistical test should reveal whether the small number, 6 if guessers are removed,
TABLE 4 Responsesto Six Standard Items Analyzed by Strategy Classification Schema Equality items Ties 43 (8) Missing 265
Inequality items
Highest more Equal
NC.h
Smallest more
i2
Widest more
NC.p
417 (15)
Equal NC.=
Widest more
36 (9)
il
20 (5)
i4
2 (2)
4 (0)
i3
5 (3)
50 (11)
C
258 (4)
NC.w
11 (5)
Note. For 11 sessions, *101 subjects = 1111 response patterns can be classified. There were 265
missing patterns. Forty-three response patterns, ties, could not be classified because of equal distance to the ideal patterns associated with the nine cells. The 801 remaining patterns can be uniquely classified and are distributed as shown in the table. The number of guessers is displayed in parentheses. A response pattern is classified as a guess strategy if two guess items are incorrect or if one guess item and more than 25% of the responses to the non-misleading initial situations of the standard items are incorrect.
g8
H. L. J. van der Maas and P. C. M. Molenaar
can be ascribed to chance or cannot be neglected. Latent class analysis may be of help here. For the classification of time series, that is, the response patterns over 11 sessions, we make use of the translation of raw scores in strategies. The classification is rather simple. If NC.h occurs at least once, and C does not occur, the series is classified as NC. If C occurs at least once, and NC.h does not occur, the series is classified as C. If both C and NC.h occurs in the time series it is classified as TR. The remaining subjects and those who apply guess and irrelevant (i2, i3, and i4) strategies on the majority of sessions are classified as R. According to these criteria, there are 31 TR, 42 NC, 20 C, and 8 R subjects. Note that this classification of time series does not depend on the use of the uncommon strategies NC.p, N C . = , i l, and NC.w. Thus we can look at the frequency of use of these strategies in the four groups of subjects (see Table 5). NC.p, rule 2 in Siegler's classifications, does not seem to be a transitional characteristic; in most cases, it is used by the subjects of the nonconserver group. The strategies i l and NC.w are too rare to justify any conclusion. In 15 of 27 cases of application of N C . = , it is applied by the transitional subjects. The null hypothesis that N C . = is distributed evenly over the four groups is rejected (X2 = 11.3, d f = 3, p = .01). Furthermore, seven transitional subjects apply N C . = just before they start using the C strategy. This analysis demonstrates much more convincingly than cross-sectional studies that transitional subjects manifest what has been called intermediary forms of reasoning. Whether we can ascribe the occurrence of the NC.-- strategy to anomalous variance is not entirely clear.
TABLE 5 Application of Uncommon Strategies by Nonconservers, Conservers, Transitional, and Residual Subjects Group Strategy
NC
C
TR
R
NC.p NC.w NC.= il Other
21 (5.5%) 1 (0.3%) 5 (1.3%) 5 (1.3%) 349 (91.6%)
4 (2.5%) 0 (0%) 4 (2.5%) 0 (0%) 149 (95%)
7 (2.8%) 3 (1.2%) 15 (6.0%) 7 (2.8%) 217 (87%)
7 (11.8%) 2 (3.4%) 3 (5.1%) 3 (5.1%) 44 (74.6%)
Total
381 (100%)
157 (100%)
249 (100%)
59 (100%)
Note. Other strategies are C, NC.h, guess, ties, and irrelevant strategies (i2 to i4). Raw numbers and percentages of use in each group are displayed.
4. Catastrophe Theory
gg
Note that the search for anomalous strategies comes from the fact that traditional variance is not useful in a binomially distributed test score. In this respect, the compensation construction items are of interest. These measure conservation (in fact, compensation) ability on a continuous ratio scale (amount of liquid or height of liquid level). In this measure, a more direct test of anomalous variance may be possible.
5.5. Experiment 3: Hysteresis The last flag that we will discuss here is hysteresis. We build the cusp model of conservation acquisition on the assumption that perceptual and cognitive factors conflict with each other. Hysteresis should occur under systematic increasing and decreasing variation in these factors. Preece (1980) proposed a cusp model that is a close cousin of ours. Perceptual misleadingness plays a role in his model, too. We will not explain Preece' model here, but it differs from our preliminary model, for example, with respect to the behavior modes to which the system jumps. The difference can be described in terms of the conservation events (see section 3). Preece's model concerns event c, whereas our model concerns event d, resulting in a choice between the height more and weight more responses and between the height more and equal responses, respectively. So Preece expects a hysteresis effect between incorrect responses and we between correct and incorrect responses. The idea suggested by Preece is to vary the misleading cue in conservation tasks. In a conservation of weight test, a ball of clay is rolled into sausages of lengths 10, 20, 40, 80, 40, 20, and 10 cm, respectively (see Table 6). In this test, a child shows hysteresis if he or she changes the judgment twice; once during the increasing sequence and once during the decreasing sequence. If both jumps take place, hysteresis occurs either according to the delay (Hyst D) or to Maxwell convention (Hyst M). In the latter case, the jumps take place on the same position, for instance, between 20 and 40 cm. If only one jump occurs, we classify the pattern as "jump." This weight test was performed as a clinical observation test and was administered to 65 children. Forty-three subjects correctly solved all items. Two subjects judged consistently that the standard ball was heavier, and two judged the sausage as heavier. In Table 6 the response patterns of all subjects are shown, the consistent subjects are in the lower part. The codes - 1 , 0, and 1 mean ball more, equal weight, and sausage more, respectively. The first subject shows a good example of a hysteresis response pattern. When the ball is rolled into a sausage of a length of 10 cm, he judges the ball as heavier. Then, the sausage of 10 cm is rolled into a length of 20 cm, the child persists in his opinion. But when the sausage of 20 cm is rolled into a sausage of 40 cm, he changes his judgment, now the roll is heavier. He continues to think this when
100
H. L. J. van der Maas and P. C. M. Molenaar
TABLE 6 Response Patterns of 65 Children in a Cross-Sectional Experiment on Hysteresis 10
20
40
80
40
20
10
nr
cm
cm
cm
cm
cm
cm
cm
Pattern
Alternation
39
-1 -1 -1 -1 0 0 0 -1 0 0 0 1 1 0 1 -1 1 1 1 0 0 0
-1
1
-1 -1
1 1
0 0 1 1 -1 0 0 1 -1 0 1 0 0 0
1 -1 0 0 0 1 0 0 1 -1 1 0 0 -1 -1 0 0 1 -1 1
hyst D hyst D' jump hyst M hyst D' hyst D hyst D' jump hyst D' hyst D' jump
inc/inc inc/inc inc/inc inc/inc c/inc c/inc c/inc c/inc c/inc c/inc c/inc
- 1 1 0
- 1 1 0
- 1 1 0
63 11 51 42 47 18
3 21 40 49 1 62 64 65 25 32 27
1 1
1
-1
-1
1 1 -1
1 -1
1
1
0 0 -1 -1 -1 -1 0 -1 -1 0 0 -1 1 -1 - 1 -1 0 1
-1 0 -1 -1 -1 -1 1 -1 -1 -1 0 1 1 0 1 -1 0 -1
-1 -1 -1 -1 -1 -1 1 1 1 0 0 1 0 -1 1 -1 0 0
1 1 0 -1 -1 -1 -1 -1 1 -1 1 0 0 0 -1 1 1 0 1 1
- 1 1 0
- 1 1 0
- 1 1 0
- 1 1 0
1
-1
0 0 0 -1
2 Subjects 2 Subjects 43 Subjects
Note. Explanation of abbreviations is given in the text.
the sausage reaches its maximum length of 80 cm. Then the sequence is reversed. The roll of 80 cm is folded to 40 cm, the child judges that the roll is still heavier. This continues until the roll is folded to 10 cm. On this last item the child changes his opinion, the ball is heavier again. The first subject is tested two times. The second time, hysteresis according to the delay convention again takes place. However, the jumps have exchanged relative position. The jump takes place earlier in the increasing rather than in the decreasing sequence of items. We denoted this strange phenomena as Hyst D'. This subject shows hysteresis between incorrect responses (inc/inc); subject 42 shows hysteresis between correct and incorrect responses (c/inc). This subject judges the ball and the sausage to be equal until the sausage is rolled into a
4. CatastropheTheory
101
length of 80 cm. Then the ball is heavier. The child persists in this judgment when the sausage is folded to 40 cm. Finally, in the last two items, the child responses are "equal" again. Eleven instances of hysteresis and jumps occur among the tested children. The other patterns are not interpretable in these terms. The hysteresis results on the conservation of weight test are not conclusive. It would be naive to expect a convincing demonstration of hysteresis in the first attempt. The number of items in the sequence and the size of steps between items are difficult to choose. Perhaps, another kind of test is needed to vary misleadingness continuously. Furthermore, only the transitional subjects in the sample are expected to show these phenomena. In this regard, the results in Table 6 are promising. We also constructed a computer test of liquid for hysteresis. Only 2 instances of the jump pattern and four of the hysteresis pattern occurred in the responses of 80 subjects (from the longitudinal sample). Three of these subjects came from the transitional group. One of them responded as nonconserver until we applied the hysteresis test (between sessions 7 and 8) and responded as conserver in all subsequent sessions. These tests of hysteresis have some relationship to a typical aspect of Piaget's clinical test procedure of conservation. This aspect is called resistance to countersuggestion, which is not applied in the computer test of conservation. It means that, after the child has made his or her judgment and justification of this judgment, the experimenter suggests the opposite judgment and justification to the child. If the child does not resist this countersuggestion, he or she is classified as transitional. That countersuggestions work has been shown in many studies. Whether subjects who accept countersuggestion are transitional (or socially adaptive) is not always clear; as said before, cross-sectional studies have serious limitations in making this decisive. The method of countersuggestion can be interpreted in the cusp model as pushing the behavior in the new or old mode of behavior to see whether this mode still or already exists and has some stability. If countersuggestion is resisted, the second mode apparently does not exist. In the cusp model, the second mode only exists in the bifurcation set. Note that we defined the transition as being the period during which the system remains in this set. If the subject switches opinions there and back as a result of repeated countersuggestions, this behavior can be taken to be hysteresis. Whether the manipulation of countersuggestion yields an independent variable or not is less clear.
6. DISCUSSION In this chapter, we gave an overview of our work on the application of catastrophe theory to the problem of conservation development. Not all of the ideas
102
H. L. J. van der Maas and P. C. M. Molenaar
or all of the empirical results are discussed, but the most important outcomes are presented. What can be learned from this investigation? One may think that catastrophe theory is a rather complex and extensive tool for something so simple as conservation learning. Some readers may hold the saltatory development of conservation as evident in itself (see Zahler and Sussmann, 1977). With respect to the first criticism, we state that there are theoretical and empirical reasons for using catastrophe theory. Theoretically, the reason for applying catastrophe theory is that the rather vague concept of epigenese in Piaget's theory can be understood as equivalent to self-organization in nonlinear system theory. Empirically, the lack of criteria for testing saltatory development, implies in the Piagetian theory, has been an important reason for the loss of interest in this important paradigm (van der Maas & Molenaar, 1992). The second criticism centers on the argument that our conservation items are dichotomous items, so behavior is necessarily discrete. The criticism is not in accordance with test theory. A set of dichotomous items yield, in principle, a continuous scale. On other tests in which a set of dichotomous items of the same difficulty are used, the sum score is unimodally distributed. In contrast, the sum scores on the conservation test are bimodally distributed. Yet, the use of dichotomous items complicates the interpretation of results. It raises a question about the relationship between catastrophe models and standard scale techniques, especially latent trait theory. This is a subject that needs considerable study. We suggested the use of compensation construction items to exclude radically these complications. This criticism of intrinsic discreteness is not commonly used by developmental psychologists. The traditional rejection of Piaget's theory assumes the opposite hypothesis of continuous development. With regard to this position the presented data are implicative. First, the sudden jump in the responses of a quarter of our longitudinal sample has not been established before. It is disputable whether this observation excludes alternative models that predict sharp continuous accelerations. We can argue that at least one subject showed the jump within a test session during a pause of approximately 30 s between two parts of the computer test. Hence, we demonstrated an immediate sudden jump for at least one subject. Second, evidence is presented for the occurrence of transitional strategies. Note that we, as this concerns a result in a longitudinal study, are much more certain that the subjects classified as transitional are indeed transitional. Transitional subjects appear to respond "equal" to both equality and inequality items. This may be interpreted to mean that they overgeneralize their discovery of the solution of standard equality items to the inequality items. Verbal justifications are required to decide whether this interpretation is correct. In our view the dem-
4. Catastrophe Theory
103
onstration of a typical transitional strategy is quite important to Piaget's idea of knowledge construction, this result is confirmatory. It also makes an argument for using the rule assessment methodology in studying Piagetian concepts. We presented this result under the heading of anomalous variance. We relate the catastrophe flag anomalous variance to what Flavell and Wohlwill (1969) call oscillations and intermediary forms of reasoning. We suggest that the findings of Perry, Breckinridge Church, and Goldin Meadow (1988) concerning inconsistencies between nonverbal gestures and verbal justifications should be understood in terms of this flag, too. Third, the few instances of hysteresis response patterns suggest a choice for the cusp model of conservation. We admit that more experimentation is required here. The test should be improved and perhaps we need other designs and manipulations. We have already suggested using countersuggestion as a manipulation. Finally, evidence for other flags can be found in the literature, especially for bimodality and inaccessibility for which strong evidence already exists (Bentler, 1970; van der Maas, 1995). Divergence, divergence of linear response, and critical slowing down are not demonstrated. Although the presented results together make a rather strong argument for the hypothesis of discontinuous development, we will attempt to find evidence for these flags, too. The presented evidence for the catastrophe flags offers a strong argument for the discontinuity hypothesis concerning conservation development. That is, the data bear evidence for catastrophe models of conservation development. However, we cannot claim that this evidence directly applies to the specific cusp model that we presented. The choice of independent variables is not fully justified. The hysteresis experiment gives some evidence for perceptual misleadingness as independent variable, but this variable is, for instance, also used by Preece (1980). On the other hand, there are some theoretical problems for which the model of Preece (1980) can be rejected (see van der Maas & Molenaar, 1992). We did not put forward evidence for cognitive capacity as independent variable; in future studies, we will focus on this issue. However, it is in accordance with the views of many supporters and opponents of Piaget that conservation acquisition should be understood in terms of a conflict between perceptual and cognitive factors. We hope that we have made clear that a catastrophe approach to discontinuous behavior has fruitful implications. Catastrophe theory concerns qualitative (categorical) behavior of continuous variables. It suggests a complex relation between continuous and categorical variables that falls outside the scope of standard categorical models and data analysis methods. Yet, the catastrophe models are not unrelated to standard notions; but the question of how catastrophe models should be incorporated in standard techniques is far from answered.
104
H. L. J. van der Maas and P. C. M. Molenaar
ACKNOWLEDGMENTS This research was supported by the Dutch organization for scientific research (NWO) and the Department of Psychology of the University of Amsterdam (UVA).
REFERENCES Alexander, R. A., Herbert, G. R., DeShon, R. E, & Hanges, E J. (1992). An examination of least squares regression modeling of catastrophe theory. Psychological Bulletin, 111(2), 366-374. Anderson, H. H., & Cuneo, D. O. (1978). The height + width rule in children's judgments of quantity. Journal of Experimental Psychology: General, 107(4), 335-378. Bentler, P. M. (1970). Evidence regarding stages in the development of conservation. Perceptual and Motor Skills, 31, 855-859. Brainerd, C. J. (1979). Markovian interpretations of conservation learning. Psychological Review, 86, 181-213. Bruner, J. S. (1967). On the conservation of liquids. In J. S. Bruner, R. R. Olver, & E M. Greenfield (Eds.), Studies in cognitive growth (pp. 183-207). New York: Wiley. Castrigiano, D. E L., & Hayes, S. A. (1993). Catastrophe theory. Reading, MA: Addison-Wesley. Cobb, L., & Zacks, S. (1985). Applications of catastrophe theory for statistical modeling in the biosciences. Journal of the American Statistical Association, 80(392), 793-802. Everitt, B. S., & Hand, D. J. (1981). Finite mixture distributions. London: Chapman & Hall. Ferretti, R. P., & Butterfield, E. C. (1986). Are children's rule assessment classifications invariant across instances of problem types? Child Development, 57, 1419-1428. Flavell, J. H., & Wohlwill, J. E (1969). Formal and functional aspects of cognitive development. In D. Elkind, & J. H. Flavell (Eds.), Studies in cognitive development: Essays in honor of Jean Piaget (pp. 67-120). New York: Oxford University Press. Gilmore, R. (1981). Catastrophe theory for scientists and engineers. New York: Wiley. Goldschmid, M. L., & Bentler, E M. (1968). Manual conservation assessment kit. San Diego, CA: Educational and Industrial Testing Service. Guastello, S. J. (1988). Catastrophe modeling of the accident process: Organizational subunit size.
Psychological Bulletin, 103(2):246-255. Kerkman, D. D., & Wright, J. C. (1988). An exegenis of compensation development: Sequential decision theory and Information integration theory. Developmental Review, 8, 323-360. McShane, J., & Morrison, D. L. (1983). How young children pour equal quantities: A case of pseudoconservation. Journal of Experimental Child Psychology, 35, 21-29. Molenaar, E C. M. (1986). Issues with a rule-sampling theory of conservation learning from a structuralist point of view. Human Development, 29, 137-144. Odom, R. D. (1972). Effects on perceptual salience on the recall of relevance and incidental dimensional values: A developmental study. Journal of Experimental Psychology, 92, 185-291. Oliva, T. A., Desarbo, W. S., Day, D. L., & Jedidi, K. (1987). GEMCAT: A general multivariate methodology for estimating catastrophe models. Behavioral Science, 32, 121-137. Pascual-Leone, J. (1970). A mathematical model for the transition rule in Piaget's development stages. Acta Psychologica, 32, 301-345. Pascual-Leone, J. (1989). An organismic process model of Witkin' s field-dependency-independency. In T. Globerson, & T. Zelniker (Eds.), Cognitive style and cognitive development (Vol. 3, pp. 36-70). Norwood, Nd: Ablex.
4. Catastrophe Theory
105
Peill, E. J. (1975). Invention and discovery of reality: The acquisition of conservation of amount. London: Wiley. Perry, M., Breckinridge Church, R., & Goldin Meadow, S. (1988). Transitional knowledge in the acquisition of concepts. Cognitive Development, 3, 359-400. Piaget, J. (1960). The general problems of the psychobiological development of the child. In J. M. Tanner, & B. Inhelder (Eds.), Discussions on child development (Vol. 4, pp. 3-27). London: Tavistock. Piaget, J. (1971). The theory of stages in cognitive development. In D. R. Green, M. E Ford, & G. B. Flamer (Eds.), Measurement and Piaget. (pp. 1-11). New York: McGraw-Hill. Piaget, J., & Inhelder, B. (1969). The psychology of the child. New York: Basic Books. Poston, T., & Stewart, I. (1978). Catastrophe theory and its applications. London: Pitman. Preece, E E W. (1980). A geometrical model of Piagetian conservation. Psychological Reports, 46, 143-148. Saunders, E T. (1980). An introduction to catastrophe theory. Cambridge, UK: Cambridge University Press. Siegler, R. S. (1981). Developmental sequences within and between concepts. Monographs of the society for Research in Child Development, 46(2), 84. Siegler, R. S. (1986). Unities across domains in childrens strategy choices. Minnesota Symposia on Child Psychology, 19, 1- 46. Siegler, R. S., & Jenkins, E. (1989). How children discover new strategies. Hillsdale, NJ: Erlbaum. Stewart, I. N., & Peregoy, P. L. (1983). Catastrophe theory modeling in psychology. Psychological Bulletin, 94(2), 336-362. Ta'eed, L. K., Ta'eed, O., & Wright, J. E. (1988). Determinants involved in the perception of the Necker Cube: An application of catastrophe theory. Behavioral Science, 33, 97-115. van Maanen, L., Been, E, & Sijtsma, K. (1989). The linear logistic test model and heterogeneity of cognitive strategies. In E. E. Roskam (Ed.). Mathematical psychology in progress (pp. 267-287). Berlin: Springer-Verlag. van der Maas, H. L. J., & Molenaar, E C. M. (1992). Stagewise cognitive development: An application of catastrophe theory. Psychological Review, 99(3), 395-417. van der Maas, H. L. J. (1995). Nonverbal assessment of conservation. Manuscript submitted for publication. van der Maas, H. L. J., Walma van der Molen, J., & Molenaar, E C. M. (1995). Discontinuity in conservation acquisition: A longitudinal experiment. Manuscript submitted for publication. Wilson, M. (1989). Saltus: A psychometric model of discontinuity in cognitive development. Psychological Bulletin, 105(2), 276-289. Zahler, R. S., & Sussmann, H. J. (1977). Claims and accomplishments of applied catastrophe theory. Nature (London), 269( 10):759-763. Zeeman, E. C. (1976). Catastrophe theory. Scientific American, 234(4), 65-83. Zeeman, E. C. (1993). Controversy in science, the ideas of Daniel Bernoulli and Ren~ Thom. Johann Bernoulli Lecture. Groningen, The Netherlands: University of Groningen.
Catastrophe Theory of Stage Transitions in Metrical and Discrete Stochastic Systems 9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
Peter C. M. Molenaar and Pascal Hartelman University of Amsterdam The Netherlands
1. INTRODUCTION The current spectacular progress in the analysis of nonlinear systems has important implications for theories of development. Until recently, central theoretical constructs in developmental biology and psychology, such as epigenesis and emergence, seemed to defy any attempt at causal modeling. For instance, when addressing the problem of causal inference in experimental studies of development, Wohlwill (1973, p. 319) believed that it would be impossible to isolate sufficient causes of normal developmental processes which he described as "acting independently of particular specifiable external agents or conditions." As to this, the situation has changed considerably in that we now have available a range of nonlinear dynamical models that can explain the self-regulating mechanisms of normal development (such as "catch-up growth" after temporary inhi-
Categorical Variables in Developmental Research: Methods of Analysis Copyright 9 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
107
108
Peter c. M. Molenaar and Pascal Hartelman
bition which served as the empirical example in Wohlwill's discussion of causal inference). Moreover, the mathematical theory of nonlinear dynamics has provided innovative, rigorous approaches to the empirical study of epigenetical processes (cf. Molenaar, 1986). This chapter is devoted to one of these approaches. The founders of modern nonlinear dynamics have always been quite aware of the relevance of their work to the analysis of developmental processes. Prigogine's work with Nicolis on applications of nonequilibrium thermodynamics is an excellent example (Nicolis & Prigogine, 1977), as is Prigogine's (1980) classic monograph on the physics of becoming. Haken (1983) applied his synergistic approach to self-organization in evolutionary processes. And last but not least, Thom's (1975) main treatise of catastrophe theory is entirely inspired by morphogenetics. A common theme in all of these applications is the causal modeling of self-organization in nonlinear developmental systems. Furthermore, this is accomplished by using the same basic paradigm: bifurcation analysis. In a bifurcation analysis, a nonlinear system is subjected to smooth variation of its parameters. One might conceive of this variation as the result of maturation, giving rise to slow continuous changes (so-called quasi-static changes) in the system parameters. Most of the time this only yields continuous variation of the behavior of the system, but for particular parameter values, the system may undergo a sudden change in which new types of behavior emerge and/or old behavior types disappear. Such a discontinuous change in a system's behavior marks the point when its dynamics become unstable, after which a spontaneous shift occurs to a new, stable dynamical regime. Hence, sudden transitions in the dynamics of systems undergoing smooth quasi-static parameter variation constitute a hallmark of self-organization. Notice that bifurcation analysis is a mathematical technique which is applied to a mathematical model of a given nonlinear system. For many developmental processes, however, no adequate mathematical model is available. For instance, it is unknown what would constitute a proper dynamical model of (Piaget's theory of) cognitive development. One could proceed by making an educated guess (called an "Ansatz") of what might be a plausible model and subject that model to a bifurcation analysis. This approach is commonly used in applied synergism (cf. Haken, 1983). Yet, in the context of cognitive development, even the formulation of a plausible Ansatz is a formidable endeavor. Far too little is known about the dynamics of cognitive systems, and their operation can only indirectly be observed. To study self-organization in such ill-defined systems, a mathematical technique is needed that does not require the availability of specific dynamic models. This is provided by catastrophe theory. Catastrophe theory deals with the critical points or equilibria of gradient systems, that is, systems whose dynamics are governed by the gradient of a socalled potential. A potential is a smooth scalar function of the state variables of a dynamical system. For the moment, a potential can be best conceived of as a
5. Catastrophe Theoryand Stochastic Systems
109
convenient mathematical construct; later, we present possible interpretations in the context of developmental processes. Let x(t) denote the vector of state variables at time t (we will denote vectors and matrices by bold lower- and uppercase letters). The rate of change of the state variables in a gradient system, then, equals the gradient of a potential V[x(t)] with respect to these state variables: d l d t x ( t ) = grad~ V[x(t)], where grad~ = [8/8xl, ~/~X 2 ~/~Xn] and ~/~X i denotes the partial derivative with respect to the state coordinate x i. This implies that the equilibria of a gradient system are given by the zeroes of grad~ V[x]. Catastrophe theory, however, does not deal with an individual gradient system but yields a characterization of the equilibria of entire families of parameterized gradient systems: grad~ V[x; e] = 0, where c is a set of parameters. Hence catastrophe theory is the study of how the critical points of V[x; c] change as the parameters in e change. The main result of catastrophe theory can be summarized as follows: The equilibria of a subset of all potentials V[x; e] can be characterized in terms of a few canonical forms that depend only on the number of parameters in e and on the rank of the Hessian, that is, the matrix of second-order derivatives of V[x; c] with respect to the state variables. If the number of parameters in c does not exceed 7 and if the rank of the Hessian is not less than n - 2, where n is the dimension of the state vector x, then this result holds. The implications of this main result of catastrophe theory are profound: For gradient systems, one only needs to specify two numbers to give a characterization of the way in which their equilibria change as the parameters change. This is precisely the kind of result that is needed for the study of self-organization in ill-defined dynamical systems. In fact, Gilmore (1981) derived characteristic features of sudden changes in equilibria that do not even depend on these two numbers but instead require an absolute minimum of a priori information for their application. These so-called catastrophe flags are extensively discussed in van der Maas and Molenaar (1992) and in the chapter by van der Maas and Molenaar in this volume. A key question now presents itself: Are gradient systems general enough to capture the essential self-organizing characteristics of most developmental processes? Apart from the catastrophe flags which apply to any gradient (and conservative Hamiltonian) system, catastrophe theory proper applies to a subset of the gradient systems and the entire class of gradient systems itself only constitutes a subset of the set of all possible dynamical systems. Notice that a definite answer to this question can only be obtained by empirical means. For instance, whether or not a gradient system constitutes a satisfactory causal model of cognitive development (or, more realistically, of an aspect of cognitive development such as conservation acquisition) can only be determined in dedicated empirical research. Having said this, however, one can put forward some a priori considerations that are indicative of the viability of applications of catastrophe theory to biological and psychological developmental processes. As alluded to earlier, .
.
.
.
.
110
Peter C. M. Molenaar and Pascal Hartelman
these processes are in general ill-defined in that we are unable to give a complete specification of the underlying dynamics. If catastrophe theory is applied and the (unknown) true dynamics do not conform to a gradient system, then what is the error of approximation? More specifically, let the true dynamical equation be d/dt x(t) = F[x(t); e], where F now is an arbitrary vector-valued function. Then, according to Jackson (1989, p. 117): "If F[x; e] = 0 implies that grad x V[x; e] = 0, for some [scalar potential] V, then Thom's theorem [i.e., catastrophe theory] can be applied to V[x; el. It is not required that F = gradx V everywhere in the [state] space." Hence, if the equilibria of a nongradient system can be locally represented by the equilibria of a potential, then there is no error of approximation. The class of potential systems, that is, systems whose dynamics depend on the gradient of a potential, includes gradient systems, (conservative) Hamiltonian systems, and Hamiltonian systems with positive damping. Thus the class of potential systems is rather large, coveting many physical processes. Huseyin (1986) has extended the catastrophe theory program to this class of potential systems. Furthermore, Huseyin shows that the (so-called static) instabilities of autonomous nonpotential systems can be characterized in much the same way as those of potential systems (thus corroborating Jackson's observation). Taken together, these a priori considerations imply that the range of applicability of catastrophe theory may be rather large (although one would like to have a more detailed formal specification of the approximation error in the general case). This is a fortunate state of affairs because catastrophe theory may be the only principled method available to study self-organization in many ill-defined systems. In what follows, we will first present an outline of elementary catastrophe theory for metrical deterministic gradient systems. This presentation is based on Gilmore (1981) and follows a constructive approach that is distinct from the more abstract, geometrical approaches followed by most other authors (e.g., Thorn, 1975). Then, we move on to a consideration of catastrophe theory for stochastic metrical systems. After having presented a typology of stochastic systems to indicate the type of systems we are interested in, a concise overview is given of Cobb's innovative work in this area (e.g., Cobb & Zacks, 1985). It will be argued that Cobb's approach does not meet the basic tenets of catastrophe theory, and therefore a revision of his approach is formulated that agrees with the goals of elementary catastrophe theory. Even then, however, there remain some fundamental problems with extending the catastrophe theory program to metrical stochastic systems. We discuss these problems in a heuristic way and suggest a principled approach to deal with them. This part of the chapter draws on ongoing research and therefore the reader should not expect definite solutions. The same can be said about the final part of the chapter dealing with catastrophe theory for discrete stochastic systems. Here, we restrict the discussion
5. Catastrophe Theory and Stochastic Systems
111
to a characterization of these systems in a way that makes them compatible with the catastrophe theory program.
2. ELEMENTARYCATASTROPHETHEORY The basic tenet of catastrophe theory is to arrive at a classification of the possible equilibrium forms of gradient systems undergoing continuous quasi-static variation of their parameters. This implies a classification of the critical points of perturbed parameterized potentials V[x; c]. Critical points belong to the same class if there exists a diffeomorphic relationship between the corresponding potentials. Roughly speaking, a diffeomorphism changes x and/or e in a smooth manner (i.e., a local diffeomorphism at a point p is an invertible coordinate transformation for which derivatives of arbitrary order exist). Hence the program of elementary catastrophe theory is effected by smooth coordinate transformations. Actually the picture is more complicated, as will be indicated in a later section, but this suffices for our present purposes.
2.1. Coordinate Transformations We consider here transformations of a coordinate system [X1, X 2 . . . . . Xn] for processes in R'. Such coordinate transformations play an important role in effecting the program of elementary catastrophe theory and will also figure predominantly in the extension of this program to stochastic metrical systems. A transformation of the coordinate system [x~, x 2. . . . . x.] into [Yl, Y2. . . . . y.] then is defined by Yi = Yi(X1, X2 . . . . .
Xn) , i = 1 . . . . .
n,
(1)
at any point p in R" at which the Jacobian det 18Yi/Sxjl is nonzero. As Equation (1) is understood to be a local diffeomorphism, it is useful to express the new Yi coordinates in terms of a Taylor series expansion of the old x coordinate system. Suppose a given potential is represented in a particular coordinate system: V[x]. Then of course a coordinate transformation will change the functional form of this potential: U[y]. Yet the potential itself remains identical, hence U[y] = V[x]. In contrast, two different potentials will have different functional forms in the same coordinate system; for example, V[x] will not be equal to U[x]. However, in the words of Gilmore (1981, p. 18), these two potentials are "qualitatively similar if we can find a smooth change of coordinates so that the functional form of U, expressed in terms o f the new coordinates, is equal to the functional form of V in the original coordinate system." Hence if V[x] differs from U[x], but we can find a coordinate transformation so that U[y] = V[x],
112
Peter c. M. Molenaarand Pascal Hartelman
then these two potentials are qualitatively similar. In particular, the number and types of their critical points is the same. The way in which coordinate transformations are used in effecting the program of catastrophe theory can now be summarized as follows. A potential is expressed in a Taylor series expansion about its critical point. A coordinate transformation is sought that reduces this potential to a canonical form. All potentials that can be reduced to the same canonical form then constitute an equivalence class having the same type of critical point. This will be elaborated more fully in the next sections.
2.2. Morse Form To start, we consider a critical point x* of a potential V[x] at which the Hessian is nonsingular: grad x V[x] = 0 at x*;
Vij -- { ~2V[x]/~xi~xjli, j = 1. . . . .
n } is nonsingular at x*.
(2)
Only the quadratic and higher degree terms of the Taylor series expansion of V[x] about x* will be nonzero. It now can be proved that there always exists a coordinate transformation so that in a neighborhood of x*, V[x] can be expressed as U[y] -- ~i=
],n
)k/Yi2,
(3)
where h i a r e the eigenvalues of the Hessian matrix Vii. This simple quadratic form is the canonical Morse form for critical points of a potential at which the Hessian is nonsingular. The constructive proof given by Gilmore (1981, pp. 20-23) is elementary and consists of counting the number of disposable coefficients in the Taylor series expansion of a coordinate transformation that can be used to kill (i.e., transform to zero) coefficients in the Taylor series expansion of the transformed potential. From Equation (3) it follows that all critical points of potentials at which the Hessian is nonsingular belong to a single equivalence class. Hence, all of the equilibria of the arbitrary gradient systems for which the Hessian is nonsingular are qualitatively similar.
2.3. Them Forms If the Hessian at a critical point of a potential is singular (has zero eigenvalues), then it cannot be transformed to the Morse form, Equation (3). Suppose that the Hessian Vii of a potential V[x] has m zero eigenvalues (in technical jargon: has nullity m). Then, it follows from the so-called splitting lemma that this potential can be reduced to U[y] = Fdg[Y l, Y2 . . . . .
Ym] + ~i=m+ 1,n hiYi 2"
(4)
5. Catastrophe Theory and Stochastic Systems
113
Thus the potential can be reduced to a form that consists of two parts: a degenerate part Fdg associated with the m degenerate eigenvalues of the Hessian Vii and another part that is simply the Morse form associated with the nondegenerate eigenvalues of Vii. The y-coordinates associated with degenerate and nondegenerate directions in state space are split, hence the name of the lemma. Notice that in Equation (4) the part conforming to the Morse form already has a canonical form. Can the degenerate part Fdg also be transformed into a canonical form? To answer this question, we first note that, thus far, the parameters e have been left implicit because they play no role in the Morse form. That is, there is only one canonical Morse form, irrespective of the number of parameters in e. To arrive at canonical forms for Fdg, however, its dependence on c has to be made explicit: Fdg[Y 1. . . . . Ym; e*], where e* denotes the value of e at the critical point. We will see that canonical forms for Fdg depend on the number k of parameters in e as well as on m. Specifically, such canonical forms only exist if m is not larger than 2 and k is at most 7. To illustrate, let m = 1 and let there be k parameters. Then, the Taylor series expansion of Fdg about the degenerate critical point, as obtained from the splitting lemma, is given by
Fdg[Yl; C*] = ~itiYl i, i -- 3, 4 . . . . .
(5)
where it is understood that the critical point has been translated to y* = 0, and where the term for i = 0 is unimportant, the term for i = 1 is lacking because t~ = 0 at a critical point, and the term for i = 2 is lacking because of the degeneracy. It can be shown (cf. Gilmore, 1981, pp. 26-27) that coordinate transformations of both y~ and e can reduce Equation (5) to the following canonical form: CG[1; k] =
sz ~+2; S
=
1 or - 1,
(6)
where, in general, CG[m; k] stands for the so-called catastrophe germ of a k-parameter potential whose Hessian at the critical point has m vanishing eigenvalues, and where s is a dummy variable (there are in fact two canonical forms, one is the negative of the other). The canonical form, Equation (6), is based on a Taylor series expansion in terms of the state variables of the potential. This implies that Equation (6) is valid in an open neighborhood of the critical point in state space. But it is only valid at the point e = e*, because no Taylor series expansion of the parameters has been considered. The extension of the validity of Equation (6) to a neighborhood of the critical point in parameter space amounts to considering the effect of perturbations of e* on the catastrophe germ. The constructive approach followed by Gilmore to establish this extension is the same as before, that is, determine which terms in the Taylor series expansion of the perturbed catastrophe germ can always be removed by a coordinate transformation. In this way, a canonical form of all possible perturbations of a catastrophe germ is obtained,
114
Peter C. M. Molenaarand Pascal Hartelman
which will be denoted by Pert[m; k]. Note that the canonical perturbations are different for different catastrophe germs and thus depend on m and k. The final result now reads, schematically, Cat[m; k] = CG[m; k] + Pert[m; k],
(7)
where Cat[m; k] stands for (elementary) catastrophe. The canonical forms of Equation (7) are called the Thom forms. In particular, if k = 2, then it can be shown that the canonical perturbation of Equation (6) is given by Pert[l; 2] = ClZ + c2z 2, yielding the cusp catastrophe which figures predominantly in the chapter by van der Maas and Molenaar: Cat[l; 2] = sz 4 + c2z 2 + ClZ; s = 1 o r - 1 .
(8)
2.4. Discussion Our heuristic outline of elementary catastrophe theory focuses on the role of coordinate transformations in the derivation of canonical forms for the critical points of potentials. There are several reasons for choosing this particular point of view. The mathematical theory underlying Equation (7) is very abstract and therefore may not be readily accessible to developmental biologists and psychologists. The constructive transformational approach used.by Gilmore (1981) may fare better in creating a valid mental picture within the confines of a single chapter. Also, this constructive approach may be less known because most textbooks present a much more geometrically inspired perspective on catastrophe theory (e.g., Castrigiano & Hayes, 1993; Poston & Stewart, 1978; Thorn, 1975). Last but not least, as will be indicated in the next section, it turns out that the problems arising from the extension of catastrophe theory to stochastic metrical systems are closely linked to the way in which coordinate transformations for these systems are defined. As alluded to earlier, elementary catastrophe theory constitutes a powerful approach to the study of self-organization in ill-defined systems. Selforganization typically takes place in a "punctuated" way, in which the stability of the equilibrium of a system gradually decreases and becomes degenerate as a result of parameter variation over an extended period of time, after which a sudden shift to a new, qualitatively different equilibrium occurs. The ongoing behavior of a system is organized around its equilibrium states and therefore a qualitative shift in equilibrium will manifest itself by the emergence of new behavioral types. The results of catastrophe theory, based on mathematical deduction, show that under mild conditions (see the quotation of Jackson, 1989, in section 1) the occurrence of such stage transitions in epigenetical processes can be detected and modeled without a priori knowledge of the system dynamics. To apply elementary catastrophe theory, one only has to identify the degenerate
5. Catastrophe Theoryand Stochastic Systems
115
coordinates of a system's equilibrium point and the parameters whose quasistatic variation induces the loss of stability. In the jargon of catastrophe theory, the degenerate coordinates are called the behavioral variables, and the parameters controlling system stability are called the control variables. Given a set of empirical measurements of candidate behavioral and control variables, and allowing for all possible coordinate transformations, the appropriate Thom forms then can be fitted to the data to determine whether a genuine stage transition is actually present and to which class it belongs. In closing this section, it must be noted that our outline of catastrophe theory, although providing a convenient stepping stone to the next section, is incomplete in almost all respects. It one thing to present a constructive approach, it is quite another to specify the necessary and sufficient conditions for such an approach to hold. Also, we do not give a complete list of the Thorn forms. All of this is contained in the excellent monograph by Castrigiano and Hayes (1993).
3. CATASTROPHETHEORY FOR METRICAL STOCHASTIC SYSTEMS Elementary catastrophe theory pertains to deterministic processes generated by gradient systems. In contrast, real developmental processes encountered in the biological and social sciences almost never can be forecasted with complete certainty and therefore do not constitute deterministic but stochastic processes. To extend catastrophe theory to the analysis of these stochastic processes, one first has to specify the relevant ways in which gradient systems can be generalized to include random influences. There are several distinct ways in which a gradient system can become stochastic (cf. Molenaar, 1990), but it will be sufficient for our purposes to restrict attention to two of them. First, consider the system: d/dt x(t) = grad x V[x] y(t) = x(t) + e(t),
(9)
where e(t) denotes random measurement noise. Here, the manifest process y(t) is stochastic, but the latent gradient system is still deterministic. The random measurement noise in (9) does not affect the system dynamics and thus prediction of x(t) by means of sequential nonlinear regression techniques will become perfect when t approaches infinity. In contrast, the role of random influences is entirely different in the following system: d/dt x ( t ) = grad x V[x] + w(t) y(t) = x(t),
(10)
where the random process w(t), called the innovations process, directly affects the system dynamics. This implies that x(t) can never be predicted with complete certainty, even if t approaches infinity.
116
Peter C. M. Molenaarand Pascal Hartelman
In what follows, we restrict attention to (10) where the system dynamics is intrinsically stochastic. This constitutes by far the most interesting (and, admittedly, also the most problematic) case. Although it is possible to allow for measurement noise e(t) in (10), we will avoid this complication because it is not essential to our purposes. Hence the second equation of (10) can be dropped because it involves a trivial identity transformation, after which x(t) will denote the manifest process. We are thus left with the first equation of (10), which constitutes an instance of a stochastic differential equation (SDE). Following the notational conventions appropriate for SDEs (to be explained in the next section), the starting point of our discussion is then given by the following metrical stochastic system: dx(t) = grad x V[x]dt + dw(t).
(11)
3.1. Stochastic Differential Equations The theory of SDEs can be formulated in a number of ways (e.g., employing martingale methods or semigroup methods; cf. Revuz & Yor, 1991). We will have to leave these elegant formal constructions for what they are, however, and concentrate on the mere presentation of a few basic notions that are of importance. The first of these is the Wiener process w(t). Consider the following process in discrete time: A (presumably drunken) man moves along a line, taking, at random, steps to the left or to the fight with equal probability. We want to know the probability that he reaches a given point on the line after a given elapsed time. This process is the well-known random walk. If now both the unit of time and the step size become infinitesimally small, then the one-dimensional Wiener process w(t) is obtained. Given that this Wiener process starts at the origin, w(0) = 0, the probability that it reaches point w at time t, p ( w ( t ) = w lw(0) = 0), can be shown to be Gaussian with mean zero and variance proportional to t. The sample paths of a Wiener process are continuous, that is, each realization is a continuous function of time. Yet, it can be shown (cf. Gardiner, 1990) that these sample paths are nondifferentiable. This implies that a realization of the Wiener process is exceedingly irregular, as is also reflected by the linear dependence of the (conditional) variance on elapsed time t. Of particular importance is the statistical independence of the increments of w(t). Let ~w(ti) = w(ti) - w ( t i - 1) denote these increments, where t i, i = 1, 2 . . . . . n is a partition of the time axis, then the joint probability of the 8 w ( t i ) is the product of normal distributions, each with mean zero and variance 8 t i - - t i - - t i _ 1. We now define the differential dw(t) by letting ~ti become infinitesimally small. But w(t) is nondifferentiable, hence dw(t) does not exist. On the other hand, the integral of dw(t) exists and is the Wiener process w(t). In particular:
5. Catastrophe Theoryand Stochastic Systems
117
t
w(t) - w(O) =
f dw(s),
(12)
o and this integral equation can be interpreted consistently. Thus dw(t) can be conceived of as a convenient abstraction to be used to arrive at a consistent definition of stochastic differential equations. We are now in a position to provide a definition of an n-dimensional SDE. Let w(t) denote an n-dimensional Wiener process with uncorrelated components: E[wi(t), wj.(t)] -= min(t, s)gij, where E denotes the expectation operator and ~ij is Kronecker's delta, that is, gij = 1 if i = j and gij = 0 otherwise. In addition, let a[x] denote an n-dimensional so-called drift vector and B[x] an (n • n)dimensional diffusion matrix. Both drift and diffusion are arbitrary functions of x(t). Then consider the SDE dx(t) = a[x]dt + B[x]dw(t),
(13)
which again can be conceived of as the formal derivative of the integral equation t
x(t)- x(0)=
t
fa[xldt + fBLxldw(t) 0
(14)
0
Note that if a[x] = grad x V[x] and B[x] = I,,, that is, the (n • n) identity matrix, then Equation (13) reduces to Equation (11). For our purposes, it suffices to consider two aspects of Equation (13). First, it should be noted that this SDE does not obey the usual transformation rules for differential equations. Instead, the famous Ito formula describes the way in which Equation (13) transforms under a change of variables. Specifically, let n - 1 for ease of presentation and consider an arbitrary smooth function of x(t): f[x]. Then, Ito's formula is given by
df[x] = { a[x]f'[x] + b[x]Zf"[x]/2 }dt + b[x]f'[x]dw(t),
(15)
where f'[x] and if[x] denote, respectively, the first- and second-order derivatives of fix] with respect to x(t). Second, it should be noted that, given initial conditions, Equation (13) specifies the evolution of the probability density function of x(t) as a function of time t. Because the drift and diffusion terms in Equation (13) do not explicitly depend on time, it follows that the probability density function of x(t) becomes stationary if t approaches infinity. That is, p(x, t) --) p(x)
as
t --~ oc
(16)
118
Peter C. /t4. Molenaar and Pascal Hartelman
The stationary probability density p(x) is a function of the drift a[x] and the diffusion matrix B[x]. This functional dependence takes on a particularly simple form if x(t) obeys the so-called detailed balance condition. Heuristically speaking, a process satisfies detailed balance if in the stationary situation each possible transition (from x(t) = xl to x(t + d t ) = x 2) balances with the reverse transition (cf. Gardiner, 1990, pp. 148-170). If this is the case, that is, if a[x] and B[x] obey the formal criteria for detailed balance, then the stationary probability density is given by p(x) - exp(-~[x]),
(17)
where x
9 [x] = fdy.zta, B, y] zi[a, B, xl = ~ B~2[xl{2aktx] - ~ O/OxjB2j[x]}, k
i = 1, 2 . . . . .
n.
j
The dot in the second equation denotes the vector inner product (of the n-dimensional vectors dy and z). The important point conveyed by (17) is that under detailed balance, the stationary probability density of a homogeneous process x(t) is given by p(x) = exp(-~[x]), where ~[x] is a potential. Moreover, this potential ~[x] is a nonlinear function of the drift and diffusion terms in the SDE.
3.2. Cobb's Approach In a number of papers, Cobb has elaborated a stochastic catastrophe theory for metrical systems (e.g., Cobb, 1978, 1981; Cobb, Koppstein, & Chen, 1983). We will focus on the paper by Cobb and Zacks (1985) in which the link with the SDEs is explicitly addressed. What follows is a synopsis of the relevant parts of the latter paper. Let n = 1 and consider the following instance of Equation (13): dx(t) = grad x V[x]dt + b[x]dw(t),
(18)
which also can be regarded as a slight generalization of Equation (11). Now the first step in Cobb and Zack's approach consists of substituting an elementary catastrophe for the potential V[x]. In particular, V[x] is taken to be the cusp catastrophe given by Equation (8). In the second step, the stationary probability density of Equation (18) is determined. An application of (17) to (18) yields, after some rewriting, the stationary density in the form as given by Cobb and Zacks: x
p(x) = exp(2f {g[y]/bZ[y]}dy),
(19)
5. Catastrophe Theoryand Stochastic Systems
119
where g[x] = V'[x] - b'[x]b[x], and the apostrophe denotes differentiation with respect to x. Cobb and Zacks then proceed by considering several functional forms for bZ[x]: a constant, a linear, or quadratic function of the state x, and so on. This yields distinct family types of stationary densities, one for each functional form of bZ[x] (and g[x]).
3.3. Issues with Cobb's Approach As alluded to earlier, the gist of catastrophe theory is to arrive at a minimal classification of the possible equilibrium forms of gradient systems undergoing quasi-static parameter variation. This is accomplished by transforming away all inessential details of particular instantiations of gradient systems and concentrating on the canonical forms thus obtained. In this way, an infinite variety of equilibrium forms is reduced to a minimal classificatory scheme that can be applied without specification of the underlying dynamical system equations. It seems, however, that Cobb's approach does not conform to this basic reductionistic tenet of catastrophe theory. The typology of stationary probability densities arrived at in his approach depends on b2[x] and, through g[x], on V[x]. This implies that each elementary catastrophe that is substituted for V[x] gives rise to a potentially infinite number of density types, one for each distinct form of b Z[x]. Hence, it would seem that Cobb' s approach creates a plethora of canonical forms, which reintroduces the need to consider the details of each system in order to determine its equilibrium type. Note, however, that this presumed state of affairs rests on the assumption that Equation (19) indeed represents the proper canonical form of stationary probability densities. That is, (19) should represent what is left after transforming away all inessential details of particular instantiations of stochastic gradient systems. But obviously this is not the case. It is always possible to find a change of variable y = f[x] such that application of Ito' s formula, Equation (15) to Equation (18), will yield a transformed SDE in which the diffusion term bZ[y] = 1. In fact, for each given instance of Equation (18), the diffusion term can be transformed into an infinite variety of forms by judicious application of Ito's formula. Consequently, the stationary density types in Cobb's approach are not canonical in the intended sense because they depend on the specific forms of diffusion terms, whereas these diffusion terms are not invariant under smooth coordinate transformations.
3.4. An Alternative Approach The classificatory scheme of Cobb depends on a bivariate feature vector: the potential V[x] and the diffusion term b2[x]. Given a fixed form of V[x], one can get different density types for different forms of b2[x]; and vice versa, given a
120
Peter c. M. Molenaarand Pascal Hartelman
fixed form of b2[x], different density types are obtained for different forms of V[x]. Pertaining to this, it was shown in the previous section that the diffusion term is not invariant under smooth coordinate transformations and therefore is not suitable to index canonical forms. Hence, one plausible alternative approach consists of trying to redefine the feature vector underlying the classificatory scheme so that it no longer depends on the diffusion term in the way envisaged by Cobb. This could be accomplished by simply removing the diffusion term from the feature vector which would result in a univariate feature vector composed only of the potential. Note, however, that this implies a redefinition of the featured potential itself. We address this important point first. The potential in Cobb's approach is defined as the entity whose gradient yields the drift term in Equation (18). This potential is a component of g[x], where g[x] itself is a component of the stationary density, Equation (19). One can only recover Cobb's potential from the stationary density (19) if the diffusion term b Z[x] is given, which is why Cobb's approach depends on a bivariate feature vector. Thus in order to arrive at a classificatory scheme of stationary densities that depends only on a potential, Cobb's definition of the featured potential will not do. Instead, we will have to turn to the definition of the stationary probability density given by Equation (17), taking n = 1: p(x) = e x p ( - ~ [ x ] ) . Note that the potential ~[x] in Equation (17) is entirely different from Cobb's potential V[x] underlying the drift term in Equation (18). In terms of Cobb's scheme based on (18), ~[x] is a nonlinear function of V[x] and the diffusion term b Z[x]. Moreover, ~[x] completely characterizes stationary densities and hence can be used as the univariate feature indexing canonical forms for these densities. In summary, it would seem that a minimal classificatory scheme for stationary probability densities can be based on the canonical forms of the potential ~[x] in Equation (17). It is then required that x(t) obey the detailed balance condition (which is automatically the case if n = 1). Taking qg[x] as the featured potential implies that, in contrast with Cobb's approach, no distinct reference is made to the drift term and/or the diffusion term in an SDE. This alleviates the problems in Cobb's approach, which, as alluded to earlier, arise from transformation of the diffusion term. We now can concentrate on the way q~[x] behaves under coordinate transformations in which the transformations concerned are given by the Ito formula, Equation (15). The main question then becomes: Does application of Ito's formula (15) to ~[x] in (17) (or ~[x] for n > 1) yield the same elementary catastrophes as the application of local diffeomorphisms, Equation (1), to the potential associated with a deterministic gradient system? In other words, is there a direct transfer of elementary catastrophe theory to metrical stochastic systems. The answer is, unfortunately, negative and leads to some rather fundamental problems that we have studied intensely for the past year.
5. Catastrophe Theoryand Stochastic Systems
121
3.4.1. The m a i n p r o b l e m It can be shown that the collection of all possible potential forms in Equation (17) constitutes a single equivalence class (type). That is, each given form of ~[x] can be transformed into any other possible form by means of Ito's formula. In the words of Zeeman (1988), all potential forms in Equation (17) are "diffeomorphically measure equivalent." For instance, let ~[x] be the cusp catastrophe Cat[l;2] given by Equation (8), and let the values of the control variables c~ and c 2 be chosen so that (8) has three critical points (two of which are stable and one is unstable). Then, the associated stationary probability density p(x) = e x p ( - C a t [ 1 ;2]) has two modes corresponding to the two stable critical points of the cusp catastrophe. This bimodal density with cuspoid potential, however, can always be transformed by means of Ito's formula into a unimodal density for which the potential has Morse form (Hartelman, van der Maas, & Molenaar, 1995). Thus the distinction between Thom forms and Morse forms, a distinction that is central to the program of catastrophe theory, collapses at the level of modes of stationary densities of metrical stochastic systems. To illustrate this collapse at the level of modes of stationary densities, consider the simulated realization of dx(t) = (2x(t) - x(t)3)dt + dw(t), 0 < t < 100, shown in Figure 1. The drift term of x(t) is given by the gradient of the cusp catastrophe and the diffusion term is 1; in the notation of Equation (13), a[x] = 2x(t) - x ( t ) 3, and b[x] = 1. It then follows directly from Equation (17) that the stationary probability of x(t) is given by p(x) = c - l e x p ( x 4 / 4 - x2/2), where c-~ is a normalizing constant. Hence, the stationary density p(x) is bimo-
FIGURE 1. Simulated realization by means of numerical integration of a univariate SDE, dx(t) = a[x]dt + b[x]dw(t), 0 < t < 100, where the drift a[x] = 2x - x 3 and the diffusion term b[x] = 1.
122
Peter C. M. Molenaarand Pascal Hartelman
dal. A rough estimate of p(x) can be obtained by determining a histogram of the realized values shown in Figure 1. This histogram is presented in Figure 2 and clearly has the expected bimodal form. We next consider the following transformation of x(t): y(t) = f[x] = x(t) + x(t) 3. An application of Ito's formula (15) this time yields the drift and diffusion terms of the y(t) process: a[y] = 8x(t) + 5x(t) 3 - 3x(t) 5, b[y] = 1 + 3x(t) 2. Figure 3 shows a realization of this y(t)process, where again 0 < t < 100, and Figure 4 shows the histogram associated with this realization. It turns out that this histogram, and hence the stationary density of the transformed x(t) process, is now unimodal, illustrating the main problem discussed in this section. 3.4.2. Zeeman's solution The stationary density, Equation (17), constitutes a probability measure. Smooth coordinate transformations (diffeomorphisms) in the space of measures are defined by Ito's formula. In contrast, the potential associated with a deterministic gradient system constitutes a function. The space of continuous functions is dual to the space of measures, and smooth coordinate transformations in function space are defined by Equation (1). We have already seen that the crucial distinction in function space between Morse forms and Thom forms collapses in measure space because these forms are diffeomorphically measure equivalent. To avoid this collapse, Zeeman (1988) proposes to treat the measure defined by Equation (17) as a function. More specifically, he states (1988):
FIGURE 2.
Histogramof the realized x(t) values depicted in Figure 1.
5. Catastrophe Theory and Stochastic Systems
123
FIGURE 3. Simulated realization by means of numerical integration, 0 < t < 100, of the transformed x ( t ) process depicted in Figure 1. The transformation is y ( t ) = f [ x ] = x + x 3.
We regard the steady state u [our stationary density (17)] as a tool, a quantitative property of v [the deterministic gradient system acting as drift term in an SDE]. A n d then, having constructed the tool, we can use the tool in any way we please to study v . . . . Therefore we can apply the qualitative theory of functions to the tool to obtain a qualitative description of v. (p. 132)
200 180, 160. 140. 120, o
100, 80, 60,
4oi 2oi 0
-:~0
-15
-10
5
-5
10
15
y value
FIGURE 4. Histogram of the realized
y ( t ) values depicted in Figure 3.
124
Peter c. M. Molenaar and Pascal Hartelman
An application of Zeeman's solution to avoid the collapse of distinct canonical forms seems to be straightforward. For a given stochastic process, one first determines the stationary probability density. Then the potential ~[x] associated with this stationary density can be conceived of as a function; that is, @[x] is treated in the same way as the potential V[x] associated with deterministic gradient systems and thus is amenable to smooth coordinate transformations defined by Equation (1). It then follows that the elementary catastrophes obtained for V[x] also yield the canonical forms for the critical points of ~[x] (and hence for the modes of the associated stationary density). 3.4.3. Discussion Zeeman's solution is presented in a fundamental paper (Zeeman, 1988) that introduces a new definition of the stability of dynamical systems. It is one of the rare technical publications dealing with catastrophe theory of stochastic systems, apart from Cobb's work, and presents a challenging and innovative point of view. Here, however, we have to restrict our discussion to the shift from measure space to function space proposed by Zeeman. It is clear that this shift prevents the collapse of Morse and Thom forms referred to in section 3.4.1. However, there are some issues that require further scrutiny. Perhaps the most important of these concerns the "timing" of the shift, that is, at which stage in the derivation of the stationary density in measure space does one shift to a function space perspective? To elaborate, suppose we are given a realization x(t), for t in the interval [0, T], of a homogeneous stochastic process. Then first the stationary density has to be estimated, for instance, by a cubic spline fit to the empirical histogram. Let the estimated potential obtained be denoted by ~[xl T]. Is this the potential that according to Zeeman's approach has to be treated as the function characterizing the observed process? Or should Ito's formula be applied to ~[xl T], yielding a transformed potential ~ [ f [ x ] IT], before the shift to a function space perspective is made? It would seem that Ito's transformation should not be made, because that would open up the door again for the problem discussed in section 3.4.1 (e.g., ~[x[ T] might be a Thom form which is Ito transformed to a Morse form ~ [ f [ x ] I T]). Accordingly, it appears to be the estimated potential ~[xl T] in the stationary density of the given realization x(t) to which Zeeman's solution applies. This assigns a special status to the actual measurements, to their operationalization, dimensionality, and scale. For instance, it may make a difference to Zeeman's solution whether the behavior of a system is measured in terms of amplitude or in terms of energy (which is proportional to the second power of amplitude). This special status of the given (chosen) measurements, however, will be problematic for the social sciences, where fundamental measures (dimensions) are rare and it is often a matter of taste as to which measurement scales are being used. It should be noted that Zeeman's aims in his 1988 paper are different from
5. Catastrophe Theoryand Stochastic Systems
125
ours. Zeeman introduces SDEs with known properties (e.g., their variance e, where e > 0, is small) to define a so-called e-smoothing of the equilibria of deterministic gradient systems and thereby arrive at new definitions of equivalence and stability of gradient systems. In social scientific applications, however, such a priori information is almost always lacking. It then appears that Zeeman's solution presents some problems of its own that require further elaboration. Yet, a shift from the space of measures to function space as stipulated by Zeeman would seem unavoidable in order to prevent the collapse of canonical forms for the potentials in stationary densities. Moreover, the focus on stationary probability densities in stochastic catastrophe theory is plausible because these stationary densities constitute direct analogues of the equilibria of deterministic gradient systems that figure in elementary catastrophe theory. If, however, one were to change the focus on stationary densities, then one could look for an alternative characterization of stochastic processes for which the canonical forms in measure space keep the distinction between elementary catastrophes intact. In such an approach, it would no longer be necessary to shift from a measure space to a function space perspective. In closing this section, we present the outline of one such alternative approach. To reiterate, we are looking for an alternative measure that characterizes the equilibria of homogeneous stochastic systems while avoiding the problems with stationary probability density mentioned in section 3.4.1. Such a characterization should not cause a collapse into a single canonical form under Ito transformation, but instead keep intact in measure space the distinct canonical forms of elementary catastrophe theory. Recently, Hartelman has identified such an alternative characterization: the number of level-crossings nl(T) of a stochastic process, where l is the level and [0, 7] the interval of observation (cf. Florens-Zmirou, 1991, for a detailed discussion of levelcrossings in SDEs). It can be shown (Hartelman et al., 1995) that the number of modes of the measure nz(T) stay invariant under Ito transformation. This would seem sufficient to establish something approaching a one-to-one correspondence between canonical forms of nl(T) in measure space and elementary catastrophes in function space. Thus it appears to be possible to arrive at a sensible catastrophe theory of stochastic systems in measure space which does not require the shift to a function space perspective underlying Zeeman' s approach.
4. CATASTROPHETHEORY FOR DISCRETE STOCHASTICSYSTEMS Until now, we have been considering metrical deterministic and stochastic systems. That is, systems whose state (behavioral) variables are real-valued. Moreover, we have considered the evolution of these systems in continuous
120
Peter C. M. Molenaar and Pascal Hartelman
time under smooth quasistatic variation of the system parameters (control variables). In this setting of continuous variation of state variables, parameters, and time, it was shown that discrete changes (elementary catastrophes) emerge in the equilibrium forms of the systems concerned. These discrete changes in equilibrium type constitute a rather special instance of what is the main theme of this book, categorical variables. In this section, however, we consider categoricalvalued processes in another, more direct sense. In particular, we outline a possible extension of the program of catastrophe theory to stochastic systems with categorical state variables. The deliberations will remain at a rather general level, mainly concentrating on the possibility of deriving potentials for discrete stochastic systems as a prerequisite for the application of catastrophe theory. We start with potential theory for denumerable Markov chains in discrete time.
4.1. Potentials for Markov Chains Let S denote the set of categories making up the state space of a Markov chain. We denote representative elements of S by i, j, k, and so on. Let x,,, where n is integer-valued, represent the Markov chain. Then it holds that Pr[xn+ 1 = i[x, =j,x,_
1 = k .....
x o = l] = P r [ X , + l
= i[x~
=j],
(20)
where P r [ a [ b ] denotes the conditional probability of a given b. In addition, we assume that the Markov chain is homogeneous" Pr[xn+ 1 = i[x,
= j ] = P r [ x , + m + 1 = i[Xn+ m = j ] .
(21)
For the state space S, the complete set of conditional probabilities given by Equation (21) can be conveniently collected in a square, so-called transition matrix P. Then, given the starting probabilities P r [ x o = i], the measure on the state space S of a Markov chain is completely determined by the transition matrix P. Now the question we want to address is whether a potential can be defined for a Markov chain. Perhaps the most extensive treatment of this question can be found in a monograph by Kemeny, Snell, and Knapp (1976). They show that almost all results of potential theory generalize to Markov chains, although some require further assumptions that are analogous to the detailed balance condition for metrical stochastic systems alluded to earlier in section 3.1. More specifically, they present Markov chain analogues of potential theory concepts such as charge, potential operator, harmonic function, and so on. In particular, the Markov chain analogues of the potential are lira e~(I + P + . . .
+ P ' ) o r l i m (I + P + . . .
P')er,
(22)
where lim denotes the limit for n approaching infinity, e l is a finite normed row vector called left charge, and e~ a finite normed column vector called fight charge.
5. Catastrophe Theoryand Stochastic Systems
127
In closing this section, note that Equation (22) gives two Markov chain analogues of the potential, one associated with a left charge cr and the other with a fight charge er Kemeny et al. (1976) refer to the analogue associated with the fight charge as the potential function, whereas the analogue associated with the left charge is denoted by the potential measure. To appreciate this difference, note that Kemeny et al. define Lebesgue integration on the denumerable state space S of a Markov chain by a vector inner product in which a nonnegative row vector constitutes the measure and a column vector constitutes the function (1976, p. 23). This would seem to imply that only the potential function associated with the fight charge can be conceived of as the Markov chain analogue of the potential in elementary catastrophe theory. But regarding the potential measure associated with the left charge, they state (Kemeny et al., 1976, p. 182): "Classically, potentials are left as point functions and are never transformed into set functions because such a transformation is frequently impossible. In Markov chain potential theory, however, every column vector can be transformed into a row vector by the duality mapping." Note that this resembles the shift from measure space to function space in Zeeman's approach to stochastic catastrophe theory for metrical stochastic systems (cf. section 3.4.2). At present, however, nothing more can be said about this interesting correspondence because a catastrophe theoretical analysis of the Markov chain potentials given by Equation (22) is still lacking.
4.2. Potentials for Markov Processes A continuous time homogeneous Markov process x(t) with denumerable state space S obeys conditions similar to Equations (20) and (21). For such processes, it is possible to derive a so-called master equation (e.g., Van Kampen, 1981). Denoting Pr[x(t) = i] by pi(t), the master equation is given by
dpi(t)/dt = Xj[wijpj(t) - wjpi(t)],
(23)
where wij is the transition probability per unit time from state j to state i. If for the state space S the complete set of transition probabilities is collected in a square matrix W, where the ith diagonal element of W is defined by wii = - E j wji for j unequal to i, then we obtain a representation that is similar to the one for Markov chains given in the preceding section. On the basis of this representation, Markov process analogues for potential theory concepts can again be derived. For this, we refer to the large literature on potential theory for Markov processes, including the momentous monograph by Doob (1984).
4.3. Discussion The message of the preceding two sections can be summarized in a simple statement: there exists a well-developed potential theory for denumerable Markov
128
Peter C. M. Molenaar and Pascal Hartelman
chains and processes. This would seem to provide us with a convenient stepping stone for the application of catastrophe theory to these types of stochastic systems. Unfortunately, such a catastrophe theoretical analysis has not yet been undertaken. At present, we do not know whether an application of catastrophe theory to the potential analogues of Markov chains and processes will prove to be fruitful. Perhaps alternative approaches may fare better, such as the approximation of denumerable Markov processes by stochastic differential equations to generalize the results presented in section 3. We have concentrated on the representation of homogeneous Markov chains and processes in terms of the transition matrices P and W, respectively. It should be noted that these representations are inherently linear and therefore would seem to be uninteresting for further catastrophe theory analysis in which the focus is on nonlinear potentials. Yet, the latter conjecture does not hold: for instance, Gilmore (1981) shows how the program of catastrophe theory is extended to matrices. More specifically, suppose that the transition rates wij in the master equation (23) are nonlinear functions in i, j and depend on the control variables c. Suppose also that the detailed balance condition holds, implying that the W matrix is symmetric. Then it follows that the stationary probability density of the Markov process is given by the eigenvector of W associated with the largest (zero) eigenvalue (cf. Van Kampen, 1981, p. 126). The components of this eigenvector are nonlinear functions in the states i, j, and will depend smoothly on the control variables in e (Gilmore, 1981). Hence, despite the linearity of the master equation in W, its stationary solution provides the proper setting for further catastrophe theoretical analysis.
5. GENERAL DISCUSSION AND CONCLUSION Catastrophe theory is an advanced mathematical subject, and so is the (potential) theory of stochastic processes. In this chapter, we have considered some forms of integration of these two theories. Of course, this is an enormous task and we could only present a few preliminary remarks about possible approaches. Yet, the task has to be addressed in order to arrive at principled applications of catastrophe theory to the stochastic processes typically observed in the biological and social sciences. As a first contribution to that, we presented two main results, one negative and one positive. The negative result pertains to Cobb's approach to catastrophe theory for metrical stochastic systems, which was shown to be flawed in one rather fundamental respect. The positive result concerns the presentation of principled ways to extend catastrophe theory to metrical stochastic systems, in particular Zeeman's approach. It should be remembered that in this chapter we introduced several assumptions of a restrictive nature. For instance, it was assumed that the processes un-
5. Catastrophe Theory and Stochastic Systems
129
der scrutiny are homogeneous and obey the detailed balance condition. Although these assumptions considerably eased the presentation, it remains to be established whether they are necessary and plausible. In fact, this also applies to the basic concept of a potential; we considered its possibly restrictive nature in section 1, but still have to address its interpretation in the context of developmental processes. We could take a formal route and refer to the stationary density of Equation (17): p ( x ) = exp(-~[x]). If ~[x] has a Morse form, then Equation (17) reduces to the (multivariate) Normal density. In the latter case, the potential is proportional to the quadratic form x ' ~ x , where 9 denotes the inverse of the covariance matrix of x, and hence this potential can be interpreted as an analogue of energy. More informally, we can use the interpretation given by Helbing (1994) of the potential in his model for the behavior of pairwise interacting individuals. According to Helbing, this potential can be understood as social field. Analogously, the potential in the stationary density (17) of developmental processes could be interpreted as developmental field. This would correspond nicely with the prominence of field concepts in mathematical biology and embryology (e.g., Meinhardt, 1982). In conclusion, the extension of the program of elementary catastrophe theory to stochastic systems leads to a number of deep and challenging questions which for the most part are still unexplored. The issues at stake not only pertain to the formal framework of catastrophe theory for stochastic systems but also relate to basic questions in developmental theory. For instance, the question of how to arrive at an integral conceptualization of intra- and interindividual differences in the timing and pattern (horizontal and vertical decalages) of discrete stage transitions marking the emergence of qualitatively new behavior. Or the specification of causal models of probabilistic epigenesis. We hope to have made clear that there exist principled approaches to resolve these questions. Thus we expect that further elaboration of the program of stochastic catastrophe theory will lead to fundamental progress in developmental theory.
REFERENCES Castrigiano, D. E L., & Hayes, S. A. (1993). Catastrophe theory. Reading, MA: Addison-Wesley. Cobb, L. (1978). Stochastic catastrophe models and multinomial distributions. Behavioral Science, 23, 360-374. Cobb, L. (1981). Parameter estimation for the cusp catastrophe model. Behavioral Science, 26, 75-78. Cobb, L., Koppstein, E, &Chen, N. H. (1983). Estimation and moment recursion relations for multimodal distributions of the exponential family. Journal of the American Statistical Association, 78, 124-130. Cobb, L., & Zacks, S. (1985). Applications of catastrophe theory for statistical modeling in the biosciences. Journal of the American Statistical Association, 80, 793-802.
130
Peter C. M. Molenaar and Pascal Hartelman
Doob, J. L. (1984). Classical potential theory and its probabilistic counterpart. New York: SpringerVerlag. Florens-Zmirou, D. (1991). Statistics on crossings of discretized diffusions and local time. Stochastic Processes and their Applications, 39, 139-151. Gardiner, C. W. (1990). Handbook of stochastic methods for physics, chemistry and the natural sciences (2nd ed.). Berlin: Springer-Verlag. Gilmore, R. (1981). Catastrophe theory for scientists and engineers. New York: Wiley. Haken, H. (1983). Advanced synergetics. Berlin: Springer-Verlag. Hartelman, E, van der Maas, H. L. J., & Molenaar, E C. M. (1995). Catastrophe analysis of stochastic metrical systems in measure space Amsterdam: University of Amsterdam. Helbing, D. (1994). A mathematical model for the behavior of individuals in a social field. Journal of Mathematical Sociology, 19, 189-219. Huseyin, K. (1986). Multiple parameter stability theory and its applications: Bifurcations, catastrophes, instabilities. Oxford: Clarendon Press. Jackson, E. A. (1989). Perspectives of nonlinear dynamics (Vol. 1). Cambridge, UK: Cambridge University Press. Kemeny, J. G., Snell, J. L., & Knapp, A. W. (1976). Denumerable Markov chains. New York: Springer-Verlag. Meinhardt, H. (1982). Models of biological pattern formation. London: Academic Press. Molenaar, E C. M. (1986). On the impossibility of acquiring more powerful structures: A neglected alternative. Human Development, 29, 245-251. Molenaar, E C. M. (1990). Neural network simulation of a discrete model of continuous effects of irrelevant stimuli. Acta Psychologica, 74, 237-258. Nicolis, G., & Prigogine, I. (1977). Self-organization in nonequilibrium systems. New York: Wiley. Poston, T., & Stewart, I. N. (1978). Catastrophe theory and its applications. London: Pitman. Prigogine, I. (1980). From being to becoming: Time and complexity in the physical sciences. San Francisco: Freeman. Revuz, D., & Yor, M. (1991). Continuous martingales and Brownian motion. Berlin: Springer-Verlag. Thom, R. (1975). Structural stability and morphogenesis. Reading, MA: Benjamin. van der Maas, H. L. J., & Molenaar, E C. M. (1992). Stagewise cognitive development: An application of catastrophe theory. Psychological Review, 99, 395-417. Van Kampen, N. G. (1981). Stochastic processes in physics and chemistry. Amsterdam: NorthHolland. Wohlwill, J. E (1973). The study of behavioral development. New York: Academic Press. Zeeman, E. C. (1988). Stability of dynamical systems. Nonlinearity, 1, 115-155.
. . . ......
‘““’3
Latent Class and Log-Linear Models ...............................
This Page Intentionally Left Blank
Some Practical Issues Related to the Estimation of Latent Class and Latent Transition Parameters 9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
L/nda M. Collins
Stuart E. Wuga/ter
Pennsy/van/a State Un/versity University Park, Pennsy/van/a
Un/versity of Southern Ca//torn/a Los Angeles, Ca/iforn/a
9
Penny L. Fidler
University of Southern Cafifomia Los Angeles, Cafifomia
1. INTRODUCTION Sometimes developmental models involve qualitatively different subgroups of individuals, such as learning disabled versus normal learners, children who have been abused versus children who have not, or substance use abstainers versus experimenters versus regular users. Other developmental models hypothesize that individuals go through a sequence of discrete stages, such as the stages in Piaget's theories of development (Piaget, 1952) or Kohlberg's (1969) stages of moral development. Both types of models are characterized by the involvement
Categorical
Variables
in Developmental
Research:
Methods
of Analysis
Copyright 9 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
134
L. M. Collins, P. L. Fidler, and S. E. Wugalter
of discrete latent variables. The first models can be tested by means of a measurement model for discrete latent variables, namely, latent class analysis (Clogg & Goodman, 1984; Goodman, 1974; Lazarsfeld & Henry, 1968). Stage sequential models can be tested by means of latent transition analysis (LTA) (Collins & Wugalter, 1992; see also Hagenaars & Luijkx, 1987; van de Pol, Langeheine, & de Jong, 1989), which is an extension of latent class models to longitudinal data. Typically, in latent class and latent transition models, data are analyzed in the form of frequencies or proportions associated with "response patterns." Conceptually, a response pattern is a set of possible subject responses to the manifest variables. For example, suppose the variables are two yes/no items. Then, No,No is one response pattern, No,Yes is another response pattern, and so on. Technically, a response pattern proportion is simply a cell proportion, in which the cell is part of the cross-classification of all the manifest variables involved in the problem of interest. Latent class models usually involve data collected at one time only. For example, let Y = {i, j, k} represent a response pattern made up of response i to Item 1, response j to Item 2, and response k to Item 3. Then, C
P(Y) = ~_~ ~lcPilcPjlcPklc, c=l
where ~/r is the proportion of individuals in latent class c, and Dilc is the probability of response i to Item 1, conditional on membership in latent class c. These parameters are conceptually similar to factor loadings in that they reflect the strength of the relationship between the manifest variables and the latent variable. They are different from factor loadings in that they are estimated probabilities, not regression coefficients. In latent class models, a strong relationship between the latent and manifest variables is reflected in a strong p parameter, that is, a parameter close to zero or one. We refer to the p parameters as measurement parameters. Latent transition models can involve some variables that are measured only once (because they are unchanging, such as an experimental condition) and other variables that are measured longitudinally (because they are expected to change overtime, such as stage membership). Let Y = {m, i,j, k, i',j', k', i",j", k"} represent a response pattern that is made up of a single response to a manifest indicator (m) of an exogenous variable and responses to three items at times t (i, j, and k), t + 1 (i', j', and k'), and t + 2 (i", j", and k"). Then, the estimated proportion of a particular response pattern P(Y) is expressed as follows for a first-order model): C
S
S
S
c=l p=l
q=l
r=l
v(r) = s s s s
~ c P m l c g p I cD i l p, cPj l p, cPk l p, c'r q l p, cP i ' I q, cPj' I q, cDk' q,c~r r l q, cP i" l r,cOj" l r,cPk" I r,c,
6. Latent Class and Latent Transition Parameters
135
where "Vc is the proportion in latent class c. In this model, individuals do not change latent class membership over time. The factor 8p Ic is the proportion in latent status p at time t conditional on membership in latent class c. This array of parameters represents the distribution across latent statuses at the first occasion of membership. Latent status membership can and does change over time. The factor Tqlp, c is an element of the latent transition probability matrix. These parameters represent how the sample changes latent status membership over time. The factor Pilp, c is the probability of response i to item 1 at time t, conditional on membership in latent status p at time t and on membership in latent class c. These measurement parameters are interpreted in the same way as their counterparts in latent class models. The purpose of this chapter is to address two practical issues related to parameter estimation in latent class and latent transition models. These practical issues arise frequently in developmental research. The first issue has to do with parameter estimation when the number of subjects is small in relation to the number of response patterns. As latent class and latent transition models become used more extensively, it is inevitable that researchers will attempt to test increasingly complex models. In our work, for example, we test models of early substance use onset in samples of adolescents. These models often involve four or more manifest variables used as indicators, measured at two or more time points. The cross-classification of all of these manifest variables measured repeatedly can become quite large. For a problem involving four dichotomous variables measured at two times, there are 28 - 256 cells in the cross-classification. If there are more than two occasions of measurement, if the items have more than two response options, and/or if there are additional items, the table becomes much larger. The size of the cross-classification is not a problem in and of itself, but it often indirectly causes a problem when the sample size is small in relation to the number of cells, that is, under conditions of "sparseness." There is a general, informal consensus that a minimum sample size in relation to the number of cells (N/k) is needed to perform latent class analyses successfully. However, to our knowledge, there are no firm guidelines, either about what the minimum N or N/k should be or about what are the likely consequences if one performs analyses on data that are too sparse. The source of the concern about sparseness is partly the well-established literature on the difficulties of conducting hypothesis testing under sparseness conditions. This literature has shown in numerous settings (see Read & Cressie, 1988) that when the data matrix is sparse, the distributions of commonly used goodness-of-fit indices diverge considerably from that of the chi-squared, making accurate hypothesis testing impossible. Collins, Fidler, Wugalter, and Long (1993; see also Holt & Macready, 1989) showed that under conditions of sparseness in latent class models, the expectation of the likelihood ratio statistic can
136
L. M. Collins, P. L. Fidler, and S. E. Wugalter
be considerably less than the expectation of the chi-square distribution, and that although the expectation of the Pearson statistic is much closer to that of the chi-square, its variance is unacceptably large. Collins et al. (1993) recommended using Monte Carlo simulations to estimate the expectation of the test statistic under the null hypothesis. The demonstrated problems with hypothesis testing under sparseness conditions have left researchers uneasy about parameter estimation under these conditions. It is not known what the effects of sparseness are, if any, on parameter estimation. One possible consequence of sparseness is bias, that is, the expectation of the parameter estimates could be greater than or less than the parameter being estimated. Another consequence could be an unacceptably large mean squared error (MSE), indicating a lack of precision in parameter estimation. Also, it is not known what the limits of sparseness are, that is, how small can the number of subjects be in relation to the number of cells if parameters are to be estimated with acceptable accuracy? Another important question related to sparseness is whether, given a fixed N, adding a manifest indicator is a benefit, because it increases the amount of information, or a detriment, because it increases sparseness. The second practical issue addressed by this chapter is the estimation of standard errors for parameters. Estimation for latent class and latent transition models is usually performed using the EM algorithm (Dempster, Laird, & Rubin, 1977). Less often, estimation is performed using Fisher's method of scoring (Dayton & Macready, 1976). The EM algorithm has the advantage of being a very robust estimation procedure. Its disadvantages are that it is slow compared with Fisher's method of scoring, and that it does not yield standard errors of the parameters it estimates. Unlike the Fisher scoring method, the EM algorithm does not require computation of the information matrix, hence standard errors are not a by-product of the procedure. It has been suggested (e.g., Rao, 1965, p. 305) that standard errors can be obtained by inverting the information matrix computed at the final iteration of the EM procedure. This sounds reasonable, but just how well this work has never been investigated. It also remains to be seen how well standard errors can be estimated in very sparse data matrices.
2. METHODS To investigate the issues presented in the first section, we performed a simulation study. The purpose of the simulation was to generate data based on latent class models with known parameters while systematically varying sparseness and other factors. We estimated the parameters in the data we generated and examined whether the factors we varied affected parameter estimation.
6. Latent Class and Latent Transition Parameters
137
The following independent variables were selected as likely to affect the accuracy of estimation: 1. Number of items. We used four-, six-, and eight-item models. Because dichotomous items were used, this resulted in data sets with 16, 64, and 256 response patterns, respectively. 2. Size of p. We reasoned that the size of p might be an important factor because all else being equal, latent class models with extreme values (close to zero or one) for the p parameters generate sparser data matrices than latent class models with less extreme p parameters. This factor had two levels: p = .65 and 9=.9. 3. Sparseness. This was operationalized as N/k, that is, the number of subjects divided by the number of cells (in this case, response patterns). Four levels of N/k were crossed with the other factors in the study: 1, 2, 4, and 16, with N/k of 1 being a level that many researchers would feel is too low and N/k of 16 being a level that most researchers would agree is "safe." Thus the design as discussed here involves the following fully crossed factors: four levels of N/k, three levels of number of items, and two levels of size of p, or 24 cells. To investigate extreme sparseness, four more cells were run with N/k = .5. One cell had six items, p = .65; one had six items, p = .9; one had eight items, p = .65; and one had eight items, p = .9. We did not cross N/k = .5 fully with number of items, because the four item conditions would have involved only eight subjects. Table 1 shows the design, including the number of artificial "subjects" in each data set, computed by multiplying N/k times the number of response patterns. As Table 1 shows, the design was set up so that there would be multiple cells with the same N in order to make it possible to investigate the effects of adding items while holding N constant. TABLE 1 Design of Simulation and Numbers of Subjects Used to Generate Data N/k .5
1
2
4
16
N of items
p = .65
4 6 8
* 32 128
16 64 256
32 128 512
64 256 1024
256 1024 4096
p = .90
4 6 8
* 32 128
16 64 256
32 128 512
64 256 1024
256 1024 4096
*Cell not included in design.
138
L. M. Collins, P. L. Fidler, and S. E. Wugalter
One thousand random data sets were generated for each cell. Parameters of the known true model were estimated in each data set using the EM algorithm (Dempster et al., 1977). All of the data sets were generated using models with two latent classes, so one ~ parameter was estimated in each data set. In the four-item data sets, all eight p parameters were freely estimated. To keep the total number of parameters estimated constant across data sets, constraints were imposed on the p parameters in the six- and eight-item conditions. In the sixitem models, the p parameters associated with the fifth and sixth items were constrained to be equal to those for the first and second items, respectively. In the eight-item models, the p parameters associated with the fifth through eighth items were constrained to be equal to those associated with the first through fourth items, respectively. Thus each data set involved estimation of one -,/parameter and eight p parameters.
1.1 Data Generation There is a finite set of possible response patterns that can be generated for a given set of items, and each latent class model produces a corresponding vector of predicted response pattern probabilities. We used a total of six vectors associated with the three numbers of items and the two measurement strengths. A single vector was used to generate data for each of the four levels of N/k. The data were generated using a uniform random number generator written in FORTRAN. Data were generated for a single subject by randomly selecting a number from zero to one from the uniform distribution. This number was then compared with the cumulative response pattern probability vector which determined a particular subject's response pattern. A mastery model was used to generate the data in which one of the latent classes in each model was a "master" latent class, where the probability of passing the item was large for each item, and the other was a "nonmaster" latent class, where the probability of passing was small for each item. Of course, in individual data sets the parameter estimates did not always come out this way. In a few instances, the two latent classes did not seem distinct at all, so that it was difficult or impossible to denote one class as a mastery class and one as a nonmastery class. An example is a solution where the probability of passing each item is estimated as .9 for one latent class and .6 for the other. Under these circumstances, it cannot be determined whether a p parameter is conditional on membership in the mastery latent class or on membership in the nonmastery latent class. Because the true values for the p parameters that are conditional on the mastery latent class are different from those conditional on the nonmastery latent class, when this occurs it is impossible to compare the parameter estimates to their true values, and bias and MSE cannot be assessed. We examined each replication to identify instances of this indeterminacy. Table 2 shows the number of these solutions in each cell. Most of the indeter-
139
6. Latent Class and Latent Transition Parameters
TABLE2
Number of Indeterminate Solutions Out of 1000 Replications Per Cell N/k
p = .65
p = .90
.5
1
2
4
16
N of items 4 6 8
* 34 1
87 19 0
59 7 0
35 1 0
0 0 0
4 6 8
* 0 0
0 0 0
0 0 0
0 0 0
0 0 0
*Cell not included in design. minate solutions occurred when p = .65 and there were four items, with the largest number (87) occurring when N/k = 1, which was the sparsest matrix included for four items. There were no such solutions when p = .9, and none when N/k = 16. There was only one instance of indeterminacy when there were eight items, occurring when N/k = .5 and p = .65. Data sets yielding indeterminate solutions were r e m o v e d from the simulation and new random data sets were generated until a total of one thousand suitable solutions was obtained for each cell. The remaining results are based on this amended data set.
1.2. Parameter Estimates We evaluate the parameter estimates in terms of bias and MSE. Table 3 shows bias and M S E for the ~/parameter, and Table 4 shows these quantities for one of the p parameters. The pattern of results was virtually identical for all eight p parameters, so only one set of results is shown here. In the population from which these data were sampled, the latent class parameter equals .50 and the P parameter equals either .65 or .90. There is some bias present in the sample-based estimates of both types of parameters, particularly when p = .65. The overall amount of bias is quite small for the latent class parameter; in no cell is it larger than .026. There is considerably more bias in the estimation of the P parameters, where bias is more than .03 in six cells. For both parameters, bias virtually disappears when p = .90, even for conditions where N/k is as low as .5. Bias generally decreases as N/k increases and as the number of items increases, although there is some slight nonmonotonicity in the bias associated with the latent class parameter. Because certain groups of these parameters sum to one and the parameters are bounded by zero and one, posi tive bias in the estimation of one parameter in a set implies negative bias somewhere else in the set, and vice versa. For example, the 9's associated with a
Cr~
c~
t~
0
c~
~
0
0
~
II
c~
CD CD ,-.I
::j
ml a,1
CD
w-.i,,
0.,,,.k
C~
Dhl t,,.i.
I'rl C~
C~ ,m.
.,m.
r'rl
r~ 0
c~
c~
c~
c~
0
0
-u
C~ i.h
_m. a,1 c~ .m.
rrl
~h
C~L
s~
c~
.r~
.,,,L
141
6. Latent Class and Latent Transition Parameters
particular item and a particular latent class must sum to one across response categories. This means that although the bias in Table 4 is positive overall, if we e x a m i n e d the c o m p l e m e n t of this parameter, we would find that the bias is negative overall. Although there can be a substantial amount of bias when N/k is small, holding N/k constant and increasing the n u m b e r of items generally tends to decrease the bias. (Note that to maintain a constant N/k, the overall N must be increased when more cells are added.) This pattern is not completely consistent for the latent class parameter. Essentially the same pattern holds for the MSE. The M S E is much smaller overall for both the latent class parameter and the p parameter when p = .90 than it is when p = .65. The M S E generally decreases as N/k increases, and as the n u m b e r of items increases, although there are some inconsistencies evident in the M S E s for the latent class parameter. Even with N/k = .5, the M S E is small when eight items are used. G i v e n a particular N, is there an increase in bias or M S E when items are added to a latent class model? Tables 5 and 6 show that as the n u m b e r of items
TABLE 5
Bias in Estimation of the ~ Parameter as a Function of N N of items 4
N of subjects 32 64 p = .65
256
.026 (.041) .008 (.051) .003 (.058)
1024
32 64 p = .90
256
.000 (.OO9) .001 (.005) -.000 (.001)
1024
6
.016 (.047) .024 (.054) -.001 (.029) .002 (.OO7)
-.002 (.015) -.000
(.003)
-.000
(.OO8) .003 (.004) -.001 (.OOl) -.000
(.000)
Note. Mean squared errors are given in parentheses.
8
-.001 (.o01) -.ooo
(.00o)
142
L. M. Collins, P. L. Fidler, and S. E. Wugalter
TABLE 6
Bias in Estimation of p Parameter As a Function of N N of items 4 N of subjects 32 64
p = .65
256
.053 (.050) .043 (.039) .O38 (.020)
.040 (.052) .031 (.035) .010 (.009) .002 (.OO2)
-.002 (.008) .001
.003 (.006) .002
1024
32 64 p = .90
256
6
(.003)
(.003)
-.001 (.001)
-.000 (.OO1) -.001 (.000)
1024
8
.007 (.005) .002 (.001)
-.003 (.OO1) .000 (.000)
Note. Mean squared errors are given in parentheses.
increases for a fixed N, bias and MSE sometimes do increase slightly, at least when the overall N is 64 or less. Under conditions when N is larger than this, generally bias and MSE remain about the same or even decrease slightly as the number of items is increased.
1.3. Estimation of Standard Errors Each cell of the simulation contains a sampling distribution of each parameter estimated. This sampling distribution is made up of the 1000 replications in each cell. The standard deviation of a parameter estimate across the 1000 replications is our definition of the true standard error of the parameter. To estimate the bias and MSE associated with estimating the standard errors, we subtracted this standard deviation from the estimate of the standard error obtained by inverting the information matrix at each replication. Table 7 contains bias and MSE for the estimation of the standard error of the latent class parameter, and Table 8 contains this information for the estimation of the standard error of the p parameter.
6. Latent Class and Latent Transition Parameters
143
TABLE 7 Bias in Estimation of the Standard Error of the ~/Parameter N/k
N of items 4
1
2
4
16
*
.284 (.714) -.035 (.022) -.023 (. 001 )
.167 (.131) -.066 (.011) -.005 (. 000)
.146 (.150) -.037 (.004) -.004 (. 000 )
-.023 (.015) -.004 (.000) .000 (. 000 )
.222 (.979) .000 (.000) -.001 (.000)
.011 (.004) -.001 (.000) .000 (.000)
-.003 (.000) .000 (.000) .000 (.000)
.001 (.000) -.000 (.000) .000 (.000)
.018 (.039) -.049 (. 004 )
p = .65
p = .90
.5
4
*
6
-.002 (.000) -.000 (.000)
8
Note. Mean squared errors are given in parentheses. *Cell not included in design.
TABLE 8
Bias in Estimation of the Standard Error of the p Parameter N/k
N of items 4
1
2
4
16
*
.056 (.157) -.016 (.026) -.009 (.000)
.057 (.089) -.023 (.009) -.003 (.000)
.057 (.104) -.014 (.001) -.001 (.000)
-.003 (.013) -.001 (.000) .001 (.000)
.086 (.366) -.002 (.000) -.000 (.000)
-.008 (.007) -.002 (.000) .000 (.000)
-.005 (.000) .000 (.000) .000 (.000)
.000 (.000) .000 (.000) .000 (.000)
-.016 (.037) -.020 (.001)
p = .65
p = .90
.5
4
*
6
-.013 (.OO2) -.002 (.000)
8
Note. Mean squared errors are given in parentheses. *Cell not included in design.
144
L. M. Collins, P. L. Fidler, and S. E. Wugalter
Tables 7 and 8 show that although generally the standard errors are estimated well, there is considerable bias in the estimation of the standard error for the latent class parameter when p -- .65, there are four items, and N/k is low. Under these conditions, inverting the information matrix tends to produce a standard error that is larger than the standard deviation of the empirical sampling distribution, particularly for the three lowest values of N/k. The MSE is also large in these conditions, especially when N/k = 1. For both the latent class and p parameters bias tends to be positive in the four-item conditions and negative in the six- and eight-item conditions. Inverting the information matrix produces highly accurate estimates of the standard error in almost all of the p = .9 conditions. The one exception is the four-item, N/k = 1 condition where there is considerable positive bias and the MSE is unacceptably large. The overall pattern of the results is the same for the p parameters, except that there is less bias in the estimate of the standard error and the MSEs are smaller in most cases.
3. DISCUSSION The results of this study are very encouraging for the estimation of latent class models. They suggest that highly accurate estimates of latent class parameters can be obtained even in very sparse data matrices, particularly when manifes titems are closely related to the latent variable. Even when N/k was .5 or 1, bias was negligible for estimating the latent class parameter and only slight for estimating the p parameters. Both bias and MSE were smallest when p = .9, indicating that in circumstances in which the items are strongly related to the latent classes, a smaller N is needed for good estimation compared with circumstances in which the items are less strongly related to the latent classes. Usually, a researcher has only so much data to work with, that is, a fixed N, and is debating how many indicators to include. With measurement models for continuous data, there is usually no debate; more indicators are better. But with latent class and latent transition models, the researcher is faced with a nagging question: Will adding an indicator be a benefit by increasing measurement precision, or will it be a detriment by increasing sparseness? The results of our study suggest that when the p parameters are close to zero or one, adding additional items has little effect on estimation. When the p parameters are weaker, adding items generally decreases the MSE for overall Ns of 256 or greater, and may increase it slightly for smaller Ns. The effect of the addition of items on bias is less consistent, but generally small. It is important to note that in the design of this simulation, we used constraints so that when additional items were added the total number of parameters estimated remained the same. This amounts to treating some items as replications of each other; conceptually, it is similar to constraining factor loadings to be equal in a confirmatory factor analysis. With-
6. Latent Class and Latent Transition Parameters
145
out such constraints, as items are added, more parameters are estimated. If the addition of items is accompanied by estimation of additional parameters, this may change the conclusions discussed in this paragraph. This study also indicates that standard errors of the parameters in latent class models can be estimated well in most circumstances by inverting the information matrix after parameter estimation has been completed by means of the EM algorithm. However, there can be a substantial positive bias in the estimate of the standard error, particularly when p is weak and N/k is small. This approach to estimation of standard errors probably should not be attempted when N/k is one or less or there are four or fewer manifest indicators. Serendipitously, this study also revealed a little about indeterminate results, that is, results for which the latent classes are not clearly distinguished. We should note that such results are not indeterminate if they reflect the true model that generated the data. However, in our study, the data generation models involved clearly distinguished latent classes. We found that, again, strong measurement parameters were very important; none of the indeterminate cases occurred when p = .9. It was also evident that more items and a greater N/k helped to prevent indeterminate solutions. It is interesting that strong measurement showed itself to be so unambiguously beneficial in this study. On the other hand, p parameters close to zero or one are analogous to large factor loadings in factor analysis, clearly indicating a close relationship between manifest indicators and a latent variable. On the other hand, all else being equal, p parameters close to zero and one are also indicative of more sparseness. Given a particular N, the least sparse data would come from subject responses spread evenly across all possible response patterns. When the p parameters are such that some responses have very high probabilities and others have very low probabilities, subject responses will tend to be clumped together in the high-probability response patterns, whereas the low-probability response patterns will be empty or nearly empty. For this reason, it might have been expected that strong measurement would tend to result in more bias or larger MSEs. This simulation shows that the sparseness caused by extreme measurement parameters is unimportant. Like all simulations, this one can only provide information about conditions that were included. We did not include any conditions where there were more than two latent classes. As mentioned previously, we controlled the number of parameters estimated rather than let the number increase as more items were added. However, in many studies, researchers will wish to estimate p parameters freely for any items they add to a model, which will result in an increase in the total number of parameters to be estimated. It would be worthwhile to investigate the effects of this increased load on estimation. Finally, we did not mix the strengths of the measurement parameters. Each data set contained measurement parameters of one strength only. Of course, in empirical research set-
146
L. M. Collins, P. L. Fidler, and S. E. Wugalter
tings different variables would be associated with different measurement strengths, so the effect of this should be investigated also.
ACKNOWLEDGMENTS This research was supported by National Institute on Drug Abuse Grant DA04111.
REFERENCES Clogg, C. C., & Goodman, L. A. (1984). Latent structure analysis of a set of multidimensional contingency tables. Journal of the American Statistical Association, 79, 762-771. Collins, L. M., Fidler, P. L., Wugalter, S. E., & Long, J. D. (1993). Goodness-of-fit testing for latent class models. Multivariate Behavioral Research, 28, 375-389. Collins, L. M., & Wugalter, S. E. (1992). Latent class models for stage-sequential dynamic latent variables. Multivariate Behavioral Research, 27, 131-157. Dayton, C. M., & Macready, G. B. (1976). A probablistic model for validation of behavioral hierarchies. Psychometrika, 41, 189-204. Dempster, A. P., Laird, N. L., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1-38. Goodman, L. A. (1974). Exploratory latent structure analysis using both indentifiable and unidentifiable models. Biometrika, 61, 215-231. Hagenaars, J. A., & Luijkx, R. (1987). LCAG: Latent class analysis models and other loglinear models with latent variables: Manual LCAG (Working Paper Series 17). Tilburg, Netherlands: Tilburg University, Department of Sociology. Holt, J. A., & Macready, G. B. (1989). A simulation study of the difference chi-square statistic for comparing latent class models under violations of regularity conditions. Applied Psychological Measurement, 13, 221-231. Kohlberg, L. (1969). Stage and sequence: the cognitive-developmental approach to socialization. In D. A. Goslin (Ed.). Handbook of socialization theory and research. Chicago: Rand McNally. Lazarsfeld, P. F., & Henry, N. W. (1968). Latent structure analysis. Boston: Houghton Mifflin. Piaget, J. (1952). The origins of intelligence in children. New York: International Universities Press. Rao, C. R. (1965). Linear statistical inference and its applications. New York: Wiley. Read, T. R. C., & Cressie, N. A. C. (1988). Goodness-of-fit statistics for discrete multivariate data. New York: Springer-Verlag. van de Pol, F., Langeheine, R., & de Jong, W. (1989). PANMARK User Manual: PANel analysis using MARKov chains. Voorburg: Netherlands Central Bureau of Statistics.
Contingency Tables and Between-Subject Variability Thomas D. Wickens
University of California, Los Angeles Los Angeles, California
1. INTRODUCTION An important problem in the analysis of contingency tables is how to treat data when several observations come from each subject. How should these dependent responses be combined into a single analysis? The problem is, of course, an old one; I found reference to the effects of heterogeneity of observations going back at least to Yule's textbook in 1911. The dangers of pooling dependent observations in the Pearson statistic are well known, although many researchers are not really sure what to do about it. For example, Lewis and Burke pointed the problem out in 1949 in their discussion of the use and misuse of chi-square tests, and yet Delucchi (1983), in his reexamination of the outcome of their recommendations, noted that errors are still too common. The introduction of new statistical techniques for frequency data has both helped and hurt. On the one hand, we now have both the procedures to represent and analyze the between-subject component of variability in these designs and the computer power needed to implement these procedures. On the other hand, the variety of new techniques, and the incomplete understanding many users have of them, has partially obscured the problem. For example, I have
Categorical Variables in Developmental Research: Methods of Analysis Copyright 9 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
147
148
Thomas D. Wickens
spoken to several researchers who believe that using "log-linear models" somehow eliminates the problems of combining over subjects. However, this is not really the case. The issue was brought to my attention some time ago when a colleague asked me to look at a paper he was reviewing and to comment on the statistics. The data in question contained results from a group of subjects with several observations per subject, and thus involved a mixture of between-subject and withinsubject variability. They had been analyzed with standard log-linear techniques. The analysis bothered me, and eventually I identified the problem as a failure to treat the between-subject variability correctly. I also realized that the issue was more general than this particular instance, and that many treatments of categorical designs for researchers, my own included (Wickens, 1989), did not discuss how to analyze such data. In particular, they gave no practical recommendations for the everyday researcher. Thinking about the problem, and about what to recommend, led me to explore several approaches to this type of analysis (reported in Wickens, 1993). Here I describe, and to some degree expand upon, that work. My thoughts are influenced both by the statistical problems and by the need to make recommendations that are likely to be used.
2. ASSOCIATION VARIABILITY To begin, remember that there are two common situations in which we are called on to test for association between two categorizations. In one, each subject produces a response that is classified in two dimensions, producing one entry in a contingency table. In the other situation, there are two or more types of subjects, each generating a response that is classified along a single dimension. Again, a two-way table results, although the underlying sampling models are different. Essentially, the difference between the two sampling situations is the same as that between the correlation model and the regression model. I will concentrate on the first situation here, but will mention the second situation again later. ~ Consider the following example, loosely adapted from the design that called the problem to my attention. Suppose that you are interested in determining whether there is a link between the behavior of a caregiver and that of an infant. You want to know whether a certain caregiver behavior is followed by a particular target behavior of the infant. You observe the interaction of a caregiver-infant dyad for some fixed duration. You divide this period into a se-
IThere is a third possibility in which the design constrains both marginal distributions--Fisher's (1973) lady testing tea--but is less often used and I will not discuss it here.
149
7. Between-Subject Variability
ries of short intervals and select certain pairs of intervals, say one pair every 5 min, for examination. In the first of any pair of two consecutive intervals, you record whether the caregiver made the antecedent behavior and in the second whether the infant made the target behavior. Aggregating over the pairs, you obtain a 2 • 2 table that describes the contingency of the two behaviors: Infant Yes No
I
Caregiver NoYeS 4213 2817
I
Each count in this table records the outcome from one pair of intervals. You can discover whether the behaviors are contingent by looking for an association between the row categorization and the column categorization. For a single dyad, these data are easy to analyze. You just test for an association in the 2 • 2 table. There are many ways to conduct this test, all of which have similar properties. I do not consider their differences here; an ordinary Pearson statistic suffices. Of course, you would not normally run the study with only one dyad. You would obtain data from several caregiver-infant pairs and want to generalize over these pairs. Table 1 shows some data, simulated from the model that I describe in the next section. These data form a three-way contingency table, the dimensions of which are the antecedent behavior, the target behavior, and the subject (classification factors A, B, and S, respectively). The goal of an analysis is either to test the hypothesis that there is no association between the behavior classifications or to estimate the magnitude of this association. Two sources of variability influence these observations, one occurring within a dyad, the other between dyads. The within-dyad source is embodied in the
TABLE 1 Simulated Data for Eight Dyads Responses
Log odds ratio
Pair
YY
YN
NY
NN
Yk
sk2
1 2 3 4 5 6 7 8
42 44 42 29 42 24 38 45
17 30 9 29 15 24 24 16
13 12 34 23 24 27 15 19
28 14 15 19 19 25 23 20
1.634 0.526 0.698 -0.187 0.780 -0.076 0.868 1.064
0.190 0.204 0.222 0.162 0.180 0.157 0.174 0.183
The estimated log odds ratios Yk and their sampling variance s2 are obtained from Equations (1) and (2).
Note.
150
Thomas D. Wickens
scatter of the responses within each 2 x 2 table; the between-dyad source is embodied in the differences among these tables. Some between-dyad scatter is expected because of within-dyad response variability, but there is more variability among the observations in Table 1 than can be attributed to such within-dyad influences. Considered in terms of these sources of variability, there are three ways that one might test a hypothesis of independence or unrelatedness. The most general approach is to find a method that explicitly uses estimates of both the betweensubject and the within-subject components of the variability. Presumably, the tests in this class should be the most accurate. However, such tests are relatively complicated and not well incorporated into the conventional statistical packages. Their complexity creates a problem in their application. Many researchers and some journal editors pay little attention to new statistical methods unless a very strong case can be made for them, preferring to fall back on more familiar procedures. Therefore it is also important to look at tests based on simpler variability structures. The more familiar types of tests are based on estimates of variability of only one type, either between-subject or within-subject. If there is much betweensubject variability, then the class of tests that estimate and use it should be the next best alternative, although they might be inferior to the tests that use both sources. Tests that use only the within-subject variation are likely to be the least accurate. Although these notions make sense, I was unclear of the size and nature of the biases and of the relative advantages of each class. The simulations I describe in this chapter address these similarities and differences. My goal is to make some practical recommendations.
3. THE SIMULATION PROCEDURE To look at the characteristics of the different testing procedures, I simulated experiments with various amounts of between-subject and within-subject variability, applied a variety of test statistics to them, and looked at the biases and power of these tests. The first step in this program was to create a model for the data that included both components of variation. The model I used has between-subject and within-subject portions. The within-subject portion is conventional. The kth subject (or dyad) gives a 2 • 2 table of frequencies Xok. In a design in which each subject gives a doubly classified response, these frequencies are a multinomial sample of in k observations from a four-level probability distribution with probabilities "rr0k. The multinomial distribution is, of course, a standard model for categorical data analysis. The between-subject portion of the model provides the probabilities ~ijk and the number of observations m k. The probabilities derive from a set of parame-
7. Between-Subject Variability
151
ters that are tied to the processes being modeled. A 2 • 2 table of probabilities "rrij can be generated by the two marginal distributions and the association. Specifically, the marginal distributions are expressed by a pair of marginal logits ~l - log "rrll -I- 'rrl2
and
~2 = log Trll nt- 'rr21,
"1721 -Jr- "ri'22
'7r12 + ,rt'22
and the association by log odds ratio: xI - log 31"11'13"22. 'IT 12"rr21
In the example, the parameters ~l and ~2 m e a s u r e how likely the caregiver and infant are to emit the antecedent and target behaviors, respectively. They are essentially nuisance parameters in that example, although in other research their relationship is the principal focus of the work. As in many studies of association, one wants to make an inference about xl, either estimating it or testing the null hypothesis that it equals zero. The between-subject portion of the model specifies the distribution of these quantities over the population of subjects. I represented them as sampled from normal distributions, with unknown means and variances: ~l "~ ~'t/'(~L~,,
0"~1)'
~2 "-" ~ I/'(~2, 0"~2), and
11 "~ ~l]~lxn, (r2).
Properly speaking, the triplet (~l, ~2, ~) should have a trivariate distribution to allow intercorrelations, but I will not follow up on this multivariate aspect. For one thing, the values of the correlations depend closely on the particular study being modeled and I wanted to be more general; for another, I suspect that their effects are secondary to those that 1 discuss here. In a given experiment, n subjects are observed, each corresponding to an independently sampled triplet of distribution parameters. Over the experiment, there are n of these triplets (~lk, ~2k, ilk), which I have shown by adding a subscript for the subject and placing a tilde above the symbol, implying that, from the point of view of the study as a whole, they are realizations of random variables and not true parameters. For any given experiment, the mean fi of the log odds ratio is not, except accidentally, equal to the mean population association; typically,
n
The three subject parameters ~lk, ~2k, and fik determine a set of probabilities ~rij for the four events. The other portion of the model concerns the item sample sizes m k. In most of my simulations m~, was fixed, as if it were set by the experimental procedure.
152
Thomas D. Wickens
In the example, this situation would occur if the dyads were sampled at regular intervals for a fixed period of time. A few simulations let m k vary. Suppose that instead of taking observations at fixed times, every instance of a particular type of caregiver behavior was classified, along with the subsequent behavior of the infant. The number of observations would then depend on the caregiver's behavior rate. I modeled this situation by assuming that instances of the behavior were generated by a Poisson process (like the multinomial, a standard sampling model) and that the rate parameter of this process varied over subjects according to a gamma distribution. This mixture model implies that m k has a negative binomial distribution (e.g., Johnson, Kotz, & Kemp, 1992). Both the mean and the variance of this distribution must be chosen. This model was the basis of the simulations. Each condition was based on values of the mean and variance for the distributions of ~k, ~2k, and xlk, and on the value of n and the distribution of m k. For each of the n subjects in the "experiment," I sampled an independent random triplet (~k, ~2k, ilk) according to the relevant normal distributions and converted it to cell probabilities ~ijk. If m k was random, I generated its value from a negative binomial distribution, and for either the fixed or random case, I generated m~ multinomially distributed responses. I then analyzed these data using a variety of test statistics, some based only on the within-subject variability, some only on a between-subject estimate of variability, and some that separated the two sources. To get stable estimates of such characteristics as the probability of rejecting a null hypothesis, I replicated the "experiment" between 2,000 and 10,000 times, depending on the particular part of the study, and calculated the observed proportion of null hypotheses rejected at the 5% level. Over the course of the study, I varied several characteristics of the experiments. Two of these characteristics concerned properties of the experiment: the fixed sample size n and the (possibly random) observation sample size m~. Two other characteristics concerned the ancillary characteristics of the subject population: the degree to which the marginal distribution of responses was skewed toward one alternative and the amount of between-subject variability. Finally, I varied the size of the association effect, so that I could investigate both Type I error rates and power. I also looked at some studies with more than two response categories. I will not describe all of the simulations and their details here, nor will I mention every statistic that I tried. Some of these are reported in Wickens (1993), others I have not reported formally but they are essentially similar to the ones I describe here.
4. TESTS BASED ON MULTINOMIAL VARIABILITY The most familiar class of test statistics contains tests that are based on the within-subject multinomial variability alone. The simplest procedure is to col-
7. Between-Subject Variability
153
lapse the data over the subjects (obtaining the marginal distribution of Table 1) and test for association in the resulting two-way table. This approach is unsatisfactory for several reasons. First, it violates the rules about the independence of observations in "chi-square tests." Second, collapsing the three-way table into its margin can artificially alter the association if, as is likely, the relative frequencies of the two behaviors are related to the subject classification--this is the effect that in its more pernicious manifestation gives Simpson's paradox. This test also shows all of the same biases as the other tests in this class, so I will not linger on it, but will turn instead to more plausible candidates. The second test procedure is a likelihood-ratio test based on a pair of hierarchical log-linear models fitted to the three-way A by B by S table. One model expresses the conditional independence of the responses given the subject; in the bracket notation that denotes the fitted marginals it is [AS][BS], and in loglinear form it is log [,LijCIk-- ~k + ~kA + ~kB + ~kS + ~k~kS + ~kjBS. This model allows for relationships between the subjects and the behavior rates, but excludes any association between the two behaviors. The second model is of no interaction in the association. It adds to the conditional-independence model a parameter that expresses a constant level of association; in bracket notation it is [AS][BS][AB], and in log-linear form it is log [.Lij N] --- ~k -~- h A + ~B + ~k~k"_~_ ~kAB ..[_ ~kiAS _~_ ~kjBS. These models are hierarchical, and the second perforce fits the data better. The need for an association term is tested with a likelihood-ratio test: ,, N I
AG~
~-
2 ce~llsXij k log ~[.Lijk ,,CI
----
G[AS][BSI -- G[ABI[AS][BS]' 2 2
where the ~ are fitted frequencies. This statistic is referred to a chi-square distribution with 1 df Because this test explicitly includes the subjects as a factor in the analysis, it appears to solve the problems associated with their pooling. The likelihood-ratio test is appropriate only when the more general model adequately fits the data, so some authorities recommend testing the no-interaction model [AB][AS][BS] first and only proceeding to the difference test when it cannot be rejected. Retention of the no-interaction model implies that there are few differences in association among the subjects. The recommendation to use this test as a gatekeeper is not very useful, however. For one thing, the goodness-offit test is neither easy to run nor particularly powerful when the per-subject frequencies mk are small. More seriously, it leaves the researcher stranded when homogeneity of association is rejected. In practice, I think the recommendation is ignored more often than followed. The third test in this class approaches the problem through an estimate of
154
Thomas D. Wickens
the log odds ratio. Any of several estimates can be used here. I describe one based on Woolf's average of the log odds ratios (Woolf, 1995) in some detail because I will be using an extension of it here; another one, which I do not describe, uses the Mantel-Haenszel estimate (Mantel, 1963). For a single subject, the log odds ratio TIk is estimated by the ratio
( ')(,) (xl2~+~1)(l) x21k+~ X l l k -Jr- "~
Yk = log
X 2 2 k "Jr" "~
.
(1)
The 1 added to each frequency reduces bias and prevents unseemly behavior when an observed frequency is zero (Gart & Zweifel, 1967). With multinomial sampling, the sampling variance of Yk approximately equals the sum of the reciprocals of the cell frequencies, again with the 1 adjustment: 2
1
n t-
S k -= Xll k +
1
1
+
+ ~
1
(2)
1
1
1
1
--
X 1 2 k -q- _
X21 k "-1- _
X 2 2 k nt- _
2
2
2
2
Table 1 gives these estimates for the individual subjects. Aggregation over subjects uses a weighted average with weights equal to the reciprocal of the multinomial sampling variability: Yw
"~wky k s k
with
wk
1 sk2"
(3)
The sampling variance of an average that uses these weights is the reciprocal of the sum of the weights: 1
var(~w) = Ew k. The hypothesis that the population value corresponding to Yw equals zero is tested by the ratio of Yw to its standard error, or equivalently by referring the ratio X2 =
~2 varfPw)
(4)
to a chi-square distribution with 1 df Test statistics in this class are unsatisfactory in the presence of betweensubject variability. Figure 1 shows the proportion of null hypotheses rejected at the nominal 5% level as a function of the association standard deviation for experiments with n = 10 subjects, m = 100 observations per subject, and marginal distributions that were nearly symmetrical (tx~ = 0.3 and o'~ = 0.2, giving ap-
7. Between-Subject Variability
155
FIGURE 1. Type I error rates at the nominal 5% level for tests based on multinomial variability. Subject sample size n = 10, fixed response sample size mk = 100, and near-symmetrical marginal distributions. Each point is based on 10,000 simulated experiments.
proximately a 5 7 : 4 3 distribution in the marginal categories). If the tests were unbiased, then these lines would lie along the 5% line at the bottom of the figure. Instead, the proportion of rejected hypotheses increases with the intersubject variability. The bias is substantial when the association varies widely among subjects. One might now ask how the biases in Figure 1 depend on the sample size. In particular, conventional lore about sample sizes would suggest that large samples of subjects or items would reduce or eliminate the bias. Unfortunately, this supposition is wrong, as Figure 2 shows for the statistic AG28 (the other statistics are similar). The bias of the test increases (not decreases) with the number of observations per subject and is essentially unaffected by the number of subjects. The problems with these tests are too fundamental to be fixed by larger samples.
FIGURE 2. Type I error rates at the nominal 5% level for the statistic AG2B as a function of subject and item sample sizes. The association standard deviation is 0"0 = 1.0 and each point is based on 10,000 simulations.
156
Thomas D. Wickens
The difficulty with this class of tests, and the source of the biases in Figures 1 and 2, is that the wrong null hypothesis is being tested. There are three null hypotheses that could be tested 9 The strongest of these asserts that there is no relationship at all between the behavior categories. In terms of the variability model, this hypothesis implies that every glk is zero 9 It is a compound hypoth2 are esis, as it implies that both the mean association txn and the variability orn zero: Hol 9 p~ = 0
and
2=0. ~rn
This hypothesis is equivalent to a hypothesis of the conditional independence of the two responses given by the subjects and is tested by the fit of the log-linear model [AS][BS]. A much less restrictive null hypothesis is that there is no association between categories in the sample: Ho2: ql = 0 .
This hypothesis focuses attention on the particular sample. It does not involve the population parameters. The third null hypothesis refers only to the mean association: Ho3: I~,q = 0.
It does not preclude intersubject variability and is also less restrictive than Hol. 2 = 0, these three When there is no intersubject variability, that is, when ~rn null hypotheses are equivalent. When crn2 > 0, they differ. The statistics that incorporate only the multinomial variability test hypothesis H02. Although these statistics are satisfactory ways to test this hypothesis, when the subjects vary, as they usually do, a researcher probably wants to test the population hypothesis H03 instead of the sample hypothesis H02. The researcher's intent is to generalize to a population parameter, not to restrict conclusions to the particular sample. A test of H03 must accommodate the subject variation.
5. TESTS BASED ON BETWEEN-SUBJECTVARIABILITY There is a handy rule for treating between-subject effects that I have often found useful. If you wish to analyze anything that varies across subjects, you first find a way to measure it for each individual, ideally to reduce each subject's data to a single number. These numbers can then be investigated using a standard statistical procedure that accommodates their variability. 2 In this spirit, I consider from this class statistics that test the distribution of the log odds ratios for a difference in mean from zero. All these statistics are based on the observed variability of the Yk.
7. Between-Subject Variabifity
157
The prototype for a test based on an empirical estimate of variability is the t test. To apply it here, the association between categories for each subject is estimated by Yk. The observed mean Yu and variance s u2 are calculated, and the hypothesis that the population mean corresponding to these observations is equal to any particular value is tested with a single-sample t test; for Ho3 tu =
~
YO
(5)
.
The ordinary t test might prove unsatisfactory for a variety of reasons. One potential problem is heterogeneity of variability. The sampling variability of the log odds ratio Yk depends both on the underlying distribution -?rijk (and through it on the subject-level parameters ~lk, ~2k, and glk) and on sample sizes mk. W h e n these values vary from one subject to the next, so does the sampling variance of Yk- One way to accommodate this heterogeneity is to use a weighted t test. Each observation is weighted by the inverse of its sampling variance, as in Equation (3), so that -
Yw=
~wkYk
and
]~wk
2 Sw=
Ewk(Yk -
n-
-
1
Yw)2
The test statistic tw is calculated as in the unweighted case (Eq. 5). It is instructive to relate these t tests to the log-linear models. One can write either t statistic as an F ratio; for example, using the weighted statistic, Fw=tw
2
=
SSassn/1 SShetero/(n -
1)'
where SSassn = E W O -2' w and k
SShetero--EWk(Yk--Yw) 2
k
(6)
The numerator and denominator of F w are ratios of chi-square distributed sums of squares to their degrees of freedom, one expressing the variability of the mean about its hypothesized value, the other the variability of the scores about their mean. Comparable interpretations attach to two of the likelihood-ratio statistics.
2An example of this principle in another domain is a test of a hypothesis about the value of a contrast in a repeated-measures analysis of variance. To test the hypothesis that a contrast ~ differs from zero, one compares the sums of squares associated with the contrast to the sum of squares for the t~ • S interaction. This error term measures the variation of the individual estimates of ~k. The F statistic that results is equivalent to a single-sample t test that checks whether the population mean of t~ is zero.
158
Thomas D. Wickens
FIGURE 3. Type I error rates for tests based on between-subject variability. The conditions are the same as in Figure 1, but the ordinate is expanded.
The presence of association, comparable to SSassn , is expressed by the difference statistic AG28, and the variability of association, comparable to SShetero , is expressed by the goodness-of-fit of the no-interaction model [AB][AS][BS]. Both quantities have chi-square distributions under the null hypothesis Ho3. Although I do not know whether these statistics have the independence necessary to construct a proper F statistic, their ratio might provide a subject-based test of association: 3 F g ~"
AGZB/1 2 G[AB][AS][BS]](rt -- 1)
(n
_
1)(G~AsIfBSl _ G[ABI[ASI[BS] 2 ) 2
G[AB][AS][BS]
This test is handy when one has the report of a log-linear analysis but no access to the original data. I call this statistic a pseudo F. As might be expected, the simulations show that these tests solve the Type I error problem. In Figure 3, the Type I error rates for all three tests lie approximately at the nominal 5% level. The ordinary t test, plotted as a solid line, does very well here. Moreover, these error rates are not substantially modulated by the parameters of the simulations. A better picture of the differences among these statistics is obtained from their power functions. When the marginal distributions are approximately symmetrical and the sample sizes are moderate, there is very little difference among them. Figure 4 shows power functions when m = 100 and n = 20 and the marginal distributions are nearly symmetrical. The panels differ in the degree of between-subject variability. The three tests have almost identical power functions unless the association variability is very large. When the subjects are very heterogeneous, the weighted t test is least satisfactory, with a power function that is displaced away from zero. 3This test is certainly inappropriate for tables larger than 2 x 2. In such tables, AG2B and 2 G[AB][ASItBS] may be influenced by different components of the effect, and thus may be uncomparable.
7. Between-Subject Variability
159
FIGURE 4. Powerof tests based on between-subjectvariability. Subject sample size n = 20, fixed response sample size mk = 100, and nearly-symmetricalmarginal distributions. Curves are based on 5000 simulated experiments and have been smoothed.
When the association is highly variable, there are some subjects for which one cell is empty or nearly so. These cells may be at the root of the weighted t problem. Evidence for this contention is obtained from the power functions in Figure 5, for which a highly asymmetrical set of marginal frequencies were used (1~r = 1.5, ~rr = 0.2; marginal categories distributed roughly as 82: 18). These marginals also create cells in which rrijk is tiny and small frequencies are likely. The unbalanced frequencies in this extreme case exaggerate the differences among the statistics. Again, the weighted t function is displaced along the abscissa. These effects probably result from the way the procedures treat cells with small observed frequencies, as they vanish when the item sample sizes mk are large. In general, these curves demonstrate the robustness of both the unweighted t and the pseudo F statistics to violations of the large-sample assumptions. I think the important point here is that the standard t does as well as any of the other statistics. It is possible that variation in the size of the multinomial samples m k might make the weighted t test necessary. This possibility was my main motivation for running the simulations with a negative binomial distribution of m k. I tried
160
Thomas D. Wickens
FIGURE 5. Power of tests based on between-subject variability with O-.q= 0.5. Subject sample size n = 20, fixed response sample size mk = 100, and asymmetrical marginal distributions. Curves are based on 5000 simulated experiments and have been smoothed.
distributions with both small and large a m o u n t s of variability relative to the mean. H o w e v e r , these c h a n g e s rarely affected the ordering of the statistics. Figure 6 s h o w s the p o w e r functions for e x p e r i m e n t s in which the variance of the s a m p l e sizes was 5 times their m e a n P~m" E v e n here, the standard t is m o r e satisfactory than the w e i g h t e d t, w h o s e p o w e r functions are still displaced. I h a v e two c o n c l u s i o n s to draw about the class of tests b a s e d on b e t w e e n subject variability. First, w h e n the s a m p l e s are large and there are no very imp r o b a b l e cells, all of the statistics are satisfactory and r o u g h l y equivalent. Sec-
FIGURE 6. Power for tests based on between-subject variability with a negative binomial distribution of response sample sizes. Subject sample size n = 25, association standard deviation o'n = 1.0, and near-symmetrical marginal distributions. The variance of the distribution of rnk is 5 times the mean. Curves are based on 5000 simulated experiments and have been smoothed.
7. Between-Subject Variability
161
ond, the tests are not equally robust to unlikely cells. The ordinary t test is better than some alternatives and no worse than any other statistic. Somewhat to my surprise, it appears to be the most accurate procedure.
6. PROCEDURES WITH TWO TYPES OF VARIABILITY Of the procedures examined so far, the t test appears to work best. This test is based on an estimate of the between-subject variability only. The question remains whether better results could be obtained from tests that separately estimate the between-subject and the within-subject components of variability. There is a substantial literature on this problem, dispersed over several domains. The fact that real categorical data may be overdispersed relative to the multinomial has been recognized for many years (e.g., Cochran, 1943). Some of this work concerns essentially the same problem that I am discussing here, that is, the accumulation of separate instances (responses) from a sampled population of larger units (subjects or dyads). Another line of work derives from the theory of survey sampling, in which the heterogeneity of geographical areas or respondent subpopulations introduce the extra variation. A third body of work involves the meta-analysis of categorical experiments, particularly in the medical domain, where experiments that differ in their conditions and size of effect are combined. In these studies, individual subjects play the role that I have assigned to responses, and the experiments correspond to my subjects. Most (although not all) of these methods begin with a sampling model similar to the one I have described. Many are derived from the Generalized Linear Model framework (McCullagh & Nelder, 1989). Unfortunately, almost all of these models suffer from some degree of mathematical intractability. Except for a procedure that merely adjusts the sample sizes downward to accommodate the variability (Rao & Scott, 1992), it is difficult to estimate the excess variance component. Either one must apply a computationally intensive procedure such as the EM algorithm or numerical minimization, or use estimators that may be suboptimal. Most of these procedures require some special purpose programming. I have not looked at all of the candidate tests here, as many of them were too complex and time consuming to be used in the type of simulation work I have described. I did, however, look at two procedures, one from the meta-analysis literature (DerSimonian & Laird, 1986) and the other a random-effect model (Miller & Landis, 1991). Both procedures obtain their estimates of the betweensubject parameters by looking at the excess variability over the multinomial variability. This strategy is at the core of most of these tests; for example, it is used in the Generalized Linear Model approach. As an example of these approaches, I will describe the DerSimonian-Laird procedure, which illustrates what is going on more transparently than the other statistics.
162
Thomas O. Wickens
This test is almost like the test of Woolf's statistic (Eq. 4), but it includes an adjustment for the between-subject variability in the weights. One begins by writing the variance of the log odds ratio Yk as the sum of two parts, a between2 2 and a multinomial within-subject part s/,: subject part 0-.q 0.2 = Yk
2_.1_ 2
0.'q
Sk "
The within-subject component s k2 is the multinomial sampling variance, as estimated by Equation (2). The between-subject component is estimated from the observed variability. First, calculate the weighted mean association 37w (Eq. 3) and the weighted sums of squares of deviations about it (Eq. 6) using weights based on the multinomial variability: SShetero -" E
wk(Yk
-
k
yw)2
with
w k = 1/s 2.
When the association is homogeneous over the subjects, SShetero is distributed as a • with n - 1 df and has an expected value of n - 1; but when there is between-subject variability, its expected value exceeds n - 1. The size of the between-subject component is estimated from this excess. After some algebra, a method-of-moments estimator of 0.~2 is obtained: ^ 2 _ max[0, SShetero - (?/ - 1)]
O'xl--
~w~,
~ ] wk
"
(7)
Y-,wk
The difference between the observed SShetero and its expectation in the absence of between-subject differences is truncated at zero to avoid negative variance estimates. The balance of the test parallels the test of the Woolf statistic. First, new weights are created that include both sources of variability: ~-~ ~
Wk
1
1
-- ,-2 _+_ 2"
Yk
0-Xl
(8)
Sk
Next, a pooled estimate of the association and its variance is found using these weights:
-*
Yw =
~wkYk ~w k
with
var(~w)=
1 y_,wk ~o
Finally, the ratio of 37" to its variance gives a test statistic for the hypothesis of no association: x~ =
aw
~w) ~
var(~w)"
7. Between-Subject Variability
163
Similar to the test for the Woolf statistic, this ratio is referred to a chi-square distribution with one degree of freedom. The adjustment of the weights makes the test more conservative. When between-subject variance is present, it increases var(y~) relative to the uncorrected var(y w) and reduces XZw accordingly. The simulations based on these statistics make three points. First, the Type I error rates are imperfect. There is some negative bias when the association variability is small and some positive bias when it is large, although these biases are small compared with those of the multinomial-based tests and improve with increased mk. These biases show both in the Type I error functions (Fig. 7, plotted on the same scale as Fig. 3 and including the t test for comparison) and in the power functions (Fig. 8). I suspect that these biases are the result of the truncation at zero used in the method-of-moments estimators. Second, the power functions are displaced when an improbable cell is created by a high marginal bias (Fig. 8, right). The simpler adjusted Woolf statistic is worse off than the more complex Miller-Landis test, although both are affected in the most extreme case. At their best, these tests are comparable in power to the between-subject tests such as the t test. Apparently, as testing procedures, these statistics, which are based on two sources of variability, do not improve on the t test, which does not separate between-subject and within-subject components. Although the new tests provide separate estimates of the between-subject variance components, they do not give more powerful tests.
7. DISCUSSION These results are a little disconcerting at first. Why do the procedures that accommodate two sources of variability work no better than those that use only a
FIGURE 7. Type I error rates for tests using both between-subject and within-subject variability. Subject sample size n = 10, fixed response sample size mk= 100, and near-symmetrical marginal distributions. Each point is based on 10,000 simulated experiments.
184
Thomas D. Wickens
FIGURE 8. Power of tests based on both between-subject and within-subject variability. Subject sample size n = 10 and fixed response sample size m k = 100. Curves are based on 5000 simulated experiments and have been smoothed.
between-subject source? The answer lies in how "between-subject" variability is defined and estimated in the two types of tests. A test that only takes account of a "between-subject" component measures it directly from the observed variability. This estimate encompasses both differences among subjects and any sampling fluctuation intrinsic to the subject. The tests that involve both components of the variability attempt to separate these two sources. In these tests, the between-subject portion refers only to the variability that exceeds an estimate of the within-subject part. Generally, the two variance components must be combined to test for between-group effects. These steps are clear in the DerSimonian2 and this parLaird model" the total variability O"2Yk is partitioned into o'n2 and s k, tition is the basis for the estimate of o'n2 in Equation (7). The two terms are recombined when the weights are calculated in Equation (8). The other procedures operate in similar, if sometimes less obvious, ways. In retrospect, it is not surprising that they look like the between-subject tests. Much the same kind of thing happens in other domains in which both between-subject and within-subject variability can be measured. Consider the analysis of variance. In a mixed design, say one with a groups of n subjects and m observations per subject, one can estimate distinct components of variability o'~ and o"2, the first deriving from between-subject sources and the second from within-subject (or repetition) sources. The test of a between-group effect uses
7. Between-Subject Variability
165
an error term that combines both sourcesmthe expected mean square for the error term is E(MSs/A)
= m o .2 + Cr2R.
Although this error term contains two theoretical parts, it is estimated by a mean square that uses only differences among the subjects, not those among the repeated scores. To calculate it, you need only the data aggregated within a subject: M S s / A = rn ~
(Yi.k - ~i..)2,
where Yijk is the jth score of the kth subject in group i and dots indicate averages. The model contains both between-subject and within-subject terms, but the test itself is calculated from between-subject information only. It helps here to remember the difference between the sampling models for frequencies and those for continuous scores. Both the binomial and the Poisson sampling models for an observation depend on a single parameter, so that their means and variances are functionally related. Although this relationship lets one construct tests in situations in which no independent error estimate is available, it also prevents the model from representing arbitrary variability. The addition of an independent variance parameter, however formulated, puts them in a category analogous to that of the models more typically used for normally distributed scores. As I keep being reminded, the t test and the analysis of variance are robust members of this class. The principle I have applied here to test the association against betweensubject variability can be used with other hypotheses. Consider two examples. Many experiments with categorical responses use independent groups of subjects. When one observation is obtained from each subject, a two-way table results, say with the subject classification specified by the row and response by the column. From a contingency table perspective, the hypothesis of homogeneous response distributions over the subject groups is tested in this table by the same test that would be used to test the independence of a double classification. When repeated observations are present, things are a little different. Although their specific forms are somewhat different, three null hypotheses can be written. The hypothesis Ho~ asserts that the response probabilities are homogeneous, both between and within the populations from which the groups were drawn; the hypothesis H02 asserts the absence of differences in mean response probabilities for the particular samples that were tested, but allows for individual differences; and the hypothesis H03 asserts that the mean response probabilities are identical across the populations. Typically, a researcher is only interested in H03. The tests described in the earlier sections of this paper cannot be applied directly, as the lack of matched subjects prevents association coefficients such
166
Thomas O. Wickens
as Yk from being calculated. The natural approach is to calculate a performance measure for each subject and compare the groups on this measure, using something like a two-sample t test or analysis of variance. This is the approach that would be taken by most researchers. To be concordant with the log-linear structure of most frequency analyses, one might choose to make the comparison using logits. For data in which xijk refers to the number of responses in category j made by the kth subject in group i, calculate the estimated logits 1
1
X l l k 4- --
X21 k --1- --
2
Ylk = log
1
and
2
Yzk = log
X l 2k -']- --
1 X 2 2 k '}- --
2
2
then compare the two sets of numbers. I have not done elaborate simulations of this situation, but I believe the same principles that apply to the association tests also apply here. In some other work in which I compared parametric and contingency table approaches to categorical data, I was impressed by the stability and power of the t test, even when the number of observations from each subject is very small. The second example returns to a two-classification situation exemplified by my dyad example. Suppose that you wish to compare the marginal frequencies in the two tablesmin the previous example, to look at whether the behavior rates for the caregivers and the infants differ. The null hypothesis here, expressed in logarithmic form, is that the two marginal logits ~ and ~2 are identical. With one observation per subject, the test is one of marginal homogeneity, which in a 2 • 2 table is McNemar's test of correlated proportions. When repeated observations from each subject are available, the marginal logits are estimated by 1
1
X l l k Jr- X 1 2 k -+- --
X l l k --I- X21 k --[- --
Zlk=log
2
and
Zzk=log
2
1
1
X21 k "~ X 2 2 k -3t- _
X 1 2 k 3t- X 2 2 k +
2
--
2
These values are compared with a dependent-scores t test or some comparable test. Equivalently, the difference between zlk and Zzk is a single quantity that expresses the difference in marginal rates for that subject:
(
X l l k -Jr- X 1 2 k +
')(
2
X l 2 k q- X 2 2 k +
dk = log xzl~+xz2~+~
xl~+x21~+~
,)
7. Between-Subject Variability
167
This measure is tested against zero using a procedure based on estimates of the between-subject variability. I do not want to give the impression that I think that t tests and analyses of variance solve every problem involving between-subject variation. The tests that separate the components of variability give information that is unavailable when the variability is pooled. They may also be more appropriate in situations in which the data are sparse and in which the statistics I have used here, such as Yk, are variable, impossible to calculate, or depend on such semiarbitrary things as the 89correction. Potentially, one of the techniques that I have not investigated, such as the computationally less convenient likelihood- or quasilikelihood-based procedures, may avoid some of the bias that appeared in my simulations. However, I do not really expect these methods to improve the power of the tests. In one form or another, they base their between-subject variability estimate on the observed between-subject variability, just as do the simpler tests. As I mentioned at the start, I believe that it is important to be able to provide recommendations to others that are feasible, and if not optimal, at least near to it. Contingency data can be particularly confusing. I think that by clarifying which null hypothesis should be tested and identifying a satisfactory way to test it, a good practical solution to the problem of between-subject variability in these designs is achieved.
ACKNOWLEDGMENTS This research was supported by a University of California, Los Angeles, University Research Grant.
REFERENCES Cochran, W. G. (1943). Analysis of variance for percentages based on unequal numbers. Journal of the American Statistical Association, 38, 287-301. Delucchi, K. L. (1983). The use and misuse of chi-square: Lewis and Burke revisited. Psychological Bulletin, 94, 166-176. DerSimonian, R., & Laird, N. (1986). Meta-analysis in clinical trials. Controlled Clinical Trials, 7, 177-188. Fisher, R. A. (1973). Statistical methods for research workers (14th ed.). New York: Hafner. Gart, J. J., & Zweifel, J. R. (1967). On the bias of various estimators of the logit and its variance with application to quantile bioassay. Biometrika, 54, 181-187. Johnson, N. L., Kotz, S., & Kemp, A. W. (1992). Univariate discrete statistics (2nd ed.). New York: Wiley. Lewis, D., & Burke, C. J. (1949). The use and misuse of chi-square. Psychological Bulletin, 46, 433-489. Mantel, N. (1963). Chi-square tests with one degree of freedom: Extensions of the Mantel-Haenszel procedure. Journal of the American Statistical Association, 58, 690-700.
168
Thomas D. Wickens
McCullagh, E, & Nelder, J. A. (1989). Generalized linear models (2nd ed.). London: Chapman & Hall. Miller, M. E., & Landis, J. R. (1991). Generalized variance component models for clustered categorical response variables. Biometrics, 47, 33-44. Rao, J. N. K., & Scott, A. J. (1992). A simple method for the analysis of clustered binary data. Biometrics, 48, 577-585. Wickens, T. D. (1989). Multiway contingency tables analysis for the social sciences. Hillsdale, NJ: Erlbaum. Wickens, T. D. (1993). The analysis of contingency tables with between-subjects variability. Psychological Bulletin, 113, 191-204. Woolf, B. (1955). On estimating the relationship between blood group and disease. Annals of Human Genetics, 19, 251-253. Yule, G. U. (1911). An introduction to the theory of statistics. London: Griffin.
Assessing Reliability of Categorical Measurements Using Latent Class Models 9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
Clifford C. Clogg~
Pennsylvania State University University Park, Pennsylvania
Wendy D. Manning
Department of Sociology Bowling Green State University Bowling Green, Ohio
1. INTRODUCTION To assess the reliability of continuous or quantitative measurements of one or more underlying variables, standard methods based on "average" correlations or model-based assessments using factor analysis are available (for a compact survey, see Carmines & Zeller, 1979). With the exception of Bartholomew and Schuessler (1991), who developed reliability indices for use with latent trait models, there is little explicit attention to reliability and its measurement for cat1Clifford C. Clogg passed away shortly after the completion of this manuscript. I am honored to have had the opportunity to collaborate with such a wonderful colleague. W.D.M.
Categorical Variables in Developmental Research: Methods of Analysis Copyright 9 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
169
170
c. c. Clogg and W. D. Manning
egorical measures. Reliability is most often conceived of in terms of errors in measurement, so any measurement-error model is automatically a model for assessing reliability. Latent class models provide a natural framework for assessing reliability of categorical measures (cf. Schwartz, 1986). In this chapter, we explore various definitions of reliability as they might be operationalized with standard latent class models. Clogg (1995) surveys the latent class model, including various areas of application, and provides an exhaustive list of references. For the most part, the technical underpinning of this chapter can be found in Clogg (1995) or in references cited there, and for this reason we do not include an exhaustive list of references here. The fundamental paper on modem latent class analysis is Goodman (1974). Except for slight changes in notation consistent with recommendations given in Clogg (1995), the model formulation put forth in Goodman's fundamental work is the one used here. Reliability of categorical measurements are examined in this chapter in the context of an empirical study of the measurement properties of various indicators of social support. Social support refers to the provision of or the receipt of aid from family members, close friends, parents, grandparents, and so on. There are many possible types of social support. A rigid empiricist view would regard each type of support as a social type, not as an indicator of something more fundamental. Such a view, however, prohibits the assessment of reliability unless an independent data source with "true" or at least more exact measurements is available which would allow comparison of true measures with the other measures whose reliability is in question. The view adopted here is that (a) each type of support represents an indicator of a broad pattern or structure of social support; (b) each specific indicator is an instance of a more abstract social condition or support structure and gives clues about that structure; and (c) the items as a set measure the structure, or the items measure an underlying variable that defines the structure, given assumptions. On the basis of this orientation, latent structure methods can be used to characterize the pattern of support that can be inferred from either unrestricted (and nonparametric) views of the structure or restricted (and more "parametric") views of the structure. The data source is the popular National Survey of Families and Households (NSFH; Sweet, Bumpass, & Call, 1988). The NSFH is a nationally representive sample of approximately 13,000 individuals in the United States. The indicators of social support contained in these data have been analyzed in several publications (Hogan, Eggebeen, & Clogg, 1993; Manning & Hogan, 1993). In this chapter, several of the indicators used in these sources are examined to determine reliability of the measures and related aspects of the data. Textbook accounts refer to reliability, in the abstract, as average correlations in some sample that generalize in an obvious way to some population, without regard for the context or the group to which these correlations refer. We examine reliability in several
8. Latent Class Models
171
senses, including reliability inferred from group comparisons that are relevant for inferring measurement properties of the indicators and for the social structures or patterns associated with these measures in particular contexts or in particular groups.
2. THE LATENT CLASS MODEL: A NONPARAMETRIC METHOD OF ASSESSING RELIABILITY We first give the general form of the latent class model to show that the parameters in this model can be used to infer reliability and other aspects of the measurements. The exposition is nontechnical. We focus mostly on what is assumed and what is not assumed in using this model to make inferences about reliability. We also give measures of reliability suited for different questions that arise in the assessment of reliability outside the world of correlations. Suppose that we have three dichotomous indicators of some underlying characteristic, variable, or trait. Let these be denoted as A, B, and C; and let the levels of those variable be denoted by subscripts i, j, and k, respectively. Level 1 might refer to receiving and level 2 to not receiving support. The items might refer to the receipt of aid or support of various kinds, such as childcare, advice, or material benefits. The probability of cell (i,j,k) in the observable crossclassification of A, B, and C is denoted "rraBc(ijk ). Now suppose that each item measures an underlying variable X, with possibly different levels of reliability. How reliability ought to be measured or indexed is the main objective of this chapter, but the most significant feature of the latent class model is that X is modeled nonparametrically so that reliability can be assessed without bringing in extra assumptions. That is, we do not assume that X is continuous or even quantitative; rather, we assume only that X has a specific number of categories or "latent classes." Let T denote the number of latent classes. If X could be observed, the "data" could be arranged in a four-way contingency table cross-classifying A, B, C, and X. Let t denote the level of the latent variable X, and let "rraBcx(ijkt ) denote the probability of cell (i,j,k,t) in this indirectly observed contingency table. The latent class model first assumes that T
"rrABC(ijk) = ~ "rrABcx(iJt:t).
(1)
t--1
In words, what we observe is obtained by collapsing over the levels of the hidden variable. The number T of latent classes can be viewed as a parameter. This integer number reflects the number of latent classes or groups in the population, up to the limits imposed by identifiability. As a practical matter, we can always find an integer value of T for which the relationship in Equation (1) must hold.
172
c. c. Cio99 and W. O. Manning
That is, by letting the value of T denote a parameter, the preceding assumption is not really an assumption at all, but rather an algebraic property of decomposing observed or observable contingency tables into a certain number of indirectly observed contingency tables, where this number is determined by the number of variables, the "size" of the contingency table, and other features of the data structure. The expression in Equation (1) is thus not a restrictive aspect of the model as such. Of course, by specifying a particular value for T (T = 2, T = 3, etc.), the model imposes an assumption, and this assumption can be tested. But nothing has been assumed about the latent variable X. If the true underlying variable is discrete with T classes, then the T-class model is exactly suited for this situation. If the true underlying variable is metric (or continuous or even normally distributed), then X in this model represents a nonparametric or discretized version of this latent variable. The distribution of the modeled latent variable can be represented by the so-called latent class proportions, ~rx(t ) = P r ( X -- t), with ~r
t=l
"rrx(t) = 1 and ~rx(t) > 0 for each t. Note that the case in which 7rx(t* ) = 0
for some t* need not be considered; if the t*th class is void (has zero probability), then the model has T - 1 latent classes, not T latent classes. The T-class latent structure refers to T n o n v o i d latent classes. For item A, let "rralx=t(i ) = P r ( A - i l X - t) denote the conditional probability that item A takes on level i given that latent variable X takes on level t, for i - - 1, 2 and t = 1. . . . T. Let "rrBix=t(j), "rrcix=t(k ) denote similarly defined conditional probabilities for the other items. The latent class model is now completed by the expression, or the model, "rrABCX( ijkt ) = ,rrx( t ),rr A i x = t( i )'rr B i X= t( j )'rr c i x= t( k),
(2)
which is posited for all cells (i, j, k, t). This part of the model expresses a nontrivial assumption that the observed items are conditionally independent given the latent variable X. This is the so-called assumption of local independence that defines virtually all latent structure models (Lazarsfeld & Henry, 1968). Much has been written about this assumption. There are many models that relax this assumption to some extent, but for ordinary assessments of reliability, we think that this assumption serves as a principle of measurement. We take this assumption as a given; virtually all model-based assessments of reliability begin with this or a similar assumption. Note that reliability of measurement can be defined as functions of the conditional probability parameters. For example, suppose that level i = 1 of item A corresponds to level t -- 1 of X, that level i = 2 of item A corresponds to level t = 2 of X, as with observed receiving and "latent" receiving of aid. Then item level-specific reliability is defined completely, and nonparametrically, in terms
8. Latent Class Models
173
of the conditional probabilities for item A. That is, level i = 1 of item A perfectly measures level t = 1 of X if 'ITA IX = 1(1) = 1; values less than unity reflect some degree of unreliability. Level i = 2 of item A perfectly measures level t = 2 of X if ' n ' Z l X = 2 ( 2 ) -- 1. I t e m - s p e c i f i c reliability can be defined completely, and nonparametrically, in terms of the set of conditional probabilities for the given item. We would say, for example, that item A is a perfectly reliable indicator of X if " r r A i X _ l ( 1 ) -- "rrAiX=2(2 ) -- 1, SO that item reliability in this case is a composite of the two relevant parameters for item-level reliability. (Perfect item reliability is the same as perfect item-level reliability at each level of the item.) In the preceding form, reliability can be viewed as the degree to which X predicts the level of a given observed item. We can also ask how well X is predicted by a given item, which gives us another concept of reliability. (Note that for correlation-based approaches, prediction of X from an item and prediction of an item by X are equivalent and hence are not distinguished from each other. This is because the correlation between two variables is symmetric.) Using the definition of conditional probability, we can define reliability in terms of prediction of X as follows. Instead of the conditional probabilities in Equation (2), use instead the reverse conditionals, "rrx l z = i( t ) = "rrx( t )'rr A i x = t( i ) /'rr A ( i ),
(3)
where "ira(i ) -- P r ( A = i) is the marginal probability that A takes on level i. (Note that this expression is equivalent to P r ( X = t ) P r ( A = i l X = t)/ P r ( A = i) = P r ( A = i, X = t ) / P r ( A = i).) Item level-specific and item-specific reliability can be defined in these terms. Finally, a margin-free measure of item-level reliability can be defined in terms of the (partial or conditional) odds ratio between X and A, OZX = "rra l X= l (1)'rrZ l X = 2 ( 2 ) / [ T r Z l X= l (2)TrZ l X = 2 ( 1 ) ] .
(4)
Generalizations of this function can be applied when the variables involved are polytomous rather than dichotomous. It is easily verified that the same result is obtained when the reverse conditionals are used. A correlation-type measure based on these values is Q A X = (OAx -- 1 )/(OAX -1- 1 ),
(5)
which is just Yule's Q transform of the odds ratio as given in elementary texts. Note that all quantities can also be defined for items B and C, or for however many items are included in the model. We can generalize the concept of reliability-as-predictability further. Rather than focus on item level-specific reliability or item-level reliability, we can instead focus on i t e m - s e t reliability. For a set S of the original items, say S - {A, B}, the following probability can be calculated: r r x i s ( t ) = P r ( X = t i S -- s), for example, P r ( X - 1 IA -- i, B = j ) . For items in a given set S, the item-set reli-
174
c. c. Clogg and W. D. Manning
ability can be operationalized in terms of these quantities. For the special case in which S corresponds to a single item, item-set reliability is already defined. In cases in which S is the complete set of items, the reliability of the entire set of items as measures of X can be measured in terms of
7rXlaBc(t) = 7raBcx(ijkt)/TraBc(ijk),
(6)
where the right-hand quantities already appear in the two fundamental equations of latent class analysis, Equations (1) and (2). Convenient summary indices of reliability, or predictability, can be defined in terms of these equations or models. To summarize, the nonparametric latent class model assumes (a) that the observed contingency table is obtained by collapsing over the T levels of the unobserved variable and that (b) the observed association among items is explained by the latent variable. That is, the assumption of local independence is used to define the latent variable or the latent structure. Given this model, various definitions of reliability of measurements are possible, including item level-specific reliability conceived as the prediction of observed items from the latent variable or as the prediction of the latent variable from the observed items, and various definitions of item-specific reliability, even including a correlation-type measure based on Yule's Q transform. Finally, reliability of sets of items or even of the entire set of items can be defined easily in terms of probabilistic prediction. These various measures of reliability are all direct functions of the parameters in the latent class model; indeed, in some accounts, these are viewed as the "parameters" of the model. We next illustrate these in a concrete case.
3. RELIABILITY OF DICHOTOMOUS MEASUREMENTS IN A PROTOTYPICAL CASE The simplest possible case to consider is a 23 table that cross-classifies three dichotomous indicators. In this case, the two-class latent structure is exactly identified (Goodman, 1974). Table 1 gives a three-way cross-classification of three items from the NSFH. The items pertain to receipt of support from parents. The items are A (receive instrumental help), B (receive advice), and C (receive childcare), with level 1 = did not receive aid and level 2 = received aid. The time referent is "in the past month." The group is white, nonhispanic mothers less than 30 years of age, with n = 682. Thus, the items cross-classified in this table have specific group, time, and source (i.e., parental) referents, and these have to be kept in mind when assessing reliability. We also provide the expected frequencies under the model of mutual independence, the log-linear hypothesis that fits the univariate marginals {(A), (B), (C)}. This model is equivalent to the latent class model with T = 1 "latent"
175
8. Latent Class Models
TABLE 1 Cross-Classification .of Three Indicators of Social Support
Cell (C, B, A) (1, (1, (1, (1, (2, (2, (2, (2,
1, 1, 2, 2, 1, 1, 2, 2,
1) 2) 1) 2) 1) 2) 1) 2)
Observed frequency
Expected frequency
193 56 49 41 65 79 57 142
104.3 91.1 76.7 67.0 105.5 92.1 77.6 67.8
Pr(X =
.97 .70 .76 .19 .62 .11 .14 .01
1)
Pr(X =
.03 .30 .26 .81 .38 .89 .86 .99
2) 1 1 1 2 1 2 2 2
Note. See text for definition of items. The expected frequencies are for the model of mutual inde-
pendence, the "one-class" latent structure. The expected frequencies for the two-class model are the same as the observed frequencies. The columns labeled P r ( X = t) correspond to the conditional probabilities in Equation (6) (sample estimates) for the two-class model. X is the predicted latent distribution.
classes. The variables are strongly associated with each other; the Pearson chisquare statistic is X 2 = 213.28, and the likelihood-ratio chi-square statistic is L 2 -- 186.72 on 4 df. We see that the independence model underpredicts the extremes in the table, that is, when mothers receive all types of aid or do not receive any type of aid. Note the "skew" in the observed joint distribution that might be important to take into account. In this case, a model using only twoway interactions (or pairwise correlations) would be sufficient to explain the association. The log-linear hypothesis that fits all two-way marginal tables gives X 2 = .97, L 2 - .97 with 1 df. 2 This standard model does not, however, lend itself easily to an analysis of reliability simply because it does not incorporate a well-specified relation of observed items to the unobserved variable that the items were designed to measure. Two other columns in this table give estimated prediction probabilities suited for the assessment of the reliability of the entire set of items; by using these, we obtain the predicted latent distribution for the two-class model given in the last column of the table. The two-class latent structure is "saturated" with respect to any 2 x 2 x 2 contingency table. This means that the latent class model, with T = 2 latent classes, is consistent with any such distribution, or with any model or assumed distribution for the data. If the "true" variable measured by the items is in fact 2Adding just one variable to the set analyzed produces a data structure that cannot be described in terms of two-way marginals, so the sufficiency of pairwise correlations is contradicted by the data in this case once other items are included. We have chosen just three of the available items for expository purposes.
176
c. c. Clogg and W. D. Manning
continuous, with known or unknown distribution, the two-class model still provides a summary of the measurement properties of the items. That is, there is no additional statistical information in the data that cannot be obtained from an assumption that the true variable is dichotomous rather than continuous. Because the two-class model is saturated, the chi-square values are identically zero and the model has zero degrees of freedom. We examine reliability of the measurements A, B, and C using the operational definitions provided in the previous section. What is the total reliability of the measures? This question is answered by considering the conditional probabilities in Equation (6), which can be called the posterior distribution of X given the items. (This is item-set reliability in which all items are used to define the set.) The maximum likelihood estimates of these probabilities appear in the columns headed P r ( X = t) in Table 1. For members of cell (1, 1, 1), with observed frequency flll = 193, the model predicts that 97% are in latent class 1 and 3% are in latent class 2. That is, 193 x .97 = 187.2 is the estimated expected number of cases in this cell that belong to the first latent class, and 193 - 187.2 = 5.8 is the expected number in the second latent class. By using these quantities, we can next define the cell-specific reliability, for example, as the maximum of the two probabilities, or perhaps the odds of the two classes. Continuing in this fashion leads to an expected number correctly predicted equal to 597.3, or 87.6% of the total sample. In other words, the items as a set can be said to be almost 90% reliable. An alternative measure of this reliability is the lambda measure of association between X and the observed items, ~. = .74. (See Clogg, 1995, for the definition of lambda in the latent class setting.) This measure is consistent with the assignment rule that created the predicted distribution in Table 1. The parameter values for the model as expressed in Equation (2) appear in Table 2. We see that level 1 of each item (not receiving aid) is associated with level 1 of X and that level 2 of each item (receiving aid) is associated with level 2 of X, so that level 1 of X can be characterized as the latent receiving class and
TABLE2 Estimated Parameter Values for the Two-Class Model Applied to the Data in Table 1, Including Measures of Reliability
~Titem[X=l
Item A A B B C C X
(i = (i = (j = (j = (k = (k =
1) 2) 1) 2) 1) 2)
.83 .17 .83 .17 .82 .18 .48
~Titem[X:2 .26 .74 .33 .67 .19 .81 .52
~item.X
Oitem.X
13.6
.86
9.9
.82
19.6
.90
~TX=llitem .75 .28 .70 .41 .80 .20
~TX=21item .25 .72 .30 .59 .20 .80
8. Latent Class Models
177
level 2 of X can be characterized as the latent not-receiving class. The estimated latent distribution is 4rx(1) = .485, 4rx(2) = .515. The item level-specific measures of reliability, defined in terms of predictability of an item from X, are the estimated conditional probabilities reported in this table. That is, for item A, X = 1 predicts A = 1 with 83% reliability, and X = 2 predicts A = 2 with 74% reliability. Level 1 of X is an equally good predictor of level 1 of A and B (reliability is 83%), and level 2 of X best predicts level 2 of C (reliability is 81%). The item level-specific measures of reliability, defined in terms of prediction of X from the items, requires the use of Equation (3). These measures are given in the last two columns of Table 2. Taking into account the marginal distributions of the items and the latent variable (see Eq. 3), we see that item C is the best predictor of X for each level, with item level-specific reliabilities of 80% for each level. Finally, the most reliable indicator for variable reliability is also item C, with an odds ratio of 19.6, corresponding to a Yule's Q of .90. The temptation to produce an overall index of reliability from these measures is strong, and therefore we next propose overall indices that should prove to be meaningful. To measure the average item level-specific reliability, with reliability defined in terms of prediction of an item from X, we merely take the relevant average of the conditional probabilities "lTitem]X. For level 1 of the items, this average is .83, and for level 2 it is .74. This means that the X variable is a better predictor of level 1 of the items on average. To measure the average item level-specific reliability, with reliability defined in terms of prediction of X from an item, we merely take the relevant average of the conditional probabilities "rrx] item" For level 1 of X, this average is .75, and for level 2 of X it is .70. Thus, with this concept of reliability, level 1 of the items is more reliably predicted than level 2. Averages of the odds ratios (or their logarithms) as well as averages of the Q transforms of these can serve as average item-level reliabilities; the average Q is .86, for example. Averages of the relevant prediction probabilities can be used to define average cell-specific reliability. For the prediction of level X = 1, for example, the average of the "ITx]ABC(1) for cells where X = 1 is predicted is .76. For the prediction of level X = 2, the average of these posterior probabilities for cells where X = 2 is predicted is .89. Thus, cell-specific reliability as predictability of X is higher for the prediction of X = 2 than for the prediction of X = 1. The overall predictability indices given earlier (lambda and the percentage correctly predicted) summarize reliability of the prediction of the variable X. As this analysis demonstrates, reliability can be viewed in many ways, as item level-specific, variable-specific, cell-specific, or item set-specific aspects of the same problem. Whether these various facets of reliability are viewed as satisfactory or not depends on the purpose of the analysis, the sample chosen, the specific indicators used, and so on. By carefully examining what the latent class model says about measurement, we see that a rich portrait of reliability
178
c. c. Clogg and W. O. Manning
can be obtained. We hasten to add that this prototypical example, with the simplest relevant case in which latent class models can be used, indicates how the definitions of reliability put forward here can be used in general settings with many items, including polytomous items. We conclude this section by examining whether item or item-level reliabilities differ across items. Restricted latent class models can be used for this purpose. Consider the model to which the following condition is applied: ,rrAiX=l(1 ) = ,rrBIX=l(1 ) = ,n-clx=l(1 ).
(7)
This condition says that the item-level reliability (with X viewed as a predictor of the item) is constant across items for level 1 of the items. The model with this constraint produces L 2 = X 2 = .06 on 2 df, so this condition is consistent with the data. The common value of this reliability is, under the model, .83. Next, consider the model with the following condition applied:
"rralx-:(1) = ~rstx=2(1) = ~rcjx_-:(1).
(8)
This condition says that item-level reliability (with X viewed as a predictor of the item) is constant across items for level 2 of the items. The model with this constraint applied gives L 2 = 9.79, X 2 = 9.87 on 2 df. Thus this condition is not consistent with the data, so this type of item-level reliability is not constant across items. The model with both of the preceding constraints applied gives L 2 = X 2 = 14.61 on 4 df. Given these results, the main source of lack of fit of this model is nonconstant reliability in the prediction of the second (notreceiving) level of the items. Various hypotheses about variability in reliabilities can be considered using constraints of this general kind (see Clogg, 1995, and references cited there).
4. ASSESSMENT OF RELIABILITY BY GROUP OR BY TIME Important checks on reliability of measurement can be made by examining group differences in parameters of the latent class model. The groups might represent observations (on different individuals) at two or more points in time, which permits one kind of temporal assessment of reliability common in educational testing, or the groups might represent gender, age, or other factors, as is common in cross-sectional surveys. For the example used in the previous section, it is natural to consider two relevant social groups for which subsequent analyses using the measures are key predictors. The sample was divided into married and not married, because the provision of social support is expected to vary by this grouping. These two groups are perhaps most relevant for the consideration of differentials in support, and it is therefore natural to ask whether the items measure equally in the
8. Latent Class Models
179
two groups. The three-way table for each group corresponding to Table 1 appears in Table 3. We use this example to illustrate how multiple-group latent class models can be used to extend the study of reliability using virtually the same concepts of reliability as in the previous example. The predicted latent distribution under this model (two-class model for married and unmarried w o m e n ) also appears in the table. Now, compare Table 1 and Table 3. Under the fitted model, we see that cell-specific predictions of X differ between the two groups for cell (2, 1, 1), with unmarried w o m e n in this cell predicted to be in the second latent class and married w o m e n in this cell predicted to be in the first latent class. We can estimate the two-class model separately for each group, producing analyses that are exactly analogous to the single-group case in the previous section. After some exploratory fitting of models, we were led to select the model for which all conditional probabilities were constrained to be h o m o g e n e o u s across groups. The model with these constraints fits the data remarkably well, with L 2 = 6.71, X 2 = 6.70 on 6 df. The estimated latent distribution (4rx(t) for each group) was .56, .44 for married mothers, and .39, .61 for unmarried mothers. In other words, the two latent distributions are quite different, with married w o m e n much more likely to receive aid from parents than unmarried w o m e n (56% vs. 39%). The model with h o m o g e n e o u s (across-group) reliabilities for all levels and all items is consistent with the data, however, so that the only statistically relevant group difference is in the latent distributions. Such a finding is painfully difficult to obtain in many cases, but in this case we can say that the indicators measure similarly, and with equal reliability, in both of these relevant groups. The conditional probabilities in Table 4 are nearly the same as those reported earlier (Table 2) for the combined groups, and as a result, virtually the same
TABLE 3 Cross-Classification of Three Indicators of Support: Married Versus Unmarried Women Cell (C, B, A)
Married women
X
Unmarried women
(1, 1, 1)
88
1
105
1
(1, (1, (1, (2, (2, (2, (2,
16 13 12 21 22 16 42
1 1 2 1 2 2 2
40 36 29 44 57 41 100
1 1 2 2 2 2 2
1, 2) 2, 1) 2, 2) 1, 1) 1, 2) 2, 1) 2, 2)
Note. The estimated percentage correctly allocated into the predicted latent distribution over both groups is 86.3%; lambda = .77.
180
c. c. Clogg and W. D. Manning
TABLE 4 Estimated Parameter Values for a Multiple-Group Model Applied to the Data in Table 3
~TitemI x = 1
~TitemIX=2
Oitem.X
Oitem.X
1) 2) 1) 2) 1)
.85 .15 .85 .15 .84
.28 .72 .35 .65 .21
14.4
.87
10.5
.83
19.4
.90
c (~ = 2)
.16
.78
Item
A A B B C
(i (i (j (j (k
= = = = =
Note. The quantities apply to both groups, that is, to both married and unmarried women, because the model constrained these parameter values to be homogeneous across the groups.
conclusions are reached about reliability values, regardless of definition. (For the analysis of item-level reliability viewed as predictability of X, the inferences are somewhat different because of the different latent distributions in the two groups and because of the different item marginals in the two groups). We conclude that these items measure with equal reliability in the two groups, apart from some sampling fluctuation that is consistent with the model used. To save space, other reliability indices will not be reported. The analyses throughout these examples were restricted to reliability assessment in one point in time and in one point in the life coursemearly motherhood. The recently released second wave of NSFH data will permit analyses of social support structures over time. For just two points in time, we can illustrate an approach that might be used as follows. Suppose that the initial measurements are denoted (A 1, B l, Cl) and that the second-wave measurements are denoted (A 2, B2, C2). A natural model to consider in this case would posit a latent variable X 1 for the first-wave measurements and a latent variable X 2 for the second-wave measurements. The concepts, measures, and statistical methods that can be used for this case are virtually the same as those presented in this chapter. The ideal situation would be one in which the X l - ( A ~, B l, C~) reliabilities were high and equal to the X2-(A2, B2, C2) reliabilities. Standard methods summarized in Clogg (1995) can be used to operationalize such a model, and the reliability indices described in this chapter could be defined easily for this case. If such a measurement model were consistent with the data, then the main question would be how X~ differed from X2. For example, the change at the latent level could be attributed to developmental change. If more than two waves of measurement are available, then more dynamic models ought to be used (van de Pol & Langeheine, 1990). But even for the broader class of latent Markov models covered in van de Pol and Langeheine, the measures of reliability presented here can be used to advantage.
8. Latent Class Models
181
5. CONCLUSION Model-based assessment of reliability is an important aspect of social measurement. For categorical measures, it would be inappropriate to assess reliability with correlation-based measures for the same reason that linear models are not generally appropriate for categorical data (Agresti, 1990). The nonparametric approach provided by the ordinary latent class model seems well suited for the measurement of reliability of categorical measures, and the parameters in this model provide the necessary components for several alternative measures or indices of reliability. For polytomous items, the same general procedures can be used, and for T-class models (with T > 2), level-specific reliability has to be defined carefully. To our knowledge, existing computer programs for latent class analysis (see Clogg, 1995) do not automatically provide indices of reliability of the sort provided here, with the exception of the overall summaries of prediction of X. Reliability is always a relative concept, and its measurement should always take account of a standard for comparison suited for the particular measures under consideration. For the example analyzed previously, most researchers would say that the indices of reliability provide evidence that high or very high reliability has been obtained. For the group comparisons involving married and unmarried women, these inferences were strengthened because the items were equally reliable (in terms of most indices) for both groups. It will be important to examine other group comparisons that are relevant for the problem at hand (e.g., by racial group or over time) to bolster confidence in the measurements. It might also be important to examine whether similar measures have the same measurement properties when other social contexts are considered (such as the giving of aid, or receipt of aid from grandparents). These are all modeling questions, however, and not simply exercises in producing reliability indices. We hope that the definitions of reliability put forward here and illustrated with simple examples will help to bridge the gap between reliability assessment of continuous variables and reliability assessment of categorical variables.
ACKNOWLEDGMENTS This work was supported in part by Grant SBR-9310101 from the National Science Foundation, (C. C. C.) in part by the Population Research Institute, Pennsylvania State University, with core support from National Institute of Child Health and Human Development Grant 1-HD28263-01 and NIA Grant T32A000-208. The authors are indebted to Alexander von Eye for helpful comments.
182
c. c. Clogg and W. D. Manning
REFERENCES Agresti, A. (1990). Categorical data analysis. New York: Wiley. Bartholomew, D. J., & Schuessler, K. E (1991). Reliability of attitude scores based on a latent trait model. In E V. Marsden (Ed.), Sociological methodology 1991 (Vol. 21, pp. 97-124). New York: Basil Blackwell. Carmines, R., & Zeller, R. (1979). Reliability and validity assessment. Beverly Hills, CA: Sage. Clogg, C. C. (1995). Latent class models. In G. Arminger, C. C. Clogg, & M. E. Sobel (Eds.), Handbook of statistical modeling for the social and behavioral sciences (pp. 311-359). New York: Plenum. Goodman, L. A. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika, 61, 215-231. Hogan, D. E, Eggebeen, D. J., & Clogg, C. C. (1993). The structure of intergenerational exchanges in American families. American Journal of Sociology, 99, 1428-1458. Lazarsfeld, E F., & Henry, N. W. (1968). Latent structure analysis. Boston: Houghton Mifflin. Manning, W.D., & Hogan, D. E (1993). Patterns of family formation and support networks. Paper presented at the annual meetings of the Gerontological Society of America, New Orleans, LA. Schwartz, J. E. (1986). A general reliability model for categorical data, applied to Guttman scales and current status data. In N. B. Tuma (Ed.), Sociological methodology 1986 (pp. 79-119). San Francisco: Jossey-Bass. Sweet, J., Bumpass, L., & Call, V. (1988). The design and content of the national survey of families and households (NSFH Working Paper No. 1). Madison: University of Wisconsin. van de Pol, E J. R., & Langeheine, R. (1990). Mixed Markov latent class models. In C. C. Clogg (Ed.), Sociological methodology 1990 (pp. 213-247). Oxford: Basil Blackwell.
Partitioning Chi-Square: Something Old, Something New, Something Borrowed, but Nothing BLUE (Just ML) 9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
David Rindskopf
City University of New York New York, New York
1. INTRODUCTION Reading the typical statistics textbook gives the impression that the analysis of a two-way contingency table generally consists of (a) calculating a statistic that has a chi-square distribution if the rows and columns are independent, and (b) either rejecting or not rejecting the null hypothesis of independence. If the null hypothesis is rejected, the researcher is left to conclude that "there's something going on in my table, but I don't know what." On the other hand, the overall test of independence might not be rejected, even though some real effect is Categorical Variables in Developmental Research: Methods of Analysis Copyright
9
1996
by Academic
Press,
Inc. All rights
of reproduction
in any
form
reserved.
183
184
David Rindskopf
being masked by many small effects. Furthermore, a researcher might have a theory that predicts the existence of specific relationships, much like preplanned comparisons in the analysis of variance. Although most researchers are acquainted with the idea of contrasts in the analysis of variance, they are unaware that similar techniques can be used with categorical data. The simplest such technique, called partitioning chi-square, can be used to test a wide variety of preplanned and post hoc hypotheses about the structure of categorical data. The overall chi-square statistic is divided into components, each of which tests a specific hypothesis. Partitioning chi-square has been used at least since 1930 (Fisher, 1930) and appeared in quite a few books on categorical data analysis (e.g., Everitt, 1977; Maxwell, 1961; Reynolds, 1977) until recently. The failure of the technique to attain wide usage is probably the result of three factors: (a) partitioning the Pearson fit statistic, calculated automatically by most computer programs, involves complicated formulas that have no intuitive basis; (b) in many expositions of partitioning, the technique was applied mechanically, without considering comparisons that would be natural in the context of a particular piece of research; and (c) the development of log-linear models for dealing with complex contingency tables has drawn attention away from the simpler two-way table. (A pleasant exception to these generalizations is Wickens, 1989). The modern use of partitioning chi-square uses the likelihood-ratio statistic, which partitions exactly without the use of special formulas. It is simple to understand, easy to do and to interpret, and encourages the testing of focused hypotheses of substantive interest. (Those acquainted with log-linear models will recognize that the partitioning of chi-square discussed in this chapter is different from the partitioning to compare nested log-linear models.) This chapter first presents partitioning in the context of examples; these include partitioning in two-way tables, partitioning in multiway tables, and partitioning of the equiprobability model in assessing change. Then, general schemes for partitioning are described, and finally, potential weaknesses are discussed.
2. PARTITIONING INDEPENDENCE MODELS 2.1. Partitioning Two-Way Tables: Breast Cancer Data A simple example using a two-way table will show how partitioning chi-square can allow researchers to address the questions they consider important, rather than being limited to the usual global hypothesis tests. Consider the crosstabulation shown in Table 1, adapted from Goleman (1985), which shows how well breast cancer patients with various psychological attitudes survive 10 years after treatment.
9. Partitioning Chi-Square
185
TABLE 1 Ten-Year Survival of Breast Cancer in Patients with Various Psychological Attitudes Response Attitude
Alive
Dead
Denial Fighting Stoic Helpless
5 7 8 1
5 3 24 4
LR--7.95;
P=8.01
Note Adapted from Goleman (1985). LR is the likelihood-
ratio goodness-of-fit statistic; P is the Pearson goodnessof-fit statistic. Each test has 3 df.
Almost every researcher would know to do a test of independence for the data in this table; the familiar Pearson chi-square statistic is 8.01 with 3 df, and the likelihood-ratio (LR) statistic is 7.95. Those not familiar with the likelihoodratio test can think of it as a measure similar to Pearson's for deciding whether or not the model of independence can be rejected. The likelihood-ratio statistic, like the more familiar Pearson statistic, compares the observed frequencies in a table with the corresponding frequencies that would be expected if a hypothesized model were true. In this case, the hypothesized model is one of independence of row and column classifications. The formula for the likelihood-ratio statistic is LR = 2
~i Oi ln(Oi/Ei),
where a single subscript i is used to indicate individual cells in a table, and ~i indicates summation over all of the cells in the table. As an aid to intuition, imagine what would happen if a model fit perfectly so that O i = E i for each cell. Then, the ratio Oi/E i 1 for each cell; and because the logarithm of one is zero, it follows that LR = 0. For more information on the likelihood-ratio statistic, see Reynolds (1977), Everitt (1977), or Wickens (1989). For the data in Table 1, p < .05 for both the Pearson and LR statistic, so there is a relationship between the two variables: attitude is related to survival. From the traditional point of view, that is that; there is nothing else to say. The issue of where the relationship lies is generally not considered. Suppose, however, that in this example the researchers had a theory that active responses to cancer, such as fighting and denial, are beneficial, compared with passive responses such as stoicism and helplessness. Furthermore, suppose that they were not sure whether patients with different active modes of response -
-
186
David Rindskopf
differ in survival rate, or whether patients with different passive modes have different survival rates. The theory immediately suggests that instead of a single overall test of independence, three tests should be done. The first should test whether fighters and deniers differ; the second, whether stoics and helpless patients differ; and the third, whether the active responders differ from the passive responders. Each of these tests is displayed in Table 2, along with both Pearson and LR chi-square tests. Each of the three tests has 1 df, and the results are as hypothesized: fighters and deniers do not differ in survival rate, nor do stoics and helpless differ, but those with active modes of responding survive better than those with passive modes. Note that the likelihood-ratio statistics for the three tests of the specific hypotheses sum to the value of the test of the overall hypothesis of independence. That is, the overall chi-square has been partitioned into three components, each of which tests a specific hypothesis about comparisons among the groups. (The Pearson test statistics, although not partitioning exactly, are still valid tests of the same hypotheses.) Because the test of the overall hypothesis of independence was barely rejected at the .05 level, one could ask what would happen had it not been rejected? Using the usual method of analysis, this would have been the end, but partitioning chi-square according to preplanned comparisons makes the overall test irrelevant. Just as in analysis of variance, the overall test might not be sig-
TABLE2 Partition of Chi-Square for Attitude and Cancer Survival Data Response Attitude
Denial (D) Fighting (F) Stoic (S) Helpless (H) D+F S+H
Alive
Dead
5 5 7 3 LR = .84; P = .83 8 24 1 4 LR = .06; P = .06 12 8 8 28 LR = 7.05; P = 7.10
Note: LR is the likelihood-ratio goodness-of-fit statistic, and P is the Pearson goodness-of-fit statistic, for the 2 X 2 table that precedes them. Each test has 1 df.
9. Partitioning Chi-Square
187
nificant, and yet one or more specific contrasts might be significant. Testing specific rather than general hypotheses increases statistical power, which is especially advantageous with small sample sizes.
2.2. Partitioning Multiway Tables: Aspirin and Stroke Data The Canadian Cooperative Study Group (1978) studied the effectiveness of aspirin and sulfinpyrazone in prevention of stroke. The four-way cross-tabulation of sex, aspirin (yes/no), sulfinpyrazone (yes/no), and stroke (yes/no) is presented in Table 3. Because stroke is considered an outcome variable, the table is presented as if it were an 8 x 2 table, with the rows representing all combinations of sex, aspirin, and sulfinpyrazone. A test of independence in this table produces a LR chi-square of 15.77 with 7 df (p = .027), so there is a relationship between group and stroke. To pinpoint the nature of the relationship of sex, aspirin, and sulfinpyrazone to stroke, we partition the overall chi-square by successively testing each of the three effects. First, within Sex x Aspirin categories, the relationship between sulfinpyrazone and stroke is tested. As shown in part (a) of Table 4, there is no relationship between sulfinpyrazone and stroke for any combination of sex and aspirin, so sulfinpyrazone was ineffective in reducing stroke in this trial. Next, we collapse over sulfinpyrazone and test whether aspirin is related to stroke within each level of sex. The second section (b) of Table 4 shows that aspirin is related to stroke for males, but not for females; inspection of the frequencies shows that aspirin reduces the in-
TABLE3 Effectiveness of Aspirin and Sulfinpyrazone in Preventing Stroke: Observed Frequencies by Sex Stroke Sex
Aspirin
Sulfa
Yes
No
P(Stroke)
Male
No
No Yes No Yes No Yes No Yes
22 34 17 12 8 4 9 8
69 81 81 90 40 37 37 36
.24 .30 .17 .12 .17 .10 .20 .18
Yes Female
No Yes
Note: Adapted from Canadian Cooperative Study Group (1978).
188
David Rindskopf TABLE4 Partition of Chi-Square for Aspirin, Sulfinpyrazone, and Stroke Data (a) Test of independence of sulfa and stroke, within each of the four Sex x Aspirin groups Group
LR
df
p
Male, no aspirin Male, aspirin Female, no aspirin Female, aspirin
.75 1.26 .92 .03
1 1 1 1
.39 .26 .34 .87
Sum
2.96
4
(b) Test of independence of aspirin and stroke, separately for each sex LR
df
p
Male Female
10.01 .97
1 1
.002 .33
Sum
10.98
2
Sex
(c) Test of independence of sex and stroke LR 1.82
df 1
.18
Note. LR is the likelihood-ratio goodness-of-fit statistic.
cidence of stroke in males. Last we collapse over levels of aspirin, and as shown in the third section (c) of Table 4, we find that sex is independent of stroke. Another partitioning of chi-square for these data gives additional insight. First, split the original table into two subtables; the first subtable consisting of the first two rows (males who did not take aspirin), and the second subtable consisting of the last six rows. For the first subtable, the model of independence is not rejected (LR = .75, df = 1); similarly, in the second subtable, independence is not rejected (LR = 3.39, df = 5). Finally, we test the table consisting of the row totals of each of the two subtables to see whether the incidence of stroke is different for males who do not take aspirin than for the remaining groups (all females, and males who took aspirin). Independence is rejected (LR = 11.63, df = 1); examining the proportions of who had a stroke, we see that males without aspirin had a higher rate of stroke than females or males who took aspirin. The sum of the LR statistics for these three tests of independence is 15.77, and the sum of the degrees of freedom is 7, which corresponds to the test of independence in the 8 x 2 table.
9. Partitioning Chi-Square
189
This example shows two ways in which partitioning chi-square can be used in multiway contingency tables: (a) a nested partition that follows the explicit design of the study, and (b) as special comparisons to establish homogeneity of sets of subgroups of people. Without partitioning, one typical approach would be to do the overall test of independence in the 8 X 2 table as was shown previously for the partitioning. But this approach merely shows that there is a relationship between group and stroke. Another approach would be to use log-linear or logit models on the four-way table. This method would show that sulfinpyrazone was ineffective and that there was an interaction of sex and aspirin effects on stroke (in the usual analysis of variance terms). However, the nature of the interaction would not be illuminated by the usual log-linear or logit model approach; with partitioning chi-square, we can describe the nature of the interaction.
2.3. Illuminating Interactions in Hierarchical Log-Linear Models: U.C. Berkeley Admissions Data The data in Table 5 are from the University of California at Berkeley, which was investigating possible sex bias in admissions to graduate school. Data from six major areas were presented and discussed in Freedman, Pisani, and Purves (1978), and so the general results are well known. For the purposes of this chapter, they illustrate the shortcomings of log-linear models (as they are usually applied) and the advantage of partitioning chi-square.
TABLE5 GraduateAdmissions Data from University of California, Berkeley Major area
Gender
A
M F
B
C D E F Note:
M
F M F M F M F M F
% Admitted
62 82 63 68 37 34 33 35 28 24 6 7
Adapted from Freedman, Pisani, and Purves (1978).
190
David Rindskopf TABLE 6 Test of Independence of Sex and Admission by Major Area for Berkeley Admissions Data Major
LR
df
p
A B C D E F
19.26 .26 .75 .30 .99 .03
1 1 1 1 1 1
.0001 .61 .39 .59 .32 .87
Sum
21.59
6
.0014
Note. LR is the likelihood-ratio goodness-of-fit statistic.
Table 5 presents the percentage of students admitted, by sex and major area. A log-linear model of the Major x Sex x Admission table shows that no unsaturated model fits the data. In other words, there is a Sex x Admissions relationship that is not the same for each major, so the description of the relationships among variables is not simplified by the log-linear approach. According to most sources about log-linear models, there is nothing more to do. But a simple visual examination of the percentage of students admitted indicates that most likely Major Area A is the only one in which males differ from females. The partitioning in Table 6 confirms this: Sex and admission are independent in Major Areas B through F, but not in Major Area A, in which males are admitted at a lower rate than females. The sum of the LR statistics for testing independence within each major area is the LR statistic for testing the log-linear model [MS] [MA]; that is, sex is independent of admissions, given major area. The partitioning locates the reason for failure of this model to fit the data.
3. ANALYZING CHANGE AND STABILITY 3.1. Hypothesis Tests for One Group Partitioning chi-square is also useful in the analysis of change and stability of responses over time. As an example, we use data reanalyzed by Marascuilo and Serlin (1979). In this study, students in the ninth grade were asked whether they agreed with the statement "The most important qualities of a husband are determination and ambition." The question was asked of them again when they were in the 12th grade. Table 7 shows the cross-tabulation of Time 1 by Time 2 responses for whites and blacks combined into one group.
9. Partitioning Chi-Square
191
TABLE 7 Cross-Tabulation for a Group of Teenagers of Their Opinion at Two Times Time 2 Time 1
Disagree
Agree
Agree Disagree
238 422
315 142
Note: Teenagers were asked whether they agree that "the most important qualities of a husband are determination and ambition." Adapted from Marascuilo and Serlin (1979).
The usual analysis of such a table is McNemar's test, which is a test of whether the frequency in cell 10 is the same as in cell 01; that is, whether change is equal in both directions (equivalently, it is a test of marginal homogeneity). But that is only one hypothesis that might be tested; there are two others that, combined with the first, will partition the chi-square test of equiprobability in the table. The three tests are displayed in Table 8.
TABLE 8 Partition of Equiprobability Model for Change Data Frequencies 10
O1
238
142
00
11
422
315
10 + 01
00 + 11
380
737 Total
LR
df
24.517
1
15.590
1
116.126
1
156.233
3
Note: In the context of the Marascuilo and Serlin data set, 0 means "disagree," 1 means "agree"; in each pair, the first number is the response at Time l, the second number is the response at Time 2. LR is the likelihood-ratio goodness-of-fit statistic.
192
OavidRindskopf
The first test is equivalent to McNemar's test, except that the likelihood-ratio statistic is used rather than the Pearson statistic. The test is significant, indicating change across time; in this case, fewer people agreed with the statement at Time 2. The second test compares the two categories of people who did not change, and answers the question of whether there were more stable people who agreed with the statement than who disagreed with it. The test statistic indicates that the null hypothesis of equality is rejected; more of the teenagers with stable responses disagreed with the statement than agreed with it. Finally, the data are collapsed and we test whether there is the same number of changers as stable responders. This hypothesis is also rejected; more people responded the same at both times than changed. The sum of the fit statistics for testing these three hypothesis (and their degrees of freedom) is equal to the statistic for testing equiprobability in the original four-cell table: 156.233 with 3 df
3.2. Hypothesis Tests for Multiple Groups With multiple groups, partitioning chi-square can be extended in an obvious way. Consider the full data, displayed in Table 9, for members of five ethnic groups answering the question about important qualities of a husband at two time points. The data form a 5 X 2 x 2 table, but the table can also be interpreted as a 5 (Ethnic Group) X 4 (Response Pattern) table, which better suits the purposes here. Testing for independence in the 5 X 4 table results in an LR chi-square of 98.732 with 12 df showing that ethnic group is related to response pattern. The partition of chi-square takes place for both rows (ethnic group) and columns (response pattern) of the 5 x 4 table. The partition of columns follows the scheme discussed in the previous section: compare the two response patterns in which change occurred; compare the two response patterns in which no
TABLE9 Cross-Tabulationfor a Group of Teenagers of Their Opinion at Two Times by Ethnic Group Time 1: Time 2: Whites Asians Blacks Hispanics Native Americans
1 1
1 0
0 1
0 0
243 60 72 62 86
208 50 30 29 28
112 22 30 19 47
381 68 41 25 39
Note: Teenagers were asked whether they agree that "the most important qualities of a husband are determination and ambition." Adapted from Marascuilo and Serlin (1979).
9. Partitioning Chi-Square
193
change occurred; compare the changers to those who did not change. The partition of rows could be done in an a priori manner if hypotheses had been made about similarities and differences among ethnic groups; however, in this case, the partitioning of ethnic groups was done in a post hoc manner. Because of the post hoc nature of these tests, it might be appropriate to consider a way to control for the Type I error level, which is discussed in detail later. One possibility is an analogy to the Scheffe test in analysis of variance: use the critical value for the overall test of independence when evaluating the follow-up tests. Table 10 shows the partition of the independence chi-square for those who changed. The overall test of independence produces an LR chi-square of 24.239 with 4 df We first compare Whites to Asians; they do not differ (LR = .523, df-1). Next, we compare Blacks and Hispanics, who do not differ (LR = 1.171, df = 1). We then combine the Blacks and Hispanics and compare
TABLE 10 Partition of Chi-Square Test of Independence for People Who Changed from Time 1 to Time 2 .....
10
O1
P(O1)
LR
(a) All ethnic groups (overall test of independence) Whites Asians Blacks Hispanics Native Americans
208 50 30 29 28
112 22 20 19 47
.65 .69 .50 .60 .37
24.239
208 50
112 22
.65 .69
.523
30 29
20 19
.50 .60
1.171
1
49 47
.55 .37
5.351
1
17.193
1
(b) Whites vs. Asians Whites Asians (c) Blacks vs. Hispanics Blacks Hispanics
(d) Blacks (B) and Hispanics (H) vs. Indians B + H NA
59 28
(e) Whites (W) and Asians (A) vs. Blacks (B), Hispanics (H), and Indians (I) W + A B + H + NA
258 87
134 96
.66 .48
Note." 10 means a change from 1 at Time 1 to 0 at Time 2; 01 means the opposite; LR is the likelihoodratio goodness-of-fit statistic; df is degrees of freedom.
194
David Rindskopf
them with Native Americans; there is a difference here (LR = 5.351, df = 1) if the usual critical value of chi-square with 1 df (3.84 at the .05 level of significance) is used, but not if the critical value with 4 df (9.49), as in a Scheffe-like test, is used. Finally, we compare the Whites and Asians with the Blacks, Hispanics, and Native Americans and find that there is a difference (LR = 17.193, df = 1): The Whites and Asians were more likely to change toward disagreement than the other groups. The sum of the four statistics, each of which has 1 df, equals the statistic for testing independence in the 5 x 2 table, showing that the chi-square statistic for the overall test has indeed been partitioned. In Table 11, we present the partitioning for the people who did not change from Time 1 to Time 2. The test of independence of ethnic group and response is significant: LR = 73.357 with 4 df. The second panel of Table 11 shows the test of independence for Blacks, Hispanics, and Native Americans; because the
TABLE 11 Partition of Chi-Square Test of Independence for People Who Did Not Change from Time 1 to Time 2 P(ll)
LR
df
381 68 41 25 39
.39 .47 .64 .71 .69
73.357
4
72 62 86
41 25 39
.64 .71 .69
1.389
2
220
105
Whites Asians
243 60
381 68
.39 .47
2.747
1
(Sum)
303
449
11
O0
(a) All ethnic groups (overall test of independence) Whites Asians Blacks Hispanics Native Americans
243 60 72 62 86
(b) Blacks vs. Hispanics vs. Indians Blacks Hispanics Native Americans (Sum) (c) Whites vs. Asians
(d) Blacks (B), Hispanics (H), and Native Americans (NA) vs. Whites (W) and Asians (A) B + H + NA W + A
220 303
105 449
.68 .40
69.221
1
Note. 11 indicates individuals who agreed with the statement at both times; 00 indicates individuals
who disagreed with the statement at both times. LR is the likelihood-ratio goodness-of-fit statistic.
9. Partitioning Chi-Square
195
LR = 1.389 with 2 df, these three groups do not differ. The third panel compares Whites and Asians; LR = 2.747 with 1 df so Whites and Asians do not differ. Finally, the fourth panel compares the combined Blacks, Hispanics, and Native Americans to the combined White and Asian students: the LR = 69.221 with 1 df This contrast obviously explains all of the group differences: Blacks, Hispanics, and Native Americans who gave the same response at both times tended to agree at both times, whereas Whites and Asians were more likely to disagree at both times. (Again, the sum of the LR chi-squares and degrees of freedom add up to the total LR chi-square and degrees of freedom testing independence of ethnic group and response among nonchangers.) Finally, we can compare the changers with the nonchangers to see whether ethnic groups differ in the proportion who are changers. The LR statistic for testing independence is 1.136 with 4 df," the ethnic groups do not differ in the ratio of changers to nonchangers. As a check on the overall calculations, the LR chi-squares can be summed over the last three tables to get 73.357 + 24.239 + 1.136 = 98.732, which is the LR for testing independence of ethnic group and response in the original 5 • 4 table. We sum the degrees of freedom also to get 4 + 4 + 4 = 12.
4. HOW TO PARTITION CHI-SQUARE A complete partition of chi-square will, at the last stage, result in the analysis of as many 2 X 2 tables as there are degrees of freedom for testing independence (or other statistic that is partitioned) in the original table. For a 5 X 4 table, for example, there are 4 x 3 = 12 degrees of freedom for testing independence, and as many one degree of freedom tests in a complete partition. In many cases, one need not do a complete partition, because only those hypotheses that make theoretical sense will be tested. There are well-known systematic ways for partitioning chi-square in the literature, but most do not take into account the nature of the investigator's hypotheses about the structure of the data. Two approaches are available that will reflect specific research hypotheses, and in the end, they both produce the same partition, so the choice between them will be based on which one feels most comfortable for the analyst. The two methods are called joining and splitting to reflect what happens to the rows or columns of the original table as the method is implemented.
4.1. Joining Joining is illustrated schematically in Figure 1. The procedure will be illustrated for rows of the table; following (or instead of) this, partitioning of columns can
196
David Rindskopf
FIGURE 1. The joining technique for partitioning chi-square. also be performed. First, extract any two rows of the table (that are thought to have the same row proportions) and test the table consisting of these two rows for independence. Next, replace the two rows in the original table with one row consisting of their sum. Repeat this procedure until only one row remains. For each (two-row) subtable extracted, perform the same procedure on the columns to obtain a complete partition of chi-square. During the process of combining rows or sets of rows, one or more statistical test may indicate lack of independence. The rows (or sets of rows) can still be combined, although the comparisons that follow will involve a nonhomogenous set of people; that is, groups that differ in their row proportions will have been combined. Such a row will then represent a (weighted) average of the groups representing the constituent rows. Whether the use of that average provides a meaningful comparison with the remaining groups will usually be determined within the context of a particular data set; if not, the rows should not be combined and the partitioning will not be reduced completely to one degree of freedom comparisons. If a point is reached where all remaining rows (and columns) differ significantly, then the analyst has found the important parts of the data structure; in such a case, a reduction to one degree of freedom contrasts would not add useful information. The joining procedure has two common patterns, each of which corresponds to a common coding method for contrasts in analysis of variance; these are il-
9. Partitioning Chi-Square
197
FIGURE 2. Two joining techniques: "piling on" and nesting.
lustrated in Figure 2. The first might be called "piling on," which corresponds to Helmert contrasts. In this procedure, two groups are compared and combined; these are compared to and then combined with a third, and so on. In the figure, rows 1 and 2 are extracted and tested for independence first. Then, rows 1 and 2 are summed and compared with row 3. Finally, the sum of rows 1, 2, and 3 are compared with row 4. An application of this method might involve the comparison of three groups administered drugs and a fourth group given a placebo. Suppose that two of the drugs were closely related chemically; call them Drug 1A and Drug 1B, and that the third drug, called Drug 2, was not related to either 1A or lB. One sensible analysis would first compare Drug 1A to Drug 1B, then combine them to compare drugs of type 1 to Drug 2, and finally combine all subjects given drugs to compare them with subjects not given drugs. The second common joining method is used for a nested or hierarchical structure in which the rows are divided into sets of pairs that are compared and combined. The resulting combined rows are then paired, compared, and combined, and so on. This is illustrated in the bottom part of Figure 2. First, rows 1 and 2 are compared and combined, as are rows 3 and 4, then 5 and 6, and finally 7
198
David Rindskopf
FIGURE 3. The splitting technique for partitioning chi-square. and 8. Next, row (1 + 2) is compared with row (3 + 4), and row (5 + 6) is compared with (7 + 8), where the notation refers to sums of rows in the original table. Finally, row (1 + 2 + 3 + 4) is compared with row (5 + 6 + 7 + 8). This method was used for the first partitioning analysis of the data on aspirin and stroke in which rows were successively collapsed over levels of the variables sulfinpyrazone, aspirin, and sex. Combinations of these methods might also be used (as was done for the data on attitudes toward the role of husbands for five ethnic groups in a previous section).
4.2. Splitting The splitting algorithm, illustrated in Figure 3, begins by dividing the original table into two subtables. In addition, a table is created of the row sums of each of the two subtables. The chi-square for the original table is now partitioned into three parts, one for each of the subtables and one for the table of row totals. The test in each subtable tests homogeneity of the rows within that subtable, and the test of the row totals tests the difference between the subtables. This procedure may be repeated for each of the two original subtables and, finally, columns may be split. The splitting method was used in the analysis of data on attitude and cancer survival in which the original table was split into two parts, within each of which homogeneity held. The remaining hypothesis test was of the sum of the rows in the two subtables created in the first step,
9. Partitioning Chi-Square
199
showing that the first two rows differed from the second two rows. This method was also used for the second partitioning of the aspirin and stroke data. Of course, splitting may be used on the rows and joining on the columns, or vice versa.
5. DISCUSSION 5.1. Advantages Partitioning chi-square has several obvious advantages. First and foremost, it allows researchers to test specific hypotheses of interest rather than more general null hypotheses. One could say that it is a context-sensitive statistical technique, because it can be implemented in a way that reflects content area concerns and should not be done mechanically. The emphasis on testing specific hypotheses can add statistical power, because in some cases an overall test may not be significant, hiding one or more significant effects. The first example, of cancer survival data, came close to fitting this description: the overall test of independence was barely rejected even though there was a large difference detected in the partitioning. Second, the technique of partitioning is as close to foolproof as can be; no great knowledge of mathematics or statistics is necessary. Finally, partitioning can be done by anyone with access to a statistical analysis program, as every general statistical program will produce a chi-square statistic for a contingency table. Some programs may provide Pearson rather than likelihood-ratio statistics, but this is no problem; the total chi-square will not partition exactly, but all of the hypothesis tests are still valid.
5.2. Cautions and Problems 5.2.1. Power One area for caution is that the statistical power is not the same for all hypotheses being tested. As rows (or columns) are collapsed, the sample size (and power) for a test gets larger; as tables are split, the sample size gets smaller. Effects of the same size might not be detected as significant in a small part of the table, but could be as larger segments of the table get involved in a comparison. Researchers may want to calculate effect size estimates, including standard measures of association, in various subtables for exploratory purposes. 5.2.2. Post hoc tests A second area for caution that was mentioned earlier is that in doing post hoc tests, the researcher might want to control for the level of Type I error. (I do not consider a priori hypothesis tests to be problematic, but others might.) The simplest technique would be to use, in an analogy with
200
David Rioaskoof
the Scheffe method in analysis of variance, an adjusted critical value. For partitioning chi-square, this would be the critical value for the overall test in the full table. For a partition of independence in an R • C table, the degree of freedom is (R - 1)(C - 1), which would determine the critical value for all hypothesis tests. Like the Scheffe test, the use of this critical value would guarantee that if the overall test is not significant, no post hoc test would be significant either. This procedure is very conservative, as is the Scheffe test. If only rows or only columns are partitioned, then the degrees of freedom for the critical value can be chosen consistent with that partitioning (see Marascuilo & Levin, 1983, for a readable discussion of the use of the Scheffe procedure with categorical data). The Bonferroni procedure is widely used in other contexts to control the overall Type I error level and can be less conservative than the Scheffe test if only a small number of hypotheses are tested. To implement the Bonferroni procedure as a post hoc test, one must know the total number possible of hypotheses that might have been tested. Then calculate a new alpha value, dividing the desired overall alpha level (usually .05) by the potential number of hypotheses tested and reject the null hypothesis only if the outcome would be rejected at the adjusted alpha level. All of these methods are discussed by Santner and Duffy (1989), based on research by Goodman (1964, 1965; see also Goodman, 1969). Informal techniques for judging significance of partitioned tests are also available. A complete partition of chi-square will ultimately produce (R - 1)(C - 1) hypothesis tests, each with 1 df The square root of a chi-square statistic with 1 df is a standard normal deviate. A half-normal plot of these deviates, as illustrated in Fienberg (1980), may help in judging which elements of the partition are likely to represent real effects and which are not. (Those familiar with factor analysis will see a similarity to using a scree plot of eigenvalues to judge the number of factors in a set of variables.)
5.3. Relationship to Nonstandard Log-Linear Models Researchers acquainted with advanced statistical methods will realize that partitioning is a simple-minded way of doing what could also be accomplished within a generalized linear model framework. The hypotheses tested here do not fit in the framework of the usual hierarchical log-linear models, but are what have been called nonstandard log-linear models (Rindskopf, 1990). Nonstandard log-linear models can be used to test every kind of hypothesis that can be tested using partitioning, and more. Why, then, should partitioning of chi-square be used? Because partitioning is simple, any researcher who knows how to test independence in contingency tables can do a partitioning of chi-square. The hypotheses being tested are obvious, and inspection of row or column proportions will give the researcher a good idea about whether the hypothesis tests have
9. Partitioning Chi-Square
201
been done correctly. Using nonstandard log-linear models, on the other hand, involves setting up a model matrix by coding variables to test the desired hypotheses. This is sometimes tricky, and even experienced researchers cannot always correctly interpret parameters when variables have been coded in nonstandard ways. Comparing the power of different contrasts is simple when using nonstandard models: The standard errors will vary with the nature of the comparison; larger standard errors indicate lower power. In conclusion, partitioning chi-square is a simple technique that allows researchers to test hypotheses specified by their theories. Partitioning can be done by anyone with access to a general statistics package, and because of its simplicity it is more difficult to misuse than nonstandard log-linear models. Caution is needed because of possible increases in Type I error rates when post hoc tests are conducted, and because different stages of a partitioning can have different powers. Partitioning has been unjustly neglected in the recent literature because most earlier expositions either used complex formulas to make the Pearson statistic additive or used mechanical partitioning schemes that did not reflect the important scientific hypotheses researchers wished to test.
ACKNOWLEDGMENTS The author thanks Howard Ehrlichman, Bengt Muthen, Laurie Hopp Rindskopf, Alex von Eye, David Andrich, and Bob Newcomb, Kim Romney, and their colleagues and students at the University of California, Irvine, for helpful comments and suggestions.
REFERENCES Canadian Cooperative Study Group. (1978). A randomized trial of aspirin and sulfinpyrazone in threatened stroke. New England Journal of Medicine, 299, 53-59. Everitt, B. S. (1977). The analysis of contingency tables. London: Chapman & Hall. Fienberg, S. E. (1980). The analysis of cross-classified categorical data. Cambridge, MA: MIT Press. Fisher, R. A. (1930). Statistical methods for research workers (3rd ed.). Edinburgh: Oliver & Boyd. Freedman, D., Pisani, R., & Purves, R. (1978). Statistics. New York: Norton. Goleman, D. (1985, October 22). Strong emotional response to disease may bolster patient's immune system. The New York Times, pp. C1, C3. Goodman, L. A. (1964). Simultaneous confidence limits for cross-product ratios in contingency tables. Journal of the Royal Statistical Society, Series B, 26, 86-102. Goodman, L. A. (1965). On simultaneous confidence intervals for multinomial proportions. Technometrics, 7, 247-254. Goodman, L. A. (1969). How to ransack social mobility tables and other kinds of cross-classification tables. American Journal of Sociology, 75, 1-40. Marascuilo, L. A., & Levin, J. R. (1983). Multivariate statistics in the social sciences: A researcher's guide. Monterey, CA: Brooks/Cole.
202
David Rindskopf
Marascuilo, L. A., & Serlin, R. C. (1979). Tests and contrasts for comparing change parameters for a multiple sample McNemar data model. British Journal of Mathematical and Statistical Psychology, 32, 105-112. Maxwell, A. E. (1961). Analysing qualitative data. London: Methuen. Reynolds, H. T. (1977). The analysis of cross-classifications. New York: Free Press. Rindskopf, D. (1990). Nonstandard loglinear models. Psychological Bulletin, 108, 150-162. Santner, T. J., & Duffy, D. E. (1989). The statistical analysis of discrete data. New York: SpringerVerlag. Wickens, T. D. (1989). Multiway contingency tables analysis for the social sciences. Hillsdale, NJ: Erlbaum.
Nonstandard Log-Linear Models for Measuring Change in Categorical Variables 9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
A/exander yon Eye Michigan State University East Lansing, M/ch/gan
Christiane Spiel
University of Vienna Vienna, Austria
1. INTRODUCTION Many statistical tests are special cases of more general statistical models. When teaching statistics, it is often seen as a didactical plus if tests are introduced via both the "classical" formulas and within the framework of statistical models. This chapter proposes recasting statistical tests of axial symmetry and quasisymmetry in terms of nonstandard log-linear models. First, three equivalent forms of the well-known Bowker test are presented. The first form is the test statistic originally proposed by Bowker (1948); the other two are log-linear
Categorical
Variables
in Develo pme ntal
Research."
Methods
of Analysis
Copyright 9 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
203
204
A. von Eye and C. Spiel
models. Second, quasi-symmetry is recast in terms of nonstandard log-linear models.
2. BOWKER'S TEST Known since 1948, Bowker's test allows researchers to assess axial symmetry in a square cross-tabulation. The test was originally proposed as a generalization of McNemar's (1947) chi-square test which assesses axial symmetry in 2 X 2 tables. The axial symmetry concept implies that for cell frequencies F,j in a square cross-tabulation, B
AB
= F)i , f o r i > j
(1)
holds, where superscript A denotes the row variable and B denotes the column variable. If changes from one category to another are symmetric, the marginal distributions stay the same. Typically, researchers apply McNemar's and Bowker' s tests when a categorical variable is observed twice (e.g., Sands, Terry, & Meredith, 1989). The null hypothesis test states that changes from one category to another are random in nature. Textbooks illustrate the tests using examples from many areas, such as from pharmacology, when subjects are asked twice about the effects of a drug, or when effects of drugs are compared with effects of placebos in randomized repeated measures designs (cf. Bortz, Lienert, & Boehnke, 1990). In developmental research, for example, the Bowker test is used in research concerning change and stability of intellectual functioning (Sands et al., 1989). For the following description of the McNemar and the Bowker tests, consider a categorical variable with k categories (k >- 2). This variable is observed twice on the same individuals. The following test statistic, known as the McNemar test if k -- 2, and as the Bowker test if k > 2, is approximately distributed as X2 with ( ~ ) degrees of freedom: k
k (4j -- fji)2
x =EE i=1 j=l
f o r i > j, f/j "F 4i
(2)
where i, j = 1. . . . . k, and f/j is the observed frequency in cell ij. The following example (adapted from Bortz et al., 1990) illustrates the application of Equation (2). A sample of N = 100 subjects took vitamin pills over the course of 2 weeks. Twice during the experiment, subjects indicated their well-being by using the three categories, positive, so-so, and negative. Table 1 displays the 3 X 3 cross-tabulation of the two observations of the well-being variable.
10. Nonstandard Log-Linear Models
205
TABLE 1 Cross-Tabulationof Two Reports About Effects of Vitamin Pill and Placebo Responses at second observation Responses at first observation
Positive
So-so
Negative
Sums
Positive So-so Negative
14 5 1
7 26 7
9 19 12
30 50 20
Sums
20
40
40
N = 100
Inserting the frequencies from Table 1 into Equation (2) yields X2 = ( 5 -
5+7
7) 2 + (1 --9) 2 + ( 7 - 19) 2 = 12.27. 1 +9 7 + 19
Thisvaluehasfordf=(32)=3atailprobabilityofp=.OO65.
Thus, thenull
hypothesis of axial symmetry must be rejected. There have been attempts to improve and generalize the preceding formulation. Examples include Krauth's (1973) proposal of an exact test, Meredith and Sands' (1987) embedding of Bowker' s test in the framework of latent trait theory, the proposal of Zwick, Neuhoff, Marascuilo, and Levin (1982) of using simultaneous multiple comparisons instead of Bowker's test (cf. Havr~inek & Lienert, 1986), and reformulation of axial symmetry in terms of log-linear models (Bishop, Fienberg, & Holland, 1975; cf. Wickens, 1989). Benefits of reformulating such tests as the Bowker test in terms of log-linear models include embedding it in a more general framework and the possibility of parameter interpretation (e.g., Meredith & Sands, 1987). The following section summarizes Bishop et al.'s (1975) formulation of axial symmetry.
3. LOG-LINEAR MODELS FOR AXIAL SYMMETRY The log-linear model that allows testing of axial symmetry can be given as follows:
with side constraints
~i - "r~, for i = j
(4)
206
A. von Eye and C. Spiel
and
~ijB = ~B,
(5)
for i > j.
The side constraints (4) and (5) result in estimated expected cell frequencies _
+::i"
2
, for i 4=j,
(6)
and
~iiiB =fAB.
(7)
The following section shows how this model formulation can be recast in terms of a nonstandard log-linear model.
4. AXIAL SYMMETRY IN TERMS OF A NONSTANDARD LOG-LINEAR MODEL The saturated log-linear model for an i • j table can be expressed as log k-~u8 = k o + h a + X~ + h AS,
(8)
where h o is the "grand mean" parameter; h a are the parameters for the main effect of the row variable, A; h~ are the parameters for the main effect of the column variable, B; and h A8 are the parameters for the A • B interaction. Equation (8) is a special case of log F = Xh,
(9)
where F is an array of frequencies, X is a design matrix, and h is a parameter vector. One of the main benefits from nonstandard log-linear modeling is that the researcher can specify constraints and contrasts in the design matrix, X (Clogg, Eliason, & Grego, 1990; Evers & Namboodiri, 1978; Rindskopf, 1990; von Eye, Brandtst~idter, & Rovine, 1993). Here, we translate the constraints of the axial symmetry model into vectors for X. The specification Equation (3), together with side constraints (4) and (5), requires a design matrix with two sets of vectors: 1. Vectors that exclude the frequencies in the main diagonal from the estimation process (structural frequencies). 2. Vectors that specify what pairs of cells are assumed to contain the same frequencies. (For alternative ways of specifying symmetry models using design matrices, see Clogg et al., 1990.) When all pairs of a table are to be tested, the design matrix contains, in addition to the constant vector, k - 1 vectors specifying structural
207
10. Nonstandard Log-Linear Models
cies. Thus, the degrees of freedom for the model of axial symmetry are
The following sections illustrate the design matrix approach in two examples. The first example uses Table 1 again. This table contains data that contradict the model of axial symmetry. The log-linear main effect model, here equivalent to Pearson's X 2, yields a test statistic of X 2 = 22.225, which, for d f - 4, has a tail probability of p = .0002. The design matrix given in Table 2 was used to specify the model of axial symmetry. The first two columns after the cell indices in Table 2 contain the vectors needed for the frequencies in the main diagonal to meet side constraint (4). Because of these vectors, the frequencies in the main diagonal cells are estimated as observed. The next three vectors specify one pair of cells each to meet side constraint (5). Specifically, the following three null hypotheses are put forth: vector 3 posits that the frequencies in cells 12 and 21 are, statistically, the same; vector 4 posits that the frequencies in cells 13 and 31 are, statistically, the same; and vector 5 posits that the frequencies in cells 23 and 32 are, statistically, the same. For the data in Table 1, this model yields a Pearson X 2 = 12.272, which is identical to the result from applying Equation (2). Application of (3) through (7) also yields the same results. Thus, the model does not hold and parameters cannot be interpreted. To illustrate parameter interpretation, the second example
TABLE2 Design Matrix for Model of Axial Symmetry in 3 x 3 Cross-Tabulation Vectors Cell index
1
2
3
4
5
11 12 13 21 22 23 31 32 33
1
0
0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0
0
0
0
1 0 1 0 0 0 0 0
0 1 0 0 0 1 0 0
0 0 0 0 1 0 1 0
208
A. yon Eye and C. Spiel
presents a case in which the symmetry model fits. Consider a sample of N = 89 children who were asked about their preferred vacations. All children were in elementary school. The children had spent their last vacations (Time 1) at the Beach (B), at Amusement Parks (A), or in the mountains (M). When their families planned for their vacations the following year (Time 2), children were asked where they would like to spend these vacations. Alternatives included going to the same type of place or switching to one of the other places. Table 3 displays the cross-tabulation of preferences. The log-linear main effect model of the data in Table 3 indicates that the preferences at the two occasions are not independent (Pearson X 2 = 75.59, df = 4, p < .01). The strong diagonal suggests that most children stay with the places they used to go. The symmetry model asks whether those children who switch to another place do this in some systematic fashion. The design matrix given in Table 2 applies again. The symmetry model provides a good fit. The Pearson X 2 = 5.436 has, for df = 3, a tail probability of p = . 1425. Thus there is no need to reject this model, and parameters--typically not estimated for the Bowker or the McNemar t e s t s - can be interpreted. All three parameters that correspond to the symmetry model are statistically significant. The first parameter is ~/~/se~ = - 3 . 4 8 5 ; for the second, we calculate ~/2/se2 = - 4 . 3 2 6 ; and for the third, ~!3]se3 - 4 . 4 1 1 . Thus, each of these vectors accounts for a statistically significant portion of the variability in Table 3. Substantively, these parameters suggest that shifts from beach vacations to amusement parks are as likely as shifts from amusement parks to beach vacations. Shifts from beach vacations to mountain vacations are as likely as inverse shifts. Shifts from amusement parks to mountain vacations are as likely as inverse shifts. Application of Bowker's test yields -
X2 = ( 1 0 - 3 ) 2 + ( 2 - 4 ) 2 + ( 1 - 3 ) 2 = 5 . 4 3 6 , 10+3 2+4 1 +3
TABLE3 Cross-Tabulation of Children's Preferences at Two Occasions Vacations at Time 2 Vacations at Time 1
B
A
M
Sums
B
25 3 4
10 19 3
2
37
A M
1 22
23 29
Sums
32
32
25
N = 89
10. Nonstandard Log-Linear Models
209
a value that is identical with the Pearson X 2 for the nonstandard log-linear version of the axial symmetry model.
5. GROUP COMPARISONS One of the major benefits from recasting a specific statistical test in terms of more general statistical models, such as the log-linear model, is that generalizations and applications in various contexts become possible. This section applies the design matrix approach of nonstandard log-linear modeling axial symmetry to group comparisons. More specifically, we ask whether the model of axial symmetry applies to two or more groups in the same way, that is, by using the Model of Parallel Axial Symmetry in which parameters are estimated simultaneously for all groups. To illustrate the Model of Parallel Axial Symmetry, we use data that describe two groups of kindergartners. The children were observed twice, the first time 1 month after entering kindergarten and the second time 6 months later. One of the research questions concerned the popularity of individual children. Specifically, the question was asked whether there was a systematic pattern of shifts in popularity in those children who did not display stable popularity ratings. Popularity was rated by kindergarten teachers on a 3-point Likert scale, with 1 indicating low popularity and 3 indicating high popularity. Ratings described popularity in two groups of children. The first group contained N = 86 kindergartners from day care centers in Vienna, Austria (Spiel, 1994). The second group contained N = 92 children from day care centers in rural areas in Austria. Table 4 displays the Group (2; G) x Time 1 (3; T1) X Time 2 (3; T2) crosstabulation of popularity ratings, the estimated expected cell frequencies, and the standardized residuals for the two groups of children. Table 5 presents the design matrix used for estimating the expected cell frequencies in Table 4 for the Model of Parallel Axial Symmetry. The model has 6 dr, one invested in each vector in Table 5 and one invested in the constant vector (not shown in Table 5). The design matrix in Table 5 contains vectors that make two types of propositions. First, there are vectors that guarantee that the cells in the main diagonals of the subtables are estimated as observed. These are vectors 1, 2, 3, 7, and 8. Second, there are vectors that specify the conditions of axial symmetry. These are vectors 4, 5, 6, 9, 10, and 11. Vectors 4, 5, and 6 posit that the frequencies in cell pairs 12 and 21, 13 and 31, and 23 and 32 are the same in the first group of children. Vectors 9, 10, and 11 posit the same for the second group of children. Goodness-of-fit for this model is good (Pearson X 2 = 10.445, dr= 6, p = .107). The parameter estimates for the symmetry model are
210
A. von Eye and C. Spiel
TABLE 4 Cross-Tabulationof Popularity Ratings of Two Groups of Children over Two Observations, Evaluated Using Model of Simultaneous Axial Symmetry Cell frequencies Cell indexes G*TI*T2
Observed
Expected
Standardized residuals
111 112 113 121 122 123 131 132 133 211 212 213 221 222 223 231 232 233
1 3 0 4 43 18 0 4 13 7 4 0 3 36 9 1 7 25
1.0 3.5 0.25 3.5 43.0 11.0 0.0 11.0 13.0 7.0 3.5 0.5 3.5 36.0 8.0 0.5 8.0 25.0
0.0 -0.27 -0.01 0.27 0.0 -2.11" -0.01 -2.11" 0.0 0.0 0.27 -0.71 -0.27 0.0 0.35 0.71 -0.35 0.0
~5/se5 = - 0 . 1 7 1 , ~16[se6 = - 2 . 8 0 8 , ~19/se9 - - - 4 . 5 8 9 , ~lo/ se~o = -3.836, and ~/~/se~ = -3.559. Only the second of these parameters ~4/se4--4.598,
is not statistically significant. Note that the second of these parameters is hard to estimate because the observed frequencies for both of the cells, 13 and 31, are zero. This may be a case for which the Delta option would be useful, where a constant, for example, 0.5, is added to each cell frequency.
6. QUASI-SYMMETRY The model of quasi-symmetry puts constraints only on interaction parameters. Thus, the side constraints specified in Equation (5) do not apply. A quasi~Using the Delta option (~ = 0.5) has the following consequences for the current example: the sample size is artificially increased by 8" the Pearson X 2 now is X e = 9.507, df = 6, p = .1470; the second parameter estimate now is ~ls/se5= 3.875, thus suggesting that the pair of cells, 13 and 31, also account for a statistically significant portion of the overall variation; all other parameter estimates are very close to what they were without adding a constant of 0.5 to each cell.
211
10. Nonstandard Log-Linear Models
TABLE 5 Design Matrix for Model of Simultaneous Axial Symmetry in 2 x 3 x 3 Cross-Tabulation for Two Groups Vectors 1
2
3
4
5
6
7
8
9
10
11
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0
symmetry model describes data by (a) reproducing the marginal frequencies, and (b) by estimating expected cell frequencies such that
eij + eji --fij '[- fji,
(11)
where eij and eji denote the estimated expected cell frequencies. The following example presents a design matrix for the quasi-symmetry model of a 3 x 3 cross-tabulation. The data describe repeat restaurant visitors' choices of main dishes at two occasions. The sample includes N = 94 customers who had selected on both visits from the following dishes: prime rib (R), sole (S), or vegetarian plate (V). Table 6 contains the cross-tabulation of the choices, the observed cell frequencies, and the expected cell frequencies estimated for the quasi-symmetry model, and the standardized residuals. The design matrix for this analysis appears in Table 7. The main effect model for the data in Table 6 suggests a lack of independence of the first and second visit choices (X 2 = 15.863; df = 4; p = .0032). The quasi-symmetry model asks whether selection of dishes changes in symmetrical fashion without placing constraints on each pair of cells. The likelihoodratio X 2 suggests that this model describes the data adequately (X 2 = 2.966;
212
A. von Eye and C. Spiel
TABLE 6 Quasi-Symmetryof Meal Choices Made by Repeat Restaurant Customers Frequencies Cell indexes RR RS RV SR SS SV VR VS VV
Observed
Expected
Standardized residuals
25 14 3 11 17 6 7 3 8
25.00 12.30 4.70 12.70 17.00 4.30 5.30 4.70 8.OO
0.0 .48 -.78 -.48 0.0 .82 .74 -.78 0.0
= 1; p = .085). The test statistics for the three parameters estimated for the quasi-symmetry model are ~ls/se5 = -1.967, ~16[se6 = -2.769, and ~7/ se 7 -- -2.404, thus suggesting that for each pair of cells the condition specified in Equation (11) accounts for a substantial amount of variation in the table. The expected cell frequencies in Table 6 show that the design matrix given in Table 7 does indeed lead to estimated expected cell frequencies that meet condition (11). Specifically, both the observed and the expected cell frequencies for cell pair RS-SR add up to 25, both the observed and the expected cell fredf
TABLE 7 Design Matrix for Quasi-Symmetry Model for 3 x 3 Cross-Classification in Table 6 Main effect First occasion 1 0
-1 1 0 -1 1 0 -1
0 1 -1 0 1 -1 0 1 -1
Main effect Second occasion 1 1
1 0 0 0 -1 -1 -1
0 0 0 1 1 1 -1 -1 -1
Symmetry 0 1 0 1 0 0 0 0 0
0 0 1 0 0 0 1 0 0
0 0 0 0 0 1 0 1 0
10. Nonstandard Log-Linear Models
213
quencies for cell pair RV-VR add up to 10, and the observed and the expected cell frequencies for cell pair SV-VS add up to 9.
7. DISCUSSION This chapter presented nonstandard log-linear models for two well-known tests and log-linear models for axial symmetry and quasi-symmetry. First, three equivalent forms of Bowker's symmetry test were considered. The first form, originally proposed by Bowker (1948), generates a test statistic that is approximately distributed as chi-squared. The second form is a log-linear model with side constraints that result in a formula for estimation of model fit that is the same as the one proposed by Bowker. The third form equivalently recasts the log-linear model as a nonstandard model that allows researchers to express model specifications in terms of coding vectors of a design matrix. Recasting statistical tests equivalently in terms of more general statistical models results in a number of benefits. For example, the new form may enable the user to understand the characteristics of a test. For the Bowker test, the design matrix shows, for instance, that main effects (marginal frequencies) are not considered, and that the change patterns are evaluated regardless of the size of the frequencies in the main diagonal (diagonal cells are blanked out). For the quasi-symmetry model, the design matrix shows that main effects are considered. In addition, the design matrix approach allows researchers to analyze data without having to create three-dimensional tables (see Bishop et al., 1975, pp. 289ff.; for an illustration, see Upton, 1978, pp. 120ft., or Wickens, 1989, pp. 260ff.), and it allows instructors to introduce axial and quasi-symmetry models in a unified approach. Yet, there are benefits beyond the presentation of equivalent forms and beyond didactical advances. For instance, log-linear model parameters and residuals can be interpreted. Thus, researchers can identify those pairs of cells for which symmetry holds and those for which symmetry is violated. In addition, other options of log-linear modeling can be used in tandem with the original test form. Examples of these options include consideration of the ordinal nature of variables (Agresti, 1984) and the incorporation of symmetry testing in multigroup comparisons. This chapter presented the Model of Parallel Axial Symmetry as one example of how to incorporate symmetry testing in multigroup comparisons. The model proposes that axial symmetry holds across two (or more) groups of subjects. Differences in group size can be considered, and so can assumptions that constrain axial symmetry to specific pairs of cells. Additional extensions include models for more than two observation points.
214
A. von Eye and C. Spiel
ACKNOWLEDGMENTS Parts of this chapter were written while Alexander von Eye was Visiting Professor at the University of Vienna, Austria. The support of the University is gratefully acknowledged. The authors are also indebted to Clifford C. Clogg, G. A. Lienert, and Michael J. Rovine for helpful comments on earlier versions of this chapter. Parts of Alexander von Eye's work on this chapter were supported by NIA Grant 5T32 AG00110-07.
REFERENCES Agresti, A. (1984). Analysis of ordinal categorical data. New York: Wiley. Bishop, Y. M. M., Fienberg, S. E., & Holland, E W. (1975). Discrete multivariate analysis: Theory and practice. Cambridge, MA: MIT Press. Bortz, J., Lienert, G. A., & Boehnke, K. (1990). Verteilungsfreie Methoden in der Biostatistik [Distribution-free methods for biostatistics]. Berlin: Springer-Verlag. Bowker, A. H. (1948). A test for symmetry in contingency tables. Journal of the American Statistical Association, 43, 572-574. Clogg, C. C., Eliason, S. R., & Grego, J. M. (1990). Models for the analysis of change in discrete variables. In A. von Eye (Ed.), Statistical methods in longitudinal research: Vol. 2. Time series and categorical longitudinal data (pp. 409-441). San Diego, CA: Academic Press. Evers, M., & Namboodiri, N. K. (1978). On the design matrix strategy in the analysis of categorical data. In K. E Schuessler (Ed.), Sociological methodology (pp. 86-111). San Francisco: JosseyBass. Havr~inek, T., & Lienert, G. A. (1986). Pre-post treatment evaluation by symmetry testing in square contingency tables. Biometrical Journal, 28, 927-935. Krauth, J. (1973). Nichtparametrische Ans~itze zur Auswertung von Verlaufskurven [Non-parametric approaches to analyzing time series]. Biometrical Journal, 15, 557-566. McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12, 153-157. Meredith, W., & Sands, L. E (1987). A note on latent trait theory and Bowker's test. Psychometrika, 52, 269-271. Rindskopf, D. (1990). Nonstandard log-linear models. Psychological Bulletin, 108, 150-162. Sands, L. P., Terry, H., & Meredith, W. (1989). Change and stability in adult intellectual functioning assessed by Wechsler item responses. Psychology and Aging, 4, 79-87. Spiel, C. (1994). Risks to development in infancy and childhood. Manuscript submitted for publication. Upton, G. J. G. (1978). The analysis of cross-tabulated data. Chichester: Wiley. von Eye, A., Brandtst~idter, J., & Rovine, M. J. (1993). Models for prediction analysis. Journal of Mathematical Sociology, 18, 65-80. Wickens, T. (1989). Multiway contingency tables analysis for the social sciences. Hillsdale, NJ: Erlbaum. Zwick, R., Neuhoff, V., Marascuilo, L. A., & Levin, J. R. (1982). Statistical tests for correlated proportions: Some extensions. Psychological Bulletin, 92, 258-271.
Application of the Multigraph Representation of Hierarchical Log-linear Models H. J. Khamis
Wright State University Dayton, Ohio
1. INTRODUCTION In developmental research, as indeed in other forms of research, it has not been uncommon in recent decades to confront studies in which large quantities of data are amassed. In the categorical case, this leads to large contingency tables that are not necessarily sparse. This is especially true given that standard rules for the adequacy of asymptotic approximations, such as the minimum expected cell size should be at least 5, are too conservative (see Fienberg, 1979). More appropriate rules of thumb are that the minimum expected cell size should be one or more, or that the total sample size should be at least 4 or 5 times the number of cells (Fienberg, 1979). Although the analytical techniques and software used for analyzing the structures of association among variables in such tables are well known, the
Categorical Variables in Developmental Research: Methods of Analysis Copyright 9 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
215
216
H. J. Khamis
techniques for interpreting and using the complex models (e.g., loglinear models) associated with the tables have not kept up. In particular, it is often of interest to identify the conditional independencies resulting from a given contingency table, and from these, the collapsibility conditions for the table. Although this is not difficult when just a few variables are involved, for complex models involving five or more variables the task can be quite cumbersome because there is no coherent, efficient methodology for these analyses. A very useful technique for analyzing and interpreting hierarchical loglinear models in a graphical way was introduced by Darroch, Lauritzen, and Speed (1980). Although it does not seem to be in widespread use, it is included in some more recent categorical data analysis textbooks, such as Wickens (1989) and Christensen (1990). The usefulness of the approach by Darroch et al. (1980) is principally due to the simple graphical characterization of models that can be understood purely in terms of conditional independence relationships. In this chapter, I introduce an alternative approach to that of Darroch et al. (1980) that uses the generator multigraph. The multigraph approach has several strategic advantages over the first-order interaction graph used by Darroch et al. (1980). The focus of this chapter, however, is on how to use the multigraph approach in maximum likelihood estimation and in identifying conditional independencies in hierarchical loglinear models. All theoretical details (theorems and proofs) have been left out (they are contained in McKee and Khamis, 1996, and are available from the authors upon request). The next section establishes the notation necessary for the application of the multigraph approach.
2. NOTATION AND REVIEW I have assumed that the reader is familiar with the technique of log-linear model analysis of multidimensional contingency tables, such as that presented in Bishop, Fienberg, and Holland (1975), Wickens (1989), or Agresti (1990). It is also helpful to know the rudimentary principles of mathematical graphs; a knowledge of graphical models would be useful but is not essential for understanding this chapter (for a review of the literature concerning graphical models, see Khamis and McKee, 1996). Attention will be confined to those models of discrete data that are most practically useful, namely, the hierarchical loglinear models (HLLMs). These models are uniquely characterized by their generating class or minimal sufficient configuration which establishes the correspondence between the h-terms (using the notation of Agresti, 1990) in the model and the minimal sufficient statistics. Consider the following model of conditional independence in the threedimensional table, log mijk = ~ + hi 1 + ~kj2 + hk 3 -q- h/j 12 + hik 13,
(1)
11. Multigraph Representation
217
where mijk denotes the expected cell frequency for the ith row, jth column, and kth layer, and the parameters on the right side of Equation (1) represent certain contrasts of logarithms of mok. The generating class for this model is denoted by [12][13] and corresponds to the inclusion-maximal sets of indices in the model (called the generators of the model). For the I • J • K table with xijk denoting the observed cell frequency for the ith row, jth column, and kth layer, the minimal sufficient statistics for I; j = 1, 2 . . . . . the parameters of this model then are {xij + }, i = 1, 2 . . . . . J; and {Xi+k}, i = 1, 2 . . . . . I; k = 1, 2 . . . . . K, where xiy~ represents the observed cell frequency and the " + " in the subscript corresponds to summation over the index replaced. This model corresponds to conditional independence of Factors 2 and 3 given Factor 1. Using Goodman's (1970) notation, it can be written as [2 | 3]1]. Decomposable models (also called models of Markov type, multiplicative models, or direct models) are those HLLMs for which the cell probability (or, equivalently, expected frequency) can be factored according to the indices in the generators of the model. For instance, in the preceding example, m(i k = mij + m i + k ] m i + + , and this allows for an explicit solution to the maximum likelihood estimation problem. In fact, for this model, the maximum likelihood estimator for the expected cell frequency mij k is xij + 9x i + d x i + +. Because models with closed-form maximum likelihood estimators have closed-form expressions for asymptotic variance (Lee, 1977), the importance of decomposable models can be seen in theoretical and methodological research, for example, in the study of large, sparse contingency tables (see, e.g., Fienberg, 1979; Koehler, 1986).
3. THE GENERATOR MULTIGRAPH The generator multigraph, or simply multigraph, is introduced as a graphical technique to analyze and interpret HLLMs. In the multigraph, the vertex set is the set of generators of the model, and two vertices are joined by edges that are equal in number to the number of indices shared by the two vertices. The multigraph M~ for the model in Equation (1) ([12][13]) is given in Figure 1. Note that the vertices for the multigraph consist of the two generators of the model, [ 12] and [ 13], and because { 1, 2 } f-) { 1, 3 } = { 1 }, there is a single edge joining the two vertices.
FIGURE 1. Generatormultigraph Ml for [12][13].
218
H. J. Kharnis
FIGURE 2. Generatormultigraph M 2 for [135][245][3451.
The multigraph M 2 for the generating class [135][245][345], corresponding to a five-dimensional table, is given in Figure 2. Here, there are two double edges ({1, 3, 5} n {3, 4, 5} = {3, 5} and {2, 4, 5} n {3, 4, 5} = {4, 5}) and one single edge ({ 1, 3, 5 } n {2, 4, 5 } = {5 }).
3.1. Maximum Spanning Trees A fundamental concept for this examination of multigraphs is the standard graphtheoretic notion of a maximum spanning tree T of a multigraph M: a tree, or equivalently, a connected graph with no circuits (or closed loops), which includes each vertex of M such that the sum of all of the edges is maximum. Maximum spanning trees always exist and can be found by using, for example, Kruskal's algorithm (Kruskal, 1956). Kruskal's algorithm simply calls for the successive selection of multiedges with maximum multiplicity so that no circuits are formed and such that all vertices are included. Each maximum spanning tree T of M consists of a family of sets of factor indices called the branches of the tree. For the multigraph M 1 in Figure 1, the maximum spanning tree is trivially the edge (branch) joining the two vertices, and it is denoted by T~ = { 1 }, namely the set containing the factor index corresponding to that edge. For the generating class [135][245][345] with multigraph M 2 in Figure 2, the maximum spanning tree is T2 = {{3, 5}, {4, 5} }, with branches {3, 5} and {4, 5}. For the nondecomposable model, [12][23][34][14], with multigraph M 3 given in Figure 3, there are four distinct possible maximum spanning trees, each of the form T3 = {{i}, {j}, {k} }. Maximum spanning trees will be used in the next section
FIGURE 3. Generatormultigraph M3 of [12][23][34][14].
11. Multigraph Representation
219
to provide a remarkably easy method for identifying decomposable models, for factoring the joint distribution of such models in terms of their generators, and for identifying conditional independencies in these models.
3.2. Edge Cutsets Another fundamental concept used in working with the multigraph is the edge cutset. An edge cutset of a multigraph M is an inclusion-minimal set of multiedges whose removal disconnects M. For the model [12][13] with multigraph M 1 given in Figure 1, there is a single edge cutset that disconnects the two vertices, and it is trivially the minimum number of edges that does so. We denote this edge cutset by { 1 }, the factor index associated with the edge whose removal disconnects M1. For the multigraph M 2 given in Figure 2, there are three edge cutsets and each disconnects a single vertex: (a) the edge cutset {4, 5} (corresponding to the single edge {5 } and the double edge {4, 5}) disconnects the vertex 245, (b) the edge cutset {3, 5} disconnects the vertex 135, and (c) the edge cutset {3, 4, 5} disconnects the vertex 345. For the nondecomposable model [12][23][34][14] with multigraph M 3 given in Figure 3, there is a total of six edge cutsets: there are four edge cutsets that each disconnects a single vertex (these edge cutsets are {1, 2}, {2, 3}, {3, 4}, and { 1, 4}); there is an edge cutset corresponding to the two horizontal edges (namely, {2, 4}); and there is an edge cutset corresponding to the two vertical edges (namely, { 1, 3 }). So, for example, removal of the two edges associated with indices 1 and 2 in Figure 3, corresponding to the preceding set {1, 2}, would disconnect the vertex 12 from the rest of the multigraph, and this is the minimum number of edges that will do so. One convenient way of keeping track of edge cutsets is to draw dotted lines that disconnect the graph. Those edges that the dotted lines intersect are contained in an edge cutset, as illustrated in Figure 4 for the multigraph M 3. Section 2.2 of Gibbons (1985) contains a standard mechanical procedure for finding all edge cutsets. This relatively efficient procedure will be important in the next section for identifying conditional independencies in nondecomposable models.
4. MAXIMUM LIKELIHOOD ESTIMATION AND FUNDAMENTAL CONDITIONAL INDEPENDENCIES 4.1. Maximum Likelihood Estimation McKee and Khamis (1996) show that a HLLM is decomposable if and only if the number of indices added over the branches in any maximum spanning tree
220
#. J. Khamis
FIGURE 4. Identificationof edge cutsets in M3.
of the multigraph subtracted from the number of indices added over the vertices of the multigraph is equal to the dimensionality of the table; that is, d = Z
I s I - ~ IsI
S E V (T)
SEB (T)
(2)
where d is the number of categorical variables in the contingency table, T is any maximum spanning tree of the multigraph, and V(T) and B(T) are the set of vertices and set of branches, respectively, of T. For example, in Figure 1, d = 3, V(T) = {{ 1, 2}, {1, 3}}, and B(T) = {1 }. Therefore the formula in (2) becomes 3 = (2 + 2) - 1; because this equality is true, the model [12][13] is decomposable (as was shown in section 2). For the generating class [135][245][345] with multigraph M 2 given in Figure 2, d = 5, V(T) = {{1, 3, 5}, {2, 4, 5}, {3, 4, 5}}, B(T) = {{3, 5}, {4, 5}}, and the formula in (2) becomes 5 = (3 + 3 + 3) - (2 + 2), so that [135][245][345] is decomposable. In Figure 3, 4 4:8 - 3; therefore [12][23][34][14] is nondecomposable. For decomposable models, the multigraph can be used directly to factor the joint distribution of the contingency table in terms of the generators of the model. In particular, let M be the multigraph of a decomposable generating class, and let T be any maximum spanning tree with set V(T) of vertices and set B(T) of branches. Then, the joint distribution for the associated contingency table is
I-I P[v: v ~ s] SEV(T)
P[vl,
122. . . . .
12d] =
1--[ P[v: V ~ S]'
(3)
SEB(T)
where P[v~, v 2. . . . . va] represents the probability associated with level v~ of the first factor, level v2 of the second factor . . . . . and level v a of the dth factor; P[v: V E S] denotes the marginal probability indexed on those indices contained in S (and summing over all other indices).
11. Multigraph Representation
221
Consider the generating class [12][13] with multigraph M 1 given in Figure 1. Because V(T)= {{ 1, 2}, {1, 3}} and B ( T ) = {1 }, from Equation (3) we get, using simpler notation, Pijk = Pij+Pi+k]Pi++; note that the terms in the numerator are indexed on factors corresponding to V(T), namely Factors 1 and 2 (Pij+) and Factors 1 and 3 (Pi+k), and the term in the denominator is indexed on the factor corresponding to B(T), namely Factor 1 (Pi+ +)" This formula agrees with the one given in section 2 for this model. Consider the model [135][245][345] with multigraph M2 given in Figure 2. Here, V(T)= {{1, 3, 5}, {2, 4, 5}, {3, 4, 5}} and B(T) = {{3, 5}, {4, 5}}, so that Equation (3) gives Pijklm = Pi+k+mP
+j+lmP + +klm[P + +k+mP + + +lm 9
4.2. Fundamental Conditional Independencies Consider a partition of the d factors in a contingency table, for example, C1, C2, .... C k, S, where 2 --< k --< d - 1. McKee and Khamis (1996) show that each generating class uniquely determines a set of fundamental conditional independencies (FCIs), each of the form [C 1 | C 2 | | Ckl S] with k -> 2 such that all other conditional independencies can be deduced from them by replacing S with S' such that S C S', replacing each C i with C i' such that C i' C C i, subject to (C~' U C 2 ' [,.J . . . [,.J C k ' ) ("] S ' -- Q~, and forming appropriate conjunctions. For example, if Factors 1 and 2 are independent of Factor 3 conditional on Factor 4, then (a) Factor 1 is independent of Factor 3 conditional on Factors 2 and 4, and (b) Factor 2 is independent of Factor 3 conditional on Factors 1 and 4. Notationally, [1, 2 | 314] ~ [1 | 312, 4] N [2 | 311, 4]. The FCIs are determined from the multigraph as follows. For a given multigraph M and set of factors S, construct the multigraph M/S by removing each factor of S from each generator (vertex in the multigraph) and removing each edge corresponding to that factor. For decomposable models, S is chosen to be a branch of any maximum spanning tree of M, and for nondecomposable models, S is chosen to be the factors corresponding to an edge cutset of M. Then, the FCI corresponds to the mutual independence of the sets of factors in the disconnected components of M/S conditional on S. A few examples should make ~ e technique clear. Consider the multigraph M 1 given in Figure 1. Select S to be the branch of the maximum spanning tree T~, that is, S - { 1 }. Then, M1]S is constructed by removing the index 1 from each vertex in the multigraph and removing the edge corresponding to 1. The resulting multigraph is given in Figure 5. The disconnected components correspond to Factors 2 and 3, so Factors 2 and 3 are independent given Factor 1, as indicated in section 2. Consider the decomposable model [135][245][345] with maximum spanning tree T2 - { {3, 5 }, {4, 5 } } and multigraph M 2 given in Figure 2. The multigraphs
222
H. ,!. Khamis
2
3
[2 | 311] FIGURE 5. The multigraph M~/Sand corresponding FCI for S = {1}.
M2/S for S = {3, 5} and S = {4, 5} are given in Figure 6, along with the FCI derived from each. For the nondecomposable model [12][23][34][14] with multigraph M 3 given in Figure 3, I have chosen S to be the set of factors associated with an edge cutset. The six edge cutsets for this multigraph are {1, 2}, {2, 3}, {3, 4}, {1, 4 }, {2, 4 }, and { 1, 3 } (see section 3), so there are six possible sets S; however, only the latter two edge cutsets yield an FCI, as the others do not produce a multigraph M3/S with more than one component (see Fig. 7). More details concerning how to work with the generator multigraph and additional examples are given in McKee and Khamis (1996).
5. EXAMPLES In the following examples, instead of using numbers, as was done in the preceding discussion, capital letters are used to represent factors. In this way, identification of the factors will be easier. Edwards and Kreiner (1983) analyzed a set of data in the form of a five-way contingency table from an investigation conducted at the Institute for Social Research, Copenhagen, collected during 1978 and 1979. A sample of 1592 em-
a
b 1
24
13
2
4 S
[1
=
3
{3,5}
2,413,5 ]
S
=
[2 |
{4,5}
]
FIGURE6. The multigraphs MJS and corresponding FCIs for (a) S = {3, 5} and (b) S = {4, 5 }.
223
11. Multigraph Representation a
1
3
2"'
1
3
4
S
=
{2,4}
S
[1 ~ 312,4]
-2
4
.....
=
{I,3}
[2 (~ 4 1 1 , 3 ]
FIGURE 7. The multigraphs M3]S and corresponding FCIs for (a) S = {2, 4} and (b) S = {1, 3 }.
ployed men, 18 to 67 years old, were asked whether in the preceding year they had done any work which before they would have paid a craftsman to do. The variables included in the study are as follows. Variable Age category Response Mode of residence Employment Type of residence
Symbol A R M E T
Levels